Rate-adaptive Bayesian independent component analysis

: We consider independent component analysis (ICA) using a Bayesian approach. The latent sources are allowed to be block-wise independent while the underlying block structure is unknown. We consider prior distributions on the block structure, the mixing matrix and the marginal density functions of latent sources using a Dirichlet mixture and random series priors. We obtain a minimax-optimal posterior contraction rate of the joint density of the latent sources. This ﬁnding reveals that Bayesian ICA adaptively achieves the optimal rate of convergence according to the unknown smoothness level of the true marginal density functions and the unknown block structure. We evaluate the empirical performance of the proposed method by simulation studies.


Introduction
Independent component analysis (ICA) refers to the problem of recovering unknown independent source signals given observations of their linear combinations. More specifically, letting S = (S 1 , . . . , S d ) T be the independent unknown source signals and W be an unknown d × d matrix, ICA aims to estimate W and the distribution of S given n independent and identically distributed (i.i.d.) observations of X generated from the following model, where A = W −1 is usually called the mixing matrix. In recent years, ICA has been widely used in signal processing, machine learning and brain imagining, among many other areas of application (Roberts and Everson, 2001;Hyvärinen, Karhunen and Oja, 2001). There are two sets of common approaches in solving ICA problems. The first set, which builds on 3248 W. Shen et al. parametric assumptions of the marginal densities of S, includes the maximum likelihood approach (Bell and Sejnowski, 1995;Lee, Girolami and Sejnowski, 1999), minimizing mutual information (Cardoso, 1999) and more generally optimizing contrast functions such as Kullback-Leibler divergence, entropy and non-Gaussian measures (Comon, 1994;Hyvärinen, 1999). However, as the distribution of S is usually unknown in practice, it may be more appealing to consider an alternative class of methods by viewing ICA as a semiparametric model without the requirement of any parametric assumptions on S. Popular methods include the kernel method (Bach and Jordan, 2002), maximum likelihood (Hastie and Tibshirani, 2003), B-spline approximation (Chen and Bickel, 2006) and log-concave ICA projection (Samworth and Yuan, 2012).
Bayesian ICA has gained popularity due to its flexibility in incorporating prior information and its easy use in making inference (e.g., chapter 20 of Hyvärinen, Karhunen and Oja, 2001). It has particular application in biomedical image processing when there is a need to impose mathematical constraints on the mixing matrix. For example, in electroencephalography analysis, it may be helpful to restrict all elements in the mixing matrix to be non-negative such that the ongoing potential from the cortex will have the same sign after being observed at the scalp Choudrey, 2003, 2005). Another example is in biological neural network studies such as modeling the visual cortex. It is commonly believed that only a small proportion of neural activity is actually connected (Olshausen and Field, 1996;Bell and Sejnowski, 1997). In this situation, Bayesian ICA is particularly useful to impose the prior knowledge of sparsity on the mixing matrix in the modeling process (Hyvärinen and Karthikesh, 2000).
Recent progress has been made in the development of computation algorithms for Bayesian ICA, such as variational Bayes, mean field approximation and other methods (Winther and Petersen, 2007;Højen-Sørensen, Winther and Hansen, 2002). However, there are considerable gaps in the theoretical properties of the proposed Bayesian methods and their connections to existing theory in the frequentist literature. In this paper, we fill this gap by considering two commonly used priors on marginal densities, namely a Dirichlet mixture and a random series prior, and establishing their asymptotic properties. More specifically, we show that the proposed estimation procedure will lead to a posterior estimate of the joint density of S that converges to the frequentist truth at the optimal minimax rate (up to logarithmic factors). The rate is equivalent to the nonparametric rate of simultaneously estimating d one-dimensional density functions, and is determined by the worst smoothness level of the marginal densities. Consequently, Bayesian ICA can be viewed as a sufficient dimension reduction technique that avoids "the curse of dimensionality" in an asymptotic sense. These results connect to the existing work in the frequentist literature, e.g., Samarov and Tsybakov (2004); Chen and Bickel (2006); Samworth and Yuan (2012). An additional advantage of the Bayesian method is that no tuning process is required as the prior will automatically adapt to the unknown smoothness levels. Of practical relevance, we also consider the block ICA, an extension of the classical ICA, in which the latent sources are allowed to be block-wise independent while the block structure is unknown. This problem is sometimes referred to as multi-dimensional ICA (Cardoso, 1998).
The rest of the paper is organized as follows. We propose a Bayesian approach for block ICA and discuss the choices of the prior in Section 2. We present the main results on posterior contraction rates in Section 3. In Section 4, we discuss the posterior computation and give simulation examples to illustrate the empirical performance of the proposed method. Proofs are given in the Appendix.

Statistical setting
We consider a Bayesian ICA model with an unknown block structure: where A is the mixing matrix, S are source signals and we observe n number of i.i.d. copies of X. We assume that S is block-wise independent with respect to a partition I = I 1 ∪ · · · ∪ I t of {1, . . . , d}, i.e., S i and S j are independent if i and j belong to different blocks in I. We denote the i-th row of A by A T i ; then the joint density function of S can be written as follows, where A i is a d × |I j | matrix if I j contains multiple indexes (|I j | > 1). Clearly, when I = {1} ∪ {2} ∪ · · · ∪ {d}, the proposed model reduces to the classical ICA where all components of S are mutually independent. Our goal is to propose appropriate prior distributions on the partition I, the mixing matrix A and the marginal distributions of S, denoted by g = (g 1 , . . . , g t ).

Prior construction
We construct the prior in a hierarchical way, first on the block partition, then on the mixing matrix and corresponding marginal density functions. The following steps provide more details.
(A1) Prior on the block partition Π P : The assignment of priors on the block structure is equivalent to assigning prior distributions on each partition of {1, . . . , d}. We use a uniform prior, i.e., for every possible partition I of {1, . . . , d}, its prior probability Π P (I) = B −1 d , where B d is the Bell number of d. Clearly, if it is known that there is no block structure, i.e., d 1 = · · · = d d = 1, then this step is no longer needed.
(A2) Prior on the mixing matrix Π A : We consider i.i.d. continuous distributions Π 1 A on each element a ij of A satisfying for some constants c 1 , c 1 , τ 1 > 0. This condition is easily satisfied for distributions such as exponential, Gamma and Laplace. It is possible to impose a sparsity structure (as discussed by (Hyvärinen and Karthikesh, 2000)) by considering Π 1 A (a ij ) ∝ exp{−G(a ij )}, where G is a positive, convex function, say G(x) = |x|. It can be easily verified that these sparse priors satisfy the proposed conditions when G(x) is a polynomial of |x|. In Roberts and Choudrey (2005), the authors proposed using a rectified Gaussian distribution (Gaussian distribution restricted on [0, ∞)) as the prior for every element of A. Clearly, that prior also satisfies condition (4). It is well known that the solution to ICA is unique up to block-wise permutation and scaling (Theis, 2005). Therefore, in practice, one may also consider scaling restrictions on the mixing matrix, e.g., each row of for every x ∈ Ω, sufficiently small > 0 and some constants c 1 > 0, τ 2 ≥ 0. This can be done by first constructing prior distributions on each element of A as in Π 1 A , and then performing a transformation (standardization, change sign) if needed.
Given the partition chosen as I = {I 1 , . . . , I t } and the fixed mixing matrix, the induced joint density function p(S) can be obtained as in (3). Our last step is to build prior distributions on the marginal density functions g 1 , . . . , g t . Denote the size of block i by d i = |I i |. We consider two sets of priors based on the input X. If X is bounded, then we consider a random series prior based on the tensor-product B-spline expansion. On the other hand, if X is unbounded, then we consider a Dirichlet mixture prior with Gaussian kernels.
(A3.1) Random series prior Π S g : We consider the B-spline basis (tensor-product if d i ≥ 2) of order q with fixed, equally spaced knots; see de Boor (2001) for an introduction. We assign independent priors on g 1 , . . . , g t ; in particular, we rewrite g i as where Ψ ∈ C ∞ is a prechosen, nonnegative and monotonic link function (e.g., exponential) that ensures the validity of g i , and B j (s) is the Bspline (tensor-product of B-splines if d i ≥ 2) basis function. The number of basis terms J i controls the accuracy and complexity of the model.
for some fixed constants c 2 , c 2 > 0, 0 ≤ κ 2 ≤ κ 1 ≤ 1 and any positive integer j. This condition is satisfied for discrete distributions such as Poisson and geometric distributions (Shen and Ghosal, 2015). For tensor-product B-splines, i.e., d i ≥ 2, we can take J 1/di i as the number of basis expansion terms for each direction. Given fixed J 1 , . . . , J t , we consider independent J i -dimensional priors on the corresponding coefficients for any finite θ 0 , sufficiently small > 0, large M and some positive constants c 3 , c 3 , κ 3 . Examples include independent Gaussian, Laplace priors on each element of θ i and joint distributions such as the Dirichlet distribution on θ i . (A3.2) Dirichlet mixture prior Π K g : For each i = 1, . . . , t, we write where φ Σ is a normal kernel function with mean 0 and covariance Σ.

Identifiability and uniqueness
Identifiability, uniqueness and separability results play a central role in ICA problems since they allow ICA algorithms to uniquely (up to the changes of scale and permutation) identify the mixing matrix and to recover the source signals. These results have been obtained by Comon (1994); Eriksson and Koivunen (2004) for standard ICA, and then extended to block ICA by Theis (2004). Here, we consider the model defined in (2), and say its solution is identifiable and unique if the following two conditions hold for any two pair of solutions (A 1 , S 1 , I 1 ) and (A 2 , S 2 , I 2 ): (1) A 1 (resp. A 2 ) can be obtained by a linear column transformation of A 2 (resp. A 1 ); and (2) the source signals S 1 and S 2 have the same distribution up to the changes of scale and permutations. If these conditions are satisfied for the two solutions (A 1 , S 1 , I 1 ) and (A 2 , S 2 , I 2 ), we say they are equivalent, and define T as the transformation such that T (A 1 , S 1 , I 1 ) = (A 2 , S 2 , I 2 ). Let p 1 and p 2 be the density functions of the joint distribution produced by (A 1 , S 1 , I 1 ), and (A 2 , S 2 , I 2 ). Then p 1 and p 2 are also equivalent under T . We write T (p 1 ) = p 2 .
We assume the following conditions on the true data generating process.
(B1) True model: Suppose that there exists a partition I 0 = {I 0 1 , . . . , I 0 t0 } of the index set {1, . . . , d} and a non-singular mixing matrix A 0 = (A 10 , . . . , A d0 ) T such that the true density function p 0 can be written in a product form: . . , t 0 . (B2) Source densities: For every i = 1, . . . , t 0 , assume that g 0 i is not a degenerating point mass and does not follow a normal distribution (joint normal distribution if d i > 1). (B3) Mixing matrix: Let d * = max t0 i=1 d i . We assume that every sub-matrix of size d * × d * of A −1 is either invertible or zero.
Conditions (B1)-(B3) are essentially equivalent to those assumed in Theis (2004Theis ( , 2005. Condition (B2) rules out a joint normal distribution, but still allows for a marginal normal distribution for sources in a block of size greater than one. Condition (B3) is often called d * -admissible, and is trivially satisfied for standard ICA when d = 1. The following lemma asserts that these conditions are sufficient for obtaining identifiability and uniqueness of block ICA. (2), then its solution is identifiable and unique.

Lemma 1. Suppose that (B1)-(B3) hold for model
The proof of Lemma 1 is essentially the same as that of Theorem 5.1 in Theis (2004) except that the size of the blocks (d i ) can differ. The necessity of (B1)-(B3) is not clear. If d * = 1, (B2) requires that every marginal density not follow a normal distribution, which is stronger than the necessary condition of allowing at most one Gaussian signal for the standard ICA model (Eriksson and Koivunen, 2004).

Posterior contraction rate for random series priors
We first consider random series priors in (A1), (A2) and (A3.1). The following assumptions are needed.
(C1) We assume that g 0 i belongs to a Hölder class C αi for some unknown smoothness values α i ∈ (0, q], i = 1, . . . , t 0 for some fixed constant q > 0. (C2) We assume that the true joint density p 0 is defined on [0, 1] d without loss of generality, and is lower bounded by a constant m p > 0. (C3) Denote the support of prior distribution Π A by S A . We assume that the true mixing matrix A 0 belongs to a known compact set A 0 ⊂ S A , such that the density function of Π A is lower bounded by a constant m A > 0 on A 0 .
Let Π n be the posterior distribution of p given the observed data. The following theorem obtains the posterior contraction rate for the use of random series priors. The proof is given in the Appendix. (A1)
Theorem 1 states that the posterior distribution of the joint density p contracts around the true p 0 within an equivalent class under permutation/scaling transformation. For the classical ICA estimation with no block structure, the contraction rate reduces to n −α * /(2α * +1) up to a logarithmic factor, in which α * = min(α 1 , . . . , α d ) is the worst smoothness level among all directions. Note that this rate corresponds to the classical nonparametric estimation rate for onedimensional density functions without the logarithmic factor; and it is faster than the usual rate of estimating a d-dimensional function n −α /(2α +d) with α = d/( α −1 i ) as the harmonic mean of smoothness levels. The assumption 0 < α 1 , . . . , α t0 ≤ q is needed to ensure sufficient approximation ability of the B-spline functions being used in the prior (de Boor, 2001). In the prior construction, we do not assume any prior knowledge about the true block structure and the smoothness parameters. Hence the proposed Bayesian estimation procedure is rate-adaptive to the smoothness levels in (0, q]. Note that the rate also depends on κ 2 , which reflects the tail decay rate in the prior distribution of J i . A Poisson prior satisfies κ 2 = 1, hence will help improve the contraction rate.
It is possible to extend our result by considering anisotropic smoothness levels within each block, i.e., condition (C1) can be replaced by (C1') Assume that g 0 i belongs to a tensor-Sobolev class with smoothness levels α i = (α i1 , . . . , α i,di ) T for some unknown smoothness values α i ∈ (0, q]∩N, i = 1, . . . , t 0 for some fixed constant q > 0.
Then the posterior contraction rate in Theorem 1 becomes is the harmonic mean of the elements in α i . This rate can be viewed as the worst rate of estimating t 0 independent anisotropic density functions. If we ignore the block structure and assume that the mixing matrix is known, then the rate agrees with those obtained in the literature for estimating a multi-dimensional anisotropic density function (de Jonge and van Zanten, 2012;Arbel, Gayraud and Rousseau, 2013;Belitser and Serra, 2014;Shen and Ghosal, 2015). Here, the smoothness levels have to be natural numbers to ensure good approximation of the tensor-product B-splines; similar assumptions appeared in Shen and Ghosal (2016).

Posterior contraction rate for Dirichlet mixture priors
Next, we consider the Dirichlet mixture priors as specified in (A1), (A2) and (A3.2). The density functions are now defined on R d . We need the following assumptions.
Condition (C4) requires the tail of p 0 to decay exponentially fast. Condition (C5) imposes smoothness on g 1 , . . . , g t0 . We then obtain the convergence result for the Dirichlet mixture prior as follows.
This puts some restrictions on the prior distribution of the covariance kernel Σ i . For example, if d i = 1 for some i, then one may need to use the squared inverse gamma (instead of inverse gamma) prior on g i to obtain the optimal rate of posterior convergence. Similar arguments appeared in Theorem 1 of Shen, Tokdar and Ghosal (2013).
Note that consistency of the joint distribution of the signals does not necessarily imply consistency of the block structure or marginal densities. Our results can be viewed as a "prediction consistency" consequence. Intuitively, one would expect that the marginal density function estimates do not deviate from the truth (up to the permutation of indexes). It will be interesting to establish posterior consistency results of these quantities in future work, building on the techniques developed by Juditsky, Lepski and Tsybakov (2009), for example. In this paper, we only consider random series and Dirichlet mixture priors on marginal density functions. We believe that optimal posterior contraction rates can also be obtained by using other type of priors on the marginal densities, such as a Gaussian process or Pitman-Yor process (Bhatacharya, Pati and Dunson, 2014;Scricciolo, 2014). In addition, it will be of interest to consider convergence under the sup-norm using the results from Castillo (2014).

W. Shen et al.
We generate 300 samples of incoming signals S 1 , S 2 , S 3 and choose the mixing matrix as For posterior computation, we use the prior in (A1), (A2) and (A3.1). For the prior (A1) on the block structure, we exclude the case when there is no block structure, and consider four cases, {2}, {3}}, i.e., I 4 means mutual independence and I 1 , I 2 , I 3 mean two independent densities of dimensions one and two. The prior probability is then 1/4 for each partition. For (A2), we put a normal prior on each element of A independently, and rescale each row of A to have a norm of one. For (A3.1), we consider an identity link function, and use the standardized B-spline basis. Then there is no need to include the integral as long as the coefficient vector θ i belongs to a J i -dimensional simplex for each i (Shen and Ghosal, 2015). In other words, if d i = 1, then the corresponding marginal density can be written as We choose the basis to be cubic spline and fix J i = 10 di for computational convenience. We use Dir(1, . . . , 1) as the prior for θ.
The main challenge in posterior computation is to update the block structure and the corresponding coefficients θ. To accommodate a varying-dimensional parameter space, we use a reversible jump Markov chain Monte Carlo (MCMC) approach (Green, 1995). In particular, if the block structure in the current stage is I i (i = 1, 2, 3), then we let I in the next stage be either the same or I 4 with equal transition probability, 1/2. If the current block structure is I 4 , then I in the next step can be any value of I i , i = 1, . . . , 4 with equal probability. To illustrate how dimension matching works, we first consider an example of moving from "lower-dimension" I 4 to "higher-dimension" I 3 . The coefficients under I 4 are θ (k) 1 , . . . , θ (k) J for k = 1, 2, 3, i.e., coefficients for each marginal density. We keep the coefficients for the marginal density of S 3 the same and update the coefficients for the joint density of S 1 and S 2 , denoted by θ ij for i, j = 1, . . . , J. We generate i.i.d. random variables η 11 , η 12 , . . . , η ( j . On the other hand, to move from "higher-dimension" I 3 to "'lower-dimension" I 4 , we simply let the coefficients for the marginal density of S 1 and S 2 be θ We run the model for 25000 MCMC iterations and discard the first 5000. The results are summarized in Figure 1. The first two subplots show that the original signals (S 2 on the x-axis and S 3 on the y-axis) and the reconstructed signals agree well. The third subplot gives the posterior selection frequency of 4 block structures with I 1 being chosen over 67% of the time. The last subplot gives the posterior mean of the marginal density of S 1 (dashed line) and the true marginal density (solid line). We find that the reconstructed density fits the data reasonably well. The estimated mixing matrix A * is fairly close to the true A. We also run the model for a larger sample size n = 1000, and find similar patterns in the results. The true block structure has been correctly chosen for over 69% of the time. When computing A * , we have used the posterior mean for each matrix element. As a result, each row does not have exactly unit length as desired. One alternative is to consider using the Karcher mean instead of arithmetic mean.
In the second example, we compare the performance of the proposed method with two other popular ICA methods. The first is called FastICA, which is based upon minimizing approximations to entropy (Hyvärinen, Karhunen and Oja, 2001). The second is called ProDenICA, which uses semi-parametric density estimation with cubic splines (Hastie and Tibshirani, 2003). Both methods are implemented in R package "ProDenICA". We generate data from a threedimensional source signal with no block structure, i.e., the sources are mutually independent. We use the same mixing matrix as (13). The marginal densities are uniform(−1, 1), Beta(2, 5) rescaled to (−1, 1) and t(3) truncated between −1 and 1. For each method, we compute the Amari metric, which takes values in [0, 1] (Hyvärinen, Karhunen and Oja, 2001), between the estimated mixing matrix and the truth A. For FastICA and ProDenICA, we assume the block structure is known and solve the classical ICA problem. For our method, we Boxplot of the Amari distance between the mixing matrix and its estimate for FastICA, ProDenICA, the proposed Bayes method with unknown/known block structure consider both scenarios, unknown block structure ("Bayes 1") and known block structure("Bayes 2"). Boxplots of the Amari metric based on 500 replications are summarized in Figure 2. It can be seen that our method has a better performance than FastICA, and performs nearly as well as ProDenICA if the block structure is assumed known. By comparing Bayes 1 with Bayes 2, we find that there is a significant improvement with the use of the true block structure, which suggests the room for future work on improving the estimation accuracy of the block structure.

Discussion
In this paper, we study the posterior contraction rate of the Bayesian ICA with an unknown block structure. In practice, a common extension is to include random noise in the output, i.e., X = W S + E, where random noise E can be Gaussian or non-Gaussian (Eloyan and Ghosh, 2013). This problem is closely connected to Bayesian density deconvolution. It will be of interest to extend our method to accommodate random noise and obtain the posterior contraction rate of the joint density using the recent results in Sarkar et al. (2014) and Donnet et al. (2015).
In the posterior computation, we use reversible jump MCMC, which only allows for splitting and merging when updating the block structure. This may lead to a low acceptance probability when the number of source signals becomes larger. It is of interest to explore more efficient computational algorithms in a future work.

Appendix: Proofs
Proof of Theorem 1. Throughout the proof, we use Π as a generic notation for the prior on p, and C for a universal positive constant, the value of which may change depending on the context. For any a, b ∈ R, we say a b if a ≤ Cb, and a b if a ≥ Cb. In view of the definition of identifiability and uniqueness for p, we drop the transformation T 0 from the statement and directly work with p 0 without loss of generality. The proof proceeds by verifying the set of conditions given by Theorem 1 of Ghosal, Ghosh and van der Vaart (2000) and Theorem 2.1 of Ghosal and van der Vaart (2001) as listed below, log D(¯ n , F n , d H ) n¯ 2 n , (15) Π(p : K(p 0 , p) ≤ 2 n , V (p 0 , p) ≤ 2 n ) exp(−n 2 n ), where F n is called a sieve, which is a subset of the parameter space of p, K(p 0 , p) = p 0 log(p 0 /p) and V (p 0 , p) = p 0 log 2 (p 0 /p) are first-and secondorder Kullback-Leibler divergences, and n ,¯ n > 0 are two sequences of numbers going to 0. In particular, we let n = max t0 i=1 n −αi/(2αi+di) (log n) αi/(2αi+di)+(1−κ2)/2 . Note that κ 2 ∈ [0, 1], hence¯ n is the posterior contraction rate because¯ n ≥ n . We define F n by considering sieves on the block structure, mixing matrix and marginal densities, where F I is the collection of all possible partitions of {1, . . . , d}, and F A is defined by F A = {A = (a ij ) d×d : |a ij | ≤ n 1/τ1 , i, j = 1, . . . , d}.
Given any I and A, suppose that there are t blocks, with sizes d 1 , . . . , d t . We can then form a sieve on the marginal densities g = (g 1 , . . . , g t ) as F g|I,A = F g1|I,A × · · · × F gt|I,A , To verify condition (14), note that . For the third term, the following holds for any partition I and mixing matrix A, ≤ exp(−c 2Ji (log n) κ2 ) +J i exp(−c 3 n) exp(−Cn di/(2αi+di) (log n) 2αi/(2αi+di) ).