Independent Component Analysis for Brain Fmri Does Indeed Select for Maximal Independence

4000 characters (w/ spaces) 10 refs Introduction: A recent paper by Daubechies et al. claims that two independent component analysis (ICA) algorithms, Infomax and FastICA, widely used for fMRI analysis


Introduction
Independent component analysis (ICA) [1][2][3][4] is a widely used signal processing approach that has been applied to areas including speech separation, communications, and functional magnetic resonance (fMRI) data analysis. Given a set of linearly mixed observations, recovering the underlying components is an ill-defined problem. However, the assumption of independence among the sources turns out to be surprisingly powerful and effective for a wide range of problems in various practical domains.
Sparsity is another commonly imposed assumption that arises naturally from the principle of parsimony: the simplest explanation is preferred. Sparsity is also motivated by evidence of neuronal coding efficiency and sparse coding in the nervous system. Sparse representations can help avoid the problem of overfitting while also leading to solutions that are easier to interpret. Applications of sparse signal processing methods include dictionary learning [5], speech separation [6], and feature learning [7].
Daubechies et al. [8] claims that ICA for fMRI optimizes for sparsity rather than independence. This is established by first noting that Infomax and FastICA are two algorithms widely used for fMRI analysis and then showing that they separate sparse components better than independent ones on a synthetic dataset. Recreating the synthetic dataset and conducting additional experiments shows that the FastICA and Infomax algorithms indeed do what they are designed to do. Both ICA algorithms can separate sources with either high or low degrees of sparsity, as long as the distributional assumptions of the algorithms are approximately met. To understand the conditions under which these algorithms work requires correct interpretation of what the sources are in an ICA formulation. We examine exactly what the sources are in the examples given in Daubechies et al. [8] and show that there is an important mismatch between the concept of source therein and what an ICA source actually is, which is ultimately at the heart of the unsupported conclusions presented in Daubechies et al. [8].

Review and Critique of the Presented Evidence
We now briefly review the evidence presented in Daubechies et al. [8] to support the claim that Infomax [3] and FastICA [9] select for sparsity and not independence. Following Daubechies et al., we refer to the versions of the two algorithms with their default nonlinearities, sigmoid for Infomax, which is a good match for sources with super-Gaussian distributions, and the high kurtosis nonlinearity for FastICA. Daubechies et al. [8] exhibits experimental results in which 1) ICA algorithm performance suffers when the assumptions on the sources are violated, and 2) ICA algorithms can separate sources in certain cases even if the sources are not strictly independent. The two points above, both of which were already widely known in the ICA community at the time, are not sufficient evidence to support the claim that ICA selects for sparsity and not independence. In addition, Daubechies et al. [8] presents a case in which the sources are somewhat dependent but also very sparse, and Infomax and FastICA do well. This result is used to claim that it is sparsity rather than independence that matters. We augment this experiment with new evidence which shows that the same ICA algorithms perform equally well in the case of both minimum and maximum sparsity (using the definition of sparsity in Daubechies et al. [8]), suggesting that the role of sparsity (if any) is minor in the separation performance.
Additional evidence in Daubechies et al. [8] involves a discussion of sparsity in which it is claimed that ICA can separate Gaussian sources (See Legend of Fig.8 in Daubechies et al. [8]) which are also sparse (utilizing a definition of sparsity different from the one initially provided in Daubechies et al. [8]). If true, such a result would support their claim about the role of sparsity in ICA, since it is well established that blind ICA algorithms are not able to separate two or more Gaussian sources. However, as we show, in that example the sources as they are generated are highly non-Gaussian, and the sparsity mentioned in Daubechies et al. [8] does not actually refer to the sources. Rather, it refers to vectors that span parts of both sources. This renders their statement incorrect and hence, does not support the claim being made (see Section ''Sparsity and sources that are mixture of Gaussians'' for details).
Finally, the paper [8] is focused on showing cases where FastICA and Infomax perform well or poorly, and from these cases the claim is made that this applies to ICA of fMRI in general. There is mention that a more general algorithm [10] does not work for fMRI, but there is no evidence presented to support this claim. As we later discuss in Section ''On the application of ICA to fMRI,'' other ICA algorithms had indeed been used on fMRI data with success, at the time of the publication [8]. Since then, more flexible ICA algorithms have been applied to fMRI data and noted to demonstrate even better performance than the widely used Infomax and FastICA [11]. Hence, while emphasizing that Infomax and FastICA are not the only two algorithms that have been applied to fMRI analysis, we also note that the prevalence of the use of these two is largely due to the availability of the code for these algorithms and their default use in toolbox implementations for fMRI analyses. Since most of the fMRI community does not specialize in the development of blind source separation algorithms, they have since opted in general for the use of these two implementations. And although they do perform reasonably well on fMRI data, sparsity is not the major driver of this success.

Experiments on Synthetic Data: Boxes
We now describe the synthetic dataset used in the original paper [8]. Two components C 1 and C 2 are generated as follows:   (2)) for the various algorithms at those four points (see Table 1 for full list). Also note the best separation performance of Infomax and FastICA for the maximum kurtosis case, which corresponds to almost the lowest level of sparsity In all cases, a controls the relative position of the boxes, and a~0 gives statistical independence between C 1 and C 2 .
The Statistical Properties of Synthetic Data in Daubechies et al. [8] Daubechies et al. [8] argues, based largely on results from synthetic datasets using boxes to represent activated regions of a component (see details above), that it is sparsity rather than independence that enables the recovery of the components. However, the case where the algorithms fail is actually due to a mismatch between the algorithms' assumptions and the statistical properties of the simulated data. In addition, we demonstrate a case where they perform best, which corresponds to almost the lowest sparsity (i.e., not sparse). To facilitate cross-referencing, in the results presented herein, we use the first definition of sparsity ( #Vi #V ) provided in Daubechies et al. [8]. Note, however, that the quantification of sparsity may be ambiguous: see Section ''On the definition of sparsity'' below, and the two definitions of sparsity in Daubechies et al. [8].
Let us first concentrate on the choice of sources. In Figure 1, we see the excess kurtosis of the simulated sources changes with the relative size of the activation region. For medium and large boxes, the two cases where Infomax and FastICA are noted to fail, the kurtosis values are close to that of a Gaussian (i.e., zero), almost corresponding to the two zero-crossings. Moreover, in these cases the distributions are bimodal, far from the unimodal super-Gaussian assumptions that underpin the nonlinearities of Infomax and FastICA used in Daubechies et al. [8]. The paper [8] showed that Infomax with a non-linearity matched to super-Gaussian sources fails for medium and large boxes, roughly regardless of the relative position of the box; but it was not noted that the sources C i were very close to Gaussian (in the sense of kurtosis) and in disagreement with the nonlinearity. Both of these facts create very challenging scenarios for ICA algorithms based on the assumption   of unimodal, super-Gaussian sources, as is the case in Infomax and FastICA, and of course sources are not even close to the ''ideal'' setup for these algorithms, contrary to the claim on p.10418 in Daubechies et al. [8]. In fact, under these scenarios components would not be expected to be well separated with either of these algorithms -because of the mismatch of the distribution (for Infomax) and an approximately zero kurtosis (for FastICA). It is noted in Daubechies et al. [8] that the sources are designed by matching their cumulative distribution function (CDF) to the nonlinearity of the algorithm, resulting in ''optimal'' detectability for Example 1 [8], and (intentionally) enforcing a ''slight mismatch'' for Example 2 [8]. First, these two CDFs are actually the same, except for a scaling factor, which would translate to the so-called scaling ambiguity in ICA. More importantly though, there is a mismatch in vocabulary between what is being identified as the underlying ICA source in Daubechies et al. [8] and what it actually is in the experiment. Specifically, the nonlinearity matches solely to the activation part of the components thereby neglecting the background, whereas the ICA source is to be understood as a combination of the two, and thus has a distribution that is a mixture distribution, i.e., a weighted sum of both activation and background distributions. Hence the claim (p. 10418, 1st column): ''For the first choice, the parameters of our ICA implementations provide optimal 'detectability' in the sense that the nonlinear function defined by the parameter setting of the algorithm coincides with the CDF of the signal source;'' is incorrect since the source in this linear source separation framework cannot refer to only a part of the underlying distribution. As it turns out, in Example 2 [8] there is actually a large mismatch (rather than a ''slight mismatch'') with respect to the algorithm's nonlinearity in that the source distributions are essentially bimodal (see Figure 1, medium box inset).

Boxes Revisited
In the boxes experiment, there are four quantities that are varied: the relative position of the boxes (controlling the amount of overlap), the size of the boxes (small, medium, large), the distribution of the marginal (i.e., the source C i ), and the joint distribution. The shift of the box changes the amount of overlap and, thus, the joint distribution/dependence. The box size controls the sparsity (small box = high sparsity, large box = low sparsity) through the proportion of v[V i , and thus changes the marginal distribution of the sources C i . Clearly, there is dependence between all four quantities, which makes interpretation of the results ambiguous at the least. This is a side effect of the way the sources are sampled in Daubechies et al. [8], which is not independent and identically distributed (i.i.d.) due to the use of the indicator function to define boxes in the spatial map (the sampling distribution is not identical but, instead, conditioned on the location of each sample). With such a design it is very difficult to understand what causes the experimental differences, which is contrary to the claim [8] that it is ''easy to change each of these characteristics separately''. In addition, in the experiments, a single fixed mixing matrix is used, which is not an ideal way to evaluate performance as results are then biased to a specific (and unjustified) set of mixing matrix parameter choice.
In order to furnish a clear, unbiased interpretation of the effect of the marginal source distributions (closely related to the box-sizes in Daubechies et al. [8]) on the performance of ICA algorithms that exploit non-Gaussianity, we first eliminate the effects of all other parameters by limiting ourselves to the case of two independent sources C 1 and C 2 . Then we generate samples directly from marginal distributions that match those in Daubechies et al. [8].
Since the sources defined in Daubechies et al. [8] have distributions that are of mixture type, we can write the CDF of each source C i as W Ci~q W x z 1{q ð ÞW y , where 0vqƒ1 with q~# V i #V , and then draw a set of i.i.d. samples. Under these conditions, the joint distribution of all samples reads: where v is the sample index and V j j~#V. The first equality follows from independent sampling, the second equality from the independence between components C 1 and C 2 , and the third equality from the samples being identically distributed (same distribution regardless of the sample index v). As such, we may generate all samples using independent samples from the inverse Here W x is the logistic distribution for activation and W y is the logistic distribution for background as defined in Daubechies et al. [8], and q is the relative area of the activation. To achieve the required visual contrasts -small, medium, large and very large boxes, at any desired position -we reorder the twodimensional samples, never decoupling the realizations of the sources. The final result, while having a similar visual appearance as the experiments of Daubechies et al. [8], retains the joint pdf. This eliminates possible confusion with respect to the influence of the different box parameters on the results of our experiments. We then compute our results using four algorithms: 1) Infomax with the standard sigmoid nonlinearity that assumes a unimodal super-Gaussian source, called Infomax (super); 2) FastICA with the same nonlinearity used in Daubechies et al. [8], which is y 3 ; 3) Infomax with a nonlinearity which assumes a sub-Gaussian source, called Infomax (sub); and 4) ICA-EBM (ICA by entropy bound minimization), a much more flexible ICA algorithm [12] able to deal with both super-and sub-Gaussian sources.
Results are averaged over 100 source realizations (each using a different random full-rank mixing matrix A) and 10 ICA runs (see Table 1). We also report two performance metrics, first, using the metric chosen in Daubechies et al. [8] EI{WAE, which is not invariant to the scaling and permutation ambiguities inherent to ICA. Hence, we also report the results using the inter-symbol interference (ISI), or normalized Moreau-Amari index [13], which is invariant to the scaling and permutation ambiguities: Here, p ik are the elements of the matrix P~WA, and L is the number of sources. This performance metric is bounded between 0 and 1 and the lower the ISI value the better the separation performance (the performance metric is zero if and only if the model is identified up to the scaling and permutation ambiguities).
As expected, the most flexible approach, the ICA-EBM algorithm, performs well (ISI,0.1) in all cases (Table 1). Infomax (sub) performs well to moderately-well for the large and medium boxes, both of which are bimodal and have a kurtosis that is close to that of a Gaussian random variable. Infomax (super) and FastICA perform marginally well or poorly in those cases but perform very well for the cases of very large boxes (maximum kurtosis) and for small boxes. This makes intuitive sense, as highkurtosis data matches the underlying assumptions of both Infomax (super) and FastICA in that the source distributions are unimodal and strongly super-Gaussian. These results directly contradict the claim in Daubechies et al. [8] that Infomax (super) and FastICA select for sparsity, since the maximum kurtosis case also has the lowest sparsity of the four (again using the first definition of sparsity in Daubechies et al. [8]).

Sparsity and Sources that Are Mixture of Gaussians
In the sparsity section in Daubechies et al. [8, p.10421, Fig. 8] there are several incorrect statements that are important and require a careful critique. First, Daubechies et al. [8] claims that the sources in the so-called ''promotional material for ICA'' are Gaussian. We show below that they are in fact highly non-Gaussian. Second, a definition of sparsity different from the one proposed earlier in the paper [8] is used to claim that the sources are sparse. We show that this sparsity does not refer to the sources and in actuality they are not sparse. Finally, we correct several other statements within that section.

Counter proof to claim of Gaussian sources
To identify the distribution of the sources in this example, it is sufficient to look along the mixing directions a and b. Observations are defined as r~cr 1 z 1{c ð Þr 2~c a 1 azb 1 b ½ z 1{c ð Þ a 2 azb 2 b ½ . Reordering the terms gives rã ca 1 z 1{c ð Þa 2 ½ zb cb 1 z 1{c ð Þb 2 ½ a b ½ : s~As, s~s a s b ½ T , s a~c a 1 z 1{c ð Þa 2 , s b~c b 1 z 1{c ð Þb 2 . Thus, the sources can be identified as mixture distributions. Their distributions are given as p sa~l p a1 z 1{l ð Þp a2 and p s b~l p b 1 z 1{l ð Þp b 2 , where l is the parameter of the Bernoulli distribution of which c are the realizations. Notice that contrary to what one might expect, a mixture of Gaussian random variables through a Bernoulli random variable c as above, in general does not yield a Gaussian random variable, but rather a random variable whose pdf is a weighted mixture of two independent Gaussian pdfs. Finally, since the distributions p a1 , p a2 , p b 1 , and p b 2 (p a1~pb 2~p t ð Þ) are all Gaussian and no two distributions in a mixture have the same variance -e.g., for the choice in Daubechies et al. [8], p a2 t ð Þ~10p a1 10t ð Þ, which implies Var a 2 ð Þ~100Var a 1 ð Þ~100s 2 -the resulting distribution must be non-Gaussian whenever l 6 [ 0,1 f g. Hence the statement in Daubechies et al. [8] that ''Each component has a Gaussian distribution'', is incorrect; the components are in actuality highly non-Gaussian (see Figure 2 (A-B)).

Critique of the claim of sparse components
In the same section there is a claim that the components (i.e. sources) in this example are sparse: '' Fig. 8 depicts processes with 2 sparse rather than independent components''. The definition of Bsparsity in this section regards the number of elements (B) in a zero-mean random vector that have variance close to zero. Thus, a 1-sparse 2D random vector means that the variance of one of the two elements is close to zero. However, B-sparsity in this section does not refer to the components at all; rather, it refers to parts of the components together, specifically, the 2D Gaussian vectors a 1 b 1 ½ T and a 2 b 2 ½ T , which are 2D, 1-sparse vector processes. In actuality, however, the components s a and s b are not sparse for the choice of l~50% and l~30% used in M 1 and M 2 , respectively. This is because Var s a ð Þ~s 2 99lz1 100 and Var s b ð Þ~s 2 100{99l 100 are far from 0 for the choices of l in M 1 and M 2 , and both are typically &0 for l= 0,1 f g. Therefore, it cannot be sparsity that is driving these algorithms towards the solution.

A few additional clarifications
There are two other sentences in the section on sparsity in Daubechies et al. [8] which require some clarification. First, in the sentence ''However, in the example given here, is Gaussian; because ICA methods cannot separate mixtures of independent Gaussian processes, the successful separation of components by Infomax and FastICA underscores again their ability to identify sparse components'' p t ð Þ is not the distribution of the components s. In addition, the statement instills belief that this example has only a single mixing process, when in fact it has two: 1) the mixing of the (Gaussian) a k 's and b k 's through l, which gives the (non-Gaussian) sources s i , and 2) the mixing of sources s i through the mixing matrix A~ab ½ . The statement suggests Infomax and FastICA can unmix the Gaussian random variables a k ,b k which constitute the mixture distribution of a source (i.e. the two parts of a single source s i ) which is clearly incorrect (they unmix the sources s i , not their subparts). Lastly, the sentence ''Infomax or FastICA identify the 2 special directions a and b correctly as the components'' incorrectly labels a and b as components, when they actually are the mixing coefficients that make up the A matrix.

ICA of Sources with Mixture of Gaussians Distribution
The discussion related to the example in Figure 8 of the original paper [8] initially notes that mixtures of independent Gaussian random variables cannot be recovered by ICA, which is true if each source comes from a single Gaussian distribution, and the algorithms are only based on higher-order statistics, as in the case of Infomax and FastICA (i.e., the algorithms do not exploit sample correlation). However, these algorithms (and many others that have been developed and also applied to fMRI data [14]) can separate sources whose probability density can be represented via a Gaussian mixture model, as long as the resulting distribution itself is not a Gaussian. The latter is the case in the example presented in Figure 8 of Daubechies et al. [8], which was incorrectly seen as evidence that sparsity was the driving force helping ICA to recover Gaussian sources. We showed that the sparsity mentioned in Daubechies et al. [8] is not related to the sources. Also, this example utilizes a mixture of Gaussians as the sources. With the parameters described in Daubechies et al. [8], the sources are in fact super-Gaussian (i.e. they have positive excess kurtosis, as shown in Table 2). Infomax and FastICA with nonlinearities selected to match a super-Gaussian distribution are expected to successfully separate such sources, as also is the more flexible ICA-EBM algorithm [12]. Conversely, Infomax with a nonlinearity selected to be sensitive to sub-Gaussian sources is expected to exhibit suboptimal performance (see Table 2). This can also be visualized in Figure 2 where we show the sources and the mixtures for the case of l~30% as described in Daubechies et al. [8]. This example again points to the confusion discussed in the Section ''The statistical properties of synthetic data in Daubechies et al. [8]'' with respect to the definition of the underlying ICA sources, i.e., what is actually being simulated and what is assumed in Daubechies et al. [8].

On the Definition of Sparsity
In coding theory, whether in transmission or in storage of a signal, a trade-off often is necessary between attainable compression rates and signal restoration error. In this context, sparsity is a signal property that allows for high compression rates, while compromising only little in the restoration error. A sparse signal generally consists of N~#V coefficients of which n%N coefficients concentrate all information within the signal. Indeed, under the hypothesis that coding a string of zeroes has little cost in resources with respect to coding whatever floating/integer number, all other N{n coefficients could be set to zero without significant loss of information but with a substantial gain in compression rate.
A legitimate question now is what about a signal of which all but 1 coefficient differ from a number, say, m. Let that one coefficient equal zero. Is that signal sparse? Under the above definition, the signal would not be considered as sparse, since only a single coefficient could be coded as a zero without introducing a reconstruction error. However, if we would allow for coding a shift by m, then coding N{1 coefficients as zero would result in a reconstruction error e upper bounded by m k k (and we would find e~0 with probability N {1 ). It is clear from this very simple example that it is important to appropriately choose the origin for the coordinate system (z 0 ) in which one foresees to evaluate the sparseness of the signal. For the model considered in Daubechies et al. [8], we plot the sparsity measure ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Here, the ordinary sparsity measure (as understood in Daubechies et al. [8]) is taken with respect to z 0~{ 1, i.e., the mean of the ''background distribution'', with sparsity decreasing as the active region size increases (see Figure 3). Note that for fMRI we typically use zero-mean samples when using ICA, thus measuring our sparsity with respect to the mean of the mixture model.

On the Application of ICA to fMRI
We also note that, contrary to the claims in Daubechies et al. [8], Infomax and FastICA, though the most widely used at the time -due in large part to their availability in fMRI-friendly software packages -were not the only ICA algorithms that had been applied to fMRI analysis with success at the time [15,16]. This trend has continued and in recent years even more flexible algorithms such as those based on entropy bound minimization (EBM) or full blind source separation (FBSS) have been used increasingly to analyze fMRI data, outperforming both Infomax and FastICA [14,17,18]. In general, we would recommend that these and other more recent algorithms preferentially be applied to fMRI, as they are generally more robust to non-super-Gaussian and/or multimodal distributed sources which can occur in real fMRI data, observed in the context of certain artifacts. These algorithms and many others are implemented in the group ICA of fMRI toolbox (GIFT; http://mialab.mrn.org/software/gift). An interesting historical note is that before extended Infomax [19] was introduced, there was confusion as to how ICA of fMRI really worked when it was applied as temporal ICA and early results indeed were not convincing -since time courses are more likely to be sub-Gaussian than super-Gaussian [20], whereas in the spatial ICA case super-Gaussian sources are more common. Another important point regarding the real fMRI experiment mentioned in Daubechies et al. [8] is that each voxel is identified as belonging to only one underlying source (page 10416, left col, third paragraph). Such an approach is perhaps a reflection of the way one might approach an fMRI experiment with a sparsity focus, but in reality, and more in line with the complexity and connectivity of the human brain, each voxel typically has a contribution from multiple components (sources), making this an ideal case for ICA.

Conclusions
We reviewed the main claim made in Daubechies et al. [8] and its supporting evidence. We revisit the initial experiments and present new evidence showing conclusively that the arguments fall short of supporting the claim that Infomax and FastICA select for sparsity and not for independence. While pointing out that the use of other metrics for fMRI analysis such as sparsity -besides independence, which is widely used -is a reasonable goal, the claims that are used to justify this desire are misleading at best and in some cases are simply incorrect. In summary, we show that ICA algorithms, including FastICA and Infomax, are indeed doing what they were designed to do, maximize independence.