On the Estimation of Entropy in the FastICA Algorithm

The fastICA algorithm is a popular dimension reduction technique used to reveal patterns in data. Here we show that the approximations used in fastICA can result in patterns not being successfully recognised. We demonstrate this problem using a two-dimensional example where a clear structure is immediately visible to the naked eye, but where the projection chosen by fastICA fails to reveal this structure. This implies that care is needed when applying fastICA. We discuss how the problem arises and how it is intrinsically connected to the approximations that form the basis of the computational efficiency of fastICA.


I. INTRODUCTION
I NDEPENDENT Component Analysis (ICA) is an established dimension reduction technique that obtains a set of orthogonal one-dimensional projections of high dimensional data, preserving some of the original data's structure. ICA is also used as a method for blind source separation and is closely connected to projection pursuit. The method is extensively used in a wide variety of areas such as facial recognition, stock market analysis, and telephone communications. For a comprehensive background on the mathematical principles underlying ICA and some practical applications, see Hyvärinen et al. [1], Hyvärinen [2] or Stone [3].
For ICA, the projections that are determined to be "interesting" are those that maximise the non-Gaussianity of the data, which can be measured in several ways. One quantity for this measurement that is used frequently in the ICA literature is entropy. For distributions with the same variance, the Gaussian distribution is the one which maximises entropy and all other distributions have entropy that is strictly less than this value. Therefore, as ICA looks for projections that maximise non-Gaussianity of the data, we want to minimise the value of entropy. A minimisation problem with respect to the entropy estimation must be solved to find the optimum projection of the data. Different methods are available for both the estimation of entropy and the optimisation that have different speed-accuracy trade-offs.
A widely used method to perform approximate ICA in higher dimensions is fastICA (Hyvärinen and Oja [4]). This method has found applications in areas as wide ranging as facial recognition (Draper et al. [5]), medicine (Yang et al. [6]) and fault detection in wind turbines (Farhat et al. [7]). Recent work on extensions of the agorithm can be seen Corresponding author: P. Smith (email: mmpws@leeds.ac.uk).
in [8], [9] and [10]. Because of its sustained use in many applied sciences, analysis and evaluation of the strengths and weaknesses of the algorithm is still required. The fastICA method uses a series of substitutions and approximations of the projected density and its entropy. These approximations make the method faster by trading off some accuracy in the result, and therefore care is needed when implementing the algorithm.
In this paper we study the fastICA method with the aim to compare it to true ICA. To achieve this we directly approximate entropy using the m-spacing method (Beirlant et al. [11]), and then use a standard optimisation technique to find the projection that minimises estimated entropy. Here, we will refer to this method as the m-spacing ICA method. The m-spacing method is shown to be consistent in Hall [12] and therefore m-spacing ICA converges to true ICA for large sample size.
We compare the fastICA method to the m-spacing ICA method on a two-dimensional example, where the resulting projections for the two methods differ significantly.  The main result is shown in Figure 1. This shows the full (two-dimensional) data and the different densities that result from the projections along the "optimal" directions found by m-spacing and fastICA. The density with five clear peaks (solid line) occurs from the projection of m-spacing ICA, whereas the other density (dashed line) is that of the projected data using fastICA. It is worth noting that the fact that these projections are approximately orthogonal is purely arXiv:1805.10206v2 [stat.ML] 19 Jul 2018 coincidental. It is clear that there is some structure which is preserved by the m-spacing ICA but is hidden by the fastICA method. This is an important result, as ICA is often used to find underlying structures in high-dimensional data sets and thus we would hope that some structure is also found in twodimensional examples. This is clearly not the case here. We remark that this failure by fastICA to preserve some structure is relatively robust in this example. Indeed, changing the parameters used in the fastICA method does not significantly change the outcome, and the underlying structure is still lost.
The m-spacing ICA method theoretically gives accurate results and thus is an obvious candidate for implementing ICA. However, in higher dimensions the calculations and optimisation required make it computationally expensive. Using Jacobi rotations, Learned-Miller and John [13] have implemented a method similar to m-spacing ICA in up to 16 dimensions.
Some problems with the fastICA method have previously been pointed out in the literature. Learned-Miller and John [13] use tests from Bach and Jordan [14] with performance measured by the Amari error (Amari et al. [15]) to compare fastICA to other ICA methods. They find that these perform better than fastICA on many examples. Focussing on a different aspect, Wei [16] investigates issues with the convergence of the iterative scheme employed by fastICA to optimise the objective function. The present paper identifies a more fundamental problem with fastICA. We demonstrate that the approximations used in fastICA can lead to an objective function where the maxima no longer correspond to directions of low entropy (see Figure 1). This paper is structured as follows. In Section II and III we describe the m-spacing and fastICA methods respectively. Section IV contains the example which is our main result. We conclude in Section V.

II. THE m-SPACING METHOD
In this section we define entropy and introduce one method for its approximation, which has been used previously in a slightly different way within the ICA method by Learned-Miller and John [13].
Suppose we have a one-dimensional random variable X with density f : R → [0, ∞) and variance σ 2 . Then the entropy of the distribution of X is defined to be whenever this integral exists.
Entropy can take any value on the interval (−∞, c], where c = 1 2 1 + log(2πσ 2 ) is the entropy of a Gaussian random variable with variance σ 2 (see, for example, Cover and Thomas [17]). As the definition of entropy involves the integral of a density, the estimation of this quantity from data is not trivial.
For a survey of different methods to estimate entropy from data, see Beirlant et al. [11]. For this paper we just consider the m-spacing estimator, originally given in Vasicek [18].
From Beirlant et al. [11], we have an entropy approximation based on sample spacings. Suppose we have a one-dimension sample, y 1 , y 2 , . . . , y n ∈ R such that y 1 ≤ y 2 ≤ · · · ≤ y n , and we have an m-spacing Y i,m = y i+m − y i for m ∈ 3, . . . , n − 1 and i ∈ 1, 2, . . . , n − m. Then the m-spacing approximation for entropy H(f ) is given by is the digamma function. This is a realisation of the general m-spacing formula given in Hall [12], and shown to be consistent under the condition Suppose we have data d 1 , d 2 , . . . , d n ∈ R p and set d = (d 1 , d 2 , . . . , d n ) ∈ R p×n . Finding a projection of the data in ICA is equivalent to finding w T d for some vector w ∈ R p , with w T w = 1. We want to find the vector w such that the entropy relating to the density of y = w T d is minimised. That is, using the m-spacing method, we want to find w * , where w * = argmin w, w =1 H m,n (w T d). In the example of this paper, the optim function in R with the Nelder-Mead method is used to obtain w * and the associated y value.
It is worth noting that the m-spacing approximation of entropy does not require a density estimation step. Also, in general the objective function is not very smooth. A method to attempt to overcome this non-smoothness (and the resulting local extremals, which can cause numerical optimisation issues) is given in Learned-Miller & John [13], which involves replicating the data with some added Gaussian noise. The mspacing method also requires sorting of the data, which has computational cost of order n log n.

III. THE FASTICA METHOD
We briefly describe the fastICA method from Hyvärinen and Oja [4] in order to show how surrogates are used in the method to bypass the need to calculate entropy directly. The surrogate used in fastICA is to negentropy as opposed to entropy. Negentropy is a transformation of entropy that is zero when the density is Gaussian, and strictly greater than zero otherwise. Suppose as before we have a random variable X with density f and variance σ 2 . Then the negentropy is given by where ϕ( · ; σ 2 ) is the density of a N (0, σ 2 ) random variable. The fastICA method estimates J(f ) by approximating both f and J(·), as explained below. We will use the typography fastICA when we are discussing the theoretical method, and fastICA when we are discussing the R code from the fastICA CRAN package explicitly.
We first introduce a set of functions G i , i = 1, 2, . . . , I, that is required as part of the fastICA method, and that does not grow faster than quadratically. Set for some α i , β i , γ i , δ i , i = 1, 2, . . . , I, such that the following conditions are satisfied, for i, j = 1, 2, . . . , I, where δ {i=j} = 1 if i = j and zero otherwise. Note that in the fastICA method we can theoretically specify many such functions G 1 , G 2 , . . . , G I , although the fastICA algorithm only allows I = 1. Therefore, going forward we just consider the case I = 1, with one G function (and thus one K function).
In fastICA the actual density f is approximated by a density f 0 chosen to minimise negentropy (and hence maximise entropy) under the constraints where Replacing f with f 0 allows fastICA to be computationally efficient, but the efficiency gains come at a price: the entropy of f 0 may be minimised for different projections than those that minimise the entropy of f . It can be shown that f 0 takes the form for all x ∈ R and for some constants A, κ, η and a that depend on c.
In fastICA there is a choice of two functions to use, G(x) := (1/α) log cosh(αx), α ∈ [1, 2], and G(x) := − exp(−x 2 /2). In the example of this paper all choices give very similar results and thus we only consider G(x) = (1/α) log cosh(αx), with α = 1 for simplicity. The approximations used are obtained as follows: • We have data y ∈ R with associated unknown density f ; • Use lower bound (in terms of negentropy) for f , given by f 0 ; for all x ∈ R.
• Approximate J(f 0 ) by second order Taylor expansion, where Y is a random variable with density f , Z ∼ N (0, 1), and C some constant; • By approximating the expectations using sample means, we obtain where y 1 , . . . y N are the projected samples, and z 1 , . . . , z l are samples from a standard Gaussian. HereĴ * (y) ≈ C ·Ĵ(f 0 ).
After these approximations, an iterative step is employed to approximate y * = argmax y=w T d w∈R p , w T w=1Ĵ * (y).
In the literature regarding fastICA it is often the convergence of the iterative step that is examined. It can be shown, for example in Wei [16], that this iterative step fails to find a good approximation for y * in certain situations. In contrast, we consider the mathematical substitutions and approximations from J(f ) toĴ * (f 0 ).
The steps in this chain of approximations are illustrated in Figure 2. The approximations to the density and the entropy used in fastICA dramatically decrease the computational time needed to find projections of the data. Unlike m-spacing, the negentropy approximationĴ * (w T d) is a simple Monte-Carlo estimator and does not require sorting of the data. It also benefits from the fact that an approximate derivative of w →Ĵ * (w T d) can be derived analytically. This increase of speed comes at a trade off in accuracy of the directions of projections, as discussed using the example in Section IV.
We remark that the substitution of the density f 0 that maximises Gaussianity for each projection for the actual density f results in a negentropy J(f 0 ) that is a lower bound of J(f ). Therefore,Ĵ * (y) is an approximate lower bound for J(f ) and there is no necessary relationship of the location of the maximum of the two functions.

DATA
To illustrate the kind of problems which can occur during the approximation from f tof 0 , we construct an example where the density f in the direction of maximum negentropy is significantly different tof 0 in the same direction. This results in fastICA selecting a sub-optimal projection, as shown below.
With the data distributed as in Figure 1, the negentropy over ϑ ∈ [0, π) used in both the m-spacing ICA method and the fastICA method is shown in Figure 3. The respective projections are taken in the direction that maximises negentropy in m-spacing (solid line), and the approximation to negentropy in fastICA (dashed line). The search is only performed on the half unit circle, as directions that are π apart result in the same projection. It is clear from Figure 3 that the fastICA result is poor, giving an "optimal" direction that is about π/2 away from the actual optimum. The fastICA objective function misses the peak that appears when using m-spacing.
It can be shown that for sufficiently small c, the approximation for the densityf 0 that maximises entropy given the constraints in (3) is 'close to' f 0 , and the speed of convergence is of order c 2 for c → 0. Therefore, it is our belief that the majority of the loss of accuracy occurs in the step where the surrogate f 0 is used instead of f , rather than in the later estimation steps for J(f 0 ) andĴ * . This can be shown by comparing the objective functions J(f ), J(f 0 ) andĴ * (y), and by comparing the densities f , f 0 andf 0 , although this is not present in this paper. Here, J(f 0 ) andĴ * (y) give similar directions for the maximum, and these differ significantly from the location of the maximum of J(f ). This is a theoretical problem with the fastICA method, and is not a result of computational or implementation issues with fastICA.
The distribution as shown in Figure 1 was found in the following way: The initial distribution was a set of Gaussian clouds arranged to form a regular hexagonal shape. Partial optimisation was then performed to move these Gaussian clouds so that fastICA found a direction that resulted in a low value of the m-spacing objective function compared to its maximum. Figure 4a shows the density f (dashed-dotted line) and respective approximation f 0 (long dashed line) along the direction found by m-spacing ICA (i.e. the solid line in Figure 3). Figure 4b shows the analogue for the direction found by fastICA (i.e. the dotted line in Figure 3). The dasheddotted lines in Figure 4 show the same densities as given in Figure 1.  In this paper we have given an example where the fastICA method misses structure in the data that is obvious to the naked eye, while the m-spacing method still performs well. As this example is very simple the fastICA result is concerning, and this is magnified when working in high dimensions as visual inspection is no longer easy. There is clearly some issue with the objective function used for fastICA having the property of being an approximation of the lower bound of negentropy, as this does not necessarily capture the actual behaviour of negentropy over varying projections.
To conclude this paper, we ask the following questions which could make for interesting future work.
Is there a way, a priori, to know whether fastICA will work? This is especially pertinent when fastICA is used with high dimensional data.
The trade-off in accuracy for the fastICA method comes at the point where the density f is substituted with f 0 . Therefore, are there other methods similar to that of fastICA but that use a different surrogate density which more closely reflects the true projection density?
If these two options are not possible, then potentially a completely different method for "fast" ICA is needed, one that either gives a "good" approximation for all distributions, or where it is known when it breaks down. ACKNOWLEDGMENT P. Smith was funded by NERC DTP SPHERES, grant number NE/L002574/1.