Separation of instantaneous mixtures of a particular set of dependent sources using classical ICA methods

This article deals with the problem of blind source separation in the case of a linear and instantaneous mixture. We first investigate the behavior of known independent component analysis (ICA) methods in the case where the independence assumption is violated: specific dependent sources are introduced and it is shown that, depending on the source vector, the separation may be successful or not. For sources which are a probability mixture of the previous dependent ones and of independent sources, we introduce an extended ICA model. More generally, depending on the value of a hidden latent process at the same time, the unknown components of the linear mixture are assumed either mutually independent or dependent. We propose for this model a separation method which combines: (i) a classical ICA separation performed using the set of samples whose components are conditionally independent, and (ii) a method for estimation of the latent process. The latter task is performed by iterative conditional estimation (ICE). It is an estimation technique in the case of incomplete data, which is particularly appealing because it requires only weak conditions.


Introduction
For the last decades, blind source separation (BSS) has been an active research problem: this popularity comes from the wide panel of potential applications such as audio processing, telecommunications, biology, etc. In the case of a linear multi-input/multi-output (MIMO) instantaneous system, BSS corresponds to independent component analysis (ICA), which is now a well recognized concept [1]. Contrary to other frameworks where techniques take advantage of a strong information on the diversity, for instance through the knowledge of the array manifold in antenna array processing, the core assumption in ICA is much milder and reduces to the statistical mutual independence between the inputs. However, the latter assumption is not mandatory in BSS. For instance, in the case of static mixtures, sources can be separated *Correspondence: marc.castella@it-sudparis.eu 1 Institut Mines-Télécom/Télécom Sudparis, CNRS UMR 5157 SAMOVAR, 9 rue Charles Fourier, 91011Évry Cedex, France Full list of author information is available at the end of the article if they are only decorrelated, provided that their nonstationarity or their color can be exploited. Other properties such as the fact that sources belong to a finite alphabet can alternatively be utilized [2,3] and do not require statistical independence. We consider in this article the case of dependent sources without assuming nonstationarity nor color.
To the best of authors' knowledge, only few references have tackled the issue of dependent source separation [4][5][6][7][8][9][10][11][12][13][14][15], although the interest in dependent sources has been witnessed by studies in various applied domains such as cosmology [6,13,14], biology/medicine [7,8,16], feature extraction [17]. Among the interesting proposed extensions of ICA to dependent components, we should mention tree-dependent models [11] and models with dependence in variance profiles [12]. Contrary to the mentioned articles, our approach is based on the selection of an appropriately chosen sub-sample of the available data, which then feeds the entry of a classical ICA method.
Among ICA or BSS methods, one can distinguish two approaches: some methods recover the sources one by http://asp.eurasipjournals.com/content/2013/1/62 one, which is what we refer to as multi-input/singleoutput (MISO) approaches. These approaches are often used in conjunction with a so-called deflation procedure [18,19]. In contrast, other approaches, which will be referred to as MIMO recover all the sources simultaneously.
Inspired from [20][21][22][23], we investigate in a first part of the article the behavior of the kurtosis contrast function: this criterion is well-known in MISO BSS approaches and we study some of its properties in some specific cases of dependent sources.
In a second part of the article, we investigate a particular model which combines an ICA model with a probabilistic model on the sources, making them either dependent or independent at different time instants. Our method exploits the "independent part" of the source components. Although it is possible to refine our model by introducing a temporal dependence, it assumes neither nonstationarity nor color of the sources. We would like to outline the difference between our study and [17]: the latter assumes a conditional independence of the sources, whereas, depending on a hidden process, we assume either conditional independence or dependence. The proposed separation method which is introduced relies on iterative conditional estimation (ICE), which has been introduced recently [24].
The considered model and notations are specified in Section 2. In Section 3, specific dependent sources are introduced and the behavior of the kurtosis contrast function is investigated. Then, Section 4 introduces a genuine model of dependent sources, for which separation is possible. The principles of our method and a discussion on ICE are provided in Section 5. The algorithm is precisely described in Section 6, where a parallel is also made with the accept-reject random generation method. Some simulations are provided in Section 7 and Section 8 concludes authors' study.

Linear mixture
We consider a set of T samples of vector observations. At each time instant t ∈ {1, . . . , T} the observed vector is denoted by x(t) (x 1 (t), . . . , x N (t)) T . We assume that these observations result from a linear mixture of N unknown and unobserved source signals. More precisely and in other words, there exists a matrix A ∈ R N×N and a vector valued process s(t) (s 1 (t), . . . , s N (t)) T such that: x(t) = As(t), ∀t ∈ {1, . . . , T}.
Let X (x(1), . . . , x(T)) be the N × T matrix with all samples of the observations and S (s (1), . . . , s(T)) be the N × T matrix with all sources samples. The matrix A is unknown and the objective consists in recovering S from X only: this is the so-called blind source separation problem. We will assume here that A is a square leftinvertible matrix and the problem thus reduces to the estimation of A or its inverse. A solution has been developed for long and is known as ICA [1]. It generally requires two assumptions: the source components should be non Gaussian-except possibly one of them-and they should be statistically mutually independent. With these assumptions, it is known that one can estimate a matrix B ∈ R N×N such that y(t) = Bx(t) restores the sources up to some ambiguities, namely ordering and scaling factors. In this article, with no loss of generality, we assume that the sources are zero-mean and have unit power. Finally, note that if A is a tall matrix (i.e., there are more observations than sources), a dimensionality reduction technique such as the principal component analysis (PCA) can be used to obtain a mixture with as many observations as sources.

Notations
In the following, B denotes the estimated inverse of A and is referred to as the separating matrix. Defining G BA the combined mixing-separating matrix, the BSS problem is solved if G is a so-called trivial matrix, i.e., the product of a diagonal matrix with a permutation: these are well known ambiguities of BSS.
In Section 3, we will study separation criteria as functions of G. Source separation sometimes proceeds iteratively, extracting one source at a time (e.g., deflation approach). In this case (and particularly in Section 3), we will write y(t) = bx(t) = gs(t) where b and g = bA, respectively, correspond to a row of B and G and y(t) denotes the only output of the separating algorithm. In this case, the separation criteria are considered as functions of g.
Finally, Cum {.} denotes the cumulant of a set of random variables (see e.g., [1]) and Var{.} denotes the variance of a random variable. For a random vector s = (s 1 , . . . , s N ) T and for any multi-index i = (i 1 , . . . , i N ), we introduce the notation: We denote by N (μ, σ 2 ) the Gaussian law with mean μ and variance σ 2 and by L(λ) the Laplace (i.e., double-exponential) distribution with zero-mean and scale parameter λ. The symbol ∼ denotes the law followed by a random variable or equality between probability distributions; the conditional distribution of a random variable (or vector) r knowing X under a parameter value θ is denoted by P(r | X; θ). Finally, we denote by δ(.) the Kronecker function which equals 1 if the argument in parenthesis holds true and 0 otherwise. http://asp.eurasipjournals.com/content/2013/1/62

A class of dependent sources
In this section, we introduce particular dependent sources based on products of independent signals. These models were shown to be useful when dealing with underdetermined [20] and nonlinear [23] mixtures. We assume that all processes are stationary. At each time instant t, the vectors s(t) and x(t) are realizations of random vectors. Since no confusion is possible, in this section only, we drop the time index t and these vectors are denoted, respectively, by s and x.

Specific sources and properties
Binary phase shift keying (BPSK) signals have specificity that will allow us to obtain source vectors with interesting properties. By definition, BPSK sources take values s = +1 or s = −1 with equal probability 1/2. We define the following source vector: A1. Let be a BPSK random variable and a a real-valued non Gaussian random variable with non-zero fourth-order cumulant κ (a) We assume also that a is independent of and E{a} = E{a 3 } = 0, E{a 2 } = 1. Then, we define the source vector s (s 1 , s 2 , s 3 ) T as follows: Interestingly, the following lemma holds true: Proof. From A1, s 1 = a and s 3 = have zero-mean by definition and so does s 2 = a by independence of a and . Hence E{s 1 } = E{s 2 } = E{s 3 } = 0 and for such centered random variables, it is known that cumulants can be expressed in terms of moments as follows: Cum s i , s j , s k , s l = E{s i s j s k s l } − E{s i s j }E{s k s l } − E{s i s k }E{s j s l } − E{s i s l }E{s j s k } It is then possible to check all cases of Equations (2) and (3), using again A1. We obtain that (2) vanishes for i = j and the decorrelation of the sources follows. The values of the fourth-order cumulants in the lemma are obtained similarly.
Depending on s 1 = a, more can be proved about the source vector defined by A1. For example, if the probability distribution of a is symmetric, then s 2 and s 3 are independent. On the contrary s 1 and s 2 are generally not independent. An even more specific case is obtained when s 1 = a is itself BPSK. Using in Lemma 1 the fact that, in the latter specific case s 2 2 = a 2 = 1, and calculating also the pairwise probability functions, we obtain the following result:

Properties of the kurtosis contrast function with sources satisfying A1
Consider a source vector s which satisfies Assumption A1. If in addition s 1 = a is BPSK, the arguments in [22] prove that a mixture of such a source vector can be separated by many classical ICA algorithms such as CoM2 [25], JADE [26], and FastICA [27]: this follows straightforwardly from Lemma 2 and the fact that the corresponding algorithms rely only on the vanishing of the cross-cumulants of the sources, that is on Equation (4). We now extend this result to more general distributions of a and state the following proposition a :   • If κ (a) 4 > 2, the maximization of (5) leads to extraction of either s 1 = a or s 3 = , • if −2 < κ (a) 4 < 2, the maximization of (5) leads to extraction of s 3 = , • if κ (a) 4 = −2, the maximization of (5) leads to extraction of either one of the sources s 1 , s 2 or s 3 , • if κ (a) 4 = 2, (5) does not define any contrast function and its maximization leads either to extraction of s 3 or to a mixture of s 1 and s 2 only.

Remark 1.
Since a is zero-mean, unit-variance, the fourth-order cumulant satisfies κ cases are hence given in the above proposition.

Remark 2.
The above result characterizes the global maximum of the criterion. However, one should remember that most optimization algorithms search for a local maximum only and may therefore fail to reach the global maximum: however, in simulations (Section 3.3), we did not observe convergence to any spurious local maximum. The same remark holds for Propositions 2, 3, and 4.
Proof. First note that for all α ≥ 1, the criterion in (5) reaches its maxima for the same values of g. We hence consider α = 1 in the proof. Using y = gs, the multilinearity of the cumulants and Lemma 1 which holds for sources satisfying A1, we obtain: Since the maxima of the criterion (5) under the constraint g 2 = g 2 1 + g 2 2 + g 2 3 = 1 are either minima or maxima of κ (y) 4 under the same constraint, we introduce the following Lagrangian: The sought extrema necessarily satisfy: ∂L ∂g = 0 and: ∂L ∂λ = 0 ( 7 ) The corresponding system is polynomial. Using wellknown algebraic techniques now implemented in many computer algebra systems [28], one can get a system of triangular equations equivalent to (7), whose solutions can be given explicitly. For κ (a) 4 = 2, there are 26 solutions to (7), some of them being complex-valued depending on κ (a) 4 . In Table 1, we give the real-valued solutions to (7) and the corresponding value of κ  Table 1 and check which one maximizes |κ  4 3 and, using again the Lagrangian, it can be checked by hand that the extrema of κ (y) 4 subject to g 2 = 1 satisfy either g 1 = g 2 = 0, g 2 3 = 1 or g 2 1 + g 2 2 = 1, g 3 = 0.
Amazingly, the above result does not hold any longer if one considers a mixture of only the first two components of the sources given by Assumption A1.

Proposition 2.
Let y = gs where the vector of sources is given by the first two components s = (s 1 , s 2 ) T of the sources defined by A1. The function in Equation (5) satisfies: it is a contrast function and its maximization leads to extraction of either s 1 or s 2 (that is: g = (±1, 0) or g = (0, ±1)), it is not a contrast function and its maximization leads to a non-separating solution of the type g = (± 1 ).  Proof. The proof is similar to the proof of Proposition 1. Indeed, we have κ (y)

Number of solutions Solution in
. Using a computer algebra system, one can then check that (7) is satisfied at 8 points. These points correspond to values of (g 1 , g 2 ) and κ (y) 4 which are given in the first three rows of Table 1 (precisely those rows for which g 3 = 0). Then, (5) is a contrast function if and only if its maximization yields a separating solution, that is if |. The proposition then follows easily.

Pairwise independent sources
We now investigate the particular case of pairwise independent sources and introduce the following source vector: This case has been considered in [20], where it has been shown that Cum {s 1 , s 2 , s 3 , s 4 } = 1 and all other cross-cumulants vanish. The latter cumulant value shows that the sources are mutually dependent; although it can be shown that they are pairwise independent. It should be clear that pairwise independence is not equivalent to mutual independence but in an ICA context, it is relevant to recall the following proposition, which is a direct consequence of Darmois' theorem ( [25], p. 294):

Property 1. Let s be a random vector with mutually independent components, and x = Gs. Then the mutual independence of the entries of x is equivalent to their pairwise independence.
Based on this proposition, the ICA algorithm in [25] searches for an output vector with pairwise independent components. Let us stress that this holds only if the source vector has mutually independent components: pairwise independence is indeed not sufficient to ensure identifiability as we will see in following section.

Pairwise independence is not sufficient
We first have the following preliminary result: Proof. If y = gs, using the fact that s 2 i = 1 for i = 1, . . . , 4, we have with the particular sources given by A2: Since (s 1 , s 2 , s 3 ) take all possible values in {−1, 1} 3 , we deduce from y 2 = 1 that the following equations necessarily hold: First observe that values given in (9) indeed satisfy (10). Yet, if a polynomial system of N equations of degree d in N variables admits a finite number of solutions b , then there can be at most d N distinct solutions. Hence we have found them all in (9), since (9) provides us with 16 solutions for (g 1 , g 2 , g 3 , g 4 ).
Using the above result, we are now able to specify the output of classical ICA algorithms when applied to a mixture of sources which satisfy A2.

Constant modulus and contrasts based on fourth order cumulants
The constant modulus (CM) criterion is one of the most known criteria for BSS. In the real valued case, it simplifies to: Proposition 3. For the sources given by A2, the minimization of the constant modulus criterion with respect to g leads to either one of the solutions given by Equation (9).
Proof. We know that the minimum value of the constant modulus criterion is zero and that this value can be reached (for g having one entry being ±1 and other entries zero). Moreover, the vanishing of the constant modulus criterion implies that y 2 −1 = 0 almost surely and one can then apply Lemma 3.
A connection can now be established with the fourth-order auto-cumulant if we impose the following constraint: E{y 2 } = 1 (or equivalently g 2 = 1 since y = gs) (12) 4 leads to either one of the solutions given by Equation (9).  Proof. Part (i) follows from the arguments given above. In addition, using multilinearity of the cumulants and (8), we have for y = gs: The result then follows straightforwardly from the study of the polynomial function in Equation (13). Indeed, optimizing (13) leads to the following Lagrangian: After solving the polynomial system which cancels the Jacobian of the above expression, one can check that all solutions are such that |κ (y) 4 | ≤ 2. Part (ii) of the proposition easily follows.

Context
We illustrate with a few simulations Propositions 1 and 2. The random variable a in Assumption A1 has been generated as the following mixture of Gaussians: μ can take any value in ] 0, 1[ and we have set σ 2 = 1 − μ 2 . The latter choice ensures that E{a 2 } = 1, whereas we obviously have E{a} = E{a 3 } = 0. Finally, we have κ We generated mixtures of the sources given by Assumption A1: we mixed the three sources s 1 , s 2 , s 3 or the two sources s 1 , s 2 only with a matrix A randomly generated in R 3×3 or R 2×2 , respectively.

Algorithm
We used the algorithm CoM2 described in [25] and ([1], Chap. 5). It relies on the following MIMO extension of criterion (5): The algorithm in [25] first operates a prewhitening and the maximization of the above criterion is performed over the set of orthogonal matrices. From the results of Propositions 1 and 2 we expect the separation results which are given in Table 2, depending on κ (a) 4 and the number of mixed sources. In Table 2, G is said separating when Above, G is said separating when G = PD where P is a permutation matrix and D = diag(±1, . . . , ±1) and

Results
We provide some results illustrating Propositions 1 and 2 through the behavior of the algorithm CoM2. Let us define the following performance criterion: We have 0 ≤ τ k ≤ 1 and τ k is close to zero whenever source k is well separated. We performed 100 Monte-Carlo runs both in the case where the three sources A1 and in the case where only s 1 , s 2 are mixed. The average results are provided in Tables 3 and 4. We see in Table 3 that τ 3 is small in all cases, indicating that the source s 3 is indeed separated by the algorithm. On the contrary, one can see in Tables 3 and  4 that τ 1 , τ 2 are small only for μ = 0.9: this corresponds to a value of κ (a) 4 for which the obtained G is theoretically separating. On the contrary, for μ = 0.5, corresponding to a value of κ (a) 4 for which the obtained G does not theoretically separate s 1 and s 2 , the values of τ 1 and τ 2 are close to 0.5.

Extended ICA model
We have seen that, depending on the value of κ (a) 4 , the classical optimization criteria in Equations (5) and (16) are not contrasts any longer for the first two sources given by Assumption A1. We now introduce a new statistical model of dependent sources, which consists in a probability mixture of sources. One component of the probability mixture satisfies the requirement of ICA, whereas the other component of the probability mixture is dependent. As an interesting example of dependent sources, we will consider the first two sources defined by A1, where a is the mixture of Gaussians proposed in Section 3.3 with μ = 0.5: this choice is justified by the previous results, which state that such sources cannot be separated by classical algorithms. We show that for our model, the separation is possible based on ICA and on the subset of samples where ICA assumptions are satisfied.

Latent variables
We first extend ICA methods and relax the independence assumption. The basic idea consists in introducing a hidden process r(t) such that, depending on the particular value of r(t) at instant t, the independence assumption is relaxed at time t. In this article, we will assume that r(t) can take two values only in the set {0, 1}. Let r (r(1), . . . , r(T)). We assume more precisely: One can see that, conditionally on r(t) = 0, the source components s(t) at time t satisfy the usual assumptions required by ICA. In a BSS context, if r were known, one could easily apply ICA techniques by discarding the time instants where the sources do not satisfy the ICA assumptions. To be more precise, let us define: and: X 0 (x(t)) t∈I 0 The set I 0 is the set of time instants where the components of s(t) are independent and non Gaussian, except possibly one of them. Then the subset X 0 of the whole set X of the observations satisfies the assumptions usually required by ICA techniques. The core idea of our method consists in performing alternatively and iteratively an estimation of B (corresponding to A −1 ) and of the hidden data r.

Typical sources separated by the proposed method
The sources that we consider satisfy Assumptions A3., A4., and A5. given previously in Section 4.1. We now detail our particular choices for A4., A5., which have been used in simulations and in the sequel to illustrate the assumptions. These particular choices are denoted hereunder by A4.' and A5.' or A5.". We have considered two particular examples with N = 2 sources. In both cases, Assumption A4. particularizes to: A4.' When r(t) = 0, the components of s(t) are mutually independent, uniformly distributed on [ − √ 3, √ 3] (that is: zero-mean and unit variance).

Example 1
As a first example, we particularize Assumption A5. as follows: A5.' When r(t) = 1, then s 1 (t) = a(t) and is an independent BPSK random variable and a(t) is an independent zero-mean, unit-variance random variable.
In simulations, we choose a(t) as a mixture of Gaussians whose distribution is given in Equation (15). A typical realization of a distribution satisfying A3., A4.' and A5.' is illustrated by the simulated values shown in Figure 1a.
Conditionally on r(t) = 1, the vector (s 1 (t), s 2 (t)) T corresponds to the first two sources in Assumption A1, which means that for r(t) = 1, (s 1 (t), s 2 (t)) T lie on one of the two bisectors of the horizontal and vertical axes. functions (such as CoM2) should not separate any linear mixture of (s 1 (t), s 2 (t))T based on the samples where r(t) = 1.
On the contrary, considering the samples X 0 only amounts to removing the set of dependent points that lie on the two bisectors of Figure 2a. In such a case, the remaining samples in X 0 satisfy the usual requirement for ICA and any ICA algorithm should succeed in separating a linear mixture of the sources.

Example 2
The previous example is an extreme case where, conditionally on r(t) = 1, either s 1 (t) = s 2 (t) or s 1 (t) = −s 2 (t). Thus, the set of samples where s 1 (t) and s 2 (t) are dependent lies on the two bisectors of the horizontal and vertical axes. We considered in this example the case where, conditionally on r(t) = 1, the components of (s 1 (t), s 2 (t)) T are dependent but have a continuous joint density. We particularize Assumption A5. as follows: A5." Let u 1 ∼ N (0, σ λ ) and: u 2 ∼ L(λ) be independent random variables where λ is a positive parameter and σ 2 λ = 2(1 − 1 λ 2 ). When r(t) = 1, P(s(t) | r(t) = 1) is such that: To say it differently, the two components of are independent, the first one being Gaussian and the second one being Laplace distributed. One can verify that the choice of σ 2 λ ensures that, conditionally on r(t) = 1, the sources are unit-variance. Such a distribution density is illustrated by simulated values in Figure 2a with λ = 5 and in Figure 2b with λ = √ 2. Considering X 0 only amounts to removing the cloud set of dependent points on the distributions. Visually, this seems much more difficult in Figure 2b than in Figure 2a, and much more difficult in this example than in Example 1 in Figure 1a. We will discuss further in Section 7 the influence of a good knowledge of r: surprisingly, it is not necessarily crucial in our method to have a good knowledge of r.

Complete and incomplete data
Let us denote by θ = (B, η) the set of parameters to be estimated from the data: in this notation, we stress that θ consists of the separating matrix B and of the parameter vector η which characterize the distribution of r. Let us call (r, X) the set of complete data, whereas X alone is the set of incomplete data: since r is a hidden process, the model described in Section 4 corresponds to the situation where only incomplete data is available for estimation of the searched parameters θ. Note that the adjective blind is used to emphasize that S is unavailable, whereas incomplete emphasizes that r is unavailable.

Iterative conditional estimation
Iterative conditional estimation (ICE) is an iterative estimation method that applies in the context of incomplete data and that has been proposed in the problem of image segmentation [24,29,30].
Another well-known iterative estimation technique is the expectation-maximization (EM) algorithm, which is based on the maximum likelihood estimator. Contrary to EM, the underlying estimator in ICE can be of any kind. This makes ICE more widely applicable in cases where the likelihood computation or maximization becomes intractable [29,31], for example in the non gaussian case. In the case where the maximum likelihood is chosen as the underlying estimator, ICE show similarities with EM (see e.g., [32] for an experimental comparative study). It has been proven in [33] that in the case of distributions belonging to the exponential family, EM is one particular case of ICE. Finally, the interest of ICE and its convergence in the case of data that are probability mixtures has been showed in [34].
We now shortly describe ICE. The prerequisites in order to apply ICE are the following: • there exist an estimator from complete dataθ(r, X), • one is able either to calculate E{θ(r, X) | X; θ} or to draw random variables according to P(r | X; θ).
Starting from an initial guess of the parameters, the method consists in finding iteratively a sequence of estimates of θ, where each estimate is based on the previous one. More precisely, ifθ [0] is the first guess, the sequence of ICE estimates is defined by: where E{. | X;θ [q−1] } denotes the expectation conditionally on X and with parameter valuesθ [q−1] . If the above conditional expectation cannot be computed, it can be replaced by a sample mean, that is (20) can be replaced by: where K ∈ N * is fixed and each r (k) is drawn according to the a posteriori law P(r | X;θ [q−1] ). Note that if θ is vectorvalued, (20) can be used for those components for which it can be computed, and (21) can be used otherwise.
Remark that the two conditions requested in order to apply ICE are very weak, which is the reason for our interest in ICE. In fact concerning the first one, there would be no hope to perform incomplete data estimation if no complete data estimator exists, whereas the second requirement consists only in being able to simulate random values according to the a posteriori law.

Applicability of ICE and assumed distributions
In this section, we give details about how the two conditions given in 5.2 for applicability of ICE are fulfilled.
First, as explained in Section 4.1, knowing r provides an easy way of estimating a separating matrix by considering as in (18) the subset X 0 of the samples. A complete data estimatorθ(r, X) hence exists.
To use ICE, one should additionally know the law P(r | X; η). Since X = AS, we have P(r | X; η) = P(r, X; η) P(X; η) = P(r, S; η) P(S; η) (22) and P(r |, X; η) is identical to the law P(r | S; η), where S = A −1 X = BX. An expression for the latter law is available if a model is assumed for P(S, r; η). Importantly, the chosen model used in the ICE estimation algorithm can be different from the model followed by the simulated data that are processed by the algorithm. This is crucial, because actual distributions are generally unknown in practical BSS problems. In particular, we here choose a model which follows Assumptions A3., A4., and A5., but which is different from the particular Assumptions A4.' , A5.' , and A5." in Section 4.2. More precisely, translating http://asp.eurasipjournals.com/content/2013/1/62 Assumption A3. only, the joint distribution of (r, S) under the parameter value η reads: The above equation holds both for the assumed distribution and for the distribution of the simulated data. On the contrary, the expressions of P(r; η) and P(s(t) | r(t); η) that are assumed in ICE differ from the ones of the simulated data and they are given in the next paragraphs.

Assumed P(s(t) | r(t)) in ICE
First, note that in the following, similarly to the real data distribution, P(s(t) | r(t); η) = P(s(t) | r(t)) does not depend on the parameters η to be estimated. Experimentally, we observed that, for robustness reasons, the distribution P(s(t) | r(t)) assumed in ICE should not have a bounded support or be too specific. For this reason, we assumed the following distributions: A typical realization of the distribution of (s 1 (t), s 2 (t)) T assumed in ICE is given in Figure 3. This distribution is of course different from the ones of the simulated data in Figures 1 and 2. However, simulations will confirm that it is a reasonable choice.
Conditionally on r(t) = 0, (s 1 (t), s 2 (t)) T is assumed an independent, zero-mean, unit-variance Gaussian vector: this is an uninformative distribution having a density with non bounded support. Conditionally on r(t) = 1, each component of u in the assumed distribution is a mixture of a Laplace and a Gaussian law. Experimentally, and from the observation of Figures 1 and 3, the latter choice seems a relevant approximation of the data model from Example 1 in Section 4.2.1. The parameter λ should make a compromise between a good fit to the data (λ large) and robustness of the method (λ small). Also, the assumed conditional distribution P(s(t) | r(t) = 1) is invariant with respect to any permutation and sign change. Such a symmetry is necessary in our method because ICA algorithms leave permutation ambiguities. For this reason, the same distribution has been considered to model the data source signals generated in Example 2 (Section 4.2.2).

Assumed P(r; η) in ICE
We propose two different models for the latent process r.

I.i.d. latent process
The simplest case is when r is an i.i.d. Bernoulli process, that is η is a scalar parameter in [ 0, 1] and: with for all t ∈ {1, . . . , T}:

Markov latent process
We can alternatively consider that r is a stationary Markov Chain, that is: where P(r(t) | r(t − 1)) is given by a transition matrix independent of t. In this case, the parameters η consist of the different transition probabilities and the initial probabilities. The main advantage of considering a Markov model is that the posterior transitions P(r(t) | r(t − 1), X; η) can be calculated by an efficient forward-backward algorithm [35]. A sampling of the hidden process according to the posterior law P(r | X; η) is hence possible, making the ICE method applicable [29].

Combined ICA/ICE separation algorithm
In this section, we first detail our separation algorithm which combines ICA and ICE. We then interpret ICE in terms of random variable generation.

Algorithm
We denote by ICA one among all possible ICA algorithms [1] such as JADE [26], CoM2 [25], FastICA [27], etc. Given a set of observation samples X, the separating matrix estimated by the ICA algorithm is denoted by B = ICA(X).
With these notations, a complete data estimator of the separating matrix is provided by B = ICA(X 0 ), where X 0 has been defined in Equation (18). Our algorithm consists in an ICE estimation of the parameters θ = (B, η). The parameters η which characterize r are estimated according to (20), whereas the separating matrix B is estimated according to (21) with K = 1. Here is a summary of the algorithm: Initialize the parametersθ [0] = B [0] ,η [0] . For q = 1, 2, . . . , q max , repeat: 1. calculate P r | X;θ [q] and drawr [q] according to this distribution, 2. update the separating matrix: 3. update the parameters η of the process r.
Steps 1 and 3 of the above algorithm are further detailed in the following sections. Note that, if the parameters η are available as an additional information, they need not be estimated and step 3 of the algorithm is not necessary.

Details in the case of an i.i.d. latent process
We here detail the three steps of our algorithm in the case where r is i.i.d. As we will see, in this case, ICE is akin to generating a set of random samples that satisfy the usual assumptions of ICA with an accept-reject method.

Step 2: accept-reject random variable generation
One can see that, when selectingX [q] 0 in the second step of our algorithm, some samples are kept, others are thrown away. A close parallel can be drawn with random variable generation by the acceptance-rejection method.

Lemma 4. Let s be a random variable (or random vector) with probability distribution given by
Letr ∈ {0, 1} be a binary random variable with probability distribution such that: Then, the conditional distribution P(s |r = 0) is P 0 (s).
Lemma 4 relates our algorithm to the accept-reject algorithm for random variable generation. Indeed, drawingr and conditioning onr = 0 to get P(s |r = 0) is performed in our algorithm by drawingr [q] (t) and keeping only those samples s(t) of S, wherer [q] (t) = 0, that is, it amounts to rejecting those samples wherer [q] (t) = 1. In other terms, through the ICE algorithm, the target http://asp.eurasipjournals.com/content/2013/1/62 distribution P 0 (s) is generated using the instrumental distribution g(s) = ηP 0 (s) + (1 − η)P 1 (s). Samples from this instrumental distribution g(s) are given by the data themselves. By doing so, we obtain a set of data following the distribution P 0 (s). Since P 0 (s(t)) = P(s(t) | r(t) = 0) and according to Assumption A4., this is precisely the distribution under which the ICA algorithm is applicable.

Remark 3.
It is known that the probability distribution of s in Lemma 4 can be seen as the marginal of (r, s), where r is a Bernoulli process and the conditional laws of s knowing r are given by P 0 and P 1 . However,r is drawn independently of r and in particular,r is different from r. It means that in our algorithm, the original latent process r and the ICE samplingr [q] may be quite different, although the selected samples are distributed following P 0 . This will be illustrated in Section 7.1.2.

Step 3
A complete data estimator of the parameter η is given by the empirical frequencyη = 1 T T t=1 δ(r(t) = 0). Equation (20) then yields the following update rule for the parameter η:

Markov latent process
We here detail the three steps of our algorithm in the case where r is a Markov process.

Step 1
Here, we assume in the ICE part of the procedure that r is a Markov process. Then, the posterior probability P(r | X;θ [q] ) is Markov, and its transitions can be calculated according to the forward-backward or Baum-Welch algorithm. At each step q of the algorithm,r [q] is then generated according to P(r | X;θ [q] ) as a non-stationary Markov chain. It is out of the scope to further detail these well-known procedures and the reader is referred to [29,30,35] and references therein for more explanations.

Step 2
It is unchanged, although the interpretation as an acceptreject random variable generation does not hold any longer.

Simulations
We performed several simulations for different numbers of samples, respectively, T = 1000, T = 5000, T = 10000. We have set q max = 30 in our algorithm: this value has been determined empirically by testing values of q max up to 150. For values greater than q max = 30, no significant quality improvement has been observed and this choice seems hence satisfactory as far as ICE convergence is concerned. Although highly interesting, a more detailed analysis of the convergence speed/rate of our algorithm is out of the scope of the article. The separation quality criterion considered is the mean square error (MSE): the provided value is a mean of the MSE over all sources. Additionally, we considered the empirical segmentation rate provided by the last samplingr [q max ] in Step 1 of the ICE procedure. We define this segmentation rate as follows: This corresponds to the proportion of time samples which are correctly classified as corresponding either to independent (r(t) = 0) sources or to dependent (r(t) = 1) sources. In all cases, the source signals and the mixing matrix have been randomly generated. The classical algorithm used in our simulations, which has been denoted ICA in Section 6 was CoM2 [25]. For comparison, we considered as a worst case reference the result when CoM2 is applied to the data, simply ignoring that the ICA assumptions are violated at some sample times. On the other hand, as an optimal best case reference, we considered the complete data situation, that is when r is available and the separation is performed based on X 0 .
In our simulations, we generated data according to the following models: • sources satisfying Example 1 in Section 4.2. We chose the process a(t) as a mixture of Gaussians as defined in Equation (15), with μ = 0.5. According to Section 3, the CoM2 algorithm, if applied on the dependent data only, is not successful in separating the sources. • sources satisfying Example 2 in Section 4.2. In this case, we choose λ = √ 2.

I.i.d. latent process with known η
We first provide some simulations in the case where r is an i.i.d. Bernoulli process with known η parameter (see Equation (24)).

Example 1
Separation results for sources generated according to  Figure 5 are the individual results for 100 Monte-Carlo realizations. In Figure 4, the MSE values have been plotted depending on η ∈[ 0, 1] for T = 1000 (a), T = 5000 (b), T = 10000 (c) samples. Note that for η = 0, all samples of the data sources are dependent and the complete data separation is therefore not applicable. Similarly, when setting η to zero in the ICE algorithm, all samples are necessarily classified as dependent and our method is not applicable. For this reason, when η = 0 has been used in the simulated sources we have set η to 0.1 in the algorithm.

Monte-Carlo realizations whereas values in
Naturally, the complete data case provides the best results: when η tends to zero, the number of samples in X 0 decreases, which explains the increasing MSE for smaller η values. On the other hand, ignoring the dependence provides results that are unacceptable for η smaller than 0.7, approximately. It seems however that CoM2 is able to separate the particular sources considered for η greater then 0.7, approximately: from the results in Section 3, the existence of such a limit value above which CoM2 is successful seems quite natural. For η smaller than 0.7 the proposed method provides a very significant improvement in terms of separation quality. The classification rate ϒ ICE has also been plotted in Figure 4 and one can observe that samples of dependent/independent sources are quite well classified, which corroborates the good MSE values obtained with our method.
Depending on the parameter λ, the source distribution assumed in the ICE estimation and described in Section 5.3.1 is closer to or farther from the real source distribution of Example 1, which has part of its samples lying on the two bisectors (see the comparison between Figure 1a and Figure 3a,b). In Figure 5, we tested the influence of the parameter λ for a fixed value of η = 0.5. The MSE of 100 Monte-Carlo realizations have been plotted for T = 1000 and T = 10000 samples. The corresponding segmentation rate ϒ ICE has been plotted in the lower part of the figures. First, it can be seen in Figure 5 that the separation results of the Monte-Carlo realizations can be clearly separated in two groups: • for a minority of cases, the segmentation rate ϒ ICE is close to 0.5, meaning that the procedure completely fails to classify the latent process r. In such a case, the corresponding MSE is important (around 0.5) and separation is unsuccessful. This situation occurs more often for large values of λ: a greater value of λ consequently implies less robustness of our method. succeeds in classifying approximately the latent process r. In such a case, the corresponding MSE is low and the separation is successful. One can see that, in such a case where the separation is successful, a greater value of λ comes with a greater ϒ ICE , hence a better segmentation of r and a lower MSE, hence a better separation quality.
In conclusion, the choice of λ should be a compromise between good separation quality (in case of success) and robustness in order to limit the rate of separation failure.

Example 2
The results for source signals following Example 2 (see Section 4.2.2 with λ = √ 2) are given in Figure 6. Similarly to the previous section, for T = 1000 (a), T = 5000 (b), T = 10000 (c) samples and depending on η ∈[ 0, 1] the MSE values have been plotted in the top graph and the segmentation rate ϒ ICE in the bottom graph.
As expected, the complete data estimation based on X 0 of the separating matrix provides again the best results. In comparison with the case where the dependence of the sources is ignored, our method provides a very significant improvement, in particular for η lying between 0.4 and 0.8, approximately. Very interestingly, as witnessed in Figure 6, the good performance in terms of separation and MSE is obtained with very poor performance in terms of classification of the latent process r: note in particular that for η = 0.5, we have a low MSE whereas ϒ ICE is close to 0.5. The poor classification can be easily understood when comparing Figure 2b with Figure 1a. The important point in this observation is that a good classification of the latent process r is a sufficient but not a necessary condition for good ICA estimation. This is in contrast with the former example but is fully justified by the interpretation of Step 2 of the algorithm in Section 6.2. Indeed, according to Lemma 4, the ICE part of our algorithm selects a set of samplesX 0 which is such that its distribution satisfies the usual ICA assumptions, although the classification may well be very poor.

I.i.d. latent process with unknown η
We now consider the case where r is i.i.d. but the parameter η in Equation (25) is unknown. In this case, η is estimated by the ICE algorithm. The results have been averaged over 1000 Monte-Carlo realizations and are gathered in Table 5  and Example 2. In the case of Example 1, we have set the parameter λ = 5 in the assumed distribution of Section 5.3.1, whereas we have set λ = √ 2 in the case of Example 2. For comparison, we provide the MSE results when η is known and when the algorithm CoM2 is applied, just ignoring that the generated source signals are dependent.
One can see from the values in Table 5 that our separation method remains valid even in the case where η is estimated.

Markov latent process
As previously indicated, the ICE estimation algorithm is able to take into account the Markov dependence of r.
We have considered the importance of modeling r as a Markov or i.i.d. process.
The process r has been generated as a stationary Markov chain with transition matrix 0.9 0.1 0.1 0.9 , in which case P(r(t) = 0) = P(r(t) = 1) = 1 2 . On the corresponding data set with r Markov, our separation method has been performed both in the case where r is modeled as i.i.d. and in the case where r is modeled as a Markov chain. In both cases, the parameters in η have been estimated from the incomplete data. Figure 7 provides the results for sources from Example 1 (with λ = 15 in the algorithm) and Figure 8 provides the results for sources from Example 2 (with λ = √ 2 in the algorithm). The separation results with complete data   and with CoM2 ignoring the dependence of the sources have also been plotted and, as expected, they correspond respectively to the best/worse MSE value. On the top plots of Figure 7, one can see that, in sources from Example 1, the performance is improved when taking into account the Markovianity of r. More precisely, similarly to the simulations in Section 7.1.1 with λ = 15, the MSE values show a successful separation for a majority of realizations: in such a case, the Markovianity assumption clearly improves the MSE. However, the Markovianity property does not significantly improve the robustness of the method, since the number of realizations where separation has failed is approximately identical with a Markov and an i.i.d. model. Interestingly, a different behavior can be seen in Figure 8 with data generated according to Example 2. Indeed, the performance has not improved when r has been modeled as a Markov chain. Intuitively, this may come from the fact that the two components of the probability mixture in the source distribution are harder to distinguish, as it has been illustrated in Figure 2.

Conclusion
In this article, we have first studied the behavior of kurtosis based contrast functions in specific cases of dependent sources. Observing that the criteria are polynomial functions of the parameters, we have been able to explicitly give the theoretical maxima as a function of a kurtosis value parameter κ (a) 4 . The behavior of the classical kurtosis contrast function can thus be understood depending on κ (a) 4 and its restricted validity has been proved. We have then introduced a model of dependent sources which consists in a probability mixture of two distributions. One component of the mixture satisfies the ICA assumption and, using an ICE estimation method, we have been able to exploit this information in order to perform separation. This example suggests that many more dependent sources might be separated if an adequate model of their distribution is provided. More generally, since the distribution model as a probability mixture may be an approximate one, an interesting problem would be to know to what extent a given distribution may be approximated by our proposed model in order to perform BSS. Finally, simulations have confirmed and validated our theoretical results.
Endnotes a It rectifies an error in [22]. b One can show that the number of solutions of (10) is indeed finite. Note that one can also solve the polynomial equations, as has been done in the proof of Proposition 1.