MAP-Based Underdetermined Blind Source Separation of Convolutive Mixtures by Hierarchical Clustering and (cid:2) 1 -Norm Minimization

,


INTRODUCTION
The high-quality separation of speech sources is an important prerequisite for further processing such as speech recognition in environments with several active speakers. Often, the underlying mixing process is unknown, thus requiring blind source separation (BSS). In general, we can distinguish two cases depending on the number of sources N and the number of sensors M: Since overdetermined BSS (N < M) can be reduced to determined BSS (N = M) [1], we refer to both as determined BSS. Most approaches deal with determined BSS [2,3], but in reality BSS is often underdetermined. While the area of underdetermined BSS is attracting increasing attention [4][5][6][7][8][9][10][11][12], it remains a challenging task.
Most existing approaches for underdetermined BSS were proposed for instantaneous mixtures. In this paper, we use [13,14] as our basis for proposing an algorithm for underdetermined BSS that deals with convolutive mixtures in the time-frequency domain. We start from a general Bayesian approach, which leads to a two-stage framework. In the first stage, we have to estimate the mixing matrix. In the second, stage the actual source signals are estimated.
Several of the previously proposed algorithms for the first stage are based on histograms and developed for only two sensors [7,9,11]. Some could, in principle, be enhanced for higher dimensions M. But since histograms are based on densities, the so-called curse of dimensionality [15] sets practical limits to the number of usable sensors. This problem becomes even worse with complex numbers, which double the histogram dimensions due to their real and imaginary parts or amplitude and phase, respectively. Complex numbers are necessary if BSS is performed in the time-frequency domain. Some methods approach complex numbers by applying real-valued algorithms to the real and imaginary parts or amplitude and phase [6,12], which is not always applicable. Some approaches extract features such as 2 EURASIP Journal on Advances in Signal Processing the direction of arrival (DOA), or work on the amplitude relation between two sensor outputs [4,5,7,16]. In both cases, only two sensors can contribute, no matter how many sensors are available.
Other algorithms such as GeoICA [8] or AICA [10] resemble self-organizing maps (SOMs) and could more easily be applied to convolutive mixtures. However, their convergence depends heavily on initial values [15]. Usually, countermeasures are computationally expensive.
Here we propose the use of hierarchical clustering to estimate the mixing matrix. This method can work directly on complex-valued samples. While it does not limit the usable numbers of sensors, it prevents the convergence problems that can occur with SOM-based algorithms.
In the second stage, we separate the mixtures using the estimated mixing matrix from the first stage. We assume statistical independence and Laplacian probability density functions (PDFs) for the sources [17]. This leads to constrained 1 -norm minimization. Since we are considering convolutive mixtures, we work in the time-frequency domain. This reduces the convolutive mixtures to instantaneous mixtures, which are easier to handle. As a result, we have to deal with complex numbers. Therefore we investigate the difference between real-and complex-valued 1 -norm minimizations and its implication for the underdetermined BSS of convolutive mixtures.
In Section 2, we first explain the general framework before providing details about the hierarchical clustering in Section 3 and the source separation based on 1 -norm minimization in Section 4. In Sections 4.2 and 4.3, we present a detailed description of real-and complex-valued 1 -norm minimizations before considering their differences. The consequences of these differences for practical applications are described in Section 5 together with experimental results. They demonstrate the performance for convolutively mixed speech data in a real room with reverberation time T R = 120 milliseconds.

GENERAL FRAMEWORK
We consider a convolutive mixing model with N speech sources s i (t) (i = 1, . . . , N) and M (M < N) sensors that yield linearly mixed signals x j (t) (j = 1, . . . , M). The mixing can be described by where h ji (t) denotes the impulse response from source i to sensor j. Instead of solving the problem in the time domain, we choose a narrowband approach in the time-frequency domain by applying a short-time Fourier transform (STFT). While a wideband approach would be desirable, extension of the proposed method is not as straightforward as described for example in [18]. This is because this problem has a different structure from traditional adaptive filtering problems. Following [13], we can approximate the mixing process in the time-frequency domain as where and τ denotes the time frame. This reduces the problem from convolutive to instantaneous mixtures in each frequency bin f . For simplicity, we will omit the frequency and time-frame dependence. Switching to the time-frequency domain has the additional advantage of making it easier to exploit the time-frequency sparseness of speech sources [6]. Sparseness of a signal means that only a few instances have a value significantly different from zero. During speech activity, the amplitude of a speech signal in the time domain is usually significantly different from zero, and therefore not sparse. The higher sparseness in the time-frequency domain can be explained by the harmonic structure of speech signals. During voiced speech, the energy of a speech signal is concentrated around multiples of the speaker's fundamental frequency. Ideally, the frequency bands in between do not carry any energy. This means that in the time-frequency domain, only a few frequency bins have high values at each time instance τ, while most frequency bins have a value close to zero. This is by definition a sparse signal. In addition, the fundamental frequency depends on the time instance τ, which means that the signal is also sparse with respect to τ. Together with the frequency sparseness and the speaker dependency, this leads to less overlap in the timefrequency domain than in the time domain. Using a sparse signal representation is very important as regards ensuring good separation performance since the separation is built on the assumption of sparse source signals.
The disadvantage of narrowband BSS in the timefrequency domain is the internal permutation problem, which results in incorrect frequency bin alignment. In our framework, we use a clustering-based method to reduce the permutation problem [3,19]. We also apply the minimumdistortion principle [2] to solve the scaling problem.
In determined BSS, the mixing matrix H is square and (assuming full rank) invertible. Therefore, the BSS problem can be solved by either inverting an estimate of the mixing matrix or directly estimating its inverse and solving (2) for S.
However, this approach does not work in underdetermined BSS where the mixing matrix is not invertible. Instead, we follow a general Bayesian approach, which leads to an optimal solution in a statistical sense. In general, we search for an estimation of the source signals S and mixing matrix H that maximize the a posteriori P(S, H|X). If we make the usually reasonable assumption that the source signals and mixing matrix are statistically independent, this problem can be written as If we assume additive white Gaussian noise with variance ν 2 at the sensors, then the likelihood P(X|S, H) also has a Gaussian distribution according to We will limit ourselves to the noiseless case (ν 2 → 0), which leads to a Dirac impulse for the likelihood It requires the maximum of the a posteriori to fulfill HS = X, which turns (3) into the constrained problem If we further assume that we know the mixing matrix H (or can provide an estimate for it as shown in Section 3), then P(H) is also a Dirac impulse. So we only have to estimate the source signals S, and (7) results in max S P(S) s.t. HS = X.
Therefore we follow a two-stage approach as utilized in [6,8] consisting of blind mixing model recovery (BMMR) and blind source recovery (BSR). To estimate the mixing matrix A in the BMMR step, we propose the use of hierarchical clustering as described in detail in Section 3. To eventually separate the signals in the BSR step, we specify a source model P(S) and provide a solution for (8) in Section 4. Finally, the inverse STFT is applied to obtain time-domain signals. The overall system is depicted in Figure 1.

BLIND MIXING MODEL RECOVERY
Several algorithms have already been proposed for BMMR. They usually have the common feature that they assume sparseness of the original signals. Without being mentioned, it is usually assumed that the sources are located at different spatial positions (space sparseness). In addition, they commonly assume a certain degree of time-frequency sparseness, which ideally means that the time-dependent spectra of the sources do not overlap even after being mixed. Rewriting (2), we can express ideal time-frequency sparseness by This means that each time-frequency instance originates only from a single source and represents a scaled version of the corresponding mixing vector h q ( f ). q depends on the frequency f and time τ.
If we assume stationary source positions, the mixing vector h q ( f ) is constant for all τ. Since h q ( f ) is related to the position of the qth source, it is also different for each source. This means ideally that the time-frequency samples X( f , τ), that originate from the qth source, cluster at each frequency f around the corresponding mixing vectors h q ( f ).
However, depending on the mixing system and the actual time-frequency sparseness of the source signals, the mixed signals will also have components of other mixing vectors stemming from other sources. Therefore the mixtures will be spread around the mixing vectors but still form clusters for each source.

Hierarchical clustering
To avoid the problems discussed in Section 1, such as the curse of dimensionality or poor convergence, we propose the use of a hierarchical clustering algorithm for finding the clusters around the mixing vectors. We follow an agglomerative (bottom-up) strategy. [15]. This means that the starting point is the single samples, considering them as clusters that contain only one object. Clusters are then combined, so that the number of clusters decreases while the average number of objects per cluster increases. In the following, we assume phase and amplitude normalized samples where ϕ X1 denotes the phase of the first vector component of X and |·| p denotes the p -norm defined by The combination of clusters into new clusters is an iterative process based on the distance between the current clusters.
Starting from the normalized samples, the distance between each pair of clusters is calculated, resulting in a distance matrix. At each level of the iteration, the two clusters with the least distance are combined to form a new binary cluster ( Figure 2). This process is called linking and is repeated until the number of clusters has decreased to a predetermined value c, N ≤ c ≤ P (P is the total number of samples).  To measure the distance between clusters, we have to distinguish between two different problems. First we need a distance measure d(X τ1 , X τ2 ) that is applicable to Mdimensional complex vector spaces. While there are several possibilities, we currently use the Euclidean distance based on the normalized samples, which is defined by where ·, · stands for the inner product and * stands for complex conjugation. When a new cluster is formed, we need to enhance this distance measure to relate the new cluster to the other clusters. The method we employ here is called the nearestneighbor technique. Let C 1 and C 2 denote two clusters as illustrated in Figure 3. Then the distance d(C 1 , C 2 ) between these clusters is defined as the minimum distance between its samples by As mentioned earlier, most of the samples will cluster around the mixing vectors h i , depending on the degree of sparseness of the original signals. Special attention must be paid to the remaining samples (outliers), which are randomly scattered in the space between the mixing vectors due to nonideal sparseness (and noise if applicable). Usually they are far away from other samples and will be combined with other clusters only at higher levels of the clustering process (i.e., when only few clusters are left). This led us to the idea of setting the final number of clusters at a high value: By doing so, we avoid linking these outliers with the clusters around the mixing vectors h i and therefore avoid distortions. This results in greater robustness. More important, however, is the fact that we avoid combining desired clusters. Since the outliers are often far away from other clusters, desired clusters might be closer to each other than to outliers. Experiments showed that the exact value of c does not matter as long as it is above 60 for N ∈ {3, 4, 5}. This approach requires distance calculations, but with a well-designed implementation as used here, the computational complexity can become as low as O(n 2 ) [20], where n denotes the number of samples per frequency bin. An example of the resulting clusters is shown in Figure 4. Here, as with the experiments in Section 5, we chose c = 100. An example where desired clusters were unintentionally combined because too small a value c was chosen is shown in Figure 5. Further experimental details are given in Section 5.

Estimation of mixing matrix
Assuming that the clusters around the mixing vectors h i have the highest densities, and therefore the highest numbers of samples, we finally chose the N clusters with the largest numbers of samples. Thereby, the number of sources N must be known. To obtain the mixing vectors, we average over all the samples of each cluster, where |C i | denotes the cardinality of cluster C i . Thereby, we assume that the influence of other sources has zero mean.

Advantages of hierarchical clustering
Among the most important advantages of the above hierarchical clustering algorithm is the fact that it works directly on the sample data in any vector space with arbitrary dimensions. The only requirement is the definition of a distance Stefan Winter et al. measure for the considered vector space. Therefore, it can easily be applied to the complex-valued data that occurs in time-frequency domain convolutive BSS.
No initial values are required for the mixing vectors h i . This means, in particular, that if the assumption of clusters with high densities around the mixing vectors is true, then the algorithm converges to those clusters.
Besides choosing a distance measure, there is only the single parameter c that determines the number of clusters. Experiments have shown that the choice of this parameter is quite insensitive as long as it is above a certain limit that would combine desired clusters. Its choice is, in general, related to the sparseness of the sources. The sparser the signals are, the smaller the value of c can be, because the number of outliers that must be avoided will be smaller.
While the considered signals must have some degree of sparseness, they do not have to be statistically independent at this point to obtain clusters.

BLIND SOURCE RECOVERY
Unmixed signals cannot be directly obtained, because the mixing matrix cannot be inverted in underdetermined BSS. Several approaches have been proposed to solve blind source recovery [17]. Of these approaches, we chose the shortestpath algorithm, which is based on maximum a posteriori (MAP) estimation, assuming statistical independence and Laplacian PDFs for the sources.

MAP-based cost function
Using a maximum a posteriori (MAP) approach, we have shown in Section 2, that once we know the mixing matrix H, we have to solve the constrained problem (8) in order to obtain a statistically optimal estimate for the source signals S. If we assume mutually independent source signals whose spectral components have statistically independent phases and amplitudes with uniform and one-sided Laplacian distributions, respectively, the cost function results in for each time instance τ. |S i | denotes the amplitude of S i .

1 -norm minimization of real-valued problems
If we had to consider only real-valued problems (K = R), we could employ linear programming (LP) [21], which solves problems of the form where c, Here 1 stands for a unity matrix with appropriate dimensions. S + and S − are derived from S by setting all negative values or positive values, respectively, at zero. Although powerful algorithms for linear programming exist, they are still time consuming. Depending on the dimensions of the problem, we can obtain a faster combinatorial algorithm if we use a certain property of the solution. It can be shown [8,22] that the N-dimensional vector S that solves (16) contains at least N − M zeros if the columns of H are normalized. The normalization can be assumed for BSS due to the scaling ambiguity.
The lower limit for the number of zeros can be considered a constraint imposed by the MAP estimation and can easily be fulfilled by setting N − M elements of the solution at zero. Then we only have to determine the remaining M elements. Assuming that we know where to place the zeros, the remaining elements are found by multiplying the inverse of the quadratic matrix built by the remaining mixing vectors h i with the constraining vector X: The correct placement of the zeros can be determined by combinatorially testing all possibilities and accepting the one with the smallest 1 -norm. As a simple example, let us consider According to the dimensions of the problem, at least one element of the solution S must be zero. The 1 -norm of the 6 EURASIP Journal on Advances in Signal Processing possible solutions is The notation of (22) reflects the above description of setting one element at zero and inverting the remaining quadratic matrix. The chosen solution would be the one corresponding to (21). This combinatorial method is based on the shortestpath algorithm [8] and the 0 -norm that basically counts the number of nonzero elements. The combinatorial method stands in contrast to the approach in [23] where conditions are given for which the 0 -norm can be calculated by an pnorm with 0 < p ≤ 1.

1 -norm minimization of complex-valued problems
If complex numbers are involved, then (18) can no longer be applied because such numbers possess a continuous phase in contrast to a discrete phase of real numbers. Thus we cannot use algorithms that solve linear programming problems for complex-valued problems. However, 1 -norm minimization problems (16) with complex numbers (K = C) can be transformed to second-order cone programming (SOCP) problems in the following way. Equation (16) is equivalent to By decomposing t = N i=1 t i , t i ∈ R, the second constraint |S| 1 ≤ t can be expressed by where (·) and (·) denote the real and imaginary parts, respectively. Thus we can rewrite (16) as By defining we can write The second constraint in (28) can be interpreted as a secondorder cone for each i. Equation (28) describes an SOCP problem [24], which can be solved numerically for example with SeDuMi [16].

-norm minimizations
In contrast to the real-valued 1 -norm minimization problem where a minimum number of zeros can be guaranteed theoretically in the optimal solution, the number of zeros cannot be predicted with complex-valued problems as the following simple example shows. Let Then the 1 -norm of the solution obtained by SOCP is given by It does not contain any zeros as we would expect with real numbers, yet it solves (16). In comparison, the 1 -norm of the optimal combinatorial solution is given by This observation reveals a very important difference from real-valued problems and prevents the theoretical justification of a procedure similar to the combinatorial approach in Section 4.2. To better explain this difference between real and complex numbers, we take a look at a general solution based on a combinatorial solution and the nullspace N (H) of H. Even though the combinatorial solution S comb does not necessarily minimize the 1 -norm, it fulfills together with the SOCP solution S socp that By defining the difference S = S socp − S comb , (32) becomes This means that if we have a combinatorial solution, we can limit our search for the minimum 1 -norm solution to the nullspace N (H), that is, where H − is an arbitrary generalized inverse of H. For N = 3 and M = 2, we can express the combinatorial solution and the nullspace without loss of generality by Here f i j is a summand that only depends on H and X, which are constant for any given problem. If only real values are involved, then (37) describes a piecewise linear function depending on α whose slope can only change a limited number of times in a discrete manner.
However, once complex numbers are involved, their imaginary part results in an inherent 2 -norm, which leads to smooth slopes as they appear with second-order or higher polynomials. This behavior becomes obvious in (28). There the 1 -norm is changed from the sum of the absolute values of real numbers to the sum of the 2   minimization compared with its real counterpart. An example is shown in Figure 6, where the dependence of 1 -norm on α is shown (here only the dependence on the real part of α is shown). The combinatorial solution that minimizes the 1 -norm is given there for α = 0. However, this is not the solution of (16), which is rather obtained for α = 0.

EXPERIMENTAL RESULTS
Even though the combinatorial solution (CS) with a minimum number of zeros in Section 4.2 cannot be justified theoretically for complex numbers, in practice its performance is comparable to, or even better than, that of the SOCP solution. In our experiments, we separated mixtures that we obtained from clean speech signals and recorded room impulse responses. We tested both approaches with both the estimated and the original mixing matrices with different numbers of sources (N ∈ {3, 4, 5}) and sensors (M ∈ {2, 3}). We performed four experiments for each scenario. Each of the four experiments had a different combination of speakers drawn from six male and female English speakers. Further experimental conditions are summarized in Table 1 and Figure 7. For comparison, we also applied a time-frequencymasking approach to the same mixtures [25].
To measure the performance, we decomposed an estimated signal s in the time domain into a filtered version s target of the original signal, a filtered mixture e interf of the interfering signals and e artif , which accounts for artifacts introduced by the separation algorithm [26,27], The results are shown in Tables 2, 3, 4, and 5. The performance values of each combination give the average for the involved signals. The specific sources and sensors used in each scenario are indicated in the caption of each table following the numbering in Figure 7.
To evaluate the performance improvement, we provide the input SDR, SIR, and SAR measured at a single sensor in Table 6.
A subjective evaluation of the separated sources supports the result.
The SOCP solution and combinatorial solution yield similar results with the estimated mixing matrix. However, the combinatorial solution performs better with the optimal mixing matrix.
Although the difference in performance quality is negligible in practical applications with estimated mixing matrices, the computational complexity reveals great differences. The combinatorial solution has a low initial computational complexity but grows exponentially with the input dimension N. On the other hand, the SOCP solution has a high computational complexity even for low input dimensions N, but even in the worst case it grows only according to denotes the precision of the numerical algorithm [16]. Figure 8 illustrates this fact and shows on a logarithmic scale the time required by the two approaches to separate the sources in one frequency bin with 230 time frames for different numbers of sources and sensors. The simulations for Figure 8 were performed on a 2.4 GHz PC based on random data and mixing matrices.
One reason for the big difference in the initial computational complexity can be found in the reusability of previous results. For underdetermined BSS in the time-frequency domain, the minimum 1 -norm solution must be calculated several times with the same mixing matrix. The combinatorial solution is built on the inverses of selected mixing vectors. Once they are calculated, they can be reused as long as the mixing matrix does not change. In contrast, SOCP cannot profit from the reuse of earlier results due to its algorithmic nature.      The time-frequency masking approach yields better separation in terms of the SIR than the proposed methods. This is because the time-frequency masking approach uses only time-frequency instances that originate from a single source with high confidence. In contrast, the proposed methods do not evaluate the confidence about the origin of a timefrequency instance but use all instances for separation in a uniform way. On the other hand, by using all time-frequency instances, the proposed methods result in fewer artifacts, as expressed by a higher SAR.

CONCLUSION
Starting from a general Bayesian approach, we derived a framework for underdetermined BSS for convolutive speech mixtures consisting of two main steps. In the first step, we estimate the mixing matrix based on hierarchical clustering. This method can work directly on complex mixture samples. It also prevents the convergence problems that can occur with SOM-based methods such as GeoICA. Experimental results confirmed that the assumption of sparseness in time-frequency and space, and therefore, clusters around the mixing vectors, is sufficiently fulfilled for convolutively mixed speech signals in the time-frequency domain.
To estimate the source signals, in the second step we assumed Laplacian priors and arrived at an 1 -norm minimization problem. We investigated the consequence of dealing with complex numbers as an result of the time-frequencydomain approach. Although the combinatorial solution with at least N − M zeros is not theoretically justified for complex numbers, its performance quality is comparable to, or even better than, that of the SOCP solution. In addition, the combinatorial solution has the advantage that it is faster for underdetermined BSS problems with low input/output dimensions. His current research interests include speech signal processing, array signal processing, adaptive filtering, and its applications to acoustic human/machine interfaces.
Hiroshi Sawada received the B.E., M.E., and Ph.D. degrees in information science from Kyoto University, Kyoto, Japan, in 1991, 1993, and 2001, respectively. In 1993, he joined NTT Communication Science Laboratories, where he is now a Senior Research Scientist. From 1993 to 2000, he was engaged in research on the computer-aided design of digital systems, logic synthesis, and computer architecture. Since 2000, he has been engaged in research on signal processing, microphone array, and blind source separation (BSS). More specifically, he is working on the frequency-domain BSS for acoustic convolutive mixtures using independent component analysis (ICA). He serves as an Associate Editor of the IEEE Transactions on Audio, Speech and Language Processing. He is a Senior Member of the IEEE, and a Member of the Institute of Electronics, Information and Communication Engineers (IEICE), and the Acoustical Society of Japan (ASJ). He received the 9th TELECOM System Technology Award for Student from the Telecommunications Advancement Foundation in 1994, and the Best Paper Award of the IEEE Circuit and System Society in 2000.
Shoji Makino received the B.E., M.E., and Ph.D. degrees from Tohoku University, Japan, in 1979, 1981, and 1993, respectively. He is an Executive Manager at the NTT Communication Science Laboratories. He is also a Guest Professor at the Hokkaido University. His research interests include blind source separation of convolutive mixtures of speech, adaptive filtering technologies, and realization of acoustic echo cancellation. He is the author or coauthor of more than 200 articles in journals and conference proceedings and has been responsible for more than 150 patents. He is a Member of both the Awards Board and the