Combining Superdirective Beamforming and Frequency-Domain Blind Source Separation for Highly Reverberant Signals

Frequency-domain blind source separation (BSS) performs poorly in high reverberation because the independence assumption collapses at each frequency bins when the number of bins increases. To improve the separation result, this paper proposes a method which combines two techniques by using beamforming as a preprocessor of blind source separation. With the sound source locations supposed to be known, the mixed signals are dereverberated and enhanced by beamforming; then the beamformed signals are further separated by blind source separation. To implement the proposed method, a superdirective ﬁxed beamformer is designed for beamforming, and an interfrequency dependence-based permutation alignment scheme is presented for frequency-domain blind source separation. With beamforming shortening mixing ﬁlters and reducing noise before blind source separation, the combined method works better in reverberation. The performance of the proposed method is investigated by separating up to 4 sources in di ﬀ erent environments with reverberation time from 100ms to 700 ms. Simulation results verify the outperformance of the proposed method over using beamforming or blind source separation alone. Analysis demonstrates that the proposed method is computationally e ﬃ cient and appropriate for real-time processing.


Introduction
The objective of acoustic source separation is to estimate original sound sources from the mixed signals. This technique has found a lot of applications in noise-robust speech recognition and high-quality hands-free telecommunication systems. A classical example is to separate audio sources observed in a real room, known as a cocktail party environment, where a number of people are talking concurrently. A lot of research has focused on the problem but development is currently still in progress. Two kinds of techniques are promising in achieving source separation with multiple microphones: beamforming and blind source separation.
Beamforming is a technique used in sensor array for directional signal reception [1,2]. Based on a model of the wavefront from acoustic sources, it can enhance target direction and suppress unwanted ones by coherently summing signals from the sensors. Beamforming can be classified as either fixed beamforming or adaptive one, depending on how the beamformer weights are chosen. The weights of a fixed beamformer do not depend on array data and are chosen to present a specified response for all scenarios. The most conventional fixed beamformer is a delay-and-sum one, which however requires a large number of microphones to achieve high performance. Another filter-and-sum beamformer has superdirectivity response with optimized weights. The weights of an adaptive beamformer are chosen based on the statistics of array data to optimize array response. In source separation system, each source signals may be separately obtained using the directivity of the array if the directions of sources are known. However, beamforming has limited performance in highly reverberant conditions because it can not suppress the interfering reverberation coming from the desired direction.
Blind source separation (BSS) is a technique for recovering the source signals from observed signals with the mixing process unknown [3]. It just relies on the independence assumption of source signals to estimate them from the mixtures. The cocktail party problem is a challenge because the mixing process is convolutive, where the observations 2 EURASIP Journal on Audio, Speech, and Music Processing are combinations of filtered versions of sources. A large number of unmixing filter coefficients should be calculated simultaneously to recover the original signals. The convolutive BSS problem can be solved in the time domain or the frequency domain [4]. In time domain BSS, the separation network is derived by optimizing a time-domain cost function [5][6][7]. However, these approaches may not be effective due to slow convergence and large computational load. In frequency-domain BSS, the observed time-domain signals are converted into the time-frequency domain by short-time Fourier transform (STFT); then instantaneous BSS is applied to each frequency bin, after which the separated signals of all frequency bins are combined and inverse-transformed to the time domain [8,9]. Although satisfactory instantaneous separation may be achieved within all frequency bins, combining them to recover the original sources is a challenge because of the unknown permutations associated with individual frequency bins. This is the permutation ambiguity problem. There are two common strategies to solve this problem. The first strategy is to exploit the interfrequency dependence of separated signals [10,11]. The second strategy is to exploit the position information of sources such as direction of arrival [12,13]. By analyzing the directivity pattern formed by a separation matrix, source direction can be estimated and permutations aligned. Generally these two strategies can be combined to get a better permutation alignment [14].
Besides the permutation problem, another fundamental problem also limits the performance of frequency-domain BSS: the dilemma in determining the STFT analysis frame length [15][16][17]. Frames shorter than mixing filters generate incomplete instantaneous mixtures, while long frames collapse the independence measure at each frequency bin and disturb separation. The conflict is even severer in high reverberation with long mixing filters. Generally, a frequency-domain BSS which works well in low (100-200 ms) reverberation has degraded performance in medium (200-500 ms) and high (>500 ms) reverberation. Since the problem originates from a processing step, which approximates linear convolutions with circular convolutions, in frequency-domain BSS, we call it "circular convolution approximation problem". This problem will be further elaborated in Section 2.2. Although great progress has been made for the permutation problem in recent years, few methods have been proposed with good separation results in a highly reverberant environment.
To improve the separation performance in high reverberation, this paper proposes a method which combines beamforming and blind source separation. Assuming the sound source locations are known, the proposed method employs beamforming as a preprocessor for blind source separation. With beamforming reducing reverberation and enhancing signal-to-noise ratio, blind source separation works well in reverberant environments, and thus the combined method performs better than using either of the two methods alone. Since the proposed method requires the knowledge of source locations for beamforming, it is a semiblind method. However, the source locations may be estimated with an array sound source localization algorithm or using other approaches, which is beyond the scope of this paper [18,19].
In fact, the relationship between blind source separation and beamforming has been intensively investigated in recent years, and adaptive beamforming is commonly used to explain the physical principle of convolutive BSS [15,20]. In addition, many approaches have been presented that combine both techniques. Some of these combined approaches are aimed at resolving the permutation ambiguity inherent in frequency-domain BSS [12,21], whereas other approaches utilize beamforming to provide a good initialization for BSS or to accelerate its convergence [22][23][24]. So far as we know, there were no systematically studies on a direct application of the BSS-beamforming combination to high reverberant environments.
The rest of paper is organized as follows. Frequencydomain BSS and its circular convolution approximation problem are introduced in Section 2. The proposed method combining BSS and beamforming is presented in Section 3. Section 4 gives experimental results in various reverberant environments. Finally conclusions are drawn in Section 5.

Frequency-Domain BSS and
Its Fundamental Problem ., x M (n)] T , the mixing channels can be modeled by FIR filters of length P, the convolutive mixing process is formulated as where H(n)isasequenceofM × N matrices containing the impulse responses of the mixing channels, and the operator " * " denotes matrix convolution. For separation, we use FIR filters of length L and obtain estimated source signal vector y(n) = [y 1 (n), ..., y N (n)] T by where W(n)i sas e q u e n c eo fN × M matrices containing the unmixing filters, and the operator " * "d e n o t e sm a t r i x convolution. The unmixing network W(n) can be obtained by a frequency-domain BSS approach. After transforming the signals to the time-frequency domain using blockwise Lpoint short-time Fourier transform (STFT), the convolution becomes a multiplication The frequency-domain BSS makes an assumption that the time series at each bin are mutual independent. It is possible to separate them using complex-valued instantaneous BSS algorithms such as FastICA [25]andI nfomax [26,27], which are considered to be quite mature. However, there are scaling and permutation ambiguities at each bin. This is expressed as where Y (m, f ) is the STFT of y(n), W( f ) is the Fourier transform of W(n); Π( f ) is a permutation matrix and D( f ) a scaling matrix, all at frequency f . The source permutation and gain indeterminacy are problems inherent in frequency-domain BSS. It is necessary to correct them before transforming the signals back to the time domain.
Finally the unmixing network W(n) is obtained by inverse Fourier transforming W( f ), and the estimated source y(n) is obtained by filtering x(n) through W(n). The workflow of the frequency-domain BSS is shown in Figure 1.

Circular Convolution Approximation Problem.
Besides permutation and scaling ambiguities, another problem also affects the performance of frequency-domain BSS: the STFT circular convolution approximation. In the frequency domain, the convolutive mixture is reduced to an instantaneous mixture for each frequency bin. The model (3)i s simple but generates two errors for short STFT analysis frame length L [16].
(1) The STFT covers only L samples of the impulse response H(n), not its entirety.
(2) Equation (3) is only an approximation since it implies a circular convolution but not a linear convolution in the time domain; it is correct only when the mixing filter length P is short compared to L.
As a result, it is necessary to work with L ≫ P to ensure the accuracy of (3). However in that case, the instantaneous separation performance is saturated before reaching a sufficient separation, because decreased time resolution for STFT and fewer data available in each frequency bin will collapse the independence assumption and deteriorate instantaneous separation [15,17].
In a nutshell, short frames make the conversion to instantaneous mixture incomplete, while long ones disturb the separation. This contradiction is even severer in highly reverberant environments, where the mixing filters are much longer than STFT analysis frame. This is the reason for the poor performance of frequency-domain BSS in high reverberation.
It is necessary to work with L ≫ P to ensure the accuracy of (3). In this case, however, long frames worsen time resolution in the time-frequency domain and decrease the number of samples in each bin. As the result, the independence of source signals decreases greatly at some bins, leading to deteriorated instantaneous BSS and hence significantly reducing convolutive BSS performance in high reverberation [15,17]. In other words, short frames make the conversion to instantaneous mixture incomplete, while long ones disturb the separation. The conflict becomes severer in highly reverberant environments and lead to the degraded performance.

Combined Separation Method
Based on the analysis above, the circular convolution approximation problem seriously degrades the separation performance in high reverberation. However, the problem may be mitigated if the mixing filters become shorter. With directive response enhancing desired direction and suppress unwanted ones, beamforming can deflates the reflected paths and hence shorten the mixing filter indirectly. It thus may help compensate for the deficiency of blind source separation. From another point of view, beamforming makes primary use of spatial information while blind source separation utilizes statistical information contained in signals.
Integrating both pieces of information should help get better separation results, just like the way our ears separate audio signal [28]. In summary, if we use beamforming as a preprocessor for blind source separation, at least three advantages can be achieved.
(1) The interfering residuals due to reverberation after beamforming are further reduced by blind source separation.
(2) The poor separation performance of blind source separation in reverberant environments is compensated for by beamforming, which suppresses the reflected paths and shortens the mixing filters; (3) Beamformer enhances the source in its path and suppresses the ones outside. It thus enhances signalto-noise ratio and provides a cleaner output for blind source separation to process.
Assuming source directions are known, we propose a combined method as illustrated in Figure 2.F o rN sources received by an array of M microphones, N beams are formed towards them, respectively. Then the N beamformed outputs are fed to blind separation to recover the N sources. The workflow of the proposed method is shown in Figure 3.
The mixing stage is expressed as where T is the observed vector, H(n)i sa sequence of M × N matrices containing the impulse responses of the mixing channels, and the operator " * " denotes matrix convolution.
The beamforming stage is expressed as where EURASIP Journal on Audio, Speech, and Music Processing responses of beamformer, F(n) is the global impulse response by combining H(n)andB(n), and the operator " * "denotes matrix convolution.
The blind source separation stage is expressed as where y(n) = [y 1 (n), ..., y N (n)] T is the estimated source signal vector, W(n)i sas e q u e n c eo fN × N matrices containing the unmixing filters, and the operator " * " denotes matrix convolution. I tc a nb es e e nf r o m( 5)-(7) that, with beamforming reducing reverberation and enhancing signal-to-noise ratio, the combined method is able to replace the original mixing network H(n), which results from the room impulse response, with a new mixing network F(n), which is easier to separate.
Regarding the implementation detail, two techniques are employed: superdirective beamformer, which can fully exert the dereverberation and noise reduction ability of a microphone array, and frequency-domain blind source separation, which is well known for its fast convergence and small computation. These two issues will be addressed as below.
3.1. Beamforming. Beamformer can be implemented as a fixed one or an adaptive one. Compared to fixed beamforming, an adaptive method is not appropriate for the combined method. The reasons are as follows.
(1) An adaptive beamformer obtains directive response mainly by analyzing the statistical information contained in the array data, not by utilizing the spatial information directly. Its essence is similar to that . .
. . of convolutive blind source separation [15]. Cascading them together is equivalent to using the same techniques repeatedly, hence contributing little to performance improvement.
(2) An adaptive beamformer generally adapts its weights during breaks in the target signal [1]. However, it is a challenge to predict signal breaks when several people are talking concurrently. This significantly limits the applicability of adaptive beamforming to source separation.
In contrast, a fixed beamformer, which relies mainly on spatial information, does not have such disadvantages. It is data-independent and more stable. Given a look direction, the directive response is obtained for all scenarios. Thus a fixed beamformer is preferred in the proposed method.
Fixed beamforming achieves a directional response by coherently summing signals from multiple sensors based on a model of the wavefront from acoustic sources. The most common beamformer is the delay-and-sum one, however, a filter-and-sum beamformer has superdirectivity response with optimized weights. Its principle is given in Figure 4. The beamformer produces a weighted sum of signals from M sensors to enhance the target direction [29]. A frequencydomain method is employed to design the superdirective beamformer. Suppose a beamformer model with a target source r(t) and background noise n(t), the components received by the lth sensor is u l (t) = r l (t)+n l (t) in the time domain. Similarly, in the frequency domain, the lth sensor output is u l ( f ) = r l ( f )+n l ( f ). The array output in the frequency domain is where T is the beamforming weight vector composed of beamforming weights from each sensor, and Filter M x(t) Figure 4: Principle of a filter-and-sum beamformer.
vector composed of outputs from each sensor, and (·) H denotes conjugate transpose. The b( f ) depends on the array g e o m e trya n ds o u r c ed i r e ct i vi ty ,a sw e lla st h ea rra yo u t p u t optimization criterion such as a signal-to-noise ratio (SNR) gain criterion [29][30][31]. Suppose T is the source vector which is composed of the target source signals from the sensors, and n( f ) is the noise vector which is composed of the spatial diffuse noises from the sensors. The array gain is a measure of the improvement in signal-to-noise ratio. It is defined as the ratio of the SNR at the output of the beamforming array to the SNR at a single reference microphone. For development of the theory, the reference SNR is defined, as in [29], to be the ratio of average signal power spectral densities over the microphone array, , to the average noise-power spectral density over the array, σ 2 n ( f ) = E{n H ( f )n( f )}/M. By derivation, the array gain at frequency f is expressed as where R rr ( f ) = r( f )r H ( f )/σ 2 r ( f ) is the normalized signal cross-power spectral density matrix, and R nn ( f ) = n( f )n H ( f )/σ 2 n ( f ) is the normalized noise cross-power spectral density matrix. Provided R nn ( f ) is nonsingular, the array gain is maximized with the weight vector The terms R nn ( f )a n dr( f )i n( 10) depend on the array geometry and the target source direction. For a circular array, the calculation of R nn ( f )andr( f )isgivenasfollows [2]. Figure 5 shows an M-element circular array with a radius of r and a target source coming from the direction (θ, φ). The elements are equally spaced around the circumference, and their positions, which are determined from the layout of array, are given in the matrix form as The source vector r( f )canbederivedas Circular array where k = 2πc/ f is the wave number, and c is the sound velocity. And the normalized noise cross-power spectral density matrix R nn ( f ) is expressed as where (R nn ( f )) m1m2 is the (m 1 , m 2 ) entry of the matrix R nn ( f ), m 1 , m 2 = 1, ..., M, k is the wave number, ρ m1m2 is the distance between two microphones m 1 and m 2 After calculating the beamforming vector by (10), (12) and (13) at each frequency bin, the time-domain beamforming filter b(n) is obtained by inverse Fourier transforming b opt ( f ).
Theprocedureaboveistodesignabeamformerwithonly one target direction. For N sources with known directions, N beams are designed pointing at them, respectively. Finally, supposing the observed vector at M sensors is u(n) = [u 1 (n), ..., u M (n)] T , the multiple beamforming is formulated as where B(n)isasequenceofN × M matrices containing the impulse responses of the beamformer, Q is length of the beamforming filter, and x(n) = [x 1 (n), ..., x N (n)] T is the beamformed output vector.

Frequency-Domain Blind Source Separation.
As discussed before, the workflow of frequency-domain blind source separation is shown in Figure 1. Three realization details will be addressed: instantaneous BSS, permutation alignment, and scaling correction.  Figure 6: Simulated room environment with a microphone array beamformer.

Instantaneous BSS.
After decomposing time-domain convolutive mixing into frequency-domain instantaneous mixing, it is possible to perform separation at each frequency bin with a complex-valued instantaneous BSS algorithm.
Here we use Scaled Infomax algorithm, which is not sensitive to initial values, and is able to converge to the optimal solution within 100 iterations [32].

Permutation Alignment.
Permutation ambiguity inherent in frequency-domain BSS is a challenge in the combined method. Generally, there are two approaches to cope with the permutation problem. One is to exploit the dependence of separated signals across frequencies. Another is to exploit the position information of sources: the directivity pattern of the mixing/unmixing matrix provides a good reference for permutation alignment. However, in the combined method, the directivity information contained in the mixing matrix does not exist any longer after beamforming. Even if the source positions are known, they are not much helpful for permutation alignment. Consequently, what we can use for permutation is merely the first reference: the interfrequency dependence of separated signals. In [33]wehaveproposeda permutation alignment approach with good results, which is based on an interfrequency dependence measure: the powers of separated signals. Its principle is briefly given as below.
An interfrequency dependence measure, the correlation coefficient of separated signal power ratios, exhibits a clearer interfrequency dependence among all frequencies. Suppose the M × N mixing network at frequency f can be estimated from the separation network by where a i ( f ) is the ith column vector of A( f ), (·) −1 denotes inversion of a square matrix or pseudoinversion of a rectangular matrix. The power ratio, which measures the dominance of the ith separated signal in the observations at frequency f , is defined, as in [11], to be where the denominator is the total power of the observed signals X(m, f ), the numerator is the power of the ith Being in the range [0, 1], (17) is close to 1 when the ith separated signal is dominant, and close to 0 when others are dominant. The power ratio measure can clearly exhibit the signal activity due to the sparsity of speech signals. The correlation coefficient of signal power ratios can be used for measuring interfrequency dependence and solving the permutation problem. The normalized binwise correlation coefficient between two power ratio sequences where i and j are indices of two separated channels, f 1 and f 2 are two frequencies, are, respectively, the correlation, mean, and standard deviation at time m (the time index m is omitted for clarity). Note that E{·} denotes expectation. Being in the range [−1, 1], (18) tends to be high if the output channels i and j originate from the same source and low if they represent different sources. This property will be used for aligning the permutation.
Reference [33] has proposed a permutation alignment approach based on the power ratio measure. Binwise permutation alignment is applied first across all frequency bins, using the correlation of separated signal powers; then the full frequency band is partitioned into small regions based on the binwise permutation alignment result. Finally, regionwise permutation alignment is performed, which can prevent the spreading of the misalignment at isolated frequency bins to others and thus improves permutation. This permutation alignment approach is employed in the proposed method.

Scaling Correction.
The scaling indeterminacy can be resolved relatively easily by using the Minimal Distortion Principle [34]: where W p ( f )i sW( f ) after permutation correction and W s ( f ) is the one after scaling correction, (·) −1 denotes inversion of a square matrix or pseudoinversion of a rectangular matrix; diag(·) retains only the main diagonal components of the matrix.   and the length of the unmixing filter is L. The beamforming filtering and unmixing filtering can be implemented by FFT. The computation cost of the proposed algorithm is summarized in Ta b l e 1 . (The computation cost of separation filter estimation is given in [33].) For convenience, only complex-valued multiplication operations are considered. To summarize, the total computation cost for the MT input data points is c total = 2NT · Mlog 2 Q + N 2iter + 12 + log 2 L . (20) The average computation for each sample time with M input data points is

Computational Complexity
We think the result is quite acceptable. For 4 sources recorded by a 16-element microphone array, iter = 100, Q = L = 2048, the average computation involves about 7200 complex-valued multiplications for each sample time (with 16 sample points). Thus, in terms of computational complexity, the proposed algorithm is promising for realtime applications.

Experiment Results and Analysis
We evaluate the performance of the proposed method in simulated experiments in two parts. The first part verifies the dereverberation performance of beamforming. The second investigates the performance of the proposed method in various reverberant conditions, and compares it with a BSSonly method and a beamforming-only one.
The implementation detail of the algorithm is as follows. For blind source separation, the Tukey window is used in STFT, with a shift size of 1/4 window length. The iteration number of instantaneous Scaled Infomax algorithm is 100. The processing bandwidth is between 100 and 3750 Hz (sampling rate being 8 kHz). The STFT frame size will vary according to different experimental conditions. For beamforming, a circular microphone array is used to design the beamformer with the filter length 2048, the array size will vary according to different experimental conditions.

Simulation Environment and Evaluation Measures.
The simulation environment is shown in Figure 6, the room size is 7 m × 5m × 3 m, all sources and microphones are 1.5 m high. The room impulse response was obtained by using the image method [35], and the reverberation time was controlled by varying the absorption coefficient of the wall.
The separation performance is measured by signal-tointerference ratio (SIR) in dB.
Before beamforming, the input SIR of the Jth channel is where M is the total number of microphones, · 2 denotes the norm-2 operation, h Jk (n) is an element of the mixing system H(n)(see (1)). After beamforming, the SIR of the Jth channel is where N is the total number of beams, f Jk (n)isanelementof After blind source separation, the SIR of the Jth channel is SIROUT J = 10 log 10 max k g Jk (n) 2 N k=1 g Jk (n) 2 − max k g Jk (n) 2 ,  where N is the total number of sources, g Jk (n)isanelement of G(n) = W(n) * B(n) * H(n), the overall impulse response matrix by combining the mixing system, beamforming, and blind source separation.

Dereverberation Experiment.
The proposed algorithm is used for separating three sources using a 16-element circular microphone array with a radius of 0.2 m. The environment is shown in Figure 6. The simulated room reverberation time is RT 60 = 300 ms, where RT 60 is the time required for the sound level to decrease by 60 dB. This is a medium reverberant condition. One typical room impulse response is shown in Figure 7(a). Three source locations (2,4,6) are used, and the sources are two male speeches and one female speech of 8 seconds each. Three beams are formed by the microphone array pointing at the three sources, respectively. Impulse responses associated with the global transfer function of beamforming is shown in Figure 8, which are calculated from the impulse responses of mixing filters and beamforming filters using It can be seen that the diagonal components in Figure 8 are superior to off-diagonal ones. This implies that the target sources are dominant in the outputs. To demonstrate the dereverberation performance of beamforming, Figure 8(a) is enlarged in Figure 7(b) and compared with the original impulse response in Figure 7(a). Obviously, the mixing filter becomes shorter after beamforming, and the reverberation becomes smaller. This indicates that dereverberation is achieved. So far, the two advantages of beamforming, dereverberation and noise reduction, are observed as expected. Thus the new mixing network F(n) should be easier to separate than the original mixing network. In this experiment, the average input SIR is SIRIN = −2.8 dB, and the output one, enhanced by beamforming, is SIRBM = 3.3 dB. Setting the STFT frame size at 2048 and applying BSS to the beamformed signals, we get an average output SIR of the combined method of SIROUT = 16.3 dB, a 19.1 dB improvement over the input: 6.1 dB improvement at the beamforming stage, and 13 dB further improvement at the BSS stage.

Experiments Reverberant Environments.
Three experiments are conducted to investigate the performance of the proposed method and compare it with the BSS-only and the beamforming-only method. The first examines the performance of the BSS-only method in medium reverberation with different STFT frame sizes. The second compares the performance of the proposed method and the other two methods in various reverberant conditions. The third examines the performance of the proposed method with various microphone array sizes.

BSS with Different STFT Frame
Size. The simulation environment for the BSS-only method shown in Figure 9 is the same as Figure 6 except that the microphone array is replaced by four linearly arranged microphones. The distance between any two adjacent microphones is 6 cm. The reverberation time is RT 60 = 300 ms. One 2 × 2( 2 sources and 2 microphones) and one 4 × 4( 4s o u r c e sa n d 4 microphones) cases were simulated. For the 2 × 2c a s e , microphones B, C, and source locations (2, 6) are used. The sources are one male speech and one female speech of 8 seconds each. For the 4 × 4 case, all four microphones and four source locations (1, 2, 4, 6) are used. The sources are two male speeches and two female speeches of 8 seconds each. Blind source separation with different STFT frame size ranging from 512 to 5120 is tested. The output SIR of blind source separation is calculated in a manner similar to the one presented in Section 4.1. The simulation results are shown in Figure 10. The performance in the 2 × 2c a s e is always better than that in the 4 × 4 case since it is easier to separate 2 sources than 4 sources. In both 2 × 2 and 4 × 4 cases, the separation performance peaks at the STFT frame size of 2048. This verifies the early discussion about the dilemma in determining the STFT frame size: the separation performance is saturated before reaching a sufficient performance level.
Obviously, an optimal STFT frame size may exist for a specific reverberation. However, due to complex acoustical environments and varieties of source signals, it is difficult to determine this value precisely. How to choose an appropriate frame length may be a topic of our future research. Generally, 1024 or 2048 can be used as a common frame length. Here we use an analysis frame length of 2048 for all reverberant conditions in the remaining experiments.

Performance Comparison among Three Methods.
The performances of the combined method, the BSS-only method, and the beamforming-only method are compared in different reverberant environments. The beamformingonly method is equal to the first processing stage of the combined method. The simulation environment of the combined method is shown in Figure 6 and the BSS-only method in Figure 9. For the combined method, a 16-element microphone array with a radius of 0.2 m is used. Various combinations of source locations are tested (2 sources and 4 sources). The sources are two male speeches and two female speeches of 8 seconds each. RT 60 ranges from 100 ms to 700 ms in increments of 200 ms. The average input SIR does not vary significantly with the reverberation time: it is about 0 dB for 2 sources, and −5dB for 4 sources. For all three methods, the STFT frame size is set at 2048. The separation results are shown in Figure 11,w i t he a c h panel depicting the output SIRs of the three methods for one source combination. It's observed in Figure 11 that, for each source configurations, the output SIRs of all methods decrease with increasing reverberation; however, the combined method always outperforms the other two. Beamforming performs worst among the three methods, however, it provides a good preprocessing result, and hence the combined method works better than the BSS-only method.
It is interesting to investigate how big an improvement one can obtain by the use of beamforming preprocessing in different reverberation. To measure the contribution of this preprocessing, we define the relative improvement of the combined method over the BSS-only method as where  with the subscripts (·) b and (·) c standing for the BSSonly method and the combined method, respectively. We calculate the relative performance improvement for the 4 separation scenarios listed in Figure 11 and show the average result in Figure 12. As discussed previously, the performance is improved by the combined method for all reverberant conditions. However, it is also observed in Figure 12 that the improvement in low reverberation is not as large as in medium and high reverberation. That is, the use of beamforming in low reverberation is not as beneficial as it would be for high reverberation. The reason is that, BSS can work well alone when the circular convolution approximation problem is not evident in low reverberation, and thus the contribution of preprocessing is small. On the other hand, when the circular convolution approximation problem become severe in high reverberation, the contribution of preprocessing becomes crucial and hence the separation performance is improved significantly. The experiments in this part illustrate the superiority of the proposed method over using beamforming or blind source separation alone. The comparison between proposed method with other hybrid methods in different reverberant conditions will be further investigated in our future research.

Performance of the Combined Method with Different Microphone Array Size.
Since the performance of a beamformer is significantly affected by the array size, it is reasonable to ask how much the array size will impact the performance of the proposed method. Some experiments are carried out on this topic. The simulation environment is shown in Figure 6. Three microphone arrays are used to design the beamformer: an 8-element array with a radius of 0.1 m, a 16-element array with a radius of 0.2 m, and a 24-element array with a radius of 0.2 m. Various combinations of source locations are tested (2 sources and 4 sources). The sources are two male speeches and two female speeches of 8 seconds each. The STFT frame size is set at 2048. The performance of the proposed combined method under RT 60 of 300 ms (medium reverberation) and 700 ms (high reverberation) is shown in Figures 13 and 14, respectively. It can be seen that, for all source configurations, the separation performance improves with increasing array 12 EURASIP Journal on Audio, Speech, and Music Processing size. For example, in the two bottom panels of Figure 14, the output SIR with an 8-element array is only about 2 dB, but rises to about 6 dB with a 24-element array. A higher output SIR can be anticipated for larger array sizes. However, the better performance is obtained at the cost of high computation and more hardware associated with more microphones. Thus, a tradeoff should be considered in actual applications.

Conclusion
Given the poor performance of blind source separation in high reverberation, the paper proposes a method which combines beamforming and blind source separation. Using superdirective beamforming as a preprocessor of frequencydomain blind source separation, the combined method is able to integrates the advantages of both techniques and complements the weakness of them alone. Simulation in different conditions (RT 60 = 100 ms-700 ms) illustrates the superiority of the proposed method over using beamforming or blind source separation alone; and the performance improvement increases with the microphone array size. The proposed method is promising for real-time processing with its high computational efficiency.