1 Introduction

Over the last 10–15 years research in machine interfaces for voice pick-up in reverberant and noisy environments has been very actively conducted using multi-channel systems like microphone arrays [14]. Multi-channel techniques have been useful in many applications such as hearing aids, hands-free communication, robotics, audio and video conference systems, and speech recognition [1, 2, 5, 6]. One of the most popular techniques applied to multi-microphone systems is the optimal beamforming technique [1]. Optimal beamformers are formulated to exploit spatial information of desired and undesired signals in such a way that the desired one is extracted and undesired signals are suppressed [1, 2]. Many methods have been proposed for determining the location of the desired source such as predefined well-determined array geometry combined with source localization [7, 8], a calibration method using training samples of pre-recording desired and undesired sources [9, 10]. Based on this information, optimal beamformers are designed using the spatial information to suppress the contribution of all undesired signals while reserving the contribution of the desired signal [1, 11, 12]. Specifically, the optimal beamformer weights are calculated using knowledge about the location of the target signal and array geometry. It is also possible to obtain estimates of speech and noise correlation matrices. These estimates are then used to form the optimal beamformer weights; for this method to be efficient a priori knowledge about the statistical characteristics of the noise is necessary. When the background noise is stationary over the measurement period either a voice activity detector (VAD) [2] estimate or a relative transfer function (RTF) estimate can be found [13]. Either of these estimates can be used to form optimal beamformers [2]. This leads us to a more general case where the spatial knowledge is not known a priori and the observed mixture signals are the only available information to be used for speech separation and noise suppression. In this case, blind source separation (BSS) techniques can be deployed for separating the different sound sources. Many blind source separation techniques using microphone array have been proposed for speech separation in both time domain and frequency domain. Some prominent BSS techniques for speech separation are independent component analysis (ICA), maximum likelihood, second-order gradient, and kurtosis maximization [1418]. Most of the BSS techniques are based on either statistical independence or non-stationarity of the different input sources in the observed signal.

Speech separation in cocktail party or multiple-speaker environment is one of the significant problems in speech enhancement research. It occurs when the observed signals are obtained from several speakers in different spatial locations. Here, the spatial separation of speech sources is very important for speech separation due to the fact that all speech signals have the same spectral characteristic. We can categorize two different cases:

  1. 1.

    When the sources’ spatial information is available, many separation techniques such as steering beamforming, optimum beamforming, and post-filtering have been proposed [3, 4, 6, 10, 19]. In [19], we introduced a post-filtering method which is implemented after an optimum beamformer to extract the desired speech source from a mixture of signals in multiple speakers environments. However, the source spatial information in those studies was obtained using a calibration method.

  2. 2.

    When the sources’ spatial information is not available then blind separation techniques in a multiple-speaker environment need to be employed. For this scenario, a number of different BSS techniques have been proposed for the case of two speech sources in both time domain and time–frequency domain [4, 18, 2022]. When the number of speech sources is more than two, the blind signal separation is more of a complicated and computational intense problem [2325]. For this case, popular blind separation techniques are conducted to extract the desired source signal by finding a separating vector that maximizes the deterministic character (such as non-Gaussianity in ICA technique) of the extracted source signals [4, 24, 26, 27].

In this paper, a blind signal separation method is proposed which estimates the source spatial information without having prior knowledge about the spatial location of speech sources in three-speaker environments. Once the source spatial information is estimated, it is used to design optimum beamformers for extracting speech sources from the observed signal. As such, the source spatial information estimation is performed in the frequency domain without having prior knowledge about the spatial location of the speech sources. Here, a spatial localization technique employing steered response power phase transform (SRP-PHAT) is proposed for estimating each source’s spatial information based on the observed signal. The SRP-PHAT localization employs cross-correlation and phase transform weighting of the received signals from all microphone pairs in the array [28]. From the SRP-PHAT estimates, the proposed spatial localization technique calculates the spatial information of three speech sources from the observed signal. Based on the spatial information of the three speech sources, an optimum beamformer is proposed for extraction of each individual speech source from the observed signal. A permutation alignment is used for grouping each extracted signal into the correct source output before transforming them into the time domain. The performance of the proposed algorithm shows that the proposed algorithm offers a good interference suppression level while maintaining low speech distortion.

Fig. 1
figure 1

Position of three speakers and the microphone array in the three-speaker environment

The paper is organized as follows: Sect. 2 outlines the problem formulation and details the signal model. In Sect. 3, the spatial localization method is derived and discussed in detail. Section 4 provides the details and derivation of the optimum beamforming technique. Section 5 discusses the method used for permutation alignment. In Sect. 6, the experimental results are presented and discussed. Finally, Sect. 7 summarizes the paper.

2 Problem formulation

Consider a linear microphone array, according to Fig. 1, consisting of L microphones and observed mixture signals \({\mathbf x}(n)\). The observed signals are a speech mixture from three speakers sitting in front of the microphones. The observed sampled signal \({\mathbf x}(n)\) at one time instant is an \(L\times 1\) vector, which can be expressed as

$$\begin{aligned} {\mathbf x}(n)={\mathbf s}_1(n)+{\mathbf s}_2(n)+{\mathbf s}_3(n) \end{aligned}$$
(1)

where \({\mathbf s}_{1}(n)\), \({\mathbf s}_2(n)\) and \({\mathbf s}_3(n)\) are the received signals from each respective speech source. In the short-term time–frequency (STFT) domain, the observed signal can be written as

$$\begin{aligned} {\mathbf x}(\omega ,k)={\mathbf s}_1(\omega ,k)+{\mathbf s}_2(\omega ,k)+{\mathbf s}_3(\omega ,k) \end{aligned}$$
(2)

where \({\mathbf x}(\omega ,k)\), \({\mathbf s}_1(\omega ,k)\), \({\mathbf s}_2(\omega ,k)\) and \({\mathbf s}_3(\omega ,k)\) are the contribution from the observed signal, the first, the second and the third speech sources, respectively. The objective is to separate each individual source signal from the observed signal. As such, one speech source is treated as the desired source while the others become undesired in a round robin fashion. In this case, the VAD cannot be employed to detect the desired source active or inactive periods because all sources can be active at the same time. Thus, a spatial localization technique needs to be employed. In this case, SRP-PHAT is utilized to estimate the spatial information for each speech source based only on the statistics of the observed signal.

3 Spatial localization technique employing SRP-PHAT

For the SRP-PHAT processing, we divide the sequence of observed signal into Q blocks, each consisting of N samples with the index \( [(q-1)N+1,qN], \,\,\,1\le q\le Q. \) The estimated correlation matrix \({\mathbf {R}}(\omega ,q)\) of the observed signal in the q th block can be obtained as

$$\begin{aligned} {\mathbf {R}}(\omega ,q)=\frac{1}{N}\sum \limits _{k=(q-1)N+1}^{qN} {\mathbf x}(\omega ,k){{\mathbf x}^H(\omega ,k)}. \end{aligned}$$
(3)

Denote by \({\mathbf {R}}(\omega )\) the estimated correlation matrix of the observed signal. This matrix can be obtained based on \({\mathbf R}(\omega ,q)\) as

$$\begin{aligned} {\mathbf {R}}(\omega )=\frac{1}{QN}\sum \limits _{k=1}^{QN} {\mathbf x}(\omega ,k){{\mathbf x}^H(\omega ,k)}=\frac{1}{Q}\sum \limits _{q=1}^{Q} {\mathbf {R}}(\omega ,q). \end{aligned}$$
(4)

Clearly, during the conversation either speech sources can be active or non-active. Therefore, there exist periods in which all speech sources are inactive. Since, \({\mathbf {R}}(\omega )\) in (4) is the average of all estimated correlation matrices \({\mathbf {R}}(\omega ,q)\), this matrix can be used as a reference to detect non-speech blocks or blocks with low speech presence. Thus, we propose to use a threshold \(\varepsilon \,{R}(\ell ,\ell ,\omega )\) to detect the speech presence where \(\varepsilon \) is a pre-set threshold, \(0<\varepsilon <1\), and \(\ell \) is a reference microphone. The value \({R}(\ell ,\ell ,\omega )\) is the \((\ell ,\ell )\mathrm{th}\) element of the matrix \({\mathbf {R}}(\omega )\).

Denote by \(\mathcal{S}\) the index set of all blocks with at least one active speech source. Based on the proposed threshold, this set can be obtained as

$$\begin{aligned} \mathcal{S}=\{q, ~ 1 \le q \le Q: ~ R(\ell ,\ell ,\omega ,q)>\varepsilon \,{R}(\ell ,\ell ,\omega )\} \end{aligned}$$
(5)

where \(R(\ell ,\ell ,\omega ,q)\) is the \((\ell ,\ell )\mathrm{th}\) element of the matrix \({\mathbf {R}}(\omega ,q)\). Note that \(\mathcal{S}\) is not an empty set since \({R}(\ell ,\ell ,\omega )\) is the average of \(R(\ell ,\ell ,\omega ,q)\), see (4). For each \(q\in S\), denote by \({\bar{\mathbf {R}}}_x(\omega ,q)\) the normalized correlation matrix of the qth block

$$\begin{aligned} {\bar{\mathbf {R}}}(\omega ,q)=\frac{{\mathbf {R}}(\omega ,q)}{{ {R}}(\ell ,\ell ,\omega ,q)}. \end{aligned}$$
(6)

By assuming that the speech signals of three speakers are statistically independent, the matrix \({\mathbf R}(\omega ,q)\) can be decomposed as

$$\begin{aligned} {\mathbf {R}}(\omega ,q)={\mathbf {R}}_1(\omega ,q)+{\mathbf {R}}_2(\omega ,q)+{\mathbf {R}}_3(\omega ,q) \end{aligned}$$
(7)

where \({\mathbf {R}}_1(\omega ,q)\), \({\mathbf {R}}_2(\omega ,q)\) and \({\mathbf {R}}_3(\omega ,q)\) are the correlation matrices for the first, the second and the third speech signals, respectively. We have

$$\begin{aligned} {\mathbf {R}}(\omega ,q)= & {} p_1(\omega ,q){\bar{\mathbf {R}}}_1(\omega )+p_2(\omega ,q){\bar{\mathbf {R}}}_2(\omega )\nonumber \\&+p_3(\omega ,q){\bar{\mathbf {R}}}_3(\omega ) \end{aligned}$$
(8)

where \(p_1(\omega ,q)\), \(p_2(\omega ,q)\), \(p_3(\omega ,q)\) and \({\bar{\mathbf {R}}}_1(\omega )\), \({\bar{\mathbf {R}}}_2(\omega )\), \({\bar{\mathbf {R}}}_3(\omega )\) are, respectively, the PSD and the normalized spatial correlation matrices of the first, the second and the third speech signals with \((\ell ,\ell )\)th elements are 1. Based on the idea of DOA estimation of acoustic signals using Near-field model [29], the spatial correlation matrices of speakers’ speech signals are available. Since the \((\ell ,\ell )\)th elements of the normalized spatial correlation matrices \({\bar{\mathbf {R}}}_1(\omega )\), \({\bar{\mathbf {R}}}_2(\omega )\) and \({\bar{\mathbf {R}}}_3(\omega )\) are one, it follows from (8) that (6) can be rewritten as

$$\begin{aligned} {\bar{\mathbf {R}}}(\omega ,q)&=\frac{p_1(\omega ,q)}{p_1(\omega ,q)+p_2(\omega ,q)+p_3(\omega ,q)}{\bar{\mathbf {R}}}_1(\omega )\nonumber \quad \\&\quad +\frac{p_2(\omega ,q)}{p_1(\omega ,q)+p_2(\omega ,q)+p_3(\omega ,q)}{\bar{\mathbf {R}}}_2(\omega )\nonumber \quad \\&\quad +\frac{p_3(\omega ,q)}{p_1(\omega ,q)+p_2(\omega ,q)+p_3(\omega ,q)}{\bar{\mathbf {R}}}_3(\omega ).\quad \end{aligned}$$
(9)

Eq. (9) can then be expressed as

$$\begin{aligned} {\bar{\mathbf {R}}}(\omega ,q)=\gamma _1(\omega ,q){\bar{\mathbf {R}}}_1(\omega )+\gamma _2(\omega ,q){\bar{\mathbf {R}}}_2(\omega )+\gamma _3(\omega ,q){\bar{\mathbf {R}}}_3(\omega )\nonumber \\ \end{aligned}$$
(10)

where the values \(\gamma _1(\omega ,q)\), \(\gamma _2(\omega ,q)\) and \(\gamma _3(\omega ,q)\) represent, respectively, the proportions of the matrices \({\bar{\mathbf {R}}}_1(\omega )\), \({\bar{\mathbf {R}}}_2(\omega )\) and \({\bar{\mathbf {R}}}_3(\omega )\) in the normalized correlation matrix \({\bar{\mathbf {R}}}(\omega ,q)\), i.e.,

$$\begin{aligned} \gamma _1(\omega ,q)=\frac{p_1(\omega ,q)}{p_1(\omega ,q)+p_2(\omega ,q)+p_3(\omega ,q)} \end{aligned}$$
(11)

and

$$\begin{aligned} \gamma _2(\omega ,q)=\frac{p_2(\omega ,q)}{p_1(\omega ,q)+p_2(\omega ,q)+p_3(\omega ,q)}. \end{aligned}$$
(12)

and

$$\begin{aligned} \gamma _3(\omega ,q)=\frac{p_3(\omega ,q)}{p_1(\omega ,q)+p_2(\omega ,q)+p_3(\omega ,q)}. \end{aligned}$$
(13)

Since \(p_1(\omega ,q)\ge 0\), \(p_2(\omega ,q)\ge 0\) and \(p_3(\omega ,q)\ge 0\) we have

$$\begin{aligned} \gamma _1(\omega ,q) \ge 0, ~ \gamma _2(\omega ,q) \ge 0, ~ \gamma _3(\omega ,q) \ge 0 \end{aligned}$$
(14)

and

$$\begin{aligned} \gamma _1(\omega ,q)+\gamma _2(\omega ,q)+\gamma _3(\omega ,q)=1. \end{aligned}$$
(15)

Since \({\mathbf {R}}(\omega )\) in (4) is the correlation matrix of the observed signal it follows

$$\begin{aligned} {\bar{\mathbf {R}}}(\omega )=\gamma _1(\omega ){\bar{\mathbf {R}}}_1(\omega )+\gamma _2(\omega ){\bar{\mathbf {R}}}_2(\omega )+\gamma _3(\omega ){\bar{\mathbf {R}}}_3(\omega ) \end{aligned}$$
(16)

where \({\bar{\mathbf {R}}}(\omega )\) is the normalized correlation matrix of the observed signal. The values \(\gamma _1(\omega )\), \(\gamma _2(\omega )\) and \(\gamma _3(\omega )\) represent, respectively, the proportions of the matrices \({\bar{\mathbf {R}}}_1(\omega )\), \({\bar{\mathbf {R}}}_2(\omega )\) and \({\bar{\mathbf {R}}}_3(\omega )\) in the matrix \({\bar{\mathbf {R}}}(\omega )\), also

$$\begin{aligned} \gamma _1(\omega ) \ge 0, ~ \gamma _2(\omega ) \ge 0, ~ \gamma _3(\omega ) \ge 0 \end{aligned}$$
(17)

and

$$\begin{aligned} \gamma _1(\omega )+\gamma _2(\omega )+\gamma _3(\omega )=1. \end{aligned}$$
(18)

In the sequel, a spatial localization technique employing SRP-PHAT is proposed. Here, the (mm)th element of \({\mathbf {R}}(\omega ,q)\) is the cross-correlation of mth and nth microphone observed signals in the qth block. As such, the SRP-PHAT in block q can be estimated as follows

$$\begin{aligned} {\Psi }({{\bar{\mathbf {R}}}(\omega ,q)})=\sum _{m=1}^{L}\sum _{n=m+1}^{L}{ {\bar{R}}}(m,n,\omega ,q) \end{aligned}$$
(19)

where \({{\bar{R}}}(m,n,\omega ,q)\) is the (mn) element of the normalized correlation matrix \({\bar{\mathbf {R}}}(\omega ,q)\). From (19) and (10), we have the following

$$\begin{aligned}&{\Psi }({\bar{\mathbf {R}}}(\omega ,q))=\gamma _1(\omega ,q){\Psi }({\bar{\mathbf {R}}}_1(\omega ))\nonumber \quad \\&\quad +\,\gamma _2(\omega ,q){\Psi }({\bar{\mathbf {R}}}_2(\omega ))+ \gamma _3(\omega ,q){\Psi }({\bar{\mathbf {R}}}_3(\omega )). \end{aligned}$$
(20)

Clearly, the Eq. (20) shows the contribution balance of three speech sources in block q. As such, during the conversation, each speech sources can be active and non-active so the correlation matrices of blocks, in which only one speech source is active, are useful for speech spatial estimation. In the block of only one active source, the contribution of this source should be 1 and all contributions of other sources should be 0. In the complex plane, based on (14) (15) (20), the point of \({\Psi }({\bar{\mathbf {R}}}(\omega ,q))\) is located inside a triangle, with vertices given by these points \({\Psi }({\bar{\mathbf {R}}}_1(\omega ))\), \({\Psi }({\bar{\mathbf {R}}}_2(\omega ))\) and \({\Psi }({\bar{\mathbf {R}}}_3(\omega ))\). In addition, based on (14), the point of \({\Psi }({\bar{\mathbf {R}}}(\omega ))\) is located inside this triangle too, see Fig. 2a. As such, normalized spatial correlation matrices \({\bar{\mathbf {R}}}_1(\omega )\), \({\bar{\mathbf {R}}}_2(\omega )\) and \({\bar{\mathbf {R}}}_3(\omega )\) can be estimated by detecting triangle vertices of blocks’ SRP-PHAT of observed signal, see Fig. 2b. Hence, a spatial detection of speech sources is proposed that employs an algorithm for finding triangle vertices, i.e., the blocks of only one source active.

The block of only first source active is detected as block \(q_1\) as follows:

$$\begin{aligned} {q_1}=\arg \max _{q} |{\Psi }({\bar{\mathbf {R}}}(\omega ,q))-{\Psi }({\bar{\mathbf {R}}}(\omega ))| \end{aligned}$$
(21)

here \(|\cdot |\) is the absolute operation. The block of only second source active is detected as block \(q_2\) as follows:

$$\begin{aligned} {q_2}=\arg \max _{q} |{\Psi }({\bar{\mathbf {R}}}(\omega ,q))-{\Psi }({\bar{R}}(\omega ,q_1))|. \end{aligned}$$
(22)

The block of only third source active is detected as block \(q_3\) as follows:

$$\begin{aligned} {q_3}= & {} \arg \max _{q} \{|{\Psi }({\bar{\mathbf {R}}}(\omega ,q))-{\Psi }({\bar{R}}(\omega ,q_1))|\nonumber \quad \\&+|{\Psi }({\bar{\mathbf {R}}}(\omega ,q))-{\Psi }({\bar{R}}(\omega ,q_2))|\}. \end{aligned}$$
(23)
Fig. 2
figure 2

a The triangle with SRP-PHAT vertices in the complex plane; b SRP-PHAT values of the observed signal for frequency of 2100 Hz from the simulation in Sect. 6

Here, the correlation matrix of the observed signal in the block of only one active source contains only spatial characteristic of the active source. As such, the normalized spatial correlation matrix for the active source can be estimated as normalized correlation matrix in the block of only this source active. To reduce the correlation mismatch, we propose to estimate the normalized spatial correlation matrices for the speech sources by taking the average of the estimated normalized correlation matrices corresponding to I blocks which SRP-PHAT are nearest to estimated triangle vertices.

The average is employed to reduce the estimation error which can occur due to a limited number of samples in each block. Then, \(\mathcal{S}_{1}\), \(\mathcal{S}_{2}\), and \(\mathcal{S}_{3}\) are proposed to be subsets of S and each subset has I block’s indexes of blocks which SRP-PHAT are nearest to SRP-PHAT of blocks \(q_1\), \(q_2\), and \(q_3\), respectively. In practice, the value I can be chosen smaller than \(5~\%\) of the number of elements in \(\mathcal{S}\). The normalized spatial correlation matrix \({{\hat{\bar{\mathbf {R}}}}}_1(\omega )\) for the first source can be estimated as follows:

$$\begin{aligned} {{\hat{\bar{\mathbf {R}}}}}_1(\omega )=\frac{1}{I}\sum _{i\subseteq \mathcal{S}_{1}}{\bar{\mathbf {R}}}(\omega ,q_{1,i}). \end{aligned}$$
(24)

The normalized spatial correlation matrix \({{\hat{\bar{\mathbf {R}}}}}_2(\omega )\) for the second source can be estimated as follows:

$$\begin{aligned} {{\hat{\bar{\mathbf {R}}}}}_2(\omega )=\frac{1}{I}\sum _{i\subseteq \mathcal{S}_{2}}{\bar{\mathbf {R}}}(\omega ,q_{2,i}). \end{aligned}$$
(25)

The normalized spatial correlation matrix \({{\hat{\bar{\mathbf {R}}}}}_3(\omega )\) for the second source can be estimated as follows:

$$\begin{aligned} {{\hat{\bar{\mathbf {R}}}}}_3(\omega )=\frac{1}{I}\sum _{i\subseteq \mathcal{S}_{3}}{\bar{\mathbf {R}}}(\omega ,q_{3,i}). \end{aligned}$$
(26)

Due to the small value of I, the proportion of non-desired sources in the matrices \({{\hat{\bar{\mathbf {R}}}}}_1(\omega )\), \({{\hat{\bar{\mathbf {R}}}}}_2(\omega )\), and \({{\hat{\bar{\mathbf {R}}}}}_3(\omega )\) is approximately close to zero and their contribution can be neglected. These matrices are now used to estimate the optimum beamformer in each frequency bin.

4 Optimum beamformer using spatial information

Based on the estimated normalized spatial correlation matrices \({{\hat{\bar{\mathbf {R}}}}}_1(\omega )\), \({{\hat{\bar{\mathbf {R}}}}}_2(\omega )\), and \({{\hat{\bar{\mathbf {R}}}}}_3(\omega )\), an optimum beamformer is proposed for each desired source in the frequency bin \(\omega \). For extracting one speech source from the observed signal, an optimum beamformer is desired to suppress all undesired sources whilst preserving the desired one. Then, the first source is assumed to be the desired source so two other sources are undesired and denote by \({\mathbf w}_1(\omega )\) the filter weight for the first source in the frequency bin \(\omega \). The filter weight \({\mathbf w}_1(\omega )\) is designed to minimize two weighted cost functions \({{\mathbf w}_1(\omega )}^{H}{{\hat{\bar{\mathbf {R}}}}}_{2}(\omega ){\mathbf w}_1(\omega )\) and \({{\mathbf w}_1(\omega )}^{H}{{\hat{\bar{\mathbf {R}}}}}_{3}(\omega ){\mathbf w}_1(\omega )\) while maintaining the source direction as follows:

$$\begin{aligned} \left\{ \begin{array}{lll} \min \limits _{{\mathbf w}_1(\omega )}{{\mathbf w}_1(\omega )}^H{{\hat{\bar{\mathbf {R}}}}}_{2}(\omega ){{\mathbf w}_1(\omega )}\;,{{\mathbf w}_1(\omega )}^H{{\hat{\bar{\mathbf {R}}}}}_{3}(\omega ){{\mathbf w}_1(\omega )}\\ \text {subject to} ~ ~{{\mathbf w}(\omega )}^H{{\hat{\bar{\mathbf {d}}}}}_1(\omega )=1. \end{array}\right. \end{aligned}$$
(27)

where \({{\hat{\bar{\mathbf {d}}}}}_1(\omega )\) is the estimated cross-correlation vector between the first source at a \(\ell \)th reference microphone. This vector is also the \(\ell \)th column of the matrix \({{\hat{\bar{\mathbf {R}}}}}_1(\omega )\). Thus, from (27) we propose to minimize the following weighted cost function \({{\mathbf w}_1(\omega )}^H\left[ {{\hat{\bar{\mathbf {R}}}}}_{2}(\omega )+{{\hat{\bar{\mathbf {R}}}}}_{3}(\omega ) \right] {\mathbf w}_1(\omega )\) and the filter weight \({\mathbf w}_1(\omega )\) can be obtained by solving the optimization problem

$$\begin{aligned} \left\{ \begin{array}{lll} {\text{ min }}\,\,{\mathbf w}_1^H(\omega )\left[ {{\hat{\bar{\mathbf {R}}}}}_2(\omega )+{{\hat{\bar{\mathbf {R}}}}}_3(\omega )\right] {\mathbf w}_1(\omega )\\ \text {subject}\,\, \text {to}\,\, {\mathbf w}_1^H{{\hat{\bar{\mathbf {d}}}}}_1(\omega )=1 \end{array}\right. \end{aligned}$$
(28)

Similarly, the beamformer weight \({\mathbf w}_2(\omega )\) for the second source can be obtained as the solution to the optimization problem

$$\begin{aligned} \left\{ \begin{array}{lll} {\text{ min }}\,\,{\mathbf w}_2^H(\omega )\left[ {{\hat{\bar{\mathbf {R}}}}}_1(\omega )+{{\hat{\bar{\mathbf {R}}}}}_3(\omega )\right] {\mathbf w}_2(\omega )\\ \text {subject}\,\, \text {to}\,\, {\mathbf w}_2^H{{\hat{\bar{\mathbf {d}}}}}_2(\omega )=1 \end{array}\right. \end{aligned}$$
(29)

where \({{\hat{\bar{\mathbf {d}}}}}_2(\omega )\) is the \(\ell \)th column of the matrix \({{\hat{\bar{\mathbf {R}}}}}_2(\omega )\). The beamformer weight \({\mathbf w}_3(\omega )\) for the third source can be obtained as the solution to the optimization problem

$$\begin{aligned} \left\{ \begin{array}{lll} {\text{ min }}\,\,{\mathbf w}_3^H(\omega )\left[ {{\hat{\bar{\mathbf {R}}}}}_1(\omega )+{{\hat{\bar{\mathbf {R}}}}}_2(\omega )\right] {\mathbf w}_3(\omega )\\ \text {subject}\,\, \text {to}\,\, {\mathbf w}_3^H{{\hat{\bar{\mathbf {d}}}}}_3(\omega )=1 \end{array}\right. \end{aligned}$$
(30)

where \({{\hat{\bar{\mathbf {d}}}}}_3(\omega )\) is the \(\ell \)th column of the matrix \({{\hat{\bar{\mathbf {R}}}}}_3(\omega )\). The solutions to three optimization problems can be expressed as

$$\begin{aligned} {\mathbf w}_1(\omega )=\frac{\left[ {{\hat{\bar{\mathbf {R}}}}}_2(\omega )+{{\hat{\bar{\mathbf {R}}}}}_3(\omega )\right] ^{-1}{{\hat{\bar{\mathbf {d}}}}}_1(\omega )}{{{{\hat{\bar{\mathbf {d}}}}}_1^H(\omega )}\left[ {{\hat{\bar{\mathbf {R}}}}}_2(\omega )+{{\hat{\bar{\mathbf {R}}}}}_3(\omega )\right] ^{-1}{{{\hat{\bar{\mathbf {d}}}}}_1(\omega )}} \end{aligned}$$
(31)

and

$$\begin{aligned} {\mathbf w}_2(\omega )=\frac{\left[ {{\hat{\bar{\mathbf {R}}}}}_1(\omega )+{{\hat{\bar{\mathbf {R}}}}}_3(\omega )\right] ^{-1}{{\hat{\bar{\mathbf {d}}}}}_2(\omega )}{{{{\hat{\bar{\mathbf {d}}}}}_2^H(\omega )}\left[ {{\hat{\bar{\mathbf {R}}}}}_1(\omega )+{{\hat{\bar{\mathbf {R}}}}}_3(\omega )\right] ^{-1}{{{\hat{\bar{\mathbf {d}}}}}_2(\omega )}} \end{aligned}$$
(32)

and

$$\begin{aligned} {\mathbf w}_3(\omega )=\frac{\left[ {{\hat{\bar{\mathbf {R}}}}}_1(\omega )+{{\hat{\bar{\mathbf {R}}}}}_2(\omega )\right] ^{-1}{{\hat{\bar{\mathbf {d}}}}}_3(\omega )}{{{{\hat{\bar{\mathbf {d}}}}}_3^H(\omega )}\left[ {{\hat{\bar{\mathbf {R}}}}}_1(\omega )+{{\hat{\bar{\mathbf {R}}}}}_2(\omega )\right] ^{-1}{{{\hat{\bar{\mathbf {d}}}}}_3(\omega )}} \end{aligned}$$
(33)

The beamformer outputs for the three sources are calculated as

$$\begin{aligned} y_1(\omega ,k)={{\mathbf w}_1^H(\omega )}{\mathbf x}(\omega ,k) \end{aligned}$$
(34)

and

$$\begin{aligned} y_2(\omega ,k)={{\mathbf w}_2^H(\omega )}{\mathbf x}(\omega ,k). \end{aligned}$$
(35)

and

$$\begin{aligned} y_3(\omega ,k)={{\mathbf w}_3^H(\omega )}{\mathbf x}(\omega ,k). \end{aligned}$$
(36)

The remaining problem is to align the beamformer output in different frequency bins to the same source. In the sequel, the correlation between the beamformer outputs in neighboring frequencies is employed to overcome the permutation problem.

5 Permutation alignment

Since the optimum beamformers are performed in each frequency bin, the permutation alignment is needed before transforming the signals to the time domain. Here, the correlation approach is chosen for the permutation alignment and permutation decision is based on inter-frequency correlation of the output signal amplitudes based on the assumption that the amplitudes of the output signals from the one speech signal are correlated with adjoining frequencies. The permutation alignment can be performed continuously with a reference frequency in the middle of the frequency range. In this case, permutation correlation is performed in two directions, with increasing and decreasing frequency indexes until the end of the frequency range. For two neighboring frequencies \(\omega _{m}\) and \(\omega _{m+1}\), the following correlations between the ith beamformer output of frequencies \(\omega _{m}\) and jth beamformer output of frequencies \(\omega _{m+1}\) are obtained as follows:

Fig. 3
figure 3

Time domain plots of the original speech signals and the observed signal at the fourth microphone

$$\begin{aligned} \text {cor}_{i,j}= \frac{\mu ({|y_{i}(\omega _{m},k)y_{j}(\omega _{m+1},k)|})- \mu ({|y_{i}(\omega _{m},k)|})\mu ({|y_{j}(\omega _{m+1},k)|})}{\sigma (|y_{i}(\omega _{m},k)|) \sigma (|y_{j}(\omega _{m+1},k)|)} \end{aligned}$$
(37)

where \(\mu (\cdot )\) and \(\sigma (\cdot )\) are, respectively, the mean and the standard deviation of \((\cdot )\). Permutation decision \({\Pi }\) is made with permutation alignment \(\Pi \) as follows

$$\begin{aligned} {\Pi }=\arg \max _{\Pi } \sum _{i,j \in \Pi }\text {cor}_{i,j}, \end{aligned}$$
(38)

After permutation alignment, three output signals in all frequencies are passed through the synthesis filters for obtaining the output signals with three speech sources in the time domain.

Fig. 4
figure 4

Time domain plots of the second-order BSS algorithm outputs

Fig. 5
figure 5

Time domain plots of the proposed algorithm outputs

Table 1 The interference suppression and the source distortion levels in the outputs of the proposed blind speech separation algorithm

6 Experimental results

For performance evaluations of the proposed blind speech separation algorithm, a simulation is performed in a real room environment using a linear microphone array consisting of 6 microphones. Here, the distance between two adjacent microphones is 6 cm and the positions of three speakers are shown in Fig. 1. The distances between the array and speakers are about 1–1.5 m. The duration of the observed signal is \(150\,\)s and the value N was chosen as the number of samples in \(0.5\, \)s period while I and \(\varepsilon \) were chosen as 10 and 0.1, respectively. With the chosen N and I, the evaluation time of each speech source is about \(5\,\)s. Based on our experience, the evaluation time \(5\,\)s is enough for evaluation of the spatial characteristic of the speech source. We conducted our numerical experiments on HP Laptop with Intel Core i7 and 16GB RAM, using Matlab (R2013b).

The observed signals are decomposed into sub-bands using an oversampled analysis filter bank. Here, an oversampling factor of two is chosen to reduce the aliasing effects between the adjacent sub-bands [30]. After the decomposition, the implementation of the proposed algorithm is performed in sub-bands. Figure 3 shows time domain plots of three speech signals and the observed signal. The speech signals from three speakers occur at different times and can overlap with each other in the observed signal. The overlapping signals simulate simultaneous conversation.

We have compared a second-order BSS algorithm with the suggested method. In Fig. 4, the results for when the second-order blind signal separation (BSS) algorithm is used for separating the observed signal are given. This second-order BBS algorithm was used in [22] for speech separation in two speaker environment. Figure 4 depicts time domain plots of the three outputs of the second-order BSS algorithm. The three outputs are speech signals extracted for three speakers from the observed signal. Hence, Fig. 4 shows a little difference between three output signals and the separation did not have a good result.

Figure 5 depicts time domain plots of the three outputs of the proposed separation algorithm when the proposed blind separation algorithm is used for separating the observed signal. The three outputs are speech signals extracted for three speakers from the observed signal. Thus, Fig. 5 shows that the proposed algorithm can separate the three speech signals from the observed mixture. Informal listening tests suggest the good listening quality of signal outputs from the proposed algorithm. From the Table 1, it is clear that the computation time of proposed algorithm is lower than computation time of the second-order BSS algorithm.

To quantify the performance of the second-order BSS algorithm and the proposed algorithm, the interference suppression (IS) and source distortion (SD) measures as presented in [31] are employed. As such, the speech signal from one speaker is viewed as the desired signal and other speech signals are interferences. Table 1 shows the IS and SD levels for the three outputs of the second-order BSS algorithm and the proposed algorithm; the proposed algorithm has a better performance. In addition, the proposed blind speech separation algorithm offers a good interference suppression level (5–7 dB) whilst maintaining a low distortion level (−26 to −29 dB) for the desired source.

7 Summary

In this paper, a new blind speech separation algorithm in the frequency domain was developed for the three-speaker environment. Since, the position of the sources are unknown, the SRP-PHAT localization is used for estimating the spatial location of all speakers in each frequency bin. Based on that information, an optimum beamformer is designed for each speech source to extract the desired signal. The permutation alignment is used before transforming the signals to the time domain. Simulation results show that the proposed blind speech separation algorithm offers a good interference suppression level whilst maintaining a low distortion level for the desired source.