Weighted Frequency Smoothing for Enhanced Speaker Localization

The coherent signal subspace method may be used in order to apply subspace localization methods (e.g. MUSIC) to coherent sources. This method involves a focusing process followed by frequency smoothing, which is intended to decorrelate source signals from coherent sources. In practice, however, only moderate decorrelation is obtained, which may lead to performance degradation. Although decorrelation can be improved by widening the smoothing bandwidth, a wider bandwidth may increase focusing error and the smoothing bandwidth is limited by the bandwidth of the actual signal. In this paper, a weighted frequency smoothing that improves decorrelation for a given bandwidth is proposed. It is shown that better decorrelation is obtained by selecting the weights to be inversely proportional to the source signal power at the given frequency. However, since the power of the source is not known, it is estimated by the trace of the array spatial covariance matrix. An experimental study is presented that investigates the effect of the proposed weighting on DOA estimation of speech sources in a reverberant environment.


I. INTRODUCTION
M ANY audio signal processing applications require accurate direction-of-arrival (DOA) estimation in order to achieve satisfactory performance in fields such as speech processing and acoustic scene analysis [1], [2]. One common method used for DOA estimation is the MUSIC (MUltiple SIgnal Classification) algorithm [3], due to its favorably low computational complexity and excellent resolution. However, its performance is inadequate in the presence of coherent sources, such as reflections in a reverberant environment, due to the deficient rank of the sources' cross-correlation matrix [4]. The coherent signal subspace (CSS) method [5] may be used in order to apply subspace localization methods (e.g. MUSIC) to coherent sources. There are two stages in CSS: a focusing process and frequency smoothing of the array spatial correlation matrices (SCM). The focusing process aims to remove the frequency dependence of the array steering vectors in order to preserve the spatial information during frequency smoothing. The frequency smoothing is applied to decorrelate the sources and to increase the rank of the sources' cross-correlation matrix. Many speaker localization methods employ the CSS method to overcome problems due to reverberation [6], [7], [8], [9], [10], [11], [12]. These methods assume that frequency smoothing fully decorrelates the coherent reflections. In practice, however, only moderate decorrelation is obtained, which may lead to performance degradation. In [13], the PHALCOR algorithm was proposed to localize the early reflections of a single source in a room using a spherical microphone array. A key step in PHALCOR is the so-called phase alignment transform of the SCM, which can be viewed as a generalization of frequency smoothing. In PHALCOR it was shown to be beneficial (both in theory and in practice) to normalize each SCM by its trace.
In this work, we show that a similar approach is also beneficial for frequency smoothing that follows focusing, and investigate the performance of this approach for the localization of multiple simultaneous sources, while using a general array configuration that requires CSS processing. Furthermore, we prove that choosing the frequency smoothing weights to be inversely proportional to the source power at the given frequency is optimal in terms of decorrelation. We then show that an estimate for the source power can be easily obtained from the trace of the SCM. The proposed weighting is incorporated in a direct path dominance (DPD) test based method for speaker localization in reverberant environments and tested on real recordings from the LOCATA challenge [14], [15] with three different microphone arrays. The results show that the proposed weighting can significantly improve DOA estimation performance.

II. NOTATIONS
The notations that are used throughout the paper are given hereafter. Boldface lowercase letters are used for vectors, while boldface uppercase letters for matrices. The transpose and conjugate transpose operations are denoted by the superscripts T and H , respectively. The complex conjugate is denoted by the superscript * . tr(·) denotes the trace operation, and E[·] denotes the statistical expectation operation.

III. SYSTEM MODEL
In this section we outline the system model. We consider a sound field in a room, comprised of a single source with a frequency domain signal s(f ) and a DOA relative to some measurement point within the room Ω 0 , where Ω = (θ, φ), where θ ∈ [0, π] and φ ∈ [−π, π) denote elevation and azimuth angles, respectively. The sound from the source propagates in the room and is repeatedly reflected from the various room boundaries. Each reflection is modeled as a separate source and described by DOA Ω k and frequency domain signal s k (f ) for the k'th reflection. Let x(f ) denote the vector of Q microphone signals, as a function of frequency. Assuming the sources are in the far field, x(f ) is described by the following model: where: n(f ) is the noise term, and includes signals from late reverberations, where only the first K reflections are not included in n(f ). v(f, Ω) denotes the array's steering vector at direction Ω, where v q (f, Ω) represents the transfer function between a source with DOA Ω and the q'th microphone. It is assumed that the k'th reflection is a delayed and scaled copy of the source signal [16]: where the scaling coefficient is α k and the delay is τ k . s 0 (f ) stands for the direct sound and τ 0 and α 0 are accordingly normalized to 0 and 1, respectively. It is assumed that the delays are sorted such that τ k−1 ≤ τ k .
Let R x (f ) denote the SCM at frequency f : Substituting (1) into (6), and assuming n(f ) and s(f ) are uncorrelated, yields: where: Note that the assumption that n(f ) and s(f ) are uncorrelated is only approximate, since, for example, the reflection with index K + 1 (which belongs to the noise vector) may be correlated to the reflection with index K (which belongs to the sources vector). In practice, the signals that have been converted to the frequency domain are time windowed, so, for non-stationary source signals such as speech, if τ K+1 is larger than the window length and the speech correlation time, the reflection with index K + 1 will not be correlated with the direct sound. Furthermore, since reflection amplitudes typically decrease with delay, the product of the amplitudes belonging to reflections with indices K and K + 1 is often negligible compared to other terms in R x (f ). existence of correlation between late reverberations and the source signal is often negligible compared to other terms in R x (f ).
By substituting (5) and (2) in (9) we have: where: As is apparent from (10), R s (f ) is a rank-1 matrix, which causes subspace based methods such as MUSIC to fail [5].

IV. FOCUSING AND FREQUENCY SMOOTHING
The CSS method was originally derived to facilitate subspace localization for coherent sources (e.g. employing the MUSIC method). CSS is a two-stage method comprising focusing and frequency smoothing. For a given frequency f , and a target frequency f , a focusing matrix T(f, f ) is any Q × Q matrix that translates the steering matrix V(f ) to the steering matrix V(f ): LetR x (f, f ) be the focused SCM: By substituting (7) in (14), we have: The focusing operation removes the steering vector dependence on f , as demonstrated in (15), where V(f ) is common to all transformed SCM. Thus, spatial information is preserved in the smoothed steering matrices. LetR x (f ) be a frequency smoothed SCM, around center frequency f :R where J controls the bandwidth of the smoothing, Δf is the frequency resolution and w −J , . . . , w J are frequency-dependent weights. The weights can be either all equal [7], [17], [18], [19], or designed to improve performance. For example, in [5], [20], [21], the weights were chosen to be proportional to the SNR. In Section V, we suggest a method for choosing the weights that improves decorrelation. By substituting (7) in (17), we have: where:R As a summation of different rank-1 matrices,R s (f ) has a potentially higher effective rank than R s (f ). Assuming that the smoothing operation succeeded in decorrelating the sources, and given that there are more sensors than sources, i.e. K + 1 < Q, subspace localization methods (e.g. MUSIC) can be applied, leading to DOA estimation of the K + 1 sources (direct sound and K reflections). The processing in the above discussion required a known focusing matrix T(f, f ). However, by definition, to calculate a focusing matrix one needs to know the steering matrix, i.e. the DOA of the sources. Moreover, in the case where there are more sources than sensors, such a matrix may not exist. Therefore, in practice, only approximate focusing matrices are used. Various methods exist to design approximate focusing matrices that do not require the DOA [7], [17], [22], [23]. For certain array configurations, it is possible to design a focusing matrix that achieves very small focusing errors for all possible directions simultaneously [24]. A discussion of conditions determining the accuracy of approximate focusing matrices can be found in [7]. There, it is shown that accurate focusing may be obtained even for the case where there are more sources than sensors if the number of spherical harmonics coefficients required for accurate representation of the steering matrices is less than or equal to Q.
Focusing and frequency smoothing play a significant role in speaker localization under reverberation. For speaker localization, the smoothed SCM is often constructed in the timefrequency (TF) domain. To overcome reverberation, direct path dominance (DPD) test based methods select TF bins that exhibit a contribution from a single dominant source, which was shown to be the speaker, and use only these bins to estimate speaker direction. The test in [8] selects bins in which the smoothed SCM is approximately of rank one. The method in [6] selects TF bins in which the first eigenvector of the smoothed SCM is similar to a steering vector. These methods assume full decorrelation. In practice, however, only moderate decorrelation is obtained, as will be demonstrated in Section VI. Decorrelation can be improved by widening the smoothing bandwidth, but a wider bandwidth may increase focusing error [7]. Also, the bandwidth is limited by the actual bandwidth of the source signal. The following section presents a method to improve the decorrelation for a given bandwidth, mitigating the above challenge.

V. FREQUENCY SMOOTHING WEIGHTS FOR IMPROVED DECORRELATION
In this section, frequency smoothing weights for improved decorrelation are derived. In the following, we omit the dependence on the center frequency f for brevity. The k, k entry of R s satisfies: where: We would like to choose the weights w −J , . . . , w J such that the off-diagonal entries ofR s are small in absolute value compared to the diagonal entries. As the delays of the reflections are typically not known, we propose to minimize |a H τ w| 2 , averaged over all possible delays τ : The equality constraint aims to prevent the minimization of the diagonal entries ofR s , and the specific value of the constant C is not important (as long as it is positive). The integration interval is from 0 to We now proceed to solve the optimization problem in (25). First, we rewrite the objective in quadratic form: where: The j , j entry of A is: Hence, the optimization problem in (25) is equivalent to the following weighted least norm problem: Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
the solution of which is given by [25]: which implies: where (37) shows that the weights which give optimal decorrelation in the average sense for unknown delays are inversely proportional to the power of the source signal. This weighting is similar to that used by the phase transform (PHAT) for time delay estimation [26]. Using these weights with C 1 = 1, we have: The kernel behaves in a sinc-like manner, where its main lobe is centered around x = 0 and the first null is found at π n+1/2 . As a result, the main lobe of |a H τ w| will be centered around τ = 0 and its first null will be found at τ = B −1 , where B Δf (2J + 1). It is now evident that J determines the temporal resolution, which affects the ability to decorrelate reflections with different delays. This result can be used as a guideline for choosing J. For example, weighted frequency smoothing with a bandwidth B of 500 Hz can decorrelate the direct sound with reflections with delays larger than 2 ms, for which the expression |a H τ w| obtains a small value. This point is further demonstrated in Section VII.

VI. EFFICIENT ESTIMATION OF THE SOURCE POWER
In the previous section it was shown that the j'th frequency smoothing weight w j should be inversely proportional to σ 2 s (f j ). However, σ 2 s (f ) is not usually known and can only be estimated from the data. The trace of R x provides a very rough but simple estimate of σ 2 s (f ). By ignoring R n in (7) and substituting (9), (5) and (11), we get [13]: where: and is a function that outputs only the real part of a complex scalar. We argue that b k is typically small compared to 1 in the case where the coherent sources are room reflections since usually room reflection's amplitudes decay rapidly, such that for k k, the ratio α k α k is expected to be small. Furthermore, when two reflections have similar amplitudes, but very different DOA, then for many array configurations, this implies that their steering vectors' inner product is small, being the gain outside the main lobe for a delay-and-sum beamformer [2]. Neglecting b k , we obtain an estimate of σ 2 s (f ) multiplied by the frequency-independent term k |α k | 2 v(Ω k ) , which is absorbed into the constant C 1 when computing the weights.

VII. DEMONSTRATION OF SPEECH DECORRELATION
In Section V it was shown that in order to improve decorrelation, frequency smoothing weights should be inversely proportional to the source signal power. These weights act to whiten the source signal; hence, weighted and standard (non weighted) frequency smoothing are approximately the same for a white source signal. These weights are therefore beneficial only for a non-white source signal, such as speech. This section demonstrates the quality of decorrelation obtained by the proposed weights (as opposed to uniform and SNR-related weightshenceforth denoted "SNR weights") for a speech signal. Further, the effect of the various weightings on the SNR is studied. We assume two coherent sources, representing a pair of reflections in the room, and we examine the decorrelation as a function of their relative delay τ . In the previous sections the analysis required frequency domain signals, which are usually obtained by means of the short time Fourier transform (STFT). Using this format promotes localized analysis both in the frequency and in the time domains. Here, we assume that the window length of the STFT is larger than the maximal delay τ K (in the present case where there are only two reflections, τ K is simply their relative delay, τ ) to a sufficient extent such that it is possible to apply the multiplicative transfer function (MTF) approximation in the STFT for (5) [27].
A scenario of two coherent sources impinging on a microphone array was simulated. A 12-element microphone array that is mounted on the Nao robot head was employed. The same array is described in detail in the documentation of the recent LOCalization And TrAcking (LOCATA) challenge [15]. Two coherent sources were simulated using a 3 seconds speech signal from the TIMIT database [29] and its delayed unattenuated copy, both sampled at a sampling frequency of 16 kHz. The microphone signals were simulated by convolving the source signals with the array steering vectors from source directions Ω 0 = (100 • , 25 • ) and Ω 1 = (20 • , 2 • ). Finally, white Gaussian noise was added to the microphone signals simulating additive noise. Both the microphone, the noise, and the source signals were transformed by the STFT with an FFT length of 512 samples, a Hamming window of 512 samples (32 ms) and a 50% overlap. The matrices R x , R s , R n from (7), (8) and (9) were constructed in the STFT domain by: where i and j denote time and frequency indices, respectively, n(i, j) and x(i, j) denote the STFT of the noise and the microphone signals, respectively, and s(i, j) = [s 0 (i, j), s 1 (i, j)] T , where s k (i, j) denotes the STFT of the k'th source signal. Then, R s andR n were calculated according to (19) and (20) with the following weights: uniform weights, w j = 1, SNR weights [5], [20], w j = σ 2 s (f j ), and the proposed weights, w j = 1 σ 2 s (f j ) , referred to as decorrelation weights. For the SNR and decorrelation weights, σ 2 s (f j ) is replaced by the actual source power |s 0 (i, j)| 2 or estimated by tr (R x (i, j)).
As a performance measure we use ρ obtains values in the range [0, 1], where ρ = 0 signifies uncorrelated sources and ρ = 1 signifies correlated sources. The analytical expression of ρ with the decorrelation weights is obtained by substituting (37) and (21): Eq. (45) shows that ρ(τ ) is independent of the central frequency f . The SNR and SNR gain are defined as: where SNR in denotes the SNR when computed with uniform weights. Fig. 1 depicts the average ρ (over time frames) as a function of the delay τ for a center frequency of 1 kHz. It shows that the theoretical ρ has a sinc-like shape, and that poor decorrelation is obtained for delays that are shorter than the main-lobe width; it also shows that decorrelation tends to improve for larger delays as the sinc decays. Further, Fig. 1 shows that the decorrelation weights perform best and that the performance of the true decorrelation weights slightly deviates from the theoretical ρ as the delay increases. This is due to the violation of the MTF approximation for delays that are significant compared to the STFT window length. Fig. 1(c) shows that the deviation from the theoretical ρ decreases for a larger window. Comparing Fig. 1(a) and (b), the effect of a wider smoothing bandwidth is evidentdecreasing both the main lobe width and the side lobe levels. It should be noted that by incorporating tapering, one can tradeoff between the width of the main lobe and the levels of the side lobes. Also, recall that the theoretical ρ is periodic with period Δf −1 , and hence equals 1 for τ = Δf −1 . Nevertheless, results for the empirical ρ do not follow those for the theoretical ρ for delays with such length since for such delays most reflections are naturally uncorrelated, due to the non-stationarity of the speech signal. Fig. 2 depicts the average ρ as a function of SNR in for a delay of τ = 8 ms, which represents the delay of a typical early reflection in a room. Fig. 2 shows that the estimated weights perform similarly to the uniform weights for a low SNR in , and similarly to the true weights for a high SNR in . This is an expected result since for a low SNR in , tr(R x (i, j)) is approximately tr (R n (i, j)), which is approximately frequency independent for white noise.    3 depicts the average SNR gain as a function of SNR in for a delay of τ = 8 ms. It shows that the SNR weights increase the SNR by about 7 dB, and the decorrelation weights decrease the SNR by about 19 dB. For white noise, the decorrelation weights are inversely proportional to the SNR weights, hence promoting frequencies with low SNR and reducing the SNR. The results in Figs. 2 and 3 suggest the existence of a trade-off between SNR gain and decorrelation, where improving decorrelation is at the expense of reducing the SNR and vise-versa. Consequently, one may prefer to use decorrelation weights only when the SNR is high. As is evident from Figs. 2 and 3, this is inherently achieved by the estimated decorrelation weights due to the built-in regularization term tr(R n (i, j)) in their denominator.

VIII. THEORETICAL ANALYSIS OF DOA PERFORMANCE
The previous section shows that the decorrelation weights improve decorrelation and decrease the SNR. This section theoretically studies the DOA estimation performance of the CSS method with the various weights for a case of two coherent sources, which may represent a direct sound and its reflection in a room. First, commonly used analysis approach is presented, which examines the DOA error against the theoretical Cramér-Rao lower bound (CRLB), while assuming a known number of sources and determining the dimension of the signal subspace accordingly. Then, a case, which may be more relevant for speaker localization is examined, in which the number of sources is unknown, and only the DOA of the direct sound is of interest and is estimated by using a one-dimensional signal subspace. The analysis in this section assumes a low sample support, where only a single time frame is available, as this is often the case in real speaker localization scenarios due to reverberation and movements of the speaker or the array. Also, to prevent the focusing error from affecting the results, the input signal x(f ) is simulated as an Ambisonics signal that follows the model in (1), but with frequency-independent steering vectors. For the Ambisonics signal, the steering vector elements are the conjugate spherical harmonics function Y m n (Ω) * with order n = 0, 1, 2, . . . and degree m = −n, . . ., n [2], where the steering vector entry index is n 2 + n + m + 1. We use spherical harmonics up to order n = 3, which is analogous to a16-element array.
Two coherent sources were simulated: a direct sound with DOA Ω 0 = (90 • , 10 • ) and a reflection with DOA Ω 1 = (90 • , 40 • ), a relative delay τ 1 = 8 ms, and a scaling coefficient α 1 = 0.5. The source signal is colored Gaussian noise with a triangular spectrum with a central frequency at 1 kHz and a slope of approximately 0.02 dB/Hz. 512 source signal time samples were generated with a sampling frequency of 16 kHz. The Ambisonics signal was simulated by multiplying the source signals with the steering function from the corresponding source directions. Finally, white Gaussian noise with covariance R n = σ 2 n I was added to the simulated Ambisonics signal. The simulated signal was transformed by FFT with a length of 512 samples. The CSS method was applied with various weights and with a smoothing bandwidth of B ≈ 500 Hz (J = 8). The decorrelation and SNR weights were computed with the estimated source power, as in the previous section. Finally, the MUSIC spectrum was calculated with a 2-dimensional signal subspace for a known elevation θ = 90 • and azimuth angle φ in the range [−180 • , 180 • ) with a spatial resolution of 0.001 • . The two source azimuths were estimated by the two largest peaks in the spectrum. The azimuth estimation error was computed after assigning estimates to sources in a way that minimizes the overall error.
The CRLB for estimating the source azimuths Φ = [φ 1 , φ 2 ] in the case where the sources' cross-correlation matrices R s (f j ),and the noise power σ 2 n are unknown is [28]: Here, T denotes the number of statistically independent frames, denotes the Hadamard product, P ⊥ V = I − V(V H V) −1 V H is the projection matrix onto the noise subspace, andV = [ ∂v(φ 1 ) ∂φ 2 ] is a derivative matrix. Note that factors such as the power, delay, and correlation of the source signals influence the sources' cross-correlation matrices R s (f j ) which, in turn, affects the CRLB. However, this particular formula does not explicitly specify the relationship between these factors and the bound. Fig. 4 shows the RMSE (over 1000 independent trials) in the direct sound's DOA estimation versus SNR in alongside the CRLB for a single frame. The figure shows that the CSS with uniform and SNR weights approach the CRLB at high SNR values. The figure also shows that decorrelation weights perform worst, while uniform and SNR weights perform similarly. The performance degradation with the decorrelation weights is attributed to the SNR loss with these weights. The results suggest that the moderate decorrelation achieved by the uniform and SNR weights is sufficient for accurate DOA estimation in the present case where the signal subspace dimension is known and equals the number of sources.
For speaker localization, however, only the DOA of the direct sound is of interest, while the total number of sources (both the direct sound and reflections) is unknown and challenging to estimate due to the coherency of the sources and their timevarying number that may exceed the number of microphones. The theoretical analysis in [6], which studied the DOA estimation of a direct sound source in the presence of coherent reflected sources, has shown that with proper decorrelation and sufficient spatial separation between the sources, the DOA information of the direct sound is contained in the first eigenvector, which corresponds to the largest eigenvalue. Hence, the speaker localization method in [6] uses only the first eigenvector to construct the signal subspace for MUSIC. In the following we adopted a similar approach, where we estimated only the DOA of the direct sound by using a signal subspace of a single dimension. All other simulation parameters remained the same. Fig. 5 shows the RMSE (over 1000 independent trials) in the direct sound's DOA estimation as a function of the SNR in . The figure shows that, in contrast to the errors in Fig. 4, the errors with a one-dimensional signal subspace are smaller than 10 • for all SNRs. The figure also shows that the decorrelation weights achieve substantially lower errors than the SNR and uniform weights. This result is attributed to the improved decorrelation that leads to better estimation of the direct path signal subspace. Additionally, it can be seen that all of the weights function similarly at low SNR values. This result concurs with the previous section's conclusion that for white noise the estimated SNR and decorrelation weights approach the uniform weights as the SNR drops.

IX. EXPERIMENTAL STUDY
Section VII shows that frequency smoothing with uniform or SNR weights leads to moderate decorrelation, and that decorrelation improves with the proposed decorrelation weights. Yet, many localization methods employ frequency smoothing with uniform or SNR weights and assume full decorrelation. One such method is the DPD test based method for speaker localization under reverberation [7], [8]. This method operates in the STFT domain and overcomes reverberation by identifying TF bins with a dominant direct path and using only these bins for estimating speakers' directions. The method employs frequency smoothing and requires good decorrelation for an appropriate selection of the direct path bins. This section demonstrates the advantage of the proposed weighting when combined in this DPD test method.
The DPD test is applied with the various frequency smoothing weights to real-world recordings of multiple static speakers in a room with three microphone arrays from the LOCATA challenge [14]. The microphone arrays included a 32-microphone spherical array (Eigenmike), a 12-microphone array mounted on a robot's head (Benchmark), and a 15-microphone planar array (DICIT). The data was recorded in a laboratory of size 7.1 m × 9.8 m × 3 m with an approximate reverberation time of T 60 = 0.55 s. Thirteen scenarios involving two, three or four static speakers were recorded by each of the arrays. The wideband SNR is approximately 13 dB for the Eigenmike, 20 dB for the Benchmark and 8 dB for the DICIT array.
The recordings were down-sampled from 48 kHz to 16 kHz and then transformed by the STFT with a Hamming window of 512 samples (32 ms), a 50% overlap and an FFT length of 512 samples. The matrix R x (i, j) was transformed by focusing matrices, as in (17). The WINGS focusing transformations, presented in [7], with a spherical harmonics order of N = 12 were employed for the Eigenmike and the Benchmark arrays and the unitary focusing transformations [7] of order N = 30 were employed for the DICIT array. The matrixR x was computed according to (17) with the various weights and for various center frequencies; hence, in the following it will be denoted with time and frequency indices asR x (i, j), where j indicates the center frequency. The set of TF bins selected by the DPD test was [7], [8]: where σ 1 (i, j) and σ 2 (i, j) are the largest and second largest singular values ofR x (i, j) and T H is a user defined threshold. A high singular values ratio implies a single dominant source and hence, a dominant direct path. Therefore, for a sufficiently high threshold, DOA estimation using TF bins in A is accurate. Furthermore, due to the sparsity of the speech signal in the time-frequency domain [29], it is highly probable that A includes direct path bins of each of the speakers, enabling the detection and DOA estimation of several speakers [8]. MUSIC with a source subspace of a single dimension and with an angular resolution of approximately 3 • was applied to each of the selected bins. For the DICIT array, whose configuration is less suitable for DOA estimation in 3D, only the azimuth angle was estimated, where the elevation angle was assumed to be 80 • . LetΩ(i, j) denote the DOA estimate at the (i, j) bin and Ω l denote the DOA of the l'th speaker. The DOA estimation error at the (i, j) bin can be defined with respect to the direction of the closest source as: where ∠ denotes angular distance. The performance of the DPD test is evaluated by examining the RMSE of DOA estimates at the selected bins. The RMSE is defined as: where |A| denotes the cardinality of the set A. The RMSE represents the mean error at an individual bin. The final DOA estimate is typically obtained by fusing the DOA estimates from multiple bins, and therefore its error tends to be smaller. For each recording, the set A was calculated for center frequencies in the range of 0.5 − 4 kHz. The focusing error [7] in this range is smaller than −50 dB for the Eigenmike, −23 dB for the Benchmark, and −2 dB for the DICIT arrays. The test threshold T H was set such that a given percentage of the TF bins in that range will pass the test. Fig. 6 depicts the average (over recordings) RMSE for the various microphone arrays as a function of the DPD test threshold. Fig. 6 shows that the DPD test with decorrelation weights performs best, and that performance tends to improve for a wider smoothing bandwidth.  The improved decorrelation obtained by decorrelation weights and for wider bandwidths leads to the appropriate selection of the direct path bins. For poor decorrelation, bins with multiple coherent reflections may exhibit a high singular values ratio (due to a deficient rank ofR s (i, j)) leading to their selection by the test and hence, to degraded localization accuracy. The performance of the sigma ratio (SR)-based DPD test presented above was compared to various state-of-the-art DPD tests [6], [7], [30], [31], [32]. The SR-based [7], [8] and the EDS-based (enhanced decomposition of the direct sound) [6] DPD tests, which incorporate frequency smoothing, were implemented with a smoothing bandwidth of B ≈ 500 Hz and with the various weights. To enable the application of the EDS-based test to the employed arrays, the method was applied to the frequency smoothed SCM. The DPD tests based on the local space domain distance (LSDD) [31], coherent-to-diffuse-ratio (CDR) [30], and DOA estimates consistency (EC) [32] do not incorporate frequency smoothing of the SCM. In the CDR-and the EC-based methods, the LSDD-based DOA estimator [31] was used for the DOA estimation, while the CDR and EC measures were used as DPD measures, with which the set of direct-path bins A was calculated. Table I presents the averaged RMSE of the various DPD tests. It shows that the decorrelation weights also enhance the performance of the EDS-based DPD test, which achieved the smallest error among the tested methods.

X. CONCLUSION
Many speaker localization methods employ frequency smoothing to decorrelate room reflections and assume full decorrelation. In this work, we have shown that, in practice, only moderate decorrelation is often obtained, which reduces localization accuracy. A weighted frequency smoothing method has been proposed to improve decorrelation performance, and optimal weights were derived. Experimental studies demonstrated the improved decorrelation obtained by the proposed weights and the enhanced localization accuracy when it was combined in a DPD test based method for speaker localization. These results suggest that the proposed weights may be useful for other frequency-smoothing-based methods, including localization methods of coherent sources other than speech [17], [19], [23], spatial filtering [22], [33], [34], speech enhancement [12] and focusing frequency selection [21].