Exploiting Deep Neural Networks and Head Movements for Robust Binaural Localization of Multiple Sources in Reverberant Environments

This paper presents a novel machine-hearing system that exploits deep neural networks (DNNs) and head movements for robust binaural localization of multiple sources in reverberant environments. DNNs are used to learn the relationship between the source azimuth and binaural cues, consisting of the complete cross-correlation function (CCF) and interaural level differences (ILDs). In contrast to many previous binaural hearing systems, the proposed approach is not restricted to localization of sound sources in the frontal hemifield. Due to the similarity of binaural cues in the frontal and rear hemifields, front–back confusions often occur. To address this, a head movement strategy is incorporated in the localization model to help reduce the front–back errors. The proposed DNN system is compared to a Gaussian-mixture-model-based system that employs interaural time differences (ITDs) and ILDs as localization features. Our experiments show that the DNN is able to exploit information in the CCF that is not available in the ITD cue, which together with head movements substantially improves localization accuracies under challenging acoustic scenarios, in which multiple talkers and room reverberation are present.

The human auditory system determines the azimuth of sounds 39 in the horizontal plane by using two principal cues: interaural 40 time differences (ITDs) and interaural level differences (ILDs). 41 A number of authors have proposed binaural sound localisation 42 systems that use the same approach, by extracting ITDs and 43 ILDs from acoustic recordings made at each ear of an artifi-44 cial head [3]- [6]. Typically, these systems first use a bank of 45 cochlear filters to split the incoming sound into a number of 46 frequency bands. The ITD and ILD are then estimated in each 47 band, and statistical models such as Gaussian mixture model 48 (GMM) are used to determine the source azimuth from the 49 corresponding binaural cues [6]. Furthermore, the robustness of 50 this approach to varying acoustic conditions can be improved by 51 using multi-conditional training (MCT). This introduces uncer-52 tainty into the statistical models of the binaural cues, enabling 53 them to handle the effects of reverberation and interfering sound 54 sources [4]- [7]. 55 In contrast to many previous machine systems, the approach 56 proposed here is not restricted to sound localisation in the frontal 57 hemifield; we consider source positions in the 360 • azimuth 58 range around the head. In this unconstrained case, the loca-59 tion of a sound cannot be uniquely determined by ITDs and 60 ILDs; due to the similarity of these cues in the frontal and rear 61 hemifields, front-back confusions occur [8]. Although machine 62 listening studies have noted this as a problem [6], [9], listeners 63 rarely make such confusions because head movements, as well 64 as spectral cues due to the pinnae, play an important role in 65 resolving front-back confusions [8], [10], [11]. 66 Relatively few machine localisation systems have attempted 67 to incorporate head movements. Braasch et al. [12] averaged 68 cross-correlation patterns across different head orientations in 69 order to resolve front-back confusions in anechoic conditions. 70 More recently, May et al. [6] combined head movements and 71 MCT in a system that achieved robust sound localisation perfor-72 mance in reverberant conditions. In their approach, the localisa-73 tion system included a hypothesis-driven feedback stage which 74 triggered a head movement when the azimuth could not be un-75 ambiguously estimated. Subsequently, Ma et al. [9] evaluated 76 the effectiveness of different head movement strategies, using 77 a complex acoustic environment that included multiple sources 78 and room reverberation. In agreement with studies on human 79 sound localisation [13], they found that localisation errors were 80 minimised by a strategy that rotated the head towards the target 81 sound source. 82 2329-9290 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information. environment, in which a binaural receiver is moved in order 120 to simulate the head rotation of a human listener. The output 121 from the DNN is combined with a head movement strategy to 122 robustly localise multiple talkers in reverberant environments. 123

A. Binaural Feature Extraction
124 An auditory front-end was employed to analyse binaural ear 125 signals with a bank of 32 overlapping Gammatone filters, with 126 centre frequencies uniformly spaced on the equivalent rectan-127 gular bandwidth (ERB) scale between 80 Hz and 8 kHz [18]. 128 Inner-hair-cell processing was approximated by half-wave recti-129 fication. No low-pass filtering was employed to simulate the loss 130 of phase-locking at high frequencies as previous studies have 131 shown that in general classifiers are able to exploit the high-132 frequency structure [4]. Afterwards, the CCF between the right 133 and left ears was computed independently for each frequency 134 band using overlapping frames of 20 ms with a 10 ms shift. The 135 CCF was further normalised by the auto-correlation value at lag 136 zero [4] and evaluated for time lags in the range of ± 1.1 ms.

137
Two binaural features, ITDs and ILDs, are typically used in 138 binaural localisation systems [1]. The ITD is estimated as the 139 lag corresponding to the maximum in the CCF. The ILD corre-140 sponds to the energy ratio between the left and right ears within 141 the analysis window, expressed in dB. In this study, instead of 142 estimating the ITD the entire CCF was used as localisation fea-143 tures. This approach was motivated by two observations. First, 144 computation of ITDs involves a peak-picking operation which 145 may not be robust in the presence of noise and reverberation. 146 Second, there are systematic changes in the CCF with source 147 azimuth (in particular, changes in the main peak with respect 148 to its side peaks). Even in multi-source scenarios, these can be 149 exploited by a suitable classifier. For signals sampled at 16 kHz, 150 the CCF with a lag range of ± 1 ms produced a 33-dimensional 151 binaural feature space for each frequency band. This was sup-152 plemented by the ILD, forming a final 34-dimensional (34D) 153 feature vector.

155
DNNs were used to map the 34D binaural feature set to corre-156 sponding azimuth angles. A separate DNN was trained for each 157 of the 32 frequency bands. Employing frequency-dependent 158 DNNs was found to be effective for localising simultaneous 159 sound sources. Although simultaneous sources overlap in time, 160 within a local time frame each frequency band is mostly dom-161 inated by a single source (Bregman's [19] notion of 'exclusive 162 allocation'). Hence, this allows training using single-source data 163 and removes the need to include multi-source data for training. 164 The DNN consists of an input layer, two hidden layers, and 165 an output layer. The input layer contained 34 nodes and each 166 node was assumed to be a Gaussian random variable with zero 167 mean and unit variance. The 34D binaural feature inputs for 168 each frequency band were Gaussian normalised, and white 169 Gaussian noise (variance 0.4) was added to avoid overfitting, 170 before being used as input to the DNN. The hidden layers had 171 sigmoid activation functions, and each layer contained 128 172 hidden nodes. The number of hidden nodes was heuristically 173 selected -more hidden nodes increased the computation time 174 but did not improve localisation accuracy. The output layer 175 contained 72 nodes corresponding to the 72 azimuth angles 176 in the full 360 • azimuth range, with a 5 • step. A 'softmax' 177 activation function was applied at the output layer. The same 178 DNN architecture was used for all frequency bands and we did 179 not optimise it for individual frequencies.

180
The neural network was initialised with a single hidden layer, 181 and the number of hidden layers was gradually increased in later 182 training phases. In each training phase, mini-batch gradient de-183 scent with a batch size of 128 was used, including a momentum 184 term with the momentum rate set to 0.5. The initial learning rate 185 was set to 1, which gradually decreased to 0.05 after 20 epochs.

186
After the learning rate decreased to 0.05, it was held constant 187 for a further 5 epochs. We also included a validation set and the where P (k) is the prior probability of each azimuth k. Assuming The target location was given by the azimuth k that maximised . 225 A second posterior distribution is computed for the signal 226 block after the completion of the head movement. If a peak 227 in the first posterior distribution corresponds to a true source 228 position, then it will appear in the second posterior distribution 229 and will be shifted by an amount corresponding to the angle 230 of head rotation (assuming that sources are stationary before 231 and after the head movement). On the other hand, if a peak 232 is due to a phantom source, it will not occur in the second 233 posterior distribution, as shown in the bottom panel of Fig. 2. 234 By exploiting this relationship, potential phantom source peaks 235 are identified and eliminated from both posterior distributions. 236 After the phantom sources have been removed, the two posterior 237 distributions were averaged to further emphasise the local peaks 238 corresponding to true sources. The most prominent peaks in the 239 averaged posterior distribution were assumed to correspond to 240 active source positions. Here the number of active sources was 241 assumed to be known a priori. 242 The proposed approach to exploiting head movements is 243 based on late information fusion -the information from the 244 model predictions is integrated. This is in contrast to the ap-245 proach in [12] which adopted early fusion at the feature level by 246 averaging cross-correlation patterns across different head ori-247 entations. Late fusion is preferred here for a couple of reasons: 248 i) the use of head rotation is not needed during model training 249 and thus it is more straightforward to generate data for train-250 ing robust localisation models (DNNs); ii) early feature fusion 251 tends to lose information which can otherwise be exploited by 252 the system. As a result, the proposed system is able to deal with 253 overlapping sound sources in reverberant conditions, while the 254 system reported in [12] was tested in anechoic conditions with 255 a single source.  four room conditions with various amounts of reverberation.

274
The loudspeakers were placed around the HATS on an arc in the 275 median plane, with a 1.5 m radius between ±90 • and measured 276 at 5 • intervals. Table I    were created using the same anechoic HRIR recorded using a 345 KEMAR dummy head [20]. This approach was used in pref-346 erence to adding reverberation during training, since previous 347 studies (e.g., [5]) suggested that it was more likely to generalise 348 well across a wide range of reverberant test conditions.

349
The training material consisted of speech sentences from the 350 TIMIT database [22]. A set of 30 sentences was randomly se-  was due to test conditions rather than signal variation. Since the 375 duration of each GRID sentence was different, and there was 376 silence of various lengths at the beginning of each sentence, the 377 central 1 s segment of each sentence was selected for evaluation. 378 Note that although the models were trained and evaluated 379 using speech signals, our systems are not intended to localise 380 only speech sources. Therefore a frequency range from 80 Hz 381 to 8 kHz was selected for the signals sampled at 16 kHz. Our 382 previous studies [6], [15] also show that 32 Gammatone filters 383 (see Section II-A) provide a good tradeoff between frequency 384 resolutions and computational cost. As the evaluation included 385 localisation of up to three overlapping talkers, using too few fil-386 ters would result in insufficient frequency resolution to reliably 387 localise multiple talkers.

388
The baseline system was a state-of-the-art localisation sys-389 tem [6] that modelled both ITDs and ILDs features within a 390 GMM framework. As in [6], the GMM modelled the binaural 391 features using 16 Gaussian components and diagonal covari-392 ance matrices for each azimuth and each frequency band. The 393 GMM parameters were initialised by 15 iterations of the k-394 means clustering algorithm and further refined using 5 iterations 395 of the expectation-maximization (EM) algorithm. The second 396 localisation model was the proposed DNN system using the 397 CCF and ILD features. Each DNN employed four layers includ-398 ing two hidden layers each consisting of 128 hidden nodes (see 399 Section II-B).

400
Both localisation systems were evaluated using different 401 training strategies (clean training and MCT), various locali-402 sation feature sets (ITD, ILD and CCF), and with or without 403 head movements. When no head movement was employed, the 404 source azimuths were estimated using the entire 1 s segment 405 from each acoustic mixture. If head movement was used, the 406 1 s segment was divided into two 0.5 s long blocks and the 407 second block was provided to the system after completion of a 408 head movement. Therefore in both conditions the same signal 409 duration was used for localisation.

410
The gross accuracy of localisation was measured by com-411 paring true source azimuths with the estimated azimuths. The 412 number of active speech sources N was assumed to be known a 413 priori and the N azimuths for which the posterior probabilities 414 were the largest were selected as the estimated azimuths. Lo-415 calisation of a source was considered accurate if the estimated 416 azimuth was less than or equal to 5 • away from the true source 417 azimuth: where dist(.) is the angular distance between two azimuths, φ is 419 the true source azimuth,φ is the estimated azimuth, and θ is the 420 threshold in degrees (5 • in this study). This metric is preferred 421 to RMS error because our study is concerned with full 360 • 422 localisation, and localisation errors in degrees are often large 423 due to front-back confusions.    was substantially more robust than the GMM system, but the 458 performance also decreased significantly when multiple talk-459 ers were present. The benefit of the MCT method became more 460 apparent for both systems in this scenario -the average localisa-461 tion accuracy was increased from 62.9% to 92.6% for the GMM 462 system and from 87% to 95% for the DNN system. Across all 463 the room conditions the largest benefits were observed in room 464 B where the direct-to-reverberant ratio was the lowest, and in 465 room D where the reverberation time T 60 was the longest.

466
Errors made in 360 • localisation could be due to front-back 467 confusion as well as interference caused by reverberation and 468 overlapping talkers. Figure 5 shows errors made by both the 469 GMM and the DNN systems using either clean training or MCT 470 in different room conditions. The errors due to front-back con-471 fusions were indicated by white bars for each system. Here a 472 localisation error is considered to be a front-back confusion 473 when the estimated azimuth is within ±20 degrees of the az-474 imuth that would produce the same ITDs in the rear hemifield. 475 It is clear that front-back confusions contributed a large portion 476 of localisation errors for both systems, in particular when clean 477 training was used. When the MCT method was used, not only 478 the errors due to interference of reverberation and overlapping 479 talkers (non-white bar portion in Fig. 5) were greatly reduced, 480 but also the systems produced substantially fewer front-back 481 errors (white bars in Fig. 5). As will be discussed in the next 482 section, without head movements the main cues distinguishing 483 between front-back azimuth pairs lie in the combination of in-484 teaural level and time differences (or ITD-related features such 485 as the cross-correlation function). MCT provides the training 486  Feature  1  2  3  1  2  3  1  2  3  1  2  3  1  2  3 A v g . The models were trained using the MCT method. The best feature set for each system is marked in bold font. Fig. 6. Comparison of localization error rates produced by various systems using different spatial features. Localization was not restricted in the frontal hemifield so that front-back errors can occur, as indicated by the white bars for each system. No head movement strategy was employed.
stage with better regularisation of the features, which is able 487 to improve the generalisation of the learned models and better 488 discriminate the front-back confusing azimuths.

489
It is also worth noting that the training and testing stages used When ILDs were not used, the localisation errors were largely 518 due to an increased number of front-back errors as suggested by 519 Fig. 6. For single-talker localisation in rooms B and D, without 520 using ILDs almost all the errors made by the systems were 521 front-back errors. When ILDs were used, the number of front-522 back errors were greatly reduced in all conditions. This suggests 523 that the ILD cue plays a major role in solving the front-back 524 confusions. ITDs or ILDs alone may appear more symmetric 525 between the front and back hemifields, but together with ILDs 526 they create the necessary asymmetries (due to the KEMAR head 527 with pinnae) for the models to learn the differences between 528 front and back azimuths.

529
Table III also lists localisation results of the GMM system 530 when using the same CCF-ILD feature set as used by the DNN 531 system. The GMM failed to extract the systematic structure in 532 the CCF spanning multiple feature dimensions, most likely due 533 to its inferior ability to model correlated features. The average 534 localisation accuracy is only 88.5% compared to 95% for the 535 DNN system, and again it suffered the most in more reverberant 536 conditions such as rooms B and D.  Table IV lists the gross localisation accuracies with or with-539 out head movement. All systems were trained using the MCT 540 method and employed the respective best performing features 541 (GMM ITD-ILD and DNN CCF-ILD). 542 Both the GMM and DNN systems benefitted from the use 543 of head movements. It is clear from Fig. 7 that the localisa-544 tion errors were almost entirely due to front-back confusions in 545 one-talker localisation. By exploiting the head movement, the 546 systems managed to reduce most of the front-back errors and 547 achieved near 100% localisation accuracies. In two-or three-548 talker localisation, the number of front-back errors was also 549  Fig. 7. Localization error rates produced by various systems with or without head movement when localizing one, two, or three overlapping talkers. Localization was performed in the 360 • azimuth range so that front-back errors can occur, as indicated by the white bars for each system.  Fig. 8 shows the localisation error rates as a function of the 556 azimuth. The error rates here were averaged across the 1-, 2-557 and 3-talker localisation tasks. Across most room conditions, 558 sound localisation was generally more reliable at more central 559 locations than at lateral source locations. This is particularly 560 the case for the GMM system, as shown in Fig. 8, where the 561 Fig. 9. Localization error rates produced by various systems as a function of the azimuth for the Auditorium3 task. Localization was performed in the full 360 • azimuth range so that front-back errors can occur, as indicated by the white bars for each system. localisation error rates for sources at the sides were above 20% 562 even in the least reverberant Room A. It is also clear from  Finally, Fig. 9 shows the localisation error rates using the  For the GMM system the benefit is particularly pronounced for 589 the source at 51 • , with localisation reduced from 14% to 4% 590 in two-source localisation and from 36% to 14% in two-source 591 localisation. The rear source at 131 • appeared to be difficult to 592 localise for the GMM system even with head movement, with 593 20% error rate in two-source localisation. The DNN system with 594 head movements was able to reduce the error rate for the rear 595 source at 131 • to 8%.

596
In general the performance of the models for the 51 • and 597 131 • locations is worse than the other source locations when 598 there are multiple sources present at the same time. This is more 599 likely due to the nature of the room acoustics at these locations, 600 e.g., they are further away from the listener and closer to walls. This paper presented a machine-hearing framework that com-607 bines DNNs and head movements for robust localisation of 608 multiple sources in reverberant conditions. Since simultaneous 609 talkers were located in a full 360 • azimuth range, front-back 610 confusions occurred. Compared to a GMM-based system, the 611 proposed DNN system was able to exploit the rich information 612 provided by the entire CCF, and thus substantially reduced lo-613 calisation errors. The MCT method was effective in combatting 614 reverberation, and allowed anechoic signals to be used for train-615 ing a robust localisation model that generalised well to unseen 616 reverberant conditions and to mismatched artificial heads used 617 in training and testing conditions. It was also found that the 618 inclusion of ILDs was necessary for reducing front-back confu-619 sions in reverberant rooms. The use of head rotation further in-620 creased the robustness of the proposed system, with an average 621 localisation accuracy of 96% under acoustic scenarios where 622 up to three competing talkers and room reverberation were 623 present.

624
In the current study, the use of DNNs allowed higher-625 dimensional feature vectors to be exploited for localisation, in 626 comparison with previous studies [4]-[6]. This could be carried 627 further, by exploiting additional context within the DNN either 628 in the time or the frequency dimension. Moreover, it is possi-629 ble to complement the features used here with other binaural 630 features, e.g., a measure of interaural coherence [24], as well as 631 monaural localisation cues, which are known to be important for 632 judgment of elevation angles [25], [26]. Visual features might 633 also be combined with acoustic features in order to achieve 634 audio-visual source localisation.

635
The proposed system has been realised in a real world human-636 robot interaction scenario. The azimuth posterior distributions 637 from the DNN for each processing block were temporally 638 smoothed using a leaky integrator and head rotation was trig-639 gered if a front-back confusion was detected in the integrated 640 posterior distribution. Audio signals acquired during head rota-641 tion were not processed. Such a scheme can be more practical 642 for a robotic platform as head rotation often produces self-noise 643 which makes the audio unusable.

644
One limitation of the current systems is that the number of 645 active sources is assumed to be known a priori. This can be 646 improved by including a source number estimator that is either