A Binaural Model Predicting Speech Intelligibility in the Presence of Stationary Noise and Noise-Vocoded Speech Interferers for Normal-Hearing and Hearing-Impaired Listeners

A binaural model is presented which predicts the effect of audibility on the intelligibility of speech in the presence of speech-shaped noise and vocoded-speech maskers. It takes the calibrated target and masker signals (independently) and the listener’s tonal audiogram at each ear as inputs. Model predictions are compared to speech reception thresholds (SRTs) measured for normal-hearing (NH) and hearing-impaired (HI) listeners in the presence of two uncorrelated speech-spectrum noises or two vocoded-speech maskers, which were either (artificially) spatially separated or co-located with the frontal speech target. The artificial spatial separation was realized by presenting each masker to a different single ear using headphones, while the target was presented diotically as coming from the front. Audibility was varied by testing four different sensation levels for the combined maskers. The model allows for a good prediction of the decrease of SRT and the increase of spatial release from masking (based primarily on better-ear glimpsing here) with increasing audibility. For both groups of listeners, the averaged absolute prediction error across conditions was between 0.6 and 1.7 dB.


Introduction
Possessing twoears is useful for understanding speech in noise.It allows for ac ompeting sound source to cause less masking when it is spatially separated from the target.This spatial release from masking (SRM)relies on interaural levela nd time differences in the signals reaching the ears (ILDs and ITDs, respectively) [ 1,2].It has been shown that SRM is substantially reduced for HI listeners [3].While several binaural models exist to describe SRM for NH listeners (see [2] for references), we are aware of only one model proposed to predict SRM also for HI listeners.This model has been tested using SRTs measured in the presence of asingle noise masker,which waseither stationary [1] or modulated in amplitude [4].The aim of the present study wast op ropose an alternative modeling approach to predict the effect of audibility on SRTs and SRM for both NH and HI listeners.The performance of this model wast ested here for speech presented against Received27February 2018, accepted 10 July 2018.twomaskers (producing both stationary noise or vocoded speech).

Model
The proposed model is based on an updated implementation of the model of Collin and Lavandier [2], which predicts binaural speech intelligibility in the presence of multiple non-stationary noises, butd oes not takeh earing impairment and audibility into account.It combines the effects of better-ear listening and binaural unmasking and is based on twoinputs: the ear signals generated by the target and those generated by the sum of all interferers.Based on these inputs, the model computes the better-ear signalto-noise ratio (SNR), as the maximum value of the SNR at the left and right ears, and the binaural unmasking advantage (indB) from the target and masker interaural parameters.The computation is realized in frequencybands and followed by integration across bands.Adding the betterear and binaural unmasking components, the model finally produces a( broadband)" e ff ective binaural ratio".Binaural ratios are inverted in order to be compared to SRTs, so that high ratios correspond to lowthresholds (high intelligibility).
The predictions are based on short-term predictions averaged across time.To avoid target speech pauses mistakenly leading to ar eduction in predicted intelligibility,the model considers interfering energy as af unction of time and target energy averaged across time.Instead of replacing the target speech by as tationary signal with similar long-term spectrum and interaural parameters and applying the short-term analysis on this signal [2], the present implementation computes the long-term statistics of the target only once and combines those with the short-term statistics of the noise to compute binaural ratios within each time frame (before averaging).The model uses 24ms half-overlapping Hann windows as time frames [2] and ag ammatone filterbank with center frequencies ranging from 30 to 19885 Hz and twofi lters per equivalent rectangular bandwidth (ERB).Aceiling corresponding to the maximum better-ear SNR allowed by frequencyband and time frame is also applied to avoid this SNR tending to infinity in interferer pauses.The ceiling value wass et to 20 dB 1 .
In order to takeinto account the effects of audibility for both NH and HI listeners, the model proposed here introduces several modifications to the initial model of Collin and Lavandier.The absolute broadband levelofthe target and masker signals and the listener'stonal audiogram, all expressed in dB sound pressure level( SPL), are required as additional inputs for the model.Predictions are computed separately for each listener (results belowa re averaged across listeners).While computing the binaural ratios, the target and masker levels are compared to the levels of internal noises which are based on the listener'shearing thresholds.Fort he better-ear component evaluation, the SNR is computed in each time frame, frequencyband and at each ear by subtracting from the target levelt he maximum between the masker and internal noise levels.The binaural unmasking advantage is set to 0dBa ss oon as the masker or target levelisbelowthe internal noise level at one ear; otherwise this component of the model is not modified.As discussed below, this rather "crude" model of binaural unmasking for HI listeners wasn ot properly tested with the stimuli considered below, because those did not contain realistic ITDs.Further investigation is needed concerning this component of the model.
The internal noise considered at each ear in the model is spectrally-shaped using the tonal audiogram.The audiometric pure tone thresholds (given in dB HL)are converted into ear drum levels (indBS PL), and the resulting levels are then interpolated to get their values at the center frequencies used in the model.The levelconversion wasrealized by adding reference equivalent sound pressure levels for the applied THD 39 headphones [5] and nominal values for the transformation from 6c cc oupler to ear drum levels [6] to the pure tone thresholds.Within the audiometric frequencyr ange, the thresholds in dB SPLw ere interpolated on alogarithmic frequencyscale.Forfrequencies below250 Hz and above 6kHz, the threshold wasset to the value in dB SPL2 at 250 Hz and 6kHz, respectively.Individual pure tone thresholds were considered separately for the left and right ears 3 .T he internal noise levels are then obtained by adding avalue in dB to the interpolated thresholds.This value margin sets the broadband levelofthe internal noise in dB SPL.It is afree parameter of the model, assumed to be constant across frequencya nd within subject group (i.e., NH or HI), butd i ff erent between subject groups.
The model predictions presented here were computed using the stimuli from twor elated experiments [7,8] briefly summarized in Section 3. Fore ach condition, two minutes of the masker signal wasconsidered and the target wasrepresented by averaging 120 and 128 target sentences for experiments 1a nd 2, respectively; after all sentences had been truncated to the duration of the shortest sentence.The masker and averaged target signals were all convolved with the impulse response of the headphones used for data collection and measured on a4 128C Bruel&Kjaer head and torso simulator.A ll signals were calibrated to the sound levels (dBSPL)used in the experiments.

Data
The experimental data and stimuli used to verify the proposed binaural model were taken from twoe xperiments described in detail in [7] and [8].Experiment 1e valuated the effect of temporal masker fluctuations on SRTand SRM in NH and HI listeners [7].Experiment 2focused explicitly on the effect of sensation level(and thus audibility) on SRTand SRM in fluctuating noise [8].
In both experiments, SRTs were measured adaptively using BKB-liketarget sentences [9] in the presence of two noise-vocoded speech interferers.The noise vocoder was applied to minimize informational masking effects.It was realized with fivef requencyc hannels with ab andwidth of four critical bands each, and wasa pplied separately to each of the twospeech maskers.The target speech wasunprocessed (i.e., not vocoded)a nd always presented from 0 • azimuth, whereas the twoi nterferers were either colocated with the target or (artificially)s patially separated.The target speech and the co-located interferers were spatialized by applying the same across-ear averaged headrelated transfer function for frontal sound incidence from [10] to both ears.The spatially separated interferers were realized artificially such that one waspresented to the left ear and the other one to the right ear,r ealizing "infinite" broadband ILDs butn oI TD.All stimuli were presented via equalized Sennheiser HD215 headphones and were filtered such that theyh ad the same long-term spectrum as the target speech.
In experiment 1, the combined interferer levelwas set to 60 dB SPLand the levelofthe target speech wasadjusted adaptively such that, on average, 50% of the words were correctly understood.The resulting SNR provided an estimate of the SRT.To partly compensate for the loss in audibility,the HI listeners receivedlinear amplification according to the National Acoustic Laboratory Revised-Profound prescription formula (NAL-RP, [ 11]).Moreover, twou ncorrelated speech-shaped noise interferers were tested in addition to the noise-vocoded speech interferers.TenNH listeners (hearing thresholds below15dBHL) with amean age of 33.1 years and ten HI listeners with amean age of 66.9 years participated in experiment 1.All HI listeners had symmetric, mild to moderate, sloping, sensorineural hearing loss with afour frequency(0.5, 1, 2, 4kHz)a verage hearing loss (4-FAHL)of37.8+/-7.1 dB HL.
In experiment 2, all stimuli were audibility equalized across frequencyb yp roviding amplification (ora ttenuation)e quivalent to the individually measured detection thresholds for speech-shaped noise filtered into nine different frequencyregions.SRTs were measured for the noisevocoded speech interferers presented at four different sensation levels (0,10, 20 and 30 dB)relative to the individual SRTs in quiet.It should be noted that 0dBSLcorresponds to very lowlevels in dB SPL, in particular for the NH listeners.The levelofthe target is varied adaptively relative to the combined interferer levelduring each SRTmeasurement.By varying the overall levelo ft he stimuli in this experiment, their audibility wasv aried.TenN Hl isteners with amean age of 23.2 years and ten HI listeners with a mean age of 70.3 years participated in experiment 2, but not all HI listeners could be tested at the higher sensation levels due to loudness tolerance issues.All HI listeners had symmetric, mild to moderate, sloping, sensorineural hearing loss with a4-FAHL of 29.1 +/-8.0 dB HL.

Predictions
Predicted differences of (inverted)b inaural ratio between conditions can be directly compared to corresponding SRT differences.To compare absolute thresholds rather than relative differences, ar eference needs to be chosen.For each listener considered here, the reference wast he individual average SRTa cross conditions in the experiment.To obtain the predicted SRTs of each listener,i nverted ratios were centered to this average SRT( by subtracting their mean and adding the average SRT).I no ther words, the individual average predicted SRTw as aligned to the individual average measured SRT 4 ;sothat we only aimed at predicting the differences across conditions within each group of listeners and experiment (i.e., within each panel of Figures 1and 2).
Prediction performances were evaluated in terms of Bravais-Pearson correlation between measured and predicted SRTs (Corr), mean absolute prediction error (MeanErr,a bsolute differences between measured and predicted SRTs averaged across conditions), and maximum absolute prediction error (MaxErr).The value of the free parameter margin of the model wasc hosen to minimize MeanErr in experiment 2, independently for each group of listeners, resulting in a margin of −11 dB and −22 dB for the NH and HI listeners, respectively.T he same margin values were used for the modeling of experiment 1(considered here for validation).

HI listeners
Figure2.Mean SRTs with standard errors across NH (top panel) and HI (bottom panel)listeners measured and predicted in experiment 2f or four overall masker levels (the audibility of the target and maskers increased with overall level).The twov ocoded speech maskers were either spatially separated (sep.)orc olocated (col.)with the frontal target.As mall horizontal offset has been added to the model predictions to reduce symbol overlap.
The measured and predicted SRTs of experiment 1a re presented in Figure 1.The model predicted closely both the SRM and the masking release associated with envelope modulations in the masker (SSN vs. vocoded speech) for the NH listeners (MeanErr below1dB; Corr wascomputed on only four points here so it should be considered with caution).Predictions were less accurate for the HI listeners (eveni fMeanErr remained below2dB).In particular,t he model overestimated the SRM for the SSNs.This SRM wasn ot present in the data.As ar esult, the model also overestimated the effect of the masker envelope modulations in the co-located condition: the model predicts an advantage that is also not apparent in the data for the HI listeners.
Figure 2p resents the mean SRTs measured in experiment 2for the NH (top panel)a nd HI (bottom panel)l is-teners, along with the model predictions for each group.SRTs are plotted as afunction of overall masker level(increasing levelcorresponds to increasing audibility for both target and maskers).The model described accurately both the decrease of SRTand the increase of SRM with increasing audibility for both groups of listeners (Corr above 0.98 and MeanErr below1dB).

Discussion
While tested on data measured with diotic and dichotic stimuli reproduced overh eadphones, the binaural model proposed here wasable to predict rather accurately the release from masking due to better-ear glimpsing in the presence of twom askers, the dip listening advantage associated with envelope modulations in these maskers, and the effect of audibility on both SRTs and better-ear glimpsing, for NH and HI listeners.Predictions were less accurate for the HI listeners in experiment 1that wasnot used to define the value of the free parameter of the model.Prediction performance wasa tl east as good as for previous binaural NH models [2,4], with a MeanErr between 0.6 and 1.7 dB.When stimuli levels are well above hearing thresholds, the proposed model is equivalent to the one of Collin and Lavandier [2].It wasthe case for the NH listeners in experiment 1a nd the similar prediction performance obtained highlights the backward compatibility of the model.Even if the better-ear glimpsing component of the model seems validated by the first predictions presented here, the model needs to be further tested using stimuli with realistic ILDs and ITDs.In particular,the binaural unmasking component of the model relying on ITDs could not be tested here.
The free parameter margin -u sed to obtain the model internal noise levels from the tonal audiograms -had to be set to different values for the NH and HI listeners.This is an important limitation of the model.When considering a panel of listeners with increasing hearing losses, it would be more relevant to be able to use as ingle model for all listeners, so that in the future margin would at least need to be made dependent on the degree of hearing loss.The fact that it is not the case in the current model might explain whyl ess accurate predictions were obtained for the HI listeners of experiment 1that had alarger average hearing loss (4-FAHL)t han the HI listeners of experiment 2, which were the listeners considered when defining margin.
The difference in margin obtained for the NH and HI listeners could reflect potential effects of reduced spectral and temporal resolutions for the HI listeners, butalso additional effects of cognitive differences between the two groups due to the age confound (i.e., young NH vs. old HI).In adifferent modeling framework (atleast in terms of implementation, even if the concept of the present model is quite similar), Beutelmann et al. model the effects of audibility by adding independent internal noises at each ear.The internal noise is also spectrally-shaped using the tonal audiogram and its levels are set 1dB [1] or 4dB [4] above the audiometric thresholds.These values are positive and much smaller in magnitude than the offsets of −11 dB and −22 dB used here.Even if the differences in model implementation might explain part of this discrepancy, more investigations are needed concerning this important parameter of the proposed model.It should be noted that, apart from the use of ad i ff erent margin,t he current model is identical for NH and HI listeners (e.g., in terms of spectral and temporal resolutions).This might need further refinement as well.
Importantly,p rediction performances were quite similar for both groups of listeners.While using identical parameters for NH and HI predictions, the binaural model proposed by Beutelmann et al. could predict well SRTs measured in the presence of one SSN in different rooms [1].The SRM wasg enerally overestimated for HI listeners, butt he difference in MeanErr between NH and HI listeners wasonly 0.5 dB.In the presence of an envelopemodulated noise [4], the predictions were less accurate for the HI compared to the NH listeners (Corr in the range 0.59-0.80 and 0.80-0.93,MeanErr of 4and 3dB, respectively).
Only averaged predictions across listeners were presented here.The model can be applied to predict SRTs for individual listeners, butcare should be taken to assure that these individual SRTs are not influenced by the potential confounding effect of the sentence material, which needs to be counterbalanced across conditions (asi tw as the case for the averaged SRTs considered here).The proposed model could be auseful tool to investigate individual differences between HI listeners in the future.

Figure 1 .
Figure 1.Mean SRTs with standard errors across NH (top panel) and HI (bottom panel)l isteners measured and predicted in experiment 1.The twom askers were either speech-shaped noise (SSN)o rv ocoded speech, and either spatially separated (sep.) or co-located (col.)with the frontal target.As mall horizontal offset has been added to the model predictions to reduce symbol overlap.
conditions wasthen directly aligned to the average SRTmeasured across listeners and conditions.