Speech Intelligibility Prediction Based on Mutual Information

This paper deals with the problem of predicting the average intelligibility of noisy and potentially processed speech signals, as observed by a group of normal hearing listeners. We propose a model which performs this prediction based on the hypothesis that intelligibility is monotonically related to the mutual information between critical-band amplitude envelopes of the clean signal and the corresponding noisy/processed signal. The resulting intelligibility predictor turns out to be a simple function of the mean-square error (mse) that arises when estimating a clean critical-band amplitude using a minimum mean-square error (mmse) estimator based on the noisy/processed amplitude. The proposed model predicts that speech intelligibility cannot be improved by any processing of noisy critical-band amplitudes. Furthermore, the proposed intelligibility predictor performs well ( ρ > 0.95) in predicting the intelligibility of speech signals contaminated by additive noise and potentially non-linearly processed using time-frequency weighting.


Speech Intelligibility Prediction Based on Mutual Information
Jesper Jensen and Cees H. Taal Abstract-This paper deals with the problem of predicting the average intelligibility of noisy and potentially processed speech signals, as observed by a group of normal hearing listeners.We propose a model which performs this prediction based on the hypothesis that intelligibility is monotonically related to the mutual information between critical-band amplitude envelopes of the clean signal and the corresponding noisy/processed signal.The resulting intelligibility predictor turns out to be a simple function of the mean-square error (mse) that arises when estimating a clean critical-band amplitude using a minimum mean-square error (mmse) estimator based on the noisy/processed amplitude.The proposed model predicts that speech intelligibility cannot be improved by any processing of noisy critical-band amplitudes.Furthermore, the proposed intelligibility predictor performs well ( ) in predicting the intelligibility of speech signals contaminated by additive noise and potentially non-linearly processed using time-frequency weighting.
Index Terms-Instrumental measures, noise reduction, objective distortion measures, speech enhancement, speech intelligibility prediction.

I. INTRODUCTION
M ONAURAL speech intelligibility prediction methods aim at predicting the average intelligibility of noisy and/processed speech signals, as judged by a group of listeners.Our motivation for studying speech intelligibility predictors is twofold.Firstly, reliable intelligibility predictors are of great practical importance, e.g., in guiding the development process of speech processing algorithms, and replacing costly listening tests in early stages of the development phase.Secondly, the development and study of intelligibility predictors may lead to a better understanding of the mechanism behind human intelligibility capabilities.
Historically, two main branches of intelligibility predictors may be identified: methods based on the Articulation Index (AI) [1], proposed first by French and Steinberg [2] and later refined by Kryter [3], and the Speech Transmission Index (STI) [4] proposed by Steeneken and Houtgast [5].
The basic AI approach assumes that intelligibility is a function of the speech information available to the listener across several frequency bands, each of which carries an independent contribution to the total intelligibility.Assuming that speech and masker signals are available in isolation, effective signal-to-noise ratios (SNRs) are computed for each frequency band; the SNRs are then limited to a certain pre-specified SNR range, normalized to a value between 0 and 1, and combined as a perceptually weighted average.The AI approach has later been refined further and standardized as the Speech Intelligibility Index (SII) [6].AI and SII are based on long-term spectra of speech and masker and therefore may be inaccurate for fluctuating maskers.To reduce this problem, Rhebergen proposed the Extended SII [7], which divides the speech and masker signal into short-time frames (9-20 ms), computes the instantaneous SII value for each frame, and then averages the per-frame SII values to find a final intelligibility prediction.Another extension of the SII is the Coherence SII (CSII) which was proposed to better take into account non-linear distortions such as peak-and center-clipping [8].
The AI based methods described above were originally formulated with the focus on simple linear degradations, e.g., linear filtering and additive, uncorrelated noise.The Speech Transmission Index (STI) [5], [9] extends the type of degradations to convolutive noise sources, such as reverberance and the effects of room acoustics.The STI is based on changes in the modulation transfer function.Specifically, STI relies on the observation that reverberation and additive noise tends to reduce the depth of the temporal amplitude/intensity modulations compared to the clean reference signal.Originally, STI used synthetic bandpass filtered probe signals with various acoustic center frequencies, intensity-modulated with a range of low-frequency sinusoidal modulators, whose frequencies were chosen in the range Hz to Hz to emulate the dominating modulation frequencies present in human speech.Later, in an attempt to better take into account the effects of various non-linear processing operations, such as dynamic amplitude compression [10], and envelope clipping [11], the class of speech STI (sSTI) methods were introduced [12] which replaced the artificial probe signals by actual speech signals.More recently, Jørgensen and Dau presented a speech intelligibility prediction model based on the envelope power signal-to-noise ratio SNR at the output of a modulation filter bank [13].This model showed promising results for noisy speech subjected to reverberation and spectral subtraction, but has only been evaluated for stationary speech-shaped noise.
The AI and STI based intelligibility predictors considered as a whole are suitable for a range of distortion types including additive noise, convolutive noise, filtering, and clipping, but they are less suited for speech signals distorted by non-stationary noise sources and processed by time-varying and non-linear filtering systems such as those typically used in single-channel speech enhancement systems [14], [15].To better take this type of distortions into account, new intelligibility predictors were proposed such as the method of Christiansen and Dau [16], and the Short-Time Objective Intelligibility (STOI) measure [17] by Taal et al.STOI shows similarities to the speech-based STI methods [12] in that speech envelopes extracted with bandpass filters are compared; however, unlike most variants of the speech-based STI methods which are based on long-term statistics, STOI compares the envelopes via short-term measures.
In this study we constrain ourselves to monaural intelligibility prediction, that is, only one realization of the noisy/processed signal and the clean reference is available.We further assume that the noise is additive but not necessarily stationary, and we consider processing methods, which can be described in a time-frequency analysis-modification-synthesis framework, e.g.[18]: in the analysis stage the signal is decomposed into time-frequency units, typically using a short-time band-pass filter bank, in the modification stage gain factors are multiplied onto the time-frequency units, and in the synthesis stage the modified time-frequency units are used to reconstruct processed time-domain signals.Since the gain factors are not necessarily constant across time and generally depend on short-term signal characteristics, the resulting processing may be time-varying and non-linear.
The proposed intelligibility prediction model makes use of basic information theoretic tools such as entropy and mutual information [19].It appears natural to use tools developed to characterize information transmission.After all, the speech communication process can be viewed as the process of transmitting a speech signal across a time-varying, non-linear channel (the acoustic channel, the auditory periphery, and the higher stages of the auditory pathway) to reach the brain of the receiver; see also the work of Allen [20] who observed that the expression for the AI shows strong similarities to the expression for the capacity of a memoryless Gaussian channel, and the work of Leijon [21] who studied the potential relationship between AI (and SII) and the acoustic-to-auditory information rate.More specifically, the basic idea of the proposed method is to compare the critical-band amplitude envelopes of the clean and noisy/processed signal and estimate the intelligibility of the noisy/processed signal based on this comparison.In particular, we assume that the clean critical-band envelopes contain all information relevant for speech intelligibility, and consider the question: how much information (measured in bits) about the clean envelopes can be extracted, on average, by observing the envelopes of the noisy/processed signal?If the noisy/processed envelopes provide no information whatsoever, i.e., the mutual information between clean and noisy/processed envelopes is zero bits, then we expect the intelligibility of the noisy/processed signal to be zero.If, on the other hand, the noisy/processed envelopes provide much information about the clean envelopes, we expect the intelligibility of the noisy/processed input signal to be high.
The proposed intelligibility prediction model shares characteristics with the method proposed in [22], although the motivation for the proposed model is quite different.Specifically, the proposed model arises as a consequence of describing speech information transmission in a simple model of the auditory periphery, whereas the method in [22] has a more heuristic foundation in that it replaces the linear correlation operation used in the STOI model with a generalization, namely mutual information.Furthermore, the proposed model employs a short-term stationary signal model, whereas the method in [22] assumes the clean and noisy/processed speech signals to be realizations of strictly (long-term) stationary stochastic processes.Finally, the proposed model relies on lower bounds of mutual information, leading to simple equations in terms of second-order statistics, whereas the method in [22] estimates mutual information, which generally involves higher-order statistics.
The proposed intelligibility prediction model also bears some similarities to the STOI model [17], as it compares critical-band amplitude envelopes in terms of the linear correlation coefficient.However, whereas the use of linear correlation in STOI has a heuristic foundation, it follows in the proposed model as a consequence of the assumed signal model and a crude model of the auditory periphery; in this sense, the proposed model might be seen as a better motivated model.
With the proposed model, we have aimed at simplicity.For example, the proposed model does not make use of band importance functions to emphasize certain critical bands over others.Instead, each critical band contributes equally to intelligibility.In fact, from the information theoretical path followed in this paper, band importance functions are hard to justify.Furthermore, the proposed model appears to work well without (see Section V).If one would introduce band-importance functions or other additional free parameters, which model aspects not taken into account by the model, we expect performance to increase.
The article is organized as follows.In the following section we introduce the basic auditory model used in the proposed method.Section III derives the proposed mutual information lower bound.Section IV presents implementational details and discusses the numerical values of the few free parameters of the proposed method.In Section V the proposed intelligibility predictor is compared to intelligibility predictors from the literature for several noise sources and processing conditions.Finally, Section VI concludes the work.

II. AUDITORY FRONT-END AND NOTATION
We consider a crude signal processing model of the auditory periphery, which is similar in structure to front-ends used in speech enhancement [23], automatic speech recognition [24], and intelligibility predictors [17].The model consists of a bandpass filter bank simulating the bandpass filtering characteristics of the cochlea, and a full-wave rectification, which simulates coarsely the mechanism of the hair cell transduction in the inner ear.The resulting "inner representations" are rough abstractions of the signal transmitted via the auditory nerve to the higher stages of the auditory system.
The model is shown in more detail in Fig. 1.We use capital letters to denote random processes and variables and lower-case letters to denote the corresponding realizations.Let and denote random processes modeling a clean speech input silence regions, which certainly do not contribute to speech intelligibility and therefore can be excluded from the mutual information computation.For this reason a simple energy-based per-frame VAD (details are given in Section IV) is applied to resulting in the frame index set of speech active frames.In an identical manner, the VAD in the lower branch identifies low-and high-energy frames in .The low-energy frames are typically i) noise-only (i.e., silence) frames, or ii) they occur due to certain types of aggressive processing which essentially suppress entire signal frames, which do carry speech information.The high-energy frames are represented by the frame index set .Let denote the number of frames in a given speech sentence, and let and denote random super vectors, formed by stacking critical-band amplitude spectra for successive frames.We are interested in the average mutual information (to be defined exactly below) between clean and noisy/processed critical-band amplitudes, i.e., , where denotes set cardinality, and estimates the number of speech-active critical-band amplitudes in the clean signal.Assuming that the entries in each super vector are statistically independent, an assumption which is routinely made in the area of speech enhancement1 , it is easy to verify that the mutual information decomposes into a summation of mutual information terms, The second equation follows because summation over the frame index set , where both signals and are speech active, excludes terms which are all zero.Specifically, silence frames in are excluded, and speech information loss due to over-suppressed frames in is taken into account (that is, all terms in such frames are set to zero).An alternative, and perhaps physiologically more plausible, implementation of the described VAD function is to replace the VAD blocks by additive, uncorrelated internal noise sources, see e.g.[27].
For notational convenience, we skip in the following the subband and frame index where possible, and simply replace and by and , respectively.The mutual information between clean and noisy/processed critical-band amplitudes is given by [19] where the differential entropy2 of is and the conditional differential entropy is (2) For certain simple situations, the joint probability density function (pdf) may be given and the conditional differential entropy might be derived analytically.However, in general, since the exact processing leading to may be complicated or even unknown, deriving or estimating from limited data the joint pdf needed to compute is difficult at best.Instead, to circumvent this difficulty, we propose to lower bound the mutual information ; as we show in the following, this requires only second-order statistics of .

III. LOWER BOUNDS ON
We derive lower bounds on the mutual information by upper bounding the conditional entropy , see e.g. the work of Bialek et al. [28] for another application of this procedure.

A. Upper Bounds on
From the expression in Eq. ( 2) for the conditional entropy , it follows that The first inequality holds because the maximum entropy pdf for a non-negative random variable with a given variance approaches a Gaussian pdf for large means, which has differential entropy . 3The second inequality follows from Jensen's inequality [19, Thm.2.6.2] and the fact that is concave.
The quantity is the variance of a non-negative random variable distributed according to the pdf .Let denote the mean of this variable.Then, (4) The second equation follows by recalling that the conditional mean is identical to the minimum mean-square error (mmse) estimator of the clean random variable upon observing the noisy and/or processed realization .So, it is clear that is nothing more than the mean-square error (mse) resulting from estimating upon observing , using an mmse estimator.
Let denote averaged across all realizations of the noisy/processed critical-band amplitude , that is (5) Inserting Eq. ( 4) in Eq. ( 3) and using Eq. ( 5), we arrive at To find we must form the mmse estimator and average the resulting mse across realizations of the noisy and/or processed critical-band amplitudes.Finding closed-form expressions for generally requires knowledge (or assumption) of the joint pdf .This has been a central topic in the area of single-channel speech enhancement over the last decades, for the case where a clean speech signal is contaminated by additive and independent noise; so, in this special case, it might be possible to derive closed-form expressions for , and could be evaluated.However, for the more general situation considered in this paper, the observations may be a result of some, potentially unknown, processing applied to the noisy observations, so that the joint pdf would certainly be unknown, and estimating reliably from limited observations would be difficult.
To circumvent this practical difficulty, observe that replacing the conditional mean estimator with the linear mmse estimator , leads to an mse of with equality for jointly Gaussian , and It therefore follows that a looser upper bound on the conditional differential entropy based on linear mmse estimators is given by (7) The quantity is a function of second-order statistics, rather than the joint pdf .To derive an expression for , let and denote expected values of and , respectively, and let , and denote the cross-correlation between and , the variance of , and the variance of , respectively.The linear mmse (lmmse) estimator is then given by (e.g., [30]), (8) with Inserting Eq. ( 8) in Eq. ( 6) we get (9) With the derived upper bounds on we have the following lower bounds on the mutual information , and

B. Differential Entropy
The bounds discussed in this paper are functions of the entropy of the clean speech critical-band amplitudes.To derive an expression for this quantity, we note that when the frame size is large compared to the correlation time of the clean signals , then the real and imaginary parts of the DFT coefficients can be considered independent and can be modeled as zero-mean Gaussian variables [25], and e.g.[26].Assuming further that the DFT coefficients within the same critical band , are identically distributed (that is, the speech power spectral density is constant), then given in Eq. ( 1) is a (scaled) chi-distributed random variable with degrees of freedom.To derive an expression for , note first that in the special case when the real and imaginary parts of , are zero-mean, unit-variance Gaussians, then the corresponding critical-band amplitude, , has an expected value of a variance of and a differential entropy given by [19,Table 16.1] where and denote the Gamma and the digamma function, respectively.In the general case, where the real and imaginary parts of are not unit-variance, the differential entropy of the corresponding critical-band amplitude is (11) where we used the fact that [19] for any random variable and constant .Thus, the differential entropy is a simple function of the variance of the critical-band amplitudes, because the two first terms in Eq. ( 11) are functions only of the number of degrees of freedom and can therefore be computed offline.Fig. 2 plots as a function of the number of degrees of freedom for the case where the critical-band amplitude variance is unity, .For comparison, the differential entropy of a unit-variance Gaussian, , is included.Clearly, is close to the upper bound except for the lowest frequency critical bands where .It can be argued that estimation of in the present context is not crucial.Specifically, if the main interest is to determine how a given type of processing leading to the processed signal compares to another type of processing leading to a processed signal in terms of intelligibility, then we are interested in determining whether .As appears on both sides of this inequality sign, the exact value of becomes irrelevant.Inserting Eqs. ( 11) and (7) in the middle line of Eq. ( 10), we find the following lower bound on mutual information nats ( 12)

C. Observations
Remark 1: In the special case when and are statistically independent, then , , and the last term in Eq. ( 12) is zero.In this case, the sum of the first three terms is negative (the difference between the solid and the dashed curve in Fig. 2), but the max-operator ensures as expected.Remark 2: The expression is scaling-invariant, that is, This is typically a desirable property, e.g. when the processing leading to (which may be unknown) reduces the general signal level significantly.However, the proposed model does not take masking effects into account: if a given spectral component is suppressed below the masking threshold, either due to spread of masking or the threshold in quiet, it would still (erroneously) contribute positively to speech intelligibility according to the proposed model.For the processing types considered in the simulation experiment, however, this potential weakness does not appear to play a big role.
Remark 3: From Eq. ( 12) it follows that (13) because the sum of the first three terms is close to 0. Since the denominator is a mse arising from estimating a clean quantity based on a noisy/processed observation , may be recognized as an "SNR-type" of measure [18].In fact, it resembles closely the frequency-based segmental SNR fwSNRseg with constant band-importance functions ( ) 4 , see [18, p. 504] [31].However, whereas fwSNRseg in this case would be interpreted as a predictor of the quality of a noisy signal enhanced by an lmmse-estimator [31], the developed theory suggests that it can be interpreted differently: it characterizes the intelligibility of the noisy/processed signal , not the quality of the signal .Remark 4: As a consequence of Remark 2, linear processing of noisy critical-band amplitudes cannot improve intelligibility beyond that of the underlying noisy signal.To extend the range of this statement further, recall that we introduced the linear estimator in Eq. ( 8) only for ease of estimation.The argument can be repeated with the generally non-linear estimator from Eq. ( 4), and the conclusion is that no processing of noisy critical-bands, linear or otherwise, allows intelligibility improvements.This prediction is in line with the results by Loizou [32] and Taal et al. [15], who showed that single-channel noise reduction systems in general provide no or very modest intelligibility improvements.
Remark 5: The proposed intelligibility predictor shows similarities to the STOI measure [17].Let denote the linear correlation coefficient.Inserting Eq. ( 9) in Eq. ( 13) and using this expression for , we find (14) The proposed intelligibility predictor averages these values of across speech-active time-frequency units (see Eq. ( 15) in the next Section) and uses this average as a predictor of intelligibility.STOI, on the other hand, computes the average of across speech-active frequency units 5 .In this way, the proposed intelligibility predictor and STOI show strong similarities; the main difference to STOI appears to be the non-linearity applied to before averaging.STOI is mainly heuristically motivated (for example, computing the linear correlation coefficient for speech-active timefrequency units with a subsequent averaging operation is not linked to any underlying theoretical reasoning).However, Eq. ( 14) is a consequence of the assumed signal model and auditory model and can in this sense be seen to offer some theoretical justification of the less well-motivated choices made in STOI.
Remark 6: The proposed intelligibility predictor does not make use of band-importance functions.As above, this is not a choice, but rather a consequence of the model.Although many existing predictors do make use of band-importance functions, e.g.[6], [33], both the proposed predictor and STOI appear to work quite well without.

IV. IMPLEMENTATION
Our implementation shows similarities to the STOI model described in [17].Signals are resampled to a sampling frequency of 10 kHz, to ensure that the frequency region relevant for speech intelligibility is covered [2].Signals are divided into frames of length samples, and a Hann analysis window is applied; we use a frame shift of samples.A DFT order of is used.DFT coefficients are grouped into a total of third-order octave bands, with a center frequency of the lowest band set to 150 Hz, and the center frequency of the highest band set to approximately 4.3 kHz.
The VAD blocks in Fig. 1 are implemented by identifying signal frames with energy no less than dB of the signal frame with maximum energy.The indices of these signal frames are collected in the index sets and for the clean and noisy/ processed signals, respectively.
Let and denote the critical-band amplitudes with frame indices .The first and second moments needed to evaluate via Eqs.( 12) and ( 9) are estimated using first-order recursive smoothing, i.e., and similarly for the other moments.Let denote the estimate of obtained by replacing expected values by recursively estimated moments.The average per sentence mutual information is finally computed as (15) We have introduced here an upper limit on the information content per critical-band amplitude to avoid that the final information score is dominated by a single high-information time-frequency unit.The idea of upper limiting the impact of a single time-frequency unit is not new, but has e.g.been used in the SII intelligibility measure, where the estimated critical-band SNR is upper limited to 15 dB [6].It can be motivated by the observation that at a sufficiently high SNR, a signal is perfectly intelligible, and increasing the SNR beyond this point cannot increase intelligibility further.
The values of the three parameters, , , and are summarized in Table I.The value of corresponds to a time constant of 250 ms.The idea of averaging signal statistics over longer time spans such as 250 ms rather than 20-40 ms which are often used in speech processing applications, e.g.[18], but much shorter than the typical long-term statistics, e.g. as used in SII [6], is not new.For example, in the STOI model [17], statistics were computed across time spans of roughly 400 ms, and it was suggested that this time span could be linked to temporal integration processes taking place in the auditory system.Performance with the proposed model is not sensitive to the exact value of .The choice of dB is not controversial.For clean speech, most speech frames have an energy content larger than this threshold.The value of nats was determined heuristically, but prediction performance does not seem to be very sensitive with respect to the exact value of this parameter either.

V. SIMULATION RESULTS
In the following we evaluate the proposed intelligibility predictor using noisy speech signals processed with different timefrequency weighting strategies.We compare the performance of the proposed method with that of algorithms proposed in the literature.The sample rates mentioned below are used in the listening experiments.When applying the proposed method, the signals are down-or upsampled to 10 kHz.

A. Signals and Processing Conditions 1) Additive Noise:
The first set of signals is from the study described by Kjems in [34].In this study, speech signals from the Dantale II sentence test [35] are contaminated by four different additive noise sources.The speech sentences consists of 150 5-word sentences spoken by a female Danish speaker.The noise sources are i) stationary speech shaped noise created by filtering white noise through a shaping filter with a frequency response corresponding to the long-term spectrum of the speech sentences, ii) car cabin noise recorded in a car driving on the highway, iii) bottle hall noise, and iv) cafeteria noise, which is a recording of a conversation in Danish between a male and a female speaker, i.e. two-talker babble, equalized to have the same long-term spectrum as the test sentences [34].The sample rate is 20 kHz.
Kjems conducted a listening test to establish the 20% and 80% speech reception threshold (SRT) 6 for each noise source, and a logistic function was fitted to the SRTs to estimate the underlying psychometric function.Finally, we generated noisy test signals with SNRs from dB to 5 dB in steps of 2.5 dB, and the corresponding intelligibility scores were established by sampling the psychometric functions at these input SNR values.The total number of conditions therefore amounts to 4 noise types x 11 SNRs = 44 conditions.A number of 15 normal-hearing subjects in the age range 25-52 years participated in the test.
2) Ideal Binary Mask Signals: In a second experiment, Kjems [34] processed the noisy signals, using the technique of ideal time-frequency segregation (ITFS) [36], and measured the intelligibility of the resulting signals for different processing conditions.More specifically, the IFTS processing decomposes the clean signal , the noise signal , and the noisy signal in time-frequency tiles.In the implementation of Kjems, the time domain signals were analyzed in the short-term spectral domain using a gamma-tone filter bank with 64 channels, each with a bandwidth of 1 ERB, and channel center frequencies linearly spaced on the ERB scale with center frequencies between 55 and 7500 Hz.The filterbank signals were segmented into 20-ms windowed frames with an overlap of 50%, and the energy , and of the th subband and th frame was computed for the clean, noise and noisy signal, respectively.Then, a binary-mask value was computed for each time-frequency unit.Finally, the resulting binary mask signal was upsampled to the signal sample rate of 20 kHz, and point-wise multiplied with the noisy filterbank output, and the result was passed through a gamma-tone synthesis filter bank to synthesize the corresponding processed time domain signal.
Two methods for deriving the binary mask signal were compared.In the ideal binary mask (IBM) method the binary mask signal was found by comparing the local target-tonoise ratio to a threshold LC according to otherwise (16) In the target binary mask (TBM) method, the binary mask signal was found by replacing the local time-frequency noise energy with the value of the long-term speech spectrum, evaluated in the 'th gammatone filter.For both the IBM and TBM methods, the sparsity of the binary mask is a function of the threshold LC: the higher the value of LC, the fewer 1's in the SNRs were selected corresponding to the 20% and 50% SRT.The third SNR was fixed at dB.The total number of test conditions was (4 noise types (IBM) + 3 noise types (TBM) 7 ) x 8 LCs x 3 SNRs = 168.As above, 15 normal hearing listeners participated in the intelligibility test, and the sampling rate was 20 kHz.
3) Single-Channel Enhancement Signals: The last set of signals consists of noisy signals processed by three single-channel noise reduction algorithms [37].In contrast to the ITFS processing, which cannot be realized in practice because it requires knowledge of the clean speech signal and the noise signal realizations separately, the study in [37] considered single-channel noise reduction methods which can be realized in practice.We included this data set, because it is important to understand if a given intelligibility predictor is limited to synthetic signals, or if it can actually be applied to signals generated by practical applications.
Noisy signals were processed using a DFT-based analysismodification-synthesis framework.Three processing methods were considered: two methods for finding a binary mask as in Eq. ( 16) but using only the noisy speech signal, and, in addition, a state-of-the-art single-channel noise reduction method, where the gain values applied to time-frequency units are not constrained to be binary (but are non-negative).Finally, the noisy unprocessed signals were included in the test as well, leading to four processing conditions in total.
The listening test was a closed Dutch speech-in-noise intelligibility test proposed in [38].As in the previous section, the test sentences consisted of 5 words, spoken by a female speaker.The signals were sampled at a sampling-rate of 8 kHz, and degraded by speech-shaped noise at five different SNRs, namely 8, 6, 4, 2 and 0 dB.Thirteen native Dutch speaking subjects participated in the test.Each processing condition was presented five times, and each sentence was used only once.The order of presenting the different algorithms and SNRs was randomized.The signals were presented diotically through head-phones (Sennheiser HD 600).For each processing method and SNR pair, the intelligibility scores were averaged across the 13 listeners and the 5 repetitions, leading to 4 conditions x 5 SNRs = 20 processing conditions for which average intelligibility scores are computed.For any good intelligibility predictor, the intelligibility-vsprediction curve is monotonically increasing.Clearly, Fig. 3 shows a strong monotonic relation between speech intelligibility and mutual information, for all three test conditions.With the proposed intelligibility predictor, one is able to predict relative intelligibility: if is larger for one noisy/processed signal than for another (where the underlying clean speech signal is the same), we would expect the intelligibility of the first to be larger than the latter.

B. Per Sentence Mutual Information vs Intelligibility
However, in order to estimate absolute intelligibility, a mapping is needed between the outcome of the intelligibility predictor, and the true underlying intelligibility.This mapping is a function of many factors, including the noise type, the test type, the processing applied to the noisy signal, the redundancy of the speech material, and, obviously, the intelligibility predictor.
For additive noise experiments, the psychometric curve is often modeled as a logistic function of the input SNR.In [17], [39] it was proposed to extend the use of this function to model the relationship between the outcome of the intelligibility prediction and the true underlying intelligibility , i.e., where are test specific model parameters, which are estimated to fit the intelligibility data.In Fig. 3, the parameters were estimated to fit the data in each subfigure in a least-square sense; the resulting logistic function is overlaid each subfigure.
To evaluate numerically the performance of intelligibility predictors, we use two figures of merit, namely the normalized linear correlation coefficient between average intelligibility scores obtained through listening tests, and the outcomes of the intelligibility predictors, and the root mean-square prediction error .Let denote the intelligibility prediction for the th processing condition, and let denote the average across listeners in the corresponding intelligibility test.Furthermore, let , and denote the averages of and , respectively, across listening test conditions , and let denote the number of test conditions.The normalized linear correlation coefficient is then defined as (17) and the root mean-square prediction error is defined as (18)

C. Comparison to Other Intelligibility Predictors
In this section we compare the performance of the proposed intelligibility predictor, which will be abbreviated as SIMI (Speech Intelligibility prediction based on Mutual Information) to several methods from the literature, see Table II.Specifically, we consider STOI [17], CSII-MID (the Mid-level coherence SII as proposed in [8]), CSII-BIF (the coherence SII with signal-dependent band importance functions, referred to as CSII , , in [33]), STI-NCM (the normalized covariance speech transmission index as proposed in [12]), STI-NCM-BIF (the normalized covariance speech transmission index with signal-dependent band-importance functions, referred to as NCM, , in [33]), and NSEC (the Normalized Subband Envelope Correlation method as proposed in [40]).We evaluate the performance of these intelligibility predictors in terms of and computed from an -fold cross-validation procedure.Specifically, for each data set, the set is randomly divided into equal size subsets, the free parameters , in the logistic function are fitted to the subsets, after which and are computed based on prediction of the remaining subset.This procedure is repeated for each subset, and the averages of the resulting and values are computed.Tables III and IV summarize these results.The values in brackets are found from estimating , using the entire data set, and estimating and from the same set; these are included here for comparison with values reported in literature, which are often computed in this way, e.g.[17].
From Tables III and IV most intelligibility predictors work well in the case of additive noise, leading to correlation coefficients of ; STI-NCM-BIF is an exception, but it should be noted that this method was proposed in [33] for single-channel noise reduced speech.For the ITFS processed signals, only SIMI and STOI work well, resulting in linear correlation coefficients of and .Most other methods essentially fail in this situation.Similar results were reported in [41].For the single-channel enhanced speech signals, most intelligibility predictors perform well.It is worth noting that STI-NCM-BIF works particularly well in this situation; this result is in quantitative agreement with the results reported in [33].The results of Tables III and IV are also in general agreement with the results reported in [17], [41].Note finally that SIMI and STOI are the only methods which work well for all conditions.

VI. CONCLUSION
Algorithms for estimating the outcome of intelligibility listening tests are of interest both for reducing the number of costly listening tests during algorithm development, but also have the potential to lead to new insights into the auditory system.
Historically, a wide range of intelligibility predictors have been proposed with varying validity domains including additive (stationary or non-stationary) or convolutive noise types, and several types of signal processing, including filtering, clipping, etc.In this work we consider the situation of additive, but not necessarily stationary, noise sources, and non-linear processing which can generally be referred to as time-frequency weighting.This class of processing method is quite broad and is for example used in single-channel noise reduction algorithms.
In this context, we pursue the hypothesis that intelligibility could be monotonically related to the Shannon information about the (unknown) clean critical-band envelopes, which can be learned by observing their noisy and potentially processed counterparts.We derive lower bounds for this mutual information, which turn out to be analytically tractable.Specifically, the information lower bound can be computed as a function of the minimum mean-square error (mmse) arising from estimating the clean critical-band amplitude from its noisy/processed counterpart.
The proposed model has a number of surprising consequences.Traditionally, in speech signal processing the mse between a clean time-frequency unit and an estimate thereof has been linked to the speech quality resulting from the estimator in question.According to this paradigm, using an mmse estimator would lead to highest speech quality.However, the proposed model suggests that it could be interpreted differently: the mmse could be viewed as an indicator of the intelligibility of the underlying noisy (and potentially processed) signal.Furthermore, it is interesting to note that whereas several of the intelligibility predictors proposed in the literature are heuristically motivated, the proposed mmse based predictor is a consequence of a simple auditory model and signal model, and the assumption that mutual information can be used as a principle for comparing inner representations.Finally, the proposed model predicts that processing of noisy critical-band amplitudes (based on the noisy critical-band amplitudes only) cannot lead to intelligibility improvements, a prediction which is in line with several existing intelligibility test results, e.g., [15], [32].
Simulation experiments with the proposed method shows that it is able to reliably estimate the average intelligibility of speech signals contaminated by stationary and non-stationary noise sources as well time-frequency processed noisy speech.In a comparison with other intelligibility predictors from literature, this performance was only equalled by the STOI intelligibility predictor [17].
It is of interest to study in the future if the proposed principle is valid in a more general context than covered in this article.For example, auditory model presented in this article is very simple; it would be of interest to study prediction performance if the proposed mutual information principle were combined with a more physiologically plausible model, e.g. the modulation filter based model of Dau et al. [42] and the intelligibility prediction model by Jørgensen et al. [13].
Another topic for future research includes the extension of the proposed principle to situations which are more natural to human listeners, e.g. a binaural listening setup.It appears possible that phenomena such as the spatial unmasking effect, e.g.[43] may be predicted well by such extended model.

Fig. 1 .
Fig. 1.Proposed intelligibility prediction scheme.It is assumed that the critical-band amplitude envelopes of the clean speech signal contain all information relevant for speech intelligibility.The intelligibility of the noisy/processed signal is estimated as the average information in its critical-band envelopes about the clean amplitude envelopes .signal, and the corresponding noisy/processed signal, respectively.Band pass filtered signals are obtained by dividing the time-domain input signals into successive, overlapping analysis frames, applying an analysis window, and transforming these time-domain frames to the frequency domain using a Discrete Fourier Transform.The resulting DFT coefficients are given by

Fig. 2 .
Fig. 2. Differential entropy of unit-variance chi-distributed variable as a function of degrees of freedom.The degrees of freedom corresponding to are marked with .The dashed line indicates differential entropy of a unit-variance Gaussian.

Fig. 3 .
Fig. 3. Per-sentence mutual information versus measured intelligibility for various processing conditions.(3(a)): additive noise,(3(b)), and (3(c)) noisy signals processed by single-channel noise reduction algorithms.See text for further details on test signals and conditions.

Fig. 3
Fig. 3 plots the per-sentence mutual information lower bound as calculated by Eq. (15) versus the intelligibility measured in listening tests, and averaged across test subjects.The free parameters, , , and were chosen according to Table I for all subfigures.For any good intelligibility predictor, the intelligibility-vsprediction curve is monotonically increasing.Clearly, Fig.3shows a strong monotonic relation between speech intelligibility and mutual information, for all three test conditions.With the proposed intelligibility predictor, one is able to predict relative intelligibility: if is larger for one noisy/processed signal than for another (where the underlying clean speech signal is

TABLE I PARAMETER
VALUES IN PROPOSED MODEL.

TABLE II INTELLIGIBILITY
PREDICTORS FOR COMPARISON.

TABLE III PERFORMANCE
OF INTELLIGIBILITY PREDICTORS IN TERMS OF LINEAR CORRELATION COEFFICIENT , EQ. (17), FOR DIFFERENT NOISE/PROCESSING CONDITIONS.THE PERFORMANCE SCORES ARE ESTIMATED USING -FOLD CROSS-VALIDATION ( ).PERFORMANCE SCORES IN BRACKETS ARE COMPUTED BY FITTING THE LOGISTIC FUNCTION TO THE ENTIRE DATA SET, AND COMPUTING THE RESULTING ACROSS THE SAME DATA SET.

TABLE IV PERFORMANCE
OF INTELLIGIBILITY PREDICTORS IN TERMS OF ROOT MEAN-SQUARE PREDICTION ERROR , EQ. (18), FOR DIFFERENT NOISE/PROCESSING CONDITIONS.THE PERFORMANCE SCORES ARE ESTIMATED USING -FOLD CROSS-VALIDATION ( ).PERFORMANCE SCORES IN BRACKETS ARE COMPUTED BY FITTING THE LOGISTIC FUNCTION TO THE ENTIRE DATA SET, AND COMPUTING THE RESULTING ACROSS THE SAME DATA SET.