Speech intelligibility prediction based on modulation frequency-selective processing

Speech intelligibility models can provide insights regarding the auditory processes involved in human speech perception and communication. One successful approach to modelling speech intelligibility has been based on the analysis of the amplitude modulations present in speech as well as competing in-terferers. This review covers speech intelligibility models that include a modulation-frequency selective processing stage i.e., a modulation ﬁlterbank, as part of their front end. The speech-based envelope power spectrum model [sEPSM, Jørgensen and Dau (2011). J. Acoust. Soc. Am. 130 (3), 1475-1487], several variants of the sEPSM including modiﬁcations with respect to temporal resolution, spectro-temporal processing and binaural processing, as well as the speech-based computational auditory signal processing and perception model [sCASP; Relaño-Iborra et al. J. Acoust. Soc. Am. 146(5), 3306–3317], which is based on an established auditory signal detection and masking model, are discussed. The key processing stages of these models for the prediction of speech intelligibility across a variety of acoustic conditions are addressed in relation to competing modeling approaches. The strengths and weaknesses of the modulation-based analysis are outlined and perspectives presented, particularly in connection with the challenge of predicting the consequences of individual hearing loss on speech intelligibility. © 2022 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )


Introduction
Speech intelligibility is a fundamental element of human communication.Normal-hearing (NH) listeners typically communicate efficiently even in adverse acoustic conditions, whereas hearingimpaired (HI) listeners often report challenges in noisy and reverberant environments (e.g., Festen and Plomp, 1990 ;Peters et al., 1998 ;George et al., 2006 ;Strelcyk and Dau, 2009 ;Christiansen and Dau, 2012 ; for a review see Moore, 2003 ), even in the case of aided hearing though state-of-the-art hearing instruments (e.g., Christiansen et al., 2012 ;Johannesen et al., 2016 ).To enable the design of advanced compensation strategies that may mitigate the effects of hearing impairment on speech perception, it seems essential to understand the auditory mechanisms involved in the processing of speech and the cues that are crucial for speech intelligibility across acoustic conditions, both in the healthy and the impaired auditory system.Speech intelligibility models provide a convenient tool for evaluating such mechanisms.Most models aim to quantitatively predict listeners' performance but they differ markedly in terms of the details and complexity of the auditory processes reflected in the front end as well as the 'decision making' in the back end1 .From the relatively simple metrics developed during the 20 th century (e.g., the articulation index; French and Steinberg, 1947 ;Kryter, 1962 ;ANSI, 1969 ) to more recent data-driven approaches (e.g., Schädler et al., 2015 ;Spille et al., 2018 ), intelligibility models have substantially contributed to our current knowledge in terms of the auditory (as well as non-auditory) processes involved in speech perception, and have inspired and supported the development of applications such as hearing-aid algorithms and speech enhancement strategies (e.g., Keidser et al., 2011 ;Kowalewski et al., 2020 ).
This review focuses on a particular set of computational models that assumes a spectral decomposition of the temporal envelope, inspired by behavioral evidence of a modulation-frequency selective process in the auditory system (e.g., Houtgast, 1989 ;Bacon and Grantham, 1989 ;Dau et al., 1997aDau et al., , 1999 ) ).The speech intelligibility models discussed in this contribution are all based on the principles first proposed by Jørgensen and Dau (2011) in their 'speechbased envelope power spectrum model' (sEPSM).The sEPSM model combines a relatively simple linear peripheral model with a frequency selective analysis in the amplitude modulation domain (i.e., a modulation filterbank) to predict speech intelligibility.In contrast to prior models based on the analysis of amplitude modulations, such as the as the speech transmission index (STI; Houtgast and Steeneken, 1971 ), which uses the modulation transfer function (MTF) as the intelligibility metric, the sEPSM uses the signal-tonoise ratio in the modulation domain (SNR env ) to predict speech intelligibility scores.
Access to amplitude modulations or, more generally, envelope fluctuations, i.e., the slow temporal fluctuations of the acoustic signal, has been demonstrated to be critical for the comprehension of speech signals.Shannon et al. (1995) showed how artificial stimuli with severely degraded spectral content, but intact temporal envelopes, were still highly intelligible, emphasizing the importance of modulation cues for speech understanding.Drullman et al. (1994aDrullman et al. ( , 1994b ) ) demonstrated in a study with normal-hearing listeners how reducing the envelope modulation content in sentences, vowels and consonants reduced their intelligibility.Their findings were later replicated by Elliott and Theunissen (2009) who derived a "modulation transfer function for speech", showing that low-pass filtering the temporal envelopes of speech signals, such that less modulations were available to the listeners, markedly reduced the intelligibility of the stimuli.
Evidence for modulation-frequency selective processing in the auditory system was first provided by the work of Houtgast (1989) .Using a modulation detection paradigm, where a sinusoidal target modulation was embedded in a modulation masker centered around the target, Houtgast (1989) showed that modulation detection thresholds stabilized when the masker-modulation bandwidth was widened beyond a certain bandwidth, suggesting a bandpass filter-like processing in the modulation domain conceptually similar to the auditory-filter-based processing in a corresponding masking paradigm in the audio-frequency domain ( Fletcher, 1940 ;Moore and Patterson, 1986 ). Bacon and Grantham (1989) used modulation masking experiments to obtain modulation masking patterns that showed the highest thresholds when the center frequencies of the narrowband modulation maskers were near the (sinusoidal) target modulation frequency.The conceptualization of this frequency-selective processing as a modulation filterbank was first proposed by Fassel and Püschel (1993) and subsequently formalized as a signal processing model by Münker and Püschel (1993) .Later, Dau et al. (1997aDau et al. ( , 1997b) integrated an implementation of a modulation filterbank within a model of the auditory periphery and showed that this model could account for modulation detection thresholds for narrow-band noise carriers of different bandwidths and different center frequencies as well as for modulation-masking data using a harmonic tone-complex masker.The model outperformed classical low-pass models of modulation processing (e.g., Viemeister, 1979 ;Forrest and Green, 1987 ) that could not account for the masking data, nor for the effects of the carrier bandwidths.The Dau et al. (1997aDau et al. ( , 1997b) models analyzed the internal representation at the output of the modulation filterbank stage by means of a cross correlation between the processed stimuli and a supra-threshold template ( Dau et al., 1996a,b ).This correlation-based metric was used by an ideal observer stage to predict the perceptual thresholds.
A slightly different approach was considered by Dau et al. (1999) who proposed the envelope power at the output of a hypothetical modulation filter as a metric to account for amplitude modulation detection thresholds for different types of noise carriers with several bandwidths.This concept was later formalized as the envelope power spectrum model (EPSM; Ewert and Dau, 20 0 0 ).The EPSM did not consider any peripheral filtering nor any non-linear preprocessing, but simply applied a modulation bandpass filterbank to the extracted envelopes of the stimuli and calculated the ac-coupled power at the outputs of the modulation filters.The EPSM could account for amplitude modulation detection thresholds as well as modulation masking patterns and modulation transfer functions for carriers of different bandwidths.This approach outperformed low-pass filter models of modulation processing, further supporting the concept of frequency selectivity in the envelope domain.In a later study ( Ewert et al., 2002 ), the EPSM framework was further conceptualized as the signal-tonoise ratio calculated at the output of a modulation filterbank and was used to account for masked-threshold patterns obtained using pure-tone maskers.
The EPSM is conceptually similar to the 'classical' power spectrum model of masking (PSM; Fletcher, 1940 ;Moore and Glasberg, 1987 ) operating in the audio-frequency domain.The PSM assumes that the auditory system is selective in the audio-frequency domain and defines the effectiveness of masking based on the power of the masker processed through the same auditory filter as the target signal.Likewise, the EPSM is based on the concept that the amount of modulation masking depends on the amount of masker modulation energy passing through the same modulation filter as the target modulation.While the PSM concept can be linked to the biophysical properties of the basilar membrane's frequencyto-place transformation, the physiological correlate of frequency selective processing in the envelope domain, as proposed in the EPSM, is less clear, even though various physiologically inspired models have been presented that propose different mechanisms to establish envelope-frequency selective processing in the auditory brainstem and the primary auditory cortex (e.g., Langner and Schreiner, 1988 ;Nelson and Carney, 2004 ;Hewitt and Meddis, 1994 ;Dicke et al., 2007 ;Carney, 2018 ).Jørgensen and Dau (2011) combined the EPSM with a linear auditory pre-processing stage, a gammatone filterbank ( Patterson et al., 1987 ), and applied this modeling framework to speech stimuli, extending its predictive power from modulation detection and masking conditions towards speech intelligibility conditions.The resulting model was termed the 'speech-based' envelope power spectrum model (sEPSM) and used the (speech) signal-to-noise ratio in the envelope domain (SNR env ), in combination with an ideal observer back end, to predict speech reception thresholds (SRTs) for NH listeners in conditions with speech presented in stationary noise, reverberant speech and nonlinearly processed speech.
The sEPSM model was later extended to account for the effects of non-stationary maskers on speech intelligibility (i.e., to account for the improvement in speech intelligibility observed for fluctuating maskers, commonly termed 'masking release') by proposing a modulation-frequency dependent (i.e., 'multi-resolution') analysis of the temporal envelope of the stimuli (mr-sEPSM; Jørgensen et al., 2013 ).Other extensions to the model introduced a spectrotemporal processing stage to account for the effects of phase distortions on speech intelligibility (sEPSM x ; Chabot-Leclerc et al., 2014 ); a correlation-based back end to account for phase distortions and the effects of non-linear noise reduction (sEPSM corr ; Relaño-Iborra et al., 2016 ); a PSM-based extension to account for masking data (mr e -GPSM; Biberger andEwert, 2016 , 2017 ); and a binaural extension to account for binaural masking release in speech perception ( Chabot-Leclerc et al., 2016 ).The aforementioned approaches all assumed a linear front end, a gammatone filterbank, while proposing different metrics and back-end approaches to account for a wider range of listening conditions than the original EPSM model.
It is well known that processing in the cochlea is highly nonlinear and level dependent (e.g., Rhode, 1971 ;Kemp, 1979 ;Robles et al., 1986 ) and it has been argued that cochlear nonlinearities, such as the compressive basilar-membrane input-output function, inner-and outer hair-cell transduction processes, suppression effects and synaptic transmission may critically affect the representation of sounds, including speech, at cochlear and retro-cochlear levels (e.g., Plack et al., 2004 ;Dubno et al., 2007 ;Horwitz et al., 2007 ;Johannesen et al., 2016 ).This should be particularly relevant when investigating the effects of hearing loss on speech intelligibility, since the most typical hearing loss is of sensorineural origin and arises in the cochlea.State-of-the-art nonlinear cochlear models, such as the auditory nerve model of Carney (1993) and more recent variants of this model (e.g., Zilany et al., 2014 ;Bruce et al., 2018 ) or the dual-resonance nonlinear filterbank (DRNL; Meddis et al., 2001 ;Lopez-Poveda and Meddis, 2001 ), have been considered as front ends of speech intelligibility models to account for effects of, e.g., sound pressure level and different types of hearing loss on intelligibility.For example, Scheidiger (2017) combined the sEPSM model with the auditory-nerve processing of Zilany et al. (2014) and Relaño-Iborra et al. (2019) combined the correlation-based sEPSM corr with the non-linear preprocessing proposed by Jepsen and Dau (2011) , based on the DRNL of Lopez-Poveda and Meddis (2001) .
The present review provides a comprehensive overview of these modulation-based modeling frameworks and explores their capabilities and shortcomings by discussing which computational stages in the front end and back end of the models are crucial for the prediction of various datasets.Some perspectives are outlined providing potential paths to further increase the predictive power of the currently available modelling frameworks.

The speech-based envelope power spectrum model
Jørgensen and Dau (2011) extended the EPSM ( Ewert and Dau, 20 0 0 ), from a model that was developed to predict data from modulation detection and masking experiments towards a 'speechbased' model, referred to as sEPSM, that was intended to account for speech intelligibility data.This was achieved by coupling the EPSM framework to a model of peripheral auditory preprocessing.In both models, the SNR in the envelope domain, SNR env , was used as the decision metric.Below, the key elements of the most basic version of the sEPSM are described.Next, more advanced versions of the sEPSM are described that were designed to cover a broader range of speech intelligibility conditions.

Intelligibility of speech in stationary noise: effects of reverberation and noise suppression
The sEPSM ( Jørgensen and Dau, 2011 ) was designed to account for the intelligibility of speech in the presence of stationary noises, by analyzing the modulation content of the noisy speech mixture and of the noise alone, assuming that the auditory system has 'access' to the noise-alone component and that the combination of noise and speech signals is additive.The essential assumption of the modeling approach was that the intelligibility of the speech in the noisy background would be affected by the intrinsic envelope fluctuations of the interfering noise signal ( Stone et al., 2011( Stone et al., , 2012 ) ).
The model structure is indicated in Fig. 1 (most left column).The sEPSM consists of a gammatone filterbank with filters centered between 63 Hz and 8 kHz with equivalent rectangular bandwidths as specified by Glasberg and Moore (1990) and a 1/3-octave spacing between the center frequencies, followed by envelope extraction using the Hilbert transform and a modulation filterbank consisting of a low-pass filter ( f mod,cut-off = 1 Hz) in parallel with six band-pass filters centered at modulation frequencies from 2 to 64 Hz with octave spacing and a Q-factor of 1.The ac-coupled power at the outputs of all auditory and modulation filters is calculated separately for the noisy speech and the noise alone.From these terms, the SNR env metric is calculated as: where Pen v i, j, S+ N represents the envelope power of the speech and noise mixture for the i th auditory filter and the j th modulation filter, and Pen v i, j,N indicates the corresponding envelope power for the noise alone.The resulting multi-channel SNR env is combined across all auditory and modulation frequency channels and represents the square root of the summed squared SNR env values, assuming that signal information is combined optimally across the processing channels ( Green and Swets, 1966 ;Dau et al., 1997aDau et al., , 1997b ) ): The final SNR env is converted into a sensitivity value of an ideal observer that transforms the model metric into speech intelligibility units (i.e., percentage of correct words or sentences).The model is fitted once to the speech material under consideration (e.g., open-set sentences, closed-set sentences, words).This calibration is assumed to account for aspects related to, e.g., speech complexity, clarity of the talker, vocabulary size and redundancy of the speech corpus of interest but does not provide any information regarding the degradations/processing algorithms evaluated by the model.Details regarding the processing stages in the model and the mapping to speech intelligibility are described by Jørgensen and Dau (2011) .
The model was shown to account well for the intelligibility of speech in the presence of stationary noise.Furthermore, the model was successful in predicting the intelligibility of noisy speech distorted by reverberation and in conditions where the noisy speech was processed using spectral subtraction.Spectral subtraction is a noise cancelling algorithm that can be detrimental to speech intelligibility even though it increases the SNR in the acoustic input domain.This phenomenon has been referred to as the 'noise reduction paradox' (e.g.; Berouti et al., 1979 ;Ludvigsen, 1993 ) and it has been a challenge for most models of speech intelligibility to account for ( Jørgensen & Dau, 2011 ;Relaño-Iborra et al., 2016 ).The left panel of Fig. 2 shows the sEPSM predictions (dark gray squares) as a function of the amount of noise reduction, reflected by the over-subtraction factor α, i.e., for higher over-subtraction factors, more noise power is subtracted from the mixture's power spectrum, as proposed by Berouti et al. (1979, Eq. ( 3)) .The model accounts for the increase in the speech reception thresholds (SRTs) for spectrally subtracted noisy speech, as observed in the data (open squares).In contrast, the speech-based speech transmission index (sSTI; light gray squares; Payton and Braida, 1999 ), a wellestablished speech intelligibility model that also works in the modulation domain, albeit using the modulation transfer function as a metric, predicts an improvement in speech intelligibility (i.e., a decrease of the model units) with increasing α.Within the sEPSM, the changes in the modulation power of the noise (alone) resulting from the noise-reduction processing ('musical noise') lead to a decrease of SNR env and thus an increase of the resulting predicted SRT.
In addition to accounting for the effects of spectral subtraction, the sEPSM was shown to account for the effects of reverberation on noisy speech (not shown here).The model captures the increase in SRTs resulting from increases in room reverberation ( Jørgensen and Dau, 2011 ).

Speech intelligibility in the presence of non-stationary maskers
While the sEPSM is able to account for the effects of reverberation and noise reduction on speech intelligibility in the conditions described above, the model fails to account for the listening advantage resulting from slow temporal fluctuations in nonstationary maskers (i.e., the effect known as "release from masking").The ability to exploit temporal regions where the power of the noise masker is decreased was implemented by modifying the SNR env calculation from a long-term analysis (i.e., where the entire signal was considered) to a short-term analysis, which was termed the 'multi-resolution' sEPSM (mr-sEPSM; Jørgensen et al., 2013 ), inspired by previous model approaches such as the extended speech intelligibility index (ESII; Rhebergen et al., 2006 ) or the 'glimpsing model' ( Cooke, 2006 ).
As for the original sEPSM, the mr-sEPSM consists of a gammatone filterbank, envelope extraction and a modulation filterbank (see Fig. 1 , second column).However, the mr-sEPSM contains two additional modulation filters centered at modulation frequencies of 132 and 256 Hz.For each auditory filter, only modulation filters with center frequencies up to of one-fourth of the center frequency of that auditory filter are further processed ( Verhey et al., 1999 ).The main conceptual difference between the sEPSM and the mr-sEPSM occurs in the back end after modulation filtering.The outputs of each combination of auditory and modulation filter are analyzed using short-term temporal windows, with durations proportional to the modulation center frequency (e.g., the output of the 2-Hz modulation filter is analyzed using a 500-ms window and the output of the 256-Hz modulation filter is analyzed using a 3.9ms window), such that one SNR env value is obtained per time window in a given modulation and auditory filter.The SNR env values are then integrated across time: where SNRen v i, j (n ) represents the SNR env calculated for the n th time window of the i th auditory filter and the j th modulation fil-  Payton and Braida, 1999 ).Reploted from Jørgensen and Dau (2011) .Right panel: SRTs for speech in the presence of stationary noise (SSN), 4-Hz amplitude modulated speech shaped noise (SAM) and the international speech test signal (ISTS; Holube et al., 2010 ).Human data measured by Jørgensen et al. (2013)  ter and N( j) is the total number of temporal windows for the j th modulation filter.The time averaged SNR env values are combined across auditory and modulation filters, following Eq.( 2) .The resulting metric is transformed into a sensitivity value and mapped to a speech intelligibility score by an ideal observer equivalent to that in the original sEPSM.
The right panel of Fig. 2 shows SRT predictions for speech presented in three maskers: a stationary speech-shaped noise (SSN), a 4-Hz amplitude modulated speech-shaped noise (SAM) and the international speech test signal (ISTS; Holube et al., 2010 ).The human data collected by Jørgensen et al. (2013) , indicated as open squares, showed a decrease in SRT (i.e., an increase in intelligibility) for the non-stationary maskers.The mr-sEPSM (black diamonds) accounts for these changes in the SRT as does the 'reference' model, the speech-based ESII (ESII-s; Meyer and Brand, 2013 ; light gray diamonds).As expected, the long-term sEPSM fails to predict the benefit of utilizing speech cues in temporally fluctuating maskers.
The mr-sEPSM was designed to be backwards compatible with the original sEPSM such that, in addition to accounting for the effects of non-stationary maskers, it still accounts for the effects of stationary noises, reverberation and spectrally subtracted noisy speech.

Effects of signal phase distortions on speech intelligibility
The SNR env , like other energy-based decision metrics such as the SII or the STI ( ANSI, 1969( ANSI, , 1997 ; ;Steeneken and Houtgast, 1980 ;Pavlovic, 1987 ;Payton and Braida, 1999 ), does not take phase information of the input signals into account.Thus, any potential changes in intelligibility derived from phase distortions in the speech signal cannot be accounted for by the sEPSM nor the mr-sEPSM.Elhilali et al. (2003) proposed a model, the spectrotemporal modulation index (STMI; Elhilali et al., 2003 ), that analyzes temporal and spectral modulations as well as joint spectrotemporal modulations by employing a two-dimensional modulation filterbank.The STMI can account for phase-distortion effects, such as the phase jitter associated with telephone transmission and the desynchronization of frequency-channels (i.e., interchannel delays).Inspired by this approach, a 2-dimensional SNR env metric (i.e., an SNR env obtained after processing through a spectrotemporal modulation filterbank) was explored by Chabot-Leclerc et al. (2014) .Their analysis demonstrated that while such a joint spectro-temporal modulation filterbank analysis can account for the intelligibility of phase-distorted noisy speech, consistent with Elhilali et al. (2003) and their STMI approach, a much simpler across-frequency process at the output of the one-dimensional (purely temporal) modulation filterbank seems sufficient to account for the data.Chabot-Leclerc et al. (2014) proposed an 'acrosschannel' version of the sEPSM, referred to as sEPMS x , inspired by models of co-modulation masking release (e.g., van de Par and Kohlrausch, 1998 ;Dau et al., 2013 ), that combines the onedimensional (temporal-only) modulation filtering process with a measure of across (audio) frequency variability at the output of the auditory preprocessing.A complex spectro-temporal modulation filterbank was thus argued not to be required for speech intelligibility prediction, at least not for the experimental conditions considered.As can be seen in Fig. 1 (third column), the sEPSM x has the same processing pathway as the original "long-term" sEPSM (i.e., gammatone filterbank, envelope extraction, and modulation filterbank).The SNR env is computed for each modulation and auditory filter, according to Eq. ( 1) , and the resulting SNR env values are then weighted as follows:

ARTICLE IN PRESS
where σ 2 j represents the variance of the normalized envelope power of the speech and noise mixture signal across all auditory channels for the j th modulation filter and β is a free parameter that controls the overall weighting and is fitted once, together with the ideal observer calibration.The resulting weighted SNR env values are combined across all channels following Eq.( 2) and transformed into sensitivity values.Intelligibility scores are obtained by means of an ideal observer equivalent to that in the original sEPSM.
The left panel of Fig. 3 shows speech intelligibility data collected by Chabot-Leclerc et al. (2014) for speech in the presence of a speech-shaped noise (SSN), where the noisy mixture was further distorted by introducing random phase changes, as described by Elhilali et al. (2003 ;Eq. ( 8)).The magnitude of the phase distortions was controlled by the phase jitter parameter ( α), which defines the range of the uniform distribution from which the phase distortions are selected, such that each time sample of the distorted speech is assigned a random phase in the range [ 0 , 2 απ] .
The non-linear relationship between phase jitter parameter and speech intelligibility is well captured by the sEPSM x (dark gray triangles), outperforming the STMI, whereas -as expected -the mr-sEPSM (black diamonds) fails to predict any change in speech intelligibility due to phase distortions.
The sEPSM x was shown to be backwards compatible with the sEPSM, such that, in addition to phase distortions, it accounts for the effects of stationary maskers, reverberation and spectral subtraction on speech intelligibility.However, since the sEPSM x does not include any short-term analysis in its documented realization, it cannot account for the intelligibility of speech in non-stationary maskers ( Chabot-Leclerc et al., 2014 ).

Spatial effects in speech intelligibility
The models described above represent monaural models of auditory processing and can therefore not account for binaural effects on speech intelligibility.Chabot-Leclerc et al. ( 2016) proposed a binaural extension of the sEPSM framework, the B-sEPSM, that combined two realizations of the mr-sEPSM (one for each ear) and an additional binaural processing path implemented using a shorttime equalization and cancellation (EC) process (i.e., a process that simulates binaural unmasking, see e.g., Durlach, 1963 ;Wan et al., 2014 ).
The two monaural paths are equivalent to the mr-sEPSM model, except for the addition of a random phase jitter to the envelopes extracted at the output of the gammatone filterbank.This was necessary to limit the efficacy of the EC process while not affecting the monoaural paths, as they are energy based.The EC process is performed individually for each auditory channel (i.e., band) with a short-time resolution of 20-ms windows with 50% overlap.For each window and each auditory band, the equalization stage selects the optimal interaural time difference (ITD) and interaural  2 ).Changes in SRT resulting from different numbers of maskers and changes in the location of the noise sources can also be accounted for by the B-sEPSM.Additionally, the effects of reverberation for single-masker conditions can be successfully predicted ( Chabot-Leclerc et al., 2016 ;Fig. 4 ).The model has a similar predictive power as other binaural models, such as the short-term binaural speech intelligibility model (stBSIM; Beutelmann et al., 2010 ) or the short-time version of the EC model of Wan et al. (2014) .Despite the inclusion of a binaural unmasking path in the B-sEPSM, Chabot-Leclerc et al. (2016) showed that a 'better-ear' strategy (i.e. the selection of the monaural channel with the higher SNR env for each time window) instead of the output after 'true' binaural processing was sufficient to account for the data in most conditions.However, the binaural path became crucial to account for the results of conditions where the stimuli were designed such that ITDs were present but no other binaural cues (e.g., ILDs) were available.

Accounting for psychoacoustic masking data using the EPSM framework
The sEPSM model was originally designed to predict speech intelligibility.However, Biberger andEwert (2016 , 2017 ) extended the sEPSM framework to account for psychoacoustic data, including intensity discrimination, spectral masking with narrow-band and pure-tone maskers, and forward masking with noise carriers.
Their model, termed the multi-resolution (envelope) generalized power spectrum model (mr e -GPSM), combined the classical PSM ( Fletcher, 1940 ;Moore and Glasberg, 1987 ) with the mr-sEPSM model ( Jørgensen et al., 2013 ).As indicated in Fig. 1 (fifth column), this was achieved by calculating the (audio-) power at the outputs of the auditory filters, and computing an SNR value in the audio domain, termed SNR DC, that was then combined with the SNR env (i.e., the modulation-based SNR) at the decision stage following: The left-hand bracket represents the combination of all SNR env values across the I auditory filters and the corresponding J(i) modulation filters, where J(i) represents the number of modulation channels that comply with the selection rule f c mod < 1 4 • f c aud ( Verhey et al., 1999 ;Jørgensen et al., 2013 ).The right-hand bracket represents the combination of SNR DC values across the I modulation filters.α and β are free parameters that are fitted experimentally to correct for the (incorrect) assumption of independence between SNR env and SNR DC in the model.In addition, in the mr e -GPSM, in contrast to mr-sEPSM, the SNR env values are logarithmically compressed before combination with the SNR DC .This
The model can account for the intelligibility of speech in the presence of fluctuating maskers as well as in conditions with reverberation and noise suppression (using spectral subtraction).By including the PSM-processing path, the mr e -GPSM can account for a broad range of psychoacoustic data with an accuracy similar to that of the auditory processing and perception model of Dau et al. (1997aDau et al. ( ,1997b) ) .Additionally, the GPSM is backwards compatible with the original (not speech-based) EPSM and can therefore account for different aspects of modulation perception, such as amplitude modulation detection, discrimination and masking.

Correlation-based models
The SNR env metric was shown to account for a broad range of listening conditions, when used in different model configurations.However, this (modulation) energy-based metric, as mentioned above (Sect.1.4), cannot account for degradations that affect the phase of the speech stimuli without an additional 'fitting' of the model back end, such as the across-frequency process introduced in the sEPSM x ( Chabot-Leclerc et al., 2014 ).Other types of non-linear speech processing that affect the (audio-) frequency relationships in the stimuli, such as ideal time-frequency segregation (ITFS), should present a challenge for the SNR env -based models, despite being able to predict effects of single-channel noise reduction (i.e., spectral subtraction).An alternative approach to estimating the (speech) signal in a noisy speech mixture is based on the cross correlation of the clean speech with the noisy speech mixture (e.g., Green and Swets, 1966 ).Thus, instead of estimating the signal through the ratio of the noisy mixture and the noise alone, as is done in the power spectrum model approaches, intelligibility can also be derived through the correlation between the assumed clean (or supra-threshold) speech signal and the noisy speech representation.Below, results obtained with the correlation-based modeling approach are presented, still assuming a modulation-frequency selective analysis in the front-end processing.

Effects of time-frequency degradations on speech intelligibility
Relaño-Iborra et al. ( 2016) proposed a model that combined the modulation-frequency selective structure of the mr-sEPSM with a correlation metric in its back end, inspired by the success of the short-time objective intelligibility measure (STOI; Taal et al., 2011 ).In contrast to STOI, which only considers an (audio-) frequency analysis of the stimuli, the correlation metric is applied in the envelope domain (i.e., at the output of the modulation filter bank), analogous to the SNR env metric.For this 'template-matching' approach (i.e., where a clean speech reference or 'template' is used to evaluate the degraded signal), the inputs to the model change from the speech mixture and the noise alone (for the SNR env metrics) to the speech mixture and the speech alone (i.e., clean speech).
As indicated in Fig. 1 (sixth column), the correlation-based sEPSM (sEPSM corr ; Relaño-Iborra et al., 2016 ) shares its preprocessing path with that of the mr-sEPSM, i.e., it includes a gammatone filterbank, an envelope extraction stage and a modulation filterbank.However, instead of computing the power at the outputs of the modulation filter, two additional processing steps are performed: first, in order to account for the limitations in modulation phase sensitivity, a second Hilbert envelope is extracted for modulation channels with center frequencies above 10 Hz, i.e., phase information is removed for higher modulation channels ( Langner and Schreiner, 1988 ).Second, the resulting modulation envelopes are logarithmically compressed to satisfy Weber's law in the modulation domain ( Ewert and Dau, 2004 ).The resulting internal repre-sentations are then cross-correlated at lag zero in 'multi-resolution' time windows analogous to those of the mr-sEPSM (i.e., the window length is modulation-frequency dependent).In contrast to the mr-sEPSM, where SNR env values are averaged across time windows, the correlation values are collapsed across time using a "multiplelooks" approach ( Viemeister and Wakefield, 1991 ): where χ i, j represents the correlation value obtained for the i th auditory filter and j th modulation filter and K( j) represents the number of temporal windows for the corresponding modulation filter.The resulting correlations are averaged across modulation and auditory filters 2 .The final model output is transformed into an intelligibility score using a logistic function, as in STOI, that is fitted once to the speech material ( Relaño-Iborra et al., 2016 ).
The right panel of Fig. 3 shows intelligibility scores for speech mixed with a masker consisting of a conversation by two speakers in a cafeteria, processed using ITFS.ITFS is a noise-reduction technique that removes time-frequency (TF) units from the mixture when the local SNR is below a certain relative criterion (RC; x-axis) by means of an ideal binary mask (IBM).The data (open squares) collected by Kjems et al. (2009) showed that applying such processing increased speech intelligibility when the proportion of removed TF units was small (i.e., the RC was low and the IBM density was high), whereas intelligibility decreased as RC and the proportion of removed TF units increased (i.e., low IBM density).The mr-sEPSM (black diamonds) cannot account for this phenomenon, since it predicts that any ITFS applied to the speech mixture would be beneficial to intelligibility.Furthermore, the mr-sEPSM overestimates the overall improvement in speech intelligibility regardless of RC.Since removing TF units increases the overall modulation energy, regardless of the amount of speech information preserved after processing, the SNR env is oversensitive to this type of processing.In contrast, the sEPSM corr (dark gray circles) accurately predicts the effects of ITFS, showing similar performance as the 'reference model' STOI (that was designed to account for these effects), even though the sEPSM corr slightly underestimates the drop in intelligibility for higher RCs.
In addition to accounting for the effects of ITFS, the sEPSM corr was shown to account for the intelligibility of speech in the presence of fluctuating maskers, and speech processed using spectral subtraction and phase jitter distortion.In fact, as a result of the use of the correlation-based metric, the sEPSM corr accounts for the phase-jitter distortion data without the need for an explicit across-frequency analysis (as assumed in sEPSM x and the STMI).The model, however, does not account for the effects of reverberation, a limitation shared by other correlation-based models, such as STOI, as discussed further below.

Effects of sound pressure level on speech intelligibility
The EPSM-based models all use a linear gammatone filterbank in their peripheral processing.However, basilar-membrane processing is non-linear and level dependent (e.g., Rhode, 1971 ;Robles et al., 1986 ).The EPSM framework must therefore fail to capture effects of sound pressure level on speech intelligibility.Furthermore, as hearing impairment commonly results in a linearization of the cochlear response (e.g., Plack et al., 2004 ; Johannesen et al.,   2 Before summation of the correlation coefficients across audio and modulation channels, all negative correlations are set to zero, under the assumption that a negative correlation represents lack of intelligibility.Thus, the intermediate correlations are limited to the range between 0 (not intelligible) and 1 (fully intelligible).

ARTICLE IN PRESS
JID: HEARES [m5G;September 24, 2022;3:23 ] 2016 ), this framework might not account for the effects of hearing impairment on speech intelligibility.Several non-linear models of peripheral processing have been proposed (e.g., Carney, 1993 ;Jepsen et al., 2008 ;Brown et al., 2011 ;Zilany et al., 2014 ).Relaño-Iborra et al. (2019) investigated the potential of one such model as a predictor of speech intelligibility.They combined the computational auditory signal processing and perception model (CASP; Jepsen et al., 2008 ;Jepsen and Dau, 2011 ), which includes the dual-resonance non-linear filterbank (DRNL; Meddis et al., 2001 ;Lopez-Poveda and Meddis, 2001 ) in its front end, with the modulation-frequency selective processing described by Dau et al. (1997aDau et al. ( , 1997b) ) and the sEPSM family to predict speech intelligibility.CASP had previously been shown to successfully predict listener behavior in non-speech psychoacoustic tasks, thus granting further investigations as a speech predictor.
In addition to the DRNL, the model termed 'speech-based' CASP (sCASP; Relaño-Iborra et al., 2019 ) includes simulations of outerand middle-ear filtering and expansion and adaptation stages (see Fig. 1 , seventh column) before the modulation decomposition.The modulation processing in sCASP differs slightly from that in the EPSM.sCASP employs a cascade of complex frequency-shifted first-order low-pass filters instead of band-pass filters.The filter frequencies are logarithmically scaled, and the modulation filters have a Q factor of 2. The internal representations produced at the outputs of these stages are analyzed using a correlationbased back end inspired by the sEPSM corr ( Relaño-Iborra et al., 2016 ).The correlation back end from the sEPSM corr was modified such that the correlation was performed over both modulation and audio frequency channels in each multi-resolution temporal window.Furthermore, the correlation values were averaged across temporal windows instead of using the multiple-looks approach in sEPSM corr .This '2D' correlation approach is conceptually similar to the across-frequency analysis performed in the sEPSM x ( Chabot-Leclerc et al., 2014 ).However, the correlation in sCASP is simply computed for the concatenated outputs of all audio channels for a specific modulation frequency and, thus, the model does not require any additional fitting of across-frequency weights.The resulting correlation values are linearly averaged across time windows and modulation channels and transformed into intelligibility scores using a logistic function analogous to that of sEPSM corr .
The inclusion of a non-linear front end the model allows to predict effects of level on speech intelligibility, including effects of hearing thresholds3 as well as the decrease of intelligibility at high stimulation levels, commonly referred to as the 'roll-over ' effect (e.g., Festen, 1993 ;Studebaker et al., 1999 ;Summers and Molis, 2004 ).The solid lines in the left panel of Fig. 4 show sCASP predictions of SRTs as a function of the speech presentation level for three maskers: SSN (black), 4-Hz modulated SSN (SAM, purple) and the ISTS ( Holube et al., 2010 ;blue).These simulations are not intended to predict a specific dataset but, instead, are shown to evaluate the trends in the model predictions at higher presentation levels.The predictions show an increase in the SRTs at higher levels, as well as a decrease of the SRT difference between the SSN and each of the two fluctuating maskers, i.e., decreasing release from masking, with increasing level.This is consistent with experimental data (e.g., Summers and Molis, 2004 ).The amount of masking release as a function of the speech level is shown in the right panel of Fig. 4 with sCASP predictions indicated by triangles.Corresponding predictions obtained with a 'linear version' of the model (the sEPSM corr ) are shown by the dashed lines (left panel) and filled circles (right panel), respectively.As expected, the sEPSM corr fails to predict changes in speech intelligibility with level, nor can it capture changes in the amount of masking release at the higher speech presentation levels as observed in listener data ( Summers and Molis, 2004 ).
In addition to accounting for the effects of level on speech intelligibility in the above conditions, sCASP accounts for the intelligibility of speech in all previously considered experimental paradigms (i.e., speech in the presence of fluctuating maskers, spectrally subtracted noisy speech, phase jittered speech and speech processed with ITFS).Furthermore, the inclusion of the DRNL stage in the model allows for the representation of sensorineural hearing loss within the framework, such that this model might be valuable for the investigation of speech perception for HI listeners.However, as for its predecessor, sEPSM corr , sCASP fails to account for the effects of reverberation on the intelligibility of noisy speech.

Perceptual effects of hearing impairment on speech intelligibility
Speech intelligibility is typically lower for HI people than for the NH population (e.g., Festen and Plomp, 1990 ;Peters et al., 1998 ;George et al., 2006 ).While, clinically, the most established measure of hearing loss is the pure-tone audiogram, hearing impairment can also lead to 'supra-threshold' deficits not reflected in the audiogram, such as reduced spectral and temporal resolution and loudness recruitment (e.g., Glasberg and Moore, 1989 ;Strelcyk and Dau, 2009 ;Summers et al., 2013 ;Johannesen et al., 2016 ;Sanchez Lopez et al., 2018 ).However, the underlying sources of the perceptual deficits are not fully understood.Therefore, modeling frameworks that can predict the performance of HI listeners are valuable for a better understanding of the relation between a given type of hearing impairment, its representation in the auditory system and its perceptual consequences, including speech perception.Several models, with different levels of complexity, have been proposed to predict speech intelligibility for HI listeners (e.g., ANSI S3.5-1997 ; Kates and Arehart, 2014 ;Kollmeier et al., 2016 ;Scheidiger et al., 2018 ).Here, modeling attempts that are based on modulation-frequency selective processing are presented and discussed.
All sEPSM-based models (e.g., sEPSM, mr-sEPSM, sEPSM corr ) include a thresholding stage that takes into account the audibility of the stimuli.After (audio-)frequency decomposition, the energy at the output of the peripheral filters is compared to the standard NH thresholds in free field ( ISO 389-7:2005 ; see also footnote 3).Audibility deficits can therefore be implemented at this stage based on the listener's audiometric thresholds.Scheidiger and colleagues reported that predictions based on audibility deficits alone did not correctly describe the SRTs of HI listeners for speech in stationary ( Scheidiger et al., 2014 ) nor fluctuating interferers ( Scheidiger and Dau, 2015 ), even though the model captured the reduced masking release for the HI listeners relative to that of the NH listeners, which is consistent with results from other studies (e.g., George et al., 2006 ;Bernstein and Grant, 2009 ).Simulating an additional loss of frequency selectivity (by a widening of the auditory filters) did not improve the predictive power of the HI predictions with the mr-EPSM ( Scheidiger, 2017 ), as shown in Fig. 5 (grey symbols).
Relaño-Iborra (2019) implemented audibility loss within the sEPSM corr and showed similar limitations in accounting for the SRT data for the HI listeners, as illustrated in Fig. 5 (brown symbols).When compared to the data for NH and HI listeners obtained by Jørgensen et al. (2013) and Christiansen and Dau (2012) , respectively (open symbols in Fig. 5 ), it can be seen that both models underestimate the elevation of the SRTs in the HI group, except for the sEPSM in the condition with the ISTS interferer.Neither model reflects the variance across listener's SRTs observed in the HI data.Thus, regardless of the back end (i.e., SNR env metric in mr-EPSM  2013) for the NH conditions and by Christiansen and Dau (2012) for the HI conditions.The middle line in all boxplots represents the median SRT, while the boxes covers the 75% and 25% confidence regions.The whiskers cover the entire region of occurrences, except for the outliers, which are depicted as red crosses.

ARTICLE IN PRESS
or the correlation-based metric in sEPSM corr ), the approaches using a linear front end fail to predict HI data for some of the 'basic' speech intelligibility conditions.Several speech intelligibility models have been proposed that include a non-linear peripheral processing stage.For example, the auditory-nerve (AN) model of Carney (1993) or more recent variants of it have been combined with different back ends to predict speech intelligibility, such as the STMI ( Zilany and Bruce, 2007 ), the 'neurogram similarity index' ( Hines and Harte, 2012 ;Bruce et al., 2013 ), the SNR env ( Scheidiger, 2017 ;Scheidiger et al., 2017 ), and correlation-based metrics ( Scheidiger et al., 2018 ;Zaar and Carney, 2022 , this issue).Scheidiger et al. (2017) 's 'AN-sEPSM' ( Fig. 1 , most-right column) includes simulations of middle-ear filtering, a basilar membrane (chirp) filter with a forward control path to account for level effects, an inner hair cell (IHC) stage, an IHC-AN synapse model and a non-homogeneous Poisson process underlying spike generation.The average firing rates generated at the output of each peripheral filter are then processed in this model using the same modulation filterbank and subsequent processing as in the mr-sEPSM.The model was fitted to the individual listeners' profile using estimates of outer hair cell (OHC) loss and IHC loss.However, this model failed to account for the SRTs of NH listeners obtained at conversational levels (65 dB SPL), whereas it produced accurate predictions at low sound pressure levels (e.g.below 50 dB SPL), at which the model's front end essentially behaves linearly.One assumption of the SNR env -based model is that the power of speech can be estimated from the difference between the (power of the) noisy speech and (the power of the) noise alone, but this assumption is not valid when the different signals follow non-linear processing paths.
In contrast, predictions obtained with the sCASP ( Relaño-Iborra, 2019 , 2021 ), which includes non-linear peripheral processing and a correlation-based back end showed reasonable agreement with measured speech intelligibility, as shown in Fig. 5 (light green symbols), even though predictions for the individual listeners remain a challenge, i.e. the across-listener variability in the data is markedly underestimated by the model.Additionally, the sCASP does not produce accurate predictions of the HI data in the SSN condition.However, the results are promising and suggest that a correlation metric might be suitable for capturing the consequences of processing deficits in the impaired auditory system.This is in line with recent approaches that have used a correlation metric in combination with the AN-model to predict speech intelligibility for HI listeners ( Scheidiger et al., 2018 ;Zaar and Carney, 2022 , this issue).

General discussion
This review has discussed speech intelligibility models that assume a modulation-frequency selective process in their front end, inspired by previous modeling work describing the processing and perception of modulated sounds.The premise of the models is that temporal fluctuations of the envelopes of the sounds are crucial for speech intelligibility, such that a degraded envelope representation of the target speech in a given background typically leads to decreased speech intelligibility (e.g., Drullman et al., 1994a ;Shannon et al., 1995 ;Elliott and Theunissen, 2009 ).Several models that include modulation processing, such as the STI ( Houtgast et al., 1980 ;Payton and Braida, 1999 ) and the STMI ( Elhilali et al., 2003 ), the latter of which utilizes a two-dimensional (spectro-temporal) modulation analysis, were not explicitly described here but were considered as comparison models (see Figs. 2 and 3 ).The models reviewed here do not analyze spectral modulations explicitly (i.e., they do not include a spectral modulation filter such as in the case of the STMI, although some incorporate an across-frequency analysis, e.g., sEPSM x and sCASP) and use an envelope-domain SNR or a correlation metric in their back end.In addition, models of 'micro-

ARTICLE IN PRESS
JID: HEARES [m5G;September 24, 2022;3:23 ] scopic' speech perception focusing on consonant perception were not considered here, despite the existence of models that can predict the intelligibility of such stimuli using modulation-frequency selective processing (e.g., Jürgens et al., 2014 ;Zaar et al., 2017 ;Zaar and Dau, 2017 ).All approaches represent bottom-up signal processing models that evaluate the effect of certain signal modifications through a transmission channel (e.g., a room, nonlinear transmission, a HI system) on speech intelligibility without having any 'knowledge about speech' with respect to, e.g., its structure and phonetic elements, its meaning, context, statistical validity, linguistic relevance, etc.Thus, these models -as for most other 'classical' approachesare purely stimulus driven and essentially evaluate the effects of the proposed (auditory) processing in the signal path on speech reception.Thus, their predictions are commonly limited to estimates of SRTs or percentage correct, and they do not cover tasks like speech or speaker identification and sound segregation, which generally require training of models using large-scale datasets.
The discussed models showed a reasonably wide range of applicability in terms of the acoustic conditions considered, including speech in the presence of stationary or fluctuating maskers, noisy speech degraded by reverberation and phase jitter, and noisy speech processed using ideal time-frequency segregation and spectral subtraction.The models with the best overall predictive power, sEPSM corr and sCASP, can account for the effects of both signal degradations and signal enhancements on speech intelligibility.All models showed better performance in the conditions they were designed for as compared to traditional metrics of speech intelligibility such as the SII ( ANSI S3.5-1997 ) or the STI ( Houtgast et al., 1980 ).A strength of these models is their fitting procedure; whether the transformation from model units to speech intelligibility scores is performed using an ideal observer (e.g., sEPSM, mr-sEPSM, sEPSM x ) or a logistic fit (e.g.sEPSM corr , sCASP), the mapping is performed once, generally for the 'reference' condition of speech in the presence of stationary noise.Thus, the models do not receive any information about the 'test' conditions, i.e., they produce 'true' predictions on unseen data.Additionally, none of these models require any band importance weighting, as is the case for models like the AI, SII, STI and their variants.
In terms of the specific role of modulation-frequency selective processing for speech intelligibility prediction, Jørgensen and Dau (2011) investigated the contribution of this stage to the predictions of the sEPSM by replacing the modulation filterbank of the model with a single modulation low-pass filter (f cut-off = 150 Hz), reflecting some sluggishness in the processing of modulations but with no modulation frequency selectivity.This model version failed to predict the data for the spectral subtraction condition.Similarly, Jørgensen et al. (2013) evaluated alternative versions of the mr-sEPSM that included low-pass modulation processing or equalduration short-time windows across all modulations (as opposed to the 'multi-resolution' approach with time constants matched to the modulation filter bandwidth in the mr-sEPSM).They demonstrated that the combination of multi-resolution time analysis with modulation-frequency selective processing was responsible for the accurate predictions of the model.Even though neither the 'lowpass' mr-sEPSM nor the 'equal-window' short-time variant performed as poorly as the low-pass sEPSM, their predictions were substantially less accurate than those of the mr-sEPSM.These effects were also shown for earlier versions of the sEPSM corr , whose predictive power was reduced when using a modulation lowpass filter approach and an equal-window short-term approach ( Relaño-Iborra, 2015 ).The need for a multi-resolution analysis seems to be consistent with results from other modelling frameworks, such as the ESII ( Rhebergen et al., 2006 ), for which (audio-)frequency dependent multi-resolution analysis windows are used, providing improved performance compared to ESII variants that used equal-duration analysis windows ( Rhebergen and Versfeld, 2005 ).
The relative contributions of different modulation filters to speech intelligibility have been analyzed within the modeling framework ( Jørgensen et al., 2013 ;Steinmetzger et al., 2019 ).It was shown that both for the SNR env -and the correlation-based models, the low ( < 8 Hz) and higher modulation frequencies ( > 64 Hz) contributed most to the predictive power of the models.This was further reinforced by Steinmetzger et al. (2019) , who proposed a revised version of the sEPSM corr that removed modulation bands whose energy fell below a minimum threshold.Steinmetzger et al. (2019) showed that when the medium modulation frequencies (16 to 64 Hz) were not considered, sEPSM corr performed better overall, particularly in conditions where the periodicity in the masker resulted in an improvement of speech intelligibility, referred to as a 'masker-periodicity benefit' ( Steinmetzger and Rosen, 2015 ).
Two decision metrics have been discussed in the present study, the SNR env and the correlation between the clean and the noisy speech.While the SNR env , an energy metric, was shown not to be suitable for predicting the effects of phase jitter distortions or ITFS processing, it seems to be well suited for the prediction of speech intelligibility for conditions including reverberation ( Jørgensen and Dau 2011 ;Chabot-Leclerc et al. 2016 ).Some attempts to model the effects of reverberation on speech intelligibility have been made using correlation metrics ( Relaño-Iborra et al. 2016 ).However, reverberation alters the temporal properties of speech, which challenges the correlation-based 'template'matching approach (whereas in the SNR env approach both the noise reference and the noisy target are equally degraded by the reverberation).On the other hand, correlation metrics can account for phase-jitter distortions due to their phase sensitivity, capturing within-channel envelope phase changes in the degraded signal.Both approaches may thus reflect strategies that need to be considered to better understand the processes underlying speech perception in a broad range of conditions that include distortions and degradations.
A limitation of the overall modeling strategy presented in this review is that a 'reference signal' is required, i.e., the noise alone in the case of the SNR env metric and the clean speech 'template' in the case of the correlation-based metric.This implies that the models have a priori knowledge of the stimuli, well beyond what listeners might have access to in real life.Moreover, this reference requirement limits their application with respect to realtime processing.With the increased attention to 'non-intrusive' (i.e., reference-free) models for predicting speech intelligibility (e.g.Schädler et al., 2015Schädler et al., , 2016 ; ;Exter and Meyer, 2016 ;Karbasi et al., 2016 ;Spille et al., 2018 ), it remains to be explored whether the SNR env or the correlation-based models have the capacity to serve as front ends to more complex data-driven approaches to modelling speech intelligibility.Some methods have been proposed for computing STOI in a non-intrusive manner ( Sørensen et al., 2018( Sørensen et al., , 2017 ) ), and these may be applicable to some of the models discussed here.In addition, the models discussed here might be challenged in conditions where speech interferers are present and fundamental frequency (F0) related cues and 'informational' masking may play role (e.g., Darwin et al., 2003 ;Vestergaard and Patterson, 2009 ;Léger et al., 2014 ;Shen and Souza, 2018 ), since neither of these aspects is taken into account by the models.Additionally, by focusing on the processing of temporal envelopes, the models discussed in this review may disregard effects of the stimuli's temporal fine structure (TFS) and its neural representation on speech intelligibility.This might limit the models' ability to, e.g., account for speech intelligibility data from older listeners (e.g., Füllgrabe et al., 2015 ) or listeners with a moderate hearing loss (e.g., Buss et al., 2004 ).TFS and envelope-based processing could be combined in the form of a two-path model, as proposed by Ewert et al. (2020) .

ARTICLE IN PRESS
JID: HEARES [m5G;September 24, 2022;3:23 ] Such an approach could be used to extend several of the models presented in the present review.Indeed, one major motivation for developing speech intelligibility models is to better understand the potential impact of hearing impairment on speech communication and to have a framework for designing and evaluating hearing instrument signal processing strategies.Several models that aimed to account for the degraded speech intelligibility typically observed for HI listeners were presented here.The most successful predictions in this regard were obtained using the sCASP ( Relaño-Iborra et al., 2019, 2021 ), suggesting that correlation metrics, in combination with a non-linear front end, may be able to account, at least partially, for hearing impairment deficits that result in a degraded signal representation.This is in line with current work using a AN-model based front end in combination with a correlation back end ( Scheidiger et al., 2018 ;Zaar and Carney, 2022 , this issue).
However, most current speech intelligibility models are challenged to account for the consequences of individual hearing loss, even if group level differences between NH and HI listeners can partly be accounted for ( Holube and Kollmeier, 1996 ;Beutelmann and Brand, 2006 ;Jürgens and Brand, 2009 ;Jürgens et al., 2014 ;Scheidiger et al., 2018 ; but see Zaar and Carney, 2022 , this issue, for some promising results regarding individualized predictions).A limiting factor might be the 'implementation' of hearing loss in these models, which is often based on audiometric thresholds and estimates of OHC and IHC loss using psychoacoustic (detection and masking) data, which themselves are subject to noisiness and resolution errors ( Jepsen and Dau, 2011 ;Johannesen et al., 2014 ).If models are to be used to verify hearing instrument outcomes (i.e., in a 'model in the loop' design), accounting for the individual perceptual and physiological deficits (e.g., dead regions; Moore, 2004 ) is crucial such that 'supra-threshold' deficits can be represented in these model frameworks (e.g., Lopez-Poveda, 2014 ).In that regard, studies that may show a relationship between hearing loss types and perceptual deficits are especially useful, e.g., in the form of auditory profiles, such as those proposed by Sanchez-Lopez et al. (2018, 2020) .Specifically, for the modulation-selective models discussed in this review, the relationship between suprathreshold speech perception for HI listeners and its potential link to degraded modulation processing of the impaired auditory system remains to be clarified ( Ihlefeld et al., 2016 ;Heeringa and van Dijk, 2019 ).

Conclusions
This paper reviews speech intelligibility models that exploit modulation-frequency selective filtering in their front-end processing.The models implemented different realizations of such modulation-frequency selectivity and different degrees of complexity of the overall auditory (front-end) processing, and were combined with different metrics in the back end to map the internal representation of the stimuli at the output of the auditory processing to speech intelligibility.The capabilities and limitations of the different models in predicting speech intelligibility data (from the literature) across a wide range of conditions were discussed, including speech presented in stationary or fluctuating maskers, noisy speech degraded by reverberation and phase jitter, and noisy speech processed through ideal time frequency segregation and spectral subtraction.While some of the models show good predictive power across the considered conditions, there are remaining challenges, particularly with respect to predicting the consequences of individual hearing loss on speech intelligibility.Overall, the presented modeling approaches might provide a valuable basis for exploring and better understanding the interplay between effects of acoustic preprocessing, effects of auditory signal processing and speech intelligibility.
Most of the models presented and discussed in this review are available through the AMToolbox (https://www.amtoolbox.org;Majdak et al., 2022 ) and/or by request to the authors.

Fig. 2 .
Fig. 2. Left panel: Speech reception thresholds (SRTs) for a spectrally subtracted speech and speech-shaped noise mixture for different amounts of noise subtraction (alpha).Human data are shown as open squares and predictions obtained with the speech-based Envelope Power Spectrum Model (sEPSM; Jørgensen and Dau, 2011 ) are shown by filled dark grey squares.Light gray squares show predictions obtained with the speech-based Speech Transmission Index (sSTI;Payton and Braida, 1999 ).Reploted fromJørgensen and Dau (2011) .Right panel: SRTs for speech in the presence of stationary noise (SSN), 4-Hz amplitude modulated speech shaped noise (SAM) and the international speech test signal (ISTS;Holube et al., 2010 ).Human data measured byJørgensen et al. (2013) are shown by the open squares.Predictions of the sEPSM are shown as black diamonds and those of the multi-resolution sEPSM (mr-sEPSM;Jørgensen et al., 2013 ) as gray squares.Predictions using the sentence-based Extended Speech intelligibility index (ESII-s;Meyer and Brand, 2013 ; replotted from Biberger and Ewert, 2017 )  are shown as light gray diamonds.
Fig. 2. Left panel: Speech reception thresholds (SRTs) for a spectrally subtracted speech and speech-shaped noise mixture for different amounts of noise subtraction (alpha).Human data are shown as open squares and predictions obtained with the speech-based Envelope Power Spectrum Model (sEPSM; Jørgensen and Dau, 2011 ) are shown by filled dark grey squares.Light gray squares show predictions obtained with the speech-based Speech Transmission Index (sSTI;Payton and Braida, 1999 ).Reploted fromJørgensen and Dau (2011) .Right panel: SRTs for speech in the presence of stationary noise (SSN), 4-Hz amplitude modulated speech shaped noise (SAM) and the international speech test signal (ISTS;Holube et al., 2010 ).Human data measured byJørgensen et al. (2013) are shown by the open squares.Predictions of the sEPSM are shown as black diamonds and those of the multi-resolution sEPSM (mr-sEPSM;Jørgensen et al., 2013 ) as gray squares.Predictions using the sentence-based Extended Speech intelligibility index (ESII-s;Meyer and Brand, 2013 ; replotted from Biberger and Ewert, 2017 )  are shown as light gray diamonds.

Fig. 3 .
Fig. 3. Left panel: Percentage of correctly understood words as a function of the amount of phase-jitter distortion.Human data collected by Chabot-Leclerc et al. (2014) are shown as open squares.Predictions obtained with the multi-resolution sEPSM (mr-sEPSM; Jørgensen et al., 2013 ) are shown as black diamonds, predictions obtained with the spectro temporal modulation index (STMI; Elhilali et al. 2003 ) are shown as light grey triangles, and predictions obtained with the across-frequency sEPSM (sEPSM x ; Chabot-Leclerc et al., 2014 ) are shown as dark grey triangles.Right panel: Intelligiblity scores for noisy speech processed using ideal time frequency segregation.The abscissa corresponds to the 'relative criterion', a measure of the mask density from dense (negative values) to sparse (positive values) binary masks.Human data collected by Kjems et al. (2009) are shown as open squares.Predictions obtained with the correlation-based sEPSM (EPSM corr ; Relaño-Iborra et al., 2016 ) are shown as dark gray circles.Predictions obtained with the short time objective intelligibility measure (STOI; Taal et al., 2011 ) are shown by light gray circles.'UN' indicates intelligibility scores and predictions obtained for the unprocessed noisy speech condition.

Fig. 4 .
Fig. 4. Left panel: Speech reception thresholds (SRTs) as a function of the speech sound pressure level for speech mixed with three maskers: stationary noise (SSN; black), 4-Hz amplitude-modulated speech shaped noise (SAM; purple) and the international speech test signal (ISTS; Holube et al., 2010 ; blue).Model predictions obtained with the 'speech-based' computational auditory signal processing and perception model (sCASP; Relaño-Iborra et al., 2019 ) are shown as solid lines and corresponding predictions obtained with the correlation-based sEPSM (sEPSM corr ; Relaño-Iborra et al., 2016 ) with a linear peripheral filtering stage are shown as dashed lines.Right panel: release from masking relative to that for the reference SSN.sCASP predictions are shown with triangles and sEPSM corr predictions with circles.

Fig. 5 .
Fig. 5. Measured (open symbols) and predicted SRTs (filled symbols) obtained for NH (diamonds) and HI (boxplots) listeners in conditions with SSN (left panel), SAM (middle panel) and ISTS (right panel) interferers.Predictions of the speech-based CASP (sCASP; Relaño-Iborra et al., 2019 ) are shown in light green; predictions of the correlationbased EPSM (sEPSM corr , Relaño-Iborra et al., 2016 ) are shown in brown and predictions of the multi-resolution sEPSM (mr-sEPSM; Jørgensen et al. 2013 ; HI simulations replotted from Scheidiger 2017 ) are shown in grey.The data were collected by Jørgensen et al. (2013) for the NH conditions and byChristiansen and Dau (2012) for the HI conditions.The middle line in all boxplots represents the median SRT, while the boxes covers the 75% and 25% confidence regions.The whiskers cover the entire region of occurrences, except for the outliers, which are depicted as red crosses.