Multiframe maximum a posteriori estimators for single ‐ microphone speech enhancement

Multiframe maximum a posteriori (MAP) estimators are applied to a single ‐ microphone noise reduction problem. Several attempts have been made to exploit the interframe correlation (IFC) between speech coefficients in the short ‐ time Fourier transform domain. In a noise ‐ reduction algorithm, all available information of recorded signals should be optimally utilized in the estimation process. Single ‐ microphone multiframe minimum variance distortion ‐ less response and single ‐ microphone multiframe Wiener filters (MFWFs) have been presented in this approach. Incorporating the concept of IFC in the MAP estimator leads to multiframe MAP estimators in a single ‐ microphone case. In each time ‐ frequency unit, the current and a finite number of past noisy signals are utilized to develop the estimators. A complex factor is adopted to model the IFC between speech signals, which allows the application of multiframe MAP estimators. The noise reduction performance is compared for the proposed estimators with the joint MAP estimator (which ignores the correlation between successive frames) and benchmark MFWFs and speech ‐ distortion weighted interframe Wiener filters for different input noise types. These evaluations verify that the proposed methods exhibit good performance.


| INTRODUCTION
It is barely possible to imagine life without sounds; people talk and communicate by voice. However, undesired background noise always exists and makes our hearing unpleasant, or in some cases, impossible. Speech enhancement has been a significant issue for researchers because of its several applications, such as hands-free, hearing aids, teleconferencing, humanto-human, and human-to-machine communication [1].
Because implementation of the fast Fourier transform is straightforward, noise-reduction algorithms are usually performed in the short-time Fourier transform (STFT) domain. In 1984, Ephraim and Malah introduced a minimum mean square error (MMSE) estimator to extract the short-time spectral amplitude (STSA) of clean speech from a noisy observed signal in the STFT domain. They assumed that successive speech frames are uncorrelated and considered noise and speech signals as statistical random variables with an independent Gaussian distribution in each time-frequency unit (TFU) [2]. They also derived an MMSE-STSA estimator under signal presence uncertainty and proposed a decision-directed approach to estimate a priori the signal-to-noise ratio (SNR). In addition, Ephraim and Malah extended their estimation to minimize the mean square error in the log-spectral domain, called MMSE-log STSA [3].
In [4], Wolfe and Godsill proposed simple alternatives to the MMSE-STSA rule concerning the maximum a posteriori (MAP) criterion. They developed MAP estimators to extract the phase and amplitude of a clean speech signal. These estimators assumed that successive speech frames are uncorrelated. It was also shown that the proposed joint MAP phase estimate is precisely equal to the noisy signal phase [4]. Both MMSE and MAP estimators were then generalized by other researchers, for example, by considering β-order MMSE criteria [5] or other probability distributions for speech signals [6,7].
In [8], Lotter et al. extended the MMSE-STSA estimator (presented in [2]) to a multimicrophone case. First, they derived a multimicrophone estimation of MMSE-STSA, which requires knowledge of the direction of arrival (DOA) and confluent hypergeometric series. Second, using the magnitude of noisy observed signals, they introduced a direction-independent MAP estimator. Trawicki and Johnson in [9] presented an extension of the work of Lotter et al., wherein the optimal MMSE estimators for STSA, log-spectral amplitude, and spectral phase of clean speech signal were developed. The proposed spectral phase estimator considerably improves estimation accuracy but requires an initial time alignment.
All previously mentioned algorithms assume that successive speech frames are uncorrelated. However, it is well known that this assumption is not accurate. The low-pass nature of speech signal and its inherent coherence, along with the overlap procedure applied in the STFT, introduces high correlation, especially between consecutive frames. In the speech enhancement problem, we are primarily interested in utilizing all available information of the noisy signal. In recent years, the issue of interframe and interband correlations has received much attention and has been applied in many speech-related applications [10][11][12][13][14][15][16].
In [10], the authors considered interframe correlation (IFC) in the STFT domain for single-microphone noise reduction and extended two well-known beam-formers to the multiframe case. In each TFU, they utilized the current and a finite number of past noisy signals, enabling them to introduce a new model similar to the multimicrophone cases. Consequently, multiframe minimum variance distortionless response (MFMVDR) and multiframe Wiener filters (MFWFs) (which are usually employed in multimicrophone algorithms) were developed for the first time for single-microphone cases. It was shown that a significant noise reduction could be obtained using the MFMVDR and MFWF if IFC can be accurately estimated.
Several researchers have probed the issue of IFC and its importance in speech enhancement. The sensitivity of the MFMVDR filter to estimation errors in the speech correlation matrix and corresponding speech IFC vector was investigated in [15]. In [12], the authors proposed blind maximumlikelihood (ML) and MAP estimators for the speech IFC vector. In addition, to set the trade-off between speech distortion and noise reduction, speech-distortion weighted interframe Wiener filters have been proposed in [16]. In [11], Momeni et al. proposed a single-microphone speech presence probability (SPP) estimator based on interframe and interband correlations. It was shown that detection accuracy is increased by taking these correlations into account. Applying the SPP presented in [13], they also proposed a conditional MMSE filter.
Although several contributions have focussed on beamforming techniques, to the best of our knowledge, there has not been a detailed investigation on multiframe Bayesian estimators in the STFT domain that considers signal statistical properties. Inspired by the method presented in [10], we incorporate the concept of IFC in Bayesian estimators and show the similarities and differences between the proposed estimators and the previous multiframe beam-forming algorithms in a single-microphone scenario. Similar to [4], we propose multiframe estimators considering the MAP criterion and present so-called multiframe MAP (MFMAP) estimators. In each TFU, the current and a finite number of past frames of noisy (observed) signals are considered to develop the estimators. It should be noted that our approach is substantially different from the method in [12]. The blind ML and MAP estimators in [12] have been applied to derive the clean speech IFC vector using an extensive database required for the MFMVDR estimator.
The first part of this research is dedicated to exploiting both the phase and amplitude of clean speech signal resulting in the joint MFMAP (JMFMAP) estimator. Although spectral phase estimation has been ignored in many previous works, several attempts have recently been devoted to phase processing, which shows its significant role in improving speech quality and intelligibility [17]. A comprehensive review of phase importance and its effect on speech enhancement has been presented in [18]. In [19], Gerkmann derived the joint MMSEoptimal estimate of both phase and amplitude of clean speech signal, given uncertain a priori knowledge of phase.
Considering IFC, we first propose joint multiframe MAP estimators for both phase and amplitude of clean speech signal. This work can be regarded as an extension of the singlemicrophone joint MAP estimator (presented by Wolfe and Godsill in [4], which uses only the information of the current frame) to the multiframe case.
In the second part of this paper, we derive multiframe MAP estimations of the amplitude of clean speech signal, while the phase is kept unprocessed. These multiframe MAP estimations of the amplitude (MFMAPEA) are derived conditioned to either the noisy signal vector (named MFMA-PEA-Sig) or the amplitude of the noisy signal vector (named MFMAPEA-Amp).
The remainder of this paper is organized as follows. We review the statistical properties of signals in Section 2. In Section 3, we incorporate the concept of IFC in the Bayesian estimators and introduce multiframe MAP estimators. Section 4 illustrates how the noise reduction performance will be affected by the use of MFMAP estimators compared with the single-microphone joint MAP estimator and the benchmark MFWFs. Finally, some concluding remarks are presented in Section 5.

| STATISTICAL PROPERTIES OF SIGNALS
Consider a single microphone that captures the clean speech signal x(t) in a noisy field. We assume that the received signal is corrupted by uncorrelated additive noise v(t). The noisy (observed) signal y(t) at time index t is given by where Y m,k , X m,k , and V m,k are the noisy signal, the clean speech, and the additive noise signals, respectively, with m the frame index and k the discrete frequency index. Let Y m;k ¼ R m;k e jϑ m;k and X m;k ¼ A m;k e jα m;k denote the spectral amplitude and phase of noisy and clean speech signals, respectively. For simplicity, we omit the discrete frequency index and only write it when referring to a specific TFU. It is assumed that the noise and clean speech signals are zero mean random processes. The main goal of this work is to estimate A m and α m .
It is assumed that the real and imaginary parts of clean speech signals are statistically independent Gaussian variables. The amplitude and phase of clean speech signals can be considered statistically independent and modelled as Rayleigh and uniform distributions, respectively, given by Hence, their joint probability density function (pdf) is written as follows: where σ 2 x ðmÞ denotes the variance in the clean speech signal at the time frame m.
Under the assumption of Gaussian distribution for noise signals, the conditional pdf of Y m given the amplitude and phase of the clean speech signal is given by and σ 2 v ðmÞ is the variance in the noise signal at the time frame m.

| MULTIFRAME MAP ESTIMATORS
In recent years, IFC and its potential in speech enhancement have been reported in several research works. When a noise reduction algorithm is applied, it is important that all available information of recorded signals is optimally utilized. Thus, multiframe beam-forming algorithms by considering IFC have been presented in [10]. These algorithms aim to find optimal coefficients based on various criteria. They make no assumptions on the statistical properties of clean speech and noise signals. However, we take into account these properties to derive optimal MFMAP estimators in this work.
Like the mentioned multiframe beam-forming algorithms, in the m-th TFU, we utilize the current, and a finite number of past noisy signals in the same frequency index expressed as a noisy vector y m such as where the superscript T denotes transpose operation, and L is the length of the vector. It indicates how many frames have been used to model IFC. The vectors x m and v m can be defined similarly, such that y m = x m + v m . We also define r m as the amplitude of noisy vector, that is, Assuming that the speech and noise signals are uncorrelated, the correlation matrix of the noisy signal is given by where E : f g denotes the expectation operator. Φ x m and Φ v m denote the correlation matrices of the clean speech and noise signals, respectively.
In practice, the correlation matrix of the noisy signal is recursively estimated as a linear combination of the correlation matrix at previous frames and the current information, that is,Φ Similarly in silent frames, the correlation matrix of noise is given byΦ where λ y and λ v denote forgetting factors. An estimate of the correlation matrix of the clean speech signal is given bŷ Because of the errors that occur in the estimation of correlation matrices, negative eigenvalues ofΦ x m should be set to zero to ensure that the resulting matrix is positive semidefinite.
Contrary to the multiframe beam-forming algorithms that make no assumptions on the IFC model, an appropriate model must derive the MFMAP estimators. This model represents the relation between the amplitude and phase of the clean speech in the complex STFT domain at the time frame (m − i), X m−i , and the amplitude and phase of clean speech at the time frame (m), X m . In this work, to address this issue, we propose a linear model for amplitudes, that is, A linear model simplifies the derivation of the estimators, as will be explained in the following results in quadratic expressions. We also model the phase difference between consecutive frames by multiplication to an exponential factor (considering the time shift property of fast Fourier transform). So, the IFC is totally modelled by a complex factor S i;m ¼ C i;m e −jβ i;m (where C i,m affects the amplitude and β i,m RANJBARYAN AND ABUTALEBI -3 indicates the phase shift). The correlation between the X m−i and X m can be modelled as where C is defined via According to Equation (11), the current and past noisy signals can be written in terms of the clean speech amplitude and phase as follows: Obviously C 0,m = 1 and β 0,m = 0. Later, we explain how to estimate β i,m 's.
Assuming that the noise signals at different frames are independent, the conditional pdf of y m given the amplitude and phase of the clean speech signal at frame m is given by where the variance in clean signal σ 2 x ðm − iÞ and the variance in noise signal at time frame m − i, that is, σ 2 v ðm − iÞ, can be found as the diagonal elements of the speech and noise correlation matrices, respectively, that is, In the following, we employ the MAP criterion for single-microphone speech enhancement, resulting in multiframe MAP estimators. We first develop a joint MFMAP estimator for both the phase and amplitude of the clean speech signal, which can be interpreted as an extension of the single-microphone joint MAP estimator presented in [4] to the multiframe case. Next, we also derive two estimators for the amplitude only, while the phase is kept unchanged.

| Joint multiframe MAP estimator for phase and amplitude
Here, we propose an algorithm that estimates both spectral phase and amplitude by maximizing the posterior distribution of A m and α m given the noisy observed vector y m aŝ Because the denominator is independent of phase and amplitude, it is sufficient to maximize only the numerator, aŝ To find the optimal JMFMAP estimator, we combine Equations (14) and (3), and obtain the numerator of Equation (16) as Because ln(⋅) is a monotonically increasing function, the maximization problem in Equation (17) is equivalent to maximizing the natural logarithm of it. Ignoring the terms that do not affect the optimization, we proceed with 3.1.1 | Joint multiframe MAP estimator to extract clean signal phase Considering the important role of phase processing, we firstly focus on exploiting the phase of the clean signal. This is carried out by maximizing Equation (19) with respect to α m . To this end, by differentiating Equation (19) with respect to α m and setting it to zero, we obtain 4 - and further simplifications, the following closed-form equation is obtained: To find the optimal JMFMAP estimator of phase, we use the trigonometry relation [20], indicating that the sum of a finite number of sine terms with different phases and amplitudes can be represented as one single term, where the resulting amplitude and phase are given by Consequently, the optimal estimator of the phase is obtained from Equation (21) as In the case of L = 1 (i.e. exploiting the current frame only), It is observed that Equation (24) becomes equal to the noisy phase, which is also the joint MAP estimation of phase in the single-microphone case as presented in [4].
To provide the optimal estimation of phase, it is required to compute n i,m and β i,m . To this end, we considered a priori and a posteriori SNRs, respectively, R 2 m−i is directly computed according to the amplitude of the noisy (observed) signals. Using above definitions and Equation (12), n i,m is written in terms of the a priori and a posteriori SNRs as follows: To estimate β i,m , we consider the estimation of signal parameters via rotational invariance techniques (ESPRIT) algorithm [21], which is usually employed for DOA estimation using microphone arrays. When the array comprises two subarrays, and only the phases are of difference, the ESPRIT algorithm is convenient for exploiting the phase difference. A brief introduction of the ESPRIT algorithm has been provided in the Appendix.
Inspired by the idea behind ESPRIT, we present a method to estimate β i,m ; for example, to estimate β 2,m (the phase difference between the clean signal at the m-th time frame, X m , and the clean signal at the (m − 2)-th time frame, X m−2 ), only the noisy observed signals are applied such as It should be noted that the noisy observed signal at each TFU is considered as one subarray in our problem. The model presented in Equation (27) makes it convenient to estimate β 2,m [21]. To this end, we first need to define a new vector containing Y m and Y m−2 , that is, � z 2;m ¼ ½Y m ; Y m−2 � T ; and compute its correlation matrix, using a recursive approach similar to Equation (7), aŝ The ESPRIT algorithm is applied toΦ � z 2;m to compute β 2,m . This procedure is repeated until all β i s have been calculated.
3.1.2 | Joint multiframe MAP estimator to extract the amplitude of clean signal In this subsection, the proposed JMFMAP estimator aims to find the optimal spectral amplitude. To this end, after computing the spectral phase based on Equation (24), we maximize the posterior distribution Equation (19) with respect to A m . Differentiating with regard to A m and setting the result equal to zero yields RANJBARYAN AND ABUTALEBI which represents a quadratic function in terms of A m . Using Equation (26) and we can rewrite Equation (30) in terms of a priori and a posteriori SNRs, ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi The JMFMAP estimator of amplitude is computed by applying a gain factor to the noisy amplitude (i.e. A m = G m R m ). The gain G m is achieved as ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where Again, it can be seen that in the case of L = 1, Equation (33) will be simplified to the joint MAP estimation of amplitude in the single-microphone case as presented in [4]. Briefly, the proposed JMFMAP estimator operates as follows: 1. In each TFU, define the noisy signal vector as Equation (5). 2. Compute the correlation matrices of Equations (7)-(9).

Compute the variances of clean and noise signals based on
Equation (15). 4. Compute a priori and a posteriori SNRs according to Equation (25). 5. Compute n i,m as stated by Equation (26) and β i,m using ESPRIT. 6. Compute the phase and amplitude according to Equations (24) and (33), respectively.
As described before, we presented the JMFMAP estimators for the phase and amplitude of the clean signal. It was observed that the phase difference between consecutive frames is required. For this purpose, we employed an ESPRIT-like algorithm. We propose two other estimators that only extract the amplitude while keeping the phase unprocessed.

| Multiframe MAP estimators of amplitude
In this section, we present two different MFMAP estimators of amplitude (or briefly, MFMAPEAs) that respectively utilize the noisy signal vector (y m ), or only its amplitude (r m ). The first estimator can be considered an extension of the singlemicrophone MAP estimator of amplitude (by Wolfe and Godsill in [4]) to the multiframe case. The second one uses the results of the multimicrophone MAP estimator by Lotter et al. in [8].

| Multiframe MAP estimator of amplitude given the noisy signal vector
Assuming the noisy signal vector is available, we propose an estimator that extracts the clean signal's amplitude and keeps its phase unchanged. The MFMAPEA-Sig estimator only considers the spectral amplitude and aims to maximize the posterior distribution of A m given the noisy vector, that is, Similar to what was described in III-A, it is sufficient to maximize only the numerator in Equation (34).
In this section, our main objective is to estimate only the amplitude. Using Equation (14), the following joint pdf is obtained as The integral is simplified as [9]: where a ¼ P In addition, the summation of two cosine terms is given by [9].
where ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where I 0 (⋅) is the modified Bessel function of the first kind of the 0th order. Finally, the joint pdf of y m and A m is given by Ignoring the terms that do not affect the optimization, we proceed with ln pðy m ; The spectral amplitude that maximizes the posterior distribution given the noisy vector is computed by differentiating Equation (42) with respect to A m and setting the result equal to zero as This represents a quadratic function in terms of Am. Using Equations (26) and (31), the quadratic function can be written as follows: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi The MFMAPEA-Sig estimator is computed by applying a gain factor to the noisy amplitude (i.e. A m = G m R m ). The gain G m is achieved as ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P L−1 i¼0 ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi We also observe that in the special case of L = 1, Equation (45) will be simplified to the MAP estimation of amplitude in the single-microphone case as presented in [4].
Briefly, the proposed MFMAPEA-Sig operates as follows: 1. In each TFU, define the noisy signal vector as Equation (5) Again, it is seen that we need to calculate the phase difference between consecutive frames, which in turn requires an additional procedure and extra computational load. To overcome this problem and provide a less complex estimator, we propose another MFMAP estimator for amplitude in the next subsection, which does not require any knowledge of the parameter β i,m .

| Multiframe MAP estimator of amplitude given the amplitude of the noisy vector
In [8], Lotter et al. proposed a direction-independent MAP estimator for amplitude in the multimicrophone case given the noisy signal's amplitude. Inspired by their work, we propose our second MFMAPEA, which aims to maximize the posterior distribution of A m given the amplitude of the noisy signal vector (instead of the whole noisy signal vector).
We propose the MFMAPEA-Amp algorithm that can be considered the multichannel MAP estimator (presented in [8]) to the multiframe case.
The optimization problem can be written as follows: Because the steps to obtain the optimal gain are similar to Lotter et al. work, we refer the readers to [8] for more details. In our proposed MFMAPEA-Amp, instead of the amplitude of noisy signals at different microphones, we utilize the amplitude of the current and a finite number of past noisy (observed) signals.
The proposed MFMAPEA-Amp is practically computed by applying a gain factor to the noisy amplitude (i.e. A m = G m R m ). The gain G m is achieved as ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ζ m−i γ m−i p ( þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P L−1 i¼0 ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Briefly, the proposed MFMAPEA-Amp operates as follows: 1. In each TFU, define the noisy signal vector as Equation (5)

| SIMULATION RESULTS
In this section, to demonstrate the important role of interframe correlation, we report our comparative evaluations on the noise reduction performance of the proposed MFMAP estimators and three baseline methods: the joint MAP (JMAP) estimator of both phase and amplitude presented by Wolfe and Godsill in [4], where the correlation between successive frames is ignored (two first proposed estimators, JMFMAP and MFMAPEA, are the generalization of the JMAP to the multiframe case), and the benchmark MFWFs [10] and speechdistortion weighted interframe Wiener filter (SDW-IFWF) [16], where the current and a finite number of past noisy signals are utilized to develop MFWFs. The trade-off parameter in SDW-IFWF was set to μ = 0.9, which seemed to give the best performance. To demonstrate the potential of phase processing for noise reduction, we also use the oracle value of β and compare the JMFMAP estimator's performance when the accurate estimation of β is provided (named JMFMAP-β, β indicates the spectral speech phase difference between consecutive frames).
The performance of the considered estimators is compared in terms of four objective measures, namely perceptual evaluation of speech quality (PESQ) [22], short-time objective intelligibility (STOI) [23], segmental SNR (SegSNR) [24] improvement between the enhanced signal and the noisy reference signal, and finally, log-spectral distortion (LSD) [25]. The clean speech signal has been used as the reference signal.
The PESQ value is used as a measure for speech quality. STOI is a measure to evaluate the intelligibility of speech. SegSNR considers both noise reduction and speech distortion, and LSD value is defined as a measure for speech distortion; the lower the LSD, the lower the speech distortion. To compute the SegSNR values, an ideal voice activity detection (VAD) is applied to distinguish the speech frames.
We used samples of eight male and eight female speakers from the TIMIT database [26] during the first experiment as the clean speech samples. TIMIT is a well-known data set of clean speech widely used in artificially production of noisy speech by adding to the different noise signals. The presented evaluation results are the average of these 16 samples. Additive noises corrupt the microphone signals at input full-band SNRs ranging from −10 to 15 dB. Signals are recorded at sampling frequency f s = 16 kHz. The STFT processing is implemented using NFFT = 256 with 75%-overlapping frames and using Hamming analysis window.
To apply JMAP estimation, it is required to compute ζ m and γ m given by In practice, a decision-directed approach [2] is applied to estimate ζ m aŝ where a ζ is a forgetting factor. In addition, R 2 m is directly computed using the amplitude of noisy signal at time frame m.
In the following, we evaluate the noise reduction performance based on two scenarios: (1) when the correlation matrix of noise is updated based on the noise signals corresponding to Equation (8) to avoid the effect of the estimation error of noise variance; and (2) where only the noisy signals are available.

| Correlation matrix of noise updated based on noise signals
First, we discovered that the best performance is experimentally achieved by choosing λ y = λ v = 0.8 in Equations (7) and (8), respectively. In addition, we set a ζ = 0.98 in Equation (49) as presented in [2].
For white Gaussian noise at input full-band SNR = 5 dB, Figure 1 depicts the performance of the proposed MFMAP estimators in terms of PESQ, STOI, and SegSNR improvement and LSD for several numbers of successive frames L.
We observe from Figure 1(a) that the PESQ improvement provided by selecting L > 1 is better than L = 1 for all MFMAP estimators. It also depicts that L = 5 results in the best PESQ performance for all MFMAP estimators. This is mainly because for L = 1, regardless of interframe correlation, each TFU has been processed independently. However, for speech enhancement in single-microphone scenarios, it is essential to utilize all available information of a recorded signal. Indeed, the harmonic and quasi-periodic nature of vowels and some voiced speech leads to slowly varying amplitudes, introducing a high interframe correlation between consequence frames.
In terms of STOI, Figure 1(b) shows that all MFMAP estimators take the largest value by selecting L = 4. As expected, JMFMAP-β and JMFMAP, which estimate the clean speech phase, outperform the others. This result can be supported by [17,27], verifying the perceptual importance of phase. In [17,27], it was shown that the enhanced spectral speech phase has an important role in improving speech quality and intelligibility.
In terms of SegSNR, Figure 1(c) indicates that the SegSNR value for L > 1 is improved compared with L = 1 for JMFMAP-β and MFMAPEA-Amp. For JMFMAP and MFMAPEA-Sig estimators, although the SegSNR improvement for L = 2 is a bit more than L = 1, the change is not so sensible. This can be justified by considering the errors in the β-estimation procedure that reduce JMFMAP and MFMAPEA-Sig performance. However, the MFMAPEA-Amp estimator, which makes no assumption and consequently does not need to estimate this parameter, represents good performance. These results emphasize that the impact of β-parameter is crucial. Figure 1(d) illustrates that the LSD value of MFMAP-β, JMFMAP, and MFMAPEA-Sig for L = 2 is slightly less than L = 1. The best performance for MFMAPEA-Amp is obtained by selecting L = 5. Considering the overlap of the speech signal and the noise in both time and frequency domains, part of the speech signal is also eliminated when we reduce the noise. Therefore, noise reduction usually comes at the price of signal distortion.
Note that by increasing the number of L, the proposed linear model in Equation (10) may be inaccurate and introduce some errors. We select the number of successive frames as L = 4 in the following simulations. In our simulation, the STFT processing was implemented using NFFT = 256 at the sampling frequency f s = 16 kHz, yielding 32 ms temporal correlation. Table 1 depicts the performance of the proposed MFMAP estimators, the joint MAP estimator, MFWF, and SDW-IFWF considering additive white Gaussian noise at input full-band SNRs ranging from −10 to 15 dB. To provide a fair comparison when selecting L = 4 for the multiframe processing, we apply 32 ms frame length for JMAP, which only considers the information of the current frame (L = 1). We have highlighted the best performance for each SNR in bold.
It can be seen that the MFWF and SDW-IFWF provide greater PESQ improvement. As expected, among the proposed MFMAP estimators, JMFMAP-β delivers the best performance for all input full-band SNRs. Next is MFMAPEA-Amp, which assumes nothing about parameter β i,m . It should be noted that although the JMFMAP estimator extracts both spectral phase and amplitude, the estimation errors of β i,m RANJBARYAN AND ABUTALEBI impair the performance. To estimate β i,m s, we apply the ESPRIT algorithm. The ESPRIT algorithm is sensitive to the number of sensors in each subarray. Considering the singlemicrophone nature of our proposed method, each subarray consists of only one observation; this may introduce some error in computing β i,m . This leads to the conclusion that further investigation is required to improve the robustness of JMFMAP and MFMAPEA-Sig against β i,m -estimation or improve the estimation accuracy.
Concerning STOI, the MFWF and SDW-IFWF outperform the rest estimators in low-and mid-SNRs. It is also seen that multiframe MAP estimators deliver higher STOI performance than JMAP, especially in low SNRs. Comparing the proposed MFMAP estimators, as expected, JMFMAP-β results in the best performance. Moreover, the performance of the JMFMAP estimator and the MFMAPEA-Sig is slightly better than the MFMAPEA-Amp.
Regarding SegSNR, Table 1 indicates that the MFWF, JMFMAP-β, and the joint MAP estimator perform the best in low, mid, and high SNRs, respectively. Moreover, the MFMAPEA-Amp obtains greater SegSNR improvement than JMFMAP-β in low SNRs.
Finally, we investigate LSD in the presence of white Gaussian noise. Following the SDW-IFWF and MFWF, the MFMAPEA-Amp produces the best performance among the multiframe estimators. Furthermore, the JMAP estimator delivers the best performance in high SNRs.
It should be noted that there are two parallel comparisons; compared with the JMAP estimator, which ignores the effect of correlation between successive frames, by using the proposed MFMAP estimators, a significant speech quality improvement can be obtained. These results demonstrate that it is possible to improve some existing single-frame Bayesian methods (in this case, MAP F I G U R E 1 Performance of the proposed multiframe MAP estimators in terms of PESQ, STOI, SegSNR improvement, and LSD in the case of white noise at input full-band SNR = 5 dB for several numbers of successive frames. LSD, log-spectral distortion; MAP, maximum a posteriori; PESQ, perceptual evaluation of speech quality; SegSNR, segmental SNR; SNR, signal-to-noise ratio; STOI, short-time objective intelligibility estimators) using the interframe correlation concept. On the other side, compared with the multiframe beamforming-based algorithms, such as MFWF or SDW-IFWF, at least when the exact VAD is available, the proposed multiframe MAP-based algorithms achieve less improvement, especially for low SNRs. Figure 2 presents the performance results in the presence of different types of additive input noises, including 'stationary pink', 'non-stationary babble', 'low-frequency Factory', and 'high-frequency Printer' noises. In addition, to provide a more realistic scenario, we used real recorded signals (from [28]), where the clean speech signals were selected from the public audiobook database, LibriVox, and the noise signals were driven from AudioSet [29]. Signals are recorded at sampling frequency f s = 16 kHz. The noisy signals are created as the summation of clean and noisy signals at input full-band SNR = 5 dB. The STFT processing is implemented using NFFT = 256 with 75%-overlapping frames and using Hamming analysis window. Figure 2 illustrates the average performance of the considered estimators over 16 signals. These results are consistent with the evaluation results shown in Table 1. It is seen that using multiframe estimators (either MAP-oriented or Wiener-based filters), a significant speech quality improvement can be achieved. Although multiframe MAP estimators, except MFMAPEA-Amp, require the knowledge of β-parameter, and Wiener filters need to calculate the inverse of the noise correlation matrix, the speech quality improvement, providing by interframe correlation, is considerable and not easy to be ignored. The above-mentioned comparative results have also been validated through an informal mean opinion score (MOS)-like listening test. The links to realistic scenarios and real recorded sample noisy and enhanced files are available at https://pws.yazd.ac.ir/sprl/ Ranjbaryan-IET-SampleWaves.

| Only noisy signals available
In this case, we possess only the noisy signals. To apply JMAP estimation, σ 2 v ðmÞ was computed using the improved minima controlled recursive averaging method [30]. However, to update the noise correlation matrix, we use the recursive approach that is mentioned in [24], where only noisy signals are available. In multiframe case, the noise correlation matrix is given byΦ where λ n denotes the forgetting factor for noise, andζ m represents the a priori SNR as defined in Equation (49).
It should be noted that noise statistics do not usually change drastically in successive frames. Consequently we can utilizeΦ v m−1 instead ofΦ v m in Equation (51). We assume that the first 10 frames consist of noise only. Therefore, we use these frames to obtain initial estimates of the correlation matrix of the noise signal. In our extensive experiments, we found that the best performance is achieved by choosing λ y = λ n = 0.9.
In addition, in the following, taking the SPP into account, we consider a mixed scenario. According to the lowpass nature of speech signal, we consider the interframe correlation for the cases where only speech is present (SPP (m) = 1). On the other side, in the case of silent frames (SPP(m) = 0), we ignore the multiframe correlation and apply the JMAP estimator. The SPP-based estimator is given byX where SPP(m) denotes the SPP at the current time frame, X MF ðmÞ denotes the estimated clean speech signal provided by multiframe methods (multiframe MAP estimators and Wiener filters), andX JMAP ðmÞ represents the estimated clean speech signal using JMAP estimator.
To compute the SPP(m), we used the single-microphone SPP estimator proposed in [11] that utilizes both interframe and interband correlations.
Using Equation (52), Table 2 depicts the performance of the considered SPP-based algorithms in the presence of additive white Gaussian noise at input full-band SNRs ranging from −10 to 15 dB. We have highlighted the best performance for each SNR in bold.
In terms of PESQ, the JMFMAP-β estimator delivers the best performance. The next ranks are for the MFMAPEA-Amp and MFWF, which their performances are very close. F I G U R E 2 Performance of the joint MAP estimator, proposed MFMAP estimators, MFWF, and SDW-IFWF in terms of PESQ, STOI, SegSNR improvement, and LSD at input full-band SNR = 5 dB for different types of input noise, when the exact VAD is available. LSD, log-spectral distortion; MAP, maximum a posteriori; MFMAP, multiframe MAP; PESQ, perceptual evaluation of speech quality; SDW-IFWF, speech-distortion weighted interframe Wiener filter; SegSNR, segmental SNR; SNR, signal-to-noise ratio; STOI, short-time objective intelligibility; VAD, voice activity detection It was shown that the performance of the multiframe algorithms (either MFMAP or MFWFs) is highly sensitive to the accurate estimation of the noise correlation matrix. Indeed, the highly time-varying IFC leads to an inaccurate correlation matrix. This leads to the conclusion that further investigation to improve the robustness of multiframe estimators against noise correlation matrix is needed. It also observed that when an accurate estimation of β-parameter is available, the multiframe MAP estimator has the potential to improve the PESQ values.
Regarding STOI improvement, it is seen that the comparative performance of the estimators can be sorted as JMFMAP-β, MFMAPEA-Sig and JMFMAP (very close), JMAP, MFWF, and MFMAPEA-Amp (very close), respectively, in low SNRs. In high SNRs, JMAP delivers the best improvement for white Gaussian noise.
In terms of SegSNR, among the multiframe MAP estimators, the JMFMAP-β outperforms the rest for all input SNRs. Next is the MFMAPEA-Amp. The performance of MFWF and SDW-IFWF lie in between JMFMAP and MFMAPEA-Sig in mid and high SNRs. It is also seen that the JMAP provides the greatest improvement in high SNRs.
Concerning the LSD values, the results for all considered estimators are very close, indicating the MFWF delivers slightly better results than the others in low SNRs and JMAP in high SNRs.

| CONCLUSION
In this work, the concept of statistical speech properties has been included in the context of interframe correlation in time-frequency analysis. Assuming Gaussian statistical modelling of the speech and noise signals, several multiframe MAP estimators in a single-microphone case have been introduced. In each TFU, a vector consisting of the observations at the current frame and a finite number of past frames was considered to derive estimators. Consequently, closed-form solutions for both phase and amplitude of a clean speech signal were provided, relying only on second-order statistical properties.
In this contribution, the effect of different numbers of frames was investigated. Simulation results confirmed that it is possible to improve the performance of the existing singleframe Bayesian methods using the interframe correlation concept. In addition, the performances of the proposed multiframe MAP estimators with those of JMAP, MFWF, and SDW-IFWF for various kinds of noises were compared. It was observed that similar to the multiframe beam-forming methods, our proposed multiframe MAP estimators deliver significant improvement if the noise correlation matrix is accurately estimated. In addition, using real recorded signals, the practical efficiency of considered estimators was demonstrated in realistic scenarios.  This work can be considered a starting point for combining both statistical speech properties and IFC concepts. One of the significant advantages of this work is extending and applying different distributions for clean speech and noise signals. It would also be interesting to investigate hybrid approaches that employ deep learning techniques to obtain noise statistics while assuming Gaussian or super-Gaussian distributions for modelling speech signals.