On Improvement of Speech Intelligibility and Quality: A Survey of Unsupervised Single Channel Speech Enhancement Algorithms

S channel speech enhancement (SCSE) [1]-[16] is one of the significant researched problems in many speech related applications; such as, Automatic Speech Recognition (ASR) [17], Speaker Identification (SI) [18], Human-Machine interaction [19], etc. The problem occurs whenever an interfering noise signal degrades the target speech signal. The interfering noise signals could be convolutive [20] or additive. The convolutive noise signal is produced because of the reverberation. However, additive noise signals are usually supposed since this supposition expresses the uncomplicated solutions and practically more adequate results have been attained with the algorithms structured on such theory [21] [22]. The additive noise distortions significantly aggravate the quality and intelligibility of the speech signals. For this reason, the objective of the speech enhancement algorithms is to quantify the estimate of the underlying clean speech from the noisy speech to increase the intelligibility and quality of the noisy speech signal [21] [22]. A fundamental structure of the SCSE is shown in Fig. 1. A variety of speech related applications exists in our everyday situations where speech enhancement is required as for example: (i) the humans are present in the noisy environments and communicating on the mobile phones, (ii) listening to a call in the noisy street or in the factory, (iii) sitting in subway or travel in a car. In these situations, a speech enhancement could be used to ease the communication by reducing the noise signals. A number of speech enhancement methods have been designed at the front-end to create robust ASR systems by decreasing the discrepancies between the training and testing stages. In ASR, a speech enhancement method is applied to minimize the noise prior to the feature extraction phase. An additional imperative application of the speech enhancement system is for those individuals using hearing aid devices. The speech signals show extremely redundancy and normal hearing listeners can comprehend the target speech signals even in adverse signal-to-noise ratios (SNRs) [23]-[26]. For instance, a normal hearing individual can comprehend approximately 50% of the words spoken in a multitalker corrupted speech at signal to noise ratio equal to 0 dB [27]. However, for individuals with hearing problem (hearing loss), various speech parts could totally be inaudible or significantly distorted. Therefore, On Improvement of Speech Intelligibility and Quality: A Survey of Unsupervised Single Channel Speech Enhancement Algorithms

the perceived speech signals have small redundancy. Consequently, the individuals with hearing loss feel problem in the noisy environments [28]- [30]. Large attention towards designing the robust speech enhancement algorithms is given to decrease the listening effort and improve the speech intelligibility [31] [32]. The combinations of such algorithms with contemporary digital signal processing systems are implemented in a number of speech related devices.
Single-channel speech enhancement algorithms are divided into two major categories: Supervised SCSE (S-SCSE) algorithms and Unsupervised SCSE (U-SCSE) algorithms. In U-SCSE algorithms, a statistical model is used for speech/noise and the estimate of the underlying clean speech is quantified from the noisy speech devoid of prior facts about speaker identity and noise. Thus, no supervision and classification of the signals is required. Alternatively, the S-SCSE algorithms use models for speech and noise. The model parameters are learned through training of the speech and noise samples and models are defined by mixing the separate models for the speech and noise and the speech enhancement task is performed. In this category, therefore, prior supervision and classification of the speech or noise type is a requisite. The emphasis of this paper is to present a survey on the U-SCSE algorithms.
The remaining paper is organized as follows: Section II shows an extensive review of U-SCSE algorithms in terms of the speech intelligibility and quality. Section III presents experiments performed to evaluate the speech intelligibility and quality potentials of U-SCSE algorithms. Section IV presents the concluding remarks of the survey. Finally, section V presents important research problems which require further study.

II. Classification of U-SCSE Algorithms
This category includes a wide range of U-SCSE algorithms; however, general classification is not limited to the presented algorithms. In U-SCSE algorithms, a statistical model is used. The estimated underlying clean speech is quantified from the input noisy speech utterances devoid of previous facts about speaker identity and noise. A general classification and fundamental framework of the U-SCSE algorithms is shown in Fig. 2-3. In subsequent sub-sections, we provide a taxonomy based review of the U-SCSE algorithms.

A. Spectral Subtraction-based Speech Enhancement Algorithms
Spectral subtraction (SS) based speech enhancement is simple, effective and traditionally one of the pioneer methods proposed for reducing noise distortion. Noise signals are assumed to be additive. Spectral subtraction based speech enhancement algorithms were initially proposed by Boll [33]. In SS, the estimate of the underlying clean speech spectrum could be obtained by subtracting the estimate of noise spectrum from the noisy spectrum. The noise spectrum is estimated and updated during pause periods i.e., absence of the speech signals. The hypotheses for designing such algorithms are: (i) the stationary or slowly varying process and, (ii) the noise spectra do not vary drastically during updating periods. The enhanced speech is acquired by using inverse transform of the estimated spectrum using noisy phase. According to the basic principle of SS, let us assume that a noisy signal z(n) is composed of the clean speech s(n) and the additive noise signal, e(n) z(n) = s(n)+e(n) (1) Computing the STFT of (1), we obtain: Subtract noise magnitude spectrum |D(ω,k)| from the noisy speech magnitude spectrum |Y(ω,k)| and finally take the inverse Fourier transform of the difference spectra using the noisy phase to produce the enhanced speech signal, given by equation as: (3) Since, noise signals are non-stationary and time-variant in the realworld environments; the SS-based enhancement approaches produce negative values for the estimated magnitude spectrum of the clean speech and result in musical noise artifact in enhanced speech. The research is done in near past to reduce the musical noise artifact. Some highly ranked researches on the SS for the speech enhancement are reviewed.
Lu and Loizou [34] proposed a spectral subtraction algorithm based on the geometric approach for the speech enhancement which addressed the inadequacy of the traditional SS algorithm. An efficient scheme to estimate the cross-terms is proposed which is involved in the phase differences between the speech and noise signals. After analyzing the suppression function of the proposed algorithm, it is examined that the algorithm holds the properties of the conventional minimum mean square error (MMSE) algorithm. The evaluation confirmed that geometric approach for the speech enhancement performed considerably better than the conventional spectral subtractive algorithm. A similar approach is also presented in [35].
Paliwal et al. [36] unconventional acoustic domain for the speech enhancement task, and showed capability of SS in the new domain. Analysis-modificationsynthesis (AMS) framework is included and reduced musical noise artifact by applying the modulation-domain based SS algorithm. Moreover, consequences of the frame duration on speech quality have been examined. The outcomes of research indicated that frames with duration with 180-280 msec provided optimized results in terms of the spectral distortions and temporal slurring. For further improvements in the speech quality, a fusion with the MMSE principle has been presented in the short-time spectral domain by joining the magnitude spectrum of the proposed speech enhancement algorithm. Consistent improvements in speech quality have been achieved for different SNRs.
Zhang and Zhao [37] proposed an approach, and performed subtraction on the real and imaginary spectrum independently in modulation-domain. An enhanced magnitude and phase is achieved through the SS approach. Inoue et al. [38] provided a theoretical investigation of the musical noise artifact created by the SS on higher order statistics. It is assumed that power SS approach is a common used form. Generalization of SS for the unpredictable exponent parameters has been provided and the quantity of the musical noise artifact has been compared between several exponent-domains. A less musical noise artifact has been observed for a lower exponent spectral-domain and offered good quality and intelligible speech.
Miyazaki et al. [39] provided a theoretical examination of the musical noise artifact with an iterative SS method. Iteratively weaknonlinear signal processing technique has been used to obtain a high quality speech with low musical noise artifact. The generation of musical noise artifact has been formulated by marking changes in kurtosis of the noise spectrum. Optimal internal parameters have been derived theoretically in order to produce no musical noise and explained that with a fixed point in kurtosis yield no musical artifacts.
Antonio et al. [40] proposed an improved algorithm based on the SS for real-time noise cancellation and applied the algorithm to the gunshot acoustical signals. A pre-processing approach based on spectral suppression algorithm is applied instead of post-filtering, which requires a priori information concerning the direction of arrival of desired signals. Ban and Kim [41] proposed an algorithm for reducing the reverberant noise to the application of remotetalking speech recognition. The SS has been used and the spectra of late reverberant signals are estimated by considering the delayed and attenuated versions of reverberant signals. The unpredictable weight sequences have been estimated via a Viterbi-decoding method based on the reverberation model. The weight sequences are then replaced with fixed weights in SS without estimating the reverberation time.
Hu and Wang [42] proposed a novel algorithm to separate the unvoiced speech signals from the non-speech interfering signals. The voiced speech and periodic parts of interfering signals have been firstly removed. The interference became stationary and the noise energy has been estimated in unvoiced intervals utilizing the separated speech in adjacent voiced intervals. The SS is applied to create timefrequency segments in unvoiced intervals and the unvoiced segments are then grouped. The grouping of segments is based on the frequency characteristics of unvoiced segments by considering thresholding and Bayesian classification.
Kokkinakis et al. [43] described and evaluated the capabilities of SS to suppress the late reflections and compared to ideal reverberant masking (IRM) approach. Speech intelligibility outcomes indicated that SS approach can suppress additive reverberant energy to a degree similar to that attained by the IRM. Hu and Yu [44] proposed an adaptive noise spectral estimator to deal with subtraction-based techniques for speech enhancement. The proposed method derived the noise spectrum from a primary estimate of noise spectrum together with the current noisy speech spectrum in an adaptive style. The fundamental framework of SS remained uninterrupted even in case of the gain for all spectral components is altered. The listening tests confirmed the superiority of the noise adaptation technique in suppressing the musical noise artifact and quality improvements.

B. Statistical Model-based Speech Enhancement Algorithms
In the statistical model based speech enhancement algorithms, speech and noise signals are assumed stationary and the resultant filter coefficients remain unchanged. The suppression of noise signals could effortlessly be realized utilizing Finite Impulse Response (FIR) or Infinite Impulse Response (IIR) filters. However, noise sources and particularly the speech signals are highly non-stationary. The speech generation trails a time-varying process. By using the noisy spectrum Z(ω,k), the short-time noise power spectral density (PSD) and the frequency-domain signal-to-noise ratio (SNR) are quantified to determine the weighting gains. The actual spectral weighting is achieved by multiplying the noisy spectrum Z(ω,k) by weighting gains G(ω,k) resulting in quantifying DFT coefficients of underlying clean speech according to the following equation: The computation of the weighting gains rely on the particular speech enhancement algorithms and is usually a function of short-term noise PSD estimate and the SNR estimates γ(ω,k) and ξ(ω,k) as: (5) Where γ(ω,k) and ξ(ω,k) indicate a posteriori and a priori SNR estimate, and show variance of the clean speech and noise signals.
is calculated during the nonspeech/ pauses-periods by using standard recursive equation, given as: (6) Where, β is the smoothing factor and is the noise estimate in the previous frame. The a priori SNR can be estimated by using Decision Direct (DD) [1] approach, given as: Where, α is weighting parameter, and represent the power spectrum estimation of the clean speech and noise at k-1 frame, respectively. In the following subsections distinguished and latest statistical speech enhancement algorithms based on the Wiener filtering (WF), minimum means square error (MMSE), Gaussian and super-Gaussian models are surveyed.

Wiener Filtering
Wiener filtering based speech enhancement minimizes the mean square error (MSE) between the estimated speech magnitude spectrum and the original signal magnitude spectrum. The formulation of the optimal wiener filter gain is as follows: [45] (8) Over the years, Wiener filtering and its variants are used for the speech enhancement task. We discuss and review some of the highly ranked research studies on WF algorithms.
Huijun et al [46] proposed a SCSE algorithm which exploited connections between various time-frames to minimize residual noise. Contrasting to the traditional speech enhancement methods that apply a post-processor after standard algorithms like spectral subtraction, the proposed method applied a hybrid Wiener spectrogram filter (HWSF) to reduce noise, trailed by a multi-blade post-processor that exploited two-dimension features of the spectrograms to retain the speech quality and to further reduced the residual noise. Spectrograms comparison showed that the proposed method significantly reduced the musical noise distortions. The usefulness of the proposed method is additionally confirmed by the use of objective assessments and unceremonious subjective listening tests. Jahangir and Douglas [47] proposed a frequency-domain optimal linear estimator with perceptual post-filtering. The proposed method incorporated the masking properties of human hearing system to make the residual noise inaudible. A modified way is presented to quantify the tonality coefficients and relative threshold offsets for the best possible estimation of noise masking threshold. The proposed speech enhancement method has been evaluated for noise reduction and speech quality under many noisy conditions and yielded better results than [1].
Almajai and Milner [48] examined the visual speech information to enhance the noisy speech. The visual and audio speech features are analyzed which identified a pair with the highest audio-visual connection. The research revealed that high audio-visual connections exist inside individual phoneme rather entire speech. This connection is used in the application of a visually-driven Wiener filtering, which achieved clean speech and noise power spectrum statistics from the visual features. Clean speech statistics are quantified from the visual features using a maximum a posteriori structure and is incorporated inside the states of hidden Markov network to afford phoneme localization. Noise statistics are achieved by using a novel audio-visual voice activity detector, which used visual speech features to formulate the robust speech/nonspeech classifications. The efficiency of the proposed method is evaluated subjectively and objectively which confirmed the superiority.
Marwa et al. [49] presented adaptive Wiener filtering approach for speech enhancement. The proposed approach depended on the adaptation of the filter transfer function from sample-to-sample based speech signal statistics (the local mean and variance). The method is implemented in the time-domain to contain time-varying nature of the speech. The approach is evaluated against conventional frequency domain spectral subtraction, wavelet denoising methods and Wiener filtering using different speech quality metrics. The results showed superiority of the proposed Wiener filtering method.
Xia and Bao [50] proposed a Weighted Denoising Auto-encoder (WDA) and noise classification based speech enhancement approach. Weighted reconstruction loss function is established into standard Denoising Auto-encoder (DAE) and link between the power spectrums of underlying clean and noisy speech is expressed by WDA structure. The sub-band power spectrums of underlying clean speech are quantified using the WDA structure from the noisy speech. The a priori SNR is quantified using a Posteriori SNR Controlled Recursive Averaging (PCRA) approach. The enhanced speech is achieved by the Wiener filter in the frequency-domain. Moreover, GMMbased noise classification method is engaged to make the proposed method appropriate for various conditions. The experimental results demonstrated that the proposed method achieved improved objective speech quality. Effective noise reduction and SNR improvements are attained with less speech distortion.
Kristian and Marc [51] investigated speech-distortion weighted inter-frame Wiener filters for the SCSE in a filterbank configuration. The filterbank configuration utilized a regularization parameter as a tradeoff between speech distortion and noise reduction. The method depends on the quantification of inter-frame correlation coefficients, and it is shown that these coefficients could be robustly estimated using a secondary higher resolution filterbank. It is then demonstrated that real-valued scalar gains can be applied directly in higher resolution filterbank rather than inter-frame filtering in the primary filterbank, which leads to a robust noise reduction performance for any value of regularization parameter.

MMSE Estimators
The minimum means square error (MMSE) estimator [1] inheres to vital class of the estimators and quantifies the spectral magnitudes. The MMSE estimator reduces the quadratic error of the spectral speech amplitudes according to the following equation: (9) Considering the Gaussian model of the speech and noise, the final weighting rule is given according to [1] as: Г(.) and F1(.) shows Gamma function and Hypergeometric function, respectively. We discuss and review some of the highly ranked research studies on MMSE algorithms.
Basheera et al., [52] proposed novel optimum linear and nonlinear estimators. They are derived based on the MMSE sense to reduce the distortion in original speech. Linear and nonlinear bilateral Laplacian gain estimators are proposed. The observed signal is first decorrelated through a real transform to achieve its moment coefficients and then applied to the estimated speech signal in the decorrelated domain. The mathematical aspect of MSE of estimators is evaluated suggesting significant improvement. Kandagatla and Subbaiah [53] derived joint MMSE estimation of speech coefficients provided phase uncertainty by assuming the speech coefficients. Uncertain phase is used for amplitude estimation. Furthermore new Phase-blind estimators are designed utilizing the Nagakami power spectral density function and the generalized Gamma for speech and noise priors.
Hamid et al. [54] addressed the problem of speech enhancement using β-order MMSE-STSA. The advantages of the Laplacian speech modeling and β-order cost function are taken in MMSE estimation. An investigative solution is presented for the β-order MMSE-STSA estimator deeming Laplacian priors for DFT coefficients of the clean speech. A Gaussian distribution for the real and imaginary parts of the DFT coefficients of the noise is presupposed. Using estimates for the joint PDF and the Bessel function, a better closed-form adaptation of the estimator is also presented.
Gerkmann and Krawczyk [55] derived a MMSE optimal estimator for underlying clean speech spectral amplitude. It is shown that the phase contains extra information which can be used to differentiate outliers in the noise from the target signals. Matthew and Bernard in [56] proposed a Bayesian STSA stochastic deterministic speech model, which included a priori information by utilizing a non-zero mean. For the speech STFT magnitude, investigative expressions are derived in the MMSE principle whereas phase in maximum-likelihood principle. An approach for quantifying a priori stochastic deterministic speech model parameters is explained based on the harmonically related sinusoidal parts in the STFT frames and deviations in magnitude and phase of components between succeeding STFT frames.

C. Signal Subspace-based Speech Enhancement Algorithms
Signal subspace [57] [58] based SE approaches use KLT, SVD and EVD to disintegrate noisy speech signals into the noise plus signal subspace known as the signal-subspace, whereas eliminates the noise signal that falls within orthogonal noise-subspace. The signalsubspaces are processed separately to remove noise components utilizing a diagonal gain matrix based on uncorrelated components in subspace. The components of the gain matrix are quantified by timedomain or spectral-domain estimators. The covariance matrix R Z of the noisy speech can be written as: (12) R s and R E are the covariance matrices of the clean speech and noise signals. R Z is supposed to have a higher rank than R s . The EVD of the covariance matrices is given as: Λ indicates a diagonal matrix that contains the Eigen-values, V indicates an orthonormal matrix containing eigenvectors; σ shows variance of noise whereas I indicate identity matrix. Speech enhancement process is represented by a filtering operation input speech vector as: The term Ѱ is the filtering matrix, given by equation (18) as: (18) Where G P holds weighted Eigen values of R Z , and V P and shows KLT and its inverse matrices, respectively. We discuss and review highly ranked research studies on SigSub algorithms.
Borowicz and Petrovsky [59] examined speech enhancement methods based on the perceptually motivated signal subspace. Lagrange multipliers are used to modify the spectral-domain-constrained (SDC) estimator. The residual noise power spectrums are shaped with an algorithm for accurate computing the Lagrange multipliers. The proposed approach uses masking phenomena for residual noise shaping and is optimal for the case of colored noise. Results show that the proposed method outperformed the competing methods and provided high noise reduction and improved speech quality.
Mohammad et al. [60] proposed a non-unitary spectral transformation of the residual noise based on diagonalization of covariance matrices associated to the clean speech and noise signals. Through this transformation, the optimization problem is solvable devoid of any constraints on the structure of contributed matrices.
Vera [61] pointed out that estimation of the dimension of signal subspace is critical and depends on the noise variance as well as SNR. Both fluctuate along temporal segments of speech and frequency bands. It is anticipated to work over frames in all critical bands utilizing the threshold noise variance. Belhedi et al [62] used soft mask as a core in the proposed approach. The method produces two separate signals of dissimilar qualities and made them available in two separate channels. The classification of the channels is made via Fuzzy logic that needs two separate parameters. One parameter determines quality and intelligibility whereas the second parameter determines the gender of the speaker via F 0 tracking method. The proposed approach achieved an average 59.5% improvement in SIR, 67.9% progress in PESQ, and 10.5% improvement in TPS.
Sudeep and Kishore [63] proposed a perceptual subspace approach via masking properties of the human auditory system with variance normalization to decide the gain parameters. An estimator is used to determine the filter coefficients. The noise is handled by substituting the noise variance by Rayleigh quotient. Normalization of variance is made by removing the spikes to evade rapid increase or decrease in power of the output samples making the output more intelligible.

D. Computational Auditory Scene Analysis-based Speech Enhancement Algorithms
The field of computational study intends to achieve human performance in the Auditory Scene Analysis (ASA) by using single microphone recordings of the acoustic prospect. This definition describes the biological relevance of the field by limiting the microphone number to two and its functional goal of Computational Auditory Scene Analysis (CASA). The CASA uses perceptually motivated mechanisms. Over the years, CASA based methods are used for the speech enhancement; here we are reviewing some of the work in recent years.
A new ideal ratio mask (IRM) depiction is proposed by Bao and Abdulla in [64] by utilizing inter-channel correlation. The power ratio of the speech and noise during the structuring of ratio mask is adaptively reallocated; therefore more speech components are held and noise components are masked simultaneously. Channel-weight contour is assumed to modify the mask in all Gammatone filterbank channels.
Wang et al. [65] proposed IRM estimation that relies on the spectral dependency into the speech cochleagram to enhance noisy speech. A data field representation is established to design time-frequency connection of the cochleagram with adjacent spectral information to estimate IRM. Firstly, a pre-processed section is used to achieve initial time-frequency values of noise and speech. Then the data field model is used to obtain the forms of speech and noise potentials. Subsequently, the optimal potentials that reveal their respective optimal distribution are achieved by the optimal influence factors. Lastly, masking values are obtained via the potentials of the speech and noise for reinstating the clean speech signals.
Wang et al. [66] considered a novel approach of speech and noise models, and presented two model-based soft decision methods. A ratio mask is computed by the exact Bayesian estimators of speech and noise. Additionally, a probabilistic mask is estimated with a variable local criterion. Liang et al. [67] considered local correlation knowledge from two aspects for improved performance. The time-frequency segmentation-based potential function is derived to represent the local correlation between mask labels of neighboring units directly. It is demonstrated that time-frequency unit that belongs to one segment is mostly dominated by one source. Alternatively, a local noise level tracking phase is integrated. The local level is attained by averaging many neighboring time-frequency units and is considered as a method for accurate noise energy. It is utilized as an intermediary auxiliary variable to signify the correlation. A high dimensional posterior distribution is simulated by a Markov Chain Monte Carlo (MCMC) approach. During iterations, the correlation is fully utilized to quantify the acceptance ratio. The estimated ideal binary mask (IBM) is achieved using the expectation operator. The proposed approach is compared and evaluated with a Bayesian approach and the approach yielded considerably large performance gain in terms of SNR gain and HIT-FA rates. Narayanan and Wang [68] presented a system for robust SNR estimation based on CASA. The proposed method used an estimate of the IBM to separate a time-frequency illustration of the noisy speech signal into speech and noise dominated sections. Energy inside each region was totaled to gain the filtered global SNR. SNR transformation was established to translate the estimated SNR to the true global SNR of the noisy speech signal.
Hu and Wang [69] proposed a tandem algorithm to estimate the pitch of a target speech utterance and separated the voiced regions of the target speech. First, a coarse estimate of the target pitch was obtained and then the estimate is used to segregate target speech using harmonicity and temporal continuity. Lee and Kwon [70] proposed a CASA-based speech separation system and matched the missing speech parts by using the shape analysis method.
May and Dau [71] presented a method based on the estimate of the ideal binary mask from noisy speech in supervised learning of AMS features and auditory inspired modulation filterbanks with logarithmically scaled filters were used. Spectro-temporal integration stage was incorporated to obtain speech activity information in neighboring time-frequency units.

E. Empirical Mode Decommission-based Speech Enhancement Algorithms
Empirical Mode Decomposition (EMD) [72] directly extracts the energy related to different intrinsic time scales. EMD is an adaptive approach and follows some necessary steps to decompose nonlinear and nonstationary data. (i) First, the EMD obtains the local maxima and minima. (ii) Secondly, the EMD finds the local maximum and local minimum envelopes. (iii) Third, the EMD finds the mean of the obtained local extrema envelopes and finally subtracts this mean envelope from the input data to attain the residual intrinsic mode function (IMF).
Upadhyay and Pachori [73] proposed a novel speech enhancement method for suppressing stationary and non-stationary noise sources. The variational mode decomposition (VMD) and EMD approaches are combined to develop the new idea for speech enhancement. Firstly, the EMD decomposes the input noisy speech into the IMFs. The VMD is then applied on the summation of preferred IMFs. The Hurst exponent was used to select the IMFs. The proposed speech enhancement method reduced low and high-frequency noise sources and showed enhanced speech quality.
Khaldi et al. [74] presented a speech enhancement method that exploited the combined effects of EMD and the local statistics of the speech signal by utilizing the adaptive centre weighted averaging filter. The speech signals were segmented into frames and all frames were segmented down by EMD into IMFs. The filtered IMFs depend on the voiced or unvoiced frame. An energy norm was utilized to classify the voiced frames and a stationarity index was used between unvoiced and transient chain. Zao et al., [75] proposed a speech enhancement scheme based on the adoption of Hurst exponent during the selection of IMFs to reconstruct the target speech.
Hamid et al., [76] proposed a novel data adaptive thresholding approach. The noisy speech signals and fractional Gaussian noises were mixed to generate the complex noisy signal. Bivariate EMD was used to decompose the complex noisy signal into complex-valued IMFs and all IMFs were segmented into short-time frames for processing. The variances of the IMFs of fractional Gaussian noise computed inside the frames were used as the reference to categorize subsequent frames of noisy speech into signal-dominant and noise-dominant frames, respectively. A soft thresholding method is used at noise-dominant frames to decrease the effects of noise. Every frame and IMF of the speech signals were combined to yield the enhanced speech signal.
Chatlani and Soraghan [77] used the EMD as a post-processing stage for filtering low frequency noise. An adaptive approach was designed to choose IMF index for sorting out the noise component from speech components. This separation was carried out by using a second-order IMF statistics. The low-frequency noise components were removed by the biased reconstruction from the IMFs. Khaldi et al., [78] used EMD for fully data-driven based approaches for noise reduction. Noisy speech signal was decomposed adaptively into IMFs using sifting process. The signal reconstruction with IMFs was done using the MMSE filter and thresholded using a shrinkage function.
The U-SCSE algorithms provide acceptable speech quality and noise reduction in many real-world noise sources. The U-SCSE algorithms along with several advantages also came with some limitations. The TableI and Table II provides advantages and limitations of various U-SCSE algorithms. These limitations will point out several Speech enhancement to improve speech quality and to reduce the musical noise distortion.
Compute the magnitude spectrum of the noisy signal using the FFT. The noise spectrum is updated using noise estimators. The gain is estimated using modified gain and multiplied with noisy spectrum to enhance speech.
Performed significantly better than the traditional spectral subtraction algorithm in terms of speech quality and musical noise artifact.
Speech intelligibility is not evaluated. Additionally informal tests were conducted for evaluations. Noise reduction impact on speech intelligibility research is required. [36] Speech enhancement to improve speech quality and intelligibility in Modulation domain.

MOD-SS
The SE method used AMS-based modulation domain. Each frequency component of the acoustic magnitude spectra is processed frame-wise across time using a modulation AMS framework, and the enhanced modulation spectrum is computed.
New Speech enhancement domain in terms of SS is explored. Better speech quality and speech intelligibility is obtained. Better noise reduction is offered.
Although the proposed method offered better results, the combination with other domains produces complexity in the proposed method. The complexity of the method is not discussed. [37] Speech enhancement to improve speech quality in Modulation domain.

MOD-SS
The magnitude subtraction is adopted and extended into the modulation frequency domain for the separate enhancements of the real and imaginary spectra. The noise is estimated in real and imaginary spectra and the estimated speech is recreated.
Perform subtraction on the real and imaginary spectra separately in the modulation frequency domain. Better noise reduction and speech quality is achieved.
The speech intelligibility potential of the proposed method is not discussed. The method estimated the phase, thus the complexity of the method is not discussed. [39] The Speech enhancement for better results and musical noise reduction in the Kurtosis of noise spectra.

VAD-SS
Iteratively weak-nonlinear method is used to obtain quality speech with less musical artifact. The generation of musical artifact is formulated by marking changes in kurtosis of the noise spectrum. Optimal internal parameters are derived theoretically to produce no musical artifact in kurtosis.
The proposed method provided better results and generation of musical noise artifact is formulated in the Kurtosis of noise spectra.
No theoretical explanation is given, only experimental results are presented. Speech quality and intelligibility is not discussed.
research areas which need further research.

III. Speech Intelligibility and Quality Potential of Various U-Scse Algorithms
The Table I-II illustrates the problem statements, methodologies, contributions and limitations of U-SCSE algorithms. It is clear from Table I-II that the U-SCSE algorithms addressed the problem of the speech enhancement effectively for noise reduction, musical noise artifact and speech quality. Speech enhancement is usually used as the front-end to Automatic speech recognition systems where speech intelligibility is the more important attribute. It is observed from the survey of the above different classes that speech intelligibility attributes is not fully explored in most of the U-SCSE algorithms. This section provides an intense experimental evaluation to observe the quality and intelligibility potentials of the U-SCSE approaches.

A. Methods
The experiments represent the measures used to evaluate and validate the performance of speech enhancement algorithms. In experiments, the U-SCSE algorithms are evaluated by using a set of 60 noisy speech sentences belonging to female and male speakers in terms of the speech intelligibility and quality. The noisy stimuli are generated by adding four real-time background noises to the clean speech utterances at several signal-to-noise ratios (SNR). The clean speech sentences are selected from the standard IEEE database [85] randomly. Four nonstationary noise sources (street, exhibition hall, airport, and multitalker babble noise) are chosen from the Aurora database [86]. The speech utterances are mixed at four SNR from 0dB to 15dB, spacing 5dB applying the ITU-T P.51. The sampling rate is fixed at 8 kHz.  Table III provides the details of speech enhancement algorithms used in the experiments. Two evaluation measures are quantified in order to access the U-SCSE algorithms. The PESQ [87] is preferred for the speech quality; an ITU-T P.862 standard that substituted the obsolete ITU-T P.861 standard because of inadequate performance to evaluate the speech enhancement. The PESQ score follows the range of −0.5 and 4.5, but, during experiments the score follows the mean opinion score (MOS), that is, a range of 1.0 to 4.5. The PESQ scores are calculated using the following equation: Where η0 = 4.5, η1 = −0.1 and η2 = −0.039.  Speech enhancement for better quality and to reduce the musical noise distortion Speech-distortion weighted inter frame Wiener filters for noise reduction is implemented in a filter bank structure. The filters utilized a regularization parameter as a tradeoff between speech distortion and noise reduction. The method depends on the estimation of inter frame correlation coefficients and these coefficients are more robustly estimated using a secondary higher resolution filter bank.
The contribution of the paper is the implementation of the scalar SDW-IFWF gain in a HRFB, matching a principle in the crucial lower-resolution filter bank to improve the speech quality and noise reduction with less musical artifact The algorithm provided improved results in terms of the speech quality However, speech intelligibility potential of the proposed algorithm is not discussed and evaluated. [52] Speech enhancement for better quality speech and low speech Distortion

LBLG-NBLG
The estimators are derived on the basis of MMSE to reduce the distortion of the fundamental speech. The musical artifact is reduced without affecting the noise reduction. LBLG and NBLG estimator are proposed. The input signal is decorrelated to obtain moment coefficients. The estimators are applied to estimate the clean signal in the decorrelated domain. The original signal is obtained in time domain.
The proposed method obtained better speech quality and noise reduction. Non-linear and linear bilateral Laplacian estimators are derived to improve the speech quality.
Although method produced better speech quality as compare to traditional methods; however, the speech intelligibility and complexity potentials are not fully explored. [60] Speech Separation in optimized subspace for improved quality and intelligibility.

EPW-Sub
The separation is achieved by optimizing the subspace via decomposing the mixture signal into three subspaces: sparse, sub-sparse and low-rank subspaces. Soft masking is used for the final verdict. Two signals of different qualities are provided in two separate channels. The channel classification is made by using Fuzzy logics with two parameters. F0 tracking algorithm is proposed to classify gender.
Embedded pre-whitening subspace method is proposed based on controlled spectraldomain for better speech quality and noise reduction in colored noises.
Although the proposed method offered better results but the speech intelligibility in nonstationary noise sources is not discussed.

CASA-SE [65]
The Speech enhancement for improved quality and intelligibility in the data driven field of cochleagram.
Iteratively weak-nonlinear method is used to obtain quality speech with less musical artifact. The generation of musical artifact is formulated by marking changes in kurtosis of the noise spectrum. Optimal internal parameters are derived theoretically to produce no musical artifact in kurtosis Ideal Ratio Mask is estimated in the data driven field of cochleagram to enhance the noisy speech. The proposed method obtained considerable gain in speech quality. Better results in terms of energy loss and residue noise are contributed.
The proposed algorithm has not incorporated the DF model into the STFT domain. The complexity of the algorithm is not discussed.

5
Empirical Mode Decomposition (EMD) H-EMD [75] A separate evaluation metric is used to access the intelligibility of the enhanced speech. The short-time speech intelligibility (STOI) [88] is considered for this purpose. The STOI sores are calculated by the equation given as: The parameters a, b are set according to [8], a = −17.4906 and b = 9.6921.

B. Results and Discussion
A performance comparison analysis at two levels is presented in this section. First, within-class performance comparison of the U-SCSE algorithms is established. The five classes are Spectral Subtractive, Statistical-models, Wiener-Filtering type, Subspace and EMD-type. This performance comparison was conducted to observe the significant performance differences within-class algorithms. Secondly, acrossclasses performance comparison is conducted to evaluate and find the algorithm(s) that performed better in all noisy situations.

Within-Class Algorithm Comparison
Table IV provides the results for PESQ (speech quality) whereas average speech intelligibility results are demonstrated in Fig. 4. Of three tested spectral-subtractive algorithms, the multi-band spectral subtraction (MBSS) [81] performed constantly the best across all noisy situations in terms of the speech quality. The MBSS and SS-RDC [80] methods performed equivalently well excluding 0dB exhibition hall noise and 0dB street noise conditions. Noise distortion of SS-RDC algorithm was considerably less than the MBSS and SS [79] approaches in all noisy situations. In terms of speech intelligibility, the MBSS and SS-RDC approaches equally performed in most of the noisy situations excluding 0dB exhibition hall noise and 0dB street noisy situations, where MBSS algorithm performed notably superior and presented less speech distortion. In brief, MBSS performed better than SS-RDC and SS, providing better overall speech intelligibility and quality. For speech quality, the two subspace approaches performed equally for the most of SNRs and noise types, excluding 0 dB babble noise.
The two Wiener-type algorithms performed well for most SNR conditions and four types of noise except 0dB airport noise and 0dB babble noise. For speech quality, the WF [45] performed significantly better than the WWF [82] approach at all SNRs and noise sources. WWF performed poorly in all noise sources at almost all SNRs and significant residual noise is experienced in the enhanced speech. On the other hand WF-as offered better speech quality and the noise reduction capabilities were significant. For speech intelligibility, the WF-as performed well at all SNRs and noise sources as compared to WWF method. There is significant speech distortion observed in the output speech utterance of the WWF approach.
The two statistical-model based approaches performed good for most of SNRs and noise types. The log-MMSE (LMMSE) [2] performed significantly better than the MMSE-SPU [1] approach at all SNRs and noise sources. MMSE performed poorly in all noise sources at almost all SNRs, and significant residual noise observed in the enhanced speech. On the other hand LMMSE offered better speech quality and noise reduction capabilities were significant. For speech intelligibility, the MMSE-SPU performed very poorly at all SNRs and noise sources. The small speech intelligibility signifies the higher speech distortion offered by MMSE-SPU. LMMSE offered better speech intelligibility and comparatively less speech distortion is experienced in the output speech.
The generalized subspace approach, KLT [83] performed significantly better than the pKLT [84] approach at all SNRs and noise sources except 0dB exhibition hall noise. The KLT approach was more successful in suppressing the background noise and perceptual speech quality. In terms of speech intelligibility, KLT and pKLT approaches performed equally well at all SNRs and noise sources except 0dB exhibition hall noise. There is no significant improvement in speech intelligibility observed for pKLT approach. On the other hand, KLT improved speech intelligibility marginally.
In terms of the speech quality, the EMD-H [75] algorithm performed well for all SNRs and noise types, except at 0dB exhibition hall noise and 0dB street noise. The EMD-H was successful in suppressing the background noise and improving the perceptual quality and speech intelligibility at all SNRs and the noise sources.

Across-Class Algorithm Comparison
Table V-VI indicates the results achieved by using ANOVA statistical analysis for the speech quality and intelligibility. Asterisk sign in Table V-VI show lack of statistical significant difference between algorithms with the utmost scores and the denoted algorithms. The U-SCSE algorithms marked by the Asterisk sign in Table V performed  similarly. Table V indicates no single algorithm is categorized as the best, and several speech enhancement algorithms performed equally well across SNRs situations and noise types. In terms of the speech quality, MMSE-SPU, LMMSE, WF, EMD-H and MBSS performed equally well across all SNRs situations. Table VI indicates the results achieved from the ANOVA statistical analysis for speech intelligibility. The MMSE-SPU, LMMSE, MBSS and WF performed well. All algorithms produced low speech distortion (high intelligibility) across all SNRs situations and noise sources. KLT, SS-RDC and WWF algorithms also performed well in isolated SNR situations.

IV. Conclusion
This paper presented a comprehensive review of the different classes of the single-channel speech enhancement algorithms in unsupervised perspective in order to improve the intelligibility and quality of the contaminated speech. Various classes of the unsupervised speech enhancement approaches for enhancing the noisy speech have been discussed. We have summarized possible algorithms of the Spectral Subtraction (SS), Wiener Filtering (WF), Minimum Mean Square Error (MMSE) estimators, Signal Subspace (SigSub) and EMD type, explained state-of-the-art approaches and a many related studies have been reviewed. The review suggested that unsupervised speech enhancement methods show an acceptable speech quality but speech intelligibility potential remains medium. The algorithms of unsupervised class show better noise reduction however; decrease of the residual noise artifact and speech distortion requires further research. Different unsupervised speech enhancement approaches have distinctive advantages that make these algorithms appropriate for speech enhancement; in contrast, these algorithms have some serious limitations as well. Table I-II summarized the problem statements, methodologies, contributions and the limitations of many speech enhancement algorithms. On the basis of the limitations extracted from the reviewed papers and also from the experimental results, it is concluded that unsupervised speech enhancement improves the speech quality but the speech intelligibility improvement potential requires further research. The algorithm can use the noise estimators, but accurate estimate is also a difficult task. A too aggressive estimation may lose important speech contents which in turn affect the speech intelligibility whereas too low noise estimation may lead to the residual noise. We have outlined various problems that need research to design robust single-channel speech enhancement algorithms. This rapid progress in the unsupervised speech enhancement algorithms will possibly persist in the future. To conclude, some following open research problems are outlined that are extracted from research studies:

Generalization to the Nonstationary Noise Sources:
Although U-SCSE algorithms provide promising speech quality results in stationary noise sources, however, their performance in nonstationary noise sources is not high. Effective noise estimation must be integrated with U-SCSE algorithms for better speech quality and noise reduction results.

Speech Intelligibility in Nonstationary Noise
Sources: U-SCSE provides enhanced speech with very low speech intelligibility. More effective algorithms are required that can improve speech intelligibility in nonstationary noise sources.

Musical Noise Artifact and Speech Distortion:
Unsupervised speech enhancement algorithms provide acceptable noise reduction, however reduction of the residual noise artifact and speech distortion requires further research.