Single-channel speech enhancement using inter-component phase relations
Introduction
In many signal processing applications including radar, image and speech processing, the problem of interest is to detect the desired signal in a noisy observation. While many previous studies were dedicated to deriving new estimators for amplitude and frequency of signal components (harmonics) (Kay, 1993, Van Trees, 2004), the estimation of spectral phase has been less addressed.
In speech signal processing, the processing of spectral phase was historically reported perceptually unimportant follow up the early experiments by Wang and Lim (1982) and Vary (1985). In particular, Vary reported that human perceives phase distortion only below signal-to-noise ratio (SNR) of 6 dB, hence noisy spectral phase suffices for high enough SNRs. Later on, Aarabi (2006) and Alsteris and Paliwal (2007) reported that spectral phase could be helpful for speech applications including automatic speech recognition and noise reduction. More recently, overview on phase-aware signal processing for speech applications thoroughly demonstrated the advantages and potential of incorporating phase processing (Mowlaee, Kulmer, Stahl, Mayer, 2016a, Gerkmann, Krawczyk, Le Roux, 2015, Mowlaee, Saeidi, Stylianou, 2016b).
The reasons why the research on phase-aware processing or in general studying the phase importance in speech applications was slow could be explained in following: (i) historically, the spectral phase of speech signals was believed to be unimportant as reported in the early studies (for a full review we refer to Mowlaee et al. (2016a, Ch. 1)), (ii) in contrast to the magnitude spectrum, the phase wrapping prevents an accessible pattern of phase spectrum in the Fourier domain which complicates the phase analysis of the given speech signal (Mowlaee et al., 2016b), (iii) phase processing is computationally complex and requires sophisticated algorithms with accurate prior statistics or fundamental frequency estimate (see e.g. Mowlaee and Kulmer, 2015b), (iv) little or no attention has been dedicated to the relations between harmonic components in speech, hence, the phase of harmonics has been estimated independently or relying on the phase of the fundamental harmonic.
It is important to note that an enhanced spectral phase obtained from noisy speech observation can be used directly for signal reconstruction and hence to enhance the noisy speech signal. Furthermore, an estimated clean spectral phase can also be used to derive improved spectral amplitude estimators in an iterative (Mowlaee, Stahl, Kulmer, 2017, Mowlaee, Saeidi, 2013) or non-iterative (Gerkmann, Krawczyk, Le Roux, 2015, Krawczyk, Gerkmann, 2016) configuration1. As the achievable improvement from a phase-aware processing framework is limited by the accuracy of the spectral phase estimator stage, therefore, a challenging research topic is to find novel approaches that provide accurate and robust estimators of the clean spectral phase from a noisy observation. The achievement of a robust and accurate spectral phase information opens up opportunities for further improved performance in other speech applications including automatic speech recognition (Fahringer et al., 2016), speech synthesis (Espic et al., 2017), source separation (Mayer et al., 2017) and emotion recognition (Deng et al., 2016).
The previous attempts for spectral phase estimation can be divided into the following groups (Chacon and Mowlaee, 2014)2: (i) Griffin–Lim (GL) (Griffin and Lim, 1984) based methods which apply consistency of the short-time Fourier transform (STFT) spectrogram and iteratively reconstruct the spectral phase from an initial estimate of the spectral magnitude (see Mowlaee and Watanabe, 2013 for an overview), (ii) model-based short-time Fourier transform phase improvement (STFTPI) (Krawczyk and Gerkmann, 2014) relying on a harmonic model to predict the spectral phase across time using phase vocoder principle and across frequency by compensating for the analysis window phase response. Another model-based phase estimator is the geometry-based approach where additional time-frequency constraint (Mowlaee and Saeidi, 2014) is used to remove the ambiguity in the chosen spectral phase pairs. Three types of constraints were proposed in the geometry-based phase estimator: group delay deviation, instantaneous frequency deviation and relative phase shift (RPS) (Saratxaga et al., 2009). As another model-based approach, time-frequency smoothing of unwrapped harmonic phase was proposed by applying the harmonic model plus phase decomposition (Degottex and Erro, 2014b) followed by smoothing filter (Kulmer, Mowlaee, 2015b, Mowlaee, Kulmer, 2015b, Mowlaee, Kulmer, 2015a), and (iii) statistical methods: maximum a posteriori harmonic (MAP) (Kulmer, Mowlaee, 2015a, Mowlaee, Stahl, Kulmer, 2017), temporal smoothing of the unwrapped harmonic phase (TSUP) (Kulmer, Mowlaee, 2015b, Kulmer, Mowlaee, Watanabe, 2014) and least-squares (LS) (Chacon and Mowlaee, 2014).
In all previous phase estimators, the underlying relation between harmonics phase or phase structure across harmonics is either not directly taken into account (Krawczyk, Gerkmann, 2014, Kulmer, Mowlaee, 2015a, Kulmer, Mowlaee, 2015b) or only relies on the phase of the fundamental frequency used as the reference (Mowlaee, Saeidi, 2014, Mowlaee, Kulmer, 2015b, Mowlaee, Kulmer, 2015a). For example, in geometry-based phase estimator with RPS constraint (Mowlaee and Saeidi, 2014) the relation between the harmonic phases with the fundamental frequency phase is taken into account. Also, smoothing across RPS has been considered in Mowlaee and Kulmer (2015b). The phase estimation performance relies on the accuracy of the fundamental frequency phase which relies itself on the fundamental frequency estimation accuracy. This limits the performance for low-frequency noise scenarios. Furthermore, the underlying phase structure across harmonics is not taken into account, therefore, the harmonic phases are estimated independently.
In this paper, we argue that the two aforementioned issues: (i) relying on the fundamental frequency phase, and (ii) neglecting the phase structure across harmonics in speech signal limit the achievable performance by the existing spectral phase estimators. Therefore, in this paper, we propose new phase estimators that rely on the inter-component phase relations (ICPR) for a polyharmonic signal like speech. In our earlier publication (Pirolt et al., 2017), we reported preliminary results on the usefulness of applying phase quasi-invariant constraint for phase estimation and speech enhancement. In this paper, we present the ICPR in details for a polyharmonic signal (here speech) and report their usefulness in speech enhancement for different noise scenarios. The three phase relations are: Phase Invariance (PI), Phase Quasi-Invariance (PQI), and Bi-Phase (see Section 2 for an overview). We will apply these phase relations as constraints to derive the harmonic phase estimators. The so-derived estimators are then applied for speech enhancement whereby a phase-enhanced speech signal is provided. Throughout the experiments, we demonstrate that the newly derived phase estimators result in improved perceived quality and speech intelligibility and a lower phase estimator error versus the benchmark methods.
The rest of the paper is organized as follows. Section 2 presents some background on the ICPR for polyharmonic signals in general. In particular, we will focus on three phase relations: Phase Invariance, Phase Quasi-Invariance and Bi-Phase. In Section 3, we propose details on the proposed phase estimators relying on each of the three constraints (PI, PQI and Bi-Phase). Section 4 presents proof-of-concept experiments and speech enhancement results. A comparative study of phase estimation performance is presented by comparing the achievable speech enhancement results versus the relevant benchmark methods followed up by discussions. Section 5 concludes on the work.
Section snippets
Background on inter-component phase relations in polyharmonic signals
In this section, we review the theory and applications of phase processing techniques that exploit the following underlying principle: the parameters of particular harmonic are considered in relation to parameters of other harmonics of the same oscillation process. This principle provides a basis for a number of inter-component phase processing methods and reveals the special properties of signals, that are failed to be observed by conventional magnitude and power spectrum analysis methods.
Proposed phase estimators
In this section, we present the proposed phase estimators relying on ICPR3 Fig. 1 shows the block diagram of the speech enhancement setup that uses the proposed phase estimation framework.
Let x(n) and y(n) denote the clean and noisy signal, respectively, in time domain. The noisy signal represents a mixture of the
Experiment setup
We randomly choose 50 utterances of 20 speakers (10 male and 10 female) from GRID corpus (Cooke et al., 2006). The utterances were corrupted by the following noise types: white, babble and factory noise files taken from NOISEX-92 database (Varga et al., 1992); car and street noise files taken from NOIZEUS database (Hu and Loizou, 2007). The SNR levels ranged from to 10 dB with 5 dB step.
In order to quantify the error introduced between the estimated versus the clean ICPR values, we define
Conclusion
In this paper, after an overview on the importance of spectral phase estimation and the recent phase estimation methods in speech signal processing, we proposed new harmonic phase estimators from noisy speech that rely on the relations between the harmonics in a polyharmonic signal as speech. The phase structure was defined by three constraints: Phase Invariance, Phase Quasi-Invariance, and Bi-Phase. These constraints were used to derive estimators for the clean spectral phase from a noisy
Acknowledgments
The authors would like to thank the anonymous reviewers for their valuable comments that helped to considerably improve this paper. The work of Pejman Mowlaee was supported by the Austrian Science Fund (project number P28070-N33).
References (71)
- et al.
Short-time phase spectrum in speech processing: a review and some experimental results
Elsevier Signal Process.
(2007) - et al.
Subjective comparison and evaluation of speech enhancement algorithms
Speech Commun.
(2007) - et al.
Advances in phase-aware signal processing in speech communication
Speech Commun
(2016) - et al.
Iterative joint map single-channel speech enhancement given non-uniform phase prior
Speech Commun
(2017) - et al.
Role of modulation magnitude and phase spectrum towards speech intelligibility
Speech Commun.
(2011) Noise suppression by spectral magnitude @@estimation mechanism and theoretical limits
Elsevier Signal Process.
(1985)Phase-Based Speech Processing
(2006)- et al.
Digital phase processing methods of ultra wide band signals (in Russian)
J. Radioeng. Electron. (Signal Gener. Trans. Recept. Radio Syst.)
(1994) - et al.
Studying the connection between quasi-harmonic components of a speech signal
Proceedings of the Twenty-Fourth Session of the Russian Acoustical Society
(2011) - Barysenka, S. Y., Vorobiov, V. I., Mowlaee, P., 2017. Single-channel speech enhancement using inter-component phase...
A synthesis approach of bi-spectral organized signals (in Russian)
Tech. Phys. Lett.
Analysis of voiced speech by means of bispectrum
Electron. Lett.
Analysis of voiced speech by means of bispectrum
Electronics Letters
Least squares phase estimation of mixed signals
Proceedings of the International Speech Communication Association Interspeech
An audio-visual corpus for speech perception and automatic speech recognition
J. Acoust. Soc. Am.
A measure of phase randomness for the harmonic model in speech synthesis
Proceedings of the International Speech Communication Association Interspeech
A uniform phase representation for the harmonic model in speech synthesis applications
EURASIP J. on Audio Speech Music Process.
Exploitation of phase-based features for whispered speech emotion recognition
IEEE Access
Speech enhancement using a minimum mean square error log-spectral amplitude estimator
IEEE Trans. Audio Speech Lang. Process.
Direct modelling of magnitude and phase spectra for statistical parametric speech synthesis
Proceedings of the International Speech Communication Association Interspeech
Phase-aware signal processing for automatic speech recognition
Proceedings of the International Speech Communication Association Interspeech
Speech enhancement using the bispectrum
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing
Methods and Techniques of Radar Recognition (an english translation of a book originally published in Russian in 1984)
On speech intelligibility estimation of phase-aware single-channel speech enhancement
Proceedings of the International Speech Communication Association Interspeech
Phase invariant method in radio-wave propagation experiments (in Russian)
Prikladnaja Radioelektronika, Kharkiv National University of Radioelectronics, Kharkiv, Ukraine
Phase Related Processes of Nonlinear Acoustics: Modulated Waves (in Russian)
Unbiased MMSE-based noise power estimation with low complexity and low tracking delay
IEEE Trans. Audio Speech Lang. Process.
Phase processing for single-channel speech enhancement: history and recent advances
IEEE Signal Process. Mag.
PEFAC - a pitch estimation algorithm robust to high levels of noise
IEEE Trans. Audio Speech Lang. Process.
Signal estimation from modified short-time fourier transform
IEEE Trans. Audio Speech Lang. Process.
DFT-domain based single-microphone noise reduction for speech enhancement
Synthesis Lectures on Speech and Audio Processing
Analysis of the phase unwrapping algorithm
Appl. Opt.
Fundamentals of statistical signal processing, volume i: Estimation theory
The importance of phase on voice quality assessment
Proceedings of the International Speech Communication Association Interspeech
STFT Phase reconstruction in voiced speech for an improved single-channel speech enhancement
IEEE Trans. Audio, Speech Lang. Process.
Cited by (9)
Inter-component phase processing of quasipolyharmonic signals
2021, Applied AcousticsCitation Excerpt :Modification of phase spectrum alone, without processing of magnitude part of the spectrum, was shown to improve quality and intelligibility of speech degraded by noise. In our earlier works [33,22] it is assumed that joint estimation of the phase of several components can increase the accuracy of the phase estimation of individual components, thus improving the quality of noisy speech. In these works the noise reduction algorithms were proposed that employ temporal smoothing on phase invariant, phase quasi-invariant, and bi-phase estimates in voiced speech fragments to reduce the variance of these estimates introduced by noise.
SNR-Based Inter-Component Phase Estimation Using Bi-Phase Prior Statistics for Single-Channel Speech Enhancement
2023, IEEE/ACM Transactions on Audio Speech and Language ProcessingAdaptive recurrent nonnegative matrix factorization with phase compensation for Single-Channel speech enhancement
2022, Multimedia Tools and ApplicationsSegmented Autoregression Pitch Estimation Method
2020, 2020 International Conference on Dynamics and Vibroacoustics of Machines, DVM 2020An Overview of Monaural Speech Denoising and Dereverberation Research
2020, Jisuanji Yanjiu yu Fazhan/Computer Research and DevelopmentDirectional Clustering with Polyharmonic Phase Estimation for Enhanced Speaker Localization
2020, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)