Single-channel speech enhancement using inter-component phase relations

doi:10.1016/j.specom.2018.03.009

Speech Communication

Volume 99, May 2018, Pages 144-160

https://doi.org/10.1016/j.specom.2018.03.009 Get rights and content

Highlights

•
The paper provides an introduction to the notion of inter-component phase relations of polyharmonic signals.
•
The performance of proposed approach is evaluated in speech quality and intelligibility, compared to the other benchmarks.
•
A joint application of proposed approach with the conventional amplitude enhancement methods is described.

Abstract

Phase-aware processing has recently attracted lots of interest among researchers in speech signal processing field as successful results have been reported for various applications including automatic speech/speaker recognition, noise reduction, anti-spoofing and speech synthesis. In all these applications, the success of the applied phase-aware processing method is predominantly affected by the robustness and the accuracy of the provided estimate of the clean spectral phase to be obtained from noisy observation. Therefore, in this paper, we first consider the inter-component phase relations of poly-harmonic signals as speech captured by Phase Invariance, Phase Quasi-Invariance and Bi-Phase constraints. Then, relying on these constraints between harmonics as phase structure, we propose phase estimators. Throughout various experiments we demonstrate the usefulness of the newly proposed methods. We further report the achievable speech enhancement performance by the proposed phase estimators and compare them with the benchmark methods in terms of perceived quality, speech intelligibility and phase estimation accuracy. The proposed methods show improved performance averaged over different noise scenarios and signal-to-noise ratios.

Introduction

In many signal processing applications including radar, image and speech processing, the problem of interest is to detect the desired signal in a noisy observation. While many previous studies were dedicated to deriving new estimators for amplitude and frequency of signal components (harmonics) (Kay, 1993, Van Trees, 2004), the estimation of spectral phase has been less addressed.

In speech signal processing, the processing of spectral phase was historically reported perceptually unimportant follow up the early experiments by Wang and Lim (1982) and Vary (1985). In particular, Vary reported that human perceives phase distortion only below signal-to-noise ratio (SNR) of 6 dB, hence noisy spectral phase suffices for high enough SNRs. Later on, Aarabi (2006) and Alsteris and Paliwal (2007) reported that spectral phase could be helpful for speech applications including automatic speech recognition and noise reduction. More recently, overview on phase-aware signal processing for speech applications thoroughly demonstrated the advantages and potential of incorporating phase processing (Mowlaee, Kulmer, Stahl, Mayer, 2016a, Gerkmann, Krawczyk, Le Roux, 2015, Mowlaee, Saeidi, Stylianou, 2016b).

The reasons why the research on phase-aware processing or in general studying the phase importance in speech applications was slow could be explained in following: (i) historically, the spectral phase of speech signals was believed to be unimportant as reported in the early studies (for a full review we refer to Mowlaee et al. (2016a, Ch. 1)), (ii) in contrast to the magnitude spectrum, the phase wrapping prevents an accessible pattern of phase spectrum in the Fourier domain which complicates the phase analysis of the given speech signal (Mowlaee et al., 2016b), (iii) phase processing is computationally complex and requires sophisticated algorithms with accurate prior statistics or fundamental frequency estimate (see e.g. Mowlaee and Kulmer, 2015b), (iv) little or no attention has been dedicated to the relations between harmonic components in speech, hence, the phase of harmonics has been estimated independently or relying on the phase of the fundamental harmonic.

It is important to note that an enhanced spectral phase obtained from noisy speech observation can be used directly for signal reconstruction and hence to enhance the noisy speech signal. Furthermore, an estimated clean spectral phase can also be used to derive improved spectral amplitude estimators in an iterative (Mowlaee, Stahl, Kulmer, 2017, Mowlaee, Saeidi, 2013) or non-iterative (Gerkmann, Krawczyk, Le Roux, 2015, Krawczyk, Gerkmann, 2016) configuration¹. As the achievable improvement from a phase-aware processing framework is limited by the accuracy of the spectral phase estimator stage, therefore, a challenging research topic is to find novel approaches that provide accurate and robust estimators of the clean spectral phase from a noisy observation. The achievement of a robust and accurate spectral phase information opens up opportunities for further improved performance in other speech applications including automatic speech recognition (Fahringer et al., 2016), speech synthesis (Espic et al., 2017), source separation (Mayer et al., 2017) and emotion recognition (Deng et al., 2016).

The previous attempts for spectral phase estimation can be divided into the following groups (Chacon and Mowlaee, 2014)²: (i) Griffin–Lim (GL) (Griffin and Lim, 1984) based methods which apply consistency of the short-time Fourier transform (STFT) spectrogram and iteratively reconstruct the spectral phase from an initial estimate of the spectral magnitude (see Mowlaee and Watanabe, 2013 for an overview), (ii) model-based short-time Fourier transform phase improvement (STFTPI) (Krawczyk and Gerkmann, 2014) relying on a harmonic model to predict the spectral phase across time using phase vocoder principle and across frequency by compensating for the analysis window phase response. Another model-based phase estimator is the geometry-based approach where additional time-frequency constraint (Mowlaee and Saeidi, 2014) is used to remove the ambiguity in the chosen spectral phase pairs. Three types of constraints were proposed in the geometry-based phase estimator: group delay deviation, instantaneous frequency deviation and relative phase shift (RPS) (Saratxaga et al., 2009). As another model-based approach, time-frequency smoothing of unwrapped harmonic phase was proposed by applying the harmonic model plus phase decomposition (Degottex and Erro, 2014b) followed by smoothing filter (Kulmer, Mowlaee, 2015b, Mowlaee, Kulmer, 2015b, Mowlaee, Kulmer, 2015a), and (iii) statistical methods: maximum a posteriori harmonic (MAP) (Kulmer, Mowlaee, 2015a, Mowlaee, Stahl, Kulmer, 2017), temporal smoothing of the unwrapped harmonic phase (TSUP) (Kulmer, Mowlaee, 2015b, Kulmer, Mowlaee, Watanabe, 2014) and least-squares (LS) (Chacon and Mowlaee, 2014).

In all previous phase estimators, the underlying relation between harmonics phase or phase structure across harmonics is either not directly taken into account (Krawczyk, Gerkmann, 2014, Kulmer, Mowlaee, 2015a, Kulmer, Mowlaee, 2015b) or only relies on the phase of the fundamental frequency used as the reference (Mowlaee, Saeidi, 2014, Mowlaee, Kulmer, 2015b, Mowlaee, Kulmer, 2015a). For example, in geometry-based phase estimator with RPS constraint (Mowlaee and Saeidi, 2014) the relation between the harmonic phases with the fundamental frequency phase is taken into account. Also, smoothing across RPS has been considered in Mowlaee and Kulmer (2015b). The phase estimation performance relies on the accuracy of the fundamental frequency phase which relies itself on the fundamental frequency estimation accuracy. This limits the performance for low-frequency noise scenarios. Furthermore, the underlying phase structure across harmonics is not taken into account, therefore, the harmonic phases are estimated independently.

In this paper, we argue that the two aforementioned issues: (i) relying on the fundamental frequency phase, and (ii) neglecting the phase structure across harmonics in speech signal limit the achievable performance by the existing spectral phase estimators. Therefore, in this paper, we propose new phase estimators that rely on the inter-component phase relations (ICPR) for a polyharmonic signal like speech. In our earlier publication (Pirolt et al., 2017), we reported preliminary results on the usefulness of applying phase quasi-invariant constraint for phase estimation and speech enhancement. In this paper, we present the ICPR in details for a polyharmonic signal (here speech) and report their usefulness in speech enhancement for different noise scenarios. The three phase relations are: Phase Invariance (PI), Phase Quasi-Invariance (PQI), and Bi-Phase (see Section 2 for an overview). We will apply these phase relations as constraints to derive the harmonic phase estimators. The so-derived estimators are then applied for speech enhancement whereby a phase-enhanced speech signal is provided. Throughout the experiments, we demonstrate that the newly derived phase estimators result in improved perceived quality and speech intelligibility and a lower phase estimator error versus the benchmark methods.

The rest of the paper is organized as follows. Section 2 presents some background on the ICPR for polyharmonic signals in general. In particular, we will focus on three phase relations: Phase Invariance, Phase Quasi-Invariance and Bi-Phase. In Section 3, we propose details on the proposed phase estimators relying on each of the three constraints (PI, PQI and Bi-Phase). Section 4 presents proof-of-concept experiments and speech enhancement results. A comparative study of phase estimation performance is presented by comparing the achievable speech enhancement results versus the relevant benchmark methods followed up by discussions. Section 5 concludes on the work.

Section snippets

Background on inter-component phase relations in polyharmonic signals

In this section, we review the theory and applications of phase processing techniques that exploit the following underlying principle: the parameters of particular harmonic are considered in relation to parameters of other harmonics of the same oscillation process. This principle provides a basis for a number of inter-component phase processing methods and reveals the special properties of signals, that are failed to be observed by conventional magnitude and power spectrum analysis methods.

Proposed phase estimators

In this section, we present the proposed phase estimators relying on ICPR³ Fig. 1 shows the block diagram of the speech enhancement setup that uses the proposed phase estimation framework.

Let x(n) and y(n) denote the clean and noisy signal, respectively, in time domain. The noisy signal $y (n) = x (n) + ν (n)$ represents a mixture of the

Experiment setup

We randomly choose 50 utterances of 20 speakers (10 male and 10 female) from GRID corpus (Cooke et al., 2006). The utterances were corrupted by the following noise types: white, babble and factory noise files taken from NOISEX-92 database (Varga et al., 1992); car and street noise files taken from NOIZEUS database (Hu and Loizou, 2007). The SNR levels ranged from $- 5$ to 10 dB with 5 dB step.

In order to quantify the error introduced between the estimated versus the clean ICPR values, we define

Conclusion

In this paper, after an overview on the importance of spectral phase estimation and the recent phase estimation methods in speech signal processing, we proposed new harmonic phase estimators from noisy speech that rely on the relations between the harmonics in a polyharmonic signal as speech. The phase structure was defined by three constraints: Phase Invariance, Phase Quasi-Invariance, and Bi-Phase. These constraints were used to derive estimators for the clean spectral phase from a noisy

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments that helped to considerably improve this paper. The work of Pejman Mowlaee was supported by the Austrian Science Fund (project number P28070-N33).

References (71)

L. Alsteris et al.
Short-time phase spectrum in speech processing: a review and some experimental results
Elsevier Signal Process.
(2007)
Y. Hu et al.
Subjective comparison and evaluation of speech enhancement algorithms
Speech Commun.
(2007)
P. Mowlaee et al.
Advances in phase-aware signal processing in speech communication
Speech Commun
(2016)
P. Mowlaee et al.
Iterative joint map single-channel speech enhancement given non-uniform phase prior
Speech Commun
(2017)
K. Paliwal et al.
Role of modulation magnitude and phase spectrum towards speech intelligibility
Speech Commun.
(2011)
P. Vary
Noise suppression by spectral magnitude @@estimation mechanism and theoretical limits
Elsevier Signal Process.
(1985)
P. Aarabi
Phase-Based Speech Processing
(2006)
V.A. Aksionov et al.
Digital phase processing methods of ultra wide band signals (in Russian)
J. Radioeng. Electron. (Signal Gener. Trans. Recept. Radio Syst.)
(1994)
I.S. Azarov et al.
Studying the connection between quasi-harmonic components of a speech signal
Proceedings of the Twenty-Fourth Session of the Russian Acoustical Society
(2011)
Barysenka, S. Y., Vorobiov, V. I., Mowlaee, P., 2017. Single-channel speech enhancement using inter-component phase...

G.N. Bochkov et al.

A synthesis approach of bi-spectral organized signals (in Russian)

Tech. Phys. Lett.

(1995)

B. Boyanov et al.

Analysis of voiced speech by means of bispectrum

Electron. Lett.

(1991)

B. Boyanov et al.

Analysis of voiced speech by means of bispectrum

Electronics Letters

(1991)

C. Chacon et al.

Least squares phase estimation of mixed signals

Proceedings of the International Speech Communication Association Interspeech

(2014)

M. Cooke et al.

An audio-visual corpus for speech perception and automatic speech recognition

J. Acoust. Soc. Am.

(2006)

G. Degottex et al.

A measure of phase randomness for the harmonic model in speech synthesis

Proceedings of the International Speech Communication Association Interspeech

(2014)

G. Degottex et al.

A uniform phase representation for the harmonic model in speech synthesis applications

EURASIP J. on Audio Speech Music Process.

(2014)

J. Deng et al.

Exploitation of phase-based features for whispered speech emotion recognition

IEEE Access

(2016)

Y. Ephraim et al.

Speech enhancement using a minimum mean square error log-spectral amplitude estimator

IEEE Trans. Audio Speech Lang. Process.

(1985)

F. Espic et al.

Direct modelling of magnitude and phase spectra for statistical parametric speech synthesis

Proceedings of the International Speech Communication Association Interspeech

(2017)

J. Fahringer et al.

Phase-aware signal processing for automatic speech recognition

Proceedings of the International Speech Communication Association Interspeech

(2016)

R. Fulchiero et al.

Speech enhancement using the bispectrum

Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing

(1993)

V.G. Nebabin

Methods and Techniques of Radar Recognition (an english translation of a book originally published in Russian in 1984)

(1994)

A. Gaich et al.

On speech intelligibility estimation of phase-aware single-channel speech enhancement

Proceedings of the International Speech Communication Association Interspeech

(2015)

Y. Galayev et al.

Phase invariant method in radio-wave propagation experiments (in Russian)

Prikladnaja Radioelektronika, Kharkiv National University of Radioelectronics, Kharkiv, Ukraine

(2009)

A.M. Gavrilov

Phase Related Processes of Nonlinear Acoustics: Modulated Waves (in Russian)

(2009)

T. Gerkmann et al.

Unbiased MMSE-based noise power estimation with low complexity and low tracking delay

IEEE Trans. Audio Speech Lang. Process.

(2012)

T. Gerkmann et al.

Phase processing for single-channel speech enhancement: history and recent advances

IEEE Signal Process. Mag.

(2015)

S. Gonzalez et al.

PEFAC - a pitch estimation algorithm robust to high levels of noise

IEEE Trans. Audio Speech Lang. Process.

(2014)

D. Griffin et al.

Signal estimation from modified short-time fourier transform

IEEE Trans. Audio Speech Lang. Process.

(1984)

R.C. Hendriks et al.

DFT-domain based single-microphone noise reduction for speech enhancement

Synthesis Lectures on Speech and Audio Processing

(2013)

K. Itoh

Analysis of the phase unwrapping algorithm

Appl. Opt.

(1982)

S.M. Kay

Fundamentals of statistical signal processing, volume i: Estimation theory

(1993)

M. Koutsogiannaki et al.

The importance of phase on voice quality assessment

Proceedings of the International Speech Communication Association Interspeech

(2014)

M. Krawczyk et al.

STFT Phase reconstruction in voiced speech for an improved single-channel speech enhancement

IEEE Trans. Audio, Speech Lang. Process.

(2014)

Cited by (9)

Inter-component phase processing of quasipolyharmonic signals
2021, Applied Acoustics
Citation Excerpt :
Modification of phase spectrum alone, without processing of magnitude part of the spectrum, was shown to improve quality and intelligibility of speech degraded by noise. In our earlier works [33,22] it is assumed that joint estimation of the phase of several components can increase the accuracy of the phase estimation of individual components, thus improving the quality of noisy speech. In these works the noise reduction algorithms were proposed that employ temporal smoothing on phase invariant, phase quasi-invariant, and bi-phase estimates in voiced speech fragments to reduce the variance of these estimates introduced by noise.
The paper presents a generalization of theoretical and experimental research in the field of inter-component phase signal processing based on instantaneous phase estimates of multiple or rational frequency harmonic components. We propose to model harmonic phase of each component of quasipolyharmonic signal with consideration of relative delays that occur on different frequencies during the signal propagation. Based on the proposed harmonic phase model, it is argued the inter-component phase relations carry the information about parameters of these relative delays. We introduce the general expression for the inter-component phase relations estimates, showing their temporal constancy and invariance to the time–frequency shifts and fluctuations of the harmonic amplitudes. These properties correspond to the findings obtained for signal propagation experiments with prior knowledge of harmonic phases. Applicability of proposed estimates for processing of natural signals is justified by results of past speech processing research (including speaker identification and speech enhancement) and novel experiments on condition monitoring of industrial machines. By employing the proposed harmonic phase model, we discuss why the earlier research on speech structure using higher-order spectra techniques did not reveal the non-linear nature of speech. We carry out simple experiments on condition monitoring of industrial machines to demonstrate the potential of distinguishing between different configurations of shaft misalignment based on the distribution of standard deviation of inter-component phase relations.
SNR-Based Inter-Component Phase Estimation Using Bi-Phase Prior Statistics for Single-Channel Speech Enhancement
2023, IEEE/ACM Transactions on Audio Speech and Language Processing
Adaptive recurrent nonnegative matrix factorization with phase compensation for Single-Channel speech enhancement
2022, Multimedia Tools and Applications
Segmented Autoregression Pitch Estimation Method
2020, 2020 International Conference on Dynamics and Vibroacoustics of Machines, DVM 2020
An Overview of Monaural Speech Denoising and Dereverberation Research
2020, Jisuanji Yanjiu yu Fazhan/Computer Research and Development
Directional Clustering with Polyharmonic Phase Estimation for Enhanced Speaker Localization
2020, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

View all citing articles on Scopus

View full text

Single-channel speech enhancement using inter-component phase relations

Highlights

Abstract

Introduction

Section snippets

Background on inter-component phase relations in polyharmonic signals

Proposed phase estimators

Experiment setup

Conclusion

Acknowledgments

Elsevier Signal Process.

Speech Commun.

Speech Commun

Speech Commun

Speech Commun.

Elsevier Signal Process.

Phase-Based Speech Processing

Digital phase processing methods of ultra wide band signals (in Russian)

J. Radioeng. Electron. (Signal Gener. Trans. Recept. Radio Syst.)

Studying the connection between quasi-harmonic components of a speech signal

Proceedings of the Twenty-Fourth Session of the Russian Acoustical Society

A synthesis approach of bi-spectral organized signals (in Russian)

Tech. Phys. Lett.

Analysis of voiced speech by means of bispectrum

Electron. Lett.

Analysis of voiced speech by means of bispectrum

Electronics Letters

Least squares phase estimation of mixed signals

Proceedings of the International Speech Communication Association Interspeech

An audio-visual corpus for speech perception and automatic speech recognition

J. Acoust. Soc. Am.

A measure of phase randomness for the harmonic model in speech synthesis

Proceedings of the International Speech Communication Association Interspeech

A uniform phase representation for the harmonic model in speech synthesis applications

EURASIP J. on Audio Speech Music Process.

Exploitation of phase-based features for whispered speech emotion recognition

IEEE Access

Speech enhancement using a minimum mean square error log-spectral amplitude estimator

IEEE Trans. Audio Speech Lang. Process.

Direct modelling of magnitude and phase spectra for statistical parametric speech synthesis

Proceedings of the International Speech Communication Association Interspeech

Phase-aware signal processing for automatic speech recognition

Proceedings of the International Speech Communication Association Interspeech

Speech enhancement using the bispectrum

Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing

Methods and Techniques of Radar Recognition (an english translation of a book originally published in Russian in 1984)

On speech intelligibility estimation of phase-aware single-channel speech enhancement

Proceedings of the International Speech Communication Association Interspeech

Phase invariant method in radio-wave propagation experiments (in Russian)

Prikladnaja Radioelektronika, Kharkiv National University of Radioelectronics, Kharkiv, Ukraine

Phase Related Processes of Nonlinear Acoustics: Modulated Waves (in Russian)

Unbiased MMSE-based noise power estimation with low complexity and low tracking delay

IEEE Trans. Audio Speech Lang. Process.

Phase processing for single-channel speech enhancement: history and recent advances

IEEE Signal Process. Mag.

PEFAC - a pitch estimation algorithm robust to high levels of noise

IEEE Trans. Audio Speech Lang. Process.

Signal estimation from modified short-time fourier transform

IEEE Trans. Audio Speech Lang. Process.

DFT-domain based single-microphone noise reduction for speech enhancement

Synthesis Lectures on Speech and Audio Processing

Analysis of the phase unwrapping algorithm

Appl. Opt.

Fundamentals of statistical signal processing, volume i: Estimation theory

The importance of phase on voice quality assessment

Proceedings of the International Speech Communication Association Interspeech

STFT Phase reconstruction in voiced speech for an improved single-channel speech enhancement

IEEE Trans. Audio, Speech Lang. Process.