Elsevier

Speech Communication

Volume 99, May 2018, Pages 144-160
Speech Communication

Single-channel speech enhancement using inter-component phase relations

https://doi.org/10.1016/j.specom.2018.03.009Get rights and content

Highlights

  • The paper provides an introduction to the notion of inter-component phase relations of polyharmonic signals.

  • The performance of proposed approach is evaluated in speech quality and intelligibility, compared to the other benchmarks.

  • A joint application of proposed approach with the conventional amplitude enhancement methods is described.

Abstract

Phase-aware processing has recently attracted lots of interest among researchers in speech signal processing field as successful results have been reported for various applications including automatic speech/speaker recognition, noise reduction, anti-spoofing and speech synthesis. In all these applications, the success of the applied phase-aware processing method is predominantly affected by the robustness and the accuracy of the provided estimate of the clean spectral phase to be obtained from noisy observation. Therefore, in this paper, we first consider the inter-component phase relations of poly-harmonic signals as speech captured by Phase Invariance, Phase Quasi-Invariance and Bi-Phase constraints. Then, relying on these constraints between harmonics as phase structure, we propose phase estimators. Throughout various experiments we demonstrate the usefulness of the newly proposed methods. We further report the achievable speech enhancement performance by the proposed phase estimators and compare them with the benchmark methods in terms of perceived quality, speech intelligibility and phase estimation accuracy. The proposed methods show improved performance averaged over different noise scenarios and signal-to-noise ratios.

Introduction

In many signal processing applications including radar, image and speech processing, the problem of interest is to detect the desired signal in a noisy observation. While many previous studies were dedicated to deriving new estimators for amplitude and frequency of signal components (harmonics) (Kay, 1993, Van Trees, 2004), the estimation of spectral phase has been less addressed.

In speech signal processing, the processing of spectral phase was historically reported perceptually unimportant follow up the early experiments by Wang and Lim (1982) and Vary (1985). In particular, Vary reported that human perceives phase distortion only below signal-to-noise ratio (SNR) of 6 dB, hence noisy spectral phase suffices for high enough SNRs. Later on, Aarabi (2006) and Alsteris and Paliwal (2007) reported that spectral phase could be helpful for speech applications including automatic speech recognition and noise reduction. More recently, overview on phase-aware signal processing for speech applications thoroughly demonstrated the advantages and potential of incorporating phase processing (Mowlaee, Kulmer, Stahl, Mayer, 2016a, Gerkmann, Krawczyk, Le Roux, 2015, Mowlaee, Saeidi, Stylianou, 2016b).

The reasons why the research on phase-aware processing or in general studying the phase importance in speech applications was slow could be explained in following: (i) historically, the spectral phase of speech signals was believed to be unimportant as reported in the early studies (for a full review we refer to Mowlaee et al. (2016a, Ch. 1)), (ii) in contrast to the magnitude spectrum, the phase wrapping prevents an accessible pattern of phase spectrum in the Fourier domain which complicates the phase analysis of the given speech signal (Mowlaee et al., 2016b), (iii) phase processing is computationally complex and requires sophisticated algorithms with accurate prior statistics or fundamental frequency estimate (see e.g. Mowlaee and Kulmer, 2015b), (iv) little or no attention has been dedicated to the relations between harmonic components in speech, hence, the phase of harmonics has been estimated independently or relying on the phase of the fundamental harmonic.

It is important to note that an enhanced spectral phase obtained from noisy speech observation can be used directly for signal reconstruction and hence to enhance the noisy speech signal. Furthermore, an estimated clean spectral phase can also be used to derive improved spectral amplitude estimators in an iterative (Mowlaee, Stahl, Kulmer, 2017, Mowlaee, Saeidi, 2013) or non-iterative (Gerkmann, Krawczyk, Le Roux, 2015, Krawczyk, Gerkmann, 2016) configuration1. As the achievable improvement from a phase-aware processing framework is limited by the accuracy of the spectral phase estimator stage, therefore, a challenging research topic is to find novel approaches that provide accurate and robust estimators of the clean spectral phase from a noisy observation. The achievement of a robust and accurate spectral phase information opens up opportunities for further improved performance in other speech applications including automatic speech recognition (Fahringer et al., 2016), speech synthesis (Espic et al., 2017), source separation (Mayer et al., 2017) and emotion recognition (Deng et al., 2016).

The previous attempts for spectral phase estimation can be divided into the following groups (Chacon and Mowlaee, 2014)2: (i) Griffin–Lim (GL) (Griffin and Lim, 1984) based methods which apply consistency of the short-time Fourier transform (STFT) spectrogram and iteratively reconstruct the spectral phase from an initial estimate of the spectral magnitude (see Mowlaee and Watanabe, 2013 for an overview), (ii) model-based short-time Fourier transform phase improvement (STFTPI) (Krawczyk and Gerkmann, 2014) relying on a harmonic model to predict the spectral phase across time using phase vocoder principle and across frequency by compensating for the analysis window phase response. Another model-based phase estimator is the geometry-based approach where additional time-frequency constraint (Mowlaee and Saeidi, 2014) is used to remove the ambiguity in the chosen spectral phase pairs. Three types of constraints were proposed in the geometry-based phase estimator: group delay deviation, instantaneous frequency deviation and relative phase shift (RPS) (Saratxaga et al., 2009). As another model-based approach, time-frequency smoothing of unwrapped harmonic phase was proposed by applying the harmonic model plus phase decomposition (Degottex and Erro, 2014b) followed by smoothing filter (Kulmer, Mowlaee, 2015b, Mowlaee, Kulmer, 2015b, Mowlaee, Kulmer, 2015a), and (iii) statistical methods: maximum a posteriori harmonic (MAP) (Kulmer, Mowlaee, 2015a, Mowlaee, Stahl, Kulmer, 2017), temporal smoothing of the unwrapped harmonic phase (TSUP) (Kulmer, Mowlaee, 2015b, Kulmer, Mowlaee, Watanabe, 2014) and least-squares (LS) (Chacon and Mowlaee, 2014).

In all previous phase estimators, the underlying relation between harmonics phase or phase structure across harmonics is either not directly taken into account (Krawczyk, Gerkmann, 2014, Kulmer, Mowlaee, 2015a, Kulmer, Mowlaee, 2015b) or only relies on the phase of the fundamental frequency used as the reference (Mowlaee, Saeidi, 2014, Mowlaee, Kulmer, 2015b, Mowlaee, Kulmer, 2015a). For example, in geometry-based phase estimator with RPS constraint (Mowlaee and Saeidi, 2014) the relation between the harmonic phases with the fundamental frequency phase is taken into account. Also, smoothing across RPS has been considered in Mowlaee and Kulmer (2015b). The phase estimation performance relies on the accuracy of the fundamental frequency phase which relies itself on the fundamental frequency estimation accuracy. This limits the performance for low-frequency noise scenarios. Furthermore, the underlying phase structure across harmonics is not taken into account, therefore, the harmonic phases are estimated independently.

In this paper, we argue that the two aforementioned issues: (i) relying on the fundamental frequency phase, and (ii) neglecting the phase structure across harmonics in speech signal limit the achievable performance by the existing spectral phase estimators. Therefore, in this paper, we propose new phase estimators that rely on the inter-component phase relations (ICPR) for a polyharmonic signal like speech. In our earlier publication (Pirolt et al., 2017), we reported preliminary results on the usefulness of applying phase quasi-invariant constraint for phase estimation and speech enhancement. In this paper, we present the ICPR in details for a polyharmonic signal (here speech) and report their usefulness in speech enhancement for different noise scenarios. The three phase relations are: Phase Invariance (PI), Phase Quasi-Invariance (PQI), and Bi-Phase (see Section 2 for an overview). We will apply these phase relations as constraints to derive the harmonic phase estimators. The so-derived estimators are then applied for speech enhancement whereby a phase-enhanced speech signal is provided. Throughout the experiments, we demonstrate that the newly derived phase estimators result in improved perceived quality and speech intelligibility and a lower phase estimator error versus the benchmark methods.

The rest of the paper is organized as follows. Section 2 presents some background on the ICPR for polyharmonic signals in general. In particular, we will focus on three phase relations: Phase Invariance, Phase Quasi-Invariance and Bi-Phase. In Section 3, we propose details on the proposed phase estimators relying on each of the three constraints (PI, PQI and Bi-Phase). Section 4 presents proof-of-concept experiments and speech enhancement results. A comparative study of phase estimation performance is presented by comparing the achievable speech enhancement results versus the relevant benchmark methods followed up by discussions. Section 5 concludes on the work.

Section snippets

Background on inter-component phase relations in polyharmonic signals

In this section, we review the theory and applications of phase processing techniques that exploit the following underlying principle: the parameters of particular harmonic are considered in relation to parameters of other harmonics of the same oscillation process. This principle provides a basis for a number of inter-component phase processing methods and reveals the special properties of signals, that are failed to be observed by conventional magnitude and power spectrum analysis methods.

Proposed phase estimators

In this section, we present the proposed phase estimators relying on ICPR3 Fig. 1 shows the block diagram of the speech enhancement setup that uses the proposed phase estimation framework.

Let x(n) and y(n) denote the clean and noisy signal, respectively, in time domain. The noisy signal y(n)=x(n)+ν(n) represents a mixture of the

Experiment setup

We randomly choose 50 utterances of 20 speakers (10 male and 10 female) from GRID corpus (Cooke et al., 2006). The utterances were corrupted by the following noise types: white, babble and factory noise files taken from NOISEX-92 database (Varga et al., 1992); car and street noise files taken from NOIZEUS database (Hu and Loizou, 2007). The SNR levels ranged from 5 to 10 dB with 5 dB step.

In order to quantify the error introduced between the estimated versus the clean ICPR values, we define

Conclusion

In this paper, after an overview on the importance of spectral phase estimation and the recent phase estimation methods in speech signal processing, we proposed new harmonic phase estimators from noisy speech that rely on the relations between the harmonics in a polyharmonic signal as speech. The phase structure was defined by three constraints: Phase Invariance, Phase Quasi-Invariance, and Bi-Phase. These constraints were used to derive estimators for the clean spectral phase from a noisy

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments that helped to considerably improve this paper. The work of Pejman Mowlaee was supported by the Austrian Science Fund (project number P28070-N33).

References (71)

  • G.N. Bochkov et al.

    A synthesis approach of bi-spectral organized signals (in Russian)

    Tech. Phys. Lett.

    (1995)
  • B. Boyanov et al.

    Analysis of voiced speech by means of bispectrum

    Electron. Lett.

    (1991)
  • B. Boyanov et al.

    Analysis of voiced speech by means of bispectrum

    Electronics Letters

    (1991)
  • C. Chacon et al.

    Least squares phase estimation of mixed signals

    Proceedings of the International Speech Communication Association Interspeech

    (2014)
  • M. Cooke et al.

    An audio-visual corpus for speech perception and automatic speech recognition

    J. Acoust. Soc. Am.

    (2006)
  • G. Degottex et al.

    A measure of phase randomness for the harmonic model in speech synthesis

    Proceedings of the International Speech Communication Association Interspeech

    (2014)
  • G. Degottex et al.

    A uniform phase representation for the harmonic model in speech synthesis applications

    EURASIP J. on Audio Speech Music Process.

    (2014)
  • J. Deng et al.

    Exploitation of phase-based features for whispered speech emotion recognition

    IEEE Access

    (2016)
  • Y. Ephraim et al.

    Speech enhancement using a minimum mean square error log-spectral amplitude estimator

    IEEE Trans. Audio Speech Lang. Process.

    (1985)
  • F. Espic et al.

    Direct modelling of magnitude and phase spectra for statistical parametric speech synthesis

    Proceedings of the International Speech Communication Association Interspeech

    (2017)
  • J. Fahringer et al.

    Phase-aware signal processing for automatic speech recognition

    Proceedings of the International Speech Communication Association Interspeech

    (2016)
  • R. Fulchiero et al.

    Speech enhancement using the bispectrum

    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing

    (1993)
  • V.G. Nebabin

    Methods and Techniques of Radar Recognition (an english translation of a book originally published in Russian in 1984)

    (1994)
  • A. Gaich et al.

    On speech intelligibility estimation of phase-aware single-channel speech enhancement

    Proceedings of the International Speech Communication Association Interspeech

    (2015)
  • Y. Galayev et al.

    Phase invariant method in radio-wave propagation experiments (in Russian)

    Prikladnaja Radioelektronika, Kharkiv National University of Radioelectronics, Kharkiv, Ukraine

    (2009)
  • A.M. Gavrilov

    Phase Related Processes of Nonlinear Acoustics: Modulated Waves (in Russian)

    (2009)
  • T. Gerkmann et al.

    Unbiased MMSE-based noise power estimation with low complexity and low tracking delay

    IEEE Trans. Audio Speech Lang. Process.

    (2012)
  • T. Gerkmann et al.

    Phase processing for single-channel speech enhancement: history and recent advances

    IEEE Signal Process. Mag.

    (2015)
  • S. Gonzalez et al.

    PEFAC - a pitch estimation algorithm robust to high levels of noise

    IEEE Trans. Audio Speech Lang. Process.

    (2014)
  • D. Griffin et al.

    Signal estimation from modified short-time fourier transform

    IEEE Trans. Audio Speech Lang. Process.

    (1984)
  • R.C. Hendriks et al.

    DFT-domain based single-microphone noise reduction for speech enhancement

    Synthesis Lectures on Speech and Audio Processing

    (2013)
  • K. Itoh

    Analysis of the phase unwrapping algorithm

    Appl. Opt.

    (1982)
  • S.M. Kay

    Fundamentals of statistical signal processing, volume i: Estimation theory

    (1993)
  • M. Koutsogiannaki et al.

    The importance of phase on voice quality assessment

    Proceedings of the International Speech Communication Association Interspeech

    (2014)
  • M. Krawczyk et al.

    STFT Phase reconstruction in voiced speech for an improved single-channel speech enhancement

    IEEE Trans. Audio, Speech Lang. Process.

    (2014)
  • Cited by (9)

    • Inter-component phase processing of quasipolyharmonic signals

      2021, Applied Acoustics
      Citation Excerpt :

      Modification of phase spectrum alone, without processing of magnitude part of the spectrum, was shown to improve quality and intelligibility of speech degraded by noise. In our earlier works [33,22] it is assumed that joint estimation of the phase of several components can increase the accuracy of the phase estimation of individual components, thus improving the quality of noisy speech. In these works the noise reduction algorithms were proposed that employ temporal smoothing on phase invariant, phase quasi-invariant, and bi-phase estimates in voiced speech fragments to reduce the variance of these estimates introduced by noise.

    • Segmented Autoregression Pitch Estimation Method

      2020, 2020 International Conference on Dynamics and Vibroacoustics of Machines, DVM 2020
    • An Overview of Monaural Speech Denoising and Dereverberation Research

      2020, Jisuanji Yanjiu yu Fazhan/Computer Research and Development
    • Directional Clustering with Polyharmonic Phase Estimation for Enhanced Speaker Localization

      2020, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus
    View full text