Estimation of pitch period of speech signal using a new dyadic wavelet algorithm

doi:10.1016/S0020-0255(99)00055-9

Information Sciences

Volume 119, Issues 1–2, 1 October 1999, Pages 21-39

https://doi.org/10.1016/S0020-0255(99)00055-9 Get rights and content

Abstract

An algorithm based on dyadic wavelet transform (DyWT) has been developed for detecting pitch period. Pitch period is regarded as an important parameter in designing and developing automatic speaker recognition/identification systems. In this paper, we have developed two methods for detecting pitch period of synthetic signals. In the first method, we estimated the pitch period using the original signal by calculating reasonably accurate values for the dilation parameter L. Whereas, in the second method, pitch period was estimated from the power spectrum of the signal. Several experiments were performed, under noisy and ideal environmental conditions, to evaluate the accuracy and robustness of the proposed methodology. It was observed from the experiments that the proposed techniques were successful in estimating pitch periods. The best case accuracy in the estimation was found to be approximately 100%.

Introduction

The aim of speaker recognition system is to determine the identity of the speaker. In a generic speaker recognition system, the desired features are first extracted from the speech signal. The extracted features are then used as an input to another sub-system, which makes the decision regarding the verification or identification of the speaker. The process of feature extraction consists of extracting characteristic parameters of a signal to be used for speaker recognition. The extraction of salient features is one of the most important steps in solving the problem. In feature extraction, the goal is to extract those features that are invariant with regard to the speaker while maintaining its uniqueness from features of an imposter.

The traditional speaker identification techniques such as the autocorrelation or cepstrum-based methods, failed to provide an accurate results due to the wide range of variations present in the real speech signals [1]. Pitch period is considered an important parameter that can be used reliably for the identification of the speaker [2], [3]. Therefore, good performance of an automatic speaker recognition system is strongly related to the accuracy and reliability with which pitch periods for the speech signals can be detected and estimated.

The detection of the pitch period is the most important task in any automatic speech signal analysis. Once the pitch period has been identified a more detailed examination of speech signal can be performed. The algorithms for pitch detection can generally be divided into two categories: (1) event detection based algorithms, and (2) non-event detection based algorithms. The event detection method is based on calculating the autocorrelation function of a signal that exhibits the same periodicity as that of the signal to be analyzed. The disadvantage of the autocorrelation method is that it is unsuitable for non-stationary signals and is computationally complex. The non-event based pitch detectors are computationally simple as compared to the event based detectors, but are insensitive to pitch period variations during the measurement interval. They are also not suitable for wide range of speakers.

Since the glottal closure is marked by a sharp discontinuity in the speech signal, it can in some sense be related to the edge detection problem in image processing. A procedure for obtaining an optimal edge detector was provided by Canny [4]. In a subsequent work Mallat [5] has shown that the multiscale Canny edge detection is equivalent to finding the local maxima of a wavelet transform. Kadambe and Boudreaux-Bartels [6] recognized the similarity between edge detection in image processing and event based pitch detection in speech recognition. They developed a wavelet based scheme for pitch detection in speech recognition, and have shown that the wavelet based method is superior to the traditional pitch estimation techniques. Obaidat et al. [7] evaluated the performance of Gaussian Window (GW), first derivative Gaussian (DG), Modulated Gaussian (MG), and the one-sided exponential window (EW) wavelet. They observed that DG has the best estimation accuracy of the pitch period of speech signals. Wavelet transform is a very promising technique for time-frequency analysis. Wavelet transform maps a signal from its domain (time or spatial) to another domain using a set of special signals called wavelets (little waves) [8]. There are several different types of wavelets that have been applied to solve problems in many areas of science and engineering [9].

In this paper, we present an algorithm based on the Dyadic wavelet transform for detecting pitch period of synthetic speech signals under ideal and noisy conditions (see Fig. 1). Dyadic wavelet transform (DyWT) is a scale-discretized version of the continuous wavelet transform (CWT), which was previously used to analyze phonocardiogram signals [10], [11]. The next section of this paper contains the theory underlying the algorithmic steps and the criterion for estimating and selecting wavelet dilation parameters. Results and the relevant discussion are given in Section 3 of the paper. Finally, conclusions are presented in Section 4.

Section snippets

Wavelet transform

The wavelet of a signal $f(n)$ is defined as: $W_{s} f(n)=f(n)⊗Ψ_{s} (n)= 1 s ∫ −∞ ∞ f(n)×Ψ n−t s d t$ where s is scale factor for $Ψ_{s} (n)=1/s×Ψ(n/s)$ , which is the dilation of a basic wavelet $Ψ(n)$ by the scale factor s (where $s=2^{L}$ ).

We can write the function $f(n)$ as $f(n)= 1 c ∫∫W_{s} f(n)× 1 s ×Ψ n−t s d t d s s^{2}$ the function $Ψ(n)$ can be considered a wavelet if it satisfies the following admissibility condition $∫Ψ(n) d n=0.$ In our work we consider wavelets with compact support and vanishing moment due to their many advantages.

Relation between the characteristic points and WT

According to the

Results and discussion

The DyWT was computed at scales $L+1, L+2$ and $L+3$ for both schemes. The first scheme estimates the pitch period using the original signal by calculating the values for the dilation parameter L at reasonable accuracy. The second scheme estimates the pitch period from the power spectral of the signal. We detected the instant at which the glottis closes by locating the local maxima of DyWT (local maxima exceeds the threshold, which is equal to $0.3×{global maxima}$ ). We then estimated the pitch period

Conclusion

In this paper, we presented algorithms based on WT for the detection of pitch points of speech signal. The algorithms were evaluated on several synthetic speech signals. The performance of the algorithms were found to be excellent even in the noisy environment. It was observed that the DyWT for the power spectral speech signal provided accurate estimates of pitch period for the signal corrupted with by white Gaussian noise. Also, Our approach was found to be successful in calculating the scale L

Acknowledgments

This work was supported by a grant from the Department of Defense (DoD) under contract No. NSA-0-96-5.

References (11)

S. Kadambe, The application of time-frequency and time-scale representations in speech analysis, Ph.D. Thesis,...
W. Hess, Pitch Determination of Speech Signals: Algorithms and Devices, Springer, Berlin,...
L.R. Rabiner, R.W. Schafer, Digital Processing of Speech Signals, Prentice-Hall, Englewood Cliffs, NJ,...
J. Canny
A computational approach to edge detection
IEEE Trans. PAMI
(1986)
S. Mallat et al.
Characterization of signals from multiscale edges
IEEE Trans. PAMI
(1992)

There are more references available in the full text version of this article.

Cited by (18)

Intelligent speech processing in the time-frequency domain
2019, Intelligent Speech Signal Processing
Speech has an important role for communication among human beings. It is not only a tool for communication, but also contains ample information about the speaker’s age, emotion, health, and identity. Presently, the importance of speech processing has increased in various fields like automatic speech recognition, speaker verification, and pathological speech prediction. Therefore we need some intelligent speech processing techniques that can be used in various applications. Since speech is a nonlinear and nonstationary type of signal, nonlinear speech analysis techniques such as wavelet packet transformation (WPT) analysis, empirical mode decomposition (EMD), variational mode decomposition (VMD) can be used. The decomposition can be used as an efficient signal processing technique, which can better characterize the speech intelligibility of a person. It is a data-adaptive technique that captures nonlinear dynamics of the speech signal. It does not complicate the analysis. It is a generalized, compact, and adaptive signal decomposition technique that provides a rare and meaningful time domain component known as modes or intrinsic mode function (IMF). These IMFs can extract time domain components of the speech signal, which consists of both vocal tract and vocal fold information. The nonlinear-based feature was better in many applications like speaker verification and pathological speech processing compared to a typical feature like spectral and cepestral features.
This chapter discusses the newer, nonconventional methods of speech processing that better characterize the speech signal. The benefit of this technique is that it overcomes the problem of short-time processing of a speech signal. In recent years, many signal decomposition techniques were developed. Some of the best speech signal processing techniques include: WPT, EMD, VMD, and synchrosqueezing wavelet transform. These techniques have proven superior in many speech signal processing fields. Section 9.1 discusses wavelet techniques, Section 9.2 explores EMD, Section 9.3 is about VMD, and Section 9.4 discusses synchrosqueezed wavelet packet decomposition.
Advances in antibiotic nanotherapy
2018, Emerging Nanotechnologies in Immunology: The Design, Applications and Toxicology of Nanopharmaceuticals and Nanovaccines
Bacteria show resistance to antibiotic drugs through a variety of mechanisms. Moreover, the development of even new mechanisms of resistance have resulted in the simultaneous development of resistance to several antibiotic classes, creating very dangerous multidrug resistant (MDR) bacterial strains. However, when bacteria are drug resistant it does not mean that they stop responding to antibiotic, rather which occurs only at higher concentrations. Of greater concern are cases of acquired resistance, where initially susceptible populations of bacteria become resistant to an antibacterial drug, in particular antibiotics, and proliferate and spread under the selective use of that drug. One approach to address this challenge is to design drug analogs, which are already in clinical use and have activity against resistant organisms. However, bacteria are constantly succeeding to develop resistant mechanism to new antibiotic drugs, as well as to their analogs. The prevalent examples of such bacterial pathogens are vancomycin (Van) resistance by Enterococcus (VRE), MDR Pseudomonas aeruginosa, drug resistant nontyphoidal Salmonella, drug resistant Salmonella Typhi, drug resistant Shigella, methicillin-resistant Staphylococcus aureus (MRSA), drug resistant Streptococcus pneumonia, drug resistant tuberculosis. These bacterial pathogens cause severe illness. Threats in this category require monitoring and, in some cases, rapid incident or outbreak response. Therefore, there is an urgent need in developing new therapeutic approaches. Nanotechnology offers opportunities to re-explore the biological properties of already known antimicrobial compounds, such as antibiotics, by manipulating their size to change their effect. This review aims to discuss the antimicrobial resistance as a serious global health concern, clarifying microbial drug resistance mechanisms, and presenting evidence on how nanotechnology may be considered a tool against this issue.
The application of the Hilbert spectrum to the analysis of electromyographic signals
2008, Information Sciences
This paper investigates the application of the Hilbert spectrum (HS), which is a recent tool for the analysis of nonlinear and nonstationary time-series, to the study of electromyographic (EMG) signals. The HS allows for the visualization of the energy of signals through a joint time–frequency representation. In this work we illustrate the use of the HS in two distinct applications. The first is for feature extraction from EMG signals. Our results showed that the instantaneous mean frequency (IMNF) estimated from the HS is a relevant feature to clinical practice. We found that the median of the IMNF reduces when the force level of the muscle contraction increases. In the second application we investigated the use of the HS for detection of motor unit action potentials (MUAPs). The detection of MUAPs is a basic step in EMG decomposition tools, which provide relevant information about the neuromuscular system through the morphology and firing time of MUAPs. We compared, visually, how MUAP activity is perceived on the HS with visualizations provided by some traditional (e.g. scalogram, spectrogram, Wigner–Ville) time–frequency distributions. Furthermore, an alternative visualization to the HS, for detection of MUAPs, is proposed and compared to a similar approach based on the continuous wavelet transform (CWT). Our results showed that both the proposed technique and the CWT allowed for a clear visualization of MUAP activity on the time–frequency distributions, whereas results obtained with the HS were the most difficult to interpret as they were extremely affected by spurious energy activity.
A fast algorithm for one-unit ICA-R
2007, Information Sciences
Independent component analysis (ICA) aims to recover a set of unknown mutually independent source signals from their observed mixtures without knowledge of the mixing coefficients. In some applications, it is preferable to extract only one desired source signal instead of all source signals, and this can be achieved by a one-unit ICA technique. ICA with reference (ICA-R) is a one-unit ICA algorithm capable of extracting an expected signal by using prior information. However, a drawback of ICA-R is that it is computationally expensive. In this paper, a fast one-unit ICA-R algorithm is derived. The reduction of the computational complexity for the ICA-R algorithm is achieved through (1) pre-whitening the observed signals; and (2) normalizing the weight vector. Computer simulations were performed on synthesized signals, a speech signal, and electrocardiograms (ECG). Results of these analyses demonstrate the efficiency and accuracy of the proposed algorithm.
Second generation wavelet transform-based pitch period estimation and voiced/unvoiced decision for speech signals
2003, Applied Acoustics
Citation Excerpt :
A third difficulty in pitch detection is determination of beginning and end points of pitch period during voiced speech segments. The PDA based on classical wavelet transform (CWT) in literature [5–7] estimates the pitch period by determining the glottal closure instant (GCI) and measuring the time period between such two events because when a GCI occurs in a speech waveform, maximum occurrence in the adjacent scales of wavelet transform. However, construction of the CWT relies on the Fourier transform (FT) and needs clumsy mathematical operations.
Pitch detection is an important part of speech recognition and speech processing. In this paper, a pitch detection algorithm based on second generation wavelet transform was developed. The proposed algorithm reduces the computational load of those algorithms that were based on classical wavelet transform. The proposed pitch detection algorithm was tested for both real speech and synthetic speech signal. Some experiments were carried out under noisy environment condition to evaluate the accuracy and robustness of the proposed algorithm. Results showed that the proposed algorithm was robust to noise and provided accurate estimates of the pitch period for both low-pitched and high-pitched speakers. Moreover, different wavelet filters that were obtained using second generation wavelet transform were considered to see the effects of them on the proposed algorithm. It was noticed that Haar filter showed good performance as compared to the other wavelet filters.
Quantitative assessment of the use of continuous wavelet transform in the analysis of the fundamental frequency disturbance of the synthetic voice
2002, Medical Engineering and Physics
The aim of this work is to investigate quantitatively the capability of the Continuous Wavelet Transform (CWT) as a tool to estimate (calculate) Jitter and Shimmer, assessing the error between these indices calculated in each Wavelet decomposition and the ones for the original signal, for several dilatation levels. Two synthetic vowels /a/ were generated with the fundamental frequencies of 120 Hz for male and 220 Hz for female, by an autoregressive 22 coefficient all-pole model, and Jitter and Shimmer were introduced to the signal using five different percentage variations. The signals were decomposed by CWT in eight levels of dilatation (1, 2, 4, 8, 16, 32, 64 and 128), using the Mexican Hat, Meyer and Morlet real bases. Jitter and Shimmer were calculated for the original signals and all eight levels of decompositions and then the errors between the indices in the decompositions and the original signals were calculated. It can be concluded that CWT can be used as a tool for pre-processing the signal to measure Shimmer preferentially, and Jitter, instead of using the original signal to do that. The Mexican Hat base provided the lowest errors for Shimmer analysis, where the best dilatation level was 8 (error below 0.1%). In addition, the errors associated with Shimmer index, in general, are lower than the ones associated with Jitter index.

View all citing articles on Scopus

View full text