Complex-Domain Pitch Estimation Algorithm for Narrowband Speech Signals

We propose a complex-domain pitch estimation algorithm for narrowband speech signals, which utilizes a complex spectrum containing both amplitude and phase spectrum information. Traditional frequency-domain pitch estimation algorithms assume that a speech signal has a harmonic structure; they estimate a pitch by calculating the distance between the adjacent peaks of the amplitude spectrum corresponding to harmonics. However, only a few peaks can be detected from narrowband speech signals because of their limited bandwidth, resulting in pitch estimation errors. In this article, phase differences between harmonics are utilized as an additional cue for pitch estimation. The phase difference between harmonics refers to the two-step phase difference between successive analysis frames and between the lowest-order harmonic and other harmonics, which is theoretically derived using the pitch. When the phase spectrum for the higher-order harmonic is shifted by the theoretical value, it agrees with the phase spectrum for the lowest-order harmonic. Therefore, for each pitch candidate, the proposed method calculates the shifted phase spectra and combines them with the amplitude spectrum to generate complex spectra. When a pitch candidate is correct, the cumulative sum of the complex spectra is added in the same direction, which emphasizes harmonics even for narrowband speech signals. Results of the objective evaluation show that the proposed method accurately estimates the pitch from narrowband speech signals.


I. INTRODUCTION
I N A public switched telephone network, speech signals are encoded using speech coding techniques to enable communication with low latency. Narrowband speech coding techniques such as G.711 [1] and Adaptive Multi-Rate [2] limit the bandwidth of the decoded speech signal to a narrowband of 300-3400 Hz, whereas wideband speech coding techniques such as Enhanced Voice Services [3] limit it to a wideband of 50-14400 Hz. When understanding unknown words or names, bandwidth limitation reduces speech intelligibility on the phones, Manuscript  such as unvoiced or plosive utterances [4]. Consequently, narrowband speech provides poorer speech intelligibility of syllables than wideband speech. Old public switched telephone networks, where system renovation is not practical because of the effort and cost required, may only support narrowband speech coding techniques [5]. Artificial bandwidth extension (ABE) is a speech enhancement approach for narrowband speech signals, which reconstructs the missing lower spectrum (0-300 Hz) and upper spectrum (3400-7000 Hz) using the existing narrowband spectrum of 300-3400 Hz [6], [7], [8], [9], [10], [11], [12]. Because of the missing lower-order harmonics, the narrowband speech signal has a partial harmonic structure. ABE for the missing lower spectrum reconstructs the harmonic structure by generating multiple sinusoidal waves corresponding to the missing lower-order harmonics; this is referred to as sinusoidal synthesis [10], [11], [12]. A fundamental frequency (or pitch) is critical for sinusoidal synthesis to accurately reconstruct the harmonic structure. Whereas a fundamental frequency is a physical property of the speech signal, corresponding to the lowest-order harmonics, a pitch is the perceived property corresponding to the relative highness of the harmonic. In speech signal processing, fundamental frequency and pitch are considered to have the same meaning. In this article, the term 'pitch' is used. For narrowband speech coding techniques [1], a pitch must be estimated from the narrowband speech signal. Because artifacts such as human voices and car sounds may be mixed in the narrowband speech signal, pitch estimation algorithms should be robust to noise.
Parametric pitch estimation algorithms estimate a pitch using a harmonic model without training predictive models [19], [20], [21], [22], [23]. The harmonic model represents the speech signal as the sum of sinusoidal waves, with a harmonic structure, and noise. Most appropriate model parameters are calculated using nonlinear least squares (NLS) [19], [20]. In addition, a pitch tracking method that considers the temporal smoothness of the model parameters can reduce the pitch estimation error in noisy environments [21], [22]. However, because parametric pitch estimation algorithms have focused on speech signals with a harmonic structure, they have difficulty accurately estimating the model parameters from narrowband speech signals with a partial harmonic structure. A parametric pitch estimation algorithm for speech signals with a partial harmonic structure has also been devised [23], but iterative updates are required to estimate the model parameters. The calculation time must be reduced as much as possible because public switched telephone networks require real-time processing.
Frequency-domain pitch estimation algorithms estimate a pitch from the amplitude spectrum using the harmonic summation algorithm with a short processing time [24], [25], [26], [27], [28]. Since harmonics correspond to peaks on the amplitude spectrum, a pitch is defined as the distance between adjacent peaks. The harmonic summation algorithm calculates the distance using the cumulative sum of the amplitude spectra in frequency indices corresponding to harmonics [24], [25], [26]. The summation of residual harmonics (SRH) algorithm [25] suppresses the effects of vocal tract resonances and noise using the flat amplitude spectrum of the excitation signal. In addition, the pitch estimation error can be reduced using the Viterbi algorithm that considers the pitch transition from past analysis frames [27], [28]. However, only a few peaks can be detected from narrowband speech signals due to their limited bandwidth, and therefore overtones or subharmonics may be obtained; these are called 'octave error.' Octave errors are an inherent problem of pitch estimation algorithms, even for wideband speech signals, and they are observed especially in narrowband speech signals with missing lower harmonics. Thus, it is difficult to estimate the pitch from narrowband speech signals using only the amplitude spectrum.
We have focused on the phase spectrum as another cue for pitch estimation for narrowband speech signals. The phase spectrum has not received as much attention in speech signal processing as the amplitude spectrum. Recently, it has been reported that the phase spectrum is also closely related to harmonic structure, using the short-time Fourier transform (STFT) representation [35]. Phase-aware speech enhancement methods have been devised based on models that separate the phase spectrum into linear and unwrapped phases [36], [37] and approximate a speech signal using the sum of sinusoidal waves with a harmonic structure [43], [44]. Pitch estimation algorithms using the phase spectrum have also been devised, which utilize the phase difference between successive analysis frames [29], [30] and group delay [30], [31], [32], [33]. Note that harmonics are the key to pitch estimation, as known in frequency-domain pitch estimation algorithms. Our previous study [34] utilized the phase difference between harmonics, which is defined as the two-step phase difference between successive analysis frames and between the lowest-order harmonic and other harmonics. While group delay, which is useful for speech signal processing, differentiates the phase spectrum for frequency, the phase difference between harmonics focuses only on harmonics and is theoretically derived using the pitch. By checking whether the phase differences between harmonics agree with the theoretical value in each pitch candidate, the pitch can be obtained even when only a few peaks can be detected on the amplitude spectrum. However, the phase spectrum may rapidly fluctuate at the beginning and end of voiced active frames or deteriorate in severely noisy environments, resulting in pitch estimation errors.
In this article, we propose a complex-domain pitch estimation algorithm for narrowband speech signals. The word "complexdomain" denotes a complex-valued mathematical representation such as a complex spectrum containing both amplitude and phase spectrum information, which is utilized in other studies [38], [39] The proposed method is inspired by the idea that the amplitude spectrum can be interpreted as the complex spectrum with the phase spectrum all zeros. That is, frequency-domain pitch estimation algorithms using the amplitude spectrum, which stably work even at the beginning and end of voiced active frames in noisy environments, have degrees of freedom for the phase spectrum. The proposed method thus introduces phase differences between harmonics as an additional cue for pitch estimation. Our previous study showed that the phase difference between harmonics could be derived using the pitch [34]. When the phase spectrum for the higher-order harmonic is shifted by the theoretical value, it agrees with the phase spectrum for the lowest-order harmonic. Therefore, for each pitch candidate, the proposed method calculates the shifted phase spectra and combines them with the amplitude spectrum to generate complex spectra. When a pitch candidate is correct, the cumulative sum of the complex spectra is added in the same direction, emphasizing harmonics even for narrowband speech signals. Finally, a pitch is obtained using the Viterbi algorithm with the cumulative sum. The proposed method can be applied to wideband speech signals as well as narrowband speech signals. The pitch estimation using both the amplitude and phase spectra is more effective for narrowband speech signals with poor pitch estimation cues.
The proposed method is related to several works [40], [41], [42]. Das et al. developed a complex-domain pitch estimation algorithm using the extended complex Kalman filter, which requires prior knowledge of the type of signal [40], [41]. Drugman et al. estimated the spectral boundary separating periodic and aperiodic components derived from amplitude and phase spectra because the phase spectrum conveys relevant information about harmonics [42]. Contributions of this work are summarized as follows: r We propose a complex-domain pitch estimation algorithm using amplitude and phase spectra without prior knowledge.
r Experimental results show that the proposed method is suitable for narrowband speech signals, not depending on gender, and Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
suppresses pitch estimation errors even in noisy environments at 0 dB. r We discuss that the computation complexity of the proposed method is comparable to that of the online pitch estimation algorithm.

II. RELATIONSHIP BETWEEN THE PHASE SPECTRUM AND HARMONICS
This section discusses the relationship between the phase spectrum and harmonics using the STFT representation. In this article, the same STFT representation as phase estimation methods using a sinusoidal model [43], [44] is used. The sinusoidal model assumes that a speech signal can be represented as the sum of sinusoidal waves with a harmonic structure. Let n denote a sample index. Here, a speech signal using the sinusoidal model is denoted as where H is the number of harmonics, 2 A h (n) is a real-valued amplitude, f (n) is a pitch, F s is the sampling rate, and Ω h is the initial phase. With a frequency index k, the STFT representation at the lth frame is defined as where N is the number of samples to be analyzed, M is the hop size, and w(n) is the analysis window. Pitch and amplitude may only sometimes be stable, particularly in the case of female speech signals toward the end of a word or phrase. Nonetheless, the speech signal can be considered quasi-stationary in a short-time analysis frame, such as between 30 and 80 ms, where the amplitude and pitch are nearly constant [26], [27]. In this work, we set the analysis frame at 40 ms; thus, the assumption is where The STFT representation analyzes a signal using band-pass filters with N frequency bands, which are determined by the analysis window. As seen in (3), the output of each band-pass filter contains the analysis results of H harmonic components. It is assumed that the frequency resolution of the STFT representation is sufficiently high, and only a single harmonic is present in a given frequency index. Here, frequency index can only handle integer values, while harmonics include non-integer values. By following the traditional methods [43], [44], the nearest frequency index to the hth (h = 1, . . . , H) harmonic is selected as follows: where In addition, it is assumed that the sideband attenuation of the band-pass filters is sufficiently large such that spectrum leakage can be neglected. With these assumptions, the STFT representation for the harmonic (3) is reduced to where W (k) denotes the discrete Fourier transform representation of the analysis window. By letting φ W (k) denote the phase spectrum of W (k), the phase spectrum is given as follows: The phase spectrum for the harmonic shifts over time, including the initial phase and the window function. Let us focus on the phase difference between successive frames. Because a pitch slowly changes on voiced active frames, it can be approximated such that The phase difference is theoretically derived using the pitch, not depending on the initial phase or window function. The presence of the harmonic structure can be determined from the phase spectrum continuity based on the phase difference. However, the phase spectrum is unstable due to noise, so the phase difference between the lowest-order harmonic and other harmonics is further calculated. By using (9), the phase difference between harmonics is defined as follows: By substituting (9) into (10), we have the relationship between the phase spectrum and harmonics as follows: with the phase lead Relationship between the phase spectrum and harmonics. Fig. 1 shows the relationship between the phase spectrum and harmonics. In the correct pitch candidate, phase spectra between higher and the lowest-order harmonics exhibits a linear relationship. The validity of the phase difference between harmonics for pitch estimation is discussed in the following section.

III. PHASE DIFFERENCES BETWEEN HARMONICS
This section discusses the validity of the phase difference between harmonics for pitch estimation. In a natural environment, a speech signal can suffer from noise. In addition, narrowband speech signals can be obtained through a band-pass filter whose cutoff frequencies are 300 and 3400 Hz. A narrowband speech signal is defined as follows: where e(n) is an additive noise signal and g(·) is a band-pass filtering function. Here, the sampling delay of the band-pass filtering has been compensated.
The jth pitch candidate is represented as f (j) (j ∈ P), where P denotes the set of indices of the pitch candidates. In this article, the pitch is within 50-400 Hz. When a pitch candidate is correct, the phase spectra for the higher-order harmonics, which are shifted based on the phase difference between the harmonics, agree with the phase spectrum for the lowest-order harmonic. Note that a narrowband speech signal has lost several lowerorder harmonics due to its limited bandwidth. The frequency index containing the hth harmonic for the jth pitch candidate in the narrowband is redefined as follows: considering the nearest integer with the highest harmonic in the missing lower bandf where · denotes the floor function. For notational simplicity, we put k h j as k * . Since harmonics possess greater amplitude than other frequencies, phase spectra encompassing them exhibit a high signal-to-noise ratio (SNR). Indeed, phase enhancement methods [43], [44] reconstruct the degraded phase spectrum from phase spectra encompassing harmonics even in noisy environments at 0 dB. Based on this knowledge, we work on a robust pitch estimation algorithm using phase spectra encompassing harmonics. Let φ l (k) be the phase spectrum of the narrowband speech signal. The phase spectrum for the harmonic is then given as where φ g (k) denotes the phase shift of the band-pass filtering. By using (11) and (12), the shifted phase spectrum based on the phase difference between harmonics in each pitch candidate is generated as follows: with the phase lead When a pitch candidate is correct such that f (j) = f l , we havê φ l (k * ) = φ l (k 1 j ). Hence, a pitch can be obtained from narrowband speech signals using the shifted phase spectrum without detecting the peak of the amplitude spectrum.
Let us examine the validity of the shifted phase spectrum for pitch estimation. In this discussion, the similarity between the phase spectra in the frequency indices k 1 j and k h j is measured as follows:  Fig. 2(a), (b), and (c) shows Z 2 l (j), Z 3 l (j), and Z 4 l (j), respectively. Here, a clean narrowband speech signal with a pitch of 211 Hz is used. Similarities approach 1 at several pitch candidates because the phase spectrum is constrained to its principal value of [0, 2π); this is called 'phase wrapping.' Due to phase wrapping, the shifted phase spectra can coincide with the phase spectrum for the lowest-order harmonic even in incorrect pitch candidates. As shown in Fig. 2(d), the averaged similarity records a maximum at the correct pitch candidate, resulting in the suppression of the phase wrapping effect using several shifted phase spectra. However, the phase spectrum may rapidly fluctuate at the beginning and end of voiced active frames or deteriorate in severely noisy environments. Moreover, (17) can be established for overtones, resulting in octave errors. To solve these issues, the complex-domain pitch estimation algorithm using the phase and amplitude spectra is discussed in the following section.

IV. COMPLEX-DOMAIN PITCH ESTIMATION ALGORITHM
In this section, we propose a complex-domain pitch estimation algorithm using the phase and amplitude spectra. Fig. 3 shows a schematic representation of the proposed method. The proposed method obtains a cue for pitch estimation from the amplitude spectrum of the excitation signal using the harmonic summation algorithm [25]. The excitation signal is given by the linear predictive coding technique as the linear prediction (LP) residual [45]. The proposed method combines the shifted phase spectrum with the amplitude spectrum of the excitation signal to generate a complex spectrum. Let U l (k) denote the amplitude spectrum of the excitation signal. The proposed method generates the complex spectrum as follows: The proposed method then obtains the distance between the adjacent peaks from the complex spectrum using the harmonic summation algorithm. The harmonic summation algorithm calculates the cumulative sum of the complex spectra Fig. 4 shows the vector diagrams of the harmonic summation algorithm. When the shifted phase spectra are the same for the correct pitch candidate, each complex spectrum is added in the same direction and the cumulative sum of the complex spectra then coincides with that of the amplitude spectra. Otherwise, the cumulative sum of the complex spectra is smaller. Therefore, the introduction of the shifted phase spectrum emphasizes peaks of the amplitude spectrum. Let us examine the validity of the cumulative sum of the complex spectra for pitch estimation. Here, a male narrowband speech signal with a pitch of 103 Hz is used and a white noise signal is added to it at an SNR of 0 dB. Fig. 5(a), (b), and (c) shows the averaged similarity between the phase spectra for harmonics, the cumulative sum of the amplitude spectra, and the cumulative sum of the complex spectra, respectively. The averaged similarity has some peaks at incorrect pitch candidates because the phase spectrum can be degraded by noise. The pitch estimation using only the phase spectrum becomes unstable in severely noisy environments. The cumulative sum of the amplitude spectra suppresses the noise effect and also has peaks at several pitch candidates. Because the peaks on the amplitude spectrum are not easily detected from narrowband speech signals with a partial harmonic structure, the pitch estimation error cannot be suppressed using only the amplitude spectrum. Conversely, the cumulative sum of the complex spectra recorded a peak at the correct pitch candidate. These results confirm that the complex spectrum is an efficient cue for pitch estimation for narrowband speech signals in noisy environments. However, there are still peaks at the pitch candidates for overtones, resulting in the octave error.
When a pitch candidate is an overtone, the amplitude spectrum also has a peak at the subharmonic. To prevent the octave error, the subharmonic subtraction algorithm [25] subtracts the cumulative sum of the amplitude spectra for subharmonics from that for harmonics. However, narrowband speech signals also have missed subharmonics in the lower spectrum. In this article, the subharmonic subtraction algorithm is extended in the complex-domain.
When a pitch candidate is an overtone such that f (j) = R · f l (R ∈ N − {1}), (11) is also valid. The proposed method introduces an additional cue into the subharmonic subtraction algorithm as to whether the shifted phase spectra agree with the phase spectrum for the lowest-order harmonic. Let (h + r/R) · f (j) +f (j) (r = 1, . . . , R − 1) denote a subharmonic for the Rth overtone at a pitch candidate. With the frequency index containing the subharmonic k h r,j , the proposed method shifts the phase spectrum for the subharmonic as follows: with the phase lead When a pitch candidate is an overtone,φ X l k h r,j = φ X l (k 1 j ). The proposed method calculates the cumulative sum of the complex spectra for the subharmonics as where Consequently, we apply the subharmonic subtraction algorithm to (21):Q In this article, the second and third overtones (R = 2, 3) are considered. Fig. 5(d) shows the cumulative sum of the complex spectra with the subharmonic subtraction algorithm. It can be seen that the maximum cumulative sum corresponds to the correct pitch candidate, whereas the others have been attenuated. The octave error can be suppressed using the subharmonic subtraction algorithm in the complex-domain. In addition, the proposed method enhances robustness using the temporal smoothing process. Here, the scale ofQ l (j) is normalized because the power of the narrowband speech signal changes over time. LetQ l (j) denote the normalized cumulative sum. The normalized cumulative sum over L frames is averaged as follows:Q   5(e) shows the cumulative sum of the complex spectra with the subharmonic subtraction algorithm and the temporal smoothing process. The noise effect can be suppressed using the temporal smoothing process, emphasizing the peak at the correct pitch candidate. Finally, the pitch is determined among the pitch candidates using the Viterbi algorithm that calculates a Viterbi score for each pitch candidate by considering the pitch transition from the previous frames. The Viterbi score is defined as where a(i, j) is the transition probability between the ith and jth pitch candidates and l * denotes the latest voiced active frame. Consequently, the proposed method outputs the pitcĥ

V. EXPERIMENTAL EVALUATION
Objective experiments were conducted to validate the performance of the proposed method. We randomly selected 800 speech signals of 10 male and 10 female speakers from PTDB-TUG [46]. The speech signals had been coded in 16 bits with a sampling rate of 16000 Hz. According to the preprocessing scheme expounded by Pulakka et al. [10], we simulated narrowband speech signals on the public switched telephone network using modified mobile station input filters. First, speech signals were high-pass filtered via an infinite-impulse-response filter whose cutoff frequency was 300 Hz with 3 dB attenuation. We employed a zero-phase digital filter to circumvent phase distortion in this work. Since the first high-pass filter was insufficient in attenuating frequencies below 300 Hz, a second high-pass filter was implemented to achieve an 80 dB attenuation at 200 Hz. We employed a finite impulse response filter as the second high-pass filter following Abel's preprocessing scheme [11]. The high-pass filtered speech signal was lower-pass filtered with an infinite-impulse-response filter whose cutoff frequency was 3400 Hz with 50 dB attenuation, and then downsampled at a sampling rate of 8000 Hz. Finally, narrowband speech signals were obtained by encoding and decoding the downsampled speech signal using G.711 [1].
White, cockpit, destroyer, factory, and babble noise signals were used from NOISEX-92 [47]. The noise signals had been coded in 16 bits with a sampling rate of 19980 Hz. The noise signal was also preprocessed using the modified mobile station input filter and added to the narrowband speech signal at different SNR levels, ranging from −10 to 20 dB in steps of 5 dB.
The online process should shorten an analysis frame, but the spectral resolution is also reduced. The proposed method (PROP) analyzed the narrowband speech signal with a frame length of 40 ms (N = 320) and a hop size of 10 ms (M = 80). In this case, the distance between the center frequencies of the adjacent band-pass filters in the STFT analysis was 25 Hz; the frequency resolution was sufficient to capture the lowest pitch of 50 Hz. In addition, a Hanning window with a frame length of 40 ms was adopted, so that spectrum leakage to the other frequency bands could be neglected. The proposed method set H = 4 and L = 3 in this article.
The performance of the pitch estimation algorithms was evaluated by measuring the gross pitch error (GPE) [14]. The GPE is defined as where N E denotes the number of frames in which the relative error of the estimated pitch is higher than 20% and N V denotes the number of voiced active frames. In this work, we introduced the octave error rate (OER) as a metric to assess octave errors: where N OE denotes the number of frames in which the relative error of the estimated pitch is greater than one octave. We also define the proportion of octave errors to total pitch estimation errors: Here, it is assumed that the number and positions of voiced active frames are known, to evaluate the maximum performance of the pitch estimation algorithms. The proposed method was compared with the following traditional pitch estimation algorithms: YIN [14], the pitch estimation filter with amplitude compression (PEFAC) algorithm [26], SRH [25], the parametric pitch estimation algorithm (NLS) [19], [22], and the pitch estimation algorithm using phase differences between harmonics (PD) [34]. The frame length of PEFAC and SRH was 90 ms with a hop size of 10 ms, and the frame length of the others was 40 ms. Fig. 6 shows the original (red line) and estimated (white line) pitch tracks for a male narrowband speech signal in a clean environment. Fig. 6(a) depicts a spectrogram of the original speech signal, and Fig. 6(b)-(g) depict spectrograms of the narrowband speech signals. The narrowband speech signal lacks the lower-order harmonics because of its limited bandwidth, resulting in the pitch estimation error for YIN, PEFAC, and SRH. Specifically, PEFAC and SRH suffered from the octave error because only a few peaks can be detected from the amplitude spectrum in the narrowband. In addition, the pitch estimation error was observed for NLS because the model parameters were not fully estimated from the narrowband speech signal with a partial harmonic structure. PD suppressed the pitch estimation error without detecting peaks on the amplitude spectrum but was unstable at the beginning and end of voiced active frames where the phase spectrum rapidly fluctuated. PROP accurately estimated the pitch from the narrowband speech signal using both the amplitude and phase spectra.
The performance of the pitch estimation algorithms was also evaluated for narrowband speech signals in clean and noisy environments. In this experiment, to verify the effectiveness of the shifted phase spectrum and the subharmonic subtraction algorithm, two variants of the proposed method were added: PROP w/o PD estimates the pitch from the cumulative sum of the amplitude spectra with the subharmonic subtraction algorithm. PROP w/o SH estimates the pitch from the cumulative sum of the complex spectra without the subharmonic subtraction algorithm. Table I shows the resulting GPE. In the traditional pitch estimation algorithms, the GPE exceeded 0.17 in a clean environment because of the limited bandwidth. Significantly, because the frequency-domain and parametric pitch estimation algorithms focus on wideband speech signals with a harmonic structure, the GPE was higher in the male narrowband speech signals with more missing lower-order harmonics than in the female ones. PD achieved lower GPE than YIN, SRH, and PEFAC without detecting the peak on the amplitude spectrum but higher GPE than NLS because of the instability at the beginning and end of the voiced active frames. Conversely, PROP achieved the lowest GPE for the male and female narrowband speech signals. Consequently, the proposed method precisely estimates the pitch from narrowband speech signals, not depending on gender. Compared with PROP w/o PD, PROP reduced the GPE by 0.039. These results imply that the pitch estimation error for narrowband speech signals can be suppressed using the amplitude and phase spectra. Table II shows the resulting OER. In a clean environment, PROP recorded the lowest OER at 0.033, whereas YIN, SRH, and PEFAC recorded more than 0.174. The proposed method is an efficient pitch estimation algorithm for narrowband speech signals regardless of whether they are male or female. Compared with PROP w/o SH, PROP reduced the OER by 0.027. These results show that the subharmonic subtraction algorithm in the complex-domain is effective in preventing the octave error for narrowband speech signals.
In noisy environments, the GPE and OER increased compared to those in a clean environment. In PD, the GPE more than doubled because of the deteriorated phase spectrum. PROP, the pitch estimation algorithm using both the amplitude and phase spectra, achieved the lowest GPE in all noise signals, which is not a high SNR. NLS, employing a harmonic model with additive white Gaussian noise, reduced the GPE for female speech signals in White noise. Nevertheless, PROP mitigated pitch estimation errors more effectively for the others. These results show that the proposed method provides stable pitch estimation even in noisy environments. For the OER, PROP achieved the lowest OER for the white noise signal, but NLS outperformed PROP for the other noise signals. Since the non-stationary noise signals suddenly distort the phase spectrum, the proposed method using the phase spectrum has difficulty preventing the octave error. Here, PD and NLS recorded less than 0.08, whereas YIN, SRH, and PEFAC recorded more than 0.22. Therefore, the proposed method suppressed the octave error with the same level of performance as the state-of-the-art method, NLS. In Destroyer, Factory, and Babble noise, the lowest p OE of PROP was 0.257, while that of NLS was 0.179. NLS, which did not capture the harmonic structure as shown in Fig. 6(e), had pitch estimation errors of less than one octave. In PROP, the proportion of octave errors increased, despite reducing the pitch estimation error compared to NLS. Since we set the extensive pitch range to accommodate both male and female speakers, octave errors are likely to occur. Hence, the postprocessing for the pitch range limitation based on speaker identification methods [48], [49] will further suppress pitch estimation errors.
The proposed method assumes that the phase spectra encompassing lowest and higher harmonics should be related linearly. The relationship between phase spectra can be checked by comparing PROP w/o PD and PROP. If the relationship between the phase spectra is not linear, the performance of PROP will decline compared to PROP w/o PD that omits the phase information. As demonstrated in Table I, PROP recorded the lowest GPE in Clean speech, signifying a linear relationship between the phase spectra. In noisy environments, PROP registered lower GPE with White, Cockpit, and Destroyer noise at 0 dB, while PROP w/o PD reported superior GPE for female speech signals in Factory and Babble noise. Since the harmonic structure was deteriorated, the linear relationship between the phase spectra could not be valid. Hence, the robustness toward the non-stationary noise is a challenge for the proposed method. In this article, we estimated the pitch using the deteriorated phase spectrum. Therefore, the preprocessing using phase enhancement methods [50], [51] further improves the performance.
Figs. 7 and 8 show the GPE and OER results in noisy environments with different SNR levels. PROP suppressed pitch estimation errors more efficiently within the 20 to -5 dB range. These results confirm that the proposed method is a robust pitch estimation algorithm for narrowband speech signals. In severely noisy environments at an SNR of −10 dB, PROP was comparable to NLS. The deteriorated phase spectrum caused pitch estimation errors. The preprocessing using phase enhancement methods [50], [51] will improve the performance in severely noisy environments. For the white noise signal, PROP also achieved the lowest OER regardless of the SNR level. For the non-stationary noise signals, NLS outperformed PROP at SNRs below 0 dB. The sudden phase spectrum distortion due to the non-stationary noise signals caused octave errors. Here, PROP achieved OER less than or equal to that of NLS at SNRs of 0 dB or higher. The proposed method prevented the octave error more efficiently compared to the state-of-the-art methods in slightly noisy environments.
Finally, we discuss the complexity of the pitch estimation algorithms. Let P denote the number of pitch candidates.  applications, and thus the proposed method is also applicable to online processing.

VI. CONCLUSION
We proposed a complex-domain pitch estimation algorithm for narrowband speech signals and verified its performance through simulation experiments. The proposed method achieved the lowest GPE and OER in a clean environment for both male and female narrowband speech signals. In noisy environments, the proposed method also recorded the lowest GPE even at an SNR of −10 dB and OER less than or equal to that of the state-of-the-art method at SNRs of 0 dB or higher. These results implied that the proposed method was a robust pitch estimation algorithm for narrowband speech signals, not depending on gender. In severely noisy environments at SNRs below 0 dB, the octave error was observed because the phase spectrum was suddenly distorted due to the non-stationary noise signals. Future work includes speaker identification methods [48], [49] and phase enhancement methods [50], [51] to prevent the octave error in severely noisy environments, and the glottal closure instants detection algorithm [52], [53] to detect the voiced active frame. The code of the proposed method is available at https://github.com/Yuya-Hosoda.