Harmonic Differences Method for Robust Fundamental Frequency Detection in Wideband and Narrowband Speech Signals

In this article, a novel pitch determination algorithm based on harmonic differences method (HDM) is proposed. Most of the algorithms today rely on autocorrelation, cepstrum, and lastly convolutional neural networks, and they have some limitations (small datasets, wideband or narrowband, musical sounds, temporal smoothing, etc.), accuracy, and speed problems. &ere are very rare works exploiting the spacing between the harmonics. HDM is designed for both wideband and exclusively narrowband (telephone) speech and tries to find the most repeating difference between the harmonics of speech signal. We use three vowel databases in our experiments, namely, Hillenbrand Vowel Database, Texas Vowel Database, and Vowels from the TIMITcorpus. We compare HDM with autocorrelation, cepstrum, YIN, YAAPT, CREPE, and FCN algorithms. Results show that harmonic differences are reliable and fast choice for robust pitch detection. Also, it is superior to others in most cases.


Introduction
Pitch is an extraordinarily complicated and distinct feature of human speech and plays a major role in the perception of human conversations as well as in human-computer interactions. Pitch detection has a strong and disputed background spanning more than a century. Myriad of methods have been proposed, but it is still a formidable task especially in narrowband (telephone), noisy, multipitch, and multitalker speech with reasonable resolution and fast implementation due to extremely complicated structure of frequency spectrum. Pitch helps us to identify some of the important cues about the speaker, such as the identity, gender, emotional state, or about the tones of a musical instrument. It has a wide range of applications in emotion and gender recognition, speech synthesis, human-computer interaction, and detection of symptoms of pathological disorder at early stages. Fundamental frequency is the quantity of pitch and is measured on the periodic signals (musical tones) or quasiperiodic signals (speech). Pitch detection algorithms (PDA) and pitch tracking algorithms are extensively used to extract the fundamental frequency of a person's speech or of a musical tone. Fundamental frequency is the frequency of vocal cord oscillation, and it can highly vary among the men, women, boys, and girls. erefore, exact calculation is a crucial factor in variety of applications spanning from human-computer interaction to early detection of pathological symptoms.
is article is organized as follows: in Section 1, we discuss the foundations and importance of pitch detection and tracking, Section 2 deals with literature overview, historical background, algorithms, novelties of this article, datasets, ground truth methods, error measures, difficulties, application areas, and related algorithm domains, Section 3 describes the novel HDM algorithm, Section 4 delineates datasets used in this article and experimental setup, Section 5 presents the results of wide and narrowband experiments, Section 6 is devoted to gender detection results, and finally, in Section 7, we conclude this work with the evaluations and future studies.

Historical Background.
e saga of pitch determination and tracking begins with the dispute between August Seebeck and George Simon Ohm on the mysterious missing fundamental concept. In 1841, August Seebeck showed that the pitch of a sound did not depend on the tone having a fundamental frequency component of the pitch frequency. e debate over missing fundamental began shortly after this [1][2][3]. In 1843, Ohm severely rejected this idea and stated that the quality of a tone depends solely on the number and relative strength of its partial simple tones [4]. is law is championed by Helmholtz in his masterpiece work [5].
Rayleigh, in his famous study, e eory of Sound (1877), claimed that pitch could not be simply associated with period using his experiments with sirens. But similar to loudness and timbre, pitch is not a thing to be measured directly [6]. In 1924, Harvey Fletcher proved that even if several lower harmonics of a waveform were removed, the pitch remained the same. e pitch was very closely related to the difference in frequency, even though that frequency did not exist in the sound source [7]. is idea is one of the bases of this study to exploit these differences to extract the fundamental frequency.
In 1938, Schouten proved that the missing fundamental effect could not be explained as a nonlinear difference as Helmholtz had claimed and strongly supported the Seebeck [8]. Schouten's theory is known as the residue theory of pitch, periodicity pitch, or virtual pitch [9][10][11]. Today, scientists agreed that Seebeck disproved the Ohm's idea. Missing fundamental has a practical implementation in our telephone conversation. In telephone signals, the frequency spectrum is limited between 300 Hz and 4000 Hz. But we can still perceive the voice clearly and discriminatively to capture the talker's identity, gender, and even emotional state. For instance, if we have a signal with harmonics at 400 Hz, 600 Hz, 800 Hz, 1000 Hz, and 1200 Hz, the pitch of the signal is still perceived as 200 Hz, and autocorrelation of both signals remains nearly same as depicted in Figure 1 [12]. e autocorrelation model of pitch perception dates back to Licklider's "duplex" and "triplex" models. Licklider solved the dilemma between Seebeck and Ohm definitely in favor of Seebeck. Periodicity pitch theory succeeded place pitch theory after Licklider's "duplex" and "triplex" theories [13][14][15][16]. is theory was investigated deeply by Ritsma, and some of its limitations were shown [17]. Today, the debate is far from over, and two mainstream theories compete against each other paving the way for Place Code of Ohm & Helmholtz and Temporal Code eory of Seebeck & Wever of hearing.

Innovations in is Article.
Pitch detection is an extensively studied research field. Today pitch detection methods are successful in wideband, clean human speech. But there are still too many formidable obstacles that need to be resolved particularly in narrowband (telephone speech) [67,68], multipitch [69][70][71][72][73][74][75][76][77][78], multitalker, and noisy speech signals [69,95,96,108,109]. Various techniques have been implemented including autocorrelation, cepstrum, phase based, and lastly, CNN methods. However, an interesting clue for pitch determination is highly neglected and rarely used. Harmonic spacings are strong pieces of evidence for pitch determination even though human speech is extraordinarily quasiperiodic. is method was first used by Seneff [39] for narrowband speech on real-time data between 210 Hz and 1050 Hz band by iteratively assigning weight factors to the pitch candidates obtained from harmonic spacings with LDVT (Lincoln Digital Voice Terminal). Seneff's work is too old, and dataset and results are not clearly presented. ere are other works till 21st century. Wu [40] tried harmonic spacings in guitar sound using GCD (Greatest Common Divisor of largest peak) with a table, and Dziubiński and Kostek in various musical instruments with an Artificial Neural Network on a matrix of sets of harmonics. We will apply this method both for wideband and narrowband speech (400 Hz-3400 Hz) signals on large speech datasets, eliminating some of the limitations while spurring the accuracy and speed via determining the most repeated difference between the harmonics using a histogram. is article will revive this technique and demonstrate that harmonic spacings can reliably be used to obtain stateof-the-art results fast and efficiently both in wideband and narrowband telephone speech samples. More detailed explanations are presented in Section 3.
Pitch detection can be implemented in clean wideband, narrowband telephony, noisy speech, and multipitch musical sounds. In telephony speech, signal is usually band-passed between 300 Hz and 4000 Hz to save the bandwidth. In this study, we applied band-pass filter twice to remove all remains between 400 Hz and 3400 Hz. ere are many types of noises that can be added to the pitch datasets, including babble noise, exhibition noise, HF (high-frequency) channel noise, restaurant noise, street noise, white noise, pink, brown, and pub noises. NOISEX [153] dataset is a publicly available noise dataset. Pitch datasets can be recorded in studio, office, living rooms, and car interior environments.
In pitch datasets, speaker profiles, distribution of gender, distribution of age, mother tongue, distribution of dialect, distribution of profession or education, pathologies, number of speakers, contents, speaking style, read speech, answering speech, command and control speech, descriptive speech, nonprompted speech, spontaneous speech, neutral and emotional speech, general recording setup (telephone, on-site, field, wizard-of-oz), annotation, technical specifications (sampling rate, sample type, number of channels, file formats), corpus structure, release plan and validation procedure, meta data, recording protocol (session id, speaker id, recording date, environmental conditions, technical recording conditions), postprocessing, pronunciation dictionary, and validation are important issues [135].

Ground Truth Determination of Pitch Datasets.
Another important issue for pitch datasets is the exact determination of ground truth. Hand-editing, EGG, and Laryngograph are widely used as ground truth methods. SIGMA, Hilbert Envelope-based detection (HE), the Zero Frequency Resonator-based method (ZFR), the Dynamic Programming Phase Slope Algorithm (DYPSA), the Speech Event Detection using the Residual Excitation And a Mean-based Signal (SEDREAMS), and Yet Another GCI Algorithm (YAGA) are some other methods used for finding ground truth of pitch samples [154][155][156][157][158].
ey have some advantages and disadvantages, and we usually need hand-editing by an expert. In some cases, detecting f0 manually by human experts can be quite difficult. EGG, laryngogram, particularly differentiated laryngogram provides a signal, which makes automatic f0 calculation easier using available methods as shown in Figure 2.
Vocal Tract Model, Uniform Lossless Tube, and Two-Tube Model can be used to model the f0 and other formant frequency structures [178][179][180][181][182][183][184][185][186]. Using the laws of conservation of mass, momentum, and energy, it can be proved that sound wave propagation inside a lossless tube satisfies the equations: where p � p (x, t) is the pressure of sound at position x and time t, u � u (x, t) is the velocity at position x and time t, ρ is air density inside the tube, c is sound velocity, and A � A (x, t) is the cross-sectional area function of the distance and time.

HDM Algorithm
e novel harmonic differences method (HDM) tries to exploit the differences between the harmonics of power spectrum of the signal with the goal of finding the most repeating difference. Harmonic spacings have been tried by Seneff [39] for telephony speech on real-time data with LDVT, Wu on guitar sound [40], and Dziubiński and Kostek [187] for various musical instrument sounds. Seneff used the area under the hump in frequency spectrum, creating a list of 8 estimates and applies temporal smoothing using median filter. We try to find the most repeating difference without temporal smoothing. We will extend these works by eliminating many steps in the implementation and generalizing the method to all kinds of sound waves, improving the speed and the accuracy. Matlab 2019 student version is used for HDM, autocorrelation, cepstrum, YIN, and YAAPT. Python 3.6.5 is used for CREPE and FCN implementations. General outline and pseudocode of the algorithm is as follows: (1) Take the discrete Fourier transform and run the peak picking algorithm between the minimum f0 and maximum threshold frequency F th . Length of the samples is taken as 1024 and Hamming windowed before the processing to smooth the signal and avoid the edge effects. Windowing is necessary due to the DFT's vulnerability to discontinuities. Windowed Fourier Transform is given as follows: Sampling rate is 16000 Hz yielding a frequency resolution of 15.625 Hz. Despite this low FFT resolution, the results of HDM are very promising. During this search, the minimum power of a partial should be at least (1/c) times the magnitude of the largest partial. Here, c is a prespecified constant such that α, β, c, f0 min , F th (upper-limit threshold frequency) are determined empirically with more than 7 million experiments on Hillenbrand dataset. ere is no limit in the number of peaks to collect, but we set a Mathematical Problems in Engineering threshold frequency value F th , and we collect all peaks below that threshold. is threshold frequency is also determined empirically.
(2) To decide a true peak, its amplitude must be larger than the previous and next Fast Fourier coefficient.
(3) In this list, remove the unnecessary peaks that are closer than minimum f0 value to one another. (4) In this list, remove the spurious peaks that are weaker than a predefined empirical ratio of the previous and next spectrum value. (5) Handle some special cases such as frequencies below 110 so that if the amplitude of first entry is less than 1/5 of the second entry remove first entry.
if f 1 < 110 Hz and A 1 < (A 2 /β) then remove f 1 from F, J ⟵ J − 1 end if (6) Find the differences between the adjacent entries in the list if abs(f0 wb − f0 nb ) < (f0 min /2) then f0 wb � f0 nb end if else f0 wb ⟵ 0, f0 nb ⟵ 0 return f0 wb , f0 nb HDM does not use temporal smoothing. Although it is possible to obtain better results using different parameters for each dataset, we use the same parameter set for all 3 datasets and for all wide and narrowband experiments. Our goal is to find a global parameter set that can achieve best results for wideband and narrowband telephone speech in all datasets. e relation between GPE and α is shown in Figure 3. In the small values of α, GPE goes too high, and after 5, it remains nearly constant. A typical value for α is between 6 and 12. is parameter is used to eliminate the spurious peaks in the list F of candidate pitches. Elimination of redundant peaks is a key point in capturing the correct spacing between the harmonics. β parameter is used to handle some special cases such as frequencies below 110 Hz. Due to nonlinearity and complexity of equal loudness curves, different regions of human speech spectrum have different loudness. erefore, we used β parameter for handling low-frequency components. e effect of α and β parameters on GPE is depicted in Figure 4. Linear region of human auditory system is usually considered between 20 and 1000 Hz (we do believe that a value between 1100 and 1200 Hz is a better choice for upper limit), and logarithmic region is the rest of the audible spectrum [188][189][190].
Another important parameter is the value of minimum f0, and it heavily affects the GPE as shown in Figure 5. Hillenbrand and Texas datasets provide ground truth value for f0, and minimum fundamental frequency value is 82 Hz.
In Hillenbrand dataset, fundamental frequency ground truth is calculated using autocorrelation followed by handediting. In Texas Vowel dataset, fundamental frequency ground truth is calculated by visual inspection together with semiautomatic LPC analysis. In the TIMIT corpus, transcriptions have been hand-verified. Transcriptions are obtained using the program SPIRE of MIT and then handverified by experienced acoustics phoneticians. But there is no fundamental frequency ground truth for TIMIT vowels. erefore, we used average f0 values found by HDM, autocorrelation, cepstrum, YAAPT, CREPE, and FCN methods in a consistent manner in conjunction with the gender labels provided by the TIMIT dataset. In some samples of TIMIT dataset, finding the ground truth is very difficult even by visual expert inspection on the frequency spectrum. e harmonics and spacing between them can be spread nearly randomly. Such a sample from TIMIT dataset is depicted in Figure 6.
Autocorrelation of a signal with the symmetry property can be obtained using the following equation: where k is the lag number. Cepstrum has the complexity of O (N log N), and power cepstrum can be calculated using the following formula: As can be seen from Table 2, HDM is the fastest method followed by cepstrum and autocorrelation. For autocorrelation and cepstrum, we used Naotoshi SEO's (http://note. sonots.com/SciSoftware/Pitch.html) implementations, but  [66], YAAPT [67,68], and FCN [113] employ temporal smoothing, whereas HDM, autocorrelation, cepstrum, and CREPE [110] do not use temporal smoothing. HDM produces nearly identical results without hamming windowing, but the rest of the algorithms have worse results without hamming window, particularly, cepstrum doubles the error margin in TIMIT dataset. No other preprocessing was applied to the waveforms. In cepstrum and autocorrelation, we imposed an upper-limit frequency of 500 Hz. Without this limit, their performances are going only worse. is makes them unsuitable for the detection of high pitch values. In angry emotional speech samples, it is possible to see up to 700 Hz pitch values.

Results
Many error measures are used to evaluate the pitch detection algorithms. Gross Pitch Error (GPE) is the average error, voicing detection error (VDE) if applicable are among the most used. Additionally, we introduce two different error measures. e first is the e 10 , which denotes the number of samples with more than 10% sway from the ground truth value. Another useful application of pitch estimation is the gender detection. Gender detection error can be very useful for the evaluation of these algorithms. GPE and e 10 error are, respectively, defined as In Hillenbrand and Texas Vowel datasets, ground truth pitch values are given, and we can make a solid comparison with our predictions. In Figure 7, such a comparison is depicted for Hillenbrand dataset. Boy, girl, man, and woman samples can clearly be seen to make an intuition over the frequency regions of these samples. Male voices have definitely lower f0 values compared with the boys, girls, and woman. It is nearly impossible to separate boy, girl, and woman voices using only f0 values. In Hillenbrand dataset, average f0 is 236.0 Hz for boys, 238.35 Hz for girls, 131.21 Hz for man, and 220.40 Hz for woman. Maximum f0 for boy, girl, man, and woman are 320, 303, 224, and 307 Hz, respectively. Minimum f0 for boy, girl, man, and woman are 183, 188, 90, and 149 Hz, respectively. ere is no boy or girl sample with f0 value of lower than 180 Hz. Only 7 males have pitch values of greater than 190, and only 5 women have pitch values of lower than 160 Hz. In Texas dataset, there are kid, man, and woman classes. In Texas dataset, the difference   Male  Female  Hillenbrand  324  228  540  576  Texas  1232  972  1110  TIMIT  54357 Table 3. In TIMIT dataset, there are no supplied ground truth pitch values. As seen from Table 4, in wideband Hillenbrand dataset, the proposed HDM has the smallest GPE, and AC has the smallest e 10 error. Although autocorrelation is an old technique, it is highly successful in this dataset. In Texas dataset, FCN has the smallest GPE error, and AC has the smallest e 10 error. In TIMIT dataset, FCN is the best performing method followed closely by HDM and cepstrum in GPE and the proposed HDM is the best method in e 10 error. YIN produces too many outliers, and for YIN, we tried many different parameters to find better results. A box plot of the abovementioned algorithms is depicted in Figure 8. We need to emphasize that in the AC and cepstrum implementations, we imposed an upper-limit frequency of 500 Hz; otherwise, these methods produce worse results. No upper limit is applied in the remaining methods. Convolutional neural network methods are quite successful in wideband implementations, but as we will see right now, they are nearly blind in narrowband telephone speech data.
From now on, we extend our experiments to the telephony speech. For this purpose, we will apply band-pass filter to our datasets twice to completely remove the frequencies below 400 Hz and above 3400 Hz. In some telephony speech, this bandwidth can be applied between 300 Hz and 4000 Hz. We selected the low frequency as 400 Hz because the highest f0 value is 392 Hz in our datasets, and by selecting 400 Hz as threshold value, we remove the fundamental frequency from all samples in all datasets. is is one of the objectives of this algorithm.
In band-passed Hillenbrand dataset, cepstrum remarkably is the most successful algorithm in all error types as shown in Table 5. HDM is the second in GPE, and AC is the second in e 10 error measures. Performance of CREPE and FCN is very disappointing in band-passed speech. In narrowband Texas dataset, cepstrum is the most successful algorithm in all error types as shown in Table 5. HDM is the second in GPE, and AC is the second in e 10 error type.
In narrowband TIMIT dataset, our novel HDM algorithm is superior to all other methods in GPE and e 10 error measures as seen in Table 5. YAAPT is the second best in GPE, and AC is the second in e 10 error. YAAPT is primarily designed for telephone speech. In TIMIT dataset, 22670 samples are shorter than 1024 in length. is may explain the failure of cepstrum in narrowband TIMIT dataset. To further clarify the underlying essence of this failure, we removed the samples that are shorter than 1024 and rerun cepstrum on     Figure 9. CREPE and FCN are nearly useless in narrowband speech. is may be due to the fact that their training is done only on the wideband speech. ey need to be trained for narrowband speech as well. Convolutional neural networks are quite successful in wideband speech data, however, we must keep in our mind that they are extremely slow compared with HDM, AC, and cepstrum as shown in Table 2. HDM is the fastest method in these experiments.

Gender Detection Implementations
Gender detection is an application of fundamental frequency detection. Although gender is not restricted to pitch value, it is highly related to its value. Pitch has specific ranges between men, women, boys, and girls. erefore, gender evaluation of the f0 algorithms is a good measure for the robustness. Here, we present the gender detection errors for wideband and band-passed TIMIT dataset. In TIMIT dataset, the gender information is given with the first letter of the name of the speech sample.
As seen from Table 6, in wideband TIMIT dataset, HDM is the best method in gender detection by a significant margin, FCN is the second, and cepstrum is the third. Cepstrum's success is well known in male speech samples. In the TIMIT dataset, there are 24017 female and 54357 male samples. is may explain the success of cepstrum in this large dataset. In Table 7, we can clearly conclude that HDM has no match specifically in male samples.

Conclusions
e experimental results show that proposed harmonic differences can safely be used to detect fundamental frequency in wideband and narrowband telephony speech. e new algorithm shows great success particularly in the large TIMIT dataset. Fast Fourier Transform has a natural resolution problem, but in this article, despite the low resolution of the implementation, the results are satisfactory. It is robust to band-limiting and moderate inharmonicity. HDM algorithm is the fastest method, and further speed improvements can be expected. FCN and CREPE are performing remarkably well in wideband data, but they are too slow compared with the other methods. erefore, they cannot be used for real-time applications, but they can be helpful in ground truth determination. An interesting    Bold values denote the best performance in the specified dataset and error type.
finding is the highly disappointing results of the FCN and CREPE algorithms in narrowband speech. Although they are quite successful in wideband speech datasets, they produced low success rates in all band-passed datasets. We should bear in mind that FCN and CREPE are end-to-end algorithms, and they take the raw waveform as input without using the frequency-domain descriptors. Most of the useful pitch information is hidden in the low part (0-400 Hz) of bandpassed signal and without this data, FCN and CREPE are unable to extract the necessary features for pitch determination. e cepstrum algorithm is very old compared to YIN, YAAPT, CREPE, and FCN, but in some cases, it can present better predictions. FCN and CREPE are CNN-based methods; FCN is using temporal smoothing, whereas CREPE does not use temporal smoothing but FCN is still much faster than CREPE.
In the future works, we plan to implement temporal smoothing in HDM. Temporal smoothing can be quite efficient in f0 detection and is used by many algorithms, including YIN, YAAPT, and FCN. Another future direction is testing the ability of HDM in noisy environments and musical sounds that needs to be handled. Pitch refinement is another technique that can be incorporated inside HDM.