An Improved Speech Base Frequency Extraction Algorithm

Traditional fundamental frequency extraction algorithms mainly include cepstrum method, short-time autocorrelation method. Some end-point extraction methods can’t correctly select the part of effective value due to the particularity of dialect audio and the noise influence. And, some fundamental frequency extraction methods cannot get the data consistent with the trend of tuning values. Against the above, this paper proposes an improved double threshold method for the extraction of the basic frequency according to the characteristics of the vowels of the initials of the one-word tone, which extracts the continuous vowels of the one-word tone, and looks for the basic frequency extraction algorithm that meets the classification of the tones of one-word. Finally, the effective basic frequency data is obtained. This method is more effective than other methods in speech tone recognition. It is of great significance to the study of speech tone, especially the study of dialect tone.


Introduction
Chinese syllables are mainly composed of initials, finals and tones. Because of the composition of Chinese intonation's different from that of other languages, it is an important content to study the characteristics tones of Chinese. The change of fundamental frequency determines the tone, which is a very important phonetic feature that can effectively distinguish semantics and emotions [1]. The fundamental frequency plays an important role in the study of speech features, which can be applied to all aspects of linguistics, such as tone recognition [2], emotion recognition [3], automatic music notation [4], speaker recognition [5].
Fundamental frequency extraction analysis methods can be divided into time domain, frequency domain, inverted frequency domain and other domain analysis methods according to the different extraction parameters. Signal processing methods in the time domain mainly include energy and average amplitude, average zero crossing rate, autocorrelation function [6]. The autocorrelation function method is easy to occurred frequency double error and half frequency error [7] under some conditions (such as low SNR). Aiming at this kind of problem, the basic frequency extraction algorithm of the autocorrelation function based on linear prediction [8] and the weighted autocorrelation algorithm based on wavelet packet transform [9] have appeared. Wavelet transform has the advantages of low computational complexity, suitable for transient data analysis, and easy implementation on hardware [10]. On this basis, there are proposed the algorithm of extracting the fundamental frequency based on wavelet transform [11] and the method of de-noising the fundamental frequency with wavelet transform [12]. Cepstrum method, which is to take logarithm of speech signal and then carry out inverse Fourier transform [7], has the good extraction effect of fundamental frequency in the case of no noise. There  [13], noise removal, optimization method of improved algorithm [14], autocorrelation and cepstrum weighted square operation method [15], which have improved the robustness and accuracy of extraction.
The double threshold method is based on short time average energy and short time zero crossing rate. The principle is that the final part can be found by comparing the short time average energy, while the initial part can be found by using the short time average zero crossing rate. Existing improvement methods include the fusion of speech enhancement to improve the accuracy under the condition of low SNR [16], the method of adjusting the number of thresholds and smoothing filtering [17], replacing the short-term average zero crossing rate with a better spectral centroid feature [18].
In the aspect of the extraction of the basic frequency of the dialects, the traditional algorithm of the basic frequency extraction for mandarin cannot meet the requirements of the basic frequency data in the study of tone recognition, and cannot extract the basic frequency of the dialects as the same as the extraction of mandarin, which seriously hinders the study of Chinese dialects. The reason is that the tone value of dialects is more complex and the traditional extraction algorithm cannot distinguish the voiced part and the change of pitch frequency well.
This method aims to find a new algorithm to extract the basic frequency of a single word, which can meet the requirements of data. By using the data set of Linyi City to improve the algorithm experiment, a new method is proposed by improving the end point extraction in audio preprocessing and combining with the basic frequency extraction method of simplified inverse filtering, which can optimize fundamental frequency extraction. The basic frequency data extracted by this method is basically consistent with the known modulation value, and it can be judged that it can meet the requirements of the basic frequency data extraction.

Fundamental Frequency Extraction Process
The proposed method can be divided into two steps: endpoint detection and fundamental frequency extraction. The most used is the double threshold method in the traditional detection methods. Different from the traditional method, this method has been improved to reduce the influence of short-time zero crossing rate in the process of extraction, focus on extracting voiced segments and carrying out further operation. The amplitude normalization and frame splitting of the data are carried out after the audio is converted into data. Through endpoint extraction, the vowel segments are found and intercepted so as to avoid invalid segments in the next operation. The intercepted data is divided into frames again and processed into the data needed for the next step (figure 1). After repeated attempts, the tuning trend of the basic frequency extracted by the simplified inverse filtering method is the best in the dialect corpus.

Endpoint Extraction
Setting the length of a leading no-talk segment and giving the frame length and frame shift to obtain the number of silent frames. And calculate the energy value and zero crossing rate value of each frame respectively.  (1) 2 as piecewise function, when | 2 | is bigger than s, its value is 2 ,, otherwise is zero, which s for a very small number. In equation (2), Sign(n) function is also piecewise, when |n| greater than or equal to 0, the value is 1, otherwise to 1.
We obtain the average energy value AMP and the average zero crossing rate value ZCR of the leading no-talk segment. Set the thresholds amp1, amp2, zcr1 according to these two values: The next to secondary judgment. The first-level judgment, which belongs to the rough judgment, compares the energy of each frame with amp2, and the one greater than this value must be the sound segment. The second decision, voice frames that do not meet the first decision will be given a second decision. The energy and zero crossing rate of this frame are compared with amp1 and zcr1 (the value of zcr1 is 0, which has no effect). If the conditions are met, it can be considered that it is in the speech segment. This decision can reduce the error caused by pronunciation problems. Statistical meet the conditions of continuous speech frame length, if more than one designated numerical б, can be identified as speech segment, if it is not, the speech segment is considered to be null. Compare in turn, we will get the beginning and end of the paragraph.
It can be found that some of the detected segments are detected as a segment (figure 2), while some get multiple segments (figure 3), that requires us to effectively distinguish which segments are valuable. After some processing, it is cut out and further processing is carried out on the newly obtained fragments.

Extraction of Basic Frequency by Simplified Inverse Filtering Method
This paper adopts the method of designing the screening algorithm, optimize and screen the obtained segments by observing the characteristics of audio segments, and obtain the effective segments for fundamental frequency extraction. Because of the different preprocessing methods required by fundamental frequency extraction and endpoint detection, so it is necessary to reprocess the audio. The data should be sampled down, and the frames and endpoints should be detected to ensure the accuracy of the audio segment. A simple inverse filtering algorithm is used to extract the fundamental frequency. Through the improvement, the endpoint detection and fundamental frequency extraction are combined to improve the detection effect of dialect speech.

Figure 4.
Steps of simplified inverse filtering method [19]. The specific operation steps are as follows: The speech signal set as X is processed through a lowfrequency bandpass filter to get X1. According to experience, a filter with a bandpass of 60~500 is selected. After that, X2 was obtained by sampling down according to 4:1. After frame splitting, LPC coefficient was extracted to calculate the autocorrelation function and then RU1 was obtained. RU was obtained by increasing the sampling rate by 4 times, and the pitch period of the speech was obtained by detecting the correlation peak value (figure 4).

Experiment
Three experiments were carried out for the method proposed in this paper.

The Data Set
The data set used in this experiment is the Linyi dialect data set, which is derived from the special data set of language protection, including the speech data of 8 counties in Linyi. In each region. There are more than 2000 words, vocalizations, sentences and other fragments recorded by several speakers. The recorder is Yanmei Shao, University of Shandong Normal University.
In this experiment, only the word fragments of elderly men were used for 8 different tonality values. The Linyi dialect database has been recognized by professionally trained dialect researchers, which meets the verification requirements of this experiment.

Endpoint Detection Experiment
As the national common language, mandarin is characterized by simple and tone clearly, which has the advantage of easy recognition in the aspect of voice feature extraction. Compared with dialects, mandarin is more suitable for all kinds of methods. As for the processing and extraction method of Linyi dialect pronunciation described in this paper, it should also be applicable to the processing of Mandarin. Therefore, this experiment is used to test the effect of this method on the extraction of mandarin. The mandarin database used in this experiment was recorded by professional recording personnel, and the standard mandarin was of reference significance. The experimental results are shown in table 1. The proposed method has a good extraction effect on the first and fourth tones, with the highest accuracy of 97.9%, but for the second and third tones, the effect is slightly worse, with the lowest only reaching 81.5%.

Experiment with Different Tuning Values
The different characteristics of different tuning values, the adaptation effect of the method in this paper is also different. Therefore, for different tuning values, the accuracy of processing will not be exactly the same, so it is necessary to further obtain the accuracy of the results through experiments.
The audio frequency with different tuning values in some areas is selected and experimented with the method in this paper, and the accuracy rate in line with the trend of tuning values is obtained. The results are shown in table 2. In general, the accuracy of tone can be more than 60%, among which the accuracy of Linshu County's audio is the highest, reaching 70.1%. The accuracy of 53 tone value in Fei County is too low, which can be regarded as an error ignored. For a single tone value, the best value of 312 in Linshu is 77.8%, and most of them can maintain around 65%, which shows a good effect.
Experiments have shown that the accuracy of the same tone value is different in different regions (the accuracy of 44 tone of Linshu was 67.3% and that of Pingyi was 71.4%), and the accuracy of different tones varies from region to region (the accuracy of four tone values in Linshu is different). In general, the accuracy of the method described in this paper meets the requirements of basic frequency data, but it needs further improvement.

Compare Experiments with Different Methods
Traditional fundamental frequency extraction methods are mostly designed for Mandarin, but are not very suitable for dialect pronunciation. Compared with this, the method in this paper focuses on the basic frequency extraction of dialect, which is more targeted and has better effect.
In this experiment, the accuracy of the traditional method is compared with that of the proposed method, and each tone are compared using a variety of methods, such as the linear correlation cepstrum method, autocorrelation function method, cepstrum method and average amplitude difference method. In table 3, the results show that the proposed method is not significantly different from the traditional algorithm in the toning of mandarin. The effect of the second and the third tone is similar, but the first tone is better than other methods. Overall, it is better than the traditional algorithm. Among the dialect tuning values, the results obtained in this paper are the best, which the accuracy rate is maintained above 60%. In the traditional method, the best result is only 46.2%, among which the autocorrelation function method is the worst. The other three algorithms have different effects for different tuning values. Due to the effect is too different and the accuracy is not high, they are worse than the method in this paper on the whole. The tonal direction obtained by the traditional fundamental frequency extraction method has a larger error than that obtained by the label, that's because the traditional endpoint extraction method and the fundamental frequency extraction algorithm are not suitable for the processing of dialect audio.

Conclusion
This paper proposes a fundamental frequency on the basis of the improved double threshold method is a new method, which combines the improved double threshold method and the simplified inverse filtering method to form a new method. Compared with other methods, this method to extract the fundamental frequency accuracy slightly high, but there is still room for improvement and needing further improvement. The experiment was carried out different accuracy experiments from three aspects, including comparison with mandarin, difference of different tone values and comparison of traditional methods. The results show that it is more effective than other methods in extracting the endpoints and fundamental frequencies of one-word dialects this method, and it is of great significance to the study of one-word tones, especially the study of dialect tones.