Speech Recognition of Oral English Teaching Based on Deep Belief Network

— The oral English teaching faces several common problems: the teaching method is very inefficient, and the learners are poor in oral English. The development of computer-aided language learning offers a possible solution to these problems. Based on techniques of speech recognition, cloud computing and deep learning, this paper applies the deep belief network (DBN) to recognize the speeches in oral English teaching, and establishes a multi-parameter evaluation model for the pronunciation quality of oral English among college students. The model combines the merits of subjective and objective evaluations, and assesses the pronunciation from four aspects: pitch, speech rate, rhythm and intonation. Finally, the proposed model was verified through speech recognition and pronunciation evaluation experiments on 26 non-English majors from a college. The results show that the proposed evaluation model output credible results, which are consistent with those of experts, as evidenced by consistency, neighbourhood consistency and Pearson correlation coefficient. The research provides a feasible way to evaluate the oral English proficiency of learners, laying the basis for improving the teaching and learning efficiency of oral English.


Introduction
In the era of economic globalization, trade is booming across borders.English, as an international language, has achieved unprecedented importance globally.English learning is being encouraged in many countries.For various reasons, there are many deficiencies with English teaching as a second language.Many English learners perform well in listening, reading and writing, but have difficulty in speaking.Computer-aided language learning has made it possible for them to overcome the difficulty.
Currently, many try to learn oral English by listening and repeating of audio-visual materials on mobile phones and MP3 players.However, these devices cannot evaluate or instruct the learner's pronunciations.Against this backdrop, many colleges around the world have organized oral communication internships and interactive language programs, and developed speech recognition/scoring techniques, providing learners great chances to sharpen their oral English [1].Fruitful results have also been achieved in pronunciation evaluation.
With the emergence of deep neural network (DNN) and deep learning (DL), many DL-based neural networks (NNs) are being applied in speech recognition.Technical giants like Microsoft, Google and Baidu have all designed DNN speech recognition models, which can recognize speeches accurately and rapidly [2].
Based on the above analysis, this paper briefly introduces relevant theories on speech signal pre-processing, feature extraction, and DL networks, and applies the deep belief network (DBN) to recognize the speeches in oral English teaching.Then, a multi-parameter evaluation model was established for the pronunciation quality of oral English among college students, and verified through simulation experiments.

Speech signal preprocessing and feature extraction
Speech recognition process: To convert the learner's spoken English into machinerecognizable digital information, firstly the computer's sound card is used to digitize the voice analogue signal, and the voice signal is then pre-processed to extract the characteristic parameters.As a result, a reference model is established for the training of test sentences.After the training, pattern matching and speech recognition are performed, and the final recognition results are obtained through post-processing [3].
Voice signal preprocessing: The process of voice signal preprocessing is shown in Figure 1.

Endpoint detection
Framing Pre-emphasis

Fig. 1. Process of speech signal preprocessing
Pre-emphasis: To flatten the speech signal, the high-frequency part of the speech signal in this paper was improved by a 6dB/oct high-frequency boost pre-emphasis digital filter.Formula (1) shows the filter response function [4].The input speech signal iJET -Vol.15, No. 10, 2020 x(n) was used to represent the result y(n) after pre-emphasis processing, as shown in formula (2).
where,  is the pre-emphasis coefficient.
Framing: Based on the short-term stationary characteristics of the speech signal, framing was performed on the speech signal stream using a half-frame overlap method [5] to analyse the time series of the characteristic parameters in the speech signal.
Window: The purpose of windowing is to strengthen the speech waveform near the sampled speech signal.The specific calculation formula is shown in (3).Hanning window, Hamming window and rectangular window are the three most used window functions [6].In this paper, the more widely used Hamming window was used to perform windowing on speech signals, and its definition is shown in formula (4).

  ( ) (
) Endpoint detection: The performance of speech recognition depends directly on the quality of the endpoint detection algorithm.The dual-threshold endpoint detection method [7] is a widely used method at present, which has the advantage of not only accurately detecting the endpoints of valid voice signals, but also improving the effectiveness of system processing [8].Thus, it's applied in this study for endpoint detection.
Speech feature parameter extraction: The original speech signal contains a lot of interference information, which can only be used for speech processing after removing redundant information.Therefore, it's necessary to extract the speech feature parameters of the original speech signal [9].Mel frequency cepstrum coefficient (MFCC), linear prediction cepstrum coefficient (LPCC), and fast Fourier transform spectral coefficient (FFT) are common speech frequency characteristics.Among them, MFCC can better improve the performance of speech signal processing with good robustness [10] so that it's used as speech parameters in this study.

Deep learning and neural networks
Deep learning [11] is a new field in machine learning research and a type of unsupervised learning.Its essence is to improve the accuracy of prediction or classification, and discover the internal connections and characteristics between the data by establishing a multi-hidden layer learning model and simulating the human brain to analyse the training data.The core ideas of deep learning [12] are: (1) unsupervised learning is used for pre-train of each layer for the network; (2) only one layer is trained each time with unsupervised learning, and its training results is taken as an input to its higher layer; (3) supervised learning is adopted to adjust all layers.
Deep Belief Network (DBN) [13] is an unsupervised greedy layer-wise learning algorithm.It is an efficient stacked deep learning algorithm based on several restricted Boltzmann machines (RBM).The training process is conducted layer by layer from the bottom to the top, and finally the network is fine-tuned by traditional global learning algorithms (such as BP algorithm), so that the model converges to the local optimum, as shown in Figure 2.

Multi-parameter pronunciation quality evaluation
Pronunciation quality evaluation process (1) Subjective evaluation: After listening to the test voice, the language expert discovered the pronunciation errors and the differences between the test voice and standard speech based on the linguistic knowledge, and then evaluated the spoken language level of the test subjects [14].But limited by the experience and subjective feelings of the language experts, the subjective evaluation method is highly subjective, making it difficult to guarantee the authenticity of the evaluation results.
(2) Objective evaluation: Objective evaluation [15] refers to the use of a computer to perform feature extraction on the spoken English pronunciation of a test subject iJET -Vol.15, No. 10, 2020 through a pronunciation quality evaluation system, make pattern matching with previously extracted standard speech feature parameters to compare the two, and give the evaluation score of the test voice.This objective evaluation method can reduce the evaluation bias and improve the evaluation efficiency.
Evaluation index: Different groups have different requirements and evaluation standards for oral English learning.Taking college students as an example, this paper establishes a multi-parameter pronunciation quality evaluation model for college students' oral English learning.Figure 3 shows the specific evaluation index.

Intonation evaluation
Speech rate assessment (2) Speech rate evaluation: Speech rate means the measure of how fast a speaker pronounces [16].This paper uses speech rate based on speech length to evaluate the oral language speed of English learners.The calculation formula is shown in (5).Then, the calculated value  was compared with the set speech rate threshold, to evaluate the speech rate and obtain the feedback result of the oral English for the learners.

Std Text
Len Len where,

Text
Len and

Std
Len are the length of the test sentences and standard sentences, respectively.
(3) Rhythm evaluation: The rhythm of language can be divided into three types: emphasized stress, incomplete stress, and complete stress [17].English is a stress-timed language, and the tempo of a sentence is determined by the number of stress syllables.Figure 4 shows the rhythm evaluation mechanism.b) Structure the sentence: Despite of different pronunciation characteristics, the pronunciations of individuals follow certain rules.Therefore, before the test, the test sentences were regularized to be close to the standard sentences in order to obtain more objective evaluation results.c) Calculate the intensity curve matching graph: Based on the original dynamic time warping algorithm (DTW) [18], R (reference template) and T (test template) were divided into N and M frames isochronally, and the distance was divided into three path: (1, Xa), (Xa+1, Xb), (Xb+1, N), where Xa and Xb are the closest integers, to be calculated as: Dynamic matching was performed only when 2M-N<3, 2N-M<2.ymin, ymax are calculated as: iJET -Vol.15, No. 10, 2020 When each frame on the X axis matched with the frame between   min max , yy , and the X coordinate moved forward, the corresponding Y axis frames had the same regularity characteristics, and the cumulative distance is:  12) and ( 13).Then, double-threshold comparison method was used to detect stress endpoints.Meanwhile the duration of stress speech was set 100ms, to determine the number of stress in the sentence.
e) Calculate rhythm correlation: The Pairwise Variability Index (PVI) can be used to determine the difference in syllable length and measure the correlation of language rhythms.Based on the differences in English pronunciation length, this paper improves the formula for calculating the PVI, as shown in formula (14), which is used as a basis for systematic evaluation.
where, Len is the standard sentence length, m = min (the number of standard sentence units, the number of test sentence units), and k d is the duration of the k-th speech unit segment.f) Rhythm evaluation and feedback: A comprehensive comparison was performed about the intensity curve matching, the number of stresses, and the dPVI parameters of the standard sentence and the test sentence.Thus, the final evaluation results of oral rhythm were obtained.
(4) Intonation evaluation: The intonation is the preparation and variation in spoken pitch when used in a sentence.For the same sentence, different intonation can result in a difference in meanings.The lexical meaning of a sentence plus the meaning of intonation can be regarded as full meaning.The five basic intonations in English are rising intonation (↗), falling intonation (↙), rising-falling intonation (∧), falling-rising intonation (∨), and flat intonation (→).Pitch is the most basic and important constituent element.In the discourse, the height of the sound is expressed as intonation, and the different rising and falling modes of the intonation is determined by the pitch.This paper uses the autocorrelation function (ACF) in the time domain to extract the pitch in English sentences, then uses a median filter to smooth the pitch, calculates the fit of the fundamental frequency curve through DTW, and finally evaluates the intonation.

Simulation experiment and result analysis
Data source: In this paper, 26 college students of non-English majors were selected as test subjects through posters, online solicitation and other means, including 15 males and 11 females.The 10 common spoken English sentences were recorded by CoolEdit recording software.
Evaluation of speech recognition: Using the pronunciations of 55 people in the data set of the UCI machine learning database as the training set and the pronunciations of 33 people as the test set, a comparative analysis method was adopted to compare the speech recognition rates of the model proposed in this study and other model.
The time warping is a problem that exists in most neural networks.In this paper, the segmentation and averaging methods were used to pre-process the speech signal.First, the feature parameters were divided into N segments according to formula (15), and then M(i) into M segments.Next, the mean vector of each sub-segment was obtained through calculation, and the output value of the dimensionality-reduced feature parameter was a K × M × N matrix ( , ) S K T composed of the averages of the respective seg- ments.
( ) ( , ), where, M (i) is the speech feature parameter of the i-th segment after segmentation.Pronunciation evaluation experiment: In order to validate the English pronunciation quality evaluation model and its method, this paper uses the three indexes of Pearson correlation coefficient, consistency and neighbourhood consistency to test the consistency of manual evaluation and machine evaluation According to English pronunciation characteristics of college students, this paper combines with previous research results, and selects four indicators of pitch, rhythm, intonation, and speech speed to evaluate college students' oral English pronunciation.They're scored on a four-level system: A, B, C, and D (i.e., 4 points, 3 points, 2 points, and 1 point respectively).Also, two outstanding college English teachers were selected to evaluate the recorded spoken English of the test subjects.The two-sided test results showed that the evaluation results of the two teachers were basically consistent, indicating that the data of the manual evaluation results were valid.
As described above, a comparative analysis was performed on the results of machine evaluation and manual evaluation.Figure 6 shows the experimental results of the evaluation index (the number of samples), and Figure 7 shows the experimental results of the evaluation index (the statistical index).It can be seen that among the 260 English sentences, the consistency rate and neighbourhood consistency rate between the machine and manual evaluation were all above 80% in terms of the four evaluation indicators, and the Pearson correlation coefficients were 0.85, 0.496, 0.557, and 0.631, which indicates that the proposed evaluation method is credible.

Conclusion
Speech recognition and pronunciation evaluation technology are the core of computer-aided language learning.This paper studies the speech recognition of oral English teaching based on deep belief network.The specific conclusions are as follows: • The DBN was applied to recognize the speeches in oral English teaching, and a multi-parameter evaluation model for the pronunciation quality of oral English was established among college students.• The simulation results of speech recognition experiments show that the speech recognition rate of the model based on the DBN was 96.21%, which is better than other models.• The simulation results of the pronunciation evaluation experiments show that the consistency rate and the neighbourhood consistency rate of the machine and manual evaluation were all above 80% in terms of the four evaluation indicators, and the Pearson correlation coefficients were 0.85, 0.496, 0.557, and 0.631, which explains the credibility of evaluation method.

Fig. 2 .
Fig. 2. The main framework of deep learning algorithms

Fig. 3 .( 1 )
Fig. 3. Multi-parameter pronunciation quality evaluation index for college students (1) Pitch evaluation: In this study, the MFCC coefficient was used as the evaluation standard of pitch, To be specific, it extracts the MFCC feature parameters of the test speech and the standard speech, and synthesizes the MFCC feature correlation coefficients with the DBN-based speech recognition model to recognize the speech and evaluate the pitch of English learners.(2)Speech rate evaluation: Speech rate means the measure of how fast a speaker pronounces[16].This paper uses speech rate based on speech length to evaluate the oral language speed of English learners.The calculation formula is shown in(5).Then, the calculated value  was compared with the set speech rate threshold, to evaluate the

Fig. 4 .
Fig. 4. Specific steps of rhythm evaluation mechanism a) Extract the short-term energy value to form an intensity curve.The short-term energy is calculated as: the stress unit and determine the number of stress: Through experiments and previous research experience, this paper sets the stress threshold and non-stress threshold as shown in formulas (

Figure 5 Fig. 5 .
Fig. 5. Comparison of recognition rates under different models