A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training

Meng, Weijing; Yolwas, Nurmemet

doi:10.3390/s23020870

Open AccessArticle

A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training

by

Weijing Meng

^1,2

and

Nurmemet Yolwas

^1,2,*

¹

Xinjiang Multilingual Information Technology Laboratory, Urumqi 830017, China

²

College of Information Science and Engineering, Xinjiang University, Urumqi 830017, China

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(2), 870; https://doi.org/10.3390/s23020870

Submission received: 4 December 2022 / Revised: 1 January 2023 / Accepted: 8 January 2023 / Published: 12 January 2023

(This article belongs to the Section Intelligent Sensors)

Download

Browse Figures

Versions Notes

Abstract

:

Building a good speech recognition system usually requires a lot of pairing data, which poses a big challenge for low-resource languages, such as Kazakh. In recent years, unsupervised pre-training has achieved good performance in low-resource speech recognition, but it is rarely used in Kazakh and other Central and West Asian languages. In this paper, wav2vec2.0 is improved by integrating a Factorized TDNN layer to better preserve the relationship between the voice and the time step before and after the quantization, which is called wav2vec-F. The unsupervised pre-training strategy was used to learn the potential speech representation from a large number of unlabeled audio data and was applied to the cross-language ASR task, which was optimized using the noise contrast binary classification task. At the same time, speech synthesis is used to promote the performance of speech recognition. The experiment shows that wav2vec-F can effectively utilize the unlabeled data from non-target languages, and the multi-language pre-training is obviously better than the single-language pre-training. The data enhancement method using speech synthesis can bring huge benefits. Compared with the baseline model, Librispeech’s test-clean dataset has an average reduction of 1.9% in the word error rate. On the Kazakh KSC test set, the pre-training using only Kazakh reduced the word error rate by 3.8%. The pre-training of a small amount of Kazakh speech data synthesized by multi-language combined with TTS achieved a word error rate of 8.6% on the KSC test set when the labeled data were only 10 h, which was comparable to the results of the previous end-to-end model when the labeled data were 30 times less.

Keywords:

automatic speech recognition; Factorized TDNN; unsupervised pre-training; speech synthesis

1. Introduction

Compared with traditional automatic speech recognition frameworks [1], divided into acoustics, pronunciation, and language modelling, sequence-based models [2,3,4,5,6] have shown remarkable performance in speech recognition tasks over the recent years. They directly use neural networks to learn speech-to-text mapping, avoiding complex modelling processes. Transformer [7] is a widely used sequence-to-sequence model that has proven to be a fantastic tool for building end-to-end speech recognition systems [8,9,10]. However, the end-to-end approach requires a large amount of annotated data to complete the training to achieve a good performance, which poses a significant challenge [11] to some low-resource languages that cannot meet the requirements of markup data for end-to-end modelling. Unmatched data are more accessible to collect than labelled data. Therefore, it is worth exploring how to use unpaired speech and text data to improve the performance of a low-resource speech recognition system under the constraints of limited annotation data.

Two main strategies have been proposed to make the most of unmatched data: unsupervised pre-training and semi-supervised learning. In the aspect of unsupervised pre-training, the bidirectional encoder representation of Transformers (BERT) [12] and generative pre-training (GPT) [13] in the field of natural language processing use a large number of unlabeled data to conduct pre-training and learn general feature representation. The training target is only related to the acoustic features themselves. Fine-tuning can transfer the learned knowledge to the downstream task, considerably speeding up the convergence of the model. In semi-supervised learning [14,15,16], encoders are usually used to reconstruct a large number of unpaired data to enhance the feature extraction of a small number of paired data. In the field of computer vision, unsupervised pre-training also shows broad application prospects for tasks such as, capturing statistical data [17], learning bias [18], and object detection [19].

In the field of speech recognition, researchers have also proposed some methods of unsupervised pre-training. Contrastive Predictive Coding (CPC) [20] combines autoregressive modelling and noise contrastive estimation with predictive coding to extract speech representations from high-dimensional data in an unsupervised manner by predicting future information. Wav2vec [21] applies CPC to speech recognition tasks, trains on large amounts of unlabeled audio data, uses the resulting representations to improve the acoustic model, and achieves a better feature extractor than manual designs. By incorporating quantization modules into the wac2vec model to discretize continuous acoustic features into a specific dictionary, the Vq-wav2vec [22] improves on the latest level of The Wall Street Journal and TIMIT benchmarks by utilizing BERTs pre-training. Wav2vec 2.0 [23] fuses BERTs sequential masking modelling with discrete CPC methods into a model that masks speech input in potential spaces and solves quantificationally defined contrast tasks in the possible representation of co-learning, showing feasibility in low-resource speech recognition tasks. Autoregressive predictive coding (APC) [24], which learns general speech representations that can be transferred to different tasks on different data sets, aims to preserve information for a wide range of downstream tasks and does not require any speech or word boundary labels, allowing the model to benefit from large amounts of unlabeled data. Jiang D et al. [25] applied APC to speech recognition tasks and effectively reduced the size of downstream marker data and model parameters while improving the recognition effect. In addition to CPC and APC, there is an unsupervised pre-training method called masking prediction coding (MPC) [26], which uses a similar structure to that used in BERT Masked-LM(MLM) to predict the coding of Transformer-based models.

Among all the languages in the world, low-resource languages account for a large proportion [11]. However, most of the current mature speech recognition systems are based on several common languages. Although researchers have conducted some research [27,28,29] on speech recognition under low-resource conditions, the research on speech recognition systems of low-resource languages such as Kazakh and some other Central and Western Asian languages is still in the initial stage. Their lack of resources is reflected in audio, text, pronunciation dictionaries, and phoneme collections. Inspired by wav2vec 2.0 and MPC, this paper integrates Factorized TDNN layers [30] in wav2vec 2.0 to reduce the potential speech feature loss when speech passes through the quantization module. Therefore, the proposed method is called wav2vec-F. At present, the unsupervised pre-training system for Kazakh speech recognition has not been well studied. We consider the cases of single-language and multi-language pre-training and adopt the complementary approach of ASR and TTS to promote low-resource Kazakh speech recognition tasks. We evaluated wav2vec-F on Librispeech and the Kazakh dataset KSC and compared the recognition results of replacing TDNN-F with other types of network layers. The experimental results show that the combination of wav2vec 2.0 and Factorized TDNN method can better preserve the relationship between the time steps before and after speech quantization so as to retain more speech features and prove the feasibility of this model in cross-language knowledge transfer.

2. Related Work

In this section, we briefly review the work related to this article in three sections: BERT, CPC, and wav2vec2.0.

2.1. BERT

BERT is a bidirectional language representation model proposed by Jacob Devlin et al., which has two steps: pre-training and fine-tuning. The model is trained on unlabeled data on different pre-training tasks in the pre-training process. For fine-tuning, the BERT model is first initialized with pre-trained parameters, and all parameters are fine-tuned using labeled data obtained from downstream tasks. A notable feature of BERT is its unified architecture across different tasks. The difference between the pre-trained architecture and the final downstream architecture is slight. Figure 1 shows the two-stage training process of BERT.

2.2. CPC

Contrastive Predictive Coding (CPC) is a general unsupervised learning method proposed by Aaron et al. It uses the next step prediction to learn the representation of the high-dimensional signal in an unsupervised manner. The model is mainly composed of two parts: the nonlinear encoder g_enc and the autoregressive model g_ar. First, given an input speech signal x = (x₁, x₂, …, x_T), g_enc will encode it into a potential embedding space f_t = g_enc (x_T) with a low temporal resolution, and then f_t is fed to g_ar which generates a context representation c_t = g_ar (z_t). Figure 2 shows the architecture of Contrastive Predictive Coding models.

The CPC model is optimized by minimizing noise contrast estimation (NCE)-based loss. At each time t, given a context representing c_t and its K future embeddings {f_t+_k}_1≤_k_≤K, the loss is defined as follows:

L_{t} = - \frac{1}{K} \sum_{k = 1}^{K} \log [\frac{\exp (f_{t + k}^{T} h_{k} (c_{t}))}{\sum_{\tilde{f} \in N_{t}} \exp ({\tilde{f}}^{T} h_{k} (c_{t}))}]

(1)

where N_t is a set of negative embedded samples and h_k(·) is the transformation of k at each step.

2.3. Wav2vec 2.0

The Wav2vec2.0 model architecture is shown in Figure 3a. It is a framework for self-supervised learning from the raw audio data. The original audio is encoded by a multi-layer CNN, and then the generated latent representation is masked by a method similar to masking language modeling, which is fed to the Transformer network to generate speech representation and trained by comparison tasks.

2.3.1. Quantitative Representation

Learn discrete units in step one, then the context representations. Product quantization is used to discretize the output of a feature encoder into a finite set of speech representations. The role of product quantization is to select quantized representations from multiple codebooks and connect them. Given the number of codebooks G, each codebook contains V items e (e ∈ R^V^×D/G). Select an entry from each codebook and concatenate the resulting vectors e₁, …, e_G, then apply the linear transformation R^d ↦ R^f to obtain q ∈ R^f. In the process of forward propagation, finding the items in the codebook corresponding to the maximum value is equivalent to a discrete operation, but this step is not derivable and cannot carry out back propagation. In order to solve this problem, the Gumbel softmax [31] method is adopted; the principle is shown in Figure 4, and the formula is:

p_{g, v} = \frac{\exp ((l_{g, v} + n_{v}) / τ)}{\sum_{k = 1}^{V} \exp ((l_{g, k} + n_{k}) τ)}

(2)

where n = −log(−log(u)), u is uniformly sampled from 0 to 1. During the forward propagation, the codeword i is selected by i = argmax_j p_g_,j, and during the reverse transmission, the true gradient of the Gumbel softmax output is used.

2.3.2. Comparative Training

Context representation C is used for contrast learning and is conditional on masking the latent speech representation Z. It is necessary to identify the authentic quantified latent speech representation in a masking time step within a set of interference samples. Unlike autoregressive training, contrast training requires the model to distinguish the masking time step representations from the other time step representations. The change from the regression task to the classification task led to more effective self-training.

3. Methods

3.1. Model

The proposed model is shown in Figure 3b. It is composed of multiple convolutional neural network layers, factorized delay neural network layers, and Transformer layers.

The feature encoder is composed of a convolutional neural network and a factorized delay neural network. It takes the original audio X as an input to generate the latent speech representation Z = z₁, …, z_T for T time steps. Before inputting the Transformer, Z is randomly sampled at a certain proportion of p for all time steps as the starting time step of the mask, and the mask operation is performed on the M time steps after that. During the mask operation, each potential speech representation Z of a segment of speech is regarded as a candidate starting time step with a probability p. The Transformer layer captures high-level content from Z in a similar way to [2] to produce a contextual representation of C. At the same time, the quantization module discretizes Z into a finite set of speech representations, using the method of product quantization [32] in the discretization process. The network structure of the model is shown in Figure 5.

3.2. Loss Function

The loss function L is divided into two parts, including the contrastive loss L_m and the codebook diversity loss L_d for the feature encoder:

L = L_{m} + α L_{d}

(3)

where α is a tuned hyperparameter.

In the pre-training process, the contextual output c_t corresponding to the time step t of the mask is given. The model needs to select the correct quantization representation q_t in a set of K + 1 samples

\tilde{q}

∈ Q_t which includes q_t and K negative samples. Negative samples are randomly and uniformly sampled at other mask time steps in the same sequence. The contrastive loss is defined as:

L_{m} = - \log \frac{\exp (s i m (c_{t}, q_{t}) / k)}{\sum_{\tilde{q} ~ Q_{t}} \exp (s i m (c_{t}, {\tilde{q}}_{t}) / k)}

(4)

where we use sim(c, q) = c^Tq/‖c‖ · ‖q‖ to compute cosine similarity between context representations c_t and quantized latent speech representations q_t.

The contrastive task depends on the positive and negative examples of the codebook representation, while the diversity loss L_d is designed to increase the use of quantization codebook

\overline{p}

_g representations:

L_{d} = \frac{1}{G V} \sum_{g = 1}^{G} \sum_{V = 1}^{V} {\bar{p}}_{g, v} \log {\bar{p}}_{g, v}

(5)

4. Experimental Setup

4.1. Datasets

In this paper, four language speech datasets of English, Chinese, Uygur, and Kazakh are used to complete all the experiments. Table 1 presents the details of these datasets.

The English speech data were obtained using the speech dataset Librispeech [33], which contains about 1000 h of speech and has been carefully segmented and aligned. This paper adopts the train-clean-100 subset, which has about 100 h of speech data, 251 speakers, and a total of 28,541 speeches. In order to compare with wav2vec 2.0, a larger 960 h of train-clean voice data are also used as unlabeled data for training to test the performance.

The Chinese speech Corpus uses Primewords Chinese Corpus Set 1, a speech dataset established by Shanghai Yuan Language Information Technology Co., Ltd. This dataset contains 100 h of Chinese speech data, with more than 98% transcription accuracy and a confidence level of 95%. There are 256 speakers and 50,384 voices in total.

The voice data from Uyghur language uses the train-clean-100 subset of the 1000 h Uyghur language voice data set in our laboratory. There are 198 speakers and 58,333 voices in total.

The Kazakh speech data set uses KSC [34], which contains about 330 h of Kazakh speech data. In this paper, speech data with different time length settings are randomly selected as fine-tuning data, and the verification and test sets of the divided standards are used. The text data used in the speech synthesis system uses the 40 h Kazakh speech data set of our laboratory, and 2000 pieces of labeled text are randomly selected for speech synthesis. About 4 h of speech data are obtained.

4.2. Pre-Training Configuration

The CNN layer has 7 hidden layers, each CNN layer contains a temporal convolution, layer normalization, and a GELU activation function. The temporal convolution of each block has 512 channels, the width of the convolution kernel is (10, 3, 3, 3, 3, 2, 2, 2), the stride size is (5, 2, 2, 2, 2), the stride length is about 20 ms, and the receptive field is 25 ms. Factorized-TDNN layer has 13 hidden layers, which is composed of 1 TDNN layer, 8 TDNNF layers, 3 DenseReLU layers, and 1 StatsPool layer. Each TDNN-F layer contains 2 SOrthConv layers, 1 temporal convolution, batch normalization, and a RELU function. The architecture of Factorized-TDNN is shown in Table 2.

The self-attention layer consists of a 12-layer, 768-dimensional Transformer layer with eight self-attention heads. For the mask operation, p is chosen to be 0.065, and M is chosen to be 10. The quantization module gives the number of codebooks G = 2, the number of entries in each codebook V = 320, and the dimension of entries 128. The calculation process inside the quantization module is shown in Figure 6. In Equation (2), l is the vector of dimension (2,320), and τ controls the distribution of the sampling structure and anneals from 2 to 0.5 with a multiple of 0.999995 at each update. The learning rate is set to 5 × 10⁻⁴ and is optimized when using Adam [35], where the learning rate warms up in the first 10% of updates, remains constant in the next 40%, and then decays linearly in the rest. In the loss function (Equation (3)), α is set to 0.1. In the contrast loss function (Equation (4)), we use k = 0.1 and K = 100. The whole experiment was conducted on 1 NVIDIA GeForce RTX 3090 graphics cards with batch size set to 4, and pre-training stopped at 100 epochs.

4.3. Modeling Unit

Pre-training data contains English (LS), Chinese (Ma), the Uygur language (Uy), and the Kazakh (KSC). For different languages, different modeling units are selected. Chinese is a character-based writing system, so subwords are used as modeling units. The modeling units of English and Uyghur are determined by BPE algorithm [36], see Table 3 for details.

4.4. The TTS Configuration

Using the ESPnet-TTS toolkit [37] to create end-to-end speech synthesis system based on Tacotron 2 [38], following the configuration of LJSpeech [39]. The input of the model is a character sequence consisting of 42 Cyrillic letters and 1 symbol (“|“), the output is a set of Mel filter group characteristics of 80 d sequence. The WaveGAN [40] vocoder is used to convert these acoustic features into time-domain waveform samples without any additional speech preprocessing, such as filtering and normalization. In the Tacotron 2 system, the encoder module is modeled as a bidirectional LSTM layer with 512 units (256 units in each direction) and the decoder module is modeled as a stack of two unidirectional LSTM layers with 1024 units. The Adam algorithm was used to optimize the parameters with an initial learning rate of 10⁻³ and 200 epochs of training. To regularize the parameters, set the dropout rate to 0.5.

4.5. Pre-Training and Fine-Tuning

Firstly, the audio data from Librispeech 960 h is pre-trained by wav2vec 2.0 and WAV2VEC-F, respectively. After the pre-training, the pre-trained model is fine-tuned on the labeled data, and the same data set as [33] is used for fine-tuning. Next, the audio data from KSC 330 h is pre-trained with the above two models, respectively. The same data set as [34] is used for fine-tuning, and the results are compared with the previous experimental results of DNN-HMM, E2E-LSTM, and E2E-Transformer. Finally, 100 h of audio data from English, Chinese, and Uyghur were used for pre-training to obtain the monolingual model. Then, pairwise combination was used for pre-training to obtain the bilingual model. Next, 100 h audio data from each of the three languages were used for pre-training to obtain the multilingual model. A total of 2000 text data were randomly selected from the 40 h Kazakh language data set of our laboratory and synthesized into speech using the Kazakh TTS model. The multilingual model containing the target language was obtained by pre-training with the three languages at the same time. The speech data from 10 min, 1 h, 5 h, 10 h, and 20 h in the KSC training set were randomly selected as the fine-tuning data.

4.6. Decoding

After the model is fine-tuned, the 4-g language model is used for decoding, and Kenlm [41] is used to train the 4-g language model on the KSC LM corpus. In the decoding process, a beam search decoder [42] is used, and the beam is set to 1500.

4.7. Supervised Model Comparison Experiment

The DNN-HMM model was constructed using the Kaldi framework, and the “nnet3 + chain” setting was adopted according to the formula of The Wall Street Journal (WSJ). The acoustic model also adopted TDNN-F, and the meshless maximum interaction information (LF-MMI) training standard was used. The input was MFCC features. cepstral mean and variance normalization were extracted every 10 ms in a 25 ms window, and a 3-g language model based on SRILM was used for decoding.

The E2E model is constructed using the ESPnet framework and follows the formula of WSJ. The CTC criterion trains two different coding–decoder architectures based on LSTM and Transformer. The input speech is a filter bank feature of 80 dimensions, the stride length is about 10 ms, and the receptive field is 25 ms. The encoder module based on LSTM consists of three bidirectional LSTM layers, each layer has 1024 units in each direction, and the decoder module is a unidirectional LSTM with the initial learning rate set to 1. The model is trained for 20 epochs using the Adadelta optimizer. The Transformer-based system consists of 12 encoders and 6 decoders, with 4 self-attention layer heads and 256 dimensions of hidden states. The feedforward network dimension was set to 2048. The dropout rate was 0.1, and the initial learning rate was 10. A total of 160 epochs were trained with the Noam optimizer. A language model constructed from two layers of RNN with 650 LSTM units using the annotations of the training set is used for decoding.

The above three models all use speed perturbation of 0.9, 1.0, and 1.1. Meanwhile, SpecAugment is also used for data augmentation.

5. Results

In this paper, the wav2vec 2.0 architecture is used for pre-training as the baseline system, and unsupervised pre-training is carried out in the proposed model in different languages. By training the single-language model and multi-language model, it is proved that the proposed model can effectively learn cross-language speech representation in an unsupervised way. Moreover, the influence of language similarity on the cross-language transfer is analyzed.

5.1. Pre-Training for the Librispeech 960 h

The baseline and proposed model are used for training on all training subsets of Librispeech, and fine-tuning is performed on the marked 10 min, 1 h, 10 h, and 100 h data sets that are equally divided as [33]. The evaluation results on test-clean are shown in Table 4. It can be seen that WAV2VEC-F is better than the baseline model when using the same length of marked data to fine-tune, and the average word error rate is reduced by 1.9% compared with wav2vec 2.0. The audio is passed into the Factorized-TDNN layer through the convolutional neural network and is then quantized to retain more context information.

5.2. Pre-Training for KSC 330 h

For pre-training using only Kazakh, the training set of KSC is pre-trained as unlabeled original audio. Then, the model is fine-tuned using the validation set and evaluated using the validation and test set. Compared with the three supervised models of DNN-HMM, E2E-LSTM, and E2E-Transformer, the experimental results are shown in Table 5. Without SpeedPerturb and SpecAugment, the WER of wav2vec-F on the validation and test sets are 6.1% and 5.0%, respectively, which are 39% and 42.5% lower than that of the E2E-Transformer model, and 4.7% and 3.8% lower than that of the baseline model.

5.3. Pre-Training for Multiple Languages

First, the baseline model (wav2vec 2.0) and wav2vec-F are pre-trained with 100 h audio data in English, Chinese, and Uygur, respectively. The labeled Kazakh data from 10 min, 1 h, 5 h, 10 h, and 20 h are fine-tuned, and the test set of the KSC is used for evaluation. The results are shown in Table 6. It can be seen that the word error rate of Uyghur is significantly lower than that of English and Chinese in the pre-training using Uyghur alone, and the results obtained after the mixed training of English and Uyghur and Chinese and Uyghur are all better than those obtained after the mixed training of English and Chinese. This phenomenon suggests that Uyghur is more suitable as a source language to transfer knowledge to Kazakh speech recognition tasks than English or Chinese. Since both Uyghur and Kazakh belong to the Turkic language family of the Altaic family, we believe that cross-language knowledge transfer using languages belonging to the same family can achieve more exciting results. More importantly, whenever another language is added to the separate language data, the final word error rate is reduced, suggesting that the model can learn universal phonological features.

When the three languages are mixed for training, the word error rate can achieve the same effect as supervised learning with the E2E-Transformer when only 20 h target language has labeled audio data. When the TTS synthesized Kazakh audio is added for pre-training, the word error rate is further reduced. With only 10 h target language audio data, the identification accuracy rate is similar to that of the E2E-Transformer. This shows that data enhancement methods using speech synthesis can bring huge benefits to speech recognition in the presence of unmatched data.

5.4. Contrast with Other Network Layers

In order to prove the effectiveness of the proposed method, we consider replacing the TDNN-F layers with the TDNN layers [43], BiLSTM layers [44], DSFMN layers [45], and TDNN-LSTM layers [46]. Table 7 shows the results after the fusion of different network layers and wav2vec2.0. In these experiments, the audio of Librispeech’s train-clean-100 subset is used as the pre-training data, and the labeled 10 h Kazakh audio is used for fine-tuning and evaluation on the KSC test set. The recognition results of wav2vec-F are optimized while the number of model parameters increases the least.

6. Conclusions

This paper proposes wav2vec-F for unsupervised pre-training of speech, using unpaired speech audio and tags for speech recognition, learning the potential speech representations from the waveforms of unlabeled audio data, and applying them to cross-language ASR tasks. On the Librispeech benchmark, WAV2VEC-F outperforms wav2vec 2.0. On the KSC benchmark, WAV2VEC-F outperforms wav2vec 2.0 and previous supervised methods. Meanwhile, the experimental results also prove that multi-language pre-training is more effective than single-language pre-training. It is necessary for low-resource languages to be able to use other accessible, high-resource languages for knowledge transfer. In the pre-training process, better results can be obtained by using a language close to the target language. Compared with supervised training, the method proposed in this paper can make use of audio data unrelated to the target language to carry out speech recognition tasks. Given the same amount of mixed data from other languages, the recognition result is similar to that of supervised learning under the condition that only 10 h of target language data are used for fine-tuning. Furthermore, the recognition effect is optimal when only the target language data are used for pre-training. In our future work, we will continue to explore how training with non-target language data can achieve similar or even better results than training with only target language data.

Author Contributions

Writing—original draft, W.M.; writing—review and editing, N.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China—Research on Key Technologies of Speech Recognition of Chinese and Western Asian Languages under Resource Constraints (Grant No. 62066043), and the National Language Commission key Project—Research on Speech Keyword Search Technology of Chinese and Western Asian Languages (Grant No. ZDI135-133).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Povey, D.; Ghoshal, A.; Boulianne, G.; Lukas, B.; Ondrej, G.; Nagendra, G.; Mirko, H.; Petr, M.; Yanmin, Q.; Petr, S.; et al. The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Big Island, HI, USA, 11–15 December 2011. [Google Scholar]
Mohamed, A.; Okhonko, D.; Zettlemoyer, L. Transformers with convolutional context for ASR. arXiv 2019, arXiv:1904.11660. [Google Scholar]
Chiu, C.C.; Sainath, T.N.; Wu, Y.; Prabhavalkar, R.; Nguyen, P.; Chen, Z.; Kannan, A.; Weiss, R.J.; Rao, K.; Gonina, E.; et al. State-of-the-art speech recognition with sequence-to-sequence models. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4774–4778. [Google Scholar]
Bahdanau, D.; Chorowski, J.; Serdyuk, D.; Brakel, P.; Bengio, Y. End-to-end attention-based large vocabulary speech recognition. In Proceedings of the 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 4945–4949. [Google Scholar]
Chan, W.; Jaitly, N.; Le, Q.V.; Vinyals, O. Listen, attend and spell. arXiv 2015, arXiv:1508.01211. [Google Scholar]
Chorowski, J.K.; Bahdanau, D.; Serdyuk, D.; Cho, K.; Bengio, Y. Attention-based models for speech recognition. In Advances in Neural Information Processing Systems, Proceedings of the 29th Annual Conference on Neural Information Processing Systems, NIPS 2015, Montreal, Canada, 7–12 December 2015; NeurIPS: La Jolla, CA, USA, 2015; Volume 28. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; NeurIPS: La Jolla, CA, USA, 2017; Volume 30. [Google Scholar]
Karita, S.; Chen, N.; Hayashi, T.; Hori, T.; Inaguma, H.; Jiang, Z.; Someki, M.; Soplin, N.E.Y.; Yamamoto, R.; Wang, X.; et al. A comparative study on transformer vs rnn in speech applications. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 449–456. [Google Scholar]
Nakatani, T. Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration. In Proceedings of the Proc. Interspeech, Graz, Austria, 15–19 September 2019. [Google Scholar]
Dong, L.; Xu, S.; Xu, B. Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5884–5888. [Google Scholar]
Lewis, M.P.; Simon, G.; Fennig, C.D. Ethnologue: Languages of the World, 19th ed. 2016. Available online: http://www.ethnologue.com (accessed on 1 December 2022).
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 3 December 2022).
Ling, S.; Liu, Y.; Salazar, J.; Kirchhoff, K. Deep contextualized acoustic representations for semi-supervised speech recognition. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6429–6433. [Google Scholar]
Karita, S.; Watanabe, S.; Iwata, T.; Delcroix, M.; Ogawa, A.; Nakatani, T. Semi-supervised end-to-end speech recognition using text-to-speech and autoencoders. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6166–6170. [Google Scholar]
Li, B.; Sainath, T.N.; Pang, R.; Wu, Z. Semi-supervised training for end-to-end models via weak distillation. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 2837–2841. [Google Scholar]
Caron, M.; Bojanowski, P.; Mairal, J.; Joulin, A. Unsupervised pre-training of image features on non-curated data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 2959–2968. [Google Scholar]
Steed, R.; Caliskan, A. Image representations learned with unsupervised pre-training contain human-like biases. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual, 3–10 March 2021; pp. 701–713. [Google Scholar]
Dai, Z.; Cai, B.; Lin, Y.; Chen, J. Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1601–1610. [Google Scholar]
Oord, A.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Steffen, S.; Alexei, B.; Ronan, C.; Michael, A. wav2vec: Unsupervised pre-training for speech recognition. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019. [Google Scholar] [CrossRef] [Green Version]
Baevski, A.; Schneider, S.; Auli, M. vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv 2019, arXiv:1910.05453. [Google Scholar]
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
Yu-An, C.; Wei-Ning, H.; Hao, T.; James, G. An unsupervised autoregressive model for speech representation learning. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019. [Google Scholar] [CrossRef] [Green Version]
Jiang, D.; Li, W.; Zhang, R.; Cao, M.; Luo, N.; Han, Y.; Zou, W.; Han, K.; Li, X. A further study of unsupervised pretraining for transformer based speech recognition. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6538–6542. [Google Scholar]
Jiang, D.; Lei, X.; Li, W.; Luo, N.; Hu, Y.; Zou, W.; Li, X. Improving transformer-based speech recognition using unsupervised pre-training. arXiv 2019, arXiv:1910.09932. [Google Scholar]
Bansal, S.; Kamper, H.; Livescu, K.; Lopez, A.; Goldwater, S. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. arXiv 2018, arXiv:1809.01431. [Google Scholar]
Hsu, J.Y.; Chen, Y.J.; Lee, H. Meta learning for end-to-end low-resource speech recognition. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7844–7848. [Google Scholar]
Yi, C.; Zhou, S.; Xu, B. Efficiently fusing pretrained acoustic and linguistic encoders for low-resource speech recognition. IEEE Signal Process. Lett. 2021, 28, 788–792. [Google Scholar] [CrossRef]
Povey, D.; Cheng, G.; Wang, Y.; Li, K.; Xu, H.; Yarmohammadi, M.; Khudanpur, S. Semi-orthogonal low-rank matrix factorization for deep neural networks. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 3743–3747. [Google Scholar]
Jang, E.; Gu, S.; Poole, B. Categorical reparameterization with gumbel-softmax. arXiv 2016, arXiv:1611.01144. [Google Scholar]
Jegou, H.; Douze, M.; Schmid, C. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 117–128. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar]
Khassanov, Y.; Mussakhojayeva, S.; Mirzakhmetov, A.; Adiyev, A.; Nurpeiissov, M.; Varol, H. A crowdsourced open-source Kazakh speech corpus and initial speech recognition baseline. arXiv 2020, arXiv:2009.10334. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Zhou, S.; Xu, S.; Xu, B. Multilingual end-to-end speech recognition with a single transformer on low-resource languages. arXiv 2018, arXiv:1806.05059. [Google Scholar]
Hayashi, T.; Yamamoto, R.; Inoue, K.; Yoshimura, T.; Watanabe, S.; Toda, T.; Takeda, K.; Zhang, Y.; Tan, X. ESPnet-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7654–7658. [Google Scholar]
Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerrv-Ryan, R.; et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4779–4783. [Google Scholar]
Ito, K.; Johnson, L. The LJ Speech Dataset. 2017. Available online: https://keithito.com/LJ-Speech-Dataset/ (accessed on 1 December 2022).
Yamamoto, R.; Song, E.; Kim, J.M. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6199–6203. [Google Scholar]
Heafield, K. KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, Edinburgh, Scotland, UK, 30–31 July 2011; pp. 187–197. [Google Scholar]
Pratap, V.; Hannun, A.; Xu, Q.; Cai, J.; Kahn, J.; Synnaeve, G.; Liptchinsky, V.; Collobert, R. Wav2letter++: A fast open-source speech recognition system. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6460–6464. [Google Scholar]
Jin, C.; He, B.; Hui, K.; Sun, L. TDNN: A two-stage deep neural network for prompt-independent automated essay scoring. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 1088–1097. [Google Scholar]
Xiong, W.; Droppo, J.; Huang, X.; Seide, F.; Seltzer, M.; Stolcke, A.; Yu, D.; Zweig, G. Achieving human parity in conversational speech recognition. arXiv 2016, arXiv:1610.05256. [Google Scholar]
Zhang, S.; Lei, M.; Yan, Z.; Dai, L. Deep-FSMN for large vocabulary continuous speech recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5869–5873. [Google Scholar]
Peddinti, V.; Wang, Y.; Povey, D.; Khudanpur, S. Low latency acoustic modeling using temporal convolution and LSTMs. IEEE Signal Process. Lett. 2017, 25, 373–377. [Google Scholar] [CrossRef]

Figure 1. Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architectures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned.

Figure 2. Overview of Contrastive Predictive Coding.

Figure 3. (a) wav2vec2.0; (b) wav2vec-F.

Figure 4. Gumbel softmax.

Figure 5. Illustration of the overall network structure of our proposed model.

Figure 6. Illustrates the calculation flow inside the quantization module.

Table 1. Specifications of the LibriSpeech, Primewords Chinese Corpus Set 1, Uyghur (our), and KSC corpus datasets.

Dataset	Durations				Speakers
Dataset	Train	Dev	Test	Total	Speakers
LibriSpeech	960.9	10.7	10.5	982.1	2484
LibriSpeech (train-clean-100)	100.6	(other) 5.3/(clean) 5.4	(other) 5.1/(clean) 5.4	121.8	251
Uyghur (our)	960.2	10.3	10.1	980.6	3179
Uyghur (train-clean-100)	93.9	5.2	5.2	104.3	198
KSC	318.4	7.1	7.1	332.6	-
Primewords Chinese Corpus Set 1	96.9	1.1	0.9	98.9	296

Table 2. Architecture of Factorized-TDNN.

Layer	Layer Type	Context Factor1	Context Factor2	Skip conn. from Layer	Size	Inner Size
1	TDNN-ReLU	t − 2,t + 2			512
2	F-TDNN-ReLU	t − 2,t	t,t + 2		1024	256
3	F-TDNN-ReLU	t	t		1024	256
4	F-TDNN-ReLU	t − 3,t	t,t + 3		1024	256
5	F-TDNN-ReLU	t	t	3	1024	256
6	F-TDNN-ReLU	t − 3,t	t,t + 3		1024	256
7	F-TDNN-ReLU	t − 3,t	t,t + 3	2,4	1024	256
8	F-TDNN-ReLU	t − 3,t	t,t + 3		1024	256
9	F-TDNN-ReLU	t	t	4,6,8	1024	256
10	Dense-ReLU	t	t		2048
11	Pooling (mean + stddev)	full-seq			2 × 2048
13	Dense-ReLU				512

Table 3. Specification of data sets for each language.

Data	Training Utts	Validating Utts	Units
LS 960 h	253,117	28,124	subword
KSC 330 h	132,513	14,723	subword
LS 100 h	25,687	2854	subword
Ma 100 h	45,346	5038	character
Uy 100 h	52,487	5846	subword

Table 4. The results (WER) on Librispeech test set when training on the low-resource labeled data setups of 10 min, 1 h, 10 h, and the clean 100 h subset of Librispeech.

Model	Unlabeled Data	Finetune
Model	Unlabeled Data	10 min	1 h	10 h	100 h
Baseline	LS 960 h	30.8	27.1	18.4	9.6
wav2vec-F	LS 960 h	30.2	26.7	18.2	9.3

Table 5. The performance (WER) of different methods on the KSC valid set and test set when pre-training on the train set and fine-tuning on the valid set.

Model	Unlabeled Data	LM	SpeedPerturb	SpecAugment	Valid	Test
DNN-HMM	-	YES	YES	YES	14.9	13.8
E2E-LSTM	-	YES	YES	YES	13.1	11.7
E2E-Transformer	-	YES	YES	YES	10.0	8.7
Baseline	KSC 330 h	YES	No	No	6.4	5.2
wav2vec-F	KSC 330 h	YES	No	No	6.1	5.0

Table 6. Results of fine-tuning Kazakh data with different time settings when pre-training with different non-target language data as well as mixed data.

Model	Unlabeled Data	Finetune
Model	Unlabeled Data	10 min	1 h	5 h	10 h	20 h
Baseline	LS 100 h	87.7	56.1	37.2	23.6	17.1
	Ma 100 h	77.4	39.3	29.3	20.3	11.5
	Uy 100 h	70.0	35.1	25.2	15.9	10.8
wav2vec-F	LS 100 h	87.5	55.6	34.5	22.1	16.7
	Ma 100 h	77.4	38.0	24.2	15.3	10.5
	Uy 100 h	68.6	34.2	23.5	14.7	10.6
wav2vec-F	LS 100 h + Ma 100 h	61.0	25.7	23.0	15.6	11.4
	LS 100 h + Uy 100 h	58.2	27.9	21.7	13.3	8.9
	Ma 100 h + Uy 100 h	57.4	25.6	20.2	14.5	9.0
	LS 100 h + Ma 100 h + Uy100 h	48.3	25.4	13.0	10.2	8.5
	LS100 h + Ma 100 h + Uy 100 h + TTS 4 h	38.9	19.2	10.8	8.6	6.7

Table 7. The results of the fusion of different types of network structures are used, and “+” represents the increased parameter amount.

Converged Network	None (Baseline)	TDNN	BiLSTM	DSFMN	TDNN-LSTM	TDNN-F
WER (%)	23.6	24.3	23.2	22.4	22.9	22.1
Params (M)	-	+19	+41	+27	+40	+20

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Meng, W.; Yolwas, N. A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training. Sensors 2023, 23, 870. https://doi.org/10.3390/s23020870

AMA Style

Meng W, Yolwas N. A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training. Sensors. 2023; 23(2):870. https://doi.org/10.3390/s23020870

Chicago/Turabian Style

Meng, Weijing, and Nurmemet Yolwas. 2023. "A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training" Sensors 23, no. 2: 870. https://doi.org/10.3390/s23020870

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training

Abstract

1. Introduction

2. Related Work

2.1. BERT

2.2. CPC

2.3. Wav2vec 2.0

2.3.1. Quantitative Representation

2.3.2. Comparative Training

3. Methods

3.1. Model

3.2. Loss Function

4. Experimental Setup

4.1. Datasets

4.2. Pre-Training Configuration

4.3. Modeling Unit

4.4. The TTS Configuration

4.5. Pre-Training and Fine-Tuning

4.6. Decoding

4.7. Supervised Model Comparison Experiment

5. Results

5.1. Pre-Training for the Librispeech 960 h

5.2. Pre-Training for KSC 330 h

5.3. Pre-Training for Multiple Languages

5.4. Contrast with Other Network Layers

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI