Transfer Learning Based Free-Form Speech Command Classification for Low-Resource Languages

Current state-of-the-art speech-based user interfaces use data intense methodologies to recognize free-form speech commands. However, this is not viable for low-resource languages, which lack speech data. This restricts the usability of such interfaces to a limited number of languages. In this paper, we propose a methodology to develop a robust domain-specific speech command classification system for low-resource languages using speech data of a high-resource language. In this transfer learning-based approach, we used a Convolution Neural Network (CNN) to identify a fixed set of intents using an ASR-based character probability map. We were able to achieve significant results for Sinhala and Tamil datasets using an English based ASR, which attests the robustness of the proposed approach.


Introduction
Speech command recognizable user interfaces are becoming popular since they are more natural for end-users to interact with.Google Assistant1 , and Amazon Alexa2 can be highlighted as few such commercial services, which are ranging from smartphones to home automation.These are capable of identifying the intent of free-form speech commands given by the user.To enable this kind of service, Automatic Speech Recognition (ASR) systems and Natural Language Understanding (NLU) systems work together with a very high level of accuracy (Ram et al., 2018).
If ASR or NLU components have suboptimal results, it directly affects the final output (Yaman et al., 2008;Rao et al., 2018).Hence, to get good results in ASR systems, it is common to use very large speech corpora (Hannun et al., 2014;Amodei et al., 2016;Chiu et al., 2018).However, low-resource languages (LRL) do not have this luxury.Here, languages that have a limited presence on the Internet and those that lack electronic resources for speech and/or language processing are referred to as low-resource languages (LRLs) (Besacier et al., 2014).Because of this reason despite the applicability, speech-based user interfaces are limited to common languages.For LRLs researchers have focused on narrower scopes such as recognition of digits or keywords (Manamperi et al., 2018;Chen et al., 2015).However, free-form commands are difficult to manage in this way since there can be overlappings between commands.Buddhika et al. (2018); Chen et al. (2018) show some direct speech classification approaches to its intents.In particular, Buddhika et al. (2018) have given some attention for the low resource setting.Additionally, Transfer learning is used to exploit the issue of limited data in some of the ASR based research (Huang et al., 2013;Kunze et al., 2017).
In this paper, we present an improved and effective methodology to classify domain-specific freeform speech commands while utilizing this direct classification and transfer learning approaches.Here, we use a character probability map from an ASR model trained on English to identify intents.Performance of this methodology is evaluated using Sinhala (Buddhika et al., 2018) and newly collected Tamil datasets.The proposed approach can reach to a reasonable accuracy using limited training data.
Rest of the paper is organized as follows.Section 2 presents related work, section 3 describes methodology used.Section 4 and 5 provides details of the datasets and experiments.Section 6 presents a detailed analysis of the obtained results.Finally Section 7 concludes the paper.

Related Work
Most of the previous research has used separate ASR and NLU components to classify speech intents.In this approach, transcripts generated from the ASR module are fed as input for a separate text classifier (Yaman et al., 2008;Rao et al., 2018).
Here, an erroneous transcript from the ASR module can affect the final results of this cascaded system (Yaman et al., 2008;Rao et al., 2018).In this approach, two separately trained subsystems are connected to work jointly.As a solution for these issues, Yaman et al. (2008) proposed a joint optimization technique and use of the n-best list of the ASR output.Later He and Deng (2013) extended this work by developing a generalized framework.However, these systems require a large amount of speech data, corresponding transcript, and their class labels.Further, the ASR component used in these systems requires language models and phoneme dictionaries to function, which are difficult to find for low-resource languages.
This cascading approach is effective when there is a highly accurate ASR in the target language.Rao et al. (2018) present such a system to navigate in an entertainment platform for English.Here, they have used a separate ASR system to convert speech into text.More importantly, they highlight that a lower performance of ASR affects the entire system.
More recently, researchers have presented some approaches that aim to go beyond cascading ASR components.In this way, they have tried to eliminate the use of intermediate text representations and have used automatically generated acoustic level features for classification.Liu et al. (2017) proposed topic identification in speech without the need for manual transcriptions and phoneme dictionaries.Here, the input features are bottleneck features extracted from a conventional ASR system trained with transcribed multilingual data.Then these features are classified through CNN and SVM classifiers.Additionally Lee et al. (2015) have highlighted that effectiveness of this kind of bottleneck features of speech when comparing different speech queries.Chen et al. (2018); Buddhika et al. (2018) present two different direct classification approaches to determine the intent of a given spoken utterance.Chen et al. (2018) have used a neural network based acoustic model and a CNN based classifier.However, this requires transcripts of the speech data to train the acoustic model, thus accuracy depends on the availability of a large amount of speech data.One advantage of this approach is that we can optimize the final model once we combined the two models.Buddhika et al. (2018) classified speech directly using MFCC (Mel-frequency Cepstral Coefficients) of the speech signals as features.In this approach, they have used only 10 hours of speech data to achieve reasonable accuracy.

Methodology
In section 2, we showed that research work of Liu et al. (2017); Chen et al. (2018); Buddhika et al. (2018) has benefited from direct speech classification approach.Additionally, as shown in the work of Lee et al. (2015); Liu et al. (2017), it is beneficial to use automatically discovered acoustic related features.Therefore our key idea is reusing a well trained ASR neural network on high resource language as a feature transformation module.This is known as transfer learning (Pan and Yang, 2010).Here, we try to reuse the knowledge learned from one task to another associated task.Current well trained neural network based end-to-end ASR models are capable of converting given spoken utterance into the corresponding character sequence.Therefore these ASR models can convert speech into some character representation.Our approach is to reuse this ability in lowresource speech classification.
We used DeepSpeech (DS) (Hannun et al., 2014) model as the ASR model.DS model consists of 5 hidden layers including a bidirectional recurrent layer.Input for the model is a timeseries of audio features for every timeslice.MFCC coefficients are used as features.Model converts this input sequence x (i) into a sequence of character probabilities y (i) , with ŷt = P(c t |x), where c t =∈ {a, b, c, .., z, space, apostrophe, blank} in English model.These probability values are calculated by a softmax layer.Finally, the corresponding transcript is generated using the probabilities via beam search decoding with or without combining a language model.
Here, we selected intermediate probability values as the transfer learning features from the model.Any feature generated after this layer is ineffective since it is affected by the beam search and it only outputs the best possible character sequence.Before the final softmax layer, there is a bi-directional recurrent layer, which is very critical for detecting sequence features in speech.Without this layer, the model is useless (Hannun et al., 2014;Amodei et al., 2016).Hence, the only possible way to extract features is after the softmax layer.Additionally, this layer provides normalized probability values for each time step.Figure 1 shows a visualization of this intermediate character probability map for a Sinhala speech query containing ' ෙ ෂය යද -śēs .aya kīyada'.

Datasets
We used two different free-form speech command datasets to measure the accuracy of the proposed methodology.The first one is a Sinhala dataset and contains audio clips in the banking domain (Buddhika et al., 2018).Since it was difficult to find such other datasets for low-resource languages, we created another dataset in the Tamil language,  Original Sinhala dataset contained 10 hours of speech data from 152 males and 63 females students in the age between 20 to 25 years.We had to revalidate the dataset since it included some miss-classified, too lengthy and erroneous speech queries.The final data set contained 7624 samples totaling 7.5 hours.Tamil dataset contains 0.5 hours of speech data from 40 males and females students in the same age group.There were 400 samples in the Tamil dataset.The length of each audio clip is less than 7 seconds.

Experiments
For the transfer learning task, we considered the DeepSpeech (DS) model 1 (Hannun et al., 2014).Given the DS English model, we extract the intermediate probability features for a given speech sample and then fed them into the classifier.Further, we employed a Bayesian optimization based algorithm for hyperparameter tuining (Bergstra et al., 2013).Since datasets are small we used 5 fold cross-validation to evaluate the accuracy.
We selected method presented in (Buddhika et al., 2018) as our benchmark.In their work, they have used the first 13 MFCC features as input for the SVM, FFN classifiers.Since we had to validate the Sinhala dataset, we reevaluate the accuracy values on the validated dataset using 5fold cross-validation.Additionally, we performed the same experiments on newly collected Tamil dataset to examine the language independence of the proposing method.Table 2 summarizes the outcomes of these different approaches.In all experiments, class distribution among all data splits was nearly equal.
In this work, we are concerned about the amount of available data.Hence, we evaluated the accuracy change of the best performing approaches with the size of training samples.We perform this on the Sinhala dataset since it has more than 4000 data samples.We drew multiple random samples with a particular size and performed 5-fold cross-validation.Here, the number of random samples is 20.Table 3 summarizes the experiment results.
In another experiment, we examined the endto-end text output of the DS English model for a given Sinhala speech query.Table 4 presents some of these outputs.

Result and Discussion
We were able to achieve 93.16% and 76.30% overall accuracy for Sinhala and Tamil datasets respectively using 5-fold cross-validation.Table 2 provides a comparison of previous and our approaches.It shows clearly that the proposed method is more viable than the previous direct speech feature classification approach.One possible reason can be the reduction of noise in speech signals.In this situation, the DS model is capable of removing these noises since it is already trained on noisy data.Another reason is that reduction of the feature space.Additionally, in this way, we can have more accurate results using small dataset.Table 3 shows the averaged precision, recall and F1-score values for each intent class and two datasets.In the Sinhala dataset, all classes achieve more than 0.9 F1-score, except for type 4 intent.Type 1 intent shows the highest F1-score among all and, this must be because of the higher number of data samples available for this class.Despite that, type 6 intent also reports 0.93 f1-score even with a lower number of data samples.Tamil data shows a slightly different result.Intent types 4,5 report the lowest score in the Tamil dataset and the number of speech queries from these classes are comparatively low in the dataset.Further, we can observe that the Tamil classifier is incapable of accurately identifying positive intent classes 4 and 5 (since lower recall value).
Compared to Sinhala data with a sample size of 500, Tamil dataset reports high overall accuracy with 400 samples.Tamil dataset contains codemixed speech quires since it is more natural when in speaking.These words are in English.Additionally, the feature generator model (DS model) is also trained in English data.This can result in more overall accuracy in Tamil data set.Additionally, type 6 intent commands contain English words in both datasets and this can result for higher precision value.
Further, sentences with more overlapping words with other sentences (different intent type) and with limited length tend to misclassify more.Hence classes, type 3,4 in Sinhala, type 2,4 in Tamil dataset show lower accuracy.As it shows having 1000 samples is enough to achieve nearly 80% overall accuracy.After that, it reaches saturation.Furthermore, it reports 77% overall accuracy for Tamil dataset with 320 training samples.This highlights the effectiveness of the proposed transfer learning approach in limited data situations.
Additionally, Figure 3 shows the most effective CNN model type with the number of available data samples to classify sequential feature maps.As it shows, it is useful to use 2D CNN based classifiers when there is a very limited amount of data.However, when there are relatively more data (More than 4000 samples in Sinhala dataset) 1D CNN based classifiers gives higher results.We can see this effect on Tamil dataset also.As table 2

Conclusion
In this study, we proposed a method to identify the intent of free-form commands in a low-resource language.We used an ASR model trained on the English language to classify the Sinhala and Tamil low-resource datasets.The proposed method outperforms previous work and, even with a limited number of samples, it can reach to a reasonable accuracy.
CNN base classifiers perform well in the classification of character probability maps generated by ASRs.Further, 1D CNN models work better with a higher number of samples, while 2D CNN models work better with a small amount of data.In the future, we plan to extend this study by incorporating more data from different languages and domains.

Figure 1 :
Figure 1: Visualization of probability output for Sinhala utterance

Figure 2 :
Figure 2: Architecture of the final model

Figure 3 :
Figure 3: CNN classifier accuracy variance with the number of samples (Sinhala dataset) Table 1 summarizes the details.

Table 2 :
Summary of results with different approaches and overall accuracy values

Table 3 :
Classification results of best performing models (F1-F1-Score, P-Precision, R-Recall) shows 1D CNN model accuracy is low compared to 2D CNN model with 400 data samples.Further, we examined the speech decoding capability of the English model.See Table4.Here 'Utterance' is the pronounced Sinhal sentence, 'Eng.Transcript' is the ideal English transcript.'DS output' lists the generated transcripts from the

Table 4 :
DS transcript for some Sinhala utterances full model.In these generated outputs, the first few characters are decoded correctly.But, in the latter part, this decoding is compromised by the possible character sequences of the English language since it is trained in English.From this, we can infer that this character probability map is closer to text representation than the MFCC features.Hence, this can improve the classification accuracy.