Automatic Speech Recognition Using Limited Vocabulary: A Survey

ABSTRACT Automatic Speech Recognition (ASR) is an active field of research due to its large number of applications and the proliferation of interfaces or computing devices that can support speech processing. However, the bulk of applications are based on well-resourced languages that overshadow under-resourced ones. Yet, ASR represents an undeniable means to promote such languages, especially when designing human-to-human or human-to-machine systems involving illiterate people. An approach to design an ASR system targeting under-resourced languages is to start with a limited vocabulary. ASR using a limited vocabulary is a subset of the speech recognition problem that focuses on the recognition of a small number of words or sentences. This paper aims to provide a comprehensive view of mechanisms behind ASR systems as well as techniques, tools, projects, recent contributions, and possible future directions in ASR using a limited vocabulary. This work consequently provides a way forward when designing an ASR system using limited vocabulary. Although an emphasis is put on limited vocabulary, most of the tools and techniques reported in this survey can be applied to ASR systems in general. AbbreviationsACC: Accuracy; AM: Acoustic Model; ASR: Automatic Speech Recognition; BD-4SK-ASR: Basic Dataset for Sorani Kurdish Automatic Speech Recognition; CER: Character Error Rate; CMU: Carnegie Mellon University; CNN: Convolutional Neural Network; CNTK: CogNitive ToolKit; CUED: Cambridge University Engineering Department; DCT:Discrete Cosine Transformation; DL: Deep Learning; DNN: Deep Neural Network; DRL: Deep Reinforcement Learning; DWT: Discrete Wavelet Transform; FFT: Fast Fourier Transformation; GMM: Gaussian Mixture Model; HMM: Hidden Markov Model; HTK: Hidden Markov Model ToolKit; JASPER: Just Another Speech Recognizer; LDA: Linear Discriminant Analysis; LER: Letter Error Rate; LGB: Light Gradient Boosting Machine; LM:Language Model; LPC: Linear Predictive Coding; LVCSR: Large Vocabulary Continuous Speech Recognition; LVQ: Learning Vector Quantization Algorithm; MFCC: Mel-Frequency Cepstrum Coefficient; ML: Machine Learning; PCM:Pulse-Code Modulation; PPVT: Peabody Picture Vocabulary Test; RASTA: RelAtive SpecTral; RLAT: Rapid Language Adaptation Toolkit; S2ST: Speech-to-Speech Translation; SAPI: Speech Application Programming Interface; SDK: Software Development Kit; SVASR:Small Vocabulary Automatic Speech Recognition; WER: Word Error Rate


Introduction
Automatic speech recognition (ASR) is the process and the related technology applied to convert a speech signal into the matching sequence of words or other linguistic entities using algorithms implemented in computing devices (Indurkhya and Damerau 2010). ASR has become an exciting field for many researchers. Presently, users prefer to use devices such as computers, smartphones, or any other connected device through speech. Current speech processing techniques (encompassing speech synthesis, speech processing, speaker identification or verification) pave the way to create human-tomachine voice interfaces. ASR can be applied in several applications including voice services (Yadava and Jayanna 2017), program control and data entry (Hauser, Sabir, and Thoma 1999), avionics (Noyes and Starr 2007), disabled assistance (Mayer 2018), amongst others. Although ASR can be advantageous in easing human-to-machine communication; in many cases, it is goes beyond helpful and becomes absolutely necessary. For example, low-literacy levels and the extinction of under-resourced languages are ideal candidates for ASR. (Besacier et al. 2014). In fact, the high penetration of communication tools such as smartphones in the developing world (Albabtain et al. 2014) and their increasing presence in rural areas (Ebongue 2015;Ebongue Louis 2015) provides an unprecedented opportunity to develop a voice-based application that can help to mitigate the low literacy levels in those areas. Smartphones offer many advantages over a PC-based interface, such as high mobility and portability, easy recharge of their batteries, and conventional embedded features such as microphones and speakers.

Motivation
In regions with low literacy levels, people are used to speaking local languages that are often considered as under-resourced languages because of the lack or insufficiency of formal written grammar and vocabulary. Since people do not know how to read or to write well-resourced languages (such as English or French), the development of ASR systems for under-resourced languages appears as an appealing solution to overcome this limitation. However, due to the complexity of the task, limited vocabulary must be considered. This paper focuses on limited vocabulary in ASR to allow researchers who wish to work on under-resourced languages to have an overview on how to develop a speech recognition system for limited vocabulary. In contrast to limited vocabulary systems, large vocabulary continuous speech recognition (LVCSR) systems are usually trained on thousands of hours of speech and billions of words of text (Saon and Chien 2012). The development of large vocabulary systems is complex since the larger the vocabulary, the harder the manipulation of learning algorithms, with more rules needed to build the dataset. LVCSR systems can be very efficient when they are applied on similar domains to those on which they were trained (Orosanu and Jouvet 2018). However, they are not robust enough to handle mismatched training and test conditions as the context may not be well handled. In fact, most of the input can be silence or contain background noise, which can be mistaken for speech; this increases the false positive rate (Warden 2018). Thus, LVCSR systems are not suitable for transfer learning targeting small or limited vocabulary.

Position with Other Surveys
Extensive research has been done regarding speech recognition using limited vocabulary. Among the existing surveys, authors in (Lima and Costa-Abreu 2020) focus on Portuguese-based language (and variations). The authors consider Portuguese as an understudied language compared to English, Arabic, and Asian languages. Among Asian languages, Indian languages received particular attention. Works on the development of ASR systems dealing with Indian languages such as Hindi, Punjabi, Tamil, amongst other, are presented in (Kurian 2014). Another specific language survey is provided in (Ronzhin et al. 2006), where the authors focus on Russian language specificities and apply models for the development of Russian speech recognition systems in some organizations, both in Russia and abroad. A broader survey inspecting more than 120 promising works on biometric recognition (including voice) based on deep learning (DL) models is provided in (Minaee et al. 2020). In the latter, the authors present the strengths and possible uses of DL. A narrowed survey is presented in (Desai, Dhameliya, and Desai 2013;Mishaim et al. 2021) and highlights the major subjects and improvements made in ASR. Additionally, a conical survey focusing on various feature extraction techniques in speech processing is provided in (Hibare and Vibhute 2014). The work in (Gong 1995) attempts to provide a comprehensive survey on noise-resistant features as well as similarity measurement, speech enhancement, and speech model compensation in noisy contexts. Due to the increasing penetration rate of mobile devices, the work in (Zaykovskiy 2006) investigates different approaches for providing ASR technology to mobile users. Approaches used to design Chatbots and a comparison between different design techniques from nine papers are given in (Abdul-Kader & Woods, 2015). To evaluate the reliability of recognition results, the work in (Jiang 2005) summarizes most research works related to confidence measures. One of the first comprehensive surveys on speech recognition systems for under-resourced languages is found in (Besacier et al. 2014), notwithstanding the fact that many of the issues and approaches presented in the paper apply to speech technology in general. In this survey, authors do not focus on limited vocabulary. The authors in (Rajashri, Ambewadikar, and Baheti 2021) have done a review for ASR for small vocabulary, but in this survey they do not clearly describe the methods and techniques used to build the models. Moreover this survey is limited to the Marathi language only. Table 1 provides a summary of the recent surveys on ASR.

Contributions
Despite the plethora of surveys, a survey on ASR using limited vocabularies is yet to be conducted. However, ASR using limited vocabulary is a tremendous opportunity as a starting point for the development of speech recognition systems for under-resourced languages. This survey helps to fill this gap by presenting a summary of works done on ASR for limited vocabulary. For a better understanding, the ASR principle is detailed along with the approach to i) build ASR systems, ii) construct datasets and iii) evaluate the performance of such systems. Furthermore, close and open-source toolkits, and frameworks are also presented. Therefore, such a study can rapidly and easily enable researchers who want to build speech recognition systems using limited vocabulary. The contributions of this paper are as follows: • A description of fundamental aspects of ASR; • A description of tools and processes for creating ASR systems; • A summary of important contributions in ASR with limited vocabulary; and • Orientations for future works.
The rest of the paper is organized around eleven sections. Section 2 presents the methodology used to conduct this survey. Section 3 provides an understanding of "limited vocabulary," and Section 4 describes the principle of speech recognition. Section 5 describes the techniques used for ASR, and Section 6 deals with the management of datasets. Section 7 presents the traditional performance metrics, and Section 8 provides an insight into the speech recognition frameworks. Section 9 summarizes works on speech recognition using limited vocabulary. Section 10 discusses possible future directions, followed by a conclusion in Section 11.  (Gong 1995) To present noise resistant features and similarity measurement, speech enhancement and speech model compensation in noisy environments No description of the construction mechanism of an ASR.
No No (Ronzhin et al. 2006) To expose specificities, methods, and applied models for the development of Russian speech recognition.
Although the authors provided a broad view of works related to ASR for Russian, they did not describe the ASR development methods. They focused on systems and their performance. To present techniques used to design chatbots.
Does not detail the speech recognition mechanism or the procedure for creating dataset.
Focuses only on Portuguesebased ASR (corpora, approaches), even if tools can be used regardless of the language.
Portuguese No (Minaee et al. 2020) To provide an insight of Biometric recognition including voice and based on DL. models Not specific to speech recognition.
Only provides insight with no description of the ASR mechanism or dataset construction.
No No (Rajashri, Ambewadikar, and Baheti 2021) To review ASR system or interfaces for specific tasks.
Limited to ASR for small vocabulary for Marathi language only.
Marathi Yes (Mishaim et al. 2021) To present a comparison between ASR techniques nowadays used and deep learning methods .
The authors do not dwell on the dataset construction techniques used in the different papers cited.

Methodology
The research methodology is composed of three main steps, namely the collection of papers, filtering to keep only relevant papers with significant findings, and analysis of the selected papers. The procedure used for collection and filtering are detailed below.

Collection
The collection of papers has been performed by a keyword-based search from common sources, such as Scopus, IEEE Xplore, ACM, ScienceDirect, PubMed. In addition, the search has been extended to web scientific indexing services, namely Web of Science and Google Scholar. The aim was to collect as many relevant papers as possible regarding ASR. For that purpose, several keywords have been used, such as: "Speech Recognition"] AND ["limited vocabulary" OR "vocabulary" OR "commands"].

Filtering
This step consists of identifying relevant papers by reading the abstract and screening the papers. Papers that do not directly deal with the topic or provide substantial contributions have been removed. However, some of papers providing relevant results have not been considered for the synthesis because they did not provide most of the needed parameters, such as the speech recognition language, the error rate, the type of environment, the exact size of the vocabulary and the number of speakers. We have selected 30 papers published between 2000 and 2021 for data synthesis.

Analyzing
Each paper has been analyzed based on the following criteria: the language for which the recognition system was designed; the toolkit used to design the models; the size of the dataset; the type of environment (noisy or not) for the recognition; the number of speakers considered for the elaboration of the dataset; and the accuracy or recognition rate.

Understanding of Limited Vocabulary
A vocabulary, in our context, is defined as a closed list of lexical units that can be recognized by an ASR system. The size of the vocabulary and the selection of lexical units in the vocabulary strongly influence the performance of the automatic transcription system since not all words outside the vocabulary can be recognized by the system. Speech recognition systems basically access a dictionary of phonemes and words. Obviously, it is easier to seek the meaning of one out of ten words in a ten-word dictionary rather than one out of thousands of words in a Webster's dictionary (Agnes and Guralnik 1999). A phoneme is, in essence, the smallest unit of phonetic speech that distinguishes one word from another (Clements 1985). Every word can be deconstructed into units (phonemes) of individual sounds that constitutes that word. The number of words or sentences an ASR system can recognize is an important classification criterion. As proposed in (Saksamudre, Shrishrimal, and Deshmukh 2015), an ASR system can be classified based on the size of the vocabulary as follows: Small Vocabulary (up to 100 words or sentences); Medium Vocabulary (between 101 and 1000 words or sentences); Large Vocabulary (between 1001 and 10,000 words or sentences); and Very-large vocabulary (more than 10,000 words or sentences).
A second classification is proposed by Whittaker and Woodland (Whittaker and Woodland 2001). They also classified ASR systems into four categories: small vocabulary (up to 1000 words), medium vocabulary (up to 10,000 words), large vocabulary (up to 100,000 words) and very/extra large vocabulary (up to 100,000 words). In this second case, limited vocabulary can be considered as a small vocabulary meaning up to 1000 words. This size will be used in the rest of the paper as the maximum size for limited vocabulary.

Automatic Speech Recognition
In the basic principle of ASR, the person speaking emits pressure variations in his larynx. The sounds produced are digitized by the microphone and transmitted through a medium or a network. Digitized sounds are transformed into acoustic units (or acoustic vectors) via an acoustic model (AM). Thereafter, the recognition engine analyses this sequence of acoustic vectors by comparing it with those in its memory (its language model [LM]) and proposes the most likely candidate sequence. It is therefore necessary for the sequence of acoustic vectors to approximate one of the sequences memorized by the recognition engine. The core of an ASR system can be seen as a mathematical model that can generate a text corresponding to the recognized pieces of speech input (Ghai and Singh 2012).

Architecture
Audio signals need to be digitized before the recognition process starts. The digitization of the signal requires the selection of an appropriate sampling frequency to catch the high-pitched voices (Kraleva and Kralev 2009). In general, all ASR systems have the same architecture whether the vocabulary is limited, medium, large, or very large. This architecture can be modified or supplemented according to the recognition to be performed. ASR is usually composed of five typical components: Feature extraction; Acoustic Model; Language Model; Pronunciation Model; and Decoder.
In the architecture illustrated in Figure 1, the speech signal is received and then features are extracted. The obtained parameters are passed to the decoder, which uses the language, the acoustic, and the pronunciation models for learning.

Feature Extraction
Feature extraction is the first step for an ASR system. It converts the waveform speech signal to a set of feature vectors with the aim of having high discrimination between phonemes (Lokhande 2015). The feature extraction performs all the required measurements on the selected segment that will be used to make a decision (Doukas, Bardis, and Markovskyi 2017). The measured features may be used to update long-term statistical measures to facilitate the adaptation of the process to varying environmental conditions (mainly the background) . Feature extraction will determine the voice areas in the recording to be written out and extract sequences of acoustic parameters from them. There are many techniques for feature extraction, as reported in (Narang and Gupta 2015), including: RASTA is designed to decrease the impact of noise as well as heighten speech. This technique is widely used for noisy speech. • Linear Discriminant Analysis (LDA) and Probabilistic LDA (Ioffe 2006): This technique uses the state-dependent variables of Hidden Markov Model-based (HMM) on i-vector extraction. The i-vector is a low dimensional vector with a fixed length that contains relevant information. • Mel-frequency cepstrum (MFCCs): It is the most commonly used technique, with a frameshift and length usually between 20 and 32 ms, using 1024 frequency bins, 26 mel channels and between 10 and 40 cepstral coefficients with cepstral mean normalization (Murphy 2012;Murshed et al., 2020;Padmanabhan & Premku-mar, 2015;Renals and Grefenstette 2000). This technique has low complexity and a high ACC of recognition. Mel-Frequency Cepstrum Coefficient (MFCC) is the usual method for character extraction in most papers tackling the design of speech recognition systems for limited vocabulary Gerazov and Ivanovski 2013;F. Huang 2011). The public sphinx base library provides an implementation of this method that can be used directly, as was done in (X. Liu and Zhou 2014). Figure 2 provides a brief description of the MFCC method, encompassing six steps as described below.
(1) Framing and windowing: For the acoustic parameters to be stable, the speech signal must be examined over a sufficiently short period. This step aims to cut windows of 20 to 30 ms.
(2) Hamming window: The Hamming window is used to reduce the spectral distortion of the signal. This is in contrast to the Rectangular window, which is simple but can cause problems since it slices the signal boundaries abruptly; the Hamming window reduces the signal values toward zero at the boundaries. This can help avoid discontinuities. This is expressed mathematically as follows: where n is the number of windows, N is the number of samples in each frame, yðnÞ the output signal, xðnÞ the input signal and wðnÞ the Hamming windows defined as: (2) (3) Fast Fourier Transformation (FFT) does the conversion of time domain windows into the frequency domain by discretising and interpolating the window onto a regular grid before the Fourier transform is applied. The FFT decreases the computation requirements compared to the discrete Fourier transform. The latter is defined as: where N is the number of windows, k is the index of the coefficient and K is the number of coefficients.
(4) Mel Filter Wrapping banks allow reproduction of the selectivity of the human auditory system by providing a coefficient that gives the energy of the signal: where m is the index of the filter, M is the number of filters and W is the weight function with inputs. The k th energy spectrum bin contributing to the m th output band.
(5) Log allows one to obtain the logarithmic spectrum of Mel and to compress the sum XðmÞ using: X 0 ðmÞ ¼ lnðXðmÞÞ: (5) (6) Discrete Cosine Transform (DCT) reduces the influence of low-energy components. MFCC coefficients are obtained by the discrete cosine transform given by: After this last step, the coefficients are returned as outputs.

Acoustic Modeling
Acoustic modeling (AM) of speech typically describes how statistical representations of the feature vector sequences computed using the speech waveform are established. It aims to predict the most likely phonemes in the audio e2095039-2956 that has been given as input (Benkerzaz, Y., & A., 2019). Challenging configurations mainly involving noise may occur. Two cases can be identified, namely the noise is isolated to a single band of frequencies, or the noise is not isolated, meaning it is on several bands. In the first case, AMs are still reliable enough to make the right decision. In the second case, models have no good information and they are prone to mistakes (Fish 2006). The Bayesian process applied to ASR is as follows: Assuming that x is a sequence of unknown acoustic vectors and w i ði ¼ 1; � � � ; KÞ is one of the K possible classes for an observation, the recognized class is given by: The recognized word w � will therefore be the one that maximizes this quantity, among all the candidate words w i . The probability Pðxjw i Þ of observing a signal, x, knowing the sequence needed for the AM estimation. The a prior Pðw i Þ of the sequences is independent of the signal and needs an LM for estimation. PðXÞ is the probability to observe the sequence of acoustic vectors X. It is the same for each phoneme sequence (because PðXÞ does not depend on W), so it can be ignored (Renals and Grefenstette 2000). For ASR using limited vocabulary, the most commonly used AM to estimate the probability PðXjWÞ is the HMM (Chaudhuri, Raj, and Ezzat 2011;Gerazov and Ivanovski 2013;X. Liu and Zhou 2014;Tamgno et al. 2012). The HMM is considered a generator of acoustic vectors. It is a finite-state automaton in which the transition from state q i to state q j has a probability of a ij at each time unit. This transition generates an acoustic vector x t with a probability density b j ðx t Þ. The HMM is then given by M ¼ ðπ i ; A; BÞ, where π i represents the initial probability distribution, A is the transition probability matrix a ij , and B is the set of observation densities B ¼ fb j ðx t Þg. Several approaches for learning HMM models have been proposed, such as the maximum likelihood estimation (Baggenstoss 2001) and forward-backward estimation (Yu and Kobayashi 2003).

Language Model
The LM in ASR is used to predict the most likely word sequence for a given text. In limited vocabulary, the training text is divided into word classes during the training phase. Based on word classes, the class-based n-gram LM is elaborated on. Thereafter, the standard bigram model, the class-based bigram model, and the interpolated model are obtained and used by the speech recognition system (Militaru and Lazar 2017). The language statistical model (for a sequence of words W ¼ w 1 ; w 2 ; � � � ; w N ) consists in calculating the probability PðWÞ: where h i ¼ w 1 ; . . . ; w iÀ 1 is considered as the history of the word w i and Pðw i jh i Þ is the probability of the word w i , knowing all the previous words.
In practice, as the sequence of words h i becomes richer, an estimation of the values of the conditional probabilities Pðw i jh i Þ becomes more and more difficult because no corpus of learning text can observe all possible combinations of h i ¼ w 1 ; . . . ; w iÀ 1 .
To reduce the complexity of the LM, and consequently of its learning, the n-gram approach can be used. The principle is therefore the same and only the history is limited to the previous n À 1 words. The probability PðwÞ is thus approximated as: It is possible to find the probability of the occurrence of a word w i in the learning corpus: where Cðw i Þ is the number of times the word w i has been observed in the learning corpus and C the total number of words in the corpus. In practice, depending on the size of the learning corpus, different sizes of the history can be chosen. We then speak of a unigram model if n ¼ 1 (without history), a bigram if n ¼ 2 or a trigram if n ¼ 3.
The N-gram language model Pocket sphinx-based is used to express syntactic constraints between words. For automatic speech recognition for limited vocabulary, the online tool LMTool of CMU is recommended to train voice data on the network server (Ashraf et al. 2010;Chaudhuri, Raj, and Ezzat 2011;X. Liu and Zhou 2014) LMTool use a corpus in the form of an ASCII text file to make appear language model and dictionary. Always for ASR limited vocabulary, many authors use the HTK decoding parameters in language model scale factor (Al-Qatab and Ainon 2010; Alumae Alotaibi, & Huda, 2009;Qiao, Sherwani, and Rosenfeld 2010).

Pronunciation Model
The pronunciation dictionary models describe how a word is pronounced and represented. For small vocabulary, word-based models are used to define whole words as individual sound units. The pronunciation dictionary is a part of the pronunciation model. Translation of speech signal into text is achieved by classifying the speech signal into small sound units. The pronunciation model then determines how these small units can be combined to form valid words. For limited vocabularies, pronunciation models can be constructed by using handwritten word pronunciations, deriving them with phonological rules, or finding frequent pronunciations in a hand-transcribed corpus (J. Fosler-Lussier 1999). Another way to design the pronunciation model for limited vocabularies is to develop independent statistical models for each word in the dataset. The idea is to design a system in which each word has a number of parts, and then a model is trained to recognize each part of the word (E. Fosler-Lussier 2003). Once the acoustic, the language, and the pronunciation models are developed, the decoder uses them to output the text corresponding to the received audio signal.

Decoder
A decoder is seen as a graph search algorithm which combines acoustic and linguistic knowledge to automatically transcribe the input record (Benkerzaz, Y, and A 2019). The goal of decoding is to deduce the sequence of states that generated the given observations. From this sequence of states, it is easy to find the most likely sequence of phonemes that matches the observed parameters. For a limited vocabulary, the Viterbi search algorithm (Viterbi 1967) uses the probabilities of the AM and those of the LM to accomplish the decoding task. The Viterbi decoder is good for short commands, meaning for small or limited vocabulary (Hui 2019;Novak 2010). It seeks the most probable candidate among states in an HMM. This search is performed given the probability of observations obtained from the AM, for each time step for each of the states (cepstral coefficient vector corresponding to the time step).

Automatic Speech Recognition Approaches
Three techniques in artificial intelligence are used for ASR in general, namely machine learning (ML), DL, and deep reinforcement learning (DRL).

Machine Learning
Machine learning (ML) is an artificial intelligence technique that refers to systems that can learn by themselves (Myers 2019). ML implies teaching a computer to recognize patterns in contrast to the traditional approach, which consists of programming a computer with specific rules. The teaching is done through a training process that involves feeding large amounts of audio data to the algorithm and allowing it to learn from data and detect patterns that can later be used to achieve some tasks (Murphy 2012;Murshed et al., 2020).
The different ML steps for speech recognition are provided in Figure 3 and are detailed as follows: (1) The first step is to select and prepare a training dataset composed of audios (from words or sentences) that have been acquired through microphones. This data will be used to feed the ML model during the learning process so that it can determine texts corresponding to audio inputs. Data must be meticulously prepared, organized and cleaned with the aim to mitigate bias during the training process.
(2) The second step is to perform a pre-processing on the input data. This pre-processing includes the reduction of the noise in the audio and the enhancement of data.
(3) The third step consists of choosing a parametric class where the model will be searched. This is a fine-tuned process that is run on the training dataset. For speech recognition using limited vocabulary, some algorithms such as the Maximum Likelihood Linear Regression algorithm (X. Liu and Zhou 2014) can be used to train the AM, and the Viterbi algorithm (F. Huang 2011; X. Liu and Zhou 2014) for decoding. The type of algorithm to use depends on the type of problem to be solved. (4) The fourth step is the training of the algorithm. This is an iterative process. After running the algorithm, the results are compared with the expected ones. The weights and biases are eventually tuned via the back propagation optimization, to increase the accuracy of the algorithm. This process is repeated until a certain criterion is met and the resulting trained model is saved for further analysis with the test data. (5) The fifth and final step is to use and improve the model. The model is then used on new data. Different ML methods have been used for acoustic modeling in speech recognition systems (Padmanabhan and Premkumar 2015). The evaluation, decoding and training of HMMs are done by ML forward-backward (Yu and Kobayashi 2003), Viterbi (Viterbi 1967) and Baum-Welch algorithms (Baggenstoss 2001), respectively. In their work, Padmanabhan et al. (Padmanabhan and Premkumar 2015) review these methods.

Deep Learning
Deep Learning (DL) is a set of algorithms in ML. It uses model architectures made up of multiple non-linear transformations (neural networks) to model high-level abstractions in data (Jinyu et al. 2016). Deep Neural Networks (DNNs) work well for ASR when compared with Gaussian Mixture Modelbased HMMs (GMM-HMM) systems, and they even outperform the latter in certain tasks (Hinton et al. 2012). DL employs the Convolutional Neural Network (CNN) approach which owns the ability to automatically learn the invariant features to distinguish and classify the audio (Abhishek 2017). By learning multiple levels of data representations, DL can derive higherlevel features from lower-level ones to form a hierarchy. For instance, in a speech classification task, the DL model can take phoneme values in the input layer and assign labels to the word in the sentence in the output layer. Between these two layers, there are a set of hidden layers that build successive higher-order features that are less sensitive to conditions, such as noise in the user's environment (Hernández-Blanco, Herrera-Flores, Tomás, & NavarroColorado, 2019).
DL can be implemented using various tools. However, Tensor Flow seems to be one of the best application methods currently available (Dhankar 2017). Figure 4 gives the steps of DL in ASR. Data augmentation helps to improve the performance of the model by generalizing better and thereby reducing overfitting (Salamon and Bello 2017). Data augmentation creates a rich, diverse set of data from a small amount of data. Data augmentation can be applied as a pre-processing step before training the model or later, directly in real-time. Different augmentation policies can be applied to audio data such as Time warping, Frequency masking, and Time masking. Recently, a new augmentation method called SpecAugment has been proposed by Park et al. in (Park 2019) for the ASR system. They combined the warping of the features and the masking of blocks of frequency channels, as well as the blocks of time steps. To ease the augmentation process, a recent free MATLAB Toolbox called Audiogmenter has been proposed (Maguolo et al. 2019).
The feature extraction process aims to remove the non-dominant features and therefore reducing the training time while mitigating the complexity of the developed models.

Deep Reinforcement Learning
A speech recognition system can be vulnerable to a noisy environment. To address this issue, deep reinforcement learning (DRL) can achieve complex goals in an iterative manner, which makes it suitable for such applications. Reinforcement learning is a popular paradigm of ML, which involves agents learning their behavior by trial and error. DRL is a combination of standard reinforcement learning with DL to overcome the limitations of reinforcement learning in complex environments with large state spaces or high computation requirements. DRL enables software-defined agents to learn the best actions possible in virtual environments to attain their goals (Mnih, Kavukcuoglu, and Silver 2020). This technique has recently been applied to limited vocabulary such as the "Speech Command" dataset in (Rajapakshe et al. 2020) or larger vocabulary such as (Kala and Shinozaki 2018). Regardless of the artificial intelligence technique that is used, an important prerequisite remains, namely the dataset.

Construction
Speech recognition research has traditionally required the resources of large organizations, such as universities or corporations (Warden 2018). Microphones recording can save data coming from multiple sources; this permits the collection of enough data for recognition. Although this approach remains the most frequent use case, it is subject to some challenges including speaker localization, speech enhancement, and ASR in distant-microphone scenarios (Cohen, Benesty, and Gannot 2010;Vincent, Vir-tanen, & Gannot, 2018). Datasets with limited vocabulary may be a subset of larger speech datasets (Glasser 2019). Datasets are used to train and test ASR engines. It is important to have a multi-speakers database (Gerazov and Ivanovski 2013). To avoid inconsistencies between datasets, the collection is done during a short period (the same day if possible) (Hofe, 2013). For the best results, the corpus of acoustic data used for learning must be performed in a good e2095039-2962 quality recording studio, but it does not have to occur in a professional studio. For instance, authors in (Tamgno et al. 2012) used a Handy Recorder with four channels (H4n) that is affordable. The creation of an audio dataset goes through four steps as shown in Figure 5. These steps are: requirement definition, corpus creation, voice recording, and labeling of the voice database.
a.) Requirement definition: The first step consists of selecting a suitable environment, which can be a room with closed doors to reduce noise. The use of studio recorded audio is unrealistic, as these audios are free of background noise and recorded with high-quality microphones and are in a formal setting. Good ASR models should work equally well in moderated noisy environments with natural voices. For good recognition, audio recordings should be made with several people (female and male) having different tones. The data should have a fixed and short duration to facilitate the training of the learning model and the evaluation process (Warden 2018).
b.) Corpus creation: After requirement definition, words or sentences are chosen. The choice is made according to the needs, and it is limited to the context in which speech recognition will be performed. Only necessary data should be selected.
c.) Voice recording: Sounds are recorded using a microphone that matches the desired conditions. It should minimize differences between training conditions and test conditions. Speech recordings are generally performed in an anechoic room and are usually digitized at 20 kHz using 16 bits (Alumae and Vohandu 2004;Glasser 2019) or at 8 kHz (Tamgno et al. 2012) or 16 kHz (Hofe et al. 2013;Warden 2018). The waveform audio file format container with file extension .wav is generally used (Glasser 2019;Warden 2018). The WAV formats encoded to Pulse-Code Modulation (PCM) allow one to obtain an uncompressed and high-fidelity digital sound. Since these formats are easy to process in the pre-processing phase of speech recognition and for further processing, it is necessary to convert the audio files obtained after the recording (for instance OGG, WMA, MID, etc.) into WAV format.
d.) Labeling of the voice database: Most ML models are done in a supervised approach. For supervised learning to work, a set of labeled data from which the model can learn to make correct decisions is required. The labeling step aims to identify raw data (text and audio files) and to add informative labels in order to specify the context so that an ML model can learn from it. Each recorded file is marked by sub-words or phonemes. Labeling indicates which phonemes or words were spoken in the audio recording. To label the data, humans make judgments about some aspects of the unlabeled audio file. For example, one might ask to label all audios containing a given word or phoneme. The ML model uses human-supplied labels to learn. The resulting trained model is used to make predictions on new data. Some software such as Audio Labeler allows one to define reference labels for audio datasets. Audio Labeler also allows one to visualize these labels interactively.
The minimal structure of a dataset for limited vocabulary takes into account elements such as the path to the audio file, the text corresponding to the audio file, the gender of the speaker, the age and the language used if there are many languages in the dataset.
If there are several speakers, then the speaker's index will be one key. Some researchers also take into account the emotion of the speaker, mentioning if he is neutral, happy, angry, surprised or sad (Zhou et al. 2021).

Languages and Datasets
In the literature, several works have developed datasets such as VoiceHome2 (Bertin, 2019) that developed a French corpus for distant microphone speech processing in domestic environments. Also regarding French, the work in (Mezzoudj et al. 2018) proposes a multi-source data selection for the training of LMs dedicated to the transcription of broadcast news and TV shows. An important work focusing on the English language is the reduced voice command database (Pleshkova, Bekyarski, and Zahariev 2019). It has been created from a worldwide cloud speech database and in combination with training, testing and real-time recognition algorithms based on artificial intelligence and DL neural networks. Another important English dataset for digits from 0 to 9, and 10 short phrases is the AV Digits Database (Petridis et al. 2018). In a survey, 53 participants consisting of 41 males and 12 females were asked to read digits in English in random order five times. In another study, 33 agents were asked to record sentences, each sentence was repeated five times in 3 different modes: neutral, whisper and silent speech. A larger dataset is the Isolet dataset proposed in (Asuncion and Newman 2007). This dataset contains 150 voices divided into five groups of 30 people. In this dataset, each speaker pronounces each letter of the English alphabet twice, which provides a set of 52 training examples for each speaker.
Apart from French and English, other languages, such as Sorani Kurdish have been tackled. BD-4SK-ASR (Basic Dataset for Sorani Kurdish Automatic Speech Recognition) is an experimental dataset which is used in the first attempt in developing an ASR system for Sorani Kurdish (Qader and Hassani 2019).
A very large project run by Mozilla is the Common Voice Project initiated with the aim of producing an open-source database for ASR (Ardila et al. 2020). The Mozilla Common voice dataset created in 2017 is intended for developers of language processing tools. In November 2020, more than 60 languages were represented on the platform. These languages include French, English, Chinese, Danish and Norwegian. It is very important that speech recognition systems be tested for efficiency, regardless of the language.

Performance Metrics
The quality of the output transcript of an ASR system is traditionally measured by the word error rate (WER) metric: where S is the number of incorrect words substituted, I is the number of extra words inserted, D is the number of words deleted and N is the number of words in the correct transcript. The WER metric is estimated in terms of percentage. However, it is important to note that it is possible to have a WER value exceeding 100% and the WER threshold for acceptable performance depends on the applications. However, Johnson et al. (Johnson et al. 1999) have shown that it is still possible to get a good retrieval performance with a WER value up to 66%. However, the precision begins to fall off quickly when WER gets above 30% (Johnson et al. 1999). Having an estimation of WER value and knowing the threshold for usable transcripts, the allocation of processing resources can be oriented only to those files predicted to yield usable results. If a spoken word is not included in the vocabulary, the corresponding transcript will be considered as an error and the WER value will increase. For detection of keywords, authors in (Anh and Thi 2020) proposed a new method to calculate the mean of accuracies of each keyword (acc i ). This is computed depending on the number of keywords correctly predicted (N cp ), the number of keywords not yet predicted (N ny ) and the number of keywords incorrectly predicted (N ip ): The ACC is computed from the above as: The WER metric works well when the morphology of the language is simple (Besacier et al. 2014). Otherwise, more adequate metrics should be applied, such as the Letter or Character Error Rate (LER or CER) (Kurimo et al. 2006), the Syllable Error Rate (SylER) (C. Huang et al. 2000) or Speaker Attributed Word Error Rate (SA-WER) (Galibert 2013). Use of OPD (Output Probability Distributions) and secondary classification is a solution to improve accuracy of ASR isolated word in limited vocabulary (Thambiratnam and Sridharan 2000). To do this, it models the relation ships between words. OPD represents the distribution of logarithm probability of HMM's set. For each word of vocabulary, a HMM is trained, an utterance is transfered to HMM and logarithmic probabilities are concatenated to give OPD.
The LMs can be evaluated separately from the AM and the most commonly used measure is perplexity (Jelinek et al. 1977). This measure is calculated on a text not seen during training (test-set perplexity). Another measure is based on Shannon's game (Shannon 1951). Existing frameworks are used to develop ASR systems. Comparing these and previous works can enable newcomers in the field to develop ASR systems.

Speech Recognition Frameworks or Toolkits
Several toolkits for ASR have been developed to train datasets. Some toolkits are open-source code and others are not and their presentation is the aim of this section.

Closed-source Code Systems
Several closed-source systems are available for ASR, namely the Dragon Mobile software development kit (SDK), Google Speech Recognition API, SiriKit, Yandex SpeechKit and Microsoft Speech API (Matarneh et al. 2017).
The Dragon Mobile SDK, developed by Nuance since 2011, provides speech services to enhance applications with speech recognition and textto-speech functionality. It consists of a set of sample projects, documentation, and a framework to ease the integration of speech services into any application. It is a trialware that requires a paid subscription after 90 days. Popular mobile platforms are supported (Android, iOS, and Windows phone). It has been used in some works like (Fujiwara 2016) where a custom phonetic alphabet has been optimized to facilitate text entry on small displays.
The Yandex SpeechKit can be used to integrate speech recognition, text-tospeech, music identification, and Yandex voice activation into Android mobile applications. The Yandex SpeechKit supports the following languages for speech recognition and text-to-speech: Russian, English, Ukrainian and Turkish. Yandex SpeechKit has been recently used in (Prozorov and Tatarinova 2019).
SiriKit is a toolkit that allows one to integrate Siri into a third-party iOS or macOS application, to benefit from the ASR capabilities. Siri is a speech recognition personal assistant unveiled by Apple in 2011 and only works on iOS and macOS (Rawat, Gupta, and Kumar 2014) . Even though potential dangers and ethical issues have been reported (Zeng 2015), SiriKit is used in numerous works including (Herbert and Kang 2019).
The Speech Application Programming Interface (SAPI) developed by Microsoft enables developers to integrate speech recognition and speech synthesis within Windows applications (Gaida et al. 2014). Several versions of the API have been released either as part of a Speech SDK or as part of the Windows Operating System. SAPI is used in Microsoft Office, Microsoft Agent and Microsoft Speech Server. Several research works have also made use of this API (Shi and Maier 1996).
Google Speech Recognition API is a C# toolkit based on a model trained with English and 120 other languages. Google has improved its speech recognition by using a DNN in its applications, reaching an 8% error rate in 2015 compared to 23% in 2013 (Këpuska 2017). Table 2 provides a summary of selected closed-source code speech recognition toolkits.
As conventional closed-source software, these toolkits are not flexible and are limited to the languages they have been designed for. Such limitations fostered the research community to develop open-source frameworks and toolkits.

Open-source Code Systems
A plethora of open-source frameworks, engines, or toolkits for ASR systems are proposed in the literature. The following is a non-exhaustive list of main works or projects. Most of them can easily handle small and large vocabulary. The HTK is implemented in the late 1980s, and maintained by the Speech Vision and Robotics Group of the Cambridge University Engineering Department (CUED) (Young 2002); HTK is available to the research community since early 2000. It provides recipes to build baseline systems with HMM. HTK is considered a very simple and effective tool for research (Qiao, Sherwani, and Rosenfeld 2010;Supriya and Handore 2017). It can build a noise-robust ASR system in a moderated noisy level environment, especially for small vocabulary systems. It is a practical solution to develop fast and accurate Small Vocabulary Automatic Speech Recognition (SVASR) (Hatala 2019). One of the most popular toolkits is the CMU Sphinx, designed for both mobile and server applications. CMU Sphinx is in fact a set of libraries and tools that can be used to develop speech-enabled applications. It is developed at Carnegie Mellon University in the late 1980s (Lee 1988). Several versions have been released including Sphinx 1 to 4, and PocketSphinx for hand-held devices (Huggins-Daines et al. 2006). CMU Sphinx is currently attracting the attention of the research community. It offers the possibility to build new LMs using its language Modeling Tool.
A former toolkit that may no longer be available is the Rapid Language Adaptation Toolkit (RLAT) introduced in (Schultz 2009) and used in (Vu et al. 2010). The website for the project is, unfortunately, no longer available. Kaldi is an extendable and modular toolkit for ASR (Povey 2011). The large community behind the project provides numerous third-party modules that can be used for several tasks. It supports DNN and offers excellent documentation on its website. Several works have been based on this toolkit including (Guglani and Mishra 2018).
Microsoft also proposes an open-source Cognitive Toolkit (CNTK). It is used to create DNNs that power many Microsoft services and products. It enables researchers and data scientists to easily code such neural networks at the right abstraction level, and to efficiently train and test them on productionscale data (Banerjee, Hamidouche, and Panda 2016). Although it is still used, it is a deprecated framework.
Julius is software normally designed for LVCSR. It is based on word N À gram and context-dependent HMM. It is used in several works including (Sharma et al. 2019).
Simon toolkit is a general public license speech recognition framework developed in C++. It is designed to be as flexible as possible and it works with any language or dialect. Simon makes use of KDE libraries, CMU SPHINX or Julius together with HTK and it runs on Windows and Linux. Praat is a framework that enables speech analysis, synthesis, and manipulation. In addition, it allows speech labeling and segmentation. It has been used in (Pleva, Juhár, and Thiessen 2015).
Mozilla Common Voice is a free speech recognition software for developers that can be integrated into projects. It works with DNN technology and targets several languages (Ardila et al. 2020). Besides Common Voice, Mozilla has also developed DeepSpeech, an open-source Speech-To-Text engine. It makes use of a model trained by the ML techniques proposed in (Hannun 2014).
Another larger project is the OpenSMILE (open-source Speech and Music Interpretation by Large-space Extraction) project that is completely free to use for research purposes. It received a lot of attention from the research community and claims more than 150,000 downloads. A recent model called Jasper (Just Another Speech Recognizer) has been introduced in 2019 (Li 2019). It can be used with the OpenSeq2Seq TensorFlow-based toolkit. OpenSeq2Seq enables, among others, speech recognition, speech commands and speech synthesis. It is used in recent works such as (S. Liu et al. 2021).
Other recent engines for ASR have been released such as Fairseq and Wav2Letter++ (both developed by Facebook), Athena, ESPnet, and Vosk which is an offline ASR toolkit. Table 3 provides a summary of some opensource speech recognition toolkits.
Several works have been performed for limited vocabularies in ASR using the above toolkits. We make a summary of these works in the following section.

Remarks
Even though European and Asian languages are well represented in the selected works, we noticed a particular focus on well-resourced languages, namely English, Chinese, Romanian. Only a few works deal with African languages namely (Tamgno et al. 2012) and Yoruba (Qiao, Sherwani, and Rosenfeld 2010). The weak representation of African languages can be justified by the fact that most of them are under-resourced, in addition to the lack of local skills and awareness about the potential of ASR systems to prevent the extinction of under-resourced languages. Although most of the works did consider noisy environments, the severity was moderated and usually limited to natural environmental noises. Due to the limited size of the vocabulary, such noise levels have been easily mitigated. This justified the competitive ACC in most works.
We observe that the HTK Toolkit is the most used for speech recognition with regards to limited vocabulary. Its success is explained by the fact that it eases the manipulation of HMM parameters for the training and testing stage of system development.
In general, the Bayesian equation allows one to find the probability that a word will be recognized in the speech recognition process for limited vocabulary, and in turn the sequence of words corresponding to a speech sequence. HMM and DNN are decent AMs that achieve good speech recognition results. HMM is used to account for variability in speech and DNN, with many hidden layers, have been shown to outperform GMMs on a variety of speech recognition benchmarks. The construction of datasets for ASR with limited vocabulary is done with speech recorded and digitized at varying frequencies between 8 kHz and 20 kHz in most cases. Finally, the ACC of the system is calculated by using the number of words correctly predicted, the number of words not yet predicted, and the number of words incorrectly predicted. ACC of ASR for limited vocabulary for previous works is between 46 and 100%.

Future Directions
The World has more than 7000 languages according to the Ethnologue website, 1 with the majority being under-resourced and even endangered. ASR systems offer an unprecedented opportunity to sustain such underresourced languages and to fight the extinction of endangered ones. We should differentiate two types of under-resourced languages: those with an acknowledged written form and those without. In the first case, new datasets should be created and require new approaches for data recording and labeling, especially when there are not enough native speakers during the creation process. This is a challenge, especially with tonal languages. In fact, most languages in the developing world and especially in Sub-Saharan Africa are tonal (Downing & Ri-alland, 2016). In a recent survey on ASR for tonal languages (Kaur, Singh, and Kadyan 2021), only two African languages were reported. In addition, some languages (especially dialects) share commonalities. Therefore, ASR systems with limited vocabulary targeting multiple similar languages can be designed.
In the second case, when the language that does not have an acknowledged written form, a new approach should be designed. Normally, an ASR system aims to transcribe a speech into a text. In ASR using a limited vocabulary, the transcript text is usually a command or a short answer that can be used by an application or system to perform an action. In this scenario, the system can be a combination of a Speech-to-Speech Translation (S2ST) and a Speech Recognition. First, the speech in a nonwritten language is directly translated into a speech in a well-resourced language such as English, then the transcript in a well-resourced language is retrieved (generated) and sent to the application. The Direct S2ST model has been developed in (Jia 2019), translating Spanish into English without passing through text. Their dataset is a subset of the Fisher dataset and is composed of parallel utterance pairs. The construction of datasets for ASR systems using limited vocabulary of nonwritten languages can also be based on the same principle.
Regardless of whether the language is written or not, more noise resistant models should be developed, because the available data for under-resourced languages could be of low quality.
The limited computing resources and poor internet connectivity in some regions can prevent the use of ASR systems. There is currently a shift of ASR models from the cloud to the edge. It is performed by reducing the size of models and making models fast enough so that they can be executed on typical e2095039-2972 mobile devices. The latest optimization techniques to achieve this, such as pruning, efficient Recurrent Neural Network variants and quantization, are presented in (Shangguan et al. 2019). Although they provide remarkable results, such as reducing the size by 8.5x and increasing the speed by 4.5x, there is still a need to develop light and offline models that can be deployed on low-resource devices, such as off-the-shelf smartphones or raspberry Pi/ Arduino modules.

Conclusion
This paper presented a review of ASR systems with a focus on limited vocabulary. After introducing the ASR principle in general and usual techniques used to perform recognition, this paper also discussed the management of datasets, the performance metrics to evaluate ASR systems, and toolkits to develop such systems. From the analysis of selected papers on ASR using limited vocabulary, HMM and DNN-based AMs achieve good speech recognition results. DNN even outperform GMMs on a variety of speech recognition benchmarks. Datasets with limited vocabulary are constructed with speech at frequencies between 8 kHz and 20 kHz. The evaluation of systems is mainly based on the ACC rather than the WER metric. Despite the satisfactory results, there is still much to do. In fact, developed systems deal mostly with well-resourced languages and most models are still running on servers. We hope that the ideas for future directions discussed regarding underresourced/unwritten languages (investigating direct speech to speech translation) and designing "on the edge ASR systems for limited vocabulary" will draw the attention of the research community to develop systems especially for developing world, where the limitation in terms of computing resources coupled with the lack of connectivity constitute quite significant barriers. Note 1. https://www.ethnologue.com

Disclosure statement
No potential conflict of interest was reported by the author(s).