Conversational telephone speech recognition for Lithuanian☆
Introduction
Lithuanian belongs to the Baltic subgroup of Indo-European languages and is one of the least spoken European languages, with only about 3.5 million speakers. Although the language was standardized during the late 19th and early 20th centuries, most of the phonetic and morphological features were preserved (Vaišnienė et al., 2012). The language is characterized by a rich inflection, a complex stress system, and a flexible word order. Lithuanian is written using the Latin alphabet with some additional language specific characters, as well as some characters borrowed from other languages. There are two main dialects – Aukštaitian (High Lithuanian), and Samogitian (Žemaitian or Low Lithuanian), each with sub-dialects. The dominant dialect is Aukštaitian, spoken in the east and middle of Lithuania by 3 million speakers. Samogitian is spoken in the west of the country by only about 0.5 million speakers.
This paper reports on research work aimed at developing conversational telephone speech (CTS) recognition and keyword spotting (KWS) systems for the Lithuanian language. Speech recognition systems making use of statistical acoustic and language models are typically trained on large data sets. Three main resources are needed: (1) telephone speech recordings with corresponding transcriptions for acoustic model training, (2) written texts for language modeling, and (3) a pronunciation dictionary.
There have been only a few studies reporting on speech recognition for Lithuanian, in part due to the sparsity of the available linguistic e-resources. Systems for isolated word recognition are described in Lipeika et al. (2002), Maskeliūnas et al. (2015), Filipovič and Lipeika (2004), Raškinis and Raškinienė (2003), Vaičiūnas and Raškinis (2006). In Laurinčiukaitė and Lipeika (2015), Šilingas et al. (2004), Šinlingas (2005), Lithuanian broadcast speech recognition systems were trained on 9 h of transcribed speech, where in Laurinčiukaitė and Lipeika (2015) syllable sets and in Šilingas et al. (2004), Šinlingas (2005) different phonemic units were investigated. In the context of the Quaero program (www.quaero.org), a transcription system for broadcast audio in Lithuanian was developed without any manually transcribed training data and achieved 28% word error rate (WER) (Lamel, 2013). Using only 3 h of transcribed audio data and semi-supervised training, this result was later improved to 18.3% (Lileikytė et al., 2016). In Gales et al. (2015) a unicode-based graphemic system for the transcription of conversational telephone speech in Lithuanian is described. The system, developed within the IARPA Babel program, obtained a WER of 68.6% with 3 h of transcribed training data, and of 48.3% using 40 h of transcribed training data.
Transcribing conversational telephone speech is a more challenging task than transcribing broadcast news, which is predominantly comprised of prepared speech by professional speakers. In spontaneous speech, speaking rates and styles vary across speakers and grammar rules are not strictly followed. Example phrases illustrating some common phenomena found in casual speech are given in Table 1. Hesitations and filler sounds occur frequently in conversational speech, appearing in 30% of the speaker turns (counted in the training transcripts). Disfluencies and/or unintelligible words are marked in 25% of the speaker turns. Moreover, the audio signal has a reduced bandwidth of 3.4 kHz and can be corrupted by noise and channel distortion.
The research reported in this paper was carried out in the context of the IARPA Babel program using the IARPA-babel304b-v1.0b corpus. The data were collected in a wide variety of environments, and have a broad range of speakers. There is a wide distribution of speakers with respect to gender, age and dialect. The audio were recorded in various conditions such as on the street, in a car, restaurant or office, and with different recording devices such as cell phones and hands-free microphones.
This study uses the same training and test resources as (Gales et al., 2015) for two conditions: the full language pack (FLP) with approximately 40 h of transcribed telephone speech and the very limited language pack (VLLP) comprised of only a 3 h subset of the FLP transcribed speech. An additional 40 h set of untranscribed data was available for semi-supervised training. A 26 million word text corpus, collected from the Web (Wikipedia, subtitles and other sources) and filtered by BBN (Zhang et al., 2015) was provided. Although the harvesting process searches the Web for texts containing n-grams that are frequent in the transcribed audio, the recovered texts are for the most part quite different from conversational speech. The available resources for acoustic and language model training are very small compared to the 2000 h of transcribed audio and over a billion words of text that are available for the American English conversational telephony task (Prasad et al., 2005).
The pronunciation dictionary is an important component of the system. To generate one, a grapheme-based or phoneme-based approach can be used. The advantage of using graphemes is that pronunciations are easily derived from the orthographic form. Grapheme-based systems have been shown to work reasonably well for various languages, such as Dutch, German, Italian, and Spanish (Kanthak, Ney, 2002, Killer, Stüker, Schultz, 2003). Yet, some languages, such as English, have a weak correspondence between graphemes and phonemes, and using graphemes leads to a degradation in system performance. Phoneme-based systems usually provide better results as they better represent the speech production. However, designing the pronunciation dictionary (Adda-Decker and Lamel, 2000) or a set of grapheme-to-phoneme rules requires linguistic expertise, making it a costly process. The Lithuanian language has quite a strong dependency between the orthographic and the phonemic forms making it relatively easy to write grapheme-to-phoneme conversion rules in comparison to the English language that requires numerous exceptions.
This article is the extension of our previous work (Lileikyte et al., 2015). New research results are reported, providing keyword spotting results for both the FLP and VLLP conditions and an analysis of keyword spotting performance with a focus on out-of-vocabulary (OOV) keyword detection. In this study two techniques of enhancing the detection of the OOV keywords are explored: using Web resources to augment the lexicon and language model, and using subword units to enhance KWS. We investigate two types of subword units: character N-grams and morpheme subwords. The use of full-word and subword units for KWS is compared. Moreover, we explore the impact of acoustic model data augmentation using semi-supervised training. To see the benefits of using augmented texts for keyword spotting, we analyze if OOV words become in-vocabulary (INV) or remain OOV.
This study addresses the following questions: (1) which set of phonemic units should be used? (2) is a phoneme-based system better than grapheme-based one? (3) how much improvement can be obtained by using untranscribed audio and Web texts for model training? (4) how much do subword units improve keyword spotting?
The next section describes the phonemic inventory of the Lithuanian language. Section 3 describes the experimental conditions. An overview of the speech-to-text and keyword spotting systems is provided in Section 4. Experimental results comparing different sets of phonemes and graphemes are given in Section 5, and Section 6 investigates semi-supervised training and the use of Web texts for speech recognition. Section 7 focuses on the improvement of out-of-vocabulary keyword detection. Finally, in Section 8 this work is summarized and some conclusions are drawn.
Section snippets
Lithuanian phonemic inventory
The Lithuanian alphabet contains 32 letters. While most of them are Latin, there is also ė, and some borrowed letters from Czech (š, ž), and Polish (ą, ę). Lithuanian is generally described as having 56 phonemes, comprised of 11 vowels and 45 consonants (Pakerys, 2003). Consonants are classified as soft (palatalized) or hard (not palatalized), where the soft consonants always occur before certain vowels (i, į, y, e, ę, ė). There are 8 diphthongs that are composed of two vowels (ai, au, ei, ui,
Corpus and task description
This section describes the speech and text corpora, and the experimental conditions used in this work. All the experiments reported in this paper use data provided by the IARPA Babel program Harper, more specifically the IARPA-babel304b-v1.0b data set. The data are comprised of spontaneous telephone conversations, and as mentioned earlier, two conditions are considered:1
Speech-to-text and keyword spotting
The acoustic models for the speech-to-text (STT) systems are built via a flat start training, where the initial segmentation is performed without any a priori information. The acoustic models are tied-state, left-to-right 3-state HMMs with Gaussian mixture observation densities (Gauvain et al., 2002). The models are triphone-based and word position-dependent. The system uses discriminatively trained stacked bottleneck acoustic features extracted from a deep neural network that were provided by
Phoneme-based and grapheme-based systems
Several phoneme-based and grapheme-based systems are evaluated, contrasting different sets of elementary units and mappings for rarely seen units. One contrast explores explicitly modeling complex sounds such as affricates and diphthongs as a single unit or splitting them into a sequence of units. Another compares explicitly modeling the soft consonants as opposed to simply allowing them to be contextual variants of their hard counterparts.3
Impact of Web data and untranscribed audio
In the above FLP experiments only the manual transcriptions were used for language modeling. To build the VLLP systems, the Web data were also used for training 3-gram language models, and the remaining 77 h of untranscribed data for semi-supervised acoustic model training. These extra resources help to reduce the performance difference between the two conditions. The following experiments aim to assess the impact of the Web data and semi-supervised training for both the FLP and VLLP
Improving keyword search
Out-of-vocabulary keywords are a challenge for keyword search as they can dramatically affect keyword spotting performance. Various methods have been proposed to address the problem of detecting OOV keywords. One common approach is to convert word lattices to phoneme (or grapheme) lattices and perform phone/grapheme based string search (Siohan, Bacchiani, 2005, Karakos, Bulyko, Schwartz, Tsakalidis, Nguyen, Makhoul, 2014). As proposed in Hartmann et al. (2014), Chaudhari and Picheny (2007), He
Summary
This paper has reported on research carried out to develop systems for transcription and keyword search in conversational telephone speech for the low-resourced Lithuanian language. According to the linguistic literature, the phonemic inventory for Lithuanian is generally described with 56 phonemes. However, when resources are limited, some of the phonemes may not be sufficiently (or at all) observed. Experiments were carried out with different phoneme inventories to determine the best set of
Disclaimer
The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoD/ARL, or the U.S. Government.
Acknowledgments
We would like to thank our IARPA Babel partners for sharing resources (BUT for the bottleneck features and BBN for the Web data), and Grégory Gelly for providing the voice activity detector.
This research was in part supported by the French National Agency for Research as part of the SALSA (Speech And Language technologies for Security Applications) project under grant ANR-14-CE28-0021, and by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Defense US Army Research
References (41)
- et al.
The LIMSI broadcast news transcription system
Speech Communication
(2002) - et al.
Lightly supervised and unsupervised acoustic model training
Comput. Speech Lang.
(2002) - et al.
Finding consensus in speech recognition: word error minimization and other applications of confusion networks
Comput. Speech Lang.
(2000) - et al.
The use of Lexica in automatic speech recognition
Proceedings of the 2009 Lexicon Development for Speech and Language Processing
(2000) - et al.
Improvements in phone based audio search via constrained match with high order confusion estimates
Proceedings of the 2007 Automatic Speech Recognition & Understanding (ASRU)
(2007) - et al.
Using proxies for OOV keywords in the keyword search task
Proceedings of the 2013 Automatic Speech Recognition & Understanding (ASRU)
(2013) - et al.
Automatic keyword selection for keyword search development and tuning
Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(2014) - et al.
Development of HMM/neural network-based medium-vocabulary isolated-word Lithuanian speech recognition system
Informatica
(2004) - et al.
Results of the 2006 spoken term detection evaluation
Proceedings of the 2007 International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)
(2007) - et al.
Lattice-based unsupervised acoustic model training
Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(2011)
Unicode-based graphemic systems for limited resource languages
Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Minimum word error training of RNN-based voice activity detection
Proceedings of the 2015 Conference of the International Speech Communication Association (INTERSPEECH)
Teoriniai Lietuvių Fonologijos Pagrindai
On improving speech recognition and keyword spotting with automatically generated morphological units
Proceedings of the 2015 Conference on Language and Technology Conference (LTC)
Combination of multilingual and semi-supervised training for under-resourced languages
Proceedings of the 2014 Conference of the International Speech Communication Association (INTERSPEECH)
Comparing decoding strategies for subword-based keyword spotting in low-resourced languages
Proceedings of the 2014 Conference of the International Speech Communication Association (INTERSPEECH)
Subword-based modeling for handling OOV words in keyword spotting
Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Context-dependent acoustic modeling using graphemes for large vocabulary speech recognition
Proceedings of the 2002 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Normalization of phonetic keyword search scores
Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Cited by (8)
Assessing child communication engagement and statistical speech patterns for American English via speech recognition in naturalistic active learning spaces
2022, Speech CommunicationCitation Excerpt :A rich, supportive language environment within the early childhood classroom is essential for all children. It is important to recognize that both environment and the role of adult-to-child and peer-to-child communication, serve as key components for young children’s language development — particularly children at-risk or with disabilities (Brown et al., 1999; Mahoney and Wheeden, 1999; Sontag, 1997; Warren and Yoder, 2004; Perry et al., 2018). There is a need to assess language/communication engagement of children at-risk for or with disabilities to determine if these children should receive greater teacher and peer support during learning activities.
Stemming for Better Indonesian Text-to-Phoneme
2022, AmpersandCitation Excerpt :This module plays an essential role in a system to recognize and synthesize speech (Yasuda et al., 2021). It is also crucial to develop a syllabification model (Suyanto, 2019) and many other applications of linguistics and speech (Lamel et al., 2018). The problem of G2P can be solved using three approaches, namely: a) rule-based approach, where some language-specific rules should be developed to convert a graphene into the true corresponding phoneme (El-Imam, 2004), (Lee and Lee, 2009), (Al-Daradkah and Al-Diri, 2015); b) shallow learning-based approach, which uses various methods of conventional machine learning (Rugchatjaroen et al., 2019), (Sar and Tan, 2019), (Peters et al., 2018), (Aachen, 2017); and c) deep learning-based approach, which exploits the recurrent networks and sequence-to-sequence models as described in (El-DesokyMousa and Schuller, 2016), (Sokolov et al., 2020), (Yolchuyeva et al., 2019), (Bajaj and Amali, 2019), (Tihelka and Vít, 2019).
A Survey on Current Speech to Text Analysis to Help Programmers Dictate Code
2023, ACM International Conference Proceeding SeriesAutomatic Speech Recognition System for Tonal Languages: State-of-the-Art Survey
2021, Archives of Computational Methods in Engineering
- ☆
This paper has been recommended for acceptance by Roger K. Moore.