Conversational telephone speech recognition for Lithuanian

https://doi.org/10.1016/j.csl.2017.11.005Get rights and content

Highlights

  • Research on conversational speech recognition and keyword spotting system for the low e-resourced Lithuanian language.

  • Grapheme-based and phoneme-based systems are studied for 2 training conditions (3 or 40 hours of transcribed audio data).

  • The Web texts are shown to significantly improve both speech recognition and keyword spotting performance.

  • Subword units (cross-word character and morpheme-based) are shown to improve the detection of out-of-vocabulary words.

  • The best keywords spotting results are achieved by combining keyword hits from word and subword systems.

Abstract

The research presented in the paper addresses conversational telephone speech recognition and keyword spotting for the Lithuanian language. Lithuanian can be considered a low e-resourced language as little transcribed audio data, and more generally, only limited linguistic resources are available electronically. Part of this research explores the impact of reducing the amount of linguistic knowledge and manual supervision when developing the transcription system. Since designing a pronunciation dictionary requires language-specific expertise, the need for manual supervision was assessed by comparing phonemic and graphemic units for acoustic modeling. Although the Lithuanian language is generally described in the linguistic literature with 56 phonemes, under low-resourced conditions some phonemes may not be sufficiently observed to be modeled. Therefore different phoneme inventories were explored to assess the effects of explicitly modeling diphthongs, affricates and soft consonants. The impact of using Web data for language modeling and additional untranscribed audio data for semi-supervised training was also measured. Out-of-vocabulary (OOV) keywords are a well-known challenge for keyword search. While word-based keyword search is quite effective for in-vocabulary words, OOV keywords are largely undetected. Morpheme-based subword units are compared with character n-gram-based units for their capacity to detect OOV keywords. Experimental results are reported for two training conditions defined in the IARPA Babel program: the full language pack and the very limited language pack, for which, respectively, 40 h and 3 h of transcribed training data are available. For both conditions, grapheme-based and phoneme-based models are shown to obtain comparable transcription and keyword spotting results. The use of Web texts for language modeling is shown to significantly improve both speech recognition and keyword spotting performance. Combining full-word and subword units leads to the best keyword spotting results.

Introduction

Lithuanian belongs to the Baltic subgroup of Indo-European languages and is one of the least spoken European languages, with only about 3.5 million speakers. Although the language was standardized during the late 19th and early 20th centuries, most of the phonetic and morphological features were preserved (Vaišnienė et al., 2012). The language is characterized by a rich inflection, a complex stress system, and a flexible word order. Lithuanian is written using the Latin alphabet with some additional language specific characters, as well as some characters borrowed from other languages. There are two main dialects – Aukštaitian (High Lithuanian), and Samogitian (Žemaitian or Low Lithuanian), each with sub-dialects. The dominant dialect is Aukštaitian, spoken in the east and middle of Lithuania by 3 million speakers. Samogitian is spoken in the west of the country by only about 0.5 million speakers.

This paper reports on research work aimed at developing conversational telephone speech (CTS) recognition and keyword spotting (KWS) systems for the Lithuanian language. Speech recognition systems making use of statistical acoustic and language models are typically trained on large data sets. Three main resources are needed: (1) telephone speech recordings with corresponding transcriptions for acoustic model training, (2) written texts for language modeling, and (3) a pronunciation dictionary.

There have been only a few studies reporting on speech recognition for Lithuanian, in part due to the sparsity of the available linguistic e-resources. Systems for isolated word recognition are described in Lipeika et al. (2002), Maskeliūnas et al. (2015), Filipovič and Lipeika (2004), Raškinis and Raškinienė (2003), Vaičiūnas and Raškinis (2006). In Laurinčiukaitė and Lipeika (2015), Šilingas et al. (2004), Šinlingas (2005), Lithuanian broadcast speech recognition systems were trained on 9 h of transcribed speech, where in Laurinčiukaitė and Lipeika (2015) syllable sets and in Šilingas et al. (2004), Šinlingas (2005) different phonemic units were investigated. In the context of the Quaero program (www.quaero.org), a transcription system for broadcast audio in Lithuanian was developed without any manually transcribed training data and achieved 28% word error rate (WER) (Lamel, 2013). Using only 3 h of transcribed audio data and semi-supervised training, this result was later improved to 18.3% (Lileikytė et al., 2016). In Gales et al. (2015) a unicode-based graphemic system for the transcription of conversational telephone speech in Lithuanian is described. The system, developed within the IARPA Babel program, obtained a WER of 68.6% with 3 h of transcribed training data, and of 48.3% using 40 h of transcribed training data.

Transcribing conversational telephone speech is a more challenging task than transcribing broadcast news, which is predominantly comprised of prepared speech by professional speakers. In spontaneous speech, speaking rates and styles vary across speakers and grammar rules are not strictly followed. Example phrases illustrating some common phenomena found in casual speech are given in Table 1. Hesitations and filler sounds occur frequently in conversational speech, appearing in 30% of the speaker turns (counted in the training transcripts). Disfluencies and/or unintelligible words are marked in 25% of the speaker turns. Moreover, the audio signal has a reduced bandwidth of 3.4 kHz and can be corrupted by noise and channel distortion.

The research reported in this paper was carried out in the context of the IARPA Babel program using the IARPA-babel304b-v1.0b corpus. The data were collected in a wide variety of environments, and have a broad range of speakers. There is a wide distribution of speakers with respect to gender, age and dialect. The audio were recorded in various conditions such as on the street, in a car, restaurant or office, and with different recording devices such as cell phones and hands-free microphones.

This study uses the same training and test resources as (Gales et al., 2015) for two conditions: the full language pack (FLP) with approximately 40 h of transcribed telephone speech and the very limited language pack (VLLP) comprised of only a 3 h subset of the FLP transcribed speech. An additional 40 h set of untranscribed data was available for semi-supervised training. A 26 million word text corpus, collected from the Web (Wikipedia, subtitles and other sources) and filtered by BBN (Zhang et al., 2015) was provided. Although the harvesting process searches the Web for texts containing n-grams that are frequent in the transcribed audio, the recovered texts are for the most part quite different from conversational speech. The available resources for acoustic and language model training are very small compared to the 2000 h of transcribed audio and over a billion words of text that are available for the American English conversational telephony task (Prasad et al., 2005).

The pronunciation dictionary is an important component of the system. To generate one, a grapheme-based or phoneme-based approach can be used. The advantage of using graphemes is that pronunciations are easily derived from the orthographic form. Grapheme-based systems have been shown to work reasonably well for various languages, such as Dutch, German, Italian, and Spanish (Kanthak, Ney, 2002, Killer, Stüker, Schultz, 2003). Yet, some languages, such as English, have a weak correspondence between graphemes and phonemes, and using graphemes leads to a degradation in system performance. Phoneme-based systems usually provide better results as they better represent the speech production. However, designing the pronunciation dictionary (Adda-Decker and Lamel, 2000) or a set of grapheme-to-phoneme rules requires linguistic expertise, making it a costly process. The Lithuanian language has quite a strong dependency between the orthographic and the phonemic forms making it relatively easy to write grapheme-to-phoneme conversion rules in comparison to the English language that requires numerous exceptions.

This article is the extension of our previous work (Lileikyte et al., 2015). New research results are reported, providing keyword spotting results for both the FLP and VLLP conditions and an analysis of keyword spotting performance with a focus on out-of-vocabulary (OOV) keyword detection. In this study two techniques of enhancing the detection of the OOV keywords are explored: using Web resources to augment the lexicon and language model, and using subword units to enhance KWS. We investigate two types of subword units: character N-grams and morpheme subwords. The use of full-word and subword units for KWS is compared. Moreover, we explore the impact of acoustic model data augmentation using semi-supervised training. To see the benefits of using augmented texts for keyword spotting, we analyze if OOV words become in-vocabulary (INV) or remain OOV.

This study addresses the following questions: (1) which set of phonemic units should be used? (2) is a phoneme-based system better than grapheme-based one? (3) how much improvement can be obtained by using untranscribed audio and Web texts for model training? (4) how much do subword units improve keyword spotting?

The next section describes the phonemic inventory of the Lithuanian language. Section 3 describes the experimental conditions. An overview of the speech-to-text and keyword spotting systems is provided in Section 4. Experimental results comparing different sets of phonemes and graphemes are given in Section 5, and Section 6 investigates semi-supervised training and the use of Web texts for speech recognition. Section 7 focuses on the improvement of out-of-vocabulary keyword detection. Finally, in Section 8 this work is summarized and some conclusions are drawn.

Section snippets

Lithuanian phonemic inventory

The Lithuanian alphabet contains 32 letters. While most of them are Latin, there is also ė, and some borrowed letters from Czech (š, ž), and Polish (ą, ę). Lithuanian is generally described as having 56 phonemes, comprised of 11 vowels and 45 consonants (Pakerys, 2003). Consonants are classified as soft (palatalized) or hard (not palatalized), where the soft consonants always occur before certain vowels (i, į, y, e, ę, ė). There are 8 diphthongs that are composed of two vowels (ai, au, ei, ui,

Corpus and task description

This section describes the speech and text corpora, and the experimental conditions used in this work. All the experiments reported in this paper use data provided by the IARPA Babel program Harper, more specifically the IARPA-babel304b-v1.0b data set. The data are comprised of spontaneous telephone conversations, and as mentioned earlier, two conditions are considered:1

Speech-to-text and keyword spotting

The acoustic models for the speech-to-text (STT) systems are built via a flat start training, where the initial segmentation is performed without any a priori information. The acoustic models are tied-state, left-to-right 3-state HMMs with Gaussian mixture observation densities (Gauvain et al., 2002). The models are triphone-based and word position-dependent. The system uses discriminatively trained stacked bottleneck acoustic features extracted from a deep neural network that were provided by

Phoneme-based and grapheme-based systems

Several phoneme-based and grapheme-based systems are evaluated, contrasting different sets of elementary units and mappings for rarely seen units. One contrast explores explicitly modeling complex sounds such as affricates and diphthongs as a single unit or splitting them into a sequence of units. Another compares explicitly modeling the soft consonants as opposed to simply allowing them to be contextual variants of their hard counterparts.3

Impact of Web data and untranscribed audio

In the above FLP experiments only the manual transcriptions were used for language modeling. To build the VLLP systems, the Web data were also used for training 3-gram language models, and the remaining 77 h of untranscribed data for semi-supervised acoustic model training. These extra resources help to reduce the performance difference between the two conditions. The following experiments aim to assess the impact of the Web data and semi-supervised training for both the FLP and VLLP

Improving keyword search

Out-of-vocabulary keywords are a challenge for keyword search as they can dramatically affect keyword spotting performance. Various methods have been proposed to address the problem of detecting OOV keywords. One common approach is to convert word lattices to phoneme (or grapheme) lattices and perform phone/grapheme based string search (Siohan, Bacchiani, 2005, Karakos, Bulyko, Schwartz, Tsakalidis, Nguyen, Makhoul, 2014). As proposed in Hartmann et al. (2014), Chaudhari and Picheny (2007), He

Summary

This paper has reported on research carried out to develop systems for transcription and keyword search in conversational telephone speech for the low-resourced Lithuanian language. According to the linguistic literature, the phonemic inventory for Lithuanian is generally described with 56 phonemes. However, when resources are limited, some of the phonemes may not be sufficiently (or at all) observed. Experiments were carried out with different phoneme inventories to determine the best set of

Disclaimer

The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoD/ARL, or the U.S. Government.

Acknowledgments

We would like to thank our IARPA Babel partners for sharing resources (BUT for the bottleneck features and BBN for the Web data), and Grégory Gelly for providing the voice activity detector.

This research was in part supported by the French National Agency for Research as part of the SALSA (Speech And Language technologies for Security Applications) project under grant ANR-14-CE28-0021, and by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Defense US Army Research

References (41)

  • J.L. Gauvain et al.

    The LIMSI broadcast news transcription system

    Speech Communication

    (2002)
  • L. Lamel et al.

    Lightly supervised and unsupervised acoustic model training

    Comput. Speech Lang.

    (2002)
  • L. Mangu et al.

    Finding consensus in speech recognition: word error minimization and other applications of confusion networks

    Comput. Speech Lang.

    (2000)
  • M. Adda-Decker et al.

    The use of Lexica in automatic speech recognition

    Proceedings of the 2009 Lexicon Development for Speech and Language Processing

    (2000)
  • U.V. Chaudhari et al.

    Improvements in phone based audio search via constrained match with high order confusion estimates

    Proceedings of the 2007 Automatic Speech Recognition & Understanding (ASRU)

    (2007)
  • ChenG. et al.

    Using proxies for OOV keywords in the keyword search task

    Proceedings of the 2013 Automatic Speech Recognition & Understanding (ASRU)

    (2013)
  • CuiJ. et al.

    Automatic keyword selection for keyword search development and tuning

    Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2014)
  • M. Filipovič et al.

    Development of HMM/neural network-based medium-vocabulary isolated-word Lithuanian speech recognition system

    Informatica

    (2004)
  • J.G. Fiscus et al.

    Results of the 2006 spoken term detection evaluation

    Proceedings of the 2007 International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)

    (2007)
  • T. Fraga-Silva et al.

    Lattice-based unsupervised acoustic model training

    Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2011)
  • M. Gales et al.

    Unicode-based graphemic systems for limited resource languages

    Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2015)
  • G. Gelly et al.

    Minimum word error training of RNN-based voice activity detection

    Proceedings of the 2015 Conference of the International Speech Communication Association (INTERSPEECH)

    (2015)
  • A. Girdenis

    Teoriniai Lietuvių Fonologijos Pagrindai

    (2003)
  • A. Gorin et al.

    On improving speech recognition and keyword spotting with automatically generated morphological units

    Proceedings of the 2015 Conference on Language and Technology Conference (LTC)

    (2015)
  • F. Grézl et al.

    Combination of multilingual and semi-supervised training for under-resourced languages

    Proceedings of the 2014 Conference of the International Speech Communication Association (INTERSPEECH)

    (2014)
  • Harper, M., 2013. The BABEL program and low resource speech technology. In Automatic Speech Recognition & Understanding...
  • W. Hartmann et al.

    Comparing decoding strategies for subword-based keyword spotting in low-resourced languages

    Proceedings of the 2014 Conference of the International Speech Communication Association (INTERSPEECH)

    (2014)
  • HeY. et al.

    Subword-based modeling for handling OOV words in keyword spotting

    Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2014)
  • S. Kanthak et al.

    Context-dependent acoustic modeling using graphemes for large vocabulary speech recognition

    Proceedings of the 2002 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2002)
  • D. Karakos et al.

    Normalization of phonetic keyword search scores

    Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2014)
  • Cited by (8)

    • Assessing child communication engagement and statistical speech patterns for American English via speech recognition in naturalistic active learning spaces

      2022, Speech Communication
      Citation Excerpt :

      A rich, supportive language environment within the early childhood classroom is essential for all children. It is important to recognize that both environment and the role of adult-to-child and peer-to-child communication, serve as key components for young children’s language development — particularly children at-risk or with disabilities (Brown et al., 1999; Mahoney and Wheeden, 1999; Sontag, 1997; Warren and Yoder, 2004; Perry et al., 2018). There is a need to assess language/communication engagement of children at-risk for or with disabilities to determine if these children should receive greater teacher and peer support during learning activities.

    • Stemming for Better Indonesian Text-to-Phoneme

      2022, Ampersand
      Citation Excerpt :

      This module plays an essential role in a system to recognize and synthesize speech (Yasuda et al., 2021). It is also crucial to develop a syllabification model (Suyanto, 2019) and many other applications of linguistics and speech (Lamel et al., 2018). The problem of G2P can be solved using three approaches, namely: a) rule-based approach, where some language-specific rules should be developed to convert a graphene into the true corresponding phoneme (El-Imam, 2004), (Lee and Lee, 2009), (Al-Daradkah and Al-Diri, 2015); b) shallow learning-based approach, which uses various methods of conventional machine learning (Rugchatjaroen et al., 2019), (Sar and Tan, 2019), (Peters et al., 2018), (Aachen, 2017); and c) deep learning-based approach, which exploits the recurrent networks and sequence-to-sequence models as described in (El-DesokyMousa and Schuller, 2016), (Sokolov et al., 2020), (Yolchuyeva et al., 2019), (Bajaj and Amali, 2019), (Tihelka and Vít, 2019).

    • A Survey on Current Speech to Text Analysis to Help Programmers Dictate Code

      2023, ACM International Conference Proceeding Series
    • Automatic Speech Recognition System for Tonal Languages: State-of-the-Art Survey

      2021, Archives of Computational Methods in Engineering
    View all citing articles on Scopus

    This paper has been recommended for acceptance by Roger K. Moore.

    View full text