Elsevier

Neurocomputing

Volume 133, 10 June 2014, Pages 46-53
Neurocomputing

Architectures of neural networks applied for LVCSR language modeling

https://doi.org/10.1016/j.neucom.2013.11.033Get rights and content

Abstract

The n-gram model and its derivatives are both widely applied solutions for Large Vocabulary Continuous Speech Recognition (LVCSR) systems. However, Slavonic languages require a language model that considers word order less strictly than English, i.e. the language that is the subject of most linguistic research. Such a language model is a necessary module in LVCSR systems, because it increases the probability of finding the right word sequences. The aim of the presented work is to create a language module for the Polish language with the application of neural networks. Here, the capabilities of Kohonen's Self-Organized Maps will be explored to find the associations between words in spoken utterances. To fulfill such a task, the application of neural networks to evaluate sequences of words will be presented. Then, the next step of language model development, the network architectures, will be discussed. The network proposed for the construction of the considered model is inspired by the Cocke–Young–Kasami parsing algorithm.

Introduction

Language modeling for the Large Vocabulary Continuous Speech Recognition improves the quality of the recognition task. The input of the Language Model Module is the list of hypotheses – the word sequences, which best match the spoken utterances. The aim of this module is the evaluation of every sequence according to language correctness and the possibility that it can be a real utterance. That means that in daily speech we simplify our utterances and, as a result, not all of them are valid sentences.

In order to perform the task of the Polish language modeling we can consider the usefulness of several known language models. The popular n-gram model [1], [2] assumes the importance of word order. However, many n-tuples of words occur few times in the training corpus or do not occur at all, which causes incorrect modeling of the conditional probabilities. Such a problem of data sparseness is partially solved by the class-based n-gram model [1], [2] or n-gram backoff models [3], but word order is also important for such improvements. The weighted Finite State Transducer [4] has a similar property. However, Slavonic languages, in which the word order is not strict, create different possibilities. Finite-State Grammar achieves good results [5], [6] only for utterances based on simple grammar and only in case such simple grammar can be prepared manually.

The solution could be the Head-Driven Phrase Structure Grammar (HPSG) [7], [8], which is a constraint-based formalism. It consists of two parts: a small amount of general rules (constraints) and a large amount of lexical entries, which describe word-specific dependencies. However the HPSG is a very strict formalism, and so, in order to create enough complex grammar one needs a large amount of work for the derivation of lexical entries. So as to avoid this problem, one can be interested in other approaches. Author proposed a simple shallow grammar based on HPSG rules in [9]. Typical applications of shallow grammars do not consider speech recognition [10]. In that work the partial parsing of sentences was allowed to make application of grammar rules possible. In this paper neural networks for the learning of grammar rules are proposed. The author of the present paper is considering in particular its application to solve both the presented problems: data sparseness and lack of enough free word order modeling.

The present paper is organized in the following way: to understand the requirements needed to model a language the application of a language model in LVCSR systems is presented. Then, the Self-Organized Maps are described and the application of such networks in the language model is discussed. Considering their architectures, it is illustrated that the model can learn by examples of connections between words. Lastly, the obtained results are discussed.

Section snippets

The role of language model in speech recognition systems

Fig. 1 presents the architecture of a typical Large Vocabulary Continuous Speech Recognition (LVCSR) system. Examples could be systems like: HTK [11] or the ESAT Speech Recognition System [12], [13]. LVCSR systems for the Polish language are presented in [6], [14], and details about domain-specific LVCSR for the Polish language can be found in [15].

The input processing module is responsible for the normalization of the input signal and digitalization. The feature extraction module produces

Related works

Neural networks have been widely used in speech recognition, and have been applied in acoustic models (for example, [16]). Work [17] describes the application of a single layer forward network with the softmax function. Such a function enables the probability normalization. The input represents k−1-th word in word sequence (context length is 2). For i-th word from the lexicon the i-th input is 1, while the other inputs are 0. The k-th word in considered word sequence, which is the word that is

Neural networks

The idea presented hereby is to use an Artificial Neural Network for modeling general grammar rules. A similar approach can be further utilized for the word-specific dependencies. The author proposed the application of Self-Organized Maps as a single network in [27]. In this work the language model will be extended to a complex network architecture in Section 5.2. In this case, there is a need to find associations between grammatical classes of words, which can be done by the unsupervised

One level parsing

All the networks consist of at least one SOM network. For this reason, we can consider the following network architectures:

SimpleSOM – it is a single SOM network, where inputs are classes (POS) of two neighboring words. This situation can be described as a “moving” network, which is connected to two neighboring words each. The operation of such a network is presented in Fig. 4. This model is similar to the class-based bigram model (n-gram with length 2 words), which is sensitive to word order

Experiments

The evaluation of the considered networks was performed in two experiments. In the first case the author's own software that simulates the LVCSR system was used. Here, the words were recognized from strings of letters, where some letters were changed randomly with probability 0.02 (to represent inexact matching to patterns in a real system). It is called “adding the noise”. In the case of non-exact matching the recognized word was the closest one from the lexicon in alphabetic order.

The network

Conclusions

The present paper proposes an application of neural networks in language modeling. The different network architectures and their abilities to find connections between words were considered. The discussion of the proposed architectures brought about the creation of a network, which was called the CKYSOM network, and has a good potential of modeling the grammatical relations. In contrast to some other language models it does not need any predefined language rules. The results obtained were close

Acknowledgment

Word lattices used in research were generated by the Laboratory of Integrated Speech and Language Processing Systems, Poznań Supercomputing and Networking Centre, Poland. Author would thank to the Head of Laboratory – Professor Grażyna Demenko, and to Marek Lange, who prepared these lattices. This research was supported by Polish National Center of Science (Ph.D. Grant no. N516 513439), and by Podkarpackie Voivodship Scholarship Fund. Part of works was performed on computer, which was funded by

Leszek Gajecki is employed at University of Information Technology and Management in Rzeszów, Poland. He received his M.Sc. degree from Rzeszów University of Technology in 2006 and Ph.D. degree with distinction from AGH University of Science and Technology in Kraków in 2013, where he took part in Ph.D. studies. His research interest focus on speech recognition, neural networks and vision processing.

References (39)

  • C. Chelba et al.

    Structured language modeling

    Comput. Speech Lang.

    (2000)
  • J. Benesty et al.

    Springer Handbook of Speech Processing

    (2007)
  • F. Jelinek

    Statistical Methods for Speech Recognition

    (1997)
  • S.F. Chen et al.

    An empirical study of smoothing techniques for language modeling

    Comput. Speech Lang.

    (1999)
  • H. Erdogan et al.

    Using semantic analysis to improve speech recognition performance

    Comput. Speech Lang.

    (2005)
  • D. Koržinek et al.

    Grammar based automatic speech recognition system for the Polish language

  • Łukasz Brocki et al.

    Telephony based voice portal for a university

    Speech Lang. Technol.

    (2008)
  • C. Pollard et al.

    Head-Driven Phrase Structure Grammar

    (1994)
  • A. Przepiórkowski, A. Kupść, M. Marciniak, A. Mykowiecka, Formal Description of Polish Language – Theory and...
  • L. Gajecki et al.

    Modeling of polish language for large vocabulary continuous speech recognition

  • A. Przepiórkowski, Shallow Processing of Polish Language [Powierzchniowe przetwarzanie języka polskiego], Problemy...
  • S. Young et al.

    HTK Book

    (2009)
  • J. Duchateau, HMM based acoustic modeling in large vocabulary speech recognition (Ph.D. thesis), Katholieke...
  • ESAT-PSI, Description of the ESAT speech recognition system, January 2006, 2006. URL:...
  • M. Szymański et al.

    First evaluation of Polish LVCSR acoustic models obtained from the JURISDIC database

    Speech Lang. Technol.

    (2008)
  • B. Hnatkowska et al.

    Application of automatic speech recognition to medical reports

    J. Med. Inform. Technol.

    (2008)
  • W. Chou, B. Juang (Eds.), Pattern Recognition in Speech and Language Processing, CRC Press, Boca Raton,...
  • W. Xu, A. Rudnicky, Can artificial neural networks learn language models?, in: Proceedings of ICSLP 2000, Beijing,...
  • Y. Bengio et al.

    Neural probabilistic language models

  • Cited by (7)

    • Parsimonious memory unit for recurrent neural networks with application to natural language processing

      2018, Neurocomputing
      Citation Excerpt :

      In the last decade, Recurrent Neural Networks (RNNs) encountered a large success in many natural language processing related applications, such as statistical parametric speech synthesis [15,16], speech emotion recognition [9,10], facial landmark detection [11] and speech recognition [17,19,31].

    • Online phoneme recognition using multi-layer perceptron networks combined with recurrent non-linear autoregressive neural networks with exogenous inputs

      2017, Neurocomputing
      Citation Excerpt :

      The characteristics of these recordings were obtained using spectral analysis, thus producing 12 spectral coefficients, also known as Mel-Frequency Spectral Coefficients (MFCCs). In [3], a language module for the Polish language is created using neural networks. The capabilities of Self-Organized Maps are explored to find the associations between words in spoken utterances.

    • Center-shared sliding ensemble of neural networks for syntax analysis of natural language

      2017, Expert Systems with Applications
      Citation Excerpt :

      Convolutional networks are used for building more abstract features composed of relations between many local features, but sliding the sites in ensemble is for detecting the most probable features by repeating classification. This sliding mechanism has been also proposed for building language models on NN representations (Emami & Jelinek, 2005; Gajecki, 2014), which is designed for structuring higher level classes as convolutional networks, rather than ameliorating target classification by repetition as the goal in this paper. There are many well-known tasks involved in the syntax analysis of natural languages, such as chunking, POS tagging, and parsing (Buchholz & Marsi, 2006; Cambria & White, 2014; Manning, 2003) shown in Fig. 1.

    • A proposal for the development of adaptive spoken interfaces to access the Web

      2015, Neurocomputing
      Citation Excerpt :

      The best results were obtained using a multilayer perceptron (MLP) [82]. Neural networks have also proven to be useful in for other tasks related to natural language processing [83,84], such as the estimation of text similarity [85,86], handwritten text recognition [87], automatic language recognition [88], out-of-vocabulary word detection [89], word sense disambiguation [90], associative memory models [91], language model estimation for speech recognition [92], spoken language understanding [93], or question-answering [94]. Fig. 1 shows the architecture that integrates our proposed framework to generate adaptive spoken dialog systems.

    View all citing articles on Scopus

    Leszek Gajecki is employed at University of Information Technology and Management in Rzeszów, Poland. He received his M.Sc. degree from Rzeszów University of Technology in 2006 and Ph.D. degree with distinction from AGH University of Science and Technology in Kraków in 2013, where he took part in Ph.D. studies. His research interest focus on speech recognition, neural networks and vision processing.

    View full text