Architectures of neural networks applied for LVCSR language modeling

doi:10.1016/j.neucom.2013.11.033

Neurocomputing

Volume 133, 10 June 2014, Pages 46-53

https://doi.org/10.1016/j.neucom.2013.11.033 Get rights and content

Abstract

The n-gram model and its derivatives are both widely applied solutions for Large Vocabulary Continuous Speech Recognition (LVCSR) systems. However, Slavonic languages require a language model that considers word order less strictly than English, i.e. the language that is the subject of most linguistic research. Such a language model is a necessary module in LVCSR systems, because it increases the probability of finding the right word sequences. The aim of the presented work is to create a language module for the Polish language with the application of neural networks. Here, the capabilities of Kohonen's Self-Organized Maps will be explored to find the associations between words in spoken utterances. To fulfill such a task, the application of neural networks to evaluate sequences of words will be presented. Then, the next step of language model development, the network architectures, will be discussed. The network proposed for the construction of the considered model is inspired by the Cocke–Young–Kasami parsing algorithm.

Introduction

Language modeling for the Large Vocabulary Continuous Speech Recognition improves the quality of the recognition task. The input of the Language Model Module is the list of hypotheses – the word sequences, which best match the spoken utterances. The aim of this module is the evaluation of every sequence according to language correctness and the possibility that it can be a real utterance. That means that in daily speech we simplify our utterances and, as a result, not all of them are valid sentences.

In order to perform the task of the Polish language modeling we can consider the usefulness of several known language models. The popular n-gram model [1], [2] assumes the importance of word order. However, many n-tuples of words occur few times in the training corpus or do not occur at all, which causes incorrect modeling of the conditional probabilities. Such a problem of data sparseness is partially solved by the class-based n-gram model [1], [2] or n-gram backoff models [3], but word order is also important for such improvements. The weighted Finite State Transducer [4] has a similar property. However, Slavonic languages, in which the word order is not strict, create different possibilities. Finite-State Grammar achieves good results [5], [6] only for utterances based on simple grammar and only in case such simple grammar can be prepared manually.

The solution could be the Head-Driven Phrase Structure Grammar (HPSG) [7], [8], which is a constraint-based formalism. It consists of two parts: a small amount of general rules (constraints) and a large amount of lexical entries, which describe word-specific dependencies. However the HPSG is a very strict formalism, and so, in order to create enough complex grammar one needs a large amount of work for the derivation of lexical entries. So as to avoid this problem, one can be interested in other approaches. Author proposed a simple shallow grammar based on HPSG rules in [9]. Typical applications of shallow grammars do not consider speech recognition [10]. In that work the partial parsing of sentences was allowed to make application of grammar rules possible. In this paper neural networks for the learning of grammar rules are proposed. The author of the present paper is considering in particular its application to solve both the presented problems: data sparseness and lack of enough free word order modeling.

The present paper is organized in the following way: to understand the requirements needed to model a language the application of a language model in LVCSR systems is presented. Then, the Self-Organized Maps are described and the application of such networks in the language model is discussed. Considering their architectures, it is illustrated that the model can learn by examples of connections between words. Lastly, the obtained results are discussed.

Section snippets

The role of language model in speech recognition systems

Fig. 1 presents the architecture of a typical Large Vocabulary Continuous Speech Recognition (LVCSR) system. Examples could be systems like: HTK [11] or the ESAT Speech Recognition System [12], [13]. LVCSR systems for the Polish language are presented in [6], [14], and details about domain-specific LVCSR for the Polish language can be found in [15].

The input processing module is responsible for the normalization of the input signal and digitalization. The feature extraction module produces

Related works

Neural networks have been widely used in speech recognition, and have been applied in acoustic models (for example, [16]). Work [17] describes the application of a single layer forward network with the softmax function. Such a function enables the probability normalization. The input represents k−1-th word in word sequence (context length is 2). For i-th word from the lexicon the i-th input is 1, while the other inputs are 0. The k-th word in considered word sequence, which is the word that is

Neural networks

The idea presented hereby is to use an Artificial Neural Network for modeling general grammar rules. A similar approach can be further utilized for the word-specific dependencies. The author proposed the application of Self-Organized Maps as a single network in [27]. In this work the language model will be extended to a complex network architecture in Section 5.2. In this case, there is a need to find associations between grammatical classes of words, which can be done by the unsupervised

One level parsing

All the networks consist of at least one SOM network. For this reason, we can consider the following network architectures:

SimpleSOM – it is a single SOM network, where inputs are classes (POS) of two neighboring words. This situation can be described as a “moving” network, which is connected to two neighboring words each. The operation of such a network is presented in Fig. 4. This model is similar to the class-based bigram model (n-gram with length 2 words), which is sensitive to word order

Experiments

The evaluation of the considered networks was performed in two experiments. In the first case the author's own software that simulates the LVCSR system was used. Here, the words were recognized from strings of letters, where some letters were changed randomly with probability 0.02 (to represent inexact matching to patterns in a real system). It is called “adding the noise”. In the case of non-exact matching the recognized word was the closest one from the lexicon in alphabetic order.

The network

Conclusions

The present paper proposes an application of neural networks in language modeling. The different network architectures and their abilities to find connections between words were considered. The discussion of the proposed architectures brought about the creation of a network, which was called the CKYSOM network, and has a good potential of modeling the grammatical relations. In contrast to some other language models it does not need any predefined language rules. The results obtained were close

Acknowledgment

Word lattices used in research were generated by the Laboratory of Integrated Speech and Language Processing Systems, Poznań Supercomputing and Networking Centre, Poland. Author would thank to the Head of Laboratory – Professor Grażyna Demenko, and to Marek Lange, who prepared these lattices. This research was supported by Polish National Center of Science (Ph.D. Grant no. N516 513439), and by Podkarpackie Voivodship Scholarship Fund. Part of works was performed on computer, which was funded by

Leszek Gajecki is employed at University of Information Technology and Management in Rzeszów, Poland. He received his M.Sc. degree from Rzeszów University of Technology in 2006 and Ph.D. degree with distinction from AGH University of Science and Technology in Kraków in 2013, where he took part in Ph.D. studies. His research interest focus on speech recognition, neural networks and vision processing.

References (39)

C. Chelba et al.
Structured language modeling
Comput. Speech Lang.
(2000)
J. Benesty et al.
Springer Handbook of Speech Processing
(2007)
F. Jelinek
Statistical Methods for Speech Recognition
(1997)
S.F. Chen et al.
An empirical study of smoothing techniques for language modeling
Comput. Speech Lang.
(1999)
H. Erdogan et al.
Using semantic analysis to improve speech recognition performance
Comput. Speech Lang.
(2005)
D. Koržinek et al.
Grammar based automatic speech recognition system for the Polish language
Łukasz Brocki et al.
Telephony based voice portal for a university
Speech Lang. Technol.
(2008)
C. Pollard et al.
Head-Driven Phrase Structure Grammar
(1994)
A. Przepiórkowski, A. Kupść, M. Marciniak, A. Mykowiecka, Formal Description of Polish Language – Theory and...
L. Gajecki et al.
Modeling of polish language for large vocabulary continuous speech recognition

A. Przepiórkowski, Shallow Processing of Polish Language [Powierzchniowe przetwarzanie języka polskiego], Problemy...

S. Young et al.

HTK Book

(2009)

J. Duchateau, HMM based acoustic modeling in large vocabulary speech recognition (Ph.D. thesis), Katholieke...

ESAT-PSI, Description of the ESAT speech recognition system, January 2006, 2006. URL:...

M. Szymański et al.

First evaluation of Polish LVCSR acoustic models obtained from the JURISDIC database

Speech Lang. Technol.

(2008)

B. Hnatkowska et al.

Application of automatic speech recognition to medical reports

J. Med. Inform. Technol.

(2008)

W. Chou, B. Juang (Eds.), Pattern Recognition in Speech and Language Processing, CRC Press, Boca Raton,...

W. Xu, A. Rudnicky, Can artificial neural networks learn language models?, in: Proceedings of ICSLP 2000, Beijing,...

Y. Bengio et al.

Neural probabilistic language models

Cited by (7)

Parsimonious memory unit for recurrent neural networks with application to natural language processing
2018, Neurocomputing
Citation Excerpt :
In the last decade, Recurrent Neural Networks (RNNs) encountered a large success in many natural language processing related applications, such as statistical parametric speech synthesis [15,16], speech emotion recognition [9,10], facial landmark detection [11] and speech recognition [17,19,31].
Recurrent Neural Networks (RNN) receive an important interest from Artificial Intelligence researches (AI) this last decade due to their high capability to learn complex internal structures to expose relevant information. However, RNNs fail to reveal long-term dependencies and new RNN with gates have been proposed to address this drawback such as Long Short-Term Memory (LSTM). This RNN-based model requires 4 gates to learn both short and long-term dependencies for a given sequence of basic elements. Recently, a new family of RNN called “Gated Recurrent Unit” has been introduced. The GRU contains few gates (reset and update gates) but is based on gates grouping without taking into account the latent relations between short and long-term dependencies. The GRU term dependencies management through hidden units is therefore similar for all hidden neurons. Moreover, the learning of gated RNNs requires a large amount of data and, despite the advent of GPU cards that allow the model to be learned quicker, the processing time is quite costly. This paper proposes a new RNN called “Parsimonious Memory Unit” (PMU) based on the strong assumption that short and long-term dependencies are related and that the role of each hidden neuron has to be different to better handle term dependencies. Experiments conduced on both a small (short-term) spoken dialogues data set from the DECODA project, a large (long-term) textual document corpus from the 20-Newsgroups and a language modeling task, show that the proposed PMU-RNN reaches similar, even better performances (efficiency) with less processing time (improve portability) with a gain of 50%. Moreover, the experiments on the gates’ activity show that the proposed PMU manages better term dependencies than the GRU-RNN model.
Online phoneme recognition using multi-layer perceptron networks combined with recurrent non-linear autoregressive neural networks with exogenous inputs
2017, Neurocomputing
Citation Excerpt :
The characteristics of these recordings were obtained using spectral analysis, thus producing 12 spectral coefficients, also known as Mel-Frequency Spectral Coefficients (MFCCs). In [3], a language module for the Polish language is created using neural networks. The capabilities of Self-Organized Maps are explored to find the associations between words in spoken utterances.
Off-line pattern recognition in speech signals is a complex task. Yet, this task becomes harder when the recognition result is required online or in real-time. The present work proposes an online identification of the Portuguese language phonemes using a non-linear autoregressive model with exogenous inputs, commonly called NARX. The process first conditions the input speech signal, and extracts its frequency characteristics. Then it pre-classifies the extracted features into one of the ten possible groups of phonemes, as available in the Portuguese language. This pre-classification is done using a multilayer perceptron network (MLP) with a supervised learning. Subsequently, the MLP output vector, together with the vector that carries the input frequencies, feeds a NARX neural network by means of a temporal delay of four times and feed-backward recurrent links that encompass the results of all hidden layers of the network. As a result of this process, the proposed phoneme recognition process improves the accuracy of an online identification of the Portuguese spoken phonemes during a natural conversation. When the phoneme input signal is well conditioned and continuous over time, the proposed recognition process can provide the correct classification in real-time, with an acceptable accuracy rate.
Center-shared sliding ensemble of neural networks for syntax analysis of natural language
2017, Expert Systems with Applications
Citation Excerpt :
Convolutional networks are used for building more abstract features composed of relations between many local features, but sliding the sites in ensemble is for detecting the most probable features by repeating classification. This sliding mechanism has been also proposed for building language models on NN representations (Emami & Jelinek, 2005; Gajecki, 2014), which is designed for structuring higher level classes as convolutional networks, rather than ameliorating target classification by repetition as the goal in this paper. There are many well-known tasks involved in the syntax analysis of natural languages, such as chunking, POS tagging, and parsing (Buchholz & Marsi, 2006; Cambria & White, 2014; Manning, 2003) shown in Fig. 1.
In this paper, we introduce a new ensemble method specialized to sequential labeling for syntax analysis and propose a neural network framework adopting the ensemble for dependency parsing of natural sentences. The ensemble method assigns sliding input sites to component classifiers which commonly include the position of the label to predict. The method improves labeling accuracy compared to simple ensemble with weighted voting if critical input features have flexible and long distance from the position to predict over sentences. We show the impact of the ensemble through theoretical estimation of its lower bound accuracy and through empirical analysis in a toy problem varying the strength of movability of critical input features. We apply the proposed neural network framework to the two phases of dependency parsing: dependency and relation tagging. Additionally, we newly define the dependency tagging problem using relative dependency and provide a post-processing method to build correct parse trees. In the practical dependency parsing of Spanish IULA corpus, applying the ensemble instead of the simple weighted voting significantly improves accuracy by 0.09% in relation tagging and by 0.06% to 1.59% with respect to the comparison settings in dependency tagging. The framework shows at least 0.28% improvement in the unlabeled attachment score and 0.14% in the labeled attachment score compared to state-of-the-art dependency parsers.
Neural networks: An overview of early research, current frameworks and new challenges
2016, Neurocomputing
This paper presents a comprehensive overview of modelling, simulation and implementation of neural networks, taking into account that two aims have emerged in this area: the improvement of our understanding of the behaviour of the nervous system and the need to find inspiration from it to build systems with the advantages provided by nature to perform certain relevant tasks. The development and evolution of different topics related to neural networks is described (simulators, implementations, and real-world applications) showing that the field has acquired maturity and consolidation, proven by its competitiveness in solving real-world problems. The paper also shows how, over time, artificial neural networks have contributed to fundamental concepts at the birth and development of other disciplines such as Computational Neuroscience, Neuro-engineering, Computational Intelligence and Machine Learning. A better understanding of the human brain is considered one of the challenges of this century, and to achieve it, as this paper goes on to describe, several important national and multinational projects and initiatives are marking the way to follow in neural-network research.
A proposal for the development of adaptive spoken interfaces to access the Web
2015, Neurocomputing
Citation Excerpt :
The best results were obtained using a multilayer perceptron (MLP) [82]. Neural networks have also proven to be useful in for other tasks related to natural language processing [83,84], such as the estimation of text similarity [85,86], handwritten text recognition [87], automatic language recognition [88], out-of-vocabulary word detection [89], word sense disambiguation [90], associative memory models [91], language model estimation for speech recognition [92], spoken language understanding [93], or question-answering [94]. Fig. 1 shows the architecture that integrates our proposed framework to generate adaptive spoken dialog systems.
Spoken dialog systems have been proposed as a solution to facilitate a more natural human–machine interaction. In this paper, we propose a framework to model the user׳s intention during the dialog and adapt the dialog model dynamically to the user needs and preferences, thus developing more efficient, adapted, and usable spoken dialog systems. Our framework employs statistical models based on neural networks that take into account the history of the dialog up to the current dialog state in order to predict the user׳s intention and the next system response. We describe our proposal and detail its application in the Let׳s Go spoken dialog system.
A neural network approach to intention modeling for user-adapted conversational agents
2016, Computational Intelligence and Neuroscience

View all citing articles on Scopus

View full text

Architectures of neural networks applied for LVCSR language modeling

Abstract

Introduction

Section snippets

The role of language model in speech recognition systems

Related works

Neural networks

One level parsing

Experiments

Conclusions

Acknowledgment

Comput. Speech Lang.

Springer Handbook of Speech Processing

Statistical Methods for Speech Recognition

An empirical study of smoothing techniques for language modeling

Comput. Speech Lang.

Using semantic analysis to improve speech recognition performance

Comput. Speech Lang.

Grammar based automatic speech recognition system for the Polish language

Telephony based voice portal for a university

Speech Lang. Technol.

Head-Driven Phrase Structure Grammar

Modeling of polish language for large vocabulary continuous speech recognition

HTK Book

First evaluation of Polish LVCSR acoustic models obtained from the JURISDIC database

Speech Lang. Technol.

Application of automatic speech recognition to medical reports

J. Med. Inform. Technol.

Neural probabilistic language models