Phonetically-based multi-layered neural networks for vowel classification

doi:10.1016/0167-6393(90)90041-7

Speech Communication

Volume 9, Issue 1, February 1990, Pages 15-29

https://doi.org/10.1016/0167-6393(90)90041-7 Get rights and content

Abstract

The vowel sub-component of a speaker-independent phoneme classification system will be described. The architecture of the vowel classifier is based on an ear model followed by a set of Multi-Layered Neural Networks (MLNN). MLNNs are trained to learn how to recognize articulatory features like the place of articulation and the manner of articulation related to tongue position.

Experiments are performed on 10 English vowels showing a recognition rate higher than 95% on new speakers. When features are used for recognition, comparable results are obtained for vowels and diphthongs not used for training and pronounced by new speakers. This suggests that MLNNs suitably fed by the data computed by an ear model have good generalization capabilities over new speakers and new sounds.

Zusammenfassung

Beschrieben wird eine Klassifizierungsstufe für Vokale als Teil eines sprecherunabhängigen Phonemklassifizierungssystems. Die Architektur dieses Vokalklassifikators basiert auf einem Ohrmodell, das von einem Satz mehrschichtiger neuronaler Netze gefolgt wird. Diese neuronalen Netze werden darauf trainiert, artikulatorische Merkmale, wie z.B. den Ort der Artikulation oder die Art der Artikulation — bezogen auf die Position der Zunge — zu erkennen.

Experimente mit 10 englischen Vokalen ergeben eine Erkennungsrate von mehr als 95% für neue, dem System bisher unbekannte Sprecher. Werden phonetische Merkmale für die Erkennung herangezogen, so lassen sich vergleichbare Resultate für solche Vokale und Diphthonge erreichen, die für das Training nicht verwendet oder von neuen Sprechern geäuβert wurden. Dies legt nahe, daβ mehrschichtige neuronale Netze, auf passende Weise mit den Ausgangsdaten eines Ohrmodells angesteuert, sich bei der Erweiterung dieser Aufgabe auf neue Sprecher oder neue Laute als gut geeignet erweisen.

Résumé

Nous présentons un système de classification de phonèmes indépendant du locuteur et appliqué aux voyelles. L'architecture du classificateur de voyelles est basée surun modèle d'oreille suivi d'un ensemble de réseaux neuronaux à plusieurs couches (MLNN). Les MLNNs apprennent à reconnaître les traits articulatoires, par exemple le lieu et le mode d'articulation en relation avec la position de la langue.

Des expériences ont été effectuées sur 10 voyelles anglaises et montrent un taux de reconnaissance supérieur à 95% sur de nouveaux locuteurs. Lorsque les traits sont utilisés pour la reconnaissance, des résultats comparables sont obtenus pour des voyelles et des dihthongues qui n'ont pas été utilisées lors de l'apprentissage et prononcées par de nouveaux locuteurs. Ceci suggère que, pour des données calculées par un modèle d'oreille, les MLNNs présentent un bon pouvoir de généralisation pour de nouveaux locuteurs et de nouveaux sons.

References (24)

D.C. Plaut et al.
Learning sets of filters using back propagation
Computer Speech and Language
(1987)
S. Seneff
A joint synchrony/mean-rate model of auditory speech processing
J. of Phonetics
(1988)
Y. Bengio et al.
Speech coding with multilayer networks
D.E. Rumelhart et al.
Learning internal representation by error propagation
G.E. Hinton et al.
Learning and re-learning in Boltzmann machines
H. Bourlard et al.
Multilayer perceptron and automatic speech recognition
R.L. Watrous et al.
Learning phonetic features using connectionist networks
S. Seneff
Pitch and spectral estimation of speech based on an auditory synchrony model
S. Seneff
Pitch and spectral analysis of speech based on an auditory synchrony model
S. Seneff
A computational model for the peripheral auditory system: Application to speech recognition research

B. Delgutte

Representation of speech-like sounds in the discharge patterns of auditory-nerve fibers

J. Acoust. Soc. Am.

(1980)

B. Delgutte et al.

Speech coding in the auditory nerve: I. Vowel-like sounds

J. Acoust. Soc. Am.

(1984)

Cited by (10)

A survey of hybrid ANN/HMM models for automatic speech recognition
2001, Neurocomputing
Citation Excerpt :
Neural nets were expected to carry out the recognition task (e.g. classification of phonemes or words) when discriminatively trained on acoustic features. Milestones in this respect are [120–122,42,45,3,39,23,46,118,112,24,110,10,111], among the others. Lippmann [75] wrote a comprehensive survey of the state of the art in connectionist speech recognition at the end of the Eighties.
In spite of the advances accomplished throughout the last decades, automatic speech recognition (ASR) is still a challenging and difficult task. In particular, recognition systems based on hidden Markov models (HMMs) are effective under many circumstances, but do suffer from some major limitations that limit applicability of ASR technology in real-world environments. Attempts were made to overcome these limitations with the adoption of artificial neural networks (ANN) as an alternative paradigm for ASR, but ANN were unsuccessful in dealing with long time-sequences of speech signals. Between the end of the 1980s and the beginning of the 1990s, some researchers began exploring a new research area, by combining HMMs and ANNs within a single, hybrid architecture. The goal in hybrid systems for ASR is to take advantage from the properties of both HMMs and ANNs, improving flexibility and recognition performance. A variety of different architectures and novel training algorithms have been proposed in literature. This paper reviews a number of significant hybrid models for ASR, putting together approaches and techniques from a highly specialistic and non-homogeneous literature. Efforts concentrate on describing and referencing architectures and algorithms, their advantages and limitations, as well as on categorizing them into broad classes. Early attempts to emulate HMMs by ANNs are first described. Then we focus on ANNs to estimate posterior probabilities of the states of an HMM and on “global” optimization, where a single, overall training criterion is defined over the HMM and the ANNs. Connectionist vector quantization for discrete HMMs, and other more recent approaches are also reviewed. It is pointed out that, in addition to their theoretical interest, hybrid systems have been allowing for tangible improvements in recognition performance over the standard HMMs in difficult and significant benchmark tasks.
Optimal learning in artificial neural networks: A review of theoretical results
1996, Neurocomputing
The effectiveness of connectionist models in emulating intelligent behaviour and solving significant practical problems is strictly related to the capability of the learning algorithms to find optimal or near-optimal solutions and to generalize to new examples. This paper reviews some theoretical contributions to optimal learning in the attempt to provide a unified view and give the state of the art in the field.
The focus of the review is on the problem of local minima in the cost function that is likely to affect more or less any learning algorithm. Starting from this analysis, we briefly review proposals for discovering optimal solutions and suggest conditions for designing architectures tailored to a given task.
Connectionist Models and their Application to Automatic Speech Recognition
1991, Machine Intelligence and Pattern Recognition
The purpose of this chapter is to study the application of some connectionist models to automatic speech recognition. Ways to take advantage of a-priori knowledge in the design of those models are first considered. Then algorithms for some recurrent networks are described since they are well-suited to handling temporal dependences such as those found in speech. Some simple methods that accelerate the convergence of gradient descent with the back-propagation algorithm are discussed. An alternative approach to speed-up the networks are systems based on Radial Basis Functions (local representation). Detailed results of several experiments with these networks on the recognition of phonemes for the TIMIT database are presented. In conclusion, a cognitively relevant model is proposed. This model combines both a local representation and and a distributed representation subnetworks to which correspond respectively a fast-learning and a slow-learning capability.
Processing speech signal using auditory-like filterbank provides least uncertainty about articulatory gestures
2011, Journal of the Acoustical Society of America
Some notes on nonlinearities of speech
2005, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Auditory Modelling and Self-Organizing Neural Networks for Timbre Classification
1994, Journal of New Music Research

View all citing articles on Scopus

View full text

Phonetically-based multi-layered neural networks for vowel classification

Abstract

Zusammenfassung

Résumé

Computer Speech and Language

J. of Phonetics

Speech coding with multilayer networks

Learning internal representation by error propagation

Learning and re-learning in Boltzmann machines

Multilayer perceptron and automatic speech recognition

Learning phonetic features using connectionist networks

Pitch and spectral estimation of speech based on an auditory synchrony model

Pitch and spectral analysis of speech based on an auditory synchrony model

A computational model for the peripheral auditory system: Application to speech recognition research

Representation of speech-like sounds in the discharge patterns of auditory-nerve fibers

J. Acoust. Soc. Am.

Speech coding in the auditory nerve: I. Vowel-like sounds

J. Acoust. Soc. Am.