In this paper we show how we can discover non-linear features of frames of spectrograms using a novel autoencoder. The autoencoder uses a neural network encoder that predicts how a set of prototypes called templates need to be transformed to reconstruct the data, and a decoder that is a function that performs this operation of transforming prototypes and reconstructing the input. We demonstrate this method on spectrograms from the TIMIT database. The features are used in a Deep Neural Network - Hidden Markov Model (DNN-HMM) hybrid system for automatic speech recognition. On the TIMIT monophone recognition task we were able to achieve gains of 0.5% over Mel log spectra, by augmenting traditional the spectra with the predicted transformation parameters. Further, using the recently discovered edropoutf training, we were able to achieve a phone error rate (PER) of 17.9% on the dev set and 19.5% on the test set, which, to our knowledge is the best reported number on this task using a hybrid system. Speaking Rate Normalization with Lattice-Based Context-Dependent Phoneme Duration Modeling for Personalized Speech Recognizers on Mobile Devices
Cite as: Jaitly, N., Hinton, G.E. (2013) Using an autoencoder with deformable templates to discover features for automated speech recognition. Proc. Interspeech 2013, 1737-1740, doi: 10.21437/Interspeech.2013-432
@inproceedings{jaitly13_interspeech, author={Navdeep Jaitly and Geoffrey E. Hinton}, title={{Using an autoencoder with deformable templates to discover features for automated speech recognition}}, year=2013, booktitle={Proc. Interspeech 2013}, pages={1737--1740}, doi={10.21437/Interspeech.2013-432}, issn={2308-457X} }