E IGENVALUE C RITERION -B ASED F EATURE S ELECTIONIN P RINCIPAL C OMPONENT A NALYSIS OF S PEECH

. This article presents a specific approach for selecting a limited set of most relevant, information rich speech data from the whole amount of training data. The proposed method uses Principal Component Analysis (PCA) to optimally select a lower-dimensional data subset with similar variances. In this paper, three selection algorithms, based on eigenvalue criterion are presented. The first one operates and analyzes the data at the entire speech-recording level. The second one additionally segments each of the recordings into experimentally sized blocks, which theoretically divides a record level into several smaller information richer/poorer blocks. Finally, the third one analyzes all the speech records at the feature vector level. These three approaches represent three different criterion-based selection techniques from the coarsest to the finest data level. The main aim of the presented experiments is to show that PCA trained with the limited subset of data achieves comparable or even better results than PCA trained with the entire speech corpus. In fact, this approach can radically speed up the learning of PCA with much smaller memory and computational costs. All methods are evaluated in Slovak phoneme-based large vocabulary continuous speech recognition task.


Introduction
Linear feature transformations are well-used techniques in high-dimensional data processing such as face and automatic speech recognition (ASR).The most popular transformations in automatic speech recognition are Principal Component Analysis (PCA), [1], [2], [3] and Linear Discriminant Analysis (LDA), [4].Our speech recognition research group tends to follow the modern trends in ASR.Therefore, we are interested in research and application of linear transformations in our speech recognition system.
It is known that one integral part of PCA is the covariance matrix computing from the training set.In case of relatively small training corpus there is no problem to compute the covariance matrix.But, in case of large corpus (thousands of recordings) and highdimensional data there may occur a problem with processing time (≈ several hours) and memory requirements (≈ 20 GB).In order to solve these problems we have built upon our previous work [5], [6] and we proposed a procedure to train PCA from a limited amount of training data.In other words, PCA can be learned from a limited subset, while the performance is maintained, or even improved.We called this procedure as Partial-data trained PCA.It is based on eigenvalue criterion and it is applied to LMFE (Logarithmic Mel-Filter Energies) feature vectors.The performance of the method is evaluated on Slovak speech corpus in phoneme-based continuous speech recognition task.This paper is organized as follows.The next section gives the mathematical background of PCA.Section III describes the full-data trained PCA.Section IV presents the proposed algorithms for data selection.Section V describes the experimental setup of experiments and finally, Section VI concludes the paper.

Principal Component Analysis
Principal component analysis (PCA), [2] is a linear feature transformation and dimensionality reduction method, which maps the n-dimensional input data to Kdimensional (K < n) linearly uncorrelated variables (mutually independent principal components) with respect to the variability.PCA converts the data by a linear orthogonal transformation using the first few principal components, which usually represent about 80 % of the overall variance.The principal component basis minimizes the mean square error of approximating the data.This linear basis can be obtained by application of an eigendecomposition to the global covariance matrix estimated from the original data.The characteristic mathematical stages of PCA can be briefly described as follows according to [2], [7].Firstly suppose that the training data are represented by M n-dimensional feature vectors x 1 , x 2 , …, x M .One of the integral parts of PCA is the centering of all vectors (subtracting the mean) as: where: is the mean vector.From the centered vectors  i the centered data matrix with dimension n  M is created as: To represent the variance of data across different dimensions, the global covariance matrix is computed as: An eigendecomposition ( 5) is applied to the covariance matrix in order to obtain its eigenvectors (spectral basis) u 1 , u 2 , …, u n and their corresponding eigenvalues  1 ,  2 , …,  n , as follows: The principal components are represented by the eigenvectors and the most significant ones are determined by K leading eigenvectors resulting from the decomposition.The dimensionality reduction step is performed by keeping only the eigenvectors corresponding to the K largest eigenvalues (K < n).These eigenvectors form the transformation matrix U K with dimension n  M: while Finally, the linear transformation where y i represents the transformed feature vector.The value of K can be chosen as needed or according to the following comparative criterion: where the threshold T  0,9; 0,95.T represents the part of the global variance of the original data preserved in the new feature space.

Full-Data Trained PCA
In this section, the classical PCA training process is shortly described.At this stage, the whole amount of training data is used.Each parametrized speech signal in the corpus is represented by a separate LMFE matrix.Firstly, the initial data preparation steps are performed.These are described by ( 1), ( 2) and ( 3).The global covariance matrix is computed according to (4) and then decomposed to a set of eigenvector-eigenvalue pairs.According to the K largest eigenvalues, the corresponding eigenvectors were chosen.These formed the transformation matrix U K (6), which was used to transform the train and test corpus into PCA feature space.Note that the final dimension K of the feature vectors after PCA transformation was chosen to K = 13 independently from the criterion formula ( 8), (because of regular comparison with MFCCs).The new PCA-based corpus was used to train the acoustic model based on fulldata trained PCA.This model was created in order to compare the full and partial-data trained PCA models.

Proposed Method -Eigenvalue Criterion-Based Feature Selection
This section presents three specific algorithms proposed in order to select the most specific feature subset for PCA training.There are two major processing stages.The first one is the "fast" PCA used for feature selection and the second one is the main PCA.The selection approach is based on eigenvalue criterion.The proportion of the first eigenvalue in the eigenspectrum decides whether the analyzed data is significant enough or not.To determine the proportion, following comparative criterion is used: where N represents the number of eigenvalues, in this case N = 26.The selected data are concatenated into one train matrix, which the input for the main PCA.There are 2 criterion modifications.In case of the first one, if the proportion is greater than T, the analyzed data are stored.
The second one stores the data with respect to inversed comparative criterion, that means all analyzed data are stored if the proportion is smaller than T. The data that do not fulfill to the criterion are ignored.The selected data matrix is formed from the most characteristic data for optimal partial PCA training.We propose three feature selection levels based on different algorithms.The first one selects the data on the recording level, the second one analyzes the data on data block level and the third one analyzes the data on feature vector level.The main aspect of proposed algorithms is the training data matrix reduction.Each of the three mentioned algorithms were set to extract data of size 0,05; 0,1; 0,5; 1; 5 and 10 % of the original training set.

1) Recording Level Feature Selection
The recording level selection represents the coarsest method of speech data analysis.The algorithm ignores all those recordings that do not fulfill to the selection condition.However, the ignored recordings could still contain some information rich training data parts.The function of this algorithm illustrates Fig. 1.The parameters for the algorithm are listed in the Tab. 1.In this table Qty (quality) means the amount of the selected subset in percentage.

3) Feature Vector Level Selection
The feature vector level selection algorithm stands for the finest method of speech data analysis because each feature vector represents the lowest available data level.The function of this algorithm is similar to Fig. 2 (only the block "Data block analysis" is changed to "Vector analysis").
Data vector level feature selection algorithm operates similarly to the other two mentioned algorithms with the difference at the eigenvalue criterion application.Each LMFE vector is reshaped to matrix in order to compute its covariance matrix, which is treated as the input to the PCA analysis.The parameters for this algorithm are listed in the Tab. 3.

Experimental Setup
The speech corpus [8] contains approximately 100 hours of spontaneous parliamentary speech recorded from 120 speakers (90 % of men).For acoustic modeling 36917 training utterances were exactly used.For testing purposes, another 884 utterances were used.
The speech was preemphasized and windowed using Hamming window.The window size was set to 25 ms and the step size was 10 ms.Fast Fourier transform was applied to the windowed segments.Mel-filterbank analysis with 26 channels was followed by logarithm application to the linear filter outputs.The 26dimensional LMFE features were decorrelated by DCT to obtain 13-dimensional MFCC vectors and also used for PCA processing.After PCA, only 13 coefficients were retained.All the MFCC and PCA vectors were finally expanded by delta and acceleration coefficients to 39dimensional feature vectors.
The acoustic modeling by using HTK Toolkit [9] was performed.The recognition system used context independent monophones modeled using three-state leftto-right HMMs.The number of Gaussian mixtures per state was a power of 2, starting from 1 to 256.The phone segmentation of 45 Slovak phones was obtained from embedded training and automatic phone alignment.During the test, it was used a word lattice created from a bigram language model, which from the test set was built.The vocabulary size was approx.125k.Notice that the accuracies in the evaluation process were computed as the ratio of the number of all word matches to number of reference words.

Results and Conclusions
In this paper, we proposed three feature selection algorithms based on eigenvalue-criterion in PCA.Overall 36 experiments were performed.The results are compared to the 39-dimensional reference MFCC model and also to the PCA model (trained from the whole corpus -PCA 100 %).Models were trained for 1−256 Gaussian mixtures.From the Tab. 4 it can be seen that partially trained PCA models achieve comparable or even better results than classical PCA.Accuracies of MFCC model for all mixtures are improved (except 128 mix.) by the proposed method and all accuracies of "PCA 100 %" are improved for all mixtures (italics font in the table).Generally, the best results for 0,05 % part of train corpus for 4 mixtures were achieved (bold marked values).Thus, it is enough a very small amount of speech data to train PCA successfully.We can suppose that the used amount contains probably the most homogeneous data suitable for PCA training.Note that the acoustic models are always trained from the whole corpus so there are enough data to estimate the parameters of Gaussian mixtures.Our proposed method achieves better results at a lower number of Gaussian mixtures (1−8).We suppose better results for higher mixtures in case of a larger amount of speech data.This approach can speed up the PCA training in case of large speech corpora.In the future, we consider the use of different input data kinds for this method and its application to larger speech databases.

Fig. 2 :Tab. 2 :
Fig. 2: Block diagram of the selection algorithm based on data block level analysis.

Tab. 3 :
Parameters for the algorithm based on vector level analysis.
Tab.4: Recognition results [%] for the reference MFCC model, PCA model trained from the whole corpus and the partial-data PCA.