Improving the Slovak LVCSR Performance by Cluster-Sensitive Acoustic Model Retraining

In this paper, we present a clusterdependent adaptation approach for HMM-based acoustic models. The proposed approach employs clustering techniques to group the original training utterances into clusters with predefined number. The clustered speech data are intended to adapt an initially pretrained acoustic model to the specific cluster by reestimation based on the standard Baum-Welch procedure. The resulting model, adapted to the homogeneous data may markedly improve the baseline recognition rate, whereas the model complexity may be reduced. In the recognition step, the test samples are scored by each adapted model and the most accurate one is chosen. The proposed approach is thoroughly evaluated in Slovak triphone-based large vocabulary continuous speech recognition (LVCSR) system. The results prove that the cluster-sensitive retraining leads to significant improvements over the baseline reference system trained according to the conventional training procedure.


Introduction
An acoustic model (AM) plays an important role in any large vocabulary continuous speech recognition (LVCSR) system because its quality highly affects the overall performance.Several approaches were developed in the past to improve the baseline recognition by AM refinement.One of the most effective and powerful approaches is the AM adaptation.In that case, a general model is adapted to the specific domain (gender, speaker, acoustic conditions, etc.) by advanced meth-ods.Most popular adaptation methods are MLLR (maximum likelihood linear regression), MAP (maximum a posteriori) [1] and eigenvoices [2].
Besides these common adaptation methods, other strategies, such as clustering, are also employed to improve the acoustic model performance.Authors in [3] generated triphone clusters using decision tree based clustering for zero-resourced language of Bengali.The clusters were used to generate tied-state triphones.Other approach to decision tree tying was presented in [4], where the authors employed segmental clustering of acoustic model components in LVCSR system.As was shown in [5], clustering may be applied to compact the acoustic model built from bootstrap to a reasonable size, whereas multiple distance measures for clustering with optimization were investigated.Another approach is focused on retraining, where the parameters of the original model are reestimated with using the adaptation data.This strategy is often used in crosslanguage modeling tasks for zero-resourced languages, where the existing model of low-resourced language is retrained on the untranscribed audio data [6].
In this paper, we present a fusion of the mentioned cluster analysis and acoustic model retraining without using any typical adaptation method.We utilize clustering to group the training set into crisp clusters, to which is the general acoustic model adapted through the standard Baum-Welch reestimation procedure.We prove that the resulting model may significantly increase the overal performance, whereas the model size and its complexity may be reduced.The LVCSR system evaluation show that the proposed method is effective and it reduces the reference word error rate.
In Section 2. , the clustering is described.Section 3. gives a description of the proposed method.The experimental setup is given in Section 4. The results are presented in Section 5. and finally, the paper is concluded in Section 6.

Clustering Approaches
Clustering [7], [8], also known as unsupervised classification is an important problem in pattern recognition field.Clustering partitions the input space into K regions according to some similarity or dissimilarity measure, where the value of K may be known a priori.The aim of clustering is to find a partition matrix U(X) of the given dataset X, where u kj = n, where u kj is the membership of pattern x j to cluster C k .The partition matrix U(X) of size K × n may be represented as where k = 1, . . ., K and j = 1, . . ., n.Note that In this section, we discuss several well-known partitional clustering techniques used in this study.These techniques include K-means clustering, Fuzzy C-means clustering, PAM (partitioning around medoids) and finally, EM (expectation-maximization) model-based clustering.

K-Means Clustering
The K-means algorithm [8], [9] is an iterative clustering technique that evolves K crisp, compact, and hyperspherical clusters such that the measure is minimized, where is the k-th cluster centroid, |C k | is the number of points and xi are the points belonging to cluster C k , respectively.Note that n is the number of all points in the data set.The algorithm may converge to values that are not optimal, depending on the choice of the initial cluster centers.K-means is also not robust to outliers.

PAM Clustering
PAM clustering, also known as K-medoid clustering [10] is an extension of the K-means algorithm, where medoids are used instead of the cluster means.It tries to minimize the total squared error of the whole data set.It is more robust to noise and outliers as compared to K-means because it minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances.The steps of the K-medoid clustering technique closely follow those in K-means.

Fuzzy C-Means Clustering
Fuzzy C-means clustering [7], [10], [11] is a widely used and powerful unsupervised method that employs the principles of fuzzy sets to find a fuzzy partition matrix.Objects on the boundaries between several clusters are not forced to fully belong to one of the clusters, but rather are assigned membership degrees between 0 and 1 indicating their partial membership.The minimizing criterion used to define good clusters for Fuzzy C-means partitions is defined as: where U is a fuzzy partition matrix, µ ∈ [1, ∞] is the weighting exponent on each fuzzy membership, Z = [ z 1 , . . ., z C ] are C cluster centers and D( z i , x k ) is the distance of x k from the i-th cluster center.According to [12], if D( z i , x k ) > 0 for all i and k, then (U, Z) may minimize J µ only if µ > 1 and for where 1 ≤ i ≤ C. A common strategy for generating the approximate solutions of the minimization problem in Eq. ( 3) is by iterating through Eq. ( 4) and Eq. ( 5) (also known as the Picard iteration technique) [12].

EM Clustering
This type of clustering assumes that the clusters follow some specific probability distribution and it is based on mixture models.It aims to determine the parameters of the probability distribution which have the maximum likelihood of their attributes [7].This algorithm assumes a GMM (Gaussian Mixture Model) with K mixtures and its mixture weights π k , mean vectors µ k and covariance matrices Σ k .Two steps are executed in each iteration; E-step (expectation), where the probability of each point belonging to each cluster is calculated.The second one is the M-step (maximization), which re-estimates the parameter vector of the probability distribution of each class [13].The cost function of the clustering algorithm is defined as where f (x i ) is a Gaussian mixture density and f k (x i ) is the k-th mixture component [14].The EM clustering assumes the normal distribution of the clusters.If clusters do not follow this distribution, the EM algorithm will fail in providing the appropriate partitioning [7].

Internal Cluster Validation
The process of evaluating the results of a clustering algorithm is called cluster validity assessment.The so called validation indices are used for measuring the "goodness" of a clustering result comparing to other ones which were created by other clustering algorithms, or by the same algorithms but using different parameter values.In our work, the Dunn index [15] was used to perform the cluster validation: where d(C i , C j ) is the distance between clusters and diam(C k ) is the maximum cluster diameter.

Standard LVCSR System
In order to incorporate the clustering-based AM retraining into the standard LVCSR system, we had to modify its baseline components.Therefore, we firstly describe the standard LVCSR system illustrated by Fig. 1.It can be seen that the acoustic front-end is responsible for the appropriate feature extraction and transformation, if it is needed.The features are fed to the decoder, where the most likely hypothesis is find with using the vocabulary and the statistical knowledge from acoustic and language models.The knowledge sources have to be trained beforehand employing well-known training procedures.

Cluster-Sensitive Training
The where M is the number of mixture components, π m is the weight of the m-th mixture and N (o t ; µ; Σ) is a multivariate Gaussian with mean vector µ and covariance matrix Σ [16].Note that we used M = 16 mixtures in all GMMs.The parameters were computed by EM algorithm (see Section 2.4.).The GMM computation produced mean and covariance mixture matrices of dimension 16 × 39 (mixtures × dimension of MFCCs).In order to perform clustering, it is necessary to find appropriate statistical representatives (vectors) of GMM matrices.Therefore, we suppose to compute weighted mean vector (WMV) of each GMM matrix as [17]: where π m are weights and µ m are mixture means.The WMV vectors were then used as input vectors for the subsequent clustering.
It is apparent from the procedure that the most important aspect in our adaptation is the clustering of WMV vectors.We have focused on four different numbers of clusters for each clustering algorithm (K = 2, K = 3, K = 5 and K = 10).The determination of the maximum number of clusters was conditioned by value of minimum number of recordings in one cluster and along with the total number of training recordings.We expect that in case of larger number of clusters (K > 10), undercounted clusters might be produced and the reestimation can not be done effectively.As was mentioned before, the same clustering algorithm may converge to different cluster configurations at each run because the result is dependent on the initial choice of the parameters.For that reason, the cluster analysis was carried out 10-times and the best one was selected.The selection criterion was based on internal validation with the Dunn index (Eq.( 7)).The clustering may also result in incorrect clusters in terms of outlying data elements with very small cluster count.Therefore, we defined the minimum count of each cluster with value |C kmin | = 2500 recordings (≈ 5 % of the complete set).The outlying clusters were joined to the nearest correct one in terms of minimum Euclidean distance.
Regarding the phonetic balance of the resulting clusters, we carried out an extensive phonetic analysis of the whole training part (see Fig. 2).This chart describes real statistical counts of Slovak phonemes, including the noise-specific phones [18].It is obvious that the training data are not phonetically balanced because they represent real attributes of the Slovak language.Note that the data were not manually balanced afterwards.There is a high degree of variability between the counts caused by the occurrence in the real speech.The highest counts (more than 1 million occurrences) are typical for vowels and noise phones and lower counts (around 300 000 occurrences and less) are typical for consonants.If we consider this nature of training data and if we further consider the fact that each cluster contains a sufficient amount of data (|C kmin | = 2500), we expect that the correct clusters follow the same or very similar phonetic distribution, probably with slight count variations (depending on K).In other words, we assume that the clusters are not phonetically balanced.
It is hard to determine how the LVCSR performance is affected by the phonetic distribution in each cluster.In order to determine the influence, a comprehensive performance analysis would be required.We assume that the phonetic balance of the cluster does not affect the overall performance markedly, while reasonable phoneme counts in each cluster are kept.
The correct clusters are finally intended to acoustic model adaptation.It should be clarified that the parameters of the original AM (probabilities and mixtures of HMM) are adjusted and reestimated with using the adaptation data of the specific speech cluster.We employed the standard Baum-Welch reestimation procedure to compute the new parameters [16].To sum up, we do not utilize any typical adaptation algorithm (MLLR, MAP, EV) to adapt an acoustic model.
We have also focused on the effect of the quality of the initial AM to the overall adaptation performance.In the most common adaptation tasks, a general AM, trained on the complete training set is usually adapted to the desired domain.In our case, the general AM is denoted as the reference AM.However, we found that the adaptation of a weak initial AM, just pretrained on randomly selected training subset (e.g.50 % of the complete set) holds the key of considerably increased LVCSR performance.This interesting fact also introduces some benefits of our adaptation approach, e.g. an adapted AM, originally pre-trained just from 25 % data, may achieve markedly lower WER than the AM, originally trained on the full data.In our evaluation, we have analysed four partially pre-trained AMs: P = 10 %, P = 25 %, P = 50 % and P = 75 %, where P defines the size of the randomly selected subset.

Modified Recognition Phase
In order to evaluate the proposed method, it was necessary to modify the standard recognition process.Compared to the standard LVCSR system, the modification was focused on the decoding because it is required to perform K−pass decoding for each test sample, where K is the number of adapted AMs.In each pass, the word level error rate is computed using the reference transcriptions and after all passes, the minimum level of WER is determined and accumulated.This procedure is repeated for each recording.Finally, the overall LVCSR performance is evaluated in form of global WER computed by averaging of the accumulated WER levels.The training and recognition phase of our adaptation approach is depicted in Fig. 3 in detail.
At the end, it is interesting to compare the standard LVCSR system with the modified one, based on AM c 2015 ADVANCES IN ELECTRICAL AND ELECTRONIC ENGINEERING retraining.It can be seen that the training phase of the standard system (Fig. 1) is extended by clusteringrelated steps and AM retraining (Fig. 3).As we mentioned, the recognition requires K-pass decoding with separate WER evaluation in each pass, whereas the best adapted AM is chosen for each test recording.This is the main reason, why the proposed method performs better than the standard one.

Experimental Setup
The Slovak parliamentary corpus ParDat1 [19] used in our study contains approx.100 hours of spontaneous parliamentary speech.The training part involves 50876 utterances collected from 120 speakers (≈ 90 % of men).The testing database includes another 884 phonetically balanced recordings with total duration up to 3 hours.
Throughout the experiments, the standard MFCC (Mel Frequency Cepstral Coefficient) features with cepstral mean normalization (CMN) were extracted, including their first and second derivatives and log energy, resulting in a 39-dimensional vectors.
The LVCSR system employed cross-word, threestate, left-to-right structure tied-state contextdependent triphone HMM (Hidden Markov Model) acoustic models.All acoustic models were trained in the maximum likelihood (ML) sense with GMM (Gaussian Mixture Model) density functions.At the end of the ML training process, about 12000 final triphones were produced and modeled with 32 Gaussians per state for each acoustic model, according to the reference training setup of the HTK toolkit [16].The LVCSR decoder employed a bigram language model [20] and vocabulary containing approximately 125000 unique, phonetically transcribed words.
For LVCSR system evaluation, we chose the wordlevel error rate (WER) computed as: where S represents the substitutions, D is the number of deletions, I is the number of insertions and K is the total number of reference words [16].
Finally, we note that the computing of weighted mean vectors, clustering, internal cluster validation, handling of outlying clusters and the evaluation were carried out in the Matlab programming environment.On the other hand, the feature extraction, GMM modeling, acoustic model training and retraining and the decoding were performed using the HTK Toolkit.

Experimental Evaluation
The experimental results are given in Fig. 4, Fig. 5, Fig. 6, Fig.In case of adaptation to K = 3 clusters (Fig. 5), the reference WER is reduced for all methods for initial models with P = 25 % and 10 %.The minimum value of WER for this setup is 10.55 % for PAM, thus the WER ref was reduced by −1.90 %.
The adaptation to 5 clusters (Fig. 6) has very similar nature.The highest reduction in WER was measured for Fuzzy and initial model P = 25 % and its value is 9.67 %.This value concurrently represents the absolute minimum value of WER achieved by the proposed method in the whole evaluation.In that case, the value of WER ref was reduced exactly by −2.78 %.This means a relative LVCSR performance improvement by 22.33 %.
Finally, from the chart in Fig. 7 it is evident that the adaptation to 10 clusters clearly outperformed the reference system for all methods in case of P = 10 %, 25 % and 50 %.For greater values of P , the values of WER began to rise.This adaptation yielded minimum value of WER 9.68 % for Fuzzy C-means and initial model P = 25 % again (WER ref improved by −2.77 %).
From a global point of view we can conclude that the lowest values of WER were achieved through EM and Fuzzy C-means clustering and most often by initial AMs with P = 25 %.We state that this type of AM is the most suitable for cluster-sensitive adaptation.We can also observe that initial AMs with P = 10 % and P = 50 % yield partially great improvement, too.We found that the number of clusters has not a crucial impact to the overall performance.It seems that the optimal values of K are K = 5 and K = 10.Note that the initial AM, trained on the complete set (P = 100 %) gives after adaptation the worst results almost in all experiments, without respect to the clustering.We have proven that for our adaptation approach it is sufficient to use a weak, non-precise AM, which yields significantly lower levels of WER than the fully-trained adapted AM.Additionally, the adaptation of less complex initial AM is also less computationally expensive, which is a much desired feature for LVCSR systems.
In order to declare the effectiveness of the presented adaptation approach, we contrast the performance of our adapted LVCSR system with two related, recently published works, where similar LVCSR systems employing conventional adaptation techniques were described.The first work [21] is focused on MLLRbased speaker adaptation task for Czech LVCSR system with two different clustering methods (knowledgebased and automated one).The authors declare here relative improvements in the range of 16.68 % up to 20.91 %, depending on the clustering method and the number of regression trees for MLLR.The second one [22] presents an on-line adaptation using KSVD-based acoustic clustering for real-time applications, where the adaptation performance in UK English LVCSR task was evaluated.The authors reported that the adaptation approach is capable of providing a 6 % relative WER reduction, rather in range of 2.0 % up to 6.1 %, whereas WER increasing was also observed.It can be concluded that the performance of the presented adaptation approach based on model retraining is competitive with other state-of-the-art adaptation techniques.

Conclusion and Future Work
In this work, we presented a cluster-sensitive adaptation for HMM-based acoustic models.We proved that our adaptation is able to reduce the reference WER significantly.This fact suggests the suitability of this method for LVCSR systems.We intend further to re-fine the recognition process by selection the appropriate AM without necessity to perform K-pass decoding.

Fig. 2 :
Fig. 2: Occurrences of Slovak phonemes in the training set.

Fig. 3 :
Fig. 3: Block diagram of training and recognition phase of the proposed adaptation approach.
aim of the proposed clustering-dependent AM retraining is to partition the complete training set into K disjoint clusters, whereas a cluster contains recordings with similar statistical attributes.
t ) of generating an observation o t : b 7. The performance of the reference LVCSR system, trained on standard MFCCs, is depicted with red line and its value is WER ref = 12.45 %.Thus, each value of WER falling below the red line means satisfactory result.The reference acoustic model was trained from the complete set (P = 100 %).At first, if we compare the results for K = 2 clusters in Fig.4, we can observe that the reference WER is decreased for all clustering methods at the same time only if P = 25 % up to 75 %.In other cases, the reference WER is improved only for Fuzzy C-means and EM clustering.The minimum value of WER for this setup is 11.32 % for EM clustering, thus the WER ref was reduced by −1.13 %.