Scaled norm-based Euclidean projection for sparse speaker adaptation

To reduce data storage for speaker adaptive (SA) models, in our previous work, we proposed a sparse speaker adaptation method which can efficiently reduce the number of adapted parameters by using Euclidean projection onto the L1-ball (EPL1) while maintaining recognition performance comparable to maximum a posteriori (MAP) adaptation. In the EPL1-based sparse speaker adaptation framework, however, the adapted Gaussian mean vectors are mostly concentrated on dimensions having large variances because of assuming unit variance for all dimensions. To make EPL1 more flexible, in this paper, we propose scaled norm-based Euclidean projection (SNEP) which can consider dimension-specific variances. By using SNEP, we also propose a new sparse speaker adaptation method which can consider the variances of a speaker-independent model. Our experiments show that the adapted components of mean vectors are evenly distributed in all dimensions, and we can obtain sparsely adapted models with no loss of phone recognition performance from the proposed method compared with MAP adaptation.


Introduction
In these days, modern server-based speech recognition systems (SRSs) serve millions of users. For this reason, reducing data storage for speaker adaptive (SA) acoustic models becomes an important issue when considering speaker adaptation to enhance speech recognition performance. There are various adaptation methods for Gaussian mixture model-hidden Markov model (GMM-HMM)-based SRS [1][2][3][4][5]. Among those methods, maximum a posteriori (MAP) speaker adaptation is the most conventional and powerful method when relatively large amount of adaptation data that is about 20 min to 10 h long is available [6,7]. SA models obtained by MAP adaptation require the data storage as much as a speaker-independent (SI) model needs, and the SI model typically has billions of parameters. Olsen et al. showed that most of the adapted parameters obtained by MAP adaptation are not closely related to speech recognition performance [6,7]. To restrict the redundant parameter adjustments, they proposed sparse MAP (SMAP) adaptation in which a typical MAP problem is maximized with certain sparse constraints. In the SMAP approach, two sets of optimization parameters need to be controlled. The first set of the optimization parameters are related to parameter regularization which is used for typical MAP adaptation. The second set of the parameters are used to restrict the redundant parameter adjustments. However, the more parameters we have, the harder it becomes to tune those parameters because the parameters are empirically chosen to show the best recognition performance.
To resolve the aforementioned problem, in our previous work, we first reinterpreted the MAP adaptation as a constrained optimization problem with an L 2 norm-based constraint [8,9]. To obtain sparsely updated SA models, we replace the L 2 norm-based constraint with an L 1 norm-based constraint. From the modification, we proposed a sparse adaptation method based on Euclidean projection onto the L 1 -ball (EPL1) [10], which only requires a single control parameter. By using the proposed sparse adaptation method, we showed that less data storage for SA models can be obtained with almost no loss of phone recognition performance than the SMAP adaptation method. Although the number of control parameters can be dramatically reduced, EPL1-based speaker adaptation still has a limitation that variances cannot be considered. Because of the limitation, parameters having large variances are only adapted during the adaptation step. However, we believe that parameters with small variances can also reflect speaker characteristics. Thus, in this paper, we propose scaled norm-based Euclidean projection (SNEP) which is a generalized version of EPL1, utilizing dimension-specific variances. From the SNEP framework, we also propose a new sparse speaker adaptation method. From our experiments, it is shown that the proposed SNEP-based speaker adaptation method can sparsely adapt the SI model (only about 9 % of the total number of parameters) with no loss of phone recognition performance against MAP adaptation.
The rest of this paper is organized as follows. In Section 2, we introduce EPL1 and a piecewise root finding (PRF) method which is a well-known solver for EPL1 [11,12]. In Section 3, from the derivation of EPL1, we describe the modified optimization problem and how to find the optimal solution of SNEP. In Section 4, we briefly review MAPand EPL1-based speaker adaptation. In Section 5, we describe our SNEP-based sparse speaker adaptation method using the variances of the SI model. In Section 6, we analyze our experimental results on adapted mean vectors and speech recognition performance. We conclude this paper in Section 7.
2 Euclidean projection onto the L 1 -ball Euclidean projection onto the L 1 -ball (EPL1) is widely used for gradient projection methods [13][14][15][16][17][18] which are used to find the optimal sparse solution of a constrained optimization problem which is given by where ℒ : ℝ D → ℝ is a convex and differentiable loss function, || ⋅ || 1 indicates an L 1 norm operator enforcing the sparse solution, and c is a constant for controlling regularization and sparsity, meaning how many zeros are in the optimal solution vector. Gradient projection with Nesterov's method [19][20][21][22] is an optimal first-order black-box method and can find the optimal solution of (1) by generating a sequence {x k } which is obtained from where s k = x k + α k (x k − x k − 1 ), α k , and η k are learning rates selected by certain rules [23], ∇ℒ(s k ) is the gradient of ℒ(⋅) at s k , and Q L 1 y ð Þ is the EPL1 problem defined as where jj⋅jj 2 2 is squared L 2 norm operator. In practice, is modified into another constrained optimization problem which is given by where z is composed of absolute values of components in y, ≽ denotes component-wise inequality, and 0 is a vector with all zero components. The optimal solution of (3) can be obtained by where sign(ρ) returns the vector whose components are signs of all components in ρ, ⊙ is component-wise multiplication of two vectors, and u* is the optimal solution of (4) which can be solved by Lagrangian function given by where λ and κ are the Lagrangian multipliers. We assume that optimal value λ* is known and ||z|| 1 > c.
Since the components in (6) can be decoupled, the closed form solution is as follows [10]: According to the optimal vector u*, i is the component index; the constraints of (4) can be expressed as To find the optimal value of λ, a piecewise linear function [11,12] is used, which is given by where R λ = {i|i ∈ {1, …, D}, z i > λ} and |R| is the number of elements in the set R. Figure 1 shows an illustration of f(λ) and a first-order gradient-based iterative method called piecewise root finding (PRF) [12] for the optimal value of λ.
With the PRF method, we can generate a sequence {λ k } via until f(λ k ) = 0 is satisfied. As shown in Fig. 1, each λ k for k ≥ 1 represents the root of a tangent line. To determine the set R λ k , every component of z needs to be compared with λ k . If we set an initial value of λ to 0, the sequence {λ k } could have a non-decreasing property. According to the property, in the kth step, we can skip the comparing operations for the components decided as less than λ k − 1 .

Scaled norm-based Euclidean projection
Basically, L 2 and L 1 norm for EPL1 can be interpreted as a multivariate Gaussian distribution with unit variance and a multivariate Laplace distribution with unit standard deviation [24]. Hence, every component in EPL1 is equally treated for optimization without considering any scaling parameters such as dimension-specific variances and standard deviations. For this reason, we propose a scaled norm-based Euclidean projection (SNEP) method which is a more generalized version of EPL1. The proposed constrained optimization problem for SNEP is given by where σ 2,i and σ 1,i denote scaling parameters for L 2 and L 1 norm, respectively. As shown in (11), we can apply any dimension-specific scaling parameters to the SNEP framework. The Lagrangian function of (11) and its differentiation with respect to u i are given by By setting dL SNEP (λ, u)/du i = 0 and considering the complementary slackness KKT condition, the optimal value u Ã i is given by with optimal value λ*. By using (14), the piecewise linear function for SNEP is given by , and the PRF method can also be used to find the optimal solution of SNEP with the initial condition, λ 0 = 0.

Previous work for speaker adaptation
For better understanding, our previous sparse speaker adaptation, MAP-based speaker adaptation is described first. Let Φ = {π, A, Θ} be the whole parameter set of HMMs, where π is the initial state distribution, A is the transition probability matrix, and Θ is the set of GMMs for every state. The GMM distribution of state s is given as follows: where N ⋅ ð Þ is a normal distribution, M is the number of Gaussian components, and w g,s , μ g,s , and Σ g,s are the weight, mean vector, and covariance matrix of Gaussian component g, respectively. In this paper, Σ g,s is set as diagonal matrix whose diagonal components are represented as [(σ 1,g,s ) 2 , (σ 2,g,s ) 2 , …, (σ D,g,s ) 2 ] T . Since MAP adaptation is typically performed on single state to adjust GMM parameters, we will omit the state index s and describe every procedure in terms of GMM framework. Since, in addition, it is well known that adapting mixture weights and variances is not helpful for recognition performance, we focus on how to adapt mean vectors only.
In order to adapt mean vectors of the SI model, the MAP adaptation process is composed of two major stages. In the With the probability of Gaussian component g, we then compute the ML mean vector: where n g ¼ Σ N n¼1 p gjx n ð Þ which is called posterior sum. In the second stage, μ ML g is used to obtain the adapted mean vector from the SI model, which is given by where τ is called the relevance factor which controls the balance between μ ML g and μ SI g . By modifying (20), we can obtain From (21), it is noticeable that φ MAP g is same as the optimal solution of the following constrained optimization problem, which is given by This constrained optimization problem is described in Fig. 2 from a geometric perspective. The shaded region implies the constraint part of (22), and the outer circle indicates the constraint when n g goes to infinity. As also shown in Fig. 2, the L 2 norm-based constraint can cause most of the small and redundant adjustments which can be negligible in terms of speech recognition performance. By replacing the constraint part of (22) with an L 1 normbased constraint, we can efficiently restrict the redundant adjustments. The modified constrained optimization problem is given by The constrained optimization problem in (23) is exactly same as EPL1 except for the constraint part. As you can see in (23), the right-hand side of the constraint part is not the constant c in previous section but variables depending mostly on n g and τ. The posterior sum n g is naturally determined by the amount of adaptation data. Also, n g is used for considering the asymptotic property of adaptation, which means relaxation of regularization effect including sparsity as adaptation data increase. Thus, the parameter τ takes charge of controlling the sparsity and regularization instead of parameter c for speaker adaptation. Figure 3 shows how the optimal solution can have sparse vectors indicated by the Fig. 2 Geometric interpretation for MAP-based speaker adaptation Fig. 3 Geometric interpretation for EPL1-based sparse speaker adaptation red cross. Before finding the optimal solution of (23), we first define a vector which is given by where |ρ| returns the vector of absolute values in ρ. To find the optimal solution of (23), we use ψ g for the following steps. The Lagrangian form of (23) is given by As described in Section 2, after being decoupled, the closed form solution of (23) with the optimal value λ*, and the piecewise linear function in terms of λ are given by As also described in Section 2, λ* can be obtained by the sequence {λ k } from f SA -EPL1 (λ), which is given by when f(λ k ) = 0 is satisfied. Thus, the final adapted mean vector from EPL1-based sparse speaker adaptation is given as follows:

SNEP-based sparse speaker adaptation
In GMM-HMM SRS, each dimension of Gaussian components typically has different variance denoting the dynamic range of each component. In Section 4, we describe the procedure for EPL1-based speaker adaptation which is unable to consider the dimension-specific variances. As a result, the adapted dimensions of mean vectors are mostly concentrated on the dimensions having large variances. Without considering the variances, the mean vectors adapted by EPL1 are not able to fully represent speaker-specific variability, which may cause loss of recognition performance. In this paper, we propose a new sparse speaker adaptation method using SNEP which can apply the variances of the SI model. Again, the proposed method utilizes ψ g in all steps. The proposed constrained optimization problem for sparse speaker adaptation is given by In (30), note that the same standard deviation σ SI i;g is shared by the objective function and the sparse constraint. The Lagrangian function of (30) is given by As described in Section 4, the closed form solution of (30) is Next, the piecewise linear function and the sequence {λ k } are given as follows: . Since the objective function and the constraint share the same standard deviations, (32)-(34) are slightly modified from related equations in Section 3. For simple implementation, (32) can be changed into following form: Note that the right-hand sides of (34) and (35) are composed of scaled ψ i,g by σ SI i;g . Thus, if we find the optimal solution with ψ i;g =σ SI i;g by EPL1, the solution would be φ SA-SNEP i;g =σ SI i;g . By multiplying σ SI i;g with the solution, we can obtain exactly same result with (32).
Finally, the adapted mean vector is given as follows: In Fig. 4, each figure shows how the sparse speaker adaptation methods work by EPL1 and SNEP. In the figure, the red arrows from the ML mean indicate the adapted mean vectors, and the region of non-sparse solution is surrounded by the two dashed lines. As can also be seen in Fig. 4, from the same ML mean vector, the differently adapted mean vectors are obtained because of the shared standard deviations.

Experimental results
The experiments were conducted on the ETRI Korean conversation speech database collected at 16 kHz sampling rate and 16-bit resolution by two types of smart phone devices in clean condition. We used about 100 h of speech data spoken by 300 speakers to train the SI triphone-based GMM-HMM acoustic model. For adaptation and evaluation, we used 50 speakers' 350 sentences (300 sentences for adaptation and 50 sentences for the phone recognition test) and each sentence is roughly 4-5 s long. We used 12-dimensional Mel-frequency cepstral coefficients with log energy and concatenated their first and second derivatives as a feature vector to constitute 39-dimensional feature vectors. We applied a phone level unigram language model in terms of 39 Korean phonemes to our phone recognition experiments. The SI model had 11,848 tied-state triphone-based HMMs including three states per each HMM and GMM with 32 Gaussian components per state. All phone recognition tests were performed according to various values of hyperparameter τ.
To observe the effects of the variances of the SI model for SNEP compared with EPL1, we counted the number of times that each dimension of mean vectors was adapted during the adaptation process. In Fig. 5, x-axis indicates each dimension of the mean vector and normalized histogram of the counts is shown on y-axis. For EPL1, three distinct peaks are observed, and their dimensions are related to the log energy and its first and second derivatives. On the other hand, it is noticeable that there is no peak with SNEP and every dimension is evenly adapted. As mentioned earlier, we believe that speaker characteristic is not mainly concentrated on the three dimensions which are related to log energy. Therefore, it can be said that SNEP-based sparse speaker adaption can reflect more the speaker variability than the EPL1-based method.
In Table 1, phone error rate (PER) and sparsity of various methods are summarized, and the sparsity indicates the percentage of the number of parameters which are not adjusted after adaptation. For comparison purpose,   we also did phone recognition tests on SI model and maximum likelihood linear regression (MLLR) adaptation. For MLLR, we used full matrix and 65 regression classes which showed the best PER. The best PER for EPL1 is 17.99 % with 91.37 % sparsity. In contrast, the SNEP shows no recognition performance degradation against MAP adaptation with 91.08 % sparsity. From our experimental results, it is proven that sparse speaker adaptation with the dimension-specific variances can adapt the SI model more accurately than EPL1-based sparse speaker adaptation.

Conclusions
In this paper, we propose the SNEP method which is a more generalized version of EPL1 in which certain scaling parameters can be applied to the EPL1 framework. In addition, by using the SNEP method, we also propose sparse speaker adaptation. In our experiments, we show that a small number of dimensions are mostly adapted by EPL1-based speaker adaptation and the proposed speaker adaptation method can evenly adapt every dimension of the mean vectors by using the variances of the SI model. With the proposed methods, it is also shown that we can obtain sparsely adapted model with no loss of phone recognition performance compared with MAP adaptation. Our further work is to apply the EPL1 and SNEP framework to deep neural network-based acoustic model adaptation [25][26][27][28] with the gradient projection method.