Simultaneous variable selection and component selection for regression density estimation with mixtures of heteroscedastic experts

: This paper is concerned with the problem of ﬂexibly estimat- ing the conditional density of a response variable given covariates. In our approach the density is modeled as a mixture of heteroscedastic normals with the means, variances and mixing probabilities all varying smoothly as functions of the covariates. We use the variational Bayes approach and propose a novel fast algorithm for simultaneous covariate selection, component selection and parameter estimation. Our method is able to deal with the local maxima problem inherent in mixture model ﬁtting, and is applicable to high-dimensional settings where the number of covariates can be larger than the sample size. In the special case of the classical regression model, the proposed algorithm is similar to currently used greedy al- gorithms while having many attractive properties and working eﬃciently in high-dimensional problems. The methodology is demonstrated through simulated and real examples.


Introduction
In this paper we are concerned with the problem of flexible regression density estimation. The term regression density estimation refers to the problem of flexibly estimating the conditional density function of a response variable y at all points x in the covariate space, while making relatively few assumptions about its functional form [28]. This is an important problem in applications where the response distribution is highly multimodal and would not be appropriately modeled by a simple parametric density such as a normal.
A well-established methodology for regression density estimation uses mixtures of heteroscedastic experts models [10,28,19]. This approach extends mixture of experts models [12,14] by allowing components to be heteroscedastic and allowing mixing probabilities to depend on the covariates. In this paper we consider mixtures of heteroscedastic normals. More specifically, the conditional density of a response y given a covariate vector x is modeled as where π j (x), µ j (x) and σ 2 j (x) are (functions of) linear combinations of x, π j (x) ≥ 0, π 1 (x) + · · · + π k (x) = 1 and k is the number of components. We will refer to this model as the RDE-MHN(k) (regression density estimation with mixtures of k heteroscedastic normals) model. Hereafter, the terms mean model, variance model and gating model refer to the models for the means µ j , variances σ 2 j and mixing probabilities π j , respectively.
[30] and [31] carry out Bayesian analysis on mixtures of experts models with flexible terms for the covariates, although they do not consider heteroscedasticity. [10] consider model (1.1) for regression density estimation in which only the means µ j are allowed to depend on x. [28] and [19] extend to the heteroscedastic case in which the mixing probabilities π j , component means µ j and variances σ 2 j all varying with covariates. The heteroscedastic extension to mixture of experts models is important in applications where the conditional distribution of y is very complex and there is a need to model p(y|x) flexibly without making too rigid assumptions on its functional form. As discussed in [19] and [28], the performance of mixtures of homoscedastic models (i.e. model (1.1) with constant σ 2 j ), when used to model heteroscedastic data, deteriorates as the number of covariates increases, and cannot be improved by simply increasing the number of components. Ignoring heteroscedasticity may lead to serious problems in inference, such as misleading assessments of significance, poor predictive performance and inefficient estimation of the mean parameters. The reader is referred to [5] and [25] for a more detailed discussion on heteroscedastic modeling. [28] use Bayesian inference and Markov chain Monte Carlo (MCMC) methods to estimate the RDE-MHN model. Using MCMC, however, may be computationally demanding in high-dimensional situations with a large number of covariates, and in time series data modeling where sequential updating is required. [19] develop a fast alternative computationally attractive estimation method using variational approximation. Their variational approximation method is computationally attractive in situations where it is necessary to re-fit complex models many times such as for sequential updating in time series data analysis. Using variational approximation for fitting mixtures of homoscedastic experts models is considered by a number of authors [29,23,2]. See also [7,17] and [32] for applications of variational approximation to fitting Gaussian mixture models. Section 3 briefly reviews the variational approximation method.
The first issue in RDE-MHN modeling is selecting the number of components k. [28] and [19] consider this problem by fitting separate RDE-MHN models within a proposed range of potential k and selecting the one with largest crossvalidation log predictive density score (see the definition in Section 6). This approach has several drawbacks. First, cross-validation is not natural for ordered data such as time series or longitudinal data (see the stock return example in Section 6.3). Second, computing the cross-validation log predictive density score may be very time consuming if the sample is divided into many parts. Third, the log predictive density score method requires prior information on the maximum number of components which may be hard to obtain.
The second important issue in RDE-MHN modeling is variable selection. Variable selection is a fundamental problem in general regression analysis in which a large number of potential covariates is often introduced at the initial stage of modeling and it is necessary to select from them a smaller subset to fit the data in order to avoid overfitting, reduce the cost of data collection and increase model interpretability. See, for example, [9,8,20] and references therein. Variable selection is not discussed in [19]. Incorporating variable selection is essential in complex models like ours, especially in high dimension where the full model fitting of [19] is almost impossible. Variable selection helps not only to improve performance by producing parsimonious models, but also reduces the dimension of the parameter space and makes the computation faster.
Another important issue in mixture model fitting in general is the local maxima problem in which the fitting procedure converges to one of the local maxima [27,18,22]. In our experience, this problem occurs very often and gives a suboptimal solution which may cause serious consequences in subsequent inferences.
Our article proposes a novel algorithm that is able to deal with the aforementioned problems. We employ a Bayesian approach and use the variational approximation method for fitting. We first modify the variational approximation fitting approach in [19] to deal with the local maxima problem by using the split-and-merge idea [see, for example, 22,32] which repeatedly splits and/or merges poorly fitted components on the basis of maximizing the variational lower bound considered as a good approximation of the log marginal likelihood. This approach also automatically determines the number of components. We then propose a strategy for ranking covariates for inclusion into the mean, variance and gating models in a computationally thrifty way. This is a non-trivial extension of the ranking algorithm introduced in [20] who consider variable selection in the heteroscedastic linear regression model, i.e. RDE-MHN(1). These together result in a novel fast method for simultaneous variable selection, component selection and parameter estimation in the RDE-MHN modeling. Our method can be used in high-dimensional situations where the number of covariates can be much larger than the sample size.
The MCMC method of [28] and their novel variable selection prior provide an excellent approach for RDE-MHN modeling in low-dimensional settings where computation time is not a primary concern. A major advantage of our method over their MCMC method is that we provide a fast alternative; see Sections 5 and 6. It is obvious that the MCMC method is not applicable in high-dimensional problems with thousands of covariates. Another advantage is that our method does variable selection and component selection simultaneously and automatically rather than fitting separate RDE-MHN models within a range of potential k and then selecting an appropriate one based on some model selection criterion which in turn requires a considerable amount of extra computation time.
Many of the popular models in the literature are special cases of the RDE-MHN model. Model (1.1) with k = 1 is the heteroscedastic linear regression model considered in [25], Chapter 14 and [20] where (y i , x i ), i = 1, . . . , n are observations. With k = 1 and σ 2 constant, (1.1) reduces further to the classical linear regression model which is extensively studied in the literature. A nice feature of our methodology is that it is able to reach a simple model if such a model is warranted. With k = 1 and constant variance, our ranking algorithm for variable selection is similar to widely used matching pursuit and greedy algorithms for model search [16,33,8].
The RDE-MHN modeling approach provides flexibility in exploring data in which the conditional density of interest has a complex structure. Section 6 applies our approach to analyzing the diabetes data [8]. This data set consists of observations on 442 patients about their 10 baseline variables (predictors x) and a quantitative measure of disease progression (response y). Previous analysis in the literature assumed a RDE-MHN(1) model [8,20]. Our goal is to see if relaxing the assumption of k = 1 components can give us more flexibility in exploring the structure of p(y|x), and lead to a model with better predictive performance. We find that the RDE-MHN(3) model is selected by our algorithm which has better predictive performance than previously selected models. Figure 1 shows the clear three-component structure in the conditional distribution explored by our approach (see Section 6 for the details).
The rest of the paper is organized as follows. Section 2 describes the RDE-MHN model in detail. Section 3 combines the variational approximation method and the split-and-merge algorithm for fitting this model. Section 4 presents our fast greedy algorithm for variable selection. Sections 5 and 6 present simulation studies and real data examples illustrating our method. Section 7 concludes. Technical derivations are placed in the Appendices.

Mixture of heteroscedastic normals model
Let (y i , x i ), i = 1, . . . , n, be n observations with y i univariate responses and x i = (x i1 , . . . , x is ) ′ corresponding covariate vectors. We will write y and x = (x 1 , . . . , x s ) ′ for a generic response and covariate vector. We are concerned with the problem of estimating the conditional distribution of y given x using mixture of experts models and with the problem of selecting important covariates as well as selecting the number of experts. The distribution of y i given x i is modeled by a mixture of heteroscedastic normals model as follows . . , w iq ) ′ are vectors of covariates in the mean and variance models (which are sub-vectors of x i ), β j = (β j1 , . . . , β jp ) ′ and α j = (α j1 , . . . , α jq ) ′ are vectors of unknown parameters in the mean and variance models of the jth component, respectively. Write The distribution of the latent variables δ i is modeled by a multinomial logit regression model , j = 1, . . . , k; i = 1, . . . , n, where γ j = (γ j1 , . . . , γ jr ) ′ is a vector of unknown parameters in the gating model of the jth component, j = 2, . . . , k. Write γ = (γ ′ 2 , . . . , γ ′ k ) ′ and δ = (δ 1 , . . . , δ n ) ′ . Note that we set γ 1 ≡ 0 for identifiability. Again, z i = (z i1 , . . . , z ir ) ′ is a sub-vector of x i , and contains the covariates used to model the mixing probabilities. We assume here and in Section 3 that we know in advance which covariates from x are included in the mean, variance and gating models. The variable selection issue will be discussed in Section 4. The above model will be referred to as the regression density estimation with mixtures of k heteroscedastic normals model, denoted by RDE-MHN(k). This article employs a Bayesian approach and uses normal distributions for priors on the parameters β, α and γ where µ 0 βj , Σ 0 βj , µ 0 αj , Σ 0 αj , µ 0 γ and Σ 0 γ are hyperparameters. We assume that β, α and γ are independent a priori, i.e. p(β, α, γ) = p(β)p(α)p(γ).
The main issues relating to the implementation of the RDE-MHN modeling are: (i) estimating the parameters β, α and γ in a fast and reliable way, (ii) selecting the important covariates and, (iii) selecting the number of components k. Another important issue when fitting mixture models is the local maxima problem. This problem happens when the fitting algorithm (by the maximum likelihood method, for example) converges to a local rather than the global maximum [27,18,22]. In later sections we present a method for dealing with these problems.

The variational approximation fitting approach
Our method for fitting and doing model selection in the RDE-MHN model is based on the variational approximation fitting approach of [19]. It is reproduced here and in the Appendices to make the paper self-contained. The reader who is not familiar with variational approximation is referred to, for example, [1] or [21].
The RDE-MHN(k) model can be written as , j = 1, . . . , k; i = 1, . . . , n Write θ = (β, α, γ, δ). Our Bayesian inferences are based on the posterior distribution p(θ|y) ∝ p(θ)p(y|θ) which is difficult to handle. We proceed by approximating this posterior by a more tractable distribution q(θ). The variational approximation posterior q(θ) is selected by minimizing the Kullback-Leibler (KL) divergence log q(θ) p(θ|y) q(θ)dθ among some restricted class of functions. From the identity we see that minimizing the KL divergence is equivalent to maximizing Because of the non-negativity of the KL divergence term in (3.1), (3.2) is a lower bound on the log marginal likelihood log p(y). The lower bound (3.2), when maximized with respect to q, is often used as an approximation to the log marginal likelihood log p(y). This approximation is useful, since log p(y) is a key quantity in Bayesian model selection. The accuracy of variational approximation is experimentally studied in [19,20]. Some results on the asymptotic normality of variational approximation estimators are recently obtained in [11]. [19] develop a variational approximation approach for fitting the RDE-MHN model with the variational approximation posterior q(θ) assuming the following product form and q(β j ) is normal N (µ q βj , Σ q βj ), q(α j ) is normal N (µ q αj , Σ q αj ), q(δ i = j) = q ij where j q ij = 1, i = 1, . . . , n. The posterior q(γ) is assumed to be the Dirac delta distribution δ µ q γ (·) concentrated at a point µ q γ , i.e. δ µ q γ (A) = 1 if and only if µ q γ ∈ A for every Borel set A in IR (k−1)r . Using the Dirac delta distribution for the variational approximation posterior of γ means that we are interested in its posterior mode, and this facilitates the computation. Note that, for simplicity, with a little abuse of notation we have not distinguished between distributions and densities in the above. The advantage of the variational approximation posterior above is that the lower bound (3.2) has a closed form (see Appendix A), which allows fast and easy optimization. Algorithm 1 in Appendix A summarizes a procedure for maximizing this lower bound.

The split-and-merge algorithm
Two difficult issues associated with the above mixture model concern the existence of local maxima and the selection of the number of components [27,18]. We now address these problems by adapting the split-and-merge algorithm discussed in [22]. See also [24] and [32]. The idea of the split-and-merge algorithm is to repeatedly merge two components and/or split a component until some criterion is satisfied. This algorithm has been proven useful in overcoming the local maxima problem and automatically determining the number of components [22,32]. We now adapt this idea to our RDE-MHN model.
We first initialize the number of components k using the method of [4] for selecting the number of clusters. With an initial k, after Algorithm 1 has converged, we consider merging two components or splitting a component until the lower bound is not improved any further. Let θ * and L * denote the parameter estimate and the maximized lower bound after Algorithm 1 converges.
Merge criterion Two components are considered most plausible for merging if they are close to each other in some sense. Here we use the Kullback-Leibler (KL) divergence to measure similarity. The KL distance between two distributions P and Q is defined as In our context, the KL distance (averaged over n observed points) between two components j 1 and j 2 is given by The smaller the KL distance of two components, the more plausible they are as candidates for merging. Let C = {(j 1 , j 2 ), j 1 = 1, . . . , k; j 2 = 1, . . . , k; j 1 = j 2 } be the set of index pairs, ξ 1 = argmin{KL(j 1 , j 2 ), (j 1 , j 2 ) ∈ C} be the index pair of two components with the smallest KL distance, The idea is to try merging the most plausible pairs of components until the lower bound is improved or the number of merging operations exceeds a pre-specified number C max merge .
Merge operation To estimate the parameters of the new merging model, it is important to make use of the previous estimate to initialize the iterative scheme.
Recall that the parameters for optimization consist of µ q βj , Σ q βj , µ q αj , Σ q αj , µ q γ and q ij for i = 1, . . . , n, j = 1, . . . , k. Suppose that two components j 1 and j 2 are to be merged into a new component j ′ . The initial values for the new merging model are assigned as follows. For the initial values of the new component j ′ , we set The initial values for the parameters in the other components are fixed at the current estimate. Now the iterative scheme in Algorithm 1 can be readily performed to estimate the parameters of the new model. Note that the number of components now is reduced by 1.
Split criterion A component is considered unreliable and a plausible split candidate if it has a small likelihood, i.e. it is poorly fitted and should be split. The reliability of component j is defined as , wherep j (x) is the density function of the jth component. The smaller the R(j), the less reliable is component j and the more plausible it is as a candidate for a split. Let η 1 = argmin{R(j), j = 1, . . . , k}, η i = argmin{R(j), j = 1, . . . , k; j = η 1 , . . . , η i−1 }, i ≥ 2. Write C split = C split (θ * ) = {η 1 , η 2 , . . .}. As in the merge step, we split the most plausible components until the lower bound is improved or the number of split operations exceeds a pre-specified number C max split .
Split operation Denote the component to be split by j ′ and the new components by j 1 and j 2 . We set q ij1 = q ij2 = q ij ′ /2, µ q αj 1 = µ q αj 2 = µ q α j ′ , Σ q αj 1 = Σ q αj 2 = Σ q α j ′ , and keep the other q ij , µ q αj and Σ q αj unchanged. Then we are able to perform the updates in Algorithm 1 for all parameters. The number of components is now increased by 1.
We now summarize our final algorithm for estimating the RDE-MHN model, which is able to deal with the local maxima problem, while automatically determining the number of components.
3. For i merge = 1 : C max merge do • Merge two components with the index pair ξ imerge . Let L * merge , θ * merge be the new lower bound and parameter estimate.
split , θ * split be the new lower bound and parameter estimate.

Model selection
This section considers the problem of selecting significant covariates out of s given potential covariates x 1 , . . . , x s for inclusion in the mean, variance and gating models. Write C = {1, . . . , s} for the index set of the potential covariates. Before presenting our strategy for ranking variables for inclusion, we discuss the model prior. Let π m i , π v i , π g i be the prior probabilities for inclusion of covariate x i in the mean, variance and gating models, respectively, and write . . , π g s ) ′ . Suppose that we have a current model M with C m , C v , C g the index sets of covariates in its mean, variance and gating models, respectively. We assume and that If no such detailed prior information is available on the inclusion probability for each predictor (which is the case we consider in this paper), one may assume that π m 1 = · · · = π m s = π m , π v 1 = · · · = π v s = π v and π g 1 = · · · = π g s = π g (we note a slight abuse of notation here), then where for a set A, |A| denotes its cardinality. The hyperparameters π m , π v , π g ∈ [0, 1] are user-specified, and small values encourage parsimonious models. By setting π m = π v = π g = 1/2, one can set the uniform prior on the inclusions of covariates. Another option is to put uniform distributions on the π's. Then and similarly p( This prior agrees with the one used in the extended BIC proposed by [6]. It has the advantage of requiring no hyperparameter while still encouraging parsimony. We recommend using this as the default prior. We now consider adding a single variable in either the mean, variance or gating model, and then a one step update to the current variational lower bound in the proposed model as a computationally thrifty way of ranking the predictors for their possible inclusion. Write X Cm , X Cv , X Cg for the corresponding design matrices and in particular x ′ iCm for the ith row of X Cm ; β Cm j , α Cv j , γ Cg j for the current coefficient vectors in the mean, variance and gating models of the jth component; β l j , α l j , γ l j for the new coefficients with respect to a new covariate x l in the jth component. Our method for ranking covariates for inclusion in the mean and variance models is an extension of the ranking algorithm proposed in [20] who considered the RDE-MHN(1) model only. The idea of ranking covariates for inclusion in the gating model is a non-trivial contribution of the present paper.

Ranking covariates in the mean model
We first consider the effect of adding a new covariate x l with l ∈ C m to the mean model of the k components. It is also possible to consider adding different covariates to different components, but this complicates the model somewhat. We consider a variational approximation to the posterior of the form Appendix A, it is easy to see that the lower bound of the new model with x l in the mean model (of all components) can be written as where L old is the lower bound of the current model (i.e., the model without covariate x l in its mean model) and µ q α Substituting these back to the lower bound (4.4) and writing L M l (C m , C v , C g ) for the optimized lower bound gives The superscript M means the lower bound is associated with the mean model. The most plausible variable for inclusion in the mean model is the one that maximizes the above lower bound (over l ∈ C \ C m ). Note that we only use (4.5) for ranking covariates for inclusion, the actual inclusion is based on the improvement of the lower bound fitted using the full variational algorithm described in the previous section.
In the full variational approximation fit, the "naive" estimatesμ q can be used to initialize the new parameters with respect to new covariate x l in the mean model, while all the other parameters can be initialized by the current estimates. This so-called warm start (i.e., the output of a previous fit is used for initial values in a subsequent fit) is very important in variational approximation, especially in the complex context of mixture models, and makes the variational approximation procedure more stable and faster [20]. We observe that, with the warm start, Algorithm 1 converges very quickly, often after just a few iterations.

Ranking covariates in the variance model
We now consider adding a new covariate x l with l ∈ C v to the variance model. As in the mean model, we consider a variational approximation to the posterior of the form ). If we do not assume a particular parametric form for q(α l j ), the optimal choice is [see, for example, 21] where the expectation is with respect to all parameters except α l j , j = 1, . . . , k. Therefore, to obtain good estimates for µ q α l j and (σ q α l j ) 2 , we make a normal approximation to q opt (α l j ). Then the mean and variance of the normal approximation are the mode of q opt (α l j ) and the negative inverse Hessian at the mode of log q opt (α l j ), respectively. We havê , j = 1, . . . , k, i = 1, . . . , n.
To obtain a more accurate estimate of the mode, in our implementation, we use Newton's method initialized withμ q α l j . Note that Newton's method is very convenient here because the second derivative is available in closed form. We found thatμ q α l j is a very good approximation and the Newton iteration often stops after a few iterations. From (A.1), the new lower bound can be written as

Ranking covariates in the gating model
Using lower bounds for ranking covariates for inclusion in the gating model is more complex. We proceed by using a measure of association between the response and the covariates for ranking for inclusion. The measure we use is the distance correlation introduced recently by [26]. The distance correlation is a measure of general, not just linear, dependence between two random vectors of arbitrary dimensions and has the property that it is zero if and only if the two random vectors are independent. This new definition of correlation measure is very interesting, useful and widely applicable. The reader is referred to [26] for the definition and properties of this association measure. Let xl be the covariate that has highest (sample) distance correlation with y among x l with l ∈ C \ C g . If the covariate space of the gating model is completely separate from that of the mean and variance models, then this high correlation of xl suggests that xl is the most plausible covariate to be added to the gating model, and it will be selected if its inclusion to the gating model improves the lower bound. The situation is more complex if the covariate spaces of the three models overlap or are the same, which is the case we consider in this paper. In this situation care must be taken in considering the inclusion of xl. We proceed as follows. If xl has not been included in the mean or variance model, i.e. l ∈ C m ∪ C v , then xl will be added to the gating model if its inclusion improves the lower bound. Suppose that xl has already been included in the mean or variance model. It will be added to the gating model as well if its inclusion improves the lower bound. Otherwise, the high correlation of xl is likely to be caused via other models rather than the gating model, therefore we consider for inclusion the covariate x l ′ which has second highest distance correlation with y among the covariates in C \ C g . This consideration for inclusion is repeated until a covariate is selected or no x l ′ exists.
We give below pseudo-code for the variable selection strategy. Let τ l = dCor(y, x l ) be the sample distance correlation, i.e. the distance correlation calculated from the data, between the response y and covariate Note that it is still possible to obtain a warm start in the variational approximation fit. To initialize the variational approximation fit in Algorithm 1, we start with Step 6 to update µ q γ , while all other parameters are initialized by the current estimates.

The full algorithm for variable and component selection
We now summarize our ranking algorithm for variable selection combined with the split-and-merge variational approximation for component selection. We denote the algorithm by RSMVA. Recall that we denote by C m , C v , C g the index sets of current covariates in the mean, variance and gating models, respectively. We write L(C m , C v , C g ) for the lower bound optimized by the full variational approximation fit procedure described in Section 3. Write C +l m for the set C m ∪ {l} and similarly for C v and C g . Denote by p(C m , C v , C g ) the prior of the model with index sets C m , C v , C g . Note that for simplicity of discussion in Sections 4.1-4.3, we did not mention the model prior. If then set C v := C +l v , L opt = L(C m , V +l , C g ). (d) Set τ l = dCor(y, x l ) if l ∈ C g else τ l = −∞, l = 1, . . . , s, and set stop=FALSE.
While not stop do l = argmax{τ l , l = 1, . . . , s} then set C g := C +l g , L opt := L(C m , C v , C +l g ) and stop=TRUE, otherwise set τl = −∞. This RSMVA algorithm was implemented in R and the code is available upon contacting the authors.
Remarks The RSMVA algorithm does not consider models of all sizes; it stops when important covariates have been included, so that the computations just involve low-dimensional matrices. This makes our method work in highdimensional problems in which, amongst a large number of potential covariates, only a few are significant. Note that in such situations, the full model fitting of [19] and the MCMC approach of [28] are very time demanding if not impossible. In the simulation studies below, we consider an example with 1000 potential predictors.
When k = 1, the RSMVA reduces to the ranking algorithm for variable selection in heteroscedastic linear regression proposed in [20]. Then, as noted by [20], the ranking algorithm is similar to frequentist matching pursuit and greedy algorithms [16,33,8], while having many good properties such as not requiring any extra tuning parameter and not penalizing non-zero coefficients for doing variable selection. The reader is referred to [20], Section 3.5, for more details.

Simulation study
This section considers a simulation study of our method. The data is simulated from the following regression density model , i = 1, . . . , n, Table 1 True parameter vectors β j , α j , γ j where β j , α j , γ j are in Table 1. The predictors x i = (1,x i ) ′ are generated by first generatingx i from the normal distribution N (0, Σ) (with Σ specified below) and then transforming each component into the unit interval by the cumulative distribution function Φ(·) of the standard normal. The reason for making the transformation is to control the magnitude of the noise levels exp(x ′ i α j ) and mixing coefficients π j (x i ). It is natural to always include intercepts in the three models. The performance is measured by correctly-fitted rates for variable selection in the three models, for the selection of the number of components and for overall model selection, over 50 replications. That is, the correctly-fitted rate for a model is the proportion of replications that the true model is correctly identified.
An easy example We first consider an easy problem in which the covariates have small correlations by setting Σ ij = 0.5 |i−j| . A low-dimensional case with s = 5 and a higher-dimensional case with s = 100 are investigated. For the latter, the first five entries of β j , α j , γ j are the same as in Table 1, while the rest are all zeros. Note that the full model fitting of [19] and the MCMC approach of [28] are almost impossible with s = 100 and a full model fitting would give a very poor fit because all of the irrelevant covariates are included. Table 2 summarizes the simulation results for two cases, n = 500 and n = 1000. The results suggest that the RSMVA algorithm is able to correctly identify the zero-coefficients in the mean, variance and gating models, the true number of components as well as the true overall model. On average, the CPU time taken for each replication is 4.5 minutes for n = 500 and 10.1 minutes for n = 1000 in the low-dimensional case, and 5.9 and 18.7 minutes for n = 500 and n = 1000 respectively in the higher-dimensional case. The code is written in the R language and run on an Intel Core i7-2600 3.40GHz desktop.
We also consider an example in which the data is generated from a simple homoscedastic linear regression model, i.e. from a RDE-MHN(1) model with  only an intercept in the variance model. The data is simulated from with β as β 1 above. The correctly-fitted rates are summarized in Table 3 for four cases, n = 500, 1000 and σ = 1, 2. The CPU time is approximately 3.1 and 6.9 minutes for n = 500 and n = 1000 respectively for each replication. These results suggest that the RSMVA algorithm is able to reach a simple homoscedastic linear regression model if such a model is appropriate.
A harder example We now consider a harder example in which the covariates have high correlations by setting Σ ij = 0.9 |i−j| . The simulation results are summarized in Table 4. The correctly-fitted rates for the variance and gating models are now smaller than those in Table 2. This is because fitting the variance and gating models is much harder than fitting the mean. However, as we observed, the algorithm misidentifies only one or two covariates with small mean squared errors in the coefficient estimation (results not shown).
A small-n large-s example We finally consider an example in which the number of potential covariates s is much larger than the number of observations n. We set s = 1000 and n = 500 and consider two cases as before: the easy case with Σ ij = 0.5 |i−j| and the harder case with Σ ij = 0.9 |i−j| . The data generating process is similar to those in the previous examples. Table 5 summarizes the correctly-fitted rates and shows that the performance is very similar to the case with s = 100 and n = 500. This suggests that the large number of irrelevant covariates does not have much influence, they are easily dropped out by the ranking algorithm. The CPU time taken is about 51.7 minutes for each replication.

Applications
We first describe the log predictive density score considered by [28] and [19] to measure the performance of a model. Suppose we split the data y into two parts: future or validation set y F and training set y \F . The log predictive density score is defined by LPDS = log p(y F |y \F ), (6.1) where p(y F |y \F ) = p(y F |θ)p(θ|y \F )dθ can be approximated by Monte Carlo samples from the posterior p(θ|y \F ) which in turn can be replaced with the variational posterior q(θ). A simpler method to estimate p(y F |y \F ) is the plugin method in which p(y F |y \F ) is estimated by p(y F |θ(y \F )) withθ(y \F ) a point estimate, based on data y \F , of the model parameters. We use the plug-in method in the following examples. If the observations in y are exchangeable and if we can randomly split y into roughly B equal parts F 1 , . . . , F B , then the B-fold cross-validation log predictive density score is defined as log p(y Fi |y \Fi ). (6.2) Note that for time series data y 1:T , the cross-validation idea in definition (6.2) is not natural; however the log predictive density score in (6.1) can be still welldefined as follows. If y 1:t is the training set and y t+1:T is the validation set, predictive performance is measured by LPDS = log p(y t+1:T |y 1:t ) = log p(y t+i |y 1:t+i−1 ), (6.3) with p(y t+i |y 1:t+i−1 ) = p(y t+i |θ, y 1:t+i−1 )p(θ|y 1:t+i−1 )dθ. A disadvantage of using (6.3) is that we lose some information contained in the validation sets in the learning processes. Furthermore, using (6.2) or (6.3) as a model selection criterion may be time consuming in some cases.

Diabetes data
We apply our method to analyze a benchmark data set on the progression of diabetes [8,20]. Ten baseline variables, age, sex, body mass index, average blood pressure and six blood serum measurements, were obtained for each of n = 442 diabetes patients, as well as the response of interest y, a quantitative measure of disease progression one year after baseline. Previous analysis in the literature assumed a RDE-MHN(1) model [8,20]. Our goal is to see if relaxing the assumption of k = 1 components gives us more flexibility in exploring the structure in the conditional distribution of y, and leads to a model with better predictive performance.
The RSMVA algorithm selects a RDE-MHN(3) model with homoscedastic components, only intercepts in the mean and variance models, and covariates 3 and 9 in the gating model. We call this model A. If we fix k = 1, the RSMVA algorithm (this is the VAR algorithm of Nott et al. 2011b) selects an intercept and covariates 2, 3, 7 and 9 to the mean model, and only an intercept to the variance model. We call this model B. The 10-fold cross-validation log predictive density score of model A is −236.7 and of model B is −241.3. This suggests that model A has better predictive performance than model B. Figure 1 shows the fitted RDE-MHN model with 3 homoscedastic components. We have separated the observed responses into clusters according to which component each response is most likely to lie in. The planes show the fitted means which are 149.7, 72.4 and 259.7. The right column shows the fitted mixing probabilities. Figure 2 shows the plots of standardized residuals versus fitted values for the two models RDE-MHN(3) and RDE-MHN(1). The RDE-MHN(3) seems to give a more satisfying residual plot, because the absolute residuals of the RDE-MHN(1) model increase when the fitted values increase. These pictures tell us visually that the distribution of y is better modeled by a mixture of 3 components. The CPU time taken to run the RSMVA algorithm to get the final model in this example is 3.9 minutes.

Rainfall runoff emulation model
This application is concerned with model emulation of a deterministic rainfall runoff model. The goal of model emulation is to replace a computationally expensive deterministic model with a computationally cheap statistical model/emulator which may allow similar results to be achieved in a manageable computational time. The emulation model we consider is the Australian Water Balance Model (AWBM) of [3]. The reader is referred to [19] for a more detailed description. We are concerned with estimating the distribution of the AWBM streamflow at a time of peak rainfall (response y) as a function of three AWBM parameters/covariates: the maximum storage capacity S (x 1 ), the baseflow recession factor K (x 2 ) and the base flow index BFI (x 3 ). We have an available dataset obtained for 500 different values of parameters S,K and BFI. Following [19], we also add independent normal random noise with a standard deviation 0.01 to the response y to avoid degeneracies in the variance model in regions of the space where the response tends to be identically zero.
[19] emulated the AWBM streamflow response as a function of S and K (BFI was found insensitive to the response). They fitted five RDE-MHN models to the data. The first four models (named A, B, C and D) have k = 2, 3, 4 and 5 components respectively with both covariates (apart from an intercept) in the mean, variance and gating models. The fifth, model E, has k = 4 components with only an intercept in the variance model. Based on the 10-fold crossvalidation log predictive density scores, [19] choose model C (among the five considered). The 10-fold cross-validation log predictive density score of model C is −57.4. This value is slightly different from the value reported in [19], due to the randomness in adding random noise to the responses, the difference in the use of priors and the randomness of the partition in the cross-validation. If we fix the covariates for inclusion in the mean, variance and gating models as considered in the four models A-D above and let the split-and-merge variational approximation algorithm (Algorithm 2) determine the number of components, then k = 4 is selected, which is consistent with the finding of [19]. The CPU time taken is 1.6 minutes. In order to find the best model C, [19] compute the cross-validation log predictive density scores for the four models. As reported in their Table 2, the CPU times taken is 16.3 minutes for variational approximation and 5.8 hours for MCMC. Our method is roughly 10 and 216 times faster than the variational approximation and MCMC methods based on log predictive density scores. It should however be noted that these differences may be partly due to the different CPU's used to carry out the computations in the two papers.
We now consider the problem of variable selection and component selection simultaneously. The RSMVA algorithm then selects a RDE-MHN(4) model with both covariates in the mean and variance models and only covariate S in the gating model. The CPU time taken is 4.38 minutes. We call this model F, whose 10-fold cross-validation log predictive density score is −52.9. This suggests that model F has better predictive performance than model C. This also illustrates that variable selection helps improve on log predictive density scores. Figure 3 summarizes the fitted RDE-MHN(4) model. These figures tell us visually the structure of the data.

Standard & Poor's 500 index
We reanalyze the data set of returns to the Standard & Poor's 500 stock market index considered in [10], [28] and [19] who must rely on future observations to do model selection. This data consists of 4646 daily returns from January 1,1990 to May 29, 2008. The response y t is log (p t /p t−1 ), with p t the closing index on day t. The goal is to flexibly estimate the distribution of y t given the data y 1:t−1 up to time t − 1. Modeling p(y t |y 1:t−1 ) is challenging because it is well known in the economics literature that the distribution of y t is non-standard and has heavy tails. The distribution of y t would therefore not be appropriately modeled by a simple parametric density such as a normal. We can set up the density estimation problem of p(y t |y 1:t−1 ) in terms of a regression density estimation problem where the covariates are functions of lagged response values. We refer the reader to [28] for a list and definition of potential covariates. [19] estimate the density function of y t by four RDE-MHN models in which the mean model contains only an intercept, the variance and gating models contain an intercept and covariates RLastWeek, RLastMonth and MaxMin95 and k = 1, 2, 3 and 4 experts. Using a validation data set consisting of 199 future observations, they report that the RDE-MHN(2) model is adequate. The CPU time taken by their approach (including times for initial fit and for computing the log predictive density scores) is 5.4 hours. Our split-and-merge variational approximation algorithm (Algorithm 2) gives the same result, with a CPU time of 9.5 minutes. Note that our method for selecting the number of experts k does not rely on future observations. We now let the RSMVA algorithm itself select important variables as well as the number of components. As usual in the literature on stock market return data where a mean relation is not expected, we restrict the mean model to include only an intercept. Our algorithm then selects a RDE-MHN model similar to the model found adequate above, except that covariate RLastWeek drops out in the variance model. The CPU time is 2.1 hours. This model has a log predictive density score of −472.2, which has slightly less predictive performance than the model suggested by [19] with a log predictive density score of −471.1.

Conclusions
Our paper describes a split-and-merge variational approximation strategy for fitting the RDE-MHN model. The approach automatically determines the number of components and is able to overcome the local maxima problem in fitting mixture models. We also present a fast greedy algorithm for variable selection. The full algorithm, RSMVA, provides a computationally thrifty path following strategy for doing simultaneous variable selection, parameter estimation and component selection. The RSMVA is able to reach a simple model if such a model is warranted, and in the special case of k = 1 components, reduces to well-studied algorithms in the literature. The proposed methodology applies to high-dimensional problems.
The RSMVA algorithm can be regarded as a forward greedy algorithm because it considers adding at each step another covariate to the current model. A drawback, as in many other greedy forward algorithms, is that if a predictor has been wrongly selected then it cannot be removed anymore. Adding a backward elimination process would help correct mistakes made in earlier forward selection steps. This research direction is currently in progress.
Maximization with respect to µ q βj , with other terms held fixed, leads to where V = (v ′ 1 , . . . , v ′ n ) ′ is the design matrix for the mean model, and D j is the diagonal matrix with ith entry q ij / exp(w ′ i µ q αj − 1 2 w ′ i Σ q αj w i ), j = 1, . . . , k. Maximization with respect to Σ q βj leads to If no parametric form for the variational posterior q(α j ) is assumed then the optimal choice for q(α j ) is [see, for example, 21] where the expectation is with respect to all parameters except α j . It can be shown that which takes the form of the posterior (apart from a normalization constant) in a Bayesian generalized linear model (GLM) with gamma response and log link, coefficient of variation 2/q ij , responses . . , n, and the log of the mean response being w ′ i α j . The prior in this Bayesian GLM is N (µ 0 αj , Σ 0 αj ). If we use a quadratic approximation to log q(α j ), then this results in a normal approximation to q(α j ) with the mean and variance the posterior mode and the negative inverse Hessian of the log posterior at the mode. The computations required are standard ones in fitting a Bayesian GLM. Write Λ(α j ) (as a function of α j ) for the diagonal matrix with entries with W = (w ′ 1 , . . . , w ′ n ) ′ the design matrix for the variance model. Maximization with respect to q ij is easy. Letting Finally, maximization of the lower bound L with respect to µ q γ is equivalent to maximization of log p(µ q γ ) + i,j q ij log p ij (µ q γ ). This is the log posterior in a Bayesian multinomial regression with a normal prior on the regression parameter µ q γ and with ith response (q i1 , . . . , q ik ) ′ . Although the response vectors are not in the typical multinomial form, the usual iterative optimization algorithms can be used to find the mode. The Newton method for fitting this Bayesian multinomial regression model is presented in Appendix B.
We now summarize the above optimization process. After initializing the parameters, the following iterative scheme is performed. 6. Update µ q γ as the posterior mode in a Bayesian multinomial regression with normal prior N (µ 0 γ , Σ 0 γ ) and ith response (q i1 , . . . , q ik ) ′ . 7. Repeat steps 1-6 until the increase in the variational lower bound (A.1) is less than some user-specified tolerance.
To initialize the algorithm, [19] first perform a k-means clustering algorithm to cluster n vectors (y i , v i ) i=1,...,n into k clusters, then assign 1 to q ij if the ith observation lies in cluster j and 0 otherwise. For each cluster j, an ordinary least squares fit of y i to predictors v i is performed to get an estimateβ j of β j and residuals r i = (y j − v ′ iβj ) 2 , then an initial estimate for µ q αj and Σ q αj is obtained by fitting log r i to the predictors w i . The iterative scheme in Algorithm 1 is now ready to be performed for all parameters. In our experience, the final estimate of the parameters at convergence is typically insensitive to the initial value of q ij . However, a good initial value makes the algorithm converge quickly.

Appendix B
We now present the Newton method for fitting a Bayesian multinomial regression model with a normal prior. Let y 1 , . . . , y n ∈ IR k be n multinomialtype responses with k categories and z 1 , . . . , z n ∈ IR p be vectors of covariates. Note that in the application to find the mode µ q γ in step 6 of Algorithm 1, the ith pseudo-response vector is (q i1 , . . . , q ik ) ′ . Write Y = (y ′ 1 , . . . , y ′ n ) ′ , Z = (z ′ 1 , . . . , z ′ n ) ′ and Y * as the matrix Y without the first column. The goal is to minimize the negative of the log of the posterior , γ = (γ ′ 2 , . . . , γ ′ k ) ′ .
Write P = P (γ) = (p ij (γ)) for the n × (k − 1) matrix with entries p ij (γ), i = 1, . . . , n, j = 2, . . . , k. The gradient can then be written as where Γ = Γ(γ) is a (k − 1) × (k − 1) block matrix with the (i, j) block of the form where p j is the jth column of P , 1 is the vector of 1's and ⊗ is the direct product. The Newton method for estimating the mode of p(γ|Y ) is as follows.