Fast estimation of mixed-effects location-scale regression models

As a result of advances in data collection technology and study design, modern longitudinal datasets can be much larger than they historically have been. Such "intensive" longitudinal datasets are rich enough to allow for detailed modeling of the variance of a response as well as the mean, and a flexible class of models called mixed-effects location-scale (MELS) regression models are commonly used to do so. However, fitting MELS models can pose computational challenges related to the numerical evaluation of multi-dimensional integrals; the slow runtime of current methods is inconvenient for data analysis and makes bootstrap inference impractical. In this paper, we introduce a new fitting technique, called FastRegLS, that is considerably faster than existing techniques while still providing consistent estimators for the model parameters.


INTRODUCTION
As a result of advances in data collection technology and study design, modern longitudinal datasets can be much larger than they historically have been. In particular, studies involving smartphone-based data collection, wearable sensors, or Ecological Momentary Assessment (EMA) methodology can result in datasets with a very large number of observations (eg, 30+) per subject. [1][2][3] For example, in longitudinal psychology studies, subjects may be asked to self-report their mood via smartphone or handheld computer many times in a short observation window. 4 In studies of physical activity, researchers may have access to exercise volume data from wearable sensors at regular and frequent intervals, [5][6][7] potentially resulting in hundreds or thousands of observations per participant. Datasets resulting from studies like these are often called "intensive" longitudinal datasets. 5,6,8 Intensive longitudinal data are rich enough to allow for detailed modeling of the variance of a response as well as the mean, and there is often scientific interest in doing so. 9,10 In longitudinal studies of mood, for example, a researcher might wish to know if a certain covariate is associated with a more stable (ie, lower variance) mood. Such questions cannot be answered by the traditional models used for longitudinal data, such as the mixed-effects model, which has only a single parameter to model the residual variance. This is not a huge limitation when analyzing smaller datasets, where it would not generally be feasible to estimate additional variance parameters with meaningful precision. However, for intensive longitudinal data, more general models are needed.
The mixed-effects location-scale (MELS) regression model was developed to address this shortcoming of the traditional methods, and is now a common technique for analyzing intensive longitudinal data. [11][12][13][14] It has also been used to analyze other types of clustered data, such as survey data 15 and data from education research. 16 The model allows fixed effects and subject-level random effects to influence the mean of a response as well as the variation around that mean. The fixed effects used to model the variance allow researchers to investigate the association between covariates and response stability, while the random effects used to model the variance capture subject-level variability in response stability not captured by the fixed-effects. In a mood study, say, even after controlling for covariates, some subjects might have more, and some others less, stable moods than others, and it is exactly this sort of difference that the random effects capture.
A limitation of the MELS model is that it is difficult to fit computationally. Current techniques compute and maximize the marginal likelihood using Gaussian quadrature to integrate out the random effects, but this can be very slow, especially as the number of random effects increases. It can therefore be inconvenient in settings where many slightly different models are fit and examined, such as exploratory data analysis and model building and checking. Moreover, due to the computational cost, using resampling methods like the bootstrap for inference on the model parameters is not practical, leaving researchers without a robust way to estimate standard errors or determine significance of model parameters. Researchers have also used Bayesian methods to fit the MELS model. [17][18][19][20] These techniques are natural given the hierarchical structure of the MELS model, but they are not meant to reduce computational burden. We will restrict ourselves to the frequentist interpretation of the model.
In this paper, we introduce a new fitting technique for the MELS model, called FastRegLS, that is a great deal faster (as much as 100× for some datasets) and more numerically stable than quadrature-based maximum likelihood, while still providing consistent parameter estimates. Because it is so much faster, FastRegLS has the added benefit of enabling bootstrap inference for the MELS model parameters, and can considerably speed up exploratory data analysis and model diagnostics. Such procedures have always been computationally intensive for MELS models, but they are feasible using FastRegLS. We note also that FastRegLS can be used in tandem with existing techniques. For example, one could check how reasonable model-based maximum likelihood Wald intervals are by computing the FastRegLS bootstrap intervals and comparing -a difference could be evidence of deviation from the specified model. In addition, FastRegLS could be used to provide starting values for a maximum likelihood algorithm.
Our method is implemented in an R package, FastRegLS, which is included in the supplemental material. However, an additional advantage of FastRegLS is that it can be easily implemented using any software that can fit linear mixed-effects models; unlike quadrature-based methods, sophisticated numerical computing is not required.
The organization of this paper is as follows. In section 2, we give additional background and a formal statement of the MELS model. In section 3, current maximum likelihood-based fitting methods are described. In section 4, we describe our new method, FastRegLS. In section 5, we present empirical verification of the theoretical properties of FastRegLS, and compare FastRegLS with maximum likelihood on both real and simulated data. We conclude in section 6 with a discussion.

THE MELS MODEL
Suppose we have longitudinal data y collected from N total subjects, with n i observations per subject, i = 1, … , N, for a total of T = ∑ N i=1 n i data points. Consider first the basic mixed-effects model commonly fit to such data: Here y ij is the j th observation from subject i, i = 1, … , N, which we assume is a Gaussian response variable. The index j runs from 1 to n i . The vector x ij ∈ R k is a covariate vector corresponding to the single observation y ij , a single row of a larger design matrix X ∈ R T×k , and ∈ R k is the vector of regression coefficients. The subject-level random effect i follows a N(0, 2 ) distribution, and the residual error ij follows a N(0, 2 ) distribution. The random effect variance 2 is also called the between-subjects (BS) variance, since it characterizes how much subjects differ from the global baseline mean determined by . The residual variance 2 is also called the within-subjects (WS) variance, since it characterizes how much the observations y ij differ from the mean value x ⊺ ij + i . Since 2 is a constant, the WS variance is not dependent on subject identity or covariates. The MELS model extends this model by replacing the residual error distribution (3) with the following: The WS variance is now the exponential (because the variance must be positive) of a linear predictor. It is dependent on covariates W ∈ R T×p , of which the w ⊺ ij are rows, with a corresponding regression vector ∈ R p . Throughout this paper, we will assume that the WS variance model has an intercept term, that is, that the first column of W is a vector of ones. The third term on the right-hand side of (5) is an additional subject-level random effect i , which also follows a normal distribution. Just as i captures subject-level variation in the mean of y above and beyond the specified covariates x ij , so i captures subject-level variation in the variance of y above an beyond the covariates w ij . From now on, i will be referred to as the random location effect, and i as the random scale effect (giving the MELS model its name). The second term in (5) represents the association between the random location effect and the random scale effect. A similar effect can be achieved by allowing i and i to be correlated, but we prefer this regression-like association for its modeling flexibility. 14 Throughout this paper, the term "MELS model" will refer to the model defined by (1), (2), (4)-(6), and its parameters will be denoted by Θ = ( , , l , 2 , 2 ). We note that although we have chosen a longitudinal data setting to explain the MELS model, and will continue to refer to the groupings of correlated observations as "subjects," the same model can be used for clustered data more generally, in which the "subjects" might be geographical locations, schools, and so forth.

MAXIMUM LIKELIHOOD ESTIMATION
Maximum likelihood (ML) estimation remains the most common way to fit the MELS model. It is implemented, for example, in software like MixRegLS, 14 Merlin, 21 and the PROC NLMIXED procedure in SAS. 22,23 The random location and scale effects are replaced by their standardized versions, and Gaussian quadrature is used to compute the necessary integrals. Specifically, if we let l i = i and s i = i , then l i and s i are independent N(0, 1) random variables, and ) .
Letting f (y ij | l i , s i ) denote the Gaussian likelihood corresponding to (7), the full likelihood conditional on is given by The full unconditional likelihood L(y) is then obtained by integrating out Θ l and Θ s , where g(⋅) is the density of a standard normal random vector in R N . No closed form expression exists for (8), so when maximizing L(y), (8) and its derivatives are approximated using Gaussian quadrature: the integral is replaced by a sum in which the integrand is evaluated at a pre-specified grid of points. If there are Q points per dimension, every integral requires Q 2 integrand evaluations per optimization iteration. As the data examples in the Results section demonstrate, this procedure can be time intensive. In addition to being inconvenient for routine data analysis and model building, it makes resampling methods like the bootstrap infeasible. The option to bootstrap is particularly important for the MELS model and other mixed-effects models, since the variances based on the Fisher information at convergence (eg, as computed in Reference 24) do not always work well for variance components. 25 In the next section, we present FastRegLS, a new technique that provides consistent estimators at a fraction of the computation time of the current methods.

Illustration on a simpler model
FastRegLS is an extension of an approach of Harvey 26 for estimating the parameters of the simpler heteroskedastic regression model where w i1 = 1 for all i (ie, the variance model contains an intercept), in which the model is fit in two steps. First, is estimated using least squares,̂= re then formed, and Harvey 26 observes that asymptotically, where i ∼ log 2 1 . The variance regression parameter is then estimated usinĝ where loĝdenotes the vector whose i th component is loĝi, e 1 denotes the unit vector with 1 in the first position, that is, e 1 = (1, 0, … , 0) ⊺ , and E log 2 1 = 1.27. It can be shown that (9) is asymptotically unbiased, and has variance where Var(log 2 1 ) = 4.93. The estimator (9) is less efficient than the ML estimator, which has variance 2(W ⊺ W) −1 . However, the ML estimator must be obtained through an iterative optimization algorithm, whereas (9) can be computed in closed form and is much faster.
We now extend this approach to estimate the parameters of the MELS model. The FastRegLS algorithm consists of a parameter update and an initialization. We will describe the updates first, and then describe the initialization. See Algorithm 1 for a full summary of the algorithm. Notice the computational simplicity of the steps, which are either closed-form or can be solved using existing fast algorithms (such as lme4 27 for the random effects models). Note also the mix of EM-like updates and direct likelihood maximizations. For example, the inner-loop updates of̂and the random interceptŝi coincide with their EM analogues. On the other hand, the WS variance parameters and the final out-of-loop estimator rely on full likelihood fits of submodels chosen to guarantee good statistical properties (see the Properties section).

Parameter updates
Assume that we have estimatesΘ k = (̂k ,̂k,̂l ,k ,̂2 ,k ,̂2 ,k ) from a previous iteration of the algorithm, as well as estimates of the random location and scale effects themselves,̂i ,k and̂i ,k for i = 1, … , N. Let̂2 ij,k+1 be the plug-in estimate of the Algorithm 1. FastRegLS 1: Fit by maximum likelihood the model

16:
W k+1 ← the diagonal matrix whose diagonal is the vector Fit by maximum likelihood the model 25: end for 26: Fit by maximum likelihood the model .
We update the estimates of the random location effects by computing their expected value conditional on the observed data y and the current parameter estimates: Note that this is essentially an empirical Bayes procedure, where the hyperparameter of the prior (2) is determined using the data. Next, we update our estimate of using weighted least squares: y ij −̂i ,k+1 is regressed on X, using thê2 ij,k+1 as weights, and̂k +1 is the resulting regression vector. More precisely, let V k+1 ∈ R T be the vector containing eacĥi ,k+1 repeated n i times, and let W k+1 ∈ R T×T be the diagonal matrix whose diagonal is the vector The updated is then given bŷk We update in this way to avoid fitting a second (see below for the first) mixed-effects model within the loop, which would slow down the computation time. However, we do update a final time out of the loop using a mixed-effects model, see below.

Initialization
The initialization is similar to the in-loop updates. We begin by fitting the homoskedastic mixed-effects model using ML, and set̂0 =b 0 and̂2 ,0 =ĝ 2 0 . As in the in-loop updates, we initialize the random location effects by their expected values under (12)-(13) conditional on the observed data y and our parameter estimates: To initialize the WS variance parameters, letting we fit the model using ML, and set̂0 =t 0 + E log 2 1 ,̂l ,0 =t l,0 , and̂2 ,0 =ŝ 2 0 . Finally, we initialize the random scale effects following (11):

Properties Theorem 1 (Consistency). The FastRegLS estimators are consistent for their respective parameters (regardless of k max ).
The estimatorsΘ resulting from the FastRegLS algorithm are consistent for their respective parameters (see the supplementary material for a proof, and the Results section for an empirical demonstration). Importantly, the estimators remain consistent no matter how many times we execute the inner loop, that is, how large or small k max is. The results presented in the Results section are all obtained using k max = 1, except where noted. As k max increases, the FastRegLS estimates tend to "converge" in the sense that the difference between successive estimates grows small, but this is not necessary for valid estimation. In practice, it is a good idea to verify that k max is large enough that the change in the estimates from iteration to iteration is small. Such large changes are possible when there are significant deviations from the specified model in the data used for fitting. For example, one of our simulations compares the performance of bootstrap to Wald confidence intervals using a dataset with some of the residual errors drawn from an exponential distribution (Table 3). To ensure that the FastRegLS estimates were stable, we used k max = 5 for this simulation after observing large changes in the estimates from the first 2-3 iterations. We note also that increasing k max does not necessarily produce estimates that are closer to those obtained using ML, although both converge to the true parameter values, and therefore get closer to each other, as the sample size increases.
There are two methods available to obtain standard errors for the FastRegLS estimators. The first is a Wald-type interval based on approximations to the asymptotic covariances of the estimators. It is difficult to derive exact formulas for these, but approximations that work well in practice can be obtained under the assumption that the parameter estimates used in the intermediate steps are known exactly (eg, treating thêi as fixed, known covariates when estimating the WS variance parameters). Formulas and details of the derivation are provided in the supplementary material.
Though the Wald intervals are convenient and work well in practice (see the Results section for simulations demonstrating good coverage rates), they (1) condition on variable quantities and therefore will tend to underestimate the uncertainties of the estimators. and (2) are model-dependent. The impact of (1) can be seen in Figure 2: although they still perform well, the intervals for WS variance parameters, which depend most on approximations of this type, tend to have lower coverage rates than the intervals for the other parameters. The nonparametric bootstrap is an alternative that accounts for all sources of variation, and is also more robust to model misspecification. The ability to bootstrap is a major advantage of FastRegLS over existing MELS model fitting techniques, which are not fast enough to perform a large number of iterations in a practical amount of time.

RESULTS
In this section, we present several applications of FastRegLS and compare its performance with that of ML. We also compare the performance of bootstrap and Wald intervals for FastRegLS. Unless otherwise specified, all results were obtained using k max = 1.

Consistency
To empirically demonstrate the consistency of the FastRegLS estimatorsΘ, we simulated data from the MELS model with ∈ R 4 , ∈ R 4 , each randomly generated with N(0, 1) entries, l = 0.3, 2 = 1, and 2 = 1, for a variety of sample sizes. Figure 1 shows the mean squared error (MSE) ofΘ as a function of both the number of subjects in the sample N and the number of observations per subject n i (each subject had the same number of observations). At each combination of N and n i , 100 datasets were generated using the above parameter values, and the design matrices X and W had random N(0, 1) entries (except an intercept column of ones). The plotted point is the average MSE over the 100 runs when estimating the parameters using FastRegLS, and the the error bars are plus/minus 2 standard deviations of this average. (Specifically, the MSE for each iteration was calculated as 1 11 ||Θ −Θ|| 2 2 , since ∈ R 11 for this simulation, and the plotted MSE is the average of this quantity over all iterations.) The MSE approaches 0 as the sample size increases, but notice   Table 2, and n i increasing from 20 to 100. that we need both N → ∞ and n i → ∞. This is because N limits how precisely we can estimate the variance components 2 and 2 , and thus for fixed N the MSE trajectory plateaus above 0.

Run time
Because it does not require numerical integration in two dimensions, FastRegLS is much faster than existing techniques. Table 1 compares the run times (in seconds) of FastRegLS with those of MixRegLS 14 on a sequence of increasingly large datasets. All datasets had N = 200 subjects, with the same parameter values as in Table 2, and n i increasing from 20 to 100. FastRegLS is orders of magnitude faster, and the difference increases as the datasets increase in size.

Performance of confidence intervals
In order to test the performance of the Wald and bootstrap FastRegLS confidence intervals, we simulated 100 datasets from a MELS model with ∈ R 4 , ∈ R 4 , each randomly generated with N(0, 1) entries, l = 0.3, 2 = 1, 2 = 1, N = 200, TA B L E 2 For 100 datasets generated using the parameter values indicated in column 2, and 200 subjects with 30 observations per subject, we fit a MELS model using FastRegLS and constructed both types of 95% confidence intervals, Wald intervals and bootstrap intervals. The coverage columns shows the proportion of times each type of interval contained the true parameter value, as well as the average widths of the intervals over the 100 runs. The observed coverage rates are close to the target of 0.95, and widths of the two types of intervals are similar. Note that since this is simulated data and there is no model misspecification, the Wald intervals tend to slightly outperform the bootstrap intervals. n i = 30 for i = 1, … , N, and the design matrices X and W with random N(0, 1) entries (except an intercept column of ones). For each such dataset, we fit the model using FastRegLS and constructed both types of 95% confidence intervals. Table 2 shows the proportion of times the interval contained the true parameter value, as well as the average widths of the intervals over the 100 runs. The observed coverage rates are close to the target of 0.95, and the widths of the two types of intervals are similar.

Parameter
To demonstrate the need for the bootstrap when we suspect model deviations, we repeated the coverage probability simulation with a mixture of exponential and Gaussian residual error terms. With probability 0.8, the residual error term was drawn from the usual N(0, 2 ij ) distribution, but with probability 0.2, the residual error was drawn from ij (Exp(1) − 1), an exponential distribution shifted to have mean 0 and scaled to have variance 2 ij . Table 3 compares the coverage rates of the 95% Wald and bootstrap intervals in this setting. Because the model deviations produced large changes in the FastRegLS estimates for the first few iterations, we used k max = 5 for this table. The Wald intervals perform poorly on the WS variance parameters, but we can obtain more reasonable coverage rates using the bootstrap (note, however, that both perform poorly for 0 ). The bootstrap intervals also perform better for 2 . Without FastRegLS, such intervals could not be computed in a practical amount of time.
We also tested the Wald interval coverage rates on datasets of varying sizes. As in the consistency simulation, we fixed the number of subjects at N = 200, and increased the number of observations per subject from 20 to 100. For each sample size, we simulated 500 datasets using the same parameter values as in Table 2, fit a MELS model to each dataset using FastRegLS, and for each parameter, determined the proportion of times the 95% Wald confidence interval contained the true value. The results are shown in Figure 2. For all parameters and for all sample sizes, the coverage rates stay close to the target of 0.95. Comparison with ML on simulated data To directly compare the performance of FastRegLS with ML, we simulated data from a MELS model using the parameter values in Table 4, and then used both FastRegLS and ML (run using MixRegLS) to obtain parameter estimates and confidence intervals. The results are shown in Table 4. For each method, the point estimates are close to the true values, and both types of 95% confidence interval (bootstrap and Wald) contain the true value, but FastRegLS obtains these results at a fraction of the computation time (Table 1).

Comparison with ML on real data
We also compared the performance of FastRegLS and ML on real data from an intensive longitudinal study of first-year university students. 28 The study followed 82 first-year university students for 46 days, during which time the students took their final exams. Data was collected using daily online surveys designed to capture various aspects of lifestyle, including mood, physical activity, sleep patterns, and so forth. Only data from the first 32 days of the study are publicly available, and this is the dataset we analyze. There were 2111 total responses, with 29.3 responses per student on average (and a range of 8 to 23). We will focus on positive affect (PA), an average of the extent to which participants felt several emotions (happy, content, cheerful, sad, downhearted, frustrated) on a given day, measured using a 1-7 Likert scale. Psychology researchers are often interested in the variability of a measure in addition to its mean value, and because of this PA data is often analyzed using MELS models. 12 of several study subjects (Figure 3). Just by looking, we can see that these trajectories differ greatly in their mean and variance; by fitting a MELS model, we can determine whether such differences are due to covariates, or to individual-level heterogeneity captured by random location and scale effects.
We fit a MELS model with PA as response, and age and sex of the student as both mean and WS variance covariates. Specifically, the model fit was where PA ij is the positive affect measurement of student i on day j, age i is the age of student i in years, and sex i is the sex of student i (0 = Female, 1 = Male). Table 5 shows the point estimates and confidence intervals from FastRegLS and ML (the latter run using MixRegLS). The point estimates of the regression coefficients are very similar for the two methods, as are both types of confidence intervals. The most interesting difference between the two is the random scale variance, for which the FastRegLS point estimate is larger than that of ML. However, we note that the confidence intervals do still overlap. The key point is that the interpretation of the coefficients is the same for each method. For instance, both methods tell us that males have a much smaller variability in their PA responses ( sex < 0), and that higher average PA responses (better moods) are associated with smaller PA variability ( l < 0). There is also subject-level variability beyond age and sex that is captured by the random location and scale effects ( 2 > 0 and 2 > 0). However, the FastRegLS estimates come at a fraction of the computation time: 0.19 s, compared to 8.25 s for ML.

DISCUSSION
In this paper, we introduced a new fitting technique, called FastRegLS, for the MELS model that is considerably faster than currently available maximum likelihood-based techniques. The FastRegLS estimators are consistent, and approximations to their asymptotic variances are available. Because it is so much faster than existing techniques, FastRegLS makes bootstrap inference practical for the MELS model for the first time. Given the complexity of the MELS model and the challenges associated with performing inference on variance components, this is a useful addition to the MELS model toolkit.

F I G U R E 3
Positive affect (PA) trajectories for a selection of subjects in our example intensive longitudinal study. The trajectories differ considerably in mean and variance; by fitting a MELS model, we can determine whether such differences are due to covariates, or to individual-level heterogeneity captured by random location and scale effects We emphasize that FastRegLS does not need to replace a full ML solution in order to be useful. Even if a researcher plans to eventually fit a model using ML, FastRegLS could still be useful by speeding up exploratory data analysis. Or if one suspects deviations from the specified model, they could check how reasonable ML Wald confidence intervals are by computing the FastRegLS bootstrap intervals and comparing -if they are similar, then there is more reason to trust the ML results. FastRegLS could also be used to quickly provide starting values for a full ML algorithm.
Although we only considered a MELS model with a random location intercept and a random scale intercept, recent work 29 has considered additional location and scale random effects, for example random slopes. For these models, the existing approaches are even more computationally burdensome, since the complexity of the fitting problem increases exponentially in the number of random effects. The FastRegLS approach of breaking down the full model into submodels could potentially provide even more substantial speed increases in this setting, since the complexity is limited only by the size of the largest of these submodels. Future work will extend FastRegLS to allow for more complicated random effects structures. Additionally, since the models considered in this paper are only appropriate for Gaussian responses, we plan to extend FastRegLS to models for non-Gaussian response variables.