Linear mixed effects models for non‐Gaussian continuous repeated measurement data

We consider the analysis of continuous repeated measurement outcomes that are collected longitudinally. A standard framework for analysing data of this kind is a linear Gaussian mixed effects model within which the outcome variable can be decomposed into fixed effects, time invariant and time‐varying random effects, and measurement noise. We develop methodology that, for the first time, allows any combination of these stochastic components to be non‐Gaussian, using multivariate normal variance–mean mixtures. To meet the computational challenges that are presented by large data sets, i.e. in the current context, data sets with many subjects and/or many repeated measurements per subject, we propose a novel implementation of maximum likelihood estimation using a computationally efficient subsampling‐based stochastic gradient algorithm. We obtain standard error estimates by inverting the observed Fisher information matrix and obtain the predictive distributions for the random effects in both filtering (conditioning on past and current data) and smoothing (conditioning on all data) contexts. To implement these procedures, we introduce an R package: ngme. We reanalyse two data sets, from cystic fibrosis and nephrology research, that were previously analysed by using Gaussian linear mixed effects models.


Introduction
This paper is concerned with the analysis of real-valued repeated measurement data that are collected through time: also known as longitudinal data. The basic data structure is that repeated measurements of an outcome variable are made on each of a number of subjects at each of a number of follow-up times, which are not necessarily the same for all subjects, with explanatory variables or covariates of two kinds also available: baseline covariates attached to subjects, and longitudinal covariates attached to individual outcomes. We write Y ij for the jth measurement of the outcome on the ith subject, t ij for the corresponding follow-up time, a i for the matrix of baseline covariates that are associated with the ith subject and x ij for the matrix of longitudinal covariates that are attached to the jth measurement on the ith subject. Fig. 1 shows a simple example, taken from a randomized trial of drug treatments for schizophrenia, in which the outcome variable is a measure of each subject's mental state at times 0, 1, 2, 4, 6 and 8 weeks after randomization to one of two different drug therapies: placebo versus active treatment. Here, a i is a scalar treatment indicator, whereas the general pattern of decreasing responses over time suggests a quadratic trend; hence x ij consists of t ij and t 2 ij . Fig.  1 shows data from three subjects in each of the two treatment arms; the complete trial included 88 subjects in the placebo group and 435 subjects distributed across five active treatment arms (Henderson et al., 2000). This example shows several features that are typical of studies of this kind: the outcome variable, the PANSS-score (positive and negative syndrome scale) (Kay et al., 1987), is an imperfect measurement instrument for the underlying process of interest, namely each subject's state of mental health at the time of measurement; the outcome variable exhibits stochastic variation both between subjects and between follow-up times within subjects; questions of interest include estimation of parameters that define the mean response profiles of the underlying process over time and prediction of the trajectory of the process for an individual subject.
Most of the very extensive literature on statistical methods for data of this kind uses either a Gaussian model or, if the inferential goal is restricted to parameter estimation, a set of estimating equations; textbook accounts include Verbeke and Molenberghs (2001), Diggle et al. (2002) and Fitzmaurice et al. (2011). In this paper, we present methodology for handling repeated measurement data that exhibit long-tailed or skewed departure from Gaussian distributional assumptions.
In Section 2, we review the literature on existing approaches to Gaussian and non-Gaussian modelling of real-valued repeated measurement data. In Section 3, we set out our proposed class of non-Gaussian models. In Section 4, we describe a computationally fast method for likelihoodbased inference. Section 5 describes a method for validating the distributional assumptions of the models considered. Section 6 describes two applications. In the first of these, the scientific focus is on estimation of mean response profiles, whereas in the second the focus is on realtime individual level prediction. Section 7 presents the results from two simulation studies and Section 8 describes our R package, ngme, that implements the new methodology. In Section 9, we discuss some potential extensions, including models for categorical or count data (Molenberghs and Verbeke, 2005) and joint modelling of repeated measurement and time-to-event data (Rizopoulos, 2012). Technical details are presented in the appendices.  were the first to consider modelling repeated measurements as noisy versions of underlying signals that can be decomposed into fixed effects, a T i α + x T ij β, and random effects, d T ij U i , leading to the mixed effects model Y ij = a T i α + x T ij β + d T ij U i + σZ ij , j = 1, : : : , n i , i = 1, : : : , m, .1/ where n i is the number of measurements on the ith subject, m is the number of subjects, the individual level U i are mutually independent, zero-mean multivariate normal, U i ∼ N.0, Σ/, and the Z ij are mutually independent N.0, 1/. A widely used special case of model (1) is the 'random-intercept and random-slope' model in which each subject's random effect is a linear function of time. This model is very useful when the data contain only a small number of repeated measurements per individual. With longer sequences, the assumption that individual random-effect trajectories can be approximated by straight lines becomes implausible, because of non-linearities in the trajectories. Diggle (1988) proposed adding to the model a time-varying random-effect term W i .t/, specified as a stationary stochastic process. Taylor et al. (1994) and  later considered non-stationary options for W i .t/. The general specification for models of this kind is that

Gaussian models for real-valued repeated measurement data
where, in addition to the notation that has already been introduced, the W i .t/ are independent copies of a continuous time zero-mean Gaussian process with covariance function γ.t, t / = cov{W i .t/, W i .t /}. We consider the elements of both the a i and x ij to be prespecified constants. This implicitly assumes, in particular, that, if any time-varying covariate is not prespecified, it is stochastically independent of all other terms in the model; hence conditioning on it is innocuous. We can then drop the term a T i α in model (2) by allowing elements of x ij to take identical values for all j that are associated with any fixed i. For the covariance function γ.t, t /, we use the stationary Matérn (1960) family: where Γ.φ/ is the complete gamma function, ω 2 > 0 denotes the variance, φ > 0 is a shape parameter, κ > 0 is a scale parameter and K φ is a modified Bessel function of the second kind of order φ. The corresponding Gaussian process W i .t/ is φ − 1 times mean-square differentiable, where ' · ' denotes the ceiling function. An alternative way of capturing non-linear behaviour of repeated measurements is to specify the random effects as regression splines or polynomials with stochastic coefficients (Fitzmaurice et al. (2011), chapter 19). We do not consider these approaches in this paper, since they appear to us less natural than the stochastic process approach and would require many more parameters to achieve the same flexibility in shape. That said, there are connections between an integrated random-walk process and a smoothing spline representation of the W i .t/ (Wahba, 1990;Zhu and Dunson, 2017). Likelihood-based inference for model (2) is straightforward. The likelihood function is a product of m multivariate normal densities with dimensions n i . For typical study designs, the n i are sufficiently small that the required matrix calculations are not computationally demanding.
In the continuous time setting, it is helpful to exploit an alternative representation of a Gaussian process W.·/ as the solution to a stochastic differential equation, where D is a differential operator and dL.t/ is continuous time Gaussian white noise . For example, the integrated random-walk model used by  and Zhu and Dunson (2017) corresponds to D = @ 2 =@t 2 , whereas the Matérn model corresponds to : . 5/ For the stochastic differential equation representation of an integrated Ornstein-Uhlenbeck process, see Zhu et al. (2011a,b).
In applications for which only the regression parameters β are of scientific interest, estimating equations offer an alternative to likelihood-based estimation. In the current context, this approach was introduced by Liang and Zeger (1986), working in the wider setting of generalized linear models. For linear models, the approach consists of estimating β by weighted least squares; henceβ where, for each i, Y i = .Y i1 , : : : , Y in i / T , x i is the n i × k matrix whose jth row is x T ij and the F i are weight matrices. Rewriting equation (6) in an obvious shorthand notation asβ = DY, inference for β uses the result thatβ is asymptotically multivariate Gaussian with mean β and variance DCD T , where C = var.Y/, a block diagonal matrix with non-zero blocks, C i = var.Y i /. If F i = C −1 i , thenβ is the maximum likelihood estimator for β. The basic idea behind equation (6) is to choose, rather than to estimate, a set of matrices F i that reflect a reasonable working covariance structure for the matrices C i = var.Y i /, but not to rely on the correctness of the chosen structure. Instead, the unknown matrix C i is replaced by a non-parametric estimateC i . One such set of estimates is given byC i = n −1 i .Y i − x iβ /.Y i − x iβ / T : Individually, eachC i is a very poor estimate of C i , but the implicit averaging in equation (6) leads to consistent estimation of var.β/ in the limit m → ∞ for fixed n i (Liang and Zeger, 1986).

Non-Gaussian models for real-valued repeated measurement data
The existing literature on non-Gaussian models takes as its starting point a linear model with correlated errors: where, in the case of a common set of follow-up times, t 1 , : : : , t n , for each subject, the S i = .S i1 , : : : , S in / T are independent copies of a zero-mean multivariate normal random vector (Jennrich and Schluchter, 1986). Most references consider the Laird-Ware approach as presented in model (1), where . 8/ Liu and Rubin (1995), Lange et al. (1989) and Pinheiro et al. (2001) replaced each S ij in equation (7) or (8) by S Å ij = S ij = √ V i , where the V i are mutually independent unit mean gammadistributed random variables. They estimated the model parameters by maximum likelihood using an expectation-maximization algorithm (Dempster et al., 1977). Lin and Wang (2011) considered Bayesian methods of inference for the same class of models. Matos et al. (2013) extended the work of Pinheiro et al. (2001) to allow censored outcomes. Song et al. (2007) and Zhang et al. (2009) considered an extension to Lange et al. (1989) by allowing the gamma-distributed scaling factor V i to apply to either one of the two components on the right-hand side of equation (8). Lin and Lee (2007) applied the gamma-distributed scaling factor only to the random-effects term, d T ij U i , but also replaced the mutually independent Z ij by a set of auto-regressive processes; this restricts its applicability to data with equally spaced measurement times. Rosa et al. (2003) and Tian et al. (2008) also used the formulation S Å ij = S ij = √ V i , but without restricting the V i to be gamma distributed. Lange and Sinsheimer (1993) called the resulting family of distributions the normal-independent family, a special case of which is a mixture of normal distributions. The R package heavy (Osorio, 2016) fits this class of models. In a series of papers, V. H. Lachos and colleagues have developed methodology for fitting non-linear mixed models by using the normal-independent family; see Lachos et al. (2009, 2010, 2011, 2012, 2013) Zeller et al. (2010 and Cabral et al. (2012) and also independent contributions by Verbeke and Lesaffre (1996), Sun et al. (2008), Ho and Lin (2010), De la Cruz (2014), Zhang et al. (2015) and Yavuz and Arslan (2016).
We have found only two references that considered the general form of model (2) with three stochastic components with the single-term formulation (8), namely Stirrup et al. (2015) and Asar et al. (2016), and none that allows the three scaling factors to be decoupled.

A flexible class of non-Gaussian models
Our aim in this section is to set out a version of the mixed effects model j = 1, : : : , n i , i = 1, : : : , m, .9/ that, we believe for the first time, allows Gaussian or non-Gaussian distributional specifications of the three stochastic components U i , W i .t/ and Z ij to be decoupled. Writing B and H to denote generic vector-valued random variables, we replace the Gaussian assumption for each of the components with a normal variance-mean mixture of the form where δ and μ are parameter matrices, H ∼ N.0, I/ with I being the identity, and V is a random variable that takes values on R + . We need to impose some restrictions on the distribution of V for the inferential algorithms that we develop in Section 4 to be practicable. For the subject-specific random effect, U i , and the measurement-specific noise, Z ij , the only necessary restriction is that V has a known distribution. However, to simplify parameter estimation, we shall impose the additional restriction that V |H also has a known distribution. For the subject-specific continuous time stochastic process, W i .t/, we use a numerical discretization of the differential operator (4) to generate realizations of the process. For this reason, we need the distribution to be closed under arbitrary discretization, which we ensure by requiring that the distribution of V be closed under convolution. Our specific proposals for U i , W i .t/ and Z ij are described below in more detail. A flexible choice for B is the multivariate generalized hyperbolic (GH) distribution (Barndorff-Nielsen, 1977;Vilca et al., 2014). This distribution can be generated from the mixture representation (10) by specifying a generalized inverse Gaussian (GIG) distribution for V . The density function of the GIG distribution is where K p is the modified Bessel function of the third kind, of order p, whereas a and b are positive-valued parameters. We denote this distribution by GIG.p, a, b/ and refer the reader to Jørgensen (1982) for more details. An important property of this distribution is that, for any c > 0, cV ∼ GIG.p, a=c, cb/. Another property that is useful for the construction of the samplingbased inferential algorithms that we introduce in Sections 4.2 and 4.4 is that the conditional distribution of V given the observed data is also GIG. The GH distribution includes several widely used distributions as special cases, e.g. the Student t, generalized asymmetric Laplace (GAL), normal-inverse Gaussian (NIG) and Cauchy distributions. Specific parameter configurations for the distributions of V that give each of these special cases are presented in Table 1. Note that, for both the NIG and the GAL distributions, the formulation is overparameterized for B in equation (10). One therefore needs to fix a or b.

Noise and random effects
Since the measurement noise is univariate, we can write the mixture representation (10) as where Z Å ij ∼ N.0, 1/. To maintain the interpretation of σ 2 as the variance of the noise, at least in the symmetric case, we constrain the values of the GIG parameters a, b and p, so that E[V Z ij ] = 1, An alternative to representation (12) is to attach a single random variable V Z i to all of the noise terms Z ij on the ith subject, i.e.
The distribution of V i can then be interpreted as a random effect for patient-specific measurement noise variance. Note, in particular, that this introduces stochastic dependence between Z ij and Z ij for j = j , although they are conditionally independent given V Z i . For the random effects, we let

Stochastic process
A simple way to introduce a non-Gaussian stochastic process term in expression (9) would be again to include a subject-specific scaling, i.e. W i .t/ = V W i W Å i .t/, where V W i follows a unit mean GIG distribution. However, this approach would not be able to capture interesting withinsubject departures from Gaussian behaviour, e.g. jumps or asymmetries in the sample paths of W i .t/. To provide the required flexibility, we instead use non-Gaussian generalizations of the stochastic differential equation (4). Specifically, we propose modelling the W i .t/ as independent copies of the solution to .13/ where the L i are independent copies of a non-Gaussian Lévy process, i.e. a process with independent and stationary increments. In practice, we work with a discretized version of equation (13), for which  showed that a type G Lévy process for L i .t/ is a suitable candidate. The implication is that the increments of L i have a distribution that corresponds to the specification given by equation (10). One approach would therefore be to choose the distribution of V W as a GIG distribution, which would yield the GH processes of Eberlein (2001). However, as noted earlier, we require the distribution of V W to be closed under convolution (Wallin and Bolin, 2015). Also, the stochastic gradient method for parameter estimation that we introduce in Section 4 requires the ability to sample from the conditional distribution of V W given all other components in the model. Within the GH family, the NIG, GAL and Cauchy distributions are the only ones that meet these requirements (Podgórski and Wallin, 2016). In Table 2, we present the parameterization of the mixing distribution for the three cases. Using any of these distributions for the increments of L i  (13) results in models with the same covariance structure as if L i were Gaussian, but with more general marginal distributions. The NIG choice makes L i an NIG process (Barndorff-Nielsen, 1997a). This class of processes has been used in financial modelling; see Barndorff-Nielsen (1997b), Bibby and Sørensen (2003), Tankov (2003) and Eberlein (2001), for more details.

The choice of operator
As previously mentioned, using D as in equation (5) yields a process with a Matérn covariance function. More specifically, if the process is defined on the entire real line, then it has Matérn covariance. In practice, we restrict the process to a bounded temporal interval and impose boundary conditions on the operator to obtain a well-posed problem. Common choices for these artificial boundary conditions are either homogeneous Neumann or homogeneous Dirichlet conditions. The effect of these artificial boundary conditions is small for distances that are larger than twice the practical correlation range of the process . We therefore define the process on an extended temporal domainT = [−r, t max + r] where all measurement times lie within the interval [0, t max ] and r is a value that is larger than the practical correlation range. For φ = 1 2 , the operator (5) is D 1 = .κ 2 − @ 2 =@t 2 / 1=2 , which results in a process with an exponential covariance function E[W.t/W.t + h/] = .2κ/ −1 exp.−κ|h|/. Another, perhaps more natural, choice that results in a process with exponential covariance when defined on the entire real line is D 2 = κ + @=@t. This can easily be seen by computing the power spectrum, S W .ω/, of the corresponding stochastic process. Taking the Fourier transform of the stochastic differential equation gives .κ + iω/Ŵ.ω/ = dL.ω/. Using the fact that the power spectrum of dL.ω/, is constant, it follows that which we recognize as the spectral density corresponding to the exponential covariance function. This result is well known, since in the Gaussian case the process is the classical Ornstein-Uhlenbeck process. When posed on the bounded domainT , we only have to equip D 2 with one boundary condition. A natural choice in this case is a stochastic Dirichlet boundary condition at either end point, W.−r/ = W or W.t max + r/ = W , where W is a random variable with distribution equal to the marginal distribution for W.t/, when the model is formulated on the entire real line. This results in a stationary model and there is no need to extend the domain of interest.
When the driving noise of the process is Gaussian, the models that are formulated by using D 1 and D 2 are equivalent in distribution (apart from the behaviour at the boundary of the domain). However, for non-Gausisan models the processes that are formulated by using the two choices are not equivalent: the kernel of D 1 is symmetric, whereas the kernel of D 2 is completely asymmetric. This affects the appearance of trajectories of the process. For D 1 , the trajectories are symmetric in time, i.e. the process is time reversible, whereas the trajectories can be asymmetric when D 2 is used. The difference between the symmetric and asymmetric models is illustrated in Fig. 2, where the same driving NIG noise is used to simulate a trajectory using D 1 and D 2 . Using D 2 allows for sharp jumps in the trajectories that are not present when D 1 is used.
One way to construct a model with a general Matérn covariance that allows for asymmetries in the sample paths is to add a fractional exponent to D 2 , to give D 3 = .κ + @=@t/ .2φ+1/=2 . This results in a process with a Matérn covariance, but with asymmetry in the trajectories depending on where boundary conditions are imposed. It should be noted that these models have Markov properties when φ = 1 2 , 3 2 , 5 2 , : : :, which simplifies simulation as explained in the next section. Besides the Matérn models, another option that is used for longitudinal data is the integrated random-walk model . This can be seen as a special case of D 3 with κ = 0 and φ = 3 2 , and can thus be handled in the same way as the Matérn models.

Discretization
We need to discretize time to use the stochastic differential equation (13). For this, we use the approximation where W = .W 1 , : : : , W K / T is a vector of random variables and the φ k .t/ are basis functions. We use a set of piecewise linear basis functions such that and, for k = 2, 3, : : : where −r = s 1 < s 2 < : : : < s K−1 < s K = t max + r. In the case of a Dirichlet boundary condition at t = −r, the function φ 1 is removed from the expansion, whereas, in the case of the Dirichlet boundary condition at t = t max + r, φ K is removed. The distribution of the stochastic weights is computed by using either a Galerkin finite element discretization or a Petrov-Galerkin method depending on the operator; details are given in Appendix B. The result for the non-Gaussian case can be written as where K is a matrix corresponding to a discretization of the differential operator and V W k ∼ IG.ν, h 2 k ν/ where h k are fixed constants depending on the basis functions. Note that we again Fig. 2. Simulation of an NIG process with exponential covariance function using the operator (a) D 2 and (b) D 1 , with the same driving noise in both cases: one can note that the trajectory is asymmetric in (a) Note that the parameter ν controls the tails of the marginal distribution of the process. The limiting case ν → 0 is the Cauchy process, whereas the limiting case ν → ∞ is a Gaussian process. These are exactly the properties that we need to use our likelihood-based methods to assess whether a standard, and undeniably convenient, Gaussian assumption for any or all of the stochastic components of expression (9) is adequate.

Hierarchical representation
Our specification of a normal variance-mean mixture for each of the stochastic components of expression (9) makes likelihood-based inference practicable via the following hierarchical representation of the model. For subject i, let V Z i , V U i and V W i denote the stochastic variance factors corresponding to the noise, random-effects and stochastic process components of expression (9), and write Y i = .Y i1 , : : : , Y in i / T for the corresponding set of repeated measurements. Let W i = {W ik : k = 1, : : : , K} be the stochastic weight vector for the ith subject in the approximation of W i .t/ given by equation (14), and A i the n i × K matrix with .j, k/th element φ k .t ij /. Write x i and d i for the matrices with jth rows x T ij and d T ij respectively. Finally, let Θ denote the complete set of model parameters. The model for the ith subject then has the following hierarchical representation: coupled with a final layer in the hierarchy: the distributions of the stochastic variance factors, which are GIG with parameters that depend on the model choice. By integrating out the latent variables and variance components, these equations collectively determine the contribution of the ith subject to the marginal log-likelihood L.Θ; Y i /. As the vectors Y i from the m subjects are independent, the overall log-likelihood is

Stochastic gradient estimation
The computations that are required for maximum likelihood estimation are cumbersome for problems that involve longitudinal data sets with large numbers of subjects and repeats, even using the computationally efficient approximation (14). Our proposed algorithm for maximum likelihood estimation therefore uses a stochastic gradient method that calculates the gradient of the objective function at each step of the maximization by subsampling. A stochastic gradient method for the general problem of minimizing an objective function f.Θ/ starts with an initial guess Θ .0/ , and then iteratively updates Θ according to Θ .n+1/ = Θ .n/ − η n Q n .Θ .n/ /, .16/ where Q n .Θ/ is a random variable such that E[Q n .Θ/] = ∇ Θ f.Θ/ and η n is a sequence of positive numbers such that Σ ∞ n=1 η n = ∞ and Σ ∞ n=1 η 2 n < ∞; an example is η n ∝ 1=η n with 0:5 < η 1. Under mild regularity conditions, the resulting sequence Θ .n/ converges to a stationary point of f.Θ/ (Kushner and Yin, 2003;Andrieu et al., 2007).
For maximum likelihood estimation, f.Θ/ = −L.Θ; Y/. If the data set contains a large number of subjects we use only a small, randomly sampled subset in each iteration to generate a computationally efficient stochastic gradient method. For this, −∇ Θ L.Θ; Y/ can be replaced by the random variable where the J i are independent Bernoulli random variables with P.J i = 1/ = 1=s. Since the expected value of ∇ Θ L s .Θ; Y/ with respect to these random variables is equal to ∇ Θ L.Θ; Y/ for any s, the resulting stochastic gradient method (16) will converge to a stationary point of the loglikelihood. Our experience, for example with the two case-studies that we describe in Section 6, has been that, for data sets containing a large number of subjects, often we need only a small proportion of the available measurement sequences Y i at each iteration to estimate the parameters reliably. For example, we used only 688 (3%) subjects out of 22910 for the renal case-study that is presented in Section 6.2. For our non-Gaussian models, an additional complication is that the likelihood is not available in an explicit form. However, using Fisher's identity (Dempster et al., 1977) we can compute the gradient of the log-likelihood without computing the log-likelihood itself. For all versions of our model, the log-likelihood conditional on the variance components is Gaussian and thus explicit. Fisher's identity then gives where L s .Θ; Y, V/ is the log-likelihood augmented with V, which is explicitly available since Y|V is Gaussian and V is GIG. The expectation with respect to V is not, in general, explicit but can be approximated by Monte Carlo (MC) sampling from the conditional distribution V|Y. We use a Gibbs sampler and iterate between sampling from the conditional distributions V|U, W, Y and U, W|V, ; for details, see Appendix A. Convergence of algorithms of this kind was studied in Andrieu et al. (2007).
When using stochastic gradient optimization to maximize over many parameters, it is important to scale the gradient by a preconditioner to give a Newton-like iteration. Our proposed algorithm therefore is Θ .n+1/ = Θ .n/ − η n I.Θ .n/ / −1 Q n .Θ .n/ /, .18/ where I.Θ .n/ / −1 is a preconditioner to be determined and Q n .Θ .n/ / is a stochastic approximation of the gradient based on subsampling and MC integration over V using the Gibbs sampler. One option for the preconditioner is I Å .Θ/ = −E V [∇ 2 Θ L s .Θ; Y, V/|Y]. Calculation of I Å .Θ/ is typically easy, since ∇ 2 Θ L s .Θ; Y, V/ is often explicit and can be calculated simultaneously with the gradient. Lange (1995) described the connection between using I Å .Θ/ and the expectationmaximization algorithm. However, if the same variables are used for the MC estimates of the expectations in I Å .Θ/ and Q n .Θ/, the joint updating step (18) will be biased because of correlation between the two estimated expectations. One way to avoid this is to use different samples of V|Y to compute the two expectations. Instead we use a preconditioner that is similar to the complete Fisher information, cFIM, which often can be computed explicitly. Note that, in equation (19), the expectation is taken over both Y and V. Since the expectation is not conditioned on Y, cFIM does not suffer from the same biasedness issues as I Å .Θ/. However, it can still be biased if subsampling is used, and we therefore use a preconditioner that is a weighted average of I cFIM .Θ/ over past iterations to reduce the bias. The use of cFIM is not ideal because it may result in slow convergence, but it is the best available option. The standard Fisher information matrix, I FIM .Θ/ = −E Y [∇ 2 Θ L s .Θ; Y/], is seldom explicit and thus cannot be used as a preconditioner. However, we do need to estimate either the standard or the observed Fisher information matrix, oFIM, I oFIM .Θ/ = −∇ 2 Θ L s .Θ; Y/, to calculate confidence intervals for the estimated parameters. We estimate I oFIM .Θ/ by using Louis's identity (Louis, 1982), . 20/ Both terms on the right-hand side of equation (20) can be estimated by MC sampling, as proposed for ∇ Θ L s .Θ; Y/ in equation (17). We could also estimate I FIM .Θ/ by an additional sampling step, using the fact that . For more details on the calculation of the gradients required and oFIM, see Appendix A.

Multiple-chain estimation
A drawback of the estimation procedure that was described in the previous subsection is that the stochastic nature of the method can make it difficult to determine a suitable stopping criterion.
To overcome this problem, we propose running several independent estimation procedures in parallel, starting from the same initial guess, i.e., using algorithm (18), we compute N r different estimates of Θ .n/ , {Θ .n/ i : i = 1, : : : , N r }, in parallel started from the same initial guess and using independent stochastic estimates of the gradient and preconditioner.
We combine these estimates by taking the mean of the N r -estimates and calculate the estimate of the corresponding MC standard deviation σ .n/ . These statistics are calculated, for the jth parameter, as where Θ .n/ i,j denotes the jth element in Θ .n/ i . We run each chain in batches. Whenever n = kN b , for k = 4, 5, 6, : : : being the number of batches and N b a chosen batch size (we use 1000 as default), we check whether each parameter Θ j , the jth element of Θ, has converged on the basis of two criteria. The first is that σ .n/ j =Θ .n/ j should be smaller than a threshold, e.g. 0.1. This means that the MC variance should be sufficiently small for each parameter. The second convergence criterion is that the rate of change for each parameter should be sufficiently small. To check this, we estimate the intercept and slope of a simple linear regression model fit to the last four batch estimates of Θ j (i.e. the values Θ ..k−l/N b / j for l = 0, : : : , 3) as the outcome and iteration as the input. We then check whether the magnitude of the slope is not significantly larger than some constant (such as 0.01) times the magnitude of the intercept. We conclude convergence of a parameter if both criteria are satisfied and stop the estimation procedure if all parameters converge. An example of the trajectories for the multiple-chain estimation procedure is shown in Fig. 3.
The use of multiple chains has extra advantages, in addition to providing stopping criteria. The first is that the combined estimates that are based on multiple chains naturally have lower MC variances than the estimates that are based on using a single chain. The second is that the estimates of the MC variances can be used when computing confidence intervals for the parameters, which can improve coverages of Wald-type confidence intervals. Specifically, to compute a confidence interval for Θ j , we use j is the estimate of the MC standard deviation for the final iteration n, and σ 2 j denotes the jth diagonal element of the inverse observed Fisher information matrix. Details on the calculation of the Fisher information matrix are given in Appendix A.

Subsampling with fixed effects: the grouped subsampler
Another issue with the subsampling method that was described in Section 4.2 is that the subsampled matrices of covariates, x i , may not be of full rank. If this is so, none of the preconditioners that were described above can be used. In contrast, regular subsampling without any preconditioners may result in large MC variation in the estimated gradient. The cystic fibrosis case-study that we shall describe in Section 6.1 provides an example. These data are stratified into birth cohorts whose effects are important, but one of the cohorts contains only seven patients. This issue is related to subsampling for S-estimation algorithms in linear regression models (Koller and Stahel, 2016), but we could not find a satisfactory solution in the literature that could be applied in the current context. To address the issue, we therefore introduce the following subsampling procedure, which we call the grouped subsampler. The procedure first builds k + 1 groups of subjects, G 0 , G 1 , : : : , G k , in such a way that the matrices x G j = Σ i∈G j x i x T i have full rank for j 1. The groups are built iteratively starting with G 1 . To this group the first subject is added and one checks whether the covariate matrix x G 1 = x 1 x T 1 has full rank. If this is so, the formation of G 1 is complete. Otherwise more subjects need to be added to the group. If x G 1 + x 2 x T 2 has a larger rank than x G 1 , the second subject is added to the group and x G 1 is updated to x G 1 + x 2 x T 2 . At this stage, the formation of G 1 is complete if x G 1 has full column rank; otherwise the procedure continues by adding more subjects in order until x G 1 has full rank or until no further subjects are left. If the formation of G 1 terminates because of a lack of further subjects, one cannot estimate the model on the basis of the available subjects. Otherwise, further groups G 2 , G 3 , : : : are constructed iteratively in the same way: subjects, who are not in any of the previous groups, are added to the group G k until the covariate matrix x G k has full rank. At some point, the group formation will terminate because of the lack of further subjects. If this happens during the formation of a group G k , this group is removed since its covariate matrix does not have full rank. Finally, any subjects who have not been assigned to a group are placed in the group G 0 . The procedure for forming the groups is described in pseudocode in algorithm 1 in Appendix A.4.
The groups are created as an initial stage before the estimation procedure begins. During the estimation procedure a subsampling step is performed as follows. Assume that we want to sample a proportion p ∈ .0, 1/ of the subjects. Let n g be the total number of subjects in the groups G 0 , : : : , G k . To ensure that we obtain a sample for which the subsampled covariate matrix has full column rank, we need to sample all subjects from at least one of the groups G 1 , : : : , G k . Given this restriction, we would like to sample approximately pn g subjects from the groups G 1 , : : : , G k as well as pn 0 subjects from G 0 . To do so, we first sample all the subjects from m g = pk out of the groups G 1 , : : : , G k chosen at random; then we sample m 0 = min{max.[pN − Mk=n g ], 1/, n 0 } subjects at random from G 0 ; here, '[·]' denotes the function that outputs the nearest integer.
To obtain an unbiased estimate of the gradient in the estimation step when using the subsampling procedure, we assign weights k=m g to the subjects sampled from the groups G g , g = 1, : : : , k, and weight n 0 =m 0 to those sampled from G 0 , so that each subject has weight 1 divided by the probability of their being sampled. The fact that we can obtain unbiased estimates by using the grouped subsampler is crucial. Many apparently natural subsampling solutions to the column rank problem, e.g. the solution that samples subjects until the matrix has full column rank, will not work because they produce samples that cannot easily be weighted to obtain unbiased estimates.

Prediction
Suppose that, for a given subject i, we want to predict the value of the latent process at a given time Different types of predictions may be defined depending on the scientific interests of a specific application. The first is smoothing prediction, where the quantity of interest is the value given all available data for that patient, Y Å ik |Y i . The second is filtering, where the quantity of interest is the value given all data collected before the time t k , The third is nowcasting, where the quantity of interest is the value given all data collected up to and including the time For all three cases, we can sample from the relevant predictive distribution, using the same Gibbs sampler as was used when estimating the gradient (just conditioning on different sets of data). The details of this Gibbs sampler are given in Appendix A.1. Given M values obtained from the Gibbs sampler, we approximate the quantities of interest such as the predictive mean or the quantiles of the predictive distribution needed for constructing prediction intervals, using MC integration.

Model validation and model selection
An obvious question is how to decide when a non-Gaussian model is preferable to the standard Gaussian model. A natural first step would be to check the validity of a Gaussian model through its marginal residuals, Y ij − x T ij β, standardized by their variances. These standardized residuals can be plotted against theoretical quantiles of the standard normal distribution to check for departures from normality. However, this plot would not show which of the model components are the source of the non-Gaussianity. Also, the approach cannot easily be used to check the validity of any given non-Gaussian model, since we do not necessarily know the true distribution of its residuals.
We therefore check the validity of a model, Gaussian or not, by predicting each model component from the data given the estimated parameters and compare the distribution of each component with the corresponding quantity for data simulated from the model. Thus, to check the validity of the error model, based on the simulated values. We can then visualize the model fit by using, for example, Q-Q-plots betweenẐ ij andẐ s ij . If these plots deviate from the line of equality, this indicates that the model does not fit the data. For a more formal assessment, we can repeat the procedure for K different simulated data sets and compute a joint simulation envelope. If the envelope does not contain the line of equality we can reject that the data are generated by the model.
Using the same simulations, we can similarly assess the fit of the random effects U i , comparing the quantiles To assess the fit of the process W i .t/, we apply a similar procedure to the innovations of the process dL i . Using the innovations, rather than the W i .t/, avoids the correlation-induced distortion of the empirical marginal distributions of W i .t/.

Natural progression of lung function in cystic fibrosis patients
Our first application uses data on the lung function of cystic fibrosis patients, taken from the Danish cystic fibrosis register. The patients are all aged over 5 years and entered the database between 1969 and 2010. The outcome variable is %FEV1 (percentage predicted forced expiratory volume in 1 s), which is a measure of lung function that is widely used as a descriptor of severity of disease (Davies and Alton, 2009). The data, previously analysed by Taylor 1948-1957, 1958-1967, : : :, 1998-2007 respectively. Baseline %FEV1 values range between 10.4 and 140.3 with a mean of 78.5. Fig. 4 shows traces for six patients, chosen to illustrate a range of total follow-up times and patterns of the outcome variable %FEV1.
Fitting a model to these data serves two purposes. The first is to characterize the mean response profile of lung function in cystic fibrosis patients, adjusted for relevant covariates. The second is to quantify the extent to which a subject's early results are predictive of their long-term prognosis.
We consider %FEV1 as the outcome, Y , and specify mixed effects models that fall within the general framework of equation (9). Specifically, we consider where each x ij contains a number of explanatory variables, as listed in Table 3. We model W i .t/ as the solution to the stochastic differential equation .κ 2 − d 2 =dt 2 /W i .t/ = dL i .t/, which implies that W i .t/ has a Matérn covariance function with smoothness parameter 3 2 . We also considered a model with an exponential covariance function, as in Taylor-Robinson et al. (2012), but this gave a worse fit to the data.
In this example, cohort effects are substantial, reflecting general improvements in the treatments that are available to cystic fibrosis patients over the time period that is concerned. This, coupled with the small numbers of patients in some cohorts (e.g. seven patients in [1948][1949][1950][1951][1952][1953][1954][1955][1956][1957], explains why the grouped subsampler that was described in Section 4.4 is needed. To illustrate the effect of the subsampling using the proposed grouped subsampler, we first fit a Gaussian model, i.e. assuming Gaussian distributions for U i , W i .t/ and Z ij , with and without subsampling. In the with-subsampling case, we subsample 20% of the patients, i.e. 96 out of 476. The resulting parameter tracks of the optimizer (for six of the model parameters) can be seen in Fig. 3. In this example, there are k = 7 subsampling groups, with an average group size of eight subjects, and we sampled two groups at each iteration. The running time for each iteration scales linearly with M, the number of patients who are subsampled at each iteration. Thus, subsampling reduces computing time per iteration by a factor of almost 5 in this case. However, since 17000 iterations were needed to meet the convergence criteria with subsampling, compared with 6000 iterations without, the total computation time was reduced by a factor of 1.75. The variances of the subsampled estimates are slightly larger, but the final parameter estimates are almost identical. For applications with data for many more subjects, e.g. as in the renal case-study to be presented in the next section, the computational gain by subsampling would be larger.
To assess the suitability of the Gaussian distributional assumption, we inspected Q-Q-plots of the standardized marginal residuals, Y ij − x T ij β. Fig. 5 suggests some departure from the Gaussian distribution but, as each marginal residual is composed of U i , W i .·/ and Z ij , the Q-Q-plot cannot detect the source of the departure. We therefore also look at the Q-Q-plots for the various model components, as explained in Section 5. The results for the Gaussian model can be seen in Figs 6(a)-6(c). The normality assumption seems to be valid for the random effects U i , but both the process, W i .·/, and the errors, Z ij , seem to have departures from the normal distribution. On the basis of these, we fitted a model with the NIG assumption for W i .·/ and Z ij and normal distribution assumption for U i . The corresponding Q-Q-plots for this model form Figs 6(a)-6(f). All three components now fit fairly well, although the random effects for the data seem to have a slightly skewed distribution. Thus, as a final model we fitted a model with an NIG assumption for each of the three model components. Figs 6(g)-6(i) show the Q-Q-plots for this model. The NIG assumption for the random effects improves the fit that the distributional assumptions are now reasonably good for all three components. We conclude that the NIG assumption for all three stochastic components is the most appropriate model for these data. The estimates of the fixed effect parameters β for the normal and NIG models are shown in Table 3. Standard error estimates, obtained by using the observed Fisher matrix, are generally lower under the NIG than under the Gaussian assumption. With regard to statistical significance, p-values indicate the same judgement on significance, except for pancreatic sufficiency, main effect of cohort 1958-1967 and its interaction with age. Note that both pancreatic sufficiency and cohort 1958-1967 are highly unbalanced with only 20 positive out of 476 and only 42 subjects in the 1958-1967 cohort.
In Section 7.1, we report the results from a simulation study to validate these findings. Two important conclusions based on the simulation study are that . 6. Q-Q-plots for the components of the models in Section 6.1 for the cystic fibrosis example, based on 100 simulated data sets from each model: for the model with Gaussian (a) it is important to include a random-process term W i .t/ in the model to obtain reliable inferences for the fixed effects and (b) a Gaussian assumption for W i .t/ may still deliver reliable inferences regarding fixed effects, when the data show signs of non-Gaussianity.

Progression towards end stage renal failure
Our second application uses clinical data on kidney function of primary care patients from the northern English city of Salford who are in high-risk groups for chronic kidney disease. The outcome variable is eGFR (estimated glomerular filtration rate, in millilitres per minute per where SCr stands for serum creatinine measured in micromoles per litre (Levey et al., 1999). The data, which were previously analysed by , contain a total of 392870 measurements on 22910 patients, for whom the total follow-up time ranged from 0 (i.e. only baseline data are available) to 10.0 years, whereas the number of measurements of eGFR ranged from 1 to 305. Among the 22910 patients, 11833 (51.7%) were male. Baseline ages ranged between 13.7 and 102.1 with a mean of 65.4. Fig. 7 shows traces for eight patients, chosen to illustrate some particularly challenging features of the data. The unusually high degree of irregularity in the follow-up times reflects the fact that the data derive from routine clinical practice. In particular, some patients provided many repeated measurements over a relatively short time period, probably during episodes of intercurrent illness.
Clinical care guidelines in the UK include a recommendation that any person in primary care who appears to be losing kidney function at a relative rate of at least 5% per year should be considered for referral to specialist secondary care. Our primary objective in analysing these data is therefore to develop a method for identifying, for each subject and in realtime, when this criterion is first met.
As in , we use a log-transformed outcome variable Y = log.eGFR/ and specify a model of the form . 23/ In model (23), each x ij includes sex, baseline age, follow-up time t ij and a piecewise linear function of age with a slope change at age 56.5 years. The processes W i .t/ are integrated random walks as in . As for the previous example, we first fit the model under Gaussian assumptions for the U i -, W i .·/-and Z ij -components. The Q-Q-plot for the standardized residuals that is shown as Fig. 5 of  clearly indicates longer-than-Gaussian tails. As for the cystic fibrosis case-study, we compute the Q-Q-plots for each model component (Fig. 8), which suggest that the Gaussian assumption is not valid for any of the model components. Therefore, we proceed by assuming NIG distributions for each of the three stochastic components. The fit is much improved, albeit with some discrepancy between the data and model in the lower tail of the U i and the upper tail of the W i .t/. Fig. 9 shows, for two patients, their observed data and the concurrent ('nowcasting') probabilities of meeting the clinical guideline for referral to specialist care. Results are shown for the Gaussian and NIG models. As would be expected, for each patient the general pattern of the predictive probabilities is similar under both modelling assumptions, but there are some substantial quantitative differences and the ranking of each pair of predictive probabilities is not consistent. The two sets of model-based predictions reflect different partitionings of the intrapatient variation into signal and noise components, and the balance between the two is affected in subtle ways by the pattern of follow-up times and their associated measurements.
To put these differences in context, in Section 7.2 we report the results of a simulation study, where we find that the distributional assumptions have a strong effect on the predictive performance.

Fixed effects estimation
We conduct a simulation study to investigate the extent to which distributional assumptions affect the validity and/or efficiency of estimators for β. We focus on evaluating the bias and coverage properties of the estimators by using the models for the cystic fibrosis patients that were presented in Section 6.1, but with a reduction in size to 256 patients covering the cohorts between 1958 and 1978 to reduce computation time.
We consider two simulation models, denoted by normal and NIG. In the first, all three stochastic components are Gaussian, with parameters set at the estimates that were obtained from the cystic fibrosis data, whereas in the second all three stochastic components are NIG distributed, again with parameters set at their estimates from the cystic fibrosis data; see Table 3 for β and Table 4 for the parameters of U i , W i .t/ and Z ij . We generate 250 replicate data sets from each of the two simulation models. For each data set, we fit both simulation models and three additional 'wrong' Gaussian models: a standard multiple linear regression model, a random-intercept-only model and a random-intercept and random-slope model. In each case, we evaluate the empirical bias of each parameter estimator and the coverage of nominal 90% confidence intervals over the 250 replicates. The confidence intervals are computed as explained in Section 4.3. The results are presented in Tables 5 and 6. Important findings are as follows.
(a) For both the normal and the NIG simulation models, the linear regression model and the two random-effects models give very poor coverages, indicating that inclusion of the process component W i .t/ is crucial for making correct inferences about the fixed effects.   The overall conclusion from this small simulation study is that it is important to include a random process in the model to obtain reliable inference of the fixed effects but that, for this purpose, a Gaussian model can give reliable inferences even if the data show signs of being non-Gaussian.

Prediction accuracy
To study the importance of the distributional assumptions on prediction, we perform a simulation study based on the renal failure application, presented in Section 6.2. We simulate new  data from the fitted NIG model, for the two patients who are shown in Fig. 9. For each simulated sequence Y ij = log.eGFR/ at follow-up times t ij , we then use the fitted NIG and Gaussian models that were reported in Section 6.2 to obtain the nowcasting predictions. Table 7 shows the results based on 100 simulated data sets, using four summaries of predictive performance: the mean absolute error, MAE, root-mean-squared error, RMSE, mean coverage of 95% prediction intervals, COV, and the average width of these prediction intervals, Width. For the two models, we also look at the model-based predictive probabilities of meeting the clinical guideline for referral to specialist care, and we compare these with the corresponding simulated values of  Fig. 9.   Fig. 10. Receiver operating characteristic curves: , Gaussian model; , NIG model the gradient of log-transformed GFR. This comparison is summarized by using the receiver operating characteristic curves in Fig. 10. Each curve is constructed by varying the predictive probability threshold at which referral is triggered and calculating the corresponding proportions of correct and incorrect (according to the clinical guideline) referrals and non-referrals.
There is a substantial increase in predictive power, according to all measures of accuracy, when the correct model, the NIG model, is used.

Software
We have implemented the methodology that is presented in this paper in the R package ngme. A development version of the package is available from https://bitbucket.org/davidbol in/ngme. The package includes functions for parameter estimation and for subject level prediction using the class of models defined by expression (9), with the following features.
(a) Any linear model can be specified for the regression term x T ij β and for the subject level random effect d T ij U i , using the standard R model formula syntax. (b) The random-effects distribution can be chosen as normal or NIG. (c) The covariance structure of the W i .t/ can be specified as a stationary, exponentially correlated process (the fully asymmetric version), as a symmetric or asymmetric Matérn model with smoothness 1.5, or as a non-stationary integrated random walk or omitted altogether to give non-Gaussian versions of the Laird-Ware model. The distribution for the process can be specified as normal, NIG, GAL or Cauchy. (d) The distribution of the measurement error terms can be specified as normal, NIG or t. (e) Subject level predictions can be obtained either through nowcasting (conditioning on a subject's past and current measurement data), smoothing (conditioning on all of a subject's data) or forecasting (conditioning on all of a subject's past data).
The generic R functions, print, summary, plot, fitted and residuals are available for the estimation and prediction functions, and the renal data set is included. We plan to extend the package's functionality to a wider range of models for the stochastic process component W i .t/, including a general Matérn correlation structure. The package also has support for estimation of non-Gaussian models for spatial data as we discuss further in the next section.

Discussion
The Gaussian version of the linear mixed model (9) represents the standard approach to analysing real-valued repeated measurement data. Typically, the simplified version without the Gaussian process term W i .t/ suffices when the number of follow-up times per subjects is small, whereas the version with the W i .t/ term, often in conjunction with a simple random intercept U i in place of the general term d T ij U i , usually gives a better fit to data with long follow-up sequences. Concerns have often been raised about the legitimacy of the Gaussian assumption, and in particular about the consequences of fitting Gaussian models when elements of the underlying process have longer-than-Gaussian tails or skewness. This has led to an extensive literature, which we reviewed in Section 2. However, to the best of our knowledge the current paper is the first to provide a flexible implementation in which departure from Gaussianity can be assessed independently for each of the three stochastic components of model (9).
In our reanalysis of the cystic fibrosis data, inferences on fixed effects showed only small changes when non-Gaussian behaviour is taken into account. Our reanalysis of the renal data also finds evidence of non-Gaussian behaviour, which in this case matters more, because it has a material effect on predictive inference, and hence on the point at which an individual patient in primary care would be considered for referral to secondary care.
We have emphasized the importance of building a computationally efficient algorithm for routine maximization of the likelihood. This is especially useful for data sets containing many subjects. Arguably, computational efficiency is of secondary importance in confirmatory analysis. Once the statistical analysis protocol has been determined, it matters little whether it takes minutes, hours or days of computing time to analyse a data set that typically will have taken weeks, months or years to collect. However, during the iterative model building cycle that characterizes exploratory data analysis, an inability to fit and compare different models in realtime is a severe impediment. The applications that were described in Section 6 show that the subsampling scheme that was introduced in Section 4.4 can perform very well. One topic of future research is a more thorough investigation of how to optimize the subsampling.
Generalized linear mixed models provide a framework for handling non-Gaussian sampling distributions. This form of non-Gaussian behaviour is complementary to the kind of non-Gaussian process behaviour that we have addressed in this paper. A natural extension to our proposed models would be to generalized linear mixed models for binary or count data with non-Gaussian random effects. However, non-Gaussian behaviour will naturally be more difficult to detect from count or binary data than from measurement data. Binary data in particular can be considered as a heavily censored version of measurement data. For example, a logistic regression model can be interpreted as a linear regression model for a real-valued response Y in which only the sign of Y is observed.
Clinical repeated measurement data are often coupled with time-to-event outcomes, e.g. death. So-called joint models for repeated measurement and time-to-event outcomes have been widely studied; for a recent book length account, see Rizopoulos (2012). However, essentially all of this literature assumes that any random-effect components are Gaussian. A natural way of extending the methodology that is presented in this paper to joint modelling problems, by analogy with much of the current literature on Gaussian joint models, would be to combine the linear mixed model (9) with a log-linear Cox process model for the time-to-event outcome, in which the stochastic process W i .t/ in the repeated measurement submodel is correlated with a second stochastic process, W Å i .t/ say, such that exp{W Å i .t/} constitutes a time-dependent frailty for the ith subject.
Another possible extension of the methodology that is presented in this paper would be to multivariate settings, in which more than one repeated biomarker measurement is collected for each patient, sometimes with different follow-up schedules for different biomarkers. Such models could be constructed similarly to the multivariate random fields in . The models of  are not considered in a longitudinal setting but can naturally be extended to this case. This could then be viewed as an extension of the models that are considered in this work, where the temporal stochastic process is replaced by a random field. The ngme package has support for such spatial and multivariate models, both for the longitudinal setting and for the classical geostatistical setting.

A.1. Gibbs sampling
In this section we derive the two conditional distributions that are required for the Gibbs sampler. The first of these is the distribution of X i = .U i , W i / given Y i and V i = .V Z i , V U i , V W i /, and the second is the distribution of V i given X i and Y i . Using the specification of the hierarchical model from Section 4.1 we have that i /} and straightforward calculations using properties of the multivariate normal distributions give that To compute the distribution of V i given X i and Y i , we use the following proposition regarding the convolution between a GIG variable and a Gaussian variable.
The proof is straightforward and hence has been omitted. Now, note that the density of then the proposition gives that the three independent distributions are

A.2. The gradient of the likelihood
In this section we derive the gradient that is needed for estimating the parameters in the model from Section 4.1. We shall use the notation from Appendix A.1 and also assume that E[V U i ] = 1, E[V Z i ] = 1 and E[V Z i ] = h i , which was previously assumed for parameter identifiability. Recall that the goal is to evaluate the gradient in equation (18), which can be written as a sum over ∇ Θ L i .Θ; Y i /, which we approximate by using MC integration as To simplify the notation, we let L i .Θ; X i , V i , Y i / denote the complete log-likelihood for the ith patient seen as a function of the parametersΘ, and similarly let L i .Θ|V i , Y i / denote the log-likelihood conditioned on the variance components. To derive the gradients, we also need the following notation from matrix calculus. The vec operator transforms a matrix into a vector by stacking its columns. The vech operator also transforms an n × n matrix into a vector but removes all the subdiagonal elements. Finally, the duplication matrix D n is such that, for any symmetric matrix A, D n vech.A/ = vec.A/.
We start by deriving the gradients for the fixed effect and the asymmetric parameters. Let .β, μ/ = .β, μ U , μ Z , μ W / and B i = .
Using the model definition from Section 4.1, we have that Thus the gradient with respect to .β, μ/ equals Using that the expected value of X i given V i and Y i isb i (see Appendix A.1) we obtain that The gradient for noise variance σ 2 , of the complete log-likelihood, equals where m i is the number observations of patient i and e i = y i − B i .β, μ/ − G iXi . We could compute ∇ σ log{L i .Θ; y i , V i /} by taking the expectation with respect to X i , but in our implementation we simply use the values of X i from the Gibbs sampler to approximate this expected value by using MC integration, the reason being that the estimation of σ is so simple compared with the other parameters that it was not worth the additional effort to implement the analytical gradient for this parameter.
To derive the gradient with respect to the covariance matrix of the random effects, we first note that the commutation matrix and D d the duplication matrix (Magnus and Neudecker (2007), pages 389-390), the gradient for the variance matrix Σ is For a generic parameter θ of the differential operator that is used to define the process, we have that Thus, defining .K θ / ij = dK ij =dθ, the gradient equals What remains is to compute the gradient with respect to the variance mixing parameters. For these parameters the complete log-likelihood is entirely determined by the specified distribution of the variance mixing variables. Thus there is no generic form; instead we present the three main distributions that we have considered and their resulting gradients. The three distributions are the gamma distribution with density f.v; p, b/ = Γ.p/ −1 a p v p−1 exp.−av/, the inverse Gaussian distribution with density f.v; a, b/ = b 1=2 .2π/ −1=2 v −3=2 exp{−av=2 − b=.2v/ + √ .ab/}, and the inverse gamma distribution with density f.v; p, a/ = Γ.p/ −1 a p v −p−1 exp.−b=v/. The resulting gradients for the noise parameters are given in Table 8, the random-effect parameters in Table 9 and the processes parameters in Table 10. In Tables 8-10, ψ is the digamma function and ψ 1 is the trigamma function. Note that we use a non-standard form of the t-distribution where ν is half of the degrees of freedom and the parameterization is chosen so that a sym-  Table 9. Distribution, gradient and observed Fisher information for the mixing variable of the random effect  metric version has variance 1. For ν < 2 this is not possible (since the variance is unbounded) and one can then instead use the parameterization IGam.ν=2, ν=2 + 1/ which puts the mode of the IGam density at 1.

A.2.1. Joint Fisher information for mixed effects parameters
When computing Wald-type confidence intervals based on the Fisher information matrix, one should ideally compute it from the joint Hessian for all parameters. A simpler alternative which we make is to compute the joint Fisher information matrix only for the mixed effect parameters, which means that we do not take the uncertainty of the other parameters into account. This simplifies the implementation greatly and should in most scenarios have little effect since the other parameters converge much faster than the mixed effect parameters. The reason for this is that the random effects vary between individuals, so one individual can be seen as an observation whereas the other parameters receive information also from all longitudinal observations for each patient. Let Θ m = .β, μ, vech.Σ/, ν/ be the vector of all parameters for a model with NIG-distributed mixed effects and let L i .Θ m / denote the complete likelihood L i .Θ m ; Y i , V i , X i /. The negative Hessian of the likelihood for patient i (the observed Fisher information is the Hessian at the mode) for these parameters can be expressed as We shall now derive the inner expectations with respect to X (the outer are computed through MCMC sampling). We shall utilize the distribution derived for the Gibbs sampling in Appendix A.1, but for simplicity drop the index of the mean and covariance:X ∼ N.b + b,Q −1 / = N.b Å ,Σ/; going further we shall always work with the ith patient and therefore drop the index in the notation.
For the derivations, we shall need the following lemma.
.24/ Proof. From corollary 2.2.7.2 in Kollo and von Rosen (2006) it follows that To link this result with equation (24) we define the two matrices, B A and B B , such that X A = B A X, and X B = B B X. Now it follows that vec.
XX T /, and the left-hand side of equation (24) can be rewritten as Joining the equation above with equation (25) gives the following three terms from which we shall extract the final result: Several of the required gradients are straightforward to derive and we therefor omit most details. For example, some simple second derivatives that are needed are Likewise, for the parameter of the mixing component ν, the gradient does not depend on X|V and its expected value is hence easily evaluated from the gradient. See Tables 8, 9 and 10. We now present the more difficult components to compute. We start with evaluating the expected value of the outer product of ∇ vech.Σ/ L i .Θ m / which we split into several parts. The gradient is and thus the outer product is   To derive the expectation of K 1 we use theorem 4.3 of Magnus and Neudecker (1979): where K dd again denotes the commutation matrix. We continue with the outer product of ∇ [β, μ] L i .Θ m /: The expectations of the terms are easily obtained since X ∼ N.b,Σ/. To compute the expectation we typically do not computeΣ but rather solve linear systems usingQ. The final difficulty is to compute the expectation of the outer product: For most of the terms in this expression we have already calculated the corresponding expectation before. The only new term isXvec.M/ T , which by lemma 1 is

A.3. Similarity between densities
Finding the parameters for the GH distribution is in general difficult, because the full log-likelihood surface is largely flat, which makes the model parameters almost non-identifiable. A further difficulty is that the boundary of the parameter space often contains, unique, distributions and hence one cannot expect that the parameter will be contained in a compact region of the parameter space. This problem is also true within the subfamilies that were discussed above. For instance, an NIG distribution converges to a Cauchy distribution as a → 0, and to a Gaussian distribution if a → ∞ and b → ∞ at the same rate. Recognizing these limiting cases is important in practice since it can lead to situations where the parameters do not converge.
A remedy for this issue within the subfamilies that we are studying is to fix a compact parameter space and if the parameters converge to the boundary then we re-estimate the parameters in the limiting distribution. We set these boundaries by examining the total variation (TV) distance between pairs of densities. For illustration, to compare a symmetric NIG distribution with fixed a with a symmetric Cauchy distribution, we calculate To simplify the calculations that are needed to find the Cauchy distribution CH.0, 0, b CH / that is closest to the NIG.0, 0, a, b NIG / distribution we use the following proposition, which shows that it suffices first to find the Cauchy distribution that is closest to NIG.0, 0, a, 1/, and then to rescale the shape parameter by b NIG .
Proposition 2. Let f s .x/ and g h .x/ be two distributions with respect to the Lebesgue measure, with scaling parameters s and h. Then, TV.f s , g h / = TV.f s=c , g h=c / for c > 0.
Proof. First note that Now use integration by substitution with respect to φ.x/ = x=c to give Fig . 11 shows the TV distances between the NIG and Cauchy, and between the NIG and Gaussian distributions, as functions of a. One can now translate the difference between the densities to compare with densities that one is more familiar with. For example, for a = 0:001, the TV distance between the NIG and Cauchy distribution is less than that between two Bernoulli distributions whose probabilities differ by 0:002. The same applies to the TV distance between the NIG and normal distributions when a = 250.
In ngme we set the boundaries of the NIG parameter space to 0:001 a < 250, and if the parameter space is hit it gives a warning. Of course one needs to be a little cautious with using the limiting distribution,  . 11. TV distance between the NIG and Cauchy ( ) and between the NIG and normal ( ) distributions for varying a since, although the TV difference decreases, the tails of the distributions remain different. For instance, the NIG distribution has exponential tails whereas the Cauchy distribution has polynomial tails.

A.4. Pseudocode for the grouped subsampler
Algorithm 1 (Table 11) contains pseudocode describing the group formation of the grouped subsampler.
Here the final term vanishes if Dirichlet or Neumann boundary conditions are used.
For the NIG version of the model, we approximate the distribution of L k by . It follows that the distribution for the stochastic weight vector W conditional on V can be written as equation (15).
When D = κ + @=@t is a first-order operator, we cannot use the Galerkin method. We then instead use a Petrov-Galerkin method, where the test functions above are replaced by piecewise constant functions: With this change, the distribution of L is the same as above (which is not an approximation in this case), but the elements of K are . 28/ If the operator is an integer power of a first-or second-order operator, the model can be rewritten as a system of equations. For example D 2 W.t/ = dL.t/ can be formulated as the system Both of these equations can then be discretized by using the method above. Combining the two discretizations yields the following equation for the coefficients, KCKW = L, where C is the mass matrix with elements C kk = ψ k , φ k . If the operator is a fractional power of a first-or second-order operator, the iterative formulation cannot be used. However, the fractional power could probably still be handled by using the methods in  and .

Arnošt Komárek (Charles University, Prague)
The approach of Asar, Bolin, Diggle and Wallin is motivated by a classical linear mixed effects model  for repeated or longitudinal outcomes Y i, j , i = 1, : : : , m, j = 1:: : : , n i , that can be specified as where x i, j and d i, j are vectors of covariates, usually involving time t i, j of measurements. Further, β and σ are unknown parameters, U i is the random-effects vector and Z i, j is the error term of the model. It is traditionally assumed that both random effects and the error term are independent and identically distributed and Gaussian. As Asar and his colleagues state, a (perhaps too) frequently used variant of model (29) is a simple 'random-intercept and random-slope' model where d T i, j U i ≡ U i,1 + U i,2 t i, j . This is clearly inappropriate in many practical settings. To overcome this model deficiency, they follow earlier proposals and add a stochastic process W i .t i, j /, specified through a parametric covariance function, to the model formula (29). They mention that an alternative way to capture non-linear behaviour of repeated measurements is to use, for example, regression splines instead of the stochastic process, i.e. the process W i .t i, j / would be replaced by a term like K k=1 U i, k B k .t i, j / . 30/ with U i = .U i,1 , : : : , U i, k / T a vector of random effects and B 1 .t/, : : : , B K .l/ a suitable spline basis. In my opinion, the two approaches are complementary rather than competing. The stochastic process approach is certainly more parsimonious with respect to the number of unknown parameters and, as such, it would probably be the first choice if prediction is the primary aim of the analysis. However, if the main interest lies in modelling the shape of both the subject-specific and the mean trajectory then regression splines seem more natural to me and enable an easier interpretation of the fitted model by practitioners. Not only the assumed shape of a (subject-specific) trajectory (model formula (29)) but also distributional assumptions make the model either only wrong or wrong but useful. As mentioned above and also by Asar and his colleagues, all random components of the model, distribution of the random effects U i , the error terms Z i, j and possible additional model components, e.g. the stochastic process W i .t/, are often assumed to be Gaussian. It is clear that it is a convolution of those random elements which determines the marginal distribution of the outcome vector Y i = .Y i,1 , : : : , Y i, n , / T , enters the likelihood and is identifiable from data. Hence it is not possible to be fully flexible ('non-parametric') in the specification of all the random components. Asar and his colleagues propose a generalized inverse Gaussian (GIG) distributional class which is still fully parametric, yet quite flexible. An algorithm to calculate the maximum likelihood estimate is not only proposed but also implemented along with additional supplementary routines in a form of the contributed R package ngme.
Here, I should like to make two proposals for possible future development of the model of Asar and his colleagues. In all applications in Section 6, only the random intercept, i.e univariate random effect U i , is considered next to the stochastic process W i .t/ to capture non-linearity in subject-specific trajectories. Suppose that someone prefers a more structured specification of the subject-specific trajectories made by, for example, the regression splines (30) instead of the stochastic process. This would clearly require a multivariate random-effect vector U i and it would be interesting to see how the estimation methods perform in this case. Second, the GIG distributional class is in fact specified hierarchically (equation (10)). Hence the GIG class seems to be perfectly suited for use within Bayesian models and inference based on Markov chain Monte Carlo and related methods. As not everybody would want to use Bayesian approaches, e.g. because priors must be specified. I appreciate the maximum likelihood estimation based procedures of Asar and his colleagues. However, MCMC-based Bayesian inference seems easier and perhaps enables use of the GIG class within models of even a higher complexity.
In summary, I congratulate the authors for an excellent paper with great potential to trigger further research and development of additional 'wrong but useful' models for increasingly complex data sets that are continuously appearing in very different research areas. Last, but not the least, the software implementation of the methodology should be applauded as it highly increases the chance that the method will also be used to solve real problems.

Daniel Farewell (Cardiff University)
I am grateful for the invitation to study this paper and to pose questions to its authors. The paper exhibits their collective and characteristic skill in addressing substantive clinical questions like 'is this patient losing kidney function at a relative rate of at least 5%?', which they answer by using nowcasting based on estimated random effects. My personal highlight is their demonstration that normal-inverse Gaussian modelling markedly influences this answer (Fig. 9). I remember reading Verbeke and Lesaffre (1997) and extrapolating (quite unjustifiably, as I now see) that I need never fret over whether random effects might deviate from normality. That only fixed effect estimation is reasonably unaffected by such departures is an important subtlety to discussions of robustness that is presented particularly clearly here.
I also want to commend the authors' graphical presentation of both data and model diagnostics. In comparison with the customary spaghetti plots that are ubiquitous in exploratory analysis of longitudinal data. I found the 'spaghetti-and-meatballs' plots that are favoured by the authors much more helpful: they clearly illustrate both the measurement range and some carefully selected individual dynamics of subject-specific trajectories, with neither obscuring the other.
A third point of interest was the authors' use of the duality between Gaussian processes expressed in terms of covariance functions and as solutions to stochastic differential equations. The latter dynamic perspective is appealing for the study of longitudinal data, but it is still reasonably rare (see Pan and Mackenzie (2003) for an exception). I was pleased that some of the authors' models were not time reversible, and I invite them to consider walking even further down this road, making time a fundamental and irreversible feature.
A dynamic approach simplifies the treatment of informative observation or dropout: as an example, the authors mention patients who have many measurements during periods of illness. By contrast, conditions like missingness at random are purely probabilistic and agnostic to temporal ordering. This makes it difficult to assess whether an observation mechanism is ignorable. The standard approach to longitudinal data analysis is in the same spirit: begin with a 'complete, underlying' longitudinal trajectory .Ỹ t : t ∈ N or R + / for each subject, which is then incompletely observed at times t 1 , t 2 and so on. Yet a quantity like %FEVI cannot be defined in continuous time: by definition, it is an average over a time period of 1 s, and more importantly exhalation cannot be sustained indefinitely! Any continuous time quantification of %FEVI must therefore involve so-called cross-world counterfactuals (Richardson and Robins, 2013), and modelling the covariance of cross-world counterfactuals is best avoided wherever possible. Perhaps more importantly, the standard missing data approach struggles under the weight of the static, global perspective that it induces, whereas a dynamic approach can more easily be melded to the substantive problem at hand (Farewell et al., 2017).
One possible alternative is to define the continuous time process of interest not as an 'underlying' measurement of %FEVI (say), but instead as local characteristics of the data themselves. The data arising from a given subject may be represented as a marked point process {.T j , Y j / : j = 1, 2, 3, : : :}, where T j records the time that measurement j is taken, and Y j the corresponding value of that measurement (Jacobsen, 2006). As in the Doob-Meyer decomposition, this stochastic process may be partitioned into 'signal' and 'noise', the signal being a previsible process and the noise a martingale. I would like to know whether the authors think that they could rephrase practical questions like 'are they losing function at a rate of 5%' with reference to the intensity of a marked point process rather than an 'underlying' variable?
Contrast this decomposition with the components of variance that are described by the authors' model. Their first separation into 'signal' and 'noise' isolates U and W from the 'measurement error' Z. Again, this invites speculation about and perhaps even presupposes a hypothetical error-free 'measurement' that exists at every time point. But is this unambiguously defined? It is a little like asking whether there is (in principle) an error-free photograph of a beautiful landscape: the subject matter is massively multivariate, and defining a suitably pure projection of the phenomenon onto a two-dimensional representation is challenging. The underlying principle of the authors' modelling seems to be that this pure, low dimensional representation can be defined by removing high frequency components like Z. Do the authors agree and, if so, do they see merit in a more formal use of frequency domain decompositions?
I am also curious about the second decomposition into U and W. Would the authors contend that these can be interpreted separately, and are there prediction tasks for which they feel that such a separation might be important?
Gertrude Stein noted that '[s]ilent gratitude isn't very much use to anyone'. In seconding the vote of thanks, I therefore declare my loud admiration for this paper, and I invite the audience to join me in expressing gratitude to the authors for their beautiful and practical research.
The vote of thanks was passed by acclamation.

Alex Stringer and Patrick Brown (University of Toronto and Centre for Global Health Research, Toronto) and Jamie Stafford (University of Toronto)
We congratulate Asar, Bolin, Diggle and Wallin on an important development in an important topic and believe that their new model and software will become an important part of the statistician's toolbox. We especially appreciate the use of stochastic optimization for maximizing the complicated marginal likelihood that arises from their procedure. In principle, any non-convex optimization algorithm could be used. In practice, stochastic optimization algorithms are popular in the machine learning literature for optimizing complicated non-convex objective functions and we believe that they should be more widely considered in statistical applications when appropriate.
In our experience, stochastic optimization becomes more appealing as the size of the data set increases and the objective function becomes more complex. The authors' use of several optimizations running in parallel suggests that computation time and memory use are of lesser concern than stability in their application. Would they comment on the specific factors that motivate their use of stochastic optimization, and the effect that it had on computation time and memory use? Under what circumstances would multiple subsampling-based optimizations in parallel be preferable to running a single optimization using the full data set? Out of the box procedures are available in R (e.g. the ipoptr software) that are highly stable and efficient, and it would be interesting to assess their performance on this interesting and complex problem.
A related point is that the authors choose to develop several custom procedures, specifically for assessing convergence, preconditioning and ensuring that intermediate quantities remain computable during the optimization. Have they considered using an out of the box method for gradient-based stochastic optimization? There is a rich literature on this topic in the field of machine learning. For example, the Adam algorithm (Kingma and Ba, 2015) effectively uses a bias-corrected gradient-based preconditioner that we have found to be fast, memory efficient and stable in large-scale problems. This would eliminate the need for the grouped subsampling procedures of Section 4.4 and potentially improve the computation time of the authors' procedure.

Neil K. Chada and Ajay Jasra (King Abdullah University of Science and Technology, Thuwal)
We thank Asar, Bolin, Diggle and Wallin for this thought-provoking paper, which is concerned with understanding and introducing a non-Gaussian phenomenon in mixed models of the from This motivation is to account for the non-linear trajectories of the stochastic effects within these models. Traditionally much of the work in this field has assumed that these components are normally distributed. The authors consider tackling this by presenting the multivariate generalized hyperbolic distribution, which in itself includes many widely used distributions such as the generalized asymmetric Laplace, Student tand Cauchy distributions. To introduce the non-Gaussian stochastic process W i .t/ of model (31) they solve a stochastic differential equation such that D is a particular choice of an operator and dL i .t/ is a non-Gaussian Lévy process. This links well with previously used methodologies in that solving a particular stochastic partial differential equation will lead to a Gaussian random field with a Matérn covariance function, with the change that dL i .t/ represents Gaussian white noise . A recent trend in the field of statistical inverse problems (Stuart, 2010), which is concerned with parameter estimation, has been incorporating non-Gaussian prior information. Some of this work (Chada et al., 2019) has focused on Cauchy processes which arise from α-stable distributions. If we assume that W i .t/ is a random field defined by an α-stable distribution, then it can be expressed as where α ∈ .0, 2] denotes the stability parameter and μ ∈ .−∞, ∞/, β ∈ [−1:1] and σ ∈ [0, ∞/ denote the location, skewness and scale parameters respectively. Chada et al. (2019) provided theoretical justifications to promote the use of these fields, as well as particular forms of discretization. Other related work has been on choosing the hyperparameters  of the covariance function through non-Gaussian distributions.
We believe that the work in Chada et al. (2019) can be potentially used in the context of the authors. This would enable us to model the random fields through generalized hyperbolic distributions which may provide closed form expressions. This could also be done through the stochastic partial differential equation formulation where further probabilistic understanding could be addressed, such as Wiener's Tauberian theory. Finally connections could be made through the various approaches to construct Cauchy processes.
Christine P. Chai (Microsoft Corporation, Redmond) I thank Asar, Bolin, Diggle and Wallin for improving the linear mixed effects model by adding a subjectspecific continuous time stochastic process W i .t/, which is important because each subject reacts differently over time. The model validation and model selection section is short but excellently written! The section successfully explains why using a normal Q-Q-plot would be insufficient to identify the source of non-Gaussianity and points out that the random-effect component U i would interfere with the Q-Q-plot results of Y ij . It is a better approach to compare the prediction of each model component with the simulated data, which avoids the distortion from correlation over time.
Since the repeated measurement data are assumed to be continuous, my major question would be: how are missing data handled in the model? Was any interpolation or extrapolation used? For example, a patient may have measurements from time t 1 to t k except for t 3 and t 5 . Another patient may be missing t 2 and t k−1 . It would be really impressive if neither of the two data sets in the case-studies contains any missing data.
Lastly, I mention a minor issue: the normal-inverse Gaussian model has better predictive power than the Gaussian model, but this does not imply that it is 'correct'. As Box (1976) said, 'All models are wrong, but some are useful'.
(The opinions and views expressed here are those of the author and do not necessarily state or reflect those of Microsoft.) Kuldeep Kumar (Bond University, Gold Coast) I congratulate Asar, Bolin, Diggle and Wallin for proposing a novel method to analyse continuous repeated measure outcomes collected longitudinally and for enabling a combination of stochastic components to be non-Gaussian. This kind of data is quite often encountered not only in medical sciences but also in social sciences when repeated measurements of an outcome variable are made on subjets with follow-up over time, which may vary from subject to subject. The authors consider baseline covariates as well as longitudinal covariates and propose a methodology to estimate parameters for the data, which have departed skewed from the Gaussian distributional assumption. They basically propose two procedures for the estimation of parameters: stochastic gradient, and multiple-chain estimation. My trifling disappointment is that they did not deal with measurement error in longitudinal covariates, which are quite often encountered in social science data. In linear mixed effects models, the regularization methods using penalized profile likelihood have been developed to select and estimate fixed and random effects. These methods avoid the stochastic error of variable selection in stepwise procedures and can successfully handle the high dimensional setting in both fixed and random effects, as cited by Fan and Li (2012). Such penalized profile likelihood methods can be developed for the general specification of linear mixed effects models for repeated measurement data.
The following contributions were received in writing after the meeting.

Anna L. Choi and Tze Leung Lai (Stanford University)
This paper presents an innovative method to assess each of the three stochastic components independently for non-Gaussian distributions in a linear mixed model, when one or more of the components do not satisfy the Gaussian assumption. It uses an elegant and computationally efficient subsampling-based stochastic gradient algorithm to develop a novel implementation to meet the computational challenges of large data sets, together with an R package ngame. Two case-studies involving reanalysis of cystic fibrosis data and renal data are used to compare the Gaussian and non-Gaussian models, with the latter data set giving different results and showing the advantages of the proposed assessment or diagnosis method. The last paragraph of the discussion section mentions possible extension of the methodology and the software package to spatial and multivariate models, for both the longitudinal and the classical geostatistical settings. We have followed up on this lead in an on-going project to forecast Covid-19 transmissions and have found it very useful.

Emrah Gecili and Rhonda Szczesniak (University of Cincinnati College of Medicine)
We applaud Asar, Bolin, Diggle and Wallin for their outstanding presentation during the discussion session and highlighting important clinical research ramifications of their paper. Describing the time course of cystic fibrosis (CF) lung function was a compelling application, justifying the importance of introducing and relaxing assumptions of the process W i .t ij / and measurement error Z ij . By framing the prediction problem as a target function, one can use their model to form predictive probability distributions of rapid disease progression and translate probabilities for downstream informatics (Szczesniak et al., 2019). In Fig. 12 showing an R Shiny web application, the target function varies from −10% to 0.5% predicted/year, reflecting lung function trajectory and rate of change in A-B (see the delta threshold, circled; the predictive probability over time in C changes according to the selected target value). Patient covariate inputs are shown at the left-hand side and in D (www.predictfev1.com). Further bolstering the applied case for this method is its potential utility in biomarker-informed prediction modelling. As noted by Dr Kumar during the meeting, measurement error in longitudinal covariates was not addressed; however, this could be ameliorated (or at least these errors could be shrunk) by using regularization. Implementing different types of regularizations for this model could be facilitated in the Bayesian framework by using various shrinkage priors, thereby enabling simultaneous variable selection to be performed with the model proposed. This could provide a useful tool for researchers from different areas of applied science. These penalized regression models can accommodate the stochastic error of variable-selection stepwise procedures (mentioned by Dr Kumar) and easily handle high dimensional variable-selection problems, which are commonly observed in modern science. Of course, advantages of using Bayesian regularized methods would come with the cost of increased computational time, especially for the cases where the sample size is large. Specific to the CF case, we modelled longitudinal lung function data by using an earlier (Gaussian) iteration of work by a subset of the authors  and incorporated Bayesian shrinkage methods to estimate unknown parameters simultaneously and to select important proteomic biomarkers of rapid disease progression. We identified a set of relevant clinical or demographic predictors and a proteomic biomarker for rapid decline of lung function. These biomarkers are moving forward past preclinical testing. Although these remarks stem from CF research, they demonstrate the potential for immediate applications and extensions of the authors' contribution to the analysis of medical monitoring data.

Kristin Kirchner (Delft University of Technology)
Asar, Bolin, Diggle and Wallin are to be congratulated for an inspiring contribution in the field of statistical modelling, presenting a methodology which combines the remarkable flexibility of non-Gaussian processes with computational efficiency.
Indeed, the computational benefits of the stochastic partial differential equation approach in the Gaussian case with differential operators as in equation (5)   . Recent developments in numerical methods for fractional operators furthermore facilitate treating the whole admissible parameter range φ > 0 while maintaining low computational complexity Herrmann et al., 2020). Specifically, it is a well-known result that, in the non-fractional case D 1 = κ 2 − @ 2 =@t 2 , inverses of the symmetric, elliptic differential operator D 1 , when discretized by means of a finite element method with continuous piecewise linear basis functions on the mesh 0 = s 1 < : : : < s K = t max as used in equation (14), can be realized numerically with multilevel preconditioned iterative solvers in a complexity which is (essentially) linear in the number K of mesh nodes; see, for example Xu (1992). Recently, it has been shown that, given a prescribed accuracy, this complexity can also be achieved in the fractional case (Herrmann et al., 2020). The numerical inversion of the operator D 2 = κ + @=@t introduced in Section 3.2.1 and treated in a variational framework with a Petrov-Galerkin discretization in Appendix B, however, requires an additional discussion due to its asymmetry. The variational approach as described in Appendix B results from multiplying the ordinary differential equatioṅ subject to u.0/ = 0 .or u.t max / = 0/, .32/ with a test function v ∈ L 2 .0, t max /, integrating over .0, t max /, and employing the natural choice for a conforming Petrov-Galerkin discretization: continuous piecewise linear basis functions vanishing at 0 (or at t max ) as the discrete trial space and piecewise constants as the discrete test space. It is worth noting that this variational approach is equivalent to the Crank-Nicolson scheme for the ordinary differential equation (32). In the non-fractional case D 2 = κ + @=@t, an efficient numerical inversion of D 2 can be realized with iterative solvers after (a) symmetrizing the problem and (b) appropriate preconditioning; see Andreev (2014). The matrix K, whose inverse appears in the conditional distribution (15), results from discretizing the differential operator D 1 or D 2 . For the reasons above, multiplication with K −1 , which is needed, for example, for likelihood evaluations or predictions, can be approximated efficiently.
To the best of my knowledge, numerical methods for fractional powers of the asymmetric operator κ + @=@t are rare in the literature so far. This paper is a promising incentive to investigate statistical properties and computational algorithms for the fractional asymmetric model in future research.
Jorge Mateu (University Jaume I, Castellón) Asar, Bolin, Diggle and Wallin are to be congratulated on a valuable contribution and thought-provoking paper on setting the pace for the extension of linear mixed effect models for non-Gaussian responses with repeated measures. They open this door by allowing any combination of the stochastic components of these linear models to be non-Gaussian. I would like to focus on the role of the stochastic process W i .t/ and how the current framework can be adapted to other types of data and problems. The fact that W i comes as a solution of a stochastic differential equation and known models, such as the integrated random-walk model or the Matérn model, can be obtained as particular cases in this framework brings direct connections to some extensions. One such extension is considering responses that are not continuous random variables, e.g. counting Poisson or negative binomial variables in combination with non-Gaussian random effects. Bivariate counts as responses define also a crucial extension, which is necessary in many real problems. But arguably the main extension could be to spatial processes, going from the stochastic real process W i .t/ to a random field. In the same way that time ago techniques from longitudinal analysis were extended and adapted to the spatial framework (and I particularly refer to some work by Professor Diggle), the proposed (longitudinal) modelling framework can be extended to the spatial context (see ). It is easy to think that the individual-based stochastic process W i could be living in the plane and defined by a particular random field. We can then formulate the model by using the stochastic partial differential equation representation of Gaussian Matérn fields. We have then room to develop spatially varying non-Gaussian repeated measures, and we can delineate some interesting details. For example, consider a nonstationary and/or anisotropic Gaussian random-field model that can model these characteristics on the response variable. By working on the properties of the underlying random field, we can obtain nice and extended versions of the models proposed. Another detail here is the concept of repeated measures in the response variable, which translates into the geostatistical context of the variance-nugget effect. And I finish by indicating that this would be another nice example of model-based geostatistics.
Regarding Table 7, some arguments for also including the logarithm of probability as a loss function-at least in the general case-are given in Dowe (2008), footnotes 175 and 176, Dowe (2011), section 3, and Dowe (2013), section 4.1, although the authors' results already seem quite clear.
Last, with current world circumstances, might the authors relate one of their models or a close variant to modelling Covid-19 coronavirus data?
The authors replied later, in writing, as follows.
We thank all the discussants for their interest in the paper and their helpful comments. We have arranged our reply under five broad topic headings, as follows.

Model interpretation
Among his many insightful comments, Daniel Farewell questions some aspects of our continuous time model formulation, including the existence (or not) of an instantaneous true state of nature and our formal separation of U i and W i .t ij / in equation (2). From a strict mathematical perspective this separation is redundant, as is our separation of the two fixed effects terms a T i α and d T ij β in the same equation. We gently disagree and, risking a charge of pedantry, mildly regret not initially writing this equation as emphasize the dual distinction between time constant and time-varying effects, and between observed (fixed) and unobserved (random) effects-the reason we did not do so is that this would have superficially excluded some widely used special cases, notably the random intercept and slope model of . If we admit the existence of measured variables in continuous time, we must surely admit the existence of unmeasured variables for which W i .t/ acts as a general proxy.
However, there can be multiple interpretations of the random effects in combination with certain stochastic processes. For example, the sum of a Brownian motion W i .t/, with initial value W i .0/ = 0, and a Gaussian random intercept U i is equivalent to the same Brownian motion with initial value W i .0/ = U i . Treating the U i as the random intercept is incorrect since the time average of a subject realized Brownian motion will not be zero. An approach for handling such issues is to orthogonalize the random effects to obtain more easily interpretable results. One could for example add a sum-to-zero constraint on the Brownian motion to allow U i to be interpreted as the mean. However, one should not do so with the fixed effects also, since this would incorrectly remove uncertainty (variance) of the estimate; see Sørbye et al. (2019) for a discussion about this in a spatial setting.
In the specific case of %FEV1, any time-varying effect must strictly be an integral over time of another stochastic process, implying a restriction on the legitimate class of covariance structures for W i .t/. Whether this restriction matters in practice is less clear, provided that the focus of attention is on parametric inference rather than on prediction of unusual events, as was the case in our other application.
Similarly, the 'measurement error' term Z ij is a misnomer, in the sense that in the model fitting process it captures both literal measurement error and (unidentifiable) stochastic variation on timescales that are shorter than the intervals between successive measurements. In a designed study, this ambiguity could be resolved by varying the follow-up intervals, including repeated measurements at effectively the same time where feasible, and conducting a frequency domain analysis of the residuals from a fixed-effect-only model. This could in turn inform the design of future studies to ensure that the follow-up protocol captures important components of the overall variation. In the absence of genuinely independent replication at a single time point, such as splitting a blood sample and running duplicate assays, the term 'measurement error' might more accurately be called 'high frequency variation'.

Other model formulations
As Jorge Mateu points out, a natural extension of the models proposed is to spatial data. In fact, the ngme package already supports models for both spatial data (as in Wallin and Bolin (2015)) and spatial multivariate data (as in ). Mateu also suggests investigating whether non-stationary Gaussian random fields could be constructed to capture the features of these spatial models. Because of the use of normal variance-mean mixtures in the construction, one could think of the non-Gaussian models in a Bayesian context as non-stationary Gaussian fields with a specific prior for spatially varying mean and variance function; see equation (15). This gives some intuition regarding the behaviour of the models but also suggests that it might be difficult to mimic the behaviour by using non-stationary Gaussian fields with smoothly varying parameters. Thus, non-stationary Gaussian models that could mimic the proposed non-Gaussian models would probably need to have a large number of parameters that may be difficult to estimate.
A related point to the spatial extension is the comment from Neil Chada and Ajay Jasra regarding α-stable processes. These are very natural when it comes to stochastic partial differential equation representations given that the distribution of the random field is the same as that of the noise (since the distributions are closed under linear combinations). The main inferential issue is that, to our knowledge, the only distributions for which closed form expressions are available are the Gaussian and Cauchy families. This sets limits on parameter inference, on exploring posterior distributions and on latent variable formulations. It also explains why, in earlier work, we focused rather on closure under convolution to give two further distributional options Wallin and Bolin, 2015;. For interpolation, the Cauchy distribution's heavy tails can accommodate clear jumps in the trajectories. However, for extrapolation the inferred jumps may become unrealistically large.
Arnošt Komárek suggests spline-based models as an alternative to our stochastic process formulation and argues that this will result in fewer parameters and easier interpretation for a practitioner. We respectfully disagree on both points. Regarding the parameters, a spline with K basis functions has K random effects. If these are Gaussian, then unless we make a sparse covariance structure assumption this generates K.K + 1/=2 parameters, whereas in our formulations the random process has one or two parameters. Also, the representation of a spline depends on the spline basis as well as on the number and locations of the knots. The same could of course be said for our basis representation of the stochastic process. However, as the number of basis functions grows, the basis representation converges to a limiting process; hence there is no loss of interpretability. Regarding the interpretability for practitioners we argue to the contrary that with sufficient data we can estimate properties of our stochastic process models, such as their differentiability, that help with their interpretability, whereas for splines these properties are tied to the spline basis, and hence not determined by the data.

Measurement error in longitudinal covariates
Kuldeep Kumar raises this very relevant question. Suppose that a stochastic covariate X.t/ is being used to predict a stochastic outcome Y.t/. Then, it is legitimate to condition on the observed value of X.t/ and to make predictions on this basis, irrespective of the presence or absence of measurement error in X.t/. However, it is well known that, if X.t/ = X Å .t/ + Z Å .t/ where X Å .t/ is the error-free version of X.t/, then ignoring the measurement error term Z Å .t/ leads to biased estimation of the association between X Å .t/ and Y.t/. Within our general approach, the correct way to deal with this is to use a bivariate stochastic process model for X.t/ and Y.t/: see, for example, . However, the specific model formulation should depend on the scientific status of X.t/. If X.t/ and Y.t/ are coevolving biomarkers of equal interest, a symmetric formulation might be reasonable. In contrast, if there is a causal pathway from X to the outcome of interest, Y , a more natural formulation, using the square bracket notation to mean 'the distribution of', might be of the form [X Å ][X|X Å ][Y |X Å ], with the likelihood following by integration with respect to the unobserved X Å . Also the association between X Å and Y might need to be time lagged. Suppose, for example, that in the renal example of Section 6.2 we had data on longitudinal measurements of blood pressure, X.t/. High blood pressure damages susceptible kidneys, which in turn leads to loss of kidney function. Hence, a candidate model might be a model that measures the effect of X Å on Y as a weighted integral:

Computation
The numerical methods presented can probably be improved in various ways. We agree with Kristin Kirchner that incorporating state of the art numerical methods to sample the processes would reduce computation times, in particular for the spatial models that are implemented in the ngme package. We also encourage further investigations of numerical methods for models with fractional powers of asymmetric differential operators, which would be of direct use in our setting.
Regarding the use of stochastic optimization, we implemented several bespoke procedures because many of the standard procedures that we tested worked poorly for our applications. We performed initial tests with methods like Adam but ended up not using them because of stability problems. This may, however, be due to suboptimal implementations on our part, and we agree with Alex Stringer, Patrick Brown and Jamie Stafford that a more systematic comparison of different stochastic optimization methods for our, and other, statistical models would be very valuable.
Running parallel optimizations enables the construction of suitable stopping criteria. Also, multiple chains could be used to construct preconditioners that do not introduce any bias. As is well known, the use of the same samples for estimating the preconditioner and the gradient can introduce bias in the estimates. Running multiple chains we could avoid this, e.g. by using the preconditioner computed in the first chain when updating the value of the second chain and vice versa. Running multiple parallel chains does not increase the computation time by much (given enough cores for the computations) but does increase memory requirements. This is, however, a smaller concern for us than for typical machine learning problems involving much larger data sets.

Model selection strategies
Giuseppe Storti and Pietro Coretto suggest the practical approach of first fitting the Gaussian model to the whole data and inspecting the distributions of the predicted values of U, W and Z. This is in contrast with our approach, where we simulate data under a fitted model (with subsampling for the renal application and without subsampling for the cystic fibrosis (CF) example) and inspect the QQ-plots of predicted values based on observed and simulated data sets. Envelopes of QQ-plots would help to understand whether apparent departure from the assumed model was due to chance.
Zihao Wen and David Dowe argue that different distributional assumptions could generate similar QQplots. Model selection criteria could be used for such scenarios, although there is always a limit to the ability of empirical procedures to discriminate between models. Also, model selection criteria measure the overall fit of an assumed model, whereas QQ-plots help us to assess the fit of individual terms. We note with interest their suggestion to apply cross-validation to our QQ-plot approach by simulating data for the validation set.
Giuseppe Storti and Pietro Coretto also raise the issue that the likelihood is not guaranteed to have a unique maximum. The use of multiple chains could help to detect the existence of multiple maxima and act as a warning that standard likelihood-based inferences may be unreliable. It is difficult to choose between different local maxima since we do not have access to the likelihood values. One approach for avoiding multiple maxima is to add a regularization term. As pointed out by Kumar, this also offers an approach to high dimensional variable selection.
Marked point process formulation, handling missing data Daniel Farewell and Christine Chai both mention the issue of missing data, i.e. unrecorded measurements at prescheduled time points. In designed studies, missing values are self-evident. In observational studies the notion of a missing datum applies only if the data record that a measurement was made but not its value. Missing values are particularly problematic if they are informative, meaning that conditional on all observed data the fact that a measurement is missing and its (consequently unobserved) value are stochastically dependent. The literature on methods for dealing with missing values in longitudinal data is extensive; see, for example, Little (1995), Little and Rubin (2002) and Daniels and Hogan (2008).
Farewell also raised the related issue of informative follow-up times. In all our analyses, we assume that the timings of the repeated measures are either fixed by design, or otherwise stochastically independent of the measurement process. For observational data sets, e.g. data from electronic health records, this assumption might not hold. The behaviour of the renal data set that we analysed in our paper is suggestive of informative follow-up. For example, we see clusters of observations taken quite frequently: in some cases daily. The reason for this is not recorded in our data, but is quite likely to be related to a patient's current health status. Recognition of serial correlation in the measurement process alleviates, but not does not entirely remove, the potentially adverse effects of this on the accuracy of our inferences. A possible modelbased solution is to couple a measurement model with an intensity submodel, in other words a marked point process where the points are the follow-up times and the marks are the repeated measurements. The analogous problem in a spatial setting has been termed preferential sampling (Diggle et al., 2010). Contributions in a longitudinal setting include Lipsitz et al. (2002) and Lin et al. (2004).

Applications
We are pleased that Anna Choi, Tze Leung Lai and Zihao Wen are finding our methods useful in their study of Covid-19 transmission. We look forward to publication of their results in due course and would welcome their, or any other users', suggestions on possible modifications to the ngme package.
Emrah Geili and Rhonda Szczesniak point out that the concept of individual and time-specific prediction is important for the CF application. As CF is a rare disease, patients with CF are followed quite closely; hence data quality is high and a large amount of data is available for each patient. For data sets with many repeated measures per patient, a time invariant random-effects assumption is unlikely to be suitable and the option to include a stochastic process component is, in our opinion, essential. In our application to the Danish CF data, we gave priority to estimation rather than prediction. We selected a model with normal inverse Gaussian distributions for all three stochastic terms. A model of this kind could provide an interesting extension to the Cincinnati Hospital case-study, and to the functionality of Geili and Szczesniak's very nice user interface.