Explicit connections between longitudinal data analysis and kernel machines

: Two areas of research – longitudinal data analysis and kernel machines – have large, but mostly distinct, literatures. This article shows explicitly that both ﬁelds have much in common with each other. In particular, many popular longitudinal data ﬁtting proceduresare special types of kernel machines. These connections have the potential to provide fruitful cross-fertilization between longitudinal data analytic and kernel machine methodology.


Introduction
Longitudinal data analysis is concerned with regression modelling and inference for data consisting of repeated measures on units, such as humans in a medical study.Since the seminal work of Harville [12] and Laird and Ware [17], linear mixed models have been the mainstay of longitudinal data analyses.The predominant distinguishing feature of linear mixed models, when compared with linear models, is the dichotomization of effects into fixed and random types.The fitting of fixed and random effects differ in that the latter is subject to a degree of shrinkage, dependent on the values of covariance parameters in the model.The concept of best linear unbiased prediction appealingly accommodates the handling of both types of effects (e.g.[26]).Expositions on longitudinal data analysis, including the role of linear mixed models, can be found in Diggle et al.
Kernel machines is a younger field that has most of its literature outside of Statistics.Essentially, kernel machines are flexible non-linear regression-type methods that use regularization to avoid overfitting.The name 'kernel' comes from the fact that the theory and methods of kernel machines is underpinned by reproducing kernel Hilbert space (RKHS) theory (e.g.[1,16]), although other frameworks such as Gaussian process theory (e.g.[25]) can also be used.Whilst kernel machines can handle general response variables, the majority of their literature is geared towards classification in which the response is categorical.In particular, support vector machines (e.g.[4,20]) are a sub-class of kernel machines and, in the 1990s, emerged as a powerful and elegant family of classifiers.The monograph of Schölkopf and Smola [29] and the web-site maintained by these authors, www.kernel-machines.org,are two portals to the large and expanding literature on kernel machines.
The main goal of this article is to expose the commonalities shared by longitudinal data analysis and kernel machines.In particular we show, explicitly, that many popular longitudinal fitting procedures are special types of kernel machines.There are at least two potential payoffs from such links: (a) the enrichment of longitudinal models to cope with non-linear predictor effects, and (b) adaptation of kernel machine classifiers to account for within-subject correlation when applied to longitudinal data.Sections 4.1.1-4.1.3give some details on (a).Section 5.2 contains an illustration of (b).
Some recent related work is Gianola, Fernando and Stella [10] and Liu, Lin and Ghosh [18], each of whom combine linear mixed models with kernel machines to analyze very high-dimensional genetic data-sets.However, neither of these papers deal with regular longitudinal data analysis models.James and Hastie [15] and Müller [21] are examples of articles concerned with classification when the data are longitudinal.
The connections between longitudinal data analysis and kernel machines are stronger in the case of continuous responses.A concise overview of continuous response longitudinal data analysis is given in Section 2. A summary of kernel machines and their RKHS substructure is given in Section 3. Section 4 forms the main body of the paper and gives an explicit case-by-case description of kernel machine representations of popular longitudinal data analytic models, as well as explaining some non-linear (kernel-based) extensions.Generalized response models and kernel machines are treated in Section 5. Concluding discussion is given in Section 6.

Continuous response longitudinal data analysis
In this section, and for following two sections, we suppose that the response variables are continuous, and without strong departures from normality.In this case, the main vehicle for longitudinal data analysis is the linear mixed model where, for a general random vector v, the notation v ∼ (µ, Σ) is shorthand for the mean vector E(v) equal to µ and the covariance matrix Cov(v) equal to Σ.
A common special case of ( 1) is The use of (1) for longitudinal data analysis dates back to Laird and Ware [17].Good summaries of estimation and prediction within this linear mixed model structure may be found in, for example, Robinson [26], Verbeke and Molenberghs [35], McCulloch et al. [19] and Ruppert, Wand and Carroll (Chapter 4) [28].We will just present the main results here.
For given covariance matrices G and R the theory of best linear unbiased prediction (BLUP) can be used to guide choice of β and u, and results in the criterion: This is minimized by where V = Cov(y) = ZGZ T + R. Expressions (4) are known as the BLUPs of β and u.
In practice, longitudinal data are fitted via the steps: 1. Estimation of G and R. Usually, these matrices are restricted to a parametrized class of covariances matrices.Most commonly, estimation is achieved via maximum likelihood, or restricted maximum likelihood (REML), under the normality assumption (2). 2. Substitution of the estimated covariance matrices into (4).The resulting estimators, β and u, are commonly known as estimated BLUPs, or EBLUPs for short.
The EBLUP phrase can be transferred to any linear function of β and u.Thus, A β + B u is the EBLUP of Aβ + Bu for any pair of matrices A and B for which the multiplications and summation are defined.
These two steps show a division into two types of estimation targets that arise in longitudinal data analysis: the covariance parameters in the G and R matrices, and the effects β and u.The strong connections between longitudinal data analysis and kernel machines occur at the EBLUP step for estimation of the fixed and random effects.For this reason, we will not dwell on the estimation of the covariance parameters.These will be taken as given when we revisit longitudinal data analysis in Sections 4 and 5.

Kernel machines
In the past decade or so kernel machines have emerged as an important methodology for classification and regression problems.A special sub-class of kernel machines is support vector machines where classifiers are constructed with respect to constraints on the margin of the classification boundary, and may be represented as a regularized optimization problem with respect to the hinge loss function (e.g.[4]).However, through the allowance of general loss functions, kernel machines are a much broader class of methods.Kernel machines with squared error loss (e.g.[33]) include popular statistical methods such as kriging (e.g.[5,32]), smoothing splines (e.g.[11,36]) and additive models (e.g.[13]).Zhu and Hastie [40] explored the use of binomial log-likelihood loss in the kernel machine framework and coined the term kernel logistic regression.Kernel logistic regression and support vector machines are both special cases of large margin kernel machines (e.g.[39]).Another prominent class of kernel machines is that corresponding to robust loss functions (e.g.[30]).Pearce and Wand [24] delineated connections between kernel machines and penalized spline methodology.
General kernel machines can be formulated in a number of ways.Among the most common are: optimization and projection within reproducing Hilbert spaces (e.g.[16]), maximum a posterior estimation in Gaussian processes (e.g.[25]) and Tikhonov regularization of ill-posed problems (e.g.[34]).Because of its prominence in the Statistics literature (e.g.[36]) we will use the first of these formulations in the remainder of the paper.Recent summaries of RKHS theory include Wahba [37] and Evgeniou, Pontil and Poggio [7].An early reference is Aronszajn [1].Relevant background material on Hilbert space projection theory may be found in, for example, Simmons [31] and Rudin [27].
In Section 3 of Pearce and Wand [24] we summarized the theory of RKHS for the purpose of explaining connections between kernel machines and penalized splines.That same section also provides useful background material for the current paper.For convenience, we reproduce the core aspects of it here.A RKHS on R d is a Hilbert space of real-valued functions that is generated by a bivariate symmetric, positive definite function K(s, t), s, t ∈ R d , called the kernel.The steps for RKHS construction from K are: 1. Determine the eigen-decomposition of K: K(s, t) = ∞ j=0 λ j φ j (s)φ j (t).This series is assumed to be well-defined (e.g.uniformly convergent).2. Define the space of real-valued functions on R d : 4. Complete H K by including limits.
The adjective 'reproducing' arises from the result This has important implications, and gives rise to the 'kernel trick' that we discuss shortly.
•) be a loss function and λ > 0 be a regularization parameter.The fit f within H K , with respect to L and λ, is the solution to The solution to (6) can be shown to be of the form f (x) = n i=1 c i K(x i , x) for some c i , 1 ≤ i ≤ n, depending on L, λ and the data.This result is known as the representer theorem of reproducing kernel Hilbert spaces.The 'kernel trick' is that we do not need to calculate the eigenfunctions, φ 0 , φ 1 , . .., in order to find the c i .Because of (5) we only need evaluations of the inner products of the form K(s, t).Popular kernels, particularly in machine learning contexts, include the pth degree polynomial kernel, K(s, t) = (1 + s T t) p , and the radial basis kernel, K(s, t) = exp −γ s − t 2 , for some γ > 0.
The λ f 2 HK term in (6) imposes a penalty on f.However, it is often desirable that certain functions in H K are unpenalized.The simplest example of this is as follows.Let H 0 be a subspace of H K for which penalization is not desired.Mathematically, this means that fits over H 0 are found by simply minimising n i=1 L(y i , f(x i )).Let H 1 = H ⊥ 0 be the orthogonal complement of H 0 and P 1 denote the linear operator corresponding to projection onto H 1 .Then with H 0 being the null space of P 1 , i.e., With respect to the null space H 0 , smoothing parameter λ, and loss function L, we define fits f according to It can also be shown ( [1]) that H 0 and H 1 are reproducing kernel Hilbert spaces in their own right, with kernels K 0 and K 1 satisfying K 0 + K 1 = K.

Explicit connections for continuous responses
In this section we show, explicitly, how longitudinal data analyses are connected intimately with kernel machine methodology.Indeed, all longitudinal data analyses that use EBLUPs are actually just fitting a special type of kernel machine.
To make these connections clear, and accessible to readers with a longitudinal data analysis background, we first treat some special cases of (1).We build up to fuller generality in the later subsections.

Random intercept model
The simple linear random intercept model is where (x ij , y ij ) is the jth predictor/response pair for subject i, and the ε ij ∼ (0, σ 2 ε ) are independent within-subject errors.The regression coefficients β 0 and β 1 are fixed effects, while the subject-specific intercepts U i ∼ (0, σ 2 u ) are independent random effects.
Given estimates σ 2 u and σ 2 ε of the variance components, the fitted line for subject i is where β 0 , β 1 and the U i are EBLUPs, as given by ( 4), with Figure 1 shows the EBLUPs for data on longitudinally recorded weights of 48 pigs (source: [6]), with σ 2 u and σ 2 ε estimated via REML.We now explain how (9) and the fitted lines in Figure 1 can be obtained as a solution to a RKHS optimization problem -thereby making them a special case of kernel machines.In the following discussion, we assume that the estimates of σ 2 u and σ 2 ε have been obtained (either via REML, or some other means) and are equal to σ 2 u and σ 2 ε , respectively.Let n = m i=1 n i and re-subscript the (x ij , y ij ) and ε ij sequentially; i.e. according to the map: (1, 1), . . ., (1, n 1 ), (2, 1), . . ., (2, n 2 ), . . ., (m, 1), This leads to the representation where Z ij is (i, j) entry of Z as given in (10) and is an indicator of (x i , y i ) being measurements for subject j (1 Next, form the RKHS of real-valued functions on R m+1 : with kernel Note that, while H K is defined on the whole of R m+1 , its members of interest in longitudinal data analysis are actually on: be a subspace of H K . Theorem 1.Let (x i , y i , Z i1 , . . ., Z im ), 1 ≤ i ≤ n, be a sequentially subscripted longitudinal data set.Consider the RKHS H K defined by ( 12) and ( 13) and subspace H β given by ( 14).Let P u be the projection operator onto H u ≡ H ⊥ β .Then the solution to the RKHS optimization problem with λ u ≡ σ 2 ε / σ 2 u corresponds to the observed EBLUPs of (9).Explicitly, the solution to (15) is . . .and f (x, 0, 0, . . ., 1) = where x ∈ R, β 0 , β 1 and the U i are given by ( 4) with G = σ 2 u I and R = σ 2 ε I and X and Z are given by (10).
Proof.Note that the inner product for H K (induced by K) is given by where y is the n × 1 vector containing the y i s, β = (β 0 , β 1 ) T and It follows immediately that where • denotes Euclidean norm.The orthogonal complement of H β is Then P u f = m j=1 U j z j and The RKHS optimization problem ( 15) is therefore equivalent to min which corresponds to EBLUP for the random intercept model.

Kernel-based extension to general mean curves
Note that the kernel for the simple linear random intercept model can be written as K((s 1 , . . ., s m+1 ), (t 1 , . . ., t m+1 )) = K β (s, t) + K u (s, t) where K u (s, t) ≡ m j=1 s 1+j t 1+j corresponds to the random intercept structure and K β (s, t) ≡ 1 + s 1 t 1 corresponds to the population mean structure.More general population mean structures can be obtained by replacement of K β (s, t) by K 0 (s, t) + K c (s, t) where, for all s, t ∈ R m+1 , for kernels K 0,1 (s 1 , t 1 ) and K c,1 (s 1 , t 1 ) defined on R 2 .The kernel K 0,1 corresponds to unpenalized functions and, typically, K 0,1 (s 1 , t 1 ) = 1.We can take K c,1 to be any positive definite function on R 2 such that its eigen-decomposition does not include functions in the RKHS generated by K 0,1 .Usually we would want K c,1 to have a rich eigen-decomposition so that non-linear mean structure can be well-handled.Examples include: where ω > 0 is a scale factor.Each of these kernels have infinite-length eigendecompositions and result in an infinite dimensional RKHS.The 'kernel trick' implies that fitting and representation only depends on evaluations of K c,1 .Let H c be the RKHS generated by K 0 and K c , respectively.Then is the RKHS generated by By the RKHS representer theorem, the solution is where c ≡ (c 1 , . . ., c n ) T and K = [K c (x i , x i ′ )] 1≤i,i ′ ≤n .Explicit solutions are a special case of (20) below.

Extension to additional linear predictors
Our final extension of the random intercept model involves the possible inclusion of additional predictors, assumed to have a linear effect on the mean of the response variable.Corresponding to each y i , 1 ≤ i ≤ n, let x ℓ i an p × 1 vector of such predictors.Then we should replace (16) by where each of these RKHSs are now on R m+p+1 and which reduces to min where X = [1 (x ℓ i ) T ] 1≤i≤n .Vector differential calculus, combined with some algebra, lead to the solutions: We now provide illustration of fits for this most general random intercept model for longitudinal data on spinal bone mineral density for 230 girls and young women (source: [2]).The subjects are categorized as belonging to one of the four ethnicity groups: Asian, Black, Hispanic and White.With double subscript notation, the model is where the y ij are spinal bone mineral measurements (g/cm 2 ), the x ℓ i contain indicators for ethnicity and the x ij are age measurements.The function c(•) indicates a curve corresponding to the kernel K c .We used the Gaussian kernel K c (s, t) = exp{−0.05(s−t) 2}.The values of λ c and λ u were chosen using REML.
Figure 2 shows the fitted mean curves for each ethnicity group.The mean age effect is clearly non-linear and is estimated well by the Gaussian kernel.
Figure 3 shows the fitted curves for those subjects that had four spinal bone mineral density measurements.Note that the fitted random effects U i are required for this display.The fits are very good for about half of the subjects.Some lack-of-fit is apparent for the other half.

Extension to multivariate kernels
We briefly mention one last extension: the replacement of c(x i ) by c(x i ) where the x i ∈ R d .This can be achieved by making K c a d-variate kernel as opposed to the univariate kernels treated so far in this section.The relevant RKHS is now on R m+p+d , but the optimization problems (18) and (19) Models of a similar type were recently considered by Liu, Lin and Ghosh [18].

Random intercept and slope model
Close inspection of Figure 1 shows that the parallel lines restriction imposed by the random intercept model is questionable.A more realistic model is one that allows each pig to have his/her own slope.This is achieved through the random intercept and slope model where, as with (8), ε i ∼ (0, σ 2 ε ) are independent, while allow for subject specific deviations in both intercept and slope from the mean line β 0 + β 1 x. Figure 4 shows an EBLUP fit of this model to the pig weights data, with the covariance matrix parameters estimated via REML.The extension to random slopes involves replacement of Z and G by in the BLUP equations (4).We will now re-write (22) in a canonical form that is amenable to RKHS representation.It involves the singular value decomposition (or spectral decomposition) of the random effects covariance matrix: where the eigenvalues d u and d v are given by The first normalized eigenvector component α takes the form otherwise.
The matrix is orthogonal: U U UU U U T = U U U T U U U = I.Even though U U U is symmetric in this case, we will write U U U T to allow comparison with more general results.The random intercept and slope model ( 22) can be written as where, for 1 are linear transformations of the random effects and predictors based on the U U U matrix.Note that Suppose that the covariance parameters are replaced by estimates: σ u , σ v , σ ε ρ and consider the EBLUPs based on (4).Let ) and α = α( σ u , σ v , ρ) be the estimates for the eigen-parameterization of the random effects covariance matrix.
We now describe RKHS representation of these EBLUPs.A first step is to switch from the double subscripting of longitudinal data analysis to single subscripting via (11).The single subscript version of the random intercept and slope model ( 22) is where, as before, Z ij is (i, j) entry of Z as given in (10).Form the RKHS of real-valued functions on R 2m+1 : with kernel where Let H β , H u and H v be the RKHSs generated by K β , K u and K v , respectively, so that Theorem 2. Let (x i , y i , Z i1 , . . ., Z im ), 1 ≤ i ≤ n, be a sequentially subscripted longitudinal data set as in Theorem 1.For ) is based on covariance parameter estimates appropriate for the random intercept and slope model ( 22) and (23).Consider the RKHS H K defined by ( 27) and ( 28) and subspaces H β , H u and H v be generated by (29).Let P u and P v be, respectively, the projection operators onto H u and H v .Then the solution to the RKHS optimization problem with λ u = σ 2 ε / d u and λ v = σ 2 ε / d v , corresponds to the EBLUPs (26).Explicitly, the solution to (30) is where x ∈ R and β 0 , β 1 and the U i , V i are given by ( 4) with X given by ( 10), Z given by ( 24), R = σ 2 ε I and G given by ( 24) with (σ u , σ v , ρ) = ( σ u , σ v , ρ), and 0 r denotes the string 0,0,. . .,0 of length r.
Proof.The inner product for H K (induced by K) is given by be a typical member of H K .Then, using a reversal of the argument presented at (25), T and Z, given by ( 24), are the random effects vector and corresponding design matrix for the random intercept and slope model.(In the definition of Z at (24), double subscripting is being used on the xs for clarity reasons.)Finally is the estimated covariance matrix of u.The criterion in the RKHS optimization problem ( 30) is therefore equivalent to u, the EBLUP criterion for the random intercept and slope model.

Kernel-based extension to general mean curves
As in Section 4.1.1we can extend the random intercept and slope model to allow for non-linear mean structure by introducing a kernel.Here we will consider the extension of (22): In this model β 0 +c(•) is the smooth overall function.Subject specific deviations in both intercept and slope are allowed.The relevant RKHS is where H 0 and H c are as defined in Section 4.1.1,whilst H u and H v carry the definitions ascribed to them in the lead up to Theorem 2. The RKHS minimization problem is similar to that given by (30) with the addition of a penalty term λ c P c f 2 HK .The minimization problem reduces to min and G = I m ⊗ Σ for some 2 × 2 symmetric positive definite matrix Σ.Here ⊗ denotes Kronecker product.The solution is

Extension to general random effects structure
The general form of the Xβ + Zu, u ∼ (0, G), structure for parametric longitudinal data analysis has Here X F i is an n i × p matrix corresponding to the ith subject's fixed effects contribution (X F i β) , X R i is an n i ×q matrix and u i is a q ×1 random effects vector corresponding the ith subject's contribution (Z R i u i ) and Σ is an unstructured q × q covariance matrix satisfying Cov(u Let Σ = U U U diag(d)U U U T be the spectral decomposition of Σ, where are the eigenvalues of Σ and U U U is a q × q orthogonal matrix of normalized eigenvectors.Set so that the model has canonical form The BLUPs for β and u minimize where Theorems 1 and 2 can be generalized to the situation where BLUP corresponds to the solution of a RKHS optimization problem.The relevant RKHS, H K , consists of real-valued functions on R p+mq with kernel K(s, t) = K((s 1 , . . ., s p+mq ), (t 1 , . . ., t p+mq )) = 1 + s T t.
Sub-spaces of interest are those generated by We denote these by H F , H 1 , . . ., H q .Then we have . ., Z im ) be the ith row of Z and x i be the ith row of x.Then the BLUPs given by ( 32) correspond to the RKHS optimization problem

HK
where, for 1 ≤ k ≤ q, λ k ≡ σ 2 ε /d k and P k is the projection operator onto H k .

Correlated errors
Each of the longitudinal models considered so far have R = Cov(ε) = σ 2 ε I.However, in longitudinal data analysis it is common to allow more general structure in the R matrix.An example is the random intercept model with first-order autoregressive (AR(1)) errors: Longitudinal data analysis models such as these do not fit as comfortably into the RKHS framework.However, if the first term of the BLUP criterion ( 4) is written as then it is apparent that the RKHS theory with L(s, t) = (s−t) 2 applies provided we work with the transformed data vector y R ≡ R −1/2 y and corresponding transformation of the predictor structure.

Alternative loss functions
So far in this section we have only considered squared error loss L(s, t) = (s − t) 2 .However, the connections between longitudinal data analysis and kernel machines remain if other continuous response loss functions are used instead.For example, to counteract the influence of outliers in the response data it is common to work with a different loss such as absolute error L(s, t) = |s − t| or those arising in M-estimation (e.g.[14]): where δ > 0 is the 'bending' parameter.Another class robust of loss functions is that which arises from modelling the responses as having a t-distribution with ν degrees of freedom and scale parameter σ > 0: Yet another alternative loss function is L(s, t; ε) = (|s − t| − ε) + , for a fixed ε > 0. It is known as ε-insensitive loss and ignores errors of size less than ε.Finally, we mention the possibility of non-convex loss functions, such as those described in Shen, Tseng, Zhang and Wong [30].

Generalized response extension
Many longitudinal studies have a non-continuous response; such as count or binary variable.In such circumstances the linear mixed model ( 1) is not appropriate and alternative approaches are required.The most common involve generalized linear mixed models (GLMM) and generalized estimating equations (GEE).
In this section we describe explicit connections between kernel machines and the popular penalized quasi-likelihood (PQL) methodology for fitting GLMMs to generalized response longitudinal data.The main message is that the RKHSs developed in Section 4 for longitudinal data analysis all apply to the generalized response situation.Only the loss functions require modification.
To keep the notation simple, we will work with GLMMs confined to the canonical one-parameter exponential family framework: where f(y|u) denotes the conditional distribution of y given u and b and c depend upon the family member.The most common examples are Bernoulli, with b(x) = log(1 + e x ), c(x) = 0, and Poisson with b(x) = e x , c(x) = − log(x!).The matrices in the linear predictor η ≡ Xβ +Zu, as well as G, have definition and structure identical to those in the continuous response situation described in Sections 2 and 4. The simplest example is the generalized response random intercept model with U i are independent (0, σ 2 u ), 1 ≤ i ≤ m, which corresponds to (33) with X and Z as in (10) and G = σ 2 u I.A common approach to fitting GLMMs is maximum likelihood for (β, G) and best prediction for u under the normality assumption u ∼ N (0, G).However this requires numerical integration techniques and, especially if the integrals are multi-dimensional, approximations are used instead.The most common of these is PQL (e.g.[3]).However, we will not treat quasi-likelihoods here, so the label penalized likelihood (PL) is appropriate.For (33) with u ∼ N (0, G) and G known this involves maximization of the penalized likelihood to obtain the estimates β PL and u PL .We now show that the penalized likelihood (35) can be treated as an RKHS optimization problem.Hence, obtaining β PL and u PL for a given G involves a particular kernel machine.Again, with simplicity in mind, we give the full explanation for the random intercept model (34).The general case follows via the linear algebraic arguments and structures given in Sections 4.2 and 4.3.
Re-subscript the (x ij , y ij ) sequentially (as in Section 4) and, as before, let Z ij be the (i, j) entry of the matrix Z defined at (10).Then (34) is Let H K , K and H β be defined by ( 12), ( 13) and ( 14) respectively.Then penalized likelihood estimation of β and u is equivalent to the RKHS optimization problem where P u is the projection operator onto H u = H ⊥ β , λ = 1/σ 2 u and the loss function is given by L(s, t) = −2{st − b(t)}.For example, L(s, t) = −2{st + log(1 + e t )}, in the Bernoulli case, −2(st + e t ), in the Poisson case.

Kernel extension
which may be solved via Newton-Raphson iteration.

Alternative loss functions
Other loss functions, appropriate for the type of generalized response at hand, may be considered.For example, in the binary response case the Bernoulli loglikelihood loss could be replaced by hinge loss: which corresponds to support vector machine classification (e.g.[4]).Alternative large margin loss functions could also be considered, such as L(s, t) = [{1 − (2s − 1)t} + ] q , q > 1 and L(s, t) = {ρ − (2s − 1)t} + , ρ > 0.
The definition of the class of large margin loss functions, and a fuller list of examples, is given in Section 2 of Wang and Shen [39].We performed a small comparison of Bernoulli log-likelihood loss and hinge loss for a binary response longitudinal data set.The data involves longitudinal measurements on Indonesian children (source: [6]).The response variable is an indicator of presence of respiratory infection.The comparison between the two loss functions is cleaner if the response is coded as ±1, so we let 1 if child i has respiratory infection at time of measurement j, −1 otherwise and x ij denote the age, in years, corresponding to y ij .Then, with (x i , y i ), 1 ≤ i ≤ n, denoting the sequentially subscripted (x ij , y ij ) we considered two versions of min For H c we used a low-rank kernel based on a set of k canonical O'Sullivan spline basis functions {z j (•) : 1 ≤ j ≤ k} and corresponding to equation ( 6) of Wand and Ormerod [38].In the case of hinge loss, such a kernel gives rise to a low-rank quadratic programming problem.This enabled us to use the methodology described in Ormerod, Wand and Koch [23] and the corresponding R package LowRankQP ( [22]).
In the original data, there are many more values with y ij = −1 than with y ij = 1 and a weighted version of hinge loss is appropriate.So that we could use ordinary hinge loss (38) for illustration we worked with a sub-sample of the y ij = −1 data so the sample sizes for each type are equal (107 each).The regularization parameter for the random subject effect was obtained via penalized quasi-likelihood to be λ u = 3.6.We then varied the regularization parameter for the K c component over the values λ c ∈ {10, 1, 0.1, 0.01}.The fitted functions: Even though the Indonesian children respiratory study was concerned with determination of risk factors, rather than classification, it is useful to suppose that the latter is the case when viewing Figure 5.Under this scenario, the f c (•; λ c ) are discriminants for classification of having respiratory infection or not based on age.Hinge loss is seen to be less wiggly and more decisive on either side of the classification boundaries.
Both discriminants possess the rare property of taking account of the longitudinal nature of the data, through the presence of the λ u P u f 2 HK term in the fitting process.While the classification rules themselves are minimally affected by this additional regularization, precision estimates are likely to change.Since precision estimates require additional probabilistic structure we do not do such comparison here.

Discussion
In this article we have shown that two ostensibly different areas of researchlongitudinal data analysis and kernel machines -are, in fact, very similar in their underlying mathematics.It is anticipated that the explicit connections that have been established here will facilitate a more fluid exchange of ideas between the two fields.For longitudinal data analysis, there is the possibility of using kernel machines to better deal with non-linearity and to develop improved classification procedures.From the kernel machine perspective, we envisage kernel methodology that is tailored to longitudinal data models and accounts for complications such as within-unit correlation.

Fig 1 .
Fig 1.The EBLUP-fitted lines to the pig-weights data for the simple linear random intercept model.The panels are ordered according to the size of the 48 pigs.

for coefficients c 1
, . . ., c n .Via arguments similar to those used in the proof of Theorem 1, (17) reduces to the matrix algebraic problem, min β0,c,u corresponds to the fixed effects.The RKHS minimization problem is now of the form min f∈HK n i=1

Fig 2 .
Fig 2. Kernel-based fit to the spinal bone mineral density data.

Fig 3 .
Fig 3. Fitted subject-specific curves corresponding to model (21) for 25 randomly chosen subjects among those with 4 measurements each.

Fig 4 .
Fig 4. The EBLUP-fitted lines to the pig-weights data for the simple linear random intercept and slope model.The panels are ordered according to the size of the 48 pigs.

cFig 5 .
Fig 5.  Results of fitting(37) for both Bernoulli log-likelihood loss and hinge loss, with 4 different values of the regularization parameter λc corresponding to the spline kernel machine age effect.If viewed as a classification problem then the curves correspond to discriminants.The longitudinal data are jittered to enhance visualization.
the linear operator corresponding to projection onto H c and let P u be defined similarly for H u .Then a mean curve, with random intercept shifts, can be fitted via the