Appropriate covariance-specification via penalties for penalized splines in mixed models for longitudinal data

A popular approach to smooth models for longitudinal data is to express the model as a mixed model, since this often leads to immediate model fitting with standard procedures. This approach is particularly appealing when truncated polynomials are used as a basis for the smoothing, as the mixed model representation is almost immediate. We show that this approach can lead to a severely biased estimate of the overall population effect and to confidence intervals with undesirable properties. We use penalization to investigate an alternative approach with either B-spline or truncated polynomial bases and show that this new approach does not suffer from the same defects. Our models are defined in terms of B-splines or truncated polynomials with appropriate penalties, but can be expressed as mixed models; this also gives access to fitting with standard procedures. We illustrate our methods with an analysis of two data sets: (a) a balanced data set on Canadian weather and (b) an unbalanced data set on the growth of children. AMS 2000 subject classifications: Primary 62G08; secondary 62J07.


Introduction
Mixed effects models are a powerful tool for analyzing longitudinal or more generally grouped data. In its general form, a mixed model consists of expressing some linear predictor as a sum of two components: (a) the fixed effect, originally interpreted as the population/overall effect; and (b) the random effects, which result from the units drawn at random from the population. Searle et al. [20] investigate the basic concepts and theoretical aspects of mixed models, while Pinheiro & Bates [13] mainly look at computational issues. A key assumption for a mixed model is the structure of the covariance matrix of the random effects since its specification has important fitting and inferential consequences.
Smoothing methods are known for their flexibility in describing complex patterns, and their connection with mixed models has been of interest to many authors. As a result, modelling longitudinal data with smooth curves has gained much attention and become an area of intensive research. Early work in this area is based on the mixed model representation of smoothing splines (see Brumback & Rice [1], Verbyla et al. [22] among others); however, smoothing splines can be computationally intensive. Alternatively, low rank smoothing methods have been expressed as mixed models. With truncated polynomial bases, a mixed model representation is almost immediate from the form of the basis and the penalty function; Ruppert et al. [18] is a comprehensive reference for this approach. Eilers [7], in the discussion to the Verbyla et al. [22] paper, pointed out that the P -splines system (Eilers & Marx [8]) with B-spline bases could also be expressed in the mixed model framework. Wood (p191) [24] used the singular value decomposition of the penalty matrix to express a smoothing model in a Bayesian way; Currie et al. [3] also used the singular value decomposition to give a mixed model representation of P -splines.
Nonetheless, the representation of smoothing models as mixed models is not without controversy. Green [9] commented on the Verbyla et al. [22] paper as follows: "Formulating spline smoothing as a mixed model is simply a mathematical device; the suggested logical distinction between the fixed linear trend and the random smooth variation is artificial". Green's point is that the randomness in the mixed model representation of smoothers is not assigned to units in a clear way as in the mixed models described in Searle et al. (chap 1) [20]. Thus, smoothers usually have a "mixed model representation" but not a "mixed model interpretation" in the original sense. Nowadays, one motivation behind the insertion of smoothing into the mixed model framework is the availability of computer packages for mixed models; two typical examples are the libraries lme and nlme in the software package R (R Development Core Team [15]) which are designed to fit and compare Gaussian linear and nonlinear mixed models; Pinheiro et al. [14].
In practice, however, the real effect of the bases and the covariance structure on the estimated effects in low rank smoothing methods for longitudinal data has received very little attention. In this paper we will consider two commonly used bases: the truncated polynomial bases (Ruppert et al. [18]) and the Bspline bases (Eilers & Marx [8]). We discuss first the unfortunate consequences that can occur when we use the standard mixed model with truncated lines for longitudinal data; and second, we discuss the resolution of these problems with appropriate penalties, whether truncated polynomial or B-spline bases are used. Our aim is to present a smooth mixed model for longitudinal data with a natural, ie, non arbitrary, covariance structure and an immediately interpretable fixed effect; this covariance structure is derived from the penalties used to design the model.
The plan of the paper is as follows. Section 2 presents the standard mixed model approach to longitudinal data using truncated lines for balanced data, by which we mean that the same number of observations are made on each unit at the same time points. We encounter some difficulties with this approach and use this to motivate a penalty approach which we examine in section 3. Section 4 presents two ways of fitting the model: (a) with the penalized residual sum of squares, and (b) with the mixed model representation of the model. Confidence band calculation is described in section 5 and an extension to un-balanced data is presented in section 6. We close with some concluding remarks in section 7.

Standard penalized splines mixed model for longitudinal data
In this section, we describe some aspects of a standard penalized spline mixed model for longitudinal data, and motivate the need for an alternative approach. For simplicity, we start with balanced data and so assume that we have longitudinal data Y = (Y i,j ), 1 ≤ i ≤ n 1 , 1 ≤ j ≤ n 2 , stored in the form of a matrix in such a way that columns are classified by subjects (j) and rows by time (t i ); that is, the column data Y •,j are repeated measurements of the response variable Y on the j th unit during time periods, t = (t 1 , . . . , t n1 ) ′ . A typical example is the well-known pig.weights data set available in the library SemiPar in the R package. This data set presents the weight measurements on n 2 = 48 pigs (subjects) over a period of n 1 = 9 weeks (time); an overview of these data is shown in the left panel in Figure 1. The global effect looks linear even though the individual subject lines are quite variable, and so it makes sense to consider models of the form where δ 0 + δ 1 t describes the linear population/overall effect,δ j,0 +δ j,1 t measures the deviations/departures of the j th subject/pig from the overall effect, and ε i,j represents the noise. Clearly, we are not interested only in the specific 48 pigs involved in this study. The main motivation of mixed models is to enable our inference from (2.1) to apply to some population of pigs, and mixed models provide an attractive solution to this problem. We suppose that our sample of pigs is drawn at random from some population of pigs and that the impact of this randomness on model (2.1) is that the subject effects u j = (δ j,0 ,δ j,1 ) ′ are themselves random. A common specification of this randomness is that the u j are generated from twodimensional normal distributions with zero means. An important point is that this normal assumption solves the problem of non-identifiability of model (2.1); this same point will arise in section 3, when we will see that penalties provide an alternative solution to the identifiability problem. Under the assumption of normality and homoskedasticity, model (2.1) can be written : t], δ = (δ 0 , δ 1 ) ′ , u j = (δ j,0 ,δ j,1 ) ′ , I n is the n × n identity matrix, 1 n is the vector of ones of length n, and Ψ is a 2 × 2 symmetric, positive definite matrix. This leads to the standard mixed model representation Here, ⊗ represents the Kronecker product and vec(·) is the operator which stacks the elements of matrices/vectors into a single vector. In the literature, δ is known as the fixed effect and u as the random effect. The right panel in Figure 1 illustrates the estimated overall line fitted with the R function lme. Sub-models of (2.1) for the pig.weights data have been investigated by many authors; Ruppert et al. [18] among others implemented the caseδ j,1 = 0, meaning that the subject departures from the overall effect are parallel. Such sub-models can be tested against model (2.1) to investigate the significance of this parallelism. However, this sort of test needs to be treated with care since the null hypothesis, H 0 : Ψ 1,2 = Ψ 2,2 = 0, specifies that the non-negative Ψ 2,2 is zero, and so sits on the boundary of the parameter space; see Self & Liang [21], Ruppert et al. (chap 4) [18]. We have assumed that both the overall and the subject effects can be captured linearly; this assumption suits the pigs data well. However, this assumption is not tenable in general. Consider for example the left panel in Figure 2, which shows the daily average temperature (the averages are taken over the period 1960-1994) in 35 Canadian cities/subjects; these data can be found in the list CanadianWeather available in the library fda in R. Clearly, the linear assumption (at least for the overall effect) fails and more flexibility is required to model the observed effects. In order to account for flexibility in such situations, both linear components in (2.1), ie, the population and the subject effects, are often extended using truncated lines as follows: To be precise, let ∆ = (t n1 −t 1 )/(q+1) and∆ = (t n1 −t 1 )/(q+1) be the distance between the knots at the population and subject levels, then the τ r and theτ k are defined by τ r = t 1 + r∆, r = 1, . . . , q, andτ k = t 1 + k∆, k = 1, . . . ,q.
Ruppert et al. (sect 9.3) [18] is an early reference to such subject-specific curves, although these authors reported there that "more work needs to be done on implementation". Model (2.2) can be expressed compactly as where In (2.4), σ 2 P is the variance parameter driving the smoothness of the overall effect, Σ is a 2 × 2 symmetric, positive definite matrix, and σ 2 S is the variance parameter driving the smoothness/randomness at the subject level.
The investigation of the covariance structure (2.4) in terms of modelling effects has not been of great concern. When we are dealing with smoothing at a single level, truncated lines with a ridge penalty produce satisfactory results; with smoothing at two levels, (ie, population and subject levels), the standard covariance specification (2.4) is problematic. Here we illustrate some of its unfortunate consequences.
We use the lme function as described in Durban et al. [5] to fit this model to CanadianWeather. The output of the lme function not only gives the estimates for the population and subjects effects, but also provides estimates for the variance parameters in (2.4), which we use to compute the bias corrected confidence intervals (see Ruppert et al. (sect 6.4) [18]) for the population and subjects effects. To illustrate our point, we first consider two knot-scenarios at the subject level. Guided by Ruppert [17], we use q = 39 equi-spaced knots τ at the population level in both scenarios.
The right panel in Figure 2 shows the fitted cities (obtained by adding the estimated population effect to the city effects) for both scenarios. As we can see from this graphic, the fits from both scenarios are almost identical and they look very satisfactory with regard to the data. One may be tempted to argue that this goodness of fit at the subject level induces a satisfactory fit at the population level. However, Figure 3 shows the fitted population effect for the two scenarios; we confirm the two observations of Heckman et al. • the confidence bands exhibit a widening fan effect as we move from left to right.
Further, for a third scenario (not shown) withq = 20, we observe upward bias, the opposite of that observed withq = 21; for a fourth scenario, with q =q = 39, we observe both severe bias and widening of the confidence interval. In all these scenarios, the behaviour (of the fitted population effect) is balanced by similar behaviour of the subject effects, in such a way that the global effect is recovered appropriately, as illustrated in the right panel of Figure 2. The reason for such behaviour is the mis-specification of the covariance structure. There are two main reasons for the choice in (2.4): first, the ridge penalty on a truncated lines basis works well when smoothing is at a single level and second, the simplicity of (2.4) is attractive; however, it does not appear capable of capturing appropriately the overall effect observed in the left panel in Figure 2.
We also remark that if the truncated lines run from right-to-left, ie, with slope −1, as opposed to left-to-right, ie, with slope 1 as in (2.2), then the bias and the fanning effect in Figure 3 are reversed. We refer to these bases as the forward (slope 1) and backward (slope −1) bases. One possibility is to use a full covariance matrix in (2.4) in place of σ 2 S Iq. This approach is not attractive since it has no obvious interpretation; it is also computationally very intensive. Thus, we are faced with one of the common challenges in mixed models: the appropriate specification of the covariance structure of the random effect.
In the remainder of this paper, we do not rely on specifying the covariance structure directly; our plan is to work with appropriate penalties. The advantage of this approach is that we can discuss the modelling effects which we wish the penalties to have; furthermore, we show how the penalty framework can be reformulated as a mixed model, and then the covariance structure follows naturally from the penalty structure and the bases. The supplementary materials contain R-code to reproduce Figures 2 and 3.

Penalty approach
We consider the general structure for some functions S(·) and S j (·) which quantify the population/overall effect and the deviations/departures of the j th unit from the population effect respectively. We view S(·) as a smooth function (assigned to the population effect) and the S j (·) as random smooth functions (assigned to the cities). At this stage, we do not make any distributional assumptions as in (2.4); these will come naturally out of our approach. In this section, we present different approaches for modelling S(·) and S j (·) and propose associated penalties for appropriate identification of these two components.

Penalties on B-spline bases
Here, we use B-spline bases to construct S(·) and S j (·); we start with Bspline bases because our approach with B-splines will motivate our solution with truncated polynomials. In brief, with B-splines, we will place separate penalties directly on the B-spline coefficients at the subject level, one to bring about smoothness (a difference penalty) and another to achieve identifiability by shrinkage (a ridge penalty). With truncated polynomials, it seems more difficult to achieve identifiability by direct shrinkage of the coefficients, as we have seen with the results of (2.2) in Figure 3. Indeed, with truncated polynomial bases we shall see in the next subsection that one way to achieve both smoothness and identifiability is to introduce a second penalty, as we do with B-spline bases. Two additional reasons for starting with B-splines are: first, the degree of the B-splines and the order of the penalty can be chosen independently; this gives additional flexibility to the modeller. Second, B-splines have good numerical properties. Hence, if we denote by S(t) the element-wise action of S(·) on the time vector t = (t 1 , . . . , t n1 ) ′ , with a similar meaning for S j (t), then we write where B, n 1 × c, andB, n 1 ×c, are regression matrices of B-splines evaluated along t, a is a vector of coefficients specifying the population effect, and theȃ j are random vectors of coefficients related to the subjects. We will refer to (3.1) and (3.2) as model M1 = M1(B,B). Note that M1 is not identifiable; indeed, if we add (for example) a constant to S(·) and subtract the same constant from the S j (·), then the predictor S(·) + S j (·) remains unchanged. Thus, two issues need to be clarified in M1: smoothness and identifiability. In the context of nested curves, Brumback & Rice [1] achieved the smoothness via the smoothing spline approach (which can be very time consuming, specifically in the presence of a large data set); from the mixed model representation, they suggested using ANOVA-like identifiability constraints by requiring that the fixed effects sum to zero at each level except the topmost level. Here we address smoothness and identifiability simultaneously via penalties as follows.
Let us first consider the overall effect S(t). For this component we take a sufficiently "rich" set of B-splines as a basis and we apply a roughness penalty (Eilers & Marx, [8]) to the wiggliness of the components of a to achieve the smoothness. Thus the estimation of a will be subject to the constraint where ∆ d represents the difference matrix operator of order d, and z quantifies the amount of smoothness applied to a. If d = 3 for example, ∆ d has the four-diagonal structure Given the above specification on the overall effect, we solve the identifiability problem by shrinking the coefficientsȃ j towards 0. It seems reasonable to apply the same amount of shrinkage,z 2 , to each of the city effects. Thus, we submit theȃ j to the constraint ȃ j 2 <z 2 , j = 1, . . . , n 2 .
The problem of smoothness of the city effects remains. Two possibilities are available: (a) work with fewer B-splines (for the city effects) and only the ridge penalty, or (b) take a rich set of B-splines as a basis (as for the overall effect) and apply a roughness penalty (together with the ridge penalty) on theȃ j ; hence we further penalize the roughness of theȃ j , ie, Clearly, (b) is computationally more intensive than (a) since each city has its own (large) set of coefficients, while in comparison with (b), (a) is economical; however, (a) is open to the criticism that the selection of the number of Bsplines is manual and artificial. Nonetheless, both approaches produce similar results (at least for CanadianWeather), provided that a judicious choice of the number of B-splines is made at the subject level under (a). From now on, we will consider approach (b) only.
We remark that we have used a d-order penalty for smoothing at the population level since we may wish to have a specific fixed effect at this level; for instance, the CanadianWeather data in Figure 2 suggest a quadratic fixed effect at the population level, in which case we take d = 3. We have no particular form in mind for the city effects, and so we simply use a second order (d = 2) penalty to smooth these effects.
In summary for M1, (i) smoothing of the population effect is achieved by dorder penalization of the population coefficients, (ii) smoothing of the city effects is achieved by second order penalization of the city coefficients and (iii) identifiability is achieved by a ridge penalty on the city coefficients. These three points are summarized as follows: these constraints apply to the model M1 and we refer to (3.3) as the constraints C1.

Penalties on truncated polynomial bases
Here we express S(·) and S j (·) in terms of a truncated polynomial and a truncated line basis respectively; ie, we set say, where T r andT r are matrices of truncated polynomials of degree r at the population and subject levels respectively. We will refer to (3.1), (3.4) and (3.5) as M2 = M2(T ,T ). With a roughness penalty on B-splines bases, a polynomial fixed effect of degree (d − 1) at the population level was captured by choosing a difference penalty of order d. Here, we achieve the same thing in (3.4) by choosing the corresponding degree p = d − 1 of the polynomial basis. Since the subject effects are likely (at least for CanadianWeather) to be quite different from one another, we simply capture them with truncated lines.
With a B-spline basis as in the previous section, the behaviour ofBȃ j is very similar to that ofȃ j in the sense that smoothness onȃ j implies the smoothness ofBȃ j , and shrinkage onȃ j implies shrinkage ofBȃ j . This is not entirely clear with truncated polynomial bases of degree p. For the latter, the coefficient vectoȓ ξ j reflects the jumps in the derivatives of order p at the corresponding knots and so the smoothness of the population and subject effects is usually obtained by applying a ridge penalty on ξ andξ j , ie, ξ 2 < z and ξ j <z 1 , j = 1, . . . , n 2 .
With B-splines, we have two penalties at the subject level, one for smoothness and one for identifiability. With this in mind, we achieve identifiability by the introduction of a second penalty in M2 at the subject level which shrinks each subject effect S j (t) =L 1bj towards 0: L 1bj 2 <z 2 , j = 1, . . . , n 2 .
In summary for M2, smoothness at the population and subject level is obtained by applying a ridge penalty on the truncated polynomial coefficients, and identifiability is achieved by shrinking all the subject effects towards 0. We summarize these constraints in

Penalties on a mixture of B-spline and truncated polynomial bases
Here we consider a mixture of B-splines and truncated polynomials. We start with we refer to (3.1) and (3.9) as M4 = M4(T ,B); smoothness and identifiability constraints on M4 are as follows: (3.10)

Further possibilities
Models M1, M2, M3 and M4 with the associated constraints C1, C2, C3 and C4 are the main models that we investigate here. In all of these cases, we achieve smoothness either by a roughness (difference) penalty on the B-spline coefficients or a ridge penalty on the truncated polynomial coefficients. An alternative might be to smooth by applying a roughness (difference) penalty directly on the estimates, ie, on S(t) and S j (t), (whether a B-spline or a truncated polynomial basis is used). Note also that solving the identifiability problem by shrinking the subject effects S j (t) is also applicable with a B-spline basis at the subject level. Furthermore, instead of applying shrinkage to S j (t), one can solve the identifiability problem by applying the shrinkage at the knots only, ie, a ridge penalty to S j (τ ). These are topics for further research.

Inference and applications
In the previous section, we presented four formulations of model (3.1) with penalized splines. Each of these formulations has the form which can be expressed compactly as where θ = vec(α,α),α = vec(α 1 , . . . ,α n2 ), is the joint vector of coefficients, Ω = [1 n2 ⊗G : I n2 ⊗G] is the full regression matrix, and G, n 1 ×c, andG, n 1 ×c, are regression matrices at the population and subject levels; specifically, we have: We will present two ways of fitting model (4.2) (with reference to the data CanadianWeather) under the associated constraints C1, C2, C3 or C4. The first approach will be based on the penalized residual sum of squares, while the second approach will use the mixed model representation of the model.

Inference with penalized residual sum of squares
Using Lagrange arguments, the penalized residual sum of squares (PRSS) of (4.2), ie, the residual sum of squares (RSS) under constraints C1, C2, C3 or C4, can be expressed as where P = blockdiag(P α , I n2 ⊗ Pα) (4.6) is the block diagonal penalty matrix, with (4.8) Here, J r is the identity matrix (of appropriate size) where the upper r diagonal elements have been replaced by 0's, while λ andλ 1 are the smoothing parameters at the population and subject level respectively, andλ 2 is the shrinkage parameter of the subject effects; (λ,λ 1 ,λ 2 ) plays (inversely) the equivalent role as (z,z 1 ,z 2 ) used throughout section 3. More precisely, increasing values of λ andλ 1 , (ie, decreasing the values of z andz 1 ) induces more smoothness on the population and subject effects, while increasing the values ofλ 2 , (ie, decreasing values ofz 2 ) corresponds to heavier shrinkage on the subject effects. At the limit, ie,λ 2 → ∞ (or equivalentlyz 2 → 0), we have S j (·) → 0; this reduces the linear predictor of the model to the population effect. Given values of these parameters, we obtainθ = (Ω ′ Ω + P ) −1 Ω ′ Y on minimizing the PRSS in (4.5). For illustration we choose the smoothing/shrinkage parameters by minimizing the Bayesian Information Criterion (BIC) Schwarz [19] BIC(λ,λ 1 ,λ 2 ) = n 1 n 2 × log(RSS) + tr(H) × log(n 1 n 2 ); (4.9) in (4.9), tr(H), the trace of the hat matrix H, is the effective dimension of the fitted model, where H = Ω (Ω ′ Ω + P ) −1 Ω ′ maps the observations to the fitted values. Nonetheless, since the BIC mostly addresses model choice at the global level, a more formal criterion for the selection of the shrinkage parameter may be derived by some partition of the BIC into a population and a subject component. Alternatively, one may prefer to choose a certain fixed amount of shrinkage, or to study (as a function of the shrinkage parameter) the departure of the fitted population effect from the observed mean, and then choose the amount of shrinkage that minimizes this departure.
Finally, in line with the familiar unbiased estimate of variance in linear regression (Wood (p171) [24]), we estimate σ 2 bŷ   We now apply this procedure to CanadianWeather. For all our applications, we will use cubic B-splines. For each of our four models, we follow Ruppert et al. [18] and so use 39 equi-spaced internal knots at the population and subject levels respectively. The results are summarized in Table 1. Figure 4 illustrates the fitted population effect with the associated confidence intervals for our four models; the confidence intervals have been computed using the Bayesian argument described in section 5. M1 and M4 are the best models (for CanadianWeather) both in terms of BIC and parsimony. Note that the confidence bands for models M1 and M4, the two models withB as regression matrix at the subject level, are narrower than those of M2 and M3. The right panel in Figure 5 displays the city effects as estimated under model M1; these city effects are essentially identical for all four models. The supplementary materials contain R-code to reproduce Figure 4 and Table 1.
In summary, our four models all return essentially identical estimates of both the population and city effects; the width of the confidence intervals appears to depend on the basis used at the subject level. We will return to this point in our concluding remarks.
We have also considered using different knots scenarios at the subject level from that at the population level. While this generally produces consistent estimates of the population and subject effects, this may adversely affect the confidence intervals at both levels for all four models, if the number of knots at the subject level is "too small" relative to that at the population level. Again, we will return to this point in our closing discussion.
Going back to the data in Figure 2, we see that the overall effect looks quadratic with some noise, mostly at the boundaries. Furthermore, our data present a mixed model structure; by this we mean that it is natural to suppose that our cities are a random selection from the population of Canadian cities. We discuss a mixed model approach for (3.1) or (4.1) in the next section.

Mixed model representation and interpretation
In the representation (4.1), it is natural to think of the coefficientsα j as random, since the subjects which they represent are randomly chosen from the population. The question is the following: from the smoothness and identifiability assumptions made so far, can we "naturally" derive the distributions which have generated these coefficients/subjects?

Mixed model representation for M2 and M4
Recall that under M2 or M4, we have a truncated polynomial basis at the population level, ie, G = L p = [X p : T p ] and α = b = vec(δ, ξ). Here, the mixed model representation is straightforward; this follows from the structure of the truncated polynomial basis at the population level. Indeed, the minimization of PRSS under M2 or M4 (with the associated constraints) is equivalent to the maximization of the log-likelihood which arises from the triplet (Y, ξ,α), where ξ and α are treated as a pair of independent random vectors under the distributional assumptions where Pα, defined in (4.8), depends onλ 1 andλ 2 . We comment on this representation in subsection 4.2.3.

Mixed model representation for M1 and M3
Under both M1 and M3, we have a B-spline basis at the population level and so G = B and α = a. In this case, the minimization of PRSS (with the associated constraints) is equivalent to maximizing the log-likelihood which arises from the triplet (Y, a,α), where a andα are treated as a pair of independent random vectors under the (improper for a) distributional assumptions (4.11)

V.A.B. Djeundje and I.D. Currie/Smooth mixed models for longitudinal data 1216
The roughness matrix ∆ ′ d ∆ d , which gives rise to the improper prior distribution for a in (4.11), is singular, symmetric and has rank c − d. Now the singular value decomposition of (ρ 1 , . . . , ρ c−d , 0, . . . , 0) is the c × c diagonal matrix of eigenvalues arranged in non-increasing order; and U is the matrix with columns given by the eigenvectors of ∆ ′ d ∆ d . We will denote diag(ρ 1 , . . . , ρ c−d ) by Λ + . With this notation, the smoother Ba can be expressed as In (4.12), U + is the sub-matrix of U which corresponds to the positive eigenvalues of ∆ ′ d ∆ d . The (improper) normal assumption about a in (4.11) reduces to a 2 ∼ N 0, σ 2 λ I c−d . Finally, minimizing the PRSS of M1 or M3 yields the mixed model representation we comment on this representation in the next subsection.

Interpretation of the components
Clearly from (4.10) and (4.13), the mixed model representation of (4.1) for our four models M1, M2, M3 and M4 with the associated constraints has the form for appropriate β, Z p , and γ. Hence, the model predictor is made up of three components: • The first component, X p β, represents the fixed overall effect. Motivated by the overview of the data in Figure 2, we require this component to be quadratic for CanadianWeather; this justifies the use of the third order difference penalty in M1 and M3. An illustration of this first component under M1 is shown by the continuous line in the left panel of Figure 5. • The second component, Z p γ, which is shrunk towards 0, accounts for the flexibility of the population effect, and smoothly captures the deviation of the population effect from a simple quadratic curve. We do not view the normal constraint on this component as random behaviour, but just as a smoothing device. This component is illustrated (under M1) for CanadianWeather by the dashed line in the left panel in Figure 5. • The third/random component,Gα j , measures the random departure of the subjects from the overall effect. The normal constraint on this component incorporates the random behaviour of the cities (controlled byλ 2 ) as well as the smoothness of the city effects (as measured byλ 1 ); these are shown for CanadianWeather (under M1) on the right panel in Figure 5.

Mixed model fitting
Since the smoothness/shrinkage constraints on the second component, Z p γ in (4.14), are expressed in terms of a distribution, it may also be treated as some fictive part of the random component; this will then allow the full insertion of the model into the mixed model framework, where we can take advantage of the estimation methodology of restricted maximum likelihood. In line with this motivation, setting (4.16) Hence, model (3.1), together with the smoothness and identifiability of the model achieved via the penalties in section 3, is now expressed as a mixed model with fixed effects β, random effects u and covariance matrix Φ. The larger the smoothing/shrinkage parameters are, the flatter the random component Zu is, and the closer the predictor Xβ + Zu approaches the fixed component Xβ. With a truncated polynomial basis at the population level, the mixed model representation was straightforward, because of the form of the basis; with a Bspline basis however, additional effort was required; in this case, the structure of the resulting regression matrix R (as defined in (4.12)) is far from obvious, but it does arise naturally as a function of the positive eigenvalues and corresponding eigenvectors of ∆ ′ d ∆ d , and the B-spline basis. A standard approach (Searle et al. [20]) for fitting (4.16) consists of maximizing the joint distribution of (Y, u) over (β, u). This leads to a set of equations that simultaneously yields an estimate for the fixed effect (β) and a predictor for the random effect (u):β These are the Best Linear Unbiased Estimator/Predictor (BLUE/BLUP) of β and u respectively (Robinson, [16]). In (4.17), V = var(Y) = ZΦZ ′ + σ 2 I n1n2 , where Φ = Φ σ 2 ,λ,λ1,λ2 is specified in (4.15) and so, (σ 2 , λ,λ 1 ,λ 2 ) are all viewed here as variance parameters. Clearly,β andũ are numerically known (only) upon specification of the variance parameters. There is a variety of literature for estimating these parameters. One possibility is to maximize the unconditional likelihood of Y, Y ∼ N (Xβ, V ). However, one unsatisfactory property of maximum likelihood in estimating the variance components is that it discards the information on the degrees of freedom involved in the estimation of the fixed effect. As a result, the maximum likelihood estimator of the variance parameters tends to be biased. This problem with maximum likelihood is corrected by the so-called restricted likelihood (Patterson & Thompson [12]), one justification of which consists in assuming a uniform prior distribution for the fixed effects, then integrating them out of the likelihood (Laird & Ware [11], Pinheiro & Bates [13]). After substituting β byβ, minus twice the restricted log-likelihood of model (4.16) is given (up to an additive constant) by − 2ℓ σ, λ,λ 1 ,λ 2 = log(|V |) + log |X ′ V −1 X| The estimation machine then consists of estimating the variance parameters from maximization of ℓ in (4.18) and substituting them into (4.17) to derive estimates for the BLUE/BLUP. An interesting point is that the standard mixed model approach in (2.4) involves six variance parameters, while our penalty approach (4.16) involves only four such parameters; hence, besides producing satisfactory results, the penalty approach is computationally more economical compared to (2.4), when searching for optimal values of the smoothing/variance parameters.

Variability bands
So far, we have presented two approaches for fitting and interpreting model (3.1). Inference in this context is a delicate issue because it depends on whether the mixed model formulation is being used or not. We first hide the mixed model formulation, and we discuss inference from the Bayesian perspective by relying on the posterior distribution of the parameters. The main motivation for working in the Bayesian framework here stems from the fact that the estimator of the parameter vector θ, defined in (4.2), is usually biased (unless the parameter vector itself vanishes). As a result, the central limit theorem cannot be used directly to compute approximate confidence intervals. In addition to that, there is evidence that the posterior confidence bands derived from the Bayesian perspective have good sampling properties; Wood (chap 4) [24].

Extensions
For the data CanadianWeather, we have assumed that the 35 cities are similar in the sense that model (3.1) is expressed in terms of a common mean (smooth) curve and the subject/city departures. These cities can be classified into four regions: Arctic, Atlantic, Continental and Pacific, and plotting the data by region shows that the observed means (average) are different in shape and level from one region to the other. Model (3.1) can be extended to account for the region effects. Once again, the estimated mean effect per region is very sensitive to the knot locations if fitted with the standard approach similar to (2.2)-(2.4).
In contrast, any of our models (similar to) M1, M2, M3 or M4 with the associated smoothness and identifiability constraints (similar to) C1, C2, C3 or C4 does not suffer from this defect; the results are not presented here. For simplicity, we have presented our work so far in the special case of balanced data where measurements are made on the subjects at the same time. The extension to different timings for subjects as well as to unbalanced data is straightforward. For illustration, we consider the simulated data based on the model in Durban et al. [5] related to the heights of 197 children suffering from acute lymphoblastic leukaemia, and receiving three different treatments; these data are displayed in Figure 6. If we denote by Y i,j,k the i th measurement of the height of the j th child receiving treatment k, then its linear predictor can be expressed as where S k (·) and S j,k (·) are smooth functions which quantify the treatment effect and the children deviations respectively. We can express (6.1) in matrix form as where t •,j,k is the time vector for the j th child receiving treatment k. Details on the components of this model with truncated lines can be found in Coull et al. [2] and Durban et al. [5], among others; these authors referred to this model as a factor by curve interaction model. Even though the number of observations per child is very small (it varies from 1 to 21), the estimated treatments effects are biased if fitted with the standard approach similar to (2.4); this is shown in the left panel of Figure 7. Indeed, for the same knot locations (40 and 10 inner knots at the population and child levels respectively as used by Durban et al. [5]), this graphic illustrates a clear difference between the estimates derived from the two equivalent bases: (a) the forward truncated lines bases and (b) the backward truncated lines bases. The fanning effect seen on the right of the left panel of Figure 7 arises partly from data thinning. In addition, however, we observe the same fanning effects as seen in Figure 3. Alternatively, each of our four formulations M1, M2, M3 or M4 is extendable to such unbalanced situations. More precisely, we set S k (t •,j,k ) = G t •,j,k α k and S j,k (t •,j,k ) =G t •,j,kα j,k , where α k andα j,k are vectors of coefficients quantifying the treatment and children effects respectively; G t •,j,k andG t •,j,k are matrices (of B-splines or truncated polynomials as the case may be) at the treatment and child level, constructed along the time vector t •,j,k . Unlike the data CanadianWeather, there is no strong motivation to use the third order penalty or a second order truncated polynomial for these growth data (see Figure 6). Model (6.3) can be expressed compactly as in (4.2) with appropriate components, and then the coefficients can be estimated either by the penalized residual sum of squares (with appropriate penalty matrix) or by re-parameterization as a mixed model, as described in section 4.2. The fitted mean effects for the three treatments are shown on the right panel in Figure 7; this approach does not suffer from the  instability illustrated in the left panel. Adding the fitted child effects to these treatment effects yields the child curves presented in Figure 8. More generally, the penalty approach is easily adapted to model multilevel nested hierarchical curves. The supplementary materials contain R-code to reproduce Figure 7.

Concluding remarks
In this work we first illustrated some consequences of the mis-specification of the standard covariance structure (2.4) in a mixed model (for longitudinal data) defined using truncated line bases. One simple way of demonstrating the problem is to fit only the population effect. Here, truncated lines with a ridge penalty and B-splines with a roughness penalty give almost identical answers, and both capture the population effect correctly with appropriate confidence intervals. However, when we add the city effects, the estimates of the population effect and its associated confidence intervals are distorted when truncated lines with the covariance structure (2.4) are used, as shown in Figure 3. No such distortion occurs with the penalty approach presented in this paper; the estimates of the population effect are identical whether city effects are included or not. For the penalty approach, we first specify the bases, and then we design the components of the model (population, subject, etc, effects). With the components in place, we use penalties to bring about the model effects we wish to achieve. One unlooked for bonus with the penalty approach is the reduction in the number of variance parameters to be estimated, from six with (2.4) to four with (4.16). Even though the B-spline and truncated polynomial bases produced satisfactory results in the applications presented in this paper, we have a preference for B-splines bases, because of the direct connection between the regression coefficients and the penalty which is applied to these coefficients; for instance, with the B-spline basis, we can easily adjust the penalty to link the start and end of the year via a circular penalty, or to account for a periodic effect (for example if we are interested in modelling the temperatures collected over many years) by using a harmonic penalty. In the case of the CanadianWeather data we have used neither a circular nor a harmonic penalty since we used these data to illustrate some general points of fitting nested smooth curves.
One obvious advantage that truncated polynomial bases appear to have over B-spline bases (at the population level) is that a mixed model representation is immediate; this allows simple fitting with standard methodology (restricted likelihood) upon model specification. With B-splines (at the population level) we must work a little harder to achieve a mixed model representation and so gain access to the standard methodology of restricted likelihood; the transformed bases are not intuitive but can be arrived at via a penalty argument. Of course, the B-spline approach can be expressed in terms of truncated polynomials, but the resulting covariance structure would again be all but impossible to guess without penalties; some details on transformations from B-splines to truncated polynomials can be found in de Boor [6] and Welham et al. [23].
We return to two issues which we raised at the end of section 4.1. First, our methods appear to be successful in recovering population and subject effects, and in solving the problem of the widening fan effect found with (2.4). However, the width of the associated confidence intervals arising from the use of BIC depends on whether a B-spline or a truncated lines basis is used at the subject level. Nonetheless, it is possible to "play" with the values of the smoothing/shrinkage parameters when truncated lines are used and to produce the confidence intervals obtained with B-splines. A second difficulty arises when selecting the smoothing/shrinkage parameters by optimizing a deviance-type criterion like BIC. We found that, for balanced data such as CanadianWeather, if the number of knots at the subject level is "too small" relative to the number at the population level, ie,q ≪ q, (for instance, q = 39 andq < 10, for CanadianWeather), then the optimal values of the shrinkage/smoothing parameters as selected by BIC fall on the boundary of the parameters space; this can lead to unexpectedly wide confidence intervals at the population and subject levels. Hence, in practice, attention must be given to the choice of q andq (for example by following Ruppert [17] both at the population and subject levels) as well as to the optimization criterion.