P-splines with an l1 penalty for repeated measures

P-splines are penalized B-splines, in which finite order differences in coefficients are typically penalized with an $\ell_2$ norm. P-splines can be used for semiparametric regression and can include random effects to account for within-subject variability. In addition to $\ell_2$ penalties, $\ell_1$-type penalties have been used in nonparametric and semiparametric regression to achieve greater flexibility, such as in locally adaptive regression splines, $\ell_1$ trend filtering, and the fused lasso additive model. However, there has been less focus on using $\ell_1$ penalties in P-splines, particularly for estimating conditional means. In this paper, we demonstrate the potential benefits of using an $\ell_1$ penalty in P-splines with an emphasis on fitting non-smooth functions. We propose an estimation procedure using the alternating direction method of multipliers and cross validation, and provide degrees of freedom and approximate confidence bands based on a ridge approximation to the $\ell_1$ penalized fit. We also demonstrate potential uses through simulations and an application to electrodermal activity data collected as part of a stress study.


Introduction
Many nonparametric regression methods, including smoothing splines and regression splines, obtain point estimates by minimizing a penalized negative log-likelihood function of the form l pen = −l(β) + P (β), where l is a log-likelihood, P is a penalty term, and β are the coefficients to be estimated. Typically, quadratic ( 2 norm) penalties are used, which lead to straightforward computation and inference. In particular, 2 penalties typically lead to ridge estimators, which have both closed form solutions and are linear smoothers. The 2 penalty also has connections to mixed models, which allows the smoothing parameters to be estimated as variance components (Green, 1987;Speed, 1991;Wang, 1998;Zhang et al., 1998).
However, nonparametric regression methods that use an 1 -type penalty, such as 1 trend filtering (Kim et al., 2009) and locally adaptive regression splines (Mammen et al., 1997), are better able to adapt to local differences in smoothness and achieve the minimax rate of convergence for weakly differentiable functions of bounded variation (Tibshirani, 2014a), whereas 2 penalized methods do not (Donoho and Johnstone, 1988). The trade-off is that 1 penalties generally lead to more difficult computation and inference because the objective function is convex but non-differentiable, and the fit is no longer a linear smoother.
In this article, we propose P-splines with an 1 penalty as a framework for generalizing 1 trend filtering within the context of repeated measures data and semiparametric (additive) models (Hastie and Tibshirani, 1986). In Section 2, we discuss the connection between Psplines and 1 trend filtering which motivates the methodological development. In Section 3, we present our proposed model, and in Section 4, we discuss related work. In Section 5, we propose an estimation procedure using the alternating direction method of multipliers (ADMM) (see Boyd et al., 2011) and cross validation (CV). In Section 6, we derive the degrees of freedom and propose computationally stable and fast approximations, and in Section 7, we develop approximate confidence bands based on a ridge approximation to the 1 fit. In Section 8, we study our method through simulations and evaluate its performance in fitting non-smooth functions. In section 9, we demonstrate our method in an application to electrodermal activity data collected as part of a stress study. We close with a discussion in Section 10.

P-splines and 1 trend filtering
In this section, we give brief background on P-splines and 1 trend filtering, and show the relation between them. P-splines (Eilers and Marx, 1996) are penalized B-splines (see De Boor, 2001). B-splines are flexible bases that are notable in part because they have compact support, which leads to banded design matrices and faster computation. This compact support can be seen in Figure 1, which shows eight evenly spaced first degree and third degree B-spline bases on [0,1]. We can define an order M (degree M − 1) B-spline basis with j = 1, . . . , p basis functions recursively as (De Boor, 2001) where t j are the knots, division by zero is taken to be zero, and c is the number of internal knots. For order M B-splines defined on the interval [a, b], in order to obtain j = 1, . . . , p basis functions, we set 2M boundary knots (M knots on each side) and c = p − M interior knots. In general, one can set t 1 ≤ t 2 ≤ · · · ≤ t M = a < t M +1 < · · · < t M +c < b = t M +c+1 ≤ t M +c+2 ≤ · · · ≤ t 2M +c . In order to ensure continuity at the boundaries, we set t 1 < t 2 < · · · < t M −1 < t M = a and b = t M +c+1 < t M +c+2 < · · · < t 2M +c . We also use equally spaced interior knots, which is important for the P-spline penalty, and drop the superscript on φ designating order when the order does not matter or is stated in the text. B-spline bases can be used to fit nonparametric models of the form y(x) = f (x) + (x), where y(x) is the outcome y at point x, f (x) is the mean response function at x, and (x) is the error at x. To that end, let y = (y 1 , . . . , y n ) T be an n × 1 vector of outcomes and x = (x 1 . . . , x n ) T be a corresponding n × 1 vector of covariates. Also, let φ 1 , . . . , φ p be Bspline basis functions and let F be an n × p design matrix such that F ij = φ j (x i ), i.e., the j th column of F is the j th basis function evaluated at x 1 , . . . , x n . Equivalently, the i th row of F is the i th data point evaluated by φ 1 , . . . , φ p . For iid normal y, a simple linear P-spline model with the standard 2 penalty can be written aŝ β 0 ,β = arg min where β 0 is the intercept, β is a p × 1 vector of parameter estimates, 1 is an n × 1 vector with each element equal to 1, λ > 0 is a smoothing parameter, and D (k+1) ∈ R (p−k−1)×p is the k + 1 order finite difference matrix. For example, for k = 1 In general, as described by Tibshirani (2014a), Our proposed model builds on one in which the 2 penalty in (1) is replaced with an 1 penalty:β 0 ,β = arg min Apart from requiring unique and equally spaced observations, (5) differs from (4) in that (5) has one parameter per data point, no intercept, and the design matrix is the identity matrix. D (k+1) is also resized appropriately by replacing p with n in the dimensions of (2) and (3). However, under certain conditions noted in Observation 1, (4) and (5) are identical.
Let F be the design matrix in (4), where F ij = φ 2 j (x i ). Then from the previous result, we have F = I n×n , where I n×n is the n × n identity matrix. This, together with the assumption that β 0 = y(0) = 0, implies that the objective functions (4) and (5) are identical, which proves Observation 1.
We note that Tibshirani (2014a) shows that 1 trend filtering has a continuous representation when expressed in the standard lasso form, and Observation 1 gives a continuous representation of 1 trend filtering when expressed in generalized lasso form. Ramdas and Tibshirani (2016) developed an algorithm to extend 1 trend filtering to irregularly spaced data. It might also be possible to extend 1 trend filtering to repeated measures data to account for within-subject correlations. However, due to Observation 1, we think it is beneficial to view 1 trend filtering as a special case of P-splines with an 1 penalty. We think this approach has the potential to be a general framework, because higher order B-splines could be used in combination with different order difference matrices just as can be done with P-splines that use the standard 2 penalty. Furthermore, expressing 1 trend filtering as P-splines with an 1 penalty may facilitate the development of confidence bands (see Section 7), which could help to fill a gap in the 1 penalized regression literature.
In addition, there are connections between P-splines with an 1 penalty and locally adaptive regression splines. In particular, as Tibshirani (2014a) shows, the continuous analogue of 1 trend filtering is identical to locally adaptive regression splines (Mammen et al., 1997) for k = 0, 1, and asymptotically equivalent for k ≥ 2.
3 Proposed model: additive mixed model using Psplines with an 1 penalty To introduce our model, let y i = (y i1 , . . . , y in i ) T be an n i × 1 vector of responses for subject i = 1, . . . , N , and let y = (y T 1 , . . . , y T N ) T be the stacked n × 1 vector or responses for all N subjects, where n = N i=1 n i . Let x i = (x i1 , . . . , x in i ) T be a corresponding n i × 1 vector of covariates for subject i, and x = (x T 1 , . . . , x T N ) T be the n × 1 stacked vector of all covariate values. In many contexts, x is time. To account for the within-subject correlations of y i , we can incorporate random effects into the P-spline model. To that end, let Z i be an n i × q i design matrix for the random effects for subject i (possibly including a B-spline basis), and let b i = (b i1 , . . . , b iq i ) T be the corresponding q i × 1 vector of random effect coefficients for subject i. Also, let be the n × q block diagonal random effects design matrix for all subjects, where q = N i=1 q i , and let b = (b T 1 , . . . , b T N ) T be the q × 1 stacked vector of random effects for all subjects. We propose an additive mixed model with j = 1, . . . , J smooths: whereF j is a n × p j design matrix of B-spline bases for smooth j,D (k j +1) j is the k j + 1 finite difference matrix, and σ 2 b S is the covariance matrix of the random effects b. For example, if b are random intercepts, then S = I N ×N and Z would be an n × N matrix such that Z il = 1 if observation i belonged to subject l and zero otherwise. Alternatively, to obtain random curves using smoothing splines and a B-spline basis, we could set where S j,il = φ ji (t)φ jl (t)dt, and φ j1 , . . . , φ jp j are the second derivatives of the B-spline basis functions for the j th smooth. We would then set Z to be the corresponding B-splines evaluated at the input points. We note that (8) includes varying-coefficient models (Hastie and Tibshirani, 1993). For example, as pointed out by Wood (2006, p. 169), ifF 1 are B-splines evaluated at x, we could haveF 2 = diag(x )F 1 , where x = x is another covariate vector and diag(x ) is a diagonal matrix with x i at the i th leading diagonal position.
As Wood (2006, Section 1.8.1) shows, Q can be obtained by taking the QR decomposition ofF T j 1 and retaining the last p j − 1 columns of the left orthonormal matrix. 1 We can then re-parameterize the p j constrained parametersβ j in terms of the p j − 1 unconstrained parameters β j , such thatβ = Q j β j . For j ∈ E, let F j =F j Q j and D j =D k j +1 j Q j . For j ∈Ē, let F j =F j and D j =D j . If centering the random effects, then we redefine S := Q T J+1 SQ J+1 1 The matrices 1 TF j , j = 1, . . . , J are of rank 1, so the remaining p j −2 columns are arbitrary orthonormal vectors. In R (R Core Team, 2017), when taking the QR decomposition ofF T 1, an appropriate matrix Q can be obtained as Q <-qr.Q(qr(colSums(F_tilde)), complete = TRUE)[, -1]. and Z := ZQ J+1 . Then we can re-write (7) in the identifiable form minimize β 0 ∈R,b∈R q ,β j ∈R p j ,j=1,...,J where p j = p j − 1 for j ∈ E and p j = p j for j ∈Ē. We note that the penalty matrix S given above for random subject-specific splines defines non-zero correlation between nearby within-subject random effect coefficients. This is in contrast to the approach of Ruppert et al. (2003) for estimating subject-specific random curves, which focuses on the case in which nearby within-subject coefficients are not correlated. To see this, letd i (x) = q i j=1b ij φ ij (x) be the estimated difference between the i th subject-specific curve and the marginal mean at point x. The smoothing spline approach above constrains (d ) 2 (x)dx = b T i S i b i < C for some constant C > 0, whereas the approach of Ruppert et al. (2003) Whereas the non-diagonal penalty matrix S implies correlations between nearby coefficients, the identity matrix in the approach of Ruppert et al. (2003) implies zero correlation.
Similar to the equivalence between Bayesian models and 2 penalized smoothing splines (Wahba, 1990), there is an equivalence between Bayesian models and 1 penalized splines. In particular, (8) is equivalent to the following distributional assumptions, which we can use to obtain Bayesian estimates: The last distributional assumption is an element-wise Laplace prior on the k j + 1 order differences in coefficients.
In some cases, the random effects penalty matrix S may be positive semidefinite but not invertible. For example, the smoothing spline random curves outlined above lead to a penalty matrix S that is not strictly positive definite, but that is still positive semidefinite. This does not cause problems for the ADMM algorithm, but some changes are required for other algorithms as well as for Bayesian estimation. Following Wood (2006, Section 6.6.1), let S = U ΛU T be the eigendecomposition of a positive semidefinite matrix S, where U U T = I q×q and Λ is a diagonal matrix with eigenvalues in descending order in the diagonal positions. Letb = U T b andZ = ZU , so that b T Sb =b T Λb andZb = Zb. Let q r be the number of strictly positive eigenvalues of S, where 0 < q r < q, and let Λ r be the q r × q r upper left portion of Λ. We can partitionb asb = (b T r ,b T f ) T , whereb T r is a q r × 1 vector of penalized coefficients andb T f is a q f × 1 vector of unpenalized coefficients, where q r + q f = q. Thenb T Λb =b T r Λ rbr , and it follows thatb r ∼ N (0, σ 2 b Λ −1 r ) andb f ∝ 1. However, allowing for unconstrained random effect parameters leads to identifiability issues. Therefore, in practice if q f > 0, we recommend using a normal or Cauchy prior on prior on σ f and constraints to ensure σ f > 0, or a diffuse prior on log(σ f ) without constraints. The Cauchy prior may be a preferable first choice, as it provides a weaker penalty and is similar to the recommendations of Gelman et al. (2008) for logistic regression. However, in some cases, such as in Section 9, it is necessary to use a normal prior.
To further improve the computational efficiency of Monte Carlo sampling methods, we can partitionZ intoZ = [Z r ,Z f ] whereZ r contains the first q r columns ofZ andZ f contains the remaining q f columns. We then setb r = Λ −1/2 rb andŽ r =Z r Λ 1/2 r , so thatŽ rbr =Z rbr andb r ∼ N (0, σ 2 b I), which allows for more efficient sampling.

Related work
There are many nonparametric and semiparametric methods for analyzing repeated measures data. For an overview, please see Fitzmaurice et al. (2008, Part III). However, most existing methods use an 2 penalty (e.g. Rice and Wu, 2001;Guo, 2002;Chen and Wang, 2011;Scheipl et al., 2015). Focusing on the optimization problem, our method puts a generalized lasso penalty (Tibshirani, 1996) on the fixed effects and a quadratic penalty on the random effects. Unlike the elastic net (Zou and Hastie, 2005), we do not mix the 1 and 2 penalties on the same parameters, though this could be done in the future.
While not developed for analyzing repeated measures, the fused lasso additive model (FLAM) (Petersen et al., 2016) is similar to ours. FLAM optimizes the following problem: where 0 ≤ α ≤ 1 specifies the balance between fitting piecewise constant functions (α = 1) and inducing sparsity on the selected smooths (α = 0). From Observation 1, we see that (9) is equivalent to our model (8) when: α = 1, there is J = 1 smooth, our design matrix has p = n columns, our B-spline bases have appropriately chosen knots, and our model has no random effects. As Petersen et al. (2016) show, FLAM can be a very useful method for modeling additive phenomenon, and as with the fused lasso (Tibshirani et al., 2005), jumps in the piecewise linear fits have the advantage of being interpretable. We also mention the sparse additive model (SpAM) (Ravikumar et al., 2009) and sparse partially linear additive model (SPLAM) (Lou et al., 2016). SpAM fits an additive model and uses a group lasso penalty (Yuan and Lin, 2006) to induce sparsity on the number of active smooths. SPLAM fits a partially linear additive model and uses a hierarchical group lasso penalty (Zhao et al., 2009) to induce sparsity in the selected predictors and to control the number of nonlinear features.
One notable difference between our model and FLAM, SpAM, and SPLAM, is that we allow for multiple smoothing parameters. In our applied experience with additive models and standard 2 penalties, we have found that in practice it can be important to allow for multiple smoothing parameters, particularly when the quantities of interest are the individual smooths as opposed to the overall prediction. This is equivalent to allowing each smooth to have different variance. However, this flexibility comes at a cost: estimating multiple smoothing parameters is currently the greatest challenge in fitting our proposed model. Perhaps due in part to these computational difficulties, several other authors also assume a single smoothing parameter in high-dimensional additive models (e.g. Lin et al., 2006;Meier et al., 2009).
There are fast and stable methods for fitting multiple smoothing parameters for 2 penalties paired with exponential family and quasilikelihood loss functions, notably the work of Wood (2004) using generalized cross validation (GCV) and Wood (2011) using restricted maximum likelihood. Furthermore, Wood et al. (2015) extend these methods to larger datasets, and Wood et al. (2016) extend these methods to likelihoods outside the exponential family and quasilikelihood form. However, similarly computationally efficient methods do not yet exist for fitting multiple smoothing parameters for 1 penalties.
In addition to allowing for multiple smoothing parameters, we also propose approximate inferential methods, which is not typically provided for 1 penalized models. Yuan and Lin (2006), Ravikumar et al. (2009), Lou et al. (2016, and Petersen et al. (2016) focus on prediction and provide bounds on the prediction risk and related quantities. These are important results, and we think that distributional results for individual parameters and smooths will also be useful to practitioners.
We also note that Eilers (2000) and Bollaerts et al. (2006) discuss a variant of P-splines for quantile regression, in which the 1 norm is used in both the loss and penalty function. However, we are not aware of existing P-spline methods that combine an 1 penalty with an 2 loss function.

Regression parameters and random effects
To fit (8), we use the alternating direction method of multipliers (ADMM) (see Boyd et al., 2011). ADMM has the advantage of being scalable to large datasets. To formulate (8) for ADMM, we introduce constraint terms w j and re-write the optimization problem as The augmented Lagrangian in scaled form (using u to denote the scaled dual variable) is ADMM is an iterative algorithm, and we re-estimate the parameters for updates m = 1, 2, . . . until convergence. 2 It is straightforward to derive the m + 1 updates (see Boyd et al., 2011, Section 6.4.1): For stopping criteria, we use the primal and dual residuals (r m and s m , respectively): where k = J j=1 k j , p = J j=1 p j − |E|, and |E| is the cardinality of E. Following the guidance of Boyd et al. (2011), we stop when r m 2 ≤ pri and s m 2 ≤ dual , By default, we set rel = abs = 10 −4 and the maximum number of iterations at 1, 000.

Smoothing parameters
To estimate λ 1 , . . . , λ J and τ , we compute cross validation (CV) error for a path of values one smoothing parameter at a time. In the CV, we split the sample at the subject level, as opposed to individual observations, and ensure that there are at least two subjects in each fold per unique combination of factor covariates. First, we estimate a path for τ with λ 1 , . . . , λ J set to 0. Then we fix τ at the value that minimizes CV error and compute a path for λ 1 , setting it to the value that minimizes CV error, and so on. We fit a path for each λ j from λ max j to 10 −5 λ max j evenly spaced on the log scale, where λ max j is the smallest value at which D j β j = 0. By taking the sub-differential of (8) with respect to β j and setting D j β j to 0, we get λ max where for a vector a, a ∞ = max j |a j |. We also use warm starts, passing starting values separately for each fold, though warm starts appear to be minimally beneficial with ADMM. We set ρ = min(max(λ 1 , . . . , λ J ), c) at each iteration for some constant c > 0 (e.g. c = 5). When the number of smooths J is small (e.g. J ≤ 2) a grid search is also feasible.
We cannot use the training sample to estimate the random effect parameters b for the test sample, because these parameters are subject-specific and the test subjects are not included in the training sample. Instead, we use the training sample to obtain estimates for the fixed effect parameters β 0 , β j , j = 1, . . . , J and then use the test sample to estimate the random effects.
To make our approach clear, we first fix notation. Let T r ⊆ {1, . . . , n} be the row indices for the observations in the test sample for both the fixed and random effect design matrices F j , j = 1, . . . , J, and Z. Also, let T c ⊆ {1, . . . , q} be the column indices of Z for observations in the test sample, and let T = (T r , T c ) be the tuple of row and column indices designating the test sample. Let matrices F j,T and F j,−T be matrix F j with only rows indexed by T r retained and removed, respectively. Similarly, let matrices Z T and Z −T be matrix Z with only rows and columns indexed by T r and T c , respectively, retained and removed, respectively. Let matrices S T and S −T be matrix S with only rows and columns indexed by T c retained and removed, respectively. Also, let y T and y −T be vector y with elements indexed by T r retained and removed, respectively.
We obtain out-of-sample marginal estimates asμ T =β 0 1 + J j=1 F j,Tβj , whereβ 0 and β j , j = 1, . . . , J are estimated with y −T , F j,−T , and Z −T . We then estimate subjectspecific random effects asb ) and obtain the out-of-sample prediction residuals as r T = y T −μ T − Z TbT . Letting T k be the tuple of indices for test sample (fold) k = 1, . . . , K, we obtain the CV error as K k=1 r T k 2 2 .

Degrees of freedom
In this section, we obtain the degrees of freedom, with the primary goal of estimating variance (see Section 7.1). However, we note that degrees of freedom does not always align with a model's complexity in terms of its tendency to overfit the data (Janson et al., 2015).

Stein's method
Let g(y) =ŷ, where g : R n → R n is the model fitting procedure. For y ∼ N (µ, σ 2 I), the degrees of freedom is defined as (see Efron, 1986;Hastie and Tibshirani, 1990) df Cov(g i (y), y i ).
As Tibshirani (2014a) notes, (11) is motivated by the fact that the risk Risk(g) = E g(y) − µ 2 2 can be decomposed as Cov(g i (y), y i ).
Therefore, the degrees of freedom (11) corresponds to the difference between risk and expected training error. Furthermore, if g is continuous and weakly differentiable, then df(g) = E[∇ · g(y)] (Stein, 1981) where ∇ · g = n i=1 ∂g i /∂y i is the divergence of g. Therefore, an unbiased estimate of df(g) (also used in Stein's unbiased risk estimate (Stein, 1981) To obtain an estimate of degrees of freedom, we transform the generalized lasso component of our model to standard form, similar to the approach of Petersen et al. (2016). To do so, we use the following matrices described by Tibshirani (2014b). Let j,1 is the first row of the finite difference matrixD (i) j , andD (0) j = I p j ×p j is the identity matrix where as before, p j = p j − 1 if j ∈ E (non-varying coefficient smooths) and p j = p j if j ∈Ē (varying coefficient smooths). As shown by Tibshirani (2014b), the inverse ofD * j is given by Assuming our outcome y is centered, so that β 0 = y(0) = 0, and letting V j = F j M j , D * j =D * j Q j for j ∈ E and D * j =D * j for j ∈Ē, and α j = D * j β j , we can write the penalized log likelihood (8) as To avoid difficulties later differentiating with respect to the 1 norm, we remove the non-active 1 penalized coefficients from (13). We also form the concatenated design matrix V = [V 1 , . . . , V J ] and will need to index the active set of V . To these ends, let A j = {l ∈ {k j + 2, . . . , p j } :α j,l = 0} be the active set of the penalized coefficients for smooth j, and let A * j = {1, . . . , k j + 1} ∪ A j be the active set for smooth j augmented with the unpenalized coefficients. Also, for a set A j and constant c ∈ R, let A j + c = {i + c : i ∈ A j } be the set of elements in A j shifted by c. Now let A * = J j=1 (A * j + j−1 l=0 p l ) be the augmented active set of V , where p 0 = 0 and p j , j = 1, . . . , J are the number of columns in V j (equivalently F j ). Finally, let V A * be matrix V subset to retain only those columns indexed by A * . Similarly, letα = (α T 1 , . . . ,α T J ) T be the concatenated vector of estimated coefficients, and letα A * be vectorα subset to retain only elements indexed by A * . Then we can write the estimated penalized loss (13) aŝ Taking the derivative of (14) and keeping in mind that the first k j + 1 elements of eacĥ α j are unpenalized and |α jl | > 0 for all l ∈ A j , we have , 0 k j +1 is a (k j + 1) × 1 vector or zeros, and the sign operator is taken element-wise.
From Tibshirani and Taylor (2012, Lemmas 6 and 9), we know that within a small neighborhood of y, the active set A and the sign of the fitted termsα A are constant with respect to y except for y in a set of measure zero. Therefore, ∂η/∂y = 0 |A * |×n , where 0 |A * |×n is an |A * | × n matrix of zeros and |A * | is the cardinality of A * . Then taking the derivative of (15) with respect to y, we have Solving for the derivatives of the estimated coefficients, we have and From Tibshirani and Taylor (2012, Lemmas 1 and 8), we know that g(y) =ŷ is continuous and weakly differentiable. Also, ∇g = tr(∂ŷ/∂y). Therefore, we can use Stein's formula (12) to estimate the degrees of freedom aŝ where we add 1 for the intercept. We note that this result is similar to the degrees of freedom for the elastic net (see the remark on page 18 of Tibshirani and Taylor, 2012) as well as for FLAM (Petersen et al., 2016).
To obtain degrees of freedom for individual smooths j = 1, . . . , J, let E j be an (|A * | + q) × (|A * | + q) matrix with 1s on the diagonal positions indexed by A * j + j−1 l=0 |A * l | and zero elsewhere, where |A * j | is the cardinality of A * j and A * 0 = ∅. Also, letf j = V jαj be the estimate of the j th smooth. Then as Ruppert et al. (2003) In other words, the degrees of freedom for smooth j is the sum of the diagonal elements of (A T A + Ω) −1 A T A indexed by A * j + j−1 l=0 |A * l |. We note that when using the ADMM algorithm, or most likely any proximal algorithm, the fitted D jβj , or equivalentlyα j , will typically have several very small non-zero values, but will not typically be sparse. However, the vectorŵ j is sparse, where in the ADMM algorithm we constrain w j = D j β j . Therefore, in practice we use w j to obtain the active set A j .

Stable and fast approximations
In some cases, such as the application in Section 9, the estimates based on Stein's method (16) and (17) cannot be computed due to numerical instability. In this section, we pro-pose alternatives that are more numerically stable and which are also more computationally efficient.

Based on restricted derivatives
In this approach, we take derivatives of the fitted values restricted to individual smooths. In particular, from Section 6.1, we see that We can then approximate the degrees of freedom for each individual smooth and the random effects byd We estimate the overall degrees of freedom as where we add 1 for the intercept. This approach is similar to one described by Ruppert et al. (2003, p. 176), though in a different context and for a different purpose. In particular, whereas we use this approach to approximate the degrees of freedom after fitting the model, Ruppert et al. (2003) use it to set the degrees of freedom before fitting the model in the context of 2 penalized loss functions.

Based on ADMM constraint parameters
In this approach, we propose estimates of degrees of freedom specific to the ADMM algorithm. As in the previous section, this approach is based on estimates for the individual smooths. Consider the model with J = 1 smooth, no random effects, and centered y: Suppose we make the centering constraints described Section 3, i.e. for j ∈ E, we set F =F Q and D =D (k+1) Q for an n × p design matrixF , a k + 1 order finite difference matrix D (k+1) , and an orthonormal p×(p−1) matrix Q. Let A = {l ∈ {1, . . . , p−k −1} : (Dβ) l = 0} be the active set, and let |A| be its cardinality. In our context, we expect the design matrices F to be full rank, in which case Theorem 3 of Tibshirani and Taylor (2012) (see the first Remark) states that the degrees of freedom is given by df = E[nullity(D −A )]. Here, nullity(D) is the dimension of the null space of matrix D, and D −A is matrix D with rows indexed by A removed. Now, D has dimensions (p − k − 1) × (p − 1), and we can see by inspection that for all k < p − 1 the columns of D are linearly independent. Therefore, the rank of D −A is equal to the number of rows p − k − 1 − |A|, and the nullity is equal to the number of columns p − 1 minus the number of rows. This givesdf = nullity(D −A ) = k + |A| for centered smooths, i.e. the number of non-zero elements of Dβ plus one less than the order of the difference penalty. This is similar to the result for 1 trend filtering, but we have lost one degree of freedom due to the constraint that 1 TFβ = 0. For uncentered smooths, D has dimensions (p − k − 1) × p, which givesdf = nullity(D −A )) = k + 1 + |A|.
As before, we note that in the ADMM algorithm, Dβ will not generally be sparse, as ADMM is a proximal algorithm. However, the corresponding w is sparse, where in the optimization problem we constrain Dβ = w. Suppose D is an order k finite difference matrix. Then for smooth j = 1, . . . , J, a numerically stable and fast alternative to (17) is given byd where E indexes the centered smooths and 1 is an indicator variable. We then combine (20) with the restricted derivative approximation for the degrees of freedom of the random effects given above to obtain the overall degrees of freedom where we add 1 for the intercept.

Ridge approximation
Let U = [F 1 , . . . , F J , Z] be the concatenated design matrix of fixed and random effects and be the penalty matrix. Then the hat matrix from the linear smoother approximation (see Section 7) is given by H = U (U T U + Ω ridge ) −1 U T . Similar to before, we can get the overall degrees of freedom asd where we add 1 for the intercept. To obtain degrees of freedom for individual smooths j = 1, . . . , J, let E j be a (p + q) × (p + q) matrix with 1s on the diagonal positions indexed by the columns of F j and zero elsewhere. Also, letf j = F jβj be the estimate of the j th smooth. Then the ridge approximation for smooth j is given byf Similar to before, we also propose stable and fast approximations to the ridge estimate of degrees of freedom based on restricted derivatives. In particular, let Then we can estimate the overall degrees of freedom as where we add 1 for the intercept. As noted above, this approach is similar to one described by Ruppert et al. (2003, p. 176), though for a different purpose. Whereas we use this approach to obtain the degrees of freedom after fitting the model, Ruppert et al. (2003) use it to set the degrees of freedom before fitting the model.

Approximate inference
In this section, we discuss approximate inferential methods based on ridge approximations to the 1 penalized fit and conditional on the smoothing parameters λ j , j = 1, . . . , J and τ . We use the ADMM algorithm to analyze the approximation. In particular, we note that we can write the ADMM update for β j as As we note in Observation 2, δ j loosely represents the difference in the estimate of β j obtained with the 1 and 2 penalties.
Proof of Observation 2. Similar to the ridge update for b, if we changed λ j D j β j 1 to (λ j /2) D j β j 2 2 in (8) we could remove the w j term and the constraint that D j β m j = w j from (10) to obtain the ridge update β m+1 j,m) . By comparison with (26), we see that δ m j = 0.
Observation 2 motivates our approximate inferential strategy. Lettingf j be the j th fitted smooth, and letting y (j) = y −β 0 − l =j F lβl − Zb, we havê where We obtain confidence intervals for the linear smoother (28) centered around the estimated fit (27), ignore F j δ j when estimating variance, and assume λ j ≈ ρ. We also condition on the smoothing parameters λ 1 , . . . , λ J and τ . Figure 2 gives a visual demonstration of the approximation for the simulation presented in Section 8 and the application shown in Section 9. As seen in Figure 2, in these examples the 1 fit and ridge approximation are very similar. If this holds in general, then this would suggest that 1) the approximate inferential procedures we propose might have reliable coverage probabilities, and 2) there may be minimal practical advantage to using an 1 penalty instead of the standard 2 penalty. However, as shown in Section 8.3, the 1 penalty appears to perform noticeably better in certain situations, including the detection of change points.
(a) Simulation (Section 8) (b) Application (Section 9) Figure 2: Linear smoother approximation to the 1 penalized fit in the simulation (see Section 8) and application (see Section 9). The solid red line is the 1 penalized fit, the dotted green line is the linear smoother approximation, and the dashed blue line is the difference between the two.
Before presenting the confidence bands in greater detail, we discuss our approach for estimating variance in Section 7.1, which we then use to form confidence bands in Section 7.2.

Variance
Let r = y − J j=1 F jβj − Zb be an n × 1 vector of residuals. We estimate the overall variance asσ 2 = r 2 2 /df resid , wheredf resid is the residual degrees of freedom. When possible, we use the estimate based on Stein's method (16) and setdf resid = n −df. If Stein's method is not numerically stable, then we use the restricted derivatives approximation (19) and set df resid = n −df. As another alternative, we could also use the ADMM approximation and setdf resid = n −df ADMM .

Confidence bands
In this section, we obtain confidence bands for typical subjects, i.e. for subjects for whom b i = 0. Since we assume a normal outcome, this is equivalent to the marginal population level response.

Frequentist confidence bands
Ignoring the distribution on D j β j , y (j) is normal with variance Var(y (j) ) = σ 2 I + σ 2 b ZS + Z T , where S + is the Moore-Penrose generalized inverse of matrix S (as noted in Section 3, S may not be positive definite). Therefore, Var(f j ) ≈ H j Var(y (j) )H T j where Var(y (j) ) is an n × n estimate of Var(y (j) ) withσ 2 andσ 2 b plugged in for σ 2 and σ 2 b respectively, and f j · ∼ N (f j , H j Var(y (j) )H T j ). The estimated variance of the fit at a single point x, which we denote as Var(f j (x)), is the corresponding diagonal element of H j Var(y (j) )H T j . Therefore, asymptotic pointwise 1 − α confidence bands take the formf j (x) ± z 1−α/2 Var(f j (x)) where Φ(z a ) = a and Φ is the standard normal CDF, e.g. z 1−α/2 = 1.96 for α = 0.05.
For the purposes of interpretation, we include the intercept term in the confidence band for the j = 1 smooth, but not for the remaining smooths.

Bayesian credible bands
Many authors, including Wood (2006), recommend using Bayesian confidence bands for nonparametric and semiparametric models, because the point estimates are themselves biased. While Bayesian credible bands do not remedy the bias, they are self consistent.
To this end, we replace the element-wise Laplace prior with the (generally improper) joint normal prior that is equivalent to the standard 2 penalty: β j ∼ N 0, (λ j D T j D j ) −1 . This leads to the posterior We can then form simultaneous Bayesian credible bands for f j |y by simulating from the posterior (29) and taking quantiles from F j β b j , b = 1, . . . , B. Alternatively, for a faster approximation we use frequentist confidence bands with F j W −1 j F T j in place of H j Var(y (j) )H T j . In practice, we have found the simultaneous credible bands and the faster approximation to be nearly indistinguishable. 4 As before, for the purposes of interpretation, we include the intercept term in the credible band for the j = 1 smooth, but not for the remaining smooths. 4 It appears that the latter (faster) method is the default in the mgcv package (Wood, 2006). As in mgcv, we only need to compute the diagonal elements of F j W −1 j F T j as rowSums((F j W −1 j ) • F j ), where • is the Hadamard (element-wise) product.

Simulation
We simulated data from a piecewise linear mean curve as shown in Figure 3. Each subject had a random intercept and is observed over only a portion of the domain. There are 50 subjects, each with between 4 and 14 measurements (450 total observations). The random intercepts were normally distributed with variance σ 2 b = 1, and the overall noise was normally distributed with variance σ 2 = 0.01. In all models, we used order 2 (degree 1) B-splines with p = 21 basis functions.

Frequentist estimation
We fit models with J = 1 smooth term and random intercepts. To obtain estimates for the 1 penalized model, we used ADMM and 5-fold CV to minimize where Z il = 1 if observation i belongs to subject l and zero otherwise. As noted above, we used order 2 (degree 1) B-splines with p = 21 basis functions, i.e. F ∈ R n×(p−1) where n = 450 and p = 21. We also fit an equivalent model with an 2 penalty using the mgcv package (Wood, 2006), i.e. with (λ/2) D (2) β 2 2 in place of λ D (2) β 1 in (30). Figure 4 shows the marginal mean with 95% credible intervals, and Figure 5 shows the subject-specific predicted curves.
(a) 1 fit with ADMM and CV (b) 2 fit with mgcv (Wood, 2006) Figure 4: Marginal mean and 95% credible intervals from frequentist estimation: black is true marginal mean, red is estimated marginal mean (a) 1 fit with ADMM and CV (b) 2 fit with mgcv (Wood, 2006) Figure 5: Subject-specific predicted curves from frequentist estimation: black is true marginal mean, red is estimated marginal mean, blue is subject-specific curves As seen in Figures 4 and 5, the results from the 1 and 2 penalized models are very similar. However, the 1 penalized model does slightly better at identifying the change points and the line segments. We explore this further in Section 8.3. Table 1 compares the degrees of freedom and variance estimates from the 1 penalized fit against those from the 2 penalized fit. From Table 1, we see that the ridge degrees of freedomdf ridge appears reasonable, as it is near the estimate for the 2 penalized model. The true degrees of freedomdf also seems reasonable. Ideally, the degrees of freedom should equal six, as there are four change points and we are using a second order difference penalty (see Section 6.2).  Table 2 compares the different estimates of degrees of freedom. In this simulation, the degrees of freedom based on the ridge approximation is larger than that from Stein's formula, and the approximations based on restricted derivatives are equal or near the estimate with Stein's formula.

Bayesian estimation
We modeled the data as y|b = β 0 1 We also fit models with normal and diffuse priors for D (2) β.
We fit all models with rstan (Stan Development Team, 2016), each with four chains of 2,000 iterations with the first 1,000 iterations of each chain used as warmup. The MCMC chains, not shown, appeared to be reasonably well mixing and stationary, and hadR values under 1.1 (see Gelman et al., 2014). Figure 6 shows the marginal mean with 95% credible intervals, and Figure 7 shows point estimates.  : Subject-specific predicted curves from Bayesian models fit with order 2 (degree 1) B-splines. Gray is observed data, black is true marginal mean, red dashed is estimated marginal mean, and blue dashed is subject-specific predictions As seen in Figures 6 and 7, all models performed well and gave similar fits as above. Similar to before, the Laplace prior appears to better enforce a piece-wise linear fit, particularly around x = 0.2.

Change point detection
We simulated 1,000 datasets with the same generating mechanism used to produce the data shown in Figure 3 and measured the performance of the 1 and 2 penalized models on two criteria: 1) the number of inflection points found, and 2) the distance between the estimated inflection points and the closest true inflection point. To that end, let T = {τ 1 , . . . , τ 4 } be the set of true inflection points, and M = max x∈X |f (x)| be the maximum absolute second derivative of the estimated function, where X = {x 1 , x 2 , . . .} is the ordered set of unique simulated x values. We approximatef bŷ Then let I = {x ∈ X : |f (x)| ≥ cM } be the set of estimated inflection points, where c ∈ (0, 1) is a cutoff value defining how large the second derivative must be to be counted as an inflection point. Also, let n I = |I| be the number of estimated inflection points, andd = n −1 I x∈I min τ ∈T |x − τ | be the mean absolute deviance of the estimated inflection points. Figure 8 shows the results from 1,000 simulated datasets. The 1 penalized model was better able to 1) find the correct number of inflection points, and 2) determine the location of the inflection points. In this section, we analyze electrodermal activity (EDA) data collected as part of a stress study. In brief, all subjects completed a written questionnaire prior to the study, which categorized the subjects as having either low vigilance or high vigilance personality types. During the study, all participants wore wristbands that measured EDA while undergoing stress-inducing activities, including giving a public speech and performing mental arithmetic in front of an audience. The scientific questions were: 1) Is EDA higher among high vigilance subjects, and 2) when did trends in stress levels change? In this section, we demonstrate how P-splines with an 1 penalty can address both questions. The raw EDA data are shown in Figure 9. After excluding subjects who had EDA measurements of essentially zero throughout the entire study, we were left with ten high vigilance subjects and seven low vigilance subjects. To remove the extreme second-by-second fluctuations in EDA, which we believe are artifacts of the measurement device as opposed to real biological signals, we smoothed each curve separately with a Nadaraya-Watson kernel estimator using the ksmooth function in R. We then thinned the data to reduce computational burden, taking 100 evenly spaced measurements from each subject. Figure 10 shows the results of this process for a single subject, and Figure 11 shows the prepared data for all subjects. Because of the limited number of subjects, as well as issues of misalignment in the time series across individuals, the results presented here should be considered as illustrative rather than of full scientific validity. Figure 10: Raw, smoothed, and thinned electrodermal activity data for a single subject Figure 11: Electrodermal activity (EDA) data used in the analysis (seven low vigilance and ten high vigilance subjects). Note: subjects not aligned in time (x-axis).

Models
In all models, we fit the structure where x represents time in minutes, 1 high [i] = 1 if subject i has high vigilance and 1 high [i] = 0 if subject i has low vigilance, b i (x) are random subject-specific curves, and i (x) ∼ N (0, σ 2 ). For β 1 (x), β 2 (x), and b i (x), we used a fourth order B-spline basis with 31 basis functions each and a second order difference penalty (k = 1).
Written in matrix notation, the 1 penalized model is where y is a stacked vector for subjects i = 1, . . . , 17, F 1 is an n × p design matrix where n = 1, 700 and p = 31, and F 2 = diag(1 high [i])F 1 where i is an n × 1 vector of subject IDs. In other words, F 2 is equal to F 1 , but with rows corresponding to low vigilance subjects zeroed out. We set where each Z i is an n i × 31 random effects design matrix of order 4 B-splines evaluated at the input points for subject i, and where S i,jl = φ ij (t)φ il (t)dt are smoothing spline penalty matrices. We also mean-centered F 1 as described in Section 3, with the corresponding changes in dimensions.
To fit a comparable 2 penalized model, in which λ j D (2) β j 1 in (31) is replaced with (λ j /2) D (2) β j 2 2 , we rotated the random effect design and penalty matrices Z and S as described in Section 3. To facilitate the use of existing software, we used a normal prior for the "unpenalized" random effect coefficients, i.e.b f ∼ N (0, σ 2 f I). We also fit a Bayesian model using the same rotations and equivalent penalties as above.
In particular, we modeled the data as y|b

Frequentist estimation
We tried to use CV to estimate the smoothing parameters for the 1 penalized model. However, with only 17 subjects split between two groups, we only did 3-fold CV. CV did not find a visually reasonable fit so we set the tuning parameters by hand. Figure 12 shows the estimated marginal mean and 95% credible bands for the 1 penalized model, and Figure 13 shows the subject-specific predicted curves for the 1 penalized model. As seen in Figure 12a, our model identified a few inflection points, particularly near minutes 40, 50, and 60. From Figure 12b it appears that the difference in EDA between the low and high vigilance subjects was not statistically significant. Also, as seen in Figure 13, the subject-specific predicted curves are shrunk towards the mean, which is expected, because the predicted curves are analogous to best linear unbiased predictors (BLUPs), although they are not linear smoothers.  Figure 14 shows the estimated marginal mean and 95% credible bands for the 2 penalized model, and Figure 15 shows the subject-specific predicted curves for the 2 penalized model. The estimate shown in Figure 14a is similar to that shown in Figure 12a, though the inflection points are slightly less pronounced in Figure 14a. The results in Figure 14b are for the most part substantively the same as those in Figure 12b; the 2 penalized model does not show a statistically significant difference between the low and high vigilance subjects, with the possible exception of minutes 45 to 66. As seen in Figure 15, the predicted subject-specific curves from the 2 penalized model are also shrunk towards the mean.  Table 3 shows the estimated degrees of freedom for the 1 penalized model. Stein's methoddf (16) and the ridge approximationdf ridge (19) were numerically instable (A T A + Ω and U T U +Ω ridge were computationally singular). Therefore we used the restricted derivative approximationdf to estimate the variance, as described in Section 7.1. In the 2 penalized model, smooth F 1 had 14.2 degrees of freedom, and smooth F 2 had 6.96 degrees of freedom.  (16) and (17) - Restricted (18) (22) and (23) - Ridge restricted (24) and (25)  216 21.1 13.50 181

Bayesian estimation
We fit the model using rstan (Stan Development Team, 2016) with four chains of 5,000 iterations each, with the first 2,500 iterations of each chain used as warmup. The MCMC chains, not shown, appeared to be reasonably well mixing and stationary withR values under 1.1 (see Gelman et al., 2014). Figure 16 shows the marginal means with 95% credible bands, and Figure 17 shows the subject-specific curves. Similar to the 2 penalized model, the Bayesian model found a slightly statistically significant difference between low and high vigilance between minutes 42 and 65. For comparison, we also fit an 2 penalized model with an alternative correlation structure similar to that recommended by Ruppert et al. (2003, p. 192). In place of the correlation structure implied by the penalty matrix S described above, we augmented each Z i matrix on the left with the columns [1, x i ], where x i is an n i ×1 vector of measurement times for subject i. We then replaced and Σ is a common 2 × 2 unstructured positive definite matrix. To model the withinsubject correlations, we used a continuous autoregressive process of order 1. In particular, Cor(y i (x ij ), y i (x ij )) = ζ |x ij −x ij | for a common parameter ζ > 0. Figure 18 shows the estimated marginal mean and 95% credible bands, and Figure 15 shows the subject-specific predicted curves. The estimates shown in Figure 18 are similar to that shown in Figure 14. While estimates of the difference between low and high vigilance subjects differs between this model and the 2 penalized model in Section 9.3, the more notable difference is in the subject-specific predicted curves. As seen in Figure 19, the predicted subject-specific curves are not shrunk towards the mean as much as in Figure 15. As demonstrated in this article, P-splines with an 1 penalty can be useful for analyzing repeated measures data. Compared to related work with 1 penalties, our model is ambitious in that we allow for multiple smoothing parameters and propose approximate inferential procedures that do not require Bayesian estimation. However, these are also the two aspects of our proposed approach that require additional future work.
Regarding estimation, our current approach of using ADMM and CV appears to work reasonably well for J = 1 smooth, but is not yet reliable for J > 1 smooths. In the future, we plan to develop more robust estimation techniques, particularly for smoothing parameters. As one possibility, we have done preliminary work to minimize quantities similar to GCV and AIC instead of the more computationally intensive CV, though these approaches do not seem as promising as their 2 counterparts. When possible, Bayesian estimation may be the most reliable way to currently fit these models. Bayesian estimation also opens the possibility of using other sparsity inducing priors, such as spike and slab models (Ishwaran and Rao, 2005).
Regarding inference, in future work we plan to use the δ quantity to bound difference between 1 and 2 penalized fits under certain assumptions on the data, and to study coverage probability through simulations. We also plan to investigate the use of post-selection inference methods to develop confidence bands for linear combinations of the active set, and to further investigate through simulations the performance of our proposed estimates of degrees of freedom. However, we note that our primary use of the degrees of freedom estimatedf is to obtain the residual degrees of freedomdf resid = n −df, which we then use to estimate the variancesσ 2 = r 2 2 /df resid . Therefore, when n d f,σ 2 is not very sensitive todf, in which case it is not critical for our purposes to obtain an exact estimate of degrees of freedom.
In addition, we think it could be beneficial to investigate the addition of random effects to locally adaptive regression splines, and to implement smoothing splines with an 1 penalty (replacing the Dβ 1 penalty with Ψβ 1 where Ψ ij = φ i (s)φ j (s)ds and φ are basis functions). We also plan to extend these results to a generalized model to allow for nonnormal response distributions.
Regarding the rate of convergence, from Observation 1 and the work of Tibshirani (2014a), we know that for equally spaced data and F = I n×n , P-splines with an 1 penalty achieve the minimax rate of convergence for the class of weakly differentiable functions of bounded variation. Let A max = max ij |a ij | be the maximum element of a matrix A. We speculate that for a general n×p design matrix F of full rank (not necessarily of first degree Bsplines), 1 penalized models achieve the minimax rate of convergence when F F T −I n×n max is small, but not when it is large. Proving this assertion and finding a cutoff value has been challenging. The framework of Tibshirani (2014a) for comparing the fits from two lasso problems with different design matrices may be promising, but extending the results of Tibshirani (2014a) to design matrices of different dimensions has been difficult, and would be required in our setting. It may also be possible to build on results regarding the optimal number and placement of spline knots. We leave this for future work.

Supplementary material
We have implemented our method in the R package psplinesl1 available at https:// github.com/bdsegal/psplinesl1. All code for the simulations and analyses in this chapter are available at https://github.com/bdsegal/code-for-psplinesl1-paper.