Smoothing $\ell_1$-penalized estimators for high-dimensional time-course data

When a series of (related) linear models has to be estimated it is often appropriate to combine the different data-sets to construct more efficient estimators. We use $\ell_1$-penalized estimators like the Lasso or the Adaptive Lasso which can simultaneously do parameter estimation and model selection. We show that for a time-course of high-dimensional linear models the convergence rates of the Lasso and of the Adaptive Lasso can be improved by combining the different time-points in a suitable way. Moreover, the Adaptive Lasso still enjoys oracle properties and consistent variable selection. The finite sample properties of the proposed methods are illustrated on simulated data and on a real problem of motif finding in DNA sequences.


Introduction
The Lasso [16] has attracted a lot of attention for prediction and variable selection in linear regression models, including high-dimensional settings where the number of covariates is much larger than sample size [6,12,3,20,13,22]. Not only has the idea of ℓ 1 -penalization shown its success in other models [17,9,14], but also many extensions of the Lasso in linear regression models have been proposed, among them are the Fused Lasso [18], the Adaptive Lasso [24] and the Relaxed Lasso [11].
Also for multivariate regression, penalization estimators have been shown to be successful [19,15]. In many problems there is a natural ordering of the response space: our new methodology and theory are exploiting this fact. If we think of time-course data where we observe a response variable over certain time-points, the relationship between "neighbouring" time-points is expected to be stronger than between more distant time-points. Instead of separately estimating a parameter vector for each time-point, it is often a better strategy to combine information across different time-points. By putting an appropriate constraint on the parameter vector, we can control certain characteristics, e.g. the smoothness over time. As an advantage, we may get a more efficient estimator: By pooling of information, we reduce the variance, typically at the cost of some bias, to achieve a lower mean squared error. For multivariate regression, [2] use the correlation between the responses to construct an estimator with lower mean squared prediction error. In the discussion of [2], the idea of relevance weighted likelihood [7] is mentioned [23]. We use this idea for ℓ 1 -penalized estimators. By using an estimator which also fits well for neighbouring time-points, we can not only get a smoother behaviour of the parameter vector over time, but also profit from more efficiency, both in estimation accuracy and in variable selection.
The rest of this article is organized as follows. In Section 2 we introduce the Smoothed Lasso estimator and show that it asymptotically reduces the bound on the mean squared error compared to the univariate Lasso estimator. In Section 3 we apply the smoothing idea to the Adaptive Lasso and variants thereof and show that it can consistently select the correct model and has a faster convergence rate than the univariate estimator. Simulations follow in Section 4 and a real data analysis for motif search in DNA sequences in Section 5. Section 6 contains the discussion. All proofs can be found in the Appendix.

Smoothed Lasso
We first start with the definition of the Smoothed Lasso estimator and then study its theoretical properties.

Definition
Assume that we observe data at N different time-points and that at each timepoint t r , r = 1, . . . , N , we have a linear regression problem of the form where X is the n×p design matrix, y(t r ) ∈ R n is the response vector, β(t r ) ∈ R p is the parameter vector corresponding to time-point t r and ε(t r ) ∈ R n is the corresponding error vector: ε(t r ), r = 1, . . . , N are assumed to be i.i.d. random vectors with i.i.d. components having mean zero and finite variance σ 2 . Note that the design matrix X does not depend on t r in our setup (but it could), and hence we consider a multivariate linear model. As commonly used for penalized estimation, we assume that the columns of X are centered and scaled to have empirical column means 0 and column variances 1.
Remark 2.1. Generalizations of the methodology and theory include that the design matrix X depends on t r , i.e. X(t r ), and that the errors have correlated components Cov(ε(t r )) = Σ or arise from a dependent, stationary process with respect to the time-points.
The idea of the Smoothed Lasso is to use an ℓ 1 -penalty and to suitably combine or smooth the information of the different time-points. It is defined aŝ , and h is a bandwidth parameter. Thus, the Smoothed Lasso isβ(t r ) = β λn,h (t r ), depending on two tuning parameters.
We can rewrite the weighted optimization problem (2.1) as an ordinary Lasso problemβ Hence, any algorithm to solve a standard Lasso problem can be used to calculate the Smoothed Lasso estimator for a given bandwidth h. By forcing the estimateβ(t r ) to fit also well for "neighbouring" time-points, a smooth (non-parametric) trend ofβ(t r ) as a function of time is usually inherited (if p < n this is always true because of strict convexity with respect to β and continuity with respect to t r of the criterion in (2.2)).

Remark 2.2.
Another approach would be to use a Fused Lasso penalty which also penalizes the absolute value of the differences between neighbouring timepoints, i.e. |β j (t r ) − β j (t r−1 )|. Such an approach has two drawbacks: First, we have to model all time-points simultaneously, i.e. fit a model with N p parameters. Moreover, the Fused Lasso problem is more difficult to solve than the Lasso problem. In our approach we fit N Lasso problems with p parameters each. Second, if we want to mimick the behaviour of a bandwidth which is locally adaptive to the underlying true parameter function β(t), we have to introduce a lot of tuning parameters for the Fused Lasso and search over a high-dimensional grid when doing cross-validation. Remark 2.3. We do not assume that the active set (the set of predictors with nonzero coefficients) stays the same over all time-points. Our methodology allows for the fact that some predictors enter or leave the active set along the timecourse.
In the next Section we first consider the special case of an orthogonal design and indeed, we show that the mean squared error is decreased asymptotically.

Orthogonal Case
We consider the situation where the number of parameters equals the number of observations and the design matrix is orthogonal, i.e. X T X = nI n , and where the errors ε(t r ) are Gaussian. We focus on a single time-point of interest and therefore omit the time-index for notational simplicity. In [5,Theorem 1] it is shown that the (univariate) soft-threshold estimatorβ ST (with threshold parameter σ 2 log(n)/n), which is equivalent to the Lasso in the orthogonal case (with penalty parameter λ = 2σ 2 log(n)n), satisfies for all β ∈ R n and that this bound is asymptotically sharp in a minimax-sense  for all β ∈ R n and some constant C.
A proof is given in Appendix A.2. For a faster convergence rate than log(n) |A n | /n (for the unsmoothed Lasso) we require N h to converge to infinity which implies that , i.e. N can be of much lower order than n for achieving a faster convergence rate for the minimax bound.

General Case
Let us now consider the general case, i.e. we do not restrict ourselves to an orthogonal design matrix. In particular, we allow for high-dimensional situations where p = p n ≫ n is increasing very fast as n → ∞.
Using the results in [13] for a fixed design, the univariate Lasso estimator satisfies under regularity conditions on the design matrix where m λn = Cn 2 /λ 2 n for some constant C. In a certain sense, this bound is tight, see [13,Remark 1]. We can choose and arrive at the optimal rate For proving such a result, assumptions on the design matrix are crucial: Various authors use different conditions, cf. [13,3,20,22]. We refer the reader to [13] for a detailed description of the regularity conditions for (2.4).
Using the notation that is introduced at the beginning of the proof of Proposition 2.1, one can derive other asymptotic properties by linking known results for the Lasso [6,3,20] with the smoothed model y = Xβ +ε and an analysis of the bias term β − β q for q ∈ {1, 2} as in (A.3).
Up to now we only considered the estimation error for β and no variable selection properties. The smoothing reduces the variance and thus it can be expected that the Smoothed Lasso selects more (noise) variables than its univariate counterpart. Empirical evidence of this property is given in Section 4. This problem can be overcome by a second stage which removes many of the coefficients whose estimates are close to zero. In fact, already the case with a univariate response often requires such a second stage for consistent variable selection [24]. We will treat a special case in the next Section.

Smoothed Adaptive Lasso
The Adaptive Lasso [24] weights the penalty for the different coefficients using an initial estimatorβ init , i.e.
whereτ j = 1/|β init,j | γ for some γ > 0 are weights based on the initial estimator β init . For simplicity we will restrict ourselves to γ = 1. In [24], the ordinary least squares (OLS) estimator is used forβ init : here, we will mainly use the Lasso and versions thereof. Through a re-scaling of the columns of the design matrix, the Adaptive Lasso estimator can be formulated as an ordinary Lasso problem, see [24]. We can also apply the smoothing technique of Section 2 to the Adaptive Lasso. In the smoothed case we again replace the residual sum of squares in the objective function with its smoothed counterpart in (2.2) In [24], an asymptotic oracle result for the Adaptive Lasso is given for fixed dimension p. We show that the Smoothed Adaptive Lasso has a faster convergence rate. Again, as we focus on a single time-point, we omit the time-index for notational simplicity. We will consider the situation where the number of variables p is kept fixed as n → ∞. As before, let A be the active set of the true parameter vector at the current time-point andÂ n be its empirical counterpart.
A proof is given in Appendix A.2. Thus, if the initial estimator is consistent, we can find a sequence λ n such that the Smoothed Adaptive Lasso has the property of consistent model selection and asymptotic normality on the active Thus, as already pointed out in Section 2.2, a relatively small value of N = N n is sufficient for achieving an improved convergence rate.
Remark 3.1. The optimal convergence rate n −2/5 N −2/5 in Theorem 3.1 can be achieved using h n ≍ n −1/5 N −1/5 . Then, the limiting normal distribution becomes This is the same distribution as when using local least squares (with kernel K). Hence, the Smoothed Adaptive Lasso has an oracle property saying that it is asymptotically as good as local least squares with the true underlying active set A known beforehand.

Choice of initial estimator
The choice of the initial estimator will influence the final estimator. In particular, the sparsity of the final estimator can be maximized by making an appropriate choice, as discussed below.
We will first focus on univariate estimators, i.e. on estimators which only use the data of the current time-point. In view of Theorem 3.1, the basic assumption for the initial estimator is consistency. The ordinary least squares (OLS) method is a possible choice for low-dimensional problems with fixed dimension p as it is √ n-consistent. The Lasso is consistent in an ℓ 2 -sense, even in the highdimensional setting, see Section 2.3. Finally, the Adaptive Lasso is √ n-consistent for fixed p [24] and consistent under suitable regularity conditions for p ≫ n [8]. For high-dimensional problems the OLS estimator is not appropriate because it is unstable or even not defined in a p > n situation. The Lasso or Adaptive Lasso are more appropriate choices.
If the initial estimator is doing variable selection, i.e. some of the coefficientŝ β init,j = 0, the smoothed estimator is at least as sparse as the initial estimator: a zero-coefficient in the initial estimator, i.e.β init,j = 0, results in an infinite penalty for that component, i.e.τ j = ∞, forcing the smoothed estimate to be zero, i.e.β j (t r ) = 0. This reduces the computational complexity for the smoothing stage since some or even many predictors can be excluded from the model.
For the case that the initial estimator has a tuning parameter, as with the Lasso and the Adaptive Lasso, one would in practice tune it to be prediction optimal. For the Lasso, this produces too large models, i.e. many noise variables are included in the selected model [12]. However, noise variables tend to have small coefficients and will therefore be heavily penalized in the second smoothing step of the Smoothed Adaptive Lasso.
It is of course also possible to use a smoothed estimator as initial estimator, e.g. the Smoothed Lasso. In terms of the number of selected variables, as we will see in Section 4, this is often worse than directly using the univariate counterpart. Due to the reduced variance, the smoothed initial estimator tends to select too many variables and not all of them will be eliminated in the second stage of the Smoothed Adaptive Lasso. In view of some empirical results in Section 4, we advocate the following: the initial estimator for the Smoothed Adaptive Lasso is the univariate Adaptive Lasso; the latter itself uses the univariate Lasso as initial estimator. This amounts to be a three-stage procedure where all of the estimations are tuned to be prediction optimal using e.g. some cross-validation scheme. There is substantial agreement by now that two or more stages are needed to achieve good regularization properties in high-dimensional settings [24,11,25,13,21,10]. As a novelty here, our third stage involves an additional smoothing operation.

Simulations
In this Section we want to evaluate the finite sample properties of the proposed estimators.

Design
We consider the following models, similar to [24]: The design matrix X is simulated from a multivariate normal distribution with mean zero and covariance matrix Σ i,j = 0.5 |i−j| . The standard deviation of the error term is chosen from σ ∈ {2, 4} which corresponds to a signal-to-noise ratio (averaged over N ) of approximately {2.7, 0.7} and {3.8, 0.9} for model 1 and model 2, respectively. We use both a "classical" setup with n = 50, p = 8 and a high-dimensional setup with n = 100, p = 1000.
The best combination of bandwidth h and penalization parameter λ is being searched on a two-dimensional grid using an independent validation set of half the size of the training set. This is done independently for each time-point which means that we allow for a locally varying bandwidth. The density of the standard normal distribution is used as kernel function K(·) for the weight function w(·, ·), see Section 2.1.
For the (Smoothed) Adaptive Lasso with (Smoothed) Lasso as initial estimator, we first determine the optimal penalization parameter for the initial estimator and keep it fixed when searching for the optimal penalization parameter and bandwidth for the final estimator.
All estimators which we compare are listed in Table 1.

Performance Measures
To measure the goodness of fit and the ability to pick the model of the correct size we define the following performance measures. For the mean squared error we use Moreover, we also report the mean squared prediction error for the regression function x T β(t r ) whereβ 0 (t r ) is the intercept term (and β 0 (t r ) = 0 for our simulations) and Σ is the covariance matrix of the covariates.
For the number of variables we define the mean model size |Â(t r )| and the mean number of false positives In applied sciences where (possibly expensive) experiments are conducted to verify the selected variables (e.g. in biology), the number of false positives is a crucial quantity one wants to minimize in order to keep the costs low.

Results
The results can be found in Table 2. For the high-dimensional setting we did not consider OLS initial estimators. Several conclusions can be made. Let us first focus on the Lasso estimator. In all simulation settings, smoothing improves the M SE β score substantially. The downside for the Smoothed Lasso estimator is that due to the decreased variance, more noise variables tend to enter the model which results in larger selected models (with more false positives) than for the univariate Lasso estimator. However, in practice one would assign a variable importance score to each coefficient and therefore concentrate first on those with the largest contributions, whereas many of the false positives have small importance scores only. Also for the Adaptive Lasso, the M SE β scores get decreased by smoothing in all simulation settings. Using a smoothed initial estimator leads to too large models. Take for example Adaptive Lasso with Smoothed Lasso as initial estimator, i.e. proposal 6 in Table 1. As we have described above, the Smoothed Lasso tends to select a too large initial model. Although the Adaptive Lasso can eliminate most noise variables in the second stage due to their large weights from small coefficients of the initial estimator, the resulting models still get a bit too large. However, the estimator is very competitive with respect to prediction performance.
Using a univariate initial estimator, i.e. our proposal 7 in Table 1, to get more reasonably sized models seems to be a good compromise. It does not only produce the sparsest models but is often also competitive with respect to M SE β and M SE P .

Real data: Motif Finding in DNA Sequences
We apply the smoothing methodology to a problem of motif regression [4]. A motif (typically a 5-15 letter word consisting of letters A, C, G, and T) is a candidate of a binding site of some functional element, e.g. a transcription factor (a protein which regulates gene expression). In [1] a collection of various gene expression time-course experiments and a set of candidate motifs for yeast is provided. Gene expression values for a total of 2587 genes are available and p = 666 motif candidates are used to build the motif scores for each gene. These measure how well the motifs are represented in the upstream regions of the genes. We focus on a time-course experiment spanning N = 12 different timepoints. In summary we have 2587 observations of a 666 dimensional predictor (the motif scores) and a one-dimensional response (the gene expression value) at each of the 12 time-points. Thus, each row of the design matrix X corresponds to a gene and each column to a motif score. The element x i,j measures how well the jth motif score is represented in the upstream region of the ith gene.
To illustrate the smoothing methods and the effect of different sizes for the training set, we use random subsets of different sizes as training set. An independent validation set is used to determine the prediction optimal tuning parameters. The size of the validation set is half the size of the training set. The remaining data is used as test-set.
The results for a training set of size 1300 is given in Table 3. In terms of prediction error, there is not much gain when smoothing the estimators for this data-set, especially for the Adaptive Lasso. A reason for this may be the large variance of the error term. Note that for a new test observation ( The error variance σ 2 is likely to be the dominating quantity since motif regression is known to be very noisy. In terms of variable selection, the smooth- ing step decreases the model size for the Adaptive Lasso estimator and is potentially reducing the number of false positives: In particular for time-points t r = 1, 3, 5, 7, 8, 9, 10 the Smoothed Adaptive Lasso yields much sparser model fits. For the Lasso estimator, the smoothing has a tendency to increase the number of selected variables resulting in rather large models. This coincides with our findings in Section 4. If we decrease the training sample size to 200 we see some small improvement with respect to the mean squared prediction error (not shown).

Discussion
We propose smoothing techniques for ℓ 1 -penalized (Lasso-type) estimators for a time-course of high-dimensional linear models. We show theoretically that for the Lasso and the Adaptive Lasso, better estimates in terms of the mean squared error can be obtained by combining the responses of different time-points in a suitably weighted way. Empirically, the Smoothed Adaptive Lasso estimator yields the sparsest models with competitive mean squared error performance when using the univariate Adaptive Lasso as initial estimator. The Smoothed Lasso estimator has very good performance with respect to the mean squared error but selects too many noise variables in general. An additional thresholding stage would be necessary if the primary interest is in variable selection.
The smoothing methodology can also be applied to generalized linear models (GLM). The main difference is that we can't rewrite the smoothed estimator as an ordinary lasso problem as in (2.2). This implies that the computational burden increases: In the worst case (depending on the support of the kernel and the bandwidth h), by stacking the response variables and design matrices of the different time-points, the total sample size is N n, which can be substantially larger than n in (2.2), while the dimensionality is still p.
Our methodology applies to more general problems than time-course settings. For example, we can directly treat the situation of different (heterogeneous) data-sets (y(t), X), t = 1, . . . , N (or (y(t), X(t)), t = 1, . . . , N ) with n(t) × 1 response vectors and n(t)×p design matrices, where t is the index for the various data-sets. All we need is a suitable pseudo-distance d(t, s) among the different data-sets indexed by t and s. The weights in (2.2) are then of the form The pseudo-distance d(·, ·) could be learned from the data, e.g. based on some pseudo-metrics for clustering different data-sets. Whether the multivariate view over different time-points (or different datasets) pays off for a particular problem is not clear a-priori. However, our methodology encompasses the univariate Lasso methods, by choosing the bandwidth h = 0. Hence, using some cross-validation scheme enables to find out whether pooling information over different time-points (or data-sets) is worthwhile and if so, the Smoothed (Adaptive) Lasso from the multivariate approach renders more accurate estimates. w (t s , t r ) ε (t s ) .
We can now use the decomposition β − β 2 2 ≤ 2 β −β 2 2 + 2 β − β 2 2 . (A.1) The first term is "classical". We can use the theory of [5] with respect to an error term with reduced variance. For the asymptotic variance we have Using a Riemann sum approximation, we arrive at i.e. the error variance is of order 1/(N h).
Let us now consider the bias term. If β i (t r ) = 0 it follows with the compactness assumption of the kernel and (RC3) that for h = h n small enough β i (t r ) = 0. If β i (t r ) = 0 we havẽ β i (t r ) = where |τ s − t r | ≤ s N . Hence, by (RC1), Since N h opt ≍ N 8/9 n |A n | −2/9 a 2/9 n → ∞, n → ∞ because N n ≫ |A n | 1/4 a −1/4 n , we see from (A.6) that a faster convergence rate o P (a n ) is achieved. withỹ = √ N hỹ,X = √ N hX andε N = √ N hε N . Note that the variance of the error termε N depends on N . As can be seen from (A.2), we have for N → ∞ Var ε N i ∼ σ 2 K 2 (x)dx. Using the rescaled model (A.7), we can now adapt the proof of [24].