Non-asymptotic approach to varying coefficient model

In the present paper we consider the varying coefficient model which represents a useful tool for exploring dynamic patterns in many applications. Existing methods typically provide asymptotic evaluation of precision of estimation procedures under the assumption that the number of observations tends to infinity. In practical applications, however, only a finite number of measurements are available. In the present paper we focus on a non-asymptotic approach to the problem. We propose a novel estimation procedure which is based on recent developments in matrix estimation. In particular, for our estimator, we obtain upper bounds for the mean squared and the pointwise estimation errors. The obtained oracle inequalities are non-asymptotic and hold for finite sample size.


Introduction
In the present paper we consider the varying coefficient model which represents a useful tool for exploring dynamic patterns in economics, epidemiology, ecology, etc. This model can be viewed as a natural extension of the classical linear regression model and allows parameters that are constant in regression model to evolve with certain characteristics of the system such as time or age in epidemiological studies.
The varying coefficient models were introduced by Cleveland, Grosse and Shyu [4] and Hastie and Tibshirani [7] and have been extensively studied in the past 15 years. The estimation procedures for varying coefficient model are e.g. based on the kernel-local polynomial smoothing (see e.g. [28,8,5,12]), the polynomial spline (see e.g. [9,11,10]), the smoothing spline (see e.g. [7,8,3]). More recently e.g. Wang et al [27] proposed a new procedure based on a local rank estimator; Kai et al [13] introduced a semi-parametric quantile regression procedure and studied an effective variable selection procedure; Lian [20] developed a penalization based approach for both variable selection and constant coefficient identification in a consistent framework. For more detailed discussions of the existing methods and possible applications, we refer to the very interesting survey of Fan and Zhang [6].
Existing methods typically provide asymptotic evaluation of precision of estimation procedures under the assumption that the number of observations tends to infinity. In practical applications, however, only a finite number of measurements are available. In the present paper, we focus on a non-asymptotic approach to the problem. We propose a novel estimation procedure which is based on recent developments in matrix estimation, in particular, matrix completion. In the matrix completion problem, one observes a small set of entries of a matrix and needs to estimate the remaining entries using these data. A standard assumption that allows such completion to be successful is that the unknown matrix has low rank or has approximately low rank. The matrix completion problem has attracted a considerable attention in the past few years (see, e.g., [2,14,19,23,16]). The most popular methods for matrix completion are based on nuclear-norm minimization which we adapt in the present paper.

Formulation of the problem
Let (W i , t i , Y i ), i = 1, . . . , n be sampled independently from the varying coefficient model Here, W ∈ R p are random vectors of predictors, f (·) = (f 1 (·), . . . , f p (·)) T is an unknown vector-valued function of regression coefficients and t ∈ [0, 1] is a random variable independent of W . Let µ denote its distribution. The noise variable ξ is independent of W and t and is such that E(ξ) = 0 and E(ξ 2 ) = 1, σ > 0 denotes the noise level.
The goal is to estimate the vector function f (·) on the basis of observations (W i , t i , Y i ), i = 1, . . . , n. Our estimation method is based on the approximation of the unknown functions f i (t) using a basis expansion. This approximation generates the coordinate matrix A 0 . In the above model, some of the components of vector function f are constant. The larger the part of the constant regression coefficients, the smaller the rank of the coordinate matrix A 0 (the rank of matrix A 0 does not exceed the number of time-varying components of vector f (·) by more than one). We suppose that the first element of this basis is just a constant function on [0, 1] (indeed, this is true for vast majority of bases on a finite interval). In this case, if the component f i (·) is constant, then, it has only one non-zero coefficient in its expansion over the basis. This suggest the idea to take into account the number of constant regression coefficients using the rank of the coordinate matrix A 0 .
Our procedure involves estimating A 0 using nuclear-norm penalization which is now a well-established proxy for rank penalization in the compressed sensing literature. Subsequently, the estimator of the coordinate matrix is plugged into the expansion yielding the estimatorf (·) = f 1 (·), . . . ,f p (·) T of the vector function f (t). For this estimator we obtain upper bounds on the mean squared (dµ) and on the pointwise estimation error for any t ∈ supp(µ) (Corollary 1). These oracle inequalities are non-asymptotic and hold for finite values of p and n. The results in this paper concern random measurements and random noise and so they hold with high probability.

Layout of the paper
The remainder of this paper is organized as follows. In Section 1.3 we introduce notations used throughout the paper. In Section 2, we describe in details our estimation method, give examples of the possible choices of the basis (Section 2.1) and introduce an estimator for the coordinate matrix A 0 (Section 2.2). Section 3 presents the main results of the paper. In particular, Theorems 1 and 2 in Section 3 establish upper bounds for estimation error of the coordinate matrix A 0 measured in Frobenius norm. Corollary 1 provides non-asymptotic upper bounds for the mean squared and pointwise risks of the estimator of the vector function f . Section 4 considers an important particular case of the orthogonal dictionary.

Notations
We provide a brief summary of the notation used throughout this paper. Let A, B be matrices in R p×l , µ be a probability distribution on (0, 1) and ψ(·) be a vector-valued function.
• For any vector η ∈ R p , we denote the standard l 1 and l 2 vector norms by η 1 and η 2 , respectively.
• Let be respectively the trace and Frobenius norms of the matrix A. Here (σ j (A)) j are the singular values of A ordered decreasingly.
• For any numbers, a and b, denote a ∨ b = max(a, b) and a ∧ b = min(a, b).
• Denote the k × k identity matrix by I k .
• In what follows, we use the symbol C for a generic positive constant, which is independent of n, p, s and l, and may take different values at different places.

Estimation method
The first step of our estimation method is the approximation of the unknown functions f i (t) by expanding them over an appropriate basis. This approximation generates the coordinate matrix A 0 . Matrix A 0 is estimated using penalized risk minimization. The estimator of the coordinate matrix is plugged into the expansion yielding the estimator of the vector function f .
For each k = 1, . . . , p, we have Denote the remainder by ρ (l) (·) = (ρ p (·)) T . We assume that the basis (φ i (·)) i=1,...,∞ guarantees good approximation of f k by l Σ j=1 a 0 kj φ j (t), that is, Assumption 1. We assume that the basis satisfies condition (2) and that there exists a positive constant b such that, for any l ≥ 1 Often approximation in L 2 −norm gives better rates of convergence. In order to get upper bounds on the mean squared error we will use the following additional assumption: Let us give few examples of possible choices of the basis.
Example 1. Assume that dµ = g(t) dt and function g is bounded away from zero and infinity, i.e. there exist absolute constants g 1 and g 2 such that for any t ∈ supp(µ) Denote φ j (t) = e 2 i π j t , j ∈ Z, the standard Fourier basis of L 2 ((0, 1)). Then, it is easy to check that φ j (t) = φ j (t)/ g(t), j ∈ Z, is an orthonormal basis of is the Fourier transform of F . Then, by Theorems 9.1 and 9.2 of [22], one has where C γ is an absolute constant which depends on γ only. Assume that for some A < ∞ the functions f k belong to a Sobolev ball of radius A, i.e. max k=1,··· ,p Let l = 2N + 1, so that where a 0 kj = f k (t) g(t), φ j (t) . Then, it follows from equations (5), (6) and (7) that so that Assumptions 1 and 2 hold. Example 2. Consider a wavelet ψ with a bounded support of length C ψ and with γ * vanishing moments and choose l = 2 H where H is a positive integer. Construct a periodic wavelet basis ψ h, where C γ is an absolute constant which depends on γ only, provided γ < γ * . Then, under assumptions (5) and (7), as in Example 1, Assumption 1 holds. For example, recalling that H = log 2 l and that length of support of ψ is bounded by C ψ , obtain For example, f i (t) are polynomials of degree less than k. Then, choosing l = k and an orthonormal basis in this sub-space, we have trivially ρ (l) (·) = 0.

Estimation of the coordinate matrix
Denoting X = W φ T (t), we can rewrite (1) in the following form We suppose that some of the functions f i (·) are constant and let (s − 1) denote the number of non-constant f i (·). This parameter, s, plays an important role in what follows. Note that rank (A 0 ) ≤ s.
Using observations (Y i , X i ) we define the following estimator of A 0 : where λ is the regularization parameter. This penalization, using the tracenorm, is now quite standard in matrix completion problem and allows one to recover a matrix from under-sampled measurements. Using estimator (9) of the coordinate matrix A 0 , we recover f (t) aŝ

Assumptions about the dictionary and the noise
We assume that the vectors W i are i.i.d copies of a random vector W having distribution Π on a given set of vectors X . Using rescaling, we can suppose that W 2 ≤ 1 almost surely. Let E W W T = Ω and ω max , ω min denote respectively its maximal and minimal singular values. We need the following assumption on the distribution of W .
An easy computation leads to By definition we obtain Finally we compute where in the last display we used that (φ i (·)) i=1,...,∞ is an orthonormal basis in L 2 ((0, 1), dµ).
We consider the case of sub-exponential noise which satisfies the following condition For instance, if ξ i are i.i.d. standard Gaussian we can take K = 1.

Main Results
Let Rademacher sequence. These stochastic terms play an important role in the choice of the regularization parameter λ.
We introduce the following notations: The following theorem gives a general upper bound on the prediction error for the estimatorÂ given by (9). Its proof is given in Appendix A.
Theorem 1. Let λ ≥ 3 Σ and suppose that Assumption 3 holds. Then, with probability at least 1 − 2/d, (ii) If, in addition n ≥ n * * , then In order to obtain upper bounds in Theorem 1 in a closed form, it is necessary to obtain a suitable upper bound for Σ . The following lemma, proved in Section E, gives such bound.
where d = p + l.
The optimal choice of the parameter t in Lemma 1 is t = log(d). Larger t leads to a slower rate of convergence and a smaller t does not improve the rate but makes the concentration probability smaller. With this choice of t, the second terms in the maximum in (11) is negligibly small for n ≥ n * where M .
In order to satisfy condition λ ≥ 3 Σ in Theorem 1 we can choose If ξ i are N (0, 1), then we can take c * = 6.5 (see Lemma 4 in [15]). With these choices of λ, we obtain the following theorem.

Theorem 2. Let Assumptions 1 -4 hold. Consider regularization parameters
λ satisfying (12) and n ≥ n * . Then, with probability greater than 1 − 4/d (ii) If, in addition n ≥ n * * , then UsingÂ we define the estimator of f (t) aŝ Theorem 2 allows to obtain the following upper bounds on the prediction error off (t).

Corollary 1. Suppose that the assumptions of Theorem 2 hold. With probability
Proof. We shall prove the second statement of the corollary, the first one can be proved in a similar way. Let A i denote the i-th row of a matrix A. We compute where in the last display we used that (φ i (·)) i=1,...,∞ is an orthonormal basis. Using (14) and Assumption 2 we derive Now Theorem 2 implies the statement of the corollary.

Orthonormal dictionary
As an important particular case, let us consider the orthonormal dictionary. Let (e j ) j be the canonical basis of R p . Assume that the vectors W i are i.i.d copies of a random vector W which has the uniform distribution Π on the set Note that this is an unfavorable case of very "sparse observations", that is, each observation provides some information on only one of the coefficients of f (t).
In this case, Ω = 1 p I p , ω max = ω min = 1 p and we obtain the following values of parameters Plugging these values into Corollary 1, we derive the following result.

(b) If, in addition, Assumption 2 holds
Remarks. Optimal choice of parameter l: The upper bounds given in Corollary 2 indicate the optimal choice of parameter l. From (15) we compute the following values of l: n .
Let γ ≥ 1/2 and consider first the case s p 3 log(d) n s p 2 log(d) (the symbol means that the inequality holds up to a multiplicative numerical constant). Then, Corollary 2 implies that On [1, l * 1 ], F 1 (l) achieves its minimum at l * 1 . Note that F 1 (l * 1 ) ≤ F 2 (l) for any l ∈ [l * 1 , p] and F 1 (l * 1 ) ≤ F 4 (l) for any l > p. Then, for s p 3 log(d) n s p 2 log(d) the optimal value of l minimizing (17) iŝ .
When n s p 3 log(d), the Corollary 2 implies that On [p, l * 2 ], F 3 (l) achieves its minimum at l * 2 if p 3+2γ log(d) n s p 3 log(d) and at l * 3 if n p 3+2γ log(d). Note that F 3 (l * 2 ) ≤ F 1 (l) for any l ∈ [1, p] and F 3 (l * 2 ) ≤ F 4 (l) for any l > l * 2 . Then, for p 3+2γ log(d) n s p 3 log(d) the optimal value of l minimizing (17) iŝ and for n p 3+2γ log(d) the optimal value of l iŝ Minimax rate of convergence: For p = 1 the optimal choice of l in (17) With this choice of l, the rate of convergence given by Corollary 2 is n − 2γ + 1 2γ + 2 .
Note that for f ∈ W γ (0, 1) we recover the minimax rate of convergence as given in e.g. [26].

Acknowledgements
Marianna Pensky was supported in part by National Science Foundation (NSF), grant DMS-1106564. The authors want to thank Alexander Tsybakov for extremely valuable discussions and suggestions.

Appendix A Proof of Theorem 1
This proof uses ideas developed in the proof of Theorem 3 in [16]. The main difference is that here we have no restriction on the sup −norm of A 0 . This implies several modifications in the proof.
It follows from the definition of the estimatorÂ that Then, we can write (18) in the following way By duality between the nuclear and the operator norms, we obtain Let P S denote the projector on the linear subspace S and let S ⊥ be the orthogonal complement of S. By definition, for any matrix B, the singular vectors of P ⊥ A0 (B) are orthogonal to the space spanned by the singular vectors of A 0 . This implies that From (20) we obtain From (19), using (21) and λ ≥ 3 Σ we obtain Since P A (B) = P S ⊥ 1 (A) BP S2(A) + P S1(A) B and rank (P Si(A) B) ≤ rank (A) we derive that rank (P A (B)) ≤ 2 rank (A). From (22) we compute where we set R = rank (A 0 ). For 0 < r ≤ m = min (p, l) we consider the following constraint set where The following lemma shows that for matrices A ∈ C(r) we have some approximative restricted isometry. Its proof is given in Appendix B.

Lemma 2. For all
We need the following auxiliary lemma which is proved in Appendix D.

Lemma 3 implies that
If The following Lemma, proved in Section E.2, gives a suitable bound on E Σ R : be an i.i.d. Rademacher sequence. Suppose that Assumption 3 holds. Then, where d = p + l and M = tr(Ω) ∨ (lω max ).
If, in addition n > 2 C c 2 φ l s M log(d) ω 2 min , from (27) we obtain On the other hand, for n > n * * (29) does not hold. This completes the proof of Theorem 1.

B Proof of Lemma 2
Set E = 44 c 2 φ l r (E ( Σ R )) 2 ω min . We will show that the probability of the following bad event is small Note that B contains the complement of the event that we are interested in. In order to estimate the probability of B we use a standard peeling argument.
Let ν = c φ 64 log(d) l log (6/5) n and α = 6 5 . For k ∈ N set If the event B holds for some matrix A ∈ C(r), then A belongs to some S k and For each T > ν consider the following set of matrices C(r, T ) = A ∈ C(r) : A 2 L2(Π⊗µ) ≤ T and the following event Note that A ∈ S k implies that A ∈ C(r, α k ν). Then (30) implies that B k holds and we obtain B ⊂ ∪ B k . Thus, it is enough to estimate the probability of the simpler event B k and then to apply the union bound. Such an estimation is given by the following lemma. Its proof is given in Appendix C. Let Lemma 5. .
. Using the union bound we obtain where we used e x ≥ x. We finally compute for ν = c φ 64 log(d) l log (6/5) n .
This completes the proof of Lemma 2.

C Proof of Lemma 5
Our approach is standard: first we show that Z T concentrates around its expectation and then we upper bound the expectation. By definition, where we used W 2 ≤ 1 and condition (2). Massart's concentration inequality (see e.g. [1,Theorem 14.2]) implies that where c 3 = 1 128 .
Next we bound the expectation E (Z T ). Using a standard symmetrization argument (see Ledoux and Talagrand [21]) we obtain is an i.i.d. Rademacher sequence. Then, the contraction inequality (see Ledoux and Talagrand [21]) yields where we have used (10).
Then, by duality between nuclear and operator norms, we compute Finally, using 1 9 and the concentration bound (31) we obtain that where c 3 = 1 128 as stated.

D Proof of Lemma 3
Using (19) we compute The condition λ ≥ 3 Σ , the triangle inequality and (21) yield This implies that as stated.

E Bounds on the stochastic errors
In this section we will obtain upper bounds for the stochastic errors Σ , Σ R . Recall that where {ǫ i } n i=1 is an i.i.d. Rademacher sequence.
The following proposition is the matrix version of Bernstein's inequality in the bounded case (see Theorem 1.6 in [25]). Let Z 1 , . . . , Z n be independent random matrices with dimensions m 1 × m 2 . Define Proposition 1. Let Z 1 , . . . , Z n be independent random matrices with dimensions m 1 × m 2 that satisfy E(Z i ) = 0. Suppose that Z i ≤ U for some constant U and all i = 1, . . . , n. Then, for all t > 0, with probability at least 1 − e −t we have It is possible to extend this result to the sub-exponential case. Set The following proposition is obtained by an extension of Theorem 4 in [18] to rectangular matrices via self-adjoint dilation (cf., for example 2.6 in [25]).
Proposition 2. Let Z 1 , . . . , Z n be independent random matrices with dimensions m 1 × m 2 that satisfy E(Z i ) = 0. Suppose that U i < U for some constant U and all i = 1, . . . , n. Then, there exists an absolute constant c * , such that, for all t > 0, with probability at least 1 − e −t we have where d = m 1 + m 2 .
We use Propositions 1 and 2 to prove Lemmas 1 and 4.

E.1 Proof of Lemma 1
ξ i X i . Then, we obtain Σ = Σ 1 + σ Σ 2 . In order to derive an upper bound for Σ 2 , we apply Proposition 2 to We need to estimate σ Z and U . Note that Z i is a zero-mean random matrix such that where we used condition (2) and W 2 ≤ 1. Then, Assumption 4 implies that there exists a constant K such that U i ≤ K c φ √ l for all i = 1, . . . , n.
Let us estimate σ Z for Z = ξ W φ T (t). First we compute 1 n where we used E(ξ 2 ) = 1.