Asymptotic theory of penalized splines

: The paper gives a uniﬁed study of the large sample asymp- totic theory of penalized splines including the O -splines using B-splines and an integrated squared derivative penalty [22], the P -splines which use B-splines and a discrete diﬀerence penalty [13], and the T -splines which use truncated polynomials and a ridge penalty [24]. Extending existing results for O -splines [7], it is shown that, depending on the number of knots and appropriate smoothing parameters, the L 2 risk bounds of penalized spline estimators are rate-wise similar to either those of regression splines or to those of smoothing splines and could each attain the optimal minimax rate of convergence [32]. In addition, convergence rate of the L ∞ risk bound, and local asymptotic bias and variance are derived for all three types of penalized splines.


Introduction
Penalized spline smoothing has become popular in the last two decades. The approach uses a flexible choice of bases and penalty and is often viewed as a bridge between regression splines and smoothing splines. Indeed, penalized splines exploit the mixed effect presentation of smoothing splines but are computationally much simpler because they use low rank bases; penalized splines inherit the computational simplicity of regression splines but relieve the overfitting of regression splines as they employ a smoothness penalty. Therefore, penalized splines have enjoyed widespread use in methodology development and applications. For example, penalized spline methods have been well developed in functional data analysis, e.g., [44,15,43,40]. Another example is penalized splines methods for generalized additive models [38]. See [24] for a comprehensive introduction to penalized splines. The paper [25] gives a review of penalized splines in the 2000 decade and most recently, the paper [14] provides a review of P -splines [13], one popular type of penalized splines.
Theoretic understanding of penalized splines has been largely lagging behind. The flexibility of penalized splines in terms of usage of bases, placement of knots and choice of penalty actually dramatically increases the difficulty of the theoretic study of penalized splines. Indeed, the O-splines [22] use B-splines as bases and impose an integrated squared derivative penalty to control overfit, the P -splines [13] also use B-splines but directly impose a penalty on the associated coefficients, and the T -splines [24] use truncated polynomials and impose a ridge-type penalty on the associated coefficients. While it seems a consensus in the literature that all three types of penalized splines have similar practical performance, existing theoretic works usually focus on one type. The paper [7] is a seminal work and shows that, for the L 2 risk bound, depending on the number of knots and the penalty, O-splines behave similar to either regression splines or smoothing splines. Specifically, when the penalty is appropriately small, the number of knots for constructing the B-spline bases determines the convergence rate, leading to asymptotics rate-wise similar to those of regression splines; but when the penalty is large, the number of knots does not matter as long as it is sufficiently large and the penalty determines the convergence rate, leading to asymptotics rate-wise similar to these of smoothing splines. In particular, the optimal number of knots for the regression spline type asymptotics is rate-wise no bigger and can be much smaller than that required for the smoothing spline type asymptotics; thus, for penalized splines, the regression spline type asymptotics is referred to as the small number of knots scenario while the smoothing spline type asymptotics is referred to as the large number of knots scenario. It is interesting to study if such two-scenario asymptotics will also hold for P -splines and T -splines. Table 1 gives a summary of existing theoretic works on penalized splines. As mentioned earlier, the paper [7] derives the L 2 risk bound of O-splines for both the small and the large numbers of knots. In addition, the work gives the local asymptotics of O-splines and T -splines for the small number of knots. The local asymptotics for the scenario of large number of knots is quite challenging to derive and equivalent kernel methods [30] have been adopted in the literature. Indeed, the papers [21], [37] and the unpublished [41] use equivalent kernel methods for P -splines, while the paper [27] studies O-splines. Note that all those works assume equally-spaced design points which is quite stringent. Another note is that, for the large number of knots scenario, P -splines have a slower convergence rate near the boundary [37]. Similar boundary behaviors of O-splines and T -splines have not been established yet. Early theoretic works of penalized splines also include [17] and [18]. For generalized additive models, the paper [45] studies the local asymptotics of a ridge-corrected P -spline estimator. For longitudinal data, the paper [5] studies the local asymptotics and the L 2 convergence rate of O-splines. Theoretic work on penalized splines in two dimensions include [42], [20] and [39].
Many theoretic gaps, e.g., the convergence rate of L ∞ risk bound, still remain and the paper intends to fill a number of the gaps. The checkmarks in Table  1 indicate the contributions of the paper. First, the L 2 convergence rates of P -splines and T -splines shall be established for both scenarios, extending [7]. Second, the L ∞ convergence rates of all three types of penalized splines shall be established and the rates are optimal for the small number of knots scenario. Third, the local asymptotic bias and variance of P -splines and T -splines will also be given. Table 1 Summary of theoretic works on penalized splines. The checkmarks indicate the major contributions of the paper. Note that the * around the ✓ means that the newly derived convergence rates for the bias might not be rate optimal. The details are provided in the theorems or remarks in parentheses.
Now we list the three key results. These three results ensure that a unified theoretic study is attainable. Finally, the paper provides several new theoretic results regarding the approximation accuracy of B-splines, which are useful for the theoretic derivations; see Section 3. The rest of the paper is organized as follows. In Section 2, we introduce penalized splines. In Section 3, we consider the approximation accuracy of Bsplines. In Section 4, we study the singular properties of penalized splines. In Section 5, we derive the L 2 convergence rate of penalized splines. In Section 6, we derive the L ∞ convergence rate of penalized splines. In Section 7, we derive the local asymptotic bias and variance of penalized splines. Proofs of theorems are provided in Section 8. Technical lemmas are given in Appendix B.3, the local variance of penalized splines is studied in Appendix B, and lower and upper risk bounds on the variance of penalized splines are derived in Appendix C.
We shall use the following notation convention. For a vector a = (a k ), a 2 denotes its Euclidean norm and a max = max k |a k |. For a matrix A = (A k ), A 2 is its operator norm, A max = max k |A k |, A F is its Frobenius norm and A ∞ = max k |A k |. For two square matrices A and B , A ≤ B means that B − A is positive semidefinite. For a function g(x) defined over an interval T , we let g be its supremum norm over T , i.e., g = sup x∈T |g(x)|, and we shall use g (i) (x) to denote its i th derivative. We also use the notation a ∼ b to denote that lim n→∞ a/b = c for some constant c > 0.

Penalized splines
Consider the nonparametric regression problem where the n design points x i ∈ T = [0, 1] can be either deterministic or random, y i are the observed responses, and e i are random errors. Let p be a fixed positive integer. It is assumed that f ∈ C p (T ), the class of functions with continuous p th derivatives over T . The aim is to estimate the unknown smooth function f by penalized splines. We introduce three types of penalized splines that are commonly used and then formulate a unified estimator that contains all of them.

O-splines
We first introduce splines [8]. A spline is a piecewise polynomial that is smoothly connected at its knots. More specifically, for a fixed integer m, denote by S(m, t) the set of spline functions with knots t = {0 = t 0 < t 1 < . . . < t K0+1 = 1}. For m = 1, S(m, t) is the set of step functions with jumps at the knots and, for m ≥ 2, A basis for S(m, t) can be formed by B-splines, which are defined as The B-spline basis functions can also be recursively defined as N for k = −(m−1), . . . , K 0 . Here 0/0 = 0. Then any spline function s(x) ∈ S(m, t) can be written as k (x) with some scalars θ k . The O-spline estimator [22] is defined to be a spline function where q < m is a fixed integer, s (q) denotes the q th derivative of s, and λ O ≥ 0 is a smoothing parameter. For O-splines, it is assumed that t 1−m = · · · = t −1 = t 0 = 0 and t K = · · · = t K0+2 = t K0+1 = 1. The derivative of a spline function is closely related to the difference operators. Let Δ K,1 ∈ R (K−1)×K be the first order difference operator such that, for a vector θ ∈ R K , Δ K,1 θ = (θ 2 − θ 1 , . . . , θ K − θ K−1 ) T . For 1 < q < K, let 752 L. Xiao Δ K,q ∈ R (K−q)×K be the q th order difference operator that is defined recursively as Δ K,q = Δ K−1,q−1 Δ K,1 . To simplify notation, denote N [46, equality (40)], we derive that K Δ K,1 ∈ R (K−1)×K be a weighted first order difference operator and define recursivelyΔ K,q,m =Δ K−1,q−1,m−1ΔK,1,m . To extend the definition to the case q = m for T -splines to be introduced later, we let W [1] K = h −1 I. For simplicity, we suppress the dependence of the weighted operatorsΔ K,q,m on t. We obtain that and hence Let Y = (y 1 , . . . , y n ) T ∈ R n and N = [N (x 1 ), . . . , N(x n )] T ∈ R n×K . It follows that the minimizer of (2.

P-splines
The P -splines [13] imposes a penalty directly on the q th order consecutive difference of the coefficient vector θ. Specifically, the P -spline estimator is also a spline function where D K,q = Δ T K,q Δ K,q ∈ R K×K , λ P is also a smoothing parameter, and the set of spline functions S(m,t) is defined over equally-spaced knots, i.e.,t contains knots with t i = i/(K 0 + 1), 1 − m ≤ i ≤ K. With slight abuse of notation, we still denote the corresponding B-spline basis functions by N (x). The difference in the bases is minor for the theoretic study as we will discuss later. Then the P -spline estimator, denoted byf P (x), takes the following form The difference penalty is effectively a smoothness penalty. Indeed, when θ T D K,q θ = 0, the resulting estimate reduces to a polynomial of degree q − 1.

T-splines
Finally, we introduce the T -splines [24]. Let t be as in the O-splines and The T -splines is the estimatorf where λ T is a smoothing parameter andĨ K,m = blockdiag(0 m,m , I K−m ). We derive thatf There exists an invertible transformation matrix L K,m ∈ R K×K that depends only on t and such that N (x) = L T K,m F (x), where N (x) denote the B-spline bases for O-splines. Thus, withD K,m = L T K,mĨ K,m L K,m . We derive that (see Lemma A.8), which can be thought of as an extension of the penalty matrix O q for O-splines to the case q = m. Indeed, O q and the O-splines can only be defined for q < m.

A unified penalized spline estimator
Comparing the three penalized splines estimators in (2.4), (2.5) and (2.6), we see that the main difference between them is the penalty matrix. In Section 4, we shall show that the penalty matrices (after adjusting the corresponding smoothing parameters) have eigenvalues of similar decay rates, which paves the way for a unified theoretic study of all three estimators. This motivates us to consider the following estimator where P q ∈ R K×K is an arbitrary positive semi-definite matrix and two assumptions (Assumptions 3 and 4) on P q will be made in Section 4. These two assumptions are satisfied by each type of penalized splines. We shall study the L 2 convergence rate, L ∞ convergence as well as local asymptotics of the unified penalized spline estimatorf and then apply the theoretic results to the three types of penalized splines.

Spline approximation
In this section we establish some necessary results on the approximation accuracy of a smooth function by splines. We make the following assumption.
We next specify some conditions on the placement of knots.
where M is a fixed constant. The same assumptions on the knots can be found in [46]. In many works, e.g., [3], the knots are assumed to be generated from a positive density, which will lead to (3.1). Therefore, (3.1) is slightly more general. Note that (3.1) implies that h ∼ K −1 , i.e., h and K −1 are rate-wise equivalent.
If the design points x = (x 1 , . . . , x n ) are deterministic, we assume that is the empirical cumulative distribution function and Q(x) is a distribution with a positive and continuously differentiable density ρ(x). Assumption (3.2) is also common; see, e.g., [46].
Denote by B k (x) the k th Bernoulli polynomial function, i.e., where the B k s are the Bernoulli numbers satisfying the following: for k ≥ 1, We use the notation b (i) f (x) with values at the knots being the right derivatives.
and for i = m − 1, Remark 3.1. The lemma is adapted from Lemma 1 of [4] and can be easily extended to prove that: if f ∈ C p (T ) with p < m, then there exists a spline s f ∈ S(m, t) such that, We denote the spline s f (x) in Lemma 3.1 by k β k N k (x) and let β = (β 1 , . . . , β K ) ∈ R K . Lemma 3.2 below further characterizes the accuracy of s f for approximating f .
For the random design where the design points x are randomly sampled from

Properties of penalty of penalized splines
In this section we establish the properties of penalty of penalized splines. For smoothing splines, the eigenvalues of penalty play a fundamental role in the study of asymptotic properties; see, e.g., [34] and [31]. Given the similarity of penalized splines and smoothing splines, it seems not unreasonable to believe that such an approach can be extended for the theoretic study of penalized splines. Indeed, a number of theoretic studies of penalized splines used results on eigenvalues for smoothing splines albeit without a formal proof. Because smoothing splines use polynomial splines with some boundary conditions [36] while penalized splines do not satisfy those conditions; see, e.g., [37] for Psplines, a formal proof on the eigenvalues of penalty of penalized splines is needed but does not exist as far as we are aware of. We shall fill the gap and establish that the three penalty matrices of the respective penalized spline estimators introduced in Section 2 have eigenvalues of similar decay rates. The assumptions on the knots and design points are summarized as below.

Remark 4.1. (3.1) holds under Assumption 2 and
If r is a finite number, we say that A has a finite band. For a symmetric matrix A, denote by λ k (A) its k th smallest eigenvalue. We first derive the spectrum of the penalty matrix D K,q for P -splines. Note that λ k (D K,q ) = λ k (O q ) = 0 for k ≤ q and λ k D K,m = 0 for k ≤ m. Proposition 4.1. Suppose that Assumption 1 holds. The matrix D K,q is (2q)banded and there exists a constant C 1 > 1 that depends only on q and such that for q + 1 ≤ k ≤ K,

Proposition 4.2. Suppose that Assumptions 1 and 2 hold. The matrices O q and
And there exists a constant C 2 > 1 that depends only on q and m and such that for q + 1 ≤ k ≤ K, Then

L 2 convergence rate
The L 2 convergence rate of O-splines was first proved in [7] and similar proofs can be adopted to establish the L 2 convergence of the unified penalized spline estimator. Then the L 2 convergence of P -splines and T -splines can be established. However, it is worth noting that the adaption is not entirely trivial because Osplines only allows q < m, but T -splines uses essentially an m th order penalty and we also allow q = m for P -splines. Assumptions 3 and 4 below summarize the key properties of penalty of all three types of penalized splines derived in Section 4.

Assumption 3.
Suppose that P q is a symmetric and positive semi-definite square matrix with a finite band that depends only on q and m and satisfies: λ q (P q ) = 0 and there exists a constant C 3 > 1 that depends only on q and m and such that for q + 1 ≤ k ≤ K, The singular values ofP q plays an indispensable role in studying the L 2 convergence off (x) [7]. Lemma 5.1 can be derived from Assumption 3 and A.1.

Assumption 6.
The random errors e i are independent from x i and are i.i.d. with mean 0 and E|e i | τ < ∞ for some constant τ > 2.
We shall only give results for the fixed design. However, the extension to the random design is straightforward and the only additional assumption required is K = O n δ * for some δ * ∈ (0, 1 2 ).
Remark 5.1. The first two terms correspond to the approximation bias of spline functions, the third term is the shrinkage bias due to the penalty, and the last term is the variance.

Remark 5.2. Assume in addition that
which measures the integrated variability of penalized splines, can be established; see Lemma C.1 in Appendix C. [7] for O-splines, depending on the number of knots K and the smoothing parameter λ, the convergence rates of penalized splines are similar to either those of regression splines or those of smoothing splines, giving rise to two-scenario asymptotics: the small number of knots scenario corresponding to regression spline asymptotics and the large number of knots scenario for the smoothing spline asymptotics.

Remark 5.3. As first observed in
(a) (Small number of knots scenario). Suppose that the conditions in Theorem , the estimator attains the optimal rate of convergence n − 2m 2m+1 . (b) (Large number of knots scenario). Suppose that the conditions in Theorem 5.1 and that f ∈ C q (T ). There exists a sufficiently large constant C that does not depend on K or n such that, if λK 2q ≥ C, then , the estimator attains the optimal rate of convergence n − 2q 2q+1 .
Because q ≤ m, the optimal number of knots K in (a) is rate-wise no bigger and can be much smaller than that in (b), explaining why the two-type asymptotics can be referred to as the small number of knots scenario and the large number of knots scenario.

L ∞ convergence rate
In this section, we shall establish the L ∞ convergence rate of penalized splines. Note that while bounds on the eigenvalues of P q andP q are useful for deriving 760 L. Xiao the L 2 convergence rate of penalized splines, they are not sufficient for studying the local and L ∞ convergence of penalized splines. For example, to study the local variance, proper bounds on the diagonals of H −1 n and H −2 n are required. Thus, we shall first derive the local asymptotic variances of the three types of penalized splines, in addition to those in Section 4. Then, we derive a unified convergence rate that applies to the three penalized splines.

Local asymptotic variance
Then there exists a constant C 5 > 1 such that, for r = 1 and 2, Here h e = max(h, λ 1 2q ) with q = m. Proposition 6.3. Suppose that Assumptions 1, 2 and 5 hold. Let λ = λ P h 2q−1 and H n = G n + λ P D K,q with q ≤ m. Then there exists a constant C 7 > 1 such that, for r = 1 and 2,

Unified L ∞ convergence rate
The results in Section 6.1 can be summarized as follows for the unified penalized spline estimator.

Assumption 7.
There exists a constant C 8 > 1 such that, for r = 1 and 2, The results in Propositions 4.3, 4.4 and 4.5 can also be summarized as follows for the unified penalized spline estimator.
We also need the following assumption, which is common for establishing uniform convergence rates [6].
where τ is in Assumption 6.  [6]. Similar to [6], such an optimal rate can be achieved when τ ≥ 2+ 1 m because of Assumption 9. Thus, λ does not matter as long as it is small and K as in regression splines serves as the smoothing parameter. Remark 6.6. If λ is sufficiently large, then h e = λ 1 2q , the approximation bias of spline functions is negligible, and the L ∞ convergence rate becomes and K ∼ n log n 1 2q+1 and that λK 2q is sufficiently large, the L ∞ convergence rate is log n n − q 2q+1 , the minimax optimal rate [12]. Because of Assumption 9, such an optimal rate can be achieved when τ ≥ 2 + 1 q . for the asymptotic variance is optimal. However, the derived for the shrinkage bias is suboptimal and we believe the optimal rate should be O λ 1 2 . Remark 6.8. To apply the unified L ∞ rate to the three types of penalized splines, we just follow the specifications in Remarks 5.5, 5.6 and 5.7, respectively.

Local asymptotic bias and variance
The above results are the same as the local asymptotics of O-splines derived in [7] and also hold for P-splines and T-splines with specifications in Remarks 5.6 and 5.7, respectively. Suppose that f ∈ C m (T ) and λ = o K −(m+q) , then the local asymptotics are the same as those for regression splines and hence are optimal.
Remark 7.2. If λK 2q is sufficiently large, then the discussion is similar to Remark 6.6 and the derived rates may be suboptimal.

Proofs
To simplify notation, we may use D for D K,q , Δ for Δ K,q and P for P q in the proofs.

Proofs for Section 3
Proof of Lemma 3.2. We first consider the case p = m.
k (x)ρ(x). By integration by parts, We now consider the second right hand term in (8.1) and shall prove that By the definition of B-splines, w k is non-zero only for a few ks. Moreover, using the fact that  (1) uniformly with respect to k. It follows that (8.3) holds and together with (8.1) and (8.2), the proof for the case p = m is complete. Now we assume that p < m and let g( The proof is now complete.
Proof of Lemma 3.3. We first consider the fixed design. We derive that By integration by parts, Hence, where the big O is uniform with respect to k and in the second to last equality we used the fact that N uniformly with respect to k and N (1) k (·) is non-zero for an interval of length O(h) [26,Theorem 4.2]. It follows that where we used the assumption Q n − Q = o(h) (equation (3.2)).
For the random design, by Serfling [29, Theorem 2.1.4b], The proof is then similar to that for the fixed design.

Proofs for Section 4
Proof of Proposition 4.1. Let A K,q = D q K,1 − D K,q . Note that A K,1 = 0. By [2, pp. 289-290], for q > 1, A K,q are non-zero only for elements with indices i, j ≤ q and K − i + 1 ≤ q, K − j + 1 ≤ q. Moreover, it is easy to verify that the non-zero elements in A K,q depends only on q when K is sufficiently large. Thus, D K,q and D q K,1 should have similar singular eigenvalues (Lemma 8.2), and we could study the singular values of D K,q via those of D q K,1 (Lemma 8.3). We first present three Lemmas. Proof. The second claim is straightforward because A K,q has at most 2q rows with non-zero elements. So we focus on the first claim. LetĎ K,1 = Δ K,1 Δ T K,1 . We first prove thatĎ K, where the last term equals Δ T K,1 A K−1,q−1 Δ K,1 . Therefore, the proof is complete by induction.
. Proof of Lemma 8.2. The proof follows from Weyl's Theorem [28, pp. 117], Lemma 8.1 and that λ j (A K,q ) = 0 for j ≤ K − 2q. Specifically, the right hand side of the inequality follows from D q K,1 = D K,q + A K,q and A p,q is positive semidefinite by Lemma 8.1. For the left hand side, by Weyl's Theorem,

L. Xiao
The following lemma is adapted from Theorem 6.5.4 in [2].
It is easy to verify that 2 π x ≤ sin x ≤ x for x ∈ [0, π/2]. Thus, The proof is complete if for any K > q, λ q+1 (D K,q ) ≥ CK −2q for some constant C > 0 that depends only on q. Note that λ q+1 (D K,q ) = λ 1 (Δ K,q Δ T K,q ). Hence, it suffices to show that We shall prove (8.6) by induction on q. By Lemma 8.3, we have Δ K,1 Δ T K,1 ≥ 4 sin π 2K 2 I K−1 . Since sin π 2K ≥ 1 K , for any K > 1, Hence, (8.6) holds for q = 1. Note that where in the last inequality (8.7) was used. Hence by an inductive proof, (8.6) holds for any q and the proof is complete.

Proof of Proposition 4.2.
We prove the inequalities for the eigenvalues of O q as the proof for those ofD K,m is similar. Note that the weight matrix W K is the same as h −1 I K−1 except the first m − 2 diagonal element. With slight abuse of notation, lnet P q =Δ T K,q,mΔ K,q,m . Then it can be shown that h 2q P q differs from D K,q in at most the first k q,m and the last k q,m rows, where k q,m is a finite constant that depends only on q and m. Then with a proof similar to that of Lemma 8.1, it can be shown that there exists a constant c > 1 such that cD q K,1 − h 2q P q is positive semi-definite and that D q K,1 − h 2q P q is non-zero only in the first and last few rows. Hence a proof similar to that of Proposition 4.1 proves the same inequalities for h 2q P q . For example, inequality (8.6) can be similarly proved. Note that by Remark A.1, there exist constants 0 < c <c such that chI ≤ G ≤chI. Hence, chP q ≤ O q ≤chP q , which implies that the eigenvalues of hP q and O q are rate-wise similar. Therefore, the inequalities for the eigenvalues of O q are proved.

By definition,Δ K,q,m is a sparse matrix with a finite band and each element is
By Lemma 3.2, we obtain that where O(h) is uniform with respect to k and hence (8.8) holds. Next, we derive that where the last equality holds by Lemma 3.1 and Remark 3.1. The proof is complete.

L. Xiao
It follows that Note that where the big O is uniform with respect to k and the last equalities follow by Lemma 3.1 and the fact that The proof is complete.
Proof of Proposition 4.5. First note that Lemma 3.1 and Proposition 4.3 also hold when the boundary knots for O-splines are the same as those for P -splines. We first consider the case q < m. ThenΔ K,q,m = h −q Δ K,q and Hence, The proof in Proposition 4.3 (see equation (8.8)) shows that It follows that Therefore, and we have proved the cases with q < m.
Hence, s . We next derive that and β T D K,m β =γ T D K−m+1,1 γ and the rest of the proof is similar to that of Proposition 4.4.

Proofs for Section 5
Proof of Lemma 5.1.
Since both G − 1 2 n and P are non-negative and symmetric matrices, applying twice the inequalities 6.76 in [28, page 119] and Assumption 3 proves the lemma.

L. Xiao
Proof of Lemma 5.2. Lets k denote the k th smallest eigenvalue ofP q , then (1 + λs k ) −1 is the k th largest eigenvalue of (I + λP q ) −1 . Then  (8.9) where α = N T (f − s f )/n. It follows that We first derive s f −f . Consider first p = m. By Lemma 3.1, Hence, . Therefore, for general p with p ≤ m, we obtain that For the second right hand term in (8.9), note by (8.18) that α max = o(h p+1 ). Thus, It follows from (8.10) that (8.13) Denote the last right hand term in (8.13) by ξ. Note that SinceP (I +P ) −2P ≤ P 2P , we derive that On the other hand, by (8.14) and the fact thatP (I +P ) −2P ≤P , we obtain Thus, we obtain that ξ = O min(λ 2 K 2q , λ) . (8.15) Combining (8.13) and (8.15), we obtain We derive that where the last equality follows by Lemma and inf x∈T ρ(x) > 0, we obtain the desired bounds for E f − f 2 L2 . Lemma 8.4. Suppose that Assumptions 1,[3][4] Proof. Note that G −1 n (N T s f /n) = β, where β is defined in Section 3. Hence And the proof is complete.

Proofs for Section 6
Proofs of Propositions 6.1, 6.2 and 6.3. We first focus on where c 1 and c 2 are constants in Lemma A.1. Similar to G n , we may assume for the same two constants c 1 and c 2 that c 1 hI ≤ G [m−q] ≤ c 2 hI. Then, LetŌ q =Δ T K,q−1,mΔ K,q−1,m . It follows that Thus, for r = 1, 2, K,m by Lemma A.8. Because of Assumption 2, the matrix h 2qŌ q is the same as D K,q except for the first and the last few rows and the matrix h 2(m−1)D K,m is the same as D K,m except for the first and the last few rows. As a result, an asymptotic study of (I + λ P D K,q ) −r for r = 1, 2 is the key and some minor technical modifications are sufficient to accommodate the differences in the first and last few rows. We first note that when λ P = O(1), the corresponding h e = O(1) and the diagonals of (I + λ P D K,q ) −r are necessarily O(1), which proves the propositions. Therefore, we just need to focus on the case that if λ P ≥ C for some sufficiently large constant C, there still exists a constant, sayC > 1, such that The desired results are given in Theorem B.1 in Appendix B.

L. Xiao
Proof of Theorem 6.1. We consider the bias and variance off (x) separately. By (8.9) and (8.12), we obtain that It follows that As for the last right hand term in (8.19), we have Note that γ = β + G −1 n α. Hence, Therefore, We derive the orders of each of the two right hand terms in (8.21) now. By Assumption 8 and Lemma A.4, On the other hand, we derive that Matrix algebra shows that (λP )H −1 n G n H −1 n (λP ) ≤ λP . Hence, Thus, we have shown that Now we work on the second right hand term in (8.21). Note that by Lemma Since p ≥ q, we obtain that On the other hand, The proof of Lemma 8.4 derives that α T G −1 n P G −1 n α = o(1). Thus,

The above derivations prove that
where w i (x) = N T (x)H −1 n N (x i )/n. We shall use the following results: for any constant r > 0, The above equalities will be derived at the end of the proof. Define also L n = nh e log n 1 2 .

L. Xiao
We derive that where Note that e i 1 {|ei|>Ln} ≤ L 1−τ n |e i | τ , where τ > 2 is a constant in Assumption 9. Thus, It follows by (8.25) and the strong law of large numbers that For the second term in (8.31), by Hölder's inequality, By derivation similar to that for (8.29), we obtain that where the big O is uniform with respect to x and z. By equality (8.27), Note that the last equality above follows from (8.30). Next we focus on the first term in (8.31). Note that whereC is a constant and the last inequality follows from (8.26). In addition, by (8.25), there exists another constantC > 0 such that uniformly with respect to x. By Bernstein's inequality for bounded random variables, for any constant c > 0, τC . We can choose c sufficiently large so that the above inequality is summable. By the Borel-Cantelli lemma [11, pp.46], sup x∈χ(r) The above equality together with (8.31) and (8.32) leads to where the latter inequality follows because N k (x) ≥ 0, N (x i ) ≥ 0 and H −1 n is symmetric and positive definite. Since k N k (x) = 1 and N (x i ) = 1, we obtain that |w i (x)| ≤ n −1 H −1 n max = O{(nh e ) −1 } by Assumption 7 and (8.25) is proved. For (8.26), we derive that

L. Xiao
where the last equality follows by Assumption 7 and Lemma A.1. Finally for (8.27), since |N

Proofs for Section 7
Proof of Theorem 7.1. We first establish the asymptotic bias off (x). With similar derivations for the bias in the proof of Theorem 5.1 (specifically, equalities (8.9), (8.11) and (8.12)) and the derivation of L ∞ rate of bias in the proof of Theorem 6.1 (specifically, equalities (8.21) and (8.23)), we obtain that Next we focus on the asymptotic variance off (x). First, Again by Lemma A.6, where the last equality holds by Assumption 7 and h e = max{h, λ 1/(2q) }.
Otherwise if λK 2q ≥ C for a sufficiently large C, then h e = λ 1/(2q) and Thus, using an inductive proof, we obtain the desired equality and the proof is complete.

Appendix B: Local asymptotic variance of penalized splines
For η > 0, define Λ = I + ηD K,q . In this appendix, we study Λ −1 and derive its order of convergence in terms of η when η is large. The results can be used for studying the local asymptotic variance of P -splines. Note that essentially the same proof can be applied to I + ηD K,m for T -splines and I + ηΔ T K,q−1,mΔ K,q−1,m for O-splines. With slight abuse of notation, we let h e = η 1 2q K −1 . We assume that K → ∞ and that there exists two constants δ 1 and δ 2 > 0 such that η > δ 1 K δ2 . In addition, we assume that η = O(K 2q ) so that h e = O(1). We also use C and C 0 to denote constants that depend only on q and to simplify notation, they are allowed to vary from place to place.
The main result is Theorem B.1.
Theorem B.1. Suppose that η > C for a sufficiently large constant C. Then for r = 1 or 2, min To prove Theorem B.1, we shall follow [41] and invert Λ directly. Consider the equation and let {ρ ν , ν = 1, . . . , q} be the q roots of (B.1) such that when η is large, the real parts of the first q roots are all positive and less than 1. Using a proof similar to that of Proposition 4.3 in [41], we derive that, when η > C and C is a sufficiently large constant, where ψ 1 , . . . , ψ q are the roots of x 2q +(−1) q = 0 with positive real parts. Notice Hence, we may assume that k ≤ K/2 in this section.
and a = (a 1 , . . . , a q ) T ∈ R q is the vector of coefficients to be determined soon.
Fix k and assume that k ≥ q + 1. For 1 ≤ ν ≤ q, it can be shown that U k (ρ ν ) is orthogonal to all columns of Λ except the first and last q columns and the j th columns with |k − j| < q. Thus, for any a, the same holds for S k . Since S k is linear in a, there exists an a such that S k is also orthogonal to the j th columns with k < j < k + q and S T k Λ k = 1, where Λ k is the k th column of Λ. Note that it can be easily verified that S k will be also orthogonal to Λ j with max(q +1, k −q) ≤ j < k. Therefore, S k is orthogonal to all columns of Λ except the first and last q columns and the k th column. By the derivations in Sections 4.2 in [41], we obtain that, a does not depend on k, and moreover, when η > C for a sufficiently large constant C, In particular, a satisfies the following equalities [41, pp. 11]: Note that both U 1 (ρ ν ) and U K (ρ ν ) are orthogonal to all columns of Λ except the first q and the last q columns. Hence, there exists a kν andã kν , ν = 1, . . . , q, such that, if we define R k = q ν=1 a kν U 1 (ρ ν ) and T k = q ν=1ã kν U K (ρ ν ), then where e k ∈ R K is a unit vector with the k th element 1 and others 0. Equation (B.5) implies that the k th column of Λ −1 takes the form Let a k = (a k1 , . . . , a kq ) T ∈ R q andã k = (ã k1 , . . . ,ã kq ) T ∈ R q . Assume now k ≤ q. There exists a k andã k such that The existence of a k andã k follows from the fact that, for arbitrary a k andã k , R k +T k is orthogonal to all columns of Λ except the first and the last q columns, and that {U 1 (ρ 1 ), . . . , U 1 (ρ q ), U K (ρ 1 ), . . . , U K (ρ q )} are linearly independent.

B.2. Proof of theorem
We first describe a few lemmas and propositions, whose proofs are given in Section B.3. Lemma B.1. Assume that η > C for a sufficiently large C. There exists a universal constant C 0 that depends only on q and satisfies for any k ≤ K/2 and 1 ≤ ν ≤ q.
and if j ≥ k, We first derive a k andã k in the following two propositions.
Proposition B.1. Suppose that η > C for a sufficiently large constant C. Fix k with q + 1 ≤ k ≤ K/2. Suppose that a k andã k satisfy (B.5). Then, and there exists a universal constant C 0 that depends only on q and satisfies In particular, the big O notation is uniform with respect to q + 1 ≤ k ≤ K/2.
whereẽ q ∈ R q is a unit vector with the q th element 1, and there exists a universal constant C 0 that depends only on q and satisfies Proof of Theorem B.1. We shall first establish that Consider first that q < k ≤ K/2. By (B.6), (B.14) Since |ρ ν | < 1 for large η, uniformly for q < k ≤ K/2. Now for L. Xiao uniformly for q < k ≤ K/2 and because 1 − |ρ ν1 ρ ν2 | ≥Cη − 1 2q for some constant We can similarly show other terms in ( . Similar derivation can be done for k ≤ q by (B.7) and Proposition B.2. We have now established (B.13).
It follows that for all k and we have proved (B.15). The proof is complete by combining (B.13) and (B.15).
Similarly, we derive that where Σ 2 is the bottom right q × q submatrix of Δ T andΣ 2 is the bottom q × K submatrix of Δ. and then from (B.17) that where the little o is uniform for k ≤ K/2 and C is a constant. Define b k = −A −1 2 A 1 a. Then, A matrix perturbation analysis shows that where the big O is uniform for k ≤ K/2. Define c k = −(Σ 1 B 1 ) −1 (Σ 1 B k )a, which gives It can be derived thatΣ 1 B 1 = Ω 1 L 1 , where L 1 is a q × q diagonal matrix with the ν th diagonal element (−1) q (1 − ρ ν ) q and Φ 1 (as well as Φ 2 used later) is defined in Section B.1. Similarly, by Lemma B.2, we derive thatΣ 1 B k = Ω 2 L 2 A k−1 , where L 2 is a q × q diagonal matrix with the ν th diagonal element (−1) q (1 − ρ −1 ν ) q . Thus, (B.23) becomes By row transformations on Ω 1 and Ω 2 , we obtain from (B.24) that where Φ 1 and Φ 2 are defined in Section B.1. By (B.2) and (B.3), a simple matrix perturbation analysis gives By the proof of Proposition B.1,Σ 1 B 1 = Ω 1 L 1 . Let R be the unique transformation matrix such that RΩ 1 L 1 = Φ 1 , we derive that where Ξ is a q × q diagonal matrix with the j th diagonal element η − q−j+1

2q
. This implies that It is easy to verify thatẽ T q R(Σ 1 ) −1 = (−1) q+1 1 T q . It follows that In particular, c can be chosen as 1 64π . Proof. It can be shown that Z = e T Σe, where e = (e 1 , . . . , e n ) T . Then EZ = σ 2 tr(Σ). Let C = e 1 ψ2 , the sub-Gaussian norm of the random variable e 1 (see Definition 5.7 in [35]). By the Hanson-Wright inequality [23], there exists an absolute constant c > 0 such that, for every t > 0, Note that for Gaussian random variables, C = σ 2/π. Hence, we obtain the desired inequality. By going through the proofs in [23] and [35], we find that c can be chosen as 1 64π . The proof is now complete.
The following Lemma is adapted from [19] using the volume-of-tube formula [33].