Quantile Processes for Semi and Nonparametric Regression

A collection of quantile curves provides a complete picture of conditional distributions. Properly centered and scaled versions of estimated curves at various quantile levels give rise to the so-called quantile regression process (QRP). In this paper, we establish weak convergence of QRP in a general series approximation framework, which includes linear models with increasing dimension, nonparametric models and partial linear models. An interesting consequence is obtained in the last class of models, where parametric and non-parametric estimators are shown to be asymptotically independent. Applications of our general process convergence results include the construction of non-crossing quantile curves and the estimation of conditional distribution functions. As a result of independent interest, we obtain a series of Bahadur representations with exponential bounds for tail probabilities of all remainder terms. Bounds of this kind are potentially useful in analyzing statistical inference procedures under divide-and-conquer setup.


Introduction
Quantile regression is widely applied in various scientific fields such as economics (Koenker and Hallock, 2001), biology (Briollais and Durrieu, 2014) and ecology (Cade and Noon, 2003). By focusing on a collection of conditional quantiles instead of a single conditional mean, quantile regression allows to describe the impact of predictors on the entire conditional distribution of the response. A properly scaled and centered version of these estimated curves form an underlying (conditional) quantile regression process (QRP, see Section 2 for a formal definition). The weak convergence of QRP is useful in developing statistical inference procedures, such as hypothesis testing on Hadamard differentiable M and L estimators (Fernholz, 1983, Chapter 7), testing on conditional distributions (Bassett and Koenker, 1982) and Wilcoxon test (van der Vaart and Wellner, 1996, Example 3.9.18). Applications in econometrics include detection of treatment effect on the conditional distribution after an intervention (Koenker and Xiao, 2002;Qu and Yoon, 2015) and testing Gini indices (Barrett and Donald, 2009). Please see Remark 4.2 for more details.
The asymptotic behavior of QRP depends on the model imposed for quantile regression. Existing literature on QRP is either concerned with models of fixed dimension (Koenker and Xiao, 2002;Angrist et al., 2006), or with a linearly interpolated version based on kernel smoothing (Qu and Yoon, 2015). However, this excludes many important cases such as linear models with growing dimension and partial linear models. For such models, establishing weak convergence of QRP becomes non-trivial since classical Donsker theorems (e.g. those in van der Vaart and Wellner (1996)) may not be directly applied. An additional challenge for partial linear models comes from the fact that their parametric and non-parametric components converge at different rates.
In this paper we consider a general model which is of the following (approximate) form Q(x; τ ) ≈ Z(x) γ n (τ ), (1.1) where Q(x; τ ) denotes the τ -th quantile of the distribution of Y conditional on X = x ∈ R d and Z(x) ∈ R m is a transformation vector of x. As noted by Belloni et al. (2016), the above framework incorporates a variety of estimation procedures such as parametric (Koenker and Bassett, 1978), non-parametric (He and Shi, 1994) and semi-parametric (He and Shi, 1996) ones. For example, Z(x) = x corresponds to a linear model (with potentially increasing dimension), while Z(x) can be chosen as powers, trigonometrics or local polynomials in the nonparametric basis expansion (where m diverges at a proper rate). Partially linear and additive models are also covered by (1.1). Therefore, our weak convergence results are developed in a very broad context. Models that can be expressed in the form (1.1) were previously studied by Belloni et al. (2016) in a very general setting, which we also consider here. In the following, we provide a detailed description of the main contributions in the present paper, and compare them with Belloni et al. (2016).

Partially linear models:
A key result in the present paper is obtained for partially linear models where X = (V , W ) ∈ R k+k , α(τ ) is an unknown Euclidean vector and h(W ; τ ) is an unknown smooth function. Here, k and k are both fixed. In the spirit of (1.1), we can estimate (α(τ ), h(·; τ )) based on the following series approximation where Z(W ) is a transformation vector of W . We provide joint asymptotic results for the parametric and non-parametric part in partially linear models [see Section 3 and Section 5.3], with a n 1/2 scaling for the parametric part and a scaling with slower rate for the non-parametric part [see Theorem 3.1 and Theorem 5.4].
To the best of our knowledge, this is the first time that the joint asymptotic as processes in τ for quantile regression is established -in fact, even the pointwise result is new. In particular, we prove that the "joint asymptotics phenomenon" discovered by Cheng and Shang (2015) even holds for nonsmooth loss functions with multivariate nonparametric covariates. This joint asymptotic result does not follow directly from the results of Belloni et al. (2016), because of the specific centering sequence (defined in their equation (2.2)) they consider and the matrix J −1 (u) in their Theorem 2 where J(u) is non-diagonal with increasing dimension. To derive our Theorem 3.1, it is necessary to choose an appropriate centering sequence (see Remark 3.3), apply our new Bahadur representations in Section 5, and provide a detailed analysis of the matrix J −1 (u).

Centering and tail bounds on remainder terms in Bahadur representations:
Theorem 2 in Belloni et al. (2016) provides a Bahadur representation for the estimated coefficients γ n centering at β n with an O P term on the remainders, where β n minimizes a QR series approximation problem [see equation (2.2) in their paper]. In Section 5 we provide similar expansions, with a main difference that we allow for more general centering sequences that satisfy certain approximation conditions. This is important in Section 3 of our paper where we consider partially linear models. Moreover, we provide explicit exponential tail bounds on corresponding remainders which is a somewhat stronger result compared to the O P bounds in Belloni et al. (2016). This result is, for instance, utilized in Volgushev et al. (2017). The findings in that paper cannot be obtained from the O P bounds in Belloni et al. (2016). 3. Approximation by a sequence of Gaussian processes v.s. convergence to a fixed limiting process: All results in Belloni et al. (2016) which are uniform in the quantile index τ are stated in terms of approximating the quantile regression process and weighted versions thereof by a sequence of Gaussian processes which depend on n [see their Theorem 5,Theorem 11,Theorem 12]. In contrast, we show that there exists a single Gaussian process which is the (weak) limit of the leading term in the Bahadur representation. Showing convergence to this weak limit requires proving asymptotic tightness of the leading term, which is a major challenge in our proof [see Section A.4] and does not follow from the approximation by a series of Gaussian processes as in Belloni et al. (2016). This is also a key ingredient in our Section 4 where we utilize the functional delta method together with compact differentiability of the rearrangement operator [established in Chernozhukov et al. (2010)]. Note that the application of the delta method requires convergence to a fixed limit, which does not follow directly from the results in Belloni et al. (2016). On the other hand, Belloni et al. (2016) provide approximations which are uniform in x and τ while we only consider results that are pointwise in x.

New bounds for local basis functions:
Last but not the least, in Sections 2.2 and 5.2, we provide results for models with "local basis structure" (for instance B-splines). For such basis functions, we show that the conditions on model dimension can be relaxed from m 4 = o(n 1−ε ) [required by Theorem 12 of Belloni et al. (2016)] to m 2 (log n) 6 = o(n) in the case of B-splines [see the discussion below Assumption (B1)].
Given the discussions above, we would also like to point out that Belloni et al. (2016) discuss other aspects such as bootstrap approximations which are not covered in our paper. In summary, both Belloni et al. (2016) and the present paper consider the same model setup, but focus on different aspects of the resulting theory, and none of the two papers is more general than the other.
The rest of this paper is organized as follows. Section 2 presents the weak convergence of QRP under general series approximation framework. Section 3 discusses the QRP in quantile partial linear models. As an application of our weak convergence theory, Section 4 considers various functionals of the quantile regression process. A detailed discussion on our novel Bahadur representations is given in Section 5, and all proofs are deferred to the appendix.
Here, the distribution of (X i , Y i ) and the dimension d can depend on n, i.e. triangular arrays. For brevity, let Z = Z(X) and Z i = Z(X i ). Define the empirical measure of (Y i , Z i ) by P n , and the true underlying measure by P with the corresponding expectation as E. Note that the measure P depends on n for triangular array cases, but this dependence is omitted in the notation. Denote by b the L 2norm of a vector b. λ min (A) and λ max (A) are the smallest and largest eigenvalue of a matrix A. 0 k denotes a k-dimensional 0 vector, and I k be the k-dimensional identity matrix for k ∈ N. Define ρ τ (u) := (τ − 1(u ≤ 0))u, where 1(·) is the indicator function. C η (X ) denotes the class of η-continuously differentiable functions on a set X . C(0, 1) denotes the class of continuous functions defined on (0, 1). Define and for a vector γ n (τ ) ∈ R m , we define the following quantities where η denotes the integer part of a real number η, and |j| = j 1 + ... + j d for d-tuple j = (j 1 , ..., j d ). For simplicity, we sometimes write sup τ (inf τ ) and sup x (inf x ) instead of sup τ ∈T (inf τ ∈T ) and sup x∈X (inf x∈X ) throughout the paper.

Weak convergence results
In this section, we first present our weak convergence results of QRP in a general series approximation framework that covers linear models with increasing dimension, nonparametric models and partial linear models. Furthermore, we demonstrate that the use of polynomial splines with local support, such as B-splines, significantly weakens the sufficient conditions required in the above general framework.

General series estimator
Consider a general series estimator Q(x; τ ) := γ(τ ) Z(x), where for each fixed τ and m is allowed to grow as n → ∞, and assume the following conditions: In the above assumptions, uniformity in n is necessary as we consider triangular arrays. Assumptions (A2) and (A3) are fairly standard in the quantile regression literature. Hence, we only make a few comments on Assumption (A1). In linear models where Z(X) = X and m = d, it holds that ξ m √ m if each component of X is bounded almost surely. If B-splines B(x) defined in Section 4.3 of Schumaker (1981) are adopted, then one needs to use its re-scaled version B(x) = m 1/2 B(x) as Z(x) such that (A1) holds (cf. Lemma 6.2 of Zhou et al. (1998)). In this case, we have ξ m √ m. In addition, Assumptions (A1) and (A3) imply that for any sequence of R m -valued (non-random) functions γ n (τ ) are bounded away from zero uniformly in τ for all n.
Define for any u ∈ R m , We are now ready to state our weak convergence result for QRP based on the general series estimators.
Then for any u n ∈ R m satisfying χ γn (u n , Z) = o( u n n −1/2 ) and γ(τ ) defined in (2.1), where the remainder term is uniform in τ ∈ T . In addition, if the following limit where G(·) is a centered Gaussian process with the covariance function H defined as (2.3). In particular, there exists a version of G with almost surely continuous sample paths.
The proof of Theorem 2.1 is given in Section A.1. Theorem 2.1 holds under very general conditions. For transformations Z that have a specific local structure, the assumptions on m, ξ m can be relaxed considerably. Details are provided in Section 2.2.
In the end, we illustrate Theorem 2.1 in linear quantile regression models with increasing dimension, in which g n , c n and χ γn (u, Z) are trivially zero. As far as we are aware, this is the first quantile process result for linear models with increasing dimension.
Corollary 2.2. (Linear models with increasing dimension) Suppose (A1)-(A3) hold with Z(X) = X and Q(x; τ ) = x γ n (τ ) for any x and τ ∈ T . Assume that m 3 ξ 2 m (log n) 3 = o(n). In addition, if u n ∈ R m is such that the following limit exists for any τ 1 , τ 2 ∈ T , then (2.4) holds with the covariance function H 1 defined in (2.5). Moreover, by setting u n = x 0 , we have for any fixed where G(x 0 ; ·) is a centered Gaussian process with the covariance function in (2.5). In particular, there exists a version of G with almost surely continuous sample paths.

Local basis series estimator
In this section, we assume that Z(·) corresponds to a basis expansion with "local" support. Our main motivation for considering this setting is that it allows to considerably weaken assumptions on m, ξ m made in the previous section. To distinguish such basis functions from the general setting in the previous section, we shall use the notation B instead of Z. Let β(τ ) be defined as where B i = B(X i ). The notion of "local support" is made precise in the following sense.
(L) For each x, the basis vector B(x) has zeroes in all but at most r consecutive entries, where r is fixed.
The above assumption holds for certain choices of basis functions, e.g., univariate B-splines. Example 2.3. Let X = [0, 1], assume that (A2)-(A3) hold and that the density of X over X is uniformly bounded away from zero and infinity. Consider the space of polynomial splines of order q with k uniformly spaced knots 0 = t 1 < ... < t k = 1 in the interval [0, 1]. The space of such splines can be represented through linear combinations of the basis functions B 1 , ..., B k−q−1 with each basis function B j having support contained in the interval [t j , t j+q+1 ). Let B(x) := (B 1 (x), ..., B k−q−1 (x)) . Then the first part of assumption (L) holds with r = q.
Condition (L) ensures that the matrix J m (τ ) has a band structure, which is useful for bounding the off-diagonal entries of J −1 m (τ ). See Lemma 6.3 in Zhou et al. (1998) for additional details.
Throughout this section, consider the specific centering where B = B(X). For basis functions satisfying condition (L), assumptions in Theorem 2.1 in the previous section can be replaced by the following weaker version.
(B1) Assume that ξ 4 m (log n) 6 = o(n) and letting c n : 1 under many situations. For instance, in the setting of Example 2.3 where ξ m √ m, we only require m 2 (log n) 6 = o(n), which is weaker than m 4 (log n) 3 = o(n) in Theorem 2.1. This improvement is made possible based on the local structure of the spline basis.
In the setting of Example 2.3, bounds on c n can be obtained provided that the function x → Q(x; τ ) is smooth for all τ ∈ T . For instance, recall the class of functions Λ η c (X , T ) defined in (1.4). Assuming that Q(·; ·) ∈ Λ η c (X , T ) with X = [0, 1] and a positive integer η, Remark B.1 shows that c n = O(m − η ). Thus the condition c 2 n = o(n −1/2 ) holds provided that m −2 η = o(n 1/2 ). Since for splines we have ξ m ∼ m 1/2 , this is compatible with the restrictions imposed in assumption (B1) provided that η ≥ 1. Assume that the set I consists of at most L consecutive integers, where L ≥ 1 is fixed. Then for any u n ∈ R m I , (2.2) holds with γ(τ ), γ n (τ ) and Z being replaced by β(τ ), β n (τ ) and B. In addition, if the following limit exists for any τ 1 , τ 2 ∈ T , then (2.4) holds with the same replacement as above, and the limit G is a centered Gaussian process with covariance function H defined as (2.8). Moreover, for any x 0 , let Q(x 0 ; τ ) := B(x 0 ) β(τ ) and assume that where G(x 0 ; ·) is a centered Gaussian process with the covariance function in (2.8). In particular, there exists a version of G with almost surely continuous sample paths.
The proof of Theorem 2.4 is given in Section A.2.
Remark 2.5. The proof of Theorem 2.4 and the related Bahadur representation result in Section 5.2 crucially rely on the fact that the elements of J m (τ ) −1 decay exponentially fast in their distance from the diagonal, i.e. a bound of the form |( J m (τ ) −1 ) i,j | ≤ Cγ |i−j| for some γ < 1. Assumption (L) provides one way to guarantee such a result. We conjecture that similar results can be obtained for more general classes of basis functions as long as the entries of J m (τ ) −1 decay exponentially fast in their distance from suitable subsets of indices in (j, j ) ∈ {1, ..., m} 2 . This kind of result can be obtained for matrices J m (τ ) with specific sparsity patterns, see for instance Demko et al. (1984). In particular, we conjecture that such arguments can be applied for tensor product B-splines, see Example 1 in Section 5 of Demko et al. (1984). A detailed investigation of this interesting topic is left to future research.
We conclude this section by discussing a special case where the limit in (2.9) can be characterized more explicitly.
Remark 2.6. The covariance function H can be explicitly characterized under u n = B(x) and univariate B-splines B(x) on x ∈ [0, 1], with an order r and equidistant knots 0 = t 1 < ... < t k = 1. Assume additional to (A3) that (2.10) and the density f X (x) for X is bounded above, then under c n = o( B(x 0 ) n −1/2 ), (2.9) in Theorem 2.4 can be rewritten as (2.11) where the Gaussian process G(·; x 0 ) is defined by the following covariance function Although we only show the univariate case here, the same arguments are expected to hold for tensor-product B-spline based on the same reasoning. See Section A.2 for a proof of this remark.

Joint weak convergence for partial linear models
In this section, we consider partial linear models of the form where X = (V , W ) ∈ R k+k and k, k ∈ N are fixed. An interesting joint weak convergence result is obtained for ( α(τ ), h(w 0 ; τ )) at any fixed w 0 . More precisely, α(τ ) and h(w; τ ) (after proper scaling and centering) are proved to be asymptotically independent at any fixed τ ∈ T . Therefore, the "joint asymptotics phenomenon" first discovered in Cheng and Shang (2015) persists even for non-smooth quantile loss functions. Such a theoretical result is practically useful for joint inference on α(τ ) and h(W ; τ ); see Cheng and Shang (2015). Expanding w → h(w; τ ) in terms of basis vectors w → Z(w), we can approximate (3.1) through the series expansion Z(x) γ † n (τ ) by setting Z(x) = (v , Z(w) ) . In this section, Z : R k → R m is regarded as a general basis expansion that does not need to satisfy the local support assumptions in Section 2.2. Estimation is performed in the following form For a theoretical analysis of γ † , define the population coefficients γ † n (τ ) : similar to (2.7); see Remark 3.3 for additional explanations.
To state our main result, we need to define a class of functions For V ∈ R k , define for j = 1, ..., k, where V j denotes the j-th entry of the random vector V . By the definition of h V W,j , we have for all τ ∈ T and g ∈ U τ , The matrix A is defined as coefficent matrix of the best series approximation of h V W (W ; τ ): (3.6) The following two assumptions are needed in our main results.
(C2) We have max j≤k |V j | < C almost surely for some constant C > 0.
Bounds on c † n can be obtained under various assumptions on the basis expansion and smoothness of the function w → h(w; τ ). Assume for instance that W = [0, 1] k , that h(·; ·) ∈ Λ η c (W, T ) and that Z corresponds to a tensor product B-spline basis of order q on W with m 1/k equidistant knots in each coordinate. Assuming that (V, W ) has a density f V,W such that 0 Assumption (3.8) essentially states that h V W can be approximated by a series estimator sufficiently well. This assumption is necessary to ensure that α(τ ) is estimable at a parametric rate without under-smoothing when estimating h(·; τ ). In general, (3.8) is a non-trivial high-level assumption. It can be verified under smoothness conditions on the joint density of (X, Y ) by applying arguments similar to those in Appendix S.1 of Cheng et al. (2014).
(B1') Assume that mξ 2/3 m log n n Moreover, assume that c † n λ n = o(n −1/2 ) and mc † n log n = o(1). We now are ready to state the main result of this section.
(3.10) and the multivariate process (G 1 (·), ..., G k (·), G h (·)) has the covariance function In addition, at any fixed w 0 ∈ R k , let w n = Z(w 0 ) satisfy the above conditions, is defined through the limit in (3.9) with w n replaced by Z(w 0 ). In particular, there exists a version of G h (w 0 ; ·) with almost surely continuous sample paths.
The proof of Theorem 3.1 is presented in Section A.3. The invertibility of the matrices M 1,h (τ ) and M 2 (τ ) is discussed in Remark 5.5. In general, α(τ ) is not semiparametric efficient, as its covariance matrix τ (1 − τ )Γ 11 does not achieve the efficiency bound given in Section 5 of Lee (2003).
The joint asymptotic process convergence result (in ∞ (T )) presented in Theorem 3.1 is new in the quantile regression literature. The block structure of covariance function Γ defined in (3.11) implies that α(τ ) and h(w 0 ; τ ) are asymptotically independent for any fixed τ . This effect was recently discovered by Cheng and Shang (2015) in the case of mean regression, named as joint asymptotics phenomenon.
Remark 3.2. We point out that E |w n M 2 (τ 2 ) −1 Z(W )| = o( w n ) is a crucial sufficient condition for asymptotic independence between the parametric and nonparametric parts. We conjecture that this condition is also necessary. This condition holds, for example, for w n = Z(w 0 ) or w n = ∂ wj Z(w 0 ) at a fixed w 0 , j = 1, ..., k , where Z(w) is a vector of B-spline basis. However, this condition may not hold for other estimators. Consider for instance the case W = [0, 1], B-splines of order zero Z and the vector w n = λ 0 Z(w)dw for some λ > 0. In this case w n 1, and one can show that E |w n M 2 (τ 2 ) −1 Z(W )| 1 instead. A more detailed investigation of related questions is left to future research.

Remark 3.3.
A seemingly more natural choice for the centering vector, which was also considered in Belloni et al. (2016), is which gives g n (γ * n (τ )) = 0. However, a major drawback of centering with γ * n (τ ) is that it is impossible to find a good bound for the difference α * n (τ ) − α(τ ) without imposing restrictive assumptions. However, such a bound is needed to show that the bias of α * n (τ ) is of order o(n −1/2 ) which is required establish (3.10) in Theorem 3.1.

Applications of weak convergence results
In this section, we consider applications of the process convergence results to the estimation of conditional distribution functions and non-crossing quantile curves via rearrangement operators. For the former estimation, define the functional (see Dette and Volgushev (2008), Chernozhukov et al. (2010) or Volgushev (2013 for similar ideas) A simple calculation shows that The latter identity motivates the following estimator of the conditional distribution function where Q(x; τ ) denotes the estimator of the conditional quantile function in any of the three settings discussed in Sections 2 and 3. By following the arguments in Chernozhukov et al. (2010), one can easily show that under suitable assumptions the functional Φ is compactly differentiable (see Section A.5 for more details). Hence, the general process convergence results in Sections 2 and 3 allow to easily establish the asymptotic properties of F Y |X -see Corollary 4.1 at the end of this section.
The second functional of interest is the monotone rearrangement operator, defined as follows The main motivation for considering Ψ is that the function τ → Ψ(f )(τ ) is by construction non-decreasing. Thus for any initial estimator Q(x; ·), its rearranged version Ψ( Q(x; ·))(τ ) is an estimator of the conditional quantile function which avoids the issue of quantile crossing. For more detailed discussions of rearrangement operators and their use in avoiding quantile crossing, we refer the interested readers to Dette and Volgushev (2008) and Chernozhukov et al. (2010).
Corollary 4.1 (Convergence of F (y|x) and Ψ( Q(x; τ ))). For any fixed x 0 and an initial estimator Q(x 0 , ·), we have for any compact sets where Q(x 0 , ·), the normalization a n , and the process G(x 0 ; ·) are stated as follows x 0 and the conditions in Corollary 2.2 hold. In this case, we have and the conditions in Theorem 3.1 hold. In this case, we have a The proof of Corollary 4.1 is a direct consequence of the functional delta method, combined with the process convergence results established in Section 2 and Section 3 and Hadamard differentiability results of certain functionals established in Chernozhukov et al. (2010). Details can be found in Section A.5.
Remark 4.2 (More Statistical Applications of Corollary 4.1). In practice, many quantities of interest can be written as Hadamard differentiable functionals of distribution functions such as some M and L estimators (Fernholz, 1983, Chapter 7), conditional distributions (Bassett and Koenker, 1982), the Wilcoxon test statistics (van der Vaart and Wellner, 1996, Example 3.9.18) and Gini indices (Barrett and Donald, 2009). Based on the chain rule of Hadamard derivative, Corollary 4.1 can be applied to prove the asymptotic normality of these statistical estimators. Moreover, detection of treatment effect on the conditional distribution after an intervention (Koenker and Xiao, 2002;Qu and Yoon, 2015) is often based on Kolmogorov-Smirnov or Cramér-von Mises statistics, whose asymptotic distribution can also be found by applying Corollary 4.1.

Bahadur representations
In this section, we provide Bahadur representations for the estimators discussed in Sections 2 and 3. In Sections 5.1 and 5.2, we state Bahadur representations for general series estimators and a more specific choice of local basis function, respectively. In particular, the latter representation is developed with an improved remainder term. Section 5.3 contains a special case of the general theorem in Section 5.1 that is particularly tailored to partial linear models. The remainders in these representations are shown to have exponential tail probabilities (uniformly over T ).

A fundamental Bahadur representation
Our first result gives a Bahadur representation for γ(τ ) − γ n (τ ) for centering functions γ n satisfying certain conditions. Recall the definition of γ(τ ) in (2.1). This kind of representation for quantile regression with an increasing number of covariates has previously been established in Theorem 2 of Belloni et al. (2016). Compared to their results, the Bahadur representation given in Theorem 5.1 has several advantages. First, we allow for a more general centering. This is helpful for the analysis of partial linear models (see Sections 3 and C.2). Second, we provide exponential tail bounds on remainder terms, which is much more explicit and sharper than those in Belloni et al. (2016).
The remainder terms r n,j 's can be bounded as follows: Moreover, we have for any κ n n/ξ 2 m , sufficiently large n, and a constant C independent of n P sup The proof for Theorem 5.1 can be found in Section C.1.

Bahadur representation for local basis series estimator
In this section, we focus on basis expansions B satisfying (L) and derive a Bahadur representation for linear functionals of the form u n ( β(τ ) − β n (τ )), where the vector u n can have at most a finite number of consecutive non-zero entries. Such linear functionals are of interest since the estimator of the quantile function itself as well as estimators of derivatives can be represented in exactly this form -see Remark 5.3 for additional details. The advantage of concentrating on vectors with this particular structure is that we can substantially improve the rates of remainder terms compared to the general setting in Theorem 5.1.
Assume additionally that mξ 2 m (log n) 2 = o(n) and that c n = o(1) and that I ⊂ {1, ..., m} consists of at most L consecutive integers. Then, for β n (τ ) defined as (2.7) and u n ∈ S m−1 where the remainder terms r n,j 's can be bounded as follows: where E(u n , B) := sup τ E|u n J m (τ ) −1 B|. Moreover, we have for any κ n n/ξ 2 m , all sufficiently large n, and a constant C independent of n P sup Theorem 5.2 is proved in Section C.2. We note that by Hölder's inequality and assumptions (A1)-(A3), we have a simple bound for Remark 5.3. Theorem 5.2 enables us to study several quantities associated with the quantile function Q(x; τ ). For instance, consider the spline setting of Example 2.3. Setting u n = B(x)/ B(x) in the Theorem 5.2 yields a representation for Q(x; τ ), while setting u n = B (x)/ B (x) yields a representation for the estimator of the derivative ∂ x Q(x; τ ). Uniformity in x follows once we observe that for different values of x, the support of the vector B(x) is always consecutive so that there is at most n l , l > 0, number of different sets I that we need to consider.

Bahadur representation for partial linear models
In this section, we provide a joint Bahadur representation for the parametric and non-parametric part of this model. Recall the partial linear model Q( where the remainder terms r n,j 's satisfy the bounds stated in Theorem 5.1 with See Section C.3 for the proof of Theorem 5.4. Remark 5.5. We discuss the positive definiteness of M 1 (τ ) and M 2 (τ ). Following Condition (A1) with Z = (V , Z(W ) ) , we have where I k is the k-dimensional identity matrix, [A|B] denotes the block matrix with A in the left block and B in the right block, and whose form follows from the definition and the condition (3.5) (see the proof for Theorem 5.4 for more details). Thus, for an arbitrary nonzero vector a ∈ R k , by Condition (A1), The strictly positive definiteness of M 2 (τ ) follows directly from the observation that for all nonzero b ∈ R m and some M > 0.

Appendix
This appendix gives technical details of the results shown in the main text. Appendix A contains all the proofs for weak convergence results in Theorems 2.1, 2.4 and 3.1. Appendix B discusses basis approximation errors with full technical details.

Additional Notations. Define for a function
For a class of functions G, let P n − P G := sup f ∈G |P n f − P f|. For any > 0, the covering number N ( , G, L p ) is the minimal number of balls of radius (under L p -norm) that is needed to cover G. The bracketing number N [ ] ( , G, L p ) is the minimal number of -brackets that is needed to cover G. An -bracket refers to a pair of functions within an distance: u − l Lp < . Throughout the proofs, C, C 1 , C 2 etc. will denote constants which do not depend on n but may have different values in different lines.

A.1. Proof of Theorem 2.1
Under conditions (A1)-(A3) and those in Theorem 2.1 it follows from Theorem 5.1 applied with κ n = c log n for a suitable constant c (note that the conditions g n = o(ξ −1 m ) and c n = o(1) in Theorem 5.1 follow under the assumptions of Theorem 2.1) that where the remainder term is uniform in T and Under the assumption χ γn (u n where the class of functions G 5 is defined as It remains to bound P n − P G5 . For any f ∈ G 5 and a sufficiently large C, we obtain By Lemma 21 of Belloni et al. (2016), the VC index of G 5 is of the order O(m). Therefore, we obtain from (D.2) For any κ n > 0, let Finally, note that under condition mc n log n = o(1) and we have that r N,3 (log n) = o(n −1/2 ). This completes the proof of (2.2).

A.1.2. Proof of (2.4)
Throughout this subsection assume without loss of generality that u n = 1. It suffices to prove finite dimensional convergence and asymptotic equicontinuity. Asymptotic equicontinuity follows from (A.33). The existence of a version of the limit with continuous sample paths is a consequence of Theorem 1.5.7 and Addendum 1.5.8 in van der Vaart and Wellner (1996). So, we only need to focus on finite dimensional convergence. Let and G be the Gaussian process defined in (2.4). From Cramér-Wold theorem, the goal is to show for arbitrary set of Let the triangular array V n,i (τ ) If 0 = lim n→∞ σ 2 n,L = L l,l =1 λ l λ l H(τ l , τ l ; u n ) = var( L l=1 λ l G(τ l )), then by Markov's inequality n i=1 L l=1 λ l V n,i (τ l ) → 0 in probability, which coincides with the distribution of L l=1 λ l G(τ l ), which is a single point mass at 0. Next, consider the case σ 2 n,L → σ 2 L > 0. For sufficiently large n and arbitrary v > 0, Markov's inequality implies since ξ 2 m n −1 = o(1) by the assumption mξ 2 m log n = o(n). Hence the Lindeberg condition is verified. The finite dimensional convergence follows from Cramér-Wold devise. This completes the proof.

A.2. Proofs of Theorem 2.4, Example 2.3 and Remark 2.6
We begin by introducing some notations and useful preliminary results. For a vector u = (u 1 , ..., u m ) and a set I ⊂ {1, ..., m}, let u (I) ∈ R m denote the vector that has entries u i for i ∈ I and zero otherwise. For a vector a ∈ R m , let k a denote the position of the first non-zero entry of a with a 0 non-zero consecutive entries Lemma A.1. Under (L), for an arbitrary vector a ∈ R m with at most a 0 non-zero consecutive entries we have for a constant γ ∈ (0, 1) independent of n, τ Proof for Lemma A.1. Under (L) the matrix Z(x)Z(x) has no non-zero entries that are further than r away from the diagonal. Thus J −1 m is a band matrix with band width no larger that 2r. Apply similar arguments as in the proof of Lemma 6.3 in Zhou et al. (1998)  for some γ ∈ (0, 1) and a constant C 1 where both γ and C 1 do not depend on n, τ . It follows that and thus (A.4) is established. For a proof of (A.5) note that by (A.4) we have By the definition of I(a, D) we find for a constant C independent of n. The proof of (A.6) is similar to the proof of (A.5).
Proof for Theorem 2.4. By Theorem 5.2 and Condition (B1), we first obtain . Given (A.8), the process convergence of u n β(τ ) − β(τ ) and continuity of the sample paths of the limiting process follows from process convergence of u n U n (τ ), which can be shown via exactly the same steps as in Section A.1.2 by replacing Z by B given assumptions (A1)-(A3).
By (A.5) in Lemma A.1, let D = c log n for large enough c > 0, we have almost surely where the third equality follows from the fact that B where for any two index sets I 1 and I 1 Taking κ n = C log n, c 2 n = o(n −1/2 ) and ξ 4 m (log n) 6 = o(n) in (B1) implies that sup τ ∈T I 2 (τ ) = o P (n 1/2 ).
Step 2: sup τ ∈T u n U 1,n (τ ) − U n (τ ) = o P (n −1/2 ), for all u n ∈ S m−1 Applying ( Now it is left to bound sup τ ∈T I 4 (τ ) . We have where for any I, The cardinality of the set I(u n , c log n) is of order O(log n). Thus, the VC index for G 0 (I(u n , D)) is of order O(log n). The VC index of G 4 is 2 (see Lemma D.4). By Lemma D.2, where v 0 (n) = O(log n). In addition, for any f ∈ G 0 (I(u n , D)) · G 4 , |f | ξ m and f L2(P ) = O(1) by (A1). Furthermore, by assumptions (A1)-(A2) and the definition of c n , Taking κ n = C log n, an application of (B1) completes the proof.  Proof of Remark 2.6. Consider the product B j (x)B j (x) of two B-spline functions. The fact that B j (x) is locally supported on [t j , t j+r ] implies that for all j satisfying |j − j | ≥ r, B j (x)B j (x) = 0 for all x, where r ∈ N is the degree of spline. This also implies J m (τ ) and E[BB ] are a band matrices with each column having at most L r := 2r + 1 nonzero elements and each non-zero element is at most r entries away from the main diagonal. Recall also the fact that max j≤m sup t∈R |B j (t)| m 1/2 (by the discussion following assumption (A1)). Define where the second inequality is an application of the upper bound of where the last equality follows by (A.14). Note that from assumptions (A1)-(A3) that uniformly in τ ∈ T , where the constant M > 0 is defined as in Assumption (A1). Using (A.16), assumptions (A1)-(A3) and (A.15), Without loss of generality, from now on we drop the term τ 1 ∧ τ 2 − τ 1 τ 2 out of our discussion and focus on the matrix part in the covariance function H(τ 1 , τ 2 ; u n ) defined in (2.8). From (A1) we have E[BB ] < M for some constant M > 0 so for any τ 1 , τ 2 ∈ T , Moreover, note that If u n = B(x), observe that for l = 1, ..., m, as suggested by the local support property, we only need to focus on the index l satisfying |x − t l | ≤ Cr/m, for a universal constant C > 0. We have where by assumption (2.10), where a B(x) ∈ R m is a vector with the same support as B(x) (only r < ∞ nonzero components) and with nonzero components of order O(m −1/2 ). Hence, a B(x) = O(m −1/2 ). Continued from (A.18), for any x ∈ [0, 1], We observe that B(x) E[BB ] −1 B(x) does not depend on τ 1 and τ 2 and can be treated as a scaling factor and shifted out of the covariance function as (2.11). Therefore, we finish the proof.

A.3. Proof of Theorem 3.1
Observe that α j (·) − α j (·) = e j ( γ † (·) − γ † n (·)) where e j denotes the j-th unit vector in R m+k for j = 1, .., m + k, and The following results will be established at the end of the proof.
From Theorem 5.4, we obtain that under Condition (B1') Following similar arguments as given in the proof of (2.2) in Section A.1.1, (A.20) and (A.22) imply that for j = 1, ..., k uniformly in τ ∈ T . Similarly, by (A.21) and (A.23) we have Thus, the claim will follow once we prove We need to establish tightness and finite dimensional convergence. By Lemma 1.4.3 of van der Vaart and Wellner (1996), it is enough to show the tightness of G n,j 's and G n,h individually. Tightness follows from asymptotic equicontinuity which can be proved by an application of Lemma A.3. More precisely, apply Lemma A.3 with u n = e j to prove tightness of G n,j (·) for j = 1, ..., k, and Lemma A.3 with u n = (0 k , w n ) to prove tightness of G n,h (w 0 ; ·). Continuity of the sample paths of G n,h (w 0 ; ·) follows by the same arguments as given at the beginning of Section A.1.2. Next, we prove finite-dimensional convergence. Observe the decomposition By definition, we have E[ϕ i (τ )] = 0 and moreover Since f Y |X (Q(X i ; τ )|X) is bounded away from zero uniformly, it follows that by Remark 5.5. Moreover, by Lemma A.2 proven later, A(τ )w n = O(1) uniformly in τ , and thus by w n → ∞, (1). This implies that n −1/2 n i=1 ϕ i (τ ) = o P (1) for every fixed τ ∈ T . Hence it suffices to prove finite dimensional convergence of (1).
We shall now show that Γ 12 (τ 1 , τ 2 ) = o(1) uniformly in τ 1 , τ 2 . Note that from the definition of h V W (W ; τ ) in (3.4), by standard argument we can write where the third inequality applies the lower bound for inf τ ∈T λ min (M 1 (τ )) in Remark 5.5; the fourth inequality follows from sup 1≤j≤k,τ |V (j) | + |h s., while the last equality follows by the assumptions of the Theorem. Now we prove the finite dimensional convergence (A.24). Taking arbitrary c j ψ i (τ j ) converges to 0 in probability by Markov's inequality. If . We will now verify that the triangular array of random variables (V i,J ) i=1,...,n satisfies the Lindeberg condition. For any v > 0 and sufficiently large n, Markov's inequality gives where ξ 2 m n −1 = o(1) by (B1'). Thus the Lindeberg condition holds and it follows that

Finally, it remains to prove (A.20) and (A.21). Begin by observing that
where the first inequality is an application of Taylor expansion, withQ † (X i , τ) lying on the line segment of Q(X; τ ) and α(τ τ); the second inequality is the result of the orthogonality condition (3.5), Condition (A2), Conditions (C1)-(C2); the third inequality follows from Conditions (A2), (C1), and the last line follows from condition (C1) and the Hölder inequality. For a proof of (A.20) observe that by (5.10) e j J m (τ ) −1 Z = e j M 1 (τ ) −1 (V − A(τ ) Z(W )) for j = 1, ..., k. Thus we obtain from Remark 5.5 To prove (A.21), without loss of generality, let w n = 1. We note that by (5.10) From (A.26), (5.11) in Remark 5.5 and Lemma A.2 we obtain where the first inequality follows from the Taylor expansion withQ † (X i , τ) lying on the line segment of Q(X; τ ) and α(τ Proof for Lemma A.2. By the first order condition for obtaining A(τ ), By the orthogonality condition (3.5), By the assumption that at fixed j, |V j | ≤ C, the uniform boundedness of the conditional density in (A2), and the hypothesis sup This completes the proof of (A.29) by noting that sup τ ∈T M 2 (τ ) −1 = O(1).

A.4. Asymptotic tightness of quantile process
In this section we establish the asymptotic tightness of the process n 1/2 u n U n (τ ) in ∞ (T ) with u n ∈ R m being an arbitrary vector, where Note that the results obtained in this section, in particular Lemma A.3, apply to any series expansion Z = Z(X i ) satisfying Assumptions (A1).
The following definition is only used in this subsection: For any non-decreasing, convex function Ψ : R + → R + with Φ(0) = 0, the Orlicz norm of a real-valued random variable Z is defined as (see e.g. Chapter 2.2 of van der Vaart and Wellner (1996)) Proof of Lemma A.3. Without loss of generality, we will assume that u n is a sequence of vectors with u n = 1, which can always be achieved by rescaling. Define G n (τ ) := n 1/2 u n U n (τ ).
To bound the remaining term in (A.40), observe that

3309
Now by Lemma A.4 we have for a sufficiently large constant C > 0. Take κ n = log n. Sinceω n (log n) 1/2 = 2ξ m (log n) 1/2 / √ n = o(1) and ξ m log(n)/ √ n = o(1) by assumption, it follows that r n (log n) → 0. Therefore, we conclude from Lemma A.4 that The following result is applied in the proof of Lemma A.3.

Lemma A.4. Under (A1)-(A3), we have for any
where κ n > 0, U n (τ ) is defined in (A.31) and u n ∈ R m is arbitrary, and To prove Lemma A.4, we need to establish some preliminary results. For any fixed vector u ∈ R m and δ > 0, define the function classes Denote G 3 , G 6 and G 7 as the envelope functions of G 3 , G 6 and C 7 , respectively. The following covering number results will be shown in Section D.2: for any probability measure Q, inf τ ∈T λmin(Jm(τ )) < ∞ given Assumptions (A1)-(A3), and A 7 > 0 is a constant. Also, G 4 has VC index 2 according to Lemma D.4.
Note that sup τ1,τ2∈T ,|τ1−τ2|<δ |I 1 (τ 1 , τ 2 )| ≤ P n − P G6(un,δ)·G4 , where Theorem 2.6.7 of van der Vaart and Wellner (1996) and Part 1 of Lemma D.4 give where the envelope for G 4 is G 4 = 2 and A 4 is a universal constant. Part 2 of Lemma D.4 and Part 2 of Lemma D.2 imply that for a large enough constant C. In addition, an upper bound for the functions in and we can take this upper bound as envelope.

A.5. Proof of Corollary 4.1
As the argument x 0 in Q(x 0 ; τ ) and F Y |X (y|x 0 ) is fixed, simplify notations by writing Q(x 0 ; τ ) = Q(τ ), Q(x 0 ; τ ) = Q(τ ) and F Y |X (y|x 0 ) = F (y), F Y |X (y|x 0 ) = F (y) as functions of the single arguments in τ and y, respectively. From Theorems 2.4, 3.1 or Corollary 2.2, we have where a n and G depend on the model for Q(x; τ ) and G has continuous sample paths almost surely. Next, note that for y ∈ Y a n F (y) − F (y) = a n Φ( Q)(y) − Φ(Q)(y) .
Next, observe that Ψ(f ) = Θ • Φ(f ) where Θ(f )(τ ) = inf{y : f (y) ≥ τ } denotes the generalized inverse. Compact differentiability of Θ at differentiable, strictly increasing functions f 0 tangentially to the space of contunuous functions is established in Lemma 3.9.23 of van der Vaart and Wellner (1996), and the derivative of Θ at f 0 is given by dΘ f0 (h)(y) = −h(f −1 0 (y))/f 0 (f −1 0 (y)). By the chain rule for Hadamard derivatives this implies compact differentiability of Ψ tangentially to C(τ L , τ U ). Thus the second weak convergence result again follows by the functional delta method.
(2003) implies there exists a constant C independent of n such that for any function on W, Recall that W is a compact subset of R d and h(w; τ ) ∈ Λ η c (W, T ). Since B m (W) is a finite dimensional vector space of functions, by a compactness argument there exists g * (·; τ ) ∈ B m (W) such that sup w∈W |h(w; τ ) − g * (w; τ )| = inf g∈Bm(W) sup w∈W |h(w; τ )−g(w)| for each fixed τ . With m > η, the inequality in the proof for Theorem 12.8 in Schumaker (1981), with their "m i " being our η and Δ i m −1/k yields where η is the greatest integer less than η, by the assumption that h(w; τ ) ∈ Λ η c (W, T ) and fixed k . An extension of Theorem 12.8 of Schumaker (1981) to Besov spaces (see Example 6.29 of Schumaker (1981)) in similar manner as Theorem 6.31 of Schumaker (1981) could refine the rate to c n m −η/k , but we do not pursue this direction here.
Next we show the bound c n = o(m − η ) in the setting of Example 2.3. Assume the density f X (x) of X exists and 0 < inf x∈X f X (x) ≤ sup x∈X f X (x) < ∞. Define the measure ν(u) by dν(u) = f (Q(u; τ )|u)f X (u)du. Thus, x → B(x) β n,g (τ ) with β n,g defined similarly to (B.1) is now viewed as a projection of a function g : X → R onto the space B(X ) with respect to the inner product g 1 , g 2 = g 1 (u)g 2 (u)dν(u). The remaining proof is similar to the partial linear model, with h(w; τ ) being replaced by Q(x; τ ), and we omit the details.
The bound on r n,1 follows from results on duality theory for convex optimization, see Lemma 26 on page 66 in Belloni et al. (2016) for a proof.
To bound r n,2 and r n,3 , define the class of functions (C.2) Moreover, let s n,1 := P n − P G1 .
Observe that by Lemma C.2 with t = 2 we have for n large enough. Thus, for all n for which (C.3) holds, Ω 3,n ⊆ Ω 2,n ⊆ Ω 1,n . From this we obtain that on Ω 3,n , for a constant C 2 which is independent of n, we have In particular, for all n for which (C.3) holds, (C.4) The bound on r n,2 is now a direct consequence of Lemma C.1 and the fact that n −1 mξ m log n = o((n −1 m log n) 1/2 ) and ξ m κ n n −1 = o(κ 1/2 n n −1/2 ). The bound on r n,4 follows once we observe that for any a ∈ S m−1 A −1 this implies that for sufficiently large n we have sup τ ∈T r n,4 (τ ) ≤ C 1 (c n s n,1 + g n ) for a constant C 1 which does not depend on n.
Lemma C.3. Consider the classes of functions G 1 , G 2 (δ) defined in (C.2) and (C.6), respectively. Under assumptions (A1)-(A3) we have for some constant C independent of n and all κ n > 0 provided that (C.10) For any δ n satisfying ξ m δ n n −1 , we have for sufficiently large n and arbitrary κ n > 0 P P n − P G2(δn) ≥ Cζ n (δ n , κ n ) ≤ e −κn , where ζ n (t, κ n ) :=t 1/2 mξ m n log(ξ m ∨ n) Proof of Lemma C.3. Observe that for each f ∈ G 1 we have |f (x, y)| ≤ ξ m , and the same holds for G 2 (δ) for any value of δ. Additionally, similar arguments as those in the proof of Lemma 18 in Belloni et al. (2016) imply together with Theorem 2.6.7 in van der Vaart and Wellner (1996) that, almost surely, where A is some constant and v 1 (m) = O(m), v 2 (m) = O(m). Finally, for each f ∈ G 1 we have On the other hand, each f ∈ G 2 (δ n ) satisfies Note that under assumptions (A1)-(A3) the right-hand side is bounded by cξ m δ n where c is a constant that does not depend on n. Thus the bound in (D.2) implies that for ξ m δ n n −1 we have for some constant C which is independent of n, Thus (C.10) and (C.11) follow from the bound in (D.3) by setting t = κ n .
Proof of Lemma C.7. From standard arguments of the optimization condition of quantile regression (p.35 of Koenker (2005), also see equation (2.2) on p.224 of Knight (2008)), we know that for any τ ∈ T , where v i ∈ [−1, 1] and H τ = {i : Y i = B i β(τ )}. Since a has at most L non-zero entries, the dimension of the subspace spanned by {B i : a B i = 0} is at most L + 2r [each vector B i by construction has at most r nonzero entries and all of those entries are consecutive]. Since the conditional distribution of Y given covariates has a density, the data are in general position almost surely, i.e. no more than k of the points (B i , Y i ) lie in any k-dimensional linear space, it follows that the cardinality of the set H τ ∩ {i : a B i = 0} is bounded by L + 2r. The assertion follows after an elementary calculation.

C.3. Proof of Theorem 5.4
The statement follows from Theorem 5.1 if we prove that the vector γ † n (τ ) satisfies sup To simplify the notations, we suppress the argument in τ in the following matrix calculations. Recall the following identity for the inverse of 2 × 2 block matrix (see equation (6.0.8) on p.165 of Puntanen and Styan (2005)) Identifying the blocks in the representation (C.23) with the blocks in te above representation yields the result after some simple calculations. For a proof of (C.22) observe that Now on one hand we have, uniformly in τ ∈ T , Here, the first equation follows after a Taylor expansion taking into account that, by the definition of the conditional quantile function, F Y |X (Q(X; τ )|X) ≡ τ . The fourth equality is a consequence of (3.5), the sixth equality follows since E[ Z(W )f Y |X (Q(X; τ )|X)(h(W ; τ ) − β † n (τ ) Z(W ))] = 0 (C.24) by the definition of β † n (τ ) as minimizer. The last line follows by the Cauchy-Schwarz inequality. On the other hand By (C.24), the first term in the representation above is zero, and the norm of the second term is of the order O(ξ m c †2 n ). This completes the proof.

D.1. Results on empirical process theory
In this section, we collect some basic results from empirical process theory needed in our proofs. Denote by G a class of functions that satisfies |f (x)| ≤ F (x) ≤ U for every f ∈ G and let σ 2 ≥ sup f ∈G P f 2 . Additionally, let for some A > 0, V > 0 and all ε > 0, Note that if G is a VC-class, then V is the VC-index of the set of subgraphs of functions in G. In that case, the symmetrization inequality and inequality (2.2) from Koltchinskii (2006) yield for a universal constant c 0 > 0 provided that 1 ≥ σ 2 > const × n −1 [in fact, the inequality in Koltchinskii (2006) is for σ 2 = sup f ∈G P f 2 . However, this is not a problem since we can replace G by Gσ/(sup f ∈G P f 2 ) 1/2 ]. The second inequality (a refined version of Talagrand's concentration inequality) states that for any countable class of measurable functions F with elements mapping into [−M, M ] P P n − P F ≥ 2E P n − P F + c 1 n −1/2 sup f ∈F P f 2 1/2√ t + n −1 c 2 Mt ≤ e −t , (D.3) for all t > 0 and universal constants c 1 , c 2 > 0. This is a special case of Theorem 3 in Massart (2000) [in the notation of that paper, set ε = 1].
Lemma D.1 (Lemma 7.1 of Kley et al. (2016)). Let {G t : t ∈ T } be a separable stochastic process with G s − G t Ψ ≤ Cd(s, t) ( · Ψ is defined in (A.32)) for all s, t satisfying d(s, t) ≥ω/2 ≥ 0. Denote by D( , d) the packing number of the metric space (T, d). Then, for any δ > 0, ω ≥ω, there exists a random variable S 1 and a constant K < ∞ such that where the set T contains at most D(ω, d) points, and S 1 satisfies (D.6)