Significance testing in quantile regression

We consider the problem of testing significance of predictors in multivariate nonparametric quantile regression. A stochastic process is proposed, which is based on a comparison of the responses with a nonparametric quantile regression estimate under the null hypothesis. It is demonstrated that under the null hypothesis this process converges weakly to a centered Gaussian process and the asymptotic properties of the test under fixed and local alternatives are also discussed. In particular we show, that - in contrast to the nonparametric approach based on estimation of $L^2$-distances - the new test is able to detect local alternatives which converge to the null hypothesis with any rate $a_n \to 0$ such that $a_n \sqrt{n} \to \infty$ (here $n$ denotes the sample size). We also present a small simulation study illustrating the finite sample properties of a bootstrap version of the the corresponding Kolmogorov-Smirnov test.


Introduction
Nonparametric regression methods have become very popular in the last decades because of the fact that employing a mis-specified parametric model will typically result in inconsistent estimates and as a consequence invalid statistical inference. In recent years many authors have developed nonparametric quantile regression estimates, which provide an attractive supplement to least squares methods by focussing on the estimation of the conditional quantiles instead of the mean function [see Chaudhuri (1991), Yu and Jones (1997), Yu and Jones (1998), Dette and Volgushev (2008), Chernozhukov et al. (2010) or Bondell et al. (2010) among many others]. These references mainly discuss the case of a one dimensional predictor, but from a theoretical point of view the methods can easily be generalized to multivariate predictors. On the other hand it is well known that in practical applications such nonparametric methods suffer from the curse of dimensionality and therefore do not yield precise estimates of conditional quantile surfaces for realistic sample sizes. In such cases a natural and very important question is which predictor variables are significant. The problem of testing significance has found considerable interest in multivariate mean regression models. Gozalo (1993) considered conditional moment tests, while Yatchew (1992) constructed a test based on semi-parametric least-squares residuals. Lavergne and Vuong (1996) suggested a directional testing procedure for discriminating between two sets of regressors without specifying the functional form of the mean regression, and Racine (1997) proposed a test based on nonparametric estimates of the partial derivatives of the conditional mean of the response. Lavergne and Vuong (2000) used the kernel method to develop a test for the significance of a subset of explanatory variables and Delgado and González-Manteiga (2001) proposed a test which is based on functionals of a U-process. Because of the well known robustness properties of the conditional quantile and the fact that conditional quantiles characterize the entire distribution it is of particular interest to develop methods for testing significance of predictors in quantile regression models. Surprisingly, in quantile regression this problem has found much less attention. Variable selection in the framework of linear quantile regression models has been recently considered by Zou and Yuan (2008), Wu and Liu (2009) and Belloni and Chernozhukov (2011) among others. Jeong et al. (2012) proposed a test for significance in a multivariate quantile regression model. The work of these authors was motivated by Granger quantile causality [Granger (1969)] and they employed an idea of Zheng (1998), who proposed to transform quantile restrictions to mean restrictions. The corresponding test is based on a U-statistic, which estimates the distance measure where Y denotes the response, (X, Z) is the predictor, f Z the density of Z and q τ (X) the conditional τ -quantile of Y given X. Note that the quantity ∆ vanishes if and only if the conditional quantile of Y given X and Z does not depend on Z. A major drawback of this approach lies in the fact that non-parametric smoothing over both X and Z is needed for the construction of the estimate. This implies that the test is of very limited use when the dimension of (X, Z) is larger than 3. Moreover, this test can only detect local alternatives converging to the null hypothesis H 0 : ∆ = 0 at a rate n −1/2 h −(d+q)/4 , where d and q are the dimensions of the predictors X and Z, respectively, and h denotes a bandwidth converging to 0 with increasing sample size n. The present paper is devoted to the problem of constructing a test for the hypothesis of the significance of the predictor Z, i.e. ∆ = 0, in the nonparametric quantile regression model, which can detect local alternatives converging to the null hypothesis at a parametric rate and at the same time does not depend on the dimension of the predictor Z, such that smoothing with respect to the covariate Z can be avoided. To be precise, the test proposed in this paper can detect alternatives converging to H 0 at any rate a n → 0 such that a n √ n → ∞, where n denotes the sample size. Our approach is based on an empirical process, which estimates the functional for all (x, z) in the support of the distribution of the predictor (X, Z), where the inequality X ≤ x between the vectors X and x is understood as the vector of inequalities between the corresponding coordinates and I{A} denotes the characteristic function of the event A. The model, necessary notation and definition of this process are introduced in Section 2 and a stochastic expansion of the process T n (x, z) is established in Section 3. This result allows us to obtain the weak convergence of an appropriately scaled and centered version of T n (x, z) under the null hypothesis, fixed and local alternatives. As a result we obtain a Kolmogorov-Smirnov or a Cramer von Mises type statistic for the hypothesis of the significance of the predictor Z in the nonparametric quantile regression model. Moreover, we are also able to extend the result to the case, where the dimension q of the predictor Z is growing with the sample size, that is q = q n → ∞ as n → ∞. The finite sample properties of a corresponding bootstrap test are investigated in Section 4. As a by-product of our theoretical analysis we also obtain new results on the uniform convergence of the conditional quantile estimator proposed by Dette and Volgushev (2008). Finally all proofs, which are complicated, are deferred to an Appendix in Section A.

Model, assumptions and test statistic
Let Y , X and Z denote one-, d and q dimensional random variables, respectively, where Y corresponds to the response and X and Z are the covariates. We assume that the random variables ..,n are independent identically distributed with the same distribution as (Y, X, Z). Let τ ∈ (0, 1) be fixed. Our aim is to test whether the predictor Z has influence on the conditional τ -quantile of Y , given (X, Z), or whether the variable Z can be omitted. Note that this problem fundamentally differs from the question whether Y is independent of Z given X. In fact, the latter is equivalent to testing whether all quantile curves do not depend on Z as opposed to looking at a particular quantile. Thus for fixed τ ∈ (0, 1) we formulate the null hypothesis as where q τ (X) is defined as the conditional τ -quantile of Y , given X, that is It is easy to see that the null hypothesis (2.1) is equivalent to for all (x, z) in the support of the random variable (X, Z), where the functional T is defined in (1.2). This functional can be be estimated by the stochastic process where (x, z) ∈ R X × R Z , R X and R Z denote the support of the distributions of the random variables X and Z, respectively, andq τ is an appropriate estimate of the conditional quantile of Y given X, which will be specified below. A test for the hypothesis of significance of the variable Z for the τ 's quantile curve of Y can now easily be obtained by considering a Kolmogorov-Smirnov or Cramer von Mises type statistic based on T n and rejecting the null hypothesis for large values of this statistic. Throughout this paper we assume that the sets R X and R Z are compact. In the literature, several non-parametric quantile regression estimators have been proposed [see e.g. Yu and Jones (1997, 1998), Takeuchi et al. (2006), Chernozhukov et al. (2010 or Bondell et al. (2010) among others]. In this paper we will use an approach proposed by Dette and Volgushev (2008) who constructed non-crossing estimates of quantile curves using a simultaneous inversion and isotonization of a preliminary estimator of the conditional distribution function F Y |X of Y given X. For this estimator, sayF Y |X (y|x; p), we will use a smoothed local polynomial estimator of order p, see e.g. Fan and Gijbels (1996). Before defining this estimator, it is necessary to introduce some notation.
• Define N j := #{k ∈ N d 0 |σ(k) = j} as the number of distinct d-tuples with size j, and denote the elements of this set by k 1,m , ..., k Nm,m With these notational conventions the local polynomial estimatorF Y |X (y|x; p) of order p can be represented as [see e.g. Fan and Gijbels (1996)] where e 1 denotes a vector of suitable dimension with first entry one and remaining entries zero, the matrices X, W and the vector Y are given by (2.5) and Ω denotes a smoothed version of the indicator function I{· ≤ 0}, that is for a given kernel ω with support [−1, 1]. Following Dette and Volgushev (2008) we consider a strictly increasing distribution function G : R → (0, 1), a nonnegative kernel κ with bandwidth b n , and define the functional IfF Y |X is the estimator of the conditional distribution function defined in (2.4), it is intuitively clear that H G,κ,τ,bn (F Y |X (·|x)) is a consistent estimate of H G,κ,τ,bn (F Y |X (·|x)). If b n → 0, this quantity can be approximated as follows and as a consequence an estimate of the conditional quantile function q τ (x) = F −1 Y |X (τ |x) can be defined byq Throughout this paper, we will assume that the kernels, the function G and the bandwidth parameters used to build the estimator satisfy the following conditions (K1) The kernel K has support [−1, 1] and is p + 1 ≥ d + 2 times continuously differentiable with uniformly bounded derivatives. Additionally the first p + 1 derivatives of K vanish at the boundary points −1 and 1.
(K2) The function ω in (2.6) is a kernel of order s ≥ d + 1, has support [−1, 1] and is d times continuously differentiable. Additionally ω has uniformly bounded derivatives that vanish at the boundary points −1 and 1.
(K4) G : R → [0, 1] is a strictly increasing distribution function such that G, G −1 are two time continuously differentiable (K5) d 2s n + h p+1 n = o(1/ √ n) and log n/(nh 3d+2 Remark 2.1 Dette and Volgushev (2008) demonstrate that the choice of the distribution function G has a negligible impact on the quality of the resulting estimate provided that an obvious centering and standardization is performed. Similarly, the estimateq τ (x) is robust with respect to the choice of the bandwidth b n if it is chosen sufficiently small [see Dette et al. (2006)].
Remark 2.2 Dette and Volgushev (2008) only established point-wise weak convergence of their estimator. However, for most applications such as the construction of tests on the basis of this estimator, uniform results are needed. In the present paper, we provide general inequalities for the operator H G,κ,τ,bn defined in (2.7), see Lemma B.4 in the Appendix. In particular, these findings allow to describe uniform properties of the quantile estimatorq τ in terms of the properties of the underlying distribution function estimatorF Y |X . For example, in Theorem A.1 in the appendix we exploit those bounds to derive a uniform Bahadur-type representation for the estimateq τ defined in (2.8).
In the following discussion it turns out to be advantageous to consider a generalization of the test statistic T n defined in (2.3), where the indicator functions I{X i ≤ x} are replaced by indicators of more general sets Θ. To be precise let Ξ denote a collection of subsets of R d and define D n := {x ∈ R X |[x − h n 1, x + h n 1] ⊂ R X } (here 1 denotes the d-dimensional vector with all entries equal to 1), then all theoretical developments will be based on the statistic The intersection of the sets Θ ∈ Ξ with the set D n is needed in the theoretical developments to exclude "residuals" I{Y i ≤q τ (X i )} − τ corresponding to predictors close to the boundary of R X . Note that if ∪ Θ∈Ξ Θ has a positive distance to the boundary of R X , the collection of sets Ξ n will equal Ξ whenever h n is sufficiently small. Note also that we use the same symbol T n for the processes in (2.3) and (2.9) but the meaning is always clear from the context. Additionally to its advantages from a theoretical point of view, the consideration of a collection of sets that are more general than sets defined by indicators of rectangles will for example allow to investigate the problem of testing the significance of the variable Z on a certain subset, say D ⊂ R X , that is X) and X ∈ D | X, Z) = τ (2.10) Note that H D 0 means that the conditional τ −quantile of Y given (X, Z) can be represented as a function q τ (X) for X ∈ D ⊂ R X . In this case a natural choice for the collection Ξ is given by Ξ := {{X ≤ t} ∩ D|t ∈ R d }, but other choices are of course possible as well.

Main asymptotic results
In this section we investigate the asymptotic properties of the stochastic process defined in (2.9). For this purpose we need some additional notation and technical assumptions which are collected here for convenience and for later reference. Define the 'error' variables as ε = Y − q τ (X) and ε i = Y i − q τ (X i ), i = 1, . . . , n. We assume that the conditional distribution function F ε|X (·|x) of ε given X = x has a density, say f ε|X (y|x). Note that by definition we have that F ε|X (0|X) = P (ε ≤ 0|X) = τ . In particular, this identity continues to hold even if the null hypothesis is violated. Throughout this paper we denote by F Z|X,ε (z|x, e) the conditional distribution function of Z given (X, ε) = (x, e). Define D := ∪ Θ∈Ξ Θ, then we assume that the data-generating process satisfies the following conditions.
(A1) The conditional distribution function F Y |X (y|x) is p + 1 times continuously differentiable with respect to x, y and all partial derivatives are uniformly bounded on R × R X . The joint density of (X, Y ) is uniformly bounded on R X × R. Moreover, p ≥ max(s, d + 1).
(A2) The density f X of the predictor X is d + 1 + n f times continuously differentiable with uniformly bounded partial derivatives on R X and n f > d/2. Moreover inf x∈R X f X (x) > 0.
In conditions (A1)-(A4), R X can be replaced by a set X ⊂ R X provided that D ⊂ X . Finally, the following assumptions on the collection of sets Ξ are required.
(S1) The class of functions F 1 = {u → I{u ∈ Θ}|Θ ∈ Ξ} satisfies N [ ] (F 1 , ε, L 2 (P X )) ≤ Cε −a for any sufficiently small ε > 0 and a constant C, where N [ ] denotes the bracketing number [see van der Vaart and Wellner (1996)] Remark 3.1 Conditions (S1) and (S2) are not strong and for example satisfied for the collection of rectangles Ξ = {{s ≤ X ≤ t}|s, t ∈ R d } if X has a uniformly bounded density with compact support. For more details on bracketing numbers and their properties we refer to the monograph of van der Vaart and Wellner (1996).
The following result gives a stochastic expansion of the process T n (Θ, z) under general conditions, which is crucial for deriving the asymptotic properties of the process T n . In particular, observe that this representation continues to hold under the alternative.
Theorem 3.2 If the assumptions (K1)-(K6), (A1) -(A5) and (S1), (S2) are satisfied, the process T n can be represented as The proof of Theorem 3.2 is complicated and given is given in the Appendix. As an immediate consequence, we obtain that under the null hypothesis H 0 the rescaled process √ nT n (Θ, z) converges weakly to a centered Gaussian process.
Corollary 3.3 If the assumptions of Theorem 3.2 and the null hypothesis H 0 in (2.1) are satisfied, the process √ nT n converges weakly in ℓ ∞ (Ξ × R Z ) to a centered Gaussian process T with covariance kernel As a consequence of this result we obtain the weak convergence of functionals such as the Kolmogorov-Smirnov statistic by an application of the continuous mapping theorem. In general the asymptotic distribution of K n depends on certain features of the data generating process and in the following section we will discuss bootstrap approximations for this distribution. However, in some special cases the situation simplifies substantially.
Remark 3.4 In the case where the pair (X, ε) and the covariate Z are independent it follows from (3.2) that where F Z is the distribution function of the random variable Z and y ∧ z denotes the vector of minima of the corresponding coordinates of y and z. If additionally X, Z are real-valued and Ξ = {(−∞, t]|t ∈ R}, the asymptotic covariance in Theorem 3.2 reduces to Hence, for univariate independent covariates X and Z with continuous distribution functions F X and F Z , respectively, the Kolmogorov-Smirnov test is asymptotically distribution-free because in this case the statistic The result obtained in Theorem 3.2 can also be used to derive the asymptotic properties of the test statistic under fixed alternatives. More precisely, the following result holds (note that under the null hypothesis, the centering term is zero, and thus this result is a generalization of Corollary 3.3).
Corollary 3.5 Under the assumptions of Theorem 3.2 the process converges weakly to the limiting process T defined in Corollary 3.3.
Remark 3.6 A further consequence of Corollary 3.5 is that the statistic T n converges for all Θ ∈ Ξ and z ∈ R Z in probability to the function Consequently, if Ξ contains sufficiently many sets (for example, , the test is consistent. In order to obtain the asymptotic distribution of the test statistic under local alternatives of the form a result on the asymptotic behavior of T n (Θ, z) is required when the data are generated from triangular arrays. A closer look at the proofs in the appendix shows that such a result does indeed hold under suitable modifications of the conditions in Theorem 3.2. The details are omitted for the sake of brevity. In particular, a test based on the Kolmogorov-Smirnov test statistic will detect all local alternatives for which the quantity diverges to infinity (the superscript is used to indicate that the corresponding quantities depend on n).
for some function h that is not identically zero on R X × R Z and sequence a n with a n √ n → ∞. This means that the test can detect alternatives converging to the null hypothesis at rates which are "larger but arbitrarily close" to the parametric rate n −1/2 . Moreover, the test will have an asymptotically non-trivial power against many local alternatives that converge to zero at the exact parametric rate n −1/2 .
Remark 3.7 We now give a brief discussion of the properties of the proposed test statistic when alternatives of increasing dimension are considered, i.e. when the dimension of the predictor Z, say q n , varies with n. Consider the additional assumption (Z) The L 2 covering numbers of the classes of functions Note that assumption (Z) holds with k n = q n if for each n the predictor Z given (X, ε) has a conditional density f Z|X,ε that satisfies for a finite constant C independent of n. Under assumptions (K1)-(K6), (A1)-(A3), (Z), (A5), (S1), (S2) it is possible to prove that Consequently, the test is able to detect local alternatives converging to the null hypothesis with any rate a n , such that an kn √ n → ∞ when the sample size and dimension k n of Z is increasing.
Remark 3.8 Jeong et al. (2012) investigated an alternative test for the hypothesis (2.1) based on ideas from Fan and Li (1996) in combination with a modification which was originally proposed by Zheng (1998). Their test is based on the statistic where L is a kernel and g n is a bandwidth converging to 0 with increasing sampling size. These authors claimed that a normalized version of this test statistic converges to a normal distribution. It should be pointed out here that the proof in this paper is not correct. The basic argument of Jeong et al. (2012) consists in the statement that the fact where the statistics J nU and J nL are defined by .11-3) in this paper). A simple calculation shows that this conclusion is not correct and in fact the inequality (3.4) does not hold. It turns out that the proof of Theorem 1 in Jeong et al. (2012) can not be corrected easily. Even if the gap in the proof would be closed, the test of Jeong et al. (2012) still has two major drawbacks. First, it requires non-parametric smoothing with respect to the covariate Z. Second, it can only detect local alternatives converging to the null hypothesis at a rate n −1/2 h −(d+q)/4 which is slower than the rate b n n −1/2 for any b n → ∞ detected by the test proposed in this paper and additionally depends on the dimension of the covariates.

Bootstrap and simulation results
In general the limit distribution derived in Theorem 3.2 depends on certain features of the data generating process which are difficult to estimate. For this reason we discuss in this section bootstrap methods that are suitable to mimic the distribution of test statistics based on T n under the null hypothesis. To be precise, let P * denote the conditional probability P (· | Y n ), given the . . , n}, and denote by E * and Cov * the corresponding conditional expectation and covariance. Several residual wild bootstrap approximations have been proposed in the literature for quantile regression analysis [see Sun (2006) or Feng et al. (2011)]. However, the residual wild bootstrap does not yield a valid approximation of the limiting distribution in the present context because it does not lead to an expansion of the bootstrap process analogous to the one given for T n in Theorem 3.2. As alternative we consider the idea of process-based wild bootstrap as considered by Delgado and González-M (2001) or He and Zhu (2003). To this end recall the definition of the "residuals"ε i = Y i −q τ (X i ), whereq τ denotes an estimator for the conditional τ -quantile of Y i , given X i , defineτ = n j=1 I{ε j ≤ 0}/n and introduce independent identically distributed Bernoulli random variables B 1 , . . . , B n with success probabilityτ , which are independent of the original data. Define the bootstrap process as denotes a kernel estimator for the conditional distribution F Z|X,ε (·|x, y). Here, L and N denote d-and one-dimensional kernel functions and a and e corresponding bandwidths converging to 0 with increasing sample size. For the sake of brevity we do not consider conditional weak convergence of the process T * n in detail, but note that E * [T * n (Θ, z)] = 0 and under the null hypothesis H 0 (and under suitable regularity conditions) the conditional covariance nCov * (T * n (Θ 1 , y), T * n (Θ 2 , z)) converges in probability to the covariance Cov(T (Θ 1 , y), T (Θ 2 , z)) as defined in Theorem 3.2. In our numerical investigations, it turned out that the asymptotic representation (3.1) for the process defined in (2.3) is not very accurate for small sample sizes. We thus considered a slightly modified version of this process, that is whereF Z (z) denotes the empirical distribution function of Z 1 , ..., Z n , which provided much better results for moderate sample sizes. As motivation for this approach, observe that under both the null hypothesis and the alternative, we have uniformly with respect to x as can be seen by taking a closer look at the proofs of the main results in the Appendix. Thus the additional correction term vanishes asymptotically (uniformly with respect to x, z) under both the alternative and the null hypothesis. If, on the other hand, δ x,z is relatively large because the sample size is small, the correction term δ x,z induces an additional centering (the factorF Z (z) corresponds to the amount of non-zero indicators I{Z i ≤ z}).
The simulation results described below confirm that this is a sensible approach. For the calculation of the test statistic based on the processT n , we use local polynomial estimators of order two [see (2.4)]. The bandwidth h n of this estimator is chosen as h n := (σ 2 /2n) 13/50 whereσ 2 denotes the variance estimate of Rice (1984) from the sample {(X i , Y i )| i = 1, . . . , n} [see Yu and Jones (1997) for a related approach]. The bandwidths used in (2.5) and (4.1) are chosen as d n = a = e = h n , while the choice of b n in (2.7) is even less critical [see also Dette and Volgushev (2008)] and we use b n = h 3 n . In fact, in the simulations it turned out that the power and size properties of the test are rather insensitive with respect to the bandwidth choice, see table 3 and related discussion in the next paragraph. The function ω in (2.6) is chosen as ω(x) := (15/32)(3 − 10x 2 + 7x 4 )I{|x| ≤ 1}, which is a kernel of order 2 [see Gasser et al. (1985)]. The function κ in (2.7) is defined as Epanechnikov kernel while all other kernels are Gaussian kernels. For the choice of the distribution function G in (2.7) we follow the procedure described in Dette and Volgushev (2008) who suggested a normal distribution such that the 5% and 95% quantiles coincide with the corresponding empirical quantities of the sample Y 1 , ..., Y n .
The random variables X and Z are independent and uniformly distributed on the interval [0, 1] while ε is standard normal. We consider the cases τ = 0.5 and τ = 0.25. All reported results are based on 1000 simulation runs with 300 bootstrap replications. The bootstrap test (at level α) rejects the null hypothesis that the variable Z is not significant, wheneverK n > K * n,1−α (4.6) whereK n is defined in (4.2) and K * n,1−α denotes the (1−α) bootstrap quantile of the Kolmogorov-Smirnov test statistic. The rejection probabilities of this test under the null hypothesis are shown in Table 1 for the 50% and 25% quantile. Note that different pairs of location and scale functions in (4.4) and (4.5) correspond to the null hypothesis for τ = 0.5 and τ = 0.25 (more precisely the models (      Table 3: Simulated rejection probabilities of the bootstrap test (4.6) for various bandwidths. The sample size is n = 50 and the lower and upper part correspond to the 50% and 25% quantile, respectively. The pair (k, l) corresponds to the location function q k and scale function s ℓ specified in (4.4) and (4.5), respectively.  Table 4: Simulated rejection probabilities of the bootstrap test (4.6) for the significance of a two dimensional predictor in median regression. The models are defined in (4.7), the sample size is n = 50 and the upper (lower) row corresponds to the null hypothesis (alternative) defined by the pairs (1, 3), (1, 4), (2, 3) and (2, 4) correspond to the null hypothesis if τ = 0.5 but to the alternative if τ = 0.25). We observe from Table 1 that the level is usually approximated very well. For τ = 0.25 there exist some cases where the test is slightly conservative . The corresponding results for various alternatives are displayed in Table 2 and we observe a reasonable power for most cases. The power for τ = 0.25 is always smaller than the power for τ = 0.5. This corresponds to intuition because the 25%-quantile is more difficult to estimate than the median. The power of the test is smaller for alternatives corresponding to the location function q 4 (x, z) = sin(2π(x + z)) if the sample size is n = 100. However, if the the sample size is larger, the test also detects the alternatives with reasonable probability. For example if n = 200 and τ = 0.5 the simulated rejection probabilities of the bootstrap test at level 5% for the alternatives (4, 2), (4, 3) and (4, 4) are given by 0.319, 0.795 and 0.821, respectively. Next we study the impact of the choice of the bandwidth on size and power of the bootstrap test. For this purpose we consider the sample size n = 50 and bandwidths 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45 and 0.50. The results for model (1, 2) and (3, 2) corresponding to the null hypothesis and alternative, respectively, are summarized in Table 3. We observe that the level and power are rather stable with respect to different choices of the bandwidth. Simulations for other scenarios yield similar results and are not shown for the sake of brevity. We conclude our numerical study with a brief investigation of a two dimensional predictor, say Z = (Z 1 , Z 2 ). Because the method proposed in this paper does not require smoothing in the Z-direction, the results should not be seriously affected, if the dimension of Z is larger. To be precise we consider two different location functions q 1 (x, z 1 , z 2 ) = x , q 2 (x, z 1 , z 2 ) = z 2 · x + z 2 1 (4.7) and a constant scale function s(x, z 1 , z 2 ) = 0.5 in model (4.3). Note that q 1 corresponds to the null hypothesis, while q 2 represents an alternative. The results of the bootstrap test for the median are listed in Table 4 for the sample size n = 50 and we observe in these examples similar satisfactory properties as in the one-dimensional setting.

A Appendix: Proofs
Throughout this section, introduce the abbreviation Θ n := Θ ∩ D n with D n : uniformly with respect to Θ ∈ Ξ, z ∈ R Z .
Proof. ¿From Lemma A.1 we obtain the representation − f ε|X (0|s)(q τ (s) − q τ (s))I{s ∈ Θ n }f X (s)F Z|X,ε (z|s, 0)ds and We will now proceed to show that the first part in the above decomposition [i.e. the part containing I 1 ] determines the asymptotic expansion and establish at the end of the proof that the part corresponding to I 2 is asymptotically negligible. First, note that Observe that every entry of M is by assumption continuously differentiable with respect to s and the derivative is uniformly bounded. The class of functions defined by where α is a small positive number has covering numbers that satisfy the assumptions of part 1 of Lemma B.3 in Appendix B. This follows from Lemma B.2 together with the fact that under the assumptions (A1), (A3) the mapping (ζ, a) → q ζ (x + a) satisfies sup x |q ζ 1 (x + a 1 ) − q ζ 2 (x + a 2 )| ≤ C(|ζ 1 − ζ 2 | + a 1 − a 2 ∞ ) for some finite constant C (this inequality is a consequence of the implicit function theorem). Moreover, it follows from the smoothness assumptions on F Y |X and the properties of Ω that where R n is a nonrandom quantity of order o(1/ √ n). Thus the smoothness properties of F Z|X,ε , F Y |X and (ζ, x) → q ζ (x) imply that by Lemma B.2 and Lemma B.3 in Appendix B we have where R n = O(h n + b n + d n ) is a non-random quantity. This, together with an application of Lemma B.3, shows that In particular, noting that F ε|X (0|X i ) = τ , the above result implies f ε|X (0|s)(q τ (s) − q τ (s))I{s ∈ Θ n }f X (s)F Z|X,ε (z|s, 0)ds where µ k (K) := R d K 1,k (u)du. Now from the definition of M it is easy to see that where R M denotes a vector whose entries are uniformly bounded and Lipschitz-continuous with respect to x. Thus applying Lemma B.3 we obtain which completes the first part of the proof. It remains to show that uniformly with respect to Θ ∈ Ξ, z ∈ R Z . To this end, consider the (n-dependent) class of functions F n with elements × (K hn,0 (s − x), ..., K hn,k Np,p (s − x)) t I{s ∈ Θ n }f X (s)F Z|X,ε (z|s, 0)dsdv indexed by z ∈ Z, Θ ∈ Ξ contains uniformly bounded elements (the bound is also uniform with respect to n). Moreover, there exists a finite positive constant C such that N [ ] (F n , ε, L 2 (P X )) ≤ N [ ] (F n,1 , ε/C, L 2 (P X ))N [ ] (F n,2 , ε/C, L 2 (P X )) 2 , (A.1) where F n,1 := {s → I{s ∈ Θ n }|Θ ∈ Ξ} and F n,2 := {s → F Z|X,ε (z|s, ε)|z ∈ Z}. To see that this holds, observe the decomposition z,Θn,hn,bn (x, y) where g 1,n and g 2,n denote non-positive and non-negative, uniformly bounded functions, respectively. Moreover, g j,n do not depend on Θ n or z. Obviously, it suffices to bound the bracketing number of F j,n := {(x, y) → f (j) z,Θn,hn,bn (x, y)} for j = 1, 2 separately. If we denote by {[b L,j , b U,j ]} a collection of ε−brackets (with respect to L 2 (P X )) for {s → I{s ∈ Θ n }F Z|X,ε (z|s, 0)}. Then a collection of ε/C brackets for F n,2 (with respect to L 2 (P X,Y )) is given by To see this, observe that for some finite constant C 1 . A bound for F n,2 can be derived by similar arguments. Thus (A.1) is established. Combining the bound in (A.1) with the assumptions (S1) and (S2), the estimate sup z,Θ |E[f z,Θn,hn,bn (X 1 , Y 1 )]| = o(n −1/2 ), and the results from Lemma B.2 and Lemma B.3 yields the assertion after noting that by assumption sup Θ∈Ξ EI 2 (X i ; Θ n , h n ) = o(1). ✷

Lemma A.3 Under the assumptions of Theorem 3.2 it holds that
uniformly with respect to Θ ∈ Ξ, z ∈ R Z , where F X,Z denotes the joint distribution function of X, Z.
Proof. Note that T n (Θ, z) = 1 n n i=1 (I{ε i ≤ 0}−τ )I{X i ∈ Θ}I{Z i ≤ z}, and that the assertion is equivalent to Here . . , n, (used to buildq τ,L ) is independent from the generic variable (Y, X, Z). The proof now proceeds in two steps. First, note that by Lemma A.1 we haveq τ −q τ,L = o P (n −1/2 ) uniformly on D n and thus there exists a deterministic sequence γ n = o(n −1/2 ) with Now on the set {|q τ (x) −q τ,L (x)| ≤ γ n }, the probability of which tends to one, we have I{|ε i,L | ≤ γ n }I{X i ∈ D n } Next, note that I{|ε i,L | ≤ γ n } = I{|ε i − g(X i )| ≤ γ n } for g =q τ,L − q τ . Now the assertion follows since the (n-dependent) class of functions satisfies the assumptions of part 1 of Lemma B.3 whenever n is sufficiently large, see the proof of Lemma A.3 in Neumeyer and Van Keilegom (2010) for a similar reasoning, andq τ,L − q τ ∈ C d+1 1 (D n ) with probability converging to one by Lemma A.1. Here C d+1 1 (D n ) is the class of d + 1 times differentiable functions g defined on D n . Further, note that This, together with (A.2), and an application of Lemma B.3, shows that Similar arguments applied to the (n-dependent) class functions and thus the proof is complete. ✷ Proof of Theorem 3.2. Starting from the stochastic expansion given in Lemma A.3 we obtain by Taylor's expansion for some ξ x,s,n between 0 andq τ (s) −q τ (s) where the last line is of order o p (n −1/2 ) due to Lemma A.1 and the assumptions sup x∈D,y∈R,z∈R Z |f ′ ε|X,Z (y|x, z)| < ∞, d 2s n + log n/nh d n = o(n −1/2 ). Note that f ε|X,Z (0|s, t)(q τ (s) − q τ (s))I{s ∈ Θ n }I{t ≤ z}dF X,Z (s, t) = F Z|X,ε (z|s, 0)f ε|X (0|s)f X (s)(q τ (s) − q τ (s))I{s ∈ Θ n }ds.
By Lemma A.2 we thus have This completes the proof. ✷ Proof of Corollary 3.3 and 3.5. Define the sequence of n-dependent classes of functions and note that it is indexed by the totally bounded metric space (Ξ × R Z , ρ) with metric 0)). Moreover, it satisfies the assumptions of part 2 of Lemma B.3. A simple calculation in combination with the assumption sup Θ∈Ξ P (X i ∈ Θ\Θ n ) = o(1) shows that all the assumptions of Theorem 2.11.23 in van der Vaart and Wellner (1996) are satisfied. In particular, the covariances Cov(W Θn,y , W Θ ′ n ,z ) converge to k(Θ, y, Θ ′ , z) given in Corollary 3.3. This implies that the process √ n T n (Θ n , z) −T n (Θ n , z) whereT n (Θ n , z) := E (I{ε i ≤ 0} − τ )I{X i ∈ Θ n }(I{Z i ≤ z} − F Z|X,ε (z|X i , 0)) converges weakly to the centered Gaussian process T (Θ n , z) described in Corollary 3.3. Thus Corollary 3.3 and 3.5 follow after a straightforward calculation of the expectationT n (Θ n , z). Now the proof is complete. ✷

B Technical results
Before stating the main results of this section, we discuss some basic properties of the local polynomial estimatorF Y |X (y|x; p). To this end, we note that Lemma B.1 Under the assumptions (K1), (K2), (K5), (A1), (A2) it holds that ) uniformly with respect to (x, y) ∈ D n × Y, where Y is any bounded subset of R and M k denote some matrices with uniformly bounded entries that are independent of x, n, y and Moreover, the quantity ∆ S (y|x) is, with probability tending to one, d + 1 times continuously differentiable with respect to x and y and all its partial derivatives of corresponding orders are uniformly bounded on D n × Y.
Proof. At the end of the proof, we will establish the following two representationŝ where M 0 , ..., M k Nn f ,n f denote some matrices that do not depend on n, x, M 0 = M(K) is invertible, H is a diagonal matrix with entries 1, h n , .., h n , h 2 n , ..., h 2 n , ..., h p n , ..., h p n and the term h |k| n appears N k times in this vector. By definition we have and tedious but straightforward calculations including integration-by parts and substitutions yield the estimates A combination of parts 1,2 and 6 of Lemma B.2 shows that, for every n, the class of functions  y), ..., T n,k Np,p ,S (x, y)) t + o P (n −1/2 ), and thus the proof of the first part of the Lemma is complete.
For a proof of the differentiability results, note that the d + 1−fold differentiability of the product of every entry of a scalar product between two vectors follows from the d + 1−fold differentiability of every entry of both vectors. This establishes that ∆ S (y|x) is d + 1 times continuously differentiable with respect to both components and that all partial derivatives are uniformly bounded. By the results in (B.3) the proof is thus complete once we establish (B.1) and (B.2).
Proof of (B.1) A Taylor expansion of F Y |X (y|x) gives This fact, combined with once we note that 1 nh d n i |K hn,k Np,p (x − X i )| = O P (1) and e t 1 (X t WX) −1 = (O P (1), ..., O P (h −p n )) [see the last part of the proof]. Thuŝ F Y |X (y|x) = F Y |X (y|x) + e t 1 (X t WX) −1 (h 0 n T n,0,S (x, y), ..., h p n T n,k Np,p ,S (x, y)) t + O P (h p+1 n ).

Consider the class of functions
where Ω is Lipschitz-continuous and there exist constants C 1 , C 2 such that Ω is constant on (−∞, C 1 ] and [C 2 , ∞). Assume additionally that the distribution of (X, Y ) has a uniformly bounded density, then for some constants C 5 , C 6 independent of n.
4. For any measure P U,V on the unit interval with uniformly bounded density f , the class of functions can be covered by Cε −2 brackets of L 2 (P ) length ε.

5.
For any measure P on R × R k with uniformly bounded conditional density f V |U the class of functions 6. Assume that f (x; a) is a function indexed by the parameter a ∈ A such that sup x f (s; x)− f (t; x) ∞ ≤ C s − t θ for some θ > 0 and norm · . Then the · ∞ -bracketing numbers of the class of functions F = {u → f (u; a)|a ∈ A} satisfy N [ ] (F , ε, ∞ ) ≤ C 1 N(A, C 2 ε 1/θ , · ) for some finite constants C 1 , C 2 .
Proof. Part 1 The first assertion is obvious from the definition of bracketing numbers. For the second assertion, note that F G = (F + C)(G + C) − CF − CG + C 2 . Moreover, all elements of the classes F + C, G + C are by construction non-negative and thus it also is possible to cover them with brackets consisting of non-negative functions and amounts equal to the brackets of F , G, respectively. Finally, observe that if 0 ≤ f l ≤ f ≤ f u and 0 ≤ g l ≤ g ≤ g u , we also have f l g l ≤ f g ≤ f u g u . Moreover f l g l − f u g u ≤ C f u − f l + C g u − g l . Thus the class (F + C)(G + C) can be covered by at most ≤ N [] (F , ε, . )N [] (G, ε, . ) brackets of length 2Cε. Finding brackets for the classes CF , CG is trivial, and applying the first assertion of the Lemma completes the proof. Part 2 Consider two cases. A) ε > 4h 1/2 : Divide [0, 1] into N := 2/ε 2 subintervals of length 2α := ε 2 with centers rα for r = 1, ..., N and call the intervals I 1 , ..., I N . Note that two adjunct intervals overlap by α > 2h. This construction ensures that every set of the form [x−h, x+h] with x ∈ [h, 1−h] is completely contained in at least one of the intervals defined above. Then a collection of N brackets of L 2length Dε for some D > 0 independent of h is given by (−CI{u ∈ I j }, CI{u ∈ I j }).
B) ε ≤ 4h 1/2 : Observe that by assumption any element g of F satisfies |g(x) − g(y)| ≤ C|x − y|h −k . Consider the points t i := i/(N + 1), i = 1, ..., N with N := 2 2k+1 C/ε 2k+1 . By construction, to every Then N . ∞ −brackets of length covering F are given by (g(t i ) − ε/2, g(t i ) + ε/2), i = 1, ..., N. From those one can easily construct L 2 (P X )-brackets. Part 3 Without loss of generality, assume that Ω equals one on [1, ∞) and zero on (−∞, −1]. Moreover, the assumptions on Ω imply the existence of finite constant C l , C u such that C l ≤ Ω ≤ C u . Distinguish two cases A) ε ≤ d n : Starting with ε 2 supremum norm brackets for the class G and using the Lipschitz condition yields the desired brackets. B) ε > d n : Denote by [g 1,l , g 1,u ], ..., [g N (ε),l , g N (ε),u ] brackets for the class G of · ∞ -size ε. For a function g ∈ G, denote the bracket that contains it by [g j(g),l , g j(g),u ]. Observe that Thus brackets of the form contain every function in F n . Moreover, the L 2 -length of each such bracket is bounded by (C u − C l )(2d n + ε) sup f X,Y (x, y) ≤ Cε. This completes the proof. Part 4 Follows by standard arguments. Part 5 Follows from |I{v ≤ g 1 (u)} − I{v ≤ g 2 (u)}| ≤ I{|v − g 1 (u)| ≤ 2 g 1 − g 2 ∞ }. Part 6 Obvious ✷ Lemma B.3 (Basic Lemma) Assume that the classes of functions F n consist of uniformly bounded functions (by a constant not depending on n).
1. If for some a < 2 N [ ] (F n , ε, L 2 (P )) ≤ C exp(−cε −a ) for every ε ≤ δ n with constants C, c not depending on n, then we have √ n sup where the * denotes outer probability, see van der Vaart and Wellner (1996) for a more detailed discussion.
2. If N [ ] (F n , ε, L 2 (P )) ≤ Cε −a for every ε ≤ δ n , some a > 0 and a constant C not depending on n, then we have for any δ n ∼ n −b with b < 1/2 √ n sup f ∈Fn, f P,2 ≤δn f dP n − f dP = O * P δ n | log δ n | .
Proof. Start by observing that the uniform boundedness of elements of F n by D implies that F ≡ D is a measurable envelope function with L 2 -norm D. The proof of the first part follows by arguments similar to those used for the proof of the second part and is therefore omitted. For the proof of the second part, note that for η n sufficiently small for some constant C that depends only on κ where R n,1 := Cb 2 n sup |s−τ |≤bn |(G • F −1 ) ′′ (s)|.
Summarizing, we have obtained the bound |H(F 1 ) − H(F 2 )| ≤ R n,3 + R n,4 where C denotes some constant depending only on the kernel κ. Assertion (b) follows from this estimate and a Taylor expansion of G −1 .
This completes the proof. ✷