Adaptive conﬁdence sets for kink estimation

: We consider estimation of the location and the height of the jump in the γ -th derivative - a kink of order γ - of a regression curve, which is assumed to be H¨older smooth of order s ≥ γ + 1 away from the kink. Optimal convergence rates as well as the joint asymptotic normal distribution of estimators based on the zero-crossing-time technique are established. Further, we construct joint as well as marginal asymptotic conﬁdence sets for these parameters which are honest and adaptive with re- spect to the smoothness parameter s over subsets of the H¨older classes. The ﬁnite-sample performance is investigated in a simulation study, and a real data illustration is given to a series of annual global surface temperatures.


Introduction
Suppose that we observe data (x i , Y i ), i = 1, . . . , n, from the regression model where for a given γ ∈ N and some unknown θ f ∈ (0, 1), the regression function f is assumed to be γ times continuously differentiable on [0, 1] \ {θ f }, and the one-sided limits of f (γ) at θ f exist and their difference [f (γ) ] is non-zero. Such a jump discontinuity of the γ th derivative is called a kink of order γ, and θ f is the location and [f (γ) ] is the size of the kink. Away from the kink, f is assumed to be Hölder smooth of some order s ≥ γ + 1. Our purpose is to optimally estimate the parameters θ f and [f (γ) ], and to construct joint and marginal confidence sets which are honest and adaptive with respect to the smoothness parameter s over suitable subsets of the Hölder classes. In model (1.1), we shall assume that the x i,n = x i are fixed, equidistant design points. Change points and other irregularities such as kinks are important features of signals, and are of interest in various areas such as economics, medicine or the physical sciences. For example, in regression discontinuity or regression kink 1524 V. Bengs and H. Holzmann designs (Card et al., 2015), the aim is to infer on the change of the level or the slope of an outcome variable from a policy change in level or slope of an assignment variable. For further reading on change point and edge detection and estimation we refer to the monographs of Carlstein et al. (1994), Korostelev and Tsybakov (1993) and Qiu (2005). Korostelev (1988) obtained the optimal convergence rate for a change point in the white noise model over a nonparametric class of functions which are Lipschitz continuous away from the change point. In indirect estimation problems of change points, including deconvolution and kink estimation, Goldenshluger et al. (2006), Goldenshluger et al. (2008a) and Goldenshluger et al. (2008b) comprehensively studied optimal convergence rates over Sobolev-type classes in the white noise model, while Neumann (1997) considered a density deconvolution framework. Goldenshluger et al. (2006) construct their estimator of the change point in a deconvolution setting based on the zero-time-crossing technique. Cheng and Raimondo (2008) transfer this estimator to first order (γ = 1) kink estimation on compact intervals, and focus on the construction of appropriate kernel functions. Wishart (2009), Wishart and Kulik (2010) and Wishart (2011) studied convergence rates of zero-time-crossing estimators under longrange dependent errors together with corresponding lower bounds.
The asymptotic distribution of change point estimates and the construction of confidence intervals are discussed in Müller (1992); Loader (1996) among others for change points, and in Müller (1992); Eubank and Speckman (1994); Mallik et al. (2013) for kink estimation, in the latter paper even with dependent errors. In recent years, for nonparametric estimation problems the concepts of honest and adaptive confidence sets have been developed and intensively studied, see e.g. Li (1989); Low (1997); Cai and Low (2004); Giné and Nickl (2010). Confidence sets are called honest if they keep the level asymptotically uniformly over the function class under consideration, while they are called adaptive if the width is of the order of the minimax rate of estimation, up to logarithmic terms. For kink estimation, however, honest and adaptive confidence sets have apparently not yet been studied.
Our contributions in the present paper are as follows. We use the zerocrossing-time technique from Goldenshluger et al. (2006) and Cheng and Raimondo (2008) to construct estimates of the location θ f as well as the size [f (γ) ] of the kink in model (1.1). We derive optimal convergence rates over Hölder smoothness classes instead of the Sobolev-type smoothness classes in Goldenshluger et al. (2006). The proof techniques for the upper bounds differ somewhat from those in Goldenshluger et al. (2006), since we make more explicit use of the zero-crossing-time property of the estimate of the location. The lower bounds require the construction of new hypothesis functions which belong to the Hölder smoothness class. Further, we show joint asymptotic normality of the estimates, uniformly over the function classes, which allows for the construction of honest confidence sets for a given smoothness parameter s. In contrast to change point estimation, for kink estimation no additional bias arises in case of a discrete design. Finally, following Giné and Nickl (2010), based on Lepski's method we construct joint and marginal confidence sets which have an adaptive length in s

The estimator
In this section estimators for the location and the size of the kink are introduced. Recall the motivation from Goldenshluger et al. (2006) that if f (γ) has a jump in θ f , a smoothed version of f (γ) will have a large slope near θ f , so that its first and second derivatives have a local maximum respectively a zero near θ f . Following Goldenshluger et al. (2006) and Cheng and Raimondo (2008), for an appropriate kernel K : R → R, specified in Assumption 2 below, and a bandwidth parameter h > 0 we introduce the probe functional (2.1) which we estimate using a Priestley-Chao-type estimator for the fixed, equidistant design x i,n , Assumption 1 (Errors). The ε i = ε i,n are centered, independent and identically distributed random variables with standard deviation σ > 0, and for any u > 0, P (|ε 1 | > u) ≤ 2 exp(−2 u 2 /σ 2 g ) for some σ g ≥ σ. ♦ Assumption 2 (Kernel). For parameters γ, l ∈ N, suppose that the kernel K : R → R has support supp(K) = [−1, 1], is (γ + 5)-times differentiable inside its support and satisfies the following properties: (i) K (j) (−1) = K (j) (1) = 0, j = 1, . . . , γ + 3, (ii) K (1) is an odd function, in particular K (1) (0) = 0, (iii) if l ≥ γ + 2 then 1 −1 x m K (1) (x)dx = 0 for m = 0, . . . , l − γ − 1, (iv) there are 0 < q * < q l < 1 such that K (1) (x) > 0 for x ∈ [−q l , 0) and K (1) has a unique global maximum at −q * , (v) for some x * ∈ (0, 1) and c 2 > 0 we have that |K (1) Remark (Discussion of Assumption 2). Assumption 2 is similar but more restrictive than assumption C 1,s in Cheng and Raimondo (2008) or assumption 2 in Goldenshluger et al. (2006). In particular, Assumption 2, (v) determines the separation rate, Lemma A.1. In Section B.1 we provide an explicit construction of kernels satisfying Assumption 2 for γ = 1 and given l, and indicate how to extend it to the case γ ≥ 2. For the estimation of the location of the kink we proceed in two stages. First, an interval is constructed which contains the kink with high probability. Second, the kink is estimated by a zero of the empirical probe functional inside this interval.
In the following we shall always impose Assumption 1 and assume that the regression function f in model (1.1) satisfies f ∈ F s as specified in Definition 2.1, and that the fixed kernel K satisfies Assumption 2 with parameters γ, l = s , and q * , q l and x * in (iv) and (v).
Given h > 0 we then let Here h 0 can be chosen uniformly over f ∈ F s and depending only on the kernel K as well as on the Lipschitz constant L and the set Θ of F s . In particular, we The proof is given in Section 6.1. Lemma 2.2 motivates the two stages of the estimation procedure: first estimate the parameters t * and t * , second estimate the kink-location θ f as a zero of the empiral probe functional in the resulting interval. Thus, let (2.5) and define the estimator for the kink-location bŷ In the proofs we will show that {t ∈ [t * ,t * ] |ψ h,n (t) = 0} = ∅ holds with high probability uniformly over F s , and consequently only this part of (2.6) is asymptotically relevant.

Optimal rates of convergence
To express the uniformity in the parameter f , we introduce the following nota- Letη h,n be random and depending on some h ∈ (0, h 0 ) and the data in model (1.1) for the sample size n. We denote P f to stress the dependence of the distribution on the parameter f . Then we writeη h,n = O P,F (n β h α ), α, β ∈ R, for uniform boundedness in probability, that is In the following theorem we obtain convergence rates for the estimates of the location and the size of the kink. For two sequences (a n ) and (b n ) we write a n ∼ = b n if C 1 ≤ |a n /b n | ≤ C 2 for n ≥ n 0 and some constants 0 < C 1 < C 2 . Theorem 2.3. Consider model (1.1) and suppose that Assumption 1 as well as Assumption 2 with l = s hold true. Then there exist finite constants h 0 , C > 0 depending only on the kernel K, on σ and σ g in Assumption 1 as well as on L and Θ of F s = F s (γ, a, Θ, L) such that if h ∈ (0, h 0 ) and n are such that nh 2γ+1 ≥ C log(1/h), then where the constants in the O-terms depend only on K, σ g and L, and can be chosen uniformly over a bounded range of values of s. Moreover, choosing h of order n −1/(2s+1) we obtain the convergence rateŝ The proof is provided in Section 6.2. The next theorem shows that these rates are indeed optimal.
where w metrizes convergence in probability and the infimum is taken over all possible estimatorsθ respectively.
The proof is provided in Section 6.3.
Remark (Convergence rates). The convergence rates for the location of the kink in Theorem 2.3 correspond to those in Theorem 1 in Goldenshluger et al. (2006) for their function class in Definition 2, which instead of Hölder-smoothness requires that for some m > 1. Indeed, their m − 1 (smoothness parameter) and β (degree of ill-posedness) correspond to s − (γ + 1) resp. γ in our setting, so that m = s − γ and β = γ transforms their rate n −(m+1)/(2m+2β+1) into n −(s−γ+1)/(2s+1) . Goldenshluger et al. (2008a) provide minimax rates for more conventional Sobolevtype classes and attain a rate which corresponds to n − (s−γ+1/2) /2s in our setting. As γ ≥ 1/2 this rate is inferior compared to ours, which is in line with the difference between the optimal rates of pointwise estimation for Sobolev-and Hölder smoothness classes. Moreover, similar considerations for our kink-size estimate and the jump amplitude estimator in Goldenshluger et al. (2008a) lead to related observations concerning the optimal rates. Also note that the rate of convergence for the size of the kink corresponds to the minimax rate when estimating the γ-th derivative at a given point.

Asymptotic normality
The next theorem establishes joint asymptotic normality of the estimates of the location and the size of the kink, (2.6) and (2.7), around their deterministic counterparts (2.4) and (2.8), respectively.
then for x ∈ R 2 we have that where Φ 2 denotes the bivariate standard normal distribution function and the asymptotic standard deviations for the estimates of location and size of the kink are given bỹ Here, · 2 denotes the L 2 -norm of a function on the interval [0, 1]. Section 6.4 is devoted to the proof of the theorem. Remark (Undersmoothing). Using similar arguments as in Theorem 2.3 one obtains the bounds Thus, using undersmoothing, that is choosing h ∼ = n −1/(2s+1) log(n) ζ for ζ < 0, the asymptotic normality in (3.2) even holds uniformly over F s forθ h,f replaced by θ f and [f (γ) ] h by [f (γ) ]. Thus, Theorem 3.1 can be directly used to construct honest confidence sets for the parameters (θ f , [f (γ) ]) over F s .

Adaptive confidence sets
We briefly recall the definitions of honest and adaptive confidence intervals tailored to our framework by following Li (1989) resp. Cai and Low (2004). (3.5) • adaptive confidence interval over the parameter space F if for every s ∈ S and ε > 0 there exists some constant C = C(α) > 0 depending only on α ∈ (0, 1) such that sup f ∈Fs wherer n (s) equals (up to a logarithmic term) the minimax rate of estimation θ f over F s .
It is straightforward to extend the definitions to a bivariate confidence set for the parameter pair (θ f , [f (γ) ]). Low (1997) and Cai and Low (2004) showed that honest and adaptive pointwise confidence intervals over Hölder classes do not exist. In the context of confidence bands, Giné and Nickl (2010) showed that the construction of honest and adaptive confidence bands in density estimation becomes possible by slightly reducing the function classes.
We shall follow their lead and construct honest and adaptive confidence sets for the bivariate parameter consisting of location and size of the kink. To this end, let γ ∈ N and s, s ∈ R + be such that γ + 1 ≤ s < s. Choose integers k min,n and k max,n such that and set K n = [k min,n , k max,n ] ∩ N as well as (3.8) Definition 3.2. Let k 0 ∈ N, 0 < b 1 < b 2 , a, L > 0 and let Θ ⊂ (0, 1) be a compact set. Then define and where the kernel of the probe functional in (3.10) satisfies Assumption 2 with parameters γ and l = s + 1. Given f ∈ F let s f be the unique value of s for which f fulfills the bias condition in (3.10).
For wavelet density estimation over Hölder classes, Giné and Nickl (2010) show that the minimax rates of estimation remain the same when reducing the function class in a similar fashion as in (3.9), see also the discussion in Bull (2012). In our present context, while a general result eludes us, in Section A.5 we show that the minimax rates over F s correspond to those over F s at least for some values of the smoothness parameter s ∈ [s, s].
To construct confidence sets, we divide the sample Y 1 , . . . , Y n into the two parts S 1 = {Y 1 , Y 3 . . . , Y n−1 } and S 2 = {Y 2 , Y 4 , . . . , Y n } if n is even, and similarly if n is odd. In particular, the sizes n j = |S j | satisfy n j ∼ = n for j = 1, 2. Note that over the subsamples, (1.1) still holds with an equidistant design.
We shall use S 2 for selection of the bandwidth parameter h based on Lepski's method. For a sufficiently large constant C Lep > 0 (specified in the proof of Lemma 6.8) we let (3.11) A central technical result, Lemma 6.8, states that for a function f ∈ F, hk n is of order log(n2) /n2 1/(2s f +1) with high probability uniformly over f ∈ F.
We then employ undersmoothing, that is we choose the bandwidth for the estimation of θ f as hk n +un , and for the estimation of [f (γ) ] as hk n +vn , where we assume that Furthermore, letσ n1 be an estimate of σ based on the sample S 1 which satisfiesσ The estimates in Hall et al. (1990) or Dette et al. (1998) fulfill this assumption. Consider the estimateŝ (3.14) of the asymptotic standard deviation of the kink in (3.3), and of the asymptotic standard deviation of the size of the kink. Given α ∈ (0, 1) let q α (W ) denote the α-quantile of W = max{|X 1 |, |X 2 |} for two independent standard normal random variables X 1 and X 2 . Consider the rectangular confidence region Theorem 3.3. Consider model (1.1) under Assumption 1 and the function class F in (3.9), and let K be a kernel satisfying Assumption 2 with γ and l = s + 1. Then for any α ∈ (0, 1), Furthermore, there exists a finite constant C > 0 such that (3.18) The proof is provided in Section 6.5. Remark. 1.Equation (3.17) shows asymptotic honesty of the confidence sets as defined in (3.5), while the adaptivity of the confidence sets as defined in (3.6) is covered by (3.18).
2. The choice of the bandwidth (3.11) is only based on the estimate of the location of the kink, but is then also used for constructing the confidence set of the size. This is possible since the optimal bandwidth resolution is the same for both estimates, see Theorem 2.3.
3. Marginal adaptive confidence intervals for either θ f or [f (γ) ] can be constructed in an analogous way (see 4.1).

Simulations and real data illustration
In this section we investigate the finite sample properties of the confidence sets in (3.16) as well as of the following marginal confidence intervals for θ f , C loc n (α) = θ hk n +un ,n1 −ŵ loc q 1−α/2 (N (0, 1)),θ hk n +un ,n1 +ŵ loc q 1−α/2 (N (0, 1)) . (4.1) The kernel K is chosen as in Section B.1 with γ = 1 and l = 2 so that (4.2) Figure 1 illustrates the first three derivatives of K. Subsection 4.1 gives detailed numerical illustrations of our methods, while Subsection 4.2 contains a real data illustration to a series of global surface temperatures. A comparison of marginal confidence intervals for the kink location with the method of Mallik et al. (2013) is presented in the appendix, Section B.2.

Numerical experiments
We consider the following two regression functions Here, f 1 has a kink at θ n of size [f 1 ] = −2 with infinite smoothness s outside the kink. The offset 1 /3n is chosen so that the kink is not located on a design point x k = k/n.
The regression function f 2 in (4.3) is defined as the absolute value of the second anti-derivative of the Weierstraß-functionf c1,c2 with vanishing affine linear part. By Hölder continuity of the Weierstraß-function with exponent 0 < − log(c 1 )/ log(c 2 ) < 1 we have that s = 2 − log(c 1 )/ log(c 2 ) as well as [f (1) 2 ] ≈ 9/4. We use errors ε ∼ SN (ζ, ω, γ), that is a skew normal distribution with shape parameter α = −3, and choose the location parameter ζ ∈ R and the scaleparameter ω > 0 so that E[ε] = 0 and E[ε 2 ] = σ 2 = 0.2 2 . Figure 2 displays the regression functions together with samples of size n = 100. For the Lepskischeme we use a grid inside the intervals [h min,n , h max,n ] as specified in Table  1, and choose the Lepski-constants in (3.11) as C Lep = 0.08 for f 1 and as C Lep = 0.001 for f 2 . Simulations are based on a m = 10000 repetitions. The First we investigate the accuracy of our estimates in terms of the square roots of the Mean Squared Error (RMSE) for the sample sizes n ∈ {500, 1000, 2000, 4000, 8000}. The results are summarized in Table 2. As can be expected from the rates in Theorem 2.3, estimates of the location of the kink are more precise than for its size. Further, in particular for f 2 the estimate of the size converges slowly. Next we investigate the confidence sets for the location of the kink, (4.1), as well as for the joint confidence sets for location and size in (3.16) in terms of coverage and average length. The results are displayed in Tables 3 and 4. For the location of the kink in Table 3, coverage is already satisfactory for both regression functions for a sample of size n = 500. In contrast, for the joint confidence sets, the coverage is quite below the nominal level for sample sizes n = 500 and n = 1000, in particular for f 2 . Moreover, the interval for the size of the kink is rather wide for these values of the sample size, even including zero. For larger sample sizes, the performance improves notably.
Finally, we investigate the ratio of the empirical bias and the empirical standard deviation. In contrast to change point estimation with fixed design, for kink estimation the discretization bias is asymptotically negligible, which can also be seen numerically in Table 5. Note that these findings also indicate the undersmoothing effect of the sequences u n and v n .

Illustration to series of global surface temperature
We illustrate our method in an application to a series of changes in annual global surface temperature in degree Celsius from 1880 to 2017 relative to the average temperature for 1951 -1980, see Figure 3. The series is available at https://data.giss.nasa.gov/gistemp, where further details on the data are provided. We used a grid of bandwidths inside the range [0.05, 0.3].

Discussion
In this paper we suggest a method to construct confidence intervals for the kink location and kink size over Hölder classes without requiring a shape restriction on the regression function as in Mallik et al. (2013). The rate of convergence derived in this paper is achieved over a more conventional Hölder class which allows for local higher-order smoothness, in contrast to the nonstandard smoothness class in Goldenshluger et al. (2006) based on integrability of the Fourier transform.
Recently, there has been quite some work on change point detection and segmentation methods (Frick et al., 2014;Haynes et al., 2017). Analogous results for kink detection and corresponding segmentation algorithms would be of quite some applied interest for example in environmental or pharmacological studies, in which the function of interest can reveal changes of trend through kinks rather than jumps.

Proofs
We consider model (1.1) and impose the Assumptions 1 and 2 throughout this section. We shall say that a sequence of events A n holds with high probability uniformly over

Properties of the probe functional and first stage estimates
The proofs of the lemmas in this section are provided in Section A.1 in the appendix. Recall the definition (2.1) of the probe functional ψ h,f (t). Proof of Lemma 2.2. We have for L h,0 (t) in (6.1) that due to Assumption 2, (iv), where [f (γ) ] > 0 is as in Assumption 2.1, and that L h,0 (t * ) and L h,0 (t * ) are of opposite signs since K (1) is odd by Assumption 2, (ii). Therefore, from (6.1) we have for sufficiently small h 0 > 0, depending only on K, L and Θ, that min{|ψ h,f (t * )|, |ψ h,f (t * )|} > 0 and ψ h,f (t * ) and ψ h,f (t * ) are of opposite signs as well. The assertion follows from the continuity of the probe functional ψ h,f (t).
Next, we bound the deviation of the empirical probe functional from its population counterpart as well as for their derivatives.
Lemma 6.2. For the probe functional (2.1) and its empirical version (2.2), there exists h 0 > 0 such that for any h ∈ (0, h 0 ) and n ∈ N we have for j = 0, 1, 2 that The constants in the O-terms and the constant h 0 depend only on the kernel K, the parameter σ g in Definition 1 as well as on the Lipschitz constant L of F s as in Definition 2.1.
The discretization error contained in the remainder term in the lemma thus has the rate (nh γ+1+j ) −1 uniformly in f ∈ F s and t ∈ [0, 1]. Finally, we bound the variance of the empirical probe functional and its derivatives.
The constant in the O-term depends only on the kernel K and on the Lipschitz constant L.
Next we further investigate the first stage of the zero-crossing-time-technique.
Lemma 6.4. There exist finite constants h 0 , C > 0 depending on K, σ, σ g as well as on L, Θ and s of F s , such that if h ∈ (0, h 0 ) and n are such that

Moreover, C can be chosen uniformly over a bounded range of values of s, while
h 0 is independent of s.

Rates of convergence: Proof of Theorem 2.3
In the following we shall restrict to the event thatψ h,n (θ h,n ) = 0, which by Lemma 6.4 is fulfilled with high probability uniformly over F s . By Taylor expansion ofψ h,n at θ f we have that

Asymptotics of the scale terms
The following lemma establishes the asymptotic behavior of the denominator in (6.3).

Lemma 6.5. Under the assumptions of Theorem 2.3, one has
The proof is given in Section A.2.
Proof of Theorem 2.3. Convergence rate for the location of the kink To prove the statement forθ h,n , consider the right side equation in (6.3). Since K (2) = 0, Lemma 6.5 implies that the denominator equals a constant unequal zero and a term of order o P,Fs (1), uniformly over F s . For the numerator, Then, Lemma 6.2, (i) in combination with representation (6.1) for j = 0 in Lemma 6.1 yield for a suitable choice of h 0 (depending only on K, σ g and L) that which yields the assertion forθ h,n . Note that the constants in the O-terms depend only on K, σ g , L as well as s and these constants can be chosen continuously in s, see Lemmas 6.1, 6.2 and 6.3.

Convergence rate of the size of the kink
whereθ is some value between θ f andθ h,n . It follows from (6.1) for j = 1 that Subtracting this from (6.4) leads to Now, Chebychev's inequality and Lemma 6.3 for j = 1 yield such that the first term on the right-hand side in (6.6) is of order Concerning the second term, the convergence rate (2.9) of the location of the kink together with hψ Note that the constants in the O-terms depend only on K, σ g , L and on s and these constants can be chosen uniformly over a bounded range of values of s, see Lemmas 6.1, 6.2 and 6.3.

Lower bounds: Proof of Theorem 2.4
Proof of Theorem 2.4. We shall use the method of two hypothesis, see Theorem 2.2 in Tsybakov (2009). Fix some θ 0 ∈ int (Θ), and introduce the function Let us start with (2.11). We set f 0 ( for any values of s ≥ γ + 1 and L ≥ 0.
As for the sequence of alternative hypotheses, let θ 1 = θ 0 + r n ∈ Θ, where r n ↓ 0 will be chosen below, and consider Here, Φ is a smooth kernel of order γ with support [−1, 1] (Tsybakov, 2009, p. 5), the sequence b n ↓ 0 remains to be selected, and in (6.10) we extend the definition of v 0 from (θ 1 , 1] to (θ 1 , 1 + b n ]. We check the conditions (i) and (iia) Hence, as well as for any u. Since the integral in (6.11) ranges at most over two intervals of length at most 2 by the support of Φ, it follows that Hence by choosing r n ∼ = b s−γ+1 n (6.14) we obtain that v Concerning the Kullback-Leibler divergence, note that the distribution P j of Y 1 , . . . , Y n with respect to f j has the density with respect to the Lebesgue measure on R n . Here ϕ σ denotes the normal density with standard deviation σ > 0. Thus, the Kullback-Leibler divergence is given by Since v 0 is a polynomial of degree ≤ γ away from θ 0 and θ 1 , and since the kernel Φ of order γ reproduces polynomials of degree ≤ γ , we have v n (x) = v 0 (x) outside of b n -neighborhoods of θ 0 and θ 1 . Inside these neighborhoods, which contain of the order nb n points, using Taylor-expansion up to order γ − 1 and (6.12) yields (Tsybakov, 2009, p. 6) n under the choice (6.14). Choosing b n ∼ = n −1/(2s+1) , Theorem 2.2 in Tsybakov (2009) implies that the minimax lower bound over the functional class F s is of order Next, we verify (2.12). Let a 1 > a be such that a 1 −a =r n ↓ 0, which remains to be specified. As hypotheses functions we set and Φ is, as above, a smooth kernel of order γ with support [−1, 1], b n ↓ 0 and we extend the definition of R γ (y; a 1 − a) to [θ 0 , 1 + b n ]. Then n by using (6.13) in the last step, which implies that f 1 ∈ F s (γ, a, Θ, L) under the choicer Outside of a b n neighborhood of θ 0 we have R γ (x; a 1 −a) =ṽ n (x). By Taylorexpansion up to order γ − 1, using the Lipschitz continuity of R γ (x; a 1 − a) (γ−1) with Lipschitz constant a 1 − a =r n we obtain that Hence, the Kullback-Leibler divergence between P 1 and P 0 is of order n under the choice (6.16). Inserting b n ∼ = n −1/(2s+1) in (6.16) gives the result.

Asymptotic normality: Proof of Theorem 3.1
. (6.17) Subtracting this from (6.3) and dividing byw loc n (h) as defined in (3.3), we get thatθ and whereθ is betweenθ h,f and θ f . Subtracting (6.21) from (6.4) and dividing bỹ w size n (h) leads to where due to definition ofw size n in (3.3) we havê (6.24) The following lemma shows the negligibility of the remainder terms in (6.20) resp. (6.24). Lemma 6.6. Under the assumptions of Theorem 3.1, it holds The next lemma shows the joint asymptotic normality of the scores in (6.19) and (6.23).
Lemma 6.7. Suppose the assumptions of Theorem 3.1 are fulfilled. Then, for any x ∈ R 2 sup f ∈Fs The proofs of the lemmas are provided in Section A.2.
Proof of Theorem 3.1. Deduce from Lemma 6.5 that By combining Lemma 6.7 with the uniform Slutzky Theorem D.3 in Section D we obtain that for With this, we conclude the proof in view of (6.18) and (6.22), Lemma 6.6 and the uniform Slutzky theorem D.3.

Adaptive confidence sets: Proof of Theorem 3.3
From the first term in the expansion (6.3) and in Lemma 6.5, the condition for f ∈ F s in (3.10) can be written as a bias condition where for appropriate constants C b1 > 0 and 0 < C b2 < a K (2) (0) we set Recall the definitions of K n resp. h k in (3.7) resp. (3.8) and introduce for s ∈ [s, s] . The next key lemma shows that h k * n (s) is of optimal order and that under f ∈ F, k * n (s f ) is essentially selected by the Lepski choicek n .
Lemma 6.8. We have that If C Lep > 0 is chosen large enough depending only on K, σ, σ g as well as on L and Θ of F and if n is sufficiently large such that k * n (s) ≥ 2 then there exists a ρ ∈ N depending only on b 1 , b 2 , s and on C Lep such that where s f is given in Definition 3.2.
The proof of the lemma is deferred to the appendix, Section A.4.
Adaptive coverage: Proof of (3.17) Lemma 6.9. Let ρ be as in Lemma 6.8. Then we have that where we abbreviate k * = k * n (s f ), andw loc n1 (h) andw size n1 (h) are defined in (3.3). Proof of Lemma 6.9. In the proof we write n for n 1 as only the subsample S 1 is involved. Given s ∈ [s, s] consider f ∈ F s , so that k * = k * n (s). To show that the sequences h k * −j+un and h k * −j+vn , j ∈ {0, . . . , ρ}, satisfy the bandwidth conditions of Theorem 3.1 in (3.1) it suffices to consider j = 0. Then (6.28) due to choice of u n in (3.12) and the expansion (6.27) of h k * , both of which hold true in terms of n (that is n 1 ). Therefore since s ≥ s ≥ γ + 1, which is the first part of (3.1), and similarly, nh 2γ+1 the second part of (3.1). Passing from u n to v n only changes the value of ζ, which is not relevant in the above analysis. The lemma then follows from Theorem 3.1 together with the uniform continuous mapping theorem, Theorem D.1 in Section D.
We shall denotê . compare toŵ loc in (3.14) andŵ size in ( where again we abbreviate k * = k * n (s f ).
Proof of Lemma 6.10. Given s ∈ [s, s] consider f ∈ F s , so that k * = k * n (s). It will suffice to consider j = 0. Then For the second term, from the definition ofw loc k * in (6.29) and (3.3) and from (3.4), where we inserted (6.28) in the last step. For the first term, since (6.28) implies undersmoothing, from (2.9) in Theorem 2.3 by using Theorem 2.3, (2.10), the Assumption (3.13) onσ n1 together with the triangle inequality. This proves the first part of the lemma. For the second, we note that is still an undersmoothing bandwidth. The argument then proceeds analogously.
Proof of (3.17). Let ρ be as in Lemma 6.8, and given s ∈ [s, s] consider f ∈ F s , and let k * = k * n (s). Then from the definition of the confidence sets in (3.16) and Lemma 6.8, where the second step follows from Lemma 6.10 and the final step from sample splitting and the independence of the subsamples. The claim then follows from Lemma 6.9. This concludes the proof of (3.17).

Adaptive length: Proof of (3.18).
From the definition ofŵ loc in (3.14) and ofŵ size in (3.15) and the consistency of [f (γ) ] hk n ,n1 , it suffices to show that there exists a constantC > 0 such that As for the first display, consider s ∈ [s, s] and f ∈ F s and set k * = k * (s). From (6.28), 1 / n h 2γ−1 k * +un ≤C 1 log(n) /n (s−γ+1)/(2s+1) for some constantC 1 . Moreover, from Lemma 6.8 for someC 2 , This implies the first display in (6.31) withC =C 2C1 . As for the second, from (6.30) there is aC 3 such that and aC 4 such that The second display is then clear forC =C 3C4 .
(A.1) Thus, by using Taylor expansion of g f,γ around t and (A.1), we also obtain that by Hölder-smoothness of g f,γ , Note that all the constants in the O-terms depend only on K, L as well as on s, where the constants are continuous in s, due to the remaining term in the Taylor expansion.

Proof of Lemma 6.2. (i). It holds for
where R n (t, h) is an error term of order O f ∈Fs,t∈ [0,1] (nh γ+1+j ) −1 , due to Riemann-sum approximation. Indeed, let B(n, h) denote the index set for which the sum in the latter display is not zero. Due to the equidistant design in model (1.1) and the support of K in Assumption 2 it holds that |B(n, h)| ≤ 2nh. Let where C L,K > 0 is the Lipschitz constant of the product of f and K (γ+2+j) which, by definition of F s , can be chosen uniformly in f ∈ F s and depending only on K and L, which concludes (i).
(ii). Consider Then Lemma C.3 in Section C implies that there is a constant C > 0 depending only on K and σ g such that provided h 0 > 0 is chosen appropriately (depending only on K and σ g ). The claim follows by Markov's inequality. Note that all the constants in the O-terms depend if necessary only on K, L and σ g .
Proof of Lemma 6.3. We compute where the order of the discretization error is derived as in the proof of Lemma 6.2, (i).
Before turning to Lemma 6.4, we require additional technical results. The next lemma is an adaptation of Lemma 2 in Goldenshluger et al. (2006), compare also to Lemma 1 in Cheng and Raimondo (2008).
Moreover, the constants C i depend only on the kernel K as well as on the Lipschitz constant L and the smoothness parameter s of F s as in Definition 2.1, and where these constants can be chosen uniformly over a bounded range of values of s.

Proof of Lemma
where c 2 is the kernel constant in Assumption 2, (v). From the assumption δ ≥ C 3 h s−γ+1 , (6.1) and the choice of C 1 it follows that

and in fact inf
for C 2 := c 2 C 3 − 2 C 1 > 0 for sufficiently large C 3 . Note that all the constants depend only on K, L as well as s and are continuous in s, due to Lemma 6.1.
Lemma A.2. There are finite constants C, C 1 , C 2 , h 0 > 0 which only depend on K, σ, σ g and on the Lipschitz constant L of F s such that if h ∈ (0, h 0 ), n ∈ N and ζ n > 0 are such that and choosing h 0 small enough (depending only on K and L) obtain by Lemma 6.2, (i), for any t ∈ [0, 1] that |R n (t; h)| ≤ C/ √ nh, where C > 0 depends only on K and L. Thus, for an appropriate choice of the constants in the requirements of Lemma C.2 we have that , then there are finite constants h 0 , C 1 , C 2 > 0 depending only on K, σ, σ g as well as on the Lipschitz constant L, the set Θ and the smoothness parameter s of F s , such that if h ∈ (0, h 0 ) and n ∈ N are such that nh 2γ+1 ≥ C 1 log(1/h), then it holds that Moreover, C 1 , C 2 can be chosen uniformly over a bounded range of values of s, while h 0 can be chosen independently of s. In particular,

Proof of Lemma A.3.
We only show P f (|t * − t * | > hq /2), the other inequality can be derived analogously.
Then by (6.1) for j = 0 it holds for sufficiently small h 0 (depending on Θ) that ψ h,f (t * ) < 0 and in this caset * = arg min tψ h,n (t). Hence, setting From Lemma 6.1, obtain for h 0 small enough (depending on Θ) that so that for constantsC i > 0 depending only on K, L and s we have that where the second inequality follows by substitution and properties of K (1) and the last inequality is due to Lemma A.1 by choosing h 0 appropriately (depending on Θ). Using Lemma A.2, for sufficiently small h 0 (depending on K, σ, σ g ) and appropriate choice of C 1 in the assumption, there exists a constant C 2 > 0 such that Note that C 1 and C 2 can be chosen depending only on K, σ, σ g , L as well as s and also continuous in s, due to Lemma 6.1 and A.1, while the choice of h 0 is independent of s. γ) ] < 0, then by (6.1) for j = 0 it holds for sufficiently small h 0 (depending on Θ) that ψ h,f (t * ) > 0 and in this caset * = arg max tψ h,n (t). Thus, We conclude with similar arguments as in case (i).
Proof of Lemma 6.4. Let h 0 > 0 be so small that Lemmas 2.2, 6.1, A.2 and A.3 apply. Assume that By choosing ζ n = 1/ log(1/h) in Lemma A.2, the first term tends to zero. By (6.1) it follows that ψ h,f is Lipschitz-continuous with constant of order h −1 . Hence the second term tends to zero by Lemma A.3. Since we consider the case ψ h,f (t * ) < 0, (compare to (6.1) for j = 0) the last term tends to zero for δ → 0.
Henceψ h,n (t * ) < 0, and similarlyψ h,n (t * ) > 0 with high probability, and the continuity ofψ h,n implies statement 1., since all estimates hold uniformly over f ∈ F s . Statement 2. follows since the distance between t * and t * is exactly of order h by definition (see (2.3)), and the distancet * − t * as well ast * − t * is of order o P,Fs (h) by Lemma A.3. Finally, since θ f is at distance of order h both from t * and t * , statement 3. also follows from Lemma A.3.

A.2. Proofs of auxiliary results in Section 6.2 and 6.4
Notation We extend our notation: For μ ∈ R d and a positive semi-definite matrix Σ let N d (μ, Σ) be the d-dimensional normal distribution with expectation μ and covariance matrix Σ. In this section we denote by || · || 2 the Euclidean norm on R d as well as the L 2 -norm on the square integrable functions.

Consistency of the kink-location estimate
By construction of t * and t * it holds that |θ f −θ h,f | = O Fs (h), see Lemma 2.2. In addition, Lemma 6.4 implies |θ f −θ h,n | = O P,Fs (h). However, we need the following lemma to ensure a faster rate of convergence to analyze the term (6.3) for the proof of Theorem 2.3.

Lemma A.4. It holds that
For the proof of Lemma A.4, we need the following consistency result, which is an extension of Theorem 5.9 in Van der Vaart (2000). Proof of Proposition A.5. With (A.2) and the definition ofŵ h,n it follows that for any δ > 0 Given > 0 choose η > 0 as the left side of (A.3). Then, Note that for the zerosŵ h,n ofψ h,n overΘ h and the zerosθ h,n ofψ h,n it holds that Then, by (6.1) for j = 0 in Lemma 6.1 Further, for any ∈ (0, x * ) Assumption 2, (v) yields Apply Proposition A.5, of which we have derived the assumptions in the latter two display by settingĝ n =ψ h,n and g f = ψ f , to obtain The assertion forθ h,f follows analogously by noting that Proposition A.5 is also true for non-random functionsĝ n and deterministicŵ h,n .
Proof of Lemma 6.5 The following lemma immediately implies Lemma 6.5, due to Lemma A.4.
Lemma A.6. Letθ,θ ∈ Θ, whereθ is random andθ is non-random. Then there exists an h 0 > 0 depending only on K, σ, σ g , L and Θ such that if h ∈ (0, h 0 ) and n ∈ N, it holds for j = 0, 1, 2, that Moreover, the constants in the O-terms depend only on the kernel K as well as on the Lipschitz constant L and the smoothness parameter s of F s as in Definition 2.1, where the constants are continuous in s.
Proof of Lemma A.6. Choosing h 0 appropriately, Lemma 6.1 implies

Now, by the mean value theorem
which yields the first assertion. Similarly, by Lemma 6.1 and equation (6.2) for a suitable choice of h0, obtain Now, using a similar argumentation as before with the mean value theorem it follows that which concludes the proof. Note that the constants in the O-terms depend only on K, σg, L as well as s and these constants can be chosen continuously in s, see Lemma 6.1.

Neglibility of the remainder terms: proof of Lemma 6.6
Proof of Lemma 6.6. Let us start with R 1 (n, h) as defined in (6.20). Note that the second factor in (6.20) is a constant. Thus, we only need to investigate the first factor. By (6.2) for j = 1 h |ψ Next, recall Theorem 2.3 as well as (3.4) which imply whereθ andθ as in (6.3) resp. (6.17). Therefore, by (6.1) for j = 1 we deduce that In summary, by the triangle inequality Hence, the enumerator in R1(h, n) is of order /2 ) by Lemma 6.1 and the assumption on the asymptotics of h and n in Theorem 3.1. Finally, R1(h, n) is oP,F s (1) as the denominator is asymptotically a constant unequal to zero by Lemma 6.5 resp. Lemma A.6 and Assumption 2, (ii).
Similarly, the terms in R2(h, n) are oF s (1) resp. oP,F s (1). To see this, we only analyze the enumerators in R2(h, n) as the denominators are both constant. Theorem Due to Lemma A.6 and Theorem 2.3 such that the first term in R2(h, n) is of the same order as in (A.4). A similar argumentation shows that the second term in R2(h, n) is oF s (1).
Asymptotic normality of the score vector: proof of Lemma 6.7 Proof of Lemma 6.7. By Lemma 6.2, (i), for j = 0, 1, respectively, obtain due to the assumed asymptotics of h and n. By Theorem D.2 the terms h,n (θ f )) and E (j) n (f ) have the same asymptotic limit distribution (provided it exists and satisfies the assumption of Theorem D.2) for j = 0, 1 respectively. For convenience set Hence, we show for any x ∈ R 2 , which would conclude the proof. Note that E n (f ) depends on f only through θ f , which is by definition of F s element of Θ, a parameter of F s . In order to prove (A.5), we intend to make use of the uniform version of the Lindeberg-Feller Theorem D.4, which can be applied since Φ 2 (·) does not depend on F s and therefore (Φ 2 (·)) f ∈Fs fulfills the assumptions of the latter theorem. Thus, we compute the asymptotic covariance matrix of E (0) By means of Lemma 6.3 for j = 0, 1, respectively, deduce With a Riemann-sum approximation in a similar fashion as in the proof of Lemma 6.2 the latter term is where the last equation holds since the function x → K (γ+2) (x) K (γ+3) (x) is odd by Assumption 2, (ii). Thus, and the convergence holds uniformly over F s . Next, for any δ > 0 we show that here || · || 2 denotes the euclidean distance. Note that Further, the computation of the covariance matrix has shown that and the convergence holds uniformly in F s . Hence, the former two displays lead us to

A.3. An exponential inequality
The following exponential concentration inequality for the estimator of the location of the kink will be important for the construction of adaptive confidence sets.
Lemma A.7. LetC > 0 be some finite constant and q ∈ (0, x * ), where x * is as in Assumption 2,(v). There exist finite constants C, C 1 , C 2 , h 0 > 0 which only depend on K, σ, σ g as well as on L, Θ and s of F s , such that if h ∈ (0, h 0 ), n ∈ N and λ n > 0 are such that andCλ n /2 < qh, then where τ ∈ (0, 1). Moreover, C, C 1 , C 2 can be chosen uniformly over a bounded range of values of s, while h 0 is independent of s.
Proof of Lemma A.7. Define the event Then, Lemma A.3 implies for sufficiently small h 0 (depending only on K, σ, σ g , L and on Θ) whereC 1 > 0 is some finite constant uniform for F s (depending only on K, σ, σ g , L and on Θ). Let δ n = (1−τ )Cλn /2, then on the event Ω it holds that can be rewritten as Hence the latter event is contained in With Lemma A.1, derive for appropriate choice of h 0 (depending only on Θ) that inf for some constantC 2 > 0 which depends only on K, L as well as s and is continuous in s. Thus, by means of Lemma A.2, for appropriate choice of the constants in the claim, for some finite constantC 3 > 0 depending only on K, σ, σ g , L as well as s and continuous in s. By assumptionCλ n /2 < qh so that one can find a suitable constant C 2 > 0 (depending only on K, σ, σ g , L and on Θ) such that with (A.7) and (A.8) which shows the first claim in view of (A.6). Finally, note that the choice of h 0 did not dependent on s and furthermore C, C 1 , C 2 were chosen depending only on K, σ, σ g , L as well as s and also continuous in s, due to Lemma 6.1 and A.1.

A.4. Proof of Lemma 6.8
Before turning to Lemma 6.8, we list some simple properties of B(k, s) and σ(n, k).
(ii) If C Lep is chosen large enough and depending only on K, σ, σ g as well as on L and Θ of F, and also if n is large enough such that k * n (s f ) ≥ 2, then there exist ρ ∈ N and c 2 > 0, which are both uniform in F, such that Moreover, ρ depends only on b 1 , b 2 , s and on C Lep .
Proof of Lemma A.9. For convenience we write n for n 2 as this lemma depends only on the subsample S 2 . Using Lemma A.8, (iii), we may assume that n is so large that k ≥ k 0 for all k ∈ K n , where k 0 is the parameter in F in (3.9). From the definition of B(k, s), (6.25) can be written as (i). Fix some k ∈ K n with k > k * n (s f ). From the definition ofk n in (3.11), From now on let j ∈ K n such that j ≥ k. Using (A.9) we estimate By Lemma A.8,(ii), and the definition of k * n (s f ) we have that Combining the latter two displays and Lemma A.7 with τ = 1/3 andC = C Lep leads us for sufficiently large n to for some absolute constants C i > 0, i = 1, 2 depending only on K, σ, σ g as well as on L, Θ and s f of F s , as in (3.10). Since the constants C 1 and C 2 can be chosen continuously in s f by Lemma A.7 and s f ∈ [s, s], we can choose these constants uniformly in F. Using Lemma A.8, (v), the deterministic terms in (A.11) vanish for n large enough. For j ≥ k we have h k−1 > h j (Lemma A.8, (i)). Hence if C Lep > 0 is chosen large enough (depending only on K, σ, σ g , L and Θ), such that by Lemma A.8,(vi) in the second and (vii) in the third step we estimate for some finite constant C 3 > 0 depending only on K, σ, σ g as well as on L and Θ of F. Similarly we estimate the last term in (A.11).
(ii). Fix some k < k * n (s f )−ρ, where ρ ∈ N will be chosen below. By definition ofk n in (3.11) and since k < k * n (s f ), which by assumption on K (1) is of order h s−γ , so that (A.16) is satisfied for b 1 small enough.
For the sequence of alternative hypotheses, let θ 1 = θ 0 +r n ∈ Θ, and r n = o(1) is of the same order as r n in the proof of Theorem 2.4, (i), and consider where ν 0 resp. ν n as in (6.9) resp. (6.10) and where c 2 = c 3 > 0 are suitable constants such that the derivatives of f 1 have the appropriate Lipschitz-resp. Hölder-constant L. In the spirit of (A.16) we check for some suitable h 0 > 0 Decompose the probe-functional into five parts Without loss of generality assume that h 0 is so small that 1−θ0 /h0 ≥ 1 and −θ0 /h0 ≤ −1. Now,(B) = 0 as well as(C) = 0 for sufficiently small h 0 can be shown similarly as in (A.17). Since θ 1 = θ 0 independently of h we have as well that(A) = 0 for sufficiently small h 0 . In addition, sinceT s,n is a piecewise polynomial with a discontinuity in the s-th derivative at θ 1 , one can show Finally, since ν n ∈ C ∞ it follows by integration by parts, Taylor expansion around θ 1 and since K (1) is of order s − γ (Assumption 2, (iii)) that which is compared to O(h s−γ ) of a negligible order for any s ∈ [γ + 1, s]. All things considered we have verified (A.19) for b 1 small enough. Concerning the Kullback-Leibler distance between f 1 and f 0 , derive similar to (6.15) that The first term on the right-hand-side of the latter display can be dealt with as in the proof of Theorem 2.4, (i), while the second term is asymptotically negligible. Indeed, firstly obtain that ||T s,n || ∞ = |T s,n θ0+θ1 /2 | ≤ max{c 2 , c 3 } θ1−θ0 /2 s = max{c 2 , c 3 } rn /2 s = max{c 2 , c 3 } C b s−γ+1 n /2 s and secondlyT s,n is non-zero only inside the interval [θ 0 , 2θ 1 − θ 0 ], which has Lebesgue measure 2r n and consequently only up to 2nr n summands in n i=1T s,n (x i ) 2 are not zero due to equidistant design. Therefore, where the last term is O nb 2(2s+1) n , since s ≥ γ + 1, and consequently negligible for the order of the Kullback-Leibler distance between f 1 and f 0 .  Remark. The function K (1) =L hence satisfies Assumption 2, (i)-(iii) and (v) for γ = 1 and l =l as well as for l =l + 1. The condition (iv) of Assumption 2 is at least numerically true, as also the plots in Figure 4 suggest, though we did not provide a rigorous theoretical argument for this condition. (c). Lemma B.1 (iii) meansL (j) (±1) = 0 for j = 1, 2, 3. Further, Lemma B.1 (iv) for m = 0 implies 0 = 1 −1L (x) dx = 2L(1) sinceL is odd, hencē L(1) =L(−1) = 0.
Remark. By using a sum from k to k+γ+2 in (B.5) the method can be extended, and (iii) can be satisfied for γ ≥ 2.

B.2. Comparison with Mallik et al. (2013)
We compared our proposed confidence intervals for the kink-location with those of Mallik et al. (2013) by simulating observations within the same setting as in Section 5 of their paper. In particular, we considered the regression function f (x) = (2(x − 0.5))1 (0.5,1] (x) and normally distributed noise variables with zero mean and standard deviation σ = 0.1. The function has a kink of first order in θ = 0.5. As in Mallik et al. (2013) we applied our method for over 5000 replications, where we used a grid K n for every scenario such that the bandwidth values are inside an interval [h min,n , h max,n ], where the values of h min,n resp. h max,n are given in Table 7. For the Lepski-constant we used C Lep = 0.03. The results can be found in Table 8, where we also display the results of Mallik et al. (2013) for comparison. As expected, our method (denoted by OCI) yields confidence intervals which are narrower than those of Mallik et al. (2013) (denoted by MCI) since they have milder assumptions on the smoothness of the regression function for their method. Nevertheless, f fulfills the assumptions of  Mallik et al. (2013) as well as the assumptions for our setting and therefore, it seems reasonable to use our method in such cases.
N (ρ h , T, ε) ≤ C 1 λ 1 (T ) (hε) −1 for some appropriate constant C 1 > 0 depending only on K and σ g . With this and if h 0 > 0 is chosen appropriately small depending on diam ρ h [0, 1] deduce for any h ∈ (0, h 0 ) for some finite constant C 2 > 0 depending only on K and σ g . In view of (ii), the choice of h 0 depends only on K and σ g as well which concludes the proof.

Lemma C.2.
There exist constants C 1 , C 2 , h 0 > 0 depending only on K and σ g , such that for any λ > 0 and h ∈ (0, h 0 ) such that λ > C 1 − log ( To prove Lemma C.2 resp. Lemma C.3 we make use of Theorem 3.1 resp. Corollary 3.3 in Viens and Vizcarra (2007), of which we derived the requirements in Lemma C.1.
Proof of Lemma C.3. As in the proof of Lemma C.2 we can assume without loss of generality that the constant c g in Lemma C.1, (i), is one. Using Corollary 3.4. in Viens and Vizcarra (2007) yields the assertion by using the bound on the covering entropy in 3. of Lemma C.1.
where Φ Σ is the cumulative distribution function of N (0, Σ).