Kink estimation in stochastic regression with dependent errors and predictors

In this article we study the estimation of the location of jump points in the first derivative (referred to as kinks) of a regression function \mu in two random design models with different long-range dependent (LRD) structures. The method is based on the zero-crossing technique and makes use of high-order kernels. The rate of convergence of the estimator is contingent on the level of dependence and the smoothness of the regression function \mu. In one of the models, the convergence rate is the same as the minimax rate for kink estimation in the fixed design scenario with i.i.d. errors which suggests that the method is optimal in the minimax sense.


Introduction
Assume that we observe a bivariate dataset {X i , Y i } n i=1 that follows the regression model, where µ is the regression function and σ is the scale function. Also, ε i and X i are the error and random design variables respectively (both being possibly longrange dependent) and X i has cumulative distribution function F = F X : R −→ [0, 1] that is strictly increasing. We are interested in testing the presence of a change point in the slope of a regression function µ and if one exists, estimating its location. We describe this jump in the first derivative of µ as a kink and denote the change point by θ. Knowledge of this change point will allow us to identify change in trends in the underlying regression function of a non-parametric model. This could explain the change in qualitative or quantitative behaviour of an underlying process.

Existing results
Before examining the kink estimation under the random design regression model (1), we first look at other non-parametric and parametric models and their link to the existing theory for kink point estimation. A change point estimation technique was pioneered by Goldenshluger, Tsybakov and Zeevi (2006) for estimating change points in the regression function itself, not the kink scenario. The underlying model assumed for their framework was the indirect model with fixed design. The indirect model assumes that the regression function is not observed in practice but a so called 'blurred' version of the regression function is observed whereby the regression function has been transformed by a convolution operator. More specifically, the indirect model assumes that observations are realisations of the asymptotic model, dY (x) = K µ(x) dx + ǫdB(x). ( In the above model the function K µ(x) = R K(t − x)µ(x) dx represents the convolution of µ and K and the noise is driven by a regular Brownian motion, B(x) and controlled by ǫ ≍ n − 1 2 where the statement a n ≍ b n means that the ratio a n /b n is bounded above and below by positive constants. The fixed design implies that the design variables x i = i n are equally spaced points on the unit interval. The asymptotic model (2) is considered due to a result by Brown and Low (1996) that shows (2) is asymptotically equivalent to the model, where z i is an i.i.d. sequence of error variables. The specific estimation technique that Goldenshluger, Tsybakov and Zeevi (2006) formulated was the zero-crossing technique and it used a particular class of kernel functions to identify the change point. Their technique will be adapted for use in this article and is pursued in further detail in Section 4.1. At this stage it will suffice to say that the main result of their paper established that the zerocrossing technique is optimal in the minimax sense under the framework given in (2).
The zero-crossing technique has been applied by Cheng and Raimondo (2008) to estimate a kink instead of a jump point and was done in the direct model in the fixed design setting. In this framework the observations are assumed to follow a fixed design and realisations derived from the following asymptotic model, Model (3) and its asymptotic equivalent is usually appropriate in practice when a variable is observed at regular intervals indexed by time and the errors are i.i.d. homoscedastic random variables. More recently, Wishart (2009) extended the technique further to include longrange dependent (LRD) noise observations instead of independent noise. The kink estimation technique was extended to include the model, where B H (x) is a fractional Brownian motion with self-similarity index H ∈ [ 1 2 , 1). The noise process was normalised by ǫ α where α = 2 − 2H. Wang (1996) has shown that Model (5) is the asymptotic equivalent to the discrete model, where e i is a LRD sequence of random variables.
In this paper we are interested in model (1), which extends the fixed design cases given in models (3) and (6) above. They are extended in the sense that the design points are no longer restricted to a uniform grid of points and the scale function σ(·) allows heteroscedasticity for the error terms in the regression model. The analysis of this random design model needs to be considered quite carefully, since the asymptotic behaviour of the estimators will depend on the behaviour of the scale function and on the level of dependence present in the design variables and errors themselves. It has been shown by Reiß (2008) that there exists an asymptotic equivalence between model (1) and (4) when σ(·) ≡ constant, and the design variables are independent uniform random variables. However, this is not the case in general. As noted in Kulik and Raimondo (2009a), with LRD design variables, model (1) cannot be equivalent to any asymptotic model, which is in contrast to model (5) being the asymptotic equivalent to model (6) in the fixed design case.
There is an extensive treatment in the literature on both parametric and nonparametric methods for regression models with a random design framework that assume i.i.d. design and error variables. The methodologies used include, but are not limited to, kernel smoothing, wavelet decompositions and orthogonal series. The methods of change point estimation for the random design case have been considered in Gijbels, Hall and Kneip (1999); Huh and Park (2004); Korostelëv and Tsybakov (1993).
There is also literature on the fixed design scenario in the presence of longrange dependent errors and the introduction of dependence in the errors always has a detrimental effect on estimation in this scenario. In the context of function estimation some recent treatments of this topic include Cavalier (2004); Csörgő and Mielniczuk (1995); Johnstone (1999); Johnstone and Silverman (1997); Kulik and Raimondo (2009a); Wang (1996). For change point estimation work has been done by Wang (1999); Wishart (2009).
There is a new emerging literature that attempts to combine the two scenarios with random design regression models where the design variables and/or the error variables are LRD. When the framework includes a random design and possibly LRD variables there is a more subtle asymptotic theory that is based on a delicate balance between the behaviour of the σ function and the level of dependence present. This is evident in a number of papers in the area and will be the case here as well. The interested reader is referred to work by Guo and Koul (2008); Robinson and Hidalgo (1997) for a parametric linear model approach in this context and to Csörgő and Mielniczuk (1999); Kulik and Raimondo (2009b); Mielniczuk and Wu (2004); Yang (2001) for regression estimation in a non-parametric framework. Finally some studies to estimate change points in the non-parametric context include Lin, Li and Chen (2008); Wang (2008).

Article outline
Some preliminary framework is outlined in Section 2, setting up the class of functions that are considered and specific dependence assumptions made in the random design model. The main result of the paper is described in Section 3, along with a brief discussion. The estimation method is explained in detail in Section 4, with a brief outline of the zero-crossing technique in the fixed design and its extension to the random design case. All the necessary proofs of the results are given in Section 5.

Smoothness assumptions and kernels
First we look at the smoothness of the regression function µ and the properties of the kernel function that was constructed to use the zero-crossing technique by Cheng and Raimondo (2008). We define a class of functions that have domain X ⊆ R, a kink at θ ∈ X and s ≥ 3 derivatives that exist in the neighbourhood of θ.
2. µ has a kink, that is, there exists a θ ∈ X and a µ ∈ R with a µ = 0 such that, where µ (1) (θ + ) and µ (1) (θ − ) are the right and left first derivatives of µ respectively. 3. The higher order derivatives µ (i) exist and are finite everywhere and satisfy, 4. For all x + ∈ (0, sup X − θ) and x − ∈ (inf X − θ, 0), Condition 4. should be interpreted in the sense that µ (1) has a separate Taylor expansion for points to the left and right of θ respectively. Condition 3. of Definition 1 might seem overly restrictive but is required to exploit the class of kernel functions that are introduced later in this Section. We will also denote F s (θ) = F s (R, θ). For completeness and comparison purposes we will also introduce another smoothness class G s to denote the class of functions that do not have a kink. This class is identical to F s (θ) except conditions 2. and 3. are relaxed in Definition 1 in the sense that there does not exist a θ ∈ X such that, [µ (1) ](θ) = 0.
In the fixed design setting, we can assume that the domain of the regression function is [0, 1] since any finite interval, [a, b] can be mapped to the [0, 1] interval by an affine transformation. However this assumption is not always valid in the general random design case. In particular, if the design variables are LRD then it is required that they have a domain across the whole real line.
To use the zero-crossing technique for this class of regression functions Cheng and Raimondo (2008) constructed a class of kernel functions via Legendre polynomials and we will denote this class of functions by K s . The full description of the zero-crossing technique and the consequent technical details required of the kernel functions are not covered here and the reader is referred to Goldenshluger, Tsybakov and Zeevi (2006) and Cheng and Raimondo (2008) respectively for full treatment. However, some key aspects will be given and for our case we will say K ∈ K s , where s = 2k + 1 and k ∈ Z + if, where the polynomial coefficients are defined by This class of kernel functions is indexed by the level of smoothness s and is constructed to exploit the extra smoothness of the class F s (θ). To save on notation we denote K i = K (i) , to represent the i th order derivative of K. The kernels have the following properties: Property (10) of K s ensures that the smoothness of F s (θ) can be exploited to obtain faster rates of convergence of the estimator θ in estimating θ. For our purposes of estimation assume that µ ∈ F s (θ) and σ ∈ G r where s ∧ r ≥ 3.

Dependence assumptions
Throughout the paper there will be a dependence assumption either among the design random variables or in the error random variables. In particular, the assumed dependence structure is a causal LRD linear process that is defined below.
Definition 2. Let c i be a set of square summable constant coefficients that are defined, where L : R + −→ R + is a slowly varying function and 0 < α ≤ 1. Then, a random variable ξ i , is said to be a causal LRD linear process if, where |µ ξ | < ∞ and η i are i.i.d. random variables with density f η and moments Eη t = 0 and Furthermore, a random variable ξ i is said to be a causal LRD Gaussian linear process if ξ i satisfies Definition 2 and {. . . , η i−1 , η i } are i.i.d N 0, σ 2 η . The case of α = 1 is to be interpreted as a short range dependent case and by the construction the random variable has Eξ i = µ ξ and Varξ i = 1. Moreover, it can be shown that ξ i is a second-order stationary process and has asymptotic covariance Therefore the process exhibits long-range dependence and a consequence of this asymptotic covariance structure is that, where C 2 1 := 2C 2 0 /((1−α)(2−α)), C 2 2 := 4C 4 0 /((1−2α)(2−2α)) and when 1/2 < α < 1, the sequence Cov ξ 2 0 , ξ 2 i is summable and C 2 3 = 1 + 2 ∞ i=0 Cov ξ 2 0 , ξ 2 i . Also, when α = 1/2, Var n i=1 ξ 2 i is asymptotically proportional to a term of order n times another term involving slowly-varying functions. Now throughout the paper, the design variables and error variables are assumed to follow one of the following dependence conditions: random variables with domain X ⊆ R and common density f such that f (x) > 0 for all x ∈ X and sup x∈X |f (s∧r) (x)| < ∞. The error variables {ε i } n i=1 are a causal LRD process with parameter α ε . Furthermore, the random variables {ε i } n i=1 are assumed to be independent of {X i } n i=1 . Define the associated set of σ-fields, G i := σ(. . . , η i−1 , η i ; X 1 , X 2 , . . . , X i ).
(B) The design variables, {X i } are a causal LRD linear process with parameter α x where f (j) η is a Lipschitz continuous function for j = 0, 1, . . . , s + 2 with f X (x) > 0 for all x ∈ R. The error variables {ε i } n i=1 , are centred and i.i.d., with a finite variance, independent of {X i } n i=1 . Similarly, define the associated set of σ-fields, In both cases, the support of the design variables will be denoted X . Let F = F X be the cumulative distribution function of X which is strictly increasing and denote by F n (x) = n −1 n i=1 ½ {Xi≤x} the empirical distribution function of X. Also let Q = F −1 and Q n = F −1 n be the quantile and empirical quantile functions respectively. We require that Q is Lipschitz, that is, there exists an Finally, we need to impose some mild restrictions on σ. We assume σ is bounded away from 0 and ∞ in the sense that, and that σ ∈ G r where r ≥ 3. Throughout the article we denote by C a general constant that is assumed to be positive and finite but which possibly changes from line to line.

Main result
The main result of the paper is concerned with the construction and analysis of an estimator, θ, of the kink location θ. The analysis of the estimator is given in Theorem 1 and concerns the rate of convergence of θ to the true the kink location θ. The estimator, θ, will be constructed in Section 4 along with the motivations and analysis.
Theorem 1. Suppose a bivariate sequence of observations {X i , Y i } that follow model (1) are observed such that µ ∈ F s (θ) with s ≥ 3. Then an estimator, θ of the change point, θ, can be constructed such that, where C is an arbitrary positive constant and s ∧ r ≥ 3.
It is worth noting at this stage that the further restriction that σ ≡ C under Assumption (A) is unnecessary for the specific estimation technique and is only required in the maximal deviation result to ensure that a kink can be detected in practice. Further detailed discussion of this matter along with the proof of Theorem 1 is given at the end of Section 4.
The minimax optimality of this result is not pursued in this paper since the lower bounds on the convergence rate of θ for the functional class F s (θ) are not determined in the framework of random design. However, it is worth making the specific point that the obtained rate of convergence under Assumption (A) is the same as the minimax rates for the fixed design case with i.i.d. errors (see Cheng and Raimondo (2008)). Consequently, it seems reasonable to conjecture that the rates of our estimator are optimal in the minimax sense.

Kink estimation method
In this section, the basis of the zero-crossing technique is studied and a brief overview given. Firstly, the zero-crossing technique pioneered by Goldenshluger, Tsybakov and Zeevi (2006) and applied by Cheng and Raimondo (2008); Wishart (2009) will be described briefly in Section 4.1 and then an adaptation for the random design case constructed in Sections 4.2-4.7.

Approximation of the third derivative for the fixed design model
In the fixed design setting (cf. model (6)) it can be assumed without loss of generality that the regression function µ has domain [0, 1]. More specifically, assume that µ ∈ F s ([0, 1], λ) and estimate µ (3) (t) by, where h = h(n) is the bandwidth that depends on n. Throughout the article it will be assumed that the bandwidth satsifies, at the very least, h + 1 nh → 0, as n → ∞. This is a standard regularity condition for kernel smoothing techniques and additional conditions on the bandwidth will be stated as needed. Using the functional class F s (θ) and the properties of the kernel function it can be shown where L h (t) is the localisation term. Indeed, by exploiting the conditions of K 3 we can express κ h (t) as follows. Change variable of integration to obtain, The last equality follows because the domain of K is [−1, 1] and the values of t are restricted to t ∈ (h, 1 − h). This restriction is used to avoid possible edge bias effects from the two sided kernel function. Using integration by parts and exploiting the boundary condition (9), Let D = {t : |λ − t| < h} and τ = (λ − t)/h. Then |τ | < 1 for all t ∈ D. We now split (13) into two integrals, The order bound follows by using (7) and (8) in combination with (10). Therefore, this allows us to express κ h (t) in the following way, Since the method is based on estimating a smoothed third derivative of µ, it is assumed that s ≥ 3. This will guarantee that µ (3) exists, is finite and the method makes sense. Then the expansion in the above equation ensures that . More specifically we have the following, As seen in all three of the aforementioned papers that use the zero-crossing technique, the δ−separation rate Lemma given below is the technical result that explains why the above representation is effective.
The proof of this Lemma is given in Cheng and Raimondo (2008). Their proof requires a minor correction as the extra regularity condition 3. is needed in the smoothness class F s (θ).
The main idea of Lemma 1 allows us to exploit the expansion given in (12) and focus in on the location of the kink. The kernel function has specific properties to guarantee that a unique global maximum and minimum occurs within order h of the kink point. Furthermore, the estimator was constructed so that the rate of convergence of kink location estimation is minimax for model (4). We will seek to adapt these results to the random design setting.

Adapted random design estimator of the third derivative
Now consider µ ∈ F s (X , θ) in model (1). An estimator is constructed to exploit the smoothed third derivative of µ and the argument built around Lemma 1 discussed in Section 4.1. The most natural extension would be to use the estimator, where f X (t) is the estimate for the density of X i at the point t given by, Unfortunately, from a brief computational investigation, the estimator given in (15) appears to suffer from poor numerical performance. Instead of using (15), another estimator with better numerical performance is constructed by rescaling the design variables by the distribution function F and defining a rescaled regression function µ F (·) = µ(Q(·)). This new estimator of κ h (t) in the random design setting is given by, Apart from the gain in numerical computation benefits, the estimator κ h (t) has some properties that can be exploited. It reduces the general random design problem into a somewhat simpler framework. To see this, consider F (X i ) to be the new random design variables of the regression problem. These new design variables follow the theoretically easier uniform distribution on [0, 1]. The price paid for this simplification is that the regression function that corresponds to the new design variables is now µ F . This simpler framework is useful for a couple of reasons. Firstly, (16) is an unbiased estimate of the smoothed third derivative of the rescaled regression function µ F , . So, the smoothed third derivative of µ F given in (17) can be exploited by Lemma 1 and the argument shown in Section 4.1. Then the problem is equivalent to estimating a kink location λ for the function µ F in the fixed design setting.
With the previous argument in mind, an estimator θ of a kink location of the regression function µ in the random design setting is constructed that is approximately the same as the estimator for kink location λ of µ F in the fixed design setting. This is done by estimating the value of λ by λ using the established zero-crossing technique in the fixed design setting and then rescaling λ back by the quantile function to obtain an estimate of θ. Thus to assess the performance of our estimator we need to check that the convergence of κ h (t) to κ h (t) is sufficiently fast. To do this consider the two following processes, With these definitions, the overall accuracy of the estimator can be decomposed into, where b h (t) and Z h (t) represent respectively the stochastic bias and stochastic error contributions to the estimator and are given by, The analysis of the above terms are given in the next subsection.

Probabilistic behaviour for the adapted estimator
In this section the analysis of the stochastic bias and stochastic error terms are considered before proceeding to the next stage of the zero-crossing technique to ensure that the stochastic contributions do not overwhelm the signal generated by the κ h (t) term. The proofs of the claims in this section will be deferred to Section 5. The first term to be considered is the stochastic bias term which did not appear in previous kink analyses pursued by Cheng and Raimondo (2008); Wishart (2009) since there is some stochastic contribution by adapting the fixed design estimator to the the random design framework. Therefore, this term needs to be appropriately dealt with and the next Lemma is a useful tool that considers this term.
Lemma 2. Consider a function µ : X −→ R such that µ ′ exists and is bounded. Then define the function If the design variables follow Assumption (A) then, If the design variables follow Assumption (B) then, Note that the two claims in given in Lemma 2 are respectively a uniform law of iterated logarithms for independent variables and a similar type of iterated logarithm result for martingale difference sequences.
We now state some central and non-central limit theorems for the estimator, κ h (t). The convergence of the estimator κ h (t) under both Assumption (A) and (B) is contingent on the size of the bandwidth relative to the level of dependence α. The specific details of this relationship between h and n α will be shown in detail inside the Theorems. Roughly speaking, if the bandwidth is too 'large' compared to α then the dependence of the random variables dominate and the estimator converges to a process that needs to be normed by a sequence that relies on α. Conversely, if the bandwidth is 'small' compared to α then the dependence of the random variables is negligible and a regular central limit theorem holds with a norming sequence that is not reliant on α. In the forthcoming Theorems the extra smoothness of the regression and variance functions are exploited to be able to obtain an estimator that is not as sensitive to the level of dependence. In practice, this extra level of smoothness will most likely be unknown. Due to its common occurrence in the subsequent Theorems, define the asymptotic variance term, υ 2 (t) := σ 2 The following Theorem deals with the case of Assumption (A).
Also if the design variables and error random variables follow Assumption (A) and the bandwidth h = h(n) also satisfies, then the following convergence result holds, Conversely, if the bandwidth h = h(n) satisfies, then, Theorem 3 and Theorem 4 deal with case under Assumption (B) and give the central limit theorems when there is a 'small' or 'large' bandwidth respectively. In the 'large' bandwidth scenario a stronger assumption is used whereby the design variables are a causal LRD Gaussian linear process.
If the design variables and error random variables follow Assumption (B) and the bandwidth h = h(n) satisfies, then the estimator obeys the following law, Assume the design variables and error random variables follow Assumption (B) and that the design variables are a causal LRD Gaussion linear process. If the bandwidth h = h(n) satisfies, and the estimator κ h (t) has a Hermite rank of 1 then the the estimator obeys the following law, where s 2 X = 1 − σ 2 η and φ and Φ are the standard normal density and cumulative distribution functions respectively. Remark 1. If the estimator κ h (t) has Hermite rank q for some q ∈ {2, 3, . . .} then the asymptotic distribution depends on the size of the bandwidth relative to qα. Firstly, if n 1−qαx h 7 L 2q (n) → ∞ then it can be shown using a similar argument used in the Proof of Theorem 4 with the result of Theorem 2 of Avram and Taqqu (1987) that the normed process n qαx/2 L and H q (x) is the Hermite polynomial of degree q and H q is the Hermite-Rosenblatt process, notes a standard Brownian motion. In Avram and Taqqu (1987), they considered Appell polynomials for a generalised sequence of stationary LRD random variables. In our case the LRD variables are Gaussian and consequently the Appell polynomials reduce to the Hermite polynomials. On the other hand, if the bandwidth satisfies n 1−qαx h 7 L 2q (n) → 0 then (21) holds.
As will be seen in Section 4.5, some large deviations results are needed to be able to distinguish between the signal generated by the κ h (t) term and the stochastic bias and noise contributions. Unfortunately, a slightly weaker large deviations result is proved under Assumption (A) in Theorem 5. In particular we assume that the scale function, σ(·) ≡ σ, is constant however this restriction could possibly be relaxed by using a different method. The large deviations result for Assumption (B) in Theorem 6 does not carry this restriction and the scale function need not be constant.
Theorem 5. Let K ∈ K s∧r and the design and error variables satisfy Assumption (A). Further assume that the bandwidth h = h(n) also satisfies, Define, and a partition of [0, 1], where m n = ⌈ 1 2h ⌉. Then, for all x ∈ R.
Theorem 6. Let K ∈ K s∧r , σ ∈ G r with s ∧ r ≥ 3 and the design and error variables satisfy Assumption (B) and assume that the bandwidth h = h(n) also satisfies, Then define, then for T n defined in (24) with m n = 1 2h , lim n→∞ P sup t∈Tn S B n (t) ≤ 2 log m n = 0.

Localisation step
Recall from (12), that the probe function given by κ h (t) gives a signal from the localisation term, L h (t) with some approximation error and the estimator adds a stochastic bias and error term, Clearly, h −2 > h s−3 , since s ≥ 3. So to be able to discern the signal generated from L h (t) = O(h −2 ), it is required that L h (t) dominates the stochastic terms, Z h (t) and b h (t). By construction of the kernel function, (cf. Cheng and Raimondo (2008)), K 1 (·) has two unique extrema in the form of a unique global minimum and maximum in the interval [−1, 1]. This implies that K 1 (·/h) has the same unique extrema in an interval of a length O(h). Consequently, L h (·) has two unique global extrema near t * = λ + O(h) and t * = λ − O(h). As in the fixed design scenario considered by Cheng and Raimondo (2008); Wishart (2009) However, in practice the location of t * and t * are not known and estimated using κ h (t) with, t * = arg min There are two respective bandwidth restrictions, ((A1), (A2); (B1), (B2)) for the asymptotic behaviour of the estimator under Assumption (A) and Assumption (B) respectively. Starting with (A1) and (B1), to have a well defined signal, it is required that, h −2 ≥ Cn − 1 2 h − 7 2 ⇒ h ≥ Cn − 1 3 . Furthermore, since it is assumed that s ∧ r ≥ 3, to ensure that (20) and (21) always hold it suffices to choose h such that h ≤ Cn − 1 7 +(αx∨αε)/7−δ , for some δ > 0 or, for some δ > 0. With this choice, the bandwidth restrictions given by (A1) and (B1) will always hold. It is worth noting that under this choice, the order of the stochastic terms does not involve α x or α ε , the level of dependence. Note that h is chosen in a very similar manner if ε i and X i , i ≥ 1, were i.i.d. Consequently, there will be no influence of the (long range) dependence on the change point estimation. The influence of the long range dependence will only affect testing purposes of the threshold used to determine if a signal is genuine and this will be discussed in the next subsection.

Kink detection step
For simplicity in notation, assume that [µ F ] (1) (λ) > 0, which means, t * < t * (a similar argument follows if [µ F ] (1) (λ) < 0 ⇒ t * > t * ). To detect a kink, first standardise the statistic κ h (t) to have unit variance. This will allow us to appropriately notice if there is a change-point present when the observed extrema of κ h (t) exceed the threshold for the noise process. Define this standardised process as, Then by (26) the T κ (t) process has expansion, As seen earlier, the information regarding a kink is generated by the L h (t) process. A thresholding regime will be considered to be able to distinguish between the signal generated by L h (t) against the noise signal generated by the Z h (t) and b h (t) terms. This thresholding will be split into the two scenarios for Assumptions (A) and (B).
Begin by giving a general decomposition of the estimator for both cases by using, γ * i (t) : and using (18) and (19). So, Focus on Assumption (A) and assume σ(·) ≡ C, constant. The assumption that the scale function is constant is required in the proof of the maximal deviation result in Theorem 5. It may be possible to relax this condition and have the same result for σ ∈ G r under Assumption (A) but remains a conjecture at this stage. Nevertheless, to control the stochastic terms in (30) first apply Lemma 2 and use (10), Then consider the values of t on the initial coarse grid T n (see (24)) where the increments are of size 2h. The grid values will be refined later in Section 4.6. From Theorem 5, it is known that sup t∈Tn S A n (t) will diverge to infinity no faster than 2 |log 2h|. Also, if µ ∈ G s , then from (14), κ h (t) = O(h s−3 ) and However, if µ ∈ F s (θ), then (27) holds and by (29), max t∈(t * ,t * ) T κ (t) ≥ Cn 1 2 h 3 2 > 2 |log 2h| and a kink is detected when, A very similar argument holds for Assumption (B). In this case assume that the scale function σ ∈ G r with r ≥ 3 and proceed as before. In conjunction with (30) and (10) apply Lemma 2, where the extra term D n (t) is defined, Using Lemma 3 (see Appendix in Section 5), with the bandwidth condition (25), sup t∈(h,1−h) |D n (t)| = o p |log h| . Also, the bandwidth restriction (28) guarantees that (25) and consequently Theorem 6 holds. Then for Assumption (B) the same argument applies that was used to show (31) for Assumption (A).
This thresholding technique does raise some restrictions that could possibly be removed by another technique. Recall from (28), that h > Cn − 1 3 +δ for some δ > 0 is required to be able to distinguish the signal from the stochastic terms. Also, (22) and (25) are required to be able to apply Theorem 5 and Theorem 6 respectively and obtain a large deviation result for the process. Therefore to ensure both conditions are satisfied, it is sufficient to consider α x > 8 9 or α ε > 4 9 .

Zero-crossing technique
If a kink is detected (when (32) is satisfied) then the method can proceed to the zero-crossing step. This step considers the interval A h := t * , t * , which will contain λ and t * − t * = O(h). The main idea behind the zero-crossing technique is that for t ∈ A h , κ h (t) ≈ κ h (t). Using Lemma 1 we can locate the zero-crossing-time of κ h (t) which occurs at t = λ with an accuracy of order δ, δ < h. This is done by minimising | κ h (t)| within the interval A h : By comparing (12) with the bounds in Lemma 1 we see that the minimum is well defined if, We will obtain the best possible accuracy if we choose δ as small as possible, as long as both inequalities of (33) still hold. The left hand expression of (33) implies that δ ≍ h s and substituting this into the right hand expression of (33) we derive the order of the smallest possible bandwidth h * ≍ n − 1 2s+1 .
We now apply Lemma 1 with δ * = h s * to locate the change point λ in µ F with an accuracy of order, Remark 2. There are some limitations to the procedure presented thus far in terms of detection. More specifically, dependent on the location of λ relative to the grid values in T n , the detection phase may fail. Indeed, define the closest grid value λ * := arg min t∈Tn |λ − t|. If λ * is too close to λ, that is, |λ − λ * | < δ then the procedure will not detect a kink since L h (t i ) = O h s−3 for t i ∈ {λ * − 2h, λ * , λ * + 2h} and consequently κ h (t i ) = O |log h| and (32) will not hold. However, if δ < |λ − λ * | < h then a kink will be detected since (32) holds and the aforementioned procedures in Sections 4.5-4.6 will follow. Furthermore, the limitations imposed by the coarse T n grid affect only the kink detection step and will not influence on the zero-crossing step.

Modified estimator of kink
Recall that θ = Q(λ). In practice the true distribution function F is unknown, so it is estimated in the usual manner by the empirical distribution function F n (x) = n −1 n i=1 ½ {Xi≤x} and consequently can obtain an estimator of Q via the empirical quantile function Q n (·). Estimate θ by, θ = Q n ( λ). The rate of convergence of this estimator is evaluated below, The rate of convergence in (34) is therefore contingent on the maximum of the rate from the generalised quantile process for the design variables or the rate from the initial unscaled kink estimator. Under Assumption (A), the quantile process involves independent and identically distributed design variables and for all t ∈ (0, 1), (see Csörgő (1983) and references therein for a detailed treatment). For Assumption (B), the rate is dependent on α x and for all t ∈ (0, 1), (see Theorem 5.1 of Ho and Hsing (1996)). Therefore, using (35) and (36) in (34), where s ∧ r ≥ 3 which proves Theorem 1.
Remark 3. The method can be extended to the multiple kink scenario by observing multiple instances of (32). For each instance of (32) there is a corresponding interval A h and the localisation and zero-crossing-time steps are executed on each of those intervals to produce an estimate for each kink location. The interested reader is referred to Cheng and Raimondo (2008); Wishart (2009) for a more detailed treatment of the method in the multiple kink scenario with numerical examples. However, it is worth pointing out that there are some limitations to the accuracy of this method in this situation. Problems will arise if the multiple change-points are not well spaced apart in the sense that they are within order h of each other. To see this, let λ 1 and λ 2 be two such changepoints. When t is within order h of both the change points, the localisation term, L h (t) will not produce two unique disjoint signals for the kinks. Instead the signals generated by K 1 λi−t h for i = 1, 2 will interact and be confounded in one overlapping signal.

Mathematical Appendix
Before giving the proofs, some notation is described. Let X denote a random variable and denote the L p -norm X p p = E |X| p and · = · 2 . For a function f : X −→ R denote the sup-norm |f | ∞ = sup x∈X |f (x)|. Throughout this Section a Taylor expansion of composite functions will be used to exploit the vanishing moment condition of K 3 . For the Taylor expansion to be well defined, the derivatives of the composite functions need to exist. A generalised chain rule for composite functions exists (see the Faà di Bruno formula from Hernández Encinas, Martín del Rey and Muñoz Masqué (2005) and references therein), and these are of the form, (37) where K n = {k i ∈ {Z + ∪ 0} : k 1 + 2k 2 + · · · + nk n = n} and k = n i=1 k i . Also, through tedious but elementary calculus it can be shown that, the n th derivative of Q = F −1 will exist, and the Taylor expansions of µ F and σ F up to order n will exist if f (n) exists.
Proof of Lemma 2. Begin with the proof of the first claim under Assumption (A). Since γ * i (t) will be non-zero only if F (X i ) ∈ (t − h, t + h), there exists a τ i ∈ (−1, 1) that depends on X i such that, and ξ i depends on τ i . The ν i (t) terms are independent random variables, each of which have variance that is of order h. Therefore by the Law of Iterated Logarithm (see Bingham (1986)) we have the following result, which proves the first claim of the Lemma. Now to concentrate on the claim for Assumption (B), a proof of a similar claim in Lemma 4 of Zhao and Wu (2006) is adapted to our framework. This technique bounds the martingale difference sequence γ * i (t) − E [γ * i (t)| F i−1 ] above and below by two discretised martingale difference sequences and uses an exponential martingale inequality to gain the required probabilistic bounds. To do this, again exploit the Taylor expansion of µ in Definition 1 and use the fact that Support(K 3 ) = [−1, 1], which means that there exists a τ i dependent on X i with |τ i | ≤ 1 such that F (X i ) = t + τ i h and, where |ξ| ≤ 1. Then split the function in (38) into its positive and negative parts by defining ξ i := t + ξ |τ i | h and τ i µ denote the respective positive and negative parts of f . Then, By the linearity of the conditional expectation operator and (39) we can decompose the martingale difference sequence into parts, To begin with we will concentrate on the first martingale difference term on the RHS of (40) and bound it above and below by a discretised version that does not depend on t directly. For this discretization let N = ⌈ nh −3 1 2 ⌉ and t j = j N where 0 ≤ j ≤ N. Then for any t ∈ [0, 1] there exists a j such that t ∈ [t j , t j+1 ) and the distance |t j+1 − t j | = O(N −1 ). Define the two new tweaked martingale difference sequences versions of ς ++ i (t), It can be shown that, the martingale difference sequence ς ++ can be bounded uniformly in t above and below by, We have the following result, where for each fixed j, S n (j) and S n (j) are martingales with respect to the filtration F n and are defined, These martingales will be bounded by an exponential martingale inequality. Consider firstly the martingale S n (j), its martingale differences are bounded Also using the Lipschitz property of Q and the bounded domain of K 3 , Then, a martingale inequality for bounded differences given by Theorem 1.5A of de la Peña (1999) can be used to yield, where a = C b h and y = C cv nh 3 . Furthermore if ax/2y = o(1) then using a Taylor expansion of sinh −1 , Now consider the chance that max 1≤j≤n S n (j) exceeds the threshold with order x = C T nh 3 |log h| for some C T > 0 which combined with a = C b h and y = C cv nh 3 implies, ax/2y = O |log h| /nh = o(1) and by (41) and (42), So, fix ǫ > 0 and use (43), By choosing C T large enough will ensure that Cn The similar conclusion can be reached that for any ǫ > 0 there exists a finite constant C such that, Therefore, (44) and (45) Using a comparable argument, the same conclusion can be reached for the S n (j), Also, a similar technique can be used to bound the other martingale difference terms given in (40), details omitted.
Proof of Theorem 2. To prove the Theorem we appeal to similar results that were shown by Kulik (2008); Wu and Mielniczuk (2002) by decomposing the stochastic terms into two parts, a martingale part and a LRD part. This is done by defining, and then decomposing the standardised estimator κ h (t) into two terms, The Theorem will follow by showing that either the first or last term on the RHS of (46) dominates under the bandwidth conditions (A1) or (A2) respectively.
More specifically, it will be shown that the dominating term will follow a CLT and the other term converges to zero in probability; then Slutsky's Theorem completes the proof. Firstly consider the case where (A1) holds, then apply the martingale CLT of Brown (1971) Note that {χ i (t), G i } form a martingale difference sequence. So it remains to check that the sum of the conditional variances converge in probability to the unconditional sum and the Lindeberg condition holds. Before we prove the Lindeberg condition note that for t ∈ (h, 1 − h), Exploiting (10) and the assumption that σ ∈ G r , where τ ∈ (0, 1). Therefore, using (48) and (49), Due to the fact that the bandwidth is assumed to follow h ∈ (0, 1), there exists a h 0 such that for all 0 < h ≤ h 0 < 1, From (48), it follows h −1 Eζ 2 1 (t) → σ 2 F (t) 1 −1 K 2 3 (u) du and similarly from (49), Also, the same argument applies for the γ i (t) term to yield, Now the Lindeberg condition is shown to hold. Let ǫ > 0 be arbitrary, where A n = {|χ 1 (t)| > ǫ}. The size of this set can be maximised using (50), Using the fact that nh → ∞ and h → 0 as n → ∞ we see that A n → ∅, the empty set. Consequently with (51), (52) and nEχ 2 1 (t) < ∞ imply that, and the Lindeberg condition holds. By a consequence of (11), let ǫ > 0 be arbitrary, with both of the above equations being o(1). So, the sum of the conditional variances to converge in probability to one: and by the martingale CLT, (47) follows. Now we show that the last term on the RHS of (46) converges in probability to zero. Consider an arbitrary ǫ > 0, then using (49) and (11), and the last line follows by the bandwidth restriction given in (A1). Thus, the proof of the first claim under the 'small' bandwidth scenario holds.
Consider now the 'large' bandwidth scenario. Using (46), (47) and (49), Also, from Ho and Hsing (1997), it is known that Therefore, normalising the expression on (53), and the result follows from (A2) and (54) with Slutsky's Theorem Proof of Theorem 3. First break down the estimator into its separate martingale and LRD part in a similar fashion to the method employed in the proof of Theorem 2. Using (30), apply Lemma 2, Define the standardised stochastic terms, Then in a similar fashion to the Proof of Theorem 2 it will be shown by the martingale CLT of Brown (1971) Indeed, ∆ i (t) is a martingale difference sequence with respect to the σ-fields {F i }. Thus we need to check that the Lindeberg condition holds and that the sum of the conditional variances converge in probability to 1. First, focus on the convergence of the conditional variances. The conditional variances can be broken into two parts, Dealing with the second term on the RHS of (57), use Lemma 1 of Zhao and Wu (2008), A bound is required for E K 3 to deal with the first term of (57). Define X i,i−1 := X i − η i = µ X + ∞ j=1 c j η i−j and Z i := s −1 X (X i,i−1 − µ X ) and define f η (x) := f X x F i−1 = f η (x − X i,i−1 ) and g(x) = 1/x. Then X i,i−1 and Z i are F i−1 -measurable and for all t ∈ (h, 1−h) the conditional expectation can be evaluated as follows.
Use a Taylor expansion of the composite functions, p(t) := f η • Q (t) and q(t) := (g • f X • Q) (t) by using the Faà di Bruno chain rule given in (37); starting with the latter Taylor expansion, where |τ | < 1. The intermediate derivatives for j = 0, 1, . . . , s ∧ r are given by due to restrictions imposed in Assumption (B). Similarly, (61) where |δ| ≤ 1. Therefore, using (61) and (60) in (59) with the vanishing moment condition (10) implies that, However, by Assumption (B), f (j) η and Q are Lipschitz continuous for j = 0, . . . , s and therefore bounded. Consequently p (j) and q (j) are also bounded which means that uniformly in t, then Eg(X i,i−1 , t) = 0 and by Jensen's Inequality E K 3 (X i,i−1 , t) 2 < ∞. It will be shown by an application of Theorem 1 of Wu (2007) ] to measure the physical dependence. To bound ϑ i , let η ′ 0 be an i.i.d. copy of η 0 and define X * i,i−1 = X i,i−1 − c i η 0 + c i η ′ 0 with the associated sigma field F * i = σ (η i , η i−1 , . . . , η 1 , η ′ 0 , η 1 , . . . ; ε 1 , . . . , ε i ). Then by Theorem 1 of Wu (2005) it was shown that there is a bound ϑ i ≤ sup t∈(h,1−h) g(X i,i−1 , t) − g(X * i,i−1 , t) . Using this, (62) and the Lipschitz property of f η it will be shown that ϑ i < Ch s∧r+2 i −β L(i), where the last line follows due to the Lipschitz property of Q and the bounded domain of K 3 . Then by Theorem 1 of Wu (2007) and Karamata's Theorem, . Using this and (62), Then the first term on the RHS of (57) can be bounded by (63) and a similar application of Lemma 1 of Zhao and Wu (2008), Substituting (64) and (58) into (57) implies that, For the Lindeberg condition, let ǫ > 0 and define A n = {|∆ 1 (t)| > ǫ}, then similar to the procedure used in the Proof of Theorem 2, it can be shown that A n → ∅ and the Lindeberg condition holds. Thus by the martingale CLT, (56) holds and by using (B1) in the decomposition given in (55) the result follows by Slutsky's Theorem.
Proof of Theorem 4. Again, use the decomposition (55) used in the Proof of Theorem 3. Then, define the standardised process, .
It will be shown via use of a Hermite expansion of the LRD variables that, To do this, split the LRD variable X i into two parts, . Then clearly, EG(Z i , t) = 0 and by Jensen's inequality, EG(Z i , t) 2 < ∞. So by Taqqu (1975), G(Z i , t) can be re-expressed by its Hermite expansion, is the m th Hermite coefficient. For our case it is assumed that a 1 = 0. Evaluating a 1 , By exploiting the Faà di Bruno formula further, it can be shown via Taylor expansions that the asymptotic behaviour of a 1 satisfies, From Corollary 5.1 of Taqqu (1975), Therefore (65) holds by Slutsky's Theorem in the decomposition given in (55) in conjuction with (56), (65) and (B2).
Proof of Theorem 6. The proof of the Theorem uses similar moderate deviation inequalities from Grama and Haeusler (2006) that were used in the proof of Theorem 5. However, slight modification is needed. Firstly fix k ∈ N and choose distinct integers 1 ≤ j 1 , j 2 , . . . , j k < m n then modify S B n (t) to obtain a martingale by adding and subtracting the conditional expectation. Define, With this definition, S B * n (t), F n n∈Z + is a martingale and The proof of the result will follow if it can be shown that the supremum over T n for both terms of (66) are o p |log h| . Starting with the latter term, from (62)  Also, since k is fixed E sup tj r R B * n (t) 2 = O nh 2(s∧r)+1 = o(1) so by the Chebyshev inequality, P sup tj r R B * n (t jr ) ≥ 2 |log h| = o(1).
Now turn attention to the first term on the RHS of (66) and apply a moderate deviation martingale result from Corollary 2 of Grama and Haeusler (2006). To be able to use their Corollary a bound is needed on the trace norm of the quadratic characteristic matrix of the martingale and a bound on the Euclidean norm of the martingale difference sequence. These will be investigated, starting with the former. For a symmetric k × k matrix U define the trace norm U tr := p i=1 |e i | where e i are the eigenvalues of U . For a sequence x = {x 1 , x 2 , . . . , x n } define the usual Euclidean norm of a sequence |x| 2 := i x 2 i 1 2 . Now let Q be the quadratic characteristic matrix of S B * n,k (t), that is, By a similar domain argument that was presented in the proof of Theorem 5, if r = r ′ , the first and second terms on the RHS of (68) are zero. Using this fact with (62) it follows that, for r = r ′ . On the other hand, if r = r ′ , then, Define the eigenvalues of Q to be e 1 ≤ e 2 ≤ · · · ≤ e k . To evaluate the trace norm of Q consider, Using (58) and (64) in (70), Then consider the third power of the above statement and expand, To give a bound on the expectation of the above term, it is sufficient to look at the higher order expectation terms and apply the Lyapunov inequality. Starting with the expected value of the suprema of the kernel function, using the fact that the kernel function has support on [−1, 1] and t jr − t j r ′ ≥ 2h for r = r ′ ,