Consistent nonparametric change point detection combining CUSUM and marked empirical processes

A weakly dependent time series regression model with multivariate covariates and univariate observations is considered, for which we develop a procedure to detect whether the nonparametric conditional mean function is stable in time against change point alternatives. Our proposal is based on a modified CUSUM type test procedure, which uses a sequential marked empirical process of residuals. We show weak convergence of the considered process to a centered Gaussian process under the null hypothesis of no change in the mean function and a stationarity assumption. This requires some sophisticated arguments for sequential empirical processes of weakly dependent variables. As a consequence we obtain convergence of Kolmogorov-Smirnov and Cram\'er-von Mises type test statistics. The proposed procedure acquires a very simple limiting distribution and nice consistency properties, features from which related tests are lacking. We moreover suggest a bootstrap version of the procedure and discuss its applicability in the case of unstable variances.


Introduction
Assume a finite sequence (X t , Y t ), t = 1, . . . , n, of a weakly dependent R d × R-valued time series has been observed. Here, we interpret X t as a covariate (which may contain past values of the process) and it is assumed that the conditional expectation of the observation Y t , given X t and all past values of the time series, does only depend on the covariate X t , and thus is a function m t (X t ). We do not impose any parametric structure on the regression function. For inference on the time series it is of importance whether the regression function is time dependent or not, i.e. the hypothesis H 0 : m t (X t ) = m(X t ) a.s. for all t = 1, . . . , n (for some not further specified function m) should be tested against structural changes over time such as change point alternatives.
Literature on such tests for nonparametric regression functions is rare in the time series context. Both Hidalgo (1995) and Honda (1997) suggested CUSUM tests for change points in the regression function in nonparametric time series regression models with strictly stationary and absolutely regular data. Su and Xiao (2008) extended these tests to strongly mixing and not necessarily stationary processes, allowing for heteroscedasticity, while Su and White (2010) proposed change point tests in partially linear time series models. Vogt (2015) constructed a kernel-based L 2 -test for structural change in the regression function in time-varying nonparametric regression models with locally stationary regressors.
We will combine the CUSUM approach as considered by Hidalgo (1995), Honda (1997), and Su and Xiao (2008) with a marked empirical process approach. Marked empirical processes have been suggested in a seminal paper by Stute (1997) for lack-of-fit testing in nonparametric regression models with i.i.d. data. Since then they have been widely used for hypothesis testing in regression models, see Koul and Stute (1999) and Delgado and Manteiga (2001), among many others. A marked empirical process approach has been applied by Burke and Bewa (2013) for change point detection in an i.i.d. setting. In contrast to our approach they use a process of observations instead of residuals with a very complicated limit distribution, whereas we obtain a simple limit distribution and even asymptotically distribution-free tests in the case of one-dimensional covariates. To this end we show weak convergence of a sequential marked empirical process of residuals under the null hypothesis. We further demonstrate consistency under fixed alternatives of one structural break in the regression function at some time point ns 0 for n → ∞.
Moreover we suggest a wild bootstrap version of our test that can be applied to detect changes in the mean function in the case of stable variances (as alternative to using the asymptotic distribution, e.g. for multivariate covariates) as well as in the case of non-stable variances. Wild bootstrap was first introduced by Wu (1986) and Liu (1988) for linear regression with heteroscedasticity. It was used in time series context by Kreiß (1997) and Hafner and Herwartz (2000), among others. The bootstrap version of our test can detect changes in the conditional mean function, even when the conditional variance function is also not stable, but -as desired -the test does not react sensitive to the unstable variance. If no change in the mean function is detected, a test for change in the variance function can be applied, which assumes a stable mean function. The latter approach will be considered in detail in a forthcoming manuscript. Most literature assumes stationary variances of the error terms (unconditional or conditional) when testing for changes in regression. However, as Wu (2016) pointed out, non-stationary variances can occur and will most likely result in misleading inferences when not taken into account. Although this is a legitimate concern, not many results are available that deal with non-stationary variances. The CUSUM test by Su and Xiao (2008) allows for breaks in the conditional variance function. But their procedure does only seem to work for fixed breaks that do not depend on the sample size, whereas we consider changes of the variance in some nt 0 for n → ∞. There are some approaches for testing for parameter stability in parametric time series models that consider unstable variances, see Pitarakis (2004), Perron and Zhou (2008), Kristensen (2012), Cai (2007), Xu (2015) and Wu (2016). But all the settings considered do not fit into our framework as they either do not allow for autoregression models, by assuming stationarity of the regressor variables under the null, or they do not cover heteroscedastic effects. More precisely if heteroscedasticity is considered, variance instabilities are not modeled in the conditional variance function but as a time-varying constant.
The paper is organized as follows. In section 2 we present the model and the sequential marked empirical process, on which the test statistics are based. Further the assumptions are listed. In section 3 we consider the limit distribution under the null hypothesis as well as consistency under the fixed alternative of one change point. The wild bootstrap version of the procedure is discussed in section 4, whereas simulations and a real data example are presented in section 5. Section 6 contains concluding remarks, whereas proofs are presented in the appendix. Some technical details and additional simulation results are deferred to a supplement.

The model and test statistic
Let (Y t , X t ) t∈Z be a weakly dependent stochastic process in R×R d following the regression model The covariate X t may include finitely many lagged values of Y t , for instance X t = (Y t−1 , . . . , Y t−d ) such that the model includes nonparametric autoregression. The un-observable innovations (U t ) t∈Z are assumed to fulfill E[U t |F t ] = 0 almost surely for the sigma-field F t = σ(U j−1 , X j : j ≤ t). Our assumptions on the innovations are very weak; in particular heteroscedastic models will be covered. Assuming (Y 1 , X 1 ), . . . , (Y n , X n ) have been observed, our aim is to test the null hypothesis H 0 : m t (·) = m(·), t = 1, . . . , n, , t ∈ Z, and some not specified function m : R d → R not depending on the time of observation t. To test H 0 , we define the sequential marked empirical process of residuals aŝ for s ∈ [0, 1] and z ∈ R d , where ω n (·) = I{· ∈ J n } with J n from assumption (J) below. Throughout x ≤ y is short for x j ≤ y j , ∀ j = 1, . . . , d, and we use the notations x = max{k ∈ Z : k ≤ x} for x ∈ R and x∧y = (min{x 1 , y 1 }, . . . , min{x d , y d }) as well as The regression function m is estimated by the Nadaraya-Watson estimatorm n , wherê with kernel function K and bandwidth h n as considered in the assumptions below. The proposed test is a modification of the CUSUM test in Su and Xiao (2008). They consider the process where w : R d → R is a weighting function andf n is the kernel density estimator. While the factorf n has technical reasons as small random values in the denominator ofm n can be avoided, the weighting function w plays a crucial role for the power of their test (see remarks to Theorem 3.2 in Su and Xiao (2008)). Depending on the alternative, w needs to be chosen appropriately, while their test (in contrast to the one based on the sequential marked process) is not generally consistent. Under the null hypothesis H 0 we formulate the following assumptions in order to derive the limiting distribution ofT n and corresponding test statistics in the next section.
(M) For some b > 2 let E[|Y 1 | b ] < ∞ and let X 1 be absolutely continuous with density function f : where f 1j is the density function of (X 1 , X j ).
(J) Let (c n ) n∈N be a positive sequence of real valued numbers satisfying c n → ∞ and c n = O((log n) 1/d ) and let J n = [−c n , c n ] d .
(F1) For some C < ∞ and c n from assumption (J) let I n = [−c n − Ch n , c n + Ch n ] d and let δ −1 n = inf x∈Jn f (x) > 0 for all n ∈ N. Further, let for some r, l ∈ N and for all n ∈ N p n = max (F2) For q n from assumption (F1), c n from assumption (J) and C from assumption (K), (Here, r, l and C are from assumption (F1).) (B1) For δ n , p n , q n and r, l from assumption (F1) let log n nh d+2(l+1) n + h r n p n p l+1 n δ l+2 n = O(1), and for some η ∈ (0, 1) let log n nh d+2(l+1) n + h r n p n p l+η n q n δ l+1+η n = o(1).
(B2) For l, p n , q n , δ n from assumption (F1) and η from assumption (B1), let h n satisfy the following conditions Remark 2.1. Under aforementioned assumptions, consistency properties hold form n uniformly on J n from assumption (J) which will be shown in section A.1 of the appendix. The key tool here is an application of Theorem 2 in Hansen (2008). Assumption (G) implies polynomial mixing rates of the underlying process needed in Hansen (2008). Moreover, together with the first bandwidth condition in (B2) the bandwidth constraints in Hansen (2008) are also fulfilled. Assumptions (M) and parts of (K) are reproduced from aforementioned paper. In order to satisfy the first bandwidth condition in (B2), a necessary condition on the smoothness of f and m then is l + η > d, meaning that for higher dimensional covariate X t , the existence of higher order partial derivatives of f and m is needed. In order to satisfy both the first and third bandwidth condition in (B2) at the same time, the order of the kernel needs to be large, in particular r > d 2 l+η l+η−d . The second bandwidth condition in (B2) is implied by the first one, if the bandwidth h n has a polynomial rate of decay in n (or slower), meaning if there exists a k ∈ (0, ∞) such that h n = O(n −k ). Note that k < 1 d − 1 l+η is necessary then.

Asymptotic results
To derive the asymptotic distribution of test statistics built from the sequential marked empirical processT n defined in (2.1), we apply the following expansion, which uses Y i = m(X i ) + U i for all i = 1, . . . , n under the null hypothesis, Lemma A.3 in the appendix shows that A n2 (s, z) = T n (s, z)+o P (1) uniformly in s ∈ [0, 1] and z ∈ R d with the process Further, Lemma A.2 in the appendix shows that holds uniformly in s ∈ [0, 1] and z ∈ R d . Inserting the definition ofm n from (2.2) one obtains one term of the form which is negligible by Lemma A.4 and one term of the form which can further be expanded applying Lemmata A.5 and A.3 such that one obtains uniformly in s ∈ [0, 1] and z ∈ R d . From this the expansion given in the first part of Theorem 3.1 below follows. In the second part of the theorem weak convergence of T n from (3.3) is stated. holds uniformly in s ∈ [0, 1] and z ∈ R d .
(ii) Suppose that the assumptions (G) and (U) are satisfied. Then under H 0 the process T n converges weakly in ∞ ([0, 1] × R d ) to a centered Gaussian process G with The proof of the first part follows from the considerations above applying Lemmata A.2-A.5 in the appendix, while the proof of the second part is given in section A.2 of the appendix. The proof of the second part in particular makes use of a recent result on weak convergence of sequential empirical processes indexed in function classes that can be applied for strongly mixing sequences, see Mohr (2018b). Note that Koul and Stute (1999) show a weak convergence result applicable to the non-sequential process {T n (1, z) : z ∈ R} under less restrictive assumptions on the dependence structure and moments (see Lemma 3.1 in aforementioned reference). From Theorem 3.1 and the continuous mapping theorem one directly obtains the limit distribution ofT n .
Continuous functionals of the processT n can be used as test statistics for H 0 . We consider the following Kolmogorov-Smirnov and Cramér-von Mises type statistics and combinations of both, where v : R d → R is some integrable weighting function. Applying Corollary 3.2 and the continuous mapping theorem gives convergence in distribution of those test statistics.
One can obtain distribution-free tests in the case of dimension d = 1 as follows. Denote by {K 0 (s, t) : s ∈ [0, 1], t ∈ R} a Kiefer-Müller process, i.e. a centered Gaussian process with covariance function Cov(K 0 (s 1 , t 1 ), K 0 (s 2 , t 2 )) = (s 1 ∧ s 2 − s 1 s 2 )(t 1 ∧ t 2 ). Then K 0 (·, Σ(·)) has the same distribution as G 0 (·, ·). Let further σ(·) be continuous and consider the consistent estimatorĉ n = n −1 n i=1 (Y i −m n (X i )) 2 ω n (X i ) for c = σ 2 (u)f (u)du. Applying a scaling property of the process K 0 in its second component and substitution in the integrals it is easy to derive convergence in distribution as follows, For the latter two tests however the unknown weight function v = σ 2 f needs to be chosen to obtain the limit as stated above. To obtain feasible asymptotically distribution-free tests, T n3 and T n4 should be replaced bỹ applying a nonparametric estimator for the variance function such aŝ To conclude the section we will have a closer look at the alternative of one change point. For simplicity reasons we will only consider the test based on T n1 . To model the alternative we assume a triangular array Y n,t = m n,t (X n,t ) + U n,t , t = 1, . . . , n, and validity of the alternative of one change point, i.e.
for some not further specified functions m (1) ≡ m (2) . Let f n,t denote the density of X n,t and assume that for all s ∈ (0, 1] there exists a functionf (s) : Under some regularity conditions it can be shown by applying Kristensen's (2009) of the regression functions before and after the change. Now for fixed z ∈ R d and s ∈ (0, 1) with s ≤ s 0 , it holds thatT As under H 1 this integral is non-zero for s = s 0 and some z, convergence of T n1 to infinity in probability and thus consistency of the test can be deduced.
Remark 3.3. Consider the non-marked CUSUM processT n (s, ∞) which is analogous to Su and Xiao's (2008) procedure. Considerations as above for the fixed alternative H 1 of one change point in ns 0 leads for s ≤ s 0 to where the last equality holds in case of a stationary covariate process. The integral can be zero even if m (1) = m (2) . Then tests based on the CUSUM process will not be consistent, while tests based on the marked CUSUM process are. We will consider some examples in section 5.
4 A bootstrap procedure and the case of non-stationary variances As alternative to the asymptotic test considered in section 3, in this section we will suggest a wild bootstrap approach. This resampling procedure can in particular be applied in the case of multivariate covariates, where the critical values for the asymptotic tests based on Corollary 3.2 have to be estimated. Moreover, the bootstrap approach can be applied to obtain a test that detects changes in the conditional mean function, even when the conditional variance function is not stable. As desired, the test does not react sensitive to the unstable variance. In contrast to the bootstrap approach, the limiting distribution from section 3 cannot be applied in the case of changes in the variance. We consider the model Y n,t = m n,t (X n,t ) + U n,t , t = 1, . . . , n, with E[U n,t |F t n ] = 0 and E[U 2 n,t |X n,t ] = σ 2 n,t (X n,t ) a.s. for some functions σ 2 n,t : R d → R and F t n := σ(U n,j−1 , X n,j : j ≤ t). We assume X n,t to be absolutely continuous with density function f n,t . The model considered in section 2 and the first part of section 3 is the special case where f n,t (·) = f (·) and σ 2 n,t (·) = σ 2 (·) for all t = 1, . . . , n and for some f, σ 2 : R d → R not depending on t and n. Both models allow for heteroscedasticity, but the more general model also allows for possible changes in σ 2 n,t , which should not effect the rejection probability of the test for H 0 : m n,t (·) = m(·), t = 1, . . . , n, (for some m not depending on t and n). We again consider the procedurê with residualsÛ n,i = Y n,i −m n (X n,i ). Herem n is defined as in (2.2), but replacing (X j , Y j ) by (X n,j , Y n,j ), j = 1, . . . , n.

First define the wild bootstrap innovations as
Then the bootstrap data fulfilling the null hypothesis are generated by Note that if the original data follow an autoregression model, say d = 1 and X n,t = Y n,t−1 , by the above choice the resulting bootstrap data does not follow the same structure. As was pointed out by Kreiß and Lahiri (2012) this bootstrap data generation is still a reasonable choice in particular if the dependence structure of the underlying process does not show up in the asymptotic distribution. Another possibility might be a dependent wild bootstrap as suggested in Shao (2010). The bootstrap residuals are defined asÛ * Bootstrap versions T * n , = 1, . . . , 4, are defined analogous to the test statistics T n , = 1, . . . , 4, but based onT * n instead ofT n . Then H 0 is rejected if T n is larger than the (1 − α)-quantile of the conditional distribution of T * n , given the original data. To motivate that we obtain a valid procedure (which holds the level asymptotically and is consistent) even in the case of changing variances, we will consider the limiting process G 0 of the original processT n and the conditional limiting process G * 0 of the bootstrap versionT * n in subsections 4.1 and 4.2 below. We will see that the processes G 0 and G * 0 coincide under the null hypothesis. Note that some steps of the derivation are explained heuristically, whereas rigorously deriving the weak convergence would require a limit theorem for sequential empirical processes indexed in function classes for weakly dependent non-stationary data. Such a result is, to the best of our knowledge, not yet available in the literature and thus a rigorous proof is beyond the scope of the paper (see Mohr (2018b) for a related limit theorem that requires stationarity).

Asymptotics for non-homogeneous variances
Heuristically under H 0 one can proceed as in the proof of the first part of Theorem 3.1 in the beginning of section 3. Again one has the expansionT n (s, z) = A n2 (s, z) + A n1 (s, z), however, similar to Lemma A.2 in the appendix one will now obtain uniformly in s ∈ [0, 1] and z ∈ R d under suitable regularity conditions and under the assumption that the limitf (s) as in (3.5) exists. Inserting the definition ofm n analogously to Lemmata A.4 and A.5 this will result in one negligible term and one term of the form Further, Lemma A.3 will stay valid in analogous form. Thus, one obtains the exansion exists for all s ∈ (0, 1] and that the process Γ n converges weakly to a centered Gaussian process Γ (a proof would require a weak convergence result for sequential empirical processes indexed in general function classes and with a weakly dependent and non-stationary underlying triangular array process). The limiting covariance is Then with the continuous mapping theorem the weak convergence ofT n to a centered Gaussian process Note that this is consistent with the stationary case as thenh (s) (·) = sσ 2 (·)f (·) and g (s) (·) = s and the same covariance function as in Corollary 3.2 is obtained. The convergence of the test statistics T n , = 1, . . . , 4, in distribution follows again from the continuous mapping theorem.
Under the change point alternative H 1 from (3.4) with m (1) ≡ m (2) , analogous to the considerations in section 3 it holds that the test statistic T n1 converges to infinity in probability.

Derivations for the bootstrap process
Concerning the weak convergence of the bootstrap processT * n , conditionally on the sample, we have again a look at the expansion in the beginning of section 3 for the derivation of the first part of the proof of Theorem 3.1. In what follows let P * denote the conditional probability and E * the conditional expectation, given the observations. Further let Z n = o P * (1) be short for P * (|Z n | > ) = o P (1) for all > 0. Here we obtain withf (s) as in (3.5). Inserting the definition ofm * n this leads to a term (similar to Lemma A.4) of the form which is negligible, and a term (similar to Lemma A.5) of the form Thus one obtains (under suitable regularity conditions) the expansion andḡ (t) is defined as in section 4.1. In what follows we will assume that the process Γ * n , conditionally on the sample, converges weakly to a centered Gaussian process, in probability. Then, by the continuous mapping theorem,T * n , conditionally converges weakly to a centered Gaussian process, say G * 0 . We will calculate the asymptotic variances in order to show that under H 0 those coincide with the covariances of G 0 as in section 4.1. First note that E * [U * n,i U * n,j ] =Û 2 n,i I{i = j} almost surely. Under H 0 it holds that U n,t = m(X n,t ) −m n (X n,t ) + U n,t andm n consistently estimates m, and thus under H 0 , where Γ is the limiting distribution of Γ n in section 4.1. Thus, under H 0 ,T * n indeed (presumably) converges weakly to G 0 in probability, and thus the test statistic T * n converges conditionally in distribution, to the same limits as T n (respectively for = 1, . . . , 4).
Under the alternative H 1 as in (3.4),Û n,i = m n,i (X n,i ) −m n (X n,i ) + U n,i and thus it holds that for fixed s 1 , s 2 , t 1 , t 2 ∈ [0, 1] and z 1 , z 2 ∈ R d . The first term again converges in probability to E [Γ(s 1 , t 1 , z 1 )Γ(s 2 , t 2 , z 2 )]. It can further be shown that converges to zero in probability. However, (m n,i (X n,i ) −m n (X n,i )) 2 ω n (X n,i )ḡ (t 1 ) (X n,i )ḡ (t 2 ) (X n,i )I{X n,i ≤ z 1 ∧ z 2 } = 1 n (m n,i (X n,i ) −m n (X n,i )) 2 ω n (X n,i )ḡ (t 1 ) (X n,i )ḡ (t 2 ) (X n,i )I{X n,i ≤ z 1 ∧ z 2 } with the samem n as in (3.6), which converges to 3.7)). Thus, it can be shown that It can be seen that these terms do not vanish but converge to some limit in probability. Thus the limiting distribution G * 0 under H 1 is not equal to G 0 and in particular depends on the changepoint s 0 . As seen before under H 1 the original test statistic T n1 converges in probability to infinity. On the other hand, the bootstrap test statistic T * n1 conditionally converges in distribution to some non-degenerated limit, in probability. Thus the bootstrap test is consistent.

Finite sample properties
A small Monte Carlo study is conducted in order to compare the results for T n1 and T n2 from section 3 with those of the traditional CUSUM versions denoted by KS := sup s∈[0,1] |T n (s, ∞)| and CM := |T n (s, ∞)| 2 ds. Note that the results forT n3 andT n4 are similar and omitted for reasons of brevity. Asymptotic tests are applied to data satisfying models 1 and 2, while the bootstrap versions are applied to model 3 explained below. Also note that simulation results for a multidimensional autoregression model can additionally be found in the supplement. All simulations are carried out with a level of 5%, 500 replications and 200 bootstrap replications and for sample sizes n ∈ {100, 300, 500}. For the nonparametric estimators we use a fourth order Epanechnikov kernel and the bandwidth is chosen by the cross validation method. For simplicity we set ω n ≡ 1. The data is simulated from the following models.
with ∆ 0 ∈ {0, 1.3} and t 0 ∈ {0.25, 0.5, 0.75}. Model 1 is a regression model with autoregressive covariables. In model 2 we consider both a homoscedastic and heteroscedastic autoregression model, while model 3 is a heteroscedastic autoregression with non-homogeneous variances. All models fulfill H 0 for ∆ 0 = 0 and H 1 for ∆ 0 = 0 with a change in regression function occurring in n/2 . Further, note that models 1 and 2 fulfill the stationarity and mixing assumptions under H 0 , while model 3 is not stationary as one change occurs in the conditional variance function under both H 0 and H 1 . Hence, for model 3 we apply the bootstrap test from section 4. Figure 1, 2 and 3 are visualizations of the performance of T n1 and T n2 , as well as KS and CM in model 1 and 2. Under the null the rejection frequencies for all tests are near the nominal level. For model 1 the CUSUM tests are not consistent against H 1 , while the tests based on the marked process are. In model 2 the rejection frequencies of all tests increase with increasing break size. Note however that the increase is much faster for T n1 and T n2 than for the CUSUM tests. Also note that the influence of the conditional variance is rather small resulting in a similar performance in both the homoscedastic and heteroscedastic case. Table 1 shows the rejection frequencies of the bootstrap procedure using T * n1 and T * n2 , as well as the bootstrap version of the CUSUM tests KS and CM under both the null and the alternative hypothesis. The level simulations show that all tests perform reasonably well under H 0 , approximately holding the level indicating that the bootstrap test is -as desired -not sensitive to changes in the conditional variance function. Furthermore, it can be seen that for all models and all tests the rejection frequency under H 1 exceeds the level, indicating that the change point is detected. With increasing sample size, the number of rejections increases rapidly for T * n1 and T * n2 , while it stays approximately constant for the bootstrap versions of KS and CM . This is presumably due to the fact that the test statistics based onT n (s, ∞) estimate some integral that might be small under H 1 . As was pointed out in subsection 4.2, this integral not vanishing is essential for the consistency property for the bootstrap tests. Finally, we apply the asymptotic test based on T n1 to 36 measurements of the annual flow volume of the small Czech river Ráztoka recorded between 1954 and 1989. It was considered by Hušková and Antoch (2003). We set X t as the annual rainfall and Y t as the annual flow volume. The asymptotic test clearly rejects H 0 with a p-value of 0.0006. The possible change point is estimated byŝ n from section 6 and suggests a change in 1979.  Note that this is consistent with the literature. As was pointed out by Hušková and Antoch (2003) deforestation had started around that time, which is a possible explanation. Figure  4 shows on the left-hand side the scatterplot X t against Y t using dots for the observations after the estimated change and crosses for the observations before the estimated change. On the right-hand side the figure shows the cumulative sum, sup z∈R |T n (·, z)|, as well as the critical value (red horizontal line) and the estimated change (green vertical line).

Concluding remarks
We suggested a new test for structural breaks in the regression function in nonparametric time series (auto-)regression. Our approach combines CUSUM statistics with the marked empirical process approach from goodness-of-fit testing. The considered model is very general and does not require independent innovations, nor homoscedasticity. We show favorable asymptotic properties and demonstrate that the new testing procedures are consistent against fixed alternatives, while the traditional CUSUM tests are not. An estimator for the change point is given byŝ n := arg max s∈[0,1] sup z∈R d |T n (s, z)|. Asymptotic properties of this estimator will be considered in future research. Moreover we have suggested a bootstrap version that can also be applied to detect changes in the regression function in the presence of changing variance functions. In a forthcoming paper we will consider testing for changes in the variance function.

A Proofs and derivations
In subsection A.1 we give some auxiliary results for the proof of Theorem 3.1. The proof of the first part of the theorem was given in the main text, while the proof of the second part can be found in subsection A.2. Some lemmata are proved in subsection A.3 while some details are deferred to the supplement. Detailed proofs can also be found in Mohr (2018a).

A.1 Auxiliary results
The following assumptions are formulated for the first lemma that gives uniform rates of convergence for the regression estimatorm n from (2.2) and its derivatives. They hold under the assumptions of Theorem 3.1.
(P) Let (Y t , X t ) t∈Z be a strictly stationary and strongly mixing process with mixing The proof of Lemma A.1 is analogous to the proof of Theorem 8 of Hansen (2008) and omitted for the sake of brevity. The proofs of the following lemmata are given in subsection A.3.
Lemma A.2. Under the assumptions of Theorem 3.1 (i) and under H 0 we have for A n1 from (3.1) uniformly in s ∈ [0, 1] and z ∈ R d .
Lemma A.5. Under the assumptions of Theorem 3.1 (i) and under H 0

A.2 Proof of Theorem 3.1(ii)
For the proof of the second part of Theorem 3.1 we use a recent result on weak convergence of sequential empirical processes indexed in function classes that can be applied for strongly mixing sequences, see Mohr (2018b). It is stated in Lemma A.7 and uses the following notion of bracketing number.
Definition A.6 (Bracketing number). Let X be a measure space, F some class of functions X → R and ρ some semi norm on F. For all ε > 0, let N = N (ε), be the smallest integer, for which there exist a class of functions X → R, denoted by B and called bounding class and a function class A ⊂ F called approximating class such that |B| = |A| = N , ρ(b) < ε, ∀ b ∈ B and for all ϕ ∈ F there exist a * ∈ A and b * ∈ B such that |ϕ − a * | ≤ b * . Then N (ε) is called the bracketing number and denoted byÑ [ ] (ε, F, ρ).
Note that the usual notion for bracketing number (as in Definition 2.1.6 in van der Vaart and Wellner (1996)) will be referred to as N [ ] (ε, F, ρ).
Lemma A.7 (Corollary 2.7 in Mohr (2018b)). Let {X t : t ∈ Z} be a strictly stationary sequence of random variables with values in some measure space X . Let F be a class of measurable functions X → R. Let furthermore the following assumptions hold.
Proof of Theorem 3.1(ii). First notice that due to assumption (G) and under the null hypothesis (U t , X t ) t∈Z is a strictly stationary sequence of random variables with values in R × R d . Denote by P the common marginal distribution of (U 1 , X 1 ) and define F : The convergence of T n is then implied by the weak convergence of We apply Lemma A.7. Condition (A1) on the mixing coefficient of (U t , X t ) t∈Z is implied by assumption (G) on the mixing coefficient of (Y t , X t ) t∈Z and the null hypothesis as measurable functions maintain mixing properties. To show condition (A2) on the function class F, the choice of approximating functions and bounding functions, as in Definition A.6, will be discussed in more detail. Denote withc from assumption (U), h(x) =c(x)f (x) and H(x) = (−∞,x] h(t)dt for x ∈ R d and for all i = 1, . . . , d and x ∈ R, Since H i is continuous and H i (−∞) = H(−∞) = 0 and H i (∞) = H(∞) ≤ M for M < ∞ from assumption (U), N i can be chosen to be smaller than 2dM ε −2 for all i = 1, . . . , d. By using cartesian products, a partition of R d is obtained. For simplicity reasons the following notation will be used. For j = (j 1 , . . . , j d ) ∈ N d let z j := (z j 1 ,1 , . . . , z j d ,d ), and due to (A.1). Furthermore for all i = 2, . . . , Q by Jensen's inequality and (U), it holds that and thus Since N i = O(ε −2 ) for all i = 1, . . . , d, it holds thatÑ [ ] (ε, F, · L 2 (P ) ) = O ε −2d . As Q > d(2 + γ) holds, assumption (A2) from Lemma A.7 is therefore satisfied. Assumption (A3) is also satisfied asF : R × R d → R, (u, x) → u is an envelope function of F such that and additionally, it holds that What is left to show, is the convergence of all finite dimensional distributions of T n . To this end we will apply Cramér-Wold's device. Let λ 1 , . . . , λ K ∈ R \ {0} and consider Rio (1995) can be applied, which is a central limit theorem for strongly mixing triangular arrays. Following the notations in Rio (1995) define V n,l := V ar( l i=1 ξ n,i ) for all l = 1, . . . , n, and n ∈ N. Let furthermore Q n,i be the càdlàg inverse function of t → P (|ξ n,i | > t), i.e. Q n,i (u) := sup{t > 0 : P (|ξ n,i | > t) > u}, ∀ u > 0, with the convention that sup ∅ := 0. Let {α n (t) : t ∈ N} be the sequence of mixing coefficients of {ξ n,i : 1 ≤ i ≤ n, n ∈ N}. For t ∈ (0, ∞) defineα n (t) :=α n ( t ). Let its càdlàg inverse function be defined bỹ Condition (a) in Corollary 1 in Rio (1995) is easy to verify. Concerning condition (b) in aforementioned corollary note that by Markov's inequality, it holds that for all t > 0 and with q : holds for all u > 0. By similar arguments, we obtainα −1 n (u) ≤Ã − log a (u) for all u > 0, whereÃ := log a (A). Furthermore, V n,n converges to K Putting the results together, it can be obtained that x 2 x − 1 qM 1 q , √ n V n,n dx converges to zero, and thus condition (b) is satisfied. Applying Corollary 1 in Rio (1995), it holds that n i=1 ξ n,i /(V n,n ) 1/2 converges to the standard normal distribution and thus the assertion follows.

A.3 Proofs of lemmata
Proof of Lemma A.2. For some l-times differentiable function h : J n → R define the norm x − y η and the function class H := C l+η 1,n (J n ) := {h : J n → R : h l+η ≤ 1, sup x∈Jn |h(x)| ≤ z n √ log n} with z n := q n δ n ((log n)/(nh d n )) 1/2 . The third bandwidth condition in (B2) implies log n nh d n + h r n p n q n δ n = O log n nh d n q n δ n and thus Lemma A.1 implies that P (ĥ n ∈ C l+η 1,n (J n )) → 1 as n → ∞ holds forĥ n (x) = (m(x) −m n (x))ω n (x). Let furthermore X t ∼ P and F := {x → I{x ≤ z} : z ∈ R d }.
Then the assertion of the lemma follows if we show To this end let ε n1 = ε n2 = n −1/2 and ε n3 = n −1/2 /(log n) and let further 0 = s 1 < · · · < s Kn = 1 partition [0, 1] in intervals of length 2ε n1 such that K n = O(ε −1 n1 ). Furthermore, let J n := N [ ] ε n2 , F, · L 2 (P ) and M n : In what follows we only consider the first line on the right hand side, while the other ones can be treated similarly. We apply Theorem 2.1 of Liebscher (1996) to the random variable (for m, j, k fixed) The mixing coefficient of {Z t : 1 ≤ t ≤ n} can be bounded by the mixing coefficient of {X t : t ∈ Z} due to Bradley (1985), Section 2, remark (iv). Further, the variables are centered and have a bound of order O(z n log n). Applying Theorem 2.1 to n i=1 Z i yields for all > 0 and n ∈ N large enough where the first and second bandwidth constraints in (B2) were used. Details are omitted for the sake of brevity.
Proof of Lemma A.4. First, using the uniform rates of convergence results in Lemma A.1 applied to (m(X t ), X t ) t∈Z together with the first and the last bandwidth condition in assumption (B2) as well as the second condition in assumption (B1), it can be shown that Defining the function class and imposing X t ∼ P , the assertion of the lemma follows if we show sup ϕ∈F n,1 The proof of (A.2) uses similar techniques to those in the proof of Lemma A.2, while the proof of (A.2) follows applying Taylor expansion for m and f . Details are given in the supplement.
Proof of Lemma A.5. First, using the uniform rates of convergence results in Lemma A.1 applied to (U t , X t ) t∈Z together with assumptions on the bandwidth, it can be shown that Furthermore, it can be shown that uniformly in z ∈ R d and for q : Defining the function class and imposing (U t , X t ) ∼ P , the assertion of the lemma follows if we show sup ϕ∈F n,2 The proof of (A.4) uses similar techniques as the proof of Lemma A.2. Details are given in the supplement.
Proof of Lemma A.3. It will be shown that uniformly in s ∈ [0, 1] and z ∈ R d , To this end define the function class and for s ∈ [0, 1] and ϕ ∈ F where (U t , X t ) ∼ P and ϕdP = 0. Similarly to the proof of Theorem 3.1 (ii) an application of Theorem 2.5 in Mohr (2018b) Note that here the assumption Q > (d + 1)(2 + γ) is needed asÑ [ ] (ε, F, · L 2 (P ) ) = O(ε −2(d+1) ). Then, for some fixed z ∈ R d defining ϕ n (u, x) := uI{x ≤ z}I{x / ∈ [−c n , c n ] d }, it holds that ϕ n ∈ F for all n ∈ N and where the convergence holds by the dominated convergence theorem as c( With (A.6) the last term is o P (1) which proves the assertion in (A.5) and therefore the assertion of the lemma.
The limiting distribution from Corollary 3.2 applies, but as the covariate is multivariate we do not easily obtain an asymptotically distribution-free test. Thus we apply the bootstrap procedure from section 4. Table 2 shows the rejection frequencies for all three models, when using the tests based on T * n1 and T * n2 , as well as the bootstrap versions of KS and CM under both H 0 and H 1 . It can be seen that under H 0 the tests reject a little more often than in the models considered in section 5, overestimating the level of 5% sometimes for finite sample sizes. Under the alternative the number of rejections increases rapidly for T * n1 and T * n2 with increasing n, while it stays relatively low for KS and CM . In summary, the bootstrap tests perform reasonably well and are therefore an acceptable alternative to the tests using critical values of the limiting distribution, which are here not known due to multidimensional covariates. Furthermore in these models, they outperform the bootstrap versions of the CUSUM tests.

C Proofs of auxiliary results
In this section we give the proofs of (A.2) and (A.3) in Lemma A.4 as well as (A.4) in Lemma A.5.
It then holds that J n = O(ε −d n ). Using these brackets of F n,1 , it can be shown that sup ϕ∈F n,1 where it is sufficient to discuss the proof of as the other assertion works analogously. By defining (m(y) − m(x))K hn (y − x) I{(m(y) − m(x))K hn (y − x) < 0}ω n (y)dy it holds that ϕ u j (x) = ϕ u j,1 (x) + ϕ u j,2 (x). Thus, again the problem is reduced to showing max 1≤j≤Jn 1 √ n n i=1 ϕ u j,1 (X i ) − ϕ u j,1 dP = o P (1), the other assertion works analogously. Similar to before we apply Theorem 2.1 of Liebscher (1996) to the random variable Z i := ϕ u j,1 (X i ) − ϕ u j,1 dP for fixed j. Note that the mixing properties are the same as the ones of the original process. Further the variables are centered and posses a bound of order O(h n q n ). For all > 0 and n ∈ N large enough it can then be shown that ϕ u j,1 (X i ) − ϕ u j,1 dP > ≤ J n 4 exp − n 2 64n( (log n) 2 + 1)h 2 n q 2 n + 8 3 √ n ( (log n) 2 + 1)h n q n + J n 4 n (log n) 2 + 1 α (log n) 2 + 1 where eventually the last bandwidth condition in assumption (B2) was used. The assertion in (A.3) can be shown by using Taylor's expansion for both m and f up to order r − 1 and the assumptions in (F1). Thus sup ϕ∈F n,1 ϕdP = sup (m(y) − m(y − th n ))K(t)ω n (y)f (y − th n )dt dy = O(h r n p n q n ) = o n −1/2 , where the last equality holds by the third condition in (B2).
It then holds that J n = O(ε − n d). Using these brackets of F n,2 , it can be shown that sup ϕ∈F n,2 We will only consider the first term, as the second one is treated analogously. Similar to before we apply Theorem 2.1 of Liebscher (1996) to the random variable Z i := ϕ u j,1 (U i , X i ) − ϕ u j,1 dP for fixed j. Note that the mixing properties are the same as the ones of the original process. Further, the variables are centered and posses a bound of order O(n 1/q ). For all > 0 and n ∈ N large enough it can then be shown that P max 1≤j≤Jn 1 √ n n i=1 ϕ u j,1 (U i , X i ) − ϕ u j,1 dP > ≤ Jn j=1 P 1 √ n n i=1 ϕ u j,1 (U i , X i ) − ϕ u j,1 dP > ≤ J n 4 exp − n 2 64n( log(n) 2 + 1)h n + 8 3 √ n ( log(n) 2 + 1)n 1/q + J n 4 n log(n) 2 + 1 α log(n) 2 + 1 = o(1), due to the third bandwidth condition in assumption (B2) and because q > 2.