Modified sequential change point procedures based on estimating functions

A large class of sequential change point tests are based on estimating functions where estimation is computationally efficient as (possibly numeric) optimization is restricted to an initial estimation. This includes examples as diverse as mean changes, linear or non-linear autoregressive and binary models. While the standard cumulative-sum-detector (CUSUM) has recently been considered in this general setup, we consider several modifications that have faster detection rates in particular if changes do occur late in the monitoring period. More presicely, we use three different types of detector statistics based on partial sums of a monitoring function, namely the modified moving-sum-statistic (mMOSUM), Page’s cumulative-sum-statistic (Page-CUSUM) and the standard moving-sum-statistic (MOSUM). The statistics only differ in the number of observations included in the partial sum. The mMOSUM uses a bandwidth parameter which multiplicatively scales the lower bound of the moving sum. The MOSUM uses a constant bandwidth parameter, while Page-CUSUM chooses the maximum over all possible lower bounds for the partial sums. So far, the first two schemes have only been studied in a linear model, the MOSUM only for a mean change. We develop the asymptotics under the null hypothesis and alternatives under mild regularity conditions for each test statistic, which include the existing theory but also many new examples. In a simulation study we compare all four types of test procedures in terms of their size, power and run length. Additionally we illustrate their behavior by applications to exchange rate data as well as the Boston homicide data.


Introduction
Despite a relatively long tradition in statistics, change point analysis is a very active field and has become increasingly popular in the last years due to its importance in many areas where data is collected over time.In addition to a-posteriori change point methods, there has been a recent interest in sequential methodology due to the fact that an increasing number of data sets are collected automatically or without significant costs such that the observations arrive steadily.Examples include financial data sets, e.g., in risk management (Andreou and Ghysels (2006)) or CAPM models (Aue et al. (2011)) as well as medical data sets, e.g., monitoring intensive care patients (Fried and Imhoff (2004)).Chu et al. (1996) introduced a new way of sequential testing, which allows to control the asymptotic α-error if no changes occur while having asymptotic power one under alternatives, which has then be pursued by others e.g.Horváth et al. (2004), Aue et al. (2006), Horváth et al. (2008), Aue et al. (2008), Aue and Horváth (2004), Fremdt (2014) and Hušková and Koubková (2005) (see Section 1.2 below for more details).The key assumption is the existence of a historic data set which is used for initial estimation before monitoring starts.This is usually given in applications as some data collection must have taken place before one can safely build or monitor a model.This approach allows for nonparametric inference by means of asymptotics by letting the length of the historic data set grow to infinity even if a possibly infinite observation horizon is used.Nonparametric inference is meant in the sense that e.g. the stochastic structure of the innovation process in a regression model is not completely specified.
As recently pointed out by Kirch and Kamgaing (2015) most of these statistics can be written by means of estimating and monitoring functions, which then provide a unifying framework for the derivation of the asymptotic results.Kirch and Kamgaing (2015) derive regularity conditions under which they prove limit results for the standard CUSUM monitoring scheme as originally proposed by Chu et al. (1996), that has been used in most follow-up works for different change point scenarios, e.g.Horváth et al. (2004), Aue et al. (2006), Aue and Horváth (2004) and Hušková and Koubková (2005).
The main disadvantage of the standard sequential CUSUM statistic is the fact that the detection time can be rather long if changes occur late in the monitoring period.This is due to the fact that effectively a (properly scaled) two-sample test is applied comparing the historic data set with the data set after monitoring starts up to the most recent observation.Thus for a late change many 'null observations' contaminate the second data set so that more 'alternative' observations need to be collected before significance is reached.This is why, recently, alternative monitoring schemes based only on more recent observations have been proposed in the literature: The mMOSUM in a linear model setup in Chen and Tian (2010) (where their null limit distribution is not correct), the Page-CUSUM in a linear model setup in Fremdt (2014) and the standard MOSUM for the location model in Horváth et al. (2008) and Aue et al. (2008).
In this paper, we generalize the latter monitoring schemes to the general framework of estimating functions thus greatly extending their range of applications to examples as diverse as the non-linear autoregressive or the binary time series model.Furthermore, we investigate the differences in terms of size, power and run length in a simulation study and illustrate the behavior on two data sets.
The paper is organized as follows: First, the general setup and the new statistics are introduced, namely the mMOSUM, the Page-CUSUM and the MOSUM statistic.Examples for change point problems that fall into that framework are given in Section 1.2.In Section 2 we develop the limit distributions for these statistics in the general setting under the null hypothesis and mild assumptions.We first focus on the mMOSUM and the Page-CUSUM statistic before developping the asymptotics for the MOSUM statistic under slightly different assumptions.In Subsection 2.3 we show that the different procedures have asymptotic power one under very general conditions.Finally, in Section 3 we compare the different types of statistics, including the CUSUM statistic in an extensive simulation study by means of their empirical size, power and run lengths in different scenarios before giving two data examples.

Monitoring schemes based on estimating functions
As already mentioned we assume the existence of a historic training data set X 1 , . . ., X m with no change points.Based on this initial observations we estimate the unknown parameter θ 0 ⊂ R d by means of an estimating function, i.e. the estimator θ m is obtained as the zero of the following sum: where X t , t = 1, ..., m, are the historical data and the estimating function G takes values in R d .All vectors given in this work are column vectors.
The simplest example is the estimating function G(x, θ) = (x−θ) in a location model which leads to the mean (as estimator for the expectation), i.e. θ m = Xm .In a linear model with Y t = θ T Z t + e t , we consider X t T = (Y t , Z t T ) as well as G ((y, z), θ) = z(y − θ T z), which yields the usual least squares estimator in this context.Similarly, in an autoregressive linear model Z t is given by the lagged observations (Y t−1 , . . ., Y t−p ) T .The examples show that the data X t can be naturally or artificially multivariate.
Estimating functions are effectively generalized method of moments (also related to M-estimators, which are a particular class of estimating functions), where it is exploited that the true parameter θ 0 is the only one fulfilling E G(X 1 , θ 0 ) = 0.In our setting, we do not require that the model is actually true but can use the model as a tool for feature extraction.In this case θ 0 is the best approximating parameter (in the above sense).This enables us to construct a change point method which will detect changes that cause a change in the best approximating parameter (see e.g.Kirch et al. (2015) as well as Kirch and Kamgaing (2012) where this idea was exploited for offline tests).
Because E G(X (1) , θ 0 ) = 0 for the observations X (1) after the change, we consider monitoring schemes based on means of G(X i , θ m ), i m.However, it is somewhat restrictive to require that the same estimating function G needs to be used, hence we allow for a different function H, which does not necessarily need to be an estimating function itself but can for example be lower-dimensional in order to increase power for certain alternatives.Furthermore, this allows to use a more precise but less robust function for estimating (e.g. if it is known that the historic data set does not contain any outliers) while using a more robust monitoring function (as there may be outliers during the observation period).See Kirch and Kamgaing (2015) for more details as well as examples for this idea.
All detector statistics are based on partial sum processes Because the monitoring function H and hence the partial sum processes are often multivariate the test decision will be based on quadratic forms S 2 A = S T AS with a suitable matrix A resulting in the following statistics where Γ j = Γ j,A for j = 1, . . ., 4 and the dependence on A is suppressed where possible for better readability.The notation x indicates the lower Gauss bracket, i.e. the largest natural number smaller or equal than x, while x indicates the upper Gauss bracket, i.e. the smallest natural number larger or equal to x.The bandwidth h 2 ∈ (0, 1) (fixed) is a tuning parameter determining the rate with which early observations are discarded, h 4 = h 4,m ∈ N is the window size in the moving sum (MOSUM) procedure.
Statistic Γ 1 has already been dealt with in detail in Kirch and Kamgaing (2015) and will not be considered here except in the simulation study and data analysis.
In order to control the asymptotic size of the detection procedure even in the presence of an infinite observation horizon, weight functions w j (k) are introduced that depend on the point of time and either the length of historic data set m (i.e.w j (k) = w j (m, k), j = 1, 2, 3, for the CUSUM, mMO-SUM and Page-CUSUM or the window width h 4 for the MOSUM method (i.e.w 4 (k) = w 4 (h 4 , k)).As before we supress the dependence on m and h in order to obtain a unified notation.
The null hypothesis of no change will be rejected at the first point in time k when where c j is a critical value which will be derived from the limit distribution under the null hypothesis.An alternative way of thinking about this is that c j /w j (k) is a critical curve that needs to be crossed by the detector statistics Γ j (m, k).In Section 3.2 we will plot w j (k)Γ j (m, k)/c j for an easier visual comparison, so that all detector statistics need to cross the horizontal 1-line.
The stopping times of the corresponding sequential procedures are defined as where we set min ∅ = ∞.We distinguish between the open-end procedure with N (m) = ∞ and the closed-end procedure with N (m) = mN + 1, N > 0.
If the weight function and critical value are chosen appropriately, the type-Ierror is controlled and the method has asymptotic power one under alternatives, i.e.
The matrix A is usually chosen such that the null limit of sup 1 k<N (m) w j (k)Γ j (m, k) is pivotal, e.g.
H(X j , θ 0 ) the long-run covariance matrix in case of a time series.

Examples
The mMOSUM has, to the best of our knowledge, only been considered in the linear model in Chen and Tian (2010), where -due to an error in the proof -the asymptotic limit distribution under the null hypothesis is incorrect.The Page-CUSUM has been considered in Fremdt (2014) also for the linear model.The MOSUM statistic has been analysed in the location model in Horváth et al. (2008) and Aue et al. (2008).Kirch and Kamgaing (2015) develop the theory for the CUSUM statistic in an equivalent setting to the one investigated in this paper.To unify the theory for all statistics we use the same assumptions in this paper to prove the corresponding limit results under the null hypothesis with the exception of the MOSUM statistic with h 4 /m → 0, which needs somewhat different assumptions.Therefore, the validity of the procedures based on the mMOSUM, Page-CUSUM as well as MOSUM (with h/m → β > 0) monitoring schemes follows for all examples considered in Kirch and Kamgaing (2015).Similarly, under alternatives we give a stricter assumption than necessary in order to unify the theory for all statistics, which can be relaxed (see Weber (2017), Chapter 4).We will now give a short overview over these examples, more details can be found in Kirch and Kamgaing (2015).

Linear regression model
The model is given by where β t = (β t,1 , . . ., β t,p ) T are the unknown parameters and x t = (1, x t,2 , . . ., x t,p ) T are regressors.Moreover, the errors { t } have mean zero and variance σ 2 t = σ 2 0 and are independent of {x t }.In the literature the residuals are often supposed to be i.i.d. or uncorrelated with some moment conditions, but the minimal requirement is that they fulfill a functional central limit theorem.
As before, we consider the null hypothesis of no change (H 0 : β t = β 0 for all t) against the alternative hypothesis of a change at m + k * (H A : Here, k * is the change point, and m denotes the length of the historical data set. Both Chen and Tian (2010) for the mMOSUM as well as Fremdt (2014) for the Page-CUSUM use the least squares estimator to obtain an approximation for the unknown parameter β 0 based on the historical data set.The corresponding estimating function is given by G((X t , x T t ) T , β) = x t (X t − β T x t ).Their monitoring function for parameter changes is given by H((X t , x T t ) T , β) = X t − β T x t with B(θ 0 ) = (1, 0, . . ., 0) T , i.e. their detector is based on the estimated residuals.However, this monitoring function cannot detect all alternatives so that Hušková and Koubková (2005) propose the CUSUM statistic with H = G resulting in a test with asymptotic power one for all alternatives.
In order to find changes in the error variance σ 2 t , we can extend the estimating function to G((X t , with B(θ 0 ) = (0, . . ., 0, 1) T .The corresponding procedures have also been considered by Chen and Tian (2010) as well as Fremdt (2014) respectively.

Location model
This is the simplest change point setting, where where k * is the change point, μ ∈ R is the mean prior the change, Δ = 0, and t is an error sequence with mean zero and variance σ 2 .The multivariate mean change model is also included in the general setup but for simplicity we concentrate on the univariate model.
In fact, this model is a special case of the previous regression model with p = 1.Estimating and monitoring function simplify to G(X t , μ) = X t − μ resulting in μ = Xm and H = G.Horváth et al. (2008) and Aue et al. (2008) investigate this procedure for the MOSUM detector.
It is also possible to use more robust monitoring procedures e.g. based on M -estimators in this context and one can even combine a non-robust initial estimator with a robust monitoring function by choosing the monitoring function appropriately and different from the estimating function.For more details, we refer to Kirch and Kamgaing (2015), Section 6.2.

Non-linear models
Several monitoring procedures based on the CUSUM detectors that fit into the above framework have already been investigated in the literature including one for GARCH-sequences (Berkes et al., 2004), for nonlinear regression models with Y t = f (x t , β 0 )+ t (Ciupera, 2013) as well as nonlinear autoregressive time series X t = g(X t−1 , . . ., X t−p ) + t for some function g where a neural network approximation is used to construct the detectors (Kirch and Kamgaing, 2015).More details can be found in Kirch and Kamgaing (2015), Section 6.3.

Integer-valued time series
A binary autoregressive time series is given by where Z t−1 = (Z t−1 , . . ., Z t−p ) are regressors which can be purely autoregressive, purely exogenous or a combination of both.In this case a typical estimating function is based on the partial likelihood scores with a monitoring function either being of the same form or a projection onto one or more of the components of the full partial likelihood scores.
Similarly, the Poisson autoregressive model is defined by where the estimating function (as well as monitoring function) is also typically obtained by the partial log likelihood scores.For more details we refer to Kirch and Kamgaing (2015), Sections 6.4 and 6.5.

Asymptotics
We first derive the null asymptotics for the respective statistics, allowing us to control the type-I-error of the procedures, before showing that the procedures are asymptotically consistent under alternatives.

Asymptotics under the null hypothesis
In this section, we give regularity conditions under which we can prove the limit distribution for the above statistics.For the CUSUM statistic this was already done in Kirch and Kamgaing (2015).It turns out that we can derive the null asymptotics for the mMOSUM, the Page-CUSUM as well as the MOSUM with h 4 /m → β > 0 under the same set of regularity conditions.In particular, the limit results carry over to all the examples given in Kirch and Kamgaing (2015) as shortly introduced in the previous section.Unlike Chen and Tian (2010) we do not use a weight function for the mMOSUM that depends on the bandwidth, although such results can also be obtained with the below methods.Their choice of weight function seems to result in a very nice limit distribution in their paper, which is however only due to an error in the proof.More precisely the second formula in the proof of Theorem 3.1 in Chen and Tian ( 2010) is not correct: An application of the invariance principle correctly gives (in our notation, where {ε t } are the innovations in a linear regression model with variance σ 2 ) for standard Wiener processes W 1,m (•).In Chen and Tian (2010 . However, the two processes are distributionally different, as e.g. for k 1 , k 2 with k 1 h 2 k 2 it holds due to the independent increments of a Wiener process As one consequence, we can no longer substitute k − kh 2 with s (as in the line after (20) in the proof of Theorem 3.2 in Chen and Tian (2010)), so that it is no longer beneficial to use the weight function w(m, k − kh 2 ).Therefore, we decided to return to the usual shape of the weight function -after all the bandwidth in this case is merely a way of using only more recent observations for later comparisons.
For the MOSUM statistic with h 4 /m → 0 we get related but somewhat different regularity conditions.Exemplary, we prove their validity for a linear regression model thus extending the existing results from the location to a linear regression model (see Section 2.2).
In the remainder of the paper, we will give some high-level regularity conditions under which we can derive the asymptotic behavior both under the null as well as under alternative hypotheses of the proposed monitoring schemes.In particular, some effort is needed to verify those regularity condition for a given change point problem and a given set of estimating/monitoring functions.Fortunately, the regularity conditions RC.1, RC.2 and RC.3 are the same that have been used in Kirch and Kamgaing (2015) for the CUSUM monitoring scheme.In fact, these regularity conditions have already been proven to be correct by various authors in many different situations including all examples given in Section 1.2.A detailed collection of these papers is given in Kirch and Kamgaing (2015), Section 6, where also simulation results and a data analysis for various examples and the CUSUM scheme can be found.The below formulation via regularity conditions has the important advantage that a statistician who develops a new procedure in the above spirit for a new setting will immediately have all of the above monitoring schemes at his disposal (with some additional effort needed only for the MOSUM with h 4 /m → 0).

Regularity Condition RC.1.
(a) The partial sum process The following Hájek -Rényi-type inequality is fulfilled for all 0 (c) For the open-end procedure the following Hájék-Rényi-type inequality is fulfilled for any k m 0 uniformly in m This set of regularity conditions ensures that the limit distribution is an appropriate functional of a Wiener process, where (b) is needed to control the behavior at the very beginning of the monitoring period, while (c) is needed to control the behavior at the infinite end of the monitoring period.
The regularity condition in (a) holds for example under mixing conditions on {X t }.Obviously H(X t , θ 0 ) and G(X t , θ 0 ) will typically be highly correlated (often even being equal), but the actual form of C is not important (with the exception of the MOSUM procedure with fixed β and an overlap of the monitoring and historic data set -see Theorem 2.3 and Remark 2.1).In all other cases, the limit only depends on the joint convergence of 1 which are asymptotically independent (under the above condition) due to the independent increments of a Wiener process.
Because of the different structures for the CUSUM, mMOSUM and Page-CUSUM on the one hand, and MOSUM on the other hand, slightly different regularity conditions are needed for the latter case (if h 4 /m → 0).Thus, we first discuss the situation of the first three (where the results for the CUSUM can already be found in Kirch and Kamgaing (2015) and will not be repeated), before giving related but slightly different assumptions for the MOSUM statistic.

Modified MOSUM and Page-CUSUM
a) The weight function has the following form and am m m→∞ −→ 0. In addition we need that ρ : [0, ∞] → R + is a positive and continuous function fulfilling For the open-end procedure we additionally need This set of regularity conditions is concerned with the regularity of the weight function, which ensures that we can control the type-I-error even in the open-end case.The case distinction in (a) is quite useful to allow for false alarms within the very first observations after the monitoring starts (compare with Figure 1).A standard weight function often used in the literature fulfilling these assumptions is given by This corresponds to the choice ρ(t Regularity Condition RC.3.The following approximation holds under H 0 , where N (m) is the observation horizon and can be infinite: for some θ 0 and a suitable B(θ 0 ), where γ is as in RC.2.
This assumption allows to replace the partial sum process including the estimated parameter value with the appropriate partial sum process with the true or best approximating parameter θ 0 .The second summand may be surprising at first but this accounts for the uncertainty from using the estimator rather than the true value.A very simple example is the location model with For sufficiently smooth estimating and monitoring functions, this condition can be derived by a Taylor expansion under weak moment conditions with where ∇ is the gradient for a vector-valued function F = (F 1 , . . ., F d ) T : R d → R d defined by ∇F = (∇F 1 , . . ., ∇F d ) and ∇F 1 denotes the standard gradient.Details are given in Kirch and Kamgaing (2015), Section 5.In particular, whenever which is important in view of Theorem 2.2 below.This includes the two standard situations that the monitoring function is either given by the estimating function or by a projection onto one particular component of the estimating function (such as estimated residuals if e.g. a least squares estimator is used in a linear regression context).Most monitoring schemes developed in the literature are of this simpler type.Exceptions are the monitoring schemes based on estimated residuals in a non-linear setting proposed by Ciupera (2013) as well as the robust monitoring scheme combined with a more precise (but not robust) estimating function in Kirch and Kamgaing (2015), Section 6.2.
Under these assumptions we can derive the following null asymptotics for the mMOSUM Γ 2 and the Page-CUSUM Γ 3 : Theorem 2.1.Let Regularity Conditions RC.1a), RC.2a) as well as RC.3 be fulfilled.Then, we get the following limit distributions (for m → ∞ and fixed bandwidth h 2 > 0) under the null hypothesis for any symmetric positive semidefinite matrix A with the notation as in (2).
The Wiener processes in the above theorem are different from the ones in RC.1a).In fact, the Wiener processes in the above limit are given by {W 1 (s−1) : s 1} and B(θ 0 ) W 2 (1) (where W j are now as in RC.1a)).The covariance structure as given in the above theorem is thus obtained from the independent increment property.
As explained beneath RC.3 in many important cases it holds Σ 1 = B(θ 0 )Σ 2 B(θ 0 ) T .In this case the limit for a weight function as in (3) simplifies because it can then be expressed as a supremum over [0, 1] (rather than over an unbounded set).

MOSUM
Since the MOSUM statistic has a different weight function, we need to replace Regularity Condition RC.2:

Regularity Condition RC.4. The weight function has the following form
where ρ is a bounded and continuous function.Depending on the procedures additional assumptions on the behavior of ρ at infinity have to be made that will be specified in the theorems below.
For the MOSUM procedure one needs to distinguish two main cases: The case where the bandwidth h 4 is of the same order as the historic data set i.e. when h 4 = βm + o(m) for some 0 < β 1 and the situation where it is much smaller, i.e. h 4 /m → 0.
In the first case, we can use the same regularity conditions on the time series as before, in the latter case, somewhat different regularity conditions are needed.In particular, the assertion of Theorem 2.3 below holds true for all the examples of Section 1.2, see Section 6 in Kirch and Kamgaing (2015) for proofs of the regularity conditions.
Theorem 2.3.Let Regularity Conditions RC.1 (a), RC.3 with γ = 0 and RC.4 hold.Then, we get the following limit theorem under the null hypothesis and for For any symmetric positive semi-definite matrix A, we get as m → ∞ sup b) Open-end procedure If additionally RC.1(c) holds and ρ fulfills RC.2 (b), then Here, {(W 1 (t), W 2 (t)) : t 0} is a Wiener processes with covariance matrix with C, Σ j , j = 1, 2, as in Assumption RC.1a) and B(θ 0 ) as in RC.3.The matrix A in Γ 4 = Γ 4,A can be replaced by a consistent estimator.
Remark 2.1.If ρ(t) = 0 for t < 1, i.e. if the supremum is only taken over k h 4 , then due to the independent increment property of Wiener processes the limit simplifies to (with possibly N = ∞) for independent Wiener processes {W 1 (•)} and {W 2 (•)} with covariance matrices Σ 1 and B(θ 0 )Σ 2 B(θ 0 ) T respectively.Under RC.1(b) and if ρ fulfills the assumption in RC.2 (a) for t → 0 one can also use the lower bound max(m + 1, m + k − h 4 + 1) resulting also in a non-overlapping MOSUM.
The condition RC.2 (b) is too strict for some weight functions that have been proposed in the literature in the case of h 4 /m → 0. If one wants to relax them more moments of the process are required, in order to get a proper limit distribution for these weight functions (see Horváth et al. (2008) for an extensive discussion).
In order to make this more precise, we need the following stronger assumptions: Regularity Condition RC. 5. Let {X t } (under H 0 ) be stationary and assume there exists (possibly after changing the probability space) a Wiener process {W 1 (t), 0 t < ∞} with covariance matrix (5) The value of ν usually corresponds to the number of moments of H(X t , θ 0 ).The larger it is, the closer the partial sum process is already to a Wiener process.We will see that this allows us to use boundary functions that grow to zero slower at infinity.Invariance principles as above have been proven for many different time series, e.g. they hold for mixing time series {X t } with sufficiently fast mixing rate (see also Kirch and Kamgaing (2015)).
The following regularity condition is somewhat different from the one in RC.3 and has to the best of our knowledge only been verified for mean changes.In Section 2.2 we prove exemplary its validity for the linear regression case with least squares estimation.
Regularity Condition RC.6.The following approximation holds under H 0 , where N (m) is the observation horizon and can be infinite: with ν as in RC.5.
We can now also state the assertion for h 4 /m → 0.
where {W (t) : t 0} is a Wiener process with covariance matrix Σ 1 .The matrix A in Γ 4 = Γ 4,A can be substituted by a consistent estimator.
Remark 2.2.If ρ fulfills RC.2 (b) we can replace RC.5 by RC.1(a) and (c) and need RC.6 with ν = 1 in addition to stationarity under H 0 with an analogous proof to the proof of Theorem 2.1.
Because the above theorem still excludes some weight functions that have been proposed in the literature such as ρ(t) = max(1, log(1 + t)) −1/2 even in the closed-end procedure, Horváth et al. (2008) suggested to use an h 4 -closed-end procedure in the sense that the observation horizon ends after Nh 4 observations for some finite N .In this case, we get the following limit distribution: Corollary 2.1.Let RC.1 (a), RC.4 as well as RC.6 (with any ν) hold.Then we get under the null hypothesis for any symmetric positive definite matrix A sup where {W (t) : t 0} is a Wiener process with covariance matrix Σ 1 .The matrix A in Γ 4 = Γ 4,A can be substituted by a consistent estimator.

Linear regression model
In this section, we will prove the above regularity condition RC.6 for the MO-SUM procedure with h 4 /m → 0 for the linear regression model as in Section 1.2.1, showing that this condition can principally be extended to other situations as well.Regularity condition RC.5 on the stochastic process have already been used by Kirch and Kamgaing (2015) and shown to be valid in a large variety of situations.
Theorem 2.5.Let h4 m → 0 and the errors {ε i } be i.i.d. and independent of the regressors {x i }.Furthermore, let {x t } be stationary and satisfy for a positive definite matrix C and some τ > 0.Then, we get for the functions G and H as in Section 1.
where c 1 the first column of the matrix C and W is a Wiener process with some covariance matrix Σ.Then RC.6 also holds for the closed-end procedure with time horizon Nm as well as the open-end procedure.

Consistency under alternatives
In this section, we will show that under mild conditions all monitoring schemes will stop in finite time with probability approaching one as m → ∞ if a change occurs.This means that the corresponding testing procedures have asymptotic power one.Under the below regularity conditions Kirch and Kamgaing (2015) have shown this property for the CUSUM statistic.The same holds true for the other three statistics.For notational ease, we let Regularity Condition RC.7.
(a) The time series before the change fulfills (b) {Y t } is stationary and independent of θ m and it holds as l → ∞ for some λ (for statistics Γ j with j = 1, 2, 3) and k * /h λ for the MOSUM statistic Γ 4 for some λ > 0. Furthermore, there exists a ball U (x 0 ) around x 0 with x 0 > λ and ρ(x) c > 0 for x ∈ U (x 0 ) (where we set ρ(x) = 0 if xm (resp.xh in the MOSUM case) is larger than the observation horizon).Regularity Condition RC.7 (a) follows from the results of the previous section, if {Z t } fulfills the assumptions under the null hypothesis.Condition (b) is stronger then necessary but unifies the treatment for all schemes.If this assumption is not fulfilled it can be replaced by a corresponding condition, where the lower and upper limit of the sum depend on the monitoring scheme at hand (see Kirch and Kamgaing (2015), Section 4 in Weber (2017) or proofs below for details).In fact, we do not need the independence of the time series after the change {Y t } from the estimator θ m (or the time series before the change {Z t }) nor do we need the stationarity of {Y t } allowing e.g. for starting values from {Z t } in an autoregressive setup.This is discussed in detail in Section 5.2 in Kirch and Kamgaing (2015), where among other things this is shown for mixing sequences in combination to some smoothness assumptions on H. Conditions (d) and (e) are conditions on the weight function in connection with the location of the change but are rather mild.
Condition (c) is the key condition in the sense that it tells us what changes can be detected with a given monitoring function H and a given matrix A.

Theorem 2.6. Let the alternative hypothesis hold in addition to regularity conditions RC.7 (a)-(c). For the closed-end-procedure let assumption (d) be fulfilled, for the open-end procedure either (d) or for all statistics except the MOSUM statistic (e). Then it holds (as
for all j = 1 − 4 for the closed-end procedures with N (m) = Nm + 1 as well as the open-end procedures with N (m) = ∞ (with the exception of the MO-SUM statistic).The matrix A in Γ j = Γ j,A can be substituted by a consistent estimator.
Remark 2.3.Because the sums in the MOSUM statistic have fixed length, we can not allow for an abritarily late change.However, if k * = o(h 1+ν/2 ) and lim inf x→∞ x 1/ν ρ(x) > 0, then it is detectable for any MOSUM procedure with observation horizon larger than k * + h.

Simulation study and data analysis
In this section, we aim at illustrating the differences between the discussed monitoring schemes by some simulations but also by a comparative application to two real data sets.For this purpose we will use the well-known examples of a mean change and a linear regression model because even in this case, no comparative simulation study involving all four monitoring schemes exists in the literature (to the best of our knowledge).We refer the reader who is interested in the performance of this methodology in the different statistical situations that have been discussed in Section 1.2 to Kirch and Kamgaing (2015), Section 6, where simulations results as well as data analyses for the CUSUM statistic and many examples beyond linear regression are given in detail.

Simulation study
In this section we compare all four monitoring schemes in terms of their size, power and run length i.e. the time until an alarm is given.Because a higher empirical size leads to a better power and shorter run-length, we can only compare the power in a meaningful way when we fix the size.So for a comparison we use the size-corrected power and size-corrected run-length i.e. the power respectively run-length corresponding to the true (not nominal) size α.Furthermore, we give a kind of density estimator of the run-length which we scale so that it integrates to the size-corrected power (rather than one).One can think of this as a probability distribution with a continuous part as given by our density estimation and a discrete part at infinity (indicating that the procedure never rejected).
All of the empirical results are based on a training period of length m = 100, a monitoring period of 200 and 1000, 2500 repetitions and standard normal errors.For the CUSUM, Page-CUSUM and mMOSUM we use the boundary function ( 3) and for the MOSUM we use which has been proposed by Horváth et al. (2008) and Aue et al. (2008) (with the limit as in Corollary 2.1).For the MOSUM statistic N is chosen so that we  10 Because we are mainly interested in a comparison of all four monitoring schemes, we use the location model X t = μ + Δ1 {t>k * } + t with μ = 0 and Δ = 0.The procedures are based on G(X t , μ) = H(X t , μ) = X t − μ.We use the true error variance σ 2 = 1.
Table 1 reports the empirical sizes for the different procedures.The MOSUM procedure holds the size quite nicely.All other procedures are conservative in particular for smaller values of γ with the exception of the mMOSUM with a large bandwidth h 2 = 0.9 and γ = 0.The problem in this case is that for the first 10 observations the detector sum only depends on one observation, namely X m+k , which is quite volatile (and asymptotically negligble hence not taken care of by the asymptotic considerations).For γ = 0 this is not a problem, but other values of γ put more weight on those early detector values, resulting in too many false positives right after monitoring starts.This can clearly be seen by the false positives at the very beginning of the monitoring sequence as shown (circled in) in Figure 2 (c) and (d)).A simple solution is to wait for a m e.g. a m = √ m+1 observations before starting the monitoring (which is permitted by Assumption RC.2).This solves the problem according to our empirical results given in the same table.Furthermore, the empirical size of the  MOSUM statistics are close to their nominal size.The other procedures use the critical values for the open-end procedure so they are in particular conservative for the smaller monitoring period.Furthermore, the asymptotic critical value is based on a supremum while the actual critical values (for normal errors) are based on a finite maximum, which shows that even with correct monitoring length the asymptotic critical values are conservative.Similarly, the plot of the run lengths in Figure 2 clearly shows a tendency of the mMOSUM with γ = 0.25 and h 2 = 0.9 to raise a significant number of false alarms at the very beginning of the monitoring -an effect that disappears if one starts monitoring only with an delay of a m observations (see Figure 1).From these plots, one can clearly see that the MOSUM procedures also have some trouble with early (and in contrast to the mMOSUM not only very early) false alarms.While for a monitoring length of m = 100 in this simple mean change situation, the asymptotic approximation yields reasonable results, for smaller historic lengths, more dependent data or more complicated situations, the asymptotic distribution may not yet provide good enough approximations in practise.In these cases, bootstrap methods can be helfpul but have to be tailorsuited to the particular situation at hand.Kirch (2008) discuss several sequential bootstrap schemes for the mean change problem with i.i.d.errors, while Hušková and Kirch (2012) discuss the linear regression case.Both papers prove validity of the proposed method for the CUSUM statistics.
Table 2 reports the size-corrected power for all procedures.First of all, somewhat surprisingly the use of γ = 0 is always best in all situations possibly due to the fact that the other choices of γ only yield an advantage for an almost immediate change point.This can best be seen by noting that all detectors reject as soon as they are larger than the corresponding critical curve c j /w j (k; γ).For a Fig 2: Run-length-plot (level at 5%) given detector j, comparing the critical curves for different values of γ one sees that at the very beginning the curves for γ = 0 are below the ones for γ = 0 (which also explains the problem with the false positives at the very beginning in these cases), but very early they cross and the one with γ = 0 is beneath the others.Because of this clear superiority we only report results for γ = 0 for the longer monitoring period.
The mMOSUM detector with h 4 = 0.4 and somewhat less so the Page-CUSUM perform very well in all situations and are only outperformed by an mMOSUM with a bandwidth h 2 chosen according to whether an early, medium or late change occurs.To elaborate: For later changes the CUSUM detector contains more null observations than the mMOSUM detector (that disregards the first h 2 k observations).Therefore, more contaminated observations need to be included in case of the CUSUM before they dominate the detector enough to become significant.This fact was precisely the motivation for introducing the mMOSUM detector as well as Page-CUSUM detectors.In a sense, the bandwidth h 2 regulates how many observations (as a percentage of the current position) are discarded.Consequently, the 'best' choice of h 2 depends on the (unknown) position of the change point, i.e. the later the change is expected the larger h 2 should be chosen.In fact, this is confirmed by the simulation results in Table 2, where the best power for the different change scenarios is given in bold.At the same time, one can see that even in cases where they do not provide the detection rate, the mMOSUM with h = 0.4 (and somewhat less so the Page-CUSUM) do give satisfactory results, so that those are a good compromise if the user is uncertain about when changes are most likely to occur).This is also confirmed by the plots of the run-lengths 2 as well as 1, which are rescaled so that the area under the curve integrates to the detection rate.Furthermore, these plots clearly show that the two MOSUM procedures have a quicker detection time than the other statistics, but at the cost of a much lower detection rate (and a somewhat higher false detection rate in particular at the early stages of the monitoring).While a smaller choice of h 4 gives the fast detection rate, for this choice the loss of detection power is dramatic.This effect also occurs in the a-posteriori usage of MOSUM procedures, which are mainly useful in the context of multiple change point estimation and much less so for testing (confer also Eichinger and Kirch (2018)).
Furthermore, the MOSUM procedure has the lowest power and more importantly, the power stays well away from one even if a longer monitoring period is used, where all other procedures have an empirical (size-corrected) power of almost 1 meaning they may take a while but eventually detect all change points.The MOSUM procedure on the other hand misses a significant amount of change points but those that are detected are detected quite quickly.

Data analysis
In this section, we analyze two data sets that have already been discussed in the literature to compare the performance of the four different monitoring schemes.
One data examples shows a mean change and thus illustrates the performance in the location model.The second data set involves a linear regression model with a change in the error variance but no change in the regression coefficient allowing us to get an impression about both the behavior under the null as well as under alternatives.
We first analyse the Boston Homicide data set contained in the R-package strucchange (see Zeileis et al. (2002)) containing the monthly number of youth homicides in Boston.In early 1995 a policing initiative -the Boston-Gun-Project was started in order to lower the youth homicides.The so-called 'Operation Ceasefire' began in the late spring of 1996.Zeileis (2006) analyzed this data set in an a-posteriori change point setting not too different from ours showing that indeed a change occurred around that time.While the data is count data that could well be modelled by a Poisson model, we will not make use of this fact in this analysis, although the theory allows for this (see also Section 6.5 in Kirch and Kamgaing (2015) or Section 10.4 in Kirch and Kamgaing (2016)).Instead we simply use it as input sequence for the location model from Section 1.2.2 based on the least-squares estimating and monitoring functions that have already been used in the previous section.The asymptotic theory derived in this paper allows to use discrete errors in that manner with the drawback that the corresponding procedure is not optimal in this situation but the advantage of not having to specify and justify a particular count time series model.In the same spirit, it illustrates the applicability of the proposed methods to non-Gaussian data.
Figure 3 gives the time series as well as the corresponding detectors Γ j (m, k)/(c j w j (k)) (for γ = 0).This normalization is chosen for easier visual comparison of the different procedure as in all cases the null hypothesis is rejected as soon as the vertical 1-line is crossed.The change should occur around 50 in the time series and we consider historic data lengths of 24 (late change), 36 (medium late change) and 48 (early change).
Only the mMOSUM (with h 2 = 0.4) detects the change point within the monitoring period in all cases where the detection time is quicker the earlier the change occurs within the monitoring period.The MOSUM (h 4 = 7), CUSUM and Page-CUSUM detect the change point only within the monitoring period for m = 48), where in this case the MOSUM has the quickest detection rate (only merely reaching significance).
As a second example we apply our methodology to a data set from finance, where the effective exchange rate regime is often analysed by a linear regression model on other currencies.This yields valuable insight into whether the currency is allowed to fluctuate freely by market forces or whether it is to all or some extend fixed to some other currency.In our case, we consider the daily log returns of the Chinese Yuan Renminbi (CNY) and use a basket of western currencies as regressors.This is motivated by the fact that the CNY used to be fixed to the US-Dollar but has since been announced to allow to move more freely.The same data set has been used by Zeileis et al. (2010), where also much more information on the model, the motivation behind it as well as the data set can be found.
More precisely, we apply a linear regression setup to the daily log returns of the Chinese Yuan Renminbi (CNY) from July 26nd, 2005 to July 31, 2009, where we use the daily log returns of the following currencies as regressors: USD (US Dollar), JPY (Japanese yen), EUR (Euro), GBP (British pound).In the analysis of the data set in Zeileis et al. (2010) no change in the regression coefficients can be found, but a change in the residual variance has been found.For this reason, we will use the procedures based on the functions G and H from Section 1.2.1 to find changes in the regression coefficients (or actually the mean of the residuals) as well as G and H from Section 1.2.1 to find variance changes in the residuals.The results for a monitoring period of m = 40 can be found in The CUSUM, Page-CUSUM and mMOSUM (h = 0.4) give an alarm around 90, where the increase starts around 70.This coincides with a period of smaller variability between 70 and 100 as can be seen in (a).The mMOSUM with h = 0.9 gives an alarm around 120, which corresponds to a series of some large variability shortly before 120.The other statistics are not influenced by this because this is balanced out by the relatively small variability from before.This is also the reason why the three previously significant statistics drop under the significance line again at that point.Around 160 all monitoring schemes start to increase again and all but the CUSUM become eventually significant during the observation period.This is due to the large variability at the end of the monitoring period, where the mMOSUM (h = 0.9) is fastest, followed by the mMOSUM (h = 0.4), the Page-CUSUM, the mMOSUM (h = 0.1) and the MOSUM (h = 20).Only the CUSUM is not yet significant because all residuals from 41 to the present are used so that it takes much longer for a late change to be detected.
Both examples confirm the balanced behavior of the mMOSUM with h = 0.4 and to a somewhat lesser degree the Page-CUSUM detector making them preferable in many situations.

Proofs
Because the proofs are in parts similar, we do not provide all details of all proofs, but restrict our attention to the key steps.Missing details can be found in Weber (2017), Part I.
Proof of Theorem 2.1.By RC.2 and RC.3 it holds sup Consequently, it is sufficient to consider the limit distribution of sup With the functional limit theorem in RC.1a)i) it holds for any τ, N > 0 For a more general weight function ρ, first note that by RC.1 (b) it holds for all 0 From this we conclude for γ < α < 1 2 sup as τ → 0 (uniformly in m), where in the last line also RC.1 (a) was used.This completes the proof of (a)(i).For (b) note first that by RC.1(c) it holds sup From this, RC.1(c) and RC.2(b) we conclude as T → ∞ (uniformly in m).Analogously, we get for the Wiener process as in RC.1 for T → ∞ uniformly in m by the law of iterated logarithm A careful combination of the above results yields assertion (b)(i).
The proof of (ii) is very close, so we concentrate on the difference.First, by an argument similar to (8) it is sufficient to consider the limit distribution of sup The main difference in the proof is the following step: A similar argument can be used to obtain analogously to (9) sup as τ → 0 uniformly in m.
A major difference in the proofs occurs when dealing with k T m as in (10), where the key difference is the following: H(X j , θ 0 ) = T −1/4 O P (1), where this follows from RC.1 (b) for the first summand and RC.1 (c) for the last three summands.
The other parts of the proof are analogously to the arguments for (i).
Proof of Theorem 2.3.The proof is analogous to the proof of Theorem 2.1 and therefore omitted.The first summand converges to sup 0<t<∞ ρ(t) W 1 (t + 1) − W 1 (t) A , which is well-defined by Theorem 1.2.1 in Csörgö and Révész (1981).This completes the proof.
Proof of Corollary 2.1.The proof is analogous to the proof of Theorem 2.4, where the supremum is taken over a finite stretch after substituting k with k/h 4 .
Proof of Theorem 2.5.For better readibility we use h = h 4 in this proof.Some calculations yield With i = X i − x T i β m this implies Because { i , 1 i < ∞} and {x i , 1 i < ∞} are independent, it holds m i=1 Furthermore, observe that by ( 6) sup By the stationarity of the regressors x i we obtain sup Putting this together with (13), ( 14) and (15) yields assertion (a).
For (b) it is sufficient to consider by Theorem 1.2.1 in Csörgö and Révész (1981), RC. 4a) and c).
Proof of Theorem 2.6.We start with the proof for the mMOSUM under RC.

Theorem 2. 4 .
Let Regularity Conditions RC.4,RC.5 and RC.6  hold.Let the weight function ρ fulfill lim sup t→∞ t 1/ν ρ(t) < ∞, ν as in RC.5.Then, we get the following limit theorem under the null hypothesis and for h 4 /m → 0 for the closed-end (with N (m) = Nm) as well as open-end procedure with N (m) = ∞:For any symmetric positive semi-definite matrix A, we get RC.6 holds for the closed-end procedure with time horizon Nh 4 .(b) If additionally the regressors {x i } fulfill a strong invariance principle k i=1 (e) In the open-end procedures with j = 1, 2, 3 for an arbitrary late change k * it holds lim inf x→∞ xρ(x) > 0.
) %, m = 100, where the asymptotic critical values for the open-end procedure are used with the exception of the MOSUM statistic.match the monitoring length of 200 resp.1000 (i.e.Nh = 200 resp.Nh = 1000), while for the other three statistics the critical value for the open-end procedure is used).

Table 1
Empirical size (in %) for a nominal level of 5 ( ).