High-dimensional data segmentation in regression settings permitting temporal dependence and non-Gaussianity

We propose a data segmentation methodology for the high-dimensional linear regression problem where regression parameters are allowed to undergo multiple changes. The proposed methodology, MOSEG, proceeds in two stages: first, the data are scanned for multiple change points using a moving window-based procedure, which is followed by a location refinement stage. MOSEG enjoys computational efficiency thanks to the adoption of a coarse grid in the first stage, and achieves theoretical consistency in estimating both the total number and the locations of the change points, under general conditions permitting serial dependence and non-Gaussianity. We also propose MOSEG.MS, a multiscale extension of MOSEG which, while comparable to MOSEG in terms of computational complexity, achieves theoretical consistency for a broader parameter space where large parameter shifts over short intervals and small changes over long stretches of stationarity are simultaneously allowed. We demonstrate good performance of the proposed methods in comparative simulation studies and in an application to predicting the equity premium.


Introduction
Regression modelling in high dimensions has received great attention with the development of data collection and storage technologies, and numerous applications are found in natural and social sciences, economics, finance and genomics, to name a few.There is a mature literature on high-dimensional linear regression modelling under the sparsity assumption, see Bühlmann and van de Geer (2011) and Tibshirani (2011) for an overview.When observations are collected over time in highly nonstationary environments, it is natural to allow for shifts in the regression parameters.Permitting the parameters to vary over time in a piecewise constant manner, data segmentation, a.k.a.multiple change point detection, provides a conceptually simple framework for handling nonstationarity in the data.
In this paper, we consider the problem of multiple change point detection under the following model: We observe (Y t , x t ), t = 1, . . ., n, with x t = (X 1t , . . ., X pt ) ⊤ ∈ R p where t β q + ε t for θ q < t ≤ n = θ q+1 . (1) Here, {ε t } n t=1 denotes a sequence of errors satisfying E(ε t ) = 0 and Var(ε t ) = σ 2 ε ∈ (0, ∞) for all t, which may be serially correlated.At each change point θ j , the vector of parameters undergoes a change such that β j−1 ̸ = β j for all j = 1, . . ., q.Then, our aim is to estimate the set of change points Θ = {θ j , 1 ≤ j ≤ q} by estimating both the total number q and the locations θ j of the change points.
The data segmentation problem under (1) is considered by Bai and Perron (1998), Qu and Perron (2007), Zhao et al. (2022) and Kirch and Reckrühm (2022), among others, when the dimension p is fixed.In high-dimensional settings, when there exists at most one change point (q = 1), Lee et al. (2016) and Kaul et al. (2019b) consider the problem of detecting and locating the change point, respectively.For the general case with unknown q, several data segmentation methods exist which adopt dynamic programming (Leonardi and Bühlmann, 2016;Rinaldo et al., 2021;Xu et al., 2022), fused Lasso (Wang et al., 2022;Bai and Safikhani, 2022) or wild binary segmentation (Wang et al., 2021) algorithms for the detection of multiple change points, and Bayesian approaches also exist (Datta et al., 2019).A related yet distinct problem of testing for the presence of a single change point under the regression model has been considered in Wang and Zhao (2022) and Liu et al. (2022).Gao and Wang (2022) consider the case where β j − β j−1 is sparse without requiring the sparsity of β j , j = 0, . . ., q, under the conditions that p < n and X it ∼ iid N (0, 1) for all i and t.
Against the above literature background, we list the contributions made in this paper by proposing computationally and statistically efficient data segmentation methods.
(i) Computational efficiency.For the data segmentation problem under (1), often the computational bottleneck is the local estimation of the regression parameters via penalised M -estimation such as Lasso.We propose MOSEG, a moving window-based two-stage methodology, and its multiscale extension, which are both highly efficient computationally.In the first stage, MOSEG scans the data for multiple change points using a moving window of length G on a coarse grid of size O(nG −1 ), which is followed by a simple location refinement step minimising the local residual sum of squares.The adoption of a coarse grid in the first stage contributes greatly to the reduction of Lasso Figure 1: Execution time in seconds of MOSEG and MOSEG.MS and competing methodologies on simulated datasets (y-axis is in log scale for ease of comparison).Left: p varies while n = 450 is fixed.Right: n varies while p = 100 is fixed.For each setting, 100 realisations are generated and the average execution time is reported.See Section 4.2 for full details.
estimation steps while losing little detection power.Figure 1 demonstrates the computational competitiveness of the proposed MOSEG and MOSEG.MS where they greatly outperform the existing methodologies in their execution time for a range of n and p.
(ii) Multiscale change point detection.We propose a multiscale extension of the singlebandwidth methodology MOSEG.Referred to as MOSEG.MS, it is fully adaptive to the difficult scenarios with multiscale change points, where large frequent parameter shifts and small changes over long stretches of stationarity are simultaneously present, while still enjoying computational competitiveness.To the best of our knowledge, MOSEG.MS is the only data segmentation methodology under the model (1) for which the detection and localisation consistency is derived explicitly for the broad parameter space that permits multiscale change points.Also, while there exist several data segmentation methods that propose to apply moving window-based procedures with multiple bandwidths, MOSEG.MS is the first extension in high dimensions with a guaranteed rate of localisation.
(iii) Theoretical consistency in general settings.We show the consistency of MOSEG and MOSEG.MS in estimating the total number and the locations of multiple change points.Under Gaussianity, their separation and localisation rates nearly match the minimax lower bounds up to a logarithmic factor.Moreover, in our theoretical investigation, we permit temporal dependence as well as tail behaviour heavier than sub-Gaussianity.
This, compared to the existing literature where independence and (sub-)Gaussianity assumptions are commonly made, shows that the proposed methods work well in situations that are more realistic for empirical applications.
The rest of the paper is organised as follows.Section 2 introduces MOSEG, the singlebandwidth methodology, and establishes its theoretical consistency.Then in Section 3, we propose its multiscale extension, MOSEG.MS, and show that it achieves theoretical consistency in a broader parameter space.Numerical experiments in Section 4 demonstrate the competitiveness of the proposed methods in comparison with the existing data segmentation algorithms and Section 5 provides a real data application to equity premium data.In the Appendix, we provide a comprehensive comparison between the existing methods and MOSEG and MOSEG.MS both on their theoretical and computational properties, and present all the proofs and additional numerical results.The R software implementing MOSEG and MOSEG.MS is available from https://github.com/Dom-Owens-UoB/moseg.
Notation.For a random variable X, we write ∥X∥ ν = [E(|X| ν )] 1/ν for ν > 0. For a = (a 1 , . . ., a p ) ⊤ ∈ R p , we write supp(a) = {i, For a square matrix A, let Λ max (A) and Λ min (A) denote its maximum and minimum eigenvalues, respectively.For a set A, we denote its cardinality by |A|.For sequences of positive numbers {a n } and {b n }, we write a n ≲ b n if there exists some constant

Single-bandwidth methodology
We introduce MOSEG, a single-bandwidth two-stage methodology for data segmentation in regression settings.We first describe its two stages in Section 2.1, establish its theoretical consistency in Section 2.2 and verify meta-assumptions made for the theoretical analysis in Section 2.3 for a class of linear processes with serial dependence and heavier tails than that permitted under sub-Gaussianity.

Stage 1: Moving window procedure on a coarse grid
Single-bandwidth moving window procedures have successfully been adopted for univariate (Preuss et al., 2015;Yau and Zhao, 2016;Eichinger and Kirch, 2018), multivariate (Kirch and Reckrühm, 2022) and high-dimensional (Cho et al., 2023) time series segmentation.When applying a moving window-based procedure to a data segmentation problem, the key challenge is to carefully design a detector statistic which, when adopted for scanning the data for changes, has good detection power against the type of changes which is of interest to detect.
For a given bandwidth G ∈ N satisfying G ≤ n/2, our proposed detector statistic is Here, β s,e denotes an estimator of the vector of parameters obtained from (Y t , x t ), s+1 ≤ t ≤ e, We propose to obtain the local estimator β s,e via Lasso, as for some tuning parameter λ > 0. In what follows, we suppress the dependence of this estimator on λ when there is no confusion.The estimand of where L(t) = {j, 0 ≤ j ≤ q : θ j + 1 ≤ t} denotes the index of a change point θ j that is the closest to t while lying strictly to its left.In short, β * k−G,k is a weighted sum of β j with the weights corresponding to the proportion of the intervals {k − G + 1, . . ., k} overlapping with {θ j + 1, . . ., θ j+1 }.
Scanning the detector statistic T k (G) over all k ∈ {G, . . ., n − G} requires the computation of the Lasso estimator O(n) times.This is far fewer than O(n 2 ) times required by dynamic programming algorithms for ℓ 0 -penalised cost minimisation (Rinaldo et al., 2021;Xu et al., 2022), but it may still pose a computational bottleneck when the data sequence is very long or its dimensionality ultra high.Instead, we propose to evaluate T k (G) on a coarser grid only for generating pre-estimators of the change points.Let T denote the grid over which we evaluate T k (G), which is given by with some constant r ∈ [G −1 , 1) that controls the coarseness of the grid.When r = G −1 , we have the finest grid T = {G, . . ., n − G} and the grid becomes coarser with increasing r.
Motivated by Eichinger and Kirch (2018), who considered the problem of detecting multiple shifts in the mean of univariate time series using a moving window procedure, we propose to accept all significant local maximisers of T k (G) over k ∈ T as the pre-estimators of the change points.That is, for some threshold D > 0 and a tuning parameter η ∈ (0, 1], we accept all θ ∈ T that simultaneously satisfy That is, at such θ, the detector T θ (G) exceeds the threshold and attains a local maximum over the grid within the interval of length ηG.We denote the set collecting all pre-estimators fulfilling (6), by Θ = { θ j , 1 ≤ j ≤ q : θ 1 < . . .< θ q } with q = | Θ| as the estimator of the number of change points.This grid-based approach substantially reduces the computational complexity by requiring the Lasso estimators to be computed only O(n/⌊rG⌋) times.Even so, it is sufficient for detecting the presence of all q change points, provided that r is chosen not too large (see Theorem 1 (i) below).We remark that the idea of utilising only a sub-sample of the data for detecting the presence of change points, has been proposed for handling long data sequences in the context of univariate mean change point detection (Lu et al., 2017).The next section describes the location refinement step applied to the pre-estimators of change point locations.

Stage 2: Location refinement
Once the set of pre-estimators Θ is generated by the first-stage moving window procedure on a coarse grid, we further refine the location estimators.It involves the local evaluation and minimisation of the following objective function for suitably chosen a, b, γ L and γ R .
For each j = 1, . . ., q, let θ L j = θ j − ⌊G/2⌋ and θ R j = θ j + ⌊G/2⌋, and consider the following local parameter estimators which serve as the estimators of β j−1 and β j , respectively.Then in Stage 2, we propose to obtain a refined location estimator of θ j from its pre-estimator θ j , as for all j = 1, . . ., q.A similar approach has commonly been taken in the change point literature, see e.g.Kaul et al. (2019b) and Xu et al. (2022) for the data segmentation problem under the model (1).Our proposal differs from theirs only in that the interval over which the search is performed in (9), is chosen to contain exactly one change point with high probability under Assumption 5 (a) below.Referring to the methodology combining the two stages as MOSEG, we provide its algorithmic description in Algorithm 1 of Appendix D.

Consistency of MOSEG
We make the following assumptions on (x t , ε t ), 1 ≤ t ≤ n.
Assumption 1.We assume that E(x t ) = 0, E(ε t ) = 0 and Var(ε t ) = σ 2 ε for all t = 1, . . ., n, and that Cov(x t ) = Σ x has its eigenvalues bounded, i.e. there exist The condition on the eigenvalues of Σ x can also be found e.g. in Rinaldo et al. (2021) and Wang et al. (2019).Where relevant, we explicitly specify the roles played by ω and ω in presenting the theoretical results, which indicates how this condition may be relaxed.Assumptions 2 and 3 below extend the deviation bound and restricted eigenvalue (RE) conditions required for high-dimensional M -estimation (van de Geer and Bühlmann, 2009;Loh and Wainwright, 2012;Negahban et al., 2012), to change point settings.We later verify the assumptions under general conditions accommodating serial dependence as well as non-Gaussian tail behaviour.Here, we explicitly state these meta-assumptions to highlight the full generality of the consistency of MOSEG derived in this section.
Assumption 3 (Restricted eigenvalue).There exist fixed constants C RE > 0 and τ ∈ [0, 1) For each j = 0, . . ., q, we denote by S j = supp(β j ) the support of β j , and by s = max 0≤j≤q |S j | the maximum segment-wise sparsity of the regression parameters.We make the following assumptions on the size of change δ j = |β j − β j−1 | 2 and the spacing between the neighbouring change points.
Assumption 4 is a technical condition under which we focus on the more challenging regime where the size of change is allowed to tend to zero; an analogous condition is found in Lee et al. (2016), Kaul et al. (2019b), Wang et al. (2021) and Xu et al. (2022).It follows im- Without Assumption 4, it incurs an extra multiplicative factor s in the detection boundary (see Assumption 5 (b) below) and the rate of localisation, see also Rinaldo et al. (2021).
Assumption 5.The bandwidth G fulfils the following conditions with τ , ρ n,p and ω introduced in Assumptions 1, 2 and 3.
(b) There exists a fixed constant C 1 > 0 such that Assumption 5 (a) relates the choice of bandwidth G to the minimum spacing between the change points.Together, (a) and (b) specify the separation rate imposing a lower bound on for all the q change points to be detectable by MOSEG.Later in Section 3, we propose a multiscale extension of MOSEG which achieves consistency under a more relaxed condition than Assumption 5.
(ii) There exists a large enough constant c 0 > 0 such that Stage 2 of MOSEG returns Theorem 1 (i) establishes that Stage 1 of MOSEG correctly estimates the number of change points as well as identifying their locations by the pre-estimators with some accuracy.There is a trade-off between computational efficiency and theoretical consistency with respect to the choice of r.On one hand, increasing r leads to a coarser grid T with its cardinality |T | = O(n/(rG)), and thus reduces the computational cost.On the other, the pre-estimators lie in the grid such that the best approximation to each change point θ j can be as far from θ j as ⌊rG⌋/2, which is reflected on the localisation property of the pre-estimators.
Theorem 1 (ii) derives the rate of estimation for the second-stage estimators θ j which shows that the location estimation is more challenging when the size of change δ j is small.Note that we always have max 1≤j≤q δ −2 j max(sρ 2 n,p , (s log(p)) 1/(1−τ ) ) ≲ G ≲ min 1≤j≤q+1 (θ j −θ j−1 ) under Assumption 4. In Section 2.3, we consider two scenarios permitting temporal dependence and non-Gaussianity on (x t , ε t ), and give concrete rates in place of ρ n,p and τ .In particular, as shown in Corollary 3 (ii), the rate derived in Theorem 1 (ii) is near-minimax optimal under Gaussianity.

Verification of Assumptions 2 and 3
Assumptions 2 and 3 generalise the deviation bound and restricted eigenvalue conditions which are often found in the high-dimensional M -estimation literature, to accommodate change points, serial dependence and heavy-tailedness.Condition 1 gives instances of {(x t , ε t )} n t=1 that fulfil Assumptions 2 and 3 and specify the corresponding ρ n,p and τ .
Proposition 2. Suppose that Assumptions 1 and 4 and Condition 1 hold.Then, there exist some constants c 1 , c 2 > 0 such that , and τ and ρ n,p chosen as below.
Remark 1.In the change point literature, Wang and Zhao (2022) propose a change point test and investigate its properties under β-mixing, and Xu et al. (2022) analyse the ℓ 0 -penalised least squares estimation approach when the functional dependence of {x t } n t=1 and {ε t } n t=1 decays exponentially.Relaxing the Gaussianity, it is typically required that x t is a sub-Weibull random vector, i.e. sup a∈B 2 (1) ∥a ⊤ x t ∥ ψν < ∞ for some γ > 0 (where B d (r) = {a : and similarly, ∥ε t ∥ ψγ < ∞.Under these assumptions, the common approach is to verify the deviation bound and restricted eigenvalue conditions analogous to those made in Assumptions 2-3, with which the uniform consistency of the local Lasso estimators is derived.
Instead, we explicitly state the meta-assumptions to highlight that the proposed MOSEG achieves consistency whenever Assumptions 2-3 are met.These in turn can be shown to hold in settings beyond those considered in Condition 1 by using the arguments adopted in the literature establishing the consistency of Lasso-type estimator (when q = 0) under a variety of characterisations of serial dependence and non-Gaussianity (Wu and Wu, 2016;Adamek et al., 2020;Han and Tsay, 2020;Wong et al., 2020).
Corollary 3 follows immediately from Theorem 1 and Proposition 2.
Corollary 3. Suppose that Assumptions 1, 4 and 5 and Condition 1 hold, and λ, r, η and D are chosen as in Theorem 1.Then, there exist constants by MOSEG satisfies the following.
(i) Under Condition 1 (a), we have (ii) Under Condition 1 (b), we have Corollary 3 (ii) shows that under Gaussianity, the rate of localisation attained by MOSEG matches the minimax lower bound up to log(p∨n), see Lemma 4 of Rinaldo et al. (2021).At the same time, Assumption 5 (b) translates to ∆ (1) ≳ s log(p ∨ n) in this setting, nearly matching the minimax lower bound on the separation rate derived in Lemma 3 of Rinaldo et al. (2021) up to the logarithmic term.We refer to Appendix A for comprehensive comparison between MOSEG, its multiscale extension to be introduced in Section 3, and the existing methods for the change point detection problem under (1), on their theoretical and computational properties.

Multiscale methodology
The single-bandwidth methodology proposed in Section 2 enjoys theoretical consistency as well as computational efficiency, but faces the difficulty arising from identifying a bandwidth that satisfies Assumption 5 (a)-(b) simultaneously.In this section, we propose MOSEG.MS, a multiscale extension of MOSEG, and show that it achieves consistency in a parameter space broader than that allowed by Assumption 5.

MOSEG.MS: Multiscale extension of MOSEG
Similarly to MOSEG, MOSEG.MS consists of moving window-based data scanning and location refinement steps but it takes a set of bandwidths as an input.The key innovation lies in that for each change point, MOSEG.MS learns the bandwidth best-suited for its detection and localisation from the given set of bandwidths.While there exist multiscale extensions of moving sum procedures, they are mostly developed for univariate time series segmentation (Messer et al., 2014;Cho and Kirch, 2021b) and to the best of our knowledge, this is a first attempt at rigorously studying such an extension in a high-dimensional setting.
Below we describe MOSEG.MS step-by-step.An algorithmic description of MOSEG.MS is given in Algorithm 2 of Appendix D.
Step 1: Pre-estimator generation.Given a set of bandwidths detection interval associated with θ.For simplicity, we write I 1 ( θ) = I( θ).Below, we sometimes write θ(G) ∈ Θ(G) to highlight that the pre-estimator is obtained with the bandwidth G, and denote by G( θ) the bandwidth involved in the detection of a pre-estimator θ.If some θ is detected with more than one bandwidths, we distinguish between them.
Step 2: Anchor estimator identification.Next, we identify anchor change point esti- That is, each anchor change point estimator does not have its detection interval overlap with the detection interval of any pre-estimator that is detected with a finer bandwidth.Denote the set of all such anchor change point estimators by as an estimator of the number of change points q.
Step 3: Pre-estimator clustering.We find subsets of the pre-estimators in Θ(G) denoted by C j , j = 1, . . ., q, as described below.Initialised as C j = ∅, for each j, we add to C j the jth anchor estimator θ A j as well as all θ ∈ Θ(G) which simultaneously fulfil It is possible that some pre-estimators do not belong to any of C j , 1 ≤ j ≤ q, but each cluster contains at least one estimator by construction.
Step 4: Location refinement.For each C j , j = 1, . . ., q, we denote the smallest and the largest bandwidths associated with the detection of the pre-estimators in C j , by G m j and G M j , respectively, and the corresponding pre-estimators by θ m j and θ M j (when , we identify the local minimiser of the objective function defined in (7), as with Repeatedly performing ( 16) for j = 1, . . ., q, we obtain The steps of MOSEG.Remark 2 (Bandwidth generation).Cho and Kirch (2021b) propose to use G generated as a sequence of Fibonacci numbers, for a multiscale extension of the moving sum procedure for univariate mean change point detection (Eichinger and Kirch, 2018).For some finest bandwidth such that the thus-generated bandwidth set G satisfies |G| = O(log(n)).

Consistency of MOSEG.MS
We make the following assumption on the size of changes by placing a condition on Assumption 5 ′ .Let G denote the set of bandwidths generated as in Remark 2 with Then, we assume that with C 1 from Assumption 5.
In essence, Assumption 5 ′ relaxes Assumption 5 by requiring that for each θ j , there exists one bandwidth G (j) ∈ G fulfilling the requirements imposed on a single bandwidth in the latter for all j = 1, . . ., q, see (a)-(b) in Appendix C.3 for further details.Compared to ∆ (1)   defined in (10), we always have ∆ (1) ≤ ∆ (2) and, if frequent large changes and small changes over long stretches of stationarity are simultaneously present, the former can be considerably smaller than the latter (Cho and Kirch, 2021a).To the best of our knowledge, Theorem 4 below provides a first result obtained under the larger parameter space defined with ∆ (2) , in establishing the consistency of a data segmentation methodology for the problem in (1).We refer to Appendix A for further discussions and comprehensive comparison between MOSEG, MOSEG.MS and competing methodologies on their theoretical properties.
(i) Under Condition 1 (a), we have (ii) Under Condition 1 (b), we have 4 Numerical experiments

Choice of tuning parameters
We discuss the selection of tuning parameters involved in MOSEG and MOSEG.MS, namely the set of bandwidths G, the grid T (r, G) in ( 5), η ∈ (0, 1] involved in the pre-estimation of the change points (see ( 6)), the penalty parameter λ and the threshold D.
Selection of G.As described in Remark 2, the set of bandwidths G is determined once the finest bandwidth G 1 is chosen.To gain insights about the minimum bandwidth required for the reasonable performance of the local Lasso estimators, we conducted numerical experiments by simulating datasets under (1) with q = 0, x t ∼ iid N p (0, I p ), ε t ∼ iid N (0, 1) and Generating 100 realisations for each setting with varying (n, p, s, G), we record Then, we obtain a simple rule to determine the finest bandwidth as 2 log log(p))⌋ with pre-determined constants c * i , i = 0, 1, 2, which are chosen as transforms of the estimated regression coefficients from regressing the 90%-percentile of the logarithm of the estimation errors over 100 realisations, onto the corresponding log(G), log log(p) and log log(n) (with R 2 = 0.8945).Adopting the Fibonacci sequence in Remark 2 sometimes gives a sequence of bandwidths that grows too quickly when the sample size n is small.Therefore, with the finest bandwidth G 1 chosen as above, we recommend generating bandwidths as Throughout the simulation studies and real data applications, we set H = 3.
Selection of D and λ.Theorems 1 and 4 provide ranges of values for λ and D for theoretical consistency, but they involve unknown parameters as is typically the case in the change point literature.For their simultaneous selection, we adopt a cross validation (CV) method motivated by Zou et al. (2020).Let Λ = Λ(G) denote the grid of possible values for λ which, dependent on the bandwidth G, is chosen as an exponentially increasing sequence from , the set of pre-estimators with D = 0, i.e. we take all local maximisers of the MOSUM statistics according to (6); due to the detection rule, we always have q 0 (G, λ) ≤ n/(2ηG).Sorting the elements of Θ(G, λ) in the decreasing order of the associated MOSUM detector values, we generate a sequence of nested change point models Then, using the odd-indexed observations we produce local estimators of the regression parameters and the even-indexed observations , where for any Here, β (1) j (L, λ) denotes the Lasso estimator obtained using (Y t , x t ), t ∈ J 1 ∩ {ℓ j , . . ., ℓ j+1 } with the penalty parameter λ.Then for each G h ∈ G, we find and obtain the set of pre-estimators In all numerical experiments reported in this paper, we set |Λ| = 5.
Selection of other tuning parameters.For change point estimation, we recommend to use η = 0.5 in (6) based on extensive simulations, which show that the performance of MOSEG and MOSEG.MS is not too sensitive to its choice.As noted in Section 4.2, MOSEG.MS is highly competitive computationally against the existing methods even without adopting a coarse grid.Therefore, we report the results obtained with r = G −1 (i.e.T = {G, . . ., n − G} in ( 5)) in the main text and provide the results obtained with coarser grids in Appendix B.1, where we observe that adopting a coarse grid does not undermine the performance of MOSEG provided that r is sufficiently small, say r ≤ 1/5.

Computational complexity and run time
Let Lasso(p) denote the cost of solving a Lasso problem with p variables.For the coordinate descent algorithm (Friedman et al., 2010), each complete iteration of the coordinate descent has the cost O(p 2 ).Then, the combined computational cost of Stages 1 and 2 of MOSEG is O(n(rG) −1 Lasso(p)), and the memory cost is O(np).Similarly, with the set of bandwidths generated as described in Remark 2, the complexity of the multiscale extension MOSEG.MS is O(n(rG 1 ) −1 Lasso(p)) with G 1 denoting the finest scale, which follows from ) (see Remark 2 for the notations).For the CV outlined in Section 4.1, we generate pre-estimators and evaluate the CV objective function on a sequence of nested models for each λ ∈ Λ, which brings the computational complexity of the complete MOSEG.MS methodology to O(|Λ|n(rG 1 ) −1 Lasso(p)).
We investigate the run time of change point detection methodologies for the problem in (1).
In particular, the default implementations of VPWBS and DPDU adopt a grid of size 3 and 4, respectively, for the Lasso tuning parameter while MOSEG and MOSEG.MS are applied with the grid Λ of size |Λ| = 5.We generate the data under the model (M3) described in Section 4.3 below, with δ = 1.6 and varying (n, p).

Simulation settings
We apply MOSEG.MS to datasets simulated with varying (n, p, s) and change point configurations.In each setting, we generate x t as i.i.d.Gaussian random vectors with mean 0 and the covariance matrix Σ x which are specified below, and ε t ∼ iid N (0, σ 2 ε ); unless specified otherwise, we use σ ε = 1.We report the results from non-Gaussian and serially dependent data in Appendix B.2 where overall, the results are not sensitive to tail behaviour or temporal dependence.Additionally, we report the results when p = 1000 in Appendix B.3 which, together with the experiments reported in Section 4.2, demonstrate the scalability of MOSEG.MS.
In setting (M4), the change points are multiscale in the sense that the size of change and spacing between the change points vary, but δ 2 j • min(θ j+1 − θ j , θ j − θ j−1 ) is kept constant for j = 1, 3, 5 and for j = 2, 4, respectively.This results in ∆ (1) being much smaller than ∆ (2) , see Equations ( 10) and ( 17) for their definitions.The setting (M5) is designed to test the performance of data segmentation methods when q = 0, where we scale the data to examine the sensitivity of the tuning parameter choices discussed in Section 4.1.

Simulation results
We apply MOSEG.MS with the tuning parameters selected as described in Section 4.1.For the purpose of illustration only, we also apply MOSEG with the bandwidth chosen with the knowledge of the minimum spacing between the change points; for (M1)-(M3) where change points are evenly spaced, we set G = 3/4 • min 0≤j≤q (θ j+1 − θ j ).For (M4) with multiscale change points, there does not exist a single bandwidth that works well in detecting all change points so we simply set G = 125.For (M5) with q = 0, we set G = G 1 selected as described in Section 4.1.For comparison, we apply the methods proposed by Wang et al. (2021) (referred to as VPWBS) and Xu et al. (2022) (DPDU).The VPWBS method learns the projections well-suited for the detection of the change points and applies the wild binary segmentation algorithm to the projected univariate series, and has been shown to outperform the methods proposed in Leonardi and Bühlmann (2016) and Lee et al. (2016).Based on dynamic programming, the DPDU algorithm minimises the ℓ 0 -penalised cost function for multiple change point detection.Both methods have been applied with the default tuning parameters recommended by the authors.We also considered the method proposed by Kaul et al. (2019a) but omit the results due to its poor performance on the simulation models considered in this paper.
In Tables 1-4, we report the distribution of the bias in change point number estimation ( q − q) for each method over the 100 realisations generated under each setting.Additionally, we report the scaled Hausdorff distance between the sets of estimated ( Θ) and true (Θ) change points, i.e.
averaged over 100 realisations; by convention, we set d H (∅, Θ) = 1.We remark that the Hausdorff distance tends to favour the cases when the change points are over-detected, than when they are under-detected.In Table 5 (considering the case q = 0), we report the proportion of realisations where any false positive is returned.
Generally, as expected, we observe better performance from all methods with increasing sample size in (M1) or increasing change size with δ in (M3)-(M4) while varying the sparsity level s brings in less clear patterns in the performance.In the presence of homogeneous change points under (M1)-(M3), MOSEG performs as well as MOSEG.MS in terms of correctly estimating the number of change points, but it suffers from the lack of adaptivity in the presence of multiscale change points under (M4) where both large frequent shifts and small changes over long intervals are present.Here, we observe the benefit of the multiscale approach taken by MOSEG.MS particularly as δ grows, where it achieves better accuracy in detection and localisation against MOSEG.Comparing the performance of MOSEG.MS and VPWBS, we note that the former generally attains better detection power while the latter exhibits better localisation properties under (M2) and (M3) (when δ is large).DPDU tends to show good detection power in the more challenging scenarios, such as when n is small (under (M1)), s is large (under (M2)) or the change size is small (see

Real data application
There exists an extensive literature on the prediction of the equity premium, which is defined as the difference between the compounded return on the S&P 500 index and the three month Treasury bill rate.Using 14 macroeconomic and financial variables (see Table E.1 for full descriptions), Welch and Goyal (2008) demonstrate the difficulty of this prediction problem, in part due to the time-varying nature of the data.Koo et al. (2020) note that the majority of the variables are highly persistent with strong, positive autocorrelations, and develop an ℓ 1penalised regression method that identifies co-integration relationships among the variables.
Accordingly, we transform the data by taking the first difference of any variable labelled as being persistent by Koo et al. (2020), and scale each covariate series to have unit standard deviation.With the thus-transformed variables, we propose to model the monthly equity premium observed from 1927 to 2005 as Y t , with the 14 variables at lags 1, 2, 3 and 12 as regressors x t via piecewise stationary linear regression; in total, we have n = 936 and p = 57 including the intercept.We apply MOSEG.MS with G = {72, 96, 120} in line with the choice described in Section 4.1 but we select G h to be multiples of 12 for interpretability as the observation frequency is monthly.MOSEG.MS returns q = 7 change point estimators reported in Table 6, and takes 45 seconds in total (including CV).When applied to the same dataset, DPDU takes 25 minutes and VPWBS takes 15 minutes, and neither detects any change point.In Figure 2, we plot the local parameter estimates obtained from each of the seven estimated segments.We can relate the change detected in 1954 to the findings reported in Rapach et al. (2010), where they attribute the instability in the pairwise relationships between the equity premium and each of the 14 variables to the Treasury-Federal Reserve Accord and the transition from the wartime economy.Dividend price ratio (d/p, at lag two) is active throughout the observation period which agrees with the observations made in Welch and Goyal (2008).They also remark that the recession from 1973 to 1975 due to the Oil Shock drives the good predictive performance of many models proposed for equity premium forecasting, and most perform poorly over the 30 year period  following the Oil Shock.The two last segments defined by the change point estimators reported in Table 6 are closely located with these important periods, which supports the validity of the segmentation returned by MOSEG.MS.We note that regardless of the choice of bandwidths, both of the two estimators in 1974 and 1975 defining the two periods are detected separately.

Conclusions
In this paper, we propose MOSEG, a high-dimensional data segmentation methodology for detecting multiple changes in the parameters under a linear regression model.It proceeds in two steps, first grid-based scanning of the data for large changes in local parameter estimators over a moving window, followed by a computationally efficient location refinement step.We further propose its multiscale extension, MOSEG.MS, which alleviates the necessity to select a single bandwidth.Both numerically and theoretically, we demonstrate the efficiency of the proposed methodologies.Computationally, they are highly competitive thanks to the careful design of the algorithms that limit the required number of Lasso estimators.Theoretically, we show the consistency of MOSEG and MOSEG.MS in a general setting permitting serial dependence and heavy tails and establish their (near-)minimax optimality under Gaussianity.
In particular, the consistency of MOSEG.MS is derived for a parameter space that simultaneously permits large changes over short intervals and small changes over long stretches of stationarity, which is much broader than that typically adopted in the literature.Comparative simulation studies and findings from the application of MOSEG.MS to equity premium data support its efficacy.The R software implementing MOSEG and MOSEG.MS is available from https://github.com/Dom-Owens-UoB/moseg.

A.1 Under (sub-)Gaussianity
Table A.1 provides an overview of the theoretical properties of MOSEG and MOSEG.MS in comparison with the methods proposed in Wang et al. (2021), Kaul et al. (2019a) and Xu et al. (2022) for the change point problem in (1) under Gaussianity, as well as their computational complexity.
Specifically, Table A.1 reports the separation and localisation rates associated with each method, which are defined as below.For a given methodology, let K denote the set of estimated change points.Then, when the magnitude of change ∆, measured by either diverges faster than the separation rate s n,p associated with the method, all q changes are detected by K with asymptotic power one and their locations are consistently estimated with the localisation rate ℓ n,p , such that min 1≤j≤q min k∈ K w Here, w j refers to the relative difficulty in locating θ j which is related to the jump size δ j .
Table A.1: Comparison of data segmentation methods developed for the model (1) in their theoretical properties under Gaussianity and computational complexity (for given tuning parameters).Here, s = max 0≤j≤q |S j | and S = | ∪ q j=0 S j |.Refer to the text for the definitions of s n,p , ℓ n,p , ∆ and w j .For Xu et al. (2022), we report the rates associated with their preliminary estimators; see the text for further explanation.
Wang et al. ( 2021) propose a method which learns the projection that is well-suited to reveal a change over each local segment and combines it with the wild binary segmentation algorithm (Fryzlewicz, 2014) for multiple change point detection.Kaul et al. (2019a) propose to minimise an ℓ 0 -penalised cost function given a set of candidate estimators of size q.Their theoretical analysis implicitly assumes that min j (θ j+1 − θ j ) scales linearly in n, and the simulated annealing adopted for minimising the penalised cost, denoted by SA(q) in Table A.1, has complexity ranging from O(q 4 ) on average to being exponential in the worst case.In specifying the properties of Kaul et al. (2019a), the global sparsity S = | ∪ q j=0 S j | can be much greater than the segment-wise sparsity s, particularly when the number of change points q is large.Xu et al. (2022) investigate the dynamic programming algorithm of Rinaldo et al. (2021) for minimising an ℓ 0 -penalised cost function in a general setting permitting non-Gaussianity and temporal dependence, as is the case in the current paper.In Table A.1, we report the separation and localisation rates derived in Xu et al. (2022) for their preliminary estimators from the dynamic programming algorithm.These estimators are further refined by a procedure similar to Stage 2 of MOSEG, which are shown to attain the rate O P (δ −2 j ) under a stronger condition on the size of changes, namely that ∆ (1) /(s 2 log 3 (pn)) → ∞.We note that the refined rate is derived for the individual change points, rather than when the estimation of multiple change points is simultaneously considered as is the case for the rates reported in Table A.1.
We also mention Zhang et al. (2015) where the data segmentation problem is treated as a high-dimensional regression problem with a group Lasso penalty, which only provides that the estimation bias is of o P (n).Leonardi and Bühlmann (2016) consider both dynamic programming and binary segmentation algorithms are considered for change point estimation, and we refer to Rinaldo et al. (2021) for a detailed discussion on their results.A.1, we conclude that MOSEG.MS is highly competitive both computationally and statistically.We investigate the theoretical properties of MOSEG.MS in the broadest parameter space possible which is formulated with ∆ (2) instead of ∆ (1) as in all the other papers; recall that from the discussion following (17), we always have ∆ (1) ≤ ∆ (2) and the former can be much smaller than the latter when large shifts over short intervals and small changes over long stretches of stationarity are simultaneously present in the signal.

A.2 Beyond (sub-)Gaussianity and independence
The theoretical properties of MOSEG.MS reported in Table A.1 do not require independence or Gaussianity, unlike other works with the exception of Xu et al. (2022).In the presence of serial dependence and sub-Weibull tails (through having γ > 1 as in Condition 1 (a)), Xu et al. (2022) require that ∆ (1) ≳ (s log(np)) 4γ+2γ ′ −1 for the detection of all change points, where a smaller value of γ ′ ∈ (0, ∞) imposes a faster decay of the serial dependence.This is comparable to the detection boundary of MOSEG, ∆ (1) ≳ (s log(np)) 4γ+3 , see Proposition 2 (i) and Corollary 3 (i).In characterising temporal dependence, Condition 1 considers linear processes with algebraically decaying dependence whereas γ ′ of Xu et al. (2022) governs the rate of exponentially decaying, possibly non-linear dependence.The localisation rate in Corollary 3 (i) is also comparable to that attained by the preliminary estimators of Xu et al. (2022) produced by a dynamic programming algorithm; as noted above, under a stronger condition on ∆ (1) , they derive a further refined rate.

B.1 Choice of the grid
We investigate the performance of MOSEG as the coarseness of the grid varies with r ∈ {1/3, 1/5, 1/10, 1/G}.Recall that when r = 1/G, we use the full grid T = {G, . . ., n − G} in Stage 1 of MOSEG, see (5).For this, we set n = 300, p = 100, s = 2 and q = 1, and generate the data under (1) with x t ∼ iid N p (0, I) and ε t ∼ iid N (0, 1).For each realisation, the change point θ 1 is randomly sampled from {51, . . ., 250}.Varying δ ∈ {0.1, 0.2, 0.4, 0.8}, we generate β 0 = (β 0,1 , . . ., β 0,p ) ⊤ with β 0,i = δ • (−1) i−1 for i ∈ {1, . . ., s} and have Setting G = 50, we select the maximiser of the MOSUM statistic as the pre-estimator θ 1 in Stage 1 of MOSEG, which then is refined as in (9) in Stage 2. Table B.1 reports the average and the standard error of n When the size of change is very small, estimators from both Stages 1 and 2 perform equally poorly regardless of the choice of r.However, as δ increases, we quickly observe that the estimation error becomes close to zero for the estimators from both stages provided that r is not too large.Also, for δ ≥ 0.2, we observe that Stage 2 brings in small improvement in the localisation performance.From this, we conclude that the performance of MOSEG is robust to the choice of r provided that it is chosen reasonably small, say r ≤ 1/5.
We consider the following three settings for the generation of x t and ε t .
(E3) {(x t , ε t )} n t=1 is generated as in ( 12) where D 1 is a diagonal matrix with 0.3 on its diagonals, Under (E2)-(E3), the data are permitted to be heavy-tailed and serially correlated, respectively; (E1) serves as a benchmark.Table B.2 reports the average and standard error of the Hausdorff distance in (19) and q − q over 100 realisations.It shows that generally, the performance of all three methods is less sensitive to heavy-tailedness or temporal dependence, compared to their sensitivity to the size of changes.VPWBS shows good localisation performance, while MOSEG.MS tends to achieve better detection accuracy when the size of change is small.DPDU performs very well in the more challenging setting with small δ.However, as noted in Section 4, it is more prone to over-estimate the number of change points as δ increases with which the localisation performance also deteriorates.

B.3 When the dimensionality is large
We additionally examine the case where p = 1000, adopting the simulation setting (E1) from Appendix B.2.We exclude VPWBS (Wang et al., 2021) and DPDU (Xu et al., 2022) which, as shown in Section 4.2, tends to take considerably longer time to run compared to MOSEG.MS.

C Proofs
In what follows, for any vector a ∈ R p and a set A ⊂ {1, . . ., p}, we denote by a(A) = (a i , i ∈ A) ⊤ the sub-vector of a supported on A. We write the population counterpart of T k (G) with β * s,e defined in (4) as Further, we write S s,e = supp(β * s,e ).

C.1 Proof of Theorem 1
C.1.1 Supporting lemmas Lemma C.1.We have  Let T j = {θ j −⌊ηG⌋+1, . . ., θ j +⌊ηG⌋}∩T for 1 ≤ j ≤ q.Under Assumptions 4 and 5, we have  This ensures that any θ ∈ Θ satisfies min 1≤j≤q | θ − θ j | < G. Next, let θ L j and θ R j denote two points within T j which are the closest to θ j from the left and and the right of θ j , respectively, with θ L j = θ R j when r = 1/G.Then by construction of T , From this and by (C.4), at θ j = arg max k∈T j T k (G), we have where the second last inequality follows from Assumption 5 (b), and the last one from (11).
When η = 1, this and (C.5) indicates that such θ j satisfies (6).When η < 1, note that 11).These arguments ensure that we detect at least one change point in T j at t = θ j for each j = 1, . . ., q.For such θ j , suppose that θ and re-arranging, we obtain for large enough C 1 in Assumption 5 (b).
(i) Defining I( Then we show that for all k ∈ I( θ j ) satisfying δ 2 j |k − θ j | > v n,p with v n,p = max sρ 2 n,p , (s log(p)) From the definition of s and the Cauchy-Schwarz inequality, and from (C.9), we have From (C.11)-(C.12),we derive for a large enough C 1 in Assumption 5 (b).Then on R (1) , we have from Assumption 4) and (C.10).As for I 2 , from Lemma C.2, (C.10) and (C.12) we have on R Turning our attention to I 3 , from (C.11)-(C.12), where the last inequality follows from Assumption 5 (b).Then on D (1) , Then from (C.13), (C.14) and (C.15), we derive under Assumption 5 (b), for all k ∈ I j satisfying δ 2 j |k − θ j | > v n,p from (C.10).Analogous arguments apply when k ≤ θ j , and the above arguments are deterministic on M. In summary, we have of a determined within the context.Let e i denote a vector that contains zeros except for its ith component set to be one.We denote the time-varying vector of parameters under (1) by We denote the functional dependence measure and the dependence-adjusted norm for U t (a) as defined in Zhang and Wu (2017), by respectively.Analogously, we define for W t (a, b).Finally, for some κ ≥ 0, we denote the dependence adjusted sub-exponential In what follows, we denote by C Π with Π ⊂ {γ, ν, Ξ, ς} a constant that depends on the parameters included in Π which may vary from one occasion to another.
Finally, we write a ∨ b = max(a, b) and a ∧ b = min(a, b).
any 0 ≤ s < e ≤ n.The statistic T k (G) contrasts the local parameter estimators from two adjacent data sections over {k − G + 1, . . ., k} and {k + 1, . . ., k + G}.Then, T k (G) is expected to form local maxima near the change points where the local parameter estimators differ the most, and thus it is well-suited for detecting and locating the change points under the model (1).
we generate the coarse grid associated with each G h and the parameter r by T h = T (r, G h ), see (5).As in Stage 1 of MOSEG, the sets of pre-estimators Θ(G h ) are generated for h = 1, . . ., H, and we denote by MS algorithm have been devised to (i) group the change point estimators across multiple bandwidths, into those detecting the identical change points, and (ii) adaptively learn the bandwidth best suited to locate each change point from the bandwidths associated with the estimators in each group.For (i), we adopt the anchor estimators in Step 2 which closely resemble the final estimators produced by the bottom-up merging proposed in Messer et al. (2014) in the context of univariate data segmentation.While the merging procedure is known to achieve detection consistency, the resultant estimators do not come with a guaranteed rate of localisation.Nonetheless, they serve as an adequate 'anchor' for clustering the pre-estimators in Step 3.Moreover, the restriction imposed in (15) ensures that the bandwidths associated with the detection of pre-estimators clustered in C j , inform us a good choice of bandwidth for the detection of the j-th change point, with which we perform location refinement in Step 4.
using λ * and m * .This amounts to selecting the bandwidth-dependent threshold D at a value just below the m * th largest MOSUM detector value.Such Θ(G h ), G h ∈ G, serve as an input to Steps 2-4 of MOSEG.MS.

Figure 1
Figure1reports the average execution time (in seconds) over 100 realisations for each setting, for the five methods in consideration.Both MOSEG and MOSEG.MS take only a fraction of time taken by the competing methodologies in their computation even when we do not use the coarser grid for Stage 1, and their run time does not vary much with increasing n or p in the ranges considered.As expected, MOSEG is faster than MOSEG.MS but the difference in execution time is much smaller than that between MOSEG.MS and other competitors.

Figure 2 :
Figure 2: Equity premium data: Parameter estimates from each estimated segment obtained by MOSEG.MS.Variables at different lags are coloured differently in the y-axis.
over 100 realisations when different grids are used.See also Figure B.1 which plots the Hausdorff distance d H (see (19)) against r.

Table 1 :
Under (M5), where no changes are present, our methods are shown to control the number of false positives well.Here, we do not include VPWBS or DPDU in Table5as they tend to detect false positives in most cases.(M1) Performance of MOSEG, MOSEG.MS, VPWBS and DPDU over 100 realisations.The best performer in each setting is denoted in bold.
Table B.2).At the same time, it is observed to over-estimate the number of change points across all scenarios.

Table 2 :
(M2) Performance of MOSEG, MOSEG.MS, VPWBS and DPDU over 100 realisations.The best performer in each setting is denoted in bold.

Table 3 :
(M3) Performance of MOSEG, MOSEG.MS, VPWBS and DPDU over 100 realisations.The best performer in each setting is denoted in bold.
Table 4: (M4) Performance of MOSEG, MOSEG.MS, VPWBS and DPDU over 100 realisations.The best performer in each setting is denoted in bold.

Table 6 :
Equity premium data: Change point estimators detected by MOSEG.MS.

Table B .
1: Comparison of Hausdorff distance d H for Stage 1 and Stage 2 estimators from MOSEG when different grids are used.The average and the standard error of estimation errors over 100 realisations are reported.Figure B.1: Hausdorff distance d H against r for Stage 1 (solid line) and Stage 2 (dashed line) estimators from MOSEG, as the size of changes varies.

Table B .
2: Performance of MOSEG.MS and VPWBS under (E1)-(E3) over 100 realisations.The best performer in each setting is denoted in bold.
Table B.3shows that, in comparison to the the results under (E1) in TableB.2obtained when p = 100, the greater sample size is required to detect smaller changes.Also, the localisation performance worsens as p increases.Nonetheless, MOSEG.MS demonstrates itself to be scalable as the dimensionality increases when the size of change is sufficiently large, which is in line with the theoretical requirements.Table B.3: Performance of MOSEG.MS under (E1) when p = 1000 over 100 realisations.