Narrowest Significance Pursuit: inference for multiple change-points in linear models

We propose Narrowest Significance Pursuit (NSP), a general and flexible methodology for automatically detecting localised regions in data sequences which each must contain a change-point (understood as an abrupt change in the parameters of an underlying linear model), at a prescribed global significance level. NSP works with a wide range of distributional assumptions on the errors, and guarantees important stochastic bounds which directly yield exact desired coverage probabilities, regardless of the form or number of the regressors. In contrast to the widely studied"post-selection inference"approach, NSP paves the way for the concept of"post-inference selection". An implementation is available in the R package nsp (see https://CRAN.R-project.org/package=nsp ).


Introduction
We propose a new generic methodology for determining, for a given data sequence and at a given global significance level, localised regions of the data that each must contain a change-point. We define a change-point in Y t on an interval [s, e] as an abrupt departure, on that interval, from a linear model for Y t with respect to pre-specified regressors. We now give examples of scenarios covered by the proposed methodology.
where f t is a piecewise-constant vector with an unknown number N and locations 0 = η 0 < η 1 < . . . < η N < η N +1 = T of change-points, and Z t is zero-centred noise.
The location η j is a change-point if f η j −1 = f η j but f η j = f η j +1 .
Scenario 2. Piecewise-polynomial (e.g. piecewise-constant or piecewise-linear) signal plus noise model. In (1), f t is a piecewise-polynomial vector, in which the polynomial pieces have a fixed degree q ≥ 0, assumed known to the analyst. The location η j is a change-point if f t can be described as a polynomial vector of degree q on [η j −q −1, η j ], but not on [η j − q, η j + 1].
Each of these scenarios is a generalisation of the preceding one. We permit a broad range of distributional assumptions for Z t , from i.i.d. Gaussianity to autocorrelation, heavy tails and heterogeneity. We now review the existing literature on uncertainty in multiple changepoint problems which seeks to make confidence statements about the existence or locations of change-points in particular regions of the data, or significance statements about their importance.
In the i.i.d. Gaussian piecewise-constant model, SMUCE (Frick et al., 2014) estimates the number N of change-points as the minimum among those candidate fitsf t for which the empirical residuals pass a certain test at level α. An issue for SMUCE, discussed e.g. in Chen et al. (2014), is that the smaller the significance level α, the more lenient the test on the empirical residuals, and therefore the higher the risk of underestimating N . This leads to the counter-intuitive behaviour of the coverage properties of SMUCE illustrated in Chen et al. (2014). SMUCE 2 (Chen et al., 2014) remedies this issue, but still requires that the number of estimated change-points agrees with the truth for successful coverage, which puts it at risk of being unable to cover the truth with a high nominal probability requested by the user. In the approach taken in this paper, this issue does not arise as we shift the inferential focus away from N . SMUCE is extended to heterogeneous Gaussian noise in Pein et al. (2017) and to dependent data in Dette et al. (2020).
Some authors approach uncertainty quantification for multiple change-point problems from the point of view of post-selection inference (PSI, a.k.a. selective inference); these include Hyun et al. (2018), Hyun et al. (2021), Jewell et al. (2022) and Duy et al. (2020). To ensure valid inference, PSI conditions on many aspects of the estimation process, which tends to produce p-values with somewhat complex definitions. PSI also does not permit the selection of the tuning parameters of the inference procedure from the same data. Useful as they are in assessing the significance of previously estimated change-points, these PSI approaches share the following features: (a) they do not consider uncertainties in estimating changepoint locations, (b) they do not provide regions of globally significant change in the data, (c) they define significance for each change-point separately, as opposed to globally, (d) they rely on a particular base change-point detection method with its potential strengths or weaknesses. Our approach contrasts with these features; in particular, in contrast to PSI, it can be described as enabling "post-inference selection", as we argue later on.
Some authors provide simultaneous asymptotic distributional results for the distance between the estimated change-point locations and the truth. In the linear regression context, this is done in Perron (1998, 2003), and in the piecewise-constant signal plus noise model -in Eichinger and Kirch (2018). These approaches are asymptotic, conditional on the estimated change-point locations, and involve unknown quantities. In contrast, our methodology has a finite-sample nature, makes no assumptions on the signal, is unconditional and automatic. A further discussion of the differences between our approach and that of Perron (1998, 2003) can be found in Section A of the appendix.
Inference for multiple change-points is also sometimes posed as control of the False Discovery Rate (FDR), see e.g. Li and Munk (2016), Hao et al. (2013) and Cheng et al. (2020), but this approach is focused on the number of change-points rather than on their locations.
The objective of our methodology, called "Narrowest Significance Pursuit" (NSP), is to automatically detect localised regions of the data Y t , each of which must contain at least one change-point (in a suitable sense determined by the given scenario), at a prescribed global significance level. NSP performs unconditional inference without change-point location estimation, and proceeds as follows. A number M of intervals are drawn from the index domain [1, . . . , T ], with start-and end-points chosen over an equispaced deterministic grid.
On each interval drawn, Y t is then checked to see whether or not it locally conforms to the prescribed linear model, with any set of parameters. This check is performed through estimating the parameters of the given linear model locally by minimising a particular multiresolution sup-norm loss, and testing the residuals from this fit via the same norm; self-normalisation is involved if necessary. In the first greedy stage, the shortest interval (if one exists) is chosen on which the test is violated at a certain global significance level α. In the second greedy stage, the selected interval is searched for its shortest sub-interval on which a similar test is violated. This sub-interval is then chosen as the first region of global significance, in the sense that it must (at a global level α) contain a change-point, or otherwise the local test would not have rejected the linear model. The procedure then recursively draws M intervals to the left and to the right of the chosen region (with or without overlap), and stops when there are no further local regions of global significance. , in the piecewise-constant signal plus i.i.d. Gaussian noise model, approximate the tail probability of the maximum CUSUM statistic over all sub-intervals of the data.
They then propose an algorithm, in a few variants, for identifying short, non-overlapping segments of the data on which the local CUSUM exceeds the derived tail bound, and hence the segments identified must contain at least a change-point each, at a given significance level.  present results of similar nature for a Gaussian model with lag-one autocorrelation, linear trend, and features that are linear combinations of continuous, piecewise differentiable shapes. The most important high-level differences between NSP and these two approaches are that (a) NSP is ready for use with any user-provided design matrix X, and this requires no new calculations or coding, and yields correct coverage probabilities in finite samples of any length; (b) NSP searches for any deviations from local model linearity with respect to the regressors provided; (c) NSP is able to handle regression with autoregression practically in the same way as without, in a stable manner and on arbitrarily short intervals, and does not need accurate estimation of the unknown (nuisance) AR coefficients. We expand on these points in Section A of the appendix.
NSP has other distinctive features in comparison with the existing literature. It is specifically constructed to target the shortest possible significant intervals at every stage of the procedure, and to explore as many intervals as possible while remaining computationally efficient. NSP furnishes exact coverage statements, at a prescribed global significance level, for any finite sample sizes, and works in the same way regardless of the scenario and for any given regressors X. Also, thanks to the fact that the multiresolution sup-norm used in NSP can be interpreted as Hölder-like norms on certain function spaces, NSP naturally extends to the cases of unknown or heterogeneous distributions of Z t via self-normalisation.
Finally, if simulation needs to be used to determine critical values for NSP, then this can be done in a computationally efficient manner.
Section 2 introduces the NSP methodology and provides the relevant finite-sample coverage theory. Section 3 extends this to NSP under self-normalisation and in the additional presence of autoregression. Section 4 provides finite-sample and traditional large-sample detection consistency and rate optimality results for NSP in Scenarios 1 and 2. Section 5 provides comparative simulations and extensive numerical examples under a variety of settings. Section 6 describes two real-data case studies. Complete R code implementing NSP is available in the R package nsp. There is an appendix, whose contents are mentioned at appropriate places in the paper. Proofs of our theoretical results are in the appendix.

The NSP inference framework
Throughout the section, we use the language of Scenario 3, which includes Scenarios 1 and 2 as special cases. In Scenario 1, the matrix X in (2) is of dimensions T × 1 and has all entries equal to 1. In Scenario 2, the matrix X is of dimensions T × (q + 1) and its ith column is given by (t/T ) i−1 , t = 1, . . . , T . Scenario 4 (for NSP in the additional presence of autoregression), which generalises Scenario 3, is dealt with in Section 3.2.

Generic NSP algorithm
We start with a pseudocode definition of the NSP algorithm, in the form of a recursively NSP(s,s + τ L (s,ẽ, Y, X), Y, X, M, λ α , τ L , τ R )
On completion, the output of NSP is in the variable S. We now comment on the NSP function line by line. In lines 2-4, execution is terminated for intervals that are too short. with τ L = τ R = 0, means that the procedure runs, in each recursive step, wholly on data sections between (and only including the end-points of) the previously detected intervals of significance. This ensures that the intervals of significance returned by NSP are nonoverlapping; however, this also reduces the amount of data that the procedure is able to use at each recursive stage, which shows the importance of optionally allowing non-zero overlaps τ L and τ R in NSP. One possibility is e.g. the following.
τ L (s,ẽ) = (s +ẽ)/2 −s; τ R (s,ẽ) = (s +ẽ)/2 + 1 −ẽ. In NSP, having p = p(T ) growing with T is possible, but we must have p(T ) + 1 ≤ T or otherwise no regions of significance will be found. Section B of the appendix comments on a few other generic aspects of the NSP algorithm.

Measuring deviation from linearity in NSP
This section completes the definition of NSP (in the version without self-normalisation) by describing the DeviationFromLinearity function (NSP algorithm, line 12). Its basic building block is a scaled partial sum statistic, defined for an arbitrary input sequence {y t } T t=1 by U s,e (y) = (e − s + 1) −1/2 e t=s y t . We define the scan statistic of an input vector y (of length T ) with respect to the interval set I as  s,e] . We refer to · I d , · I a and their restrictions as multiresolution sup-norms (see Nemirovski (1986) and Li (2016)) or, alternatively, multiscale scan statistics if they are used as operations on data. If the context requires this, the qualifier "dyadic" will be added to these terms when referring to the I d versions. With this notation in place, DeviationFromLinearity(s m , e m , Y, X) is defined as follows.
Step 1. Find β 0 = arg min β Y sm:em − X sm:em,· β I d [sm,em] . This fits the postulated linear model between X and Y restricted to the interval [s m , e m ]. However, we use the multiresolution sup-norm · I d [sm,em] as the loss function, rather than the more usual L 2 loss. This has important consequences for the exactness of our significance statements, which we explain later below.
Step 2. Compute the same multiresolution sup-norm of the empirical residuals from the above fit, D [sm,em] := Y sm:em − X sm:em,· β 0 I d [sm,em] .
Steps 1. and 2. above can be carried out in a single step as D [sm,em] = min β Y sm:em − X sm:em,· β I d [sm,em] , however, for comparison with other approaches, it will be convenient for us to use the two-stage process in steps 1. and 2. for the computation of D [sm,em] . Computationally, the linear model fit in step 1. can be carried out via simple linear programming; we use the R package lpSolve. The following important property lies at the heart of NSP. . This is a simple but valuable result, which can be read as follows: "under the local null hypothesis of no signal on [s, e], the test statistic D [s,e] , defined as the multiresolution supnorm of the empirical residuals from the same multiresolution sup-norm fit of the postulated linear model on [s, e], is bounded by the multiresolution sup-norm of the true residual process Z t ". This bound is achieved because the same norm is used in the linear model fit and in the residual check, and it is important to note that the corresponding bound would not be available if the postulated linear model were fitted with a different loss function, e.g. via OLS. Having such a bound allows us to transfer our statistical significance calculations to the domain of the unobserved true residuals Z t , which is much easier than working with the corresponding empirical residuals. It is also critical to obtaining global coverage guarantees for NSP, as we now show.
Theorem 2.1 Let S = {S 1 , . . . , S R } be a set of intervals returned by the NSP algorithm.
Theorem 2.1 should be read as follows. Let α = P ( Z I a > λ α ). For a set of intervals returned by NSP, we are guaranteed, with probability of at least 1 − α, that there is at least one change-point in each of these intervals. Therefore, S = {S 1 , . . . , S R } can be interpreted as an automatically chosen set of regions (intervals) of significance in the data. In the no-change-point case (N = 0), the correct reading of Theorem 2.1 is that the probability of obtaining one of more intervals of significance (R ≥ 1) is bounded from above by P ( Z I a > λ α ).
NSP uses a multiresolution sup-norm fit to be checked via the same multiresolution supnorm. This leads to exact coverage guarantees for NSP with very simple mathematics.
In contrast to the confidence intervals in e.g. Bai and Perron (1998), the NSP regions of significance are not conditional on any particular estimator of N or of the change-point locations, and are in addition of a finite-sample nature. Still, they have a "confidence interval" interpretation in the sense that each must contain at least one change, with a certain prescribed global probability. NSP is not automatically equipped with pointwise estimators of change-point locations.
This is an important feature, because thanks to this, it can be so general and work in the same way for any X. If it were to come with meaningful pointwise change-point location estimators, they would have to be designed for each X separately, e.g. using the maximum likelihood principle. (However, NSP can be paired up with such pointwise estimators; see immediately below for details.) We now introduce a few new concepts, to contrast this feature of NSP with the existing concept of post-selection inference.
"Post-inference selection" and "inference without selection". If it can be assumed that an hence the label of "post-inference selection". This avoids the complicated machinery of postselection inference, as we automatically know that the p-value associated with the estimated change-point must be less than α. Similarly, "inference without selection" refers to the use of NSP unaccompanied by a change-point location estimator.
"Simultaneous inference and selection" or "in-inference selection". In this construction, change-point location estimation on an interval [s,ẽ] occurs directly after adding it to S.

Gaussian Z t
We now recall distributional results for Z I a , in the case Z t ∼ i.i.d. N (0, σ 2 ) with σ 2 assumed known, which will permit us to choose λ α = λ α (T ) so that P { Z I a > λ α (T )} → α as T → ∞. The resulting λ α (T ) can then be used in Theorem 2.1. As the result of Theorem 2.1 is otherwise of a finite-sample nature, some users may be uncomfortable resorting to large-sample asymptotics to approximate the distribution of Z I a . However, (a) the asymptotic results outlined below approximate the behaviour of Z I a well even for small samples, and (b) users not wishing to resort to asymptotics have the option of approximating the distribution of Z I a by simulation (see Section H of the appendix), which is computationally fast. The assumption of a known σ 2 is common in the changepoint inference literature, see e.g. Hyun et al. (2018),  and Jewell et al. (2022). Section D of the appendix covers the unknown σ 2 case. Results on the distribution of Z I a are given in Siegmund and Venkatraman (1995) and Kabluchko (2007). We recall the formulation from Kabluchko (2007) as it is slightly more explicit.
We use the approximate value H = 0.82 in our numerical work. Using the asymptotic independence of the maximum and the minimum (Kabluchko and Wang, 2014), and the symmetry of Z, we get the following simple corollary.
as T → ∞. In light of (6), we obtain λ α for use in Theorem 2.1 as follows: (a) equate We now extend NSP to positively-dependent Gaussian innovations. Let {Z t } T t=1 be a stationary, zero-mean, non-negatively autocorrelated process with long-run standard deviation σ LR . Let σ s,e = Var 1/2 {U s,e (Z)}, and note σ s,e ≤ σ LR . In the notation of Theorem 2.2, This demonstrates that valid coverage guarantees are obtained for a system with innovations Z by applying the NSP threshold equal to the threshold suitable for i.i.d. N (0, 1) innovations times the long-run standard deviation ofZ. Long-run standard deviation estimation, especially in the presence of change-points, is a difficult problem, but several solutions have been proposed, including one in Dette et al. (2020) (in our Scenario 1). See also Section J of the appendix for a related discussion of NSP with autocorrelated innovations.

Tightening the bounds: X-dependent thresholds
We now show how to obtain thresholds lower than those in Theorem 2.1 if the analyst is willing to allow their dependence on the design matrix X. This calls for the re-examination of Proposition 2.1. Consider the following alternative version.
This leads to a tighter version of Theorem 2.1.
. . , S R } be a set of intervals returned by the NSP algorithm.
In Theorem 2.3, the probability P will be lower than that obtained by solving P ( Z I d > λ α ) = α (which was done in Theorem 2.1). In addition, unlike the solution to P ( Z I d > λ α ) = α, the solution to (7) accounts for the number and form of the covariates X. To solve (7), the distribution of min β Z −Xβ I d can be obtained by simulation, separately for each set of covariates X and sample size T ; see Section H of the appendix for details. The better localisation properties of the thus-obtained tighter bounds are illustrated, for Scenario 1, in Section 5.1.

NSP with self-normalisation and with autoregression
3.1 Self-normalised NSP for possibly heavy-tailed, heteroscedastic Z t Kabluchko and Wang (2014) point out that the square-root normalisation used in U s,e (y) is not natural for distributions with tails heavier than Gaussian. We are interested in obtain-ing a universal normalisation in U s,e (y) which would work across a wide range of possibly heavy-tailed distributions without requiring their explicit knowledge, including under heterogeneity. One such solution is offered by the self-normalisation framework developed in Rackauskas and Suquet (2003) and related papers. We now recall the basics and discuss the necessary adaptations to our context; the less mathematically-inclined reader is welcome to skip this description and proceed directly to formula (9), which gives the oracle self-normalised statistic computed on the true residuals Z t .
We first discuss the relevant distributional results for the true residuals Z t . We only cover the case of symmetric distributions of Z t . For the non-symmetric case, which requires a slightly different normalisation, see Rackauskas and Suquet (2003). In the latter work, the This last condition, in particular, is satisfied if θ = 1/2 and ν > 1/2. The function ρ θ,ν,c will play the role of a modulus of continuity. Let Z 1 , Z 2 , . . . be independent and symmetrically distributed with E(Z t ) = 0; note they do not need to be identically distributed. Define Egorov (1997) shows that this last condition is equivalent to Z t being within the domain of attraction of the normal law. Therefore, the material of this section applies to a much wider class of distributions than the heterogeneous extension of SMUCE in Pein et al. (2017), which only applies to normally distributed Z t .
Let the random polygonal partial sums process ζ T be defined on [0, 1] as linear interpolation between the knots (V 2 t /V 2 T , S t ), t = 0, . . . , T , where S 0 = V 0 = 0, and let ζ se H 0 ρ θ,ν,c [0, 1] is a separable Banach space. Under the assumptions above, we have the following convergence in distribution as T → ∞: and, with > 0 and c = exp(1 + 2 ), consider the statistic In the notation and under the conditions listed above, it is a direct consequence of the as T → ∞, and the quantiles of the distribution of sup u,v∈[0,1] I ρ 1/2,1/2+ ,c (W, u, v), which does not depend on the sample size T , can be computed (once) by simulation.
Following the narrative of Sections 2.2 and 2.3, to make these results operational in a new function DeviationFromLinearity.SN (where 'SN' stands for self-normalisation) for use in line 12 of the NSP algorithm, we need the following development. Assume initially that the global residual sum of squares V 2 T is known. For a generic interval [s, e] containing no change-points, we need to be able to obtain empirical residualsẐ This provides a self-normalised equivalent of Proposition 2.1 and requires that the deviation from linearity computed on an interval containing no change-points (left-hand side of (11)) does not exceed the analogous oracle quantity computed on the true residuals (right-hand side of 11). Section F of the appendix describes the construction ofẐ (k) for k = 1, 2, 3 so that (11) is guaranteed, and introduces a suitable estimator of V 2 T for use in (11).

NSP with autoregression
To accommodate autoregression while retaining the serial independence of Z t , we introduce the following additional scenario.
Scenario 4. Linear regression with autoregression, with piecewise-constant parameters. In this work, we treat the autoregressive order r as fixed and known to the analyst. Fang and Siegmund (2020) consider r = 1 and treat the autoregressive parameter as known, but acknowledge that in practice it is estimated from the data; however, they add that "

For a given design matrix
[it] would also be possible to estimate [the autoregressive parameter] from the currently studied subset of the data, but this estimator appears to be unstable". NSP circumvents this instability issue, as explained below. NSP for Scenario 4 proceeds as follows.
1. Supplement the design matrix X with the lagged versions of the variable Y , or in other words substitute X := X Y ·−1 · · · Y ·−r , where Y ·−k denotes the respective backshift operation. Omit the first r rows of the thus-modified X, and the first r elements of Y .
2. Run the NSP algorithm of Section 2.1 with the new X and Y (with a suitable modification to line 12 if using the self-normalised version), with the following single difference. In lines 21 and 22, recursively call the NSP routine on the intervals

Detection consistency and lengths of NSP intervals
We now study the consistency of NSP in detecting change-points, and the rates at which the lengths of the NSP intervals contract, as the sample size increases. We consider a version of the NSP algorithm that considers all sub-intervals of [1, T ], and we provide results in Scenario 1 as well as in Scenario 2 with continuous piecewise-linearity (this parallels the scenarios for which consistency is shown in Baranowski et al. (2019)).
So far in the paper, we avoided introducing any assumptions on the signal: our coverage guarantees in Theorem 2.1 held under no conditions on the number of change-points, their spacing, or the sizes of the breaks. This was unsurprising as they amounted to statistical size control. By contrast, the results of this section relate to detection consistency (and therefore 'power' rather than size) and as such, require minimum signal strength assumptions.

Scenario 1 -piecewise constancy
In this section, f t falls under Scenario 1. We start with assumptions on the strength of the change-points. For each change-point η j , j = 1, . . . , N , definē Recalling that η 0 = 0 and η N +1 = T , we require the following assumption.
Theorem 4.1 leads to the following corollary.
Corollary 4.1 Let the assumptions of Theorem 4.1 hold, and in addition let Z t ∼ N (0, σ 2 ).
}, a quantity that characterises the difficulty of the multiple change-point detection problem, to be of order O(log 1/2 T ), which is the same as in Baranowski et al. (2019) and minimax-optimal as argued in Chan and Walther (2013). Further, the statement of Corollary 4.1 implies statistical consistency of NSP in the sense that with probability tending to one with T , NSP estimates the correct number of change-points and each NSP interval contains exactly one true change-point.
Moreover, the length of the NSP interval around each η j is of which is near-optimal and the same as in Baranowski et al. (2019). Finally, this also implies that this consistency rate is inherited by any pointwise estimator of η j that takes its value in the jth NSP interval of significance; this applies even to naive estimators constructed e.g.
as the middle points of their corresponding NSP intervals [s j , e j ], i.e.η j = (s j + e j )/2 .
More refined estimators, e.g. one based on CUSUM maximisation within each NSP interval, can also be used and will also automatically inherit the consistency and rate.

Scenario 2 -continuous piecewise linearity
In this section, f t falls under Scenario 2 and is piecewise linear and continuous. Naturally, the definition of change-point strength has to be different from that in Section 4.1. For each change-point η j , j = 1, . . . , N , letd where ξ j = |ξ j,1 − ξ j,2 |/2 and ξ j,1 , ξ j,2 are, respectively, the slopes of f t immediately to the left and to the right of η j , and C 2 is a certain universal constant (i.e. valid for all f t ), suitably large. The following theorem holds.
Theorem 4.2 Let Assumption 4.1 hold, withd j defined in (14). On the set Z I a ≤ λ α , a version of the NSP algorithm that considers all sub-intervals, executed with no overlaps and with threshold λ α , returns exactly N intervals of significance [s 1 , such that η j ∈ [s j , e j − 1] and e j − s j + 1 ≤ 2d j , for j = 1, . . . , N .
We note that Assumption 4.1 is model-independent: we require it as much in the piecewiseconstant Scenario 1 as in the piecewise-linear Scenario 2 (and in any other scenario), but   Table 2: Numbers of times, out of 100 simulated sample paths of each null model, that the respective method indicated no intervals of significance. Throughout the paper, all batches of 100 sample paths are simulated with the random seed initially set to 1.
withd j defined separately for each scenario. Theorem 4.2 leads to the following corollary.
Corollary 4.2 Let the assumptions of Theorem 4.2 hold, and in addition let Z t ∼ N (0, σ 2 ).
Corollary 4.2 implies that with λ α as defined therein, and if ξ j ∼ T −1 (a case in which f t is bounded; see Baranowski et al. (2019)), we have that the accuracy of change-point localisation via NSP (measured by e j − s j ) is O(T 2/3 log 1/3 T ), the same as in Baranowski et al. (2019) and within a logarithmic factor of Raimondo (1998 Table 3: Results for each model+method combination: "coverage" is the number of times, out of 100 simulated sample paths, that the respective model+method combination did not return a spurious interval of significance; "prop. gen. int." is the average (over 100 simulated sample paths) proportion of genuine intervals out of all intervals returned, if any (if none are returned, the corresponding 0/0 ratio is ignored in the average); "no. gen. int." is the average (over 100 sample paths) number of genuine intervals returned; "no. all int." is the average (over 100 sample paths) number of all intervals returned; "av. gen. int. len." is the average (over 100 sample paths) length of a genuine interval returned in the respective model+method combination. Note 1: for the Teeth 10 signal only, the corresponding averages are over 50 simulated sample paths as the BP method crashed for sample path indexed 52. Note 2: the BP methods were too slow to execute for the Blocks model.  Table 4: Numbers of times, out of 100 simulated sample paths of each null model, that the respective method indicated no intervals of significance. Here, the process Z t is autocorrelated and the σ is set to its true long-run standard deviation, rather than being estimated via MAD. "Noise 300 (a)" means a sample path of length 300 with marginal variance 1 and AR(1) autocorrelation structure with AR coefficient equal to a.  Table 5: Results for each model+method combination under auto-correlation: the process Z t is autocorrelated and the σ is set to its true long-run standard deviation, rather than being estimated via MAD. "Single 300 (a)" means the Single 300 signal plus a sample path of length 300 with marginal variance 1 and AR(1) autocorrelation structure with AR coefficient equal to a.

Scenario 1 -piecewise constancy
In this section, we demonstrate numerically that the guarantee offered by Theorem 2.1 holds for NSP in practice over a variety of Gaussian models with and without change-points in Scenario 1. We start by describing the competing methods. "NSP" is the NSP method executed with a deterministic grid using M = 1000 intervals, with the threshold chosen as in Section 2.3 and no interval overlaps, i.e. τ L = τ R = 0; σ is estimated via MAD. "NSP-SIM" is like "NSP" but uses the simulation-based thresholds of Section 2.4. "NSP-O" is like "NSP" but uses the overlap functions defined in (3). "NSP-SIM-O" is like "NSP-SIM" but uses the overlap functions as in "NSP-O". "BP" is the method of Bai and Perron (2003) as implemented in the routine breakpoints of R package strucchange (version 1.5-3) with the minimum segment size set to 2; the number of change-points is chosen by BIC, and confidence intervals are then formed conditionally on the estimated model by using the confint.breakpointsfull routine, with the significance level Bonferroni-corrected for the estimated number of change-points. "BP-LIM" is like "BP" but with the number of changepoints limited from above by the number of intervals returned by NSP (or one if NSP returns no intervals). "SMUCE" is the method of Frick et al. (2014), for which the execution is stepR::stepFit(data, alpha, confband=TRUE); we use version 2.1-3 of stepR.
We begin with null models, by which we mean models (1) for which f t is constant throughout, i.e. N = 0. For null models, Theorem 2.1 promises that NSP at level α returns no intervals of significance with probability at least 1 − α. In this section, we use α = 0.1.
There are similar parameters in BP, BP-LIM and SMUCE, and they are also set to 0.1. All models used are listed in Table 1.  We further test the NSP-* in the presence of noise autocorrelation as follows. We modify the Noise 300 and Single 300 signals of Table 1 Figure 2: Noisy (light grey) and true (black) wave2sect signal, with NSP q significance intervals for q = 0 (left, misspecified model), q = 1 (middle, well-specified model), q = 2 (right, over-specified model). See Section 5.2 for more details.

Scenario 2 -piecewise linearity
We consider the continuous, piecewise-linear wave2sect signal, defined as the first 450 elements of the wave2 signal from Baranowski et al. (2019), contaminated with i.i.d. Gaussian noise with σ = 0.5. The signal and a sample path are shown in Figure 2. In this model, we run the NSP procedure, with no overlaps and with the other parameters set as in Section 5.1, (wrongly or correctly) assuming the following, where q denotes the postulated degree of the underlying piecewise polynomial: (a) q = 0, which wrongly assumes that the true signal is piecewise constant; (b) q = 1, which assumes the correct degree of the polynomial pieces making up the signal; (c) q = 2, which over-specifies the degree. We denote the resulting versions of the NSP procedure by NSP q for q = 0, 1, 2. The intervals of significance returned by all three NSP q methods are shown in Figure 2. Theorem 2.1 guarantees that the NSP 1 intervals each cover a true change-point with probability of at least 1 − α = 0.9 and this behaviour occurs in this particular realisation. The same guarantee holds for the over-specified situation in NSP 2 , but there is no performance guarantee for NSP 0 .

Self-normalised NSP
We briefly illustrate the performance of the self-normalised NSP. We define the piecewise- NSP interval of significance, and we note that no spurious intervals get detected despite the heavy-tailed and heterogeneous character of the noise.
In addition, we run the self-normalised NSP, with the parameters as above, on heavy-tailed versions of the Noise 300 and Single 300 models from Table 1 6 Data examples

The US ex-post real interest rate
We re-analyse the time series of US ex-post real interest rate (the three-month treasury bill rate deflated by the CPI inflation rate) considered in Garcia and Perron (1996) and Bai and Perron (2003). The dataset is available at http://qed.econ.queensu.ca/jae/datasets/ bai001/. The dataset Y t , shown in the left plot of Figure 4, is quarterly and the range is 1961:1-1986:3, so t = 1, . . . , T = 103. The arguments outlined in Section K of the appendix justify the applicability of NSP in this context.
We first perform a naive analysis in which we assume our Scenario 1 (piecewise-constant mean) plus i.i.d. N (0, σ 2 ) innovations. This is only so we can obtain a rough segmentation which we can then use to adjust for possible heteroscedasticity of the innovations in the next stage. We estimate σ 2 viaσ 2 M AD and run the NSP algorithm with the following We could stop here and agree with Garcia and Perron (1996), who also conclude that there are two change-points in this dataset, with locations within our detected intervals of significance. However, we note that the first interval, [23,54], is relatively long, so one question is whether it could be covering another change-point to the left ofη 1 = 47. To investigate this, we re-run NSP with the same parameters onỸ 1:47 but find no intervals of significance (not even with the lower thresholds induced by the shorter sample size T 1 = 47 rather than the original T = 103). Our lack of evidence for a third change-point contrasts with Bai and Perron (2003)'s preference for a model with three change-points.
However, the fact that the first interval of significance [23,54] is relatively long could also be pointing to model misspecification. If the change of level over the first portion of the data were gradual rather than abrupt, we could naturally expect longer intervals of significance under the misspecified piecewise-constant model. To investigate this further, we now run NSP onỸ t but in Scenario 2, initially in the piecewise-linear model (q = 1), which leads to one interval of significance: This raises the prospect of modelling the mean ofỸ 1:57 as linear. We produce such a fit, in which in addition the mean ofỸ 58:103 is modelled as piecewise-constant, with the changepoint locationη 2 = 79 found via a CUSUM fit onỸ 58:103 . We also produce an alternative fit in which the mean ofỸ 1:79 (up to the change-point) is modelled as linear, and the mean of  Figure 5. It is interesting to see that the quadratic+constant model for Y t leads to a slightly lower residual variance than the piecewise-constant model (4.9 to 4.94). Both models use five parameters. We conclude that more general piecewise-polynomial modelling of this dataset can be a viable alternative to the piecewise-constant modelling used in Garcia and Perron (1996) and Bai and Perron (2003). This example shows how NSP, beyond its usual role as an automatic detector of regions of significance, can also serve as a useful tool in achieving improved model selection.

House prices in London Borough of Newham
We consider the average monthly property price P t in the London Borough of Newham, for all property types, recorded from January 2010 to November 2020 (T = 131) and accessed on 1st February 2021. The data is available on https://landregistry.data.gov.uk/. We use the logarithmic scale Q t = log P t and are interested in the stability of the autoregressive Again, the arguments of Section K of the appendix justify the applicability of NSP here.
NSP, run on a deterministic equispaced interval sampling grid, with M = 1000 and α = 0.1, with theσ 2 M OLS estimator of the residual variance (see Section D of the appendix) and both with no overlap and with an overlap as defined in formula (3) Table 6 shows the estimated regression coefficients (with their standard errors) over the two sections.
It appears that both the intercept and the autoregressive parameter change significantly at the change-point. In particular, the change in the autoregressive parameter from 1.03 (standard error 0.02) to 0.95 (0.02) suggest a shift from a unit-root process to a stationary one. This agrees with a visual assessment of the character of the process in the right plot of Figure 3, where it appears that the process is more 'trending' before the change-point then the number of breaks can be decided based upon a sequential examination of the sup F (l + 1|l) statistics constructed using global minimizers for the break dates (i.e. ignore the test F (1|0) and select m such that the tests sup F (l + 1|l) are insignificant for l ≥ m. This method leads to the best results and is recommended for empirical applications.
For the purpose of this discussion, we label the process above the 'Improved Sequential Procedure' (ISP). Bai and Perron (2003) do not formulate or prove the inferential properties of the m selected by ISP. For a procedure that selects the number of change-points, the control of global significance would have to mean, in particular, a guarantee that the true number of change-points is at least as high as the estimated number, with at least 1 − α probability. NSP provides such a statement as a simple corollary of Theorem 2.1 in the main paper, but ISP is a complex sequential process put together from separate, non-independent, conditionally applied tests, and the exact guarantees for the resulting output (m) have not been shown.
The next difference is that the UD max and WD max tests require the provision of the maximum number of change-points, but NSP does not require this, thereby eliminating the risk of providing too low a maximum by the user.
Furthermore, the ISP test only concerns the number of change-points, but not their locations: inference for locations in Bai and Perron (1998) and Bai and Perron (2003) is carried out later, conditionally on the number of change-points and on their estimated locations.
Not only that, but also the obtained conditional confidence intervals are asymptotic in nature and are only valid for large sample sizes (unknown to the user). By contrast, NSP provides a single, clear, joint, finite-sample guarantee for the number of change-points and for their locations: it flags up disjoint regions in the data, each of which must contain at least one change-point with a global probability specified by the user. The NSP intervals of significance serve as "unconditional" confidence intervals (in contrast to the conditional CIs of Bai and Perron (1998) and Bai and Perron (2003), whose conditionality on the number of estimated change-points and the estimated locations means that the user cannot be sure whether they contain change-points with a certain probability). The NSP guarantees are valid for any, even small, sample sizes.
Next, we discuss in more detail the most important high-level differences between NSP and the approaches of  and .
(a) While  and  perform change-point location estimation as well as inference, NSP works on the principle of "inference without location estimation". This is a key property of NSP, which enables it to use an allpurpose multiscale test, whose distribution under the null is stochastically bounded by the scan statistic of the corresponding true residuals Z t , and is therefore independent of the scenario and of the design matrix X used. This means that NSP is ready for use with any user-provided design matrix X, and this will require no new calculations or coding, and will yield correct coverage probabilities. This is in contrast to the approach taken in  and , in which, because of their focus on location estimation, each new scenario not already covered would involve new and fairly complicated approximations of the null distribution. (We note that outside the change-point context, the method for constructing confidence intervals for groups of variables in sparse high dimensional regression by Meinshausen (2015) shares with NSP the attractive property of providing valid error control without assumptions on the design matrix.) (b) While in  and , the user needs to be able to specify the significant signal shapes to look for, NSP searches for any deviations from local model linearity with respect to specific regressors.
(c) Out of our scenarios,  and  provide results under our Scenario 1 and Scenario 2 with linearity and continuity. Their results do not cover our Scenario 3 (linear regression with arbitrary X) or Scenario 2 with linearity but not necessarily continuity, or Scenario 2 with higher-than-linear polynomials.
(d) Thanks to its double use of the multiresolution sup-norm (in the local linear fit, and then in the test of this fit), NSP is able to handle regression with autoregression practically in the same way as without, in a stable manner and on arbitrarily short intervals, and does not suffer from having to estimate the unknown (nuisance) AR coefficients accurately. This is of importance, as change-point analysis under serial dependence in the data is a problem known to be difficult, and NSP offers a new approach to it, thanks to this feature.
Finally, we provide additional references on the use of scan statistics. In the literature, scaled partial sum statistics acting directly on the data are often combined into variants of scan statistics (Siegmund and Venkatraman, 1995;Arias-Castro et al., 2005;Jeng et al., 2010;Walther, 2010;Chan and Walther, 2013;Sharpnack and Arias-Castro, 2016;König et al., 2020;. They are also used in estimators represented as the simplest (from the point of view of a certain regularity or smoothness functional) fit to the data for which the empirical residuals are deemed to behave like the true residuals (Frick et al., 2014;Davies and Kovac, 2001;Davies et al., 2009;Li, 2016).

B Discussion of the NSP algorithm
We now comment on a few generic aspects of the NSP algorithm as defined in the main paper.
Length check for [s, e] in line 2 Consider an interval [s, e] with e − s < p. If it is known that the matrix X s:e,· is of rank e − s + 1 (as is the case, for example, in Scenario 2, for all such s, e) then it is safe to disregard [s, e], as the response Y s:e can then be explained exactly as a linear combination of the columns of X s:e,· , so it is impossible to assess any deviations from linearity of Y s:e with respect to X s:e,· . Therefore, if this rank condition holds, the check in line 2 of NSP can be replaced with e − s < p, which (together with the corresponding modifications in lines 5-10) will reduce the computational effort if p > 1.
Having p = p(T ) growing with T is possible in NSP, but by the above discussion, we must have p(T ) + 1 ≤ T or otherwise no regions of significance will be found.
Sub-interval sampling Sub-interval sampling in lines 5-10 of the NSP algorithm is done to reduce the computational effort. In the change-point detection literature (without inference considerations), Wild Binary Segmentation (WBS, Fryzlewicz, 2014) uses a random interval sampling mechanism in which all or almost all intervals are sampled at the start of the procedure, i.e. with all or most intervals not being sampled recursively. The same style of interval sampling is used in the Narrowest-Over-Threshold change-point detection (note: not change-point inference) algorithm (Baranowski et al., 2019) and is mentioned in passing in . Instead, NSP uses a different, recursive interval sampling mechanism, introduced in the change-point detection (not inference) context in Wild Binary Segmentation 2 (WBS2, Fryzlewicz, 2020). In NSP (lines 5-10), intervals are sampled separately in each recursive call of the NSP routine. As argued in Fryzlewicz (2020), this enables more thorough exploration of the domain {1, . . . , T } and hence better feature discovery than the non-recursive sampling style. We note that NSP can equally use random or deterministic interval selection mechanisms; a specific example of a deterministic interval sampling scheme in a change-point detection context can be found in Kovács et al. (2023).
Our general preference is for NSP to be used with deterministic sampling as it leads to reproducible results without the user having to fix the random seed. for NSP) they iteratively focus on the narrowest intervals on which a certain test (a changepoint locator for NOT; a multiscale scan statistic on multiresolution sup-norm fit residuals for NSP) exceeds a threshold, but this is where similarities end: apart from this common feature, the objectives, scopes and modi operandi of both methods are different.

Relationship to NOT
Focus on the smallest significant regions Some authors in the inference literature also identify the shortest intervals (or smallest regions) of significance in data. For example, Dümbgen and Walther (2008) plot minimal intervals on which a density function significantly decreases or increases. Walther (2010) plots minimal significant rectangles on which the probability of success is higher than a baseline, in a two-dimensional spatial model.  mention the possibility of using the interval sampling scheme from Fryzlewicz (2014) to focus on the shortest intervals in their CUSUM-based determination of regions of significance in Scenario 1. In addition to NSP's new definition of significance involving the multiresolution sup-norm fit (whose benefits are explained in Section 2.2 of the main paper), NSP is also different from these approaches in that its pursuit of the shortest significant intervals is at its algorithmic core and is its main objective. To achieve it, NSP uses a number of solutions which, to the best of our knowledge, either are new or have not been considered in this context before. These include the two-stage search for the shortest significant subinterval (NSP routine, line 19) and the recursive sampling (lines 5-10, proposed previously but in a non-inferential context by Fryzlewicz (2020)).
Lack of penalisation for fine scales. Instead of using multiresolution sup-norms (multiscale scan statistics) as defined in the main paper, some authors, including Walther (2010) and Frick et al. (2014), use alternative definitions which penalise fine scales (i.e. short intervals) in order to enhance detection power at coarser scales. We do not pursue this route, as NSP aims to discover significant intervals that are as short as possible, and hence we are interested in retaining good detection power at fine scales. However, some natural penalisation of fine scales necessarily occurs in the self-normalised case; see Section 3.1 of the main paper. Bottom-up implementation of NSP Our implementation of NSP is "bottom-up", in the sense that at each recursive stage, we consider the intervals [s m , e m ] in non-decreasing order of their lengths, and exit the current recursive stage (if and) as soon as significance is declared, rather than moving on to longer intervals. This aligns with the objective of looking for the shortest intervals (so the examination of longer intervals is unnecessary if shorter significant intervals have been found). Any non-bottom-up implementation of NSP would therefore unnecessarily be wasting computational resources. This is in contrast to, for example, the region-based multiple testing method of Meijer et al. (2015), in which the successive p-value adjustments (which lead to power improvements) are only possible because of the top-down character of that approach.

C Proofs of results of Section 2
Proof of Proposition 2. , which completes the proof.
Proof of Theorem 2.1. The second inequality is implied by (5) in the main paper. We now prove the first inequality. On the set Z I d ≤ λ α , each interval S i must contain a change-point as if it did not, then by Proposition 2.1, we would have to have However, the fact that S i was returned by NSP means, by line 14 of the NSP algorithm, that D S i > λ α , which contradicts (15). This completes the proof.
Proof of Proposition 2.2. The inequality is true because for any fixed β, the norm Z − Xβ I d is a maximum over a larger set than the maximum in Z s:e − X s:e,· β I d [s,e] . We now prove the equality. As [s, e] does not contain a change-point, there is a β * such that .
Proof of Theorem 2.3. On the set min β Z − Xβ I d ≤ λ α , each interval S i must contain a change-point as if it did not, then by Proposition 2.2, we would have to have However, the fact that S i was returned by NSP means, by line 14 of the NSP algorithm, that D S i > λ α , which contradicts (16). This completes the proof.

D Estimated σ 2 , and other light-tailed distributions
We first show under what condition Theorem 2.2 in the main paper remains valid with an estimated variance σ 2 , and give an estimator of σ 2 that satisfies this condition for certain matrices X and parameter vectors β (j) . Similar considerations are possible for the lighttailed distributions from the latter part of this section, but we omit them here. With {Z t } T t=1 ∼ N (0, σ 2 ) rather than N (0, 1), the statement of Theorem 2.2 of the main paper trivially modifies to lim T →∞ P (max 1≤s≤e≤T U s,e (Z) ≤ σ(a T + b T γ)) = exp(−e −γ ). From the form of the limiting distribution, it is clear that the theorem remains valid if With σ estimated via a generic estimatorσ, we ask under what circumstances In light of (17), it is enough to solve for γ T in σ(a T + b T γ T ) =σ(a T + b T γ), yielding In view of the form of a T and b T defined in Theorem 2.2 of the main paper, we have γ T −→ T →∞ γ on a set large enough for (18) to hold if After Rice (1984) and Dümbgen and Spokoiny (2001), defineσ 2 R = 1 2(T −1) Define the signal in model (2) of the main paper by f t = X t,· β (j) for t = η j + 1, . . . , η j+1 , for As in Dümbgen and Spokoiny (2001), we have from which (19) follows, by Markov inequality, if By way of a simple example, in Scenario 1, Note that if f is bounded with a number of change-points that is finite in T , then T V (f ) = const(T ). Similar arguments apply in Scenario 2, and in Scenario 3 for some matrices X.
Without formal theoretical justifications, we also mention two further estimators of σ 2 (or σ) which we use in our numerical work. In Scenarios 1 and 2, we useσ M AD , the Median Absolute Deviation (MAD) estimator as implemented in the R routine mad, computed on Empirically,σ M AD is more robust thanσ R to the presence of change-points in f t , but is also more sensitive to departures from the Gaussianity of Z t . In Scenario 3, in settings outside Scenarios 1 and 2, we use the following estimator.
In model (2)  are computed on change-point-free sections of the data, and therefore the median of these local estimators should serve as an accurate estimator of the true σ. Empirically,σ M OLS is a useful alternative toσ R in settings in which condition (20) is not satisfied. Kabluchko and Wang (2014) provide a result similar to Theorem 2.2 of the main paper for distributions of Z dominated by the Gaussian in a sense specified below. These include, after scaling so that E(Z) = 0 and Var(Z) = 1, the symmetric Bernoulli, symmetric binomial and uniform distributions, amongst others. We now briefly summarise it. Consider the cumulant-generating function of Z defined by ϕ(u) = log E(e uZ ) and assume that for some σ 0 > 0, we have ϕ(u) < ∞ for all u ≥ −σ 0 . Assume further that for all ε > 0, sup u≥ε ϕ(u)/(u 2 /2) < 1. Finally, assume . After simple algebraic manipulations, this result permits a selection of λ α for use in Theorem 2.1 of the main paper, similarly to Section 2.3 of the main paper.
E Importance of two-stage search for shortest interval of significance We next illustrate the importance of the two-stage search for the shortest interval of significance, whose stage two is performed in line 19 of the NSP algorithm via the call [s,ẽ] := ShortestSignificantSubinterval(s m 0 , e m 0 , Y, X, M, λ α ).
Consider the blocks signal referred to in the main paper but with the much smaller noise standard deviation σ = 1. A realisation Y t is shown in the left plot of Figure 6. All N = 11 change-points are visually obvious and hence we would expect NSP to return 11 intervals [s i ,ẽ i ], exactly covering the true change-points, for which we would haveẽ i −s i = 1 for most if not all i. As shown in the middle plot of Figure 6, the NSP procedure with no overlap and with the same parameters as in Section 5.1 of the main paper returns 11 intervals of significance withẽ i −s i = 1 for i = 1, . . . , 10 andẽ 11 −s 11 = 2. The 11 intervals of significance cover the true change-points.
However, consider now an alternative version of NSP, labelled NSP (1) In other words, [s m 0 , e m 0 ] is not searched for its shortest sub-interval of significance, but is added to S as it is. The output of NSP(1) on Y t is shown in the right plot of Figure 6.
The intervals of significance returned by NSP (1)  This issue would not arise in NSP, as NSP would then search this detection interval for its shortest significant sub-interval. From the output of the NSP procedure, we can see that this second-stage search drastically reduced the length of this detection interval, which is unsurprising given how obvious the change-points are in this example. This illustrates the importance of the two-stage search in NSP.
For very long signals, it is conceivable that an analogous three-stage search may be a better option, possibly combined with a reduction in M to enhance the speed of the procedure.  (1). See Section E for more details.

F Self-normalised NSP -further discussion
We now outline the construction ofẐ (k) for k = 1, 2, 3 so that (11) in the main paper is guaranteed, and propose a suitable estimator of V 2 T for use in (11) in the main paper.
j ) be the ordinary least-squares residuals from regressing Y (i+1):j on which guarantees (Ẑ for a range of distributions of Z t and design matrices X. We now briefly sketch the argument justifying this for Scenario 1; similar considerations are possible in Scenario 2 but are notationally much more involved and we omit them here. The argument relies again on self-normalisation. From standard least-squares theory (in any Scenario), we have In light of the distributional result (10) of the main paper, the relationship between the statistic I ρ 1/2,1/2+ ,c (W, u, v) and Rackauskas and Suquet (2004)'s statistic UI(ρ 1/2,1/2+ ,c ), as well as their Remark 5, we are able to bound sup 0≤i<j≤T I 2 ρ 1/2,1/2+ ,c (ζ se by a term of order O(log T ) on a set of probability 1−O(T −1 ). Making the mild assumption that sup 0≤i<j≤T log 1+2 {cV 2 T /(Z 2 i+1 + . . . + Z 2 j )} l T = o P (T log −1 T ) and continuing from (22), we obtain (Ẑ which in turn guarantees the bound (11) in the main paper, is practically equivalent to the multiresolution norm minimisation solved in Step 1 of Section 2.2 of the main paper except it now uses a weighted version of the norm · I a [s,e] , where the weights are given in the denominator of (23). This weighted problem is solved via linear programming just as easily as Step 1 of Section 2.2 of the main paper, the only difference being that the relevant constraints are multiplied by the corresponding weights.
We now discuss further practicalities of the self-normalisation. In the exposition of the main paper, we use all intervals [i + 1, j] ⊆ [s, e], i.e. the set I a [s,e] . In practice, for computational reasons, we compute the supremum on the LHS of (11) in the main paper over the dyadic set I d [s,e] , which does not alter the validity of the bound. Our empirical experience is that the statistic on the LHS of (11) of the main paper is fairly robust to the choice of V 2 T , as the latter only enters through the (close to) square-root logarithmic term in the denominator. In addition, over-estimation of V 2 T for use on the LHS of (11) of the main paper is permitted as it only strengthens the bound in (11) of the main paper. For these reasons, we do not dwell on the accurate estimation of V 2 T here, but use the rough estimateV 2 where theσ t 's are the constituents of theσ M OLS estimator from Section D. As clarified earlier, the use of (21) requires that small values of j − i do not enter in the computation of the supremum on the LHS of (11) of the main paper. In practice, however, we use s,e] . This is because the function I ρ 1/2,1/2+ ,c (ζ se

G NSP with autoregression
We use the piecewise-constant signal of length T = 1000 from the first simulation setting in Dette et al. (2020), contaminated with Gaussian AR(1) noise with coefficient 0.9 and standard deviation (1 − 0.9 2 ) −1/2 /5. A sample path, together with the true change-point locations, is shown in Figure 7.
We run the AR version of the NSP algorithm (as outlined in Section 3.2 of the main paper), with the following parameters: a deterministic equispaced interval sampling grid, M = 100, α = 0.1, no overlap,σ 2 M OLS estimator of the residual variance. The resulting intervals are shown in Figure 7; NSP intervals cover four out of the five true change-points, and there  (1) noise with coefficient 0.9 and standard deviation (1 − 0.9 2 ) −1/2 /5 (light grey), NSP intervals of significance (shaded red), true change-points (blue); see Section G for details. We simulate from this model 100 times and obtain the following results. In 100% of the sample paths, each NSP interval of significance covers one true change-point (which fulfils the promise of Theorem 2.1 of the main paper). The distribution of the detected numbers of intervals is as in Table 7; we recall that NSP, with a fixed significance level, does not promise to detect the number of intervals equal to the number of true change-points in the underlying process.

H Computation of the NSP threshold by simulation
In a number of locations in the main paper, we mention the possibility of obtaining the NSP thresholds by simulation. We now clarify how this is done. For example, to solve for λ α (see e.g. Theorem 2.1 of the main paper) by simulation, we would simulate multiple realisations of Z I d and choose λ α as the 100(1 − α)% empirical quantile of the sample.
We proceed similarly in Section 2.4, in which the task is to approximate the distribution of It is important to note that this can easily be done for any distribution of Z (assumed known), not just Gaussian. (If there is uncertainty regarding the distribution of Z and there are a few plausible candidates, the corresponding threshold can be computed for each of them and the largest one among them chosen for use in the NSP algorithm.) This threshold selected as the empirical quantile of min β Z − Xβ I d , for the Gaussian case in Scenarios 1 and 2, is implemented in the R package nsp and can be used upon setting thresh.type = "sim" in the nsp_poly routine.
One remaining question is whether it is possible to use the standard (non-self-normalised) NSP without knowledge of the distribution of the innovations Z. Here, the following simple practical procedure for determining the threshold via simulation may help.
1. Pre-estimate the time-varying signal Xβ via a localised moving-window fit; then preestimate the innovationsẐ.
2. Re-sample the innovations to estimate the distribution of the multiscale deviation measure Ẑ I d .
3. Use a suitable empirical quantile of this distribution as the NSP threshold.

I Detection consistency and lengths of NSP intervals -proofs and discussion
Proof of Theorem 4.1 (main paper). Assume initially that f t has a single change- Without loss of generality, assume f η 1 > f η 1 +1 . Representation (24) implies On the set Z I a ≤ λ α , (25) is further bounded from below by 1 As NSP looks for shortest intervals of detection, the NSP interval of significance around η 1 will definitely be no longer than 2d = |[η 1 − d + 1, η 1 + d]|. However, from (26), it is sufficient for detection to be triggered if d > 16λ 2 α |f η 1 +1 −fη 1 | 2 . This shows that the maximum length of an NSP interval of significance will not exceed 2d, whered = 16λ 2 α |f η 1 +1 −fη 1 | 2 + 1. We now turn our attention to the multiple change-point case. For each change-point η j , define its correspondingd j as in formula (13) of the main paper. Recall we are on the set Z I a ≤ λ α .
Note first that even though the NSP interval of significance around η j is guaranteed to be of length at most 2d j , it will not necessarily be a subinterval of [η j −d j + 1, η j +d j ] (as NSP simply looks for the shortest intervals of significance and interval symmetry around the true change-point is not explicitly promoted). Therefore, in order that an interval detection around η j does not interfere with detections around η j−1 and η j+1 , the distances η j − η j−1 and η j+1 − η j−1 must be suitably long, but this is guaranteed by Assumption 4.1 from the main paper. This completes the proof.
As an aside, note in addition that in the Gaussian case Z t ∼ N (0, 1), Theorem 2.2 of the main paper implies λ α = O(log 1/2 T ); in fact for α = 0.05, we have λ α ≤ 1.33 √ 2 log T for T ≥ 100, for α = 0.1, we have λ α ≤ 1.25 √ 2 log T over the same range of T .
Proof of Corollary 4.1 (main paper). From Lemma 1 in Yao (1988), we have as T → ∞. This combined with the statement of Theorem 4.1 in the main paper proves the result.
Proof of Theorem 4.2 (main paper). Assume initially that f t has a single change-point η 1 . In the same way in which the NSP procedure is "blind" to constant shifts in the data in Scenario 1, it is also invariant to the addition of linear trends in the piecewise-linear Scenario 2. Assume, therefore, that we have added a linear trend to Y t in such a way that the true signal is symmetric around the true change-point η 1 . The case that will lead to the longest interval is one in which the change-point leads to a trapezoid shape of the true signal (as in, for example, 1, 2, 3, 3, 2, 1) rather than one with a single peak or trough (e.g. 1, 2, 3, 2, 1). Therefore we assume the former case as the "worst case" (whether this is or is not assumed will only lead to O(1) differences in the length of the NSP intervals, so is irrelevant from the point of view of rates). Note that for such a trapezoid signal, the location of η 1 is unambiguous (in the cartoon example above, it must be at the first 3). For where the minimum is taken with respect to all linear fits on [η 1 − d + 1, η + d]. Consider a single scale τ . Observing that taking moving partial sums does not change the linearity of f , and continuing from (27) − Z I a .
Observe now that since f t is symmetric around η 1 , the minimisingf must be constant.
In the multiple change-point case, the argument about the relevance of Assumption 4.1 from the proof of Theorem 4.1 (main paper) still applies here, and this completes the proof of the theorem.
Proof of Corollary 4.2 (main paper). The argument is identical to the proof of Corollary 4.1 from the main paper.

J NSP with autocorrelated innovations
Scenario 4 permits the use of NSP in settings in which autocorrelation is present, but this is done through the use of the lagged response as an additional covariate, rather than through allowing the innovations Z t to be autocorrelated. We now briefly explore the case in which the Z t 's themselves are serially correlated. This presents an alternative to the discussion of Section 2.3 of the main paper.
Suppose that Z t can be modelled as an autoregressive process as follows.
where U t is independent (not necessarily identically distributed) noise distribution acceptable to NSP in Scenarios 1, 2 or 3, and L is the lag operator. We propose the following iterative scheme which builds on the NSP procedure for independent innovations. We use the (most general) language of Scenario 3.
Due to the smoothing action of the filter a(L), this now only approximates a piecewiseconstant parameter regression setting, as it features the short "smooth transition" sections indexed t = η j + 1, . . . , η j + r. However, the presence of these smooth transitions does not spoil the applicability of NSP, with the intervals of significance obtained on the regression problem (30) having a similar interpretation as in the case of exactly abrupt transitions.
In practice, r or (a 1 , . . . , a r ) will be unknown to the analyst. We suggest the following scheme, in which these are treated as nuisance parameters and estimated from the data, as in .
2. Transform the regression problem using the estimated operatorâ(L) to obtain a problem of the form (30).
3. Run NSP suitable for independent innovations on the transformed problem, to obtain a set S of the NSP intervals of significance.
4. Re-estimate r and (a 1 , . . . , a r ) on the longest stretch of data outside the NSP intervals of significance.
5. Go back to step 2. and iterate until no changes are seen in the NSP intervals of significance.

K Additional arguments regarding the real-data analysis
In this section, we show that the application of NSP to the real-data examples of Section 6 of the main paper is justified as the errors do not exhibit significant serial correlation in the interest rate case or conditional heteroskedasticity in the price series case. Figure 8 demonstrates this for the interest rate data (note NSP was used on the scaled data shown in Figure   8, where the scaling had been performed to remove heteroscedasticity). Figure 9 shows this for the Newham house price data example (the presence of significant autocorrelation in the squared empirical residuals could have been indicative of heteroscedasticity).

L Discussion
We conclude with a brief discussion of a few speculative aspects of NSP.
Possible use of NSP in online monitoring for changes NSP can in principle be used in the online setting, in which 'alarm' should be raised as soon as Y starts deviating from linearity with respect to X. In particular, consider the following simple construction: having observed (Y t , X t ), t = 1, . . . , T , successively run NSP on the intervals [T − 1, T ], [T − 2, T ], . . . , until either the first interval of significance is discovered, or [1, T ] is reached.
This will provide an answer to the question of whether the most recently observed data deviates from linearity and if so, over what time interval.
Using and interpreting NSP in the presence of gradual change If NSP is used in the absence of change-points but in the presence of gradual change, obtaining a significant interval means that it must (at global significance level α) contain some of the period of gradual change. However, this does not necessarily mean that the entire period of gradual change is contained within the given interval of significance. Note that this is the situation portrayed in Section 5.2 of the main paper, in which the simulation model used is a 'gradual change' model from the point of view of the NSP 0 method, but an 'abrupt change' model from the point of view of NSP 1 and NSP 2 .
Possible use of NSP in testing for time series stationarity It is tempting to ask whether NSP can serve as a tool in the problem of testing for second-order stationarity of a time series. In this problem, the response Y t would be the time series in question, while the covariates X t would be the Fourier basis. The performance of NSP in this setting will be reported in future work.
Does the principle of NSP extend to other settings? NSP is an instance of a statistical procedure which produces intervals of significance (rather than point estimators) as an output. It is an interesting open question to what extent this emphasis on "intervals of significance before point estimators" may extend to other settings, e.g. the problem of parameter inference in high-dimensional regression.