Local bandwidth selection via second derivative segmentation

: This paper studies the problem of local bandwidth selection for local linear regression. It is known that the optimal local bandwidth for estimating the unknown curve f at design point x depends on the curve’s second derivative f ′′ ( x ) at x . Therefore one could select the local bandwidth h ( x ) at x via estimating f ′′ ( x ). However, as typically estimating f ′′ ( x ) is a much harder task than estimating f ( x ) itself, this approach for choosing h ( x ) tends to produce less accurate results. This paper proposes a method for choosing h ( x ) that bypasses the estimation of f ′′ ( x ), yet at the same time utilizes the useful fact that the optimal local bandwidth depends on f ′′ ( x ). The main idea is to ﬁrst partition the domain of f ( x ) into diﬀerent segments for which the second derivative of each segment is approximately constant. The number and the length of the segments are assumed unknown and will be estimated. Then, after such a partition is obtained, any reliable, well-studied global bandwidth selection method can be applied to choose the bandwidth for each segment. The empirical performance of the proposed local bandwidth selection method is evaluated by numerical experiments.


Introduction
Local linear regression is a popular method for nonparametric curve estimation. An important aspect in its implementation is the choice for the amount of smoothing; i.e., the selection of the so-called bandwidth. If the target curve does not possess too much spatial variation in its structure, then it is well known that it could be well estimated by using one single (global) bandwidth throughout its whole domain. However, if the curve demonstrates a large amount of spatial inhomogeneities, then local bandwidth smoothing, sometimes also known as variable bandwidth smoothing, should be used. That is, different bandwidths are allowed to be used at different locations. This constitutes the so-called bandwidth function h(x): the optimal local bandwidth h(x) for estimating the regression function at location x is a function of x. The goal of this paper is to propose a method for choosing this bandwidth function h(x).
In the literature different approaches have been proposed for choosing h(x). The so-called plug-in approach relies on the asymptotic expression for the optimal bandwidth function. In this approach h(x) is obtained by replacing the unknowns in this asymptotic expression with their estimates; e.g., see Fan and Gijbels (1992) and Gijbels and Mammen (1998). Another popular approach, sometimes known as the risk estimation approach, is to first construct an estimator of the mean squared error between the true and estimated function, and then choose h(x) to minimize such an estimator. Examples include Fan and Gijbels (1995), Ruppert (1997) and Doksum, Peterson and Samarov (2000). Most recently Gluhovsky and Gluhovsky (2007) proposed a different approach, in which h(x) is modeled as a smoothing spline and is defined as the minimizer of a novel penalty criterion.
The proposed method of this paper is motivated by the fact that the asymptotic expression for the optimal bandwidth at x depends on the second derivative of the unknown curve at x. We shall use Figure 1 to aid describing the main ideas of its major steps. A set of noisy observations together with the true but unknown curve are given in Figure 1(a). The noisy observations are then partitioned into different segments with the goal that the second derivative within each segment is approximately constant. The number of segments and the locations of the break points (i.e., the points at which adjacent segments meet) are automatically estimated by the minimum description length (MDL) principle (e.g., see Rissanen, 1989Rissanen, , 2007. Some asymptotic properties of this segmentation procedure will be provided below. See Figure 1(b) for the true second derivative and the corresponding segmentation. The next step is to calculate a single (global) bandwidth for each segment. These bandwidths are then joined together to form a piecewise constant function h(x); see Figure 1(c). Notice that this bandwidth function is smaller near the middle of the curve, indicating that comparatively smaller bandwidths are required to recover the peak structure around x = 0.5. In order to preserve continuity, the partial local smoothing rule of Hall, Marron and Titterington (1995) is applied to this piecewise constant bandwidth function to obtain a final continuous bandwidth function, which is shown in Figure 1(d). Lastly this final bandwidth function is used to estimate the unknown curve. The resulting curve estimate is displayed in Figure 1(e). For comparative purposes, an estimate obtained by using a global bandwidth is shown in Figure 1(f). This global bandwidth was chosen by the AIC c method of Hurvich, Simonoff and Tsai (1998). Observe that this "single bandwidth estimate", although recovering the peak structure at x = 0.5 reasonably well, undersmoothes the linear structures at both ends.  The rest of this article is organized as follows. The proposed method is described in detail in Section 2. Some of its theoretical properties are provided in Section 3. Section 4 reports numerical simulation results while concluding remarks are offered in Section 6. Lastly technical details are delayed to the Appendix.

The proposed method
where f (x) is the unknown regression function of interest. For the moment we assume that the design points x i 's are uniformly distributed in [a, b]; nonuniform design densities will be discussed later. At any point x the local linear (e.g., see Fan and Gijbels, 1996, Ch. 2). In the above h(x) is the local bandwidth that controls the amount of smoothing at x, K(·) is the kernel function, and K h(x) (x) = K{x/h(x)}/h(x). Note that we view a kernel as a symmetric probability density function, not necessarily of bounded support. If the goal is to minimize the expected local squared error E{f (x)−f h(x) (x)} 2 , then it is well-known that the optimal choice of h(x) admits the following asymptotic expression (e.g., see Fan and Gijbels, 1996, Ch. 3): Observe that in this expression for h opt (x), the only quantity that depends on x is the second derivative f ′′ (x). Therefore one way to select h(x) is to first estimate f ′′ (x) and then plug-in this estimate into (1). However, as the estimation of f ′′ (x) is a much harder task than the estimation of f (x), this approach for choosing local bandwidth tends to produce less satisfactory results. Our proposed method for choosing h(x) will bypass the estimation of f ′′ (x), but at the same time utilize the fact that h opt (x) depends on x only through f ′′ (x). The main idea is to first partition the domain of f (x) into different segments for which the second derivative of each segment is approximately constant. Then one could use any reliable, well-studied global bandwidth selection method to choose the bandwidth for each segment. In other words, the key is to estimate f ′′ (x) with a best fitting piecewise constant function. Now we return to the case when the density function for the design points x i 's are not uniform. In this case the term (b − a) in the optimal bandwidth expression (1) will need to be replaced by the reciprocal of the density function at x, and an ideal segmentation of the regression function domain should take that into account. However, our numerical experience suggests that, unless the density function is highly skewed, the resulting segmentation using the uniform density assumption often leads to satisfactory empirical results. Results from simulation experiments to be reported below support this claim.

Second differencing
Fitting a piecewise constant function to f ′′ (x) would have been a standard problem if we had direct noisy observations of f ′′ (x). That is, if we could observe measurements like where the e i 's are iid zero mean errors. However, we do not observe such y * i and we suggest applying second differencing to y i to obtain "pseudo data" that play a similar role as y * i . In the sequel we write f i = f (x i ) for all i. We first apply a differencing operator to y i and calculate (x ′ i , y ′ i ) for i = 1, . . . , n − 1 as: Now apply another differencing operation to y ′ i and obtain (x ′′ i , y ′′ i ) for i = 1, . . . , n − 2 as: Notice that y ′′ i may be viewed as a discrete but noisy approximation of f ′′ (x ′′ i ). To simplify notation, write By noting that g i is in fact a discrete version of f ′′ (x ′′ i ), one could write (3) in the form of (2) as We shall treat (x ′′ i , z i ) as our "pseudo data" and fit a piecewise constant function to them. However, the noise term η i , although mean-zeroed, is now no longer independent. To derive the correlation structure of η i , first write d i = x i+1 − x i . Then straightforward algebra shows that We will denote the covariance matrix, of size m×m, specified by these equations as σ 2 V . We note that the above expressions were derived by conditioning on the x i 's; i.e., they are conditional variances and covariances.

Second derivative segmentation using minimum description length
The next task is to fit a piecewise constant function to To do so, we need to decide on how many pieces are required, and on the locations of the break points at which these pieces join. This is a model selection problem, in the sense that different candidate models (i.e., piecewise constant functions) may have a different number of parameters. We will use the minimum description length (MDL) principle (e.g., see Rissanen, 1989Rissanen, , 2007 to solve this problem. The basic idea of the MDL principle can be explained as follows. Suppose a set of observed data w and a set of candidate models Θ = {θ 1 , . . . , θ N } for w are given. The goal is to select a "best" model for w from Θ. It is allowed that different θ i 's may have a different number of parameters. One typical example is subset selection in the multiple linear regression context. The MDL principle defines the "best" model as the one that permits the most economical representation (or compression) of the data w. That is, the best fitted model is the one that produces the shortest codelength for storing w.
One general method for calculating the codelength for w is to decompose w into two components: a fitted modelθ plus the corresponding residualsr. We shall use the notation CL(a) to denote the codelength for an arbitrary object a. With this we have The MDL principle defines the bestθ as the one that gives the smallest CL(w).
In the above expression we have stressed thatr is "conditional" onθ.
For the piecewise constant function fitting problem that we consider here, w corresponds to z = (z 1 , . . . , z m ) T ,θ corresponds to any fitted candidate piecewise constant functionĝ, andr = z −ĝ. In other words, the MDL principle suggests thatθ should be chosen as the one that minimizes Thus to apply MDL to solve the current segmentation problem, we need to derive computable expressions for CL(ĝ) and CL(r|ĝ), which in turn requires the calculation ofĝ. Suppose that there are B + 1 segments in the candidate piecewise constant function (i.e., there are B break points), and that the number of x ′′ i 's in the j-th segment is m j (such that m 1 + · · · + m B+1 = m). Let λ 1 < · · · < λ B be the locations of the B break points relative to the sample size (basically λ j = n j /m, where n j = m 1 + · · · + m j ; see Section 3 for the formal definition), and write λ = (λ 1 , . . . , λ B ). Also, define the ij-th element X ij of the "model matrix" X as where i = 1, . . . , m and j = 1, . . . , B + 1. Deleting repeated values, we next convertĝ intoĥ = (ĝ n1 , . . . ,ĝ nB+1 ) T . To determine the candidate piecewise constant function maximum likelihood or generalized least squares can be ap- from whichĝ can be easily computed by reintroducing the corresponding number of repetitions m j for each coordinateĥ j . Using this, it is shown in Appendix A that CL(z) can be approximated by Notice that, for any given z, any candidate piecewise constant function can be completely specified by (B, λ) ifĝ is computed with (5). This fact is reflected in the notation of MDL(B, λ). We propose selecting the best fitting piecewise constant function as the minimizer of (6). Some theoretical properties of MDL(B, λ) is established in Section 3 below. We also note that the criterion MDL(B, λ) can be straightforwardly modified to handle the situation when the noise variance varies with the segments. In this case the second last term will be replaced by 0.5 log(m j +1) while the last term will be replaced with a sum of such terms. The theoretical results in Section 3 can be slightly modified to accommodate this new criterion.

Practical minimization of MDL(B, λ)
This subsection describes a practical algorithm for minimizing (6). The idea is similar to performing forward selection followed by backward elimination in the multiple linear regression setting.
At the beginning of the algorithm, we fit only one segment to (x ′′ i , z i ); i.e., no break points. Then we add one break point to this initial fit. The location of this break point is chosen in a way that it provides the largest reduction of MDL(B, λ) amongst all possible break point locations. Then a second break point is added to this two-piece constant function. As before, the location of this break point is chosen to maximize the reduction of MDL(B, λ). This forward selection process continues until the adding of any new break points actually increases the value of MDL(B, λ).
The second and last stage of this algorithm is backward elimination. The idea is to successively remove one break point at a time from those that were introduced in the previous forward selection process. At each time step the break point to be removed is chosen such that it permits the largest reduction of MDL(B, λ) after its removal. This elimination process continues until no more removal of break points will cause a reduction in MDL(B, λ).
The algorithm is akin to the knot addition and deletion idea of the highly successful smoothing method MARS (Friedman, 1991). In the context of regression spline fitting, MARS is known to perform empirically better than other knot addition/deletion strategies (Lee, 2002). It also worked exceptionally well in all our numerical work.
If the number of observations in any segment is too small, it may lead to unreliable estimates. Therefore we have imposed the constraint that each segment contains at least 5 observations.
We close this section with the following remark which outlines how the candidate segmentation given by 0 = λ 0 < λ 1 < · · · < λ B < λ B+1 = 1 can greatly facilitate numerical computations. To do so, utilize first the candidate segmentation to decompose the m × m matrix V into B block square submatrices V j with dimension m j × m j , where m j = ⌊λ j m⌋ and m 1 + · · · + m B = m. This has the effect that the dependence between the different pieces in the segmentation is suppressed and we can work with independent blocks for the asymptotics. Since the MA(2) errors in the pseudo-data model y i = g i + η i are independent if they are more than two lags apart, the block creation does not affect the large sample properties. On the other hand, as a consequence of the above, one can simplify calculations involving the limit of the generalized least squares estimatorĥ = (ĥ 1 , . . . ,ĥ B ) T . Each of its components is now of the form where e j = (1, . . . , 1) T is the m j -dimensional vector whose entries are all equal to one and z(λ j−1 , λ j ) = (z ⌊λj−1m⌋+1 , . . . , z ⌊λj m⌋ ) T . In Lemmas B.1 and B.2 below we show that both e T j V −1 j e j and e T j V −1 j z(λ j−1 , λ j ) can be represented as certain fifth-order polynomials and the (ill-conditioned) inverse matrix V −1 does not need to be calculated explicitly.

Partial local smoothing
After a segmentation is obtained, the next task is to choose a (global) bandwidth for each segment. This can be achieved by applying any reliable global bandwidth selection method. In our numerical work to be reported in Section 4 below, we use the AIC c method of Hurvich, Simonoff and Tsai (1998). Once a (global) bandwidth is obtained for each segment, all these bandwidths are then joined together to form a piecewise constant bandwidth function h 0 (x). When the bandwidth function h 0 (x) is piecewise constant, it is customary to smooth those "corners" at which adjacent pieces are merged (e.g., see Fan and Gijbels, 1995), so that the resulting h(x) is continuous. We also follow this custom and apply the partial local smoothing rule of Hall, Marron and Titterington (1995) to make h 0 (x) continuous. This partial local smoothing rule employs the following interpolation formula. Let τ j and τ j+1 be the midpoints of the j-th and (j + 1)-th pieces of the piecewise constant function h 0 (x) respectively. Therefore h 0 (τ j ) is the (global) bandwidth obtained for the j-th segment; similarily for h 0 (τ j+1 ). For any x ∈ [τ j , τ j+1 ), the partial local interpolation rule defines the final bandwidth function h 1 (x) as Supportive theoretical and empirical results of this partial local smoothing rule can be found in Hall, Marron and Titterington (1995).

Summary
The main steps of the proposed method can be summarized as follows.
1. Apply the second differencing operation (3) and obtain (x ′′ i , z i ). 2. Find the "best" fitting piecewise constant function for (x ′′ i , z i ). This "best" fitting function is defined as the minimizer of (6), and it can be practically minimized using the algorithm described in Section 2.4. 3. From the "best" fitting piecewise constant function obtained in the previous step, a segmentation for (x i , y i ) can be obtained. For each segment in this segmentation, apply a global bandwidth selector to choose a bandwidth. Merge the resulting global bandwidths together to form a piecewise constant bandwidth function h 0 (x). In our implementation the AIC c method of Hurvich, Simonoff and Tsai (1998) is adopted as the global bandwidth selector.
4. Apply the partial local smoothing rule (7) to h 0 (x) to form a continuous bandwidth function h 1 (x). 5. Compute the estimatef h (x) for f (x) with local linear regression with bandwidth h = h 1 (x).

Theoretical properties
In this section, we study the asymptotic behavior of the proposed second differencing segmentation procedure. To do so, we have to further specify the form of the regression function f . For our purposes, we henceforth restrict the discussion on theoretical properties to regression functions f 0 that are once continuously differentiable with a piecewise constant second derivative f ′′ 0 . This is enabled in the following way. Without loss of generality, let [a, b] = [0, 1]. Set λ 0 0 = 0 and λ 0 B 0 +1 = 1. Then, we assume that f ′′ 0 (x) = f ′′ 0,j is constant for x ∈ (λ 0 j−1 , λ 0 j ), j = 1, . . . , B 0 + 1, where 0 < λ 0 1 < · · · < λ 0 B 0 < 1 denote the B 0 break points. The second differencing procedure aims to partition f ′′ 0 via noisy versions of the discrete approximations g 0 i for which we then obtain The connection between λ 0 j and n 0 j is given by with ⌊·⌋ denoting integer part and m = n − 2 as before. The number of g 0 i in segment j is therefore equal to m 0 j = n 0 j − n 0 j−1 . Certain edge effects in (8) have been left out. These occur when one transitions with the second differencing procedure from one segment into the next. As the number of these occurrences is clearly not larger than B 0 , they do not affect the asymptotic.
Since the true partition is unknown, the MDL procedure is utilized as described in Section 2 and we select the best piecewise constant approximation of f ′′ 0 , which is determined by the parameters (B, λ) according to the MDL criterion (6), adjusted for known B 0 , λ = arg min λ 2 m MDL(B 0 , λ).
The following consistency result can be proved.

The proof of Theorem 3.1 is provided in Section B of the Appendix.
Note that the application of the differencing operator introduces dependence. For equally spaced design points with d = d i , {η i } is a second order moving average process given by the difference equations The moving average polynomial θ(z) = 1 − 2z + z 2 = (1 − z) 2 has two unit roots and imposes a special structure on the matrix V defined in Section 2.2 (see Appendix B below). Matrices of similar kind have been used in the detection of trend in time series and are discussed in depth in Anderson (1971). It should also be noted that it is critical here that the unit roots are known in advance and do not have to be estimated from the data. In the latter case, which has been dealt with for example in Anderson and Takemura (1986), certain pileup effects cause the maximum likelihood estimator of the moving average unit roots to select an invertible set of parameters with positive probability, even asymptotically. While the unit roots complicate matters for theoretical derivations, they also induce a superconsistent procedure under the piecewise constant second derivative assumption. That is, the rate of convergence is faster than the typical parametric rate of "root n"; see Lemma B.4 for the exact rate. The reason for this lies roughly in the fact that partial sums of the {η i } are telescoping, namely consists of exactly four terms for any m. Since the second differencing procedure utilizes the generalized least squares estimatorĥ in (5), the exact proof will deal with weighted versions of the above partial sums. We discuss details in Appendix B. These findings imply, and add theoretical justification for, the excellent finite sample performance of our procedure to be reported in Section 4 below.
In proving Theorem 3.1, we have assumed the number of break points, B 0 , to be known. There are, as of now, only a few estimation procedures known in the literature whose theoretical foundation covers the case of unknown B 0 . Two deal with independent random variables with common variance confounded by changes in the mean. Yao (1988) addresses the normal case and Horváth and Serbinowska (1995) the multinomial case. Recently Aue and Lee (2011) generalized the results of Yao (1988) to more complex image segmentation problems. While the theory behind the MDL-based second differencing procedure is difficult to establish, we conjecture that under a Gaussianity assumption one can retain Theorem 3.1 also for B 0 unknown. Since a formal proof of this conjecture would add unnecessary length to the paper with only marginal gains from a more practical point of view, we do not pursue this further. The simulations in this paper, however, indicate that the performance is very satisfactory also when B 0 is unknown and even when the true model is different from the one assumed in this section.

Simulation results
Two sets of numerical experiments were conducted. The first set of experiments was to evaluate the performance of the proposed method when the design points are regularly spaced, and to compare its estimation results with those obtained by the recent method proposed by Gluhovsky and Gluhovsky (2007). In the second set of experiments the proposed method is compared with other common bandwidth selection methods when the design density is non-uniform. For easy referencing, we shall call the proposed local bandwidth selection method SDS, short for Second Derivative Segmentation.

Regularly spaced data
Since we were unable to obtain the codes for the method proposed in Gluhovsky and Gluhovsky (2007), we repeated their simulation experiments with identical settings, and compare our numerical findings with those reported in their paper.
First, 100 sets of noisy observations were generated from the regression function with n = 81 design points equally spaced in [−2, 2] and σ 2 = 0.5 2 . This test function is the same as the one in Figure 1, except the domain now is linearly "stretched" to [−2, 2] from [0, 1]. For each of these noisy data sets, we applied the proposed method and the EBBS local bandwidth method of Ruppert (1997) to obtain estimates of f (x). Denote, for the I-th noisy data set, the corresponding estimates obtained by the proposed method and the EBBS method asf I (x) and f I (x) respectively. We calculated mean squared errors (MSEs) forf I (x) as and similarly forf I (x). Following Gluhovsky and Gluhovsky (2007), we then calculated the MSE ratio 100 I=1 MSE(f I )

100
I=1 MSE(f I ) and the standard deviation of the MSE values forf I (x) divided by the average of the MSE values forf I (x). These two values are (0.73, 0.37), while the corresponding "best possible" pair from Table 1 of Gluhovsky and Gluhovsky (2007) is (0.74, 0.26). The reason for using the words "best possible" in the previous sentence is as follows. The practical calculation of the proposed local bandwidth estimate of Gluhovsky and Gluhovsky (2007) involves the choices of (i) a tuning parameter λ and (ii) a fitting methodβ (i) . However, no automatic selection procedures were provided by these authors for choosing λ andβ (i) . Instead, they reported results obtained from using different combinations of λ's andβ (i) . The  Gluhovsky and Gluhovsky (2007), we repeated the above experiment with 15 other regression functions. They are the 15 normal mixture functions listed in Marron and Wand (1992). The number of design points is n = 181, while σ 2 remains the same. The resulting MSE ratios and their scaled standard errors are calculated as before, and are listed in Table 1. Also listed in Table 1 are the corresponding values of the proposal of Gluhovsky and Gluhovsky (2007), using fitting methodsβ (4) andβ (5) with their best possible λ's. Judging from these numerical values, one could conclude that, for regularly spaced data, the proposed method SDS is to be preferred over the method of Gluhovsky and Gluhovsky (2007) or the EBBS method of Ruppert (1997).

Non-uniform design densities
Recall that the proposed second derivative segmentation procedure assumes that the design density is uniform. In this second set of experiments we tested its performance when the design density was actually non-uniform. Altogether six beta densities with different parameters were used as the design density: Beta[ s+4 5 , 11−s 5 ] with s = 1, . . . , 6. They are plotted in Figure 2. Two testing regression functions were used. The first regression function is essentially the same as (9), but with its domain mapped from [−2, 2] to [0, 1]. The second regression function is which is displayed in Figure 3.
For each combination of design density and test function, 200 data sets were generated with n = 200 and a signal-to-noise ratio (snr) of 3, where snr is defined   Figure 1), while the last 6 rows are for the second test function (displayed in Figure 3) as snr = f /σ with · as the Euclidean norm. Then, for each generated data set, four regression estimates were obtained: 1. global : local linear regression using global bandwidth selected by the AIC c method of Hurvich, Simonoff and Tsai (1998), 2. plug-in: kernel regression with the local plug-in bandwidth strategy of Herrmann (1997), 3. EBBS : the local bandwidth EBBS method of Ruppert (1997), and 4. SDS : the proposed method.
Finally, MSE values for all regression estimates are calculated. The averages of these MSE values, together with their standard errors, are reported in Table 2. From Table 2, one could see that, even for non-uniform design densities, SDS still performed favorably when comparing to other commmon methods.
We have also repeated the above experiments with n = 400 and snr = 5. Since these additional experiments provide similar empirical conclusions, their numerical results are omitted.

Real data
In this section the proposed procedure is applied to two real data sets. The first one is the motorcycle data set that has been analyzed by various previous authors (e.g., Fan and Gijbels, 1996). Here the design points x i are the time at which the responses y i were recorded after a simulated motorcycle impact experiment. These responses are the head acceleration of the test object. The data are displayed in the left panel of with the locations of the break points, are also displayed in the left panel of Figure 4. Displayed in the right panel of Figure 4 is the so-called LIDAR data set (e.g., Ruppert, Wand and Carroll, 2003). LIDAR is a laser based technique for detecting chemical compounds in the atmosphere. The x-variable is the distance traveled by the laser light before it is bounced back to its origin. The y-variable is the log of the ratio of laser light received from two different frequency sources. As similar to the above motorcycle data set, a global constant bandwidth will not work well for this LIDAR data set. The proposed method was capable of first dividing it into three regions of approximately constant curvature and then selecting a tailored local bandwidth for each region; see the right panel of Figure 4.
Lastly we point out that for both data sets the noise levels are heteroscedastic. This violates the constant noise variance assumption made by the proposed procedure, but still the proposed procedure performed well.

Concluding remarks
In this article a method is proposed for choosing the bandwidth function for local linear smoothing. A major component of the method is the second derivative segmentation procedure. This procedure aims to partition the curve domain into homogeneous regions, so that a tailored bandwidth can be obtained for each region. Although this segmentation procedure is computationally expensive, it has been shown to be superconsistent if the underlying second derivative is piecewise constant. In addition, via theoretical results and numerical experiments, we have also demonstrated the superior empirical properties of the resulting local bandwidth selection method. We have further outlined how the procedure can handle hetereoscedastic data. Lastly, the second derivative segmentation idea can be combined with other smoothing methods. For example, the local linear regression used in this article can be straightforwardly replaced by smoothing splines.
Section B.2. We assume throughout the proof that the regression y i = f (x i ) + ǫ i is canonical in order to contain the complexity of the proofs. It is expected that similar arguments apply also to the case non-canonical regression case.
B.1. The banded Toeplitz matrix V Let T be the complex unit circle and let b : T → C be the Laurent polynomial b(t) = 6 − 4(t + t −1 + (t 2 + t −2 ). The symbol b induces the banded Toeplitz operator T (b) that takes the values 6, −4 and 1 on the main diagonal, the first off-diagonals and the second-off diagonals, respectively. The symbol is unbounded as b has a zero of order four at t = 1, so that the smallest eigenvalues of the corresponding finite (m × m) Toeplitz matrices T m (b) are of exact order m −4 . This, in turn, implies that the largest elements of T −1 m (b) grow with rate m 4 ; e.g., see Böttcher and Grudsky (2005) for details on Toeplitz matrices. It is now easy to see that the m × m variance-covariance matrix V of Section 2.2 can be rewritten in terms of T m (b) simply as V = d 4 T m (b). Most of the theory of banded Toeplitz matrices is based on boundedness of the symbol and is therefore not applicable in the current setting.
We need the following two important auxiliary results. Note that we do not need to compute the ill-conditioned inverse matrix V −1 = T −1 m (b) directly. Lemma B.1. Let e = (1, . . . , 1) T be the m-dimensional vector whose elements are all equal to 1. Then, To see that this is correct, it is most convenient to verify that d 4 V a = e. Now, d −4 e T V −1 e = a T e = m j=1 a j and the statement of the lemma can be verified directly by elementary but lengthy calculations.
Proof. Similar to the proof of Lemma B.1.

B.2. Establishing Theorem 3.1
Recall that, since the value of B 0 is assumed known, the candidate segmentation is specified by the values 0 = λ 0 < λ 1 < · · · < λ B 0 < λ B 0 +1 = 1. Given such a candidate segmentation, we need to derive its large-sample behavior, in particular the bias that is induced when compared to the true segmentation 0 = λ 0 0 < λ 0 1 < · · · < λ 0 B 0 < λ 0 B 0 +1 = 1. To do so, we utilize the candidate segmentation and decompose the m × m matrix V into B 0 block square submatrices V j with dimension m j × m j , where m j = ⌊λ j m⌋ and m 1 + · · · + m B 0 = m. This has the effect that the dependence between the different pieces in the segmentation is suppressed and we can work with independent blocks for the asymptotics. Since the MA(2) errors in the pseudo-data model y i = g i + η i are independent if they are more than two lags apart, the block creation does not affect the large sample properties.