A general approach to the joint asymptotic analysis of statistics from sub-samples

In time series analysis, statistics based on collections of estimators computed from sub-samples play a crucial role in an increasing variety of important applications. Proving results about the joint asymptotic distribution of such statistics is challenging since it typically involves a nontrivial verification of technical conditions and tedious case-by-case asymptotic analysis. In this paper, we provide a novel technique that allows to circumvent those problems in a general setting. Our approach consists of two major steps: a probabilistic part which is mainly concerned with weak convergence of sequential empirical processes, and an analytic part providing general ways to extend this weak convergence to functionals of the sequential empirical process. Our theory provides a unified treatment of asymptotic distributions for a large class of statistics, including recently proposed self-normalized statistics and sub-sampling based p-values. In addition, we comment on the consistency of bootstrap procedures and obtain general results on compact differentiability of certain mappings that seem to be of independent interest.


Introduction and Motivation
In time series analysis, a large class of statistics can be expressed as smooth functions of estimators computed on consecutive portions (i.e., subsamples) of data. Since time series observations are naturally ordered by time, the use of such statistics has been a common theme in time series inference and examples are abundant in areas such as sequential monitoring [Chu and White (1995), Aue and Reimherr (2009)], retrospective change point detection [Csörgö and Horváth (1997), Perron (2006)] and subsampling-based inference [Politis and Romano (1994), Politis et al. (1999)], among others. More recent examples include the self-normalized (SN, hereafter) statistics [Shao (2010a)], a new SN-based test statistic for change point detection [Shao and Zhang (2010)] and the p-value of the subsampling-based inference under the fixed-b asymptotics [Shao and Politis (2013)]. To obtain the asymptotic distributions of statistics of such kind, a traditional approach is to express the estimator as a sum of three parts, including the parameter, an average of influence functions, and a remainder term, followed by certain assumptions that ensure asymptotic negligibility of remainder terms and a routine analysis of the leading term which is of linear form. For many statistics of practical interest, theoretical analysis based on this approach can be quite challenging and tedious. In particular verifying the negligibility of remainder terms can be technically involved, since it requires a careful case-by-case study. The situation is further complicated by the fact that in time series settings, the underlying data are dependent. The aim of the present paper is to provide a general approach which allows to easily obtain the asymptotic distribution of statistics based upon infinite collections subsample estimates without long and tedious arguments. In statistical applications, many important statistics can be expressed as smooth [more precisely: compactly differentiable] functionals of simple quantities such as the empirical distribution function. The analysis of the asymptotic properties of such statistics in the non-sequential setting can be elegantly performed in two distinct steps: an analytic part which consists in establishing the smoothness of the functional and a probabilistic part that is concerned with the analysis of the underlying quantity. One of the many appealing features of such an approach lies in the fact that the analytic properties need to be established only once. Moreover, quantities such as the empirical distribution function are often rather well analyzed for a wide range of data types. This approach has been successfully applied to the analysis of quantiles [Doss and Gill (1992)], survival data [Gill and Johansen (1990)], copulas and scalar measures of dependence [Fermanian et al. (2004);  and to the setting of dependent data. A slightly more formal description of the situation above is as follows. Assume that we have a collection of estimators, say (x n,κ ) κ∈K of a quantity x. A classical example of such a collection is given by estimators computed from various fractions of the sample X 1 , ..., X n . For illustration purposes, assume that x is the distribution function,x n,κ denotes the empirical distribution function computed from X 1 , ..., X ⌊nκ⌋+1 and K = [0, 1]. Also, assume that the parameter of interest, say θ, can be expressed as φ(x) where φ denotes some functional. For example, it is possible to express the copula as a functional of the cumulative distribution function. If the map φ is compactly differentiable, the asymptotic distribution of a suitably normalized version of φ(x n,κ ) for fixed κ can be derived from a corresponding result forx n,κ . More precisely, denoting by α n a sequence diverging to infinity and by w(κ) a weight function, weak convergence of α n w(κ)(x n,κ − x) in a suitable function space implies weak convergence of α n w(κ)(φ(x n,κ ) − φ(x)) for a finite collection of fixed values of κ. However, in many important applications the joint weak convergence of the whole collection V n := α n w(κ)(φ(x n,κ ) − φ(x) κ∈K in a suitable functional sense is required. For the purpose of illustration, consider the following simple example.
Example 1.1. For the sake of concreteness, assume that we observe data, say (Y i = T i ∧ C i , δ i = I{Y i = T i }) i=1,...,n from a censored time series [here, T i denote survival times, C i censoring times and δ i denote censoring indicators] and want to test if there is a change in the location parameter of the marginal distribution F i of T i . A general way to quantify the location of censored observations, that is well-defined even under heavy censoring, is provided by the median. Typical test statistics for the null hypothesis of a constant median are based on comparing the medians of the Kaplan-Meier estimators which are computed from portions of the data. For simplicity, assume that the estimatorm κ with κ ∈ [0, 1] is based on the data (Y i , δ i ) i=1,...,⌊nκ⌋∨1 . A simple test statistic for the null hypothesis of a constant median is given by sup κ∈[0,1] w(κ)|m κ −m 1 | with w denoting a suitable weighting function. In order to derive the null distribution of our test statistic, we would typically establish a process convergence result for √ nw(κ)(m κ −m 1 ) viewed as element in the space D[0, 1] and apply the continuous mapping theorem. Classical results on compact differentiability [see Example 2.2 and 2.3] imply that the median of the Kaplan-Meier estimator can be represented as a compactly differentiable functional of the two empirical (sub-)distribution functionsĤ 0,⌊nκ⌋ (y) := ⌊nκ⌋ −1 ⌊nκ⌋ i=1 δ i I{Y i ≤ y} and F Y,⌊nκ⌋ (y) := ⌊nκ⌋ −1 ⌊nκ⌋ i=1 I{Y i ≤ y}. If we want to apply the classical delta-method to derive the process asymptotics of √ nw(κ)(m κ −m 1 ) we are faced with two problems: first, we need process convergence of suitably normalized versions ofĤ 0,⌊nκ⌋ . Second, as we shall argue below, the classical delta method does not provide results on weak convergence of the quantity √ nw(κ)(m κ −m 1 ) as a process indexed in κ.
Returning to a more general setting, we can say that the classical delta method and a large collection of results on the behavior of general empirical processes allow to establish weak convergence results for a wide class of statistics as long as we consider a fixed, finite collection of values κ. Informally, we call this the 'non-sequential' case. However, the tools available to date do not allow the same conclusion when we are interested in collections of sub-samples, or, stated informally, in the 'sequential' case. The fundamental aim of the present article is thus to provide general ways of importing the tools mentioned above from the 'non-sequential' into the 'sequential' setting. For example, let us consider what we would need to apply a delta method in the 'sequential' case if only compact differentiability of the map φ in the 'non-sequential' case is available. Essentially, such an approach would require us to show compact differentiability of the map viewed as a map between suitable metric spaces since we can write Given the fact that a large amount of important maps φ that are known to be compactly differentiable, we would like to make use of this information in the sequential setting. A natural question to ask thus is: given compact differentiability of φ, what can we say about compact differentiability of Φ? As we shall see in Section 2.1, such an implication does not hold in full generality, see in particular Example 2.5 and the discussion preceding it. At the same time, we obtain a positive result if we additionally assume that the map φ possesses certain boundedness properties. Additionally, even when compact differentiability of Φ fails, there still are many relevant settings where additional arguments can be applied to obtain the desired weak convergence of V n . In fact, in Section 2.1 we show that, given weak convergence of κ∈K , we can derive properties of V n in a very general setup. Additionally, some general results on compact differentiability that seem to be of independent interest can be found in Section 2.3. Another fundamental question that needs to be taken care of before we can apply the functional delta method is the weak convergence of the process Y n . In fact, results on weak convergence of Y n in settings where the data X 1 , ..., X n are allowed to be dependent are limited. A summary of available results as well as new insights providing considerable extensions of those findings are collected in Section 2.2. Finally, in Section 3, we illustrate how the general results presented in Section 2 can be applied to obtain new insights regarding the properties of recently proposed methods including self-normalization and generalizations thereof [Section 3.1], fixed-b corrections for sub-sampling methods [Section 3.2], and SN-based testing procedures for change-points [Section 3.3]. Some comments on the applicability of our results to bootstrap methods are also provided.

General results
We begin by introducing some relevant notation. For arbitrary sets F 1 , ..., F J , K define the vector space By the definition of L ∞ (F 1 , ..., F J ; K), we have sup t sup f |H j,t (f )| < ∞ for all j = 1, ..., J so that the maps (t j , f j ) → H j,t j (f j ) are indeed bounded and thus elements of ℓ ∞ (K × F j ). In particular, if the product space ℓ ∞ (K ×F 1 )×...×ℓ ∞ (K ×F J ) is equipped with the maximum norm (x 1 , ..., x J ) max := max j x j ∞ induced by the supremum norms on its components, the identification given above is an isometry, that is Weak convergence in L ∞ (F 1 , ..., F J ; K) is henceforth understood as weak convergence in the Hoffmann-Jørgensen sense in the space L ∞ (F 1 , ..., Van der Vaart and Wellner (1996), Chapters 1.4 and 1.5 for more details].
Remark 2.1. In most situations, the sets F 1 , ..., F J can be viewed as subsets of R d . For example, the empirical distribution function (n −1 I{X i ≤ y}) y∈R d of a sample of d-dimensional random variables X 1 , ..., X n is naturally indexed by the set R d . Another approach that fits nicely into the empirical process setting and will play a central role in Section 2.2, is to consider classes of functions {x → f (x)|f ∈ F j }.
In this setting, the empirical process can be elegantly written as , see Van der Vaart and Wellner (1996) for examples. For example, the empirical distribution function can also be viewed as element of ℓ ∞ (F) with F denoting the collection of indicators of rectangles, that is By identifying the function x → I{x ≤ y} with the point y ∈ R d we obtain a way to index F by R d and vice versa. In most of the following theoretical developments, the form of F j will be arbitrary unless explicitly specified otherwise.
As discussed previously, the asymptotic analysis of statistics based on the process V n can be performed by considering two distinct questions: the stochastic properties of Y n and the analytic properties of the map φ. Both questions will be addressed in this section in a general setting. In section 2.1, we present our analytic considerations. An overview of existing results regarding the stochastic part as well as their extension will be considered in section 2.2. Finally, some general results on compact differentiability that seem to be of independent interest are provided in section 2.3.

Analytic considerations
This section is primarily concerned with the following questions: given a collection of estimators (ŷ n,s,t ) (s,t)∈K such that for fixed (s, t) ∈ K eachŷ n,s,t is an element of , and weak convergence of the process [α n denotes some deterministic sequence diverging to infinity] V n (s, t, g 1 , ..., g L ) := (t − s)α n (φ(ŷ n,s,t )(g 1 , ..., g L ) − φ(x)(g 1 , ..., g L )) as element of L ∞ (G 1 , ..., G L ; K)? And what can we say about bootstrap validity for V n given a valid bootstrap procedure for Y n ?
For instance, consider the situation where we have a sample X 1 , ..., X n . Assume that the quantity x can be represented as ) f ∈F J for some classes of functions F 1 , ..., F J , see the examples below. A prime example for the quantityŷ n,s,t is given by the estimator computed from the sub-sample X ⌊ns⌋+1 , ..., X ⌊nt⌋ , that iŝ y n,s,t : where the empty sum is defined as zero and we set ′ 0/0 = 0 ′ to take care of the case ⌊ns⌋ = ⌊nt⌋.
Regarding the smoothness of φ, we impose the following condition In the 'classical' setting, compact differentiability is known to provide a good balance between strength of the differentiability concept that is needed for establishing a general functional delta method and the number of statistically relevant functionals that can actually be shown to be compactly differentiable. See Van der Vaart and Wellner (1996), Chapter 3.9 for a more detailed discussion of this topic. Two particular examples are discussed below. Of course, there exists a vast collection of further examples [copulas, dependence measures, M-and L-estimators to name just a few] that are equally important but not discussed here because of space considerations. For a more detailed list we refer the interested reader to Chapter 3.9 in Van der Vaart and Wellner (1996) and the recent paper by Gao and Zhao (2011).

Example 2.2. Empirical quantiles
Consider the class of functions F : {y → I{y ≤ t}|t ∈ R}. In this case,ŷ n,s,t is simply the empirical distribution function of the sub-sample X ⌊ns⌋+1 , ..., X ⌊nt⌋ . Consider the quantile map φ : F → (F −1 (τ )) τ ∈S for some S ⊂ (0, 1) which now corresponds to G 1 . Applying this map toŷ n,s,t yields collections of empirical quantiles of the sub-samples X ⌊ns⌋+1 , ..., X ⌊nt⌋ . Compact differentiability of the quantile map can be established under appropriate conditions, see Lemma 3.9.23 in Van der Vaart and Wellner (1996).

Example 2.3. Kaplan-Meier estimator
Assume that we have right-censored observations of the form (Y i , δ i ) i=1,...,n . It is a well-known fact that the Kaplan-Meier estimatorF KM [Kaplan and Meier (1958)], viewed as a map into the set of distribution functions on [0, V ] for a suitable V < ∞, is a compactly differentiable functional of the two functionŝ see Chapter 3.9 in Van der Vaart and Wellner (1996). This suggests to consider the classes of functions Combining this with the quantile mapping [see Example 2.2] easily allows to consider quantiles of the Kaplan-Meier estimator.
Regarding the process Y n , we need the following assumption, and Y j , j = 1, ..., J are centered, Borel measurable processes.
Remark 2.4. A detailed discussion of condition (W) for estimatorsŷ n,s,t of the form (1) is provided in the next section. However, there are interesting examples that go beyond the framework described above. For example, the classical empirical copula process [see Rüschendorf (1976)] is of the form ) denotes the vector of the generalized inverses of the marginal empirical distribution functions F nj (y) = n −1 i I{X ij ≤ y} and C is the copula of the distribution of X. Note that F − n (u) depends on all the data regardless of the value of s. The process C • n (s, u) can be coerced into the general framework of this section by considering the collection of estimators 1 Weak convergence of the process C • n under weak assumptions on the copula with possibly dependent data was recently established by .
The limit Y in assumption (W) needs to satisfy certain technical conditions that are not very restrictive as we shall demonstrate later.
(A2) Define the set Assume that the sample paths of Y are in U K with probability one.
Condition (A2) is non-restrictive in the sense that it is needed to apply the functional delta method to V n (s, t, ·) for each fixed (s, t). Assumption (A1) is needed for the application of the general compact differentiability result in Section 2.3. As we shall discuss in the next section [see Remark 2.13], assumption (A1) is typically satisfied in a wide variety of practically relevant settings. Assumptions (W), (A1), (A2) are already sufficient to derive weak convergence of V n if the set K satisfies inf (s,t)∈K |t − s| > 0. Without this condition, (W), (A1), (A2) are not sufficient as the following example demonstrates.
The underlying problem in the above example is that due to the weighting with t − s, weak convergence of α n (t − s)(ŷ n,t,s − x) is not informative aboutŷ n,t,s for values of t − s that can be arbitrarily close to zero. Additional assumptions are needed to exclude this kind of behavior presented in the above example. It turns out that for this purpose the following condition is sufficient. As we shall discuss later, there are many situations where it is easily satisfied.
where the asterisk denotes outer probability.
Remark 2.6. Note that condition (A3) is automatically satisfied if sup (s,t)∈K φ(ŷ n,s,t ) = O * P (1). This is trivially true for uniformly bounded maps φ, which includes many interesting examples such as copulas, dependence measures or the Kaplan-Meier estimator (which per definition is a distribution function). Moreover for specific sets K, further conditions implying (A3) can be derived. See Remark 2.15 in Section 2.2 for further details. Now we are ready to state the first main result of this section.
Remark 2.8. Although assumption (A3) often holds, there are situations where verifying it can be very tedious or requires additional assumptions on the underlying data structure. For example, consider the setting where K = ∆ and φ denotes the map that takes a distribution function to its median. In that case, assumption (A3) would require that 1 n max i=1,...,n |X i | = o P (1) since the median of one observation is the observation itself. Effectively, this places moment assumptions on X that are not needed for the median from large samples to be well-behaved. A closer look at the proofs reveals that for any γ ∈ (0, 1) the following modified version of the process V ñ converges to the same limit V without assumption (A3) or the condition inf (s,t)∈K |t − s| ≥ a > 0. In the applications discussed in Section 3, the modification above essentially amounts to not using information from extremely small sub-samples. As the discussion above indicates, for certain sets K this can be viewed as a robustification.
Remark 2.9. A closer look at the proof of the above result shows that the special structure of ∆ does not play a crucial role. In fact, the same approach yields a more general result. Let (K, d K ) denote a general compact metric space. Assume that (1) and if the sample paths of Y, are, with probability one in the set as long as inf κ∈K |w(κ)| > 0. If additionally a modified version of condition (A3) holds, i.e. if for any k n → 0 we have sup κ∈K,|w(κ)|≤kn w(κ) φ(ŷ n,κ ) = o * P (1), the weak convergence above holds without the assumption inf κ∈K |w(κ)| > 0.
Next, we discuss bootstrap procedures. In particular, consider the following bootstrap version of the quantityŷ n,s,t defined in (1) with M 1 , ..., M n denoting random variables independent of the original sample X 1 , ..., X n . The corresponding bootstrap version of the process G n is given by Under suitable assumptions on the data and random variables M 1 , ..., M n a conditional version of assumption (W) holds. Specifically, assume that (WB) Y b n weakly converges to Y conditionally on the data in probability, or Here, weak convergence conditional on the data in probability ( P M -convergence) is understood in the Hoffmann-Jørgensen sense as defined in Kosorok (2008) where BL 1 denotes the set of all Lipschitz-continuous functions f : L ∞ (F 1 , ..., F J ; K) → R that are uniformly bounded by 1 and have Lipschitz constants bounded by 1, and where the asterisks in (ii) denote measurable majorants (and minorants, respectively) with respect to the joint data (X 1 , . . . , X n , M 1 , . . . , M n ). Also, note that the map (M 1 , ..., M n ) → Y b n is measurable conditionally on the original data X 1 , ..., X n outer almost surely [for fixed X 1 , ..., X n , this mapping is Lipschitz-continuous] and thus we do not need to consider measurable majorants. Settings where results of this kind hold are discussed in the next section.
The classical delta method for the bootstrap [see e.g. Theorem 12.1 in Kosorok (2008)] asserts that for a map φ that is compactly differentiable at x with derivative φ ′ x and additionally satisfies suitable measurability conditions, we have for every fixed (s, t). The next Theorem provides a generalization of this finding. More precisely, it states conditions that allow for a generalization of Theorem 2.7 to conditional weak convergence in Theorem 2.10. With the notation above, assume that (WB), (A1), (A2) and (C) hold. Then for any If additionally (A3) holds and sup (s,t)∈K,|t−s|≤kn (t − s) φ(ŷ b n,s,t ) = o * P (1), the convergence holds for arbitrary compact K ⊂ ∆.
Remark 2.11. Suitable modifications of the extensions discussed in Remark 2.8 and Remark 2.9 continue to hold in the bootstrap setting. More precisely, we can replace sets K ⊂ ∆ by arbitrary compact sets and the weighting t − s with arbitrary bounded weighting functions w, in which case the assumption inf (s,t)∈K |t − s| > 0 needs to be replaced by inf κ∈K |w(κ)| > 0. Also, conditional weak convergence of n,s,t )−φ(ŷ n,s,t )) κ∈K holds without assumption (A3) and the condition sup (s,t)∈K,|t−s|≤kn (t − s) φ(ŷ b n,s,t ) = o * P (1) used in the above theorem.

Probabilistic considerations
In this section, we focus our attention on the setting where Y n has a specific structure that typically arises in applications. More precisely, consider the multi-parameter sequential empirical process denotes an estimator for x that is computed based on the sub-sample X ⌊ns⌋+1 , ..., X ⌊nt⌋ . It turns out that conditions (W), (A1), (A2) in the previous section can be derived from simpler conditions that involve only a collection of 'classical' one-parameter sequential processes Consider the assumptions and G j , j = 1, ..., J are centered, Borel measurable processes.
The conditions above turn out to be sufficient for (W) and (A1).
Remark 2.13. For many kinds of weakly dependent data [including, of course, the independent case], the process G is a vector of centered Gaussian processes with covariance of the form for some uniformly bounded covariance kernel K. In this case, assumption (A1') holds. To see this, note that under (A1') the process G has paths that are uniformly continuous with respect to the metric ρ 2 ((t, , see Example 1.5.10 in Van der Vaart and Wellner (1996). The discussion at the beginning of Example 1.5.10 in Van der Vaart and Wellner (1996) thus yields the desired result. The special structure of Y n implies that its sample paths have the same property.
Remark 2.14. There are interesting cases where condition (A1') holds for limiting processes that are non-Gaussian. More precisely, defining F 1 = [−∞, ∞], the results in Dehling and Taqqu (1989) imply weak convergence of the process G n if the data X i exhibit long-range dependence. The limiting process, which can be non-Gaussian, is of the form G(t, y) = f (y)Z m (t) with f denoting a deterministic, uniformly bounded function and Z m a so-called m'th order Hermite-process [see Dehling and Taqqu (1989) for a definition]. In particular, the sample paths of this process are Hölder-continuous [see Maejima and Tudor (2007)] and thus assumptions (A1') and (W') hold.
Remark 2.15. Consider the special case K = {0} × [0, 1]. In this case, assumption (A3) is satisfied as soon asx n,t is of the form given in (1) with the data X 1 , X 2 , ... stemming from a strictly stationary sequence andx n,1 → x outer almost surely. To see this, note that under the assumptions discussed above we have sup t φ(x n,t ) = max j=1,...,n φ(x j,1 ) and that by Lemma B.1 together with the continuous mapping theorem φ(x n,1 ) → φ(x) outer almost surely. This in turn implies that (sup n≥1 φ(x n,1 ) ) * [the asterisk denoting a measurable majorant] is bounded in probability. For results implying almost sure convergence in a very general setting, see Adams and Nobel (2010) and the references cited therein.
For independent data, assumption (W') is known to hold as soon as the classes of functions F 1 , ..., F J are Donsker [see Van der Vaart and Wellner (1996), Chapter 2.12.1]. For dependent data, much less is known. Available results are, to the best of our knowledge, limited to classes of functions of of the form Here, results for d > 1 are derived by Sen (1974) and Rüschendorf (1974) under φ−mixing and by Yoshihara (1975) and Inoue (2001) under strong mixing. Berkes et al. (2009) considered the case d = 1 under S-mixing, and derived a stronger result than weak convergence of the process. Finally, the paper by Dehling and Taqqu (1989) contains a similar result for the class of functions F 1 = {u → I{u ≤ y}|y ∈ R} and long-range dependent data. To the best of our knowledge, nothing is known for general classes of functions. Note that by Lemma 1.4.3 in Van der Vaart and Wellner (1996), asymptotic tightness of G n is equivalent to asymptotic tightness of G n,j for all j = 1, ..., J. Thus, Problem 1.5.3 in the same reference implies that in order to obtain weak convergence of G n to G, we need to show that first G n,j is asymptotically tight for all j = 1, ..., J and second that the following condition holds (F) For all finite collections s i,j ∈ [0, 1], i = 1, ..., N, j = 1, ..., J, f ij ∈ F j , i = 1, ..., N, j = 1, ..., J the collection (G j,n (s ij , f ij )) j=1,..,J,i=1,...,N converges weakly to (G j (s ij , f ij )) j=1,..,J,i=1,...,N in the usual There is a vast literature containing results that imply the finite-dimensional convergence (F), see Dehling et al. (2002) and the references cited therein for an overview. Criteria establishing asymptotic tightness of the processes G n,j for dependent data on the other hand are not as widely available, and one general result along those lines is provided below. This result is of independent interest. In particular, it can be used to verify condition (W') in a number of settings that have not been considered before.
Theorem 2.16. Assume that the process G n is of the form G n = tα n (x n,t − x) wherex n,t is defined in (1) and the data X 1 , X 2 , ... come from a strictly stationary sequence. Assume that for each j = 1, ..., J there exists a semi-metric ρ j on F j which makes F j totally bounded, and for each j = 1, ..., J we have sup f ∈F j E|f | q < ∞. Define F j,δ := {f − g|f, g ∈ F j , ρ j (f, g) ≤ δ}. Assume that the process G n,j (1, ·) satisfies for some q > 2 and j = 1, ..., J lim δ↓0 lim sup n→∞ E * G n,j (1, ·) q F j,δ = 0 (4) [remember that the asterisk denotes outer expectation], that and that for every j the class of functions F j has envelope F j which has finite q'th moment. Let condition (F) hold. Then G n G in L ∞ (F 1 , ..., F J ; [0, 1]).
Condition (4) has been established by Andrews and Pollard (1994) for strongly mixing data, and inequality (3.1) in Andrews and Pollard (1994) reveals that (5) holds under the same assumption. Moreover Hagemann (2012) established (4) for stationary sequences with geometric moment contraction properties [see Wu and Shao (2004)], and the results in his appendix show that again (5) holds under the same assumptions.
Next, consider bootstrap procedures. In the case of independent data, a mild assumption on the multipliers M i suffices. More precisely, assuming that M i are i.i.d., independent of the data X i , and that P (|M 1 | > u)du is finite [which follows if M 1 has finite moment of order 2 + ε], the classes of functions F 1 , ..., F J being Donsker [see Van der Vaart and Wellner (1996), page 81 for a definition of this property] implies (WB). To see this, note that by arguments similar to the ones given in the proof of Proposition 2.12 it suffices to derive (WB) for the set K = {0} × [0, 1]. To do so, apply Lemma B.3 in the appendix where the approximating mappings A i and A b i,n are defined through projections on piecewise constant functions, see the arguments in the proof of Theorem 1.5.6 in Van der Vaart and Wellner (1996). Then assumption (i) of Lemma B.3 corresponds to conditional finite-dimensional convergence which can be established by arguments similar to those given in Lemma 2.9.5 in Van der Vaart and Wellner (1996). Condition (ii) corresponds to tightness of the limit process Y. Condition (iii) follows from the unconditional asymptotic tightness of Y b n , which can be established by combining Theorem 2.12.1 and 2.9.2 in Van der Vaart and Wellner (1996).
Under dependence, much less is known about bootstrap validity for empirical processes, even in the non-sequential setting. For an overview of available results, see Radulović (2009). In the sequential setting, some results along those lines were recently considered by Bücher and Ruppert (2013) based on arguments from . More precisely, those authors proposed to consider variables M 1,n , ..., M n,n from a triangular scheme that satisfy certain conditions [see assumptions A1-A3 in their paper]. In particular, the results in Bücher and Ruppert (2013) imply (WB) for K = {0} × [0, 1] under strong mixing conditions for the class of functions F = {u → I{u ≤ w}|w ∈ R d }. Moreover, using the techniques in that paper, in particular the Ottaviani type inequality [Lemma 1 in Appendix B of the corresponding paper], it should be possible to derive (WB) for K = {0} × [0, 1] by combining arguments similar to those in the proof of Theorem 2.12.1 in Van der Vaart and Wellner (1996) with the Ottaviani-type inequality of Bücher and Ruppert (2013) and results on the validity of bootstrap procedures in the non-sequential setting. For an overview of such results, see Radulović (2009) and the references cited therein.

A general result on (quasi) Hadamard differentiability
This section contains an abstract result on compact differentiability that seems to be of independent interest. It plays a crucial role in the proofs of Theorems 2.7 and 2.10. The result in this section applies to both classical Hadamard differentiability [also known as compact differentiability], and the more general concept of quasi-Hadamard differentiability which was recently introduced by Beutner and Zähle (2010). The main advantage of this more general approach is that it allows to apply a modified delta method in settings where the classical delta method fails, the simplest example being the mean. In particular, the distribution of U-and V-statistics and value-at-risk functionals can be derived in settings where the classical delta method fails. See Zähle (2010, 2012 Consider the following general setting. (S) Denote by (R, d R ) a metrized topological vector space. Consider a second vector space D with subsets D φ , D 0 ⊂ D, C 0 ⊂ D 0 and assume that (D 0 , d D ) is a metrized topological vector space. Let φ : D φ → R be quasi Hadamard differentiable at x tangentially to C 0 D 0 and denote the derivative by φ ′ x . Let (K, d K ) be a compact metric space. Define the sets On the sets R Φ and D 0 , define the metrics respectively. For elements (h t ) t∈K , (g t ) t∈K set (h t ) t∈K + a(g t ) t∈K := (h t + ag t ) t∈K and assume that with this definition, (D 0 , d D,Φ ) and (R, d R,Φ ) are metrized topological vector spaces. Define the map Φ : Theorem 2.18. Under setup (S) the map Φ is pseudo-Hadamard differentiable at X := (x) t∈K tangentially to U D 0 where and the derivative is given by Since quasi-Hadamard differentiability also implies classical Hadamard differentiability, the above result continues to hold in the classical setting.
This result is of independent interest. For example, Gao and Zhao (2011) recently demonstrated that compact differentiability can be used to establish large and moderate deviation principles. The findings above allow to carry their results into the setting of statistics from subsamples and could for example be used to analyze rejection probabilities of various breakpoint tests.

Applications
In this section, we demonstrate how the results in Section 2 can be applied to various subsample based methodologies studied in the recent literature. Throughout this section, we will assume that we have a sample of data X 1 , ..., X n from a strictly stationary time series. The processŷ n,s,t is assumed to be based on the sub-sample X ⌊ns⌋+1 , ..., X ⌊nt⌋ , i.e. of the form given in equation (1) . In what follows, write θ = φ(x) for the parameter of interest and defineθ n,s,t := φ(ŷ n,s,t ). For notational convenience, we also consider the quantityθ n,k,j which is computed from the data X k , X k+1 , ..., X j . Note thatθ n,k,j =θ n,k/n−1,j/n . For the sake of a shorter notation, introduce the abbreviation V s,t := V(s, t, ·).

Self-normalization
For a weakly dependent stationary time series, inference on a finite-dimensional quantity (say, mean or median) typically involves a consistent estimation of the asymptotic variance matrix of the sample estimator. The difficulty with the traditional approach lies in the bandwidth parameter(s) involved in the consistent estimation, which also occurs for other existing approaches, such as sub-sampling [Politis and Romano (1994)], moving block bootstrap [Künsch (1989)] and block-wise empirical likelihood [Kitamura (1997)]. To avoid the bandwidth selection, a general self-normalized approach to confidence interval construction and hypothesis testing for a stationary time series has been developed in Shao (2010a). The basic idea is to use recursive estimates to form an inconsistent estimator of asymptotic variance (matrix) of a statistic and use a non-standard but pivotal limiting distribution to perform the inference. The SN approach is convenient to implement as recursive estimates can be easily calculated with no need to develop new algorithms. Moreover, it does not involve any bandwidth parameters and its finite sample performance is comparable or could be superior to some other existing bandwidth-dependent inference methods, as shown in Shao (2010a). Owing to these nice features, it has been recently extended to a few important inference problems in time series; see Shao and Zhang (2010); Shao (2011Shao ( , 2012; Zhou and Shao (2013), among others. The theory for the SN approach was first developed in Shao (2010a,b) by adopting a traditional approach, which is based on a linearization of the statistic and assumptions on uniform negligibility of remainder terms. More precisely, Shao (2010a) assumed that where {R n (k/n)} n k=1 denote negligible remainder terms. To describe the basic idea of Shao's approach, note that we generally expect that for a weakly dependent stationary time series and smooth functional φ, R n (1) = o P (n −1/2 ) and where Σ = k∈Z cov(L(X 0 ), L(X k )) > 0 is the so-called long run variance matrix. Further note that we implicitly assume E[L(X j )] = 0, which is trivially satisfied in many cases. Inference on θ is then based on estimating the covariance matrix Σ consistently, which can be difficult as it involves a choice of bandwidth parameters. To avoid those complications, Shao (2010a) proposed to consider the self-normalized statistic where V n = n −2 n j=1 j 2 (θ n,1,j −θ n,1,n )(θ n,1,j −θ n,1,n ) ′ is the self-normalization matrix. In Shao (2010a,b), the asymptotic distribution of G n was derived under the following assumptions: with B denoting a vector of independent Brownian motions on [0, 1]. To verify (9), a common approach is to derive a uniform Bahadur representation forθ n,1,⌊nt⌋ and control the order of R n (t) uniformly over t ∈ [0, 1]. Such a task is in general not easy and it requires a tedious case-by-case study. Under the assumptions above, Shao (2010a) proved that where the limiting distribution is pivotal and does not depend on the unknown covariance matrix Σ. Using the results in Section 2, we can both considerably generalize the findings in Shao (2010a) and at the same time avoid tedious calculations required to bound remainder terms. The key observation is that the only result required to derive (10) is weak convergence of the process √ nt(θ n,0,t − θ) . In the language of Section 2.1, this amounts to setting K = {0} × [0, 1]. Assuming that φ(x) is an element of R p , the quantity V n (s, t, ·) can be viewed as a R p −valued vector. Abusing notation, denote this vector by V n (s, t). Similarly, denote by V s,t the vector V(s, t, ·). Some straightforward calculations show that under assumptions (A1)-(A3), (C), (W) the statistic G n can be represented as An application of Theorem 2.7 with the set K = {0} × [0, 1] in combination with the discussion at the beginning of this section and the continuous mapping theorem yields Under the assumption that V(0, t, ·) = Σ 1/2 B(0, t, ·), the limit of the statistic G n is pivotal. Note that the limiting process will typically have this form in most settings with weakly dependent data, see Remark 2.13. With the general machinery of Section 2 at hand, there are several extensions and remarks that can be made to the self-normalization approach. First, observe that we can replace the self-normalization matrix V n with a more general statistic of the form with H denoting an arbitrary probability measure on ∆. By the continuous mapping theorem, we have joint convergence of (V n (H),θ n,0,1 ) to (W (H), V 0,1 ) where Assuming that W (H) is non-singular almost surely [which happens as soon as H places mass on sufficiently many different points], the asymptotic distribution of the generalized self-normalized statistic G n (H) follows. We thus have derived the following result. Finally, note that by the discussion in Remark 2.8 it might be advantageous to exclude estimatorsθ n,k,l that are based on too small proportions of data. By considering a modified version of the statistic G n of the formḠ n (H) := V T n (0, 1)V n (H) −1 V n (0, 1) with V n (H) := ∆ (θ n,s,t − (t − s)θ n,0,1 )(θ n,s,t − (t − s)θ n,0,1 ) T I{t − s > n −γ }dH(s, t) ∆ I{t − s > n −γ }dH(s, t) , and arbitrary γ ∈ (0, 1/2), we would obtain the convergenceḠ n (H) V T 0,1 W (H) −1 V T 0,1 without the need for assumption (A3).

Subsampling and fixed-b corrections
Sub-sampling [Politis and Romano (1994)] has been used in a wide range of inference problems for time series. The basic idea is that the distribution of an estimator computed from a sufficiently large subsample of the data should be close to that of the estimator from the whole data set. Confidence intervals and tests can then be constructed by approximating the unknown distribution of the estimator with subsampling counterparts. To accommodate the time series dependence non-parametrically, it involves the sub-sampling window width l, which needs to go to infinity as sample size goes to infinity but at a slower rate to achieve consistent approximation. In practice, the choice of l affects the sub-sampling distribution estimator and related operating characteristics, although its role does not show up in the conventional first order asymptotics. In Shao and Politis (2013), the traditional sub-sampling method was calibrated using a p-value based argument under the so-called fixed-b asymptotics [Kiefer and Vogelsang (2005)], where b = l/n. For simplicity, assume that θ is R d -valued. Defining N = n − l + 1, the sub-sampling based estimator of the distribution function of √ n{θ n,1,n − θ} evaluated at x is The corresponding p-value of the test statistic √ n(θ n,1,n − θ 0 ) for the null hypothesis θ = θ 0 iŝ Note that under the conditions l/n+1/l = o(1) and additional regularity assumptions,p n (b) has a uniform asymptotic distribution, see Politis et al. (1999). Under the fixed-b asymptotic framework, l/n = b ∈ (0, 1] is held fixed. Following an elementary approach, the limiting null distribution ofp n (b), which equals Shao and Politis (2013) by assuming that θ n,j,j+l−1 = θ + l −1 j+l−1 i=j L(X i ) + R n (j, j + l − 1), that a similar representation holds forθ n,1,n , that (8) holds for {L(X t )} with remainder R n (1, n), and that the remainder terms satisfy √ n|R n (1, n)| = o p (1) and √ l sup j=1,··· ,N |R n (j, j + l − 1)| = o p (1). Verifying the latter assumption for general functionals can be quite tedious and challenging. Now, consider the general setup of Section 2 and let conditions (C), (W), (A1) and (A2) hold. We apply Theorem 2.7 with K := {(t, t + b)|t ∈ [0, 1 − b]} ∪ {(0, 1)} and assume that the map is continuous on a set of functions that contains the sample paths of V with probability one. In particular, this is the case if V(s, t) = (t − s)Σ 1/2 (B(t) − B(s)) with Σ denoting a non-singular matrix and B a vector of independent Brownian motions [see the arguments in Shao and Politis (2013)], which is typically the case for weakly dependent stationary time series. From now on, assume that this is the case. Observe that for θ = θ 0 we have in the setting discussed abovê where the negligibility of remainder follows from an application of the continuous mapping theorem. The results in Theorem 2.7 in combination with the continuous mapping theorem thus yield as soon as assumptions (C), (W), (A1) and (A2) hold. Unless θ is real-valued, the asymptotic distribution of the statisticp n (b) is in general not pivotal. Shao and Politis (2013) proposed to estimate its distribution based on further sub-sampling. An alternative is to consider block bootstrap approximations such as those discussed in Section 2.2. More precisely, consider a bootstrap version for y n,s,t which is of the form given in (2) and denote it byŷ B n,s,t . Define a bootstrap version forθ n,s,t throughθ B n,s,t := φ(ŷ B n,s,t ). Assume that the map φ is continuous. Now Theorem 2.10 combined with the continuous mapping theorem for the bootstrap in probability [see Theorem 10.8 in Kosorok (2008)] directly yields that under condition (WB) Finally, note that the reasoning above does not rely on θ being R p -valued and that it is thus also possible to handle infinite dimensional parameters.

Testing for change points
Testing change points in a time series is a well-studied topic in econometrics and statistics; see Perron (2006) for a recent review. A large class of tests in the literature is based on the so-called CUSUM (cumulative sum) process and the test statistic is a smooth functional of the CUSUM process with Kolmogorov-Smirnov (L ∞ ) test and Cramer-von-Mises (L 2 ) test being two prominent examples. To accommodate the time series dependence and make the limiting null distribution pivotal, one needs to obtain a consistent estimator of the long run variance as a studentizer. As mentioned previously, consistent estimation involves a bandwidth parameter, the choice of which is even more difficult in the change point testing problem. In particular, the fixed bandwidth (e.g., n 1/3 ) is not adaptive to the magnitude of dependence and the data-dependent bandwidth could lead to the so-called non-monotonic power problem [Vogelsang (1999)], i.e., the power of the test can decrease when the alternative gets farther away from the null. To overcome the non-monotonic power problem, Shao and Zhang (2010) proposed SN-based tests in a general framework. Let θ t = T (D(X t )) ∈ R q be the quantity of interest which depends on the distribution of X t denoted by D(X t ). The goal is to test if there is a change point in {θ t } n t=1 , i.e. H 0 : θ 1 = · · · = θ n and the alternative hypothesis is H 1 : θ 1 = · · · = θ k * = θ k * +1 = · · · = θ n for some unknown k * , 1 ≤ k * < N.
Applying Theorem 2.7 in combination with the continuous mapping theorem yields weak convergence of G n to sup Finally, note that by considering the modificationG n := sup r∈[n −γ ,1−n −γ ] H n (r) with γ ∈ (0, 1/2) arbitrary, assumption (A3) can be dropped. See Remark 2.8 for further details.

A Proofs of main results
Proof of Theorem 2.7 The proof consists of two steps. First, we show that the convergence holds for K with inf (s,t)∈K t − s > 0, and second, we extend the result to general sets K ⊂ ∆ under assumption (A3). The first step follows by an application of the functional delta method [see Theorem 3.9.4 in Van der Vaart and Wellner (1996)] in combination with Theorem 2.18. In particular, the space D 0 can be identified with L(F 1 , ..., F J ; K) since a finite norm H L is equivalent to the distance H −(x) (s,t)∈K L being finite. Similarly, the space R Ψ is identified with ℓ ∞ (G × K) and the metric d R,Φ corresponds to the supremum norm on L ∞ (G 1 , ..., G L ; K). Observe that The functional delta method in combination with elementary considerations thus implies the factor 1 t−s can be moved in front since φ ′ x is a linear map. Multiplying both sides by t − s, the Continuous Mapping Theorem [see Theorem 1.3.6 in Van der Vaart and Wellner (1996)] completes the first step of the proof. For the second step, define the set K S := {(s, t) ∈ K|t − s ∈ S} and consider the approximating processes It then suffices to verify the following three statements [see  (i) For every i ∈ N : The first statement is the weak convergence established in the first step. For (ii), note that [here, · op denotes the operator norm] and the right-hand side converges to zero in probability, this is a direct consequence of assumption (A1). Finally, for a proof of (iii) note that for β n := γ n ∨ α =: R n,1 + R n,2 + R n,3 .
Here the second inequality follows by an application of Lemma B.1 on the set sup (s,t)∈K [βnn −1/2 ,1/i] α n ŷ n,s,t − x ≤ ε after observing that by definition Condition (A3) implies that R n,2 = o * P (1). To see that R n,1 + R n,3 converge to zero in outer probability, define the set This set is closed, and by the Portmanteau theorem [Theorem 1.3.4 in Van der Vaart and Wellner (1996)] combined with the weak convergence of Y nj and assumption (A1) on Y we obtain lim sup for j = 1, ..., J. By condition (A1), lim i→∞ P (Y j ∈ S j (i, ε)) = 0 for every ε > 0. This shows that R n,1 = o * P (1) and R n,3 = o * P (1). Thus the proof is complete. 2 Proof of Theorem 2.10 The first assertion follows by an application of the bootstrap functional delta method [see e.g. Theorem 12.1 in Kosorok (2008)]. For more details on the appropriate identification of spaces, see the proof of Theorem 2.7 in the present note. In order to prove the second part, define the set By Lemma B.3 it then suffices to verify the following three statements which can be regarded as adaptation of Theorem 4.2 in Billingsley (1968) to the present setting Assertion (i) follows from the first part. Assertion (ii) can be established by exactly the same arguments as the corresponding statement in the proof of Theorem 2.7. For a proof of the third assertion, note that Y, see e.g. the proof of Theorem 10.4, assertion (ii) ⇒ (i) in Kosorok (2008).
Thus assertion (iii) follows by exactly the same arguments as (iii) in the proof of Theorem 2.7. Hence the proof is complete. 2 Proof of Proposition 2.12 Observe the representation and thus setting Observe that ⌊nt⌋ nt − 1 ≤ 1 ⌊nt⌋∨1 and n(t−s) (⌊nt⌋−⌊ns⌋)∨1 − 1 ≤ 3 (⌊nt⌋−⌊ns⌋)∨1 . DefiningỸ n (s, t, ·) := G n (t, ·) − G n (s, ·), observe thatỸ n Y by the continuous mapping theorem. Moreover, sup t ⌊nt⌋ nt G n (t, ·) − G n (t, ·) = o * P (1) since for t ≥ n −1/4 the factor ⌊nt⌋ nt tends to one uniformly and since sup t≤n −1/4 G n (t, ·) = o * P (1) by arguments similar to those used to establish the negligibility of R n,1 at the end of the proof of Theorem 2.7. Thus it remains to show that n(t−s) (⌊nt⌋−⌊ns⌋)∨1 − 1 Ỹ n (s, t, ·) is uniformly small. This can be done by similar arguments [distinguish the cases t − s ≤ n −1/4 and t − s > n −1/4 ]. This completes the proof. 2 Proof of Theorem 2.16 Since it suffices to show asymptotic tightness of each process G n,j individually, we will focus on G n,1 . To simplify notation, define Z n := G n,1 , F := F 1 , F δ := F 1,δ . Start by noting that under the assumptions of the theorem together with (5) we have for some finite constant C 1 To see this, fix δ > 0 and cover the set F with N balls of radius δ and centers f 1 , ..., f N . Then make use of the bound and condition (5).
Proof of Theorem 2.18 Let a n = o(1) and H (n) denote a sequence in D 0 with H (n) → H ∈ U such that X + a n H (n) ∈ D Φ ∀n ∈ N. We need to show that a −1 n Φ(X + a n H (n) ) − Φ(X) → Φ ′ X H.
Assume that this does not hold. Then there exists a sequence t n and a positive number b such that for all n ≥ N 0 . On the other hand, the sequence H (n) tn has a subsequence H (n k ) tn k which converges to H t∞ for some t ∞ ∈ K. To see that this is the case, start by noting that t n is a sequence in a compact metric space, i.e. it has a convergent subsequence t n k → t ∞ with t ∞ ∈ K. The definition of the set U then implies that H tn k → H t∞ . Together with the uniform convergence sup t d D (H (n) t , H t ) = o(1) this yields H (n k ) tn k → H t∞ . Now quasi compact differentiability of φ tangentially to C 0 D 0 implies a −1 n φ(x + a n H and together with continuity of φ ′ x this contradicts (14). Thus the proof is complete. 2

B Auxiliary technical results
Lemma B.1. Denote by (R, · R ) a normed vector space. Consider a second vector space D with subsets D φ , D 0 ⊂ D, C 0 ⊂ D 0 and assume that (D 0 , · D ) is a normed vector space. Let φ : D φ → R be quasi compactly differentiable at x tangentially to C 0 D 0 and assume 0 ∈ C 0 . Then there exist constants ε > 0, K < ∞ such that φ(x) − φ(x + y) R ≤ K y D ∀y ∈ D 0 : y D ≤ ε, x + y ∈ D φ .
Thus the proof is complete. Proof Define K C a as the complement of K a in K and set By asymptotic equicontinuity of Y n [see the discussion in the proof of Theorem 2.7 for more details and note that sup s=t sup f 1 ,...,f J |Y n (s, t, f 1 , ..., f J )| ≡ 0 a.s.] we have B n = o * P (1). This implies ∀ε > 0 ∃n 0 (ε) ∈ N : ( * ) ∀n ≥ n 0 (ε) P * (B n > ε) < ε.

Thus (iii) yields lim
Fix arbitrary ε, η > 0. The computations above yield the existence of an i 1 ∈ N such that for all i ≥ i 1 lim sup Moreover, by (ii) and the definition of weak convergence, there exists an i 2 ∈ N such that for all i ≥ i 2 and combining all the results above we see that Since η, ε were arbitrary, this establishes (a). For a proof of (b), note that (i) implies A b i,n A i since conditional weak convergence implies unconditional weak convergence [see the proof of Theorem 10.4, assertion (ii) ⇒ (i) in Kosorok (2008)]. Thus, by an the results in , (i)-(iii) imply that V b n V. In particular, this implies asymptotic measurability of V b n [see section 1.3 in Van der Vaart and Wellner (1996)], and together with the continuity of f ∈ BL 1 this shows that f (V b n ) f (V) by an application of the continuous mapping theorem. Thus E M f (V b n ) * − E M f (V b n ) * converges to zero in L 1 , hence also in probability. Now the proof is complete. 2