Testing for episodic predictability in stock returns

Standard tests based on predictive regressions estimated over the full available sample data have tended to find little evidence of predictability in stock returns. Recent approaches based on the analysis of subsamples of the data suggest in fact that predictability where it occurs might exist only within so-called ‘‘pockets of predictability’’ rather than across the entire sample. However, these methods are prone to the criticism that the subsample dates are endogenously determined such that the use of standard critical values appropriate for full sample tests will result in incorrectly sized tests leading to spurious findings of stock returns predictability. To avoid the problem of endogenously-determined sample splits, we propose new tests derived from sequences of predictability statistics systematically calculated over subsamples of the data. Specifically, we will base tests on the maximum of such statistics from sequences of forward and backward recursive, rolling, and double-recursive predictive subsample regressions. We develop our approach using the over-identified instrumental variable-based predictability test statistics of Breitung and Demetrescu (2015). This approach is based on partial-sum asymptotics and so, unlike many other popular approaches including, for example, those based on Bonferroni corrections, can be readily adapted to implementation over sequences of subsamples. We show that the limiting null distributions of our proposed test statistics depend in general on whether the putative predictor is strongly or weakly persistent and on any heteroskedasticity present (indeed on any time-variation present in the unconditional variance matrix of the innovations), the latter even if the subsample statistics are based on heteroskedasticityrobust standard errors. As a consequence, we develop fixed regressor wild bootstrap implementations of the tests which we demonstrate to be first-order asymptotically valid. Finite sample behaviour against a variety of temporarily predictable processes is considered. An empirical application to US stock returns illustrates the usefulness of the new predictability testing methods we propose. © 2020 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). ✩ We are grateful to the guest editor, Michael McAleer, two anonymous referees, Serena Ng and Torben Anderson for their helpful and constructive comments on earlier versions of this paper. Demetrescu gratefully acknowledges the support of the German Research Foundation (DFG) through the project DE 1617/4-2. Rodrigues gratefully acknowledges financial support from the Portuguese Science Foundation (FCT) through project PTDC/EGE-ECO/28924/2017, and (UID/ECO/00124/2013, UID/ECO/00124/2019 and Social Sciences DataLab, LISBOA-01-0145-FEDER-022209), POR Lisboa (LISBOA-01-0145-FEDER-007722, LISBOA-01-0145-FEDER-022209) and POR Norte (LISBOA-01-0145-FEDER-022209). Taylor gratefully acknowledges financial support provided by the Economic and Social Research Council of the United Kingdom under research grant ES/R00496X/1. ∗ Corresponding author. E-mail address: robert.taylor@essex.ac.uk (A.M.R. Taylor). https://doi.org/10.1016/j.jeconom.2020.01.001 0304-4076/© 2020 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/ licenses/by/4.0/).


a b s t r a c t
Standard tests based on predictive regressions estimated over the full available sample data have tended to find little evidence of predictability in stock returns. Recent approaches based on the analysis of subsamples of the data suggest in fact that predictability where it occurs might exist only within so-called ''pockets of predictability'' rather than across the entire sample. However, these methods are prone to the criticism that the subsample dates are endogenously determined such that the use of standard critical values appropriate for full sample tests will result in incorrectly sized tests leading to spurious findings of stock returns predictability. To avoid the problem of endogenously-determined sample splits, we propose new tests derived from sequences of predictability statistics systematically calculated over subsamples of the data. Specifically, we will base tests on the maximum of such statistics from sequences of forward and backward recursive, rolling, and double-recursive predictive subsample regressions. We develop our approach using the over-identified instrumental variable-based predictability test statistics of Breitung and Demetrescu (2015). This approach is based on partial-sum asymptotics and so, unlike many other popular approaches including, for example, those based on Bonferroni corrections, can be readily adapted to implementation over sequences of subsamples. We show that the limiting null distributions of our proposed test statistics depend in general on whether the putative predictor is strongly or weakly persistent and on any heteroskedasticity present (indeed on any time-variation present in the unconditional variance matrix of the innovations), the latter even if the subsample statistics are based on heteroskedasticityrobust standard errors. As a consequence, we develop fixed regressor wild bootstrap implementations of the tests which we demonstrate to be first-order asymptotically valid. Finite sample behaviour against a variety of temporarily predictable processes is considered. An empirical application to US stock returns illustrates the usefulness of the new predictability testing methods we propose.

Introduction
A large body of empirical research has been undertaken investigating whether stock returns can be predicted. Therein, a wide range of financial and macroeconomic variables have been considered as putative predictors for returns, including valuation ratios such as the dividend-price ratio, dividend yield, earnings-price ratio, book-to-market ratio, various interest rates and interest rate spreads, and macroeconomic variables including inflation and industrial production.
Early empirical studies, including Fama (1981), Keim and Stambaugh (1986), Campbell (1987), Campbell and Shiller (1988a,b), French (1988, 1989) and Fama (1990), often found significant evidence of in-sample predictability of U.S. stock index returns, at least over relatively long horizons. It has since been argued, however, that these findings could be spurious. Nelson and Kim (1993) and Stambaugh (1999) show that strongly persistent predictors lead to biased coefficients in predictive regressions if the innovations driving the predictors are correlated with returns, as is argued to be the case for many of the variables used as predictors; e.g., the stock price is a component of both the return and the dividend yield. Goyal and Welch (2003) show that the persistence of dividend-based valuation ratios increased significantly over the typical sample periods used in empirical studies, and argue that, as a consequence, outof-sample predictions using these variables are no better than from a no-change strategy. Predictability tests which are asymptotically valid when the predictor is strongly persistent and driven by innovations which are correlated with returns have been proposed in Cavanagh et al. (1995), Campbell and Yogo (2006), Kostakis et al. (2015), Breitung and Demetrescu (2015), Elliott et al. (2015) and Jansson and Moreira (2006), inter alia. When such robust techniques are used the statistical evidence of predictability is considerably weaker and often disappears completely; see, among others, Ang and Bekaert (2007), Boudoukh et al. (2007), Welch and Goyal (2008) and Breitung and Demetrescu (2015).
The foregoing approaches are based on a maintained assumption that the coefficients of the predictive regression model are constant over time. However, there are several reasons to suspect that if returns are predictable, then it is likely to be a time-varying phenomenon. The business cycle, time-varying risk aversion, rare disasters, structural breaks, speculative bubbles, investor's market sentiment, and regime changes in monetary policy have all be cited as possible reasons; see, e.g., Pesaran and Timmermann (2002). For example, significant changes in monetary policy and financial regulations could lead to shifts in the relationship between macroeconomic variables and the fundamental value of stocks, via the impact of these changes on economic growth and the growth rates of earnings and dividends. Timmermann (2008) argues that for most time periods returns are not predictable but that there are 'pockets in time' where evidence of local predictability is seen. In particular, if predictability exists as a result of market inefficiency rather than because of time-varying risk premia, then rational investors will attempt to exploit its presence to earn abnormal profits. Assuming a large-enough proportion of investors are rational, this behaviour will eventually cause the predictive power of the relevant predictor to be eliminated. If a variable begins to have predictive power for returns then a window of predictability might exist before investors learn about that relationship, but it will eventually disappear; see, in particular, Paye and Timmermann (2006), Timmermann (2008) and Farmer et al. (2018). It therefore seems reasonable to consider the possibility that the predictive relationship might change over time, so that over a long span of data one may observe some windows of time during which predictability occurs.
A growing body of empirical evidence is supportive of the view that the slope parameter in prediction models for returns varies over time. Henkel et al. (2011) find that return predictability in the stock market appears to be closely linked to economic recessions with dividend yield and term structure variables displaying predictive power only during recessions. Similarly, Gargano et al. (2019) find that commodity returns are predictable using macroeconomic information, but again only during recessions. Lettau and Ludvigsson (2001) find evidence of instability in the predictive ability of the dividend and earnings yield in the second half of the 1990s. Goyal and Welch (2003) and Ang and Bekaert (2007) find instability in prediction models for U.S. stock returns based on the dividend yield in the 1990s. Other studies which report evidence of time-varying behaviour in stock return predictability include Barberis (2000), Lettau and van Nieuwerburgh (2008), Welch and Goyal (2008), Stambaugh (2009, 2012), Pettenuzzo and Timmermann (2011), Dangl and Halling (2012), Gonzalo and Pitarakis (2012), Rapach and Wohar (2006) and Giannetti (2007), inter alia. In the context of predicting the equity premium, Kolev and Karapandza (2017) find that, for a given set of predictors, alternative data splits often lead to strongly contradictory outcomes concerning return predictability. Paye and Timmermann (2006) undertake a comprehensive analysis of prediction model instability for international stock market indices using conventional Bai-Perron structural break tests and report statistically significant evidence of structural breaks for many of the countries considered, arguing that the ''[e]mpirical evidence of predictability is not uniform over time and is concentrated in certain periods' ' (op. cit. p. 312). Paye and Timmermann (2006) also cite a number of applied studies which find significant evidence of in-sample (ex post) predictability in returns data but yet find very weak evidence of out-of-sample (ex ante) predictability, and argue that a possible explanation is structural instability in the predictive relations involved.
A limitation of many of the statistical techniques used in previous research on the instability of return prediction models is that they are not designed for use with highly persistent, endogenous predictors. Paye and Timmermann (2006) investigate the effects of persistence and endogeneity of the regressors on the Bai-Perron tests for structural breaks using Monte Carlo simulations. Their simulations reveal that size distortions, whereby parameter change is falsely signalled when none is present, can be substantial. They also show that some of the tests lack power in this context because of the large amount of noise typically present in predictive regression models. Moreover, because tests from predictive regression models based on the full sample of available data will have relatively low power to detect short windows of predictability, special case ζ = 0 ∈ R, we recall that ζ T w ⇒ p 0 (equivalently, ζ T = o p * (1) in P-probability) means that P * (|ζ T | > ε) → 0 in P-probability for every ε > 0. Finally, ζ T = O p * (1) in P-probability signifies that for every ε > 0 there exists a K > 0 such that P { P * (|ζ T | > K ) < ε } > 1 − ε for all T . The o p and O p symbols retain their usual meaning.

The episodic predictive regression model
The basic predictive regression model for stock returns, y t , allowing for time-variation in the slope coefficient on the predictor variable, is taken to be of the form where x t , t = 0, . . . , T , is observed and satisfies the data generating process [DGP] x t = µ x + ξ t , t = 0, . . . , T with ξ 0 a mean zero O p (1) variate. The innovations u t form a martingale difference [MD] sequence, while v t is allowed to exhibit weak serial dependence. For expositional simplicity we have only allowed for a single predictive regressor, x t−1 , and an intercept in (2.1). Generalisations to the case where the predictive regression contains multiple predictors and/or a general deterministic component of the form considered in section 3.2 of Breitung and Demetrescu (2015) are detailed in section S.2 of the supplementary material. The DGP in (2.1) generalises the constant parameter predictive regression model by allowing the slope coefficient on x t−1 to vary over time, thereby allowing for changes over time in the predictive content of the regressor x t−1 . The constant parameter predictive regression model obtains by setting a constant slope parameter such that β 1,t = β 1 , for all t = 1, . . . , T . Our interest will focus in this paper on testing the usual null hypothesis that (y t − β 0 ) is a MD sequence and, hence, that y t is not predictable by x t−1 , which entails that β 1,t = β 1 = 0, for all t = 1, . . . , T , in (2.1). In contrast to the extant literature which tests this null hypothesis against the alternative that y t is predictable by x t−1 with a constant slope parameter holding across the whole sample, that is β 1 ̸ = 0, under the maintained hypothesis that β 1,t = β 1 , for all t = 1, . . . , T , we will test against alternatives such that β 1,t ̸ = 0 for some t but without imposing constancy on β 1,t . Some structure obviously needs to be placed on the class of alternative hypotheses we may consider and this will be formalised below.
As discussed in the Introduction it is important to allow for the possibility of high persistence in the predictor variable x t and to allow the shocks driving the predictor, v t in (2.2), to be correlated with the unpredictable component of stock returns, u t in (2.1). As regards the latter, we will allow u t and v t to be contemporaneously correlated and heteroskedastic; exact conditions will be detailed in Assumption 3. For the former, we allow ρ in (2.2) to satisfy the following assumption.
Assumption 1. Exactly one of the two following conditions holds true: 1. Weakly persistent predictors: The autoregressive parameter ρ in (2.2) is fixed and bounded away from unity, |ρ| < 1. 2. Strongly persistent predictors: The autoregressive parameter ρ in (2.2) is local-to-unity with ρ := 1 − c T where c is a fixed non-negative constant.
Remark 1. Many predictors are strongly persistent, exhibiting sums of sample autoregressive coefficients which are close to unity. Near-integrated asymptotics has been found to provide better approximations for the behaviour of test statistics in such circumstances; see, inter alia, Elliott and Stock (1994). However, a large part of the literature works with models which take x t to be generated from a stable autoregressive process; see, for example, Amihud and Hurvich (2004). Assumption 1 allows for either of these possibilities to hold on x t . ⋄ We will develop tests for the null hypothesis that y t is not predictable by x t−1 in any subsample, which do not require the practitioner to know which of Assumption 1.1. or Assumption 1.2. holds in (2.2), nor indeed what the precise value of ρ is in either case. Moreover, we aim to develop tests which possess non-trivial asymptotic local power against DGPs where predictability is present. Predictive regressions for stock returns typically exhibit small R 2 and low signal-to-noise ratios (see, inter alia, Campbell, 2008, andPhillips, 2015) so departures from the null, should predictability be present, are small. We will therefore conduct our theoretical analysis of the large sample properties of the tests we discuss under local alternatives such that the slope parameter β 1,t is local-to-zero for an asymptotically non-vanishing set of the sample observations. The localisation rate (or Pitman drift) will need to be such that β 1,t is specified to lie in a neighbourhood of zero which shrinks with the sample size, T . The appropriate Pitman drift is dictated by which of Assumption 1.1. and Assumption 1.2. holds in (2.2). Where x t is near-integrated the appropriate rate is T −1 , while for weakly dependent x t−1 , the rate is T −1/2 . The different localisation rates reflect the fact that near-integration implies a much stronger signal from the predictor x t−1 . Moreover, tests based on the maxima from sequences of subsample predictability test statistics can only deliver non-trivial asymptotic local power in cases where an asymptotically non-vanishing fraction of the data is such that β 1,t ̸ = 0 holds on the DGP. For example, if β 1,t ̸ = 0 at one time point only, then although this would formally violate the null that y t is not predictable by x t−1 , this data point would, as T → ∞, however be dominated by the remaining T − 1 data points where the null hypothesis holds. Formally, in our framework we specify β 1,t to satisfy the following assumption.
Using the framework of Assumption 2 we can then equivalently write our null hypothesis that β 1,t = 0, for all t = 1, . . . , T , as We can now also formally specify the alternative hypothesis as, (2.4) Remark 2. The alternative hypothesis specified by H 1,b(·) is very general but entails that at least one subset of the sample observations (this need not be a strict subset, so it could contain all of the sample observations) comprising contiguous observations exists for which β 1,t ̸ = 0, and where the size of this subset is proportional to the sample size T . Notice that under H 1,b(·) the integral of |b(·)| on [0, 1] is non-zero and it is this property which qualifies H 1,b(·) as a genuine (local) alternative. Moreover, as we will establish later, the form that b(·) takes under H 1 determines the local power offsets obtained in the limiting distributions of the statistics we propose. Notice also that, under H 1,b(·) , b(·) may be zero in certain parts of its domain and it may also change magnitude and/or sign over its domain; the former corresponds to data points where β 1,t = 0, while the latter reflects observations for which β 1,t does not have a fixed magnitude and/or sign across the full sample. ⋄ We conclude this section by detailing in Assumption 3 the conditions that we will place on the disturbances u t and v t in (2.1) and (2.2).
where I k denotes the k × k identity matrix and: 1. ζ t := (a t , e t ) ′ is a uniformly L 4 -bounded martingale difference sequence which is such that ) is a matrix of piecewise Lipschitz-continuous bounded functions on (−∞, 1], which is of full rank at all but a finite number of points; 3. B (L), where L denotes the usual lag operator, is an invertible lag polynomial with b 0 = 1 and 1-summable Remark 3. The structure in (2.5) imposes that the disturbances u t are uncorrelated with the increments of x t at all (positive) lags. Where ζ t is independent and identically distributed [IID], this structure would entail that x t−1 is weakly exogenous with respect to u t , and we will continue (with an abuse of language) to use the same term as a shorthand to describe this structure irrespective of whether ζ t is IID or not. Assumption 3.3 allows the increments to the predictor x t−1 to be serially correlated. These dynamics are not restricted beyond a 1-summability regularity condition on the moving average representation, as is typical in this literature; see, for example, Breitung and Demetrescu (2015) and Kostakis et al. (2015). ⋄ Remark 4. Assumption 3 allows for quite general forms of heteroskedasticity in (ũ t ,ṽ t ) ′ := H(t/T ) (a t , e t ) ′ and hence in u t and v t . In particular, Assumption 3.1 imposes a MD structure on ζ t allowing for conditional heteroskedasticity which is natural for the empirical applications to financial data we have in mind. Assumption 3.1 also imposes finite fourth moments; while daily returns often display very fat tails (see, for example, Nicolau and Rodrigues, 2019) such that the assumption of finite fourth order moments might not be a suitable assumption for daily data, standard predictive regression models have tended to be run on lower frequency data (monthly, quarterly or even annual data) where infinite kurtosis does not appear to be a concern. Assumption 3.1 places uniformity conditions on the cross-product moments of the innovations which limits the degree of serial dependence allowed in the conditional variances; these conditions are satisfied, for example, by strictly stationary and ergodic MD sequences with finite variance. Assumption 3.2 allows for unconditional time heteroskedasticity in the innovations through the matrix H(τ ). Where H(τ ) is diagonal for all τ ∈ [0, 1] the innovations (ũ t ,ṽ t ) ′ can display time-varying variances but are contemporaneously uncorrelated, so in the case where ζ t is IID with independent components, this would entail that x t is strictly exogenous with respect to u t (again we will use this terminology, with an abuse of language, whether ζ t is IID or not). Importantly, the off-diagonal elements of H(τ )H(τ ) ′ (i.e., the covariance matrix of (ũ t ,ṽ t ) ′ ) are not imposed to be zero, thus allowing for contemporaneous and time-varying correlation among the innovations. The structure placed on H(τ ) by Assumption 3.2 allows for a wide class of models for the behaviour of the variance matrix of the innovations including single or multiple (co-) variance shifts, variances which follow a broken trend, and smooth transition variance shifts. As discussed in Breitung and Demetrescu (2015, p. 360), such patterns are plausible with macro and financial data and it is therefore important to use tests which are robust to such behaviour to avoid the possibility of spurious rejection of the null because of non-constancy in the variance matrix rather than genuine predictability from x t−1 . ⋄ Under Assumption 1.1., x t is a particular case of a locally stationary process which admits a time-varying variance when the series v t displays time-varying volatility. In fact, the variance of the putative predictor is given as Var ( , whereσ 2 ξ (ρ) denotes the sum of the squared coefficients of the lag polynomial (1 − ρL) −1 B(L) (which is finite in the stable autoregression case). This form of heteroskedasticity impacts on the inferential procedures based on subsample sequences of statistics discussed in this paper. Furthermore, time-varying volatility where present in the regression errors, u t , and in the instrumental variables used in constructing the statistics can also affect the behaviour the sequences of statistics.
Heteroskedasticity has analogous effects under Assumption 1.2. (near-integration), though the transmission mechanism is somewhat different. In particular, under Assumption 3 we have that 1 where W is a two-dimensional standard Wiener process (see e.g. the invariance principle given in Lemma 1 of Boswijk et al., 2016), such that on D 2 . The processes U(τ ) and ωV (τ ) are individually time-transformed Brownian motions whose correlation may also vary over time; their covariance at time τ is given by ω

Full sample predictability tests
Consider the maintained hypothesis that the slope parameter β 1,t in (2.1) is constant, such that β 1,t = β 1 , for all t = 1, . . . , T . This yields the standard constant parameter predictive regression (3.1) A number of procedures have been developed for testing H 0 : β 1 = 0 in (3.1) against the local alternative H c : β 1 = n −1 T b 1 , with b 1 a non-zero constant. Of these the simplest is the standard (full sample) ordinary least squares [OLS] t-test for the significance of x t−1 in (3.1). While standard normal asymptotic theory applies to the t-statistic under Assumption 1.1. provided the errors are homoskedastic (although this can be weakened by using heteroskedasticity-robust standard errors), it does not under Assumption 1.2. where the limiting null distribution of the t-statistic is nonstandard and depends on the local-to-unity parameter c unless x t is strictly exogenous with respect to u t .
Tests robust to c have been developed in Elliott and Stock (1994), who propose a Bayesian mixture procedure, and Cavanagh et al. (1995) and Campbell and Yogo (2006) who develop tests based on conservative bounds, and Jansson and Moreira (2006), who conduct inference on the basis of conditionally sufficient statistics. However, these procedures are all developed for the case where x t is near-integrated, i.e. such that Assumption 1.2. holds, and for the case of homoskedastic disturbances. Variable addition [VA] techniques (see Breitung and Demetrescu (2015, p. 359) for a literature review) can be used to develop predictability tests which can be validly used regardless of whether x t is local-to-unity or stationary. However, these VA-based tests have only trivial asymptotic local power against the Pitman rate, T −1 , where x t is near-integrated. Breitung and Demetrescu (2015) show that the finite sample power of the VA-based tests is indeed very low relative to the tests designed for the use with near-integrated x t when the AR parameter ρ in (2.2) is close to unity. They also develop modifications of the VA approach but some loss of power still remains. Gorodnichenko et al. (2012) proposed tests based on quasi-differencing but like the original VA-based tests these only have power in T −1/2 neighbourhoods of the null. Breitung and Demetrescu (2015) also examine tests based on the instrumental variables [IV] approach. They show that these can be validly implemented in the presence of endogeneity and uncertain regressor persistence and heteroskedasticity of the form specified in Section 2. The basic idea underlying IV estimation of the predictive regression model is to use instruments such that the instrument has lower persistence than the regressor x t−1 (so-called type-I instruments), or is such that the instrument is strictly exogenous with respect to u t (so-called type-II instruments). Formal conditions which must hold on these instruments are given in Breitung and Demetrescu (2015).
A range of possible type-I instruments is given in Breitung and Demetrescu ( 2015, p. 361). These comprise: (i) a short memory instrument whereby we generate z t−1 = (1 −ᾱL) −1 + ∆x t−1 := ∆x t−1 +ᾱ∆x t−2 + · · · +ᾱ t−2 ∆x 1 with |ᾱ| < 1; (ii) a mildly integrated instrument, generated as z t−1 = (1−α T L) −1 + ∆x t−1 , for α T := 1−aT −γ with a > 0, 0 < γ < 1; (iii) a fractionally integrated instrument, generated as z t−1 = (1 − L) 1−d * x t−1 I(t > 0) := ∆ 1−d * + x t−1 for some d * ∈ (0, 1/2); (iv) a long differences instrument, generated as z t−1 = x t−1 −x t−k T for K T := min{⌊KT υ ⌋, t −1} for some 0 < υ < 1 and positive constant K . The use of the mildly integrated instrument in (ii) is an example of the so-called IVX approach of Phillips and Magdalinos (2009). In each case the generated instrument is, by design, free of a stochastic trend and hence less persistent than a near-integrated process, regardless of whether x t−1 is near-integrated or stationary. Being filtered versions of x t−1 , these instruments are driven by the same innovations and it is therefore expected that they provide valid instruments for x t−1 ; at the same time, the reduced persistence leads to standard inference. Breitung and Demetrescu ( 2015, p. 362) also discuss the following type-II instruments: (i) a generated random walk, with w t independent of u t and v t ; (ii) deterministic functions of time, such as z t−1 = (t − 1) or z t−1 = sin(π (t − 1)/2T ), and (iii) Cauchy instruments, z t−1 = sign(x t−1 ). Each of these is exogenous with respect to u t by construction. However, they do not exploit any specific information about x t , other than where x t is near-integrated in which case they will be correlated with x t ; see Phillips (1998). Simulation evidence in Breitung and Demetrescu (2015) shows that tests based on type-II instruments are significantly more powerful than those based on type-I instruments when x t is strongly persistent. However, these instruments will be weak, in the sense that they will be almost uncorrelated with the regressor, where x t is stationary. In such cases, Breitung and Demetrescu (2015) show that the resulting IV test for β 1 = 0 in (3.1) will have only trivial power. In order to simultaneously exploit the settings which result in superior power properties for the IV approach based on type-I and type-II instruments, Breitung and Demetrescu (2015) recommend the use of a test which combines two instruments for x t−1 , one of each type, which we denote by z I,t−1 and z II,t−1 , collected into the vector z t−1 := ( z I,t−1 , z II,t−1 ) ′ for t = 1, . . . , T . The general form of the resulting full sample IV-combination test statistic of Breitung and Demetrescu (2015), implemented with Eicker-White standard errors to account for heteroskedasticity satisfying Assumption 3, is given by denoting demeaned versions of y t , x t−1 and z t−1 , respectively, so that, for w t generically denoting either y t , x t−1 or z t−1 , w t := w t − 1 T ∑ T s=1 w s , and whereû t denotes the regression residuals from estimating (3.1). For the reasons outlined in Remark 4 of Breitung and Demetrescu (2015), the IV-combination test must be run as two-sided and so we accordingly consider tests based on the square of t β 1 ; that is t 2 β 1 . The limiting null distribution of t 2 β 1 is χ 2 1 under either Assumption 1.1. or 1.2.; see Breitung and Demetrescu (2015) for details.
A variety of choices for the residualsû t used in constructing D T is possible. A natural choice is the IV regression residuals so thatû t := y t −β iv 0 −β iv 1 x t−1 , whereβ iv j denotes the two-stage least squares [2SLS] estimator of β j , j = 0, 1. However, both Breitung and Demetrescu (2015) and Kostakis et al. (2015) recommend the use of OLS residuals on the grounds that they represent the best linear projection of y t on x t−1 regardless of the persistence of the putative predictor, and that their finite-sample behaviour appears to be more stable than that of IV residuals. Finally, one could also use residuals computed under the null; i.e.,û t := y t − 1 T ∑ T s=1 y s . Under the local alternatives considered in Assumption 2, these three possible choices can be shown to be asymptotically equivalent to one another in so far as the behaviour of (the suitably normalised) D T is concerned.
As we will subsequently see, a special case of the large sample results which will be presented in Section 4 is that the full-sample test based on t 2 β 1 has non-trivial asymptotic local power against H 1,b(·) for both weakly and strongly persistent regressors. This property of the full sample IV-based test statistic obtains through the limiting behaviour of the sample cross-product moment A T . In particular, its two components are not of the same order of magnitude; therefore, upon normalisation, one of these terms will converge to zero and so all weight is placed on the other instrument. Which instrument gets full weight depends on the persistence of x t−1 . The type-II instrument is selected for strongly persistent predictors (i.e., those satisfying Assumption 1.2.), while the type-I instrument is selected for weakly persistent predictors (i.e., those satisfying Assumption 1.1.); see the proof of Lemma S.6 in the supplementary material for details. As a result, regardless of the degree of persistence of the regressor, the appropriate instrument is chosen in the limit.
However, as the simulation results in Section 6 demonstrate, the finite sample power of the full sample test can be quite low against such ''pocket'' alternatives. In the next section we therefore propose tests based on sequences of subsample implementations of the IV-combination test statistic. IV-based techniques are particularly useful to consider because the corresponding subsample-specific statistics may be expressed in terms of partial sums, whose behaviour may in turn be characterised in a tractable manner. This is not the case, for instance, with the test of Campbell and Yogo (2006) or those of Elliott and Müller (2006) and Elliott et al. (2015), where the analysis of the joint behaviour of subsample-specific statistics is considerably more involved.

Subsample IV-combination tests for predictability
Our aim is to develop predictability tests with good power to detect temporary periods of predictability irrespective of whether the putative predictor, x t−1 is stable or near-integrated, and which are robust to the presence of heteroskedasticity in the data. To that end, we will base our testing approach on the computation of the IV-combination predictability statistics outlined in the previous section in the context of (3.2) computed not over the full available sample but over various sequences of subsamples of the data. For each such sequence we consider, our proposed test will be based on the maximum (in absolute value) statistic within that sequence. By taking the maximum over these sequences, we therefore base our test on the particular subsample within the given sequence of subsamples where the predictability statistic gives the strongest signal of predictability.

Choice of instruments
Before laying out our subsample IV-combination testing approach, we first need to state some regularity conditions which must hold on the type-I and type-II instruments such that we can validly use a testing strategy based on sequences of subsample IV-combination predictability statistics. We will then discuss the choice of instruments to use in practice which satisfy these conditions. Assumption 4 details the conditions which need to hold on the type-I instrument used.

Assumption 4.
Let z I,t obey the following conditions: 1. Under either Assumption 1.1. or Assumption 1.2.: E is a deterministic Hölder-continuous function of some order α > 0 and strictly increasing; is a continuous process with independent increments (and therefore, Gaussian), with G I (0) = 0 a.s., zero mean function, strictly increasing variance function [G I ] (τ ) and variance profile defined as η I (τ ) : 3. Under Assumption 1.2., is a deterministic Hölder-continuous function of some order α > 0 and strictly increasing; Remark 5. The conditions placed on z I,t−1 by Assumption 4 can differ depending on whether Assumption 1.1. or Assumption 1.2. holds. This distinction is germane in cases where z I,t−1 is constructed from x t−1 ; see the examples listed in Section 3. In such cases δ I may take different values for the same instrument, and, similarly, K z 2 I (τ ) may take different shapes under Assumption 1.1. and Assumption 1.2.. We do not, however, make this explicit to ease notation. Assumption 4.1 complements the condition in Assumption 3.1 to ensure that the innovations u t are uncorrelated with the instruments. Assumption 4.2 is new compared to Breitung and Demetrescu (2015) , and is required because we explicitly consider the behaviour of the IV-combination statistics under DGPs which can allow for either weak or strong persistence in the (putative) predictors; it requires the instruments to have stochastic properties similar to those of a stable autoregression driven by heteroskedastic innovations. Assumption 4.3 is the analogue of Assumption 3 of Breitung and Demetrescu (2015) but is considerably less restrictive: rather than the weak convergence of suitably normalised cross-product sample moments required there, we only require uniform boundedness in probability. Assumption 4 .3(b) regarding the partial sums of the squared instrument is new, but would appear fairly mild. Our conditions are weaker than those of Breitung and Demetrescu (2015) as we only consider the IV-combination statistic with two instruments, one of type-I and the other of type-II. ⋄ Remark 6. Although the weak convergence in Assumption 4.2 is joint, we do not specify the dependence structure between the limiting processes because our asymptotic results will hold irrespective of this structure. We note, however, that the variance profile, η I (·), which turns out to play an important role in our asymptotics under stability (Assumption 1.1.) depends on both the choice of type-I instrument and the DGP (specifically, on H(·) and the unconditional variance of u t ); see Lemma 1 for an example. Similarly, the limiting processes K z I x (·) and K z 2 I (·) also depend on both the DGP and the choice of instrument; again, see Lemma 1 for an example. ⋄ Assumption 5 details the corresponding regularity conditions on the type-II instrument.
Assumption 5. The variable z II,t is deterministic and, for some function Z (τ ), Hölder-continuous of order α > 1/2, and Remark 7. Notice that the conditions stated in Assumption 5 do not involve the persistence of the regressor because the type-II instruments are exogenous. Assumption 5 essentially coincides with Assumption 4 of Breitung and Demetrescu (2015), up to minor differences. While Assumption 4 of Breitung and Demetrescu (2015) allows for stochastic z II,t , it also requires the average cross-products of the instrument and the regression error to have a mixed Gaussian limiting distribution, such that it actually affords little additional flexibility in the choice of type-II instruments relative to Assumption 5. Indeed, under the above assumptions it holds, for example, that which is immediately seen to be a Gaussian process, given that Z is deterministic. However, the quadratic variation process of ds, is in general nonlinear in τ and depends, analogously to the case of Assumption 1.1., on both the DGP and the choice of the instrument z II,t−1 . Finally, notice that Z (·) is not permitted to be constant for any of the subsamples over which the test statistics are computed, as this would entail perfect multicollinearity in those subsamples. ⋄ We also require further regularity conditions regarding the interaction of the type-I and type-II instruments used. These are now collected in Assumption 6.
Assumption 6. For instruments z I,t and z II,t satisfying the conditions of Assumptions 4 and 5, respectively, it is also

Remark 8.
Breitung and Demetrescu (2015) do not impose such conditions explicitly as they are implied by the stricter set of assumptions under which they work. For instance, Assumption 6.1 would be implied by the weak convergence of the partial sums of z I,t−1 in Assumption 3 of Breitung and Demetrescu, but we do not require such weak convergence here because Assumption 4.1 on the uniform boundedness of the partial sums of the type-I instrument, suffices for our purposes (and is, for example, implied by Assumption 3 of Breitung and Demetrescu under near-integration). Indeed, Assumption 6.1 only differs through the weights T −δ II z II,t−1 , which are deterministic; Assumption 6.2 can be seen as a randomly weighted version thereof, with weights T −δ II z II,t−1 u 2 t . Notice that Assumption 6.1 entails that the (appropriately scaled) type-I and type-II instruments are mutually asymptotically orthogonal for all subsamples of the data, t = ⌊τ 1 T ⌋ + 1, . . . , ⌊τ 2 T ⌋, such that 0 ≤ τ 1 < τ 2 ≤ 1. ⋄ In the context of their full-sample predictability tests, Breitung and Demetrescu (2015) consider the following choice for the type-II instrument, z II,t , where k is a positive integer chosen by the practitioner. Breitung and Demetrescu (2015) find that the best performing IV-combination test obtains for k = 1 in (4.1). For the type-I instrument we use the IVX approach which has become popular in predictive regressions; see, among others, Gonzalo and Pitarakis (2012), Phillips and Lee (2013) and Kostakis et al. (2015). This entails setting for some a > 0 and γ ∈ (0, 1). In Lemma 1 we show that these two instruments satisfy the set of conditions required by Assumptions 4-6. 1 1 We will formally establish this result for only these two instruments which will subsequently be used in both our Monte Carlo study and empirical application. We conjecture, however, that the other examples of type-I and type-II instruments considered in Breitung and Demetrescu (2015, pp. 361-362) will also satisfy Assumption 4-6.
Lemma 1. Let Assumptions 1 and 3 hold with ζ t strictly stationary and ergodic such that, for some ϑ > 0, sup t∈Z where this weak convergence result holds jointly on D 3 with the weak convergence given in (2.6), and 2. Under Assumption 1.2., δ I = γ /2 and K z 2 Remark 9. The additional assumptions required to ensure the validity of the IVX instrument are relatively mild. Strict stationarity and ergodicity restrict the weak stationarity of ζ t required in Assumption 3 such that the asymptotic behaviour of sample averages can be accounted for, as required for example in Assumption 4.2. The additional condition on the rate of decay of E limits the amount of serial dependence allowed in the second order moments of the series. This rate is obviously satisfied when E = 0, but is much weaker than that condition and, hence, still allows for asymmetric volatility clustering. ⋄ Remark 10. Under Assumption 1.1., the processes K z I x (τ ) and K z 2 I (τ ) are both proportional to the quadratic variation of V (τ ), the limit process of the suitably normalised partial sums of ξ t under stability. This demonstrates the usefulness of the IVX instrument in that, under stability, z I,t−1 is approximately equal to the stochastic component ξ t−1 of the (putative) predictor, x t−1 , such that IVX effectively delivers the optimal instrument for x t−1 under Assumption 1.1.. For a choice of type-I instrument other than IVX this is, in general, not true and one obtains different processes K z I x (τ ) and K z 2 I (τ ) whose properties depend on the particular choice made; see also Corollary 2 of Breitung and Demetrescu (2015). Our large sample results will, however, be established under Assumptions 4 and 5 and, as such, will hold irrespective of the particular shape or properties of K z I x (τ ) and K z 2 I (τ ). Furthermore, under Assumption 1.2. the IVX instrument will turn out to be dominated uniformly over all subsamples by the type-II instrument, such that the precise properties of K z 2 I will not be relevant under near-integration. 2 ⋄

Subsample-based predictability tests
For type-I and type-II instruments satisfying Assumptions 4-6, we can proceed to develop subsample implementations of the IV-combination predictability test discussed in Section 3. To provide a unified notation for such subsample statistics it will prove useful to define the subsample-specific analogues A T (τ 1 , τ 2 ), B T (τ 1 , τ 2 ), C T (τ 1 , τ 2 ) and D T (τ 1 , τ 2 ) of the full-sample quantities A T , B T , C T and D T , respectively, used to construct the standard IV-combination statistic, t β 1 of (3.2). These are defined analogously to their full-sample counterparts but for a sample consisting of observations t = ⌊τ 1 T ⌋ + 1, . . . , ⌊τ 2 T ⌋, so that, for example, subsample-specific demeaned versions of y t , x t−1 and z t−1 , respectively, so that, for w t generically denoting either y t , The full-sample quantity is recovered on setting τ 1 = 0 and τ 2 = 1.
Precise definitions of these quantities are provided (in partial sum notation) in section S.3.1 of the supplementary material.
If it was known that a pocket of predictability might occur over the particular subsample t = ⌊τ 1 T ⌋ + 1, . . . , ⌊τ 2 T ⌋, then it would be logical to compute the subsample IV-combination statistic 3 and a test for predictability in this specific subsample could be obtained by comparing (t β 1 (τ 1 , τ 2 )) 2 with the χ 2 (1) distribution. Indeed, this would be nothing more than the approach of Breitung and Demetrescu (2015) applied to the particular subsample t = ⌊τ 1 T ⌋ + 1, . . . , ⌊τ 2 T ⌋. Such a test would be expected to have considerably more power to detect a regime of predictability over the subsample t = ⌊τ 1 T ⌋ + 1, . . . , ⌊τ 2 T ⌋ than would the full sample test based on t β 1 of (3.2) because the former would be calculated only for sample points where a predictive relationship holds. In practice it is unlikely the practitioner will know which specific subsample(s) of the data might admit predictive regimes. While some previous applied studies in the literature have considered a variety of sample splits and also looked at the evolution of predictive regression statistics over a sequence of subsamples, these studies have tended to signal 2 It should be noted, however, that K z 2 I (τ ) is also proportional to [V ](τ ) under near-integration, albeit with a different constant of proportionality; this is a consequence of the fact that z I,t is mildly integrated in this case. 3 In the context of D T (τ 1 , the presence of a predictive episode based on comparing each of these subsample statistics with the critical value that would apply when running a test for predictability on a single known subsample. As discussed in Section 1, this induces either multiple testing and/or endogenously determined breakdate problems and, hence, does not deliver size-controlled tests; see, inter alia, Inoue and Rossi (2005). In order to control for these issues, the critical value of the test needs to reflect the searching element involved. This can be done by basing one's test on certain functionals of the sequence of subsample predictability statistics considered. Given we are testing the null of no predictability against the alternative of predictability in at least one subsample of the data, an approach based on the maximum of the sequence of subsample predictability statistics considered would seem appropriate. The specific sequences of statistics that we take the maximum over must also be entirely agnostic of the data to avoid any endogenous selection bias; we could not, for example, validly choose to take the maximum statistic from the sequence of subsamples where previous studies had argued predictability holds.
Notice that this entails that the forward recursive sequence discussed above is calculated across all possible warm-in fractions such that τ L ≥ ∆τ , which is why this sequence is referred to as double-recursive. 4 The maximum statistic from the double-recursive sequence is then, (4.7) Remark 11. The full sample IV-combination statistic t 2 β 1 of (3.2) is contained within the forward recursive sequence of statistics and obtains by setting τ = 1, and similarly is contained within the backward recursive sequence for τ = 0. It is also contained within the double-recursive sequence for τ 1 = 0 and τ 2 = 1. Notice also that if we set ∆τ = 1 in the context of the rolling sequence then this would collapse to the single full sample statistic, t 2 β 1 . ⋄ Tests based on the maximum from each of the foregoing sequences of subsample statistics have particular patterns of local predictability that they will be well designed to detect. Tests based on the forward recursive sequence of statistics are designed to detect pockets of predictability which start at or near the start of the full sample period available to the practitioner. The longer the duration of such an episode the more powerful these tests will be, other things being equal, because they are based on a sequence of increasing subsamples all starting from the first data point. By analogy, tests based on the reverse recursive sequence of subsample statistics are designed to detect end-of-sample pockets of predictability. As such, reverse recursive based tests could therefore usefully be employed in an on-going monitoring exercise for the emergence of predictive regimes. Because both the forward and reverse recursive sequences, and indeed the double-recursive sequence, contain the usual full sample predictability statistic, regardless of the choice 4 Notice that this double sequence also obtains by calculating the rolling sequence discussed above for all possible rolling window widths between ∆τ and 1 inclusive. of the trimming parameters, they also deliver tests which have power to detect predictability which holds over the whole sample, although in this particular case they would not be expected to be as powerful as the standard full sample IV-combination test which is clearly designed for that specific alternative hypothesis.
For a given window width, tests based on a rolling sequence of statistics are designed to pick up a window of predictability, of (roughly) the same length, within the data. As discussed above, the double-recursive sequence amounts to considering all possible window width rolling sequences, subject to a minimum window width. These then are useful for picking up multiple predictive regimes, of potentially different lengths, within the data. However, because the doublerecursive sequence considers such a large number of possible subsamples of the data a test based on the maximum from this sequence would necessarily be expected to be less powerful than the recursive or rolling-based tests in scenarios for which the latter are designed. This is because the more statistics one considers in a sequence over which the maximum is taken the stricter the critical value needs to be to maintain a correctly sized test. So, for example, in the case where a pocket of predictability existed in the middle of the sample data of length say m observations, a test based on the maximum from the rolling sequence using a window width of m observations would be expected to be more powerful than a test based on the maximum of the double-recursive sequence because the critical value for the latter would be considerably larger than the former. However, a power advantage over the double-recursive test would not necessarily be expected to hold for the corresponding rolling tests where the window width was either smaller than m or greater than m. In the former case this would be because the maximum subsample length available over which predictability held (m observations) could never be utilised because the window width is less than m, while in the latter case all subsamples in the sequence will contain a mix of data points where predictability holds and where it does not. It is of course very hard to analytically predict what the relative finite sample power properties of the recursive, rolling and double recursive based tests will be in cases like these and so we will investigate these further using Monte Carlo experimentation in Section 6.
Before establishing the asymptotic properties of the maximum subsample statistics, it is worth briefly commenting on estimation of the location of any predictive windows in cases where our proposed tests reject. Even for the simplest possible case where H 1,b(·) of (2.4) implies predictability over just a single subsample of the data, say t = ⌊τ 1 T ⌋ + 1, . . . , ⌊τ 2 T ⌋, with τ 1 < τ 2 and where either τ 1 > 0 or τ 2 < 1, consistent estimation of τ 1 and τ 2 is not possible because of the Pitman localisation to zero placed on β 1,t in this interval by Assumption 2. In practice, however, if a given maximum statistic rejects then a sensible estimate of τ 1 and τ 2 would be given by the start and end points of the subsample corresponding to the maximum value from the sequence of statistics from which a rejection was obtained. If one was looking to date possibly multiple windows of predictability then one could reapply the procedures outlined above to the data set excluding those sample points for which a first stage rejection occurred, and do so repeatedly until no rejection was obtained.

Asymptotic distributions
In Proposition 1 we now provide representations for the asymptotic distributions of the maximum subsample statistics defined in Section 4.2 under the appropriate local alternative, H 1,b(·) . Proposition 1. Consider the model in (2.1) and (2.2) and let Assumption 2-6 hold. Then under the local alternative H 1,b(·) of (2.4): , are as defined in Assumption 4.2.

Remark 12.
Expressions for the limiting null distributions of the statistics can be obtained by omitting those terms involving the function b(·) from the representations given in Proposition 1. In what follows we will denote the resulting limiting null distributions of the T f , T b , T d and T r statistics under Assumption 1.
The impact of non-constancy in H(·) on these limiting null distributions also differs between the strongly and weakly persistent cases; see the discussion in Remarks 15 and 16. ⋄ Remark 13. All of the statistics in the sequences are exact invariant to both µ x and β 0 by virtue of being based on subsample demeaned variables. Moreover, the vector of instruments used is, by construction, invariant to µ x , because z I,t is based on differences of x t for the instruments mentioned in Section 3, and z II,t is a deterministic function of time chosen by the user without reference to µ x . Consequently the limiting representations in Proposition 1 do not depend on either µ x or β 0 . ⋄ Remark 14. Under Assumption 1.1., local power depends indirectly on the persistence of the putative predictor, as measured by ρ and B(L) through K z I x (·); see Lemma 1 for the particular example of the IVX instrument. Under Assumption 1.2., while the mean-reversion parameter c does not affect the limiting null behaviour of the maximum statistics, the local power functions depend explicitly on c through J c,H (·). In each case, the rule-of-thumb that the stronger the mean reversion, the lower the local power, seems to hold; see the Monte Carlo results in Section 6. ⋄ For weakly persistent regressors, a time transformation can shed further light on the influence of heteroskedasticity. Under Assumption 4.2(c), the process W (·) := G I (η −1 (·))/ √ [G I ](1) is continuous with stationary independent increments, W (0) = 0 a.s. and Var (W (τ )) = τ , and therefore, W (·) is a standard Wiener process. It follows that G I (·) = √ [G I ](1)W (η I (·)) is a time-transformed Wiener process. Consequently, taking the limiting functional associated with T f as an example, we have that sup τ ∈[τ L ,1] with similar distributional identities holding for the remaining statistics. As the maximum of a function is invariant to monotonic transformations of the argument, we may set r = η I (τ ) and therefore obtain the following alternative representations of the limiting results in part (i) of Proposition 1.
(1) is a standard Wiener process andK z I x (·) := K z I x (·)/ √ [G I ](1). Moreover, the limiting null distributions discussed in Remark 12 are such that, Remark 15. The results in Proposition 1 and Corollary 1 highlight that both the limiting null distributions and the local power functions of all of the tests depend, in general, on any unconditional heteroskedasticity present through the resulting non-constancy of H(·). This holds irrespective of the persistence of the regressor x t ; moreover, heteroskedasticity has differing effects on the limiting distributions depending on the degree of persistence of x t . At least under the null this may seem surprising, as Eicker-White standard errors are designed to robustify any of the subsample statistics, 0 ≤ τ 1 < τ 2 ≤ 1, to heteroskedasticity (conditional or unconditional). However, this asymptotic invariance only holds marginally for a given statistic in the sequence; indeed, it can be shown for each of the sequences of statistics, and regardless of which of Assumptions 1.1. and 1.2. holds, any given statistic in the sequence has a marginal χ 2 1 limiting null distribution. The representations in Corollary 1, for example, show that under Assumption 1.1. the suprema are taken over statistics computed for various intervals whose endpoints depend on the variance profile η I (·) defined in Assumption 4 .2, which depends in turn on both the DGP and the choice of type-I instrument. Moreover, under Assumption 1.2., the same phenomenon explains part (ii) of Proposition 1, with the additional complication that one cannot represent the subsample statistics more tractably using a time transformation due to the presence of the subsample-demeaned process Z . Here, too, heteroskedasticity depends on the choice of instrument (now the type-II instrument) in addition to the DGP. Under local alternatives, heteroskedasticity additionally enters by means of K z I x (·) and J c,H (·), under Assumptions 1.1. and 1.2., respectively. It is important to emphasise that the precise effect of non-constancy of H(·) due to unconditional heteroskedasticity on the limiting distributions of our maximum statistics depends on which of Assumption 1.1. or 1.2.
holds. ⋄ Remark 16. More generally, the impact of the DGP on the large sample behaviour of the statistics depends on the choice of instrument and on the persistence of the (putative) predictor. Consider first the results under Assumption 1.1.. Here the limiting null distributions, T s,I ∞ , s = f , b, d, r, all depend on η I (·) which in turn depends on the unconditional variance of u t . In the case where η I (s) = s, these limiting null distributions simplify to the suprema of squared standardised Wiener processes taken over the range of the subsamples. However, constancy of H(·) is not sufficient to ensure linearity of η I (·), because heteroskedasticity can still enter via the instrument z I,t−1 . Under the local alternative, the key quantity controlling power is K z I x (·) which can be deterministic under Assumption 1.1. (see, for example, Lemma 1 for the case of the IVX instrument), and (upon normalisation) characterises the strength of the instrument z I,t−1 . However, K z I x (·) also characterises the signal; other things equal, if x t has a large marginal variance relative to u t , then local power will increase. In the case of Assumption 1.2., local power depends on the process J c,H (·) in a more intricate way, due to the fact that J c,H and ∫Z dU may be dependent. Clearly, local power is influenced by all three factors c, ω and H(·). The effect of the elements of H(·) is not easy to disentangle, as can be seen from the expressions given for the quadratic variation processes of U(·) and V (·) at the end of Section 2. ⋄ In Corollary 2 we detail the limiting distributions of the full sample statistic t 2 β 1 of (3.2) under the local alternative, H 1,b(·) of (2.4).  . Consequently, t 2 β 1 is seen to possess a standard χ 2 1 limiting null distribution regardless of whether x t is stable or near-integrated. Moreover, the results in Corollary 2 show that the full sample IV-combination test exhibits non-trivial power against the class of local alternatives we consider in this paper; that is, it has power to detect predictive episodes. However, local power depends indirectly on heteroskedasticity which influences the stochastic properties of K z I x (·) and J c,H (·); see Remarks 15 and 16. ⋄ Remark 18. Where β 1,t = β 1 ̸ = 0, for all t = 1, . . . , T , the results in Corollary 2 specialise to the standard local power of the full sample IV-combination test based on t 2 β 1 . For type-II instruments without demeaning, one recovers the result of Breitung and Demetrescu (2015, Theorem 2.2). ⋄ To summarise, the limiting null distributions of the maximum subsample statistics all depend both on any heteroskedasticity present and on whether the putative predictor x t is a near-integrated or weakly dependent process. This poses significant problems for conducting inference not encountered with tests based on the full sample statistic, t 2 β 1 of (3.2). We next demonstrate that these issues can be solved using fixed regressor wild bootstrap implementations of the subsample tests.

Bootstrap implementation
As the results in the previous section show, implementing tests based on the T s , s = f , b, d, r, statistics from Section 4.2 will require us to address the fact that their limiting null distributions depend on any unconditional heteroskedasticity present in u t and v t , and on whether the predictor x t−1 is weakly dependent or near-integrated. To account for the former we employ a wild bootstrap resampling scheme applied to the demeaned dependent variableŷ t := y t − 1 T ∑ T t=1 y t , while for the latter we use the observed outcomes on x := [x 0 , x 1 , . . . , x T ] ′ and z := [z ′ 0 , z ′ 1 , . . . , z ′ T ] ′ as a fixed regressor and fixed instrument vector, respectively, when implementing the bootstrap procedure. We now outline our fixed regressor wild bootstrap approach in Algorithm 1. We will then demonstrate the asymptotic validity of this approach in Proposition 2.

Algorithm 1.
Step 1 Construct the wild bootstrap innovations y * t :=ŷ t R t , whereŷ t := y t − 1 T ∑ T t=1 y t are the demeaned sample observations on y t , and R t , t = 1, . . . , T , is an IID N(0, 1) sequence independent of the data. 5 Step 2 Using the bootstrap sample data ) ′ , construct the bootstrap analogues of the statistics T s , s = f , b, d, r, from Section 4.2. Denote these bootstrap statistics as T s * , s = f , b, d, r.
Step 3  Step 1 must also be independent across the B bootstrap replications.
Step 4 The wild bootstrap test of the null hypothesis H 0 of (2.3) at level α based on T s rejects if P s, * T ≤ α, s = f , b, d, r.
Remark 19. The bootstrap statistics T s * , s = f , b, d, r, are calculated treating both x t−1 and the vector of instruments, z t−1 , as fixed; i.e., they are calculated using the same observed x t−1 and z t−1 as were used in the construction of T s , This aspect is crucial for delivering bootstrap tests that are asymptotically valid regardless of whether x t satisfies Assumption 1.1. or 1.2. and without knowledge of which of these holds. In particular, the same instrument (either type-I or type-II, depending on the true regressor persistence) gets full asymptotic weight in both the original 2SLS and the bootstrap t-ratios (see section S.3 of the supplementary material). ⋄ Remark 20. The wild bootstrap generating y * t in Step 1 of Algorithm 1 replicates the pattern of unconditional heteroskedasticity present in the original innovations, as conditionally onŷ t , y * t is independent over time with zero mean and varianceŷ 2 t . Moreover, any heteroskedasticity present in x t−1 and z t−1 is replicated through the fixed regressor/instrument aspect of the bootstrap statistics. In particular, as u t is a MD sequence, it is anticipated that the bootstrap will replicate the variance properties of either z I,t−1 u t or z II,t−1 u t , depending on the degree of persistence exhibited by x t . Having fixed the regressor and the instruments when bootstrapping, the analogous terms in the bootstrap test statistics are given by z I,t−1ŷt R t and z II,t−1ŷt R t with variances z 2 I,t−1ŷ 2 t and z 2 II,t−1ŷ 2 t , respectively. Using the result detailed in the last sentence of Remark 19, it is then seen that the correct variance profile is replicated in the limit. For full details see the proof of Proposition 2. ⋄ 5 The Gaussianity assumption on R t is standard in the literature and simplifies the proof of Proposition 2. This can, however, be generalised such that R t is any IID sequence with E(R t ) = 0, E(R 2 t ) = 1 and E(R 4 t ) < ∞.

Remark 21.
Step 1 of Algorithm 1 is based on residuals obtained under the null hypothesis. It is straightforward to show that the large sample properties of corresponding bootstrap tests based on either the OLS or IVX residuals from estimating the predictive regression over the full sample are unaltered from those given here. Moreover, albeit more computationally intensive, one could also use the analogous subsample implementations of any of these three full sample residuals. ⋄ In Proposition 2 we now demonstrate the large sample validity of the fixed regressor wild bootstrap implementation of the tests from Section 4.2. In particular, we show that our proposed bootstrap in Algorithm 1 correctly replicates the first order asymptotic null distributions of the statistics given in Remark 12 under both the null hypothesis and local alternatives.
Proposition 2. Let the conditions of Proposition 1 hold. Then, as T → ∞, under either the null hypothesis H 0 of (2.3) or the local alternative H 1,b(·) of (2.4): (i) under Assumption 1.1., as T → ∞, it holds that A consequence of Proposition 2 is that we obtain asymptotically correctly sized tests when using bootstrap critical values obtained using Algorithm 1. We now formalise this result in Corollary 3.  size-adjusted test based on the corresponding original statistic T s , s = f , b, d, r. ⋄ Remark 24. The full sample IV-combination statistic, t 2 β 1 , of Breitung and Demetrescu uses Eicker-White standard errors to correct the limiting null distribution of the statistic for non-constancy in H(·) due to unconditional heteroskedasticity in the innovations. Because it is still necessary to implement the subsample maximum tests using a wild bootstrap it would be feasible to replace the Eicker-White standard errors used in the computation of the subsample ( t β 1 (τ 1 , τ 2 ) ) 2 statistic in (4.3) and its bootstrap equivalent, computed in Step 2 of Algorithm 1, with conventional standard errors. While this would alter the limiting representations given for the maximum statistics in Proposition 1 and Corollary 1, it can be shown that the resulting wild bootstrap tests would still be asymptotically valid with an analogous result to that in Corollary 3 holding. In this case the wild bootstrap tests would attain the same asymptotic local power functions as (infeasible) size-corrected implementations of the (non-bootstrap) maximum tests based on conventional standard errors. These asymptotic local power functions will not in general coincide with those obtained for the statistics based on Eicker-White standard errors, but they would where H(·) is constant. ⋄ Remark 25. The bootstrap validity results given in this section also apply to a fixed regressor wild bootstrap implementation of the full sample IV-combination test based on t 2 β 1 . In particular, this will satisfy a result of the form given in Corollary 3 and will have the same asymptotic local power function as the test based on t 2 β 1 using χ 2 1 critical values, discussed in Section 4.3. As with the discussion for the subsample maximum statistics in Remark 24, one could replace Eicker-White standard errors with conventional standard errors without losing asymptotic validity. ⋄

Numerical results
We use Monte Carlo simulation methods to investigate the finite sample performance of the bootstrap implementations of the subsample-based predictability tests T f , T b , T r and T d proposed in Section 4 for testing the null hypothesis of no predictability in (2.3); i.e., H 0 : β 1,t = 0, for all t = 1, . . . , T , against the alternative H 1,b(·) of (2.4) that predictability holds across some subset of the sample data. Data are generated from (2.1)-(2.2). In Section 6.1 we explore the empirical size properties of these tests comparing with the corresponding full sample IV-combination test of Breitung and Demetrescu (2015), t 2 β 1 of (3.2). In Section 6.2 we compare the finite sample local power properties of these tests against a variety of DGPs displaying temporary predictability.
Following the discussion in Section 4.1, we base both the full sample t β 1 IV-combination statistic in (3.2) and the corresponding subsample t β 1 (τ 1 , τ 2 ) statistics in (4.3) on the instrument vector z t−1 := ( z I,t−1 , z II,t−1 ) ′ with the type-II instrument, z II,t−1 , defined as in (4.1) with k = 1, and the type-I instrument, z I,t−1 , given by the IVX choice of Kostakis et al. (2015) defined as in (4.2), with a = 1 and γ = 0.95. 6 Excepting the IVX instrument, z I,t−1 , all variables and instruments entering the estimated predictive regressions are demeaned, as in the main text. As discussed in Kostakis et al. (2015Kostakis et al. ( , p. 1514) the IVX instrument, z I,t−1 , does not need to be demeaned as the slope estimator in the predictive regression is invariant to whether z I,t−1 is demeaned or not. In order to correct for the finite sample effects of estimating the intercept term in (2.1), which are most pronounced for highly persistent regressors which are strongly correlated with the predictive model's innovations, Kostakis et al. (2015Kostakis et al. ( , p. 1516 recommend the use of a finite-sample correction factor. We also found that this correction factor led to significant improvements in the finite sample properties of our proposed tests and hence it is implemented in all of the numerical and empirical results we report. All simulations are preformed in MATLAB, versions R2018a and R2018b, using the Mersenne Twister random number generator. All results pertain to the nominal 5% level; qualitatively similar results were obtained for other conventional significance levels. All of the subsample tests are computed according to Algorithm 1 using 399 bootstrap replications; the bootstrap tests are denoted T s * , s = f , b, d, r. Following Banerjee et al. (1992), we set τ L = 1/4 and τ U = 3/4 in the context of the forward and backward recursive statistics, respectively, and ∆τ = 1/3 for the rolling and double recursive statistics. The empirical size simulations were based on 5000 Monte Carlo replications and the local power simulations on 1000 replications, with the exception of the double recursive tests where 1000 replications were used for size and 500 for power because of the much higher computing time required. For the full sample t 2 β 1 test, results for versions based on the asymptotic χ 2 1 critical value and on a fixed regressor wild bootstrap are reported, the latter using 399 bootstrap replications. For all of the bootstrap tests, two versions are reported. The first is based on statistics using Eicker-White standard errors while for the second, following the discussion in Remark 24, conventional standard errors are used. These two variants are distinguished apart by the additional ''NW '' nomenclature in the subscripts of the latter. Following the discussion in Section 3 and footnote 4.2, all of the reported statistics use residuals,û t , computed under the null hypothesis.

Empirical size
We first investigate the finite sample size properties of our proposed tests. To that end, we consider the simulation DGP given by (2.1)-(2.2) with β 1,t = β 1 = 0 for all t = 1, . . . , T . Results are reported for T = 250 and T = 500. In generating the simulation data we set the intercepts β 0 and µ x in (2.1) and (2.2), respectively, to zero with no loss of generality. The autoregressive process characterising the dynamics of the putative predictor, x t , in (2.2) was initialised at ξ 0 = 0. Results are reported for a range of values of the autoregressive parameter ρ in (2.2) that cover both stationary and persistent predictors; in particular, for ρ := 1 − c/T we consider c ∈ {0, 2.5, 5, 10, 20, 0.5T }. Notice that c = 0.5T corresponds to ρ = 0.5, such that the autoregressive parameter is fixed and stable.
In our simulation DGP the innovation vector (u t , v t ) ′ is drawn from an i.i.d. bivariate Gaussian distribution with mean zero and covariance matrix Σ t : = . Notice, therefore, that φ corresponds to the correlation between the innovations u t and v t . Results are reported in Table 1 for the case where φ = 0, and in Table 2 for the case where φ = −0.90. 7 We report results for the case where the innovations are homoskedastic, σ 2 ut = σ 2 vt = 1 (labelled DGP1 in Tables 1 and 2), and for the case where there is a contemporaneous one-time break of equal magnitude in the variances of u t and v t . Following the simulation designs considered in Georgiev et al. (2019), two such heteroskedastic cases are considered: (i) an upward change in variance (labelled DGP2 in Tables 1 and 2) such that σ 2 ut = σ 2 vt = 1I(t ≤ ⌊0.5T ⌋) + 4I(t > ⌊0.5T ⌋), and (ii) a downward change (labelled DGP3 in Tables 1 and 2) where σ 2 ut = σ 2 vt = 1I(t ≤ ⌊0.5T ⌋) + 1 4 I(t > ⌊0.5T ⌋), where in each case I(·) denotes the indicator function, taking the value one when its argument is true and zero otherwise. DGP2 and DGP3 allow us to examine the impact of unconditional heteroscedasticity, both in isolation and in its interaction with φ, on the finite sample size of the tests. In each of DGP2 and DGP3 a fourfold change in variance is seen which is likely to be of rather larger magnitude than we might expect to see in practice, but serves to illustrate how the tests behave in the presence of a large change in unconditional volatility. We also considered further DGPs allowing for stationary GARCH(1,1) with different degrees of persistence coupled with either Gaussian or t-distributed innovations, thereby allowing for unconditionally heteroskedastic and fat-tailed innovations. These results were qualitatively similar to those reported here for DGP1 and can be found in the supplementary material.
Consider first the results pertaining to the homoskedastic DGP1. A comparison of the results in Table 1 for φ = 0 and Table 2 for φ = −0.90 shows that, in the homoskedastic case at least, the correlation parameter φ has relatively little impact on the size properties of the tests. For the full sample tests there is relatively little difference between the tests based on the asymptotic χ 2 1 critical value and the fixed regressor wild bootstrap. Similarly, as might be expected, there is little to choose between the versions of the full sample tests with Eicker-White standard errors and those with conventional standard errors. For the subsample tests, there is a general trend towards undersizing in the Eicker-White versions in cases where the putative predictor, x t−1 , displays persistence at or close to a unit root process. This is most 6 We also considered tests based on using the fractionally integrated instrument suggested on page 363 of Breitung and Demetrescu (2015) for z I,t−1 . We do not report these results here as the IVX choice performed better in our results, but they can be obtained from the authors on request.
7 In predictive regression models for the equity premium employing valuation ratios as predictors (e.g. the dividend-price ratio, earnings-price ratio), as we shall do in the empirical application in Section 7, the relevant innovation terms are strongly negatively correlated, hence our choice of φ = −0.90.  Notes: A superscript * denotes tests run using the fixed regressor wild bootstrap outlined in Algorithm 1; t 2 β 1 and t 2 β 1 ,NW denote the full sample IV-combination predictability tests of Breitung and Demetrescu (2015) based on the 5% asymptotic critical value from the χ 2 1 distribution and computed with Eicker-White [EW] and conventional standard errors, respectively, and t 2 * β 1 and t 2 * β 1 ,NW their bootstrap analogues; NW , denote the maximum forward and backward recursive tests computed with EW and conventional standard errors, respectively; T r * and T r * NW denote the maximum rolling tests computed with EW and conventional standard errors, respectively; T d * and T d * ols denote the maximum double recursive tests computed with EW and conventional standard errors, respectively. pronounced in the rolling and double recursive statistics. However, this undersizing is not seen with the versions of the subsample tests based on conventional standard errors. It is well known that Eicker-White standard errors can be heavily downward biased in small samples leading to incorrectly sized tests; see, for example, MacKinnon and White (1985).
We next turn to the results for the two unconditionally heteroskedastic DGPs, DGP2 and DGP3. Consider first the full sample tests. As expected, the full sample test based on conventional standard errors and the asymptotic χ 2 smaller. The size distortions observed with t 2 β 1 ,NW are significantly ameliorated by the use of Eicker-White standard errors (t 2 β 1 ) in all but the case of DGP2 with c = 0 where no apparent improvements are seen. The bootstrap implementations of the full sample tests do a much better job at controlling finite sample size, regardless of whether Eicker-White or conventional standard errors are used, although some over-sizing is still seen for φ = −0.90 when c = 0. There appears Table 2 Empirical Rejection Frequencies under the Null Hypothesis, H 0 . Nominal 5% significance level. DGP1-DGP3 with φ = −0.90.  Table 1 to be no need to use Eicker-White standard errors with the fixed regressor bootstrap implementation of the full sample test. Consider next the subsample predictability tests. Undersizing, in many cases substantial, is again seen with the subsample bootstrap tests based on Eicker-White standard errors. As with the full sample tests, these effects tend to be larger, other things equal, for φ = −0.90 vis-à-vis φ = 0. As with the results for DGP1, the subsample bootstrap tests based on conventional standard errors are much less prone to this undersizing phenomenon, albeit some undersizing is seen in the case of DGP2 with φ = −0.90 for small values of c, most notably for the rolling and double recursive tests. Moreover, under DGP3 with φ = −0.90 some oversizing is seen in the persistent x t−1 cases for the backward recursive, rolling and double recursive tests. For φ = 0 all of the subsample bootstrap tests implemented with conventional standard errors appear to display good finite sample size control.

Finite sample local power
We now turn to an investigation into the relative finite sample local power properties of the tests. We again generate simulation data from DGP (2.1)-(2.2) but now for a variety of local alternatives satisfying H 1,b(·) of (2.4). To keep the set of results to a manageable level we report results only for φ = −0.90, for the homoskedastic case, σ 2 ut = σ 2 vt = 1, for a sample In all of our experiments the slope parameter β 1t in (2.1) is set to be local-to-zero. As specified by Assumption 2, for c = 0 and c = 10, where x t is strongly persistent, we parameterise the slope parameter in (2.1) as β 1t = b 1t /T , and here we consider the following values of the Pitman drift parameter, b 1t ∈ {0, 5, . . . , 80}. For the case of a weakly dependent predictor, c = 0.5T , we parameterise the slope parameter as β 1t = b 1t / √ T , and here we consider the Pitman drift values b 1t ∈ {0, 1, . . . , 21}.
We report results for three distinct experimental cases, where episodes of predictability occur once in the sample either at the beginning, the end or within the sample. To that end, we consider the following three simulation DGPs, with the range of non-zero values of b 1t as outlined above, Case 1: All other aspects of the simulation design are as described previously. Figs. 1-3 graph the simulated finite sample local power curves for each of Cases 1-3, respectively. Each figure contains power curves for the fixed regressor wild bootstrap implementations of the full sample t 2 * β 1 ,NW test along with the subsample-based predictability tests T f * NW , T b * NW , T r * NW and T d * NW . To aid presentation of the graphs, we have chosen only to report the versions of the bootstrap tests implemented with conventional standard errors. Results with Eicker-White standard errors are available on request. In general the latter were less powerful (often considerably so) than the reported tests based on conventional standard errors.
Consider first the results pertaining to Case 1 in Fig. 1. Recall from Section 4.2 that the temporary predictability DGP in Case 1, with a pocket of predictability at the start of the sample, is one where we expect the forward recursive T f * NW test to perform best. Fig. 1  test, for all of the values of c considered. The rolling test, T r * NW , displays a similar power profile to T d * NW for c = 0 and c = 0.5T , but is significantly less powerful than T d * NW for c = 10. The least powerful tests among those considered are the backward recursive test, as expected, and the full sample t 2⋆ β 1 ,NW test. To illustrate, the empirical power of t 2⋆ β 1 ,NW at b T = 50 is approximately 50% for both c = 0 and c = 10 while for T f * NW it is around 75%. For c = 0.5T and b T = 10 the power of t 2 β 1 ,NW is about 55% while that of T f * NW is in excess of 95%. In the latter example both the rolling (T r * NW ) and double recursive (T d * NW ) tests have power of approximately 80%. Consider next the results for Case 2, given in Fig. 2, where the pocket of predictability now occurs at the end of the sample. When x t is weakly persistent the simulation DGP is approximately time-reversible and, as such, we would anticipate that all but the forward and backward recursive tests, whose relative behaviour would be expected to switch around, will behave similarly to how they behaved in Case 1 for the weakly dependent case. This is clearly seen to be the case in Fig. 2(c), with the backward recursive test now clearly the most powerful, the forward recursive test the least powerful, and the other tests all displaying almost identical power properties in Figs. 1(c) and 2(c). These patterns are also seen, albeit not as clearly, in a comparison of Figs. 1(b) and 2(b); the main difference being that the most of the tests (although not the double recursive test) tend to be slightly more powerful for c = 10 vis-à-vis c = 0.5T . The pattern of a general increase in power of the tests as c decreases for an end-of-sample pocket of predictability is very clearly continued in Fig. 2(a) for the case where c = 0 and x t follows a pure unit root. Here, comparing with Fig. 1(a), we see that all of the tests display considerably higher local power against an end-of-sample pocket of predictability than against a pocket of predictability at the start of the sample, and, comparing with Figs. 2(b) and 2(c), that the power of the tests is considerably higher than for c = 10 and c = 0.5T . A possible explanation for this improvement in power is the shape of the non-centrality term, ∫ τ 2 τ 1Z τ 1 ,τ 2 (s)b(s)J c,H (s)ds, entering the limiting distributions of the statistics under local alternatives in the case where x t is strongly persistent. Clearly, end of sample predictability will be boosted from the larger magnitude of J c,H (τ ) when τ is close to 1, and this will be most evident when c = 0. Interestingly, the full sample t 2⋆ β 1 ,NW test displays competitive power in Fig. 2(a) although it should be recalled from Table 2 that t 2⋆ β 1 ,NW is significantly over-sized in this case while the subsample tests are not.
Finally, the results in Fig. 3 pertain to Case 3, where the simulation DGP admits a window of predictability of size ⌊2T /5⌋ within the sample. Here the double recursive test, T d * NW , displays superior power to the other tests considered for both c = 10 and c = 0.5T (Figs. 3(b) and 3(c) respectively), and is jointly most powerful along with the forward recursive T f * NW test for c = 0 ( Fig. 3(a)). Notice also that for a given value of c, T d * NW displays considerably higher power under Case 3 than it does under both Cases 1 and 2. This is expected given that a larger window of predictive data is now present in the sample which the double recursive procedure is best able to exploit. Indeed, most of the tests considered display improved power performance compared to Figs. 1 and 2. This is particularly evident for the rolling test, T r * NW , and again is to be expected given that a greater number of the subsample predictability statistics in the rolling sequence will contain data from a predictive period relative to the DGPs in Cases 1 and 2. For Case 3, the T b * NW test (as expected, given that the window of predictability begins early in the sample) and the full sample t 2 β 1 ,NW test display the lowest power among the tests considered. Table 3 Application to updated (Welch and Goyal, 2008) data: bivariate regressions -(1950:01-2017

Empirical application
The data set used consists of monthly observations on the equity premium for the S&P Composite index calculated using CRSP's month-end values together with 14 different putative predictors, generically denoted x t , and is taken from the updated monthly data set on Amit Goyal's website (www.hec.unil.ch/agoyal/) which is an extended version of the data set used by Welch and Goyal (2008). The data cover the period 1950:01-2017:12 (T = 817). We define the equity premium as in Goyal and Welch (2003) as the log return on the value-weighted CRSP stock market index minus the log return on the risk-free Treasury bill: y t = ep t = log(1 + R m,t ) − log(1 + R f ,t ) where R m,t is the CRSP return and R f ,t is the Treasury bill return. The variables are in log form (as in Goyal and Welch, 2003) and each of the predictors is lagged one period. A full list of the predictors together with graphs of the excess returns and the predictors can be found in the supplementary material. Table 3 reports the outcome of the conventional IV-combination test from bivariate predictive regression models applied to the full sample of data. We report versions of the statistic using Eicker-White (t 2 β 1 ) and conventional (t 2 β 1 ,NW ) standard errors. All of the IV-based test statistics computed in the empirical analysis follow the same specification as was used in the Monte Carlo experiments; that is, they are based on a combination of the IVX instrument, z I,t−1 (as defined in (4.2), with a = 1 and γ = 0.95), and the sine instrument, z II,t−1 (as defined in (4.1) with k = 1), with all of the observed variables and z II,t−1 (but not z I,t−1 ) entering the estimated predictive regressions demeaned, and with the finite-sample correction factor of Kostakis et al. (2015Kostakis et al. ( , p. 1516 implemented. Fixed regressor wild bootstrap p-values computed according to Algorithm 1 with 999 bootstrap replications are reported in parentheses. For most of the putative predictors considered, the results in Table 3 yield no statistically significant evidence of predictability. Exceptions are seen for the treasury bill rate (tbl t−1 ), the long term government bond yield and rate of return series (lty t−1 and ltr t−1 , respectively), and inflation (infl t−1 ) all of which are significant at the 5% level. Rejections of the null of no predictability are also seen at the 10% level for the term spread (tms t−1 ) and the equity premium volatility (rvol t−1 ) series.
To provide an insight into how stable the full sample predictive regressions are, Table 3 also reports the tests proposed in Georgiev et al. (2018) for the stability of the slope coefficient in the bivariate predictive regression of the equity premium on each (lagged) predictor. These tests are denoted LM x and supF x . The former is designed to test for the stability of the slope coefficient against a smoothly evolving slope change model and the latter against a one-time change in the slope. Bootstrap p-values calculated as outlined in Georgiev et al. (2018) for 999 bootstrap replications are reported in parentheses. Significant rejections at the 5% level by at least one of these tests are observed for the predictive regressions involving the dividend price ratio (dp t−1 ), dividend yield (dy t−1 ), earnings price ratio (e/p t−1 ), book to market ratio (bm t−1 ), term spread (tms t−1 ) and infl t−1 . The rejections seen for dp t−1 and e/p t−1 are particularly strong. A rejection at the 10% level is also seen for the net equity expansion ratio (ntis t−1 ) predictor. Interestingly, for three of the four series (tbl t−1 , lty t−1 and ltr t−1 ) for which the full sample IV-combination tests are significant at the 5% level these stability tests provide no evidence of structural instability in the slope coefficient.
To provide some additional insight into any time-varying behaviour present in the slope coefficients, Figs. 4 and 5 plot forward recursive and rolling IV (using the same choice of instruments as detailed above for the full sample IVcombination tests) slope estimates from the predictive regression of y t on x t−1 and associated approximate 95% marginal confidence bands. 8 The warm-in fraction for the recursive sequence, τ L , and the rolling window fraction, ∆τ , were both set at 1/4. In each case the horizontal axis dates correspond to the end of a given subsample. Commensurate with the results of the stability tests of Georgiev et al. (2018), these graphs highlight considerable time variation in the sequences of subsample slope estimates. A general pattern evident in Fig. 4 is a decline over time in the absolute value of the estimated slope coefficient with the recursive slope estimates generally tending to move closer to zero over time. This pattern can also be seen, albeit less clearly, in the rolling estimates in Fig. 5. This suggests that for some of these variables, any predictive ability they might have for the equity premium weakens over time. As a further heuristic device, rather than a formal statistical test, many of the graphs show some periods where the 95% marginal confidence intervals do not include zero, which is at least suggestive that pockets of predictability may be present in the data. Most of these episodes occur nearer the start of the data, such as, for example, with dy t−1 , but some are much longer lived as with, for example, the sequences of recursive estimates for tbl t−1 , ltr t−1 , tms t−1 and infl t−1 ; recall that for tbl t−1 , ltr t−1 and infl t−1 , the full sample IV-combination tests gave significant rejections at the 5% level.
To pursue these findings further using statistically rigorous size-controlled methods, we next apply our proposed subsample-based predictability statistics. We report versions of the statistics using Eicker-White (T f , T b , T r and T d ) and conventional (T f NW , T b NW , T r NW and T d NW ) standard errors. Fixed regressor wild bootstrap p-values computed according to Algorithm 1 with 999 bootstrap replications are again reported in parentheses. In the computation of the forward and backward recursive statistics we set τ L = 1/4 and τ U = 3/4, respectively, while we set ∆τ = 1/4 for the rolling and double recursive statistics. The instruments used are as described above for the full sample statistics. Focusing on the forward recursive tests we see significant rejections at the 5% level (or stricter) of the null hypothesis of no predictability for each of dp t−1 , dy t−1 , e/p t−1 , de t−1 , tbl t−1 , lty t−1 , ltr t−1 , tms t−1 and infl t−1 ; indeed, in many cases these rejections are also significant at the 1% level. While these rejections tally with those delivered by the full sample test for tbl t−1 , lty t−1 , ltr t−1 and infl t−1 , for the other series, all of which (other than de t−1 ) fail the structural stability tests of Georgiev et al. (2018), these are series for which the full sample tests delivered no significant evidence of predictability. With the exception of dp t−1 and e/p t−1 , those series for which T f NW delivers a rejection at the 5% significance level also show rejections at the 5% level for at least one of the other subsample maximum tests reported. Additional evidence of temporary predictability at the 5% level (or stricter) is provided for dfy t−1 by both T r NW and T d NW (notice that for this series the supF x test is in fact very close to giving a rejection at the 10% level). A significant rejection at the 10% level is also provided for ntis t−1 by the T r NW test.
To gain further insight, Fig. 6 graphs the forward recursive sequences of t β 1 (τ 1 , τ 2 ) subsample statistics for each case where a rejection at the 5% level is observed for the corresponding maximum test. 9 Also reported on these graphs are the 5% and 10% bootstrap critical values for the null distribution of the maximum statistic in the sequence, together with the 5% and 10% critical values from the χ 2 1 distribution (the marginal critical values which apply for any given subsample). Consider first the graph in part (a) of Fig. 6 for the dividend price ratio, dp t−1 . Looking at the time path of the forward recursive subsample statistic we can see that for much of the first half of the sequence (up until roughly the early 1980s) the statistic exceeds the χ 2 1 5% critical value, suggesting that running the IV-combination test on any subsample of the data selected up until this point would have delivered a significant rejection at the (marginal) 5% level. After this sample endpoint no significant evidence of predictability would have been found. We can also see that a large number of exceedances of the 10% bootstrap critical value for the maximum are seen in the early part of the data, with exceedances of the 5% bootstrap critical value also seen, most notably in the mid 1970s. These results are suggestive that a pocket of predictability for returns existed for the predictor dp t−1 in the 1970s with peak predictability seen in the middle of that decade, and that since the 1980s onwards predictability appears to have evaporated. For the dividend yield, dy t−1 , a pocket of predictability appears to be present again from the early 1970s but lasting much longer, and with apparently stronger magnitude, displaying many more contiguous exceedances of the bootstrap critical values for the maximum than were seen for dp t−1 ; indeed, here predictability appears to run until the early to mid 1990s. From the mid to late 1990s onwards the evidence for predictability disappears. Evidence for both the earnings price ratio, e/p t−1 , in part (c) and the dividend pay out ratio, de t−1 , in part (d) is less strong than for the previous two series (reflected in the considerably 8 Denoting the IV slope estimate asβ 1 , the confidence bands were computed asβ 1 ± 1.96se(β 1 ), where se(β 1 ) are the associated IV Eicker-White standard errors. These confidence bands should, however, be treated with caution as they are not joint 95% confidence bands for the entire sequence of slope estimates, but rather represent the marginal 95% confidence band at each point in the sequences of estimated slope coefficients. 9 Where both the maximum tests based on Eicker-White and conventional standard errors reject we report the version with the smallest p-value; cf. Table 3. Corresponding graphs for the rolling sequences are available on request.   larger p-values for the maximum statistics for those series in Table 3), but again the period of predictability appears to be concentrated in mid 1970s. For both the treasury bill rate, tbl t−1 , in part (e) and the long term bond yield, lty t−1 , in part (f) there appears to be evidence of predictability across a window from the early 1970s until the mid 1980s, albeit the strength of predictability appears to waver somewhat over this period, particularly so for lty t−1 . For both of these series, there is also evidence that predictability is re-emerging from around the period of the recent financial crisis onwards, most notably so for tbl t−1 where a number of exceedances of the bootstrap critical values occur. In the case of tbl t−1 running the IV-combination test on almost any subsample of the data would yield a rejection at the 5% using the marginal χ 2 1 critical value. This observation is also true for the long term rate, ltr t−1 , in part (g) and for inflation, infl t−1 , in part (i).
Recall that these are the three series for which the full sample IV-combination tests gave significant rejections at the 5% level. Finally for the tms t−1 series in part (h) predictability appears evident and consistently strong up until the mid 1990s after which the magnitude of predictability starts to tail off and then falls markedly around the time of the financial crisis onwards. In contrast, the full sample tests reveal no significant evidence (at the 5% level) of predictability from tms t−1 .
These examples highlight the advantage of considering the recursive sequence of statistics and their evolution through time rather than just full sample IV-combination tests, with much stronger evidence for predictability earlier in the sample than later for a number of the predictors considered.

Conclusions
Recent research has suggested that should stock returns be predictable, then this is likely to be a temporary phenomenon. Our motivation has been to develop tests with good power to detect such episodes. To avoid the problem of endogenously-determined sample splits, our proposed tests are derived from sequences of predictability statistics calculated over systematic subsamples of the data. The tests are based on the maxima of the instrumental variable-based predictability statistics of Breitung and Demetrescu (2015) taken across sequences of forward and backward recursive, rolling, and double-recursive predictive regressions. The limiting distributions of these statistics were shown to depend both on any heteroskedasticity present and on whether the putative predictor follows a near-integrated or weakly dependent process. To account for these dependencies, fixed regressor wild bootstrap implementations of the tests were proposed and shown to be first-order asymptotically valid. Monte Carlo simulation demonstrated that the tests display decent finite sample size control, and can be considerably more powerful in detecting temporary predictability than full sample tests. An empirical application to a well-known US monthly stock returns data set highlighted the ability of the new tests to detect predictability within the data where full sample tests could not.
We conclude with two suggestions for further research. First, we have focussed on tests based on subsample implementations of the IV-combination statistics of Breitung and Demetrescu (2015) which use two instruments per predictor. It should be possible to apply the same approach to subsample implementations of statistics which use only one instrument, such as the statistics considered in section 2.2 of Breitung and Demetrescu (2015) or the IVX statistic of Kostakis et al. (2015). Second, our proposed tests are based on an approach which assumes a linear predictive regression model with a constant slope parameter within the given subsample window and then bases a test on the fluctuations seen in the sequence of such statistics over a range of subsamples. As such, this approach is ambivalent about the true form of any time-variation present in the slope parameter and so would be expected to have reasonable power against a wide range of patterns of time-variation in the slope parameter, including those generated by threshold or other non-linear DGPs. LM-type tests could be developed based on an assumed non-linear model for the time-variation in the slope and would be expected to be more powerful than the tests developed here where this assumed model coincided with, or was at least a close approximation to, the true (unknown) DGP, but would likely have much lower power if the true DGP was not well approximated by the model.