Simple tests for stock return predictability with good size and power properties ✩

We develop easy-to-implement tests for return predictability which, relative to extant tests in the literature, display attractive finite sample size control and power across a wide range of persistence and endogeneity levels for the predictor. Our approach is based on the standard regression t -ratio and a variant where the predictor is quasi-GLS (rather than OLS) demeaned. In the strongly persistent near-unit root environment, the limiting null distributions of these statistics depend on the endogeneity and local-to-unity parameters characterising the predictor. Analysis of the asymptotic local power functions of feasible implementations of these two tests, based on asymptotically conservative critical values, motivates a switching procedure between the two, employing the quasi- GLS demeaned variant unless the magnitude of the estimated endogeneity correlation parameter is small. Additionally, if the data suggests the predictor is weakly persistent, our approach switches to the standard t -ratio test with reference to standard normal critical values. . of EMW , T hyb and are the best performing tests for these of When 0 . 0.5, power gains of T hyb EMW seen, in φ = 0 . 5 case. For white φ 0, hyb is oversized and has poor power.


Introduction
A large body of empirical research has been undertaken investigating whether stock returns can be predicted using publicly available data. A wide range of financial and macroeconomic variables has been considered as putative predictors for returns, including: valuation ratios such as the dividend-price ratio, dividend yield, earnings-price ratio, and bookto-market ratio; various interest rates and interest rate spreads, and macroeconomic variables including inflation and industrial production; see, for example, Fama (1981), Keim and Stambaugh (1986), Campbell (1987), Campbell and Shiller (1988a,b), French (1988, 1989) and Fama (1990).
Empirical evidence on the predictability of returns largely derives from inference obtained from predictive regressions and, as such, the size and power properties of tests from these regressions are of fundamental importance. These depend on the time series properties of the predictor used, in particular its degree of persistence and endogeneity. Data analysis presented in, among others, Campbell and Yogo (2006) [hereafter CY] and Welch and Goyal (2008), suggest that many ✩ We dedicate this paper to Pierre Perron and the enormous contributions he has made to the science of econometrics. All three of us have benefited hugely not only from Pierre's intellectual contributions to the discipline, but also from his professional generosity and kindness. We are grateful to two anonymous referees and the Editor, Serena Ng, for their helpful and constructive comments on previous drafts of this paper. Taylor gratefully acknowledges financial support provided by the Economic and Social Research Council of the United Kingdom under research grant ES/R00496X/1. of the variables used in predictive regressions are highly persistent with autoregressive roots close to unity, and that a strong negative correlation often exists between returns and the predictor's innovations, such that the predictive regressor is endogenous.
A number of likelihood-based predictability tests have been developed, designed to be asymptotically valid when the predictor is strongly persistent and endogenous; see, in particular, Cavanagh et al. (1995), Lewellen (2004), CY and Jansson and Moreira (2006). These approaches are based on a formulation where the predictor, x t−1 say, is assumed to follow a first-order autoregression with a local-to-unity coefficient φ = 1 − c/T , where c is a finite unknown constant and T is the sample size. Of these, the Q test of CY is widely viewed as the state of the art methodology in the literature for testing the predictability of stock returns with highly persistent regressors. A major drawback with these tests, however, is that they are invalid if the predictor is weakly persistent (stationary). Alternative tests based on instrumental variable [IV] estimation have also been developed; see, among others, Phillips and Magdalinos (2009), Kostakis et al. (2015) and Breitung and Demetrescu (2015). Here a stochastic instrument is constructed from the predictor which, by design, is less persistent than a local-to-unity process. The IV-based tests are asymptotically valid regardless of whether the predictor is local-to-unity or weakly persistent, but their power is not as high as the likelihood-based tests when the predictor is strongly persistent. Breitung and Demetrescu (2015) therefore also propose a combined instrument test using two instruments: the first as described above, the second a trending variable independent of the predictor. This test is designed such that, in large samples, it selects the second instrument when the predictor is local-to-unity but reverts to the first instrument otherwise. A significant drawback, however, is that it can only be implemented as a two-tailed test and so if the direction of predictability is known, it can have significantly lower power than one-sided tests.
An alternative approach, designed to retain good power regardless of whether the predictor is weakly or strongly persistent, is considered in Elliott et al. (2015) [hereafter EMW]. EMW note that as the local-to-unity parameter c → ∞, the predictive regression essentially reduces to a standard time-series regression with a weakly dependent regressor. Consequently, standard likelihood-based inference, in particular a test comparing the regression t-ratio with standard normal critical values, is an appropriate methodology. They therefore propose a hybrid test which switches between a test based on a weighted average (local asymptotic) power criterion valid when c is "small" but reverts to a standard time-series test when c is "large". In practice the choice of switching function is necessarily arbitrary; EMW propose a switching rule based on an estimate of c. The weighted average power criterion test adopted by EMW is computationally involved, and the test is also based on the assumption that the predictor cannot be locally explosive (i.e. negative values of c are not allowed), an assumption not required for the tests of CY, Kostakis et al. (2015) or Breitung and Demetrescu (2015). In this paper we explore further how one can develop an approach to predictive regression testing which retains both good size properties and strong power profiles regardless of the degree of persistence of the predictor. Our approach is focused on easy-to-implement tests using regression t-ratios. In the near-unit root case, we base our proposed testing strategy on the use of two t-statistics: the first is the standard t-ratio test discussed above, the second is one where the predictor has been demeaned using the quasi-GLS demeaning method of Elliott et al. (1996), rather than OLS demeaning as with the standard t-ratio. The limiting null distributions of these statistics depend on both the endogeneity correlation parameter and the local-to-unity parameter characterising the predictor. We therefore propose a feasible method for obtaining asymptotically conservative critical values and provide response surfaces for practical use. An analysis of the asymptotic local power functions of the resulting conservative tests shows that in the empirically most relevant case where a significant negative correlation exists between returns and the predictor's innovations, the test for positive predictability based on quasi-GLS demeaning is significantly more powerful than that based on OLS demeaning. This relationship reverses when testing for negative predictability. Consequently, when testing for positive predictability, our recommended procedure in the near-unit root environment is to use the conservative standard t-ratio when the estimated endogeneity correlation is either positive or "small" and negative, but to use the conservative test based on the quasi-GLS t-ratio otherwise. Further, in common with EMW, if the data suggest the predictor is weakly persistent, we propose switching into the standard t-ratio test with reference to standard normal critical values. However, in contrast to EMW, we do not base our switching function on an (inconsistent) estimate of c, but rather on the familiar augmented Dickey-Fuller [ADF] normalised bias coefficient unit root test, with MBIC lag selection as developed in Ng and Perron (2001). Our approach has the advantage of not needing to exclude the possibility of locally explosive predictors, and we show that our recommended procedure delivers effective finite sample size control and attractive power profiles across a wide range of correlation parameters and degrees of predictor persistence.
The remainder of the paper is organised as follows. Section 2 introduces the predictive regression model which we will consider in this paper together with the assumptions we place on this data generating process [DGP]. In Section 3 we present the details of our hybrid switching-based test procedure and establish its asymptotic properties. Here we also outline our method for obtaining asymptotic critical values. In Section 4 we investigate the finite sample size and power properties of our proposed hybrid test, comparing with the leading tests in the literature. These results suggest that the newly proposed hybrid test performs well and compares very favourably with extant tests, including its most obvious comparator test from EMW, offering simple yet highly effective methods for predictability testing. Section 5 contains a short empirical example using monthly U.S. stock returns data. Section 6 concludes. An on-line supplementary appendix contains a proof of Theorem 1 and additional material relating to the numerical simulation studies in Sections 3.2 and 4.

The predictive regression model
Let y t denote the (excess) stock return in period t and x t−1 denote a scalar variable observed at time t − 1 which is considered to be a putative predictor for y t . Following Kostakis et al. (2015) and Jansson and Moreira (2006), among others, the predictive regression model we consider is specified as where x t is an observed process, specified according to where ψ(L) := 1 + ∑ ∞ j=1 ψ j L j satisfying ψ(1) ̸ = 0 and ∑ ∞ j=1 j|ψ j | =ψ < ∞, and where it is assumed that s 1 is a mean zero O p (1) random variable. The innovations ϵ t := (ϵ xt , ϵ yt ) ′ are assumed to form a (bivariate) martingale difference sequence with respect to the natural ] < ∞ for some κ > 0, ∥ · ∥ denoting the Euclidean norm. We define the correlation between the innovations to be ρ xy := σ xy /σ x σ y .
Our interest in this paper centres on developing tests of the null hypothesis that y t is not predictable by x t−1 , i.e. H 0 : β = 0 in (1). The alternative hypothesis is that y t is predictable by x t−1 , in which case β ̸ = 0. Moreover these tests need to allow the shocks driving the predictor, ϵ xt in (3), to be correlated with the unpredictable component of stock returns, ϵ yt in (1), as occurs when ρ xy ̸ = 0. As discussed in the Introduction it is important for practical purposes that the tests we develop are efficacious without knowledge of whether the predictor variable x t in (1) is weakly or strongly persistent. Formalising, we therefore allow φ in (3) to satisfy one of the following two assumptions: Assumption S. Strongly persistent predictor: The autoregressive parameter φ in (3) is local-to-unity with φ : Assumption W. Weakly persistent predictor: The autoregressive parameter φ in (3) is fixed and bounded away from unity, |φ| < 1.
Remark 1. Many commonly used predictors are strongly persistent, exhibiting sums of sample autoregressive coefficients which are close to or only slightly smaller than unity. Near-integrated asymptotics have been found to provide better approximations for the behaviour of test statistics in such circumstances; see, inter alia, Elliott and Stock (1994). However, not all (putative) predictors are strongly persistent and a large part of the literature works with models which take x t to be generated from a stable autoregressive process; see, for example, Amihud and Hurvich (2004). We therefore allow for either of these possibilities to hold for x t . □ Remark 2. Assumption S also allows for the case where c < 0 such that x t is locally explosive. While some predictive regression tests in the literature, including the tests proposed in EMW and Lewellen (2004), impose the condition that c ≥ 0 (equivalently, φ ≤ 1), CY, p. 54, provide a discussion on why it might not be sensible to restrict c to be nonnegative in practice. Moreover, in their empirical analysis CY find that many of the predictors they consider, most notably the dividend-price ratio, have confidence intervals for φ that include values greater than 1. This may well be a result of local explosivity in the price series, as is well documented in the literature on financial bubbles; see, among others, Phillips et al. (2011Phillips et al. ( , 2015. □ Remark 3. The conditions placed on the errors above essentially coincide with Assumption INNOV(i) of Kostakis et al. (2015Kostakis et al. ( , p. 1512) and impose conditional homoskedasticity on ϵ t . This is done to simplify our presentation, but it would be possible to allow for conditional heteroskedasticity of the form considered in Assumption A.1 of CY without altering the large sample results which follow under Assumption W . Under Assumption S, as in Kostakis et al. (2015Kostakis et al. ( , p. 1516, an assumption of the form given in their INNOV(ii), op. cit., p.1512, would be needed and the predictive regression t-ratios we discuss in Section 3 would need to be implemented using White standard errors rather than OLS standard errors. □

Regression-based predictability tests
The simplest possible regression-based test for H 0 : β = 0 is based on the t-ratio associated with the OLS estimate of β from (1) . Definingα y := (T − 1) −1 ∑ T t=2 y t andα x := (T − 1) −1 ∑ T t=2 x t−1 this is identical to the t-statistic associated with the OLS estimate of β in the regression which is therefore defined as whereσ 2 v is the usual OLS residual variance estimate from (4). The representation in (4) serves to make clear two very important aspects of the basic statistic T . First, T is based on separate OLS demeaning of both y t and x t−1 . Under weak persistence of x t ,α x is a consistent estimator of α x , whilê α x = O p (T 1/2 ) under strong persistence. Second, T is based on estimation that takes no account of the endogeneity present between the predictor and the regression error in (1) and, as we will see in Theorem 1, has a limiting null distribution that, under Assumption S, depends on both c and on ρ xy when c ̸ = 0. In contrast, under Assumption W , where x t is weakly persistent, T has a standard normal limiting null distribution and is asymptotically optimal under Gaussianity; see Jansson and Moreira (2006, p. 704).
Under strong persistence, the literature to date has largely focused on the endogeneity issue. As discussed in the Introduction, analogous tests to T based on instrumental variable estimation of (1) have been considered in, among others, Kostakis et al. (2015) and Breitung and Demetrescu (2015). Other approaches which are more powerful when x t is strongly persistent, including Lewellen (2004) and CY, fall within the general control variable approach outlined in Elliott (2011). Here (1) is augmented by an additional regressor used as a proxy for the current period innovation driving the predictor, ϵ xt . 1 As discussed in Jansson and Moreira (2006, p. 691), such procedures are asymptotically biased (so power can fall below the nominal level for alternatives sufficiently close to the null) as a result. The most popular example of this approach is CY's Q test. This is based around the infeasible t-statistic on β when (x t − φx t−1 ) is added as a regressor to (1). CY develop a feasible version of this test, using the approach of Cavanagh et al. (1995), based on a Bonferroni confidence interval for β formed from the sequence of such statistics across φ and a confidence interval for φ (equivalently c) formed from the well-known quasi-GLS demeaned ADF unit root statistic of Elliott et al. (1996). While Jansson and Moreira (2006) develop asymptotically uniformly most powerful tests which are asymptotically unbiased, their simulation results show that the Q test of CY has higher power in finite samples for most alternatives.
As noted above, the standard t-ratio T is based on OLS demeaning of both y t and x t−1 . It is, however, well known in the literature that where a series is strongly persistent it can be advantageous to quasi-GLS demean it, as proposed in Elliott et al. (1996), rather than use OLS demeaning. Indeed, CY adopt the quasi-GLS demeaned ADF statistic in constructing the Bonferroni-type confidence interval for c that forms the basis of their predictability test, arguing that they do so because of the superior local power properties of the quasi-GLS demeaned ADF test relative to the standard OLS demeaned ADF test. In particular, Elliott et al. (1996, p. 814) comment that "... where a deterministic mean or trend is present, power can be improved considerably over the standard Dickey-Fuller test by modifying the method employed to estimate the parameters characterising the deterministic term." Elliott et al. (1996) develop a class of feasible near-efficient unit root tests. But the asymptotic local power functions of these tests are essentially indistinguishable from the asymptotic local power function of an ad hoc quasi-GLS demeaned regression-based ADF test despite this test not being based on any formal optimality criterion; see, in particular Figures 2 and 3 of Elliott et al. (1996, pp. 823-4). Similarly, the MZ GLS α test of Ng and Perron (2001), although again not based on any formal optimality criterion, also has an asymptotic local power function that is indistinguishable from the near-efficient tests of Elliott et al. (1996) and superior power than the corresponding MZ α test based on OLS demeaning considered in Stock (1999) and Perron and Ng (1996).
One may therefore conclude that it is largely the quasi-GLS method of demeaning that brings about this power advantage over the standard OLS demeaned unit root tests. Indeed, as Elliott et al. (1996, p. 823-24) argue "Since the difficulties with the standard tests are associated with inefficient estimates of the trend parameters, it is reasonable to expect that modified estimates could improve their performance." It therefore seems worth investigating whether the same applies in the current situation. Consequently, rather than focusing on predictive tests in the strongly dependent case that are driven by a formal asymptotic optimality property, we will explore whether, and if so in what settings, using quasi-GLS demeaning of the persistent predictor can deliver tests with good power. Indeed, as will be shown in Theorem 1 below, under strong persistence the limiting distribution of T features a component which is a weighted combination of two distributions, the first of which is the local alternative limit distribution of the OLS-demeaned Dickey-Fuller statistic and the second is standard normal. The Dickey-Fuller component dominates the standard normal component when the degree of endogeneity |ρ xy | is large. Where the degree of endogeneity is small the reverse holds and so here we might not necessarily expect to see any gains from using a test based on quasi-GLS demeaning the predictor. As we will see in Section 3.2, an exploration of the asymptotic local power functions of (asymptotically) conservative implementations (needed to account for the dependence on c and ρ xy under the null) of the tests shows that quasi-GLS demeaning of the persistent predictor can indeed deliver power gains relative to T for moderate to large ρ xy in the strongly persistent case.
To define the t-ratio from the predictive regression where the predictor regressor is quasi-GLS demeaned, we first need to define the quasi-GLS estimate of α x . This is obtained from the OLS regression of (x 1 , x 2 −φx 1 , . . . , x T −φx T −1 ) 1 A proxy is needed because ϵ xt is unobservable as both α x , the unconditional mean of x t in (2), and the autoregressive parameter, φ, in (3) are unknown. These parameters cannot be estimated at a sufficiently fast rate such that a proxy based on an estimate of ϵ xt delivers (under Gaussianity) an asymptotically efficient test with a standard normal limiting null distribution, as would be obtained if α x and φ were known. on (1, 1 −φ, . . . , 1 −φ) whereφ := 1 −c/T withc = 7; see Elliott et al. (1996) for further details. We denote this estimatorα x . Under strong persistenceα x = O p (1) and, hence, it is not divergent, unlike its OLS counterpartα x . Based on the quasi-GLS demeaned predictor, we can define the corresponding t-statistic associated with the OLS estimate of β in the regression whereσ 2 v is the OLS residual variance estimate from (6). Notice that we retain OLS demeaning of y t because, under the null, y t = α y + ϵ yt is not strongly persistent and so GLS demeaning would not be appropriate.

Asymptotic distributions of T and T ′
In this subsection we consider the asymptotic behaviour of the T and T ′ statistics. Predictive regressions for stock returns typically exhibit a small R 2 and low signal-to-noise ratios (see, inter alia, Campbell, 2008, andPhillips, 2015) so that departures from the null, should predictability be present, are likely to be small. Consequently, we will establish the large sample behaviour of the tests under local alternatives such that the slope parameter β in (1) is local-to-zero. The localisation rate (or Pitman drift) will need to be such that β is specified to lie in a neighbourhood of zero which shrinks with the sample size, T . The appropriate Pitman drift is dictated by whether x t is strongly or weakly persistent. Where x t is strongly persistent, such that Assumption S holds, the appropriate local alternative is given by where g is a finite constant and where ω 2 x := σ 2 x ψ(1) 2 is the long run variance of x t . For weakly persistent x t , such that Assumption W holds, the appropriate local alternative is given by H 1,W : x is the short run variance of x t , and g is again a finite constant. The different localisation rates reflect the fact that near-integration implies a much stronger signal from the predictor x t−1 .
In Theorem 1 we now report the asymptotic distributions of the T and T ′ statistics under both the null and local alternatives for the case where x t is strongly persistent. In Theorem 2, the proof of which is entirely straightforward and is therefore omitted, we subsequently present the corresponding limit for T for the case where x t is weakly persistent.

Theorem 1. Let y t and x t be generated according to the model in (1)-(3) under the conditions stated in Section 2 and let
Assumption S hold. Let the regression t-statistics T and T ′ be as defined in (5) and (7), respectively. Then, as T → ∞, under H 1,S : where "⇒" denotes weak convergence and where W 1 (r) and W 2 (r) are independent standard Brownian Motions,W 1c (r) : Remark 4. Theorem 1 highlights that for both the statistics considered the offset seen in their limiting distributions under the local alternative H 1,S , and hence their asymptotic local power, is a function of the drift parameter g and a different statistic-specific stochastic offset term. Under the null hypothesis, H 0 , the asymptotic distributions of both statistics are non-standard and depend on ρ xy and c. When ρ xy = 0, T has a N(0, 1) limiting distribution under H 0 .
Remark 5. The limiting null distribution of T is seen from the representation in Theorem 1 to be a weighted average of two components, the first ∫ 1 0W 1c (r)dW 1 (r)/( ∫ 1 0W 1c (r) 2 dr) 1/2 is the local alternative limit of the OLS-demeaned Dickey-Fuller statistic, while the second ∫ 1 0W 1c (r)dW 2 (r)/( ∫ 1 0W 1c (r) 2 dr) 1/2 is, as noted in Remark 4, a standard N(0, 1) distribution. The former dominates this weighted average when |ρ xy | is large, while the latter dominates when |ρ xy | is small. Consequently, and as discussed earlier, where the degree of endogeneity is small we would not anticipate the possibility of any gains from quasi-GLS, rather than OLS, demeaning of the predictor, but where significant endogeneity is present this possibility exists. We will explore this further in Section 3.2 by comparing the asymptotic local power properties of (asymptotically) size controlled tests based on T and T ′ under strong persistence, across a range of values of the endogeneity parameter ρ xy .  N(g, 1).
Remark 6. The result in Theorem 2 demonstrates that, under Assumption W , T has a standard normal limiting null distribution. Notice that, unlike under Assumption S, the local power offset under H 1,W is deterministic and equals the drift parameter, g. Indeed, as noted in Jansson and Moreira (2006, p. 704), under Assumption W the test based on T is asymptotically optimal under Gaussianity. We do not present the corresponding limiting distribution for T ′ under Assumption W because it can be shown to depend on the distribution of s 1 . The hybrid testing scheme, denoted T hyb , that we will subsequently develop in Section 3.3, is designed such that it never selects T ′ in large samples under Assumption W and, hence, we will not need the limiting distribution of T ′ to establish the limiting distribution of T hyb under Assumption W .

Asymptotic size and local power comparisons of T and T ′ under strong persistence
Under strong persistence, we can use the limiting representations given in Theorem 1 to compare the asymptotic sizes and asymptotic local powers of tests based on the T and T ′ statistics for a range of values of the relevant nuisance parameters on which these depend, ρ xy and c. For a given value of ρ xy , the main issue is that the asymptotic critical values of T and T ′ depend on c, which is unknown, but, unlike ρ xy , is not consistently estimable. To make asymptotic size and, subsequently, asymptotic power comparisons meaningful, we adopt a scheme for simulating critical values that will, by design, deliver asymptotically conservative tests. We will illustrate this in the context of a one-sided upper-tailed test for the alternative of β > 0, but the same approach can be used in an obvious way for lower-tailed and two-tailed tests.
The steps to obtaining asymptotically conservative critical values for tests based on T and T ′ are as follows: 1. For a given value of ρ xy , simulate the null distributions S(0, ρ xy , c) and S ′ (0, ρ xy , c) for different c across an interval c ∈ [c min , c max ]. 2. At each value of c, compute the respective λ-level upper-tail critical values, cv λ (ρ xy , c) and cv ′ λ (ρ xy , c) say. 3. Set the λ-level critical values for T and T ′ equal to cv λ (ρ xy ) : Using cv λ (ρ xy ) and cv ′ λ (ρ xy ) will yield correct λ-level sized tests based on T and T ′ in the case where c = arg max c∈[c min ,cmax] cv λ (ρ xy , c) and c = arg max c∈[c min ,cmax] cv ′ λ (ρ xy , c), respectively, and give conservatively sized tests for all other values of c. We simulated critical values in this manner for a significance level λ = 0.05, approximating the Brownian motion processes in the limiting functionals using IIDN(0, 1) random variates, with the integrals approximated by normalised sums of 1,000 steps based on 10,000 Monte Carlo replications. This was carried out for c min = −5 and c max = 50 on the grid c ∈ {c min , c min + 1, . . . , c max − 1, c max }, thereby covering the locally explosive, unit root and local to unit root cases. For ρ xy we consider the grid ρ xy ∈ {−0.950, −0.925, −0.900, . . . , 0.900}. We will refer to the two tests where T and T ′ are compared with their asymptotically conservative critical values as T con and T ′ con , respectively. Consider first the case where ρ xy = −0.9. Regarding the asymptotically conservative tests, T con and T ′ con , we observe that T con maximises its asymptotic size (i.e., has asymptotic size of 0.05) for c just below 0. Importantly, it is also generally very undersized for positive c. In contrast, T ′ con , which maximises its asymptotic size at c = 0, has a very flat size profile across c, never dropping much below 0.05. The pattern of asymptotic size behaviour in T N essentially magnifies the pattern observed with T con , but with very bad oversize for small c. In terms of local power, between T con and T ′ con , it is clear that T ′ con offers substantially more power unless c is very small, with T con suffering due to its undersize outside of the small c range. The local power plot for T N is not meaningful here because of its severe oversize.
When ρ xy = −0.5, the main feature we observe is that T con is now less undersized for positive c compared to when ρ xy = −0.9 (and consequently T N is less oversized), while the size behaviour of T ′ con is little changed. On comparing the powers, we see that T con remains somewhat less powerful than T ′ con unless c is small, although the deficit is somewhat reduced.
Results for ρ xy = 0 show that T con has size independent of c and coincides exactly with that of T N at 0.05, however T ′ con tends towards undersize unless c is large. In terms of power, this behaviour translates into T con (i.e. T N ) being more powerful than T ′ con unless c is large, where the powers of T con and T ′ con are very close to each other.
Finally, when ρ xy = 0.5, both T con and T ′ con tend to be undersized for smaller c, with T ′ con slightly more so. However, the power gains of T con over T ′ con remain evident for the smaller values of c. Also, we see that T N always has slightly lower size than T con , and so as a consequence its powers are always slightly lower.
One obvious feature is the substantial asymmetry of the size and power profiles of T con (and T N ) and T ′ con between ρ xy = 0.5 and ρ xy = −0.5. For ρ xy = −0.5, T ′ con clearly possesses a better power profile than T con while the opposite is true for ρ xy = 0.5. This pattern of T ′ con displaying better overall power properties for substantially negative ρ xy and T con outperforming T ′ con for positive ρ xy extends to the expanded set of ρ xy results reported Figure S1 of the Supplementary Appendix, and on the basis of these it appears that T ′ con has arguably the better power properties whenever ρ xy < −0.1 and that T con is better otherwise. While these results and conclusions are drawn for the single point g = 10 under the alternative, similar patterns of relative power performance arise for other values of g, as seen in Figure S2 of the Supplementary Appendix where results for g = 5 and g = 20 are reported, reinforcing the general result that T ′ con has better power when ρ xy < −0.1 and T con better otherwise. In unreported simulations, we also found that similar relative power patterns are obtained when using the 0.10 and 0.01 significance levels.
While we have focused the foregoing analysis on upper-tail testing against the alternative β > 0, lower-tail testing of the alternative β < 0 can be carried out in an analogous fashion. To test β < 0, for T con and T ′ con we simply replace cv λ (ρ xy ) and cv ′ λ (ρ xy ) with −cv λ (−ρ xy ) and −cv ′ λ (−ρ xy ), respectively. Consequently, an identical pattern of asymptotic sizes and powers obtains for lower-tailed tests but with ρ xy replaced with −ρ xy . That is, for lower-tailed tests, T ′ con has better overall power when ρ xy > 0.1 with T con superior otherwise. Two-sided λ-level tests can be conducted by taking the union of rejections of the lower-tail and upper-tail tests, each conducted at the (λ/2)-level.

A hybrid testing procedure
We now propose a hybrid testing procedure that is designed to capitalise on the power optimality of the standard t-test T N under weak persistence, and the relative local power advantages of T con and T ′ con under strong persistence for different values of ρ xy . Specifically, we consider an approach that, for upper-tail testing (lower-tail testing), (i) uses T con under strong persistence if ρ xy > −0.1 (ρ xy < 0.1), (ii) uses T ′ con under strong persistence if ρ xy < −0.1 (ρ xy > 0.1), (iii) uses T N under weak persistence (for both upper-and lower-tail testing). Below we detail how to operationalise such an approach for practical implementation. We require the use of two switching mechanisms. Part (iii) involves a switching approach similar to that of EMW, whereby the standard test T N is selected when evidence of a weakly persistent predictor is present. In the absence of such evidence, a secondary switching mechanism is needed to determine whether T con or T ′ con should be applied, this time on the basis of a consistent estimate of ρ xy .
In the first switching mechanism, we determine whether the predictor variable is strongly or weakly persistent on the basis of a standard unit root test. For the unit root test we use the ADF normalised bias coefficient unit root test whereπ andγ i , i = 1, . . . , p are obtained from the estimated OLS ADF regression equation The lag truncation parameter, p, in (8) needs to satisfy the standard rate condition that as T → ∞, 1/p + p 3 /T → 0. In practice, the lag length p can be selected by any suitable information criterion. In the numerical work which follows we will use the modified Bayes information criterion [MBIC] of Ng and Perron (2001) as we found this to deliver the best finite sample performance among popularly used lag selection rules. In the context of the MBIC rule, we used the modification suggested by Perron and Qu (2007), and the maximum permitted lag order p was set to p max = ⌊ 12(T /100) 1/4 ⌋ (⌊.⌋ denoting the integer part), as in Ng and Perron (2001).
Under Assumption S, ADF π = O p (1), while under Assumption W , ADF π diverges to minus infinity. Consequently, employing any fixed critical value for ADF π , cv ADF say, would ensure that T N would be selected asymptotically under weak persistence since Pr(ADF π < cv ADF ) → 1. However, in finite samples we found such a cut-off rule can lead to T N being selected too often under strong persistence, leading to over-sizing of the resulting hybrid procedure. We therefore implement T hyb with a sample size dependent critical value, cv ADF T = −4T 1/2 , a choice motivated from extensive Monte Carlo simulation evidence for a range of values of T , ρ xy and c. Under weak persistence, ADF π diverges to infinity at a rate faster than T 1/2 , hence T N is selected asymptotically under weak persistence since Pr(ADF π < cv ADF T ) → 1.
In the second switching mechanism, which is operational whenever weak persistence is not detected, selection between T con and T ′ con is made on the basis of the ρ xy estimator where theε yt are the OLS residuals from regressing y t on a constant and x t−1 , andε xt are the ADF residuals from (8). This estimator is consistent for ρ xy under either Assumption S or Assumption W . In practice then, for T con and T ′ con we use the critical values cv λ (ρ xy ) and cv ′ λ (ρ xy ) as estimates of cv λ (ρ xy ) and cv ′ λ (ρ xy ). To automate selection of an appropriate critical value we calculated a response surface by OLS regressions of cv λ (ρ xy ) and cv ′ λ (ρ xy ) on [1, ρ xy , ρ 2 xy , . . . , ρ 8 xy ] for the grid of values ρ xy = {−0.90, −0.85, . . . , 0.9} (37 data points). The response surface coefficient estimates are given in Table 1 for the usual values of λ (the R 2 from the response surface regressions exceeded 0.999 in all cases), and the response surface critical value is obtained as the fitted value from the corresponding estimated regression.
To summarise, our suggested hybrid double switching-based testing procedure, which we denote by T hyb , is defined as follows: 1. If ADF π < −4T 1/2 perform T N (T with a standard normal critical value).

Otherwise:
(a) For upper-tail tests against the alternative β > 0, ifρ xy > −0.1 perform T con (T with conservative critical value cv λ (ρ xy )) ifρ xy < −0.1 perform T ′ con (T ′ with conservative critical value cv ′ λ (ρ xy )) (b) For lower-tail tests against the alternative β < 0, ifρ xy < 0.1 perform T con (T with conservative critical value − cv λ (−ρ xy )) ifρ xy > 0.1 perform T ′ con (T ′ with conservative critical value − cv ′ λ (−ρ xy )) In the next section we explore the efficacy of this hybrid testing approach in delivering a procedure with reliable size and attractive power, relative to existing tests in the literature, across a wide range of correlation parameters ρ xy and degrees of predictor persistence.

Finite sample size and power
We examine the finite sample size and power properties of the T hyb procedure and compare these with the prominent tests in the predictive regression testing literature. Specifically, the tests we employ as comparators are CY's Q test; BD, the instrumental variable test of Breitung and Demetrescu (2015) using their recommended sine and fractional instruments (denoted BD); the test of Kostakis et al. (2015) (denoted IVX ), and the test of EMW (denoted EMW ). Note that we compare with the original Q test of CY, rather than a modified variant that can control size under weak persistence, because EMW find in their supplement that the modified test has lower power than the original test for moderate values of c, and is dominated by the EMW test, and also because the original Q test is the one implemented by practitioners, hence it presents a more useful point of comparison. We do not report the test of Jansson and Moreira (2006) because, as noted earlier, the Q test has higher power than this test in finite samples for most alternatives.
We conduct a 0.05-level upper tail test for T hyb , Q and EMW ; a 0.10-level two-tailed test for BD (recall that this test can only be run as a two-tailed test) and consider two variants of IVX : a 0.05-level upper tail test and a 0.10-level twotailed test, denoting these as IVX 1 and IVX 2 respectively. The Q tests were computed using the code provided by CY. 2 To implement EMW , we adopt their switching function so that the standard test T based on (4) is applied if a (non-consistent) estimate of the local offset c is at least 130, while their weighted average power criterion-based test is applied otherwise, using the sample statistics and long run correlation estimator specified on p. 697 of Jansson and Moreira (2006), together with the routines provided by EMW. 3 For the estimate of c we use the natural estimator from (8), −Tπ, and when the standard T test is used in EMW , we follow EMW's approach of setting the critical value to the usual value of 1.645 for non-negative estimates of the long run correlation parameter, but to set it to 1.7 for negative estimates. In calculating the IVX 1 and IVX 2 tests we implemented the finite-sample correction factor outlined in Kostakis et al. (2015Kostakis et al. ( , p. 1516. Although the innovations in (3) are generated without serial correlation, we do not assume knowledge of this when running the tests (as would be the case in practical applications). For Q we set p max = ⌊ 12(T /100) 1/4 ⌋ , in line with the p max setting used in T hyb , while for IVX and EMW , long run variances are calculated using a Bartlett kernel with lag truncation ⌊ T 1/3 ⌋ (BD requires no serial correlation correction).
The simulation results are shown in Figs. 2-6. Considering first the sizes of the tests across the different φ and ρ xy settings, the newly proposed T hyb test displays excellent finite sample size control across the full range of persistence and correlation parameters, with very little deviation from nominal size, apart from some undersize for positive ρ xy in the more persistent cases. Of the existing competitor tests, BD and IVX 2 also demonstrate decent size behaviour, while the remaining tests can be badly size distorted. Specifically, EMW does not control size for explosive processes, e.g. size is close to one for ρ xy = −0.9 and close to zero for ρ xy ≥ 0; also EMW displays substantial oversize for moderate and small values of φ when ρ xy > 0. On the other hand, Q is severely oversized when φ = 0 and also suffers severe undersize when φ = 1.025 and ρ xy = 0.5, and IVX 1 can be badly oversized for more persistent series when ρ xy is negative. That EMW does not control size for explosive processes and Q does not control size for white noise processes is not surprising given that these tests are not designed to be valid in such circumstances. However, the severe size distortions displayed for these settings highlight the sensitivity of these tests to departures from the persistence assumptions under which they were derived, and the contrast with tests such as T hyb , which offer robustness to a much broader set of persistence parameters, is stark.
Turning attention to the power performance of the procedures, for ρ xy = −0.9 (Fig. 2), we find that for the explosive setting φ = 1.025, the correctly sized tests T hyb and IVX 2 have very similar power profiles, lying only a little below those of BD and IVX 1 which are modestly oversized in this case. The power of Q is very low in comparison to the other tests here, while comparison with EMW is not meaningful due to it having a size close to one. In the unit root case φ = 1, EMW dominates all other tests in terms of power; it appears that exclusion of robustness to the case of explosive predictors affords the EMW test the opportunity of greater power in the unit root setting. Of the other tests, T hyb is next best for small departures from the null while Q offers some gains over T hyb for larger β, while both of these tests offer significant power advantages over IVX 2 and BD (IVX 1 is oversized and hence cannot be compared in terms of power). As the process becomes less persistent, the power advantages of EMW are very quickly eliminated, with T hyb offering the best power profile (of the correctly sized tests) even for φ = 0.975. For ρ xy = −0.9, the overall picture is one of T hyb offering the best power profile for all values of φ except φ = 1 where EMW dominates.
When ρ xy = −0.5 (Fig. 3), similar comments apply to the explosive and unit root cases, although in the unit root case, the power gains of EMW over T hyb are not as marked. For φ = 0.975, 0.95 and 0.875, the power profiles of EMW , T hyb and Q essentially coincide, and are the best performing tests for these degrees of persistence. When φ = 0.75 and 0.5, power gains of T hyb over EMW and Q are seen, the magnitude of which can be quite substantial in the φ = 0.5 case. For the white noise setting φ = 0, T hyb and EMW again coincide while Q is badly oversized and has poor power. In the case of ρ xy = 0 (Fig. 4), with the exception of the explosive case (where EMW has very low size and power), the T hyb , EMW , Q and IVX 2 tests share very similar properties (EMW offers some small power gains for the most persistent settings), while IVX 1 and BD lag behind in terms of power performance. When ρ xy = 0.5 (Fig. 5), T hyb performs best for an explosive predictor, with EMW and Q displaying considerably lower power here, while EMW offers the best power profile for φ = 1, 0.975 and 0.95. However, as the persistence parameter decreases further, EMW first becomes oversized (as noted above) with T hyb providing the best power of the correctly sized tests, until φ = 0 when the power profiles of EMW and T hyb again coincide. Finally, for ρ xy = 0.9 (Fig. 6), a generally more exaggerated picture of the ρ xy = 0.5 results is seen, with T hyb dominant for φ = 1.025, EMW markedly best for φ = 1 and 0.975, but then EMW suffering from oversize for smaller φ with T hyb being the best performing test of those correctly sized. Across Figs. 3-6, T hyb and EMW arguably emerge as the tests with the best overall power profiles, with each test offering relative power advantages over the other in different settings. However, of these two procedures, T hyb is alone in also offering reliable size control across the full range of persistence and correlation settings.
In the Supplementary Appendix, Figures S3-S20 report results for cases where additional serial correlation is permitted in the predictor series, with s t of (2) specified as s t = φs t−1 + u t , u t = δu t−1 + ϵ xt − θϵ x,t−1 , with simulations conducted for θ = ±0.5 and δ = ±0.5. We find that T hyb retains its feature of never being subject to large upward size distortions, in contrast to Q and EMW whose sizes can vary dramatically for different combinations of φ, ρ xy , δ and θ. For example, Q can now display substantial oversize for φ = 1.025 across all values of ρ xy , as well as an increased range of less persistent φ cases, especially when ρ xy > 0, while the oversize seen for EMW in Figs. 2-6 for small and moderate φ when ρ xy > 0 can now extend to the φ = 0 case and sometimes become more pronounced. When the tests are approximately correctly sized so that meaningful power comparisons can be made, the same broad patterns emerge as in Figs. 2-6, albeit with some differences in the magnitudes of the relative power gains/losses. Based on our simulation results, we conclude that T hyb offers appealing size and power properties when compared to the leading currently available testing procedures. It would be fairly naïve to believe, a priori, that any one test    The column labelled ''T hyb ='' states which of the constituent tests is selected in the hybrid test T hyb .
procedure would have the best finite sample size and power properties across the full constellation of settings that we have examined, i.e. a wide spectrum of values of the persistence level in the predictive regressor and the correlation coefficient between the innovations in the model. However, T hyb does appear to perform consistently well in terms of both size and power across these settings, never seemingly showing a substantial weakness in either dimension, something which appears to be rather less true of its extant competitors.

An empirical illustration
To illustrate the use of our proposed test in practice, we apply it, together with its competitors, to the monthly U.S.
annual equity series analysed in Welch and Goyal (2008), using updated data for the period 1980:1-2017:12 (T = 456) which is available at http://www.hec.unil.ch/agoyal/. Our dependent variable, y t , is the S&P 500 value-weighted log excess return and for x t we consider thirteen putative predictor variables: the dividend price ratio, earnings-price ratio, dividendpayout ratio, dividend yield, default yield spread, long-term yield, default return spread, stock variance, net equity expansion, inflation rate, Treasury bill rate, term spread and the book-to-market value ratio. Detail of the construction of these predictors can be found in Welch and Goyal (2008). The test procedures are all applied with the same settings and serial correlation corrections as used in Section 4. We conduct one-sided upper-tail tests for T hyb , EMW , Q and IVX 1 (with the exception of the stock variance predictor for which we apply lower-tail tests), and two-sided tests for BD and IVX 2 ; the tests are implemented at the 0.10, 0.05 and 0.01 significance levels for T hyb , BD, IVX 1 and IVX 2 , and at the 0.05-level for EMW and Q (again using the code provided by EMW and CY, respectively). The results are presented in Table 2, along with the values ofρ xy and ADF π (in this application, cv ADF T = −4T 1/2 = −85.4). There are three cases where one or more of the tests reject at the 0.10-level or above: the dividend yield, default return spread and stock variance. For the dividend yield, only T hyb and Q reject; hereρ xy is close to zero so we would expect T hyb and Q to give similar results, although interestingly none of the other tests reject, including EMW . Due to the persistence in this predictor, as evidenced by the small value of ADF π , T hyb is here using T con . In the case of the default return spread, all tests but BD exhibit rejections, while for the stock variance all the tests reject. The values of ADF π for these two predictors suggest very low levels of persistence, with T hyb using T N and EMW also switching into the standard t-test. In summary, fairly limited evidence of return predictability is found across the set of predictors considered, but it is clear that T hyb uncovers at least as much evidence for predictability as any of its comparator tests.

Conclusions
We have developed new and easy-to-implement tests for predictability based on computationally simple regression t-ratios and a switching rule based on a conventional normalised bias ADF statistic implemented with the MBIC lag selection rule of Ng and Perron (2001). In particular, together with the standard t-ratio from the OLS regression of returns on a constant and a lagged predictor, we have discussed a t-ratio from a variant of the standard predictive regression where the OLS demeaned returns are regressed on the quasi-GLS demeaned lagged predictor. Where the predictor is strongly persistent, we have proposed a feasible method for obtaining (conservative) asymptotic critical values for tests based on each of these statistics and associated response surfaces have been provided. An analysis of the asymptotic local power functions of the resulting (asymptotically) conservative tests in the case where the predictor is strongly persistent showed that these vary considerably with the endogeneity correlation parameter. We consequently suggest applying either the conservative standard t-ratio or its quasi-GLS variant, according to the magnitude of the estimated endogeneity correlation parameter. Where the predictor is weakly persistent the standard t-ratio compared to standard normal critical values is optimal under Gaussianity. We therefore propose a switching testing procedure, similar in approach to that considered in Elliott et al. (2015), whereby one of the two conservative tests is performed, as outlined above, unless the normalised bias ADF statistic indicates that the predictor is weakly dependent, in which case we compare the standard t-ratio with standard normal critical values. Monte Carlo simulations presented suggest that our hybrid test compares very favourably with the leading tests for predictability in the literature, offering arguably the best trade-off of in terms of overall finite sample size and power properties across a broad diversity of persistence and endogeneity settings.
We conclude with a suggestion for further research. Like the vast majority of the published tests in this literature we have considered the case of a single predictor. Some published papers have considered multiple predictors simultaneously, most notably the IV-based tests of Kostakis et al. (2015) and Breitung and Demetrescu (2015). Both of these, however, assume that either all of the predictors are weakly persistent, or all of the predictors are strongly persistent, thereby disallowing sets of predictors with mixed orders of persistence. The bootstrap tests of Bauer and Hamilton (2018) also allow for multiple predictors, but again make the same assumption. Amihud et al. (2009) also allow for multiple predictors but these must all be weakly dependent. Under the assumption of a common order of persistence, it may be possible to generalise the approach outlined in this paper to accommodate multiple predictors. Investigating this possibility and how well it works in practice compared to the other tests mentioned above is beyond the scope of the present paper but would constitute an interesting topic for further research.