Inference in Heavy-Tailed Nonstationary Multivariate Time Series

Abstract We study inference on the common stochastic trends in a nonstationary, N-variate time series yt, in the possible presence of heavy tails. We propose a novel methodology which does not require any knowledge or estimation of the tail index, or even knowledge as to whether certain moments (such as the variance) exist or not, and develop an estimator of the number of stochastic trends m based on the eigenvalues of the sample second moment matrix of yt. We study the rates of such eigenvalues, showing that the first m ones diverge, as the sample size T passes to infinity, at a rate faster by O(T) than the remaining N – m ones, irrespective of the tail index. We thus exploit this eigen-gap by constructing, for each eigenvalue, a test statistic which diverges to positive infinity or drifts to zero according to whether the relevant eigenvalue belongs to the set of the first m eigenvalues or not. We then construct a randomized statistic based on this, using it as part of a sequential testing procedure, ensuring consistency of the resulting estimator of m. We also discuss an estimator of the common trends based on principal components and show that, up to a an invertible linear transformation, such estimator is consistent in the sense that the estimation error is of smaller order than the trend itself. Importantly, we present the case in which we relax the standard assumption of iid innovations, by allowing for heterogeneity of a very general form in the scale of the innovations. Finally, we develop an extension to the large dimensional case. A Monte Carlo study shows that the proposed estimator for m performs particularly well, even in samples of small size. We complete the article by presenting two illustrative applications covering commodity prices and interest rates data. Supplementary materials for this article are available online.


Introduction
Since the seminal works by Engle and Granger (1987), Stock and Watson (1988) and Johansen (1991), investigating the presence and number m of common stochastic trends has become an essential step in the analysis of multivariate time series which are non-stationary over time. Inference on m is of great importance on its own, as it has a "natural" interpretation in many applications: for example, it can provide the number of non-stationary factors in Nelson-Siegel type term structure models; it allows to assess the presence of (long-run) integration among financial market (Kasa, 1992), or, in the context of asset pricing, the number of portfolios that allow hedging of long-run risks (Alexander et al., 2002). In terms of theory, available estimators are based either on sequential testing (see e.g. Boswijk et al., 2015), or on information criteria (see Aznar andSalvador, 2002 andQu andPerron, 2007).
6. Section 7 concludes. Further simulations, and all technical lemmas and proofs, are relegated to the Supplement.
Throughout the paper, we make use of the following notation. For a given matrix A ∈ R n×m , we denote its element in position (i, j) as A i,j ; we also let let λ (j) (A) denote the j-th largest eigenvalue of A. We denote with c 0 , c 1 , ... positive, finite constants whose value can change from line to line. We denote the k-iterated logarithm of x (truncated at zero) as ln k x -e.g. ln 2 x = max {ln ln x, 0}. The backshift operator is denoted as L.

Theory
Consider an N-dimensional vector y t with MA (∞) representation where C (L) = ∞ j=0 C j L j . We assume that y 0 = 0 and ε t = 0 for t ≤ 0, for simplicity and with no loss of generality; inspecting the proofs, all the results derived here can be extended to more general assumptions concerning e.g. the initial value y 0 . Standard arguments allow to represent (2.1) as 2) is derived from the multivariate Beveridge-Nelson decomposition of the filter C (L) (see Watson, 1994).
We assume that the N × N matrix C can have reduced rank, say m. This corresponds to assuming that the long-run behavior of the N-dimensional vector y t is driven by m non-stationary common factors. 4 Assumption 1. It holds that: (i) rank (C) = m, where 0 ≤ m ≤ N; (ii) C j = O (ρ j ) for some 0 < ρ < 1.
Part (ii) of the assumption requires that the MA coefficients C j decline geometrically. This assumption is similar to Assumption 1 in Caner (1998), where the C j s are assumed to decline at a rate which increases as the tail index of the innovations ε t decreases. The case m = 0 correspond to y t being (asymptotically) strictly stationary. Assumption 1 is also implied by Assumption 2.1 in She and Ling (2020), where a finiteorder VAR model under the classic I (1) conditions stated e.g. in Ahn and Reinsel (1990) is considered.
On account of the possible rank reduction of C, a different formulation of (2.2) may be helpful. It is always possible to write C = P Q, where P and Q are full rank matrices of dimensions N × m and m × N respectively. Defining the m-dimensional process x t = Q t s=1 ε s , and using the short-hand notation u t = C * (L) ε t , then (2.2) can be given the following common trend representation (2.3) y t = P x t + u t .
We now make some assumptions on the error term ε t .
Assumption 2. It holds that: (i) ε t , 1 ≤ t ≤ T is an i.i.d. sequence, with ε t = 0 for all t < 0 and y 0 = 0; (ii) for all nonzero vectors l ∈ R N , l ′ ε t has distribution F lε with strictly positive density, which is in the domain of attraction of a strictly stable law G with tail index 0 < η ≤ 2.
As is generally the case in the analysis of time series with possibly infinite variance, we assume that ε t is i.i.d. Part (ii) of the assumption implicitly states that the vector ε t has a multivariate distribution which belongs to the domain of attraction of a strictly stable, multivariate law (see Theorem 2. 1.5(a) in Samorodnitsky and Taqqu, 1994) with common tail index η, so that linear combinations of them can be constructed. The assumption implies that, when η < 2, E |ε i,t | p < ∞ for all 0 ≤ p < η, whereas E |ε i,t | η = ∞ (see Petrov, 1974). Also, by Property 1.2.15 in Samorodnitsky and Taqqu (1994), it holds that, as x → ∞ where L (x) is a slowly varying function in the sense of Karamata (see Seneta, 2006 for a review), and c l,1 , c l,2 ≥ 0, c l,1 + c l,2 > 0. The condition that G is strictly stable entails that c l,1 = c l,2 when η = 1, thus ruling out asymmetry in that case (see Property 1.2.8 in Samorodnitsky and Taqqu, 1994 y t y ′ t , and S 00 = T t=1 ∆y t ∆y ′ t .
We report a set of novel results for the eigenvalues of S 11 and S 00 , which we require for the construction of the test statistics.
Proposition 1 states that the first m eigenvalues of S 11 diverge at a faster rate than the other ones (faster by an order of "almost" T ), thus entailing that the spectrum of S 11 has a "spiked" structure.
In order to study S 00 and its eigenvalues, we need the following assumption, which complements Assumption 1(ii).
The integral Lipschitz condition in Assumption 3 is a technical requirement needed for ∆y t to be strong mixing with geometrically declining mixing numbers, and it is a standard requirement in this literature (see e.g. Withers, 1981 andPham andTran, 1985).
for every ǫ > 0 and every integer n. Also, there exists a random variable T 0 such that, for all T ≥ T 0 and every ǫ > 0.
Similarly to Proposition 1, Proposition 2 provides bounds for the spectrum of S 00 . Part (2.7) has been shown in Trapani (2014), and it is a Chover-type Law of the Iterated Logarithm (Chover, 1966). As shown in Trapani (2014), the bound in (2.7) is almost sharp. The lower bound implied in (2.8) is also almost sharp.
The spectrum of S 11 (and, in particular, the different rates of divergence of its eigenvalues) can -in principle -be employed in order to determine m. However, S 11 has two main problems which make it unsuitable for direct usage. First, by Proposition 1, the spectrum of S 11 depends on the nuisance parameter η. Also, the matrix S 11 and, consequently, its spectrum depend on the unit of measurement of the data, and thus they are not scale-free. In order to construct scale-free and nuisance-free statistics, we propose to rescale S 11 by S 00 . Proposition 2 ensures that this is possible: by equation (2.8), the inverse of S 00 cannot diverge too fast, and therefore the spectrum of the matrix S −1 00 S 11 should still have m eigenvalues that diverge at a faster rate than the others. This is shown in the next theorem.
Theorem 1 states that the spectrum of S −1 00 S 11 has a similar structure to the spectrum of S 11 : the first m eigenvalues are spiked and their rate of divergence is faster than that of the remaining eigenvalues by a factor of almost T . More importantly, by normalising S 11 by S 00 , dependence on η is relegated to the slowly-varying (logarithmic) terms. In essence, apart from the slowly varying sequences, equations (2.9) and (2.10) imply that the rates of divergence of the eigenvalues of S −1 00 S 11 are of order (arbitrarily close to) O (T ) for the spiked eigenvalues, and (arbitrarily close to) O (1) for the other ones. This is the key property of λ (j) S −1 00 S 11 : dividing S 11 by S 00 washes out the impact of the tail index η, which essentially does not play any role in determining the divergence or not of λ (j) S −1 00 S 11 . Although this result pertains to first order asymptotics (i.e., rates), it is possible to find an analogy between the result in Theorem 1 and approaches based on eliminating nuisance parameters using self-normalisation (see e.g. Shao, 2015).

Inference on the common trends
In this section we collect our main results about estimation and inference on the common trends in the possible presence of heavy tails. In Section 3.1, we report a novel one-shot test about the (minimum) number of common trends. Then, in Section 3.2 we introduce a sequential procedure for the determination of the number of common trends. Estimation of the common trends and associated factor loading is presented in Section 3.3.
3.1. Testing hypotheses on the number of common trends. We consider testing hypotheses on the number of common trends m. The tests we propose herein will form the basis of our sequential procedure for the determination of the number of common trends -see Section 3.2. Specifically, we consider the null hypothesis (3.1) where j ∈ {1, ..., N} is a (user-chosen) lower bound on the number of common trends. For instance, testing whether there is at least one common trend would entail setting j = N − 1; a test of non-stationarity against the alternative of strict stationarity corresponds to j = 1.
Based on Theorem 1, we propose to use where κ ∈ (0, 1); criteria for the choice of κ in applications are discussed in Section 5. Importantly, the T −κ term in (3.2) is used in order to exploit the discrepancy in the rates of divergence of the λ (j) S −1 00 S 11 under H 0 and under H A . In particular, it ensures that T −κ λ (j) S −1 00 S 11 drifts to zero under H A (i.e., whenever j > m), whereas it still passes to infinity under H 0 (i.e., when j ≤ m). According to (2.10), this only requires a very small value of κ, which would also allow λ (j) S −1 00 S 11 to diverge at a rate close to T under H 0 . On account of Theorem 1, it holds that P (ω : lim T →∞ φ (j) T = ∞) = 1 for 0 ≤ j ≤ m; hence, we can assume that under the null that m ≥ j, it holds that lim T →∞ φ (j) T = ∞. Conversely, we 8 have P (ω : lim T →∞ φ (j) T = 0) = 1 for j > m, which entails that under the alternative that j > m, we have lim T →∞ φ (j) T diverges to positive infinity, or converges (to zero), according to whether λ (j) S −1 00 S 11 is "large" or "small". Since the limiting law of φ (j) T under the null is unknown, we propose a randomised version of it. The construction of the test statistic is based on the following three step algorithm, which requires a user-chosen weight function F (·) with support U ⊆ R.
Step 1 Generate an artificial sample {ξ Step 2 For each u ∈ U, define the Bernoulli sequence ζ (j) Step 3 Compute where F (·) is the user-chosen weight function. In Step 2, the binary variable ζ (j) i (u) is created for several values of u ∈ U, and in Step 3, the resulting statistics θ (j) T,M (u) are averaged across u, through the weight function F (·), thus attenuating the dependence of the test statistic on an arbitrary value u. The following assumption characterizes F (·).
A possible choice for F (·) could be a distribution function with finite second moment; note that part (ii) of the assumption is trivially satisfied if U is bounded. Possible examples include a Rademacher distribution, based on choosing U = {−c, c} with F (c) = F (−c) = 1/2, or the standard normal distribution function.
Let P * denote the conditional probability with respect to the original sample; we use the notation " D *
Theorem 2 provides the limiting behaviour of Θ (j) T,M , also illustrating the impact of M on the size and power trade-off. According to (3.7), the larger M the higher the power. Conversely, upon inspecting the proof, it emerges that θ The one-shot test developed in this section has at least three advantages compared to existing methods.
First, our approach can also be implement to check (asymptotic) strict stationarity. Indeed, running the test for j = 1 corresponds to the null hypothesis that the data are driven by at least one common trend; rejection supports the alternative of stationarity. Second, running the test with j = N corresponds to the null hypothesis that the N variables do not cointegrate, thus offering a test for the null of no cointegration against the alternative of (at least) one cointegrating relation. Finally, we point out a further advantage over the well-known method of Johansen (1991). Johansen's likelihood ratio test allows to test the null of rank R (i.e., of m = N − R common trends), where R is user-chosen, versus the alternative of rank greater than R (i.e., less than N − R common trends). However, while the limiting distribution under the null is well-known, if the true rank is lower than R, then the limiting distribution is different (see Bernstein and Nielsen, 2019). Hence, Johansen's test should be use only if the practitioner knows that the rank cannot be lower than R. In contrast, since the null hypothesis is formulated as a minimum bound on the number of common trends, our test does not have this drawback.
A final remark on the test is in order. Letting 0 < α < 1 denote the nominal level of the test, and defining c α such that P (χ 2 1 > c α ) = α, an immediate consequence of the theorem is that under H A it holds that lim min(T,M )→∞ P * (Θ (j) T,M > c α ) = 1 for almost all realisations of {ε t , 0 < t < ∞} -i.e. the test is consistent under the alternative. Conversely, under H 0 we have, for almost all realisations of {ε t , 0 < t < ∞} T,M > c α ) = α.
As noted also in Horváth and Trapani (2019), our test is constructed using a randomisation which does not vanish asymptotically, and therefore the asymptotics of Θ (j) T,M is driven by the added randomness. Thus, different researchers using the same data will obtain different values of Θ (j) T,M and, consequently, different p-values; indeed, if an infinite number of researchers were to carry out the test, the p-values would follow a uniform distribution on [0, 1]. This is a well-known feature of randomised tests. In order to ameliorate this, Horváth and Trapani (2019) suggest to compute Θ (j) T,M S times, at each time s using an independent sequence ξ (j) i,s for 1 ≤ j ≤ M and 1 ≤ s ≤ S, thence defining Based on standard arguments, under H 0 the LIL yields Hence, a "strong rule" to decide in favour of H 0 is Decisions made on the grounds of (3.11) have vanishing probabilities of Type I and Type II errors, and are the same for all researchers: having S → ∞ washes out the added randomness. No restrictions are required on S as it passes to infinity, so it can be arbitrarily large.
3.2. Determining m. In order to determine the number of common trends m, we propose to cast the individual one-shot tests discussed above in a sequential procedure, where different values j = 1, 2, ...
for m are tested sequentially (note that the individual tests must be based on artificial random samples independent across j, see below).
The estimator of m (say, m) is the output of the following algorithm: Algorithm 1.

11
Step 1 Run the test for H 0 : m ≥ 1 based on Θ (1) T,M . If the null is rejected, set m = 0 and stop, otherwise go to the next step.
Step 2 Starting from j = 2, run the test for H 0 : m ≥ j based on Θ (j) T,M , constructed using an artificial sample {ξ If the null is rejected, set m = j − 1 and stop; otherwise, if j = N, set m = N; otherwise, increase j and repeat Step 2.
Consistency of the proposed procedure is presented in the next theorem.
Theorem 3. Let Assumptions 1-4 hold and define the critical value of each individual test as c α = c α (M).
Theorem 3 states that m is consistent, as long as the nominal level α of the individual tests is chosen so as to drift to zero: no specific rates are required. This can be better understood upon inspecting the proof of the theorem: letting α denote the level of each individual test, in (C.9), we show that, as min (T, M) → ∞, whence the requirement c α → ∞, which entails α → 0. This can also be read in conjunction with Johansen's procedure (Johansen, 1991) and its bootstrap implementations (Cavaliere et al., 2012), whose outcome is an estimate of m, say m, such that, asymptotically, P ( m = m) = 1 − α for a given nominal value α for the individual tests. On account of (C.9), in our case choosing a fixed nominal level α would yield, as mentioned above, P * ( m = m) = (1 − α) N −m , which depends on the unknown m and is, for m > 1, worse than Johansen's procedure. On the other hand, in our procedure the individual tests are independent (conditional on the sample). Thus, in order to match the 1 − α probability of selecting the true m, one can use a Bonferroni correction with α/N as nominal level for each test, rather than α. In this case, the same calculations as in the proof of Theorem 3 (and Bernoulli's inequality) yield Finally, as an alternative to Bonferroni correction, one could consider the following top-down algorithm.
Step 1 Run the test for H 0 : m = N based on Θ T,M . If the null is not rejected, set m = N and stop, otherwise go to the next step.
Step 2 Starting from j = N − 1, run the test for H 0 : m ≥ j based on Θ (j) T,M , constructed using an artificial sample {ξ If the null is rejected, set m and stop; otherwise, if j = 1, set m; otherwise, decrease j and repeat Step 2.
The proof of the consistency of m, and some Monte Carlo evidence (showing that the performance of m is virtually indistinguishable from that of m), are in the Supplement.
3.3. Estimation of the common trends. Recall the common trend representation provided in (2.3), After determining m, it is possible to estimate the non-stationary common stochastic trends x t by using Principal Components (PC), in a similar fashion to Peña and Poncela (2006) and Zhang et al. (2019).
In particular, let υ j denote the eigenvector corresponding to the j-th largest eigenvalue of S 11 under the orthonormalisation restrictions υ j = 1 and υ ′ i υ j = 0 for all i = j. Then, defining P = ( υ 1 , ..., υ m ), the estimator of the common trends x t is The next theorem provides the consistency (up to a rotation) of the estimators of P and x t . Interestingly, the convergence rate of P is not affected by the tail index.
Theorem 4. We assume that Assumptions 1-5 are satisfied. Then it holds that Theorem 4 states that both P and x t are consistent estimators of P and x t -up to an invertible linear transformation, since it is only possible to provide a consistent estimate of the eigenspace, as opposed to the individual eigenvectors. Equation (3.13) states that the estimated loadings P provide a superconsistent estimator of (a linear combination of the columns of) P . This result, which is the same as in the case of finite variance, is a consequence of the fact that x t is an "integrated" process, and it is related to the eigen-gap found in Proposition 1. Equation (3.13) could also be read in conjunction with the literature on 13 large factor models, where -contrary to our case -it is required that N → ∞. In that context, Bai (2004) obtains the same result as in (3.13) albeit for the case of finite variance: thus, in the presence of integrated processes, the PC estimator is always superconsistent, irrespective of N passing to infinity or not.
According to (3.14), x t also is a consistent estimator of the space spanned by x t . The "noise" component does not drift to zero and, when η < 1, it may even diverge; however, the "signal" x t is of order O P t 1/η , and therefore it dominates the estimation error (in fact, when η < 1, the estimation error is smaller by a factor T ). This result can be compared to the estimator proposed by Gonzalo and Granger (1995), which is studied under finite second moment and requires a full specification of the VECM. Furthermore, (3.14) can also be compared with the findings in the large factor models literature (see, in particular, Lemma 2 in Bai, 2004). As far as uniform rates in t are concerned, in the proof of the theorem we also show the strongest result This result is unlikely to be sharp, since it is based on the (rather crude) fact that the maximum of any T -dimensional sequence with finite p-th moment is bounded by O P T 1/p .

Extensions
The framework developed in the previous section does not allow for deterministic terms in the data, and requires ε t to be identically distributed. We now discuss possible extensions of our set-up, to accommodate for heterogeneous innovations and deterministics. The main result of this section is that our procedure can be implemented even in these cases, with no modifications required.

Heterogeneous innovations.
We consider a novel framework where we allow for innovation heterogeneity of a very general form. Specifically, we assume that where u t satisfies Assumption 2 and h (·) is a deterministic function. The representation in (4.1) has also been employed in order to consider the presence of heteroskedasticity in the case of data with finite variance (see e.g. Cavaliere and Taylor, 2009). 14 The only requirement on the scale function h (·) is that it has bounded variation within the interval [0, 1]. Functions of bounded variations and their properties are very well studied, and we refer to e.g. Hewitt and Stromberg (2013) for details. The design in (4.1) includes several cases which can be of potential interest: h (r) can be piecewise linear, i.e. h (r) = n+1 i=1 h i I (c i−1 ≤ r < c i ), with c 0 = 0 and c n+1 = 1, thus considering the possible presence of jumps/regimes in the heterogeneity of ε t ; or it could be a polynomial function.
Corollary 1. Let Assumptions 1-5 hold, with Assumption 2 modified to contain only symmetric stable u t .
Then, as min (T, M) → ∞ with (3.5), it holds that, for all j under H 0 , with probability tending to 1. Under H A , (3.7) holds for each j, for almost all realisations of Repeating the proof of Theorem 3, the results in Corollary 1 entail that, using the Algorithm 1 in Section 3.2, P * ( m = m) → 1 with probability tending to 1: m is still a consistent estimator of m. Similarly, it can be shown that the asymptotic properties of our estimator of the common trends and associated loadings are unaffected by heterogeneity in the scale, as in (4.1).

4.2.
Deterministics. In this section we show that our procedure can be also applied to data which contain a deterministic term. Specifically, we consider the representation where C and C * (L) are defined as before. Equation (2.2) is derived from the multivariate Beveridge-Nelson decomposition of the filter C (L). As is known, (4.3) can also be obtained from a VECM representation (see She andLing, 2020 andYap andReinsel, 1995) (4.4) under the constraint µ = αρ with ρ an m × 1 vector.
In this case, our procedure still yields the same results as without the deterministic term.

Monte Carlo evidence
In this section, we discuss the implementation/specification of our procedure and illustrate its finite sample properties through a small scale Monte Carlo exercise. To save space, we report only a limited number of results; further results are in the Supplement.
5.1. Design. Following She and Ling (2020), we simulate the N-variate V AR (1) model initialized at y 0 = 0 and where A = I N − P P ′ , P being an N × (N − m) matrix with orthonormal columns (i.e., P ′ P = I N −m ). 1 .The innovations ε t in (A.1) are i.i.d. and coordinate-wise independent, from a power law distribution with tail index η ∈ {0.5, 1, 1.5, 2}. We follow the procedure proposed by Clauset et al. (2009) and generate ε i,t as First, we note from unreported experiments that our procedure for the determination of the number of common trends is not particularly sensitive to the choice of the various specifications. In our experiments, we have used M = 100 to speed up the computational time, but we note that results do not change when setting e.g. M = T , M = T /2 or M = T /4. In (3.2), we have used κ = 10 −4 . This is a conservative choice, whose rationale follows from the fact that, in (3.2), dividing by T κ serves the purpose of making the non-spiked eigenvalues drift to zero. The upper bound provided in (2.10) for such non-spiked eigenvalues is given by slowly varying functions, which suggests that even a very small value of κ should suffice. Indeed, altering the value of κ has virtually no consequence. In order to compute the integral in (3.4), we use the Gauss-Hermite quadrature. 3 Finally, as far as the family-wise detection procedure is concerned, the level of the individual tests is α (T ) = 0.05/T , as also used in Barigozzi and Trapani (2021); this corresponds to having a critical value c α which grows logarithmically with T . All routines are based on 1, 000 iterations and are written using GAUSS 21. This difference, however, vanishes as T increases. The impact of N is also very clear: as the V AR dimension increases, the performance of m tends to deteriorate, as expected. Inference improves for larger values of T . Indeed, whilst results for N = 3 are good even when η = 0.5 and T = 100, when N = 5 the estimator m requires at least T = 200 in order to have a frequency of correctly picking the true value of m higher than 90%. This is, as noted above, more pronounced when m = N, and less so when m = 0. As it can also be expected, our procedure improves as η increases; results are anyway very good even in the (very extreme) where the z s s, 1 ≤ s ≤ n S , are the zeros of the Hermite polynomial H nS (z) and the weights w s are defined as Thus, when computing θ (j) T,M (u) in Step 2 of the algorithm, we construct n S of these statistics, each using u = ± √ 2z s . The values of the roots z s , and of the corresponding weights w s , are tabulated e.g. in Salzer et al. (1952). In our case, we have used n S = 2, which corresponds to u = ±1 with equal weight 1 2 ; we note that in unreported experiments we tried n S = 4 with the corresponding weights, but there were no changes up to the 4-th decimal in the empirical rejection frequencies.  case η = 0.5, and the impact of η is less and less important as T increases. Finally, although Tables 1-3 focus only on the i.i.d. case, unreported experiments showed that results are essentially the same when allowing for serial dependence.
In the Supplement, we report a broader set of results which, in addition to serial dependence in the errors ε i,t , also compare the proposed method with classic information criteria. Results are in Tables S.14-S.19. Broadly speaking, our procedure is very good on average at estimating m -and better than the best performing information criterion, BIC -for all values of N and T (and η). This is true across all values of m, including the stationary case (m = 0) and the no-cointegration case (m = N). Information criteria seem to perform marginally better when m = 0, but this is more than offset when considering that they tend to overestimate m in general, especially so when m = N and m = N − 1. When errors are serially correlated (see Tables S.16-S.19), results are affected, albeit marginally, but the relative performance of the various methods remains as described above. Finally, in order to evaluate our procedure for medium-large systems, in Tables S.20-S.22 in the Supplement we report results from a further experiment where we set N ∈ {10, 15, 20}. We find that our estimator delivers a superior performance with respect to information criteria for m = N and m = N − 1, especially for small values of η (when BIC grossly understates m, even for large values of T ); BIC is (albeit marginally) superior when m = 0.

Real data examples
We illustrate our methodology through four empirical applications to: commodity prices (Section 6.1), U.S. interest rate data (Section 6.2), long run purchasing power parity (Section 6.3) and cryptocurrency markets (Section 6.4). In these applications, N ranges from 3 to 7.
6.1. Comovements among commodity prices. We consider a set of N = 7 commodity prices: three oil prices (WTI, Brent crude, and Dubai crude) and the prices of four metals (copper, gold, nickel, and cobalt).
The presence of common trends can be anticipated due to global demand factors (e.g. growth in emerging Asian countries and especially in China; or changes of preferences towards greener energy sources, which increase demand for copper and decrease demand for oil), and also due to global supply factors (e.g. related to the effect that oil prices have on transportation costs of other commodities; or driven by technological innovations which often require the use of cobalt -see e.g. Alquist et al., 2020). Moreover, the three oil prices should exhibit strong comovements, and similarly should the prices of metals, which are often used in combination in industry (e.g. copper and nickel). In order to study the presence of such common trends, we use a dataset consisting of monthly data from January 1990 to March 2021, corresponding to a sample of T = 373 monthly observations. 4 We use the logs of the data, which are subsequently demeaned and detrended.
We have applied our methodology using the same specifications as described in Section 5, i.e. κ = 10 −4 , M = 100 and n S = 2 in (5.3). In order to assess robustness to these specifications, we have also considered other values of M (including M = T ) and n S = 4. In the top part of the table we report the estimated values of the tail index using the Hill's estimator -the package 'ptsuite' in R has been employed, using a number of order statistics equal to k T = 40. We also report, in light of Hill's estimator being inconclusive, the outcome of the test by Trapani (2016). As far as this is concerned, we have used the modification suggested in Section 3.2 in Horváth and Trapani (2019); whilst we refer to that paper for details, essentially this consists in running the test for S times (we have used S = 10, 000), and compute the average number of non-rejections. This is then compared against the threshold 1 − α − (α (1 − α)) 2 ln ln S S ; in our case, such threshold is 0.9454, with H 0 being rejected if the average number of non rejections falls below the threshold, and not rejected otherwise. In the table (top, right part), we also report the number of cointegration relationships found by Johansen's procedure; this has been implemented using p = 2 lags in the V AR specification, as suggested using BIC, and constant and trend when implementing the test. In the bottom half of the table, we report results on m obtained using different specifications, as written in the table. In particular, in each sub-panel, the columns contain different values of the nominal level of the family-wise procedure, set equal to 0.05 T , 0.05 ln(T ) and 0.05 N .
4 Data have been downloaded from https://www.imf.org/en/Research/commodity-prices 20 We report the results for the 7-dimensional series in Table 4. Initially, we report the (Hill's) estimates of the tail indices for the seven series; the associated confidence sets are quite large, but the test by Trapani (2016) supports the hypothesis that all series have infinite variance.
Estimation of m based on Johansen's sequential procedure for the determination of the cointegration rank (using either the trace tests or the maximum eigenvalue tests) provides ambiguous results, with the estimate of m ranging between 5 and 7 (which corresponds to no cointegration). These results, in the presence of heavy tails, are not reliable: the simulations in Caner (1998) show that Johansen's procedure does not control size under infinite variance, thus making the estimate of m inaccurate (in particular, it tends to be over-sized, thus leading to overestimation of m). In contrast, through our test we find strong evidence of m = 4 common stochastic trends. As shown in the table, our results are broadly robust to different values of M, κ. In (much) fewer cases, we find m = 5, which might suggest the presence of a slowly mean reverting component in the data.
In order to shed more light on these findings, we split the series into two sub-groups: one of dimension N = 3 (comprising the three crude prices -Brent, Dubai and WTI crude), and one of dimension N = 4 (containing the four metal prices). Results for the 3-dimensional series of crude prices are in Table 5.
On the one hand, Johansen's tests in this case identifies (at 5% level, and only using the trace test) two common trends (m = 2). On the other hand, our methodology provides evidence of a single (m = 1) common stochastic trend (and, in some, more rare, cases, of m = 2). Results concerning the N = 4 metals are in Table 6; in this case, evidence of m = 3 common trends emerges from all the procedures considered.
Overall, most of the evidence points towards m = 4 common stochastic trends; there is much less evidence in support of m = 5. The m = 4 common trends can be estimated as explained in Section 3.3. In order to identify the trends, based on the results above we propose to order the series as follows: WTI, gold, cobalt, copper, Brent crude, Dubai crude, nickel. Then, we constrain the upper m × m block of the estimated loadings matrix P to be the identity matrix. In this way the trends are identified with the first four series, i.e. WTI, gold, cobalt, and copper. In Figure 1, we plot the estimated common trends x t and in Table 7 we report the associated loadings P .
By construction, the first and second trends ( x 1,t and x 2,t ) are associated with oil prices and cobalt repsectively. The third one ( x 3,t ) is associated with gold (by construction), and nickel, with a negative loading; finally, the fourth trend ( x 4,t ) is associate with copper by construction, and with nickel (with a positive loading). The trends driving metals are also common to oil prices, albeit with smaller loadings. Johansen's trace test 2 2 2 β 1,1 1.00 β 2,1 0.00 β 1,2 0.00 β 2,2 1.00 Johansen's λmax test 2 2 2 β 1,3 −1.08 β 2,3 −1.05   and February 1999 (thus corresponding to T = 301 monthly observations). 5 As in She and Ling (2020), we consider the logs of the original data, say Y t = (y 1,t , y 2,t , y 3,t ) ′ where y 1,t is the log of the 3-month Treasury Bill rate, y 2,t is the log of the 1-year Treasury Bill rate, and y 3,t represents the log of the Fed Fund rate. She and Ling (2020) carry out their analysis using a VAR(2) in error correction format: with µ = αρ, such that the data do not have a linear trend but may have a non-zero mean in the stationary directions. In the light of Corollary 2, in this case our set-up can be applied with no modifications.
Moreover, we do not need to make specific assumptions on the lag structure for ∆Y t .
As demonstrated above, the main feature of our contribution is that we do not require any prior knowledge of the index η. This advantage can be further understood upon observing the estimates of η obtained in She and Ling (2020) -see in particular, their Figure 6, where η ranges between 1 and 1.5 (thereby implying that the data have infinite variance and, possibly, infinite mean). On account of this information, She and Ling (2020) report the critical values for the Likelihood Ratio tests in Caner (1998) for η = 1 and η = 1.5, using both values to build a decision rule.
In particular, She and Ling (2020) conclude that there is some evidence in favour of m = 1 (corresponding to two cointegration relationships). Upon inspecting their results, however, it seems that they are driven by the choice of the 5% nominal level: evidence in favour of m = 1 is much weaker when considering a 1% nominal level (in particular, in the latter case, should one use the critical values computed for η = 1, there would be no evidence of cointegration at all, whereas using the critical values computed for the case 5 The data have been downloaded from the Federal Reserve Economic Data website, https://fred.stlouisfed.org 23  (2020) 1 (α = 5%; η = 1) She and Ling (2020) 1 (α = 5%; η = 1.5) She and Ling (2020) 3 (α = 1%; η = 1) She and Ling (2020) 1 Panel B: sensitivity analysis The top part of the table summarizes the findings using various procedures. The BCT procedure has been implemented with M = 100, nominal level for individual tests equal to 0.05 T , κ = 10 −4 and n S = 4; see also Panel B of the table for results obtained using different specifications. In Panel A of the table, the results based on She and Ling (2020) are taken from Table 4 in their paper. The results corresponding to Johansen's procedure (Johansen, 1991) have been derived setting p = 4 in the specification of the V AR based on BIC. We report the outcome of our test under various specifications in Panel B of the table, similarly to 4. η = 1.5, one would conclude that m = 1). Interestingly, assuming finite variance, Johansen's sequential procedure rejects m = 2 and m = 3 in favour of m = 1, even at the 1% nominal level.
In contrast to these mixed results, we find m = 2 common stochastic trends, corresponding to one cointegration relationship. This result is robust to different values of M and κ.
6.3. Testing for the weak form of the PPP. In this application, we check the validity of the weak form of the PPP, using the same dataset as in Falk and Wang (2003). In particular, for a set of 12 countries, we individually verify the presence of cointegration in the vectors (p c,t , p U S,t , F X c,t ) ′ , where p c,t is the (log of the) CPI of country c at time t, p U S,t is the (log of the) US CPI index, and F X c,t is the log of the exchange rate of country c currency vis-a-vis the dollar. The series are monthly, spanning from January 1973 to December 1999, corresponding to T = 324 for each country.

24
A minimum requirement for the weak form of the PPP to hold is that there is at least one cointegrating relation between the three series -i.e., at most m = 2 common stochastic trends. Falk and Wang (2003) find evidence of heavy tails in this dataset (see their Table III), with series typically having finite mean but infinite variance. Thence, the authors test for cointegration using critical values based on allowing for heavy tails (notice that, even in this case, the authors require prior knowledge of the tail index η). We compare results from Falk and Wang (2003) and Johansen's procedure with our proposed tests. As in the previous sections we use M = T , κ = 10 −2 ; the level for the individual tests is set to 0.05/ ln T . Tests are based on the deviations from the initial values. Results are reported in Table 9.
We indicate with Y cases where tests support the necessary condition for PPP to hold, i.e. m ≤ 2. Cases with no evidence supporting PPP, corresponding to m = 3, are denoted as N. Results corresponding to the paper by Falk and Wang (2003) are taken from their Tables VII (for the Gaussian case) and VIII (for the case with heavy tails). The test statistics employed are the maximum eigenvalue (λmax column) and the trace test, as developed by Johansen (1991) for the Gaussian case. We refer to Table V in Falk and Wang (2003) for the critical values for both tests computed under the assumption of heavy-tailed data, First, we notice that Johansen's method at the 10% (5%) nominal level under the assumptions of finite variance indicates that the necessary condition for the weak form of the PPP holds for 10 (8) out of 12 countries. Second, if the critical values are adjusted using the estimated tail index, as in Falk and Wang (2003), then the support for the weak from of the PPP decreases. Specifically, they found that the number of countries with m ≤ 2 ranges between 3 (when using a maximum eigenvalue test at the 5% nominal level) and 8 (when using a trace test at the 10% nominal level). Our test provides support of PPP in 5 countries out of 12, which is closely in line with Falk and Wang (2003). Our empirical evidence reinforces the view that the assumption of finite variance can lead to spurious support to the PPP.
6.4. Statistical arbitrage in cryptocurrency markets. In the context of asset pricing, since the seminal contribution by Diebold et al. (1994), common trends and cointegration have played a prominent role.
Indeed, the presence of cointegration entails the possibility of constructing portfolios which are mean reverting, so that a profit can be made when such portfolios depart from their mean -the so-called "statistical arbitrage". Although the literature usually focuses on portfolios of two assets ("pairs trading"), it is possible to generalise the notion of statistical arbitrage to the multi-asset case (see Alexander et al., 2002). In such a case, having m common stochastic trends allows to construct N − m mean reverting portfolios, each having weights given by a user-chosen linear combination of the cointegrating vectors. Cryptocurrency data have been paid increasing attention (Makarov and Schoar, 2020); however, despite the evidence that returns on cryptocurrencies exhibit heavy tails (see Pele et al., 2020), this is not usually accounted for in applications.
We consider a portfolio of N = 6 cryptocurrencies, chosen among those with the largest market capitalisation within our sample period: Cardano (ADA), Bitcoin Cash (BCH), BitCoin (BTC), EOS, Ethereum (ETH), and XRP. 6 We use daily data from October 18, 2017 until September 18, 2020, which is equivalent to a sample of T = 1, 066 observations. As in the previous applications, we consider deviations from the initial value. Table 10 shows that all cryptocurrencies have heavy tails, with the estimated tail indices around 1; despite some apparent heterogeneity in the tail indices, confidence intervals for the Hill's estimator show that the common tail index assumption which characterises stable distributions is satisfied.
The results in Table 10, based on our test using different configurations of the tuning parameters M and κ, indicate m = 5 common trends for all cases considered, i.e. the presence of one cointegrating relation between the 6 cryptocurrencies considered. Note that, in contrast to our findings, Johansen's approach based on a VAR(p) with p = 10 (as suggested by BIC) broadly indicates m = 4 (m = 3 at the 10% nominal level or for larger values of p). However, the results by Caner (1998) indicate a tendency to understate m in the presence of heavy tails; therefore, the additional cointegrating relations found by Johansen's method is likely to be spuriously driven by the heavy-tailedness of the cryptocurrency returns.

Conclusions
In this paper, we propose a methodology for inference on the common trends in multivariate time series with heavy tailed, heterogeneous innovations. We develop: (i) tests for hypotheses on the number of  In the top part of the table we report the estimated values of the tail index using the Hill's estimator -the package 'ptsuite' in R has been employed, using a number of order statistics equal to k T = 32, which corresponds to O(T 1/2 ). We also report the number of cointegration relationships found by Johansen's procedure; this has been implemented using p = 3 lags in the V AR specification, as suggested using BIC. We report the outcome of our test under various specifications in the bottom half of the table, similarly to Table 8. common trends; (ii) a sequential procedure to consistently detect the number of common trends; (iii) an estimator of the common trends and of the associated loadings. A key feature of our approach is that estimation of the tail index of the innovations is not needed, and no prior knowledge as to whether the data have finite variance or not is required. Indeed, the procedure can be applied even in the case of finite second moments.
Our method is based on the eigenvalues of the sample second moment matrix of the data in levels, the largest m (m being the unknown number of common trends) of which are shown to diverge at a higher rate, as T increases, than the remaining ones. Based on such rates, we propose a randomised statistic for testing hypotheses on m; its limiting distribution is chi-squared under the null, and diverges under the alternative. Combining these individual tests into a sequence of tests, we prove consistency of the estimator of m by simply letting the nominal level to shrink to zero at a proper rate. We also show that, once m is 27 determined, estimation of the common trends and their loadings can be done using PCA. Our simulations show that our method has good properties, even in samples of small and moderate size. In Tables S.11-S.13, we report the frequencies of the estimates of m, obtained from the same design as in Section 5 in the main paper. As can be seen, results are as good as with m.  A.2. Further evidence on m. Here, we use the same DGP as in Section 5, viz.
(A.1) y t = Ay t−1 + ε t ,  we also allow for serial correlation in ε t . Specifically, the innovations ε t in (A.1) have again been generated as coordinate-wise independent, and we consider serial dependence through an AR (1) specification for 1 ≤ i ≤ N; we set θ ∈ {0, 0.5, −0.5}. The innovation e i,t is generated as i.i.d. across t, with a power law distribution of tail index η, using η ∈ {0.5, 1, 1.5, 2} and again, as a benchmark, we consider the case ε t ∼ i.i.d.N (0, I N ). In order to simulate the power law distribution, we generate ; e i,t is subsequently centered. The first 1, 000 datapoints are discarded in order to avoid dependence on initial conditions. All other specifications are the same as in the main paper.
We use information criteria as a comparison. In particular, we consider the three most popular ones (i.e., AIC, BIC and HQ), but in the tables we only report the results corresponding to BIC, which is consistently the best out of the three for all experiments 7 . We point out that the use of information criteria in the context of cointegration analysis is well-studied in the finite variance case (see e.g. Aznar andSalvador, 2002 andCavaliere, De Angelis, Rahbek, and; conversely, the case of infinite variance has not been studied in the case of cointegrated systems (see however the contribution by Knight et al., 1989).
In our experiments, we use an "infeasible" version of each information criterion, assuming that the lag structure in (A.1) is known, so that the only quantity to be determined is the number of common trends m.
For all rank estimates, we report three measures of performance. Denoting the estimate of m at iteration 1 ≤ mc ≤ 1, 000 as m mc , we define ME = 1 1000 the P CW and ME indicators are concerned, the tables show that our procedure is very good on average at estimating m -and better than the best information criterion, BIC -for all values of N and T , even for the (rather extreme) case η = 0.5. This is true across all values of m, including the case of stationarity (m = 0) and lack of cointegration (m = N). Such evidence is further corroborated by considering the ST D indicator, which shows that our procedure is systematically better than BIC in this respect. Indeed, the BIC performs (very) marginally better when m = 0, but this is more than offset when considering that BIC tends to overstate m in general (as the values of ME indicate), and especially when m = N and m = N − 1; in particular, BIC seems to perform puzzlingly badly when errors are Gaussian. This is even more remarkable as information criteria are based on the (infeasible) assumption that the lag structure is known. In line with the theory, results are better for larger T ; also, results do improve as η increases, although this becomes less evident as T increases. The impact of N is less clear, although generally speaking larger values of N worsen the performance of both BIC and our procedure when m = N, whereas it has virtually no impact for other values of m.
In Tables  Finally, and as also mentioned in the main paper, in Tables S.20-S.22, we report results using N ∈ {10, 15, 20}. As overview of the results is also in Section 5 in the main paper.

B. Technical lemmas
Henceforth, we denote the L p -norm of a scalar random variable X as |X| p = (E |X| p ) 1/p ; also, the subscript " ⊥ " denotes the orthogonal complement of a matrix. Further, we often use the notation R = N − m.
Lemma B.1. Consider a sequence U T for which E |U T | δ ≤ a T , where a T is a positive, monotonically non-decreasing sequence, and let d δ = 1 if δ ≤ 2 and zero otherwise. Then there exists a C 0 < ∞ such that for every ǫ > 0.

Proof. It holds that
having used Theorem B in Serfling (1970) when δ > 2, and the first theorem in the same paper when δ ≤ 2.
having used Markov inequality. By the proof of Corollary 2.4 in Cai (2006) (see also Lemma A.5 in Horváth and Trapani, 2019), it ultimately follows that which entails that the sequence U T converges completely (Hsu and Robbins, 1947). The desired result now follows immediately from the Borel-Cantelli lemma.

Lemma B.2. We assume that Assumption 2(ii) is satisfied. Then there exists a random variable T 0 and a positive definite matrix D such that, for all
Proof. The lemma follows immediately upon noting that, by Assumption 2(ii) and Theorem 1.3 in Jain (1982), it holds a.s. that, for all b ∈ R m , b = 0 Proof. The lemma is an immediate consequence of Lemma B.2. Indeed, by the multiplicative Weyl's inequality (see Theorem 7 in Merikoski and Kumar, 2004), it holds that and Lemma B.2 readily implies that, for every j Assumption 1 entails that λ (min) (P ′ P ) > 0, whence the desired result.
Proof. The proof does not require any distinction between p < 2 and p = 2. We have

It holds that
where we have used the Cauchy-Schwartz inequality and the fact that C * m,ik = O (exp (−c 0 m)), which follows from Assumption 1. Using Lemma B.1, the desired result follows immediately.
Proof. Recall that C h,k indicates the element of C in position h, k. Using the definition of Frobenius norm and the L p -norm inequality, it follows that Thus, using the C r -inequality, after some algebra it holds that where the constant c 0 < ∞ depends only on p, N and m. It is convenient to consider separately the cases where 0 ≤ p < η when η ≤ 2 with E ε t η = ∞, and p = 2 when η = 2 with E ε t 2 < ∞. In the former 39 case, let p < 2. In order to lighten up the notation, we omit the subscripts j, h and k, and study

It holds that
with the convention that 0 j=1 = 0. Consider I; it holds that by the summability of C * m . Also Similar steps as above yield |ε t | p p ≤ c 0 T, so that Lemma B.1 entails that I b = o a.s. T 2/p (ln T ) 2(1+ǫ)/p for all ǫ > 0. Turning to I a , when p ≤ 1 When 1 < p ≤ 2, the von Bahr-Esseen inequality (von Bahr and Esseen, 1965) can be applied, giving 40 By Lemma B.1, it follows that, for all 0 < p ≤ 2, I a = o a.s. T 2/p (ln T ) (1+ǫ)/p . Turning to II, note that so that II = o a.s. T 2/p (ln T ) 2(1+ǫ)/p . Finally, for III we apply the same logic as above to get which yields III = o a.s. T 2/p (ln T ) 2(1+ǫ)/p as above. Finally, note that, as above again by the exponential summability of C * m ; this entails that IV = o a.s. (ln T ) 2(1+ǫ)/p . The lemma follows by putting everything together; the case η = 2 with finite variance follows from exactly the same calculations, with p = 2.
Lemma B.6. We assume that Assumptions 1-3 are satisfied. Then it holds that z t = l ′ ∆y t is a strong mixing sequence with mixing numbers α m = O (ρ m ), for all l and some 0 < ρ < 1.
Proof. In order to prove the lemma, recall that, by Assumption 1, we have the following equivalent representations: where C * (L) is an invertible filter, and there exists an N × R matrix b of full rank R such that b ′ C = 0.
We prove the lemma by considering separately the cases m = 0, m = N and 0 < m < N. When m = N, We start by showing that C * (L) ε t is strong mixing with geometrically declining mixing numbers, by verifying the conditions of Lemma 2.2 and Theorem 2.1 in Pham and Tran (1985). Firstly, Assumption 3 ensures that the error term ε t satisfies condition (i) in Lemma 2.2 in Pham and Tran (1985). Also, C * (L) is invertible, 8 and, using Assumption 1, it is easy to see that ∞ j=0 C * j < ∞. Hence, Theorem 2.1 in Pham and Tran (1985) can be applied, yielding that C * (L) ε t is a strongly mixing sequence with geometrically declining mixing numbers. Using Theorem 14.1 in Davidson (1994), it follows that both ∆y t and l ′ ∆y t are also strong mixing sequence with geometrically declining mixing numbers. When m = 0, the same arguments can be applied to (B.1), noting that in this case C (L) is invertible, and that the condition ∞ j=0 C j < ∞ follows from Assumption 1, which entails that C j declines geometrically. Finally, consider the case 0 < m < N. Since the N × N matrix (b, b ⊥ ) has full rank, any l ∈ R N can be expressed as having used (B.1). We already know from above that l ′ [C * (L) ε t − C * (L) ε t−1 ] is strong mixing with geometrically declining mixing numbers; the same (obviously) applies to the i.i.d. sequence v ′ 2 b ′ ⊥ Cε t . The desired result follows again from Theorem 14.1 in Davidson (1994).

C. Proofs
Henceforth, we let E * and V * denote the expected value and the variance with respect to P * respectively.
Proof of Proposition 1. The result follows from Weyl's inequality (see e.g. Horn and Johnson, 2012) and Lemmas B.3, B.4 and B.5. Indeed, it holds that 8 We point out that in the paper by Pham and Tran (1985), condition (ii) in their Lemma 2.2 contains a typo, as it requires (using our notation) that ∞ j=0 C * j z j = 0 for all |z| ≤ 1 -upon inspecting the proof, this is required in order to invert the polynomial C * (L), and thus it should instead read det Further, it holds that In light of (C.3), Lemmas B.4 and B.5 entail that for all 0 ≤ p < η when η ≤ 2 with E ε t η = ∞, and p = 2 when η = 2 with E ε t η < ∞. Consider (C.1); when 0 ≤ j ≤ m, the term that dominates is λ (j) P T t=1 x t x ′ t P ′ , and (2.5) follows immediately from Lemma B.3. When j > m, λ (j) P T t=1 x t x ′ t P ′ = 0 by construction, and therefore (2.6) follows from (C.2).
Proof of Proposition 2. Lemma B.6 entails that, for all l = 0, z t = l ′ ∆y t is strong mixing with geometrically declining mixing numbers. Thence, (2.7) follows immediately from Theorem 2.1 in Trapani (2014). Consider now (2.8), and define the sequence Define the array ω T,t = I (|z t | > ϕ T ). We define the sigma-field ω F t

This entails that sup
where α ω F t T,0 , ω F ∞ T,t+k is the mixing number of order k, and α k is the mixing number of order k associated to the sequence z t . Since α k = O ρ k , 0 < ρ < 1, it also follows that the sequence ω T,t is strong mixing with geometrically declining mixing numbers. Note also that ω T,t has finite moments up to any order; also, by Assumption 2(i), it holds that Eω T,t = (c l,1 + c l,2 ) ϕ −η T L (ϕ T ) (1 + o (1)). Letting ω T,t = ω T,t − Eω T,t , equation (5.1) in Rio (1995) ensures that and Q w (u) is the quantile function of ϕ η T ω T,t . Consider now the function f (x) = x ln (1 + x), and let be its Young dual function (see Appendix A in Rio, 1995); as x → ∞, f * (y) = exp (y). Then it holds that where U is a random variable with a uniform distribution on [0, 1] and X f is the Luxemburg norm (Luxemburg, 1955) of a random variable X with respect to the function f , i.e.

Note that we have
can be shown to correspond to Ef c −1 |ϕ η T ω T,t | 2 < 1 after some algebra.
Noting that, by definition, x T = T , equation (C.4) yields By Lemma B.1, this entails that lim sup Putting all together, (2.8) obtains.
Proof of Theorem 2. The proof follows exactly the same lines as in Horváth and Trapani (2019) and we therefore omit most passages. Under H 0 , it follows from Theorem 1 that for every j and any ǫ > 0. Let Φ (·) denote the standard normal distribution, and note that We can write It holds that on account of the independence of the ξ (j) i s across i. Given that we have Also, it follows immediately that Using Assumption 4(ii), (C.5), (3.5) and Markov's inequality, it is easy to see that and therefore the desired result follows from the Central Limit Theorem. Under H A , Theorem 1 entails that (C.8) P ω : lim Noting that and we have The desired result now follows from (C.8).
Proof of Theorem 3. The proof is the same as the proof of Theorem 3 in Trapani (2018), and therefore we only report the main passages. Consider the events { m = N − j}, 0 ≤ j ≤ N, and recall that the test statistics Θ (j) T,M are independent across j conditional on the sample. Thus, for all j ≤ N − m − 1, we have P * ( m = j) = α (1 − α) j , where we omit dependence on N and T from the nominal level α for the sake of the notation. Letting π be the power of the test, it easy to see that P Proof of Corollary 2. The result follows from the theory developed in Hansen (2005). Indeed, using the results in Section 3 in his paper, it follows that, under (4.3), equation (2.2) still holds with

Note now that
and therefore the conclusions of Theorem 1 remain unaltered. Similarly, (2.1) also holds, and therefore Theorem 2 also holds. The final results now follows immediately.
Proof of Corollary 1. The proof follows from exactly the same passages (with one minor modification, which we describe at the end of this proof) as in the proof of Theorem 2. In turn, this follows immediately if (2.5)-(2.8) hold. Equation (2.7) has already been shown in Trapani (2014). Similarly, (2.6) and (2.8) can be shown by exactly the same passages as in the proofs of Lemma B.4 and Proposition 2, with minor if tedious complications arising from the presence of h t T . Proving (2.5) requires more work; an a.s. result is predicated on extending the paper by Jain (1982) to the case of a heterogeneous i.i.d. sequence. This is entirely feasible (the key requirement is independence, not homogeneity), but in order for the proof to be transparent, the whole original paper would have to be replicated allowing for the presence of h t T . In order to present a more transparent and self-contained argument, we instead begin by showing a "weak", in probability, result, which replaces Lemma B.2. We show that, for all b ∈ R N , it holds that Note first that Assumption 2 entails that for r ∈ [0, 1], where as usual " w → J 1 " denotes weak convergence in the space D [0, 1] equipped with the J 1 -topology, and Z η (r) is a symmetric stable process of index η. Consider (C.12) 1 T 1/η ⌊T r⌋ t=1 b ′ ε t and note that the norming sequence in (C.11) and (C.12) is exactly the same. This, and Assumption 5, entail that Assumptions B and C on p. 165 in Kasahara and Maejima (1986) are satisfied. Thus, it holds that (C.13) 1 where we note that Assumption A in Kasahara and Maejima (1986) holds immediately because the norming sequence in (C.11) and (C.13) is exactly the same. By construction, Y η,b (r) is a (well-defined) symmetric stable process of index η -the details are in Chapter 3 in Samorodnitsky and Taqqu (1994), see in particular Property 3.2.1. By the Continuous Mapping Theorem (C.14) 1 As mentioned above, the process Y α,b (r) is symmetric α−stable; thus, using Lemma 2.2 in Shi (1999), it is immediate to see that Using (C.14) and (C.15) proving (C.10). Now, by the same passages as in Lemma B.3, it holds that for j ≤ m. Combining this with (2.7), and following the same logic as in the proof of (2.9), it ultimately follows that for j ≤ m, where ǫ, ǫ ′ and p are defined in (2.9). Now we are ready to prove the theorem. Equation (3.7) for this case follows immediately from (2.10), which is implied by (2.6)-(2.8), using exactly the same passages as in the original proof. As far as (4.2) is concerned, note that, by continuity, (C.16) entails for every ǫ > 0. The desired result now follows by applying the proof Theorem 2 -see also the proof of Theorem C1 in Horváth and Trapani (2019).
Proof of Theorem 4. We present the proof for the homoskedastic case. In the heteroskedastic case, the proof follows from exactly the same arguments, and therefore we omit them to save space.
In light of the consistency of m, we prove the result assuming that m is known. Consider the N × T matrix where V is an m × m diagonal matrix whose entries are m largest the eigenvalues of Y Y ′ . Then it holds that Let XX ′ P ′ P V −1 = H; it is easy to see that H = O P (1). We have P − P H = P XU ′ P V −1 + UX ′ P ′ P V −1 + UU ′ P V −1 = I + II + III.

50
Consider I P XU ′ P V −1 ≤ P XU ′ P V −1 ; clearly, P and P are both bounded. Also, by the FCLT (Paulauskas and Rachev, 1998), V −1 = O P T −1−2/η . Finally, using the calculations in the proof of Lemma B.5, it is easy to see that XU ′ = O P T 2/η . Thus I = O P (T −1 ), and the same holds for II. Considering III, essentially the same arguments and the passages in the proof of Lemma B.4 yield III = O P (T −1 ). Putting all together, (3.13) follows.
Proof of Theorem A.1. The proof is fairly similar to that of Theorem 3, and we therefore report only the main passages. Consider the case 0 < m < N; it is easy to see that for all m + 1 ≤ p ≤ N. Also, it can be shown, again by the conditional independence of the test statistics across p, that for 1 ≤ p ≤ m − 1, and Hence we have which entails that, as α → 0, P * ( m = m) tends to one for almost all realisations of {ε t , 1 ≤ t ≤ T }. When  Each entry in the table contains the average estimated rank across replications; the standard deviation (in round brackets); and the percentage of times the rank is estimated incorrectly (in square brackets). We report our estimator of M , denoted as BCT and the number of common trends (implicitly) estimated by BIC, which is always the best information criterion in all cases.  Each entry in the table contains the average estimated rank across replications; the standard deviation (in round brackets); and the percentage of times the rank is estimated incorrectly (in square brackets). Each entry in the table contains the average estimated rank across replications; the standard deviation (in round brackets); and the percentage of times the rank is estimated incorrectly (in square brackets). Each entry in the table contains the average estimated rank across replications; the standard deviation (in round brackets); and the percentage of times the rank is estimated incorrectly (in square brackets).