Simultaneous Decorrelation of Matrix Time Series

We propose a contemporaneous bilinear transformation for a $p\times q$ matrix time series to alleviate the difficulties in modeling and forecasting matrix time series when $p$ and/or $q$ are large. The resulting transformed matrix assumes a block structure consisting of several small matrices, and those small matrix series are uncorrelated across all times. Hence an overall parsimonious model is achieved by modelling each of those small matrix series separately without the loss of information on the linear dynamics. Such a parsimonious model often has better forecasting performance, even when the underlying true dynamics deviates from the assumed uncorrelated block structure after transformation. The uniform convergence rates of the estimated transformation are derived, which vindicate an important virtue of the proposed bilinear transformation, i.e. it is technically equivalent to the decorrelation of a vector time series of dimension max$(p,q)$ instead of $p\times q$. The proposed method is illustrated numerically via both simulated and real data examples.


Introduction
Let X t = (X t,i,j ) be a p×q matrix time series, i.e. there are p×q recorded values at each time from, for example, p individuals and over q indices or variables.Data recorded in this form are increasingly common in this information age, due to the demand to solve practical problems from, among others, signal processing, medical imaging, social networks, IT communication, genetic linkages, industry production and distribution, economic activities and financial markets.Extensive developments on statistical inference for matrix data under i.i.d.settings can be found in Negahban and Wainwright (2011), Rohde and Tsybakov (2011), Xia and Yuan (2019), and the references therein.Many matrix sequences are recorded over time, exhibiting significant serial dependence which is valuable for modelling and future prediction.
A common feature in the aforementioned approaches is dimension-reduction, as the so-called 'curse-of-dimensionality' is more pronounced in modelling time series than i.i.d.observations, which is exemplified by the limited practical usefulness of vector ARMA (VARMA) models.Note that an unregularized VAR(1) model with dimension p involves at least p 2 parameters.Hence finding an effective way to reduce the number of parameters is of fundamental importance in modelling and forecasting high dimensional time series.In the context of vector time series, most available approaches may be divided into three categories: (i) regularized VAR or VARMA methods incorporating LASSO or alternative penalties (Basu and Michailidis, 2015, Gao et al., 2019, Ghosh et al., 2019, Guo et al., 2016, Han et al., 2020a, Han and Tsay, 2020, Han et al., 2023, Lin and Michailidis, 2017, Zhou and Raskutti, 2018), (ii) factor models of various forms (Bai and Ng, 2002, Forni et al., 2005, Lam and Yao, 2012, Pena and Box, 1987), (iii) various versions of independent component analysis (Back and Weigend, 1997, Chang et al., 2018, Huang and Tsay, 2014, Matteson and Tsay, 2011, Tiao and Tsay, 1989).Note that the literature in each of the three categories is large, it is impossible to list all the relevant references here.
In this paper we propose a new parsimonious approach for analyzing matrix time series, which is in the spirit of independent component analysis.It transforms a p × q matrix series into the new matrix series of the same size by a contemporaneous bilinear transformation (i.e. the values at different time lags are not mixed together).The new matrix series is divided into several submatrix series and those submatrix series are uncorrelated across all time lags.Hence an overall parsimonious model can be developed as those submatrix series can be modelled separately without the loss of information on the overall linear dynamics.This is particularly advantageous for forecasting future values.In general cross-serial correlations among different component series are valuable for future prediction.However the gain from incorporating those cross-serial correlations directly in a moderately high dimensional model is typically not enough to offset the error resulted from estimating the additional large number of parameters.This is why forecasting a large number of time series together based on a joint time series model can often be worse than that forecasting each component series separately by ignoring cross-serial correlations completely.The proposed transformation alleviates the problem by channeling all cross-serial correlations into the transformed submatrix series and those subseries are uncorrelated with each other across all time lags.Hence the relevant information can be used more effectively in forecasting within each small models.Note that the transformation is one-to-one, therefore the good forecasting performance of the transformed series can be easily transformed back to the forecasting for the original matrix series.Our empirical study in Section 4 also demonstrates that, even when the underlying model does not exactly follow the assumed uncorrelated block structure, the decorrelation transformation often produces superior out-of-sample prediction, as the transformation enhances the within block autocorrelations, and the cross-correlations between the blocks are weaker or too weak to be practically useful.
The basic idea of the proposed transformation is similar to the so-called principal component analysis for time series (TS-PCA) of Chang et al. (2018).Actually one might be tempted to stack a p × q matrix series into a (pq) × 1 vector time series, and apply TS-PCA directly.This requires to search for a (pq) × (pq) transformation matrix.In contrast, the proposed bilinear transformation is facilitated by two matrices of size, respectively, p × p and q×q.Indeed our asymptotic analysis indicates that the proposed new bilinear transformation is technically equivalent to a TS-PCA transformation with dimension max(p, q) instead of (p × q); see Remark 6 in Section 3 below.Furthermore the bilinear transformation does not mix rows and columns together, as they may represent radically different features in the data.See, e.g., the example in Section 4.2 for which the bilinear transformation also outperforms the vectorized TS-PCA approach in a post-sample forecasting.
The rest of the paper is organized as follows.The targeted bilinear decorrelation transformation is characterized in Section 2. We introduce a new normalization under which the required transformation consists of two orthogonal transformations.The orthogonality allows us to accumulate the information from different time lags without cancellation.The estimation of the bilinear transformation boils down to the eigenanalysis of two positivedefinite matrices of sizes p × p and q × q respectively.Hence the computation can be carried out on a laptop/PC with p and q equal to a few thousands.Theoretical properties of the proposed estimation are presented in Section 3. The non-asymptotic error bounds of the estimated transformation are derived based on some concentration inequalities (e.g., Merlevède et al. (2011), Rio (2000)); further leading to the uniform convergence rates of the estimation.Numerical illustration with both simulated and real data is reported in Section 4, which shows superior performance in forecasting over the methods without transformation and TS-PCA.Once again it reinforces the fact that the cross-serial correlations are important and useful information for future prediction for a large number of time series, and, however, it is necessary to adopt the proposed decorrelation transformation (or other dimensionreduction techniques) in order to use the information effectively.All technical proofs are relegated to a supplementary.

Decorrelation transformations
Again, let X t = (X t,i,j ) be a p×q matrix time series.We assume that X t is weakly stationary in the sense that all the first two moments are finite and time-invariant.Our goal is to seek for a bilinear transformation such that the transformed matrix series admits the following segmentation structure: where A and B are unknown, respectively, q × q and p × p invertible constant matrices, and random matrix U t is unobservable, U t,i,j is a p i × q j matrix with unknown p i and q j , nr i=1 p i = p, and nc j=1 q j = q.Furthermore, Cov{vec(U t+τ,i 1 ,j 1 ), vec(U t,i 2 ,j 2 )} = 0 for any (i 1 , j 1 ) = (i 2 , j 2 ) and any integer τ , i.e. all those submatrix series are uncorrelated with each other across all time lags.Remark 1. (i) The decorrelation bilinear transformation (1) is in the same spirit as TS-PCA of Chang et al. (2018) which transforms linearly a vector time series to a new vector time series of the same dimension but segmented into several subvector series, and those subvectors are uncorrelated across all time lags.Thus as far as the linear dynamics is concerned, one can model each of those subvector time series separately.It leads to appreciable improvement in future forecasting, as the transformed process encapsulates all the cross-serial correlations into the auto-correlations of those uncorrelated subvector processes.
(ii) In the matrix time series setting, one would be tempted to stack all the elements of X t into a long vector and to apply TS-PCA of Chang et al. (2018) directly, though it destroys the original matrix structure.This requires to search for the (pq)×(pq) decorrelation transformation matrix Φ such that vec(X t ) = Φvec(U t ).The proposed model ( 1) is to impose a low-dimensional structure Φ = A ⊗ B, where ⊗ denotes the matrix Kronecker product, such that the technical difficulty is reduced to that of estimating two transformation matrices of size p × p and q × q respectively, with significant faster convergence rate; see Remark 6 in Section 3 below.The empirical results in Section 4 also demonstrate the benefit of maintaining the matrix structure in out-sample prediction performance.
(iii) The segmentation structure in (1) may be too rigid and the division of the submatrices may not be that regular.In practice a small number of row blocks and/or column blocks may be resulted from applying the estimation method proposed in Section 2.3 below.Then applying the same method again to each submatrix series U t,i,j may lead to a finer segmentation with irregularly sized small blocks.
(iv) Similar to TS-PCA, the desired segmentation structure (1) may not exist in real applications.Then the estimated bilinear transformation, by the method in Section 2.3, leads to an approximate segmentation which encapsulates most significant correlations across different component series into the segmented submatrices while ignoring some small correlations.With enhanced auto-correlations, those submatrix processes become more predictable while those ignored small correlations are typically practically negligible.
Consequently the improvement in future prediction still prevails in spite of the lack of an exact segmentation structure.See Chang et al. (2018), and also a real data example in Section 4.2 below in which the three versions of segmentation (therefore, at least two of them are approximations) uniformly outperform the various methods without the transformation.A simulation study in Section 4.1 also shows that when the underlying true model deviates from the desired segmentation structure by a moderate amount, the approximated segmentation model produces superior out-sample prediction performance than those without the transformation.
(vi) For dealing with time series with structural changes, it is possible to allow A and B to be time varying, which is technically more demanding and will be explored elsewhere.

Normalization and identification
To simplify the statements, let E(X t ) = 0.This amounts to centre the data by the sample mean first, which will not affect the asymptotic properties of the estimation under the stationarity assumption.Thus E(U t ) = 0.
The terms A, B and U t on the RHS of (1) are not uniquely defined, and any A, B and U t satisfying the required condition will serve the decorrelation purpose.Therefore we can take the advantage from this lack of uniqueness to identify the 'ideal' A and B to improve the estimation effectiveness.More precisely, by applying a new normalization, we identify A and B to be orthogonal, which facilitates the accumulation of the information from different time lags without cancellation.To this end, we first list some basic facts as follows.
(i) (B, U t , A ) in (1) can be replaced by (BD 1 , D −1 1 U t D −1 2 , D 2 A ) for any invertible block diagonal matrices D 1 and D 2 with the block sizes, respectively, (p 1 , • • • , p nr ) and (q 1 , • • • , q nc ).In particular, D 1 and D 2 can be the permutation matrices which permute the columns and rows within each block.In addition, D 1 and D 2 can be the block matrices that permute n r row blocks and n c column blocks respectively.Note both (p 1 , • • • , p nr ) and (q 1 , • • • , q nc ) are defined by (1) only up to any permutation.
(ii) The q columns of BU t can be divided into n c uncorrelated blocks of sizes (q 1 , • • • , q nc ), i.e. the columns of BU t resemble the same pattern of the uncorrelated blocks as those of U t .Thus Σ (u) 1 ≡ E(U t B BU t )/p is a block diagonal matrix with the block sizes (q 1 , • • • , q nc ).In the same vein, Σ , where B i is p×p i and A i is q ×q i .Then linear spaces M(B 1 ), • • • , M(B nr ) and M(A 1 ), • • • , M(A nc ) are uniquely defined by (1) up to any permutation, though B and A are not, where M(G) denotes the linear space spanned by the columns of matrix G.Note that for any q × q positive definite matrix Σ 1 and p × p positive definite matrix Σ 2 , we can treat both A and B in (1) as orthogonal matrices.This orthogonality plays a key role in combining together the information from different time lags in our estimation (see the proof of Proposition 2 in the appendix).
Proposition 1.Let both Σ (x) 1 and Σ (x) 2 be invertible.Then it holds that where U * t admits the same segmentation structure as U t in (1), and A * and B * are, respectively, q × q and p × p orthogonal matrices.More precisely, The proof of Proposition 1 is almost trivial.First note that both Σ 2 are invertible.Then (2) follows from (1) and (3) directly.The orthogonality of, e.g., B * follows from equality 2) and (3).Also note that {Σ 1 .This implies that U * t admits the same segmentation structure as U t . Write , where A * j has q j columns and B * i 2) are (still) not uniquely defined, similar to the property (i) above.In fact, only the linear spaces M(B * 1 ), • • • , M(B * nr ) and M(A * 1 ), • • • , M(A * nc ) are uniquely defined.Proposition 2 below shows that we can take the orthonormal eigenvectors of two properly defined positive definite matrices as the columns of A * and B * .With A * and B * specified, the segmented U * t can be solved from (2) directly.Let where E i,j is the unit matrix with 1 at position (i, j) and 0 elsewhere, and E i,j is p × p in the first equation, and q × q in the second equation.For a prespecified integer τ 0 ≥ 1, let be invertible, and all the eigenvalues of W (i) be distinct, i = 1, 2. Also let τ 0 ≥ 1.Then the q orthonormal eigenvectors of W (1) can be taken as the columns of A * , and the p orthonormal eigenvectors of W (2) can be taken as the columns of B * .
The proposition above does not give any indication on how to arrange the columns of A * and B * , which should be ordered according to the latent uncorrelated block structure (1).
We address this issue in Section 2.3 below.
Remark 2. The representation (1) suffers from the indeterminacy due to the fact that the column blocks and row blocks can be permuted, and the columns and rows within each block can be rotated, see property (i) above.However this indeterminacy does not have material impact on identifying the desired structure (1), as any one of such representations will serve the purpose.The developed asymptotic theory guarantees that the proposed estimator converges to a representation which fulfills the conditions imposed on (1).
Remark 3. In case that W (1) , W (2) have tied eigenvalues and the corresponding eigenvectors across different blocks, Proposition 2 no longer holds.Then the blocks sharing the tied eigenvalues cannot be separated.To avoid the tied eigenvalues, we may use different values of τ 0 , or different form of W (i) .For example, we may replace W (1) by where is the eigen-decomposition for V , and wise transformation of the diagonal matrix D of the eigenvalues.
In fact the condition that all eigenvalues of W (i) are different can be relaxed.For example, we only require any two eigenvalues of W (i) corresponding two different blocks to be different while the eigenvalues corresponding to the same black can be the same.See also a similar condition in the eigen-gap ∆ in ( 14) in Section 3 below.
Remark 4. In practice, we need to specify τ 0 in (4).In principle any τ 0 ≥ 1 can be used for the purpose of segmentation.A larger τ 0 may capture more lag-dependence, but may also risk adding more 'noise' when the dependence decays fast.Since autocorrelation is typically at its strongest at small lags, a relatively small τ 0 , such as τ 0 ≤ 5, is often sufficient (Chang et al., 2018, Chen et al., 2022, Lam et al., 2011, Wang et al., 2019a).

Estimation
The estimation for A * and B * , defined in (3), is based on the eigenanalysis of the sample versions of matrices defined in (4).To this end, let Performing the eigenanalysis for W (1) , and arranging the order of the resulting q orthonormal eigenvectors γ 1 , • • • , γ q by the algorithm below, we take the re-ordered eigenvectors as the columns of A * .The estimator B * is obtained in the same manner via the eigenanalysis for W (2) .Then by (2), we obtain the transformed matrix series Note that the estimation for A * and that for B * are carried out separately.They do not interfere with each other.
Now we present an algorithm to determine the order of the columns for A * .By (2), the columns of We divide the columns of Z t into uncorrelated blocks according to the pairwise maximum cross correlations between the columns.Specifically, the maximum cross correlation between the k-th and -th columns is defined as The second equality above follows from ( 6), ( 7) and ( 9).In the above expression, τ 1 ≥ 1 is a user-defined tuning parameter (See Remark 5 below).To determine all significantly correlated pairs of variable, rearrange ρ k, , 1 ≤ k < ≤ q, in the descending order: where δ T > 0 is a small constant.We take the r pair of columns corresponding to ρ as correlated pairs, and treat the rest as uncorrelated pairs.The intuition is that ρ when both ρ (j) and ρ (j+1) are small.Similar ideas have also been used in determining the number of factors in Lam and Yao (2012), Ahn and Horenstein (2013) and Han et al. (2022).
With the r identified significantly correlated pairs, a connection graph is built, with the column indexes as the vertices and the edges between the identified correlated column pairs.The number of unconnected sub-graphs is the estimated number of blocks n c , and each of maximum connected sub-graphs forms the estimated groups Algorithmically, start with q groups with one column in each group; then iteratively check all pairs of groups and merge two groups together if there exists at least one pair of columns (one in each group) are significantly correlated; and stop when no groups can be merged.
The finite sample performance of the above algorithm can be improved by prewhitening each column time series of Z t .This makes ρ k, , for different (k, ), more comparable.See Remark 2(iii) of Chang et al. (2018).In practice the prewhitening can be carried out by fitting each column time series a VAR model with the order between 0 and 5 determined by AIC.The resulting residual series is taken as a prewhitened series.(ii) There are two tuning parameters used in the procedure, the maximum lag τ 0 used in constructing W (1) and W (2) in ( 4) and the maximum lag τ 1 used in measuring the dependency between two columns (rows) in (10).Lag τ 1 is usually a sufficiently large integer, for example, between 10 and 20, in the spirit of the rule of thumb of Box and Jenkins (1970).Different values of τ 0 and τ 1 may, or may not, lead to different segmentation.However the impact on, for example, future prediction is minimum, as the most information on linear dynamics is encapsulated in the most correlated pairs, as indicated by numerical examples in the Appendix.
The ordering for columns of B * is arranged, in the same manner as above, by examining the pairwise correlations among the columns of where γ (2) p are now the p orthonormal eigenvectors of W (2) .

Theoretical properties
To gain more appreciation of the methodology, we will show the consistency of the proposed detection method of uncorrelated components.We mainly focus on the estimation error and the ordering of A * , as those for B * are similar.Denote by X t,i,• the i-th row of X t , and X t,i,j the (i, j)-th element of X t .For matrix V = (V ij ), let V op denote the spectrum norm, and We introduce some regularity conditions first.
Assumption 1. Assume exponentially decay coefficients of the strong α-mixing condition, α(k) ≤ exp − c 0 k r 1 for some constant c 0 > 0 and 0 < r 1 ≤ 1, where Here for any random variable/vector/matrix X, σ(X) is understood to be the σ-field generated by X. Assumption 1 allows a very general class of time series models, including causal ARMA processes with continuously distributed innovations; see Bradley (2005), and also Section 2.6 of Fan and Yao (2003).The restriction r 1 ≤ 1 is introduced only for simple presentation.
Assumption 2. Assume EX t = 0.There exist certain finite constants c * , c * , such that 1 u.Assumption 2 on the eigenvalues is a common assumption in the high dimensional setting, for instance, Bickel and Levina (2008a,b), Xia et al. (2018).
Assumption 3 requires that the tail probability of each individual series of X t decay exponentially fast.In particular, when r 2 = 2, each X t,i,j is sub-Gaussian.
Remark 6. (i) Let P G be the projection operator onto the column space of G, which can be written as Under the conditions of Theorem 1, we can (ii) The consistency of W (1) requires max(p, q) = o( √ T ).When A and B, or equivalently 1) and W (2) , have certain sparsity structures, this condition can be further relaxed.In TS-PCA, threshold estimator of Bickel and Levina (2008a) is employed to construct a sparse 1) , and then the convergence rates are improved to allow the dimension to be much larger than T (Chang et al., 2018).The similar extension can be established in our setting.
When the decays of the α-mixing coefficients and the tail probabilities of X t are slower than Assumptions 1 and 3, we impose the following Assumptions 4 and 5 instead.These conditions ensure the Fuk-Nagaev-type inequalities for α-mixing processes; see also Rio (2000), Liu et al. (2013), Wu and Wu (2016) and Zhang and Wu (2017).
Theorem 2 below presents the uniform convergence rate for V (1) τ,i,j and W (1) , 1 ≤ i, j ≤ q.It is intuitively clear that the rate is much slower than that in Theorem 1.
Theorem 3 below implies that the partition of provides a consistent estimation for the column segmentation of See also (10).By (2), the columns of Z t ≡ X t {Σ (x) 1 } −1/2 A * are divided into the n c uncorrelated blocks.We assume that those n c blocks can also be obtained by the algorithm in Section 2.3 with ρ k, replaced by ρ k, • I(ρ k, ≥ ρ) for some constant ρ > 0. Denote by Theorem 3. Suppose conditions of Theorem 1 or 2 hold and τ 1 is a finite constant.Suppose where κ 1 , κ 2 are certain positive constants depending on c * , c * , r 1 , r 2 only, η T,p,q is defined in (17) or (19) and depends on T .Then in an event Ω T with probability at least 1 − T , we have Remark 7. The first inequality in (21) requires that the minimum eigen-gap ∆ between different uncorrelated groups is sufficiently larger than the estimation error η T,p,q , such that all the groups are identifiable.The second inequality in (21) ensures that there are no cross- The constants κ 1 and κ 2 are specified in the proof of the theorem.
4 Numerical properties

Simulation
We illustrate the proposed decorrelation method with simulated examples.We always set τ 0 = 5 in (7), and τ 1 = 15 in (10).The prewhitening, as stated at the end of Section 2.3, is always applied in determining the groups of the columns of A * and B * .As the true transformation matrices A * and B * defined in (3) cannot be computed easily even for simulated examples, we use the following proxies instead where Σ 2 are given in (5), and To abuse the notation, we still use A * , B * to denote their proxies in this section since the true A * , B * never occur in our computation.For each setting, we replicate the simulation 1000 times.
Example 1.We consider model ( 1) in which all the component series of U t are independent and ARMA(1, 2) of the form: where b is drawn independently from the uniform distribution on (−.98, −0.5) ∪ (0.5, 0.98), a 1 , a 2 are drawn independently from the uniform distribution on (−.98, −0.3) ∪ (0.3, 0.98), and t are independent and N (0, 1).The elements of A and B are drawn independently from U (−1, 1).For this example, we assume that we know n r = p and n c = q in (1).The focus is to investigate the impact of p and q on the estimation for B and A.
Performing the eigenanalysis for W (1) and W (2) defined in (7), we take the resulting two sets of orthonormal eigenvectors as the columns of A * and B * respectively.To measure the estimation error, we define where d i,j is the (i, j)-th element of matrix A * A * .Note that D( A * , A * ) is always between 0 and 1, and D( A * , A * ) = 0 if A * is a column permutation of A * .
We set T = 100, 500, 1000, 5000, and p, q = 4, 8, 16, 32, 100.The means and standard errors of D( A * , A * ) and D( B * , B * ) over the 1000 replications are reported in Table 1.As expected, the estimation errors decrease as T increases.Furthermore the error in estimating A * increases as q increases, and that in estimating B * increases as p increases.Also noticeable is the fact that the quality of the estimation for A * depends on q only and that for B * depends on p only, as D( A * , A * ) is about the same for (q, p) = (4, 4) and (q, p) = (4, 8), and D( B * , B * ) is about the same for (q, p) = (4, 8) and (q, p) = (8, 8).When (q, p) = (32, 32), each A and B contains 1024 unknown parameters.The quality of their estimation with T = 100 is not ideal, but not substantially worse than that for (q, p) = (16, 16), and the estimation is very accurate with T = 5000.For (q, p) = (100, 16), matrix time series X t contains 1600 component series.The estimation for A * is not accurate enough as q = 100, while the estimation for B * remains as good as that for (q, p) = (16, 16), and is clearly better than that for (q, p) = (32, 32).
Example 2. To examine the performance of the proposed method for identifying uncorrelated blocks, we consider now a only column-segmentation model where X t , U t are p × q matrix time series.Note that adding the row transformation matrix B in the model entails the application of the same method twice without adding any new insights on the performance of the method, as the estimation for A * and that for B * are performed separately.See Section 2.3.
The transformation matrix A in the above model is generated in the same way as in Example 1 above.All the element time series of U t , except those in columns 2, 3 and 5, are simulated independently from ARMA(1,2) model ( 22).Denoted by U t,j the j-th column of U t , we let Hence the first three columns of U t forms one block of size 3, columns 4 and 5 forms a block of size 2, and the rest of the columns are uncorrelated with each other, and are uncorrelated to the first two blocks.
We set T = 100, 500, 1000, 5000.For each sample size, we set (p, q) = (3, 6), (6, 6), (6, 10) and (10, 10).Hence the number of groups is n c = 3, 3, 7 and 7, respectively.For each setting, we perform the eigenanalysis for W (1) defined as (7).We then apply the algorithm in Section 2.3 to arrange the orders of the q resulting orthonormal eigenvectors to form estimator A * , for which the number of the correlated pairs of columns is determined by (11).Table 2 reports the relative frequencies of the correct specification of all the n c uncorrelated blocks.
We also report the relative frequencies of two types of wrong segmentation:  2 indicates that the relative frequency of the correct specification increases as the sample size T increases, and it decreases as q increases while the performance is much less sensitive to the increase of p.It is noticeable that the probability of the event { n c = n c + 1} is very small especially when T is large.With q = 6, the probability for the correct specification is not smaller than 64% with T = 500, and is greater than 70% with T = 1000.Furthermore the probability for n c = n c or n c − 1 is over 92% for T ≥ 500.Note that when n c = n c − 1, an effective dimension reduction is still achieved in spite of missing a split.
For the instances with the correct specification, we also calculate the estimation errors for the n c subspaces.More precisely, put We measure the estimation error by Note that {rank(A j )} −1 trace A j A j A j A j is always between 0 and 1. Furthermore it is equal to 1 if M(A j ) = M( A j ), and 0 if the two spaces are perpendicular with each other; see, for example, Stewart and Sun (1990) and Pan and Yao (2008).The boxplots of D 1 ( A * , A * ) presented in Figure 1 indicate that the estimation errors decrease when T increases, and the errors with large q are greater than those with small q.Example 3. In this example we examine the gain in post-sample forecasting due to the decorrelation transformation.Now in model (1) all the elements of A and B are drawn independently from U (−1, 1), with (p, q) equal to (6, 6), (6, 8), (8, 8) and (10, 10).For each (p, q) combination, the first three rows (or columns) form one block, the next two rows (or columns) form a second block, and all the other row (or column) is independent of the rest of the rows (or columns).Table 3 shows the configuration of U t for (p, q) = (8, 8).
Each univariate block of U t (e.g.U t,6,6 ) is an AR(1) process with the autoregressive coefficient drawn from the U [0.1, 0.3] and independent and N (0, 1) innovations.
where Φ 1,i,j and Φ 2,i,j are the coefficient matrices with unit spectral norm, all elements of E t,i,j are independent and N (0, 1).Matrices Φ 1,i,j and Φ 2,i,j are constructed such that their singular values are all one and the left and right singular matrices are orthonormal, and the scalar coefficient λ i,j controls the level of auto-correlation and its values for the four blocks Table 3: Example 3: Block configuration of U t for the case of (p, q) = (8, 8) are marked in Table 3.
For each setting, we simulate a matrix time series of length T + M with M = 10.To perform one-step ahead post-sample forecasting, for each i = 0, 1, • • • , M − 1, we use the first T observations for identifying the uncorrelated blocks and for computing A * , B * and t defined in (8).Depending on its shape/size, we fit each identified block in U * t with an MAR(1), VAR(1) or AR(1) model.The fitted models are used to predict U * T +i+1 .The predicted value for X T +i+1 is then obtained by We then compute the mean squared error: where X s,i,j denote the one-step ahead predicted value for the (i, j)-th element X s,i,j of X s , and µ s,i,j = E(X s,i,j |X s−1 , X s−2 , • • • ).We compare X s,i,j with µ s,i,j , instead of X s,i,j , to remove the impact of the noise in the observed X s,i,j .We report overall the mean and the standard deviation of MSE over 1000 replications.
For the comparison purpose, we also include several other models in our simulation: (i) VAR(1) model, i.e.X t is stacked into a vector of length pq and a VAR(1) is fitted to the vector series directly, (ii) MAR(1) model, i.e.MAR(1) of Chen et al. (2021), is fitted directly to X t , and (iii) TS-PCA, i.e. all the elements of X t are stacked into a long vector and the segmentation method of Chang et al. ( 2018) is applied to the vector series.In addition, we also include two oracle models: (O1) the true segmentation blocks are used to model U * t ; and (O2) True Σ  4. It is clear that the forecasting based on the proposed decorrelation transformation is more accurate than those based on MAR(1), VAR(1) and TS-PCA directly.The improvement is often substantial.The only exception is the case of (p, q) = (6, 6) and T = 5000: the 36dimension VAR(1) performs marginally better (i.e. the decrease of 0.002 in MSE) than the transformation based method.On the other hand, the improvement of the transformation based forecast over VAR(1) for sample size T ≤ 2000 is more significant.The relative poor performance of MAR( 1) is due to the substantial discrepancy between MAR(1) model and the data generating mechanism used in this example.Note that there are (p + q) 2 free parameters in VAR(1) while there are merely (p 2 + q 2 ) parameters in MAR(1).With large T , VAR(1) is more flexible than MAR(1).TS-PCA performs poorly, due to the large error in estimating the large transformation matrix of size pq × pq.See also Remark 6.In addition, TS-PCA also requires to fit VAR models for several large blocks of size 9, 6, 6 and etc. Table 4 indicates that the oracle model O1 performs only slightly better than segmentation when sample size T = 500 or 1000.Oracle Model O2 only deviates from the true model in terms of the estimation error of the coefficients in the AR models, hence performs the best.The MSE of O2 is very close to 0, since the only source of error in MSE is the estimation error of the AR models for each blocks of U t , which goes to zero quickly as T increases since these models are of small dimensions.When T is large, the VAR(1), proposed segmentation and O1 all perform about the same, which is worse than O2.
Example 4. Next, we assess the sensitivity of the proposed matrix segmentation method with respective to the Kronecker product structure (vec(X t ) = (A ⊗ B)vec(U t )).We use the same model setting in Example 3 and choose (p, q) = (8, 8), T = 500.Instead of model (1), we consider vec(X t ) = (A ⊗ B + Ψ)vec(U t ), where Ψ is a perturbation matrix with N none-zero elements.We also set each non-zero element of Ψ to be c A ⊗ B F /(pq) × ω, ω iid ∼ Rademacher distribution.Table 5 shows MSEs and standard deviations of VAR(1) and the proposed matrix segmentation method over 1000 replications, for several combinations of N and c.Again, N controls the number of non-zero perturbations and c controls the level of the perturbations, comparing to the average size of the elements in A ⊗ B. When c = 0, it becomes the original case in Table 4. From Table 5, it is seen that the segmentation is still useful in prediction when the model is deviated from the assumed block structure within certain perturbation.The prediction performance of the VAR model is not affected by the perturbation since it does not assume any block structure.It is seen that the segmentation outperforms the VAR model when c < 0.3 when N = 10 and c < 0.2 when N = 100.This feature demonstrates the benefit of dimension reduction through segmentation, even when the underlying model does not have the exact Kronecker product structure.

Real data analysis
We now illustrate the proposed decorrelation method with a real data example.The data concerned is a 17×6 matrix time series consisting of the logarithmic daily averages of the 6 air pollutant concentration readings (i.e.PM 2.5 , PM 10 , SO 2 , NO 2 , CO, O 3 ) from the 17 monitoring stations in Beijing and its surrounding areas (i.e.Tianjin and Hebei Province) in China.The sampling period is January 1, 2015 -December 31, 2016 for a total of T = 731 days.The readings of different pollutants at different locations are crossly correlated.
We first remove the seasonal mean of the 17 × 6 matrix time series.Setting τ 0 = 5 in (7), we apply the proposed bilinear transformation to discover the uncorrelated block structure.
The finding is stable: with 5 ≤ τ 1 ≤ 30 in (10), the transformed matrix series admits n c = 4 uncorrelated column blocks with sizes 3, 1, 1, 1 respectively, and n r = 15 uncorrelated row blocks with two blocks of size 2 and the rest of size 1.Overall there are 15 × 4 blocks, among which there are 2 blocks of size 2 × 3, 13 blocks of size 1 × 3, 6 blocks of size 2 × 1 and 39 blocks of size 1 × 1.
To check the post-sample forecasting performance, we calculate the rolling one-step, and two-step ahead out-sample forecasts for each of the daily readings in the last three months in 2016 (total 92 days).The methods included in the comparison are (i) the forecasting based on the segmentation derived from the proposed decorrelation transformation by fitting an MAR(1), VAR(1) or AR(1) to each identified block according to its size and shape; (ii) MAR(1) for the original 17 × 6 matrix series; (iii) VAR(1) for the vectorized original series, (iv) univariate AR(1) for each of the 17×6 original series, and (v) the TS-PCA (Chang et al., 2018) for the vectorized original series.The TS-PCA leads to the segmentation consisting of one group of size 3, three groups of size 2, and 93 groups of size 1.For each group, a VAR(1) model is used for prediction.
For the two segmentation approaches, we fix the needed transformation obtained using data up to September 30, 2016, and the corresponding segmentation structure.The time series model for each individual block is updated throughout the rolling forecasting period.
In addition to the identified segmentation with 15×4 blocks, we also compute the forecasts based on two other segmentations: one with 14×3 blocks, and another with 16×5 blocks.The former is obtained by merging two most correlated single row blocks in the discovered segmentation into one block and two most correlated single column blocks into one block.The latter is obtained by splitting one of the two blocks with two rows in the discovered segmentation into two single row blocks and splitting the block with 3 columns into two blocks.This will reveal the impact of slightly 'wrong' segmentations on forecasting performance.Let the mean squared forecast error for s-th observation be where X s,i,j denote the one-step ahead predicted value for the (i, j)-th element X s,i,j of X s .
Two-step ahead MSPE s is defined similarly.The means and the standard deviations of MSPE s of one-step and two-step ahead post-sample forecasts for the pollution readings in the last 92 days are listed in Table 6.The prediction based on the bilinear decorrelation transformation (even using 'wrong' segmentations) is clearly more accurate than those without transformation as well as that of the TS-PCA.Using the 'wrong' segmentation deteriorates the performance only slightly, indicating that a partial (instead of total) decorrelation still leads to significant gain in prediction.Applying TS-PCA to the vectorized series requires to estimate a 102 × 102 transformation matrix (instead of the 17 × 17 and 6 × 6 matrices by utilizing the matrix time series structure).It leads to larger estimation errors (see Remark 6 in Section 3).Nevertheless it still provides more accurate predicts than those without transformation.Also note that fitting each of the original 17 × 6 time series with a univariate AR(1) separately leads to a better performance than those from fitting MAR(1) and VAR(1) to the original matrix series jointly.This indicates that the cross-serial correlation is useful and important information for future forecasting.However to make efficient use of the information, it is necessary to adopt some effective independent component analysis or dimension-reduction techniques such as the proposed decorrelation transformation which pushes correlations across different series into the autocorrelations of some transformed series.
Also included in the table are 'regret' defined as the difference between MSPE and the in-sample residual variance of the fitted VAR(1) model for the vectorized data (i.e. a vector time series of dimension 17 × 6 = 102), and 'adjusted ratio' defined as the ratio of the regret to the regret of the best prediction model (i.e. the segmentation with 15 × 4 blocks).The regret tries to measure the forecasting error on the predictable signal, removing the impact from unpredictable noise in the model.Since the fitted VAR(1) model uses more than 10,000 parameters, its sample residual variance underestimates the noise variance in the model, and is taken as a proxy for the latter.The adjusted ratios shows that the 'regret' of TS-PCA is 14.4% worse than the 15 × 4 segmentation for one-step ahead forecast, and 6.4% worse than the 15 × 4 segmentation for two-step ahead forecast.
To further evaluate the forecasting stability, Figure 2 shows weekly average of one-step rolling forecast MSPE within the forecasting period under various models.It is seen that the proposed segmentation outperforms the other three methods in most of the periods in terms of prediction.
In this example, the identified segmentation of ( n r , n c ) = (15, 4) for the original 17×6 matrix series is more likely to be an approximation of the underlying dependence structure rather than a true model.The fact that the two 'wrong' segmentation models (though quite close to the discovered one) as well as the aggressive segmentation via vectorized TS-PCA transformation provide comparable, though inferior, post-sample forecasting performance lends further support to the claim that the proposed decorrelation method makes the transformed series more predictable than the original ones, regardless model (1) holds or not.See Remark 1(iv) in Section 2.1 above.

C Proofs
Proof of Proposition 2. In the following we provide the construction of A * = Γ (1) , where the columns of Γ (1) is the q orthonormal eigenvectors of W (1) and is a column permutation matrix.Let Hence, the column of Z t has same uncorrelated (segmentation) structure as that of where , where I q is the q × q identity matrix.Hence A * is orthonormal.Note that B * can be similarly defined.
Let the auto-cross-covariance matrices between the i-th and the j-th rows of − → X t and the  where Γ (1) = A * Γ (z * ) .Since Γ (1) is an orthogonal matrix and D * is a diagonal matrix with descending diagonals, it is clear that D * contains the eigenvalues of W (1) and Γ (1) contains the corresponding eigenvectors.
Now it follows from (27) that Since the columns of Z t Γ (z) = − → X t Γ (1) has the same uncorrelated structure as that of Z t , Γ 1) is the desired orthogonal transformation A * we are looking for.
Proof of Theorem 1.As V op ≤ q V max for all matrices V ∈ R q×q , it suffices to bound the estimation error under the • max norm.In the proof, we consider a more general case that EX t is not necessary 0 and let Σ τ,i,j = T −τ t=1 X t+τ E i,j X t T − τ .
are of the same block diagonal structure as, respectively, Σ

Remark 5 .
(i)  The ratio estimator in (11) picks the r most correlated pairs of columns, and ignores the other small correlations in constructing the segmentation structure.Theorem 3 in Section 3 below shows that the partition of A * into { A * 1 , • • • , A * nc }, determined by the above algorithm, provides a consistent estimation for the column segmentation of A * .
respectively, the indices of the components of Z t in each of those n c blocks.Denote by G 1 , • • • , G nc , respectively, the indices of the components of Z t in each of the n c uncorrelated blocks identified by the algorithm in Section 2.3.Re-arrange of the order of G 1 , • • • , G nc if necessary.Then the theorem below implies that P(n c = n c ) → 1 and (a) Merging with n c = n c − 1, i.e. n c − 2 blocks are correctly specified, and the remaining two blocks are put into one; (b) Splitting with n c = n c + 1, i.e. n c − 1 blocks are correctly specified, and the remaining block incorrectly splits into two.Table
A * , B * are used and the true segmentation blocks are used to model U * t .The mean and standard deviations (in bracket) of the MSEs are listed in Table
Chang et al. (2018)f B * are a permutation of the orthonormal eigenvectors of W (2) .If we stack all the elements of X t into a long vector such that vec(X t ) = Φvec(U t ) and apply TS-PCA ofChang et al. (2018)directly, the estimation of the decorrelation transformation matrix would satisfy P Φ k − P Φ k op = O P (pq[(log(pq)/T ) 1/2 +log(T pq) 1/β 1 /T ]) for 1 ≤ k ≤ n r n c and Φ = (Φ 1 , ..., Φ nrnc ).Obviously, P A * j ⊗ B * i − P A * j ⊗B * i op has much sharper rate which is equivalent to that for the TS-PCA estimation with dimension max(p, q).

Table 1 :
Example 1 -Means and standard errors (SE) of D( A * , A * ) and D( B * , B * ) in a simulation with 1000 replications.

Table 2 :
Example 2 -Relative frequencies correct and incorrect segmentation in a simulation with 1000 replications.

Table 4 :
Example 3 -One-step ahead post-sample forecasting: means and standard deviations (in bracket) of MSEs in a simulation with 1000 replications.

Table 5 :
Sensitivity of the Kronecker product structure using different N and c.The results are the means and standard deviations (in bracket) of MSEs of one-step ahead post-sample forecasting based on 1000 replications.

Table 6 :
One-step and two-step ahead post-sample forecasting for the air pollution data: means and standard deviations (in bracket) of MSPE s , regrets and adjusted ratios of the various methods.