Fast same-step forecast in SUTSE model and its theoretical properties

We consider the problem of forecasting multivariate time series by a Seemingly Unrelated Time Series Equations (SUTSE) model. The SUTSE model usually assumes that error variables are correlated. A crucial issue is that the model estimation requires heavy computational loads because of a large matrix computation, especially for high-dimensional data. To alleviate the computational issue, we propose a two-stage procedure for forecasting. First, we perform the Kalman filter as if error variables are uncorrelated; that is, univariate time-series analyses are conducted separately to avoid a large matrix computation. Next, the forecast value is computed by using a distribution of forecast error. The proposed algorithm is much faster than the ordinary SUTSE model because we do not require a large matrix computation. Some theoretical properties of our proposed estimator are presented. Monte Carlo simulation is performed to investigate the effectiveness of our proposed method. The usefulness of our proposed procedure is illustrated through a bus congestion data application.


Introduction
Multivariate time series data analysis has been recently developed in various fields to achieve highquality forecasts and investigate the correlations among time series; for example, forecast of energy consumption and crowd-flow [Gong et al. 2020]. In particular, a state-space model is applied widely for forecasting time series, and a number of model estimation procedures have been proposed. One of the most famous methods is the Kalman filter [Kalman 1960b] and its extensions; for example, Adaptive Kalman filter [Mohamed and Schwarz 1999], and Robust Kalman filter [Koch and Yang 1998]. Recently, the deep Kalman filtering network that fuses deep neural network and the Kalman filter is also proposed [Lu et al. 2018]. The Kalman filter and its extensions have been used in various fields of research, such as tracking [Jondhale and Deshpande 2018], photovoltaic forecast [Pelland et al. 2013], and traffic volume forecasting [Xie et al. 2007].
In this study, we employ the SUTSE model [Fernández andHarvey 1990, Antoniou andYannis 2013]. In the SUTSE model, multiple univariate time series equations are combined to express a single multivariate state-space model. This study assumes that components of a noise vector are correlated. In this case, components of an observation vector are also correlated. As an example of an analysis using such a correlation structure, Moauro and Savio [2005] proposed applying SUTSE model to temporal disaggregation. Temporal disaggregation is used for transforming low-frequency time series into high-frequency time series by interpolation. The interpolation is achieved by combining multivariate time series with different frequencies. Moauro and Savio [2005] apply SUTSE model and impute missing values by the Kalman filter. For example, assume that   t . The SUTSE model can be applied to not only interpolation but also forecasting. Specifically, we consider the following issue: suppose that we have n observations of d-dimensional time series data, {{y 1 , . . . , y n } : y i = (y 1i , . . . , y di ) T }, and some elements of y n+1 , say y A,n+1 . Here, A ⊂ {1, . . . , d} and y A,n+1 indicates a subvector of y n+1 whose indices consist of A. We consider the problem of forecasting the value of y k,n+1 by using {y 1 , . . . , y n } and y A,n+1 , where k / ∈ A. We call this forecast "same-step forecast". Figure 1.1 shows one example of application of the same-step forecast. In this example, y i,n indicates the congestion of ith bus on nth day. Typically, we forecast the congestion using past congestion data that includes the most recent one; in other words, A = {1, . . . , j} when we forecast y j+1,n+1 . Same-step forecast value is computed based on not only the information until day n, but also on the correlation between y j+1,n+1 and y A,n+1 . In addition to the bus congestion, the same-step forecast is also used for electricity demand forecasts using multiple data sources, such as weather and event information.
Since the SUTSE model is a special case of the state-space model, the Kalman Filter is usually performed to estimate the state vectors. However, the estimation with Kalman filter requires heavy computational loads due to a large matrix computation, especially for high-dimensional data; the computational complexity is about O(max{d 3 , p 3 }) [Willner et al. 1976 [2017], we prove that the mean vector and the covariance matrix of the forecast error converge under a misspecified model.
The rest of this paper is structured as follows. In Section 2, forecast methods using the SUTSE model are presented. We also review the Kalman filter in this section. In Section 3, we propose the fast same-step forecasting method. We give the estimator of a covariance matrix of the forecast error, which is necessary for implementing our fast method. In Section 4, we prove the convergence of the mean vector and the covariance matrix of the forecast error. Using this convergence, in Section 5, we prove the consistency of the estimator for the fast method. Finally, in Section 6, we conduct a Monte Carlo simulation to investigate the computational time of our proposed algorithm, and the results show that our proposed algorithm is dozens of times faster than an ordinary Kalman filter, depending on a dimension of observations, d.
2 Forecasts using a SUTSE model

SUTSE model
Let y 1 , y 2 , . . . , y n be d × 1 observation vectors with y t := (y 1,t , . . . , y d,t ) T . Assume that y j,t follow linear-Gaussian state space model: and p := p (1) + . . . p (d) . Then model ( 2.1 ) can be rewritten as the form of multivariate linear-Gaussian state space model: where α t is a p × 1 state vector, ε t is a d × 1 observational noise vector, η t is a p × 1 state noise vector, Z t is a d × p design matrix, and T t is a p × p transition matrix. We assume ε t ∼ N (0, Σ ε ),

One-step-ahead forecast
We now consider the forecast of y n+1 when y 1 , y 2 , . . . , y n are given. It is well known that using the Kalman filter for this forecast. Let a t := E(α t |y 1 , . . . , y t−1 ), P t := V (α t |y 1 , . . . , y t−1 ), v t := y t −Z t a t , and F t := V (v t |y 1 , . . . , y t−1 ), then these conditional expectations and covariance matrixes can be computed by the Kalman filter as follows: Note that a 1 and P 1 are mean vector and covariance matrix of initial state vector α 1 respectively. Then, we define "one-step-ahead forecast" of y t as and call v t "one-step-ahead forecast error" in this paper. In particular, we can get the onestep-ahead forecast of y n+1 asȳ n+1 = Z n+1 a n+1 . Note that by the definition of a t , we havē y t = E(Z t α t |y 1 , . . . , y t−1 ) = E(y t |y 1 , . . . , y t−1 ) and E(v t |y 1 , . . . , y t−1 ) = E(y t − Z t a t |y 1 , . . . , y t−1 ) = E(y t −ȳ t |y 1 , . . . , y t−1 ) = 0.

Same-step forecast
We now consider the forecast of y k,n+1 when y 1 , y 2 , . . . , y n and also y 1,n+1 , . . . , y j,n+1 are given with k > j. The one-step-ahead forecast of y k,n+1 can be modified by using the conditional expectation of the one-step-ahead forecast error v k,n+1 as follows: where [Z n+1 a n+1 ](k) denotes the kth component of Z n+1 a n+1 . We call this forecast "same-step forecast" in this paper. To compute the modification term (the second term) of the right-hand side of ( 2.5 ), we use the formula for a conditional expectation of a multivariate normal distribution (see [Eaton 1983] for example): where x 1 and x 2 follow multivariate normal distribution.
3 Fast method 3.1 Kalman filter with disregarding correlation The above forecasts are computationally expensive especially in high dimensions (i.e., d or p are large) because of iterative calculations of large matrix multiplications in the Kalman filter. To address this problem, we run the Kalman filter separately for each dimension as follows: t . This method reduces the cost of matrix calculations compared to ( 2.3 ). For example, the cost of deriving This method also reduces time-consuming estimation. Only the estimation of Σ ε (j, j) and Σ (j) η for each j = 1, . . . , d is required instead of Σ ε and Σ η . In addition, it is possible to run ( 3.1 ) in parallel for each j = 1, 2, . . . , d.
Specifically, if such correlations do not exist, then one-step-ahead forecast errors v 1,t , . . . , v d,t are mutually uncorrelated. Hence the modification term of same-step forecast in ( 2.10 ) become 0, so one-step-ahead forecast and same-step forecast become equal. Under this circumstance, of course, the values derived by ( 3.1 ) are not equal to the values derived by ( 2.3 ).

Fast one-step-ahead forecast
When y 1 , y 2 , . . . , y n are given, similarly to ( 2.4 ), we consider the one-step-ahead forecast in the fast method asȳ Note that Thus, it does not always hold that E(v n+1 |y 1 , . . . , y n ) = 0 in contrast to the existing one-step-ahead forecast ( 2.4 ). However, we show the convergence of E(v n+1 ) as follows: These assumptions and the proof of this convergence are given in Section 4.

Fast same-step forecast
When y 1 , y 2 , . . . , y n and also y 1,n+1 , . . . , y j,n+1 are given, similarly to ( 2.10 ), we consider the samestep forecast in the fast method aŝ Using the formula for a conditional expectation of a multivariate normal distribution ( 2.6 ), the modification term (the second term) of ( 3.5 ) can be computed as This equation suggests that if E(v n+1 ) and V (v n+1 ) are given, the same-step forecast ( 3.5 ) can be obtained. However, it is difficult to derive E(v n+1 ) and V (v n+1 ) unless a n+1 , F n+1 , etc. are given.
Note a n+1 , F n+1 , etc. are given in the normal Kalman filter ( 2.3 ), so these values cannot be obtained with the fast method. Hence, we now consider the estimations of E(v n+1 ) and V (v n+1 ).
First, as we stated above, it follows that E(v n+1 ) converges to 0. In addition, we show the convergence of V (v n+1 ) as follows: These assumptions and the proof of this convergence are given in Section 4. Therefore, we have Then, we choose n 0 that is large enough to eliminate the effect around the initial value and consider the estimation of V v as follows: Here, the estimatorV v is a consistent estimator of V v . The proof of this consistency is given in Section 5. From the above, the estimators of E(v n+1 ) and V (v n+1 ) are 0 andV v respectively, then Substituting this into ( 3.5 ), the same-step forecast in the fast method we propose is derived as In this paper, define the Frobenius norm as Assume true model is ( 2.2 ). Consider the situation when the Kalman filter is run under the misspecified parameters as in ( 3.2 ) and the values v t , F t , a t , and P t are obtained. Also assume that Z t , T t are time-independent, so these can be written Z t = Z, T t = T . Actually, it is not necessary that T t and Z t are time-independent to prove convergence of E(v t ), that is, assumptions can be mild. The mild assumption is detailed in Appendix C. Then, we will prove that E(v t ) and V (v t ) converge under some assumptions. The assumptions are as follows: Assumption 4.1.
Similarly, decompose Σ η = R Q R T , where R and Q are p × r matrix and r × r positive matrix. Then rank(R , T R , ..., T p−1 R ) = p.
Assumption 1 implies that the observational noise vector follows a non-degenerate distribution.
If assumptions 2 and 3 are satisfied, then it is called that the linear system ( 2.2 ) is observability and controllability respectively. The notion of observability and controllability were suggested by Kalman [1960a]. According to Gilbert [1963], these are used in the study of the control theory.
Assumption 4 is set by us. The product of T L t as written in assumption 4 appears in computational processes of E(v t ) and V (v t ). This product needs to be bounded to ensure the convergence of E(v t ) and V (v t ). Under assumption 1,2,3, the following lemmas hold.

And thus
Lemma 4.2.
Under Assumptions 1,2,3, all eigenvalues of T L are of absolute value that is less than 1.

And thus
These lemmas were proved in Section 6 of [Chui et al. 2017]. Now, the following convergences of E(v t ) and V (v t ) holds.
Under Assumptions 1,2,3,4, For the proof of this theorem, we prepare some more lemmas.

Consistency
In Section 3, thenV v is a consistent estimator of V v . We will show them in this section.

It follows that
The first term of ( 5.1 ) can be written as following lemma:

The proof is given in Appendix
As a result, the first term of ( 5.1 ) can be written as Next, consider the second term of ( 5.1 ). We use following lemma: The proof is given in Appendix A. If s = 1, then Cov(v t , v t+s ) can be written as Thus, note the Frobenius norm is sub-multiplicative, we have Now we prepare the lemma as follows: The proof is given in Appendix A. Using lemma 5.3, it follows that Cov If s ≥ 2, then Cov(v t , v t+s ) can be written as

Now it holds (see the proof of lemma 4.3 in Appendix A) that
E(a t+s − a t+s |y 1 , . . . , y t+s−2 ) = T L t+s−1 (a t+s−1 − a t+s−1 ). Therefore, Using this transformation repeatedly, we have Additionally, by lemma 5.3 and lemma 4.5, there exist M > 0 and 0 < r < 1 such that Note that this inequality holds if s = 1. Thus, from lemma 5.2, Cov(v i,t v j,t , v i,t+s v j,t+s ) can be bounded as follows: Then consider the second term of ( 5.1 ), note that 0 < r < 1, we have From the above results, V (V v (i, j)) can be bounded as follows: that we mentioned at the beginning of this Section, it holds that

Simulation results
We compare the performance of the existing method in Section 2 and the fast method in Section 3 by Monte Carlo simulation. The simulation model is set to: 1, 2, . . . , 2000 j = 1, 2, . . . , d) and α (j) 1 = 0 for j = 1, . . . , d. This is a combination of the AR model and the local level model. We design such model for bus congestion forecasting, for example. In bus congestion forecasting, the dimension of the observation vector, d, corresponds to the number of buses per day; so, d is usually about a few dozen. The simulation also examines effects of changes in d on the results.
The following procedure is used in the simulation: 1. Generate the data y 1 , . . . , y 2000 that follow above model and split this data as the training data Y train := y 1 , . . . , y 1000 and the test data Y test := y 1001 , . . . , y 2000 .
• Apply the existing method in Section 2 and derive the maximum likelihood estimates for θ 1 , . . . , θ d+1 by using only Y train .
• Apply the fast method in Section 3 and derive the maximum likelihood estimates for θ 1 , . . . , θ d and alsoV v in ( 3.7 ) by using only Y train .
• Here, for both methods, maximum likelihood estimates are computed numerically by the quasi-Newton method, and n 0 inV v is selected as follows: 3. Run the one-step-ahead forecast of y d,t and the same-step forecast of y d,t given y 1,t , . . . y d−1,t for t = 1001, . . . , 2000 by the exiting method and the fast method, then compute the squared forecast errors and measure the computation time taken to complete the parameter estimation and forecasts.
We repeated this simulation 100 times, computed the mean-squared forecast error and the mean of computation time, and compared them. Each dimension is d = 4, 6, 8, . . . , 16. Figure 6.1 shows the mean-squared forecast error for the test data with 1000 observations over 100 runs. The result shows that the same-step forecast of the fast method can modify the one-step-ahead forecast similarly to the existing method. The fast method is a little less accurate, but not significantly so.  In this paper, we considered the same-step forecast and proposed the faster method. In this method, we estimated the mean vector and the covariance matrix of the one-step-ahead forecast error by ensuring their convergence and discussed their estimators' consistency. Monte Carlo simulation is conducted to investigate the effectiveness of our proposed method. The results showed that our proposed method is much faster than the existing method. In future directions, we consider researching the convergence of the covariance matrix of the one-step-ahead forecast error when Z t or T t is time-varying. We also consider researching a method to select the best n 0 in ( 3.7 ) that minimizes the mean-squared same-step forecast error.

Lemma 4.7
Under Assumptions 1,2,3, ∃ M > 0, 0 < ∃ r < 1, Proof. Note F t and K t − K t are finite, we have Proof. V (v i,t v j,t ) can be written as follows: Consider the covariance of the third term in the last equation. Since the mean of the right side is In the last equality, we use that third order moment of a normal random variable with 0 mean is 0. Therefore, Moreover, using Isserlis' theorem [Isserlis 1918], we have Therefore, we get Proof. Similarly the proof of lemma 5.1, we have

Lemma 5.3
Under Assumptions 1,2,3,4, ∃ M, ∀ t, Cov(v t , a t+1 − a t+1 ) F < M Proof. It follows that From the law of total variance, it follows that and from the fact that a t , a t are constant when y 1 , ..., y t−1 are given, it follows that Thus, we have Now, since F t , K t , and K t converge, F t (T (K t − K t )) T also converge. Therefore, it follows that ∃ M 1 , ∀ t, F t (T (K t −K t )) T F < M 1 . Similarly, since L t converge, it follows that ∃ M 2 , ∀ t, (T L t ) T F < M 2 , and since E (a t − a t )(a t − a t ) T and E(a t − a t ) converge (see Section 4), it follows that ∃ M 3 , ∃ M 4 , ∀ t, E (a t − a t )(a t − a t ) T F < M 3 , E(a t − a t )E(a t − a t ) T F < M 4 . From the above, let M := M 1 + Z F × M 2 (M 3 + M 4 ), for any t it holds that Cov(v t , a t+1 − a t+1 ) F = Cov(v t + Z(a t − a t ), T (K t − K t )v t + T L t (a t − a t )) F < M.

B Supplement of Lemma 4.2
We prove that for any matrix A of which all eigenvalues are of absolute value less than 1 it holds that ∃ M > 0, 0 < ∃ r < 1, A n F ≤ M r n . First, it follows that A n → 0 from that all eigenvalues of the matrix A are of absolute value less than 1. Then, there exists a positive constant c large enough such that A c F < ∃ r 1 < 1.
Let q be the remainder when n is divided by c, for all n ∈ N it follows that 1 . Note r < 1 from r 1 < 1, the inequality we want was given.

C Mild assumption for the convergence of E(v t )
In Section 4, we proved the convergence of E(v t ) and V (v t ) under assumption 4.1 and the condition that Z t = Z, T t = T . However, the convergence of E(v t ) even holds under the condition that Z t = Z, T t = T are time-varying, and the assumptions can be mild as follows: Assumption for the convergence of E(v t ) : Note this assumption is more mild than assumption 4.1 from lemma 4.6. We prove the convergence of E(v t ) under this assumption.

D Comparison of V (v t ) and V (v t )
We compare the covariance matrixes V (v t ) in the existing method and V (v t ) in the fast method.
From ( 4.2 ), it follows that From Section 4, under assumption 4.1, this difference of the covariance matrixes converges as follows: From this equation, it is possible to guess that as the difference between K and K becomes small, the difference of the covariance matrixes also becomes small.