ARIMA Forecasting Chinese Macroeconomic Variables Based on Factor and Principal Component Backdating

Abstract: In this paper the backdating methods based on factors and principal components are applied for the first time to emulate the historical macroeconomic variables in China. The numerical results show that these procedures are useful to backdate some missing or not available historical data. ARIMA forecasting experiments based on backdated historical data are conducted and compared with forecasting procedures using directly factors and principal components. Our results suggest that some key variables like GDP can indeed be forecasted more precisely with the principal components backdated data.


INTRODUCTION
With the implementation of five-year plan in China, studies on the behavior of macroeconomic variables have been paid more and more attention by economists and policy makers. Tracking and predicting the tendencies of some important macroeconomic variables are especially what researchers interested in. To guarantee the accuracy and effectiveness of these analyses, enough data of time series are usually required. A fundamental statistical approach to handle longitudinal data is time series analysis method. A popular model is autoregressive integrated moving average (ARIMA) model, which was first proposed by Box and Jenkins (1968) [1] and had been further studied in a series of works of Box and Pierce (1970) [2], Box et al. (1974) [3] and Bartholomew (1976) [4]. Saboia (1977) [5] used this model to forecast femalebirth data for Norway during the years 1976-2000 with time series of the years 1919-1974. Ledolter (1979) [6] first investigated the sensitivity of ARIMA models to study non-normality of the distribution of the shock price data. Very recently, Garg et al. (2015) [7] analyzed traffic noise levels by modeling the data of day-night average sound level with AMIRA process. Time-series methods have been frequently applied to the studies of complex and dynamic economic phenomena and for an overview readers are referred to the book of Tsay (2010) [8]. Recently Corte et al. (2010) [9] and Bianco et al. (2012) [10] showed that fundamentals-based econometric models obtain statistically significant improvements upon randomwalk model when they are modeling certain macroeconomic variables such as short-term exchange *Address correspondence to this author at the School of Mathematics and Statistics, Wuhan University, Wuhan, Hubei 430072, P.R. China; Tel: (86)27-68752957; Fax: (86)27-68752256; E-mail: yanliu@whu.edu.cn rate. Xiao et al. (2014) [11] studied financial market volatility by establishing a multiscale ensemble forecasting model which combines ARIMA with feed forward neural network (FNN). ARIMA model is used to generate a linear forecast, and FNN is developed as a tool for nonlinear pattern recognition to correct the estimation error in ARIMA forecast. These studies motivate us to construct effective models of time series to forecasting some important macroeconomic variables of China.
It should be mentioned that in practice some historical data may be insufficient because of varieties of reasons such as missing or incomplete statistics. To solve this problem, Angelini et al. (2006) [12] first put forward a factor-backdating method to construct historical data. Factors from some macroeconomic variables relative to interest were extracted and then a linear model between interest and the factors was established to backdate interest data before Germany was unified in 1991. Factor-backdating method has great ability to handle the situations in which time series data are missing in some of the cross-section units and this advantage makes the procedure more effective and useful. Based on this method, Angelini et al. (2011) [13] and Anderson et al. (2011) [14] analyzed separately historical financial data for the Euro area. Brüggemann and Zeng (2015) [15] further investigated the effectiveness of backdating method in forecasting a number of macroeconomic Euro-area variables by contrasting predicting data obtained by this method with that by autoregressive (AR) models and logistic smooth transition auto regression (LSTAR) models. To the best of our knowledge, there is no literature about analysis of backdating method to study the macroeconomic variables of China.
This paper aims to analyze the behavior of macroeconomic variables of China by using backdating methods combined with classical simulation models. Take GDP, a key variable, as an example to expand indepth. Methods based on factors and principal components are used separately to simulate the historical GDP data. Short-term GDP data are also forecasted by embedding backdating method in traditional ARIMA time series models. All the estimated data are compared with the real ones. Throughout the paper, the set of indicators includes Electricity production (EP), Railway freight traffic (RFT), Index of raw materials supply(RMI), Retail sales of consumer goods (CGR), as discussed in Fernald, Hsu, and Spiegel (2015) [16]. In addition to most frequently used variable GDP, Financial institutions deposits (FID). Financial institutions loans (FIL) and National financial expenditure (NFE) are also analyzed. All the procedures are achieved by R statistical software.
The paper is constructed as follows. Section 2 introduces related backdating models based on factors and principal components and all the obtained backdated GDP data before 1999 are compared to the real ones. ARIMA forecasting experiments based on factor and principal component backdated data are conducted in Section 3. Predicting GDP data after 2010 are obtained and compared to the real ones. Section 4 concludes the major results of the paper.

Factor-Based Backdating
Exploratory factor analysis (EFA) is used frequently to gain insights into latent structure underlying obtained data. In this section the factor-based procedures to backdate GDP historical data for China are introduced and a detailed description is given as follows.
As in Stock and Watson (2002a, b) [17,18], we suppose that X s ) is an original variable, represented by an unobserved common factor F t and an idiosynacratic component e t , where . Then the vector of time series are written as where X t is a p ! 1 vector, ! is a p ! k matrix, F t is the k ! 1 k " p ( ) vector of common factors and e t , is a p ! 1 vector of idiosyncratic components independent of F t .
Here ! is called factor loading matrix. Denote and e = e 1 , e 2 ,…, e N ( ) ! when the number of samples is N , and assume that The common factors are first extracted from the time series data before the backdate procedure. In the factor model approach, we estimate factor model using maximum likelihood estimation (MLE). Suppose that there are k common factors from p original variables, and then we get In our application, we split the entire sample period (from 1995 to 2015) into a backdating period (from T 0 to T 1 ), a real period (from T 1 + 1 to T 2 ), and a forecasting evaluation period (from T 2 + 1 to T 3 ). In the following, we set T 0 to 1995, T 1 to 1998,T 2 to 2010 and T 3 to 2015 which means that we will backcast the GDP data from 1995 to 1998 by backdating methods. In the first step, we make the initial indicators stationary by two order difference operation and then extract factors from data of 4 variables including EP, RFT, RMI and CGR before 2011. In the second step, we relate the factor time series to the macroeconomic series of GDP from T 1 +1 to T 2 by a regression model, denoted by y t , as the dependent variables and the estimated factor time series F j t as explanatory variables. That is, we use the model and estimate the parameters β 0 ,β 1 ,··· ,β k by ordinary least square (OLS). Here the macroeconomic series of GDP are also stationary after using two order difference operation. In the third step of our procedure, we backdate the GDP time series for period before 1999 by The backdated data results of ŷ t and the related comparisons are provided in Section 2.3.

Principal Component-Based Backdating
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables which are called principal components. The number of principal components is less than or equal to the number of original variables. PCA was first introduced by Pearson (1901) [19] and has been mostly used as a tool in exploratory data analysis and for making predictive models.
In this subsection, prior to the backdate procedure we first use PCA to extract k principal components from the data set of four indicators including EP, RFT, RMI, CGR. Suppose that at time t the total number of economic indicators is p and we denote these indicators as p random variables X 1 t , X 2 t ,…, X p t .

1.
For each principal component the sum of the square of coefficients is equal to 1: The principal components are independent of each other: The variance of the principal components satisfying Denote the orthonormal sample matrix as X t , and COV (X t ) as ! . Then we derive a orthogonal matrix U such that satisfying that C k ! 85% , that is, these k principal components account for 85% contribution to the total sample variance. Therefore, we obtain time series set of G j t , j = 1,2,··· ,k, t = 1,2,···N.
Relating the principal components time series to the macroeconomic series of GDP from T 1 +1 to T 2 by a regression using series of GDP of China, denoted by y t , as the dependent variable and the estimated principal components time series ! G j t as explanatory variables, we get the model and estimate the parameters β 0 ,β 1 ,···,β k by ordinary least square (OLS). Finally, we backdate GDP time series ! y t by The polynomial and exponential fitting methods for principal components time series ! G j t are also investigated in the following of this subsection.
In the polynomial fitting model, we rewrite estimated principal components time series ! G j t as ! G j t (1) and represent the model as (1) ,!,! jn (1) , j = 1, 2,!, k , are parameters to be estimated. Here t 0 is set to1994 and n is set to 2. Then, we run a regression like equation (2.3) in which time series of GDP are dependent variables and estimated principal components ! G j t (1) are explanatory variables.
Finally, we backdate following equation (2.4) and then obtain the backdated GDP which are written as ! y t (1) .
In parallel, we denote the estimated principal components time series ! G j t as ! G j t ( 2 ) and introduce the exponential regression model as below (2) , ! j 2 (2) , j = 1, 2,!, k are parameters to be estimated. Then we rerun backdating procedures of (2.3) and (2.4) using ! G j t ( 2 ) as explanatory variables.
The backdated data of GDP are denoted accordingly as ! y t (2) .
The backdated data results of ! y t , ! y t (1) , ! y t (2) and the related comparisons are provided in Section 2.3.

Backdating Comparisons
In this subsection, the numerical results of ŷ t , ! y t , ! y t (1) and ! y t (2) obtained in Section 2.1 and Section 2.2 and the corresponding comparisons with the real ones are provided. Recall that we split the entire sample period (from 1995 to 2015) into a backdating period (from 1995 to 1998), a real period (from 1999 to 2010), and a forecasting evaluation period (from 2011 to 2015). We will derive the GDP data from 1995 to 1998 by backdating methods introduced in Section 2.1 and Section 2.2 and then compare them to the real ones.
In EFA model we first extract 1 factor and then get the backdating GDP ŷ t by (2.2). In PCA-based backdating we also extract 1 principal component from the set of data of the four indicators. And then we compute backdated GDP ! y t by PCA model, ! y t (1) by polynomial fitting and ! y t (2) by exponential fitting. All of the backdated data are given in Table 1.
In addition, following above procedures backdated variables of FID, FIL, NFE can also be computed. Here we just provide RMSE in Table 2.
Tables 1 and 2 suggest that backdating methods works well to emulate historical macroeconomic variables for China. Generally speaking, backdating based on classical PCA provides most precise data and this method is followed closely by that based on Note: The unit of the data in Table 1 is billion yuan. factors. Meanwhile, traditional curve fitting approach to choose principal components leads to higher error relative to real data.

FORECASTING METHODS AND EVALUATION
The effectiveness of forecasting based on factors and PCA is investigated in this section. Prior to these two forecasting procedures, we first use traditional ARIMA time series model to get prediction because of the rising tendency of GDP, which is described in detail in Section 3.1. Forecasting methods based on factors and PCA are introduced separately in Section 3.2 and Section 3.3. The data results and corresponding comparisons are given in Section 3.4.

Autoregressive Integrated Moving Average (ARIMA)
It is well known that ARIMA time series model has the form where Z t is the vector of explanatory variables, θ ht is a vector of possibly time-varying parameters and ε t is the error of this model. We focus on forecasting GDP in China h periods ahead, denoted as y t+h , where h represents the forecasting horizon. The h-step ahead forecast is given by where the unknown parameter vector θ ht is estimated by ! ht , and the h-step forecast error is where the forecasting horizon is chosen as h = 1,2,3,4 and 5.
The widely used ARIMA (p,d,q) model is defined as where B is the backward shift operator satisfying that By t = y t!1 , B j y t = y t! j , ! d = (1" B) An example of an ARIMA model is the ARIMA(1,0,0) model, first order autoregressive model, which is used recently to forecast the EMU aggregate of some variables of interest h periods ahead by Brüggemann and Zeng (2015) [15].
In this subsection we study the problem of minimum variance prediction for stationary time series. In the first step, stationary property of the time series ! y t , t=1,2,…, is checked and a stationary sequence of GDP data, ! y t , ! y t"1 , …, is obtained by using difference operators if it is not stationary. It turns out that in our case GDP time series become stationary after difference operations twice. In the second step, a white noise test is applied to the stationary sequence and ARMA model is used to simulate the model if it is not a white noise.
In this ARIMA (p,d,q) model, we first use the real historical GDP data from 1995 to 2010 to predict data from 2011 to 2015 and the results are denoted by ˆ! y t . And then we use the backdating data from 1995 to 1998 obtained in Section 2 and real data from 1999 to 2010 to forecast the future data.
The forecasted data results based on ARIMA, factor backdated data, PCA backdated data, PCA with polynomial fitting backdated data and PCA with exponential fitting backdated data are denoted respectively by ˆ! y t , ŷ t ! y t , ! y t (1) , ! y t (2) and the related comparisons are provided in Table 3.

Factor-Based Forecasting
In this subsection, we discuss the forecasting method based on factors. As in the procedure of backdating, we first extract k factors from four original variables covering the entire period from T 0 to T 3 . And then we relate the factor time series to the macroeconomic series of GDP from T 0 to T 2 , that is, we construct a regression model like (2.1) using series of GDP as the dependent variable and the estimated factor time series as explanatory variables for t ! T 0 ,T 2 " # $ % . Finally we forecast GDP data following The forecasted data results of ŷ t , and the related comparisons are provided in Table 4.

Principal Component-Based Forecasting
In this subsection, we discuss the forecasting method based on PCA. Firstly, we extract k principal We also try to fit G j t by polynomial and exponential functions and denote the emulated values as ! G j t (1) and ! G j t ( 2 ) which have the same form as (2.5) and (2.6) respectively for t ! T 0 ,T 3 " # $ % . And linear regression models are established as in Section 3.2 in which y t is dependent variable and ! G j t , ! G j t (1) and ! G j t ( 2 ) are the independent variables.
The forecasted data results of ! y t , ! y t (1) , ! y t (2) and the related comparisons are provided in Table 4.

Forecasting Comparisons
This subsection investigates the efficiencies of forecasting methods discussed in Section 3.1-3.3. Recall that in ARIMA models, we first use real value of GDP from 1995 to 2010 to make a prediction. Then we use backdated GDP data from 1995 to 1998 obtained respectively by factor-model and by three principalcomponent-models introduced in Section 2 and real GDP data from 1999 to 2010 to forecast GDP after 2010. The numerical results are provided in Table 3. Note: The unit of the data in Table 3 is billion yuan. Table 3 shows that ARIMA models with backdated GDP data (from 1995 to 2010) based on PCA perform better than that based on EFA. Meanwhile FIL FID and NFE are also forecasted in the same way and corresponding RMSE are shown in Table 5. We conclude that historical data based on PCA provide more precise forecasting than that based on EFA and forecasting data perform especially well when principal components time series are described with polynomial and exponential functions. It is interesting that forecasting data of FIL and NFE with historical data based on PCA combined with polynomial and exponential function fitting even predict more precisely than that based on real historical data.
The numerical results for forecasted GDP based on factors and PCA are given in Table 4 and corresponding RMSE for forecasting GDP, FIL, FID and NFE are computed in Table 6. To see more clearly, we also draw prediction curves yielded by forecasting methods discussed in Section 3.2 and Section 3.3, see Figure 1. It follows that forecasting methods based on factor and PCA are practicable to emulate future macroeconomic data. Compared to the real data from 2011 to 2015, the data forecasted by PCA provides more efficiency than that obtained by factors. In particularly principal components described with exponential functions provide the optimal forecasting data.
It follows from Tables 3 and 4 that real data are usually higher than forecasted data. The practice has showed that implementation of five-year plan is impactful to promote economic growth in China. Comparisons in Tables 5 and 6 suggest that   forecasting data based on ARIMA model and PCA with principal components described with exponential functions perform better than that obtained by other methods.

CONCLUSION
In this paper the backdating procedures based on factors and PCA are introduced for the first time to emulate historical China macroeconomic time series data. Our numerical results for GDP, FID, FIL, NFE data illustrate that they are effective methods to handle the situation where some historical data are missing or not available in the desired quality. ARIMA forecasting experiments based on the factor-backdated and PCAbackdated data are conducted and compared with forecasting procedures based directly on factors and PCA. Our results suggest that some key variables like real GDP can indeed be forecasted more precisely with PCA backdated data. Overall, our results indicate that for some important macroeconomic variables the backdating procedure based on factors or PCA is a valuable method to construct time series data for China.