Identification of Common Factors in Multivariate Time Series Modeling

For multivariate time series modelling, it is essential to know the number of common factors that define the behaviour. The traditional approach to this problem is investigating the number of cointegration relations among the data by determining the trace and the maximum eigenvalue and obtaining the number of stationary long-run relations. Alternatively, this problem can be analyzed using dynamic factor models, which involves estimating the number of common factors, both stationary and not, that describe the behaviour of the data. In this context, we empirically analyze the power of such alternative approaches by applying them to time series that are simulated using known factorial models and to financial market data. The results show that when there are stationary common factors, when the number of observations is reduced and/or when the variables are part of more than one cointegration relation, the common factors test is more powerful than the usually applied cointegration tests. These results, together with the greater flexibility to identify the loading matrix of the data generating process, render dynamic factor models more suitable for use in multivariate time series analysis.


Introduction
The identification of the common factors of a set of variables and their reduced representation is an open research area within the social sciences (Zhang 2009).The usual multivariate techniques used for this include Common Factor Analysis (CFA) and Principal Component Analysis (PCA).CFA is used to obtain a reduced-dimensional representation of variance shared among a set of observed variables, whereas PCA is used to obtain a reduced-dimensional representation of the total variance of the observed variables.As Widaman (1993) noted, the choice between CFA and PCA as the model for representing the dependence among a set of variables involves different loading matrices.Furthermore, Independent Component Analysis (ICA) complements PCA and extracts independent factors from the non-correlated factors of PCA for non-Gaussian variables (González & Nave 2010).
An important issue in CFA and PCA is the number of factors necessary to represent the set of observed variables.Lorenzo-Seva, Timmerman & Kiers (2011) reviewed the procedures to select this number of factors in a cross-section framework and proposed the use of the Hull method, but this heuristic methodology does not inform us about the number of possible factors, its results depend on the function used as a measure of error and cannot be used to obtain the loading matrix.Furthermore, this method has not been extended to time series analysis.
Additionally, when some of the observed variables are not stationary, PCA can extract factors with similar loadings for all variables.This lack of sparseness makes the interpretation of the results more difficult.Lansangan & Barrios (2009) suggest using Sparse Principal Component Analysis, but it requires finding a sparse approximation of the loading vectors.Thus, when the set of observed variables contains non-stationary variables, the following two are the approaches usually used in dynamic multivariate analysis: cointegration analysis (CA) and dynamic factorial analysis (DFA).
CA identifies the long-run relationships among variables, while DFA determines the number of uncorrelated and unobservable common dynamic factors that describe the behaviour of the variables.In the long run, the solutions of both problems are complementary because for a set of variables, the number of common factors is the difference between the number of variables and the number of cointegration relationships.However, in Economics papers, CA is more widely used in the dynamic multivariate context, perhaps because it is more popular and is included in standard econometrics software packages (PcGive, Gretl and JMulti, among others).Beyond focus on the long run, other shortcomings may arise when we apply CA, especially when the model errors are serially dependent and/or are non-Gaussian (Gonzalo 1994, Gonzalo & Lee 1998).Thus, alternative approaches become necessary (Li, Pan & Yao 2009).
Most of the research on cointegration has focused on tests to estimate the cointegration relationships, including different issues such as the sample size (Banerjee, Dolado & Mestre 1998, Pesavento 2007, Bayer & Hanck 2013), structural breaks (Cavaliere & Taylor 2006, Westerlund & Egderton 2006, Jing & Junsoo 2010), decomposition matrix (Doornik & O'Brien 2002), common cycles (Cubadda 2007) and fractional cointegration (Dittmann 2000, Trenkler, Saikkonen & Lütkepohl 2007, Davidson & Monticini 2007).However, none of these studies are concerned with estimating the parameters that define cointegration relationships, and in most cases, they only consider a single relationship between two variables, where the relationship is well defined.In this case, normalizing based on one of the variables yields the parameters.In others cases, as that described by Park, Ahn & Cho (2011), an alternative estimation method is proposed.When the number of variables involved increases, to achieve an identified system, we must use a set of restrictions based on relationships among variables known a priori, which are not always available.Finally, when we analyze finite samples, the distribution of the cointegration rank test is not approximated well by the limiting distribution (Ahlgren & Antell 2008).
In this context, DFA becomes increasingly useful.Peña & Box (1987) showed how to identify common factors for time series and how to build a simple transformation to recover the factors as linear combinations of the original series.Stock & Watson (1988) were the first to connect CA and CFA.They showed that multiple cointegrated time series must have at least one common trend or factor.Then, Escribano & Peña (1994) proposed a new normalization technique, which builds common trend representations with moving-average polynomials and, under certain circumstances, with uncorrelated shocks, that allow us to see the connection between CA and CFA.Hu & Chou (2003) studied DFA to simplify multiple time series, and Hu & Chou (2004) derived a procedure to identify the Peña & Box (1987) model.Subsequently, Kapetanios & Marcellino (2009) showed, in Monte Carlo experiments, the advantages of using a parametric estimation of dynamic factor models.Correal & Peña (2008) introduced a threshold dynamic factor model for the analysis of vector time series, that includes the non-linear thresholdtype behaviour.More recently, Lopes, Gamerman & Salazar (2011) proposed the generalized spatial dynamic factor models.Some papers on CA, such as that by Miller (2010), suggest using DFA to estimate the parameters when there are more than two time series because the cointegration representation is not unique in this case.
The main objective of this paper is to analyze these two (a priori) alternative approaches to show whether they can yield the true number of hidden common factors in the data-generating processes of a set of variables.To do so, we apply to series of simulated and real data, the test developed in the DFA framework by Peña & Poncela (2006), and a traditional cointegration test, i.e., the trace test.We choose the Peña & Poncela (2006) test as the representative test among a set of similar tests (Forni, Hallin, Lippi & Reichlin 2005, Hallin & Liska 2007, Chen, Huang & Tu 2010) developed in the DFA framework due to its superior robustness (Park, Mammen, Hardle & Borak 2009) and lower computational cost compared with other tests, such as that described in Pan & Yao (2010).
The rest of the paper is structured as follows.In section 2, we formulate the problem and describe the Peña & Poncela (2006) methodology.In section 3, we perform a comparative analysis of the two methodologies, CA and DFA, using simulated time series.In section 4, we apply the methods to real data from financial markets.Finally, we summarize the main conclusions.

Cointegration vs. Peña and Poncela Factorial Model
In multivariate cases, when the observed variables are non-stationary, by using a stationary transformation of the series, we may not be using all of the information about the relationships among the variables (Peña & Sanchez 2007).To compensate for this, we can use two complementary perspectives as follows: • Considering the long-run stable and stationary relationships among variables and using different techniques related to cointegration.Thus, if X is the vector of N non-stationary variables and ∆X is its stationary transformation, the Vector Error Correction Model or VECM (Engle & Granger 1987) is usually used in the literature: where t = 1, . . ., M ; P is the number of lags required to correct the autocorrelation and other characteristics of the data series (∆X), usually determined using the Akaike Information Criterion (AIC); is an (N × 1) vector of stationary errors; and Π denotes an (N × m) matrix of the cointegration relationship.Through the well-known cointegration tests, i.e., the maximum eigenvalue and the trace test, we can estimate the cointegration rank, or rank of the matrix Π in (1).Thus, if there are N variables and m cointegration relationships, we obtain r = N − m non-stationary common factors.
• Directly obtaining the number of common factors (Escribano & Peña 1994, Gonzalo & Granger 1995).In this case, the formulation depends on the factors (f ) and their loading matrix (λ): where λ denotes an (N × r) loading matrix; f is an (r × 1) vector of factors; ω shows the structure of the covariance matrix; and and u are stationary residuals.We note that if |ρ| is less than 1, the factor is I(0) or stationary; it is only non-stationary if |ρ| is equal to 1.
The relationship between these approaches can be explained using the decomposition of the matrix Π, where β is the weight matrix of the cointegration relationships for each variable and α is the coefficient matrix of the cointegration relations.
We can solve the multivariate problem using CA with the relevant restrictions, which, as we note above, becomes intractable as the number of variables increase1 ; alternatively, we can analyze the dynamic common factors involved.This is where the Peña & Poncela (2006) test can be used to further investigate the problem.Peña & Poncela (2006) express the factorial model (2) as follows: That is, there are N observed variables with the corresponding mean vector µ.The common factors may belong to the subset f 1 of non-stationary factors or to the subset f 2 of stationary common factors, and λ 1 and λ 2 are their respective loading matrices.Thus, if the first group is made up of r 1 factors and the second of r 2 , then there must be N − (r 1 + r 2 ) cointegration relationships.
The advantage of this approach is that regardless of whether the observed series are stationary, we can identify stationary and non-stationary common factors.It also allows us to test whether the cointegration test correctly discriminates between cointegration relations and stationary factors.The test is implemented as follows: According to Theorem 3 in Peña & Poncela (2006), the matrix Ĉk has N − (r 1 + r 2 ) eigenvalues that converge in probability to zero as the sample size (M ) tends to infinity and the number of lags, k, increases from 0 to K, such that K M tends to zero.In this way, the test sorts the eigenvalues (h j ) of the matrix Ĉk and obtains their sum, which tends asymptotically to a χ (N −r) 2 distribution, as follows: The cumulative explanatory power (cep) of the factors with k-lags can be expressed as Once we have estimated the number of common factors, based on Peña & Poncela (2006), we can use an EM algorithm to determine the loading matrix of the expression (5), using as initial values the eigenvectors (V j ) of the following matrix: As shown, (9) operates with one lag, d is the order of integration of the observed series, and d * is equal to 0 if d is greater than zero or 1 if d is equal to zero.Thus, once we order the eigenvalues in decreasing order, the initial value of the loading matrix would be the eigenvectors associated with the first r non-zero eigenvalues.Similarly, the initial values of the factors can be obtained from these eigenvectors and the observed variables as follows: Once we recover the initial values of the factors ( f0 ), then Peña & Poncela (2006) use a contrast of stationarity (e.g., Augmented Dickey-Fuller test or ADF ) to determine which factors are stationary or not.
From the initial values, the final results are estimated by applying the Kalman filter on the following state space representation because, as Bauer & Wagner (2009) show, this method yields better cointegration parameters: Transition Equation: where β i,j is the load of factor j on variable i; σ 2 i is the residual variance of variable i; and the parameter ρ is 1 when the stationary test shows that the factors are non-stationary.

Comparative analysis of the Cointegration and
Factorial Dynamic Models: Experimental Study

Design of the Experimental Study
To measure the power of the Peña & Poncela (2006) test related to CA, we apply the tests in four different scenarios with a maximum of 100 time series (max(N ) = 100) with maximum sizes of 5,000 data points each (max(M ) = 5, 000), computed from simulated factors and known loading matrices.Then, we compare the test results with the actual values used in the simulated data generating processes.We use the trace test from CA because it is more robust to moving average and non-Gaussian innovations than the maximal eigenvalue test due to its greater robustness to the skewness and excess kurtosis of the innovations, as Cheung & Lay (1993) showed.We use Ox packages to simulate factors and build non-stationary variables as follows: • Scenario I: one common factor I(1), integrated of order one, for each set of two variables, i.e., there are 50 common factors in 100 time series.
• Scenario II: one common factor I(1) for each set of four variables, i.e., there are 25 common factors in 100 time series.
• Scenario III: one factor I(1) and another I(0), or stationary, for each set of four variables, i.e., there are 50 common factors in 100 time series.
• Scenario IV: two factors I(1) for each set of four variables, i.e., there are 50 common factors in 100 time series.
The processes simulated from the loading matrices, i.e., the variables grouped by common factors and scenarios, are as follows: • Scenario I: • Scenario II: ( 1,t , 2,t , 3,t , 4,t , u 1,t , u 2,t ) ∼ N(0, 1)i.i.d. ( Revista Colombiana de Estadística 38 (2015) 219-238 • Scenario IV: In Table 1, we show a statistical summary of the simulated values for each of the four scenarios.As we can see, the variables and the factors are not neither normal nor stationary, except factor I(0) of Scenario-III.It also highlights that the variables have a minimum in the ADF test for stationarity nearest to being accepted; that is, though the generating factors are I(1), in some cases, the variables constructed from these may seem I(0), at least when 95 percent confidence intervals are considered.
From the results shown in Table 2, when we use the trace test, we can see that if there is one common factor and sufficient degrees of freedom, the trace tests determine the correct number of factors without errors.However, when the series are two-cointegrated as in Scenario-IV, the performance of this test declines.Thus, this result suggests that if the variables have more than one common factor, this test does not discriminate them correctly.Moreover, if one of them is I(0), as in Scenario-III, this test does not detect it.Another drawback is that this technique does not permit estimation of the loading matrix associated with each common factor.
However, from the results shown in Table 2, when we use the Peña & Poncela (2006) test, we can see that when the number of observations for each variable (M ) is higher than the number of variables (N ), the test converges quickly to the correct number of factors.Therefore, the test is valid for samples where the number of observations by variables is greater than the number of variables.Additionally, in Scenario-III, there is one common factor I(0) that is not detected by the trace test, but the Peña & Poncela (2006) test detects it.

Results
We apply the trace test and Peña & Poncela (2006) test on different samples, i.e., we analyze the results of the tests in samples with different numbers of variables (N ) and different numbers of observations by variable (M ) in each of the four scenarios described.The sample data range from 8 observations of 25 variables to 5,000 observations of 100 variables.In Table 2, we show the errors of these tests, i.e., the difference between the number of real factors and the number of factors estimated.2 distribution.The ADF is the Augmented Dickey-Fuller test on unit roots with a constant and without a trend.Significance at 95 and 99 per cent levels are indicated by (*) and (**), respectively.Now, we analyze the impact on the results of the two tests of the level of confidence required and the number of lags used.We apply the tests to the same samples used in Table 2 but vary, in increments of one, the lag number from 2 to 100 and the confidence level from 0.5 to 0.9999 in increments of 0.1 to 0.9 and then in increments of 0.01 to 0.99, and so on.The results2 , which confirm in all cases those in Table 2, can be summarized as follows.
Note: ndf denotes that there were not sufficient degrees of freedom.
For the trace test, the results primarily depend on the degrees of freedom: by increasing the number of lags, the cointegrating rank converges to the right value, provided that the number of observations and variables are large enough.In contrast, when the number of observations decreases, the best results are achieved with lower lags, as this way, the degrees of freedom are higher.Therefore, the trace test results depend on the lag number used, which in turn depends on the number of observations.For the Peña & Poncela (2006) test, the results show that as M approaches N , we can reduce the error by increasing the number of lags.Nevertheless, if N is greater than M , we achieve the correct value only when the number of variables N is not much higher than M .Again, this is due to the number of degrees of freedom.However, the Peña & Poncela (2006) test is less sensitive to the degrees of freedom than the trace test, as the errors are smaller.
Finally, we analyze whether the methodology proposed by Peña & Poncela (2006) allows us recover the true common factors, by determining if they are stationary or not and by determining the corresponding loading matrices.To do so, we apply the two tests to six variables (Z), two of Scenario-I and four of Scenario-IV.In Table 3, we present the results of the trace test, which, as expected based on the previous results, correctly identify the number of common factors but do not inform us about the two groups of variables.Because the cointegration rank is three, we could assume that there are three sets of two cointegrated variables.Similarly, the Peña & Poncela (2006) test identifies three factors with one lag, a t-value of 11.62 and a p-value of 0.236.In addition, the ADF test of the 3 factors recovered by the product of the matrix transpose of the eigenvectors associated with the three largest eigenvalues and the observed variables resulted in three factors being I(1), as we can see in Table 4. Thus, both methods yield the number of factors, but it is interesting to note how the methodology proposed by Peña & Poncela (2006) recovers and groups the factors.The Augmented Dickey Fuller model is estimated with a constant.The null hypothesis is that the variable is integrated with order 1.This hypothesis is rejected at a value of -2.86 and -3.43 at 5 and 1 per cent significance levels, respectively.
The parameters σ and β of the loading matrix are shown in Table 5.
The Start, the EM algorithm estimate and the True factors are shown in Figure 1, Figure 2 and Figure 3, respectively, for each factor.As we can see, the factors estimated after the maximization are quite similar to the true values.
Finally, we show, for the Start factors and estimated factors, as suggested by Peña & Poncela (2006), the mean absolute error (MAE) and root mean square error (RMSE) in Table 6 to measure the efficiency of the EM algorithm.In summary, the Peña & Poncela (2006) test is a perfect complement to CA because it is more consistent when the variables are involved in more than one cointegration relationship and/or when there are common stationary factors.Ad- ditionally, the Peña & Poncela (2006) test is less sensitive to a low number of degrees of freedom and allows us to estimate the loading matrix associated with each common factor.

Financial Market Data Analysis
Several works, such as those by Baillie & Bollerslev (1994) and Diebold, Gardeazabal & Yilmaz (1994), have tested the market efficiency hypothesis for the spot exchange rate through the triangular arbitrage relation using the CA.The cross exchange rate, or exchange rate of triangular arbitrage, is defined as follows: In ( 16), for day t and three currencies (i, j and h), c is the cross exchange rate and s is the spot exchange market rate.In addition, each exchange market rate has an offer price (bid) and a demand price (ask).The difference between these two prices (spread) indicates the transaction costs.Taking logarithms in ( 16) to obtain a linear triangular arbitrage relationship and including the bid-ask spread, we obtain: Where a is the ask price, b is the bid price and δ i is the transaction cost for the currency i.When we operate with three currencies and their respective bid and ask prices, there are six possible triangular relations among them: three relationships of the ask type and three of the bid type.Thus, the problem is to determine the true number of relations of long-term equilibrium among the three currencies.For this, the general expression for arbitrage strategies (17) becomes the following econometric model: where δ i,ψ is the long-run result of triangular arbitrage for currency i with price ψ and the parameters δ i,j,ψ and δ i,h,ψ are the weights in an arbitrage portfolio of currencies j and h, respectively, whose expected values according to (17) are |δ i,•,ψ | = 1.In this context, we could estimate long-run relationships with the CA or the DFA, but from a financial perspective, only two relationships are possible.
For our analysis, we use daily data on the Bid and Ask close prices of the exchange rate of the JPY (Japan yen) against USD (US dollar) and EUR (EMU euro), from 1-july-2002 to 30-december-2011: 2,478 observations from the Reuters platform of inter-dealer markets.We first test univariate evidence of unit roots in the sample (logarithms of the exchange rate and first difference of the log-exchange rate) and reject the stationary hypothesis for all exchange rates in the data.Table7 shows that the logarithms of the exchange rates are non-stationary, while the first difference of log (returns) is stationary; therefore, all the series, in logarithms, have a unit root.
Then, we estimate the number of long-run relationships using the trace test and Peña & Poncela (2006) test.The results are presented in Table-8.The CA results show four long-run relationships among the six spot exchange market rates (bid and ask price of JPY against USD and EUR), while the DFA indicates only two cointegration relations (or equivalently, four common factors).Therefore, to test whether triangular arbitrage opportunities exist, while the factorial analysis result is correct, the trace test is inconsistent with two market cointegration relations, one for the bid price and the other for the ask price, here due to the selection criteria used performing poorly in selecting the proper lag, as Cheung & Lay (1993) note.Note: Columns 2-5 show the results (using AIC criteria to select lags) of traditional cointegration analyses (trace test).Columns 6-8 show the results of the dynamic factorial test, that is, the number of variables (six exchange rates) minus the number of common dynamic factors indicates the cointegration rank.

Conclusions
When, in a univariate time series framework, the variable involved turns out to be non-stationary, we transform it.However, when the problem is multivariate, doing so may involve a loss of relevant information for modelling the data.In this case, the cointegration analysis and the determination of the cointegration relations allow us to model the relationships of long-run equilibrium among the variables involved.
However, when variables are involved in more than one cointegration relationship, estimating the parameters of the cointegration relations is conditioned upon known constrains that define the problem.Furthermore, estimating the correct number of non-correlated factors when one of them is stationary is also difficult because the tests that are commonly used cannot distinguish between relationships of cointegration and stationary factors.In this paper, we have analyzed simulated series with known data-generating processes, the power of the cointegration test of the trace and the common factors test proposed by Peña & Poncela (2006).
The results for both methodologies are similar, with high numbers of degrees of freedom, except when there are stationary common factors, where the cointegration tests do not identify the correct number of factors.Moreover, when the data sample sizes decrease, and/or the cointegration relationships involve more than two variables, the Peña & Poncela (2006) methodology results overcome the cointegration results.An added advantage of the Peña & Poncela (2006) methodology compared with the cointegration analysis is that it allows for a better approximation to the data-generating process.In short, whereas the trace test presented drawbacks for the stationary factors, a low number of observations and a highorder autocorrelation of the series, Peña & Poncela (2006) test results showed more consistency.These conclusions were also confirmed when real market data on spot exchange rates were used.

Table 1 :
Statistical Summary of simulated data.Normality is tested using the Jarque-Bera test with a Chi 2

Table 2 :
Errors in the number of factors extracted by cointegration and factorial tests.

Table 3 :
Cointegration test in the subset Z.

Table 4 :
Testing ADF on common factors. Note:

Table 5 :
Estimated parameters and statistics.Note: To facilitate the comparison, we present the initial weights (resulting from the eigenvectors of the autocovariance matrix with one lag) or START, the estimates after the EM algorithm or Loading, with the TRUE weights that generated the series.( * * ) and ( * ) show significance at the 99 and 95 percent confidence intervals, respectively.The average computation time for the test was 1:03 (min:sec), while the average EM estimation time was 6:17 (min:sec).We used a DELL Precision M 6500 mobile workstation with 32 GB (RAM), Intel Core i7 (processor) and two HDDs of 465 GB each.

Table 6 :
Mean Absolute Error and Root Mean Square Error of the estimates.

Table 7 :
Stationarity test of the spot exchange rate.The ADF test is estimated with a constant, and the [number of lags] is estimated using the AIC.Critical values of rejection of non-stationarity are -2.86 and -3.44 at significance levels of 5 and 1 per cent, respectively. Note:

Table 8 :
Cointegration rank with ask and bid prices.