A Comparative Study of Multivariate Analysis Techniques for Highly Correlated Variable Identification and Management

In this work we attempt is to locate and analyze via multivariate analysis techniques, highly correlated covariates (factors) which are interrelated with the Gross Domestic Product and therefore are affecting either on short-term or on long-term its shaping. For the analysis, feature selection techniques and model selection criteria are used. The case study focuses on annual data for Greece for the period 1980-2018. KeywordsMulticollinearity, Correlation feature selection, Model selection criteria, Multivariate analysis, Principal component analysis.


Introduction
The purpose of this work is to identify an optimal model for the Gross Domestic Product (GDP). The Organization for Economic Co-operation and Development (OECD) states that "Gross Domestic Product (GDP) is the standard measure of the value added created through the production of goods and services in a country during a certain period. As such, it also measures the income earned from that production, or the total amount spent on final goods and services (less imports). While GDP is the single most important indicator to capture economic activity, it falls short of providing a suitable measure of people's material well-being for which alternative indicators may be more appropriate. This indicator is based on nominal GDP (also called GDP at current prices or GDP in value) and is available in different measures" (OECD, 2019). Based on well-established and proven studies it is known that GDP can be expressed by where C represents the Private Consumption Expenditures, I the Private Domestic Investments, G the Government Consumption Expenditures, Ex the Total Exports and Im the Total Imports.
The goal of this work is to locate and analyze the interrelationships between GDP and various factors/variables which are interdependent and often characterized by a high degree of multicollinearity. The GDP is frequently used by central banks, public entities and private businesses as a standard measurement for the economic health of a country (Callen, 2008). For predictive purposes, researchers often rely on economic or financial indices and model identification procedures. den Reijer (2005) and Schumacher (2007) both studied the forecasting of Dutch and German respectively, GDP through factor modelling. Later, Akhter et al. (2012) used Principal Component Analysis in order to obtain a model for the GDP of Bangladesh. Bai et al. (2015) has shown the accuracy of factor analysis in the evaluation of the economy of a country, including variables such as Unemployment Rate, Investments, Population and General Government Total Expenditures, which are part of the current model analysis. Because of its unstable economy, Greece is the focus of many economic analyses from organizations such as OECD, Eurostat, International Monetary Fund and there is sufficient material and data on their websites one can refer to.
The explanatory variables/factors (see Table 1) that were chosen are highly correlated and result in severe multicollinearity in the primary model which appears to be a frequent problem in financial and economic big data analytics (Wang and Alexander, 2019). For the reduction or even elimination of the multicollinearity, which is a common issue in data analysis in finance and economics (Kondo et al., 2018), a number of dimension reduction techniques were used in order to identify an optimal model with a set of new uncorrelated variables/factors. In this work, for comparative purposes and for measuring the quality of each model, three information criteria were used, namely Akaike Information Criterion (Akaike, 1974), Bayesian Information Criterion (Schwarz, 1978) and Modified Divergence Information Criterion (Mantalos et al., 2010). In this work we rely on multivariate analysis and in particular, on Dimension Reduction Techniques (see e.g. Li, 2018) and Multivariate Linear Regression (see e.g. Anderson, 2003) for the modelling of the Gross Domestic Product by identifying an appropriate set of factors from a long list of possible explanatory interdependent variables which likely interact with and affect the GDP. The choice of GDP is obvious since it is a quantity of great interest for micro as well as macroeconomics. The case of Greece is chosen due to extreme economic events of recent years that greatly affected all aspects of economic activity.
The rest of the paper has been organized as follows. Section 2 provides the information and characteristics of the dataset used in this work. Section 3 discusses the Dimension Reduction Techniques methods that were used, including Principal Component Analysis (PCA) (Jolliffe, 1972;Artemiou and Li, 2009;Artemiou and Li, 2013;Wang, 2018), Beale et al. (1967) technique and Hall's (1999) Correlation-based Feature Selection (CFS) technique for the identification of the appropriate set of factors that affect the GDP. Section 4 deals with the modelling of GDP while the Assessment and Comparison is based on model identification procedures, more specifically on Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC) and Modified Divergence Information Criterion (MDIC). Section 5 provides the results of the optimal model. Finally, in the conclusion-discussion section, the techniques and their results are discussed along with possible extensions.

Preference Data
The Gross Domestic Product is interrelated, according to the relevant theory, with a variety of explanatory variables which possibly affect GDP.
This work is based on Greece's economy with annual data collected through Knoema, OECD and Eurostat for the eight (8) explanatory variables X1-X8 presented in Table 1 for the period 1980-2018 (39 annual observations). Three (3) missing values have been replaced by the average value of the preceding and the following year.

Dimension Reduction Techniques 3.1 Discarding Variables Technique
In order to discard variables of limited information, the Beale et al. (1967) technique was used. This technique is a simple three-step procedure proposed by Beale et al. (1967) for discarding variables in multivariate data analysis. The technique can be summed up as follows: (i) Locate the minimum eigenvalue and the corresponding eigenvector of the variancecovariance or correlation matrix of the covariates involved.
(ii) Locate the element of the eigenvector with the highest absolute value. This value corresponds to one of the original variables (covariates) which is removed from the analysis.
(iii) Repeat steps 1 and 2 until p-k variables have been removed, where p represents the number of covariates and k represents the number of eigenvalues, which are larger than one.
The implementation of this procedure, results in the removal of Exports of Goods and Services, General Government Total Expenditures, Household Consumption Expenditures, Imports of Goods and Services, Investments and Population from the model. Thus, the technique suggests the use of the Total Labor Force and the Unemployment Rate as the only variables interrelated with GDP and thus affecting its modelling.

Principal Component Analysis
We proceed now with the implementation of the Principal Component Analysis as an alternative dimension reduction technique and manage to obtain the complete set of the 8 principal components, with the corresponding eigenvalues ranging from around 6.5 to nearly zero. This technique was proposed independently by Pearson (1901) and Hotelling (1933;1936). The idea behind PCA is the conversion of a dataset with interdependent variables into a new one with uncorrelated variables (principal components), which are arranged in such a way so that the first few components maintain the greater part of the variability that exists among the original variables. Under this procedure the dimensionality reduction of the original data set can be achieved while 48 leaving unchanged as much as possible the variation (Jolliffe, 2002).
The components constitute a set of uncorrelated vectors which have been created by the following methodology: Let us denote by the j th component, the corresponding eigenvalue and = ( 1 , … , )′ the corresponding eigenvector, = 1, … , , = 1, … , where n is the number of observations and m is the total number of original variables/covariates. Hence, is defined as: Then, the Zij element of the vector Zj = (Z1j, …, Z8j)' is: Based on the overall results and the fact that it is preferable to avoid the loss of important information, we conclude that the first two components (Z1 and Z2) should be kept (see Table 2) regardless of the eigenvalues because they retain a considerable amount of the total information/variability (more than 95% of the original variability of the data). The described variability played a key role in the aforementioned decision since the intention was to keep that many components so that a considerable proportion of the original variability will be described by the components chosen.

Remark:
To determine which variables are significant in each component, the following empirical rule was followed. For the two chosen components, the variables for which the absolute value of the associated coefficient is at least equal to 0.95 are kept as significant. A value of around 0.95, although there is no specific rule, is considered to be satisfactory in retaining a sufficient amount of information.
For the problem at hand, the first component, denoted by Z1, holds more than 80% of the total variation of the dataset while the second one, denoted by Z2, holds roughly 15% of it. The rest of the components contain the remaining percentage of variation. By construction, the first component is considered to be the most important in which the analysis is primarily based on. Having said that, we observed in the above analysis, 6 of the total of 8 variables emerge as important according to the associated coefficients given in parenthesis (see Table 2).
Remark: For modelling purposes both PCA significant variables/components (Z1 and Z2) are used in their full form that contains, not only the significant variables (with coefficients at least equal to 0.95) which are presented in Table 2, but all m=8 original Xi's in (3).
As it can be seen from

Correlation-Based Feature Selection
Variable selection, also known as feature selection (Guyon and Elisseeff, 2003), is the procedure of evaluating all possible subsets of a dataset and finding the one that minimizes the error rate. Through this process, the best subset of relevant variables will emerge for a better model construction. Furthermore, all insignificant variables will be removed without incurring much loss of information. In this work a Correlation-based Feature Selection, denoted by CFS (Hall, 1999), will be used as the third approach for Dimension Reduction. CFS is a measure that evaluates subsets of features on the basis of the following Hall's hypothesis: An optimal feature subset includes uncorrelated independent covariates (features) and simultaneously high correlations between each covariate with the dependent variable. If such correlations are available, then the merit of a feature subset S consisting of N features is defined as: where is the correlation between the summed independent variables and the dependent variable, N is the number of variables, ̅̅̅̅̅ is the average of the correlations between the independent variables and the dependent variable, and ̅̅̅̅̅̅ is the average inter-correlation between the independent variables. Hall presented a backward elimination procedure, with the use of (4) in order to choose a subset. The full set of variables is evaluated with (4), which, in fact, is the Pearson's correlation coefficient with standardized variables. Then, a variable is temporarily removed and the set of variables is evaluated with the aforementioned equation. If the subset scores are higher than the set before, then the variable is permanently removed. Otherwise, it is reinstated. The process continues until each variable is removed once and the effect of its removal is measured. The process stops when no subset scores are higher than those of the original set.
The implementation of this procedure in the examined dataset, results in the withdrawal of 5 out of the total 8 original variables. The remaining variables, namely General Government Total Expenditures, Household Consumption Expenditures and Imports of Goods and Services are considered as the important ones in the modelling of GDP. It must be noted that the same variables together with the Total Labor Force compose the first and most important component (Z1), of PCA.

Techniques Review
In the previous sections three-dimension reduction techniques were implemented for the identification of interrelationships between a number of potentially significant factors and the GDP. While in some cases similarities between the techniques were revealed, all three highlight different variables as important, as it can be seen in Table 3.

Model Selection Criteria
Model identification procedures play a pivotal role in statistics by identifying the best model among an available class of models. Those techniques are contemplated as assessors of a quantity. For example, for a given data the probability of the proposed model can be used as assessor which is essential for the pursuit of identifying the optimal fundamental structure of the phenomenon under investigation.
Model identification procedures have been heuristically recommended for time varying processes. Kullback and Leibler (1951) developed such a measure that minimizes the loss of information. A direct connection between the Kullback-Leibler (KL) measure and the maximum likelihood estimation (MLE) method, gave rise to the well-known Akaike Information Criterion (AIC, Akaike, 1974). A related procedure also associated with the likelihood function is the Bayesian Information Criterion (BIC, Schwartz, 1978). These criteria are the most popular ones, among others. In this work, in addition to AIC and BIC, a recently developed information criterion known as Modified Divergence Information Criterion (MDIC), proposed by Mantalos et al. (2010), will be used for comparative purposes.

Akaike Information Criterion
The AIC can be considered as the relative amount of information lost by the candidate model: the less information lost, the higher the model's quality. In other words, AIC approximates the quality 51 of a candidate model relative to each of the other candidate models for the data. As mentioned above, the task is accomplished by combining a criterion that minimizes the loss of information with a maximum likelihood estimation method. More specifically, AIC is based on the loglikelihood function and is defined as: where p represents the dimension of the vector-parameter θ. The optimal model is the one with the lowest AIC value.
From the results in Table 4, it appears that the optimal model based on Akaike Information Criterion is the one formulated by Hall's Correlation-based Feature Selection technique and contains the General Government Total Expenditures, the Household Consumption Expenditures and the Imports of Goods and Services as the independent variables.

Bayesian Information Criterion
The Bayesian information criterion is a model identification procedure based on information theory but set within a Bayesian context. It is an evaluation criterion for models estimated by using the maximum likelihood method. BIC can be considered as an estimate of a function of the posterior probability of a model being true, under a certain Bayesian setup, so that a lower BIC means that a model is considered to be more likely to be the true model. BIC is given by where p represents the dimension of the vector-parameter θ and n is the number of observations.
Laplace method for integrals has been used for obtaining the marginal likelihood associated with the posterior probability in (6). The results of the implementation of BIC can be seen in Table 4, where we observe that BIC, like AIC chooses Hall's CFS as the best model.

Modified Divergence Information Criterion
The Divergence Information Criterion (DIC) proposed by Mattheou et al. (2009) constitutes a modelling generalization of AIC, based on the Basu, Harris, Hjort, and Jones (BHHJ) divergence measure (Basu et al., 1988). DIC family of procedures, like AIC, is an asymptotic approximation as the sample size increases and offers an alternative based on the so called divergence measures.
In this work, we consider the Modified Divergence Information Criterion (MDIC), which consists a modification of DIC, proposed by Mantalos et al. (2010). MDIC can be viewed as an approximation of the expected overall discrepancy, which based on the BHHJ measure, evaluates the distance between the true and the fitted models. If the model with the smallest estimator of the expected overall discrepancy is chosen, then it is possible to end up with a model with an unnecessarily large number of variables. Thus, the Modified Divergence Information Criterion is a criterion comparable to AIC. MDIC is given by where  p is the order of the model or the number of variables involved.
 ̂ is a consistent and asymptotically normal estimator of the parameter vector θ.
The results based on MDIC together with those based on AIC and BIC are provided in Table 4.

Conclusion and Future Research
In conclusion, in this paper, we attempted via dimension reduction techniques, to identify interrelationships between the Gross Domestic Product of Greece and a number of factors which are highly correlated. Beale et al. (1967), Principal Component Analysis and Hall's Correlationbased Feature Selection techniques were implemented and suggested different models with different variables (see Table 3).
More specifically, Beale et al. (1967) proposed a model with the Total Labor Force (X7) and the Unemployment Rate (X8) as independent variables. This technique clearly focuses solely on the workforce point of view in order to achieve the optimal model. PCA, on the other hand, instead of using the original variables, created new uncorrelated ones. In fact, PCA promotes a model with two uncorrelated variables (Z1 and Z2). Through them, 6 out of a total of 8 variables emerge as important, namely X2, X3, X4, X5, X7 and X8 (see Table 2). It should be noted that the variables selected as significant have also been chosen either by Beale's or Hall's models. The third technique, CFS, proposed a model with the General Government Total Expenditures (X2), the Household Consumption Expenditures (X3) and the Imports of Goods and Services (X4) as significant variables affecting GDP.
Based on theoretical background (see Equation (1)), it appears that the CFS model covers most part of GDP's formula and seems to be able to identify and select the" right" subset of variables from the original ones. Indeed, although CFS does not select the Investments and the Exports of Goods and Services which both are part of the variables involved in (1), it is able to identify, the Imports of Goods and Services (which is part of the Imports), the Government Expenditures and the Household Consumption Expenditures. Note though that CFS model also chooses to ignore demographic variables, which affect indirectly and not directly the modelling of GDP through their interrelationships with all variables involved in (1).
The theoretical interpretation of the results is confirmed by two out of the three model selection criteria that were used and their results are provided in Table 4. Both AIC and BIC select Hall's CFS model, while MDIC selects Beale et al. (1967) model.
From the analysis, we see that the PCA model is not the optimal in all cases examined. When it comes to CFS and Beale et al. (1967), we observe that, both AIC and BIC, choose clearly the former, leaving way behind the latter. On the other hand, although MDIC is in favor of Beale et al. (1967), the difference observed as compared to CFS, could not be considered significant.
The main obstacle that we had to overcome in this work was the problem of multicollinearity, which is very common especially when it comes to modelling that involves big data on various financial characteristics and/or economic indicators. The case of the GDP of Greece was an ideal example to explore the capabilities of various multivariate analysis techniques in handling the multicollinearity problem and identify a set of influential factors. Taking that under consideration, it is possible, in a future work, to attempt to explore how different model selection criteria react or are able to make the right variable/model selection, when multicollinearity is of different magnitude (non-existent, low, medium, high, nearly perfect or even perfect). Through this process one could be able to identify the criterion which is better adjusted and finally succeeds in choosing the optimal model when the variables involved are highly correlated.