PROPOSED TWO VARIABLE SELECTION METHODS FOR BIG DATA: SIMULATION AND APPLICATION TO AIR QUALITY DATA IN ITALY

In this era of big data, considerable amounts of information data are produced daily with the rapid development of technology. In various fields, such as engineering, computer science, and finance, several statistical and machine learning methods are used to uncover useful information and patterns behind these enormous datasets. Neural networks (NN) and random forests (RF) are the common model selection (variable selection) methods in machine learning. The least absolute shrinkage and selection operator (LASSO) and principal component analysis are the statistical methods. In this study, we propose two methods: a combination of NN and LASSO and a combination of NN and RF. We use Monte Carlo simulation and a real data application (air quality data in Italy) to investigate the performance of the classical methods (ordinary least square and feed-forward NN) and two proposed methods by the goodness of fit criteria. The results showed that the proposed methods perform better than the classical methods.


INTRODUCTION
Variable selection methods in regression analysis held statistical value, especially for models with multiple independent variables and recent developments in model selection methods for extracting useful information from large databases (big data) in all fields. However, traditional statistical methods cannot manage big data. Extracting useful information from these complex and informative rules has become a major challenge. The commonly used machine learning methods are neural network (NN), random forests (RF), and statistical tools, such as robust least absolute shrinkage and selection operator (LASSO) and principal component analysis (PCA). This study proposes two selection methods by combining NN with LASSO and RF. We compared the performance of classical selection methods (ordinary least square (OLS) and feed-forward NN) with the proposed methods through simulation methods using and life data application. Finally, we concluded that the proposed methods perform better than the classical methods with a minimum error.
Wang et al. [1] used the idea of quantile regression and random LASSO in the case of highly correlated variables. Mansoor et al. [2] used a feed-forward Neural networks (FFNN) on a dataset concerning commercial buildings because of a possible demand response program application.
They used the machine learning method that deserves more attention, i.e., the RF method, which dominates all other methods. The combination of machine learning methods, i.e., RF with NN and LASSO with NN, produces new powerful methods.
The remainder of the paper is organized as follows: the classical variable selection methods are introduced in Section 2. Section 3 presents the proposed methods. In Section 4, we present the Monte Carlo simulation. In Section 5, we discuss the application of the proposed methods. Finally, Section 6 presents the conclusion.

Feed-Forward Neural Networks
A big deal of hyperbole has been devoted to NNs in their first wave in around 1960 [3,4] and their renaissance in around 1985 (inspired by [5]). However, the biologically relevant ideas have been detracted from the essence of what is being discussed. They are irrelevant to practical applications in pattern recognition. Because NNs have become a popular subject, they have collected numerous methods loosely related and not biologically motivated. A formal definition of a feed-forward 3 PROPOSED TWO VARIABLE SELECTION METHODS network is given in the glossary. They basically contained units that have one-way connections to other units; the units can be labeled from inputs (low numbers) to outputs (high numbers) to connect to units with higher numbers. The units can always be arranged in layers so that connections go from one layer to another. This can be best observed graphically (Fig. 1). Each unit sums its inputs and adds a constant (the "bias") to form a total input x j and applies a function f j to x j to output y j . The links have weights w ij ,which multiply the signals traveling with them by that factor. The input units are used to distribute the inputs, so have f ≡1. Thus, a network such as that given in Fig.1, represents the function  transformation that can be absorbed into the weights, except at the output units. Only threshold units give a genuine multilayer extension of the perceptron, and such networks were considered by [4,6]. The general definition allows more than one hidden layer, and it also allows "skip-layer" connections from input to output. If all units in a layer have the same functions f h ,f 0 , we have 4 AHMED A. EL-SHEIKH, MOHAMED R. ABONAZEL, MOHAMED C. ALI The bias terms can be eliminated by introducing a new unit 0 (the bias unit), which is permanently at +1 and connected to all other units. We set w 0j = α j . This is the same concept as incorporating the constant term in the design matrix of regression by including a column of 1's. This is shown in Fig. 1. The general form is then given as Notably, if the hidden layer contains logistic units, adding skip-layer connections is not more general because we can add another unit per output in the hidden layer with input weights w jk /G and output weight G to only unit k. Then, for large G, we only use the central, linear part of the range of the logistic function. However, skip-layer connections can be easier to implement and interpret. NN with a single logistic output unit is a nonlinear extension of the logistic regression.
With several logistic output units, it corresponds to linked logistic regressions of each class vs.
others. The terminology of NNs can be very confusing. Fig. 1 is sometimes considered to have three layers (which seems visually correct), two layers (as the input layer does nothing), and one hidden layer (as the states of the units in the central layer cannot be inspected from outside the "black box"). Che et al. [7] referred to the inputs, outputs, and hidden layer because we will always have only one hidden layer. We will extend our notation to allow every unit j to have an input x j and output y j . The inputs to the entire network are the inputs to the input units, and the outputs from the entire network are those of the output units. The signal paths through the network are determined using the following equation: The conditions on the sum can be neglected by defining w ij to be zero for all nonexistent links.
When programming, numbering the units by layer is necessary, so that all units in the first layer precede those in the first hidden layer. Then, we know w ij = 0, unless i< j. Che et al. [7] briefly consider how such functions were suggested and the theory that shows that they form large and flexible classes of functions. In practice, the main issues are how the parameters and weights should be selected and how the architecture (the number of layers and the number of units in each, and which connections to include) should be selected [6].

Principal Component Analysis with Neural Network (PCANN)
This method is based on the spectral analysis of the second-order moment matrix called a correlation matrix that statistically characterizes a random vector used by [8] and introduced by [6]. The single-neuron model was extended from to feed-forward network model to extract the first of PCs. Fig. 2 shows the architecture of the PCA network. The output of the network is given by where y = ( 1 , 2 , ⋯ , ) ′ , x = ( 1 , 2 , ⋯ , ) ′ , W = ( 1 , 2 , ⋯ , ) , and = ( 1 , 2 , ⋯ , ) ′ ; is the weight from the jth input to the pth neuron [10].

Least Absolute Shrinkage and Selection Operator
For a given collection of N predictor-response pairs {( , )} =1 , LASSO obtains the solution ( 0 , ) to the following optimization problem: The constraint ∑ | | =1 ≤ can be written more compactly as the 1 -norm constraint ‖ ‖ ≤ .
Furthermore, (6) is usually represented using matrix-vector notation. Assuming = ( 1 , … , ) ′ to be the N-vector of responses, is a × matrix with ∈ ℝ in its ℎ row, and is a vector of , then the optimization problem (6) can be re-expressed as follows: 6 AHMED A. EL-SHEIKH, MOHAMED R. ABONAZEL, MOHAMED C. ALI where 1 is the vector of N ones, and ‖. ‖ 2 is the usual Euclidean norm on vectors. The bound t is a type of "budget". It limits the sum of the absolute values of the parameter estimates. Because a shrunken parameter estimate corresponds to a more heavily constrained model, this budget limits how well we can fit the data [11,12].

Random Forests
Similar to bagging, a large number of tree classification or regression trees are grown with bootstrap samples from training data. However, as each tree is grown, a random sample of predictors is taken before splitting each node. For example, if there are 20 predictors, a random five are selected as candidates for defining the split. Then, the best split is constructed, as usual, but it is selected only from the five chosen. This process is repeated for each prospective split without pruning. Thus, each tree is produced from a random sample of cases and a random sample of predictors at each split. The mean or proportion for each tree's terminal nodes is determined similar to bagging. Finally, for each case, the over trees are averaged as in bagging, but only when that case is out-of-bag (OOB). Breiman [13] called such a procedure a "random forest". This method was used by Liu et al. [14] to identify spatial poverty determinants in rural China. Ludwig et al. [15] used big data analytics for feature selection to forecast electricity prices using the LASSO and RF.
The RF algorithm is similar to the bagging algorithm. Assuming N to be the number of observations in the training data and assuming that the response variable is binary. The RF algorithm is designed through the following steps [16]:

Algorithm: Random Forests Method
Step 1. Taking a random sample of size N with replacement from the data.
Step 2. Taking a random sample without replacement of the predictors.
Step 3. Constructing the first recursive partition of the data as usual.
Step 4. Repeating Step 2 for each subsequent split until the tree is as large as desired. This usually leads to one observation in each terminal node. Prune is not done. Computing each terminal node proportion as usual.
Step 5. Dropping the OOB data down the tree. Storing the class assigned to each observation along with each observation's predictor values.

PROPOSED TWO VARIABLE SELECTION METHODS
Step 7. Using only the class assigned to each observation when the observation is OOB, counting the number of times over trees so that the observation is classified into one category and the number of times over trees classified in the other category.
Step 8. Assigning each case to a category by a majority vote over the set of trees when the case is OOB. Thus, if 51% of the time over several trees for a given case is classified as 1, it becomes its estimated classification [16].

PROPOSED METHODS
In this section, we introduce a combination of RF, LASSO, and NN to obtain the two proposed methods (RFNN and LASSONN). We hope obtaining a more powerful goodness of fit compared with the classical methods (OLS and PCANN).

LASSONN
In this method, we combine NN and LASSO to obtain a new estimator with a better goodness fit.
We can simplify this algorithm in the following steps:

Algorithm 1: LASSONN
Step 1. Starting with LASSO model Step 2. Choosing the selected variables from the LASSO model Step 3. Entering the selected variables to NN.

RFNN
In this method, we combine NN and RF to obtain a new estimator with a better goodness fit. Then, we randomly permute the values of in to obtain a perturbed sample denoted by ̃ and calculate ̃, the error of the predictor t on the perturbed sample. Variable importance of is expressed as follows: where the sum is over all trees t of RF and ntree denotes the number of RF trees. Notably, we used this definition of importance and not the normalized one. Instead of considering that the raw VI are independent replicates, normalizing them and assuming the normality of these scores, we prefer a fully data-driven solution. This is a key point of our strategy: we prefer directly estimating the variability of importance across repetitions of forests instead of using normality when ntree is 8 AHMED A. EL-SHEIKH, MOHAMED R. ABONAZEL, MOHAMED C. ALI sufficiently large, which is valid only under specific conditions. We propose the following twostep procedure. The first one is common, whereas the second one depends on the objective.
Step 1. Preliminary elimination and ranking: Computing the RF scores of importance, cancelling the variables of small importance, and arranging the remaining variables in decreasing order of importance.
Step 2. Variable selection: For interpretation: constructing the nested collection of RF models involving the k first variables, for k = 1 to m, and selecting the variables involved in the model leading to the smallest OOB error. For prediction: starting from the ordered variables retained for interpretation, constructing an ascending sequence of RF models by invoking and testing the variables stepwise. The variables of the last model are selected. This is a sketch of the procedure, and more details are required for its effectiveness [17]. Then we can use the selected variables from RF as input variables in NN. We can simplify this algorithm through the following steps:

Algorithm 2: RFNN
Step1. Starting with RF method Step2. Choosing the selected variables using RF method Step3. Entering the selected variables to NN.

MONTE CARLO SIMULATION STUDY
In this section, we conduct a comparative study between the classical estimator (PCANN) and two proposed estimators (LASSONN and RFNN) via the Monte Carlo simulation. In our simulation study, we used different simulation factors (see Table 1) to investigate the performance of these estimators in different situations. R-software version 4 was used to perform the simulation study, see [18,19].
We generate independent variables, as in [18,20,21] from a multivariate normal distribution (0, ), where diag( ) = 1 and off-diag ( ) = ; = 0.90 and 0.95 [22,23], where is the correlation coefficient between the independent variables. We also generate an error using a standard normal distribution with two outlier rates (OR) of 10% and 15% [19,24,25]. We used a simulation study with different sample sizes N = 75,150 and 300, and independent variables K = 10, 20, 30, 40, 60, and 70. The true regression parameter is 0.5 and 0.001 [26]. Then, we construct the LASSO, RF, PCANN, and proposed estimators for their comparison. Fig. 3 shows the simulation design flowchart.   when N is 300, K is 10, 20, and 70.            We obtain the following results from the simulation: From

APPLICATION: AIR QUALITY
In this section, we present the application to air quality dataset and compare the variable selection methods (FFNN and proposed methods LASSONN and RFNN). independent variables [27] (see Table 14).
The aim of our analysis is to underline the independent variables that are most relevant for predicting response variable. Thus, we use the model selection (variable selection) methods.

PROPOSED TWO VARIABLE SELECTION METHODS
However, we first analyze the dataset for a better understanding of the data. As presented in Table   15, some correlations between variables are stronger than the other. From Table 15, the correlation coefficients indicate that there are strong relationships (more than 0.8) between some independent variables. Then, we obtained variance inflation factors (VIF) to ensure the existence of multicollinearity, which is not normal for some variables greater than 10. It means that a multicollinearity problem exists between independent variables.

CONCLUSIONS
We investigated the efficiency of the model selection (variable selection) methods for a higher multicollinearity and outlier effect. The proposed methods include LASSONN and RFNN. We We used classical variable selection methods (OLS and PCANN) and proposed methods (LASSONN and RFNN) in the real dataset (air quality). We computed the correlation matrix and obtained a higher correlation between independent variables and a higher correlation between the independent variables. Thus, we had multicollinearity and computed the variance inflation factor.
To ensure multicollinearity, we obtained some variables greater than 10. Finally, we applied all methods to the dataset and summarized some results of the application: 1) OLS and PCANN In future work, we can study another variable selection method for handling multicollinearity and outliers problems in different regression models. Also, we can study another estimation method for handling multicollinearity and outlier together without selection, such as [28,29], and extend this estimation to the case of high-dimensional data (when the number of independent variables is greater than the sample size).