Optimized Multivariate Adaptive Regression Splines for Predicting Crude Oil Demand in Saudi Arabia

-is paper presents optimized linear regression with multivariate adaptive regression splines (LR-MARS) for predicting crude oil demand in Saudi Arabia based on social spider optimization (SSO) algorithm.-e SSO algorithm is applied to optimize LR-MARS performance by fine-tuning its hyperparameters. -e proposed prediction model was trained and tested using historical oil data gathered from different sources. -e results suggest that the demand for crude oil in Saudi Arabia will continue to increase during the forecast period (1980–2015). A number of predicting accuracy metrics including Mean Absolute Error (MAE), Median Absolute Error (MedAE), Mean Square Error (MSE), RootMean Square Error (RMSE), and coefficient of determination (R2) were used to examine and verify the predicting performance for various models. Analysis of variance (ANOVA) was also applied to reveal the predicting result of the crude oil demand in Saudi Arabia and also to compare the actual test data and predict results between different predicting models. -e experimental results show that optimized LR-MARS model performs better than other models in predicting the crude oil demand.


Introduction
e development of prediction techniques and machine learning models is a critical task for crude oil demand [1]. e prediction techniques can predict different features in oil [2] including oil price, oil demand, oil viscosity, etc. Prediction models and techniques can present many advantages in energy sector such as energy planning, strategy formulation, and energy advancement. e design of prediction models and techniques is a complex task which has huge impacts for the economic trajectories of countries, energy companies, and other industrial sectors [3]. According to the International Energy Agency (IEA), the global demand for crude oil accounted for about 41% of the total fuel share in 2016. According to the Organization of the Petroleum Exporting Countries (OPEC), Saudi Arabia is one of the world's largest oil consumers, ranking fifth after Russia with a 3.4% share of global oil consumption in 2016.
ere are numerous models that support the crude oil demand prediction, including autoregressive conditional heteroscedasticity (ARCH) model [4], other time series models, artificial neural networks [5], and fuzzy theory predictions [6,7].
Machine learning models play an important role in the evaluation and prediction tasks. e features included in the dataset can be used to perform predictions. Machine learning models can also perform future predictions based on the available in the dataset [8]. Regression analysis is a statistical process used to assess the relationship between various variables. In the field of machine learning, regression analysis models are widely used for predictions. e idea of the regression analysis is to show how the dependent variable (predicted variable) changes when one of the independent variables changes and other independent variables remain constant [9]. When the independent variables are restricted, regression analysis is used to obtain the average value of the dependent variable.
ere are three main processes for regression analysis which are (1) determining the strength of the predictors, (2) predicting an effect, and (3) trend prediction [10]. Many techniques have been presented in the field of regression analysis, which can be divided into parametric method and nonparametric method. In parametric method, the parameters contain all information about the data. e parameters contain all of the information required to predict the value of future data from the model. For example, in linear regression with a single variable, two parameters (intercept and coefficient) must be known in order to predict a new value. In nonparametric method, because more information is available, the ability to predict new values is more flexible because the parameters in the nonparametric method have infinite dimensions, and the data characteristics are superior to parametric models. e purpose of this paper is to propose the LR-MARS model for predicting the demand for crude oil in Saudi Arabia. To improve the accuracy of the MARS model, social spider optimization is applied to improve the hyperparameters of the MARS model.

Related Work
is section outlines relevant studies in regard to Artificial Intelligent (AI) models for predicting the demand for crude oil. In [11], the authors proposed wavelet method to predict oil price in the long term. e proposed model can forecast the Brent oil price one year ahead. Several time series prediction approaches were compared to [11] model such as ARIMA, GARCH, and Holt-Winters. Result has shown that [11] model provides better prediction models than the other models. China's crude oil demand was predicted using soft and hard computing [12]. In [13], three estimated models for the price of petroleum called theories model, simulation model, and informal model were used. e informal estimate model performs better results than the other two models. e authors in [14] make use of eight artificial neural networks (ANN) and fuzzy regression (FR) for oil price prediction. e analysis of variance (ANOVA) and Duncan's multiple range test (DMRT) are then used to test the forecast produced by ANN and FR. e mean absolute percentage error (MAPE) was calculated for ANN models and the results have shown that ANN models outperform the FR models. For verification and validation purposes, the author have applied Spearman correlation test. e authors in [15] studied the factors that play a role in affecting the demand for oil in thirty developed countries using cointegration functions model. e variables used in the study were energy prices and national income. e result has shown strong relationship between income and the demand of energy and oil. In [16], distinct nine oil models were studied and compared. Oil price, gross domestic product (GDP), and time trend for improvement were considered among the most influencing factors of the models. A method that estimates coefficients was used in the comparison of econometric response of these models [16]. Another study focused on the markets of global crude oil and natural gas in the period 1918-1999 [17]. is study predicted price and income elasticities for crude oil, demand models, and natural gas supply [17].
Panel quantification analysis techniques were used to estimate long-term income and price elasticities in crude oil demand in the Middle East [18]. Data employed in the study covered the period 1971-2002. e result has shown high price inelasticity and slight income elasticity [18]. A prediction model for crude demand based on cointegration and a vector error correction model (VEC) is introduced [19]. Four main factors that affect the crude oil demand were considered: GDP, population growth, oil price, and the share of industrial sector in GDP. Both error correction model (ECM) and Johansen cointegration test were applied for the estimation of elasticities.
In [20], the International Energy Agency (IEA) proposed the scenarios for future oil demand for China in 2006 World Energy Outlook. e study concluded that the minimum statistical (lower bound) annual oil consumption in developed countries is 11 barrels per capita. [21]. Another study in [21] developed crude oil demand models that combines variance analysis and a flexible fuzzy regression model. e results demonstrated the superiority of fuzzy regression over the conventional model. e data used covered the period 1990-2005 for different countries: Japan, Canada, Australia, and United States [21]. In [22], the authors used data that covers the period 1981-2005. Input variables include population, GDP, oil imports, and export of oil. e study demonstrated the benefits of the optimization of particulate swarm (PSO) versus GA in estimating and predicting Iran's crude oil demand. In the domain of energy consumption prediction, another study [23] compared the performance of energy consumption prediction using conventional econometric and artificial intelligence-based models. e result reflected that AI-based models are robust and scalable for prediction. e results also showed that, in national level, the prediction of yearly energy consumption is preferred using conventional models. Moreover, nonlinear regression models obtain the lowest average MAPE (1.79%) for longterm prediction.
SSO has been successfully used to solve the continuous optimization problems [24]. In [24] the researchers adopted SSO and support vector regression as short-term electric load forecasting model. Results showed that SSO helps to achieve good results [24]. Another study in [25] used SSO algorithm to search for optimal cluster centers in fuzzy c-means clustering algorithm. e results showed that SSO improved the performance of fuzzy c-means clustering algorithm among other optimization algorithms [25]. Another study in [26] used SSO algorithm to solve discrete optimization problems. SSO was used for the problem of traveling salesman [26]. SSO was compared to eighteen algorithms and the experimental results revealed that the performance of SSO algorithm in solving discrete problems was very useful for both low and middle-scale TSP datasets [26].

Linear Regression Model.
On real-world data, linear regression model works perfectly. ere are numerous advantages to using linear regression, such as the fact that the linear regression model in training is faster than many predictive models [27]. Linear regression is used to compute the strength of the relationship between the dependent variable and the independent variables, as well as to determine which independent variables have no relationship with the dependent variable and which independent variables contain redundant information about the dependent variable. Furthermore, linear regression models are simple to implement and use a small amount of memory [28]. If there is only one independent variable in a linear regression model, the regression function is a straight line; if there are two independent variables, the regression function is plane; and if there are n independent variables, the regression function is hyperplane with n− dimensions [10]. If the actual values and predicted values are fitted, then the actual values will be similar to the predicted values. However, if there is a difference between the actual and predicted values, this difference is referred to as a cost, loss, or error. e regression function y dependent on n independent (predictor) variables x 1 ,x 2 , . . ., x n is calculated using the following equation: (1) Equation (1) represents how the value of y varies with the independent x 1 ,x 2 , . . ., x n . w 0 , w 1 , . . ., w n , where x 1 , x 2 , . . ., x n . w 0 , w 1 , . . ., w n are known as feature weights (model coefficients) and b is called a constant bias term (intercept).

Ridge Regression Model.
Ridge regression is a model for multiple regression in order to perform data analysis. In ridge regression, the independent variables are highly correlated. Ridge regression model is used to avoid overfitting and to reduce the complexity of the model. New values that are predicted by ridge regression model give better results when the predictor variables are correlated [10]. Ridge regression model learns two parameters w, b by using the same standard of the least squares with adding a penalty term to make an appropriate variation for the parameter w. e penalty term in ridge regression is known as regularization in order to perform restriction to the model and reduce the overfitting, and also the coefficients of the regression are controlled using the regularization methods; this will reduce the sampling error and minimize the variance [29]. Also, L2 regularization is used for ridge regression model to minimize the residual sum of square (RRS) of the coefficients [29]. RSS for ridge regression can be expressed as in the following equation: where α is the penalty term. When the value of α is high, this means that the model is simple and more regularization. e penalty term α adjusts the parameters when the values of the parameters are high, so ridge regression minimizes the parameters to make the model simple and reduce the complexity of the model.

Multivariate Adaptive Regression Splines Model.
MARS model is a nonlinear and nonparametric regression approach that uses piecewise linear splines to simulate the nonlinear relationship between the dependent and independent variables [30]. e MARS model is built as a linear combination of the following basis functions BFs showed in the following equation: where β i , i � 1, 2, . . . ..m are unknown coefficients that can be estimated using the least square method and m is the number of terms found in the final model using a forward backward stepwise process. BF i is the i − th basis function defined from piecewise linear basis functions and based on knot C. BF i is calculated from the following set functions that is showed in the following equation: where |x − C i | + and |C i − x| + are given by Finally, the predicted model is built with m numbers of BFs to provide the lowest generalized cross validation (GCV) value that is calculated by the following equation: where SSEi is the sum of square error, where 2 and v is the smoothing parameter.

Analysis of Variance (ANOVA).
ANOVA is a statistical analysis technique which is developed by R.A. Fisher in the 1920s. ANOVA can be used for many purposes such as comparing group mean. Two hypotheses are applied to determine the output of the comparison, namely, null hypothesis and alternative hypothesis. ANOVA is also known as analysis of an analysis of variance because it compares two variance estimations, namely, variation within groups and variation between groups. In this paper we perform a oneway ANOVA. e purpose of a one-way between-groups ANOVA is to show if there are any differences among the means of two or more groups/models. When at least two of Discrete Dynamics in Nature and Society 3 the groups/models have means that are significantly different from each other, the ANOVA test is significant in this case. However, it does not reveal which of the groups/models are different.

Social Spider Optimization
Algorithm. e social spider optimization (SSO) is swarm intelligence-based metaheuristic algorithm [31]. SSO is chosen in this study because it is a new heuristic algorithm that solves difficult optimization problems. It is a vital model to search for the global optimum through performing a simulation to the social spider behavior. SSO mimics the behaviors of spiders. Spiders identify the position of prey via the vibration that occurred on the spider web. Any unusual vibration is a sign for the social spider to search for food and move into the source of vibration. e search area of SSO uses chain-like social spider structure. e direction of the food is determined by insects through signals generated through vibrations from the spider web. Equations (8) and (9) define the SSO operation. e vibration intensity [32] at position x is calculated by the following equation: where F (x) denotes the cost function and C denotes a constant number. e iteration attenuation is given by the following equation: where D(x 1 , x 2 ) indicates the distance between x 1 and x 2 . e standard deviation of all members along one searched dimension is indicated by σ. e free parameter is r a .

e Proposed Prediction Model.
is paper combines both LR model and MARS model based on SSO to develop an optimized LR-MARS prediction model that predicts crude oil demand. e proposed LR-MARS model is developed based on five main stages as demonstrated in Figure 1. ere are five stages used to develop the LR-MARS model which are (1) data collection and data preprocessing stage, (2) determining training and testing sets, (3) LR model and MARS model, (4) using SSO, and (5) performance evaluation.

Data Collection and Preprocessing.
e process of data collection starts with collecting different features for crude oil demand from different sources. Data are tracked and verified for any externality or inconsistency. For example, the gross domestic product (GDP) feature is gathered from the sources: OPEC, IEA, International Monetary Fund (IMF), Saudi Statistics Authority, and World Bank. e data used in this article come from various sources and cover the period 1980-2015 [3]. Features such as year, oil demand, GDP, population, Brent prices, Light-Duty Vehicles (LDV), and Heavy-Duty Vehicles (HDV) are shown in Table 1. Table 1 describes a number of statistical metrics such as mean, standard error, median, standard deviation, etc., of the features of the dataset which are oil demand, GDP, population, LDV, and HDV. For instance, the maximum value of the oil demand is 3318.656317, the minimum value is 602, and the standard deviation is 774.0563839.
In statistics, the correlation matrix shows the correlation coefficients between variables. e correlation matric of the features of Saudi Arabia oil demand dataset is shown in Table 2. Each cell represents the correlation value between two variables. As can be seen in Table 2, the correlation coefficient of the features is closer to 1 which means that we have strong positive correlation between each two features in the dataset.
Data preprocessing stage is an essential step in machine learning [33]. e quality of the data can directly affect the ability of the models to learn; thus, it is critical that we preprocess our data before using data as inputs into the proposed model. In this paper, preprocessing is done using normalization. If the data contains input values with varying scales, normalization can be used to scale these values. Normalization scales each input value separately through subtracting the mean (centering) and dividing by the standard deviation in order to change the distribution's mean to zero and standard deviation to one [33]. Normalization is calculated using the following equation: where x is the input value, μ is the mean value, and σ is the standard deviation value. Mean value (μ) is calculated using the following equation: Standard deviation (σ) is calculated using the following equation:

Training and Testing
Sets. e crude oil demand dataset is split into train data (90%) and test data (10%). Following that, the train data is split further into training set (50% of train data) and validation set (50% of train data).

LR Model and MARS Model.
e training set (50% of train data) is trained by LR model and the validation set (50% of train data) is used as an input to the LR model to make predictions through LR model. LR model provides two predictions (validation prediction set and test prediction set). Finally, the validation prediction set will be trained with MARS model to create LR-MARS model. is LR-MARS is used to make final predictions on the test prediction set to obtain the final predicted output that is in turn compared with the actual test data.

Performance Metrics.
To validate the performance and effectiveness of the prediction models proposed, five error analysis criteria are introduced to evaluate the proposed models, as shown in equations (13)- (17), where y real i is the actual values, y pred i is the predicted values, and y is the mean value of actual values [24,34]. For each model, the performance is evaluated using the Mean Absolute Error (MAE), Median Absolute Error (MedAE), Mean Square Error (MSE), Root Mean Square Error (RMSE), and R-squared (R 2 ).

Results and Discussion
e implementation of the models is done using Google Colab notebook. Google Colab notebook helps to write and execute python in the browser, where it is an open source and widely used for the implementation of machine learning algorithms such as regression, classification, and clustering. To evaluate the performance of the optimized LR-MARS model in crude oil demand prediction more effectively, other models are chosen for comparison. Furthermore, the models commonly used in machine learning are chosen. SSO has been used to perform tuning to the two hyperparameters (penalty term and maximum number of basis functions (BFs)). e population of SSO metaheuristic algorithm consists of 30 members and the    Table 3.   Among all the experimental models in Table 1, ridge regression model has the largest error, and its MAE, MedAE, MSE, RMSE, and R 2 are 0.055, 0.054, 0.0036, 0.06, and 99.4%, respectively. e MAE, MedAE, MSE, RMSE, and R 2 of LR model are 0.042, 0.047, 0.0026, 0.05, and 99.6%, respectively. e error of LR-MARS model with optimizing the two hyperparameters (penalty term and maximum number of basis functions (BFs)) using SSO algorithm is the smallest; its MAE, MedAE, MSE, RMSE, and R 2 are 0.024, 0.023, 0.0007, 0.02 and 99.9%, respectively, which is significantly lower than the other two models. It can be seen from Table 3 that LR-MARS model with optimizing the two hyperparameters (penalty term and maximum number of basis functions (BFs)) using SSO algorithm has a high accuracy in predicting crude oil demand and is more effective than the other models. Table 4 demonstrates a comparison of LR-MARS model with different cases: case 1, optimizing the two hyperparameters (penalty term and maximum number of basis functions (BFs)) using SSO algorithm, case 2, optimizing the one hyperparameter (penalty term) using SSO algorithm, and third, without optimizing any hyperparameter. Figures 2-4 show a cross-plot of the actual and predicted crude oil demand using LR-MARS model, LR model and ridge regression model, respectively.

Analysis of Variance (ANOVA).
In this section, we use ANOVA for two purposes. e first purpose is to predict the crude oil demand in Saudi Arabia. e second purpose is using ANOVA to compare the actual test data and the predicted data results between LR-MARS, LR, and ridge regression model, respectively. R 2 which is also known as coefficient of determination, is used to calculate how close the data are to the fitted regression line. e value (R 2 � 0.898) indicates a better fit for the model as shown in Figure 5.

ANOVA Predicting
Result. ANOVA is used as a prediction model. Table 5 provides a comparison of ANOVA prediction model and the proposed LR-MARS optimized model. e results show that LR-MARS optimized model gives a high performance comparing to ANOVA model.
In Table 6, the analysis of the source of variation is carried out in two ways: between groups and within groups. Between-groups analysis determines the source of variance of LR-MARS, LR, and ridge regression models, respectively. Within-groups analysis identifies the experimental error between the group and itself. From the ANOVA results in Table 6, SS � 2.582510317, while Mean Square MS � 0.215209193. erefore, we can conclude that the null hypothesis was rejected because F critical � 3.490294819 and F � 10.00246025, where F critical < F. Moreover, since the P value is less than 0.05 (i.e., 0.00982 < 0.05), this is another indication of the significant differences in the attribute (crude oil demand) between LR-MARS, LR, and ridge regression models, respectively, and therefore is another evidence to reject the null hypothesis.

Conclusion
In this paper, a hybrid model called LR-MARS is developed for predicting the crude oil demand in Saudi Arabia. is paper used historical data of one of the world's largest oil producers (Saudi Arabia) to demonstrate the applicability and effectiveness of the proposed LR-MARS model. e dataset used in the LR-MARS consists of seven features: time, oil demand, GDP, population, Brent crude prices, LDV, and HDV. e LR-MARS model is a combination of linear regression model and multivariate adaptive regression splines (MARS) model. We also used SSO algorithm for optimizing two hyperparameters, namely, penalty term and maximum number of basis functions (BFs) for the MARS model. To evaluate the performance of LR-MARS optimized model, we used MAE, MedAE, MSE, RMSE, and R 2 to examine and test the predictions performance for the LR-MARS model that are 0.024, 0.023, 0.0007, 0.02, and 99.9%, respectively. We have also compared LR-MARS optimized model to other machine learning prediction models. e optimized LR-MARS model is more accurate in predicting crude oil demand in Saudi Arabia than other models. Moreover, we have used ANOVA as prediction model to predict the crude oil demand in Saudi Arabia and also to compare the actual test set and predicted results between LR-MARS, LR, and ridge regression models. is paper will be useful for oil demand planning, setting strategies, and future oil investments. Due to the limitation in obtaining some features and the inconsistency of scaling some data, these limitations of features will lead to a certain range of errors in data-processing process and prediction process. erefore, other possible influencing features can be considered as input variable. As a direction of future work, as splines can be modelled by adding more knots, this will help in increasing the model flexibility. Moreover, cubic spline model and natural cubic spline model can be used to enhance the results.
Data Availability e data used in this paper were obtained from different sources (OPEC, IEA, International Monetary Fund (IMF), Saudi Statistics Authority, and World Bank) and cover the period 1980 to 2015 [3].

Conflicts of Interest
e authors declare that they have no conflicts of interest.