Abstract

This paper presents optimized linear regression with multivariate adaptive regression splines (LR-MARS) for predicting crude oil demand in Saudi Arabia based on social spider optimization (SSO) algorithm. The SSO algorithm is applied to optimize LR-MARS performance by fine-tuning its hyperparameters. The proposed prediction model was trained and tested using historical oil data gathered from different sources. The results suggest that the demand for crude oil in Saudi Arabia will continue to increase during the forecast period (1980–2015). A number of predicting accuracy metrics including Mean Absolute Error (MAE), Median Absolute Error (MedAE), Mean Square Error (MSE), Root Mean Square Error (RMSE), and coefficient of determination () were used to examine and verify the predicting performance for various models. Analysis of variance (ANOVA) was also applied to reveal the predicting result of the crude oil demand in Saudi Arabia and also to compare the actual test data and predict results between different predicting models. The experimental results show that optimized LR-MARS model performs better than other models in predicting the crude oil demand.

1. Introduction

The development of prediction techniques and machine learning models is a critical task for crude oil demand [1]. The prediction techniques can predict different features in oil [2] including oil price, oil demand, oil viscosity, etc. Prediction models and techniques can present many advantages in energy sector such as energy planning, strategy formulation, and energy advancement. The design of prediction models and techniques is a complex task which has huge impacts for the economic trajectories of countries, energy companies, and other industrial sectors [3]. According to the International Energy Agency (IEA), the global demand for crude oil accounted for about 41% of the total fuel share in 2016. According to the Organization of the Petroleum Exporting Countries (OPEC), Saudi Arabia is one of the world’s largest oil consumers, ranking fifth after Russia with a 3.4% share of global oil consumption in 2016.

There are numerous models that support the crude oil demand prediction, including autoregressive conditional heteroscedasticity (ARCH) model [4], other time series models, artificial neural networks [5], and fuzzy theory predictions [6, 7].

Machine learning models play an important role in the evaluation and prediction tasks. The features included in the dataset can be used to perform predictions. Machine learning models can also perform future predictions based on the available in the dataset [8]. Regression analysis is a statistical process used to assess the relationship between various variables. In the field of machine learning, regression analysis models are widely used for predictions. The idea of the regression analysis is to show how the dependent variable (predicted variable) changes when one of the independent variables changes and other independent variables remain constant [9]. When the independent variables are restricted, regression analysis is used to obtain the average value of the dependent variable. There are three main processes for regression analysis which are (1) determining the strength of the predictors, (2) predicting an effect, and (3) trend prediction [10]. Many techniques have been presented in the field of regression analysis, which can be divided into parametric method and nonparametric method. In parametric method, the parameters contain all information about the data. The parameters contain all of the information required to predict the value of future data from the model. For example, in linear regression with a single variable, two parameters (intercept and coefficient) must be known in order to predict a new value. In nonparametric method, because more information is available, the ability to predict new values is more flexible because the parameters in the nonparametric method have infinite dimensions, and the data characteristics are superior to parametric models.

The purpose of this paper is to propose the LR-MARS model for predicting the demand for crude oil in Saudi Arabia. To improve the accuracy of the MARS model, social spider optimization is applied to improve the hyperparameters of the MARS model.

This section outlines relevant studies in regard to Artificial Intelligent (AI) models for predicting the demand for crude oil. In [11], the authors proposed wavelet method to predict oil price in the long term. The proposed model can forecast the Brent oil price one year ahead. Several time series prediction approaches were compared to [11] model such as ARIMA, GARCH, and Holt-Winters. Result has shown that [11] model provides better prediction models than the other models. China’s crude oil demand was predicted using soft and hard computing [12]. In [13], three estimated models for the price of petroleum called theories model, simulation model, and informal model were used. The informal estimate model performs better results than the other two models. The authors in [14] make use of eight artificial neural networks (ANN) and fuzzy regression (FR) for oil price prediction. The analysis of variance (ANOVA) and Duncan’s multiple range test (DMRT) are then used to test the forecast produced by ANN and FR. The mean absolute percentage error (MAPE) was calculated for ANN models and the results have shown that ANN models outperform the FR models. For verification and validation purposes, the author have applied Spearman correlation test. The authors in [15] studied the factors that play a role in affecting the demand for oil in thirty developed countries using cointegration functions model. The variables used in the study were energy prices and national income. The result has shown strong relationship between income and the demand of energy and oil. In [16], distinct nine oil models were studied and compared. Oil price, gross domestic product (GDP), and time trend for improvement were considered among the most influencing factors of the models. A method that estimates coefficients was used in the comparison of econometric response of these models [16]. Another study focused on the markets of global crude oil and natural gas in the period 1918–1999 [17]. This study predicted price and income elasticities for crude oil, demand models, and natural gas supply [17].

Panel quantification analysis techniques were used to estimate long-term income and price elasticities in crude oil demand in the Middle East [18]. Data employed in the study covered the period 1971–2002. The result has shown high price inelasticity and slight income elasticity [18]. A prediction model for crude demand based on cointegration and a vector error correction model (VEC) is introduced [19]. Four main factors that affect the crude oil demand were considered: GDP, population growth, oil price, and the share of industrial sector in GDP. Both error correction model (ECM) and Johansen cointegration test were applied for the estimation of elasticities.

In [20], the International Energy Agency (IEA) proposed the scenarios for future oil demand for China in 2006 World Energy Outlook. The study concluded that the minimum statistical (lower bound) annual oil consumption in developed countries is 11 barrels per capita. [21]. Another study in [21] developed crude oil demand models that combines variance analysis and a flexible fuzzy regression model. The results demonstrated the superiority of fuzzy regression over the conventional model. The data used covered the period 1990–2005 for different countries: Japan, Canada, Australia, and United States [21]. In [22], the authors used data that covers the period 1981–2005. Input variables include population, GDP, oil imports, and export of oil. The study demonstrated the benefits of the optimization of particulate swarm (PSO) versus GA in estimating and predicting Iran's crude oil demand. In the domain of energy consumption prediction, another study [23] compared the performance of energy consumption prediction using conventional econometric and artificial intelligence-based models. The result reflected that AI-based models are robust and scalable for prediction. The results also showed that, in national level, the prediction of yearly energy consumption is preferred using conventional models. Moreover, nonlinear regression models obtain the lowest average MAPE (1.79%) for long-term prediction.

SSO has been successfully used to solve the continuous optimization problems [24]. In [24] the researchers adopted SSO and support vector regression as short-term electric load forecasting model. Results showed that SSO helps to achieve good results [24]. Another study in [25] used SSO algorithm to search for optimal cluster centers in fuzzy c-means clustering algorithm. The results showed that SSO improved the performance of fuzzy c-means clustering algorithm among other optimization algorithms [25]. Another study in [26] used SSO algorithm to solve discrete optimization problems. SSO was used for the problem of traveling salesman [26]. SSO was compared to eighteen algorithms and the experimental results revealed that the performance of SSO algorithm in solving discrete problems was very useful for both low and middle-scale TSP datasets [26].

3. Materials and Methods

3.1. Linear Regression Model

On real-world data, linear regression model works perfectly. There are numerous advantages to using linear regression, such as the fact that the linear regression model in training is faster than many predictive models [27]. Linear regression is used to compute the strength of the relationship between the dependent variable and the independent variables, as well as to determine which independent variables have no relationship with the dependent variable and which independent variables contain redundant information about the dependent variable. Furthermore, linear regression models are simple to implement and use a small amount of memory [28]. If there is only one independent variable in a linear regression model, the regression function is a straight line; if there are two independent variables, the regression function is plane; and if there are independent variables, the regression function is hyperplane with dimensions [10]. If the actual values and predicted values are fitted, then the actual values will be similar to the predicted values. However, if there is a difference between the actual and predicted values, this difference is referred to as a cost, loss, or error. The regression function dependent on independent (predictor) variables ,, …, is calculated using the following equation:

Equation (1) represents how the value of varies with the independent ,, …, . , , …, , where , , …, . , , …, are known as feature weights (model coefficients) and is called a constant bias term (intercept).

3.2. Ridge Regression Model

Ridge regression is a model for multiple regression in order to perform data analysis. In ridge regression, the independent variables are highly correlated. Ridge regression model is used to avoid overfitting and to reduce the complexity of the model. New values that are predicted by ridge regression model give better results when the predictor variables are correlated [10]. Ridge regression model learns two parameters , by using the same standard of the least squares with adding a penalty term to make an appropriate variation for the parameter . The penalty term in ridge regression is known as regularization in order to perform restriction to the model and reduce the overfitting, and also the coefficients of the regression are controlled using the regularization methods; this will reduce the sampling error and minimize the variance [29]. Also, L2 regularization is used for ridge regression model to minimize the residual sum of square (RRS) of the coefficients [29]. RSS for ridge regression can be expressed as in the following equation:where is the penalty term. When the value of is high, this means that the model is simple and more regularization. The penalty term adjusts the parameters when the values of the parameters are high, so ridge regression minimizes the parameters to make the model simple and reduce the complexity of the model.

3.3. Multivariate Adaptive Regression Splines Model

MARS model is a nonlinear and nonparametric regression approach that uses piecewise linear splines to simulate the nonlinear relationship between the dependent and independent variables [30]. The MARS model is built as a linear combination of the following basis functions BFs showed in the following equation:where , are unknown coefficients that can be estimated using the least square method and is the number of terms found in the final model using a forward backward stepwise process. is the basis function defined from piecewise linear basis functions and based on knot . is calculated from the following set functions that is showed in the following equation:where and are given by

Finally, the predicted model is built with numbers of BFs to provide the lowest generalized cross validation (GCV) value that is calculated by the following equation:where is the sum of square error, where and is the smoothing parameter.

3.4. Analysis of Variance (ANOVA)

ANOVA is a statistical analysis technique which is developed by R.A. Fisher in the 1920s. ANOVA can be used for many purposes such as comparing group mean. Two hypotheses are applied to determine the output of the comparison, namely, null hypothesis and alternative hypothesis. ANOVA is also known as analysis of an analysis of variance because it compares two variance estimations, namely, variation within groups and variation between groups. In this paper we perform a one-way ANOVA. The purpose of a one-way between-groups ANOVA is to show if there are any differences among the means of two or more groups/models. When at least two of the groups/models have means that are significantly different from each other, the ANOVA test is significant in this case. However, it does not reveal which of the groups/models are different.

3.5. Social Spider Optimization Algorithm

The social spider optimization (SSO) is swarm intelligence-based metaheuristic algorithm [31]. SSO is chosen in this study because it is a new heuristic algorithm that solves difficult optimization problems. It is a vital model to search for the global optimum through performing a simulation to the social spider behavior. SSO mimics the behaviors of spiders. Spiders identify the position of prey via the vibration that occurred on the spider web. Any unusual vibration is a sign for the social spider to search for food and move into the source of vibration. The search area of SSO uses chain-like social spider structure. The direction of the food is determined by insects through signals generated through vibrations from the spider web. Equations (8) and (9) define the SSO operation.

The vibration intensity [32] at position x is calculated by the following equation:where F (x) denotes the cost function and denotes a constant number.

The iteration attenuation is given by the following equation:where indicates the distance between and . The standard deviation of all members along one searched dimension is indicated by . The free parameter is .

3.6. The Proposed Prediction Model

This paper combines both LR model and MARS model based on SSO to develop an optimized LR-MARS prediction model that predicts crude oil demand. The proposed LR-MARS model is developed based on five main stages as demonstrated in Figure 1. There are five stages used to develop the LR-MARS model which are (1) data collection and data preprocessing stage, (2) determining training and testing sets, (3) LR model and MARS model, (4) using SSO, and (5) performance evaluation.

3.6.1. Data Collection and Preprocessing

The process of data collection starts with collecting different features for crude oil demand from different sources. Data are tracked and verified for any externality or inconsistency. For example, the gross domestic product (GDP) feature is gathered from the sources: OPEC, IEA, International Monetary Fund (IMF), Saudi Statistics Authority, and World Bank. The data used in this article come from various sources and cover the period 1980–2015 [3]. Features such as year, oil demand, GDP, population, Brent prices, Light-Duty Vehicles (LDV), and Heavy-Duty Vehicles (HDV) are shown in Table 1.

Table 1 describes a number of statistical metrics such as mean, standard error, median, standard deviation, etc., of the features of the dataset which are oil demand, GDP, population, LDV, and HDV. For instance, the maximum value of the oil demand is 3318.656317, the minimum value is 602, and the standard deviation is 774.0563839.

In statistics, the correlation matrix shows the correlation coefficients between variables. The correlation matric of the features of Saudi Arabia oil demand dataset is shown in Table 2. Each cell represents the correlation value between two variables. As can be seen in Table 2, the correlation coefficient of the features is closer to 1 which means that we have strong positive correlation between each two features in the dataset.

Data preprocessing stage is an essential step in machine learning [33]. The quality of the data can directly affect the ability of the models to learn; thus, it is critical that we preprocess our data before using data as inputs into the proposed model. In this paper, preprocessing is done using normalization. If the data contains input values with varying scales, normalization can be used to scale these values. Normalization scales each input value separately through subtracting the mean (centering) and dividing by the standard deviation in order to change the distribution’s mean to zero and standard deviation to one [33]. Normalization is calculated using the following equation:where is the input value, is the mean value, and is the standard deviation value. Mean value () is calculated using the following equation:

Standard deviation () is calculated using the following equation:

3.6.2. Training and Testing Sets

The crude oil demand dataset is split into train data (90%) and test data (10%). Following that, the train data is split further into training set (50% of train data) and validation set (50% of train data).

3.6.3. LR Model and MARS Model

The training set (50% of train data) is trained by LR model and the validation set (50% of train data) is used as an input to the LR model to make predictions through LR model. LR model provides two predictions (validation prediction set and test prediction set). Finally, the validation prediction set will be trained with MARS model to create LR-MARS model. This LR-MARS is used to make final predictions on the test prediction set to obtain the final predicted output that is in turn compared with the actual test data.

3.7. Performance Metrics

To validate the performance and effectiveness of the prediction models proposed, five error analysis criteria are introduced to evaluate the proposed models, as shown in equations (13)–(17), where is the actual values, is the predicted values, and is the mean value of actual values [24, 34]. For each model, the performance is evaluated using the Mean Absolute Error (MAE), Median Absolute Error (MedAE), Mean Square Error (MSE), Root Mean Square Error (RMSE), and R-squared ().

4. Results and Discussion

The implementation of the models is done using Google Colab notebook. Google Colab notebook helps to write and execute python in the browser, where it is an open source and widely used for the implementation of machine learning algorithms such as regression, classification, and clustering. To evaluate the performance of the optimized LR-MARS model in crude oil demand prediction more effectively, other models are chosen for comparison. Furthermore, the models commonly used in machine learning are chosen. SSO has been used to perform tuning to the two hyperparameters (penalty term and maximum number of basis functions (BFs)). The population of SSO metaheuristic algorithm consists of 30 members and the maximum number of generations is 100. The output of the optimization process is that the maximum number of basis functions (BFs) is 42 and the penalty term is 1.46. The prediction model proposed in this paper, which combines linear regression model with multivariate adaptive regression splines model, has shown high prediction accuracy when predicting crude oil demand in Saudi Arabia. To effectively evaluate the performance of LR-MARS in crude oil demand prediction, traditional prediction models of machine learning are used in this paper as comparative experiments. During the experiment, LR model and ridge regression model are used for crude oil demand prediction as comparative tests. To objectively evaluate and describe the performance of the three prediction models, the prediction error values of each model are calculated according to equations (13)–(17). The experimental results of MAE, MedAE, MSE, RMSE, and of the test data are shown in Table 3.

Among all the experimental models in Table 1, ridge regression model has the largest error, and its MAE, MedAE, MSE, RMSE, and are 0.055, 0.054, 0.0036, 0.06, and 99.4%, respectively. The MAE, MedAE, MSE, RMSE, and of LR model are 0.042, 0.047, 0.0026, 0.05, and 99.6%, respectively. The error of LR-MARS model with optimizing the two hyperparameters (penalty term and maximum number of basis functions (BFs)) using SSO algorithm is the smallest; its MAE, MedAE, MSE, RMSE, and are 0.024, 0.023, 0.0007, 0.02 and 99.9%, respectively, which is significantly lower than the other two models. It can be seen from Table 3 that LR-MARS model with optimizing the two hyperparameters (penalty term and maximum number of basis functions (BFs)) using SSO algorithm has a high accuracy in predicting crude oil demand and is more effective than the other models. Table 4 demonstrates a comparison of LR-MARS model with different cases: case 1, optimizing the two hyperparameters (penalty term and maximum number of basis functions (BFs)) using SSO algorithm, case 2, optimizing the one hyperparameter (penalty term) using SSO algorithm, and third, without optimizing any hyperparameter.

Figures 24 show a cross-plot of the actual and predicted crude oil demand using LR-MARS model, LR model and ridge regression model, respectively.

4.1. Analysis of Variance (ANOVA)

In this section, we use ANOVA for two purposes. The first purpose is to predict the crude oil demand in Saudi Arabia. The second purpose is using ANOVA to compare the actual test data and the predicted data results between LR-MARS, LR, and ridge regression model, respectively. which is also known as coefficient of determination, is used to calculate how close the data are to the fitted regression line. The value ( = 0.898) indicates a better fit for the model as shown in Figure 5.

4.2. ANOVA Predicting Result

ANOVA is used as a prediction model. Table 5 provides a comparison of ANOVA prediction model and the proposed LR-MARS optimized model. The results show that LR-MARS optimized model gives a high performance comparing to ANOVA model.

In Table 6, the analysis of the source of variation is carried out in two ways: between groups and within groups. Between-groups analysis determines the source of variance of LR-MARS, LR, and ridge regression models, respectively. Within-groups analysis identifies the experimental error between the group and itself. From the ANOVA results in Table 6, SS = 2.582510317, while Mean Square MS = 0.215209193. Therefore, we can conclude that the null hypothesis was rejected because and F = 10.00246025, where  < F. Moreover, since the value is less than 0.05 (i.e., 0.00982 < 0.05), this is another indication of the significant differences in the attribute (crude oil demand) between LR-MARS, LR, and ridge regression models, respectively, and therefore is another evidence to reject the null hypothesis.

5. Conclusion

In this paper, a hybrid model called LR-MARS is developed for predicting the crude oil demand in Saudi Arabia. This paper used historical data of one of the world’s largest oil producers (Saudi Arabia) to demonstrate the applicability and effectiveness of the proposed LR-MARS model. The dataset used in the LR-MARS consists of seven features: time, oil demand, GDP, population, Brent crude prices, LDV, and HDV. The LR-MARS model is a combination of linear regression model and multivariate adaptive regression splines (MARS) model. We also used SSO algorithm for optimizing two hyperparameters, namely, penalty term and maximum number of basis functions (BFs) for the MARS model. To evaluate the performance of LR-MARS optimized model, we used MAE, MedAE, MSE, RMSE, and to examine and test the predictions performance for the LR-MARS model that are 0.024, 0.023, 0.0007, 0.02, and 99.9%, respectively. We have also compared LR-MARS optimized model to other machine learning prediction models. The optimized LR-MARS model is more accurate in predicting crude oil demand in Saudi Arabia than other models. Moreover, we have used ANOVA as prediction model to predict the crude oil demand in Saudi Arabia and also to compare the actual test set and predicted results between LR-MARS, LR, and ridge regression models. This paper will be useful for oil demand planning, setting strategies, and future oil investments. Due to the limitation in obtaining some features and the inconsistency of scaling some data, these limitations of features will lead to a certain range of errors in data-processing process and prediction process. Therefore, other possible influencing features can be considered as input variable. As a direction of future work, as splines can be modelled by adding more knots, this will help in increasing the model flexibility. Moreover, cubic spline model and natural cubic spline model can be used to enhance the results.

Data Availability

The data used in this paper were obtained from different sources (OPEC, IEA, International Monetary Fund (IMF), Saudi Statistics Authority, and World Bank) and cover the period 1980 to 2015 [3].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors would like to acknowledge Taif University Researchers Supporting Project number (TURSP-2020/292), Taif University, Taif, Saudi Arabia, for funding this research.