Forecasting East and West Coast Gasoline Prices with Tree-Based Machine Learning Algorithms

Sofianos, Emmanouil; Zaganidis, Emmanouil; Papadimitriou, Theophilos; Gogas, Periklis

doi:10.3390/en17061296

Open AccessArticle

Forecasting East and West Coast Gasoline Prices with Tree-Based Machine Learning Algorithms

¹

Bureau d’Economie Théorique et Appliquée (BETA), University of Strasbourg, 67085 Strasbourg, France

²

Department of Economics, Democritus University of Thrace, 69100 Komotini, Greece

^*

Author to whom correspondence should be addressed.

Energies 2024, 17(6), 1296; https://doi.org/10.3390/en17061296

Submission received: 29 January 2024 / Revised: 29 February 2024 / Accepted: 4 March 2024 / Published: 8 March 2024

(This article belongs to the Special Issue Emerging Trends in Energy Economics II)

Download

Browse Figures

Versions Notes

Abstract

:

This study aims to forecast New York and Los Angeles gasoline spot prices on a daily frequency. The dataset includes gasoline prices and a big set of 128 other relevant variables spanning the period from 17 February 2004 to 26 March 2022. These variables were fed to three tree-based machine learning algorithms: decision trees, random forest, and XGBoost. Furthermore, a variable importance measure (VIM) technique was applied to identify and rank the most important explanatory variables. The optimal model, a trained random forest, achieves a mean absolute percent error (MAPE) in the out-of-sample of 3.23% for the New York and 3.78% for the Los Angeles gasoline spot prices. The first lag, AR (1), of gasoline is the most important variable in both markets; the top five variables are all energy-related. This paper can strengthen the understanding of price determinants and has the potential to inform strategic decisions and policy directions within the energy sector, making it a valuable asset for both industry practitioners and policymakers.

Keywords:

gasoline; decision tree; random forest; XGBoost; machine learning; forecasting

1. Introduction

Accurately forecasting of gasoline spot prices is critical in many areas. It supports critical supply chain management decisions, enabling companies to optimize logistics and reduce costs. It also plays a key role in risk management by helping companies hedge against market volatility. Gasoline price forecasting drives consumer behavior analysis, allowing companies to adapt their strategies to changing consumer spending habits. In addition, energy policy and investment decisions are heavily influenced by these forecasts, guiding governments and investors in shaping the energy landscape. They are critical to economic analysis, informing macroeconomic indicators and policies, while also helping to assess the environmental impact of hydrocarbons. Ultimately, accurate price forecasts enable stakeholders and investors to maximize profits, adapt to market fluctuations, and make informed decisions in their operations, positively impacting a wide range of industries and decision-making processes.

Machine learning was developed in the 1950s to provide artificial intelligence (AI) systems with the ability to learn. The concept that computers can learn from data, identify patterns, and make decisions with minimal human input is the foundation of automated analytical model construction, which is a crucial aspect of machine learning.

Large datasets have always been essential for machine learning [1]. In finance, very high frequencies can be accessed easily (the domain of finance in economics has a plethora of data) and this has led to the application of machine learning mostly to financial data.

In this paper, we test the efficiency of three machine learning methodologies on forecasting the day-ahead gasoline spot prices for both the New York Harbor Conventional Gasoline Regular and the Los Angeles Reformulated RBOB Regular Gasoline.

2. Literature Review

The relevant literature in the field includes extensive research that employs various traditional and emerging methodologies to accurately predict energy prices. The introduction of new approaches such as machine learning combined with the availability of large datasets has resulted in a substantial body of related literature.

Econometric and statistical methods have been used extensively to forecast gasoline prices using the Michigan Surveys of Consumers (MSC) [2,3]. Ref. [4] in his study uses a univariate Moving Average (MA) and a Vector Autoregression (VAR) models to forecast gasoline prices. The VAR model makes use of the data from the MSC, which include consumer sentiment and consumer-expected inflation. According to the findings, the VAR forecasts for the period of 2003–2014 include the MA forecasts, indicating that the MSC data provide valuable predictive information for gasoline prices. Ref. [5] tests if the expectations of the US consumers on inflation can be used to forecast changes in energy prices. To do so, he created a random walk model to generate forecasting of various horizons to three oil-based product prices. The paper concludes that consumer expectations hold information that can be used for the forecasting of oil-based product prices.

In his work, ref. [6] examines how oil prices affect ethanol and gasoline price estimates over the long and short terms. The oil price forecast was used as an exogenous variable in the vector error correction with exogenous variable (VECX) model, which allowed him to obtain simultaneous forecasts for ethanol and gasoline prices for the period from January 2019 to December 2022. The findings show that the price predictions for gasoline and ethanol were more susceptible to fluctuations in future oil prices in the short term. Machine learning techniques have been widely used in the literature for energy commodity forecasting (e.g., [7] forecast oil prices and [8] forecast natural gas prices). Ref. [9] present a new hybrid forecasting model called SW-GRU (Stochastic Time Effective Weights with Gated Recurrent Unit), which they use for forecasting global energy prices using empirical mode decomposition (EMD). In summary, the study concludes that the proposed model can accurately forecast energy prices. Ref. [10] in their study couple machine learning with swarm-based algorithms to forecast the price of gasoline. More specifically, they incorporated a Bee Colony Algorithm to optimize the classic Least-Squares SVM model. The simulations reveal the forecasting superiority of the proposed model against the competition. Ref. [11] predict the monthly price of five energy products using both linear and nonlinear techniques. The models use both spot prices and data on a monthly basis. To test the models, the authors used data from subsequent years and discovered asymmetries in the forecasting performance of their models, even though the forecasted series were highly correlated. Ref. [12] present a blending predicting system based on a multilevel clustering method for pattern recognition on energy spot prices. The author decomposes the series into subseries and then divides them into appropriate components of varying frequencies. The prediction is performed in all components, and then the author combines them to the forecast of the initial series.

The importance of future prices in energy markets has been studied in the literature by [13,14]. Using ARMA, ARIMA, and AIC, the authors of the latter study explore the energy commodities and more specifically the potential connections between the spot and the futures prices. The findings demonstrate that, in most cases, futures prices are an objective indicator of future spot prices. Although they slightly outperform time series models, futures do not seem to be good forecasters of forthcoming shifts in the pricing of energy commodities.

Energy commodities, including crude oil [15] and gasoline consumption [16] have been forecasted using XGBoost. However, to the best of our knowledge, XGBoost has not been applied to forecast gasoline spot prices. Random forests (RFs) have been extensively used for gasoline consumption forecasting [17] and for spot price forecasting on a weekly frequency. In order to discover explanatory variables with nonlinear influences, threshold values, and closest parametric approximation, ref. [18] suggest a new mixed RF strategy for modeling. The methodology is implemented for weekly gasoline price estimates, which are cointegrated with global oil prices and exchange rates. Specifically, in comparison to a logistic Error Correction Model (ECM), the mixed RF exhibits the best performance in weekly gasoline price forecasting.

In this paper, we employed three machine learning methodologies, decision trees, random forest, and XGBoost, to forecast the day-ahead gasoline spot prices both for the New York Harbor Conventional Gasoline Regular and the Los Angeles Reformulated RBOB Regular Gasoline, using 128 explanatory variables. The innovations of our approach include focusing on the day-ahead forecast instead of the week-ahead forecast, using machine learning techniques to forecast gasoline prices, and using the variable importance measure (VIM) technique to identify the most influential predictors used in the best model.

The paper is structured as follows: in Section 3, we briefly discuss the methodology and data, while in Section 4, we describe our empirical results. Finally, Section 5 concludes the paper.

3. Methodology and Data

3.1. Machine Learning

In the 1950s, machine learning was developed to provide the “Learning” element of artificial intelligence (AI) systems. Automated analytical model building, the foundation of machine learning, is based on the idea that computers (a) can learn from past information (data) and (b) can identify patterns in order to (c) make decisions with as little human intervention as possible.

Machine learning has historically been dependent on big datasets [1]. Thus, machine learning in economics was firstly applied to the subfield that provides high-frequency timeseries, namely finance.

In machine learning, a training–testing split is used to assess the performance of a model in both the specific data at hand and any new data unknown to the optimally trained model. First, we split the dataset in two parts: (a) the in-sample part, where we trained the algorithm using alternative hypermeter specifications and selected the optimal model and (b) the out-of-sample part that was used only to assess the generalization ability of the selected optimal model on new and unknown (to the model) data.

3.2. Decision Trees

Decision trees are considered one of the most important machine learning models for both classification and regression. They appear like inverse trees with roots, branches, and leaves (Figure 1). The model works by recursively splitting the data based on the most informative feature [19]. In regression tasks, the decision tree makes predictions for the target variable by calculating the average of the target variable values associated with the training data points assigned to the same leaf node. Every node symbolizes a splitting condition, whereas every branch represents the appropriate result. The remaining nodes are referred to as decision nodes; the top node is the root node that represents the entire dataset. The final outcomes of the decision-making process (a value associated with the regression technique in our case) are represented by the nodes that stop splitting, which are also referred to as leaves (or terminal nodes and leaf nodes).

3.3. Random Forest

Decision trees have the advantages of being simple to understand, and they typically perform well with the training data. However, their primary flaw—which is what is meant to be understood as overfitting—is that they have a limited ability for generalization and perform poorly when dealing with out-of-sample data. They typically have a high variance and a low bias. One technique to overcome overfitting is the ensemble methods. Ensembling combines predictions from multiple distinct machine learning algorithms. It aims to create an optimized model that outperforms individual models (weak learners) by leveraging their different strengths and weaknesses. This process consists of aggregating predictions from each algorithm and generating a final prediction based on the combined output, typically resulting in better accuracy and robustness. The two main ensemble methods are BAGGing (Bootstrapped AGGregating, [20]) and boosting [21].

Bagging trains the weak learners independently and in parallel for determining the model’s average score (Figure 2). Regression tasks use the mean value of the target variable from each leaf node’s data points to make predictions. The most common example of bagging is the random forest algorithm [22]. For each tree, a separate subsample of the same size of the initial dataset is randomly chosen with replacement (bootstrapping). For each tree’s training, a random subset of the original independent variables is chosen.

3.4. XGBoost

The XGBoost is one of the variations of the classification and regression tree (CART) algorithms [24] that use the boosting technique (Figure 2) to serially construct decision trees, yielding numerous weak regressors. Each tree is adjusted using the error generated by the previous one. The combination of the decisions from all these weak regressors results in a more accurate model. XGBoost can address overfitting (low generalization ability) with multiple additional techniques. For example, it can be controlled using the early stopping parameter that pauses the training phase before the machine learning model learns and predicts the noise in the training data and overfits (pre-pruning). Another method to address overfitting is regularization, which aims to evaluate the influence of each individual feature on the predicted outcomes. Regularization helps eliminate or downweight features that do not significantly contribute to the predictive performance. Regularization information is added to the objective function of XGBoost, which is the squared error of the difference between predicted and true values:

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - p_{i})}^{2}

(1)

where y_i describes the observations, p_i describes the prediction of y_i, and n is the cardinality of our set.

There are two regularization parameters in XGBoost: (a) γ (gamma), which is the minimum loss reduction allowed for a split to occur:

L o s s f u n c t i o n = M S E + γ \times {\sum (w_{i})}^{2}

(2)

where γ is the penalty and w_i is the slope of the curve of ridge regression. Ridge regression adds the “squared magnitude” of the coefficient as penalty term to the loss function. It tends to shrink coefficients toward zero but never equal to zero and (b) α (alpha) regularization on leaf weights:

L o s s f u n c t i o n = M S E + α \times \sum | w_{i} |

(3)

where α is the penalty and w_i is the slope of the curve of Lasso regression (L1 regularization). Lasso regression adds the “absolute value of magnitude” of the coefficient as penalty term to the loss function. L2 regularization (ridge regression) tends to shrink coefficients evenly and is useful when you have collinear/codependent features for feature selection, and also, we can drop any variables associated with zeroed coefficients. Regularization was used to add bias into our model to prevent it from overfitting to our training data.

XGBoost is considered a state-of-the-art algorithm for enhancing the accuracy and precision of the results [25].

3.5. Variable Importance Measure

Ιn tree-based models, we can compute the variable importance measure (VIM) to partially break the black box behind the algorithm as it provides a variable ranking according to its contribution in the model. The VIM computes the performance improvement on every attribute’s split at each node, weighting it by the number of the nodes’ observations. VIMs are then averaged across all the decision trees within the model. VIM sorts the features according to their importance in generating a forecast.

3.6. Overfit

Overfitting is a concept of data science applicable to machine learning methodologies as well. Overfitting occurs when the trained model is highly fitted to the training data but it is unable to describe new and unknown data; this is the case where it is said that the model has low generalization ability. In order to test the generalization ability of the models, we first split our dataset in two subsamples: (a) the training sample (in-sample) on which the algorithms are trained and (b) the test sample where the generalization ability of our trained models was tested with new, out-of-sample data. To address the issue of overfitting and, furthermore, test the generalization ability of our models, we employed the cross-validation technique. The cross-validation procedure is implemented in the training sample (Figure 3).

Cross-validation is used in the training process, where the model’s optimal hyperparameters are determined, to prevent overfitting. In cross-validation, the training part of the dataset (in-sample) is divided into k equal in size folds (subsets). Therefore, the training/testing process is carried out k times for each unique set of hyperparameters that are tested. Every iteration used a new fold for model testing, and the remaining k − 1 folds are used for model training. The average performance across all test k folds is used to determine the overall performance of each set of evaluated hyperparameters (Figure 3). We employed a 5-fold setup in our tests.

There are several forecasting metrics that can be calculated to measure and compare the forecasting performance of our models. The metric that we used in this paper is the mean absolute percentage error (MAPE). It is the one most commonly used in the relevant literature due to its simplicity and ease of interpretation, and it is defined as:

M A P E = \frac{1}{n} \sum_{i = 1}^{n} |\frac{y_{i} - {\hat{y}}_{i}}{y_{i}}| \times 100

(4)

where

y_{i}

and

{\hat{y}}_{i}

are the target variable’s actual and predicted values, respectively, and n is the cardinality of our sample. The average mean absolute difference between the forecasted and the true values is measured in percentage terms by the MAPE metric.

3.7. The Dataset

The dataset includes the target variables, the New York Harbor Conventional Gasoline Regular Spot Price FOB (dollars per gallon), the Los Angeles Reformulated RBOB Regular Gasoline Spot Price (dollars per gallon) and 128 explanatory variables including energy commodities [27,28], macroeconomic variables [29], fixed-income securities, exchange rates [30], financial indices, derivatives, weather [31], technical, and stock market variables [30]. The list of all the variables is available in Appendix A Table A1. For the WTISP, BRENTSP, NYGSP, GCGCP, LAGSP, NYHOSP, LADSP, GCKRP, MBPSP, and NGSP, we added up to 5 additional lags [32] reaching a total of 178 explanatory variables.

The sampling frequency was daily covering the period from 17 February 2004 to 26 October 2022 resulting in 4680 observations. We split our dataset into two sequential parts, the in-sample (90%) and the out-of-sample (10%) data. Thus, the OOS spanned the period from 15 December 2020 to 26 October 2022, i.e., 468 observations. The data were obtained from the US Energy Information Administration (https://www.eia.gov/, accessed on 10 January 2023), Yahoo Finance (https://finance.yahoo.com, accessed on 10 January 2023), FRED database from the St. Louis Fed (https://fred.stlouisfed.org/, accessed on 10 January 2023), and Meteostat (https://meteostat.net/en/, accessed on 10 January 2023).

4. Empirical Results

In the first step, we selected the best tuned decision trees, random forest, and XGBoost models out of 1200, 320, and 320 combinations of parameters, respectively, in the cross-validation process. The MAPE for the best models for each algorithm are summarized in Table 1. The in-sample performance was calculated as the average test MAPE from the cross-validation process, and the out-of-sample performance was calculated in the part that was left aside from the training.

The algorithm with the best results in-sample and out-of-sample for both variables is the random forest with an in-sample MAPE of 2.74% and 3.35% for the NYGSP and the LAGSP, respectively, and with 3.23% and 3.78% MAPE in the out-of-sample part. The difference between the in-sample MAPE and the out-of-sample MAPE for all the models is very small and less than 0.7%. This is evidence that these models have significant generalization ability to new, previously unseen data or, in other words, that they have a comparable bias and variance.

Figure 4 and Figure 5 represent the actual time series of the two target variables in the out-of-sample data along with the predicted prices from the optimal random forest model. We can observe that the forecasted and the actual prices are very close. Notably, there are three specific time horizons in which the models underperformed (grey areas), possibly as a result of the unexpected market fluctuations resulting from the start of the war between Russia and Ukraine on 21 February 2022 and the NATO summit in Madrid on 28 June of the same year. The MAPE of LAGSP increased after the G7 economies agreed to impose a price cap on Russian petroleum exports on 2 September 2022.

From the scatter plots in Figure 6, of true vs. predicted values, we can observe how well the model is performing in the out-of-sample part of the dataset. In the case of the ideal model (zero error and R² = 1), all the points lie in the diagonal line (green line). All models have high R² values of approximately 93%.

The methodologies produced different results for the VIM. The variable with the highest predictive power according to the decision tree and the random forest algorithms is the last available price of the target variable, with more than 95% importance. In order to better understand the contribution of other factors, we focused the interpretation of the VIM results to the XGBoost algorithm, because the forecasting results are similar.

For the VIM results, we selected the top five variables to comment on, because these can explain 88.66% and 95.88% of the NYGSP and the LAGSP, respectively (Table 2). According to these results, the most important explanatory variable is the first lag of the target variable, as in the decision tree and random forest VIMs. The explanatory variable that ranks second in forecasting both dependent variables is lag 3 of Brent Spot Price (BRENTSP). Another common factor is the first lag of the Gulf Coast Conventional Gasoline Regular Spot Price FOB (dollars per gallon) (GCGSP). What is interesting from the comparison of the VIMs is that the two target variables are codependent: the price of one moves the other, because both appear on the feature importance of the other. It seems that there is an explanatory variable that affects each target variable individually and that is the first lag of the U.S. Gulf Coast Kerosene-Type Jet Fuel Spot Price FOB (GCKRP) for the NYGSP and the second lag of the Los Angeles, CA Ultra-Low Sulfur CARB Diesel Spot Price (LADSP) for the LAGSP.

Another interesting observation from the VIM ranking is that the optimal models employ mostly energy-related variables. Thus, another key finding is that macroeconomic indicators, fixed income securities, exchange rates, stock indices, derivative prices, and weather and technical variables, and stock market data do not significantly affect the gasoline price one day ahead. The aggregate importance of all these 173 such variables is less than 11.33% and 4.12% for the NYGSP and the LAGSP, respectively.

5. Conclusions

In this study we attempted to forecast the New York and Los Angeles gasoline spot prices using three tree-based machine learning methodologies, namely decision trees, random forest, and XGBoost. To the best of our knowledge, this is the first time the gasoline spot price has been forecasted using these methodologies, especially with XGBoost, both for the New York and the Los Angeles RBOB Regular Gasoline. For training the models, we used 128 different explanatory variables spanning the period from 17 February 2004 to 26 October 2022. For the training process, 90% of the data were used, and the remaining 10% were used as the out-of-sample data to detect possible overfit. The out-of-sample data spanned the period from 15 December 2020 to 26 October 2022.

Random forest models had a lower MAPE for both the New York and Los Angeles gasoline spot prices. The best model for the New York gasoline price achieved an in-sample MAPE of 2.74% and an out-of-sample MAPE of 3.23%, while for the Los Angeles gasoline spot price, the best model achieved a 3.35% and a 3.78% MAPE, respectively. According to the XGBoost VIM, the two target variables are codependent, which means that the price direction of one moves the other, because both appear on the VIM results of the other. Moreover, the first lag of the target variable is the most important explanatory variable for both the New York and the Los Angeles spot gas prices. Finally, the optimal models employ mostly energy-related variables. Thus, we find evidence that macroeconomic indicators, fixed income securities, exchange rates, stock indices, derivative prices, and weather and technical variables do not provide significant information in terms of gasoline forecasting; this is also the finding of [6,14].

Author Contributions

Conceptualization, E.S., E.Z., T.P. and P.G.; methodology, E.S., E.Z., T.P. and P.G.; software, E.S., E.Z., T.P. and P.G.; validation, E.S., E.Z., T.P. and P.G.; formal analysis, E.S., E.Z., T.P. and P.G.; investigation, E.S., E.Z., T.P. and P.G.; resources, E.S., E.Z., T.P. and P.G.; data curation, E.S., E.Z., T.P. and P.G.; writing—original draft preparation, E.S., E.Z., T.P. and P.G.; writing—review and editing, E.S., E.Z., T.P. and P.G.; visualization, E.S., E.Z., T.P. and P.G.; supervision, E.S., E.Z., T.P. and P.G.; project administration, E.S., E.Z., T.P. and P.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. List of explanatory variables.

Name	Description
Date	the date that we ran the model for our prediction (t0)
DAAA	Moody’s Seasoned Aaa Corporate Bond Yield
DBAA	Moody’s Seasoned Baa Corporate Bond Yield
DFII5	Market Yield on U.S. Treasury Securities at 5-Year Constant Maturity, Quoted on an Investment Basis, Inflation-Indexed
DFII7	Market Yield on U.S. Treasury Securities at 7-Year Constant Maturity, Quoted on an Investment Basis, Inflation-Indexed
DFII10	Market Yield on U.S. Treasury Securities at 10-Year Constant Maturity, Quoted on an Investment Basis, Inflation-Indexed
DGS1	Market Yield on U.S. Treasury Securities at 1-Year Constant Maturity, Quoted on an Investment Basis
DGS1MO	Market Yield on U.S. Treasury Securities at 1-Month Constant Maturity, Quoted on an Investment Basis
DGS2	Market Yield on U.S. Treasury Securities at 2-Year Constant Maturity, Quoted on an Investment Basis
DGS3	Market Yield on U.S. Treasury Securities at 3-Year Constant Maturity, Quoted on an Investment Basis
DGS3MO	Market Yield on U.S. Treasury Securities at 3-Month Constant Maturity, Quoted on an Investment Basis
DGS5	Market Yield on U.S. Treasury Securities at 5-Year Constant Maturity, Quoted on an Investment Basis
DGS6MO	Market Yield on U.S. Treasury Securities at 6-Month Constant Maturity, Quoted on an Investment Basis
DGS7	Market Yield on U.S. Treasury Securities at 7-Year Constant Maturity, Quoted on an Investment Basis
DGS10	Market Yield on U.S. Treasury Securities at 10-Year Constant Maturity, Quoted on an Investment Basis
DGS20	Market Yield on U.S. Treasury Securities at 20-Year Constant Maturity, Quoted on an Investment Basis
DGS30	Market Yield on U.S. Treasury Securities at 30-Year Constant Maturity, Quoted on an Investment Basis
DLTIIT	Treasury Long-Term Average (Over 10 Years), Inflation-Indexed
DTB3	3-Month Treasury Bill Secondary Market Rate, Discount Basis
DTB4WK	4-Week Treasury Bill Secondary Market Rate, Discount Basis
DTB6	6-Month Treasury Bill Secondary Market Rate, Discount Basis
NGSP	Henry Hub Natural Gas Spot Price (Dollars per Million Btu)
NGFP1	Natural Gas Futures Contract 1 (Dollars per Million Btu)
NGFP2	Natural Gas Futures Contract 2 (Dollars per Million Btu)
NGFP3	Natural Gas Futures Contract 3 (Dollars per Million Btu)
NGFP4	Natural Gas Futures Contract 4 (Dollars per Million Btu)
COFC1	Cushing, OK Crude Oil Future Contract 1 (Dollars per Barrel)
COFC2	Cushing, OK Crude Oil Future Contract 2 (Dollars per Barrel)
COFC3	Cushing, OK Crude Oil Future Contract 3 (Dollars per Barrel)
COFC4	Cushing, OK Crude Oil Future Contract 4 (Dollars per Barrel)
HOFC1	New York Harbor No. 2 Heating Oil Future Contract 1 (Dollars per Gallon)
HOFC2	New York Harbor No. 2 Heating Oil Future Contract 2 (Dollars per Gallon)
HOFC3	New York Harbor No. 2 Heating Oil Future Contract 3 (Dollars per Gallon)
HOFC4	New York Harbor No. 2 Heating Oil Future Contract 4 (Dollars per Gallon)
WTISP	Cushing, OK WTI Spot Price FOB (Dollars per Barrel)
BRENTSP	Europe Brent Spot Price FOB (Dollars per Barrel)
NYGSP	New York Harbor Conventional Gasoline Regular Spot Price FOB (Dollars per Gallon)
GCGSP	U.S. Gulf Coast Conventional Gasoline Regular Spot Price FOB (Dollars per Gallon)
LAGSP	Los Angeles Reformulated RBOB Regular Gasoline Spot Price (Dollars per Gallon)
NYHOSP	New York Harbor No. 2 Heating Oil Spot Price FOB (Dollars per Gallon)
LADSP	Los Angeles, CA Ultra-Low Sulfur CARB Diesel Spot Price (Dollars per Gallon)
GCKRP	U.S. Gulf Coast Kerosene-Type Jet Fuel Spot Price FOB (Dollars per Gallon)
MBPSP	Mont Belvieu, TX Propane Spot Price FOB (Dollars per Gallon)
AEDUSD=X	AED/USD
BP	BP p.l.c.
CC=F	Cocoa, March 23
CNY=X	USD/CNY
COP	ConocoPhillips
CT=F	Cotton, March 23
CVX	Chevron Corporation
ENB	Enbridge Inc.
EQNR	Equinor ASA
ES=F	E-Mini S&P 500, December 22
EURCAD=X	EUR/CAD
EURUSD=X	EUR/USD
GBPUSD=X	GBP/USD
GC=F	Gold
GF=F	Feeder Cattle Futures, January 2023
HE=F	Lean Hogs Futures, December 2022
HG=F	Copper, December 22
HKD=X	USD/HKD
INR=X	USD/INR
JPY=X	USD/JPY
KC=F	Coffee, March 23
KE=F	KC HRW Wheat Futures, March 2023
LBS=F	Lumber, January 23
LE=F	Live Cattle Futures, December 2022
NQ=F	Nasdaq 100 December 22
NZDUSD=X	NZD/USD
PA=F	Palladium, March 23
PBR	Petróleo Brasileiro S.A.—Petrobras
PBR-A	Petróleo Brasileiro S.A.—Petrobras
PL=F	Platinum, January 23
QAR=X	USD/QAR
RUB=X	USD/RUB
SARUSD=X	SAR/USD
SB=F	Sugar #11, March 23
SGD=X	USD/SGD
SHEL	Shell plc
SI=F	Silver
TTE	TotalEnergies SE
XOM	Exxon Mobil Corporation
YM=F	Mini Dow Jones Indus.-USD 5, December 22
ZB=F	U.S. Treasury Bond Futures, December
ZC=F	Corn Futures, March 2023
ZF=F	Five-Year US Treasury Note Futures
ZL=F	Soybean Oil Futures, January 2023
ZM=F	Soybean Meal Futures, January 2023
ZN=F	10-Year T-Note Futures, December 2022
ZO=F	Oat Futures, March 2023
ZR=F	Rough Rice Futures, January 2023
ZS=F	Soybean Futures, January 2023
ZT=F	2-Year T-Note Futures, December 2022
^AORD	ALL ORDINARIES
^AXJO	S&P/ASX 200
^BFX	BEL 20
^BSESN	S&P BSE SENSEX
^DJI	Dow 30
^FCHI	CAC 40
^FTSE	FTSE 100
^GDAXI	DAX PERFORMANCE-INDEX
^GSPC	S&P 500
^GSPTSE	S&P/TSX Composite Index
^HSI	Hang Seng Index
^IXIC	Nasdaq
^JKSE	Jakarta Composite Index
^KLSE	FTSE Bursa Malaysia KLCI
^KS11	KOSPI Composite Index
^MERV	MERVAL
^MXX	IPC MEXICO
^N100	Euronext 100 Index
^N225	Nikkei 225
^NYA	NYSE COMPOSITE (DJ)
^NZ50	S&P/NZX 50 INDEX GROSS
^RUT	Russell 2000
^STI	STI Index
^TWII	TSEC Weighted Index
^VIX	CBOE Volatility Index
^XAX	NYSE AMEX COMPOSITE INDEX
Washington	Average temperature in Washington
London	Average temperature in London
Hong Kong	Average temperature in Hong Kong
Saudi Arabia	Average temperature in Saudi Arabia
Moscow	Average temperature in Moscow
São Paulo	Average temperature in São Paulo
Tokyo	Average temperature in Tokyo
day	day of the month number
month	month number
day_name	day number

References

Gogas, P.; Papadimitriou, T. Machine Learning in Economics and Finance. Comput. Econ. 2021, 57, 1–4. [Google Scholar] [CrossRef]
Baumeister, C.; Kilian, L.; Lee, T.K. Inside the Crystal Ball: New Approaches to Predicting the Gasoline Price at the Pump. J. Appl. Econ. 2016, 32, 275–295. [Google Scholar] [CrossRef]
Anderson, S.T.; Kellogg, R.; Sallee, J.M.; Curtin, R.T. Forecasting Gasoline Prices Using Consumer Surveys. Am. Econ. Rev. 2011, 101, 110–114. [Google Scholar] [CrossRef]
Baghestani, H. Predicting gasoline prices using Michigan survey data. Energy Econ. 2015, 50, 27–32. [Google Scholar] [CrossRef]
Baghestani, H. Inflation expectations and energy price forecasting. OPEC Energy Rev. 2014, 38, 21–35. [Google Scholar] [CrossRef]
Carpio, L.G.T. The effects of oil price volatility on ethanol, gasoline, and sugar price forecasts. Energy 2019, 181, 1012–1022. [Google Scholar] [CrossRef]
Dimitriadou, A.; Gogas, P.; Papadimitriou, T.; Plakandaras, V. Oil Market Efficiency under a Machine Learning Perspective. Forecasting 2018, 1, 157–168. [Google Scholar] [CrossRef]
Mouchtaris, D.; Sofianos, E.; Gogas, P.; Papadimitriou, T. Forecasting Natural Gas Spot Prices with Machine Learning. Energies 2021, 14, 5782. [Google Scholar] [CrossRef]
Wang, B.; Wang, J. Energy futures and spots prices forecasting by hybrid SW-GRU with EMD and error evaluation. Energy Econ. 2020, 90, 104827. [Google Scholar] [CrossRef]
Mustaffa, Z.; Yusof, Y.; Kamaruddin, S.S. Gasoline Price Forecasting: An Application of LSSVM with Improved ABC. Procedia Soc. Behav. Sci. 2014, 129, 601–609. [Google Scholar] [CrossRef]
Malliaris, M.E.; Malliaris, S.G. Forecasting inter-related energy product prices. Eur. J. Financ. 2008, 14, 453–468. [Google Scholar] [CrossRef]
Li, R. Forecasting energy spot prices: A multiscale clustering recognition approach. Resour. Policy 2023, 81, 103320. [Google Scholar] [CrossRef]
Ma, C.W. Forecasting efficiency of energy futures prices. J. Futur. Mark. 1989, 9, 393–419. [Google Scholar] [CrossRef]
Chinn, M.; LeBlanc, M.; Coibion, O. The Predictive Content of Energy Futures: An Update on Petroleum, Natural Gas, Heating Oil and Gasoline; National Bureau of Economic Research: Cambridge, MA, USA, 2005. [Google Scholar] [CrossRef]
Gumus, M.; Kiran, M.S. Crude oil price forecasting using XGBoost. In Proceedings of the 2017 International Conference on Computer Science and Engineering (UBMK), Antalya, Turkey, 5–8 October 2017; pp. 1100–1103. [Google Scholar] [CrossRef]
Yu, L.; Ma, Y.; Ma, M. An effective rolling decomposition-ensemble model for gasoline consumption forecasting. Energy 2021, 222, 119869. [Google Scholar] [CrossRef]
Ceylan, Z.; Akbulut, D.; Baytürk, E. Forecasting gasoline consumption using machine learning algorithms during COVID-19 pandemic. Energy Sources Part A Recover. Util. Environ. Eff. 2022, 1–19. [Google Scholar] [CrossRef]
Escribano, Á.; Wang, D. Mixed random forest, cointegration, and forecasting gasoline prices. Int. J. Forecast. 2021, 37, 1442–1462. [Google Scholar] [CrossRef]
Gogas, P.; Papadimitriou, T.; Sofianos, E. Forecasting unemployment in the euro area with machine learning. J. Forecast. 2021, 41, 551–566. [Google Scholar] [CrossRef]
Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
González, S.; García, S.; Del Ser, J.; Rokach, L.; Herrera, F. A practical tutorial on bagging and boosting based ensembles for machine learning: Algorithms, software tools, performance study, practical perspectives and opportunities. Inf. Fusion 2020, 64, 205–237. [Google Scholar] [CrossRef]
Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Wadsworth Inc.: Monterey, CA, USA, 1984. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16), San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
Gogas, P.; Papadimitriou, T.; Sofianos, E. Money Neutrality, Monetary Aggregates and Machine Learning. Algorithms 2019, 12, 137. [Google Scholar] [CrossRef]
Drachal, K. Forecasting selected energy commodities prices with Bayesian dynamic finite mixtures. Energy Econ. 2021, 99, 105283. [Google Scholar] [CrossRef]
Herrera, G.P.; Constantino, M.; Tabak, B.M.; Pistori, H.; Su, J.-J.; Naranpanawa, A. Long-term forecast of energy commodities price using machine learning. Energy 2019, 179, 214–221. [Google Scholar] [CrossRef]
Idilbi-Bayaa, Y.; Qadan, M. Forecasting Commodity Prices Using the Term Structure. J. Risk Financ. Manag. 2021, 14, 585. [Google Scholar] [CrossRef]
Huang, S.-C.; Wu, C.-F. Energy Commodity Price Forecasting with Deep Multiple Kernel Learning. Energies 2018, 11, 3029. [Google Scholar] [CrossRef]
Timmer, R.P.; Lamb, P.J. Relations between Temperature and Residential Natural Gas Consumption in the Central and Eastern United States. J. Appl. Meteorol. Clim. 2007, 46, 1993–2013. [Google Scholar] [CrossRef]
Yadav, M.P.; Sehgal, V.; Ratra, D.; Wajid, A. Forecasting the Energy Commodities: An evidence of ARIMA and Intervention Analysis. Int. J. Monet. Econ. Financ. 2023, 16, 443–457. [Google Scholar] [CrossRef]

Figure 1. Example of a decision tree.

Figure 2. Visual representation of the bagging and boosting methodologies [23]. The graph represents the workflows of bagging- and boosting-based algorithms. In that example, the colors represent the two classes of the target variable (classification problem), and the size of each observation is the importance on each tree. In bagging, each individual data point can be used to train a new decision tree, and the importance of each observation is the same because they are independently sampled. In boosting, decision trees are created iteratively, and the weight/importance of each observation can vary at each iteration, with more emphasis given to misclassified or difficult-to-predict data points as the algorithm seeks to correct its errors and improve its overall performance.

Figure 3. A graphical representation of a 3-fold cross validation process. For every set of hyperparameters test, each fold serves as a test sample, while the remaining folds are used to train the model. The average performance for each set of parameters over the test k folds was used to assess the model (in-sample) and test it on the out-of-sample part of the dataset that includes data unknown to the model [26].

Figure 4. Actual and predicted NYGSP for the out-of-sample part of the dataset for the optimal random forest model. The grey area corresponds to time horizon in which the model underperformed.

Figure 5. Actual and predicted LAGSP for the out-of-sample part of the dataset for the optimal random forest model. The grey areas correspond to time horizons in which the model underperformed.

Figure 6. Scatter plot of the true vs. predicted values for the out-of-sample part of the dataset for both the NYGSP and the LAGSP with the optimal random forest model. Linear regression between true and predicted values is displayed by the equation formed and the R².

Table 1. MAPE for the in-sample and out-of-sample parts of the dataset for both the NYGSP and the LAGSP with the best models from decision trees, random forests, and XGBoost.

	NYGSP	LAGSP
	Decision Tree
in-sample	2.93%	3.60%
out-of-sample	3.34%	4.04%
	Random Forest
in-sample	2.74%	3.35%
out-of-sample	3.23%	3.78%
	XGBoost
in-sample	3.30%	4.00%
out-of-sample	3.92%	3.87%

Table 2. Variable importance measure for the NYGSP and LAGSP, top 5 variables.

Variable Importance Measure
NYGSP		LAGSP
Variable	VIM Score	Variable	VIM Score
NYGSP_lag_1	68.59%	LAGSP_lag_1	84.83%
BRENTSP_lag_3	14.92%	BRENTSP_lag_3	6.62%
GCKRP_lag_3	2.53%	LADSP_lag_2	2.45%
GCGSP_lag_1	1.45%	GCGSP_lag_1	1.15%
LAGSP_lag_2	1.18%	NYGSP_lag_1	0.83%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sofianos, E.; Zaganidis, E.; Papadimitriou, T.; Gogas, P. Forecasting East and West Coast Gasoline Prices with Tree-Based Machine Learning Algorithms. Energies 2024, 17, 1296. https://doi.org/10.3390/en17061296

AMA Style

Sofianos E, Zaganidis E, Papadimitriou T, Gogas P. Forecasting East and West Coast Gasoline Prices with Tree-Based Machine Learning Algorithms. Energies. 2024; 17(6):1296. https://doi.org/10.3390/en17061296

Chicago/Turabian Style

Sofianos, Emmanouil, Emmanouil Zaganidis, Theophilos Papadimitriou, and Periklis Gogas. 2024. "Forecasting East and West Coast Gasoline Prices with Tree-Based Machine Learning Algorithms" Energies 17, no. 6: 1296. https://doi.org/10.3390/en17061296

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Forecasting East and West Coast Gasoline Prices with Tree-Based Machine Learning Algorithms

Abstract

1. Introduction

2. Literature Review

3. Methodology and Data

3.1. Machine Learning

3.2. Decision Trees

3.3. Random Forest

3.4. XGBoost

3.5. Variable Importance Measure

3.6. Overfit

3.7. The Dataset

4. Empirical Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI