1 Introduction

A prediction is an essential tool in business, where it helps in making decisions about operational (short term), tactical (medium term), and strategic (long term) planning (Hyndman and Athanasopoulos 2018). In a world that demands information processing and knowledge extraction in an increasingly efficient and dynamic way, the combination of Forecasting and Machine Learning techniques has gained notorious relevance in modeling real-world phenomena, which are characterized by complexity and uncertainties. Several authors have developed models to achieve the optimal point at the trade-off between the quality of decisions and response times.

In recent years, a new family of methods emerged in the context of online machine learning and it became proeminent in nonstationary time series prediction. It was stimulated due to the increasing availability of data and the need to real-time predict their behavior to support decision-making. Evolving fuzzy algorithms are an important segment of evolving systems because they allow a model to react to data changes by creating, merging, updating, and deleting fuzzy rules (Kasabov and Filev 2006). A wide variety of fuzzy algorithms and models can be found in the literature. The authors in Lughofer (2016) discussed recent advances that have improved stability, reliability, and useability as well as aspects related to grown-up interpretability for real-world applications such as on-line condition monitoring, visual inspection, human-machine interaction, smart sensors, production systems and others. Another important contribution is presented by Škrjanc et al. (2019), a systematic survey on evolving intelligent systems with a focus on fuzzy and neuro-fuzzy methods for clustering, classification, regression and system identification in real-world applications as online trading, financial analysis, e-commerce and business, smart home, health care, transportation systems, global supply logistic chains, smart grids, industrial control, cyber-security, and many other areas.

The evolving Takagi–Sugeno (eTS) model is considered the precursor (Angelov and Filev 2004). This model updates the system recursively, adding new rules, or updating existing rules. Antecedent terms of TS rules are determined through a subtractive clustering approach (Chiu 1994) based on the notion of potential function. Consequent parameters of TS rules are updated through Recursive Least Squares (Young 2012). Other models have modified the eTS, such as eTS+ (Angelov et al. 2010), which use concepts such as age, utility, local density, and zone of influence to update a rule base; and eTS-KRLS (Shafieezadeh-Abadeh and Kalhor 2016) whose consequent TS parameters are updated through Kernel Recursive Least Squares (Engel et al. 2004).

Lima et al. (2010) proposed an algorithm, evolving Participatory Learning (ePL), which uses an unsupervised dynamic fuzzy clustering algorithm that mimics human learning (Yager 2004). In this approach, updating the rule base is a function of existing rules and new information received. A similar process occurs with human learning, which is amplified if there exists prior knowledge on a given subject. Maciel et al. (2017) extended the initial ideas with the development of ePL+, which, similar to Angelov et al. (2010), uses criteria such as utility, age, area of influence, and local density to update the rule base. Both models, ePL and ePL+, uses the Recursive Least Squares method (RLS) to estimate consequent parameters.

Recently, (Vieira et al. 2018) proposed a new method called ePL-KRLS, inspired by Shafieezadeh-Abadeh and Kalhor (2016). The main difference between ePL-KRLS and ePL or (ePL+) is the way the consequent parameters are estimated. The former uses the Kernel Recursive Least Squares (KRLS). The advantage of using this approach is that Kernel-based methods are more sensitive to variations in the input data and are able to approximate nonlinear systems accurately and efficiently with moderate computational cost Vieira et al. (2018). Alves et al. (2020) introduced the methods called SM-ePL-KRLS and ESM-ePL-KRLS. These models work with the recursive adjustment of parameters according to the error based on the Set-Membership (SM) de Aguiar et al. (2017) and Enhanced Set-Membership (ESM) Alves et al. (2020) filtering.

The present paper proposes a forecasting model called Variable Step-Size evolving Participatory Learning with Kernel Recursive Least Squares (VS-ePL-KRLS). It expands the ePL-KRLS model with an error-based variable-step-size (VS) parameter-updating method. The VS algorithm, discussed in Harris et al. (1986), is an extension of earlier ideas in stochastic approximation using changeable the step size in gradient methods. According to Evans et al. (1993), Variable Step-Size methods are an attempt to improve the convergence of the Least Mean Squares (LMS) algorithm while preserving the steady-state performance; the increase in the complexity of the implementation is relatively low. Compared to the SM-ePL-KRLS and ESM-ePL-KRLS models, it is easier to implement and, consequently, further computational complexity savings are expected as noted by de Aguiar et al. (2017).

The main contributions of this work are summarized as follows:

  • We introduce a new forecasting model based on the Variable Step-Size (Harris et al. 1986) to adjust the parameter that controls the rate of change of the arousal index.

  • We approach an unexplored and important class of time series, which is the price of fuels, more specifically Brazilian S500 and S10 diesel oil, with weekly data periodicity for biweekly and monthly horizons.

  • We evaluate the performance of the proposed model in terms of the average number of rules, error metrics, and percentage of system-wide CPU utilization. Additionally, we compare the performance with benchmarks evolving models like ePL-KRLS (Vieira et al. 2018), SM-ePL-KRLS and ESM-ePL-KRLS (Alves et al. 2020).

Our major conclusions are as follows:

  • The proposed model obtained the lowest errors (with the absence of significant outliers) in all simulations, suggesting that the forecasting model can predict complex behaviors with high accuracy.

  • The proposed model obtained a low percentage of system-wide CPU utilization in all simulations, suggesting that it is appropriate to deal with Big Data.

  • The forecast of fuel prices has applicability in the logistical context, for example. It can be used to optimize the delivery of a given cargo, defining the best days and places to fill up trucks in order to decrease the costs of delivery and therefore maximize profits.

This paper is organized as follows. Section 2 analyzes quantitatively and qualitatively historical S500 and S10 diesel oil prices and highlights their relevance in logistics. Section 3 introduces the variable step-size evolving Participatory Learning with Kernel Recursive Least Squares algorithm. Section 4 evaluates the proposed model. Section 5 concludes the research.

2 Problem formulation

Transport management is one of the main logistical processes and is also responsible for integrating the links in a supply chain, that is, the supply of inputs and raw materials between suppliers and production and the distribution of final products between production and customers, respectively. Such management must be done efficiently and effectively to minimize costs and maximize the level of service.

According to Fleury et al. (2000), transportation costs correspond to approximately 60% of a company’s logistical costs. Of these expenses, 40% represent fixed costs, such as depreciation and salaries for drivers, helpers, and mechanics. The remaining 60% represents variable costs, particularly maintenance, tolls, and fuel, in the case of a heavy truck used to transport cargo (Vilardaga 2007). Such research still estimates that the fuel is equivalent to about 58% of the variable costs and, consequently, 35% of the total expenses. In a country of continental dimensions such as Brazil, which primarily uses road networks to transport cargo and people, a computational tool that allows reliable forecasts on the prices for S500 and S10 diesel oil becomes a competitive differential for the logistics operator.

The fuel price dataset used to evaluate the model was obtained from the website of the National Agency of Petroleum, Natural Gas, and Biofuels (ANP) website (Agência Nacional do Petróleo 2020). Weekly resale and distribution prices at S500 and S10 diesel oil from December 31st, 2012 to May 30th, 2020 were considered. The main difference between diesel oils is related to the concentration of sulfur that each contains: S500 has 500 parts per million and S10 has 10 parts per million. This directly affects engine and environment, in which the S10 diesel oil has better ignition and combustion functioning, suffers less from corrosion of parts, and is less polluting.

The diesel oil market in Brazil is regulated by the National Petroleum Agency (ANP) and Federal Law 9.478/97 known as Petroleum Law. The price of diesel oil is composed of: 40% pure diesel, 11% biodiesel, 11% state taxes, 16% national taxes and 20% profit from distribution and resale (the final product must contain a mixture of 88% pure diesel and 12% biodiesel) according to Petrobras studies between May 24th, 2020 and May 30th, 2020 (Petrobras 2020).

A preliminary statistical analysis is presented in Table 1 and the behavior of historical prices can be observed in Fig. 1, requiring some additional information for a better understanding. The first factor is the fuel price policy in Brazil. Until 2016, the price was controlled by Petrobras, which delayed the transfer of its international variations as a tool to control inflation. After several financial losses, the state-owned company changed this policy and the price started to be controlled by the barrel of oil in the international market. As their prices are due to several political and economic factors on a global scale, which has dynamic characteristics, the oscillation is more noticeable since 2017. The second factor is the successive increases in fuel prices, especially S500 diesel oil and S10 diesel oil which increased 8.383% and 7.755% in the first three weeks of May 2018, respectively. This resulted in the truckers strike (also called the diesel crisis) which lasted from May 21st, 2018 to May 30th, 2018 throughout the Brazilian territory which had several claims, including a reduction in the price of diesel. The third factor is COVID-19 pandemic. The international price of a barrel of oil plummeted, coming to cost -US$ 37.63 on April 20th, 2020, an event unprecedented in history. Its dynamic characteristic justifies the use of evolving fuzzy models.

Table 1 Statistical analysis
Fig. 1
figure 1

Historical diesel oil prices

3 Proposed model

The learning structure of variable step-size evolving Participatory Learning with Kernel Recursive Least Squares is shown in Figure 2.

Fig. 2
figure 2

VS-ePL-KRLS model

It is an evolving fuzzy model based on Takagi–Sugeno–Kang rules (Takagi and Sugeno 1985; Sugeno and Kang 1988) . The rules configuration is “IF<antecedent>, THEN<consequent>” shown on Eq. 1. The antecedent is expressed by a linguistic variable and the consequent is expressed by a mathematical function. The main advantage of using fuzzy systems is its universal approximator propriety (Afravi and Kreinovich 2020). It is capable of approximating, with pre-specified precision, a general nonlinear continuous function on a compact set through the combination of local functions.

$$\begin{aligned} {\mathcal {R}}_{i}: \mathtt {IF} \ {x} \ \mathtt {is} \ {\mathcal {A}}_{i} \ \mathtt {THEN} \ {\hat{y}}_{i} = f_{i} \left( {x}, \theta _{i} \right) \end{aligned}$$
(1)

where \({\mathcal {R}}_{i}\) is the ith rule, \({\mathcal {A}}_{i}\) is the antecedent fuzzy set of the i-th rule, \({x} = \left[ x_{1},\ldots ,x_{m} \right] ^{T} \in {\mathbb {R}}^{m}\) is the input vector, \(\theta _{i} = \left[ \theta _{i1},\ldots ,\theta _{in_{i}} \right] ^{T} \in {\mathbb {R}}^{n_{i}}\) is the consequent parameters vector of the ith rule, and \({\hat{y}}_{i}: {\mathbb {R}}^{m} \ \mathtt {x} \ {\mathbb {R}}^{n_{i}} \rightarrow {\mathbb {R}}\) is the local output of the ith rule.

The first stage of the model is the unsupervised fuzzy clustering algorithm proposed by Silva et al. (2005). It proved to be as efficient as Gustafson–Kessel (GK) Gustafson and Kessel (1979) and Modified Fuzzy K-Means (MFKM) (Gath et al. 1997), two major fuzzy clustering algorithms. The objective of this part is to adjust the number of fuzzy rules and their respective centers. The algorithm is initialized with a single rule \(\left( R = 1 \right)\) whose center is the first input \(\left( v_{1}^{1} = x^{1} \right)\). During the computational procedure, new rules can be added and existing rules can be updated or merged.

The second stage of the model is the evolving participatory learning proposed by Lima et al. (2010). It has a convergent conception with human learning (Yager 2004). In humans, the greater the prior knowledge in a given subject, the more amplified the learning will be. The computer learns in the same way in which each new information will be related to the existing ones and the greater the similarity, the greater the learning. This stage is divided into two steps.

In the first step is calculated the compatibility index \(\left( \rho _{i}^{k} \right)\) and the arousal index \(\left( a_{i}^{k} \right)\).

The compatibility index measures how well data and rules are compatible and it is represented by Eq. 2:

$$\begin{aligned} \rho _{i}^{k} = 1 - \frac{\left\| x^{k} - v_{i}^{k}\right\| }{m}. \end{aligned}$$
(2)

where \(\rho _{i}^{k} \in \left[ 0,1 \right]\) is the compatibility index of the ith rule at kth step, \(x^{k} \in {\mathbb {R}}^{m}\) is the m-dimensional input vector at kth step, and \(v_{i}^{k} \in \left[ 0,1 \right] ^{m}\) is the center of the ith rule at kth step.

The arousal index evaluates whether a new rule must be added or an existing rule must be updated and it is represented by Eq. 3:

$$\begin{aligned} a_{i}^{k} = a_{i}^{k-1} + \beta ^{k-1} \left( 1 - \rho _{i}^{k} - a_{i}^{k-1} \right) \end{aligned}$$
(3)

where \(a_{i}^{k} \in \left[ 0,1 \right]\) is the arousal index of the ith rule at kth step and \(\beta ^{k} \in \left[ 0,1 \right]\) is parameter that controls the rate of change of arousal. The initialization values are \(a_{1}^{0} = 0\) and \(\beta ^{0} = 0.18\).

The arousal index is compared with a parameter \(\tau ^{k} \in \left[ 0,1 \right]\). If \(\left| a_{i}^{k} \right| > \tau ^{k}\), a new rule is added. Otherwise, an existing rule with more compatibility is updated. At each step, the parameter is updated as \(\tau ^{k} = \beta ^{k-1}\).

The center of the rule added \(\left( v_{R}^{k+1} \right)\) is represented by Eq. 4:

$$\begin{aligned} v_{R}^{k+1} = x^{k} \end{aligned}$$
(4)

The center of the rule updated \(\left( v_{i}^{k+1} \right)\) is represented by Eq. 5:

$$\begin{aligned} v_{i}^{k+1} = v_{i}^{k} + \alpha \left( v_{i}^k \right) ^{1 - a_{i}^{k+1}} \left( x^{k} - v_{i}^{k} \right) \end{aligned}$$
(5)

where \(\alpha \in \left[ 0,1 \right]\) is the learning rate parameter.

In the second step is calculated the compatibility index between rules \(\left( \rho ^{k}_{ij} \right)\). It measures how redundant the rules are and evaluates whether an existing rule must be removed. It is represented by 6:

$$\begin{aligned} \rho ^{k}_{ij} = 1 - \frac{1}{m} \sum _{l=1}^{m} \left| v^{k}_{il} - v^{k}_{jl}\right| \end{aligned}$$
(6)

where \(\rho ^{k}_{ij} \in \left[ 0,1 \right]\) is the compatibility index between ith and jth rules at kth step for \(i \ne j\).

The compatibility index between rules is compared with a parameter \(\gamma ^{k} \in \left[ 0,1 \right]\). If \(\rho ^{k}_{ij} > \gamma ^{k}\), the least compatibility rule is removed. At each step, the parameter is updated as \(\gamma ^{k} = 1 - \beta ^{k-1}\). The center of the rule removed \(\left( v_{j}^{k} \right)\) is merged with the most compatibility rule center \(\left( v_{i}^{k} \right)\) and it is represented by Eq. 7:

$$\begin{aligned} v_{i}^{k} = \frac{v_{i}^{k} + v_{j}^{k}}{2} \end{aligned}$$
(7)

The third stage of the model presents the main innovation in relation to classic approaches. The \(\beta\) parameter, which controls the rate of change of arousal, is updated with the concept of Variable Step proposed by Harris et al. (1986). The normalized error is compared with a parameter \({\bar{\gamma }} \in \left[ 0,1 \right]\). If \(\left| {\tilde{e}}^{k}\right| > {\bar{\gamma }}\), \(\beta\) increases and arousal index increases. Otherwise, \(\beta\) decreases and arousal index decreases. Equation 8 represents this procedure and the values \(\alpha _{VS1}\), \(\alpha _{VS2}\) and \({\bar{\gamma }}\) are determined by heuristic.

$$\begin{aligned} \beta ^{k+1} = {\left\{ \begin{array}{ll} \frac{\beta ^{k}}{\alpha _{VS1}}, \ \ if \ \ \left| {\tilde{e}}^{k}\right| > {\bar{\gamma }} \\ \beta ^{k} \ \alpha _{VS2} ,\ \ otherwise \end{array}\right. } \end{aligned}$$
(8)

where \(\alpha _{VS1} \in \left] 0,1 \right[\) and \(\alpha _{VS2} \in \left] 0,1 \right[\) are the factors that increases or decreases parameter \(\beta\), respectively.

The fourth stage of the model is the estimation of the \(\theta\) parameter proposed by Vieira et al. (2018). As shown on Eq. 1, each rule results in a local output \(\left( {\hat{y}}_{i} \right)\) and it is represented by Eq. 9:

$$\begin{aligned} \begin{aligned} {\hat{y}}_{i}&= \sum _{j=1}^{n_{i}} \theta _{ij}^{k-1} \ \kappa \left( d_{ij}^{k}, x^{k} \right) \\ {\hat{y}}_{i}&= \sum _{j=1}^{n_{i}} \theta _{ij}^{k-1} exp \left( - \frac{\left\| d_{ij}^{k} - x^{k} \right\| }{\sqrt{2} \nu _{ij}^{k}} \right) ^2 \end{aligned} \end{aligned}$$
(9)

where \(\theta _{ij}^{k-1}\) are the consequent parameters of the ith and jth rules at kth step, \(\kappa \left( \cdot \right)\) is the Gaussian kernel function, \(d_{ij}^{k}\) is the element of dictionary, and \(\nu _{ij}^{k}\) is the kernel size.

The consequent parameters are estimated using an adaptive method known as Kernel Recursive Least Squares (KRLS). It is an extension of Recursive Least Squares (RLS), that increases the dimension of the input data space using a nonlinear function, generating simpler solutions. These values are calculated by determining the weight vector \(\left( \omega \right)\) with the minimization cost function \(\left( L\left( \omega \right) \right)\). The optimization is represented by Eq. 10:

$$\begin{aligned} \min _{\omega } L \left( \omega \right) = \sum _{j=1}^{k} \left| y^{j} - \omega ^{T} \phi ^{j} \right| ^{2} + \lambda \left| \left| \omega \right| \right| ^{2} \end{aligned}$$
(10)

where \(\lambda\) \(\in \left[ 10^{-5}, 10^{-2} \right]\) is the regularization parameter to avoid numerical instabilities at matrix (Haykin et al. 2009), \(\phi ^{j}\) is the high dimensional space at kth step, and \(y^{j}\) is the output at kth step.

The solution to the optimization of Eq. 10 is represented by Eq. 11:

$$\begin{aligned} \begin{aligned} \omega ^{k}&= \left[ \lambda \mathbf{I } + \Phi ^{k} \left( \Phi ^{k} \right) ^{T} \right] ^{-1} \Phi ^{k} Y^{k} \\ \omega ^{k}&= \Phi ^{k} \left[ \lambda \mathbf{I } + \left( \Phi ^{k} \right) ^{T} \Phi ^{k} \right] ^{-1} Y^{k} \\ \omega ^{k}&= \Phi ^{k} \left[ \lambda \mathbf{I } + \mathbf{K }^{k} \right] ^{-1} Y^{k} \\ \omega ^{k}&= \Phi ^{k} Q^{k} Y^{k}\\ \omega ^{k}&= \Phi ^{k} \theta ^{k} \end{aligned} \end{aligned}$$
(11)

where \(\mathbf{I }\) is the identity matrix, \(\Phi = \left[ \phi ^{1},\ldots ,\phi ^{k} \right]\) is the high dimensional space vector, and \(Y = \left[ y^{1},\ldots ,y^{k} \right] ^{T}\) is the output vector.

The kernel function can be used for generalizations of linear methods written in terms of internal products, represented by Eq. 12, in a Reproducing Kernel Hilbert Space (RKHS) (Scholkopf and Smola 2001), reducing its dimension and making it possible to solve Eq. 11.

$$\begin{aligned} \begin{aligned} \mathbf{K }^{k}&= \left( \Phi ^{k} \right) ^{T} \Phi ^{k} \\ \mathbf{K }^{k}&= \begin{bmatrix} \kappa \left( x^{1}, x^{1} \right) &{} \ldots &{} \kappa \left( x^{1}, x^{k} \right) \\ \vdots &{} \ddots &{} \vdots \\ \kappa \left( x^{k}, x^{1} \right) &{} \ldots &{} \kappa \left( x^{k}, x^{k} \right) \\ \end{bmatrix} \\ \mathbf{K }^{k}&= \begin{bmatrix} \mathbf{K }^{k-1} &{} g^{k} \\ \left( g^{k} \right) ^{T} &{} 1 \\ \end{bmatrix} \\ \end{aligned} \end{aligned}$$
(12)

where \(\mathbf{K }^{k}\) is the kernel matrix and \(g^{k} = \left[ \Phi ^{k-1} \right] ^{T} \phi ^{k} = \left[ \kappa \left( x^{1},x^{k} \right) , \ldots , \kappa \left( x^{k-1},x^{k} \right) \right] ^{T}\).

The calculation of matrix \(\left[ \lambda \mathbf{I } + \mathbf{K }^{k} \right]\) is computationally expensive. Thus, \(Q^{k}\) variable is defined to recursively calculate it and expressed by Eq. 13:

$$\begin{aligned} \begin{aligned} Q^{k}&= \left[ \lambda \mathbf{I } + \mathbf{K }^{k} \right] ^{-1} \\ Q^{k}&= \left( r^{k} \right) ^{-1} \begin{bmatrix} Q^{k-1}r^{k} + z^{k}\left( z^{k}\right) ^{T} &{} -z^{k} \\ -\left( z^{k} \right) ^{T} &{} 1 \\ \end{bmatrix} \\ \end{aligned} \end{aligned}$$
(13)

where \(z^{k} = Q^{k-1} g^{k}\), \(r^{k} = \lambda + \left[ \phi ^{k} \right] ^{T} \phi ^{k} - \left[ z^{k} \right] ^{T} g^{k}\), and \({\tilde{e}}^{k}\) is the error, that is, the difference between real value (\(y^{k}\)) and predicted value (\({{\hat{y}}}^{k}\)).

The updated solution of consequent parameters is expressed by Eq. 14:

$$\begin{aligned} \begin{aligned} \theta ^{k}&= Q^{k} Y^{k}\\ \theta ^{k}&= \begin{bmatrix} \theta ^{k-1} - z^{k}[r^{k}]^{-1}{\tilde{e}}^{k} \\ [r^{k}]^{-1}{\tilde{e}}^{k} \end{bmatrix} \\ \end{aligned} \end{aligned}$$
(14)

where \(\theta ^{k}\) is the consequent parameters at k-th step.

It is important to note that the size of the kernel matrix increases quadratically in relation to the amount of data and, consequently, \(Q^{k}\) increases in the same way. This implies more and more expensive operations over time. To get around this situation was used a sparcification procedure as in Vieira et al. (2018) because it significantly decreases the required simulation time and memory, and also increases the generalization ability Scholkopf and Smola (2001). A local dictionary \(\left( {\mathcal {D}}^{k} \right)\) is defined for each rule and the dictionary belonging to the rule with the highest compatibility is altered which means that only the subset of the most relevant samples are considered for the update kernel matrix and parameters vector.

In this model, the novelty criteria Richard et al. (2008) were used because it compactly represents knowledge in each local dictionary and presents computationally low cost at the same time. The technique consists of calculating the minimum distance \(\left( \psi \right)\) between input and each element at the dictionary with Eq. 15 and two error scenarios: one with the new sample at local dictionary \(\left( \varepsilon \right)\) and other without the new sample at the local dictionary \(\left( \epsilon \right)\). If \(\psi < \delta\), the sample and all elements of the local dictionary are coherent and are not added to the local dictionary. Otherwise, the sample and all elements of the local dictionary are not coherent and are added to the local dictionary if \(\varepsilon < \epsilon\). The value of \(\delta = \frac{v_{ij}^{k}}{10}\) was calculated by heuristic method Richard et al. (2008), where \(\nu _{ij}^{k}\) is the kernel size for element \(d_{ij}^{k}\) nearer \(x^{k}\).

$$\begin{aligned} \psi = \min _{(\forall d_{ij} \in {\mathcal {D}}^{k}_{i})} \left\| x^{k} - d^{k}_{ij}\right\| \end{aligned}$$
(15)

where \(\psi\) is the minimum distance between input and each element in the dictionary.

The process for choosing an appropriate value of kernel size is complex because a value too large can result in all similar data and a value too small can result in all distinct data. In this model, the recursive Levenberg–Marquardt algorithm Ngia et al. (1998) is used. It is a non-linear optimization method that is an intermediate version between gradient and Newton and it has the better convergence properties among the three algorithms. Each element \(\nu _{ij}^{k}\) in the vector of kernel parameters \(\left( \nu _{i}^{k} = \left[ \nu _{i1}^{k}, ..., \nu _{in_{i}}^{k} \right] \right)\) is associated with an element \(d_{ij}^{k}\) in the vector of local dictionary \({\mathcal {D}}_{i}^{k} = \left[ d_{i1}^{k}, ..., d_{in_{i}}^{k} \right]\) with the final objective minimizing the local error function. This value is initialized with 0.5 for a new rule and Eq. 16 for a existing rule.

$$\begin{aligned} {\nu }^{k}_{i} = {\nu }^{k-1}_{i} + {P}^{k}_{i}{\nabla }^{k}_{i}{\tilde{e}}^{k}_{i} \end{aligned}$$
(16)

where \({P}^{k}_{i} = \left[ {P}^{k-1}_{i} - \frac{{P}^{k-1}_{i}{\nabla }^{k}_{i}[{\nabla }^{k}_{i}]^{T}{P}^{k-1}_{i}}{1 + [{\nabla }^{k}_{i}]^{T}{P}^{k-1}_{i}{\nabla }^{k}_{i}}\right]\),

\({\nabla }^{k}_{i} = \Lambda _{i}^{k} \begin{bmatrix} \theta ^{k-1}_{i1} - \frac{\left\| x^{k} - d^{k}_{i1}\right\| ^2}{(v^{k-1}_{i1})^{3}}{k} (x^{k},d^{k}_{i1}) \\ \vdots \\ \theta ^{k-1}_{in_{1}} - \frac{\left\| x^{k} - d^{k}_{in_{1}}\right\| ^2}{(v^{k-1}_{in_{1}})^{3}}{k}(x^{k},d^{k}_{in_{1}}) \end{bmatrix}\) and

\(\Lambda _{i}^{k} \in \left[ 0,1 \right]\) is the normalized activation degree of the ith fuzzy rule.

figure a

The fifth stage of the model is the global output, that is, the forecast value is the weighted average of local outputs considering individual rules contributions, resulting in the non-linear nature of the model. This variable is represented by Eq. 17:

$$\begin{aligned} {\hat{y}} = \sum _{i=1}^{R} {\hat{y}}_{i} \Lambda _{i} \end{aligned}$$
(17)

4 Experimental results

To evaluate the effectiveness of the method proposed in this work, the models presented in previous sections are applied in the estimation of the S500 and S10 diesel oil, at the Brazilian level, for biweekly and monthly horizons. The training data (from January 2013 to December 2018) and test data (January 2019 to May 2020) were divided into 80% and 20%, respectively.

The parameters of ePL-KRLS, SM-ePL-KRLS, ESM-ePL-KRLS and VS-ePL-KRLS are defined as in Vieira et al. (2018) Alves et al. (2020): \(\alpha = 0.01\), \(\beta ^0 = 0.18\), \(\gamma ^0 = 0.18\), \(\tau ^0 = 0.82\), \(\sigma = 0.05\), \(\lambda = 10^{-4}\) and \(v_{0} = 0.5\). The results are showed at Table 5.

The forecasting evaluation was measured with three error metrics: Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Non-Dimensional Index Error (NDEI). The biggest difference between them is that RMSE penalizes large errors and benefits small errors (between 0 and 1). These metrics are expressed by Eqs. 1819 and 20, respectively.

$$\begin{aligned} RMSE = \sqrt{ \frac{1}{n} \sum _{k=1}^{n} \left( {y}^{k} - {\hat{y}}^{k} \right) ^{2} } \end{aligned}$$
(18)

where \({y}^{k}\) is the real value and \({\hat{y}}^{k}\) is the predicted value

$$\begin{aligned} MAE = \frac{1}{n} \sum _{k=1}^{n} {\left| {y}^{k} - {\hat{y}}^{k} \right| } \end{aligned}$$
(19)

where \({y}^{k}\) is the real value and \({\hat{y}}^{k}\) is the predicted value

$$\begin{aligned} NDEI = \frac{1}{n} \sum _{k=1}^{n} {\left| \frac{{y}^{k} - {\hat{y}}^{k}}{{y}^{k}} \right| } \end{aligned}$$
(20)

where \({y}^{k}\) is the real value and \({\hat{y}}^{k}\) is the predicted value.

The computational performance evaluation was measured with a float that represents the current system-wide CPU utilization as a percentage. The test was realized in a computer with Intel(R) Core(TM) Processor i3-7020U CPU @ 2.30GHz 2.30 GHz, 4.00 GB installed memory (RAM) (with 3.87 GB usable) and 64-bit Operating System.

This approach was statistically validated with Morgan Granger Newbold (MGN) Test Granger and Newbold (2014) to evaluate if the accuracy between models are equivalent or not. MGN test is represented by Equation 21 and follows a student’s t distribution with \(n-1\) degrees of freedom.

$$\begin{aligned} MGN = \frac{{\hat{\varrho }}}{\sqrt{\frac{1 - \hat{\varrho }}{n-1}}} \end{aligned}$$
(21)

where \({\hat{\varrho }}\) is the correlation coefficient (Pearson coefficient) between sum and difference of errors for a time series with a size equal to an n. The null hypothesis is that two forecast error variances are equivalent.

Table 2 Comparison between VS-ePL-KRLS and ePL-KRLS
Table 3 Comparison between VS-ePL-KRLS and SM-ePL-KRLS
Table 4 Comparison between VS-ePL-KRLS and ESM-ePL-KRLS
Fig. 3
figure 3

S500 diesel oil forecast for the biweekly horizon

Fig. 4
figure 4

S500 diesel oil forecast for the monthly horizon

Fig. 5
figure 5

S10 diesel oil forecast for the biweekly horizon

As can be seen on Table 5, the proposed model showed the best prediction accuracy among evolving models for all error metrics. This result was statistically validated on Tables  23 and 4 through the rejection of the null hypothesis for all comparisons between VS-ePL-KRLS and other models, indicating that two forecast error variances are not equivalent to a significance level of 5%. The computational performance, calculated through medium among 50 measurements of the percentage of current system-wide CPU utilization, has better results for VS-ePL-KRLS and ePL-KRLS models. The number of final rules of the proposed model was satisfactory. Figs. 34,  5 and 6 show the predictions and Figs. 78, 9 and 10 show the rules evolution.

Fig. 6
figure 6

S10 diesel oil forecast for the monthly horizon

Fig. 7
figure 7

S500 diesel oil rules evolution for the biweekly horizon

Fig. 8
figure 8

S500 diesel oil rules evolution for the monthly horizon

Table 5 Results

5 Conclusions

Fig. 9
figure 9

S10 diesel oil rules evolution for the biweekly horizon

Fig. 10
figure 10

S10 diesel oil forecast for the monthly horizon

In this paper, a new forecasting model is suggested to deal with the problem of fuel price forecasting: the Variable Step-Size evolving Participatory Learning with Kernel Recursive Least Squares, VS-ePL-KRLS. This model was tested for prediction of S500 and S10 diesel oil weekly prices in the context of Brazil for biweekly and monthly horizons and it appears as a better choice to integrate a decision support tool to assist the operational, tactical, and strategic logistic planning.

The evaluation of this model was measured in terms of error, the average number of rules, and computational complexity. The model proposed has better accuracy (supported by MGN test) and lower computational cost when compared to benchmarks evolving models suggested in the literature. Another benefit of the introduced models is that their structure makes the knowledge process continuous and more adaptable as the data changes than its evolving counterparts.