Prediction Error and Forecasting Interval Analysis of Decision TreeswithanApplication inRenewableEnergySupplyForecasting

Renewable energy has become popular compared with traditional energy like coal. *e relative demand for renewable energy compared to traditional energy is an important index to determine the energy supply structure. Forecasting the relative demand index has become quite essential. Data mining methods like decision trees are quite effective in such time series forecasting, but theory behind them is rarely discussed in research. In this paper, some theories are explored about decision trees including the behavior of bias, variance, and squared prediction error using trees and the prediction interval analysis. After that, real UK grid data are used in interval forecasting application. In the renewable energy ratio forecasting application, the ratio of renewable energy supply over that of traditional energy can be dynamically forecasted with an interval coverage accuracy higher than 80% and a small width around 22, which is similar to its standard deviation.


Introduction
Renewable energy such as solar and wind has been playing an integral role in sustaining power supply and relieving the environment pollution and global warming crisis. With the increasing penetration of renewable energy, determining the amounts of renewable energy generation is critical to maintain the energy balance and the stability and reliability of power networks. Forecasting the mixing shares of the energy generation offers the guidance of setting up the power generation for each energy source and ensures the load demand of power networks to be satisfied [1,2]. Databased prediction methods, in particular machine learning methods, provide a promising solution to infer the required ratios of energy generation, among which decision tree is a well-recognized approach due to its satisfactory accuracy and interpretation [3][4][5][6].
Although decision tree provides an effective method in forecasting, the theory explaining when and how it performs well is rarely discussed. e required ratios of renewable energy generation can be seen as a linear time series. In this context, we explore how the tree model performs in terms of the bias, variance, and prediction error. In addition, point prediction is not sufficient in time series prediction, so we also provide prediction interval choices like Gaussian and quantile intervals in theories with the application in renewable energy ratio forecasting.
Decision tree [7] is a nonparametric supervised learning method used for discovery and prediction-oriented classification and regression. e goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Decision tree, compared to other data mining methods, has its own advantages. (1) For casual relationship, it can deal with nonlinear models. In most cases, economics pay more attention to linear models, while if it is a nonlinear model, it will be transferred to be a linear model. In problems like consumer behavior analysis, the number of variables exceeds the normal extent to be tens or even hundreds, which will definitely lead to high correlation among variables. In that case, coefficients may have the wrong meaning in reality. Decision tree, however, provides variable importance ranking criteria, which helps a lot. (2) In terms of comprehensibility, it tends to be relatively better than "blackbox" models like neural network, which means it can interpret data structure more clearly and help readers understand the information involved. ese undoubtedly bring convenience to decision making in medical treatment [8][9][10], e-commerce, [11][12][13] and so on.
We now explore the performance of trees when fitted to data generated from a linear model. e corresponding bias, variance, and prediction error between the fitted simplified tree and the true simple linear model will be calculated. en, how those errors vary will be explored when the linear data distribution changes. e motivation is to explore how the trees perform under different distributions. Afterwards, prediction interval is proposed using Gaussian and quantile intervals, which explains why quantile interval is chosen in the study by Zhao et al. [14]. e simple linear model in use is where f(X) � α + βX is the true model. It is supposed that, throughout this paper, X ∼ U(a, b) independently and ε ∼ N(0, σ 2 ). Uniform distribution guarantees that if the tree has k terminal nodes, the sample size in each node will be equal which is convenient in theory and simulation analysis. Decision tree analysis under the uniform distribution assumption includes the work by Hancock [15], Jackson and Servedio [16], and White and Liu [17]. Other distributions can also be considered but the analysis will be much more complex as the sample size of each terminal node depends on many parameters. e expected squared prediction error (SPE) is one of the important metrics to measure how well the trained model is applied to further unseen data. As shown in Hastie et al. [18], SPE of a regression fit f(X) at an input point X � x 0 is In (2), the first term is the variance of the target around its true mean f(x 0 ) and cannot be avoided no matter how well the f(x 0 ) is estimated, unless σ 2 � 0. e second term is the squared bias, the amount by which the average of the estimate differs from the true mean; the last term is the variance, the expected squared deviation of f(x 0 ) around its mean. Typically the more complex the model f is, the lower the (squared) bias but the higher the variance [18] will be.
In Section 2, the performance of regression trees is analyzed when fitted to data which simply follow a uniform distribution, with additive Gaussian noise. When we predict this time series using simplified trees, the prediction error is calculated and decomposed into variance and other errors.
When Gaussian or uniform effect is strong, those errors have different kinds of behavior. Other exploration is conducted in Section 3 including the best tree depth with minimum prediction error and the performance of Gaussian and quantile prediction intervals under different conditions. A real interval forecasting application is conducted in Section 4. Conclusions are drawn in Section 5. All calculations were done using R [19]; 'waveslim' [20] was used for wavelet decomposition and 'ctree' [21] for CTree.

Bias-Variance Exploration
and the variance is ey both have no relationship to i. In that case, E(Y) � E(Y i ) and Var(Y) � Var(Y i ). Accordingly, for N observations, the expectation and variance for the average Y are shown in

Decomposition in the Context of Decision Trees.
In the context of decision trees, the fitted model is f(X) in a simplified form is where k is the number of terminal nodes in the tree f(X) and Y i is the mean of y in terminal node i. In a tree with only us, the SPE at point x 0 is en, the mean squared prediction error (MSPE) is Comprising variance is and squared bias is Now the number of terminal nodes in the decision tree is extended from k � 1 to a general k; then, the MSPE, bias 2 , and variance for x ∈ [a, b] are equal to those for x ∈ [a, (a + ((b − a)/k))] since the decision tree is assumed to make k equal terminal nodes with the same number of observations in each terminal node. In that case, for x ∈ [a, b] for a general k, the MSPE is with variance as and squared bias as It is easy to see that with a lower |β|, b − a, and σ 2 and higher N, variance, squared bias, and MSPE will all decrease.

Optimal k to Minimize MSPE.
e ideal number of terminal nodes can be found by minimizing the MSPE with respect to k. Here k is a discrete integer, so the target k will be the nearest integer from the differentiate result. Calculating the first derivative of MSPE, we get and the second derivative of MSPE is always positive. erefore, we only need to solve e real root of (18) is Having we can approximate k min by In addition, the constraint for root k is also k min ∈ [1, N]. If k min is not in [1, N], MSPE might always decrease.
By substituting k min in (19) back into (16), we will get and it is easy to see that, with the increase of σ and β(b − a), when N is fixed, E(Bias 2 ) will increase. e others will be shown as figures. Accordingly, how will the ratios (E(Var)/MSPE), (E(Bias 2 )/MSPE), (σ 2 /MSPE) vary when parameters change? Since a, b, and β appear together, they are regarded as one parameter. For b and a, the thing that matters is their difference, so we use a � 0 and only change b. Here, k is set to be k min calculated using given parameters for (19), and if k min does not exist, the results will not be shown. e results in Figure 1  In Figure 1, when β 2 (b − a) 2 gets bigger, X is more likely to be uniformly distributed and k min increases as y is more accurately described with a uniform distribution; besides, the ratio of Var and Bias 2 over MSPE gets larger while σ 2 increases. In Figure 2, when σ 2 gets bigger, the Gaussian distribution will play a bigger role in data generation and k min decreases.
e decrease speed slows with bigger b and β as expected.

Simulation.
In this simulation, a simplified tree model will be designed to confirm the theory results using simulated data. at is, when parameters of the simulated data change, the distribution of X and y will also change. e question is, how will the statistics of Var, Bias 2 , MSE, and k min change accordingly?
In the simplified tree, X is evenly split into k intervals, i � 1, 2, . . . , k. For specific k, a, b, N, α, and β, we are going to calculate the statistics of MSPE, Var, and Bias 2 for the i th interval in k from simulated data. us, for the i th interval, the x range is defining n 0 � 0 and n k � N − k− 1 j�0 n j . (i) Step 1: for the data (x, y) in R i , we train a model from them as for simulated y, and y is the averaged value of y in Step 3: simulate one x 0 uniformly from the x range R i . We are going to calculate the SPE(x 0 ), Var(x 0 ), and Bias 2 (x 0 ) for this specific x 0 . (iv) Step 4: simulate s values of y j using x 0 .

Prediction Interval
Instead of point prediction, a prediction interval is also desirable especially for time series with high variance. If both the point prediction and the prediction interval can be provided, we will be more confident for the prediction.
is study also helps us decide the proper prediction interval method for decision-tree-based regression problems. Gaussian-based prediction interval and quantile interval are compared under different parameters distributions.

Probability Function of Y. Since our linear model,
is the sum of uniform and Gaussian distributions, the probability function for Y is 4 Complexity By letting t � ((α + βx − y)/σ), we obtain Now we get the probability of Y as (29). However, P Y (y) is in a complex form meaning that the parameters are not easily solvable in theory by a given value for P Y (y).

Prediction Interval as a Gaussian Distribution.
If we want to get the prediction interval, say [y 1 , y 2 ] for Y at (1 − p) level, the theoretical way is to obtain y 1 and y 2 from the equations However, the integral of Φ is not analytically solvable without approximating Φ with other suitable expressions. e results will also be quite complex. If we know the parameters values, then y 1 and y 2 can easily be found numerically.
From Figure 5, if the uniform (Gaussian) distribution plays a main role, then Y can be approximately described by a uniform (Gaussian) distribution. Under the conditions Complexity that β is not too large, σ is not too small, and k is 1 (with only one interval), we will approximate the distribution of Y as a Gaussian distribution N(μ Y , σ 2 Y ): en, the prediction interval under 95% criteria for this Gaussian distribution is around en, for a general k, the prediction interval becomes a typical Gaussian prediction interval.

Prediction Simulation Using Gaussian Prediction Interval and Quantile
Interval. In this simulation, we explore the performance of Gaussian prediction intervals and quantile intervals under different parameter combinations. e parameters include σ, b − a, β, and k. When the other parameters are fixed, a higher σ means a stronger Gaussian distribution effect, in which case, Gaussian prediction interval may work well. When β 2 (b − a) 2 is large, the uniform distribution plays a bigger role. en, Gaussian prediction interval may not work so well. Both Gaussian prediction interval and quantile interval are influenced by the observation size of the terminal node. When the sample size is large, they can have stable performance, but when sample size is small, performance differs. e Gaussian prediction interval in use is where c is 1.96 and RMSPE is the root mean squared error estimated from the training data in each terminal node. e quantile interval [L, U] comes from the 0.025 and 0.975 quantiles of each terminal node from the training data.
Using given parameters α, β, a, b, σ, N, data are generated according to the model erefore, we get the true fitted values for Y.
(ii) Step 2: RMSPE and quantiles from training data. From this training data, the trained model, RMSPE, and quantiles are calculated as in the following steps.

Complexity 7
For training data A (the rest of data B is the test data), we sort the data x in an ascending order, so y will also be rearranged following x, and then A training is divided into k roughly successive equal folds, making a total of N observations. e number of observations in fold i (i � 1, 2, . . . k) is n i : defining n 0 � 0.

Complexity
For the i th fold in A, giving x i and y i , the predicted value will be in the trees context. e predicted value of a tree model is the averaged response values of each terminal node.
Samples being split into those terminal nodes will have the corresponding averaged value as the predicted value.
When the model for each i is trained as model A i , the predicted values for y in A will be y. en, the RMSPE for the training data is   Step 3: test data generation and model testing.
Using the same parameters α, β, a, b, and σN as in Step 1, data are generated according to en, the test data B are put into model A and the coverage is computed as (iv) Step 4: repeating Steps 1 to 3.
Repeat Steps 1 to 3 s times to get an averaged coverage.
Using parameters a � 0, α � 2, and s � 200, the results are shown in Figures 6. e results show that quantile interval coverages are closer to the 0.95 reference line for fixed σ, b, and β. Gaussian prediction interval is only closer to the 0.95 coverage when σ is large; otherwise, it is larger than 0.95 at the cost of wider width. When k is chosen as the best k min , the coverages get closer to the 0.95 reference line as σ increases for both quantile and Gaussian prediction intervals. However, when the uniform distribution effect gets stronger, the coverages all go far away from 0.95. Accordingly, when the number of observations for each terminal node is large and the data distribution is not obviously Gaussian, quantile intervals are suggested. When the data follows obvious Gaussian distribution, Gaussian prediction intervals are recommended.

Real Application
We have explored the performance of decision trees under different circumstances. A real application is conducted in this section. e data come from UK Gridwatch (http:// www.gridwatch.templar.co.uk/), which are the demand data of grid and the supply data of each energy source. e time series begin from year 2011 to year 2020, making a total of 953824 observations with a record every 5 minutes. e details are shown in Figure 7.
From the figure, we can see that the demand of grid changes in period as expected since there are peak and valley values daily and seasonally. e general trend of grid demand changes a little. Some kinds of energy like wind and biomass increase a lot in supply these years; they will be more frequently used in the future than traditional energy like coal as they are more environmentally friendly. We construct a metric ratio to measure the ratio of other energy supply over that of coal. By deleting observations which have none or zero values of coal, we have 847922 observations left, as shown in Figure 8.
We average the time series ratio from a frequency of 5 minutes to a daily basis, ending with 2954 observations left. A forecasting method is conducted on ratio to help us know how much renewable energy is needed in the near future.
e interval forecasting method we use is from our designed method, Zhao et al. [14], called ctreeone, which uses the tree method ctree in a dynamic interval forecasting context. e different parameter we choose is 7 for time gap (weekly dynamic forecasting), leaving the other parameters unchanged.    give an interval that in most cases the real future value will be in, which is similar to the 95% confidence interval that the fitted value is most likely to be covered in.
Interval forecasting provides not only the point forecasting results but also the prediction interval that the predicted point belongs to. Small change of the ratio often happens, which influences a little the the energy supply and demand system, so no action is needed in this circumstance. When the predicted ratio changes a lot, out of a preset limit, an alarm may be raised to help the system accommodate to the new circumstance, for example, by producing more renewable energies in advance to meet the instant demand. e interval forecasting model provides such an alerting system to adjust the energy production.
e results are shown in Table 1 and Figure 9. e coverage and width make a good balance; that is, a higher coverage costs a relatively higher width. We end with a coverage of 80.31% and a suitable width of 22.95 which is similar to the standard deviation of ratio of 19.78.

Conclusion
In this paper, the data are constructed using a simple model that includes both Gaussian and uniform distributions. We explore the squared prediction error in the context of trees and decompose that error into bias, variance, and irreducible error. e bias decreases when the tree gets bigger. However, for squared prediction error and variance, the relationship is not monotonic. We also calculate the best tree depth with a minimum mean squared prediction error. When Gaussian effect dominates, the best tree depth density decreases. However, when uniform effect dominates, the best tree depth increases. Under both circumstances, mean squared error, variance, and bias all increase.
After that, two options are given for the prediction interval using Gaussian prediction interval or quantile interval. When Gaussian distribution is obviously dominant, Gaussian prediction intervals are suggested. Otherwise, quantile intervals are suggested, which is also why quantile intervals are chosen as the prediction intervals in our regression application, although they both perform poorly when the uniform distribution is quite strong. When the number of observations is small in the terminal node, both interval constructions perform poorly in terms of coverage.
In the real data application, we applied our method to the UK grid energy supply and demand data to forecast the ratio of renewable energy supply over that of coal. We have good forecasting results as 80.31% in interval coverage and 22.95 in interval width. e method can be extended to other models as well besides decision trees. We use the model decision tree for interval forecasting. In practice, other models can also be considered. For example, Hall et al. [22] used multiple nonlinear regression to forecast and analyze the changes of climate and weather dynamics and proposed a simple model averaging approach to reduce model and prediction uncertainty. Besides decision tree, other dynamic regression models can also be considered, for example, Gu et al. [23] used dynamic regression model to predict the dynamics of a specific space weather index and proposed a new approach for prediction uncertainty analysis using pointcloud model parameters. Dynamic regression model was also applied to social dynamic behavior modeling and analysis [24].
In the future research, the model can be applied to more kinds of datasets to test its generation ability. In the simulation, instead of linear model, nonlinear model can also be considered to test the tree performance.
Data Availability e source code and simulation data in the theory exploration are available from the corresponding author upon request. e real data in application can be openly accessed from Elexon Portal (cited June 2020) [25].

Conflicts of Interest
e authors declare that they have no conflicts of interest.