Prediction of gross calorific value from coal analysis using decision tree-based bagging and boosting techniques

The calorific value of any fuel is one of the crucial parameters to grade fuel's burning capability. The bomb calorimeter has historically been used to calculate coal's gross calorific value (GCV). However, for many years, engineers and scientists were trying to measure coal's GCV without a bomb calorimeter, using only laboratory-derived ultimate and/or proximate analyses to eliminate tedious and time-consuming laboratory analyses. In this study, Extra trees, Bagging, Decision tree, and Adaptive boosting are developed for the first time in coal's GCV modeling. In addition, the prediction and computational efficiency of previously applied decision tree-based algorithms, such as Random forest, Gradient boosting, and XGBoost are investigated. Well-established empirical models, namely Schuster, Mazumdar, Channiwala and Parikh, Parikh et al. and Central Fuel Research Institute of India are examined to compare their efficiency with newly developed algorithms. Proximate and ultimate analysis parameters are ranked based on their significance in GCV modeling. The studied models are tuned using an exhaustive grid search technique. Statistical indexes, such as explained variance (EV), mean absolute error (MAE), coefficient of determinant (R2), mean squared error (MSE), maximum error, minimum error, and mean absolute percentage error (MAPE) are used to critique these models. To accomplish the goals, 7430 data points containing ten coal features, such as ash, moisture, fixed carbon, volatile matter, hydrogen, carbon, sulfur, nitrogen, oxygen, and GCV are selected from the U.S. Geological Survey Coal Quality (COALQUAL) database. It has been found that, due to simplicity and location-specific constraints, empirical models could not correlate proximate and/or ultimate analyses with GCV. Bagging and boosting techniques tested here performed well with the coefficient of determinant (R2) of over 0.97. The XGBoost model outperforms other tree-based algorithms with the most significant coefficient of determinant (R2 of 0.9974) and lowest error values (MSE of 14703.3, max_error of 1027.2, MAE of 89.2, MAPE of 0.009). The studied models' ranking (highest to lowest) based on their performance are XGBoost, Extra trees, Random forest, Bagging, Gradient boosting, Decision tree, and Adaptive boosting. The correlation heatmap and scatterplots used here clearly indicate that oxygen and carbon are the utmost significant, whereas volatile matter and sulfur are the least essential rank parameters for GCV modeling. The strategy suggested in this research can aid engineers/operators in obtaining a rapid and accurate determination of the GCV with a few coal features, thus lessening complicated, tedious, expensive, and time-consuming laboratory efforts.


Introduction
Coal is an incredibly heterogeneous natural material encountered on Earth.It may involve up to 76 of the 98 naturally developing chemical components, most appearing as traces [1].Assessments, such as calorific value, ultimate analysis, and proximate analysis determine the energy quality of coal.The weight percentages of ash (C A ), moisture (C M ), fixed carbon (F C ), and volatile matter (V m ) are measured by proximate analysis [2].These four components involve 100 % of all chemical constituents of coal.The elements reported in an ultimate analysis are hydrogen (C H ), carbon (C C ), sulfur (C S ), nitrogen (C N ), and oxygen (C O ) [3,4].These five components together determine the quantity of air necessary for the complete combustion of coal [5].The other usual laboratory assessment is the gross calorific value (GCV), a fundamental property when coal is planned to be employed as fuel.GCV is coal's most influential rank property and depends on the composition of minerals and maceral [6].The most exact measurement is GCV generated from a coal sample in a laboratory using an adiabatic bomb calorimeter, but the process is expensive, time-consuming, and tedious [4,7].These limitations forced the formulation of different empirical relationships to estimate GCV using the ultimate and/or proximate analyses.Table 1 lists the most widely used empirical models to estimate the GCV from ultimate and/or proximate analyses.These linear algebraic correlations have several advantages, such as: (i) they are simple and fast methods for valuing the GCV, thus saving the cost and labor required in its laboratory measurement, (ii) they can be used to model the performance of coal-based combustion, pyrolysis, and gasification processes, and (iii) in order to investigate how the ultimate and proximate analyses of a coal affect process performance, they give algebraic equations linking GCV with the ultimate and/or proximate analyses [7].However, the prime disadvantage of these non-universal algebraic formulations is their poor accuracy in estimating GCV from proximate and/or ultimate analyses [8].
A little research has been conducted over the past few decades to determine the applicability of data-driven modeling in forecasting GCV from proximate and/or ultimate analyses (Table 2).Several multiple variable regression (MVR) models have been prepared to increase the accuracy of GCV prediction [4,24,25].The application of smart tree-based intelligent methods is limited in GCV prediction.Tree-based intelligent methods, namely regression tree, random forest, gradient boosting tree, and XGBoost were applied by Bui et al. [8], Matin and Chelgani [26], Ahmed et al. [27], and Chelgani [28], respectively (Table 2).They applied these techniques to different datasets with different sample sizes.However, wide-ranging fine-tuning is absent in these studies.
Bagging and boosting are helpful techniques for improving any regressor or classifier.The bagging mechanism enables a predictor to be more robust and balanced.It generalizes any predictor exceptionally well in the testing dataset.Additionally, bagging can address overfitting issues [39].On the other hand, Boosting can convert a mediocre algorithm to a strong learner via weight transfer in an iterative process.XGBoost made boosting technique more popular by achieving computational efficiency in large-scale datasets and via other optimization techniques, namely sparsity-aware split finding algorithm, weighted quantile sketch, and cache-aware block structure [40].There is an increasing body of literature on modeling complex input-output relationships using bagging and boosting techniques [41][42][43][44][45][46][47].However, comprehensive examination and fine-tuning of Extra trees, Bagging, Decision tree, and Adaptive boosting are absent in GCV prediction from coal analysis.
The above research review recommends that more trustworthy soft computing methods are still desired to increase/improve the estimation accuracy in GCV modeling.The present investigation is conducted to fill in the research gap by achieving several novelties with methodical accuracy, and can be registered as follows:   The rest of the document is segmented into subsequent sections: In Section 2, the foundations of numerous empirical and smart tree-based techniques used in this study are critically addressed.Along with a summary of the tools and procedures used here, Section 2 also includes information on data collection and preprocessing.Section 3 presents findings and discussions based on critical information.A summary of the conclusions and suggestions is provided in Section 4.

Data collection and treatment
The nine proximate and ultimate analyses data consist of  [48] are the input variables for this research.Initially, 7430 data points are considered to examine the effectiveness of the suggested strategies in predicting the output variable GCV.Sampling techniques and laboratory experiments conducted to obtain GCV, proximate analysis, and ultimate analysis data are available at http://energy.er.usgs.gov/products/databases/CoalQual/index.htmwebsite.The analysis results presented in the database are on as received basis.In the data preprocessing stage, 848 samples having zero values in either one or more features are removed from primarily considered 7430 data points.The rest 6582 samples are applied to meet the research objectives.Fig. 1 and Table 3 present the results of the statistical analysis of the data samples.
The proposed empirical and smart tree-based modeling is performed in Python environments.For smart tree-based modeling, python's package scikit-learn is utilized.The major steps of this research are shown in Fig. 2.

Statistical indexes
The performance of the investigated models is evaluated using different metrics, i.e., explained variance (EV), coefficient of determinant (R 2 ), mean squared error (MSE), maximum error (max_error), minimum error ((min_error)), mean absolute percentage error (MAPE), and mean absolute error (MAE) The definitions of these performance indicators are as follows (Equations ( 1)-( 7)): (2) Where ŷi refers to the predicted value of the i-th sample; y i denotes the corresponding actual value of the i-th sample, ŷ stands for the estimated target output, y represents the corresponding (correct) target output, and ϵ is an arbitrary small but positive number to prevent outcomes that are undefined if y is zero.

Empirical modeling for GCV estimation
Two types of empirical correlations are on hand for coal's GCV evaluation.The first type of relationship is solely dedicated to coal; the other type deals with various fuels, including solid, liquid, and gaseous ones.Goutal [9] was the first researcher to propose a coal-specific empirical relationship associating GCV with F C and V m .Schuster [10] correlated the GCV with only V m , while Spooner [11] linked the GCV with V m and C O of coal.Mazumdar [12] developed a relationship among GCV, V m , and C M .Mazumdar [13] recently formulated a more rigorous empirical correlation where he presented that the mineral matter of coal and percentages of C The widely used seven empirical correlations, namely Schuster [10], Spooner [11], Mazumdar [12], Mazumdar [13], Channiwala and Parikh [14], Parikh et al. [15], and CFRII Formulae presented in Table 1 are utilized in this study to calculate coal's GCV.After calculating Q (GCV) from the empirical equations, the unit of Q is converted to BTU/lb from Mj/kg.

Smart tree-based modeling for GCV prediction
A brief description of the theoretical concept and structure of smart tree-based statistical models is presented in this section.In addition, the methodology explaining the major steps involved in tree-based modeling (Fig. 3) is described.

Decision tree
A decision tree is a binary tree that recursively splits a dataset until pure leaf nodes are left.It has three nodes, i.e., parent nodes, internal nodes, and leaf nodes.A Parent node has no incoming link, internal nodes have both incoming and outgoing links, and leaf nodes only have an incoming link.Various decision tree algorithms, including C4.5, C5.0, ID3, and CART, are employed for various applications.In CART operation, the dataset is split into two sections based on impurity analysis.The splitting procedure is continued until no impurity is left in the dataset.Gini impurity and entropy analysis are performed to measure the impurity, where variance is used as a measure of impurity [49,50].Binary splitting using a minimum sum of squares is computationally infeasible.Hence, a greedy   algorithm is used to perform this binary partition task [51].For prediction, the predictor space is divided into high-dimensional different-sized boxes, and the values in each box are averaged to get a prediction value for the observation that falls into that region.

Bagging, random forest, and extra trees
Bagging is an ensemble learning method.It employs the same algorithm multiple times to train the model with the same amount of randomly picked and replaced data.After that, the model is trained using all the predictors.The aggregation function (mode or average) combines all the predictors into one.Aggregation lowers the model's variance, a bottleneck in the decision tree method [51][52][53].
Another ensemble approach that makes use of the bagging technique is random forest.However, unlike Bagging, it uses only decision trees as its base estimator.Another significant difference is that it searches for the best feature from a random selection of features rather than searching for the best feature from the whole feature space while splitting a node.As a result, choosing an arbitrary selection of features for the best feature provides more unpredictability to the model, resulting in a reduced variance in exchange for increased bias [51,54,55].
Random forest is further optimized by introducing more randomness to the bagging ensemble model.When splitting the node, it chooses a random threshold value rather than looking for the best threshold value for each feature.As the threshold value is randomly picked, it reduces computational complexity, which is a strength for this model compared to other bagging ensemble models.It also picks whole training samples instead of bootstrap samples.Due to its severe randomness, this model is called extremely randomized tree ensembles or extra trees for short [56].

Boosting techniques -adaptive boosting, gradient boosting, and XGBoost
Boosting is a popular ensemble technique that turns several weak learners into one strong learner.Unlike Bagging, where trees/ predictors are trained parallel and then combined to form a single predictor, Boosting is trained sequentially in an iterative process using data from the previous iteration [52,57].Numerous boosting algorithms are available, the most popular of which are gradient boosting, adaptive boosting, and extreme gradient boosting (XGBoost).
In adaptive boosting, all the samples are given the same weight at the start, set to the 1/total number of samples, making each sample equally valuable.The first predictor is trained via the training set, and the weighted error rate is determined.The mathematical expressions for weighted error rate (for j th predictor), predictor weight (for j th predictor), and weight update rule (for i = 1, 2,…….,m) are presented in Equation ( 8), Equation ( 9), and Equation (10), respectively [55].The weights of all instances are then standardized.Finally, a new predictor is trained using the revised weights, and the process is repeated.
∝ j = η log 1 − r j r j (9) Where ŷ(i) j refers to the j th predictor's prediction for the i th instance, η denotes the learning rate, w (i) stands for the weight for i th instance, r j represents the weighted error rate of the j th predictor, and ∝ j is the predictor's weight.
In contrast to adaptive boosting, which updates its weights at every iteration, in the gradient boosting technique, the prior predictor's residual errors are fit to the new predictor.After the first predictor predicts its value, the pseudo-residual is calculated, and that residual error is used as a label/target for the next iteration using the same inputs/features.The residual error is reduced iteratively by passing the residual error to the next predictor.For the regression task, the first predictor is nothing but the mean value of the target, which is used in the first iteration to calculate residual error [52,58].
Tianqi Chen and Carlos Guestrin of the University of Washington were the ones who first created the XGBoost model.They have built a tree-boosting system (scalable and end-to-end) that incorporates several core innovations.Due to its system and algorithmic optimizations, this model can process data much faster than existing solutions and scales to billions of instances in memoryconstrained situations.Tianqi Chen and Carlos Guestrin developed a novel sparsity-aware split finding algorithm that can handle sparsity of data, a theoretically proven weighted quantile sketch for handling weighted data, and a cache-aware block structure for alleviating the slowdown of split finding.These algorithmic and data structure improvements made this model a perfect candidate for a go-to algorithm for a vast and diverse set of datasets [59].

Ranking of proximate and ultimate analyses features
To know the significance of proximate and ultimate analysis features in predicting GCV, the correlation heatmap for input features T.A. Munshi et al. and GCV (Fig. 4) and scatter pots of ultimate and proximate analyses components versus the target (GCV) (Fig. 5) are developed.

GCV estimation using empirical models
All the mathematical relationships reviewed in Section 2.3 presume linear correlation among GCV and ultimate and/or proximate analyses constituents.Scatter plots produced through mapping ultimate and proximate analyses components versus the GCVs (Fig. 5) are used to validate this inherent assumption.Fig. 5 shows a clear linear association between GCV and the variables C M , F C , C c , C N , and C O .However, considerable scatter among GCV and V m , C A , C H , and C S suggest non-linear dependence between them.These findings imply that the inherent assumption of empirical relationships is incorrect, which suggests that the empirical correlations' derived GCVs are erroneous.This can also be understood from Table 4.It is evident from Table 4 that the GCV data estimated using empirical models match poorly with actual GCV.The coefficient of determinant is negative, while the error values are substantial for all the evaluated empirical correlation models.

GCV estimation using smart tree-based models
The values of R 2 for the testing sets of all the smart tree-based models, such as random forest, extra trees, decision tree, bagging, gradient boosting, adaptive boosting, and XGBoost, are significantly closer to 1, as shown in Table 5.The variance of R 2 for these models is also quite low.Other performance parameters, such as MSE, MAE, and max error of these models, show a moderate degree of variance, which would be helpful to rank these models.The values of MAE, MSE, min_error, max_error, and MAPE are higher than other smart tree-based models for adaptive boosting.Considering all the performance parameters, XGBoost and extra trees have a presidency over all other models.The hierarchy of the tree-based models based on R 2 , EV, MSE, MAE, MAPE, max error, and min error is as follows: XGBoost, extra trees, random forest, bagging, gradient boosting, decision tree, and adaptive boosting.This ranking is performed using plural voting of the statistical indexes.
The optimum hyperparameter values used in smart tree-based models are shown in Table 6, along with their grid-search run time.Although the combinations of tunning parameters of XGBoost are higher than random forest and extra trees, XGBoost takes less time to run all those combinations.Hence, its algorithmic optimization of data handling is quite visible here.

Discussions
The present study develops intelligent statistical models, namely Extra trees, Bagging, Decision tree, and Adaptive boosting for the first time in coal's GCV modeling to increase/improve the estimation accuracy.Prediction and computation efficiency-based comprehensive comparison among well-established empirical models and newly developed bagging and boosting algorithms is also a novelty of this research.Another scientific originality of this study is the ranking of ultimate and proximate analyses parameters based on their significance in GCV modeling.
Well-established empirical models, such as Schuster [10], Mazumdar [12], Mazumdar [13], Channiwala and Parikh [14], Parikh   et al. [15], and CFRII formulae provide negative values of R 2 .According to the definition of R 2 given in Equation ( 1), when the ratio of SS res and SS tot is greater than 1, the value of R 2 could be negative.It happens when the model fits the original data very poorly.Hence, the negative R 2 value obtained from the studied empirical models indicates that these models are arbitrarily worse.Obtained erroneous GCVs suggest that the studies of coal mines are incompatible with the inherent assumptions and foundation of these empirical models.In general, empirical models cannot capture the non-linear correlations between GCV and ultimate and/or proximate constituents.Hence, empirical modeling is inappropriate for predicting GCV.The smart tree-based models employed here perform efficiently as they all provide quite good performance.The R 2 values for all these models are over 0.97.However, the XGBoost model outperforms other smart tree-based models with the most significant coefficient of determinant (R 2 of 0.9974) and lowest error values (MSE of 14703.3,max_error of 1027.2,MAE of 89.2, MAPE of 0.009).As XGBoost and extra trees predicted GCVs perfectly fit with actual GCVs (Fig. 6), they could be proposed as a suitable GCV predicting strategy.
All studied empirical models provide very low R 2 and very high error measures.Hence, well-established empirical algorithms are inappropriate for GCV modeling.The smart tree-based models are highly capable of predicting GCV as their correlation coefficient and involved error are high and low, respectively.
The feature ranking reveals that the C O , C C , and C M are the most significant input parameters for GCV prediction, whereas C A , V m , and C S are the least important variables.The variable ranking can assist operators in selecting a few features rather than conducting comprehensive coal analysis to predict GCV, consequently saving the tedious and time-consuming laboratory effort.
The strategy suggested in this research can aid engineers/operators in obtaining a rapid and accurate determination of the GCV with a few coal features, thus lessening laboratory efforts and significantly lowering the experimental costs.In addition, the developed decision tree-based models can assist geochemists/engineers in precisely evaluating coal's practical energy matter and accurately characterizing coal mines.This study will likely open an entrance for introducing decision tree-based bagging and boosting techniques in fuel engineering studies where innovative connectionist models will be applied instead of well-established empirical algorithms to recognize inherent highly complex links from input-output data, find optimum patterns, and forecast target parameters.

Conclusions
i) The existing empirical models applied to derive coal's GCV from ultimate and proximate analyses are linear in nature.It is seen that there is a clear linear relationship among GCV and C M , F C , C c , C N , and C O .However, considerable non-linear dependence of GCV on V m , C A , C H , and C S is also observed.Hence, empirical modeling is inappropriate for predicting GCV.The prediction performance of all empirical models is abysmal and erroneous.ii) The studied tree-based models are not very distinct from each other.The R 2 values for these models are very high, suggesting an excellent match between actual and predicted GCVs.iii) The comprehensive contrast among studied methods demonstrates that the hierarchy of the applied models based on statistical measures and computational efficiency is as follows: XGBoost, Extra trees, Random forest, Bagging, Gradient boosting, Decision tree, and Adaptive boosting, Schuster [10], Spooner [11], Mazumdar [12], Mazumdar Weighted error rate of the j th predictor ϵ Small positive constant w (i)  Weight for i th instance

•
Intelligent statistical models, namely bagging, extra trees, and adaptive boosting are developed for the very first time in the field of coal's GCV modeling.• A thorough comparison of the studied empirical and intelligent tree-based models is used as an illustration of prediction and computing efficiency.• The ultimate and proximate analyses parameters are ranked based on their importance in GCV modeling.
C , C H , and theoretical C O requirement for the complete coal combustion are needed to estimate GCV.Central Fuel Research Institute of India (CFRII) proposed a correlation among useful heating value (UHV), C A , and C M .Chaudhury and Biswas (2002) modified the CFRII formulae since the grading and pricing of coal functioned globally rely on GCV, not UHV.The modified version of CFRII model incorporates two new constants (a and b) whose values differ based on the geographical location of the coal.Based on an extensive review conducted on the empirical correlations available for the estimation of GCV of different types of fuels, Channiwala and Parikh [14] developed an algebraic relationship showing that the GCV is associated with the content of C C , C H , C S , and C N .The most recent relationship was formulated by Parikh et al. [15] linking GCV with F C , V m , and C A .

Fig. 2 .
Fig. 2. Major steps involved in this research to find an efficient GCV prediction approach.
T.A.Munshi et al.
T.A.Munshi et al.
[13], Channiwala and Parikh[14], Parikh et al.[15], CFRII formulae.iv) The C O , C C , and C M are the most crucial input variables for GCV prediction, whereas C A , V m , and C S are the least important variables.

Table 1
Most widely used empirical models applied to estimate the GCV from proximate and/or ultimate analyses.

Table
Studies conducted to predict GCV from proximate and/or ultimate analyses using soft computing and intelligent statistical modeling.
(continued on next page) T.A.Munshi et al.

Table 2
(continued ) ANN, it didn't give major boost in terms of performance.
and C S from U.S. Geological Survey Coal Quality (COALQUAL) database, open file report 97-134

Table 3
Statistical analysis on input and output data (as received basis).

Table 4
Statistical performance indexes for empirical models.

Table 5
Statistical indexes of training and prediction efficiency for smart tree-based models.

Table 6
Hyperparameter tuning results of smart tree-based models.