Comprehensive Modeling in Predicting Biodiesel Density Using Gaussian Process Regression Approach

In this study, four Gaussian process regression (GPR) approaches by various kernel functions have been proposed for the estimation of biodiesel density as the functions of pressure, temperature, molecular weight, and the normal melting point of fatty acid esters. Comparing the actual values with GPR outputs shows that these approaches have good accuracy, but the performance of the rational quadratic GPR model is better than others. In this GPR model, RMSE = 0.47, MSE = 0.22, MRE = 0.04, R2 = 1, and STD is equal to 0.3. In addition, for the first time, this study shows that the effective parameters affect the biodiesel density. According to this analysis, it was shown that among the input parameters, pressure has the greatest effect on the target values with a relevancy factor of 0.59. This study can be used as a suitable and valuable work/tool for chemical and petroleum engineers who attempt environment protection and recovery improvement.


Introduction
Recently, in various countries, the issue of energy is deep and complex [1][2][3]. The greenhouse gas concentration in the atmosphere has been increased because the consumption of fossil fuels increased, and in that case, the earth's temperature increased too [4,5]. For the solution to these problems, several agreements have been signed to reduce greenhouse gas emissions [6]. There are so many reasons but the main reasons for blooming the renewable energy resources are the aim of controlling the emission of pollutants and the implementation and credibility of these agreements [7,8]. The two things that mainly can form biodiesel are the oils that come from vegetables and the fat of an animal's body [9]. Biodiesel is a combination of fatty acid alkyl esters [10]. Ethanol and methanol are some of the alcohols that the oils that come from vegetables and the fat of animal's body are transesterification catalytically by them [10]. Biodiesel is a clean fuel to burn, and it is not a toxic fuel because of low concentrations of sulfur. In that case, this fuel has the minimum bad effect on the emission of greenhouse gas changes, and also unlike fossil fuels, the biodiesel fuel has a less negative effect on our environment too [11,12]. Additionally, in diesel engines, we can use biodiesel fuels lonely or use this fuel with fossil diesel because it improves engine life [13], although this fuel costs more than petroleum-based diesel and has higher viscosity too. Some other disadvantages of this fuel are lower oxidation stability, higher cloud point, and lower energy content in comparison with petroleum-based diesel [14]. The observational and modeling studies for the establishment of these properties become important because the concentration on usages of the biofuels and their properties have been grown [15,16]. For example, the density of biodiesel is the important property that has a significant matter in the thermophysical process. One of the important topics in diesel fuels is the investigation of density because diesel fuels have so many technical and economic parts for the usage of the fuel and also the environmental effects [17][18][19].
For estimating the properties of fossil fuels, lots of investigations have been carried out in the literature and also there are different advancements in this study, such as a new way for guessing many properties of fuels consisting of surface tension and viscosity has been proposed by Queimada and his colleagues [20]. Barati-Harooni and his colleagues estimate interfacial tension between oil and brine with developed least-squares support vector machine in terms of pressure, salinity, and temperature [21]. Rostami and his colleagues correlate interfacial tension of hydrocarbon and water using genetic programming (GP) with an R-squared of 0.910 [22]. The Kay's model is a model that can determine the density of biodiesels that was estimated by Pratas and his colleagues [17]. The alkane density with an average absolute relative deviation of 60% was predicted by Gahk and his colleagues [23]. Miraboutalebi with the help of his coworkers estimate cetane numbers with a root mean squared error (RMSE) of 2.530 and R-squared of 0.950 by implementing an artificial neural network (ANN) [24]. On the other hand, the cetane number in terms of fatty acid methyl ester (FAME) was predicted by Mostafaei with developing the adaptive neuro-fuzzy inference system [25].
For biodiesel properties, there are some investigational searches in the literature. For example, the density of biodiesels was measured by Paratas et al. at atmospheric pressure with ten individual samples in temperatures between 278.150 and 373.170 K [17]. The viscosities and densities of three individual mixtures of methylcyclohexane and fatty acid methyl esters were obtained by Li and his colleagues in the atmosphere in temperatures between 293.150 and 324.150 K [26]. The density of soybean oil biodiesel was experimentally determined by Aitbelale and his colleagues at temperatures between 298.150 and 393.150 K and pressures up to 140.0 MPa [27]. The surface tension values were measured by Aitbelale et al. for three different biodiesels in temperature and pressure of 473.0 K and 7.0 MPa [27].
The lack of sufficient accuracy and difficulties of computations cause more attention in the aforementioned literature. On the other hand, much attention has been paid to artificial intelligence methods to a precise solution in order to model different processes [28][29][30][31][32][33]. One of the things that are crucial for the process design and the same operation is the accuracy of thermophysical properties. The development of a precise and low-cost advancement towards the estima-tion of biodiesel properties is worthy because of the necessity of these properties most importantly in the clean energy resource topics. For this job, the density of biodiesel fuels has been researched thoroughly. The development of the GPR algorithm model is the main purpose of this research for the estimation of biodiesel density. This algorithm is better for the estimation of individual properties compared to other models because of the independence of this algorithm from the outliers. In the process of development of this model, the reliability and accuracy of the collected dataset are important so for the first time for the identification of suspected data points of biodiesel density, a throughout analysis has been carried out. On the other hand, the effects of input variables on the output have been researched statistically as an important part of this work.

Gaussian Process Regression.
In recent years, the trend of neural networks, which is a branch of artificial intelligence, has made significant progress in solving problems related to the engineering field. The main disadvantage of this method is overfitting, which of course can be improved by adjusting the weight [34,35].
Of course, setting these parameters is also complex and difficult, and to solve this problem, a conventional mathematical method called Bayesian network is used. It should be noted that this method is probabilistic and uses Bayesian interference to calculate the probability [36]. This network plots each variable graphically, and these variables are connected by an arc, and each variable shows its knowledge content as a distribution of probabilities. It should be noted that the potential specificity of BNs is of great importance for assessing uncertainty. High distribution leads to more uncertainty. One of the factors influencing the increase of complex prior distribution on functions in the Bayesian method to neural networks is the prior overweight [37].
Gaussian process regression is a developed method for the abovementioned problem. It should be noted that this method is nonparametric. The advantages of the GPR  3 BioMed Research International algorithm include measuring uncertainty for predictions and working well on small data. It is worth noting that the GPR method has important advantages over Bayesian, including simplicity, nonlinearity, easy generalization, and has several dimensions. The model parameters are determined using the sample training information in the GPR method.
The GP model is obtained by linking previous knowledge to the current process of modeling and integrating real laboratory data.
One of the salient and important differences between the old methods of machine learning and GPR is not finding the best approximation with experimental points and is a complete BN core. GPR works periodically by obtaining posterior distributions on models. In the following, we will explain how to create GP regression [38].
Randomly selected points T = fx L,i , y L,i g and L = fx T,i , y T,i g, i = 1, 2, 3, ⋯, n, which are test and learning data from a particular distribution, are assumed as follows [39]: As an important point, we remind that the model parameters are adjusted based on the learning data [40].
As the input and goal data, respectively, x and y have been assumed that noise has affected them.
The general formula of the GPR described as follows [41]: In the abovementioned equation, X L and Y L denote independent and objective variables of the training data, and subsequently, the ε~Nð0, σ 2 noise l n Þ denotes for the observation noise with the independent Gaussian distribu-tion that I noise symbolizes and σ 2 noise the variance of noise and unit array [42].
Note: GP assumes the output f ðxÞ is random. So the following equation is obtained: In the above equation, y T and x T represent the goals and independent variables and the f ðxÞ represents a Gaussian process with covariance function kðx, x ′ Þ and mean function mðxÞ.
But in practice, the exact determination of mðxÞ can be complicated, so the value of mðxÞ is taken to be zero to make the calculations easier, so we have the following [43]: From Eqs. (2) and (5), we can conclude the following equation: A better representation of the variables mentioned in the text above can be provided as follows [44]:

BioMed Research International
The following Gaussian function is obtained by adding the sum of Eqs. (7) and (8): Therefore, the previous distribution of Y T is obtained from Gaussian conditions as below: And values Σ T and μ T are assumed as follows: In GPR modeling, the following theoretical concept is obtained by predicting the output of experimental data through independent variables and training data. From the above equations, it can be concluded that the covariance and the mean function of both together with the Gaussian distribution represent a GP.
To better predict the goals of the developed GPR model, the selection of the core function in the training phase is of great importance. Therefore, in this research, to find the best kernel function, we use four different and conventional kernel functions, of course, with changes.
These functions are described as follows [45]: (i) Rational quadratic covariance function In rational quadratic covariance function equation σ 2 , 1, a > 0 represents the variance, length, and weight scale changes.
(ii) Squared exponential covariance function (iii) Exponential covariance function (iv) Matern covariance function The symbol represents the gamma function and x, y are positive parameters, and the adjusted K v is the Bessel function [46].
In the Matern equation, exponential covariance and quadratic are important functions. The Matern function is an exponential when the value of V = 0:5 and also when the exponential is a square when V is inclined to infinity. Since the Matern equation has a greater degree of freedom, it also performs better than the other two [47].
Since the GPR method is nonparametric, the learning stage tries a lot to modify the parameters of the above equations.

Data Collection.
The dataset containing 2117 real density points has been collected from different resources. References to this data have been reported elsewhere [48].

Results and Discussion
As mentioned before, here, four various GPR algorithms including kernel functions in Matern, rational quadratic, exponential, and square exponential forms are used for estimating the density of biodiesels. To evaluate the precision of these algorithms, statistical analysis of parameters is determined as follows: , v u u t As you see in Table 1, R 2 values of rational quadratic, Matern, exponential, and square exponential forms are equal to 1. According to other statistical parameters, the rational quadratic form depicts a better performance than the other kernel functions. In this kernel function form, RMSE, MSE, MRE, and STD are obtained 0.47, 0.22, 0.04, and 0.30, respectively. These values exhibit the rational quadratic formability in the forecast of density values.
Comparing the results of R 2 from this table with the results published by Abooali et al. who used SGB and GP models to predict biodiesel density, it was concluded that the models presented by us have a higher ability to  BioMed Research International predict the target values because the R 2 values for these two models were obtained 0.99988 and 0.99635, respectively [48].
To make a better decision about all these models, Figure 1 shows the experimental and estimated density values, simultaneously. In this figure, there is a good agreement between the real density values and GPR outputs. Also, the cross/regression plot of predicted and real density has been shown in Figure 2.
To express the quality of GPR outputs, we refer to the density data located on bisector lines in the analysis. Moreover, Figure 3 shows the relative deviation between the real density and GPR outputs.
The accuracy of these density point data affects the validity of models. In this examination, too many data points have been used. It is important to know that these data may have errors due to measurements done in laboratories. So, these types of data are separated from the other data points. In this regard, some strongly developed strategies are required to remove these data and enhance the model accuracy. Here, the separation of these suspected data is accomplished by the Leverage method. In this method, after the determination of residual values, a Hat matrix is created as follows [49]:  where A is an i × j dimensional matrix and i and j are defined for the model parameters and training points, respectively. Then, the critical leverage limit is computed by i and j as follows: According to William's plot, as shown in Figure 4, we can separate the suspected data obtained as residuals from Hat values. Here, the leverage limits in the green line and two red lines are considered as residuals. So, the data points located outside these lines are considered as suspected data. As you see, for exponential, Matern, squared exponential, and rational quadratic forms, 14, 1, 1, and 1 points are considered as the suspected data, respectively, among 2217 points.
In this study, the suggested GPR algorithms create a relationship between density and inputs. So, sensitivity analysis is utilized to show this relationship affects the output. In this regard, to determine the most efficient variable for density, the relevancy factor, r, in the range of -1 and 1, is used. If the absolute value of r is large, it can affect the density further. The less and more relation to density is shown by negative and positive r values, respectively. The r is calculated as follows [50]: where X k:i and Y i are inputs and outputs, respectively. The X k and Y are the average of inputs and that of outputs, respectively. Figure 5 shows that the higher temperature and melting point lead to less density. Also, the temperature is considered the most efficient parameter for density. In addition, parameters such as pressure and molecular weight have a direct relationship with this target.

Conclusion
In this work, four various kernel functions including Matern, rational quadratic, exponential, and square exponential functions have been used for GPR algorithms to compute the density of biodiesels. To prepare and validate these algorithms, a large database containing 2217 actual data is gathered. It is concluded that the proposed models have highly precise to predict real data. The rational quadratic GPR model has shown greater performance compared with other models. In this model, the calculations show that RMSE = 0:47, MSE = 0:22, MRE = 0:04, R 2 = 1, and STD is equal to 0.3. Other analyses also confirmed the accuracy of this model, which indicates that this attractive and simple model can be used in biodiesel-related industries.

Data Availability
The data are stated in the article.

Conflicts of Interest
The authors declare that they have no conflicts of interest.