Modeling of H2S solubility in ionic liquids: comparison of white-box machine learning, deep learning and ensemble learning approaches

In the context of gas processing and carbon sequestration, an adequate understanding of the solubility of acid gases in ionic liquids (ILs) under various thermodynamic circumstances is crucial. A poisonous, combustible, and acidic gas that can cause environmental damage is hydrogen sulfide (H2S). ILs are good choices for appropriate solvents in gas separation procedures. In this work, a variety of machine learning techniques, such as white-box machine learning, deep learning, and ensemble learning, were established to determine the solubility of H2S in ILs. The white-box models are group method of data handling (GMDH) and genetic programming (GP), the deep learning approach is deep belief network (DBN) and extreme gradient boosting (XGBoost) was selected as an ensemble approach. The models were established utilizing an extensive database with 1516 data points on the H2S solubility in 37 ILs throughout an extensive pressure and temperature range. Seven input variables, including temperature (T), pressure (P), two critical variables such as temperature (Tc) and pressure (Pc), acentric factor (ω), boiling temperature (Tb), and molecular weight (Mw), were used in these models; the output was the solubility of H2S. The findings show that the XGBoost model, with statistical parameters such as an average absolute percent relative error (AAPRE) of 1.14%, root mean square error (RMSE) of 0.002, standard deviation (SD) of 0.01, and a determination coefficient (R2) of 0.99, provides more precise calculations for H2S solubility in ILs. The sensitivity assessment demonstrated that temperature and pressure had the highest negative and highest positive affect on the H2S solubility in ILs, respectively. The Taylor diagram, cumulative frequency plot, cross-plot, and error bar all demonstrated the high effectiveness, accuracy, and reality of the XGBoost approach for predicting the H2S solubility in various ILs. The leverage analysis shows that the majority of the data points are experimentally reliable and just a small number of data points are found beyond the application domain of the XGBoost paradigm. Beyond these statistical results, some chemical structure effects were evaluated. First, it was shown that the lengthening of the cation alkyl chain enhances the H2S solubility in ILs. As another chemical structure effect, it was shown that higher fluorine content in anion leads to higher solubility in ILs. These phenomena were confirmed by experimental data and the model results. Connecting solubility data to the chemical structure of ILs, the results of this study can further assist to find appropriate ILs for specialized processes (based on the process conditions) as solvents for H2S.

designed a Gaussian process regression (GPR) model to inquire about the relationship between temperature, nanoparticle volume concentration and specific heat capacity, and the specific heat capacities of nanofluids and base liquids. The results demonstrated a high estimation precision as reflected in the estimation mean absolute error (MAE) and root mean square error (RMSE) which were 0.05% and 0.06% of the average experimental C p−nf , respectively. Mosavi et al. 66 implemented two bagging models and two boosting models for groundwater potential estimation with 339 groundwater resources, where bagging models had a higher undertaken than the boosting models. Mosavi et al. 67  According to the findings, the XGBoost approach with statistical parameters such as RMSE: 0.00079 and MAE: 0.00062 in the training step and RMSE: 0.00069 and MAE: 0.00054 in the testing phase was affirmed to be the most precise approach among the four suggested approaches. Ullah et al. 73 suggested three methods, including K-means clustering, XGBoost, and t-SNE to forecast the short-term rock-burst risk utilizing 93 rock-burst patterns with six authoritative characteristics from micro-seismic monitoring events. The outcomes of the suggested models provide an excellent benchmark for accurate prediction of future short-term rock-burst levels. Shahani et al. 74 indicated the machine learning technology to estimate drilling rate index (DRI) of rocks, namely; the random forest algorithm (RFA), long short-term memory (LSTM), and simple recurrent neural network (RNN), where the LSTM method showed the best estimation of DRI with the lowest RMSE and highest R 2 in the training (0.13416, 0.999) and testing (0.19479, 0.998) step, respectively. Most of intelligent models are black box which means the mathematical relationship is not clear and explicit. These models mostly need special software such as Matlab or Python for their applications. On the other hand, white box models are those ones which generate explicit mathematical formula and can be easily applied by a simple calculator and they do not need any special software for their applications [75][76][77] . In the present research, two white box methods including genetic programming (GP) and group method of data handling (GMDH) have been used, which are explained in detail.
The main objective of this work is to establish exact models for computing H 2 S solubility in ILs using wider datasets and powerful techniques. In order to construct genetic programming (GP), group method of data handling (GMDH), deep belief network (DBN), and extreme gradient boosting (XGBoost), a large collection of 1516 experimental data from 37 ILs is employed. Seven thermodynamic features of ILs are chosen as inputs, including T, P, T c , P c , ω, M w , and boiling temperature (T b ). Considering that this research aims to compare the proficiency of intelligent methods with equations of state, therefore, the selected parameters have been tried to have the same agreement with the required features of the EOSs so that we can have an accurate comparison among different methods. The sensitivity assessment is employed to figure out the relative effect of inputs on the H 2 S/ILs solubility. Finally, the validity of the models is evaluated utilizing William's plot. Figure 1 shows the steps of this study. One of the circumscriptions that can exist in this research, like all studies based on artificial intelligence techniques, is their data-driven nature, which makes predictions reliable only in the range of experimental data, although the model may be used for new data. Therefore, the bigger the database, the more comprehensive and practical the model can be.

Data collection and preparation
An extensive databank is essential for developing a predictive system with a high degree of trustiness and comprehensiveness. To create models, 1516 experimental data for H 2 S solubility of 37 ILs were acquired from the literature in the broad range of temperature range between 293. 15  www.nature.com/scientificreports/ developed. The training and test subsets comprise 80% and 20% of the whole data points, respectively. The test data was used to see how the generated system predict the output for new input parameters, while the training data was used to construct the original system. Features including T, P, T c , P c , ω, T b , and M w were used as inputs, while H 2 S solubility was expressed as a mole fraction as an output. Table 1 highlights key details about the ILs we utilized to create our systems. Table 2 also shows the statistical evaluation information, such as the maximum, mode, minimum, and mean of the seven inputs, and H 2 S solubility employed in the current research. The chemical structures of the diverse anions and cations used in the current work are shown in Fig. 2.

Modeling
Genetic programming (GP). In many engineering fields, Koza 91,92 presented a novel evolutionary-based technique known as genetic programming (GP) for addressing issues automatically by introducing an expression characterizing the examined phenomena. Symbolic optimization will be used to identify possible answers utilizing the well-known design of tree illustration 91,92 . Terminals and functions are used in this sort of illustration, with functions being mathematical algebraic operators (e.g., +, −, log, exp) and terminals being solution factors and input parameters. The potential solution, which incorporates functions and terminals, may therefore be shown using an expression tree 91,92 . Several genetic operators, such as mutation, crossover, and selection, are used to increase the population of potential solutions regularly. This iterative procedure ends when the system's number of iterations is met or an acceptable solution is found. As a result, a fitness function is created to verify the solution quality, particularly using the mean square error (MSE) 93 , which must be reduced. The initial stage in computing a fitness function is to train the model, after which it may be used for forecasting 93 . Figure 3 depicts the schematic of the GP model.
Group method of data handling (GMDH). Ivakhnenko 94 initially proposed the GMDH technique, which consists of mathematical methodologies and a black box nonlinear system characterization idea. He added a polynomial transfer function to the neuron, making it a more complicated unit. An automated technique for design procedure and weight modification was devised, and the links between layers of neurons were reduced. This approach may be thought of as a form of controlled Artificial Neural Network (ANN) that use natural selection to govern the system's size, intricacy, and precision. Simulation of complicated networks, func- www.nature.com/scientificreports/ tion estimation, nonlinear regression, and pattern identification are the key applications of GMDH. The original model is created using the Volterra-Kolmogorov-Gabor (VKG) polynomial function as follows 94 : In the above equation, X = (x 1 , x 2 , . . . ) , y, A = (a 1 , a 2 , . . . ) , and a 0 refer to the inputs vector, output vector, weights vector, and bias parameter, respectively. www.nature.com/scientificreports/ The initial population of partial representations is created at the first stage of the system utilizing an algorithm that uses all conceivable combinations of two inputs. The sum of the two inputs l = m 2 determines the total number of neurons produced on the first layer. At each stage, layers are generated depending on error indicators. At each stage of the layer production process, performance is assessed. The next layer starts with the largest number of neurons feasible, determines weights, and then freezes 95,96 . This differs from the back propagation approach, which allows all layers to engage in the training procedure simultaneously. In the sth repetition, the training of the GMDH system to create second-order polynomials is written in the following form 97 : Here w, X, and m signify the vector of factors in each neuron, the vector of system inputs, and the count of iterations of the system training step, respectively. The training and testing data are two subgroups of the sample dataset.
To recognize the count of neurons in each layer, a coefficient named "Selection pressure" should be specified as an appropriate threshold in this technique. After computing the parameters for all of the neurons, the ones that generate the worse results based on a previously set selection criterion should be deleted from the layer. RMSE is used as the threshold. The selection pressure can range from 0 to 1. All of the neurons had a similar configuration, with the best neurons being identified and progressing to the next step based on an external criterion. The precision of the obtained results may be impacted by the application of a critical threshold 97 .
It is worth noting that just one neuron is chosen in the final layer. Figure 4 demonstrates the schematic structure of the GMDH model.

Deep belief network (DBN). A DBN is a combination of RBMs that can be trained by moving the RBMs
from the bottom to the top layer 98 . The RBM in the bottom layer trains the initial inputs and takes measurements; the extracted measurements, on the other side, will be the feed of another RBM in the top layer. By continuing this process, further layers of RBM can be produced 99 . Figure 5 depicts a common DBN system configuration. A DBN system's training procedure may be broken down into two parts 100 : First stage: Uncontrolled train each RBM layer individually to ensure that feature information is preserved as much as feasible when feature vectors map to various feature regions.
Second stage: Adding a back propagation (BP) system to the top of the RBM's final layer, which takes the RBM's final layer's output feature vectors, and controlled training the BP system.
Since each layer of the RBM can only ensure that the weights of each layer well select to the input vectors of each layer for individual training, and the vector representation of the entire DBN system does not accomplish the desirable, the entire DBN system must be fine-tuned top-down by the BP system that is established on top. The procedure of fine-tuning the DBN may also be compared to the initialization of weights in a deep BP neural network. When the network trains the BP system, this strategy aids the training process prevent flaws like local optimum and considerable time spent in multi-level training. Initial training in deep learning is the first phase, and fine-tuning is the next stage in the aforementioned training model. Because the RBM can be quickly taught by evaluating divergence, the DBN reduces the complex deep learning neural network training phase by combining the training of several RBMs, making it more efficient than other deep learning neural network systems. A significant number of studies indicate that DBN can overcome the challenges that classic Restricted Boltzmann machines (RBM). Figure 6 depicts the structural layout of the RBM. When the status of the input layer is specified, the activation criterion of each concealed layer is independent; on the other hand, when the status of the concealed layer is provided, the activation criterion of each transparent layer is independent 101,102 . RBM's system energy may be calculated as 103 : The activation possibility of the j th concealed unit, according to the formula is 103 :  When training data are provided, an RBM's variables are adjusted to suit the provided training datasets. It may be expressed as optimizing the logarithm probability function using mathematical terms as follows 101 : Extreme gradient boosting (XGBoost). A tree-based entity approach's primary idea is to utilize a set of classification and regression trees (CARTs) to accommodate training sets utilizing a regularized objective function reduction. The XGBoost is a tree-based algorithm, which is component of the GBoost decision tree architecture. To better illustrate the CART establishment, each CART is consisting of root, internal, and leaf nodes, as demonstrated in Fig. 7. The first, which shows the complete data set, is separated into internal nodes, while the leaf nodes indicate the final categories, based on binary decision procedure 104 . A collection of n tresses is required to train and forecast the output y for a particular dataset, where m and n indicate dimension characteristics and instances, respectively 104 .
In the above equation, the decision rule q(x) delineates the sample X to the binary leaf index. f denotes the area of regression trees, f k the k th independent tree, T the count of leaves on the tree, and w the weight of the leaf in Eqs. (1) and (2).  Here is the regularization parameter that aims to minimize overfitting by restricting system intricacy; l refers to a vicissitudinous convex loss function; γ represents the smallest loss minimization required to cleave a new leaf; and exhibits the regulation factor. The objective function for each particular leaf is reduced in the gradient boosting strategy, and further leaves will be created repeatedly 105 .
The tth repetition of the given training process is denoted by t. The XGBoost technique greedily increases the area of regression trees to significantly amend the ensemble system, which is represented to as a "greedy algorithm". Consequently, the result is modified repeatedly by minimizing the objective function: The XGBoost employs a shrinkage method that freshly affixed weights are adjusted by a learning component rate after each step of boosting. This reduces the danger of overfitting by limiting the influence of new further trees on each present particular tree 106 .

Results and discussion
Computational procedure. To properly estimate the H 2 S/ILs solubility throughout a high diversity of thermodynamic conditions, 1516 data points of 37 ILs were gathered in this work. In this regard, four smart strategies were developed after selecting the inputs: T, P, T c , P c , ω, T b , and M w . Figure 8 illustrates the pairwise correlation plot of the inputs in this work. The categories for training and testing were created at random from the collected data. To develop the method and find the best model configuration, 80% of the data sets (1212 data points) were chosen at random as the training set. The train set was utilized to study the physical rules that existed in the systems. To assess the suggested models' accuracy and validity, the remaining 20% of the data sets that contained 314 data points were used. As previously stated, the suggested GMDH strategy is the simplest established method for estimating H 2 S solubility in ILs. This approach is expressed as simple mathematical formulas, such as: www.nature.com/scientificreports/ In the aforementioned relationships, T, T b , and T c are in (K), P and P c are in (bar), ω is a unit-less variable, molecular weight is in (g/mole), and H 2 S solubility is in (mole fraction). As mentioned earlier, the developed GP smart model for computing H 2 S solubility in ILs is defined by the following correlations: where Statistical assessment of the developed methods. Statistical criteria were employed to assess the precision and vigorous of the suggested techniques in forecasting H 2 S/ILs solubility. The statistical variables used in the analysis were average absolute percent relative error (AAPRE%), root mean square error (RMSE), coefficient of determination (R 2 ), standard deviation (SD), average percent relative error (APRE%), and a-20 index. The mathematical expression of the aforementioned criteria is as below: 1. Average absolute percent relative error:  where, the count of data points in the original data set is denoted by n. S ipred. and S iexp. , respectively, state the estimated and measured values of ith data. S iexp defined the mean value of experimental data. The parameter m20 refers to the samples with a value of rate exp./pred. value between 0.8 and 1.2. Table 3 indicates the computed values for these variables for the testing, training, and the whole dataset of all paradigms. The findings show that XGBoost gives the highest accurate predictions of all the evaluated models. Indeed, the XGBoost paradigm has the highest precise prediction of H 2 S/ILs solubility, with AAPRE values of 1.14% for the all data-points, 1.30% of testing set, 1.09% for the training.
Graphical analysis of the models. In addition to the presented error evaluation criteria, four graphical analyses were provided in this research to assess the effectiveness of the recommended paradigms. A succinct description of these charts is presented as follows: 1. Cumulative frequency: In this graph, the cumulative frequency is shown versus the absolute relative deviation. The percentage of various error intervals is presented. 3. Cross-plot: The estimated value is displayed against the experimental value. In such a plot, the tight collection of data points along the Y = X line suggests that the model is the most correct. 4. Taylor diagram: It demonstrates which of the numerous plausible representations of a system or phenomenon is the most realistic. The degree of connection between the estimated and real behavior is measured using three statistical criteria: RMSE, SD, and R 2 . Figure 9 shows cross plots of actual values against predicted values for the four techniques utilized in this study, namely DBN, GMDH, GP, and XGBoost. Around the unit slope line, there is a considerable dispersion of data points, both testing and training ones, indicating that three smart models, GP, DBN, and GMDH, have an improper match between predicted and experimental data points. As can be shown from this chart, a tight aggregation of entire dataset is placed around the Y = X line for both training and testing sets for the developed XGBoost model. Indeed, the XGBoost approach outperforms other models in terms of accuracy and performance, and it can estimate the hydrogen sulfide/ILs solubility with less variation from real data (the same results as reported in Table 3). Cross plots of an estimated H 2 S solubility in ILs versus experimental data are displayed in Fig. 10 to compare the forecasting ability of the XGBoost model as the best model against available methods in the Mousavi et al. 53 Fig. 11 shows a comparison of the present study results and introduced methods by Mousavi et al. 53 . As this figure demonstrates, about 96% of XGBoost, 86% of CNN 53 , 75% of DBN and RNN models 53 , 70% of DJINN 53 , 57% of DBN, 26% of GMDH, and17% of GP predictions have errors less than 5%. This implies that the XGBoost smart method is the most practical method for calculating H 2 S solubility in ILs among all previously investigated techniques. Figure 12 compares the AAPRE % of the literature models to the smart methodologies employed in this work to analyze the model's capabilities. Examples of available techniques are including ELM1 (1318 measured data for 28 ILs) 34 Fig. 13 introduced Taylor chart, which combines a variety of evaluation criteria for a more intelligible design. This graphic illustrates all of the intelligent models in terms of how well they account H 2 S solubility in ILs. The RMSE, SD, and R 2 of systems, namely XGBoost, DBN, GP, and GMDH, are implemented to quantify the difference between the calculated and actual data. The centered RMSE is calculated using a distance from the total measure as the reference point. The consummate predictive system, which is indicated by a point with R 2 equal to 1, is next criterion in the Taylor diagram. The RMSE and R 2 values for the developed XGBoost smart method in this study were 0.0024 and 0.999, respectively. The statistical parameters for the smart methods in the current study are shown in Fig. 13 after the models have been put into the Taylor diagram. Based on how far away from the desire instance, which is labeled as "measured" in this chart, each paradigm's overall performance is evaluated. The Taylor diagram further emphasizes XGBoost's domination because it is the closest to actual measurements when evaluating performance globally.

Sensitivity analysis.
In the current survey, a sensitivity assessment was accomplished to detect the influence of each input variable (i.e., P, T, P c , T c , ω, M w , and T b ) on the output prediction using the XGBoost as the best 1. If (r > 0) , each of the input variables has an increasing effect on the predicted parameter. In other words, the output value increases when the desired parameter is increased. As a result, the larger the impact, the closer the relevance factor is to 1. 2. If (r = 0) , the output data and the input parameters have no relationship or this is increasing a region and decreasing in another region. 3. If (r < 0) , the input variable's effect on the output is decreasing. Enhancing the targeted variable decreases the output parameter's value. The larger the effect, the closer the relevance factor is to − 1.
The relevancy factor value for the XGBoost model's inputs as the best paradigm is indicated in Fig. 14. According to this figure, each of input parameters including P, M w , T b , and T c have direct effects on H 2 S solubility, whereas T, P c , and ω have contrary effects. This means that changing the value of each input will change the value of H 2 S solubility. Among the positive input parameters, the pressure with a relevancy factor of + 0.537 has the greatest impact. As a result, the solubility of H 2 S increased as pressure increased. This phenomenon is explained by Henry's law, which expresses that the gas solubility in liquids is proportional to the pressure of that gas above the solution's surface. Gas molecules are driven into the solution as pressure increases, therefore this www.nature.com/scientificreports/ leads to higher amount of gas molecules dissolve in the solution 58 . On the other hand, the temperature with a relevancy value of − 0.230 has the highest influence through the negative input variables. This implies that when the temperature is raised, the solubility of H 2 S decreased. Consequently, heating the solution creates thermal energy, which overcome the gas/solvent molecules attraction, decreasing the gas solubility 53 .

Applicability domain of the XGBoost method. To evaluate the validity region of the suggested
XGBoost system and to distinguish any data that is dubious, the Leverage approach was utilized [108][109][110] . The Leverage evaluation, that has been formed graphically through Williams' plot, is one of the most substantial methods of outlier distinction. This method includes computing residuals and the Hat matrix for the input data. The deviation of actual data and method predictions is represented by residual values (R), while the Hat matrix is determined as indicated here 111 : where X is an (a × b) matrix, where, a is the number of data points and b is the number of method variables, X t is the transpose of the matrix X. Furthermore, H* (the Leverage limit) is calculated as 3(a+1) b . In this study, the H* value for the XGBoost model was 0.015. The R values are then plotted against the hat variable to produce the Williams plot, which shows the suspicious data as well as the usable region of the model. The developed paradigm is regarded authentic and its forecasts are performed in the validity scope if majority of the data points were collected between the ranges of 0 ≤ H ≤ H * , and − 3 ≤ R ≤ 3. Figure 15 shows the Williams plot of the XGBoost method. Virtually all data points in Fig. 15 appeared to be between 0 ≤ H ≤ 0.015 and − 3 ≤ R ≤ 3. Therefore, only 56 data points are over and above the aforementioned range, according to the outlier's identification findings. Indeed, 24 data points of whole data set (1.58% of databank) are suspected data, and the rest of the data (32 data points, 2.11% of all data) are outside the allowed leverage (0.015 ≤ H). The findings in Fig. 15 show that the XGBoost model used to predict H 2 S in ILs is statistically valid, and the datasets used to create the model are of sufficient quality to be used.

Qualitative investigation of H 2 S solubility in ILs.
Hydrogen sulfide/ILs solubility increases with rising pressure and decreasing temperature, and a similar pattern is also evident for CO 2 solubility in ionic liquids 112 . This gas dissolves due to its molecules interact with those of the liquid. Since heat is generated when these new attractive interactions create, dissolution is an exothermic process. Thermal energy is created when heat is applied to a solution, which prevails the gas/liquid molecules attractions, decreasing the gas solubility. In compared to the anion, the cation has a minor effect on H 2 S solubility. Figure 16 shows the influence of cation alkyl chain length on H 2 S solubility at 313.15 K with an identical anion. In comparison to ILs with shorter chains, those with longer alkyl chains may have greater H 2 S solubility ([C 8 MIm] + > [C 6 MIm] + > [C 2 MIm] + ). This effect is explained by the verity that ILs with longer alkyl chains have larger free volumes, which allows them to raise van der Waals interactions while also admitting more H 2 S molecules 5,6,62 . Figure 17 shows the influence of fluorine content in anion on H 2 S solubility at 313. 15 58 . Gas molecules are compelled into the solution as pressure rises, therefore this leads to a higher quantity of gas molecules incorporated in the solution. As a result, based on thermodynamic characteristics, temperature, and pressure, the proposed technique may produce exact and reliable predictions for the H 2 S solubility in various ILs. As previously stated, Fig. 14 depicts a sensitivity assessment of the inputs and output of the XGBoost. Pressure has a favorable influence on H 2 S solubility in various ILs, which is consistent with the findings shown in Fig. 18.

Conclusions
In the current study, 1516 real data points were used to forecast the H 2 S solubility in ILs. A comprehensive databank was prepared from literature and includes variables such as T, P, T c , P c , ω, T b , and M w . In this work, four intelligent models-DBN, GMDH, GP, and XGBoost-were developed to compute the H 2 S solubility in ILs. The statistical indices (namely; AAPRE, APRE, RMSE, SD, and R 2 ) and graphical analyses (such as; cross-plot, Tylor diagram, error chart, and cumulative frequency) findings of these recently implemented tools were applied to assess the achievements of the suggested methods. The following findings are found from this research: 1. The H 2 S solubility in ILs was estimated using four models in the current study, of which XGBoost was determined to be the most precise and valid. 2. The newly established XGBoost model for computing the H 2 S solubility in ILs systems is smart enough to make very accurate predictions about the H 2 S solubility values (total AAPRE: 1.14%, train AAPRE: 1.09%, and test AAPRE: 1.30%). 3. Furthermore, the XGBoost is more accurate and trustworthy when results are compared to those of other well-known earlier models. 4. The relevancy factor was applied in order to examine how each input parameter affected the H 2 S solubility.
Pressure (based on Henry's law) has a more favorable impact on model results than the other input variables for the XGBoost method. While the solubility of H 2 S is inversely related to temperature. 5. According to the leverage technique, the majority of the data are empirically valid, and only a small number of data points are found outside the XGBoost model's application area. 6. In terms of pressure, the XGBoost model's output trends make logical. 7. The developed XGBoost model is not only quick and inexpensive, but it also accurately predicts the solubility of H 2 S.

Data availability
All the data have been collected from literature. We cited all the references of the data in the manuscript. However, the data will be available from the corresponding author on reasonable request.