Modeling CO2 solubility in water using gradient boosting and light gradient boosting machine

The growing application of carbon dioxide (CO2) in various environmental and energy fields, including carbon capture and storage (CCS) and several CO2-based enhanced oil recovery (EOR) techniques, highlights the importance of studying the phase equilibria of this gas with water. Therefore, accurate prediction of CO2 solubility in water becomes an important thermodynamic property. This study focused on developing two powerful intelligent models, namely gradient boosting (GBoost) and light gradient boosting machine (LightGBM) that predict CO2 solubility in water with high accuracy. The results revealed the outperformance of the GBoost model with root mean square error (RMSE) and determination coefficient (R2) of 0.137 mol/kg and 0.9976, respectively. The trend analysis demonstrated that the developed models were highly capable of detecting the physical trend of CO2 solubility in water across various pressure and temperature ranges. Moreover, the Leverage technique was employed to identify suspected data points as well as the applicability domain of the proposed models. The results showed that less than 5% of the data points were detected as outliers representing the large applicability domain of intelligent models. The outcome of this research provided insight into the potential of intelligent models in predicting solubility of CO2 in pure water.

experimentally measured is 350 MPa which was reported in the studies of Todheide & Franck 22 and Takenouchi and Kennedy 23 .In 1999, Dhima et al. measured the CO 2 solubility in pure water at 344.25 K and pressures up to 100 MPa 24 .Ahmadi and Chapoy 25 utilized a high-pressure setup to examine the CO 2 solubility in several salt solutions at temperatures between 300 and 424 K and pressures up to 41 MPa.They conducted experiments with deionized water under equal conditions to verify the validity of their method.Wang et al. 26 conducted experiments to determine the solubility of CO 2 in water under high pressure and temperature conditions.They investigated the effect of the vapor phase of H 2 O.A summary of literature experimental research os presented in Table 1, comprising the respective year of study, applied pressure and temperature ranges, and the employed equipment.
CO 2 solubility in water can also be estimated by thermodynamic modeling, which is more cost-effective than experimental measurements.In recent years, the potential of theoretical models to effectively represent influential characteristics such as pressure, temperature, and electrolyte concentrations has attracted the interest of researchers attempting to develop models for different systems 27 .Duan and Sun 28 proposed a model to estimate CO 2 solubility in water and aqueous NaCl solutions using an extensive databank containing about 1500 data points at temperatures ranging from 273 to 533 K and pressures between 0 and 2000 bar.Spycher et al. 29 utilized a calculating approach based on determining new Redlich-Kwong Equation of State (EoS) parameters 19 for mutual solubility between CO 2 and water, using the CO 2 solubility data obtained at 285-373 K and up to 60 MPa.Yan et al. 30 measured CO 2 solubility in water and NaCl brine in three temperatures of 232, 373, and 413 K and a pressure range of 5 to 40 MPa.By comparing experimental data with Søreide and Whitson EoS model 31 , they indicated that a modification in the model by refitting the interaction parameter between water and CO 2 in the aqueous phase could lead to an acceptable solubility prediction.Recently, statistical associating fluid theory (SAFT) EoSs have also been applied to CO 2 -electrolyte-water solutions 27,32,33 .Yan and Chen 34 created a model for CO 2 solubility in NaCl solution employing a Perturbed-Chain SAFT (PC-SAFT) EoS with a maximum temperature and pressure of 473.15K and 150 MPa.They determined Henry's constant by fitting the experimental CO 2 solubility in pure water at the same pressure and temperature range.Moreover, multiple thermodynamic models of CO 2 solubility have been constructed based on various pressures, temperatures, and water salinity conditions 25,35,36 .Table 2 provides an informative summary of theoretical models found in the literature.It includes information such as the year of study, the pressure and temperature ranges that were examined, the techniques used for prediction, and the error.
In recent years, artificial intelligence (AI) has garnered significant attention from researchers and has become a potent tool for predicting tasks in a variety of fields [37][38][39][40][41] .AI has been widely applied in the oil industry owing to its advantages over costly and time-consuming experimental procedures and sophisticated thermodynamic models 42 .AI was utilized in the assessment of a water-alternating-CO 2 process 37 for determining the minimum miscibility pressure (MMP) of CO 2 38 and predicting oil recovery in CO 2 -EOR approaches 39,40 .and assessing the features of coal as an approach to carbon sequestration 42,43 .Ghasemian et al. 44 developed an artificial neural network (ANN) to estimate the solubility of CO 2 in water based on 105 data points from their experiments conducted at pressure and temperature ranges of 0.1-1 MPa and 278.15-348.15K, respectively.They noticed that the ANN model outperformed the EoSs in CO 2 solubility prediction with an absolute average deviation (AAD) of 4%.Hemmati-Sarapardeh et al. 45 investigated the efficiency of four machine learning (ML) techniques, namely, radial basis function (RBF), multilayer perceptron (MLP), gene expression programming (GEP), and least-squares support vector machine (LSSVM) models, to estimate CO 2 solubility in pure water at high pressures and temperatures.The results showed that the highest level of accuracy was achieved with the LSSVM model optimized by the firefly optimization algorithm (FFA) with a root mean square error (RMSE) of 0.3261.Khoshraftar and Ghaemi utilized ANN and response surface methodology (RSM) to predict the solubility of CO 2 in water.Their model development was based on 240 measurements that were conducted within a pressure range of 0.5-200 MPa and a temperature between 313.15 and 473.15 K.The efficiency of MLP and RBF models were compared using an ANN technique.All the developed models accurately predicted solubility, although the RBF and MLP models exhibited the highest performance 46 .
Jeon and Lee 47 developed an ANN based on 2406 data points to predict CO 2 solubility in pure water and single-salt aqueous solutions.In order to train the model, 80% of the data bank was used, and the remaining 20% of the data points of solubility in the complex or multi-component solutions were utilized for validation and testing.They stated that the developed ANN model could be extrapolated to predict CO 2 solubility in multicomponent salt solutions.Over the past years, several investigations have been conducted on the prediction of pure and impure CO 2 solubility 48 in brine solutions containing different components with the aid of AI [49][50][51] .Furthermore, research has been carried out to predict the solubility of CO 2 in the non-aqueous phase.Certain EOR techniques employ the injection of miscible CO 2 into the oil in order to enhance oil mobility by lowering the viscosity of the oil, the IFT, and oil swelling.Rostami et al. developed a gene expression programming (GEP) model for predicting the solubility of CO 2 in both dead and live oil, utilizing 106 and 74 data points, respectively.The error analysis revealed that the QEP-based model predicts the solubility of CO 2 in both dead oil and living oil accurately, with correlation coefficients (R 2 ) of 0.9860 and 0.9844 for each 52 .Prior studies that have explored the application of AI to predict the solubility of gases in predicting the solubility of gases in different types of aqueous solutions and non-aqueous phase are summarized in Table 3.This table presents data regarding the year of the study, the applied AI technique, the kind of gas, the solution type, and the temperature and pressure ranges.
The purpose of this study is to develop intelligent models that accurately predict CO 2 solubility in pure water.For this, a large data bank with 785 data points containing the values of pressure, temperature, and CO 2 solubility in water is gathered.Then, two powerful intelligent models, namely gradient boosting (GBoost) and light gradient boosting machine (LightGBM) are implemented to provide predictions for the CO 2 solubility as a function of temperature and pressure.Various statistical and graphical methods are employed to assess the validity and precision of the developed intelligent models.In addition, a trend analysis is undertaken to verify the developed models' ability to detect physical trends.Lastly, the validity of the data bank and application domain of both models is examined by the Leverage method.As depicted in Table 3, although there are valuable artificial models for forecasting CO 2 solubility in aqueous solutions, blending data points of CO 2 solubility in pure water, single salt solutions, and diverse brines have been implemented in most predicted models.This study comprises an extensive database covering a broad range of temperature and pressures with reference to CO 2 solubility in Table 3. Literature AI models on the prediction of gas solubility in aqueous and non-aqueous phase.

Year of study
Temperature (K) pressure (Mpa) Type of gas water.This vast data collection resulted in a well-trained AI model with high accuracy.Both development models offered in this study demonstrate a high level of accuracy with a minimum determination coefficient of 0.995, indicating the reliability of the models.When employing EOR techniques or carbon storage technologies, we encounter a complex scenario involving the combination of gases in a water solution with varying salt levels and components.These validated and precise AI models can be employed in future research to assess complex systems, such as gas mixes and aqueous solutions of varying salinity levels.This not only improves the reliability of these models but also creates new opportunities for enhancing the effectiveness of these operations.

Methodology Data gathering
In order to develop intelligent models, a trustworthy and broad data set was collected from various sources 17,19,20,22,24, . Each f the 785 data points in the database contained values for temperature, pressure, and CO 2 solubility in pure water.Temperature, pressure were regarded as the models' inputs, while solubility was specified as the models' output parameter.These data points have already been used by Hemmati-Sarapardeh et al. 45 .The data set was randomly partitioned into 80% and 20% subsets for model training and testing, respectively.Table 4 describes the statistical features of the data bank.

Modeling techniques
Gradient boosting (GBoost) Gradient boosting (GB) is a kind of ensemble supervised tree-based machine learning (ML) approaches that can be utilized for both regression and classification issues [88][89][90] .It is called an ensemble because the ultimate model's prediction is produced based on various single models' (decision trees) predictions 88 .GB, which is portrayed in Fig. 1, is an iterative accumulation of sequentially organized tree-based models of weak learners or predictors that are converted to powerful learners 91,92 .Commonly, boosting techniques combine weak predictors into a powerful one in an iterative path to minimize the loss function 89 .This loss function is minimized similarly to an ANN in which weights are tuned 93 .
To achieve this purpose, it is recommended to choose a function h(x, θ t ) to be the most parallel to the negative gradient (g t (x i )) N i=1 .By selecting an iterative approach, we can defeat challenge posed by the prediction variables.The function g t (x) for every experimental data is calculated as below 94 : To permit the replacement of a hard optimizing problem, one can easily choose the new function increment to be the most matched with −g t (x) utilizing classic least-squares optimization as follows 89 : The following stages show a general optimizing procedure of the GBoost 94 :*** Initializing the f 0 as a constant; Calculating the negative gradient of −g t (x); Conforming a next base-learner function h(x, θ t ); Recognizing the optimal gradient descent step-size ρ t as: (1) This approach involves the base-learner phase, which consists of a single neuron and the loss function employed is the standard squared error.Through the process of training of the model, the optimum structure is earned.

Light gradient boosting machine (LightGBM)
LightGBM is a gradient-supervised technique based on decision trees and the idea of boosting algorithms 95 .LightGBM technique, which includes several decision trees, is applicable in various ML tasks like regression, classification, and ranking [96][97][98] .Each LightGBM technique employs a powerful learning framework to produce prediction values 99 .Its principal differences from other tree-based models are that it accelerates the training stage by applying histogram-based techniques, decrease memory consumption and it uses a leaf-wise growth strategy with depth constraints 95 .Figure 2 illutrates a schematic image of the LightGBM.The training process of the LightGBM is determined by the subsequent formula: Next, f (x) will forecast by minimizing the loss function L as 95 : Eventually, the training stage of every regression tree can be indicated as W q(x) , q ∈ {1,2, 3, . . ., N} ; where W denotes a weight term of every leaf node, q shows utilized decision rules in each tree, and N indicates the number of leaves in a tree 100 .Thus, by the employing of Newton's method for recognizing objective function, the training outcome of every stage is tuned by the following equation:

Model's development
This research focused on proposing two intelligent models that accurately predict the solubility of CO 2 in pure water.The models were designed to predict the target variable as a function of temperature and pressure.The established models were trained with 628 data points (80% of the data bank) and tested with 157 data points (20% of the data bank).The accuracy of the intelligent models in this study compared to the EoSs is impressive.All statistical and graphical error analyses verify this issue.Furthermore, smart techniques require less input parameter information than EoSs.In our current research, we only utilized pressure and temperature to predict the solubility of CO 2 in pure water.In contrast, many EoSs necessitate additional properties.Moreover, the application of EoSs typically consumes a significant amount of time.
For the tuning process, various ranges of hyperparameters were tested to find the optimal value of each hyperparameter.Table 5 shows the optimum values of the GBoost and LightGBM hyperparameters, separately.Max depth refers a maximum depth of the tree.Learning rate determines the step size at each sequential iteration. (3)

Performance evaluation
To analyze the performance of the developed models, multiple statistical and graphical evaluations were utilized.This research implemented the root mean square error (RMSE) and the correlation coefficient (R 2 ) for statistical evaluation.Equations ( 8) and ( 9) provide the mathematical formulations of these criteria.
where y exp i is the experimentally determined solubility of CO 2 in pure water and y cal i is the value of solubility predicted by the models.Also, y represents the mean value of the measured data points.As Table 6 reflects, the two developed models showed strong agreement with experimental data.Although error indices indicate the great accuracy of both models, the GBoost model outperformed the LightGBM algorithm slightly.
Graphical assessments were also used to compare the reliability of the models.For this purpose, various kinds of plots, comprising cross-plots, error distribution plots, and cumulative frequency were plotted.Cross plots are one of the visual assessments that compare the predicted data with experimental measurements.The concentration of data near the Y = X line indicates the accuracy of the developed model.Figure 3 shows the cross plots of the GBoost and LightGBM models.As shown, the dense compactness of train and test data points was seen around the unit slope line.While the points were more distributed close to the Y = X line in the cross plot of the LightGBM model, the GBoost approach demonstrated a stronger concentration around this line.
Figure 4 demonstrates the error distribution diagrams of the established models, where the error is defined as the deviation of predicted data of the solubility from the experimental values.In this diagram, more aggregation around the zero line implies a more accurate model.The plots revealed that the majority of points were located close to the zero-error line and confirmed the accurate predictions.While the GBoost model has an accurate performance over the CO 2 solubility range, a slight deviation was observed in the LightGBM model in a few points, especially in high solubility values, indicating that there was a minor relative error at these values.

Group error analysis
Group error diagrams were used to assess the efficiency of the established models in various pressure and temperature ranges.Figure 5 depicts group error plots of the GBoost and LightGBM models in five equal temperature and pressure ranges.As Fig. 5a shows, both models demonstrated higher precision at temperatures below 454.13 K.It should be noted that the GBoost model demonstrated higher reliability than LightGBM throughout all temperature ranges.The effect of pressure on the efficiency of the model was shown in Fig. 5b.As shown, ( 8) www.nature.com/scientificreports/both models provided more accurate predictions at pressures lower than 700 bar.While GBoost demonstrated relatively consistent accuracy at all ranges, increasing the pressure reduced the efficiency of LightGBM.

Cumulative frequency
The cumulative frequency plot is an interactive visualization technique for evaluating the reliability of intelligent models.In these charts, the higher a model is positioned, the more accurate its predictions are.The cumulative frequency curves of the GBoost and LightGBM are illustrated in Fig. 6.The closeness of the two models to the vertical axis signifies that these models predicted the majority of data points with a low error.Considering that the error is defined as the absolute difference between measured experimental and predicted data points, 95% of data predicted by the GBoost model have an error of less than 0.29 mol/kg.By contrast, LightGBM reported 93% www.nature.com/scientificreports/ of the data at this error value.Both developed models demonstrated an acceptable level of certainty in predicting the solubility of CO 2 , whilst GBoost showed relatively higher accuracy.

Sensitivity analysis
One of the sensitivity approaches employed to measure the effect of input parameters on the output is the relevancy factor.This factor is employed to determine the correlation between dependent and independent variables, and is defined below 101 : where I and O symbolize the the input and output variables, respectively.As stated earlier, temperature and pres- sure were considered input parameters of the developed models for predicting CO 2 solubility in water.I k i represents the i th value of k th input parameters, I k ave stands for the average value of the k th input, and O ave indicates the average output.The relevancy factor shows a value between − 1 and 1.Where a positive number implies a direct correlation between the input and output variables, while a negative value shows a reverse one.The degree of influence is expressed by the absolute value of r so that the greater the r, the more influence of the input.Figure 7 demonstrates the impact of pressure and temperature on CO 2 solubility for the GBoost and LightGBM models.As can be seen, the two models reflected relatively the same relevancy factor of input parameters.While a rise of both temperature and pressure would result in an increase in solubility, pressure with a relevancy factor of 0.87 had a stronger impact than temperature with a relevancy factor of almost 0.65.

Model trend analysis
As noted in previous sections, the GBoost approach provided more accurate CO 2 solubility predictions.To investigate the impact of pressure and temperature on the CO 2 solubility in pure water and to verify the model's engagement to a physically expected trend, the experimentally measured points were compared to the GBoost model's predictions with respect to the trends of solubility at various temperatures and pressures by keeping one of them at a constant value.The temperature's impact on solubility at three constant pressures is depicted in Fig. 8.As demonstrated, the predicted points were accurately consistent with the experimental data and illustrated that increasing temperatures increases solubility strength.The solubility line at 3500 bar is positioned above the solubility lines at 2000 bar and 1200 bar in the figure, which illustrates the influence of pressure.In addition, the impact of pressure on solubility is represented in Fig. 9.As illustrated in the figure, a rise in pressure enhanced gas solubility at both high and low temperatures.Pressure rise causes compressibility in the gas phase, which leads to the release of more space for additional gas molecules to be dissolved in the water.In Fig. 9a, the solubility at 278 K was higher than that of 304.19 K. Although, in high-pressure regions, the solubility increased as the temperature rise.In other words, at low and medium pressures, heating the liquid or increasing the temperature results in a rise of the molecules' kinetic energy, leading them to move faster and letting more of them escape the solvent.This indicates that thermal energy overcomes the intermolecular attraction force between carbon dioxide and pure water.However, when gas reaches the supercritical condition, which for CO 2 is above 304.13K and 73.97 bar of temperature and pressure, it exhibits a more complicated behavior.As seen in Figs. 8 and 9b demonstrate, CO 2 tends to behave more as a liquid, showing that both pressure and temperature positively affect solubility 102 .In conclusion, Figs.www.nature.com/scientificreports/

Implementation of the Leverage method
Existing outlier data in a data bank might adversely impact the prediction's efficiency and applicability of a model.Therefore, the detection of these measured points that differ from the bulk of data is a key step for model development.The Leverage method was employed to detect outliers in this study 103,104 .According to a statistical perspective and a visual evaluation, this technique identifies outlier points.This method sketches the Williams plot based on a H value and standardized residual.The H value, refers to the elements of the Hat matrix, which is computed as below: where k and N indicate the input parameters and number of data points, respectively, and T refers to the trans- pose.The standardized residual is measured based on the e i , which is the deviation between predicted and experimental data, root mean square error ( RMSE ), and H vectors.In the Williams plot, valid data points are situated in the area bounded by the Leverage limit ( H * ), and the suspected limits.This valid zone represents the applicability of the model domain.The mathematical measurement of the Leverage limit is presented in Eq. ( 13), which corresponds to 0.0115 for the gathered databank.Suspicious limits are defined as a standard residual higher than 3 or less than − 3.In addition, the area with H > H * is classified into two areas based on its SR: good high leverage and bad high leverage.The good high leverage, which represented data points that predicted well but were outside the scope of the model's applicability, relates to the data as −3 ≤ SR ≤ 3 .The bad high leverage area belongs to points with SR > 3 or SR < −3 [105][106][107] .
where k represents the number of inputs (here two) and N is the total number of data points (here 785).Williams's plot is one of the outlier detection methods for examining the performance of developed models which is widely used in AI application studies 45,[108][109][110][111] .The William plots of the proposed models for predicting the CO 2 solubility in water are shown in Fig. 10.As the plots demonstrate, the majority of data points were placed in the valid region, 95.92% of the points for GBoost and 95.67% of them for the LightGBM.For both models, only 2.42% of data exceeded the leverage limit.Overall, as a negligible percentage of data were placed out of the valid zone, the Leverage approach approved the applicability of proposed models and the validity of experimental data points.

Conclusions
In this study, two tree-based models, GBoost and LightGBM, were developed to predict CO 2 solubility in pure water based on an extensive data bank including 785 experimental data points collected from diverse sources.Two parameters of pressure and temperature were considered as the input parameters, while solubility was defined as the output.
• In order to validate models, statistical and graphical evaluations were implemented.Multiple validation pro- cedures approved the high precision agreement between experimental and predicted solubility values.The findings indicated the outperformance of the GBoost model with R 2 and RMSE values of 0.9976 and 0.137 mol/kg, respectively.• The trend analysis was employed to assess the pressure and temperature effects on the solubility in compari- son to the predictions of the model.The trend analysis revealed that the proposed models exhibited the high accuracy in comprehending understanding the physical trend of the problem.• For outlier detection, the Leverage approach was implemented which demonstrated the validity and reliability of the models on a large portion of data; nevertheless, only a few points were identified as suspected data.
• The findings of this study demonstrated that both developed models could be considered potent and trust- worthy tools for predicting the solubility of CO 2 in water.

Figure 7 .
Figure 7.The relative impact of the temperature and pressure on the predicted CO 2 solubility.

Figure 8 .
Figure 8.The GBoost model's predictions and experimentally measured data points of CO 2 solubility in water at three fixed pressures with temperature variation.

Figure 9 .Figure 10 .
Figure 9.The GBoost model's predictions and measured data points of CO 2 solubility in water at fixed temperatures with pressure variation (a) for low to medium temperature and pressure (b) for high temperature and pressure.

Table 1 .
Literature experiments studies on the determination of CO 2 solubility in water.

Table 2 .
Literature theoretical models on the calculation of CO 2 solubility in water.These studies have focused on predicting the efficiency of CO 2 storage

Table 4 .
Statistical overview of the dataset in this study.

Table 5 .
Optimum values of hyperparameters of the models.

Table 6 .
Statistical error factors of established models.