Modeling solubility of CO2–N2 gas mixtures in aqueous electrolyte systems using artificial intelligence techniques and equations of state

Determining the solubility of non-hydrocarbon gases such as carbon dioxide (CO2) and nitrogen (N2) in water and brine is one of the most controversial challenges in the oil and chemical industries. Although many researches have been conducted on solubility of gases in brine and water, very few researches investigated the solubility of power plant flue gases (CO2–N2 mixtures) in aqueous solutions. In this study, using six intelligent models, including Random Forest, Decision Tree (DT), Gradient Boosting-Decision Tree (GB-DT), Adaptive Boosting-Decision Tree (AdaBoost-DT), Adaptive Boosting-Support Vector Regression (AdaBoost-SVR), and Gradient Boosting-Support Vector Regression (GB-SVR), the solubility of CO2–N2 mixtures in water and brine solutions was predicted, and the results were compared with four equations of state (EOSs), including Peng–Robinson (PR), Soave–Redlich–Kwong (SRK), Valderrama–Patel–Teja (VPT), and Perturbed-Chain Statistical Associating Fluid Theory (PC-SAFT). The results indicate that the Random Forest model with an average absolute percent relative error (AAPRE) value of 2.8% has the best predictions. The GB-SVR and DT models also have good precision with AAPRE values of 6.43% and 7.41%, respectively. For solubility of CO2 present in gaseous mixtures in aqueous systems, the PC-SAFT model, and for solubility of N2, the VPT EOS had the best results among the EOSs. Also, the sensitivity analysis of input parameters showed that increasing the mole percent of CO2 in gaseous phase, temperature, pressure, and decreasing the ionic strength increase the solubility of CO2–N2 mixture in water and brine solutions. Another significant issue is that increasing the salinity of brine also has a subtractive effect on the solubility of CO2–N2 mixture. Finally, the Leverage method proved that the actual data are of excellent quality and the Random Forest approach is quite reliable for determining the solubility of the CO2–N2 gas mixtures in aqueous systems.

In the last decade, one of the most important challenges in the petroleum and chemical industries has been evaluating the solubility of different gases in liquids, including hydrocarbon and non-hydrocarbon gases [1][2][3] . The solubility of gases in liquids can be vital in the petroleum and chemical industries for a variety of reasons, including transport operations and the production of hydrates 1,4 . CO 2 as a greenhouse gas has been considered a serious problem in recent decades [5][6][7] . Carbon capture and storage (CCS) 8,9 is a technique that involves capturing CO 2 from major point sources and storing it in formations 10,11 . Flue gas storage in saline aquifers, as well as CO 2 extraction and storage using gas hydrates, are considered as potential CCS methods. As a result, information gaps about these methods, such as the solubility of gas mixtures in water and brine, must be filled before commercialization. Due to the high cost of traditional CCS technologies, considerable efforts have been made to improve the efficiency of CCS operations by creating cost-effective and practical CCS approaches; however, there are still a lot of technological and financial roadblocks to overcome 10,[12][13][14] .
Flue gas or the mixture of CO 2 and N 2 injected within gas hydrate reservoirs have been suggested as a potential alternative for CO 2 underground storage. The thermodynamic mechanism by which CO 2 in flue gas or a CO 2 -N 2 mixture is collected as hydrate, on the other hand, is not well recognized 15 . CO 2 storage in hydrate reservoirs has expensive obstacles that limit its widespread usage, despite all of the stated benefits. The primary expense in this scenario is CO 2 collection before storage 15,16 . Injecting CO 2 -N 2 mixture within gas hydrate reservoirs rather than pure CO 2 might considerably cut CO 2 separation expenses. Furthermore, an industrial-scale CO 2 substitution experiment on the North Slope of Alaska found that injecting a gas combination of 77/23 ratio of N 2 /CO 2 into a hydrate reservoir while recovering methane successfully avoided CO 2 hydrate creation around the injection well. Although the prior studies show that injecting CO 2 -N 2 gas mixes into gas hydrate reservoirs might be a cost-effective technique for CCS, a primary concern remain: How can the reservoir circumstances following CO 2 -N 2 mixtures or flue gas injection into a gas hydrate reservoir affect the production of CO 2 and CO 2 -mixed hydrates 15 ? Since different thermodynamic conditions affect the injection process of the CO 2 -N 2 mixture and make the injection process difficult, the first important step is to evaluate the solubility of the CO 2 -N 2 mixture at different thermodynamic conditions. It should be noted that these limitations have also led to limited laboratory data on the solubility of CO 2 -N 2 mixture in liquids. Therefore, finding a solution to measure the solubility of the CO 2 -N 2 mixture has great importance. As a result of these considerations, assessing the solubility of gases in liquids has become a contentious issue. CO 2 and N 2 have been extensively considered as two frequently used non-hydrocarbon gases in recent studies 17,18 . The injection of CO 2 into the aquifer and the injection of a mixture of CO 2 and N 2 into oil and gas reservoirs are two examples of these situations, where knowing the degree of solubility of the gas is critical 10,19 . As a result, a thorough understanding of the physical and chemical interactions between CO 2 , N 2 , and water is required. For instance, solubility trapping and mineral trapping are the two significant mechanisms that influence the injection of CO 2 into the aquifer. To accurately determine the effect of these mechanisms, it is necessary to conduct a sufficient number of theoretical and experimental studies, which can be time-consuming and costly 10,20,21 .
In addition to the laboratory experiments, another technique for determining the solubility of CO 2 and N 2 in water is to utilize equations of state (EOSs); however, it should be noted that EOSs are more appropriate for pure fluids but have limitations for pure compounds. Some of these limitations are as follows 22,23 : • To determine the solubility using these types of equations, critical characteristics of pure substances are necessary. Many of the chemicals studied, particularly those with complicated chemical structures, break down before meeting critical conditions. As a result, measuring the relevant characteristics does not appear to be feasible. • To adjust the thermodynamic coefficients of the equation for a more precise estimation of the physical properties of the system, several physicochemical aspects of the system should be evaluated, such as the characteristics of the donor and the acceptor of the hydrogen bond of the molecule. • Interaction factors setting for solubility data for each model is a time-consuming process.
• Numerical methods are often divergent to solve some equations for pure materials that have low solubility in water. • The solubility estimations are heavily influenced by the optimization techniques used to get the best values for the thermodynamic model parameters.
As a result, choosing the appropriate optimization technique is another issue to consider. Despite these flaws, thermodynamic techniques have been extensively used to forecast the solubility of CO 2 , N 2 , and other gases in water, which are often found in the oil and gas industry under a variety of thermodynamic conditions. In the literature, CO 2 solubility in water and aqueous solutions [24][25][26] of salts like NaCl, KCl, and CaCl 2 has been thoroughly documented. Also, the solubility of N 2 and CO 2 -N 2 mixture in water and brine has been studied 22,[27][28][29] . Tomoya et al. 30 measured CO 2 solubility in aqueous solutions and then correlated the experimental data with the Peng-Robinson-Stryjek-Vera EOS. Yiteng et al. 31 also needed to know the solubility of CO 2 in brine to estimate CO 2 capturing potential in deep saline aquifers. For this purpose, they utilized the Peng-Robinson Cubic-Plus-Association (PR-CPA) EOS to calculate the solubility of CO 2 in brine. They represented that good agreement was achieved with laboratory data.
The second group of methods for estimating solubility involves creating correlations, particularly mathematical methods that employ the physical characteristics of the chemicals in a manner that makes these approaches broad and thorough. These techniques may represent/predict the solubility of substances from diverse chemical categories in water in any condition 22 . Abraham et al. 32 suggested a linear solvation energy relationship (LSER) approach. However, the relationship can predict the solubility of ordinary organic substances; the model's properties are challenging to be determined from the compounds' chemical structures. Other researchers have taken the same method 33,34 .
In the previous studies, a number of experimental data have been reported for the solubility of non-hydrocarbon gases, including CO 2 and N 2 in liquids, especially in water 18,35,36 . There is a scarcity of experimental results for non-hydrocarbon solubility due to the difficulties and sophistication of measured data of natural gas including gas equilibrium data. As a result, the utilization of laboratory data in new modeling approaches like artificial neural networks has gotten much attention 1 . Machine learning techniques have recently found widespread use in forecasting challenges such as hydrate formation 37 , ammonia solubility in liquids 38 , simulating asphaltene behavior 39 , and hydrocarbon-CO 2 interfacial tension 40 . They have received much interest as a result of their captivating performance 41 . Samani et al. 42 proposed different intelligence techniques for estimating the solubility of various gases in aqueous electrolyte systems. Regarding the solubility of non-hydrocarbon gases (i.e., N 2 and CO 2 ) in aqueous electrolyte systems, their database includes 774 data points, of which only 81 data are related to the N 2 -CO 2 gas mixture and the rest are related to the solubility of N 2 and CO 2 pure gases. Their model was based on Coupled Simulated Annealing (CSA) linked to the Least-Squares Support Vector Machine (LSSVM) method. Average absolute relative error and root mean square error (RMSE) values of their proposed CSA-LSSVM model were 10.71% and 0.0011, respectively. Hemmati-Sarapardeh et al. 43 investigated the solubility of CO 2 in water at high pressures and temperatures using four powerful machine learning techniques. In this study, Multilayer Perceptron (MLP), Radial Basis Function (RBF), Least-Squares Support Vector Machine (LSSVM), and Gene Expression Programming (GEP) models were developed using temperature and pressure as input data to estimate the solubility of CO 2 in water. The results showed that the LSSVM-FFA model a with an RMSE value of 0.3261 had the best performance compared to other models. Nabipour et al. 1 investigated the solubility of CO 2 and N 2 in aqueous solutions using Extreme Learning Machine (ELM) and LSSVM approaches. Their solubility database was similar to Samani et al. 's work 42 including 774 data points with less than 90 data related to CO 2 -N 2 mixture solubility. The results showed that the LSSVM technique with an RMSE value of 0.001 had higher proficiency than the ELM approach in estimating the solubility values of CO 2 and N 2 in aqueous solutions. Temperature, pressure, and composition were the most critical input parameters to the models. Saghafi et al. 44 investigated the solubility of CO 2 in Monoethanolamine (MEA), Diethanolamine (DEA), Triethanolamine (TEA), and N-Methyldiethanolamine (MDEA) aqueous solutions. In this study, the AdaBoost-Decision Tree method and intelligent neural networks were used. The results showed that AdaBoost-Decision Tree models with RMSE values of 0.005-0.022 obtained the best solutions for different aqueous solutions. Gharagheizi et al. 22 estimated the solubility of pure compounds such as CO 2 in water using an Artificial Neural Network-Group Contribution (ANN-GC) technique. The results showed that this model with an RMSE value of 0.4 could have a good performance in estimating the solubility of pure materials in water.
Therefore, as mentioned earlier, particular importance and attention to the issue of determining the solubility of CO 2 and N 2 in liquids and especially water with various techniques including laboratory methods 45 , EOSs, mathematical methods, and intelligent neural networks 46,47 in previous studies has caused further studies in this field and is still of interest to researchers. Although many studies have been done on pure CO 2 and N 2 , few studies investigated the solubility of CO 2 -N 2 mixtures in water and brine. Only two papers 1,42 applied intelligent models for CO 2 -N 2 mixtures, however, they used less than 90 data points and in limited ranges of operating parameters.
In this study, to estimate the solubility of CO 2 -N 2 mixtures in water and aqueous brine solutions, an extensive database containing 289 laboratory is collected from the literature. This paper uses six machine learning approaches, including Random Forest, Decision Tree (DT), Gradient Boosting-Decision Tree (GB-DT), Adaptive Boosting-Decision Tree (AdaBoost-DT), Adaptive Boosting-Support Vector Machine for Regression (AdaBoost-SVR), and Gradient Boosting-Support Vector Machine for Regression (GB-SVR), for determining CO 2 -N 2 mixture solubility in aqueous solutions in terms of temperature, pressure, ionic strength of aqueous brine solutions, CO 2 mole percent in gaseous mixture, and finally the index of non-hydrocarbon gases (i.e., N 2 and CO 2 ) whose solubility is to be estimated. Also, four reputable equations of state, including Peng-Robinson (PR), Soave-Redlich-Kwong (SRK), Valderrama-Patel-Teja (VPT), and Perturbed-Chain Statistical Associating Fluid Theory (PC-SAFT) are utilized to have a comparison with artificial intelligence models. Moreover, the sensitivity analysis of input parameters utilizing the relevancy factor is performed to check their impact on the solubility of CO 2 -N 2 gas mixtures in aqueous electrolyte systems. Lastly, the Leverage method is applied to investigate the quality of actual data and the reliability of the best-proposed approaches for determining the solubility of the CO 2 -N 2 gas mixtures in aqueous systems.

Data gathering
In this study, to estimate the solubility of CO 2 -N 2 mixtures in water and aqueous brine solutions, an extensive database containing 289 laboratory data has been collected from the literature 10,18 , which is presented in the Supplementary file. Although two studies 1,42 have been performed to estimate the solubility of CO 2 , N 2 , and CO 2 -N 2 mixture in aqueous electrolyte systems using artificial intelligence models, in these studies, the number of data related to the solubility of CO 2 -N 2 mixture in water is much less than the data for the two pure substances (i.e., CO 2 , N 2 ). The number of CO 2 -N 2 mixture solubility data of these studies 1 Therefore, what distinguishes this study from other previous studies is the use of a large data bank containing a large number of data related to CO 2 -N 2 mixture solubility in aqueous brine solutions. Therefore, the results of the developed models can be more comprehensive and reliable for use in the cases mentioned at the beginning of the introduction. To develop the models, temperature, pressure, ionic strength of aqueous solutions, CO 2 mole percent in gaseous mixture, and the index of non-hydrocarbon gases (IDX: 1 = N 2 and 2 = CO 2 ) whose solubility is to be estimated, have been used as input parameters. The statistical parameters of inputs and output data are summarized in Table 1.

Models' implementation
Support vector machine for regression (SVR). The Support Vector Machine (SVM) is a type of controlled machine learning system that can be employed for both regression (SVR) and classification (SVC) problems 48 . SVM has been widely used in various research areas due to its superior feature, notably in solving non-linear problems called the kernel trick, mapping the input space into a higher-dimensional space. For the sake of conciseness, this article briefly explains the concept of SVR; however, it has extensively been presented in literature 49 . Let the given dataset be a set of n independent samples, [ x 1 , y 1 , . . . ., x n , y n ] , where x ∈ R d has d dimension and y ∈ R . The objective of SVR is to identify regression function as below: here w, b, and φ(x) denote the weight, bias, and kernel function, respectively. To get the appropriate values of the weight and bias vectors, Vapnik et al. 50 suggested the following optimization procedure: here w T indicates the transposed matrix, ε is the toleration of error, ζ + j and ζ − j are regarded positive variables reflecting the lower and higher excessive variations, respectively, and C interprets a positive regularization factor determining the deviance from ε . By employing the Lagrange multiplier, Eq. (2) can be converted into a dual optimization problem as follows, which makes it easier to solve 48 .
where K(x k , x l ) is the kernel function, a k and a * k represent the Lagrange multipliers. It should be noted that in the present study, the polynomial kernel function was used in the SVR model which was selected by using grid search for the best performance. Weight and bias in Eq. (1) stand for trainable variables of SVR model.

Random forest (RF).
Decision Trees, a tree-like structure, are easy to interpret and perform well, notably when the dataset is large. However, the problems of the model are twofold. First, the Decision Trees usually experience low prediction bias and high variance, so-called over-fitted, which means the model picks up even small perturbations and random noises in the training dataset. Furthermore, although the most optimum decision is determined at each step, this greedy model does not consider the global optimum; therefore, the overall decision tree might not be optimal. The abovementioned issues can be mitigated by ensembling methods, integrating the results of multiple trees (weak learners) into the final result (strong learner) 51,52 . Such ensemble learning algorithm in which each tree is trained in parallel forms a Decision Tree ensemble, which is referred to as Random Forests. The greedy strategy in RF determines the importance of each tree at each stage 53 . Moreover, RF can measure the feature's importance and retain the most informative input features 54 . To improve the variable selection and diversity of the trees, the RF algorithm employs a technique called bagging or bootstrap aggregation. The model will decide how to split the input data into multiple sub-datasets according to the given trees' www.nature.com/scientificreports/ population. Bagging, a type of random sampling technique, allocates a third of data for the training purpose of a subtree development process, and the remaining will be left behind, which are referred to as out-of-bag samples. Additionally, the cross-validation technique is unnecessary while using the RF algorithm as multiple bagging in the training process prevents over-fitting 55 . The framework of RF construction is illustrated in Fig. 1. Suppose D is the training dataset with n number of observations, D = [ x 1 , y 1 , x 2 , y 2 · · · x n , y n ] , and D t is the training dataset for the tree h t , the predicted output corresponding to the out-of-bag dataset of sample x can be expressed as follows 56 : The learning error of the OOB can be obtained by: The procedure of RF must be random and this feature can be controlled over a parameter formulated as k 55 . The significance of a characteristic of a variable X i could be obtained as follows: where X i is the ith parameter in vector X , B indicates the current number of trees in the RF, OOBerr t i denotes the predicted error of the OOB samples for the feature X i of tree t , and OOBerr t is the initial OOB samples including permuted variables 56 . Decision tree (DT). Decision Tree, a nature-inspired supervised learning algorithm, has been widely utilized in the literature and can be used for classification and regression 57 . This algorithm consists of four elements: root node, which is the topmost node in the tree carrying the input data; leaf nodes, which are the final section of the flowchart and denotes the output of the system; internal nodes, which are placed between the root and leaf nodes; branches, which are the connection between nodes. A tree-building process in a decision tree algorithm includes three techniques: splitting, pruning, and stopping 58 . The input data is split into branches and decision nodes starting from the root node. The splitting process caries on till a stopping criterion is convinced. The pruning technique implies removing the low-importance branches 59 . A simple architecture of a DT model is illustrated in Fig. 2. www.nature.com/scientificreports/ Gradient boosting (GB). Gradient Boosting (GB) is an effective machine learning technique that can be used in both regression and classification to reduce bias error or overfitting. Gradient boosting, as functional gradient descent, obtains the residual errors generated from the previous learner, and adds a new learner to it to minimize the loss function of the model at each stage of gradient descent. This technique aims to combine a group of weak learners in a stage-wise manner to build a strong learner and in turn, a more robust model to fit more accurately to the response variable. In other words, the new base-learner must have two conditions: be correlated with the negative gradient of the loss function and also be associated with the whole ensemble. As the idea behind gradient boosting is to minimize the loss function, there is a range of loss functions that can be used. Assume h(x, θ) is a custom base-learner and y, f a loss function, it is tough to predict the variables and a repetitive model; therefore, is proposed to choose a new function as h(x, θ t ) , where the t enhancement is directed by 60,61 : This converts a potential sophisticated optimization problem into a classic least square minimization 60,62 .
The following are the steps in the GBDT technique process 63 : • Suppose that f 0 is a constant • Evaluate the g i (x) and training h(x i , θ) function • Obtain parameter ρ i and modify the function: The method starts with a single leaf and optimizes the training algorithm for each node and record. Figure 3 shows a schematic example of a conventional GBDT.
Adaptive boosting (AdaBoost). The adaptive boosting algorithm presented by Freund and Schapire 64 aims to combine weak classifiers and learn from their mistake to create a strong classifier. In other words, it selects the training dataset iteratively to combine the multiple classifiers and assign the appropriate weight to each classifier based on the accuracy of each classifier so that higher weights are assigned to the misclassified/ mislabeled samples 65 . The following are the general stages of the AdaBoost technique 66,67 : • Determine the weights for predictors for each i as follows: • Update the sample weights for each i to N (where N is the learner's number) • Assign a weak learner to the data test (x) as a result. Support vector regressors (SVR) and Decision Trees (DT) have been used as weak learners in AdaBoost systems in this study.
In this paper, we have applied ensemble models such as Adaboost-DT, Adaboost-DT, and GB-SVR. To discover the functionality and different possibilities of regression methods, AdaBoost and Gradient boosting as varieties of clustering methods have been executed to enhance the conventional weak regressors by incorporating the outcome of the weak regressors into a weighted combination that determines the best output of the enhanced powerful regressor and also the outcome of the weak regressors is distorted in pursuit of incorrectly estimated samples autonomously.
More details are as follows: Linear SVR indistinguishability is achieved by using a nonlinear imaging approach to convert features with linearly unidentifiable low-dimensional input space into a high-dimensional feature space. This allows the nonlinear features of the samples to be analyzed linearly using a linear algorithm in a high-dimensional feature space. However, the choice of kernel functions and parameters has a significant impact on its performance 68 . The AdaBoost method trains many base learners, and the sample generalization could be further improved by combining techniques to produce the final strong learner. Anomaly samples are susceptible to the AdaBoost method, and anomalous samples may obtain greater weights in iterations, affecting the prediction accuracy of strong learners. Furthermore, the decision tree is widely used as a basic learning method, but it is inadequate in dealing with nonlinear issues, and prediction accuracy varies substantially 69 . The AdaBoost method, on the other hand, is sensitive to anomalous data, and anomalous samples may obtain greater weights in iterations, affecting strong learners' prediction accuracy.
When using SVR for sample learning, the model's performance is determined by the kernel function and kernel parameters. Using SVR as the AdaBoost base learner, on the other hand, lowers the influence of the SVR algorithm's kernel functions and parameters. It also overcomes AdaBoost's standard algorithm's inability to address nonlinear issues. This makes the AdaBoost-SVR method appropriate for dealing with nonlinear feature data prediction while also ensuring the model's generalizability 70 . We combined GB and SVR algorithms 71 . The combined GB and SVR algorithm into a single predictive model is another meta-algorithm applied in this paper in order to enhance the overall performance. Gradient Boosting as part of an ensemble technique attempts to create a strong regressor from several weak regressors.
The formulation for the contributions from the dispersion and ideal gas are similar to those of Gross Tables 2 and 3. The critical properties and acentric coefficients of the materials utilized in this study are summarized in Table 4.

Performance analysis of models
The mathematical description of the statistical parameters employed in this study are summarized below 72,83 : • Average absolute percent relative error (AAPRE) • Standard deviation (SD) • Coefficient of determination (R 2 ) • Root mean square error (RMSE) In the above equations, S iEXP , S iPRED , S iEXP , and N refer to experimental solubility, predicted solubility, mean experimental solubility, and the total number of data points, respectively.
Also, several graphical analyses, namely, cross-plot, relative error distribution diagram, cumulative frequency plot, and trend plot were utilized to visually evaluate the developed models. Descriptions of these analyses can be found elsewhere 72 .

Results and discussion
Statistical evaluation of models. The models discussed in the previous sections have been developed to predict the solubility of CO 2 -N 2 mixtures in water utilizing 289 laboratory data. In this study, we have employed six algorithms, which were rarely used, to estimate CO 2 -N2 gas mixture solubility in water. The structure of the models was modified and also the grid search algorithm was used to optimize the hyperparameters of the models to avoid overfitting in this particular problem. The hyperparameters obtained by the grid search are different for each model. It is based on the importance of the hyperparameters according to theoretical and practical aspects. Total data has been divided randomly to 80/20 for the training and testing phase. It should be noted that experimental data and predictions of different models are presented in the Supplementary file. The calculated  Table 5. In this table, different statistical parameters such as RMSE, R 2 , SD, and AAPRE are reported. GB-SVR outperforms other models except for Random Forest because SVR is more like a soft fabric that can bend and fold in whatever way we need to better fit our data. This gives more degrees of freedom and flexibility so that a more accurate model can be achieved. Moreover, SVR can capture the non-linear relationships between variables. The performance of the model is further improved by tuning the hyperparameters. These are the main reasons that GB-SVR has shown a higher accuracy. Random Forest proved the highest accuracy in this study even higher than SVR-GB. Random Forest is built for multiclass issues, whereas SVM is for two-class problems. In SVM, in the case of a multiclass problem, the problem must be broken down into numerous binary classification tasks. With a combination of numerical and categorical variables, Random Forest performs well. Also, in classification problems, it is not necessary to do normalization or scaling in Random Forest. SVM seeks to maximize the "margin," relying on the idea of "distance" between points. It is up to us to decide if "distance" is significant. As a consequence, one-hot encoding for categorical features is a must-do. Further, min-max or other scaling is highly recommended in preprocessing step. Random forests are good for a specific set of issue types when given a specific set of data, but they do not act well for many others. We should mention that random forests are unexpectedly effective for a wide range of issues because they are built on trees, the variables cannot be scaled. A tree inherently captures any monotonic alteration of a single variable, and in random forest built-in feature selection is automated 84 . According to Table 5, it can be seen that the Random Forest model with an AAPRE value of 2.84% has the most accurate prediction for the solubility of CO 2 -N 2 mixtures in water. The GB-SVR and DT models with AAPRE values of 6.43% and 7.41%, respectively, have the closest prediction to the Random Forest model compared to other models. However, it should be noted that other models also have relatively good results. Another noteworthy point is that sometimes the high accuracy of a model in predicting outputs may be due to overtraining. In order to ensure that this does not happen, the results of training and test data should be compared with each other. If the difference between the statistical parameters of the training and test data is significant, the model may be over-trained. If the results of the training and test data are close to each other, it can be stated that over-training has not happened. As the results show, the statistical parameters for the training and test data are very close.
To evaluate the performance of artificial intelligence methods in comparison with mathematical methods, four equations of state such as SRK, PR, VPT, and PC-SAFT, have been used. For this purpose, the solubility of CO 2 and N 2 in different CO 2 + N 2 + H 2 O (brine) systems was calculated using 24 laboratory data points extracted from the literature 10 , and the results are reported in Table 6 and Table 7. As shown in Tables 6 and 7, the value of AAPRE obtained for the SRK and PR equations of state is much higher than VPT and PC-SAFT equations of state and the intelligent models. For solubility of CO 2 in aqueous solutions, the Random Forest approach outperforms the other intelligent techniques with an AAPRE value of 1.16%, and the PC-SAFT model has the best results among the EOSs with an AAPRE value of 3.35%. For solubility of N 2 in aqueous solutions, the Random Forest technique has the best results among the intelligent approaches with an AAPRE value of 4.13%, and the VPT model has the best results among the EOSs with an AAPRE value of 5.71%. Figure 4 shows the cross-plot diagrams for the six models presented in this study. In this graph, where the predicted results are plotted against actual values, the higher the compaction of data around the Y = X line indicates that the estimated values are closer to the actual values; therefore, the model www.nature.com/scientificreports/ is more accurate. In addition, R 2 value for this dataset will close to 1. As shown in Fig. 4, the Random Forest model is in a better position than the other models, which also confirms the results reported in Table 5. Figure 5 shows the error distribution diagram for the developed models. This diagram shows the relative error on the Y-axis and the experimental data on the X-axis. The closer and the more compaction of the points around the zero line, the less the predicted data error. On the other hand, according to this diagram, the relative error range for experimental data can be visually observed. For example, it can be seen how the relative error will change as the value of experimental data increases. As shown in Fig. 5, it can be observed that the Random Forest model is in a better condition and shows relatively lower errors than other models.

Graphical analysis of models.
A cumulative frequency graph is one of the most important diagrams that can be used to compare the performance of several models simultaneously. Figure 6 shows a cumulative frequency diagram for different models. In this diagram, which is a cumulative frequency of the number of data in terms of absolute relative error, the higher the curve of one model than the curve of other models, the higher the accuracy. In other words, if a model's curve is higher than another model's curve in a constant AAPRE value, it means that a higher percentage of the data in that model has a lower absolute relative error than another model. The higher the curve of one model at small absolute relative errors (close to 1), the higher the percentage of that data, the lower the absolute relative error, and the more accurate the model. Therefore, according to Fig. 6 and what is said, the Random Forest model is in a better situation than other models and has a higher accuracy, which also confirms the results presented in Table 5.
Trend analysis. Investigating the trend of solubility changes in terms of different parameters can give us a better understanding of the solubility of CO 2 -N 2 mixture in water and brine solutions. On the other hand, the validity of the developed models can be investigated by comparing the trend of measured changes with laboratory data, equations of state, and predicted data. For example, when an input parameter shows an increasing trend in experimental data, the developed models should show the same trend. In this case, the validity of the developed model will be more. In the following, we examined the trend analysis of various parameters. Figure 7 shows the effect of pressure on the solubility of CO 2 and N 2 in an aqueous system consisting of 39% N 2 and 61% CO 2 at 283 K. In this figure, the changes in solubility in terms of pressure using laboratory and predicted data in the Random Forest model as the best model and equations of state were investigated. According to Fig. 7a and b, all methods show an incremental trend. What is debatable in this figure is the degree to which the models are overestimated and underestimated. Another noteworthy point is the perfect agreement of the Random Forest model data with the experimental data, which confirms the efficiency of the intelligent models. As shown in Fig. 7a, the curves related to the equations of state are generally in a higher position than the curve of   www.nature.com/scientificreports/ the experimental data, and this indicates that these equations overestimate the solubility of CO 2 in the mentioned system. Figure 7b also shows the conformity of the data curve predicted by the Random Forest model with the experimental data, but the different point is that the PR EOS overestimates the solubility of N 2 in the mentioned system and other models underestimate although the degree of agreement of the VPT EOS to the experimental data is significant. Again, for solubility of CO 2 present in gaseous mixtures in aqueous systems, the PC-SAFT model, and for solubility of N 2 , the VPT model had the best results among the EOSs. Figure 8 shows the effect of CO 2 content in the gas mixture for the solubility of CO 2 and N 2 in an aqueous system containing CO 2 and N 2 at a temperature of 308 K and pressure of 8 MPa, as experimentally investigated in the literature 18 . As expected, increasing the amount of CO 2 in the gas mixture reduces the solubility of N 2 in the system and, conversely, increases the solubility of CO 2 at constant temperature and pressure. As it is clear, the solubility of N 2 in water is less than that of CO 2 . Figure 9 shows the effect of pressure on the solubility of CO 2 and N 2 in a system containing 85.4% N 2 and 14.6% CO 2 in water at 303 K for the Random Forest model and laboratory data 10 . As shown in Fig. 9, increasing the pressure can have a positive effect on increasing the solubility of both CO 2 and N 2 in the system, although this effect is more significant for CO 2 . Figure 10 shows the effect of pressure on the solubility of CO 2 and N 2 in aqueous systems with different salinity (pure water, 5% NaCl brine, and 15% NaCl brine). What can be seen in both Fig. 10a and b is the effect of salinity on system performance. For both CO 2 and N 2 gases, increasing the pressure increases the solubility, but it is noteworthy that increasing the salinity decreases the solubility of CO 2 and N 2 . Therefore, increasing the concentration of NaCl in water, or in other words, an increase in the ionic strength of the solution, reduces the solubility of CO 2 and N 2 . The salting-out phenomenon causes a reduction in CO 2 and N 2 solubility in water. The electrolytes influence water to dissolve less gas in this process. As salinity increases, more water molecules are attracted to the salt ions, reducing the amount of H + and O 2 − ions available to gather and separate the gas molecules, lowering CO 2 and N 2 solubility in the water 85 . Input parameters impact analysis. To study the influence of input parameters on the output of the model, a parameter called Relevancy factor was used. Relevancy factor is calculated as follows 86 : www.nature.com/scientificreports/ (a) If the relevancy factor < 0, the impact of the input parameter on the output is decreasing. In other words, by increasing the desired parameter, the value of the output parameter decreases. On the other hand, the closer the relevancy factor to -1, the greater the influence. (b) If the relevancy factor = 0, there is no relation between the input parameter and output or this relation is not monotonic. (c) If the relevancy factor > 0, the impact of the input parameter on the output data is incremental. In other words, by increasing the desired parameter, the value of the output parameter also increases. Therefore, the closer the relevancy factor to 1, the greater the influence. Figure 11 shows the relevancy factor value for the input parameters of the Random Forest model as the best model. According to this figure, the impact of temperature, pressure, and mole percent of CO 2 in gaseous phase on the solubility of CO 2 -N 2 mixture in aqueous solutions is increasing, and the impact of ionic strength is decreasing. Among the parameters whose relevancy factor values are positive, the mole percent of CO 2 in gaseous phase with a relevancy factor of 0.61 has the most significant impact. Therefore, with increasing temperature, pressure, and the mole percent of CO 2 in gaseous phase, the solubility of CO 2 -N 2 mixture in water and brine solutions increases, and with increasing ionic strength, the solubility decreases.
Implementation of Leverage method. The Leverage method [88][89][90] was used to determine the applicability domain of the constructed Random Forest model and to identify any data that is suspect. The Leverage method, which is well-established analytically and visually through Williams' plot, is one of the most important approaches in outlier diagnosis. Standardized residuals (R), which reflect the differences of model's outcomes from experimental observations, and Leverage values, which are the diagonal components of the hat matrix, are determined in this method. The following is the definition of the hat matrix 83 : here, X T denotes the transpose of the matrix X, which is an (m × n) matrix, and m and n denote the number of data points and model input variables, respectively. In addition, the critical leverage (H*) is determined to be 3(n + 1)/m.
The proposed model's applicability domain is then graphically evaluated by displaying the standardized residuals versus leverage values. If most of the data points were located in the limits of 0 ≤ H ≤ H * , and −3 ≤ R ≤ 3 , the created model is considered trustworthy and its estimations are made in the applicability domain 91 .
Following that, as shown in Fig. 12, William's plot is utilized to determine the Random Forest model's applicable scope and outliers. As shown in Fig. 12, the majority of data falls between 0 ≤ H ≤ 0.062 , and −3 ≤ R ≤ 3 , indicating that the experimental results are of excellent quality and the Random Forest model is quite reliable. Suspicious data are data points with R > 3 or R < − 3, linked with a high level of doubt. Out of Leverage data are data points with H > 0.062, and −3 ≤ R ≤ 3 beyond the Random Forest model's applicability range. Only nine data points were identified to be as suspected data and one outlier exists in the solubility databank, which proves the high validity of the experimental databank used for modelling. Eight suspected data points along with one

Conclusions
In this study, using 289 laboratory data and six intelligent models including DT, GBDT, AdaBoost-DT, AdaBoost-SVR, GB-SVR, and Random Forest, the solubility of CO 2 and N 2 in the systems of CO 2 -N 2 mixture and aqueous solutions was predicted and comparing their results with thermodynamic models such as SRK, PR, VPT, and PC-SAFT led to the following conclusions: 1. Among the presented models, the Random Forest model with an AAPRE value of 2.84% has the best results. GB-SVR and DT models then have the closest predictions with AAPRE values of 6.43% and 7.41%, respectively. After these models, AdaBoost-SVR, GB-DT, and AdaBoost-DT are ranked in terms of good predictions, respectively. Therefore, intelligent models are very efficient and reliable compared to equations of state. 2. Generally, the equations of state used in this work overestimate the solubility of CO 2 in the aqueous system by increasing the pressure; however, this is the opposite for N 2 except for the PR equation of state for all other models. www.nature.com/scientificreports/ 5. Increasing the water salinity causes the reduction of CO 2 and N 2 solubility in water. 6. The impact of mole percent of CO 2 in gaseous phase, temperature, and pressure on increasing the solubility of CO 2 and N 2 in water is incremental, and the impact of ionic strength on the solubility of CO 2 and N 2 in water is decreasing.