Introduction

Interfacial tension (IFT) is a parameter of interest in petroleum and chemical science and engineering1,2,3. It plays a vital role in multiphase flow, separation processes, formation and stability of emulsions, fluid transportation, and reservoir engineering processes like fluid contacts, fluid saturation distribution, recovery mechanisms, and enhanced oil recovery (EOR) processes4,5,6,7,8.

Capillary pressure plays a critical role in oil recovery at all stages of production from oil reservoirs. The capillary number (Nc) concept and equation are used to investigate the effect of capillary pressure on oil recovery from the reservoir. The general form of the capillary number is defined as follows9, 10:

$${\text{N}}_{\text{c}}{=}\frac{\left(\text{viscous force}\right)}{\left(\text{capillary force}\right)}{=}\frac{\upmu \upnu}{{\upsigma}\,{\text{cos}}\,{\uptheta}}$$
(1)

where θ is the contact angle, σ is the IFT between the wetting and non-wetting phase, μ is the viscosity of the displacing phase, and v is the Darcy velocity. The amount of oil saturation remaining in the porous medium has a strong correlation with the capillary number9,10,11,12,13,14,15. Researchers concluded that increasing the capillary number increases oil recovery16. Also, capillary number was considered as the primary variable in several modeling and simulation studies related to IFT and wettability alteration15, 17, 18. According to Eq. (1), decreasing the IFT increases the capillary number. EOR techniques produce residual oil by optimizing the amount of capillary number18,19,20,21. Surfactants are amphiphilic molecules that are soluble in both organic solvents and water22. The surfactant reduces the IFT between oil and water by adsorbing at the liquid–liquid interface23, 24. It was found that an oil droplet on the meniscus could be attracted to the wall when surfactant is added to water, while bubbles always move towards the walls; IFT and gravity play significant roles in these cases25, 26. Researchers have conducted numerous laboratory studies to investigate the ability of surfactants to reduce the IFT between aqueous solution and oil for use in EOR techniques21, 27,28,29. The measurement of IFT of a water-hydrocarbon interface in the presence of surfactants is of great interest for surfactant flooding. Various parameters affect the IFT between the solution containing surfactant and hydrocarbons that must be considered. Experimental studies have shown that the type28, 30 and the surfactant concentration31, the temperature of the aqueous solution32, and the hydrocarbon composition33, 34 can affect the IFT. Surfactants are divided into two general categories, which include ionic and nonionic surfactants. Ionic surfactants have a positive or negative electric charge, or both, classified into cationic, anionic, and amphoteric surfactants, respectively35. However, nonionic surfactants do not have an electric charge. The interfacial behavior of surfactants can vary depending on their structure2, 35, 36. Therefore, many researchers have investigated the role of surfactant structure in reducing the IFT between hydrocarbons and aqueous solutions. Strey37 showed that with increasing temperature, the IFT behavior of the surfactant is curved and has a minimum point. It was found that before the minimum point of IFT, increasing the temperature which leads to an increase in the number of surfactant molecules at the interface between hydrocarbon and aqueous solution, reduces the IFT value37. Also, increasing the surfactant concentration to the critical micelle concentration (CMC) reduces the IFT35, 38.

The best way to measure the IFT between surfactant and hydrocarbon is performing laboratory methods. Laboratory methods for measuring IFT are the weight of drop method39, 40 pendant drop41,42,43, spinning drop44,45,46, etc. Time-consuming is one of the limitations and challenges of laboratory methods. Considering the price of the chemicals used to perform the IFT test and the cost of the tests, this method is costly. Therefore, developing a model for predicting the IFT between surfactants and hydrocarbons can be very attractive and practical. Previous studies have described the effect of surfactants on the interfacial boundary of two fluids with the surface equation of state47, 48. The surface equation of state is a relationship between the surface concentration of surfactant and surface pressure48. The difference between the IFT without surfactant and after a surfactant to the solution is equal to the surface pressure48. Also, the concentration of surfactant molecules on the surface is defined by surface adsorption49. Different approaches to obtaining state equations were discussed in the literature. The Szyszkowski equation50, the Frumkin equation51, and the Langmuir model52 are three examples of semi-empirical equations for the surface equation of state. The Langmuir model was used to describe the adsorption of nonionic surfactants at the interface between hydrocarbons and aqueous solution. This model cannot predict the effect of a surfactant mixture solution on IFT. It is also not suitable for describing the interfacial behavior of surfactants in the presence of inorganic ions53. Mulqueen and Blankschtein (2002)54 developed a different molecular-thermodynamic approach to predict surfactant adsorption at the oil/water and air/water interfaces. They evaluated the validity of this model only on a limited number of laboratory data, and it was valid only for decane/water interface54. Bahramian and Zarbakhsh (2015)48 performed studies to estimate the IFT of ionic surfactants. In their proposed model, they considered the size of the surfactant molecule and the CMC value of a surfactant in the aqueous solution as independent variables. In this method, a laboratory test set is required to obtain the CMC48. In a recent study, Nikseresht et al. (2019)55 used the Butler equation to estimate the IFT between ionic surfactants and normal alkanes as the oil phase. This model is based on the surface state equation. For each case, two fitting parameters need to be set. In other words, the set model is not suitable for other conditions. They also examined Bahramian and Zarbakhsh’s48 equation for various surfactants and concluded that it could not be satisfactorily used to predict IFT in the presence of C10TAB and C12TAB55. In summary, thermodynamic models for estimating IFT have the following limitations: (1) they require laboratory testing to calculate the input parameters. (2) Each model fitted to a surfactant does not apply to other cases and conditions. (3) these models were evaluated for limited experimental data. (4) All parameters affecting IFT were not considered in these models. On the other hand, machine learning methods can model and solve complex numerical problems in industry56,57,58. Predicting the IFT between two immiscible fluids using intelligent methods has been considered by many researchers59,60,61,62. In previous studies, the ability of intelligent methods to estimate the IFT between hydrocarbons and the aqueous solution was evaluated63,64,65,66,67. It was found that intelligent methods are appropriate for this purpose. As the literature review shows, the IFT studies in recent years are mostly focused on the aqueous phase and hydrocarbons, and the role of surfactants is less considered. To the best of the authors’ knowledge, no reliable and comprehensive model has been presented for this case. Therefore, there is a window to develop a reliable model for predicting the IFT of surfactants and hydrocarbon systems.

This study aims to develop accurate and reliable models to estimate the IFT between ionic surfactants and normal alkanes. Decision tree (DT), extra trees (ET), and gradient boosted regression trees (GBRT) models are implemented for this purpose. Temperature, normal alkane molecular weight, surfactant concentration, hydrophilic–lipophilic balance (HLB), and phase inversion temperature (PIT) are selected as inputs and independent variables. Also, the IFT between the surfactant solution and normal alkanes is selected as the output and the dependent variable. Sample data are collected from the literature to train and evaluate the models. The trial-and-error method is used to optimize the implemented models. In the present study, the implemented models are evaluated using statistical analysis and applied graphical methods. Furthermore, sensitivity analysis is performed on how changes in the model's inputs impact the IFT values. Finally, the leverage method is carried out to ensure the credibility of the gathered IFT databank and the accuracy and dependability of the best model for estimating the IFT between ionic surfactants and normal alkanes. Hence, the main contributions of this research are as follows:

  • Collecting an extensive database of IFT of surfactant–hydrocarbon systems including important parameters such as HLB and PITx, which have a significant impact on the better characterization of surfactants.

  • Development of accurate models with low error using robust tree-based machine learning algorithms.

  • Performing sensitivity analysis to detect the relative effect of temperature, normal alkane molecular weight, surfactant concentration, HLB, and PITx on the IFT of surfactant–hydrocarbon systems.

  • Implementation of leverage method to identify suspicious and outlier data related to IFT of surfactant–hydrocarbon systems reported in the literature.

Data gathering

In order to develop the models, 390 sample data were collected from previous studies68,69,70,71,72,73,74,75,76. The sample dataset contains temperature, normal alkane molecular weight, surfactant concentration, HLB, and PITx. In this paper, five different surfactants were used, and their specifications were presented in Table 1. Also, n-hexane, n-heptane, n-octane, n-nonane, n-decane, n-undecane, n-dodecane, n-tetradecane, and n-heptadecane were used as normal alkanes. HLB and PITx were used to represent the type of surfactants. The HLB value determines the hydrophilicity and lipophilic of a surfactant77, 78. Researchers have developed various methods for calculating the amount of the HLB79, 80. In this study, the method introduced by Davies (1957)80 was used to calculate the HLB. Davis method calculates the value of the HLB based on the group number as follows:

Table 1 Characteristics of surfactants used in this study.
$$\text{HLB }_{\text{Davies}} = 7 + {\Sigma} \, (\text{hydrophilic group numbers}) - \, { \Sigma} (\text{lipophilic group numbers}{)}$$
(2)

The hydrophilic group numbers and the lipophilic group numbers are obtained from tables provided by Davies (1957)80. The values of the HLB for the surfactants used in this study are presented in Table 1. Another parameter used to characterize the surfactant is the PIT. This parameter depends on the structure of the surfactant76. For better visualization, the structures of surfactants utilized in this work are depicted in Fig. 1. The amount of the PITx of the five surfactants used in this study are presented in Table 1. Moreover, the statistical parameters of the databank used in this work are represented in Table 2. The collected data were divided into training and testing categories for model development. In all modeling techniques, 80% of the sample dataset was used to train algorithms, and the remaining 20% was used to evaluate the models’ performance. At this stage, the data were randomly divided.

Figure 1
figure 1

Structures of surfactants used in this study.

Table 2 Statistical parameters of the databank.

Model development

Decision tree

The DT is used for regression and classification issues81. A simple structure of the DT is depicted in Fig. 2. A regression DT can predict numerical responses corresponding to independent variables. These types of algorithms are used in complex datasets. Decision trees are intuitive and interpretable82, 83. In decision tree regression (DTR), it constantly divides the initial input space into smaller subsets and incrementally makes the final DT with decision and leaf nodes. ID3, C4.5, C5, and Classification and Regression tree (CART) are standard DT algorithms. C4.5 is an improved version based on ID3 and has the following advantages: (1) it can work with incomplete data, (2) it can use the pruning technique to prevent over-fitting, and (3) accepting discrete and continuous features. The CART is very similar to the C4.5. The difference between the two algorithms is that the CART does not calculate the rules and can also solve regression problems84. In this study, the optimized version of the CART algorithm was used.

Figure 2
figure 2

A simple structure of the DT.

Dividing nodes in the training process of the network is one of the most critical parts of implementing a DT algorithm. In CART, it uses a Gini coefficient to divide the nodes85. The DT implemented in that study consists of four stages85. In the first stage, the DT grows using the division of nodes. After dividing the training data into two parts, with the same logic, it divides these subsets again and so on. The DT greedily searches for an optimal division. This algorithm repeats the data segmentation in each step and does not check whether it leads to less impurity in the next steps. Each node is assigned to a predicted target based on the target’s distribution in the sample data in nodes. This process continues until it cannot find a partition that reduces the impurity. Also, when the tree reaches its maximum state, the DT’s growth process will stop. It is necessary to optimize the maximum depth value for this algorithm. In the second stage, after the tree’s maximum growth, the process of building the tree stops. At this stage, the DT may not accurately predict the target value based on the test data. The third step involves pruning and simplifying the tree, which makes it better to predict. In the fourth step, the best tree is selected from the pruned trees.

The DT continuously divides the data into smaller sections during the same process to homogenize the data in the partition. The splitting rules may be set to optimize a criterion related to the target’s predictive value, or the rules may be set to minimize local node impurity or dependent variables over the training set.

Identifying the number of training data points for the DT is a significant issue because it will cause over-fit if the sample data is low. Adding any level to the tree may lead to an increase in the number of samples required to learn the DT. The size of the tree should be controlled to prevent overfitting86. The main parameters for DT optimization include tree depth, minimum sample division, and minimum sample leaf. Ensemble methods can prevent over-fitting in the DT algorithm87, 88.

Extra trees

The ET creates a stronger model by combining several decision trees. The ET is one of the ensemble methods. In the ET method, the node is divided completely randomly by selecting the cutting points. Each DT grows independently, and all learning samples are used to grow the trees. The predicted target values are then added up for the final prediction. Finally, it predicts the final answer using the mathematical mean of the predicted values obtained from each base model89. The ET algorithm builds an ensemble model based on the explicit randomization of cutting points and feature combinations using averaging. Also, using all learning data to build base models can minimize the bias of the final model90. The tree growth method’s complexity in the ET algorithm, assuming the trees are balanced, is similar to other tree growth methods89. This algorithm has three parameters including Nmin, which denotes the minimum sample size to divide a node, K shows the number of randomly selected attributes in each node, and M illustrates the number of trees used as the base model. It is necessary to optimize these parameters to develop a more robust model based on additional trees. Each of these parameters has a different effect. The value of parameter Nmin affects the average noise output of the model. The larger the value of Nmin, the smaller the trees are made. As a result, the variance decreases, and the bias increases. The minimum size of sample data for node splitting should be optimized according to the model’s amount of output noise. Obviously, in regression problems, high noise levels lead to overfitting. Geurts et al.89 suggested that a higher value for parameter Nmin should be used to build a stronger model when the data has more noise. In other words, to optimize ET in high noise conditions, it is necessary to increase the value of Nmin. The number of selected attributes can also determine the strength of the attribute selection process. The maximum value that can be considered for K is the number of input features of the model. The low value of parameter K increases the randomness of the trees. It also makes the structure of the trees less dependent on the target value of the learning samples. Therefore, if we set the K to 1, the divisions are selected completely independent of the target variable. Also, if the value of this parameter is equal to the number of features in the learning data, the features are not explicitly selected randomly, and the randomization effect is applied only by selecting the cut points89. A schematic structure of the ET algorithm is illustrated in Fig. 3.

Figure 3
figure 3

Schematic structure of the ET.

Gradient boosted regression trees (GBRT)

Boosting is another method of an ensemble that combines several weak learners to create a stronger model for target prediction85. This method is used to solve regression and classification problems. The weak learners are trained one after the other, each focusing on correcting the previous step91. In this study, the GBRT was used, in which the DT is defined as the basic learner.

In a modeling issue, one has a system consisting of a set of random “explanatory” variables or “input” x = {× 1; …; xn} and random “response” variables or “output” y. The goal is to create a function F*(x) that relate y to x.

$${\text{F}}^{*}\left({\text{x}}\right)= {\text{arg}}{}_{{\text{F}}\left({\text{x}}\right)}{}^{\text{min}}{{\text{E}}}_{\text{y, x}} {\Psi} \left(\text{y, F}\left({\text{x}}\right)\right)$$
(3)
$${\text{F}}\left({\text{x}}\right)= \sum_{{{\rm m}=0}}^{\text{M}}{\upbeta}_{\text{m}}{\text{h}}\left(\text{x};{\text{a}}_{\text{m}}\right)$$
(4)

the expected value of some specified loss function \(\Psi \left(y,F\left(x\right)\right)\) is minimized. Here, \(h\left(x;{a}_{m}\right)\) is a simple function of \(x\) defined as a basic learner. The expansion coefficients \(\left\{{\upbeta }_{m}\right\}\begin{array}{l}m\\ 0\end{array}\) and the parameters \(\left\{{a}_{m}\right\}\begin{array}{l}m\\ 0\end{array}\) are jointly fit to the training data in a forward “stage-wise” manner.

$$\left({\beta}_{\text{m}}{,}{\text{a}}_{\text{m}}\right) = {\text{arg}}\,{\text{min}}_{{\beta,{\rm a}}}\sum_{{{\rm i}-1}}^{\text{N}}{\Psi}\left({\text{y}}_{\text{i}},{\text{F}}_{{\text{m}}-{1}}\left({\text{x}}_{\text{i}}\right){+\beta {\rm h}}\left({\text{x}}_{\text{i}};\text{a}\right)\right)$$
(5)
$${\text{F}}_{\text{m}}\left({\text{x}}\right)= {\text{F}}_{{\text{m}}-{1}}\left({\text{x}}\right){+}{\upbeta}_{\text{m}}{\text{h}}\left(\text{x};{\text{a}}_{\text{m}}\right)$$
(6)

Equation (5) can be solved in two steps for a given cost function by the Gradient Boosting method92, 93. First, the function \(h\left(x; a\right)\) is fit by least squares:

$${\text{a}}_{\text{m}} = {\text{arg}}{\text{min}}_{{{\rm a},\uprho}}\sum_{{{\rm i}=1}}^{\text{N}}{\left[{\widetilde{\text{y}}}_{\text{im}}-{\uprho {\rm h} }\left({\text{x}}_{\text{i}};\text{a}\right)\right]}^{2}$$
(7)
$${\widetilde{\text{y}}}_{\text{im}} = -{\left[\frac{{\partial \Psi}\left({\text{y}}_{\text{i}},\text{F}\left({\text{x}}_{\text{i}}\right)\right)}{{\partial {\rm F}}\left({\text{x}}_{\text{i}}\right)}\right]}_{{\text{F}}\left({\text{x}}\right) = {\text{F}}_{{\text{m}}-{1}}\left({\text{x}}\right)}$$
(8)

In the next step, according to \(h(x;a\) m \()\), the optimal value of \(\beta\) Μ is determined. GBRT specializes in this strategy to the case where the weak learner \(h\left(x; a\right)\) is an L terminal node regression tree. At each iteration m, a regression tree is divided the x space into L-disjoint regions \(\{R\) lm \(\}\) \(\begin{array}{l}l\\ l=1\end{array}\).

$${\text{h}}\left(\text{x};\left\{{\text{R}}_{\text{lm}}\right\}\begin{array}{l}{\text{L}}\\ {1}\end{array}\right) = \sum_{\text{l-1}}^{\text{L}}{\bar{\rm y}}_{\text{lm}}{1}\left({\text{x}} \in {\text{R}}_{\text{lm}}\right)$$
(9)
$${\overline{\text{y}}}_{\text{im}} = {\text{mean}}_{\text{xi } \in {\text{R}}_{\text{im}}}\left({\widetilde{\text{y}}}_{\text{im}}\right)$$
(10)

the solution to Eq. (8) reduces to a simple “location” estimate based on the criterion \(\Psi\).

$${\text{y}}_{\text{lm}} = {\text{arg}}\,{\text{min}}_{\text{y}}\sum_{{\text{x}}_{\text{i}} \in {\text{R}}_{\text{lm}}}{\Psi}\left({\text{y}}_{\text{i}}{,}{\text{F}}_{{\text{m}}-{1}}\left({\text{x}}_{\text{i}}\right)+ \text{y} \right)$$
(11)

It will be updated separately in each corresponding area.

$${\text{F}}_{\text{m}}\left({\text{x}}\right) = {\text{F}}_{{\text{m}}-{1}}\left({\text{x}}\right)+\upnu. {\text{y}}_{\text{lm}}{1}\left({\text{x}} \in {\text{R}}_{\text{lm}}\right)$$
(12)

The learning rate is controlled by the “shrinkage” parameter \(0<v\le 1\). The small values of this parameter \(\left(v\le 0.1\right)\) lead to a much better generalization error. Friedman (1999)92 presented specific algorithms based on this template for several loss criteria, including least-absolute deviation, least squares, Huber, and for classification, K class multinomial negative log-likelihood. It should be noted that hyper-parameters should be considered to optimize the implemented model. These parameters such as the number of base estimators, subsample, loss function, maximum depth, the minimum number of leaf nodes, the maximum number of features, and the minimum number of sample split samples, define the structure of the network. A simple architecture of the GBRT algorithm is depicted in Fig. 4.

Figure 4
figure 4

A simple architecture of the GBRT.

Results and discussion

Description of the models' development

In the present work, three different data-driven techniques, including DT, ET, and GBRT were developed to establish accurate models for the estimation of the IFT between ionic surfactants and normal alkanes. As mentioned, in order to create a more robust and faster model, the specific hyperparameters of each algorithm must be adjusted and optimized. As mentioned earlier, the trial-and-error method was employed to optimize the implemented models. The value of the maximum depth parameter of the DT strongly affects the speed and accuracy of the model. The depth of the tree should be carefully adjusted so as not to cause over-fitting and under-fitting. As reported in Table 3, the best value for this parameter is 7. The proposed control parameters for implementing the DT algorithm based on the sample data used in this study are reported in Table 3. In this study, two ensemble algorithms based on the DT were used. Ensemble methods were used to increase the stability and accuracy of the DT model. Ensemble models can also prevent over-fitting and create a robust and stable model based on the base estimator94. Due to the nature of ensemble models, the number of estimators is the most important parameter for optimization. To build an accurate model based on the GBRT algorithm, loss function, learning rate, number of estimators, subsample, maximum depth, and alpha must be considered and adjusted. The adjusted parameters for the models implemented in this study were reported in Table 3.

Table 3 Internal parameters of the developed models.

Statistical evaluation

In this study, statistical criteria were used to evaluate the accuracy and ability of the developed models in predicting IFT. For this purpose, statistical parameters including determination coefficient (R2), average percent relative error (APRE, %), root mean square error (RMSE), standard deviation (SD), and average absolute percent relative error (AAPRE, %), were used. The formulas of these statistical parameters are listed as follows95:

$${R}^{2}=1-\frac{\sum_{i=1}^{n}{\left({IFT}_{{exp}_{i}}-{IFT}_{{pred}_{i}}\right)}^{2}}{\sum_{i=1}^{n}{\left({IFT}_{{exp}_{i}}-\overline{IFT }\right)}^{2}}$$
(13)
$$APRE=\frac{1}{n}\sum_{i=1}^{n}\left(\frac{{IFT}_{{exp}_{i}}-{IFT}_{{pred}_{i}}}{{IFT}_{{exp}_{i}}}\right)\times 100$$
(14)
$$RMSE=\sqrt{\frac{1}{n}\sum_{1}^{n}{\left({IFT}_{{exp}_{i}}-{IFT}_{{pred}_{i}}\right)}^{2}}$$
(15)
$$AAPRE=\frac{1}{n}\sum_{i=1}^{n}\left\lceil\frac{{IFT}_{{exp}_{i}}-{IFT}_{{pred}_{i}}}{{IFT}_{{exp}_{i}}}\right\rceil\times 100$$
(16)
$$SD=\sqrt{ \frac{1}{n} \sum_{i=1}^{n}{\left(\frac{{IFT}_{{exp}_{i}}-{IFT}_{{pred}_{i}}}{{IFT}_{{exp}_{i}}}\right)}^{2}}$$
(17)

If the value of R2 is high and the values of AAPRE, APRE, RMSE, and SD are low, the model predicts the IFT with higher accuracy. The maximum value of R2 is one, and the lowest value of the AAPRE value is zero. Statistical parameters for evaluating the models implemented in this study at different development stages are presented in Table 4. According to the RMSE values presented in Table 4, the accuracy of the models implemented in this study is as follows:

$${\text{GBRT}}\, > \,{\text{ET}}\, > \,{\text{DT}};{\text{ for both training and testing phases}}.$$
Table 4 Statistical assessment of the developed models.

Graphical error analysis

In this section, graphical error analysis shows the models’ validity and accuracy. Therefore, four graphs, including bar-plot, cross-plot, relative error distribution, and cumulative frequency diagram were investigated. Figure 5 is a cross-plot of DT, ET, and GBRT models. This diagram plots the predicted values in the training and testing phases versus the experimental values. In this type of diagram, if the train and test points are close to the unit slope line (X = Y), it indicates that the model can predict with high accuracy. As Fig. 5 shows, some of the test points of the DT and ET models are above or below the X = Y line, which indicates the lower accuracy of these models. Figure 5 shows that the points of the GBRT model are scattered around the unit slope line. This model estimates the IFT close to the experimental values. It can be seen that the GBRT model estimates the IFT with higher accuracy than the DT and ET models.

Figure 5
figure 5

Cross plots of the developed models; (a) GBRT, (b) ET, (c) DT.

Furthermore, the relative error diagram is a practical tool to show the deviation of the value predicted by the model from the experimental value. Absolute error is the difference between the predicted and experimental values. The relative error is equal to the absolute error divided by the experimental value. The relative error is calculated as follows:

$$\text{Relative Error} \, = \text{ } \frac{{\text{IFT}}_{\text{exp}}-{\text{IFT}}_{\text{pred}}}{{\text{IFT}}_{\text{exp}}}$$
(18)

In Fig. 6, the zero line indicates that the model predicts without error. Therefore, if the training and test points are close to the zero line, it indicates that a robust model has been developed. Figure 6 shows that from a value of 20 mN/m onwards, the relative error of all models implemented in this study is low. The points in the negative range of Fig. 6 with respect to the relative error Eq. (18) show that the model is overestimated. As explained in the model section, ensemble methods increase accuracy and improve overfitting87, 96. According to the results presented in Figs. 6 and 7, the models implemented in this study, including ET and GBRT, have reduced variance and controlled overfitting, as seen in the test phase.

Figure 6
figure 6

The relative deviation of the experimental results and predicted data using (a) GBRT, (b) ET, (c) DT models.

Figure 7
figure 7

Statistical evaluation of the developed models.

Next, Fig. 8 illustrates the cumulative frequency plot, which displays the proportion of predicted data that are less than or equal to a particular error value. This graph displays absolute relative errors (%) that are calculated using the next equation for different proportions of predicted data by models.

Figure 8
figure 8

Cumulative frequency diagram of three proposed models for estimating the IFT of ionic surfactants and normal alkanes.

$$\text{Absolute }\text{Relative Error}\text{ (\%) }= \text{ } \left|\frac{{\text{IFT}}_{\text{exp}}-{\text{IFT}}_{\text{pred}}}{{\text{IFT}}_{\text{exp}}}\right|\times 100$$
(19)

The closer a model gets to the vertical axis, the more data it can predict with lower error and consequently considered a more precise model. As Fig. 8 shows, the GBRT model estimates 70% of the IFT values with just less than 3.2% error. This is while the ET and DT models have predicted this proportion of the data with an error of 3.6% and 4.2%, respectively. In addition, about 90% of the data, estimated by the GBRT model, has an error lower than 6.2%, while it is 8.2% and 10.8% for ET and DT models, respectively. As a result, the superiority of the GBRT model over the ET and DT models can be recognized again.

Based on the presented results in this section, the GBRT model is proposed to estimate the IFT between the surfactant solution and the normal alkane with high accuracy. A part of IFT values predicted by GBRT model in the test phase is reported in Table 5, and no significant difference is observed in the prediction of experimental data by this model. The results of Fig. 7 show that the AAPRE and RMSE of the GBRT model in the test phase were 3.63% and 1.628, respectively, which indicates the high reliability of this model in predicting the IFT of surfactant–hydrocarbon systems.

Table 5 The IFT data predicted by GBRT models in the test phase.

Sensitivity analysis

In this study, the sensitivity analysis was performed to determine the extent and type of relationship between the independent variables presented in Table 2 and the amount of the IFT (output). Different methods of sensitivity analysis are introduced for regression models97, 98. In this section, the Pearson coefficient was used to calculate the relevancy factor99:

$$\text{RF} = \frac{\sum_{\text{i=1}}^{\text{n}}\left(\overline{{\text{x} }_{\text{k}}}-{\text{x}}_{\text{k. i}}\right)\left(\stackrel{\mathrm{-}}{\text{y}}-{\text{y}}_{\text{i}}\right)}{\sqrt{\sum_{\text{i=1}}^{\text{n}}{\left(\overline{{\text{x} }_{\text{k}}}-{\text{x}}_{\text{k. i}}\right)}^{2}\sum_{\text{i=1}}^{\text{n}}{\left(\stackrel{\mathrm{-}}{\text{y}}-{\text{y}}_{\text{i}}\right)}^{2}}}$$
(20)

where n and k represent the number of data points and the type of input variable. The symbols yi, \(\stackrel{\mathrm{-}}{\text{y}}\), x k.i, and xk denote the target, the average of the target value, the input value, and the kth input value average. Figure 9 shows the absolute values of sensitivity analysis results of the proposed GBRT model. In this data set, PITx, the surfactant concentration, and HLB had the greatest effect on IFT, respectively. At concentrations lower than the CMC of surfactants, the IFT decreases with increasing surfactant concentration100. Therefore, the surfactant concentration was expected to  have large effect on IFT along with the type of surfactants. The molecular weight of alkanes has the lowest value of the Pearson coefficient compared to other input parameters.

Figure 9
figure 9

The absolute effect of each input parameter on the IFT of surfactant–hydrocarbon systems based on Pearson correlation coefficient.

Trend analysis

In this section, the ability of the proposed GBRT model to predict IFT behavior in different conditions was investigated. Figure 10 shows the predicted values of IFT by the GBRT model and the experimental data68, 69,72. In this figure, the IFT of anionic (SDS) and cationic (C10TAB) surfactants with n-hexane as hydrocarbon phase was plotted. The temperature of the surfactant solution was considered constant (298.2 K), and IFT data was plotted as a function of surfactant concentration. Based on the results presented in Fig. 10, it can be seen that the GBRT model accurately predicts the IFT of the surfactants and n-hexane systems. The IFT between the surfactant solution and the hydrocarbon depends on the surfactant concentration101,102,103. At concentrations lower than the CMC, by increasing the surfactant concentration, the IFT value will reduce38. Surfactant molecules are adsorbed on the liquid–liquid interface and reduce IFT104. Therefore, increasing the adsorbed surfactant molecules at the interfacial boundary further reduces the IFT.

Figure 10
figure 10

Comparison of predicted and experimental IFT trends for the systems of n-hexane and surfactants C10TAB and SDS at 298.2 K.

In the next step, Fig. 11 shows the IFT between C10TAB and C12TAB in n-octane and n-nonane hydrocarbon phases69 along with the prediction of the GBRT model. For this analysis, the temperature was considered a constant value of 298.2 K. As can be seen, the heavier the hydrocarbon phase, the greater the IFT of surfactant–hydrocarbon. Again, the GBRT model renders great predictions for the IFT of the surfactants and hydrocarbons considered in this figure.

Figure 11
figure 11

Comparison of predicted and experimental IFT trends for pure hydrocarbons at 298.2 K; (a) C10TAB and (b) C12TAB.

Another parameter affecting the IFT of the surfactant that was considered in this study is temperature. The IFT values were predicted for different concentrations of SDS corresponding to a temperature range. Using the GBRT model, a comparison of the predicted trend of IFT changes in the temperature range with experimental data70 is shown in Fig. 12. The predicted trend is similar to the experimental data and also shows good agreement with the real trend of IFT variation. As mentioned earlier, the IFT decreases with increasing temperature. It also shows that the proposed GBRT model predicts the interfacial behavior of the surfactant well under defined conditions.

Figure 12
figure 12

Comparison between experimental and predicted IFT values by the GBRT model for SDS surfactant in n-hexane mixture at different temperatures and concentrations.

Outlier detection using Leverage approach

The Leverage approach105,106,107 is a reliable technique to discover outliers that may exist in a databank due to a variety of circumstances, including experimental errors. These points are located at an improper distance from the majority of data. Therefore, catching the inappropriate data noted above is critical for preventing model inaccuracy and unreliability. According to the Leverage method, the values of the standardized residuals (R) as well as a matrix named the Hat matrix, which is made up of the exploratory and predicted values obtained from the model, are needed to conduct this analysis. The leverage or Hat indexes (H) are determined using the following formula108,109,110:

$$H=X({{X}^{T}X)}^{-1}{X}^{T}$$
(21)

Here, X represents the matrix of explanatory variables, and T represents the transpose matrix operator. Moreover, the critical Leverage (H*) is calculated according to the following formula111:

$${H}^{*}=\frac{3(number\,of\,inputs+1)}{number\,of\,data\,points}$$
(22)

In this work, the databank includes four inputs and 390 data points, leading to H* = 0.0461. On the other hand, considering MSE as the mean square of error, ej as the error value of the jth data, and Hj as the jth Leverage value, R values can be determined as follows90, 112:

$${R}_{j}=\frac{{e}_{j}}{{[MSE\left(1-{H}_{j}\right)]}^{0.5}}$$
(23)

If the most of the data points are positioned in the ranges of 3 ≤ R ≤ 3 and 0 ≤ Hj ≤ H*, both the experimental data and the model's estimations will be statistically trustworthy and precise113. William's plot shows R values versus H values to identify outliers. Figure 13 displays the outcomes of the leverage approach utilizing the GBRT model's results. In this case, only 6 points are recognized as suspected data, which are located outside of the model application scope. Also, it was found just 8 points as outliers. This confirms that the experimental database of IFT between surfactants and hydrocarbon is highly reliable, and the GBRT model is statistically dependable and valid.

Figure 13
figure 13

William's plot of the developed GBRT model.

Conclusions

The aim of this study was to develop accurate and reliable models to estimate the IFT of ionic surfactants and normal alkanes. In this study, three intelligent computer-aided algorithms namely DT, ET, and GBRT models were implemented for this purpose. A databank containing 390 experimental IFT data points presented in the literature was used to develop these models. The models considered the following parameters as input: temperature, normal alkane molecular weight, surfactant concentration, HLB, and PIT. Based on this work, the following conclusions are drawn:

  1. 1.

    The ensemble methods implemented in this study, GBRT and ET, were able to reduce the variance of the DT model.

  2. 2.

    Among all the models developed in this study, the GBRT was the best model for predicting the IFT between the surfactant and normal alkanes.

  3. 3.

    Statistical evaluation in the test phase showed that the AAPRE% and RMSE of the GBRT model are 3.63% and 1.628, respectively.

  4. 4.

    The trend analysis demonstrated that the predictions of the GBRT model follow the expected variations in terms of the independent variables.

  5. 5.

    The cumulative error distribution of the GBRT model was very satisfactory, with approximately 90% of the predicted data having a relative error of less than 6.2%.

  6. 6.

    According to the results of the sensitivity analysis, the effect of input parameters on the IFT is as follows: PIT > surfactant concentration > HLB > temperature >  molecular weight of normal alkane.

  7. 7.

    The Leverage method demonstrated that the majority of the data points (almost 96.5%) are valid and both the IFT databank and the GBRT model seem to be highly trustworthy.

As a suggestion for future studies, it can be mentioned that the simultaneous involvement of the aqueous phase containing surfactant, the organic hydrocarbon phase and the solid phase including reservoir rock or its minerals can correlate the governing equations in the area of surface wetting.