Optimized artificial neural network application for estimating oil recovery factor of solution gas drive sandstone reservoirs

The most crucial aspect in determining field development plans is the oil recovery factor (RF). However, RF has a complex relationship with the reservoir rock and fluid properties. The application of artificial neural networks is able to produce complex correlations between reservoir parameters that affect the recovery factor. This research provides a new approach to improve the accuracy of the ANN model in the form of steps including removing outlier data, selecting input parameters, selecting transferring functions, selecting the number of neurons, and determining hidden layers. By applying these steps, an ANN model was selected with nine input parameters consisting of oil viscosity, water saturation, initial oil formation volume factor, formation thickness, initial pressure, permeability, specific gravity of oil, porosity, and original oil in place. Furthermore, based on the correlation coefficient, a tangent sigmoid transferring function, 30 neurons, and two hidden layers were determined. The proposed ANN correlation gives the best accuracy compared to the previous correlations. This is proved by the highest correlation coefficient of 0.91657.


Introduction
The main concern of company management when a new oil reservoir is discovered with an exploratory well is the estimation of the oil that can be obtained by natural mechanisms from the reservoir.The recovery factor (RF) for an oil reservoir is a very important parameter for companies to plan and optimize field development, manage ongoing production, and identify profitable investments among other technical and commercial decisions.This parameter defines oil volume that can be obtained from the initial oil volume in the reservoir [1][2][3][4].
In oil and gas projects, determining the recovery factor is one of the biggest uncertainties.Predicting the recovery factor is challenging because many variables affect the oil production from the reservoir.These include variables that are uncertain and beyond the control of oil and gas operators, such as fluid flow in pores, reservoir drive mechanisms, and engineering design-based variables, such as well spacing, well completion, as well as secondary and tertiary recovery mechanisms.In the early life of a field, inadequate production data coupled with subsurface uncertainty make RF predictions uncertain [3].In general, the factors that affect oil recovery are the physical properties of the reservoir rock and fluid, reservoir drives, and the type of reservoir development [5][6][7].Changes in the physical properties of reservoir rocks and fluids to obtain a higher recovery factor can be done using enhanced oil recovery (EOR) methods.EOR is divided into four categories including thermal, chemical, microbial, and miscible methods.These methods are used to remove residual oil that has not moved in the reservoir after the primary and secondary recovery stages [8,9].
Primary recovery factors are known to vary over a wide range.Under the most unfavorable conditions recovery efficiencies as low as 10 percent of the oil in place have been experienced.On the other hand, recovery levels as high as 85 percent to 90 percent of the oil in place have been reported.The main reason for the difference is in the transport mechanism itselfspecifically whether oil is displaced by invading water from the aquifer (water drive), by expanding gas from solution (solution gas drive), by expanding gas cap (gas cap drive), or by conducting gravity segregation (segregation drive) [10][11][12].
Empirical methods can be used at the stage before or after field development.It involves RF estimation through the application of regression equations using parameters that have an impact on oil recovery from multiple reservoirs with similar characteristics.In the early stages of field development, data obtained from direct measurements are not always available.In addition, there may be little or no reservoir performance data available.In this case, the empirical method may provide a quick and easy technique to give an initial estimate of RF with limited data within a given tolerance to provide a framework for investment [13].
In the last decade, various artificial intelligence (AI) techniques have been used to estimate the recovery factor both for water drive reservoirs and solution gas drive reservoirs.AI is able to model complex reservoirs in a relatively short computation time [14,15].The following is a summary of previous studies.
In 2014, Okpere and Njoku [16] applied artificial neural networks to predict the recovery factor for oil reservoirs in Niger Delta.They divided the data from 94 reservoirs into three groups: 60 % of the data was used for training the ANNs model, 20 % was used for validating the model, and the rest was used for testing the trained model.They used the connate water viscosity, connate water saturation, oil viscosity, oil formation volume factor, initial and abandonment reservoir pressures, permeability, and porosity as input parameters to predict the oil RF.Their model was built for water drive reservoir using backpropagation network.
Noureldien and El-Banbi (2015) proposed two ANN models namely Simple ANN and Sophisticated ANN to predict oil recovery factors for both solution gas drive and water drive reservoirs.The input parameters used to generate the first model were asset area, stock tank original oil in place (STOOIP), net pay thickness, porosity, Lorenz coefficient, initial water saturation, permeability, API, oil viscosity, and reservoir pressure, while additional operational and technological parameters were used for the second model.The first and the second models predicted the RF with an absolute average percentage error (AAPE) of 9.5 % and 8.0 % respectively for the testing dataset [17].Ahmed et al. (2017) and Mahmoud et al. (2019) applied four artificial intelligence techniques, namely artificial neural networks (ANN), radial basis neuron networks (RNN), adaptive neuro-fuzzy inference systems (ANFIS) with subtractive clustering, and support vector machines (SVM).They used 10 parameters from 130 sandstone water drive reservoirs to determine the oil recovery factor.The ten parameters used in this study were the same as those used by Noureldien and El-Banbi (2015).Furthermore, the extracted weights and biases of the optimized ANNs were used to develop an empirical equation that could be programmed and used for estimating the RF of water drive sandy reservoirs.ANN was the best model of the four artificial intelligence techniques used because this model had the lowest AAPE, namely 7.92 %, and the highest determination coefficient, namely 0.94, for RF prediction on the testing dataset [4,17].Roustazadeh et al. (2024) used Atlas, GasIS, Commercial, and TORIS databases to construct machine learning models for estimating the recovery factors of oil and gas reservoir.Combinations of the databases and three algorithms i.e., stepwise multiple linear regression (MLR), extreme gradient boosting (XGBoost), and support vector machine (SVM), were applied in the modeling.They found that the XGBoost model predicted both oil and gas RFs more accurately than the SVM and MLR models for the training and testing datasets [18].In addition, other machine learning methods such as multilayer feedforward neural network (MLFNN), function networks (FNs), and random forest had been used for reservoir characterization [19].Gomes et al. (2018) used two datasets containing 769 oil-bearing reservoirs in Middle Eastern Fields to discuss the reservoir/field data aggregation process, verification process, and validation process, to extract some key determinant parameters for use in comparison of oil recovery factor (RF) against global analogs.Four data analytics methodologies used in the study were fuzzy logic with backpropagation ANN, symbolic regression with a genetic algorithm, feed-forward backpropagation neural network, and Boruta algorithm with random forest classification method.By using symbolic regression with a genetic algorithm, the recovery factor can be modeled and predicted.The main inputs to this model were 6 independent variables such as oil viscosity, the ratio between oil and water viscosity, the product of average permeability and reservoir thickness divided by oil viscosity, maximum capillary pressure at reservoir top, density of well, and average water saturation.This model gave a training determination coefficient of 0.86 and a testing determination coefficient of 0.62 [20].
Al Tashi et al. (2021) applied ANN for the classification of reservoir recovery factors.They used data obtained from 367 sandstone and carbonate reservoirs with water and solution gas drives.They developed an ANN model with 10 input parameters, namely permeability, oil viscosity at bubble point, connate water saturation, initial reservoir pressure, pressure at the end of primary recovery, oil viscosity at initial pressure, solution gas ratio at abandonment pressure, oil formation volume factor at bubble point pressure, oil formation volume factor at initial pressure, and original oil in place at initial pressure.In the study, the ANN method was equipped with several algorithms such as non-dominated Sorting Genetic Algorithm II, Multi-Objective Gray Wolf Optimizer, and Multi-Objective Particle Swarn Optimization [21].Makhotin et al. (2022) used tree-based machine learning models to forecast the oil recovery factor for flooding.Two cases from more than 2000 reservoirs were investigated.The first case used parameters related to geometry, storage, geology, transport, and fluid properties, while the second case applied additional parameters including production and development data.The best model had a mean absolute error of 4.91 and a determination coefficient of 0.8 for the testing datasets [22].

M.T. Fathaddin et al.
The aim of this study was to increase the accuracy of the ANN model in predicting oil recovery factors by selecting the transferring function, number of neurons, and number of hidden layers.Data processing was carried out before modeling, including removing outliers and selecting input parameters.After being validated with data, the proposed correlation was compared with previous regression and ANN methods to compare their accuracy.

Methodology
The methodology applied for predicting oil recovery factor for solution gas drive reservoirs presented in this study covered four steps, namely data acquisition and screening, parameter selection, development of the ANN model, and statistical evaluation.MATLAB and SPSS software were used as the research tools.

Previous correlations
In 1945 Craze and Buckley collected a large amount of data on the performance of about 103 oil reservoirs in the United States.Twenty-seven of these reservoirs are produced by solution gas drive.The oil recovery factor (RF) equation for reservoir depletion in bbl/acre-ft is as follows [23].
where φ is porosity, S w is water saturation, S g is gas saturation, B oi is oil formation volume factor at initial pressure, and B oa is oil formation volume factor at abandonment pressure.
To obtain the oil recovery factor fraction, the result obtained by calculating Eq. ( 1) is divided by the initial oil volume.Therefore, the oil recovery (RO) equation is obtained as follows: In 1956, API proposed an empirical equation for oil recovery based on data from eighty solution gas drive reservoirs.Of these, 67 data were from sandstone reservoirs and the rest were from carbonate reservoirs.The published correlations are as follows [2,11]: where S wi is initial water saturation, B ob is oil formation volume factor at bubble point pressure, k is permeability, μ ob is oil viscosity at bubble point pressure, p b is bubble point pressure, and p a is abandonment pressure.
In 1967, API proposed another oil recovery equation based on a larger amount of data, namely data from 116 solution gas drive sandstone reservoirs.The published regression equation for the prediction of oil recovery (RO) is [2]: Arps (1968) used API data to derive the empirical oil recovery factor equation for a solution gas drive reservoir as follows [11,24]: or Gulstad (1995) proposed an empirical equation of recovery factor using multiple linear regression techniques using the same API data used in previous publications for solution gas drive sandstone reservoirs.The oil recovery factor equation proposed is as follows [2]: RO = − 264.59 + 0.34(OOIP) + 29.37 ln(R si ) − 0.06 where OOIP is original oil in place, R si is initial solution gas oil ratio, and h is formation thickness.Onolemhemhen et al. ( 2016) introduced an empirical equation to estimate the oil recovery factor of dissolved gas drive reservoir.The equation was derived using data from 128 oil reservoirs in the Niger Delta.The equation for the oil recovery factor given is as follows [25]: M.T. Fathaddin et al. or where μ o is oil viscosity, API is specific gravity of oil, p i is initial pressure, S or is residual oil saturation, C f is conversion factor of 1.0001, a = 0.127, b = 0.0218, c = 0.0341, and d = 0.1924.

Data acquisition and screening
Dataset from 159 solution gas drive reservoirs were collected from Refs.[2,25].The database were statistically processed.They were grouped into several classes to generate frequency distribution for each parameter.The green histogram graphs in Fig. 1 shows the frequency distributions for all parameters.Based on these frequency distributions, cumulative frequency curves can be determined which ranging from 0 % to 100 %.The cumulative frequency curves are expressed by the red curves in Fig. 1.In this research, a dataset was designated as outlier if it had at least one parameter that was located in a cumulative frequency interval that was less than 1 % (P1) or located in a cumulative frequency interval that was greater than 99 % (P99).Therefore, P1 and P99 were determined as the lower and upper limits of the database.The values of P1 and P99 for each parameter are given in Table 1.
Removing outlier datasets from the analyzed database aims to obtain a strong relationship between input parameters and output parameters for most of the data.By following this procedure, 27 datasets were determined to be outliers.Therefore, the 159 datasets in the database were reduced to 132 datasets.Although the outlier datasets have been removed, the remaining datasets were believed to still represent all reservoirs.
Tables 2 and 3 present descriptive statistics before and after removing outliers, respectively.Apart from that, the mean, median, and standard error tend to decrease due to the removal process.This indicates that data located in the cumulative frequency interval greater than 99 % (P99) have very large values.
The distribution of data for each parameter regarding the recovery factor after the removal process is explained in Fig. 2. The cumulative frequency error was obtained by comparing the cumulative frequency before and after removing outliers.The cumulative frequency error for each parameter is shown in Fig. 3.

The selection of input parameters
The parameters selected as independent variables should be available at the beginning of the field development since the correlation has been applied to predict the recovery factor as early as possible.In addition, they should have a strong influence on the recovery factor.In this study, the selected parameters were limited by the available data.The parameters used to generate the ANN model as the input parameters or independent variables were formation thickness (h), porosity (φ), permeability (k), water saturation (S w ), specific gravity of oil (API), initial pressure (p i ), oil viscosity (μ o ), solution gas-oil ratio at initial reservoir conditions (R si ), oil formation volume factor at initial reservoir conditions (B oi ), and original oil in place (OOIP).Meanwhile recovery factor (RF) was as an output parameter or a dependent variable.
The feedforward backpropagation algorithm was selected for ANN model.The algorithm described the flow of information in an ANN system.The feedforward backpropagation algorithm consisted of two phases, namely forward propagation and backward propagation.The input was fed into the neural network during forward propagation, and the network computes the output.The weights and biases of each neuron were modified during backward propagation to minimize the error between the expected output and the actual output.Levenberg-Marquardt algorithm and gradient descent with momentum weight and bias learning function were chosen as the training function and adaption learning function, respectively.
In the process of building a model with ANN, the input parameters involved need to be normalized, while the output needs to be denormalized.The normalization equation used is as follows [26].
x * j = x j − x min x max − x min (10) where x j and x j * are the values measured and normalized at data point j; x min and x max are the minimum and maximum data respectively.The denormalization equation is as follows.
x j = x * j (x max − x min ) + x min (11) Various independent variables were combined to be correlated with the dependent variable (RF).The statistical method used to see whether there was a simultaneous influence between the independent variable and the dependent variable was the F test and significance.Based on the F value and significance, the ten best correlations were selected for various numbers of independent variables as shown in Table 4.The next step was to select the best combination of independent variables from the ten combinations of independent variables given in the table.The best combination of independent variables was the combination which given the highest correlation coefficient value.Based on the last column in Table 4, the highest correlation coefficient (R) was resulted from a combination of nine independent variables, namely 0.84074.Therefore, nine independent variables (input parameters) were chosen to form the best ANN model, namely μ o , S w , B oi , h, p i , k, API, φ, and OOIP.

Optimization of the ANN model
Optimization of the ANN model was carried out by selecting the appropriate transfer function, the number of neurons, and the number of hidden layers.The transfer function and the number of neurons per layer were tested based on the correlation coefficient as shown in Fig. 4.There were three transfer functions tested, namely tangent-sigmoid, purelin, and log-sigmoid.The correlation coefficient was used to select the most appropriate transfer function, which provided the best strength of the relationship between input parameters and output parameters.
Based on the curve in the figure, the tangent-sigmoid (tansig) transfer function produced relatively higher correlation coefficient values for a number of neurons of 10-60 compared to other transfer functions.The correlation coefficient values for the tangentsigmoid transfer function varied between 0.6991 and 0.8645.Therefore, the transfer function was chosen because it produced the most representative model.
Next, the number of neurons that produced the highest correlation coefficient for the tangent-sigmoid function was selected.Based on Fig. 4, the highest correlation coefficient of 0.8645 was obtained with a number of neurons of 30.Therefore, the ANN model was built using 30 neurons.
The next step was to select the hidden layer.The hidden layer selection criterion was also based on the correlation coefficient value.The correlation coefficient value was obtained using Pearson correlation by comparing the output parameters, namely the recovery factor (RF) prediction obtained with the ANN model with the RF obtained from the data [27].Testing was carried out for the number of hidden layers one to seven.Each test used an ANN model with a tangent-sigmoid transfer function and 30 neurons.Fig. 5 shows the correlation coefficient for various numbers of hidden layers.The image shows that the hidden layers produce correlation coefficients ranging from 0.7175 to 0.9207.The best correlation coefficient was obtained with an ANN model that used 2 hidden layers.Therefore, the optimum ANN model was a model with the tangent-sigmoid transfer function, 30 neurons, and 2 hidden layers.
The use of multiples of five neurons in Fig. 4 aimed to reduce the time to determine the appropriate number of neurons.In addition, assigning the same number of neurons for each hidden layer in this proposed approach was to limit the variation in the number of neurons for each hidden layer.Another problem was that the time complexity increased rapidly with increasing number of hidden layers.Therefore, the number of hidden layers tested in this study was limited to seven.These were the limitations of the application of the proposed steps for finding the optimum ANN model.

Correlation validation
In order to ensure the feasibility, the proposed correlation had to be validated with actual data.The study used 132 data which were divided into three groups for training, validation, and testing.The percentage of data used for training, validation, and testing was 70 %, 15 %, and 15 %, respectively.Fig. 6 shows the results after the training, validation, and testing processes.The figure indicates that the correlation coefficients for the training, validation, and testing processes are 0.9188, 0.8963, and 0.9693, respectively.Because the correlation coefficient at each stage is close to one, this shows that the ANN model can represent the data very well.
Equation (12) represents the relationship obtained from applying the ANN model.This equation was developed on the same basis as that followed by Mahmoud et al. [4].
+ b 2 (12) where RF represents the recovery factor.N and J denote the total neurons in the hidden layer and the total number of input parameters,   respectively.w 1 and w 2 are the hidden layer weights and the output layer weights, respectively.b 1 and b 2 represent the hidden layer biases and the output layer bias, respectively.x* represents the normalized input parameters.The values of the parameters for the model are given in Table 5.

Comparison to previous correlations
The performance of the proposed correlation was then tested by comparing the use of the correlation to previous correlations in predicting recovery factors.All correlations compared were intended to calculate the recovery factor of solution gas drive reservoirs.Previous correlations involved were API 1956, API 1967, Arps, Gulstad, Onolemhemhen et al., Noureldin and El-Banbi, and Al-Tashi et al. correlations [2,11,17,21,24,25].The first five correlations used regression method, while the last two correlations used ANN method.The input parameters used by the correlations are given in Table 6.
For comparison purpose, 50 % of the 132 data were randomly selected.The prediction results were then compared with actual data to analyze the accuracy of these methods.The comparison is shown in Fig. 7 and Table 6.
Table 6 shows that the method of Onolemhemhen et al. gave a low negative correlation coefficient (R).This shows that there was a very weak correlation between input parameters and RF.Additionally, this method gave predictions that deviated greatly from the data.Onolemhemhen et al. (2016) did not provide any information regarding the type of reservoir formation for their equation [25].This is believed to drastically affect the accuracy of the model when applied to different environments.
The API 1956 method produced recovery factor predictions that were too optimistic.This is shown by the data plot that lies far above the diagonal line in Fig. 7. Therefore, this method provided a low R as shown in Table 6.Adjustment of the coefficients and constants from the API 1956 equation led to the API 1967 method and Arpps (1968) which generated better recovery factor predictions compared to the previous method.This is indicated by an increase in correlation coefficient.Gulstad (1995) used the parameters of original oil in place, initial solution gas-oil ratio, and formation thickness to replace the parameters of oil formation volume, water saturation, porosity and reservoir pressure, resulting in increased accuracy for predicting RF.The Gulstad method was the best regression method with R of 0.55003.
All the ANN methods involved generated higher accuracy than regression methods.The Noureldin and El-Banbi (2015) method used an ANN model with ten input parameters as described in the introduction.In this study, Lorenz coefficient data were not available, so the Lorenz coefficient was generated randomly with a uniform distribution from 0.25 to 0.6 as specified in reference [17].As shown in Table 6, this method generated a predicted recovery factor with R of 0.69155.
Al-Tashi et al. ( 2021) developed an ANN model using 10 input parameters with different combinations of the Noureldin and El-Banbi's correlation (2015) as given in Table 6.In addition, the correlation of Al-Tashi et al. applied the multi objective gray wolf optimizer (MOGWO) algorithm introduced by Mirjalili et al. (2016) [28].The application of this algorithm produced RF predictions closer to the actual conditions compared to the Noureldin and El-Banbi's method (2015) with a fairly high R, namely 0.85246.
The proposed correlation used an ANN model using 9 input parameters with different combinations from the correlation of Noureldin and El-Banbi (2015) and the correlation of Al-Tashi et al. (2021) as described in Table 6.In addition, this proposed correlation applied the process of selecting the transferring function and optimizing the number of neurons and hidden layers as shown in Figs. 4 and 5. Table 6 indicates that the proposed method had the best performance, since it generated the highest correlation coefficient (0.91657).In addition, almost all of the ANN prediction data lies very close to the diagonal line in Fig. 7 compared to other models.This indicates the importance of carrying out optimization steps as discussed above to improve the accuracy of the ANN model in predicting recovery factors.

Sensitivity analysis of input parameters
There are many parameters that influence oil recovery, from both rock properties and reservoir fluids.The Monte Carlo method was used to analyze the influence of input parameters.All these parameters were varied 300 times randomly using a uniform distribution.The maximum and minimum limits for each parameter were based on the data given in Table 3.After that, ANN was used to predict the output parameters.The relationship between variations in each input parameter and the output parameter is presented in Fig. 8.The figure shows the range of RF values for each parameter.The length of the range shows the magnitude of the influence of the input

Table 5
The proposed ANN-based weights and biases for calculating RF using Equation (12).parameters on RF changes.Based on Fig. 8, it can be seen that the input parameters with the largest to smallest influence on the recovery factor are original oil in place (OOIP), formation thickness (h), oil viscosity (μ o ), porosity (φ), specific gravity of oil (API), oil formation volume factor at initial reservoir conditions (B oi ), permeability (k), water saturation (S w ), and initial pressure (p i ), respectively.
The level of influence of input parameters (independent variables) on the recovery factor (dependent variable) is unique.It depends on depositional environment, drive mechanism, rock and fluid properties, and driving mechanisms [13,29].Previous studies showed different levels of influence of input parameters on the recovery factor as shown in Table 7. Okpere and Ndibueze (2013) analyzed the influence of input parameters on recovery factors for 40 strong water-driven reservoirs with sandstone lithology in the Niger Delta.The results of their study showed that the pressure drop (p i /p a ) was the most sensitive input parameter that affected the recovery factor.The analysis results are shown in the third column of Table 7 [13].Babayeva (2019) analyzed parameters that were sensitive to recovery factors in the Guneshli offshore field.The field was divided into 10 isolated blocks.In addition, the Guneshli field consisted of eight layers.Sensitivity analysis indicated that several layers have different sequences of parameters that influence the recovery factor.The complete results are shown in the fourth to sixth columns in Table 7 [30].

Conclusions
Based on the results and analysis discussed above, several conclusions can be drawn as follows.An artificial neural network (ANN) can be applied to predict recovery factors for solution gas drive sandstone reservoirs.Correlation accuracy can be determined based on a comparison of the statistical parameters correlation coefficient (R).The Gulstad method is the best regression model with R of 0.55003.All correlations using the ANN model show better accuracy compared to correlations using regression model.Removing outliers, selecting input parameters, selecting the number of neurons, selecting transfer function, and selecting number of hidden layers are necessary to optimize the accuracy of the ANN model.The optimized ANN model provides better accuracy compared to previous ANN models.The proposed correlation has the highest R of 0.91657.In addition, according to the sensitivity analysis, the initial oil in place parameter had the greatest influence on the recovery factor.The limitation of applying the approach proposed in this research which will become a challenge in the future is the calculation time.Calculation time problems are caused by variations in the number of neurons and model complexity due to the number of hidden layers.

Data availability statement
Data can be obtained from Refs.[2,25].

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 4 .
Fig. 4. selection of the transferring function and the number of neurons.

Fig. 5 .
Fig. 5. selection of the number of hidden layers using tangent-sigmoid function and 30 neurons.

Fig. 6 .
Fig. 6. results of the training, validation, and testing of the ANN model.

Fig. 7 .
Fig.7.cross-plot of predicted and actual RF using the proposed ANN method and previous methods.

Fig. 8 .
Fig. 8. Influence of input parameters on the recovery factor.

Table 1
determination of upper and lower limits of parameters.

Table 2
descriptive statistics before removing outliers.

Table 3
descriptive statistics after removing outliers.

Table 4
best correlations for various number of independent variables.

Table 6
comparison of correlation coefficient.ob , S wi , B ob , k, p b , p a ob , S w , B ob , k, p b , p a Regression 0.45446 Arps, 1968 φ, μ ob , S wi , B ob , k, p b , p a , OOIP ob , S w , p i , p ep , μ oi , R sa , B ob , B oi , OOIP

Table 7
influence level of input parameters on recovery factor from largest to smallest.
o at initial pressure[cp]