Rapid evaluation and optimization of carbon dioxide‐enhanced oil recovery using reduced‐physics proxy models

The carbon dioxide (CO2) injection in oil reservoirs is a promising enhanced oil recovery method to reduce carbon emissions into the atmosphere. The CO2 injection simulation and optimization require a large computation time, especially in real large‐scale oil reservoirs. This paper integrates experimental design and machine learning to construct a reduced‐physics surrogate model alternative to the complex reservoir simulator. This surrogate model was used to evaluate and optimize the CO2‐assisted gravity drainage (CO2‐GAGD) process, applied to a clastic reservoir in the Rumaila oil field. In the GAGD process optimization, five operational decision parameters controlling the production and injection activities were manipulated to attain optimal future oil production. Hundreds of simulation runs were created by Latin Hypercube Sampling to build a proxy‐based optimization. The optimal scenario increased the cumulative oil production by 416 MMSTB at the end of 10‐year prediction period. Finally, four machine learning (ML) algorithms were built as proxy surrogates alternative to the complex compositional reservoir simulations: Quadratic Equation (QM), FUzzy logic‐GEnetic algorithm (FUzzy‐GEnetic), Multivariate Additive Regression Splines (MARS), and Generalized Boosted Modeling (GBM). To show how robust using the ML approaches for the fast CO2‐GAGD process optimization, the random subsampling cross‐validation was adopted to conclude the optimum proxy metamodel that provides the lowest mismatch between the proxy‐ and simulator‐based cumulative oil production. The GBM algorithm achieved the highest adjusted‐R2 (0.9973) and lowest root mean square prediction error ( 3.704 × 1 0 6 ) $(3.704\times 1{0}^{6})$ to produce the most accurate metamodel. The resulting accurate GBM proxy metamodel can be used for fast optimization of the CO2‐GAGD process. Specifically, GBM‐metamodel should lead to achieving higher optimal reservoir flow responses, especially with running a large number of simulation runs. Consequently, this proposed trained GBM‐ML proxy metamodel could be used for gas injection optimization and uncertainty assessment with far‐less computational effort than with the conventional approaches, which use a complex flow simulator.


| INTRODUCTION
Energy consumption worldwide is vastly increasing annually due to the technology revolution and the population's significant incrimination. Discoveries of new oil fields have become rare in the last decade. Therefore, boosting oil production by utilizing all available technological methods including enhanced oil recovery (EOR) methods is inevitable. Specifically, enhanced oil recovery is becoming more important than ever, as finding new sources is becoming more costly and time-consuming, especially when the oil price is facing a huge reduction. Gas flooding is considered one of the most promising technologies used in EOR projects. Gas flooding is being implemented through the continuous gas injection (CGI) or water alternating gas (WAG) methods. However, the gas-assisted gravity drainage (GAGD) technique was recently implemented to advance the volume of the recovered oil in both secondary and tertiary stages for immiscible/miscible gas injection methods. 1 Consequently, it was proposed to boost the volume of the recovered oil in the main pay of Zubair formation in the Rumaila oil field, which is located in southern Iraq. The field was selected because it is a mature oil field that has been producing for more than 60 years and waterflooding is no longer effective to increase the recovery of oil.
Elevating oil recovery by implementing gas flooding is crucial in field development through EOR projects, especially when conducted in real oil fields. Many limitations impact reservoir production through the CO 2 -EOR processes. These parameters are operational or controllable by field operators, such as well production and injection constraints. Some other factors are uncontrollable, such as the geological properties. These operational factors need to be optimized to identify the optimal solution for the reservoir flow response. It is indispensable to define the ideal limits of the operational parameters that govern the effectiveness of the EOR method. These factors are primarily comprised of operational constraints with the injection and production wells in a certain reservoir. More specifically, how the production constraints are defined to regulate the volume of produced and the injected volumes in the reservoir; consequently, it provides a major influence on the reservoir flow response. Therefore, the enhancement of these parameters can achieve ideal reservoir performance with the time relative to reservoir cumulative oil production as well as net present value (NPV). 2 Proxy modeling and the design of experiments (DoE), also noun as designed experiments, are statistical techniques, which are integrated to build a response surface methodology (RSM) as an uncomplicated substitute or metamodel for the convoluted models to assess the several constructed experiments in the enhancement approach instead of evaluating the same simulator. 3 The proxy modeling optimization has not been found in the literature when it comes to the CO 2 -GAGD technique. However, the proxy model has been implemented in different reservoir studies and EOR modeling, such as oil production optimization, 4,5 waterflooding processes, [6][7][8] gas flooding processes, 9 steam injection, 10-13 chemical flooding, 14 and history matching. [15][16][17] The DoE concepts generate various computer trials (realizations) for the subject by connecting the levels for every parameter. These experiments are evaluated to compute the response factor. The created experiments and response factors are taken into account in the statistical modeling to generate a relationship, which characterizes the proxy or surrogate model. Many DoE approaches have been used in various reservoir simulation studies to build proxy models. The most common DoE approaches are fractional factorial design, 12 Central composite (CC) designs, 18,19 Doptimality also noun as D optimal design, 14 and Latin Hypercube Design. 15 There are many successful examples of using the proxy models in the literature of reservoir studies, for instance, the second-degree polynomial equation, 2,11,20,21 kriging algorithms, 11,15,22 and artificial neural networks algorithm. 5,15,23 Constructing a large and complex compositional reservoir flow simulation for the evaluation and optimization of the CO 2 -EOR, especially in a real oil field, is an expensive and time-consuming step, especially in the need to run hundreds of simulation scenarios through the optimization process. Therefore, it is essential to construct a simplified model (metamodel or surrogate) alternative to the complex flow simulation models. This simplified metamodel could provide very fast and highquality analysis of the complex reservoir engineering problems. Specifically, metamodeling or surrogate modeling serves as an alternative modeling technique to fast solve complex problems by reducing the computation time along with preserving the accuracy of detailed complex solutions. The possible disadvantages of these metamodels are related to the accuracy of the algorithms used to build the proxy metamodels. However, this point has been overcome by using advanced machine learning approaches that provide very accurate modeling and prediction in comparison to the outcome of complex models. Consequently, this paper combines the DoE and proxy metamodeling for the fast and accurate evaluation and optimization of the CO 2 -assisted gravity drainage (CO 2 -GAGD) process in heterogeneous clastic reservoirs of the south Rumaila oil field.
To the best of our knowledge, this study's new contributions in comparison to the previous studies include various points: (i) the fuzzy logic-genetic algorithm and generalized boosted modeling approaches are the first time used to construct the proxy metamodels for fast optimization of the gas injection processes based on large reservoir flow simulation of a real oil field. (ii) the power of integrating experimental design and fuzzy logic-genetic algorithm along with generalized boosted modeling has never been adopted in the optimization of the CO 2 -GAGD process. (iii) the developed proxy metamodels can adapt for fast evaluation and optimization of other CO2-EOR projects that have similar reservoir characterization. (iv) the resulting accurate proxy metamodels can provide full optimization of CO 2 EOR in real oil fields by running thousands or millions of designed simulations within a few seconds. (v) these accurate surrogate models can efficiently replace the computationally expensive compositional simulation models of other real oil fields in the case of a lack of data, which are required to build the simulation models.
The entire evaluation and optimization workflow can be summarized by the following steps, as depicted in Figure 1: (i) Build a full detailed compositional reservoir flow simulation to assess the reservoir production by the CO 2 -GAGD flooding during 10 years of future reservoir production. (ii) Then, the proxy models were reconstructed by manipulating the operational decision parameters that impact CO 2 flooding through the DoE-latin hypercube sampling (DoE-LHS) approach. (iii) More specifically, the DoE-LHS approach was used to create multiple simulation runs by generating multiple levels of each of the controlling gas injection parameters. (iv) The levels of the controlling parameters are manipulated in a way that ensures the minimum number of descriptive simulation runs that lead to F I G U R E 1 General sketch of rapid GAGD process evaluation and optimization problems. GAGD, gas-assisted gravity drainage.
achieving the optimum conditions of the CO 2 injection implementation. (v) Later and after running hundreds of simulation cases, the DoE-LHS approach was incorporated to generate a simplified surrogate approach (metamodel) alternative to the compositional reservoir model. (vi) To construct the surrogate metamodels, four machine learning (ML) and artificial intelligence (AI) algorithms were adopted: polynomial proxy model, fuzzy logic-genetic algorithm, multivariate additive regression splines, and generalized boosted modeling. (vii) The random sub-sampling cross-validation with variance calculations was then used to validate the accuracy of the four proxy models.

| GAGD PROCESS
Natural separation of reservoir fluids is imperative to boost the recovery of bypassed oil by the CO 2 -GAGD technique by supplying gravity steady oil sweep. The GAGD technique was patented to improve oil recovery in different production stages including secondary and tertiary for both immiscible and miscible gas injection practices. 1 The GAGD process is achieved by placing horizontal production wells at the lower part of the target reservoir. Then, the immiscible or miscible gas injection process in a gravity-stable state by the vertical injectors at the top of the reservoir is initiated. 24 Because of the gravity segregation occurring from the various fluid densities at reservoir conditions, the accumulation of the injected gas will be at the top of the reservoir to generate a gas cap, as illustrated in Figure 2. 25 This process will supply gravity stable oil sweep that moves the hydrocarbon to the bottom of the reservoir towards horizontal production wells, which helps to achieve improved sweep performance and optimum oil recovery. Mostly, the CO 2 gas is favored for the GAGD process as it provides optimum volumetric sweep performance with ideal microscopic displacement efficiency, particularly when it is used in miscible injection processes. Furthermore, the ideal volumetric sweep performance enhances retarding CO 2 arrival in production wells. 1 Slowing down or diminishing the arrival of the injected gas minimizes concurrent gas-liquid flow, and later enhances gas injection to sustain the reservoir pressure.

| FIELD DESCRIPTION
The subject oil field was first discovered at the end of 1953 in the south of Iraq. The field is approximately 50 km to the west of Basrah city and located 30 km to the west of the Zubair field. 26 The field length is 100 km with 12-14 km width and is located 3 km below sea level. Reservoir flanks are dipping with angles that do not exceed 3 o while the crest only dips with 1 o . South Rumaila oil field consists of many oil-bearing reservoir intervals.
The main prolific oil reservoir is Zubair characterized by the Late Berriasian Albian cycle while its sediments go back to the lower cretaceous age. The thickness of the Zubair reservoir ranges from 280 to 400 m and the sand to shale ratio depicts that the Zubair formation involves five members that are rich in organic contents: upper shale, upper sandstone (main producing), middle shale, lower sand, and lower shale. 27,28 According to Al-Ansari, 26 Zubair formation is a conventional reservoir that does not contain any complicated geological structures and figures such as faults or fractures. 26 Four key sectors were defined in the South Rumaila field: Janubia, Rumaila, Shamiya, and Qurainat. Rumaila sector and only minor regions of the Shamiya and Janubia sectors will be studied in this study. The selection of these parts was decided to depend on the gathered reservoir data and the capability to characterize the major parts of the reservoir, where the wells are producing and water injection activities are performed, shown in Figure 3A that represents the structural contour map.
It was found that the Zubair formation has two types of boundary conditions that include a no-flow boundary and an aquifer. The no-flow boundary was proposed to cover the reservoir's northern and southern areas. This assumption is acceptable as it mimics the reality since the reservoir adopted balance production and injection rates. Moreover, the streamlines in the north and south of the reservoir were crossed by the isobaric lines, which are perpendicular to the reservoir boundaries. Therefore, the flow direction is parallel to these boundaries, as shown in Figure 3B. Whereas the east and the west flanks were characterized by flow boundaries that symbolize the natural water drive to mimic the effects of the existing infinite aquifer. 29 The South Rumaila field was first developed for primary production in 1954 while water injection was commenced in 1980s to maintain the reservoir pressure and to sustain the west flank aquifer support, which is about 20 times stronger than the east flank aquifer. 29,30 Throughout field development, 40 production wells were drilled in the regions that are under investigation in this study. The minimum well spacing between each of the producers and injectors is 400meter. 28 The development of the South Rumaila field included shutting of some of the reservoir layers, as they were the reason for high water cut values that reached about 98%. Until 2004, the total volume of the injected water reached approximately 1.1 billion barrels with various rates of injection. The maximum injection rate value was 426,000 BPD for 2 months in 1988. Since the water cut hit 80% in some wells, artificial lift (ESPs) has been used to sustain production. The main pay in South Rumaila has original oil in place (OOIP) of 19.5 billion barrels and the estimated recovery factor is around 55%. The studied sector in this study has an OOIP of 6.13 billion barrels.

| COMPOSITIONAL SIMULATION OF GAGD DESIGN
A comprehensive compositional reservoir simulation was constructed to optimize the recovery of oil by the GAGD process in the main pay reservoir through five main steps.

| The geological modeling
The geological description of the main pay reservoir in South Rumaila depicts that it has three rock types: sand, shale, and shaly sand that are distributed through the reservoir with different permeability ranges. To model lithofacies and petrophysical properties, a full detailed geostatistical model with 1,908,900 grids has been employed using the Multiple Point Geostatistics and Sequential Gaussian Simulation. More specifically, the multiple-point geostatistics has been first used for 3D lithofacies modeling. Then, the sequential Gaussian simulation has been adopted for the 3D horizontal permeability and porosity distribution given each lithotype, 31

| Relative permeability
Since the oil relative permeability is a key factor in the gravity-drainage process, 33 three different relative permeability and capillary curves were conducted into the reservoir flow simulation model as a function of the three lithotypes based on their ranges of permeability: sand, shaly sand, and shale, 28 as depicted in Figure 6.

| History matching
The built geostatistical model has grids of 210, 202, and 45 in I, J, and K directions was then upscaled to grids of 69, 66, and 12 to simulate the GAGD process. To preserve the reservoir heterogeneity, arithmetic and harmonic mean formula were adopted for the porosity and permeability upscaling, respectively. The upscaled model was the base to build the compositional reservoir flow simulation by the use F I G U R E 6 Relative permeability and capillary curves for the three lithofacies types: sand (top), shaly sand (middle), and shale (bottom) of the CMG-GEM package. 34 The reservoir model should then be history matched to be used for future field development and planning. In the history matching, the trial and error technique considers frequently adjusting the reservoir model parameters and calculating the flow responses to check the matching with the observed measurements. The procedure continues until we reach the minimum mismatch error (satisfactory) match between the calculated and observed flow responses. 35 In this paper, excellent curve matches of the cumulative oil production, field oil rate, cumulative water injection, and field injection rate were achieved in the early steps after adjusting the aquifers parameters of permeability and radius along with considering the carter-Tracy approach to simulate the infinite-active aquifer. The acquired matching is an excellent indicator of the model's performance as it replicates water cuts and saturation distribution. The production and injection history of this field covered 56 years of production activities until the first quarter of 2010. Consequently, history matching was obtained between 1954 and 2010 as shown in

| PVT fluid behavior
A detailed pressure-volume-temperature (PVT) model was built through the WinProp package to spoor the interaction of primary and secondary fluids in terms of their compositions. WinProp is a Computer Modeling Group (CMG) compositional equation-of-state package used for phase behavior and property calculations. 34 In WinProp, the fluids are characterized based on their components to tolerate the compositional interaction through the reservoir. The following steps of calculation were conducted to build the full PVT model to simulate the GAGD process in the South Rumaila oil field: The compositions of the primary fluids (available oil) and the secondary (injected gas) were depicted in Figure 9. The general PVT data of solution gas-oil-ratio (GOR), oil and gas formation volume factor (FVF), and oil and gas viscosity, all as a function of pressure, were illustrated in Table 1.

| The GAGD process simulation setting
After achieving a satisfactory history matching, the reservoir model was considered for the GAGD process evaluation within 10 years of the future production F I G U R E 7 History matching of entire field production of South Rumaila oil field AL-MUDHAFAR ET AL.
| 4119 period. The development strategy focuses on placing vertical and horizontal wells for simultaneous gas injection and oil production, respectively. The wells were placed at the early stage of development, all at the same time to provide the fast and efficient injection and production. A total of 33 wells have been used to implement the key concept of the GAGD process. These wells included 22 vertical injection wells and 11 F I G U R E 8 History matching of entire field injection of South Rumaila oil field horizontal producers with 3000 m lateral length that are placed in the reservoir layers where the lithology is sand and shaly sand. First, CO 2 injection is commenced by the vertical injection wells at the shallower two layers. Simultaneously, the following three layers are utilized as a transition region to provide vertical space for gas gravity drainage. The next steps involve setting up the horizontal producers through layers 6-8, which contain the highest oil saturation in the reservoir. Eventually, the remaining four layers did not involve injection or production processes since the water saturation in these layers is 100% from the infinite water aquifer. Figure 10 illustrates the 3-D reservoir model of facies distribution with the positions of injection and production wells, which were used for the CO 2 injection in the GAGD method. In Figure 10, the reservoir body is represented by the red color that refers to the shale zones.
Sand zones and shaly-sand zones (indicated in 1 and 2, respectively, in Figure 10) were perforated in production and injection wells since they are considered to be high permeable zones.

| OPTIMIZATION APPROACHES
DoE is a methodical numerical technique that generates a suitable group of experiments to be the base of the simulation runs. DoE is mainly utilized for determining the highest critical parameters that impact the response during the sensitivity analysis practice. The DOE tool provides a way to acquire the most likely case, which accomplishes the optimum response from a certain procedure. 36  Abbreviations: FVF, formation volume factor; GAGD, gas-assisted gravity drainage; GOR, gas-oil-ratio.
F I G U R E 10 Production and injection well locations in sand and shaly-sand zones robust design. 37 It is essential to accomplish the most precise experimental design model that imitates the physical model or process for faster, cheaper, and more flexible implementation. To accomplish this task, the necessary group of elements and interactions have to be investigated to enable the analysis and implementation of results to be accurate and trustworthy. 2 The key terms of DoE are response parameter that symbolizes the result from a specific experiment and factor. The factors are described as a variable that impacts the response parameter and can be classified as primary and secondary based on the different levels of sensitivity. The overall count of designed experiments is defined by an exponential relationship. For illustration, the experiments count with k variables and 4 levels will be equal to 4 k . The Latin Hypercube Sampling was implemented in this study along with the proxy modeling to identify the optimal values of the operational production decision factors for the GAGD method optimization.
Latin Hypercube Sampling (LHS) is described as a numerical sampling method that is employed to produce random samples out of given input factors to create several computer experiments and trials from a distribution with several levels. 38 To represent various stages of variation for every single factor with the lowest number of trials, the sampling procedures deliver restricted data points in the design field with a uniform distribution by the space-filling strategy. 39 LHS is an efficient design that generates uniform and low discrepancy observations. 38 Figure 11 illustrates the space-filling design by LHS for two variables. In this figure, the sampled data are allocated within the entire space randomly, but all the points are uniformly distributed to capture the entire variation of the process being studied.
LHS creates further effective experiments for K parameters compared with the basic Monte Carlo sampling technique. More explicitly, LHS delivers a steady points design since it retains the highest space between each design point compared with the other points. 40 In LHS, K variables sampling is conducted by splitting every parameter into several equivalent parts. Furthermore, LHS is a thorough practice that randomly produces a new group of trials or experiments in case the given data set does not characterize the problem. However, there is no explicit approach to define the required number of trials or experiments that can be generated. 41 The computer experiments, which were generated for optimization, were accomplished in R statistical language by lhs [ ] package. 42

| PROXY MODELING
The proxy approach deals with building a simplified model alternative to the complex model (metamodel). This proxy model empowers users to achieve results equivalent to the results acquired by the sophisticated models with a significantly lower computational time that might be a few seconds for millions of simulation runs. Nonetheless, the complex model consumes several days to obtain the results of hundreds of runs. The proxy model is performed by fitting the operational parameters F I G U R E 11 Latin hypercube sampling design given two variables (R output) training data to the response factor, which is represented by the following relationship: y f X X X = ( , , …, ) + ϵ , where X X X , , …, k 1 2 represent the input variables, y refers to the anticipated response factor, and ϵ i refers to the regression error.
Cross-validation is necessary to maximize the opportunity of achieving global optima and to enhance the forecast precision of the proxy model. The random sampling cross-validation was implemented on the entire experiment through sampling and dividing the given data set into two groups: 30% testing set for forecast and prediction while 70% training set for building the model. 43 More explicitly, the training subset was used as the base for cumulative oil production modeling, obtained by the reservoir simulator, as a function of operational parameters. The prediction is then utilized for testing subset data using the simulator as well as the proxy model.
In this paper, the comparison between the proxy models was conducted for the missing match of the cumulative produced oil that was estimated through the reservoir simulation model and by the proxy model depending on the testing subset, not the same training data set. More specifically, this cross-validation approach provides an indication of the model's generalization ability and confirms making exterior forecasts on the new data set that was unseen in the training subset-based modeling (the results can be trusted when implemented on external datasets). The mismatch was computed by estimating the root mean square prediction error (RMSE) and the updated R 2 adj. RMSE quantifies the predictable squared variance of the reservoir simulator and proxy model response factors, for instance, the cumulative produced oil. While, the updated R adj 2 is a modified version of R 2 , which shows how much variance can be described by a model. However, the updated R adj 2 is adjusted for several predictors in the model and it grows only if the new term enhances the model.
where n is the count of experiments and k is the count of predictors (operational decision factors). Figure 12 illustrates the comprehensive flowchart for DoE, Proxy model, cross-validation, and application domain of the optimal proxy model. From the aforementioned DoE and proxy modeling, 643 simulation jobs (experiments of the operational parameters) were generated for the training and validation runs. These runs were then adopted for the comparison of four proxy models: Polynomial (Quadratic) Regression (QM), Multivariate Additive Regression Splines (MARS), Fuzzy log-GEnetic Algorithm, and Generalized Boosted Regression Model (GBM).

| Polynomial regression
The second-degree polynomial regression has been employed to generate the RSM by constructing a nonlinear function that relates the input parameters with the response factor. 10,15 The common model for the polynomial RSM is the second-degree quadratic model. Models with more than two parameters, are characterized by the following formula: where α j is the linear coefficient while α jj is the coefficient of the quadratic term, and α ij is the interaction coefficient of every two factors. The RSM was entirely implemented via R-statistical programming language through rsm package. 44 6.2 | Multivariate additive regression splines MARS is described as an improved technique of linear regression that represents a nonparametric regression approach that spontaneously formulates a function between variables considering the nonlinearity by utilizing piecewise linear slices, which noun as splines. 45 The MARS technique employs a group of coefficients and functions that define a formula to relate the response parameters and the forecasted variables. MARS technique is proper for multidimensional predictors since the fundamental functions divide the given input data into sections where every section contains its coefficients group to eliminate the potential outliers that might be available in the data set. 46 Modeling data by MARS is applied by two key stages: a forward phase that explores possible knots to advance the performance of modeling, and a backward procedure that removes the unimportant predictors. 47   Non-influential predictors are eliminated through the backward step depending on the global crossvalidation (GCV) principle, which adaptively deals with the diverse trends of data. 45 The entire implementation of MARS approach was performed through earth package in the R-statistical programming language. 49

| Fuzzy logic-genetic algorithm
Fuzzy logic is a method of knowledge illustration proper for notions that cannot be identified accurately, but their contexts govern it. Fuzzy Logic is a convenient way to build a fuzzy input and output data model. The fuzzy logic system comprises three phases: fuzzifier, fuzzy inference system, and defuzzifier. 50 In particular, the mechanism of the fuzzy logic system can be described as follows: in the fuzzifier stage, the raw inputs to the system form fuzzy inputs. Later, these fuzzy inputs are to be populated into the inference environment or system in which the real calculations are accomplished. The next step includes incorporating the rule base with fuzzy inputs and the inference engine to generate fuzzy results for every rule. The rule base is described as a confinement of expert understanding. The fuzzy results create a fuzzy group and this group is converted into a crisp value through the defuzzifier stage. 51 However, a Genetic Algorithm is a random search tool to generate potential answers that compete with each other to define the most suitable solution by applying operators of recombination, transformation, and selection that mimic the genetic regeneration in a biological environment comparable to the Natural Section theory that was proposed by Darwin. 52 FUzzy logic-GEnetic algorithm (FUzzy-GEnetic) is an evolutionary algorithm of fuzzy systems population, which is randomly generated by Genetic Algorithm, to be used as a prediction model by fitting the given training data as labels. The full procedure of FUzzy-GEnetic proxy modeling was implemented through fugeR Rpackage. 53 In fugeR, the given data is used to verify the entire fuzzy system. The algorithm forecast is then compared with the labels and each system has a quantified performance. The population of chromosomes is then utilized to create a 20% population for the subsequent creation based on the crossover and mutation. The final creation identifies the fuzzy system that achieved the optimum performance.

| Generalized boosted regression model
GBM is an influential technique that has various implementations in machine learning. GBM was developed by 54,55 to imitate the complicated dependencies of a nonlinear relationship. Specifically, GBM can be described as an application of augmentation to Freund and Schapire's AdaBoost algorithm and J. Friedman's gradient boosting machine. 56 In the literature, GBM has been proven to have wide implementation within machine learning and data science subjects that achieves high accuracy of forecast and modeling of the response parameters. In GBM modeling, fitting new models repeatedly will lead to achieving precise modeling as it helps to minimize the difference between the observed and the forecasted responses. A key concept of GBM is to train the data to attain the highest formulation with the negative gradient of the loss function. 57 The concept beyond the GBM loss function is to eliminate high deviations from the objective results and to ignore the insignificant residuals. 57 The GBM technique starts with allocating a differentiable loss function and begins with a base model F . Afterward, iteration is executed to compute the next negative gradient and the process finishes when convergence is obtained.
Then, the regression tree h is matched to the negative gradients g x − ( ) i . The full practice of GBM-based proxy modeling has been applied by gbm library in R language. 58

| GAGD PRODUCTION OPTIMIZATION
The five operational decision parameters investigated for the immiscible GAGD production optimization are maximum oil production rates MAX STO ( _ ), lowest bottom hole pressure MIN BHP ( _ ), and water-cut MAX WCUT ( _ ) in producers, along with the maximum gas rate MAX BHG ( _ ) and minimum bottom hole injection pressure MAX BHP ( _ ) in injectors. These five parameters represent the main controls on the production and injection activities in the field. Specifically, these parameters govern the amount of produced oil from the production wells by specifying the maximum limits of oil production and water cut along with determining the minimum limit of bottom hole pressure. However, the maximum injection rate and maximum injection pressure control the total amount of injected water into the reservoir, which improves reservoir pressure and oil production. Table 2 illustrates the default parameters of the base simulation scenario of the GAGD technique in addition to the ranges of each parameter (minimum and maximum levels) within the optimization procedure.
In Table 2, the main predefined constraint is the maximum water cut value, which was specified according to the capabilities of surface facilities that separate water from oil in the field. Moreover, the minimum bottom hole pressure in the production well represents the second constraint that governs the amount of oil production and keeps the pressure difference reasonable to improve oil production. The over-predicting field performance happens when we specify the minimum bottom hole pressure to the lowest levels, which causes a large pressure difference and over-prediction of oil production. The levels of each factor in Table 2 were combined by the Latin Hypercube Sampling to produce hundreds of simulation jobs (experiments). Then, the designed experiments were assessed by the compositional flow model to estimate the cumulative produced oil during 10-year prediction period (January 1, 2026).
The optimal solution referred to the simulation job that results to achieve the maximum cumulative oil production, as depicted in Figure 13 It also demonstrates that field cumulative produced oil through the base simulation scenario of the GAGD technique in addition to the general solutions that represent the nonoptimal cases. The total number of the generated simulation jobs including the optimal solution was approximately 625 runs.
The optimal solution was identified and visualized relative to the field cumulative produced oil, as outlined in Figure 14. The optimum solution represents the maximum field cumulative produced oil during 10 years of prediction time. The general solutions in Figure 14, represented by the green curves, refer to the least flow response that combines low and/or poor combinations of the factors' levels. Hence, they led to low levels of field cumulative oil production.
The cumulative produced oil during the prediction time through the base simulation scenario of the GAGD technique was 4.39 bn STB. Nonetheless, the optimal solution, obtained from LHS-based proxy optimization (Optimal Case), managed to enhance the cumulative produced oil production to 4.6 bn STB. The incremental oil recovery is 215.2 million STB, as illustrated in Figure 15, which compares the base simulation case and optimal GAGD process performance along with the primary production case of no injection. The optimum T A B L E 2 The operational decision parameters of the GAGD production optimization Abbreviation: GAGD, gas-assisted gravity drainage.

F I G U R E 13
Histogram of the base case, general and optimal cumulative oil production through the latin hypercube sampling-based GAGD process optimization. GAGD, gas-assisted gravity drainage.
cumulative oil production was acquired through achieving the optimal levels of the production control factors, which are illustrated in Table 3.
In Figure 15, a substantial increment in oil recovery was acquired from the optimal solution in comparison with the base simulation scenario of the original conditions of the operational parameters. The field cumulative produced oil acquired through the base scenario in 10 years can be produced after only 18 months within the prediction period.  aiming to identify the optimum proxy model that is possible to be assumed as an ideal metamodel of the nonlinear CO 2 injection through the GAGD technique.
After sampling and subdividing the data set of 643 designed experiments into 75% training and 25% testing subsets, the modeling was implemented based on 450 simulation jobs (training subset) through the four aforementioned proxy approaches. The prediction from each of the four proxy models was then adopted based on 193 runs (testing subset). The RMSE along with the adjusted R adj 2 have been computed to differentiate the estimated cumulative produced oil from the simulation model and the forecasted through the proxy model. Figures 16, 18, 20, and 22 illuminate the matching of the estimated cumulative produced oil from the observed, which is from the simulation model, and the forecasted from the proxy models (Predicted) concerning the test subset jobs for the QM, MARS, FUzzy-GEnetic, and GBM, respectively. Figures 17, 19, 21, and 23 depict the matching of the estimated cumulative produced oil from the simulation model (Observed) along with the forecasted based on the proxy models (Predicted) concerning the full data set jobs for the QM, MARS, FUzzy-GEnetic, and GBM, respectively. In the (Left) figures, the blue F I G U R E 18 Comparison of proxy-predicted and simulator calculated-fuzzy-genetic given the test subset Comparison of proxy-predicted and simulator calculated-fuzzy-genetic given the full data set AL-MUDHAFAR ET AL.
balls refer to the observed response values, cumulative oil production estimated from the compositional reservoir simulator. However, the red balls represent the predicted cumulative oil production from the four proxy models. In addition, the (Right) figures represent the scatter plots of matching of simulation and proxy models cumulative produced oil (Figures 16-23).
The comparison between the four proxy metamodels: QM, FUzzy-GEnetic, GBM, and MARS was justified based on the RMSE and R adj 2 with respect to the test subset and full datasets, as illustrated in Table 4.
The GBM proxy model was much better than the QM, FUzzy-GEnetic, and MARS models as GBM had the least RMSE and highest R adj 2 . Moreover, the scatter matching of the simulation along with proxy models cumulative produced oil from the Gradient Boosting Regression model was more matched than QM, FUzzy-GEnetic, and F I G U R E 20 Comparison of proxy-predicted and simulator calculated-GBM given the test subset F I G U R E 21 Comparison of proxy-predicted and simulator calculated-GBM given the full data set MARS models as most of the points in GBM fit the 45°d egree line.

| APPLICABLE DOMAIN OF GBM MODEL
To conclude, a statistical analysis of outlier identification was carried out to assess whether the GBM model and the data set upon which it was built were statistically valid. In this study, outliers were identified using the leverage strategy, which was then shown using a Williams plot because of its high accuracy in detecting outliers. The standardized residual (R) values resulting from model predictions are shown on a graph in relation to hat (H) values, which reflect the diagonal elements of the hat matrix and are characterized as follows 59,60 :  where X represents a matrix with a size equal to n D n × , stands for the number of data points, D represents the number of parameters, and X t denotes the transpose matrix of X . The applicability domain represents a leverage limit value H ( *) calculated as D n 3( + 1)∕ in the Williams plot. Importantly, eligible data points must have R values in the range of −3 to 3. 61,62 Consequently, acceptable data points are placed within the following ranges: H H 0 *   and R −3 3   . A high proportion of valid data points implies a model that is extremely consistent in its predictions. Figure 24 depicts the respective Williams plots of cumulative oil production by the GBM model. As can be observed, 96.11% of cumulative oil production of data points fell within the following valid range: H 0 0.032   and R −3 3   . These findings provide further evidence of the statistical validity of the GBM model and the data set employed in this investigation.

| SUMMARY AND CONCLUSIONS
The summary and conclusions of this study were outlined as follows: 1. The compositional reservoir simulator was conducted for the GAGD process evaluation within a nonhomogeneous main pay reservoir in the South Rumaila field. 2. After achieving the history matching, 10 years of future production was set as a prediction period to acquire the optimum recovery ratio through changing the operational decision parameters, which govern the injection and production processes. 3. The Latin Hypercube Sampling was implemented based on a low-discrepancy Experimental Design technique to generate several hundreds of trials and experiments to be assessed by the simulation model to identify the optimum recovery ratio and to construct the proxy model. That DoE and proxy modeling approach include determining an ideal group of operational decision parameters through the immiscible GAGD technique. 4. The parameters are CO 2 injection rate and highest BHP in the injectors, along with the highest oil production rate, lowest BHP, and maximum water cut in the horizontal producers. This optimal case led to obtaining 4.6039 million STB with an increment of 212.5 million STB of oil over the base GAGD case (360 million STB over the primary production case). 5. The first proxy modeling workflow includes generating simulation jobs as training runs to build the proxy model, which was iteratively validated through four sets of validation tests (verification runs). 6. To create an accurate proxy model that truly models the compositional reservoir simulator (metamodel), the polynomial regression in addition to three more approaches were adopted to build proxy models. The four constructed models are Polynomial Regression, Fuzzy Logic-Genetic Algorithm (Fuzzy GEnetic), Generalized Boosted Model (GBM), and Multivariate Additive Regression Splines (MARS). The accuracy comparison between the four proxy models was conducted concerning the R adj 2 and RMSE for the prediction of test subsets and full data sets. 7. It was observed that the GBM model was the most accurate metamodel for the GAGD process as it attains the minimum RMSE and the maximum adjusted R adj 2 that both reflect the lowest mismatch of the cumulative produced oil estimated through the simulation model and forecasted by the GBM-proxy models. 8. In addition, the MARS proxy model was the second-best matching model, followed by the polynomial and FUzzy-GEnetic proxy models. Additionally, the cumulative produced oil of both the simulation and proxy models from the GBM has better scatter points matching than the MARS, QM, and FUzzy-GEnetic. 9. Consequently, the GBM can be implemented as a simplified substitute metamodel instead of the highresolution compositional reservoir model by the GAGD technique assessment and forecast. 10. Also, the Williams plots of estimated cumulative oil production corroborated the valid results of the database and the findings provided by the GBM model, with below than 3.88% of data points highlighted as probable outliers, as shown by the results. 11. In the end, one of the most valuable outcomes of this study is the implementation of an effective GBM-based proxy model for forecasting and evaluating the viability of CO 2 -EOR initiatives. This robust proxy model may be utilized as a reliable template for predicting the EOR performance in mature fields in real-world environments due to the extensive use of actual field data in its field development. Machine learning's modeling approach may be simply adapted to different oil fields with similar reservoir characteristics.