Optimization of tight gas reservoir fracturing parameters via gradient boosting regression modeling

In China, the exploitation of most unconventional oil and gas reservoirs is dependent on hydraulic fracturing, which is a key method employed when developing tight gas formations. Numerous scholars and field engineers, both domestically and internationally, have conducted extensive numerical simulations and physical experiments to study crack propagation and predict post-fracturing productivity in hydraulic fracturing. Although some progress has been reported in this regard, it is difficult to accurately predict the well productivity using mechanistic models owing to the vertical multilayered development of tight gas reservoirs. In this study, vertical fractured wells in a block of Sulige gas field were examined. The block relied on hydraulic fracturing to produce tight gases. However, as development progressed, the available reservoir environment deteriorated, large differences emerged between wells after fracturing, and the fracturing results did not meet the expectations. In this study, geological, construction, and generation data for this block that had been collected since 2007 were analyzed. After applying multiple machine-learning methods to filter outliers and fill in missing values, k-means clustering, classification enhancement, extreme gradient enhancement, and LightGBM algorithms were used to establish a regression model. The analysis results revealed that the regression accuracy of the cluster test set was as high as 70% and that the LightGBM model had the best regression effect among the 227 stripper wells in the block. After optimizing the fracturing construction parameters (fracturing fluid volume, proppant volume, liquid-nitrogen volume, and pumping rate), the average fracturing fluid and liquid-nitrogen volumes per well decreased, whereas the unit reservoir proppant and liquid-nitrogen volumes increased. The results also revealed that 182 wells showed an improved initial production capacity during fracturing. The average gas production index per meter increased by 22.04%. This approach enabled rapid and efficient production forecasting and construction optimization. Moreover, this represents a novel fracture design method that is applicable to onsite engineers in tight gas production fields in the Ordos region.


Introduction
In the oil and gas production process, rapid and accurate prediction of preconstruction productivity is critical in the oilfield department, and subsequent measures should be taken to enable stripper wells to achieve enhanced oil recovery.With the accumulation of relevant data and advancements in machine learning in recent decades, the applications of machine learning in oil and gas production have increased.Although several relevant studies have been conducted recently, owing to the shortage of algorithms and uncertainty in the oil and gas industry, a complete theoretical system and sample database have not yet been developed in the field of oil and gas intelligence [1].Selecting a reasonable method to analyze the geological conditions and construction method of a block is critical for ensuring effective analysis.
Currently, there are two main approaches for optimizing the design of hydraulic fracturing parameters: crack inversion and production simulation methods.The crack inversion method primarily focuses on fracturing parameters, reservoir characteristics, and economic factors.By simulating and analyzing the expansion of hydraulic fractures, the formed fractures could be described, enabling the establishment of a matching relationship between the fracturing fractures and reservoir properties.This provides guidance and suggestions for optimizing fracturing parameters.Jiang et al. [2] proposed a new method that combines DPVS model, reservoir numerical simulation, and reservoir classification to optimize fracture parameters of heterogeneous horizontal gas wells.Jiang et al. [3] provided theoretical support for horizontal wells in the BZ oilfield, establishing physical fracturing models and productivity prediction mathematical models.Specifically, they focused on the impact of fracture parameters and injection wells on comprehensive production.Salah et al. [4] improved the production of horizontal multi-stage fracturing wells, reduced costs, and increased profits by integrating rock physics, geomechanics, and production data.However, the effectiveness of parameter optimization is significantly dependent on the accuracy and reliability of the hydraulic fracture model.Additionally, hydraulic fracture simulations are computationally intensive, and performing full-wellbore crack-inversion simulations along the horizontal section of a well is time-consuming and labor-intensive.
The production simulation method utilizes numerical reservoir simulation techniques.Based on the accuracy of fitting historical production data with numerical models, different fracture parameter schemes were established and executed.Specifically, production was considered a constraint for parameter optimization.Yu et al. [5] used response surface methodology, combined with hydraulic fracturing numerical simulation and economic analysis, to maximize NPV and optimize the production efficiency of unconventional gas reservoirs by considering key parameters.Rammy et al. [6] optimized the hydraulic fracturing parameters and horizontal well length of shale gas reservoirs via differential evolution to improve economic efficiency.Li et al. [7] significantly improved the gas production of coalbed methane wells via optimization of construction parameters.Xu et al. [8] revealed the relationship between hydraulic fracturing parameters under different geological conditions and uniform crack propagation, SRV, and NPV.This production simulation method offers better real-time performance and flexibility.It allows the visualization and presentation of results, facilitating understanding and analysis.However, it has higher input data requirements, and significant computational resources and time are required when dealing with large-scale and complex reservoir systems.
With the advancement of computer performance and urgent need for data processing, data mining methods, such as machine learning and deep learning, have provided new approaches for optimizing fracturing parameters.Many scholars are currently attempting to leverage these methods, breaking through previous assumptions and limitations and using data-driven models in conjunction with field production data for fracturing parameter optimization.Researchers, such as Koroteev et al. [9], analyzed the application of artificial intelligence in the upstream field of oil and gas, emphasizing risk reduction, process acceleration, and data, personnel, and collaboration challenges.Sircar et al. [10] reviewed the latest advances in machine learning and artificial intelligence technologies in data processing and interpretation to improve performance as well as reduce risks and costs.Aung et al. [11] summarized the applications of artificial neural networks and support vector machines in geological data interpretation, price prediction, and flow regime prediction to improve exploration and production efficiency.Choubey et al. [12] reviewed the applications of artificial intelligence and machine learning technologies in the oil and gas industry, spanning from exploration to distribution.They highlighted their crucial role in big data utilization and decision-making.Wang et al. [13] and Zhou et al. [14] established data mining models to understand the relationship between parameters and production capacity and optimize fracturing parameters based on the optimization model.Moreover, other studies by Al Mudhafar [15] and Pankaj et al. [16] utilized surrogate models to optimize fracturing parameters.
Based on local and global research, data mining technology has been shown to be applicable in the field of oil and gas development.Its applications mainly include reservoir parameter prediction, fracturing effect prediction, well and layer selection for fracturing, and fracturing decision-making.However, there remain some challenges in the application of data mining techniques to optimize fracturing processes in oil and gas field development.
First, predictive models based on data mining methods primarily focus on production as the target variable; however, production varies significantly under different production systems.
Second, multiple factors influence the effectiveness of fracturing, and previous studies focused on different factors.Some researchers considered fewer factors, leading to a less comprehensive analysis or an unequal treatment of different factors.
Fracturing is the primary stimulation method, and the production after fracturing is one of the main indicators for evaluating the fracturing effect.Establishing a prediction model for production after fracturing and optimizing the fracturing parameters, based on the prediction results, can improve production and increase natural gas recovery.However, owing to the harsh construction environment, it is difficult to record data, and thereby, data loss often occurs, causing difficulties in forecasting production after fracturing.Therefore, after consulting with field engineers, we addressed the issue of missing critical fracturing parameters in the field.Outliers H. Yang et al. were removed, and data interpolation models using random forest, KNN, and miceforest [17] were developed to populate the missing data.The optimal dataset was then chosen for production capacity prediction.If prediction accuracy was insufficient, then we implemented a clustering process prior to prediction and subsequently selected the best prediction model for each production level.For the stripper wells, an optimization model grounded in a genetic algorithm was created, and a regression model was used to determine the relationship between each influencing factor and the meter recovery index.After evaluating various models, the LightGBM stripper well productivity prediction model was chosen as the most accurate.An optimization model was then crafted using a genetic algorithm, focusing on refining four factors: the quantities of fracturing fluid, proppant, liquid nitrogen, and construction displacement.This resulted in an optimal 22.04% increase in gas production.Additionally, it reduced fracturing costs and enhanced the efficiency of fracturing agents.When compared to the conventional numerical simulation methods in the industry, our proposed approach exhibits greater accuracy and speed, aiding in identifying and addressing field challenges.Theoretically, it augments the output of stripper wells while bolstering fracturing efficiency and economic benefits.

Geological overview of the target block
The operation block is located in the northern part of the Sulige gas field (Fig. 1), with an area of 1162 km 2 and a natural gas geological reserve of 177.716 billion m 3 .Since the fracturing operation began in 2007, the size of the enriched area has decreased annually with the development of interlayers.The thin-interlayer hydraulic fracture characteristics are affected by the interlayer, which is difficult to extend, and it is difficult to ensure that the fractures extend through the reservoir and are linked to an effective sand body.Currently, the focus is on the sub-enrichment regions of natural gas and extraction of thin interbedded reservoirs.Due to the considerable variability in construction parameters during the fracturing process, issues, such as proppant plugging and thin interbedded channeling, can arise.These issues can lead to a less effective fracturing stimulation than anticipated, thus impacting the block's recovery efficiency.Concurrently, the porosity and permeability of the succession area have diminished, and the fracturing fluid poses potential damage to the reservoir.With low gas saturation and inadequate natural energy, these characteristics are indicative of a low-gas reservoir [18][19][20].

Data processing
Prior to data analysis, block reservoir and production data were collected and processed.The selection of evaluation indicators corresponds to the first step in developing a fracturing database that establishes prediction models based on evaluation indicators.The raw data used in this study were obtained from the production construction system database of the fracturing unit in the Su A Block.The acquired raw data were sorted, and factors related to the production capacity were screened and classified into reservoir physical, rock mechanics, and fracturing construction parameters.These classifications pertain to 538 production wells.The details are presented in Table 1 and Fig. 2.
Production is primarily influenced by reservoir geology and post-fracturing construction.Throughout the production process, the factors impacting productivity are multifaceted and dynamic.Generally, the open flow rate and daily gas production rate can indicate the production status of a gas well.However, in the target block, most gas wells are not evaluated during this process.Based on a summary and analysis of production experiences, the wellhead casing pressure was integrated with cumulative gas production.This combination was then used to define the capacity index based on the output per unit of production pressure difference (Eq.( 1)).
Based on this, the productivity derived from the meter recovery index was substituted with the average pressure drop yield observed 90 days post-fracturing construction, serving as a representation of the post-pressure production capacity.Specifically, the daily average pressure drop in gas production, per unit thickness of the reservoir within the 90 days following construction, was employed as the meter recovery index.The calculation is as shown in Eq. (2). (2)

Characterization engineering
Due to the limitations of the field equipment, a significant amount of data was missing in the collected set.Typically, the challenge  of missing data is tackled by filling features with a low rate of missing data using their mean value and discarding features with a high missing data percentage and no discernible pattern.However, these types of methods can distort the original data distribution and diminish valuable insights.A more effective approach involves maximizing rational data interpolation.In this research, outlier detection was employed (as shown in Fig. 3 (a)-(o)) alongside random forest and KNN techniques, to identify outliers and utilize multiple interpolation strategies to populate the missing values.The random forest method is adept at managing high-dimensional data and remains accurate even when many features are missing.KNN is less sensitive to outliers, and it offers high accuracy in filling data.Interpolation techniques can be single or multiple in nature.Although single interpolation is straightforward, it often falls short in addressing data uncertainty.Multiple interpolations, on the other hand, can mitigate these shortcomings through various functions and models.Given the lack of a well-established theoretical foundation for devising a construction plan in the field, and the high variability among construction parameters, this study employed three distinct methods to address missing values.Their effects were then compared.Following a final selection and reduction process, 444 data entries were preserved for further analysis.The outcomes of the data-filling procedures of the three methods are detailed in Table 2.
By comparing the filling errors of the different algorithms listed in Table 2, the results of the multiple interpolations are selected as samples for subsequent research.The effects of filling in the data are shown in Fig. 4 (a)-(o).For processed data, the difference between variables should be maximally preserved, whereas the influence of different orders of magnitude should be eliminated.When the processed data are combined with the field demand, the higher the production efficiency, the better the production effect.Therefore, when the data have m features and each feature has n samples, Eq. ( 3) can be selected to process the data and eliminate dimensional effects.

Productivity regression model
Evaluating productivity is of great significance for oilfield development and production.It can be used to evaluate and improve the preliminary exploration results and provide a reference for the design of construction methods.After data pre-processing, we introduced a weight analysis step.We can understand the degree of contribution of different factors to productivity and further optimize the model via a weight analysis.In this process, the grey correlation, entropy weight, and maximum information coefficient were selected to consolidate the results and avoid the influence of a single model contingency.The correlation results are presented in Fig. 5 (a)-(d) and Table 3.
After studying and comparing various methods, CatBoost [21][22][23], XGBoost [24][25][26][27], and LightGBM [28][29][30][31] were selected, and the results of the weighted summation were weighted to establish a model for predicting the gas meter recovery index.In this study, 30% of the data were randomly selected as the test set.The model results are presented in Table 4 and Fig. 6.
As shown in Table 4, the LightGBM exhibited the best regression effect; however, the regression errors of the three models remained unacceptable for field applications.To solve this problem, a clustering algorithm was used to further process the sample dataset.Given the influence of several factors on reservoir productivity, it is difficult to obtain quantitative predictions.The classification of the productivity levels of the producing wells not only constrains and guides quantitative predictions but also improves the accuracy of the quantitative prediction based on the classification.

Data cluster
Cluster analysis is an unsupervised machine learning algorithm and is an important technique for mining data distributions and hidden patterns [32].Based on the principle of minimizing the distance within a group and maximizing the distance outside the group, samples can be grouped according to data similarity without a given classification.The algorithm uses the Euclidean distance to measure the distance from the sample to the cluster center and uses the error sum of squares, SSE, as an objective function to measure the effect of clustering.The classification result with the smallest SSE is selected as the final result.
The Euclidean distance formula is as shown in Eq. (4).
where characteristic x i = (x i1 , x i2 , …x in ), and characteristic x j = (x j1 , x j2 , …x jn ).In Eq. ( 5), E i denotes the ith cluster, e i denotes the center of the ith cluster and x denotes the sample data of the cluster.The optimal number of categories was determined to be three by employing three to five clustering centers on the samples and conducting iterative calculations using the silhouette coefficient to compare and analyze clustering effectiveness.Furthermore, the classification ranges of stripper, middle production, and prolific wells were determined based on the classification results (Table 5).The yield distributions of the wells are shown in Fig. 7.

Regression of wells with different productivities
After grouping the data using a clustering algorithm, a regression model was applied.Compared to the results before clustering (Table 3), the results obtained after clustering demonstrate that the regression model significantly improves data accuracy.Furthermore, the model accuracy is substantially improved when compared to that of the previous overall regression by counting the data after the classification regression, as shown in Fig. 8 (a)-(c) to Fig. 10 (a)-(c).
The best models for the various well types are determined by comparing the errors of the different models, as listed in Table 6.The regression results of the three models were similar for the middle and prolific production wells.Moreover, the LightGBM regression    yielded the most favorable results for the stripper wells, achieving an R 2 value of 0.66.

Algorithm principle
According to the results of previous studies, it is nearly impossible to solve the problem directly based on the exploration of oil and gas combined with the field production demand.Therefore, a machine learning model was introduced for engineering design opti-

Table 5
Cluster well meter gas recovery index distribution.pseudorandom population to reach a local minimum position.Initially, each member of the pseudo-random population is a potential solution to the problem.Subsequently, following a few iterations, the genetic algorithm guides the population to the best-fit position [33][34][35][36].Under fixed values of the physical parameters of the reservoir and mechanical parameters of the rock, the maximum number of iterations was set to 100, the initial population number was 50, and the fracturing parameters were optimized with the boundary value of the fracturing parameters as the constraint condition.Thus, the optimization model can be expressed as shown in Eq. (6).
Given that the model does not provide a complete objective function, it learns the LightGBM regression model of low-producing wells during optimization, obtains it as the objective function, and sets the maximum value of the construction parameter column as the constraint condition for the next analysis.

Fracturing parameter optimization application
An optimization model based on a genetic algorithm and LightGBM low-producing well regression model was established to optimize 227 stripper well samples.The optimization results demonstrate that all the wells can be optimally fitted before the maximum number of iterations.A total of 182 production wells were optimized.The improvements obtained are listed in Table 7 and Fig. 12.
In terms of the construction parameters, the overall proppant, flowback fluid volume, and construction displacement did not  change significantly before or after model optimization.However, the average fracture fluid and liquid nitrogen usage per well reduced by 8.6% and 6.3%, respectively, which can be used as a reference for field cost control.Given that many uncontrollable factors affect flowback, formation water was added when calculating the amount of flowback liquid, which resulted in a serious discrepancy between the statistical and actual results; therefore, the model was not suitable for flowback optimization.Additionally, vertical wells involve reservoir thicknesses ranging from 3 to 19 m.Therefore, the optimization results were further translated into unit reservoir   thickness changes to better reflect the optimized parameters.The results revealed that the proposed approach increased the proppant and liquid nitrogen usage by 7.3% and 7.1%, respectively, when compared with the current protocol.The changes in the construction parameters before and after optimization are shown in Table 8 and Figs.13-19.

Conclusions
Several machine learning models have been used to process and analyze the missing oil and gas data.The main conclusions of this study are as follows.
(1) In the dataset analyzed, the multiple interpolation method outperforms the other techniques.It preserves the relationship between variables and accurately simulates the distribution of missing data.However, due to the constraints of the data source, certain important variables-identified from field experience-were excluded during statistical analysis because of their lack of  diversity.In subsequent studies, there should be an emphasis on broadening the sample collection and enhancing the quality of the source data to ensure the study's adaptability.(2) Theoretically, the block achieved an initial stimulation of 22.04% after killing the stripper wells.Reasonable fracturing design parameters can be developed according to the static parameters of each well, and the capacity can be predicted before construction.Hence, it is possible to adjust unreasonable development schemes in a timely manner.The aforementioned workflow can improve economic efficiency, while reducing risks in the field production process.Hence, its application is valuable.(3) The prediction accuracy of the model for middle-and high-production wells was as high as 85%, and the prediction accuracy for stripper wells reached 70%, which could be further enhanced.However, the unavoidable presence of groundwater during onsite backflow analysis suggests that results could be further refined and optimized using methods such as enhanced analytical techniques.

Symbol comment
J is the oil well productivity index (m 3 /(MPa⋅m)), Q f is the oil well production (t/d), Δp f is the oil well production differential pressure (MPa), q g is the daily gas production of the well (10 4 m 3 /d), p cf1 is the maximum casing pressure of the day (MPa), p cf2 is the minimum casing pressure produced in the same day (MPa), h is the effective reservoir thickness (m), and C(X) indicates the construction parameters.

Fig. 1 .
Fig. 1.Location of the research area (from Google Maps).

Fig. 2 .
Fig. 2. Missing dataset for the Su A area.

Fig. 12 .
Fig. 12.Comparison of productivity before and after optimization.

Fig. 13 .
Fig. 13.Comparison of the fracturing fluid usage before and after optimization.

Fig. 14 .
Fig. 14.Comparison of the fracturing fluid usage before and after optimization (unit reservoir).

Fig. 15 .
Fig. 15.Comparison of the proppant usage before and after optimization.

Fig. 16 .
Fig. 16.Comparison of the proppant usage before and after optimization (unit reservoir).

Fig. 17 .
Fig. 17.Comparison of the liquid nitrogen usage before and after optimization.

Fig. 18 .
Fig. 18.Comparison of the liquid nitrogen usage before and after optimization (unit reservoir).

Fig. 19 .
Fig. 19.Comparison of the construction displacement before and after optimization.

H
.Yang et al.

Table 1
Factors affecting the productivity of the vertical wells.

Table 2
Results obtained with different data-filling methods.

Table 3
Weight values and rankings of each method.
H.Yang et al.

Table 6
Comparison of the model error statistics after clustering.

Table 7
Optimization model effect statistics.
H.Yang et al.

Table 8
Comparison of the average values of the construction parameters before and after optimization.