Hybrid physics-machine learning models for predicting rate of penetration in the Halahatang oil field, Tarim Basin

Jiao, Shengjie; Li, Wei; Li, Zhuolun; Gai, Jingming; Zou, Linhao; Su, Yinao

doi:10.1038/s41598-024-56640-y

Download PDF

Article
Open access
Published: 12 March 2024

Hybrid physics-machine learning models for predicting rate of penetration in the Halahatang oil field, Tarim Basin

Shengjie Jiao¹,
Wei Li^1,2,
Zhuolun Li¹,
Jingming Gai¹,
Linhao Zou¹ &
…
Yinao Su^1,2

Scientific Reports volume 14, Article number: 5957 (2024) Cite this article

532 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Rate of penetration (ROP) is a key factor in drilling optimization, cost reduction and drilling cycle shortening. Due to the systematicity, complexity and uncertainty of drilling operations, however, it has always been a problem to establish a highly accurate and interpretable ROP prediction model to guide and optimize drilling operations. To solve this problem in the Tarim Basin, this study proposes four categories of hybrid physics-machine learning (ML) methods for modeling. One of which is residual modeling, in which an ML model learns to predict errors or residuals, via a physical model; the second is integrated coupling, in which the output of the physical model is used as an input to the ML model; the third is simple average, in which predictions from both the physical model and the ML model are combined; and the last is bootstrap aggregating (bagging), which follows the idea of ensemble learning to combine different physical models’ advantages. A total of 5655 real data points from the Halahatang oil field were used to test the performance of the various models. The results showed that the residual modeling model, with an R² of 0.9936, had the best performance, followed by the simple average model and bagging with R² values of 0.9394 and 0.5998, respectively. From the view of prediction accuracy, and model interpretability, the hybrid physics-ML model with residual modeling is the optimal method for ROP prediction.

Global prediction of extreme floods in ungauged watersheds

Article Open access 20 March 2024

Accurate medium-range global weather forecasting with 3D neural networks

Article Open access 05 July 2023

Synthetic Lagrangian turbulence by generative diffusion models

Article Open access 17 April 2024

Introduction

Drilling is a costly, high-impact, mission-critical operation, and the ROP is a key indicator of drilling efficiency. A higher ROP indicates faster drilling, increased rig productivity, and better rig performance¹. To obtain a higher ROP, it is necessary to construct a model that can evaluate how drilling variables affect ROP, and a large number of scholars and researchers have comprehensively analyzed actual drilling operations and found many factors that affect ROP, such as drill bits, weight on bit (WOB), revolutions per minute (RPM). Combined with laboratory experiments and theoretical analysis, under certain assumptions, a variety of ROP models suitable for different working conditions have been proposed to describe and predict ROP. These models can be divided into two types: classical physical models and ML models. In general, obtaining an ROP model can be seen as a regression problem. The basic research method is to use the drilled drilling curve to fit a model, and the parameters to be fitted include: empirical coefficients from the physical model, and internal hyperparameters from the ML model².

Classical physical models are rigorous mathematical equations established by theoretical analysis or experimentation between drilling variables. Some classical physical models that are often mentioned in research or widely used in actual drilling including Bingham³, Eckel⁴, Young⁵, Bourgoyne and Young⁶, Soares⁷ and other classic physical models. These physical models can be broadly understood as any knowledge that expresses the effective relationship between the properties or elements of drilling objects⁸, including physical knowledge, geometric constraints, stratigraphic laws, etc. Physical models follow objective laws and establish explicit associations between inputs and outputs to help people recognize and understand the physical world in which they live⁹. The physical model can clearly describe the internal characteristics of the system, and its outstanding advantages include its rigorous theory, (relatively) stable model, and interpretable results. However, physical models also have insurmountable shortcomings:

(1)
Limitation of the understanding of drilling downhole physics. A drilling system is a complex system¹⁰ with mixed elements, multiscale coupling and multiple process intertwined; however, accurately depicting all drilling processes is still difficult to in the existing physical model, and some physical processes are still unknown. For example, in the actual drilling process, the influence of drilling fluid on ROP is complex, and the influence of drilling fluid displacement, viscosity and density on ROP is not clearly understood, resulting in the assumption and simplification of physical model modeling and thereby triggering uncertainty in ROP prediction.
(2)
Underdetermined system problems. Even if most physicals in drilling processes are clear, some parameter inversions are often underdetermined systems; that is, the number of observation equations is less than the number of unknown parameters, resulting in unstable effects of the drilling speed model. For example, in the actual drilling process, the heterogeneity of rocks and the wear of drill bits cannot be directly observed, and simple assumptions and some deductions will affect the effectiveness of the physical model.
(3)
Insufficient accuracy. Despite making a lot of efforts (theoretical and experimental), modeling the ROP as a mathematical function of some variables is not trivial because this is a highly non-linear problem². The systematic, complex and uncertain drilling conditions of the downhole drilling process and the limitations of traditional ROP modeling result in a lack of sufficiently high accuracy.

Due to the limitations of physical models and the development of ML algorithms and the improvement in computer computing power, data-driven ML methods have been widely applied to ROP prediction and have achieved good results. One of the earliest applications of ML methods for ROP prediction dates back to 1997. Bilgesu creatively designed a one-hidden layer feed-forward back-propagation network to predict ROP and demonstrated the applicability of this neural network approach¹¹. Mendes¹² also presented a methodology based on a neural network model for ROP and a neuro-genetic adaptive controller to address the problem that relationships between operational variables affecting ROP are complex and not easily modeled. In addition, with the boom in ML algorithms approximately 2010, more and more ML methods are being used for ROP prediction, including Moran¹³, Arabjamaloei¹⁴, Esmaeili¹⁵, Ning¹⁶, Zare¹⁷, Bodaghi¹⁸, Hegde¹⁹, Mantha²⁰, Hegde²¹, Anemangely²², Soares⁷, Sabah²³, Felipe², Korhan²⁴, Li²⁵, Mohammad²⁶, Gan²⁷, Hazbeh²⁸, Salaheldin²⁹, Zhang³⁰, Ren³¹, Zhang³², Brenjkar³³, Riazi³⁴, Song³⁵, Wang³⁶, Mohammad³⁷, Kaveh³⁸ and so on. Judging from the increasing number of articles published each year in recent years on the use of machine learning for ROP prediction, it can be amply demonstrated that ML methods are well suited for application in the field of ROP prediction.

The ML methods used in the more than thirty articles mentioned above are all different, and to provide a clearer picture of the specifics of ML used for ROP prediction, the above articles are categorized according to the five ML algorithm types: artificial neural network (ANN), support vector machines (SVM), random forests (RF), hybrid model and deep learning (DL) methods. For each method, an additional information is given in the field of detail, and in the “Detail” column, the models used for comparison in the articles are in parentheses, as shown in Table 1.

Table 1 Classification of ML methods employed in ROP prediction.

Full size table

Table 1 shows a history of the application of ML models in predicting ROP, and the more commonly used methods include ANN, SVM, RF and hybrid models. Moreover, recent years have also seen the emergence of approaches involving hybrid models and DL for ROP prediction. For example, in 2016, B. Mantha et al. realized the use of Step-Wise regression, neural networks (NN, KNN), support vector regression (SVR), categorical regression trees (CART) and other models for prediction²⁰. Ensemble methods such as RF and Boosting help improve accuracy and reduce errors.

In 2022, based on data-driven thinking, Zhang proposed solving the real-time optimization problem of ROP by combining attention-based Gated Recurrent Unit networks and fully connected neural networks, which are accurate and robust and can make predictions after training on the first few data streams³². Compared with the traditional data-driven model, the proposed model shows great superiority due to its subnetwork structure, gated loop unit network and attention physical.

In 2022, Gan used a mobile window strategy, extreme learning machines, and tenfold cross-validation to build an ROP model²⁷. Alternately, two steps (modeling and forecasting) are performed within a moving drilling depth window to more accurately predict the ROP.

In 2023, Zhang used RF, ANN and SVM combined with real-time workflow to predict drilling speed in real time³⁵ and optimized drilling parameters through the NSG- III algorithm through an objective function for ROP and MSE to obtain a better real-time prediction effect during drilling.

Considering the above for the application of ML methods in ROP prediction, ML models simulate the “inductive” and “speculative” processes of humans through “training” and “prediction” respectively, to model and solve typical problems. Unlike the explicit expression of mechanistic models, ML models establish implicit associations between different variables through training on data; this is often a typical “black box” model. One of the key advantages of ML models is that when physical activity is unknown, they can skip the understanding of the physical process and go directly to data-driven modeling, especially when the training data are sufficient, and high modeling accuracy can often be obtained. In addition, although ML is time-consuming in the training stage, it generally has high computational efficiency in the test application stage, which has become one of its important advantages. Despite this, ML still has some limitations, especially in the drilling process of complex processes, which often have the following problems:

(1)
Insufficient generalization. The lack of sufficient training samples is the most common problem of ML in drilling applications, and the use of limited samples to learn complex drilling processes is prone to overfitting, even if the training samples show high modeling accuracy, the test application accuracy will be greatly reduced. In particular, when the actual numerical range, variable relationship, etc. are not covered by the training sample, the prediction result is more likely to be extremely biased, that is, the typical generalization ability is insufficient.
(2)
Insufficient migration. Regionality is the essential characteristic of rocks in drilling, and the differences between different regions manifest not only as differences in different geological elements but also as differences in the relationships between various elements. As a result, it is often difficult for ML models trained in one region to migrate to other regions for application. Second, the Earth’s surface elements and their interrelationships are undergoing constant change, and human activities make them more drastic; therefore, models of different time spans in the same region are often difficult to apply. In addition, insufficient scale migration is another dilemma in drilling applications.
(3)
Insufficient interpretability. The goal of scientific research is not only to develop a usable model, but also to discover the intrinsic causal relationships and driving patterns between different variables, and use them to explain theories and hypotheses, thereby contributing to the advancement of scientific knowledge⁹. One of the outstanding problems of ML is the lack of interpretability; although it can also obtain relatively high accuracy under specific conditions, lacks the ability to explain the internal physical process.

According to the result of the analysis of the application of ROP prediction models in different articles, the physical models and the ML models have different characteristics for ROP prediction and no single ROP model is suitable for all drilling site conditions. More importantly, most of the above studies have updated physical models or fused ML models to improve accuracy, but few studies have combined physical models and ML models to improve both accuracy and interpretability. On the one hand, the construction methods of the hybrid models are various, and it is uncertain which kind of construction method has more accurate prediction results. On the other hand, there are kinds of physical and ML models in the hybrid models, and it is also unknown which combination of physical and ML models is the best. Therefore, it is hard to satisfy the high-accuracy and great interpretability requirement for ROP prediction by the previous method, and it is necessary to deepen the studies on hybrid models of ROP prediction.

To construct an ROP prediction model for the Halahatang oil field that achieves high prediction accuracy and maintains a certain degree of interpretability, this paper combines the advantages of different physical models and ML models to propose four novel hybrid physics-ML models because there is a growing consensus that solutions to complex science and engineering problems require novel methodologies that are able to integrate traditional physics-based modeling approaches with state-of-the art ML techniques. This paper starts from the operation parameters that can be conveniently controlled by a field driller: WOB, RPM and Q. On the basis of several classical physical models, combined with commonly used ML methods for data training and experiments, the two parameters are combined to construct an ROP prediction model. There are four novel kinds of hybrid modeling approaches designed to reach this goal. The results obtained from this study can be applied in drilling parameter optimization and ROP management of drilling wells to add technical and economic benefits in the future³¹.

Methodology

Procedure of the hybrid physics-ML ROP modeling

In general, no ROP prediction model can be adapted for all drilling fields. From the previous demonstration, physical models and ML models have different advantages; perhaps the hybrid physics-ML ROP prediction model which combines physical models and ML models can have better accuracy and interpretability. The hybrid ROP prediction model may be more suitable as long as it is rebuilt before it is used in new drilling fields via the hybrid ROP modeling procedure. The procedure is shown in Fig. 1.

On the top of Fig. 1, four stages, which include collecting data from drilled wells, data denoising, feature selection, division of dataset are used for data processing before modeling. In the middle of Fig. 1, the goal of this stage is to select suitable physical and ML models from among the different models through three model performance evaluation metrics. On the bottom of Fig. 1, this study puts forward four novel approaches for hybrid physics-ML modeling, which include residual modeling, integrated coupling, simple average and bagging, maybe in the future, there will be more approaches for hybrid modeling. At the end of the procedure, a suitable model for this field will be selected by comparing the RMSE, MAPE and R². However, perhaps the best hybrid model is not unitary because of the characteristics of different regions despite having the same modeling procedure.

Tarim Basin (Xinjiang) dataset

In this study, comprehensive logging data from an ultra-deep well in the Halahatang oil field of the Tarim Basin in southern Xinjiang were used. The Halahatang area is the main oil-bearing zone in the Tarim Basin, and the Ordovician-rich carbonate-rich oil and gas resources in this area, which are deep unconventional oil and gas resources, are buried at a depth of more than 7000 m (Fig. 2a–c). Due to the drilling depth generally above 7000 m, the lithology is carbonate, and the drilling period is long (Fig. 2d).

The basic data of this well is shown in Table 2, the measured data relevant to ROP are composed of formation parameters including dc exponent (DC), normal dc exponent (DCN), formation pressure gradient (${g}_{p}$), formation fracture pressure gradient (${G}_{f}$), Poisson’s ratio ($\mu $), drilling parameters containing WOB, RPM, drill fluid flow rate (Q), torque (TORQUE), drilling time (DT), hook load (HL), stand pipe pressure (SPP), well trajectory parameters including measured depth (MD), true vertical depth (TVD), downhole parameters consisting of drill-bit diameter (${d}_{b}$), drill-bit footage (${d}_{f}$), and drilling fluid parameters containing drilling fluid equivalent density (ECD).

Table 2 Dataset basics.

Full size table

Ten PDC drill bits were used in the well, two sizes, 406.4 mm and 241.3 mm. The entire well spans 10 different strata including the Quaternary, Neogene, Paleogene, Cretaceous, Jurassic, Triassic, Carboniferous, Devonian, Silurian and Ordovician strata. Due to the serious lack of data for the first drill bit and the tenth drill bit in the logging data, only the second to eighth drill bits were selected for experimentation, and the different drill bits drilled into the well section are shown in Table 3:

Table 3 Different drill bits are drilled into the well section.

Full size table

Post drilling data analysis and preprocessing from the preliminary description of the data in Table 2 reveal that due to sensor errors or measurement errors, several zero value points or mutation values need to be processed, to improve the data quality and lay a good foundation for subsequent model fitting.

Data denoising

According to the basic statistics of the dataset in Table 2, the data quality of individual variables is poor, such as the variance in Q is large, there are obvious abnormal peaks and valleys in the distribution of Q, so data noise reduction is needed to improve the quality of data. Different data denoising methods may be suitable for different datasets and three data denoising methods are selected to improve the quality of data including standard deviation, quartile deviation and Savitzky–Golay (SG).

The preprocessing of data was divided into four situations: one was the raw data (as a control group), one was to determine the abnormal data according to a distance of 3 times the standard deviation, one was to preprocess the data according to 1.5 times the quartile difference, and the other was to smooth the data according to the SG algorithm. The SG smoothing filter⁴⁰ is used to remove some of the noise in the original dataset. This method applies a polynomial function to reduce noise in data variables, replacing values identified as noise in the data records with values generated by the SG function²³. Based on the identified least squares error, the nth order polynomial function is derived from the selected drill encounter stratigraphic points.

The number of points chosen should be odd and greater than the order of the derived polynomial function. If a higher polynomial order is applied, or if the number of data records fitted within that interval decreases, the derived polynomial function preserves the data trend of the variable. However, a decrease in the order of the polynomials or an increase in the number of data records used to define a particular interval can disrupt the data trend of the variable, resulting in excessive smoothing of the data.

To obtain a good SG filtering effect, a sensitivity analysis of the SG algorithm is carried out, the smoothing effect of the SG algorithm is performed on the data, and the filtering effects of the window length and polynomial order are compared; these two aspects are considered. One is to try to maintain a smoothing effect, that is, to use the standard deviation of the field data and the smoothed data to ensure that the dispersion degree of the data is better, and the other is to compare R² to ensure that the smoothed data are as good as possible and that the field data are consistent. The window length was set from 3 to 29 for experimental comparison:

As shown in Fig. 3, the best filtering parameters are 17 and 1 because R² is close to 1 and the data standard deviation is relatively small, resulting in better data noise reduction. Preprocessing via the SG algorithm, clearly revealed that the data quality is better, and some abnormal points are removed more smoothly, which is more in line with the actual drilling situation.

From Figs. 4, 5, 6, and 7, it can be seen that denoising is necessary to ensure that the impact of various abnormal situations can be removed and the data more in line with the real situation can be obtained. For example, in Fig. 6, Q has a dozen exceptions that were filtered by the SG algorithm so that the overall trend of the Q data is more consistent with the actual drilling conditions.

Data spilt

According to the drilling field, different drilling bits are suitable for different well sections and formations. To avoid the influence of different kinds of bits, the drilling data from the drilled well were divided into nine datasets, namely, the whole well dataset, first bit dataset, second bit dataset, third bit dataset, fourth bit dataset, fifth bit dataset, sixth bit dataset, seventh bit dataset and eighth bit dataset according to the type of bit used; the specific segments are shown in Table 3.

Whether through the regression fitting of physical models or during the training process of ML models, how to divide the training dataset and test dataset has a great impact on the prediction effect. Moreover, during the drilling process in the drilling field, the logging data are acquired segment by segment, and the random spilt method perhaps can’t be meet. Therefore, in this study, nine datasets all were spilt with two kinds of method including randomly divided and divided according to the ration before and after.

Feature selection

Table 2 shows that there are approximately 18 kinds of drilling variables, which may be related to the ROP. However, some drilling variables need not be taken into account because they may not be measured directly but may be calculated twice by some of the measured variables, such as DC and DCN. In addition, some variables, such as MD, TVD and Bit Depth, have the same meaning and effect and simply consider MD.

According to the mechanistic analysis, 14 variables were retained, namely, MD, HL, RPM, DT, ROP, WOB, TORQUE, SPP, Q, ECD, ${d}_{b}$, $\mu $, ${G}_{f}$ and ${g}_{p}$. However, to quantitatively analyze the relationship between each variable and the ROP, correlation analysis was carried out. Common methods for correlation analysis include Pearson, Spearman and Kendall correlation analyses, which quantify correlations by correlation coefficient. However, in statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data⁴¹; Spearman’s rank correlation coefficient is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables)⁴²; and the Kendall rank correlation coefficient is a statistic used to measure the ordinal association between two measured quantities⁴³. The three correlation analysis methods generally need to follow a linear or normal distribution or order, and are not suitable for variable correlation related to ROP.

The maximal information coefficient (MIC)⁴⁴, which was proposed in 2011, has two excellent properties: generality and equitability. In other words, the MIC can detect various relationships including linear, non-linear, functional and non-functional relationships. The MIC values of the different types of relationships are similar at the same noise level⁴⁵. Therefore, in this study, the MIC was selected as the correlation analysis method. The correlation heat map obtained from the MIC is shown in Fig. 8. Variables with high correlation can be used as references for feature selection and modeling.

Physical models

In 1965, Bingham proposed an early R–W–N form of an ROP prediction model considering the influence of drill bit diameter. An empirical coefficient b is added to the ratio of the drilling pressure to the diameter of the drill bit, and a comprehensive drilling coefficient a is also included, which is calculated and fitted using the field data of each formation³. The specific model is as follows:

$${\text{ROP}}={\text{a}}{\left(\frac{WOB}{{d}_{b}}\right)}^{b}RPM$$

(1)

where a and b are dimensionless coefficients of experience corresponding to each formation, WOB is the weight on the drilling bits (KN), d_b is the diameter of the drill bit (mm), and RPM is the bit revolution per minute (rev/min). Although both model coefficients (a and b) were determined for the entire rock formation, they could not account for changes in the physical behavior of the well within different operating parameter areas⁷.

In 1967, Eckel studied the influence of the drilling fluid flow rate and hydraulic parameters on drilling speed through the indoor laboratory of micro drill bits, and proposed a drilling speed prediction equation that includes the drilling fluid flow rate on the basis of the drilling speed equation proposed by traditional experiments⁴.

$${\text{ROP}}={\text{K}}*{WOB}^{a}{RPM}^{b}{\left(\frac{kQ\rho }{d\mu }\right)}^{c}$$

(2)

$${\text{ROP}}={\text{K}}*{WOB}^{a}{RPM}^{b}$$

(3)

where K is the comprehensive dimensionless coefficient of drilling; a, b and c are dimensionless coefficients; WOB is the drilling pressure; RPM is the turntable speed; Q is the drilling fluid flow rate (L/s); $\uprho $ is the specific gravity of the drilling fluid (dimensionless density); d is the nozzle diameter; $\upmu $ is the drilling fluid viscosity; and k is the dimensionless coefficient fitted by the microdrill laboratory and field data.

Although Eq. (2) describes the relationship between the properties of the drilling fluid and the ROP to a certain extent, in actual drilling operations, the density and viscosity of the drilling fluid are designed in advance and generally do not change during the actual drilling process. The operating parameters that are convenient for driller adjustment and control on site are the WOB, RPM and Q. Considering the actual working conditions, Eckel’s drilling speed equation is simplified, and a simplified Eckel drilling speed prediction equation is proposed, as shown in Eq. (3).

In 1974, Bourgoyne and Young⁶ proposed a broad ROP model that included all common drilling parameters. Subsequently, in 1986, the newly proposed modified Bourgoyne and Young formula eliminated normalization factors and constant parameters⁴⁶, simplifying the model and making it dependent on the measurement of core variables in real-time drilling optimization:

$$\frac{dD}{dt}=Exp\left({a}_{1}+{\sum }_{j=2}^{8}{a}_{j}{x}_{j}\right)$$

(4)

$${\text{ROP}}=\left({f}_{1}\right)\left({f}_{2}\right)\left({f}_{3}\right)\left({f}_{4}\right)\left({f}_{5}\right)\left({f}_{6}\right)\left({f}_{7}\right)\left({f}_{8}\right)$$

(5)

$${f}_{1}={e}^{2.303{a}_{1}}=K$$

$${f}_{2}={e}^{2.303{a}_{2}(10000-D)}$$

$${f}_{3}={e}^{2.303{a}_{3}{D}^{0.69}\left({g}_{p}-9.0\right)}$$

$${f}_{4}={e}^{2.303{a}_{4}D({g}_{p}-{\rho }_{c})}$$

$${f}_{5}={\left(\frac{\frac{WOB}{{d}_{b}}-{(\frac{WOB}{{d}_{b}})}_{t}}{4.0-{(\frac{WOB}{{d}_{b}})}_{t}}\right)}^{{a}_{5}}$$

$${f}_{6}={(\frac{RPM}{100})}^{{a}_{6}}$$

$${f}_{7}={e}^{-{a}_{7}h}$$

$${f}_{8}={(\frac{{F}_{j}}{1000})}^{{a}_{8}}$$

(6)

Although Bourgoyne and Young’s mechanical ROP model provides a comprehensive description of the drilling process, many of the parameters used in the model are difficult or impossible to measure in real time with prior art and must be approximated, such as the pore pressure gradient and drill bit wear. In addition, the model relies on normalization constants for drill bit depth and the WOB, RPM and Q terms, first derived from the 1970s. Nascimento⁴⁷ and Kutas⁴⁸ exposed this problem by reporting Bourgoyne and Young model applications by different authors with different normalization factors and proposing new values.

In 2019, Cesar Soares proposed a new drilling speed model⁷ based on real-time drilling speed prediction and the problems of Bourgoyne and Young. The validity of the coefficients can be guaranteed only if the model predicts the exact same bit, drilling fluid, formation and mechanical drilling speed under similar operating conditions. Therefore, in a certain range of drill bits or mud, the same coefficient values are not applicable, and one model coefficient can absorb the influence of all constant parameters. The newly proposed modified Bourgoyne and Young formula eliminates the normalization factor and constant parameters, simplifying the model so that it relies on the measurement of core variables in real-time drilling optimization.

$${\text{ROP}}={a}_{1}{D}^{{a}_{2}}{WOB}^{{a}^{5}}{RPM}^{{a}^{6}}{q}^{{a}_{8}}$$

(7)

ML models

Artificial neural network

In 1997, Bilgesu introduced neural networks to train a new ROP prediction model, which was well applied in real situations¹¹. The structure of ANN is shown in Fig. 9.

Support vector machine

In 2015, Bodaghi proposed an “optimized support vector machine regression” method for ROP prediction¹⁸. Optimizing the parameters of the support vector machine by optimizing the parameters of the cuckoo search algorithm has a higher ROP prediction accuracy than the genetic algorithm and the pattern search and grid search algorithms. The largest edge hyperplane is based entirely on the data point located at the edge. These two points, equidistant from the hyperplane, are called support vectors. Figure 10 shows an ideal example of detecting drilling speed anomalies. Given the additional drilling parameters and actual site conditions, it is unreasonable to expect linearly separable data to distinguish between efficient drilling and high vibration, inadequate hole clearing, or drill bit ladle operation areas. Support vector machines overcome this problem by allowing some data points to violate margins.

Random forest

In 2015, Hegde used limited ground measurements to predict ROP using Trees, Bagging, and Random Forest, respectively, and random forests provided the best accuracy for the data used; therefore, real-time, closed-loop applications were used¹⁹. Different regression algorithms, including Least Squares Regression, Ridge and Lasso Regression, and Principal Components Regression, have also been used to predict ROP incidence. The random forest algorithm is a Bagging ensemble algorithm based on decision trees⁴⁹. The structure of the random forest algorithm is shown in Fig. 11. The calculation relies mainly on the construction of multiple decision trees, and the average value of each decision tree is taken as the final prediction result. The original sample sampling of each decision tree is random, and the sampling processes are independent of each other. The specific training process of the random forest regression algorithm is shown in Fig. 11.

(1)
Randomly put back the sampling to construct the original samples X and Y of the decision tree, and construct the root node of the decision tree.
(2)
Calculate the feature number m of the original sample X and the size n of the training set.
(3)
If m < 1, the training is over. If m ≥ 1 iterates through all the values of m features on the training set, each value is used as the segmentation point, the impurity of different features of each segmentation point is calculated, and the feature with the smallest impurity is the segmentation feature of the segmentation point. then M-1; Repeat this step until the end of training. The prediction output is the average of all sample values for the current node sample set.
(4)
Repeat (1) (2) (3) until all decision trees have been trained. Output averages for all decision trees.

Model performance evaluation metrics

Whether it is a classical physical model or an ML model, it is essentially a fitting of the real drilling law, and the effect of its fitting needs to be evaluated and measured by certain indicators. Common model performance evaluation indicators are selected: the root mean square error (RMSE), average absolute percentage error (MAPE) and coefficient of determination (R²). The RMSE is the square root of the square of the deviation between the predicted value and the true value and the ratio of n observations; this metric measures the deviation between the predicted value and the true value, and is sensitive to outliers in the data.

$${\text{RMSE}}=\sqrt{\frac{1}{{\text{N}}}\sum_{{\text{i}}=1}^{{\text{n}}}{\left({{\text{ROP}}}_{{\text{Field}},\mathrm{ i}}-{{\text{ROP}}}_{{\text{Model}},\mathrm{ i}}\right)}^{2}}$$

(8)

where N is the number of samples, ${{\text{ROP}}}_{{\text{Field}},\mathrm{ i}}$ represents the actual drilling speed at the site and ${{\text{ROP}}}_{{\text{Model}},\mathrm{ i}}$ represents the drilling speed predicted by the drilling speed model.

The MAPE, which measures the relative error of a predicted value to a true value, is one of the most popular metrics for evaluating forecast performance, as defined below:

$${\text{MAPE}}=\frac{1}{{\text{N}}}\sum_{{\text{i}}=1}^{{\text{n}}}\frac{|{{\text{ROP}}}_{{\text{Field}},\mathrm{ i}}-{{\text{ROP}}}_{{\text{Model}},\mathrm{ i}}|}{{{\text{ROP}}}_{{\text{Field}},\mathrm{ i}}}\times 100\mathrm{\%}$$

(9)

The R² evaluates the fitting performance of the regression model. When R² is close to 1, the model has good prediction accuracy.

$${R}^{2}=1-\frac{\sum_{{\text{i}}=1}^{{\text{n}}}{\left({{\text{ROP}}}_{{\text{Field}},\mathrm{ i}}-{{\text{ROP}}}_{{\text{Model}},\mathrm{ i}}\right)}^{2}}{\sum_{{\text{i}}=1}^{{\text{n}}}{\left({{\text{ROP}}}_{{\text{Model}},\mathrm{ i}}-\frac{1}{{\text{N}}}\sum_{{\text{i}}=1}^{{\text{n}}}{{\text{ROP}}}_{{\text{Field}},\mathrm{ i}}\right)}^{2}}$$

(10)

Hyperparameter selection in ML models

The hyperparameters control the structure of ML models and determine their performances, which are parameters that are not directly learnt within training process. There are no definite rules for hyperparameter selection, as optimal model structure varies by application. Each ML model has multiple different hyperparameters. For instance, the hyperparameters of the RF include the number of trees in the forest (n_estimators), the function to measure the quality of a split, the number of features to consider when looking for the best split (max_features) and the minimum number of samples required to be at a leaf node (min_samples_leaf). The number of trees in the forest and the minimum number of samples required to be at a leaf node are closely related to the training accuracy of the RF model. The more n_estimators and min_samples_leaf, the more accurate the prediction performance is; however, the training costs of the RF increase. Therefore, it is necessary to balance the relationship between training accuracy and training costs. Researchers typically define a grid and search for the best hyperparameter combinations with cross-validation. The same methodology was applied in this study, and it was implemented using the sklearn.model_selection.GridSearchCV function of Python’s scikit-learn⁵⁰.

After referring to the hyperparameters in Sepideh⁵¹ and several cross-validations, the optimal hyperparameters of those ML models selected are shown in Table 4. To provide a more specific explanation of the process of hyperparameter selection, the optimization process of two RF hyperparameters, n_estimators and max_features, is illustrated in Fig. 12. As shown in Fig. 12, the mean_test_score of max_features = None(1.0) is the highest among the three max_features, and the mean_test_score is the average test core obtained from fivefold cross validation so that overfitting can be avoid in the process of hyperparameter selection. Among the n_estimators values, n_estimators = 14 had the highest mean_test_score, and max_features = None (1.0) and n_estimators = 140 were ultimately selected as the optimized hyperparameters. In this way, two hyperparameters are selected, and the hyperparameters of the other ML models are also determined in accordance with this process.

Table 4 Hyperparameters of ML models.

Full size table

Hybrid physics-ML models

The purpose of ROP prediction is to better guide the optimization of drilling parameters, and three physical data models including WOB, RPM and Q are selected from the actual application situation on site, namely, the physical model proposed by Bingham in 1965 (Eq. 1), the modified physical model proposed by Eckel in 1967 (Eq. 3) and the physical model proposed by Soares in 2019 (Eq. 7). With a certain amount of logging data, the optimize method in Python’s SciPy library is used for multivariate nonlinear fitting to obtain the empirical coefficients of the three physical models.

Three common ML algorithms were selected—ANN, SVM, and RF as the learning model. This study proposes four approaches of hybrid physics-ML modeling.

Residual modeling

The residual approach may be the oldest and most common approach for directly addressing the imperfection of physics-based models in the scientific community; an ML model (usually linear regression) learns to predict the errors or residuals made by a physics-based model⁵². The structure of the residual model is shown in Fig. 13. First, drilling data are used to train the physics model with a regression method to obtain the empirical coefficients of the physics model. Then, the physics model can obtain the ${ROP}_{phy}$ when the drilling data are the input of the physics model. Finaly, the drilling data and ${ROP}_{phy}$ are used as the inputs of the ML model which is subsequently trained on a hybrid residual model.

Integrated coupling

Different physical ROP models have different emphases, and the ML model can combine their different advantages when the output of the physical model is used as input to the ML model. A detailed description of this approach of hybrid modeling is shown in Fig. 14. First, the drilling data are used for three physical model regression fitting. Then, the output from Bingham model, Eckel model and Soares model are recognized as inputs to the ML model to train the ML model, which may yield better prediction than individual physics models.

Simple average

Physical models have better generalization performance and interpretability than ML models, but most of ML models have better accuracy when enough good data are available for training ML models. When predictions from both physics and the ML model are combined, the prediction of the ROP may improve. In this study, the simple average is the combined approach, which is shown in Fig. 15. First, the physics model and ML model are regression fit and trained alone to obtain predictions from both the physics model and the ML model. Then, the average of ${ROP}_{phy}$ and ${ROP}_{ML}$ is recognized as the final prediction of the ROP from the hybrid model.

Bagging

Ensemble learning works by building and combining multiple learners to accomplish learning tasks; this approach is sometimes referred to as multi-classifier system or as committee-based learning⁵³. By combining multiple learners, ensemble learning often achieves significantly better generalization performance than does single learner learning by creating and selecting good and different learners. According to the methods used to generate individual learners, the current ensemble learning methods can be roughly divided into two categories: serialization methods with strong dependency between individual learners, which must be generated serially; and parallel methods with no strong dependency between individual learners, which can be generated simultaneously. The former is represented by Boosting and the latter is represented by Bagging and Random Forest. Because the physics prediction ROP models are not strongly dependent, this study selects Bagging as the ensemble learning method and different physics models are recognized as individual learners.

Bagging is a type of parallel ensemble learning⁵⁴, that is directly based on bootstrap sampling⁵⁵. The specific process is shown in Fig. 16. First, the drilling data are divided into three training sets for physics models through bootstrap sampling and every training set accounts for approximately 63.6% of the drilling data without repetition. Then, physics models can obtain the first prediction including ${ROP}_{Bingham}$, ${ROP}_{Eckel}$ and ${ROP}_{Soares}$, which are used for simple average to improve ${ROP}_{pred}$.

Results and discussion

In this study, the Bingham, Eckel, and Soares models were selected as physical models for ROP prediction. Among the ML models, ANN, SVM and RF were also selected for training and testing. Three evaluation metrics, RMSE, MAPE and R², were used to comprehensively compare the performances of these models, and the optimal physical and ML models were selected to establish hybrid models. Four categories of hybrid models, namely, residual modeling, integrated coupling, simple average and Bagging, were established and compared with the optimal physical and ML models. To further investigate the influence of different models on hybrid models, physical models and ML models were used to establish different hybrid models. In total, 34 kinds of drilling speed prediction models were used for a total of 2448 groups of experiments.

Comparison of different physical models

In this section, the performances of three physical models, namely, the Bingham, Eckel and Soares models, are compared in detail, and the optimal physical model is selected. First, the coefficients in these physical models must be obtained. The Trust Region Reflective (TRF) algorithm, is motivated by the process of solving a system of equations that constitute the first-order optimality condition for a bound-constrained minimization problem, as formulated in STIR⁵⁶. This algorithm is particularly suitable for large sparse problems with bounds and works quite robustly in unbounded and bounded problems; thus, it is chosen as the selected algorithm for solving the non-linear least squares problem to obtain the empirical coefficients contained in the physical models.

The performances of these three ROP prediction models are compared, and the results are shown in Figs. 17 through 18. To better reflect that the empirical coefficients obtained by the TRF algorithm can accurately predict ROP; the training set and test set are randomly divided at a ratio of 4:1, and the comparison between the ROP predicted by the three physical models and the original ROP is shown in Fig. 17. Figure 17 shows that the prediction results of the named Soares model fit the actual ROP curve more closely, confirming that this model is more suitable for this field. The R² of the amended Soares model (0.7900) is larger than that of the other physical models, and its RMSE (9.0874) and MAPE (0.4190) are both smaller than those of the other physical models, indicating that the prediction accuracy of the amended Soares model is significantly better (Fig. 18 and Table 5). Therefore, this model is selected as the physical part of the hybrid part of the hybrid model for ROP prediction.

Table 5 Evaluation indices of the physical model results.

Full size table

Comparison of different ML models

In practice, the ROP performances of different ML models are quite different for different regions, and there is no a unified model suitable for all situations. In this section, the performances of three ML algorithms, ANN, SVM and RF, were compared, and the optimal ML algorithm for the Tarim Basin was selected.

The training results of the ML models are shown in Figs. 19 and 20. Due to its powerful learning capability, the prediction result of the RF model matches the original value well as shown in Fig. 19. Moreover, the training curve of the SVM also achieved a good fit. The R² of the SVM prediction result is 0.7929, and the R² of the RF prediction result is 0.9938 (Fig. 20 and Table 6).

Table 6 Evaluation indexes of ML model results.

Full size table

What’s more, the RMSE and MAPE values of RF are also better than those of ANN and SVM; thus, the RF model outperforms than the other two models, and it is selected as the ML part of the hybrid model.

As shown in Fig. 19 and Table 6, the RF model has a high level of ROP prediction accuracy. However, it is well known that ML models are prone to overfitting, and to avoid overfitting, this study adopts the k-fold cross-validation (CV) method to avoid overfitting and ensure generalization. With respect to the tenfold CV used in this study, the entire dataset was split into 10 smaller sets. The following procedure is followed for each of the k “folds”, as shown in Fig. 21:

A model is trained using k − 1 of the folds as training data;

The result model is validated on the remaining part of the data (it is used as a test to compute a performance measure such as accuracy).

The performance measure reported by k-fold CV is then the average of the values computed in the loop. It was repeated ten times until each fold was used as the validation set. A tenfold CV was performed for all three ML models, and the specific results are shown in Fig. 22.

The performance of the ML models during ten-fold CV was assessed by calculating the train score and test score, and the score represents the R². According to the similar score of each fold in Fig. 22, these models can obtain similar results on different validation sets. These models are verified to have good generalization.

Furthermore, as shown in Fig. 22, there was not much difference between the train score and the test score for each cross-validation, especially for RF. It was verified that these models were not overfitted. The fluctuations in the train score and test score in the SVM and ANN models may be caused by the two models being more sensitive to fluctuations in the data. There may be different distributions between the test data and the training data, and the model may be better adapted to the distribution of the test data, resulting in higher test accuracy.

Comparison of physical, ML, and hybrid models

The physical model with the highest prediction accuracy (Soares model) and the ML model with the best performance (RF model) were selected to construct two categories of hybrid models, including the hybrid model with residual modeling and the hybrid model with simple average. The hybrid model with integrated coupling is composed of the Bingham, Eckel, Soares and RF models. In the end, the hybrid model with Bagging included the Bingham, Eckel and Soares models.

The prediction results of the physical, ML, and hybrid models are shown in Figs. 23 and 24. In Fig. 23, the sign of Origin represents the ROP obtained from the drilling field, the sign of Soares represents the Soares model’s prediction, the sign of First represents the residual modeling’s prediction, the sign of Second represents the integrated coupling model’s prediction, the sign of Third represents the simple average model’s prediction, and the sign of Fourth represents the Bagging model’s prediction.

As shown in Fig. 23, Soares, Second and Fourth cannot obtain very good accurate results compared with the other models. However, when RF error compensation was introduced, the prediction results of the first hybrid model fit the actual ROP curve more closely in most well sections. According to Fig. 24 and Table 7, the prediction accuracy of the first hybrid model (R² = 0.9936) is better than that of the RF model (R² = 0.9934). Moreover, the prediction accuracy is also greatly improved compared with that of the Soares model. The RMSE and MAPE also exhibited better performance than did the other five models.

Table 7 Evaluation indices of physical, ML, and hybrid models.

Full size table

Comparison of different hybrid models

To further study the influence of different physical and ML models on the performance of the first hybrid model, the two best physical models (the Soares and Eckel model) and the two best ML models (SVM and RF) were selected for cross-mixing. The first model was established with the Eckel model as the ROP prediction model and the SVM as the error compensation model, and the second model was established with the Eckel model as the prediction model and the RF as the error compensation model. The third model takes the Soares model as prediction model combined with the SVM as the error compensation model, and the last model takes the Soares model as the prediction model and uses the RF as the error compensation model.

The prediction results are shown in Figs. 25 and 26. As shown in Fig. 25, all four models achieved outstanding performances, thus verifying the effectiveness of the first hybrid model. The best ROP prediction model was selected by comparing the performance indices of all the models. The R² values of the different models are close, but the Fourth model is slightly better than the other models. Moreover, the RMSE and MAPE values of the Fourth model are significantly lower than those of the other three models (Fig. 25 and Table 8), indicating that the prediction accuracies of the Fourth model are higher.

Table 8 Evaluation indexed of different hybrid models.

Full size table

Based on the above analyses, it can be concluded that the best hybrid model with high prediction accuracy is obtained only when the optimal physical and ML models are selected, which can provide important guidance for future ROP prediction modeling.

Comparison of the effects of different dataset sizes on models

In the drilling field, the ROP will change with rounding trip operations or when connections are made; consequently, the ROP prediction model may not reach enough accuracy when the dataset falls short of more than 5000 records in the above experiment. To study the effects of different dataset sizes on ROP models, an additional eight sets of tests were added to compare the full interval data according to the drilling depth of each bit (Table 3). The prediction results are shown in Figs. 26 and 27.

As shown in Fig. 27, four different kinds of ROP models including the Soares model, the RF model, the First hybrid model and the Second hybrid model, were selected for comparison. According to Figs. 27, 28, and 29, the RF model and the First hybrid model, which are suitable for ROP prediction in the field, are less sensitive to changes in dataset size, but the accuracy of the Soares and the Second model is strongly affected by the dataset size. It is also proven that the First hybrid model has better generalization performance and accuracy and is more suitable for field application.

Comparison of the effects of different data denoising methods on the model

In the experiments in the above section, most of them used the whole well data after noise reduction by the SG algorithm to conduct the experiments. To compare the advantages and disadvantages of the SG algorithm and the effects of data denoising methods on the model, common standard deviation and quartile deviation were also used to conduct control experiments with the original data. The prediction results are shown in Fig. 30.

As shown in Fig. 30, from the overall trend, in the case of the same model prediction, the R2 is the highest, and the RMSE and MAPE are the lowest after SG noise reduction, except that the Second hybrid model is not suitable for the SG algorithm. In other words, the ROP prediction model should be retrained before it is used for new drilling areas, not only for physical models, ML models and hybrid models but also the data denoising methods and dataset sizes.

Comparison of the effects of different data splitting methods on the model

Generally, in the process of regression fitting or ML model training, randomly selected samples are used as training sets to ensure the representativeness of the samples and to ensure that all sample spaces can be included in the training set as much as possible to improve the accuracy and generalizability of the model. As shown in Fig. 31, both the training set and the test set were divided at 4:1, but one was divided according to the front-to-back ratio, and the other was divided randomly. All four models achieved higher accuracy in the case of random division than in the case of front-to-back ratio division.

In this study, a logging dataset was collected after real drilling, and the data of all the well segments were obtained. After the data noise reduction processing, the training set and the test set are randomly divided to obtain a better prediction effect. Moreover, in the real-time drilling process, the ROP prediction model can be trained only by drilled data, which include tens, hundreds or even thousands of meters of data that will be drilled in the next time. Perhaps this situation is similar to the division according to the front-to-back ratio, and cannot obtain a better drilling rate prediction effect like random division. How to ensure that the ROP prediction model can accurately predict the ROP of the adjacent drilling interval on the basis of the drilled interval data may be the focus of further research.

Further discussion on hybrid models for ROP prediction

Judging from the above accuracy of the ROP predictions of the different models, the hybrid model has advantages in terms of prediction accuracy and interpretability, but it also has several limitations, such as the computational complexity of the hybrid models being greater than that of the single model. The computational complexity of hybrid models can be divided into p steps⁵⁷, each corresponding to the training of a model (M₀, M₁, …, M_p). The evaluated hybrid models are trained sequentially, and their computational complexity can be described as M₀ + M₁ + ··· + M_p, where M_p is the computational complexity in ROP prediction. In this way, the four hybrid models are approximately two times more expensive than the single model, because the hybrid models use at least two kinds of models, including physical models and ML models. This work applied six combinations of hybrid models, Bingham, Eckel, Soares or ANN, SVM and RF models.

The three physical models were trained and fitted using the curve_fit function in scipy.optimize⁵⁸, whose core method is non-linear least squares method to fit the ROM model. The computational complexity of it depends on the size and complexity problem, as well as the number of iterations required. Therefore, in this study, it is approximately assumed that the computational complexity is O(m²n), where n are the training examples and m is the total number of features/variables. The ANN training process has a complexity of O(emnk), where e is the number of epochs and k is the number of neurons⁵⁹. The SVM training process has a complexity of O(nm), where n is the size of the dataset, and m represents the number of input features⁶⁰. The RF training process has a complexity of O(Mmnlog(n)), where M is the number of trees, m is the number of features and n is the number of data samples in the training set⁶¹.

In the previous section, through ten-fold CV experiment, it was proven that the generalization of the model is not problematic in the Halahatang oil field. However, further in-depth research needs to be conducted in additional regions to further optimize the model and study generalizability.

Conclusions

Through the above research and analysis, the following conclusions can be drawn:

1.
The Soares model considering MD and Q was proven to be the most accurate for the Tarim Basin field among all physical models such as the Bingham and Eckel models. The R² score of the Soares model is approximately 3 times greater than that of the Bingham model and 11.9% greater than that of the Eckel model.
2.
Three ML algorithms were used to create machine learning models. Compared with these machine learning models, the RF model yielded the best results, with the highest correlation coefficient (R² = 0.9934) and lowest prediction error (MAPE = 0.0501 and RMSE = 1.6072).
3.
Four hybrid models, including residual modeling, integrate couple, simple average and bagging, were established. Most hybrid models showed higher accuracy than physical models and greater interpretability than conventional ML models.
4.
The performance of the hybrid model with error compensation by ML was optimal among all the ROP prediction models. The R² score improved greatly, specifically by approximately 25.77% in comparison with that of the Soares model. Moreover, the best hybrid model can be achieved when both physical and ML models with the best performances were selected for hybrid modeling.
5.
The idea of a hybrid model with residual modeling and combining predictions from both physical and ML models are recommended for different drilling operations due to its clear physical meaning, awesome generalization capability, low modeling difficulty, and good interpretability. The RF algorithm is also recommended when using artificial intelligence because of its outstanding predictive accuracy.
6.
The RF model and First hybrid model are relatively insensitive to dataset size compared with the other ROP models.
7.
Most ROP models achieve better performance after denoising with the SG algorithm than after denoising with Stand deviation or Quartile deviation, a suitable denoising reduction method should be selected through experimentation and analysis.
8.
Dividing the training set and test set according to the ration before and back, the accuracy of the trained model is far less than that of the random division model. To employ the ROP model in real-time drilling, how to make the ROP prediction model accurately predict the ROP of the adjacent drilling interval on the basis of the drilled interval data may be the focus of further research.
9.
Further studies can focus on the optimizing drilling parameters to increase the ROP. In addition, in this study, the selected ML algorithms were mostly common ML methods, that can be combined with physical models by deep learning, transfer learning, reinforcement learning, large model and other algorithms.

Data availability

The utilized data in this study is available upon reasonable request from the corresponding author.

Abbreviations

ROP:: Rate of penetration
ML:: Machine learning
Bagging:: Bootstrap aggregating
WOB:: Weight on bit
RPM:: Revolutions per minute
ANN:: Artificial neural network
SVM:: Support vector machine
RF:: Random forests
ELM:: Extreme learning machines
DL:: Deep learning
MLP:: Multilayer perceptron
BP:: Back-propagation
IGA:: Improved genetic algorithm
DC:: Dc exponent
DCN:: Normal dc exponent
${g}_{p}$ :: Formation pressure gradient
${G}_{f}$ :: Formation fracture pressure gradient
$\mu $ :: Poisson’s ratio
Q:: Drill fluid flow rate
DT:: Drilling time
HL:: Hook load
SPP:: Stand pipe pressure
MD:: Measured depth
TVD:: True vertical depth
${d}_{b}$ :: Drill-bit diameter
${d}_{f}$ :: Drill-bit footage
ECD:: Drilling fluid equivalent density
SG:: Savitzky–Golay
PCC:: Pearson correlation coefficient
MIC:: Maximal information coefficient
Bingham:: Bingham’s model
Eckel:: Eckel’s simplified model
Soares:: Cesar Soares’s model
SVR:: Support vector regression
CART:: Categorical regression trees
RMSE:: Root mean square error
MAPE:: Mean absolute percentage error
R² :: Coefficient of determination
TRF:: Trust region reflective algorithm

References

Hegde, C., Daigle, H., Millwater, H. & Gray, K. Analysis of rate of penetration (ROP) prediction in drilling using physics-based and data-driven Models. J. Pet. Sci. Eng. 159, 295–306. https://doi.org/10.1016/j.petrol.2017.09.020 (2017).
Article CAS Google Scholar
Barbosa, L. F. F. M., Nascimento, A., Mathias, M. H. & de Carvalho Jr, J. A. Machine learning methods applied to drilling rate of penetration prediction and optimization: A review. J. Pet. Sci. Eng. 183, 106332. https://doi.org/10.1016/j.petrol.2019.106332 (2019).
Article CAS Google Scholar
Bingham, M. G. How rock properties are related to drilling. Oil Gas J. 94, 101 (1965).
Google Scholar
Eckel, J. R. Microbit studies of the effect of fluid properties and hydraulics on drilling rate. J. Pet. Technol. 19(04), 541–546. https://doi.org/10.2118/1520-PA (1967).
Article CAS Google Scholar
Young, F. S. Jr. Computerized drilling control. J. Pet. Technol. 21(4), 483–496. https://doi.org/10.2118/2241-PA (1969).
Article Google Scholar
Bourgoyne, A. T. Jr. & Young, F. S. Jr. A multiple regression approach to optimal drilling and abnormal pressure detection. Soc. Pet. Eng. J. 14, 371–384. https://doi.org/10.2118/4238-PA (1974).
Article Google Scholar
Soares, C. & Gray, K. Real-time predictive capabilities of analytical and machine learning rate of penetration (ROP) models. J. Pet. Sci. Eng. 172, 934–959. https://doi.org/10.1016/j.petrol.2018.08.083 (2019).
Article CAS Google Scholar
von Rueden, L. et al. Informed machine learning: A taxonomy and survey of integrating prior knowledge into learning systems. IEE-E Trans. Knowl. Data Eng. 35(1), 614–633. https://doi.org/10.1109/TKDE.2021.3079836 (2021).
Article Google Scholar
Daw, A., Karpatne, A., Watkins, W. et al. Physics-guided neural networks (PGNN): An application in lake temperature modeling. arXiv. https://doi.org/10.48550/arXiv.1710.11431 (2017).
Shen, H. & Zhang, L. Mechanism-learning coupling paradigms for parameter inversion and simulation in earth surface systems. Sci. China Earth Sci. 66(3), 568–582. https://doi.org/10.1007/s11430-022-9999-9 (2023).
Article ADS Google Scholar
Bilgesu, H. I. et al. A new approach for the prediction of rate of penetration (ROP) values. Soc. Pet. Eng. https://doi.org/10.2118/39231-MS (1997).
Article Google Scholar
Mendes, J. R. P. et al. Applying a genetic neuro-model reference adaptive controller in drilling optimization. World Oil 228(10), 29–36 (2007).
Google Scholar
David, M., Hani, L. et al. Sophisticated ROP prediction technologies based on neural network delivers accurate drill time results. In Asia Pacific Drilling Technology Conference and Exhibition. https://doi.org/10.2118/132010-MS (2010).
Arabjamaloei, R. & Shadizadeh, S. R. Modeling and optimizing rate of penetration using intelligent systems in an Iranian Southern oil field (Ahwaz oil field). Pet. Sci. Technol. 29(16), 1637–1648. https://doi.org/10.1080/10916460902882818 (2011).
Article CAS Google Scholar
Abdolali, E., Behzad, E. et al. ROP modeling using neural network and drill string vibration data. In SPE Kuwait International Petroleum Conference and Exhibition. https://doi.org/10.2118/163330-MS (2012).
Jing, N. et al. A new model of ROP prediction for drilling engineering with data source data output desired output. Int. J. Adv. Inf. Sci. Serv. Sci. 5, 597–604. https://doi.org/10.4156/AISS.VOL5.ISSUE7.70 (2013).
Article Google Scholar
Zare, J. & Shadizadeh, S. R. Managed pressure drilling to increase rate of penetration and reduce formation damage. Pet. Sci. Technol. 32(15), 1833–1842. https://doi.org/10.1080/10916466.2010.540618 (2014).
Article CAS Google Scholar
Asadollah, B., Reza, A. H. & Mahsa, G. Optimized support vector regression for drilling rate of penetration estimation. Open Geosci. 7(1), 20150054. https://doi.org/10.1515/geo-2015-0054 (2015).
Article Google Scholar
Chiranth, H., Scott, W. & Ken, G. Using trees, bagging, and random forests to predict rate of penetration during drilling. In SPE Middle East Intelligent Oil and Gas Conference and Exhibition. https://doi.org/10.2118/176792-MS (2015).
Mantha, B. & Samuel, R. ROP optimization using artificial intelligence techniques with statistical regression coupling. In SPE Annual Technical Conference and Exhibition. https://doi.org/10.2118/181382-MS (2016).
Hegde, C. et al. Analysis of rate of penetration (ROP) prediction in drilling using physics-based and data-driven models. J. Pet. Sci. Eng. 159, 295–306. https://doi.org/10.1016/j.petrol.2017.09.020 (2017).
Article CAS Google Scholar
Mohammad, A. et al. Drilling rate prediction from petrophysical logs and mud logging data using an optimized multilayer perceptron neural network. J. Geophys. Eng. 15(14), 1146–1159. https://doi.org/10.1088/1742-2140/aaac5d (2018).
Article Google Scholar
Sabah, M. et al. A machine learning approach to predict drilling rate using petrophysical and mud logging data. Earth Sci Inform. 12, 319–339. https://doi.org/10.1007/s12145-019-00381-4 (2019).
Article Google Scholar
Kor, K. & Altun, G. Is support vector regression method suitable for predicting rate of penetration?. J. Pet. Sci. Eng. 194, 107542. https://doi.org/10.1016/j.petrol.2020.107542 (2020).
Article CAS Google Scholar
Li, C. & Cheng, C. Prediction and optimization of rate of penetration using a hybrid artificial intelligence method based on an improved genetic algorithm and artificial neural network. In Abu Dhabi International Petroleum Exhibition & Conference. https://doi.org/10.2118/203229-MS (2020).
Mehrad, M. et al. Developing a new rigorous drilling rate prediction model using a machine learning technique. J. Pet. Sci. Eng. 192, 107338. https://doi.org/10.1016/j.petrol.2020.107338 (2020).
Article CAS Google Scholar
Gan, C. et al. A novel dynamic model for the online prediction of rate of penetration and its industrial application to a drilling process. J. Process Control 109, 83–92. https://doi.org/10.1016/j.jprocont.2021.12.002 (2021).
Article CAS Google Scholar
Hazbeh, O. et al. Comparison of accuracy and computational performance between the machine learning algorithms for rate of penetration in directional drilling well. Pet. Res. 6(3), 271–282. https://doi.org/10.1016/j.ptlrs.2021.02.004 (2021).
Article Google Scholar
Elkatatny, S. Real-time prediction of rate of penetration while drilling complex lithologies using artificial intelligence techniques. Ain Shams Eng. J. 12(1), 917–926. https://doi.org/10.1016/j.asej.2020.05.014 (2021).
Article Google Scholar
Zhang, H., Lu, B. et al. Combining machine learning and classic drilling theories to improve rate of penetration prediction. In SPE/IADC Middle East Drilling Technology Conference and Exhibition. https://doi.org/10.2118/202202-MS (2021).
Ren, C., Huang, W. & Gao, D. Predicting rate of penetration of horizontal drilling by combining physical model with machine learning method in the China Jimusar oil field. SPE J. https://doi.org/10.2118/212294-PA (2022).
Article Google Scholar
Zhang, C., Song, X., Yinao, Su. & Li, G. Real-time prediction of rate of penetration by combining attention-based gated recurrent unit network and fully connected neural networks. J. Pet. Sci. Eng. 213, 110396. https://doi.org/10.1016/j.petrol.2022.110396 (2022).
Article CAS Google Scholar
Brenjkar, E. & Delijani, E. B. Computational prediction of the drilling rate of penetration (ROP): A comparison of various machine learning approaches and traditional models. J. Pet. Sci. Eng. 210, 110033. https://doi.org/10.1016/j.petrol.2021.110033 (2022).
Article CAS Google Scholar
Riazi, M. et al. Modelling rate of penetration in drilling operations using RBF, MLP, LSSVM, and DT models. Sci. Rep. 12, 11650. https://doi.org/10.1038/s41598-022-14710-z (2022).
Article PubMed PubMed Central ADS CAS Google Scholar
Zhang, C. et al. Real-time and multi-objective optimization of rate-of-penetration using machine learning method. Geoenergy Sci. Eng. 223, 211568. https://doi.org/10.1016/j.geoen.2023.211568 (2023).
Article CAS Google Scholar
Wan, Y. et al. Intelligent prediction of drilling rate of penetration based on method-data dual validity analysis. SPE J. https://doi.org/10.2118/217977-PA (2023).
Article Google Scholar
Sharifinasab, M. H., Niri, M. E. & Masroor, M. Developing GAN-boosted artificial neural networks to model the rate of drilling bit penetration. Appl. Soft Comput. 136, 110067. https://doi.org/10.1016/j.asoc.2023.110067 (2023).
Article Google Scholar
Shaygan, K. & Jamshidi, S. Prediction of rate of penetration in directional drilling using data mining techniques. Geoenergy Sci. Eng. 221, 111293. https://doi.org/10.1016/j.petrol.2022.111293 (2023).
Article Google Scholar
Li, J. et al. The origin and accumulation of ultra-deep oil in Halahatang area, northern Tarim Basin. J. Pet. Sci. Eng. 195, 107898. https://doi.org/10.1016/j.petrol.2020.107898 (2020).
Article CAS Google Scholar
Savitzky, A. & Golay, M. J. Smoothing and differentiation of data by simplified least squares procedures. Anal. Chem. 36(8), 1627–1639. https://doi.org/10.1021/ac60214a047 (1964).
Article ADS CAS Google Scholar
Karl, P. Notes on regression and inheritance in the case of two parents. Proc. R. Soc. Lond. 58, 240–242 (1895).
Article Google Scholar
Spearman, C. The proof and measurement of association between two things. Am. J. Psychol. 15(1), 72–101 (1904).
Article Google Scholar
Kendall, M. A new measure of rank correlation. Biometrika 30(1–2), 81–89 (1938).
Article Google Scholar
Reshef, D. et al. Detecting novel associations in large data sets. Science 334(6062), 1518–1524 (2011).
Article PubMed PubMed Central ADS CAS Google Scholar
Shao, F. & Liu, H. The theoretical and experimental analysis of the maximal information coefficient approximate algorithm. J. Syst. Sci. Inf. 9(01), 95–104 (2021).
Google Scholar
Bourgoyne, A. T. Jr., Millheim, K. K., Chenevert, M. E. & Young, F. S. Jr. Applied Drilling Engineering (Society of Petroleum Engineers, 1991).
Google Scholar
Nascimentao, A. et al. Mathematical modeling applied to drilling engineering: An application of Bourgoyne and Young ROP model to a presalt case study. Math. Probl. Eng. 20, 1–9. https://doi.org/10.1155/2015/631290 (2015).
Article Google Scholar
Kutas D. T., Nascimento A., Elmgerbi A. M. et al. A study of the applicability of Bourgoyne & Young ROP model and fitting reliability through regression. In International Petroleum Technology Conference. https://doi.org/10.2523/IPTC-18521-MS (2015).
Song, X. et al. Intelligent prediction of mechanical drilling speed based on support vector machine regression. Xinjiang Oil Gas 18(1), 14–20 (2022).
Google Scholar
Pedregosa, F. et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet Google Scholar
Karimi, S. et al. Generalizability of gene expression programming and random forest methodologies in estimating cropland and grassland leaf area index. Comput. Electron. Agric. 144, 232–240. https://doi.org/10.1016/j.compag.2017.12.007 (2018).
Article Google Scholar
Willard, J. et al. Integrating scientific knowledge with machine learning for engineering and environmental systems. ACM Comput. Surv. 55, 1–37. https://doi.org/10.1145/3514228 (2022).
Article Google Scholar
Zhihua, Z. Machine Learning 171–178 (Beijing, 2016).
Google Scholar
Breiman, L. Bagging predictors. Mach. Learn. 24(2), 123–140 (1996).
Article Google Scholar
Efron, B. & Tibshirani, R. An Introduction to the Bootstrap (Chapman & Hall, 1993).
Book Google Scholar
Branch, M. A., Coleman, T. F., Li, Y. & Subspace, A. Interior, and conjugate gradient method for large-scale bound-constrained minimization problems. SIAM J. Sci. Comput. 21(1), 1–23 (1999).
Article MathSciNet Google Scholar
de Mattos Neto, P. S. G. et al. Hybrid systems using residual modeling for sea surface temperature forecasting. Sci. Rep. 12, 487. https://doi.org/10.1038/s41598-021-04238-z (2022).
Article PubMed PubMed Central ADS CAS Google Scholar
Virtanen, P. et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nat. Methods 17(3), 261–272 (2020).
Article PubMed PubMed Central CAS Google Scholar
Buczak, A. L. & Guven, E. A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Commun. Surv. Tutor. 18(2), 1153–1176. https://doi.org/10.1109/comst.2015.2494502 (2016).
Article Google Scholar
de Oliveira, J. F. L. & Ludermir, T. B. A hybrid evolutionary decomposition system for time series forecasting. Neurocomputing 180, 27–34. https://doi.org/10.1016/j.neucom.2015.07.113 (2016).
Article Google Scholar
Hassine, K., Erbad, A. & Hamila, R. Important complexity reduction of random forest in multi-classification problem. In 2019 15th International Wireless Communications & Mobile Computing Conference (IWCMC) 226–331. https://doi.org/10.1109/iwcmc.2019.8766544 (2019).

Download references

Acknowledgements

This research was financially supported by the National Natural Science Foundation of China under contract Nos. 51774093 and 52274005, and the Heilongjiang Provincial Government and Daqing Oilfield unveiled the first batch of key scientific and technological research projects under contract No. DQYT-2022-JS-750.

Author information

Authors and Affiliations

School of Petroleum Engineering, Northeast Petroleum University, Daqing, China
Shengjie Jiao, Wei Li, Zhuolun Li, Jingming Gai, Linhao Zou & Yinao Su
National Engineering Research Center of Oil & Gas Drilling and Completion Technology, Beijing, China
Wei Li & Yinao Su

Authors

Shengjie Jiao
View author publications
You can also search for this author in PubMed Google Scholar
Wei Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhuolun Li
View author publications
You can also search for this author in PubMed Google Scholar
Jingming Gai
View author publications
You can also search for this author in PubMed Google Scholar
Linhao Zou
View author publications
You can also search for this author in PubMed Google Scholar
Yinao Su
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors reviewed the manuscript.

Corresponding authors

Correspondence to Wei Li or Yinao Su.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Jiao, S., Li, W., Li, Z. et al. Hybrid physics-machine learning models for predicting rate of penetration in the Halahatang oil field, Tarim Basin. Sci Rep 14, 5957 (2024). https://doi.org/10.1038/s41598-024-56640-y

Download citation

Received: 29 November 2023
Accepted: 08 March 2024
Published: 12 March 2024
DOI: https://doi.org/10.1038/s41598-024-56640-y

Keywords

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Global prediction of extreme floods in ungauged watersheds

Accurate medium-range global weather forecasting with 3D neural networks

Synthetic Lagrangian turbulence by generative diffusion models

Introduction

Methodology

Procedure of the hybrid physics-ML ROP modeling

Tarim Basin (Xinjiang) dataset

Data denoising

Data spilt

Feature selection

Physical models

ML models

Artificial neural network

Support vector machine

Random forest

Model performance evaluation metrics

Hyperparameter selection in ML models

Hybrid physics-ML models

Residual modeling

Integrated coupling

Simple average

Bagging

Results and discussion

Comparison of different physical models

Comparison of different ML models

Comparison of physical, ML, and hybrid models

Comparison of different hybrid models

Comparison of the effects of different dataset sizes on models

Comparison of the effects of different data denoising methods on the model

Comparison of the effects of different data splitting methods on the model

Further discussion on hybrid models for ROP prediction

Conclusions

Data availability

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Comments

Search

Quick links