Spatial Distribution Prediction of Oil and Gas Based on Bayesian Network with Case Study

College of Computer Science and Technology, Jilin University, Changchun 130012, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China Chengdu Kestrel Artificial Intelligence Institute, Chengdu 610000, China Beijing Jingdong Century Trading Co., Ltd., Beijing 100083, China Research Institute of Petroleum Exploration & Development, PetroChina, Beijing 100083, China


Introduction
Petroleum exploration is an economic activity with plenty of decision problems involving risk and uncertainty [1]. From 1980 to 2016, China National Petroleum Corporation drilled 40730 exploratory wells in the exploration area and obtained 17371 industrial oil and gas wells, with an average success rate of 42.6% [2], indicating that there is still much room for improvement in the success rate of drilling. erefore, accurate prediction of oil and gas spatial distribution is an important work in the process of oil exploration. It can not only quantitatively calculate the probability of the spatial distribution of oil and gas reservoirs to be discovered, so as to realize the risk visualization of oil and gas exploration, but also combine the results of traditional resource evaluation methods to realize the integration of resource, oil and gas exploration risk evaluation, and target optimization [3].
Over decades, lots of large petroleum corporations and national governments have paid great attention to this topic; many experts and scholars have proposed methods to describe the characteristics of the spatial distribution of hydrocarbons. ese methods are mainly classified into two categories: knowledge-driven method and data-driven method. As to knowledge-driven method, it is represented by classic geological risk probability assessment [4][5][6][7][8][9] and fuzzy comprehensive assessment [1,10,11]. It is based on expert knowledge and experience to synthesize the main geological elements necessary for oil and gas accumulation in a region to obtain an overall evaluation. is method is mostly used in exploration middle and early period, relatively simple geological structure. However, there are two drawbacks to this type of method. First, subjective judgments employed in risk evaluation may be inconsistent or nonrepeatable. For example, if independent assessments are conducted by different people, the probability values of the risk can vary significantly, while a consistent and repeatable geological evaluation method is more desirable for consistent exploration strategy analysis. Second, this method does not explicitly consider spatial correlations among prospects in the analysis of geological risk. Since hydrocarbon occurrence at a specific prospect is a part of the result of similar geological processes, instead of an isolated event, a good understanding of spatial variation of the exploration objects and the spatial relationship with the geological factors is of great importance in geological risk prediction. As to datadriven method, it mainly uses the data mining, artificial intelligence, and other methods to quantitatively calculate the spatial distribution of oil and gas based on actual well exploration data and geoscience data, thereby realizing the exploration risk visualization. For example, Chen et al. [12][13][14], Gao et al. [15], and Chen and Osadetz [16] proposed stochastic simulation evaluation method for modeling the locations of undiscovered petroleum accumulations. Hu et al. [17] and Xie et al. [18] proposed a multivariate-Bayesian approach for risk reduction in exploration. Chen et al. [19] proposed a SVM evaluation method for geological risk evaluation. Amiri et al. [20] proposed an evaluation method of hydrocarbon resources potential based on the combination of evidence belief function and GIS. Song et al. [21] proposed the multiple regression method for quantitative assessment on trap oil-bearing property. Zhu et al. [22] proposed a method of geological risk evaluation based on logical regression. Jie et al. [23] used a fuzzy AHP-based grey relation analysis method to assess hydrocarbon potential. e second type of method is based on statistical theory with less human intervention and is relatively objective. It can better reflect the relationship between the hydrocarbon occurrence and the geological factors of oil and gas accumulation. However, it usually requires a large amount of real and actual exploration data, and this method is mainly suitable for medium and high degree exploration areas. At present, the data-driven method is widely applied, but there is still great room for improvement in improving the accuracy of model prediction [17,18,24,25]. is paper proposes a method for predicting the spatial distribution of oil and gas based on K-dependence Bayesian network (KDB). Firstly, the data set of exploration wells and the key geologic factors are obtained. Secondly, the data set is gridded and discretized, respectively, to form a labeled exploration well data set and an unlabeled grid data set.
irdly, Bayesian structure learning algorithm is used to model the labeled well data set and evaluate model effectiveness. Finally, the established Bayesian network model and unlabeled grid data are used to calculate the a posteriori probability of oil and gas in the untest areas, and the interpolation method is used to form the probability map of oil and gas distribution. e flow chart of spatial distribution prediction of oil and gas resources based on Bayesian network is shown in Figure 1. e proposed method is then applied to predict the spatial distribution of oil and gas in the Huanglong Formation of Carboniferous in the east Sichuan Basin of China. In the case study, the topological structure of Bayesian network is used to determine the qualitative relationship between oil and gas occurrence and key factors based on results of previous exploration drilling. e probability of hydrocarbon occurrence is calculated by using the determined Bayesian network model, and the spatial variation of hydrocarbon occurrence is visualized as the resulting map, which provides theoretical basis for drilling decision-making.

Geological Problem and Mathematical Model.
e petroleum exploration process is highly coupled with geological models that explain the occurrence of hydrocarbon accumulations. Geologists, geophysicists, and seismologists apply high levels of expertise to answer questions such as the following: what is the chance of finding an accumulation in the prospect? What is the risk of using the information in order to decide to develop the field or not? Which methods should be used to assess the oil in this field? Capturing this knowledge and representing it in a formal model is a permanent aim for knowledge management in petroleum companies [1,11]. In the present study, the problem of calculating the chance of finding an accumulation in the prospect is investigated. e goal of finding a hydrocarbon accumulation, that is, determining the spatial distribution of hydrocarbon resources, seeks to determine to what degree of certainty a potential drilling target either productive or nonproductive may be, on the basis of our understanding of the domain knowledge and available data [17,26]. is geological problem can be transformed into a mathematical model, which is a binary classification problem with uncertainty in a multivariate space. Suppose that the nature of an area in a petroleum play can be classified into productive area and nonproductive area and that the play was penetrated by n exploratory wells that provided samples from the play. e actual exploration results show that the same geological factors have different characteristics in the productive area and the nonproductive area. erefore, the basic assumption of classification is that the same geological factor in different groups is statistically significantly different, but there is no significant difference in the same group. Based on data from n exploration wells combined with other geoscience information, we want to estimate the probability that each untested position belongs to one of two defined categories.
is estimation can be in the form of conditional probability. e result of exploration well is expressed by random variable C, c � 0 stands for nonproducer well in nonproductive area, and c � 1 stands for producer well in productive area. Suppose that whether each exploration well is a producer well related to m geological variables and vector X � (X 1 , X 2 , . . . , X m ) represents m geological variables containing information on the classification. e conditional probability that, for given observations , the area at location r belongs to c can then be written as where C stands for binary class variable, c ∈ {0, 1}, c � 1 stands for producer well, and c � 0 stands for nonproducer well. According to the chain rule of joint probability distribution and Bayes formula, equation (1) can be calculated as follows: In fact, formula (3) needs a lot of accurate and complete data which are often difficult to obtain to solve the joint probability formula in the case of large m. To avoid the complexity of a multivariate-Bayesian formulation, the Bayesian network is used in the present study to integrate all available geological variables and to calculate the probability of hydrocarbon occurrence.

Bayesian Network.
A Bayesian network is a pair BN � 〈G, Θ〉. e first component, G, is a directed acyclic graph whose vertices (nodes) correspond to the random variables X 1 , X 2 , . . . , X m , and edges represent direct dependencies between the variables. e second component of the pair, namely, Θ, represents the set of parameters that quantify the network [27][28][29]. In Bayesian network, Y is called a parent of X if there is an edge from Y to X. pa (X) is the parent set of X. If all nodes in a BN consist of variables in equation (3), equation (3) can be converted as follows: When each node in formula (4) has at most K parent nodes from other variables except the class node, then we can simplify formula (4) with the KDB framework. KDB was proposed by Sahami [30] with the aim of achieving the tradeoff between classification accuracy and structure complexity. e arc or dependency between the variables C and X i is measured by the mutual information (MI); the arc or conditional dependency between the variables X i and X j is measured by the conditional mutual information (CMI) given class variable C. e definitions of MI and CMI are shown in the following two equations, respectively, where X and C are discrete random variables. In addition, the KDB algorithm is suitable for sparse network structure learning. In the case of fewer attribute nodes, even if the sample size is very small, it can maintain high learning efficiency. erefore, this paper uses it to predict the spatial distribution of oil and gas. Assuming that there are four geological variables X � (X 1 , X 2 , X 3 , X 4 ) and K � 2, the procedure of structure learning for the proposed model is explained as follows [30][31][32]: (1) For each feature X i , compute mutual information, MI(X i ; C), C is the class variable. (2) Compute class conditional mutual information CMI(X i ; X j | C), for each pair of attributes X i and X j , where i ≠ j.   Mathematical Problems in Engineering (5.1) Select feature X max , which is not in S and has the largest value MI(X i ; C). (5.2) Add a node to KDB representing X max . (5.3) Add an arc from C to X max in KDB. (5.4) Add m � min(|S|, 2) arcs from m distinct attributes X j in S with the highest value for CMI(X max ; X j | C). (5.5) Add X max to S.
(6) Compute the conditional probability tables inferred by the structure of KDB by using counts from database, and output KDB.
After the KDB model is established, as shown in Figure 2, the following formula is derived from formula (4): e conditional probability P(c � 1 | x 1 , x 2 , x 3 , x 4 ) can be calculated by formula (7) for given observations ; the probability of untapped petroleum in this area is considered as high. e greater the P(c � 1 | x) value, the smaller the geological risk, and the higher the reliability of the layout of exploration wells.

Geological Setting.
e study area is located in the east of Sichuan Basin, covering an area of about 5.5 × 10 4 km 2 . e structure is a typical barrier fold [33]. In the study area, there are 10 rows of high and steep structural belts (mainly in NNE) distributed from west to east. e main gas-producing layer in the high-steep anticline is the Carboniferous. After 40 years of development, Datianchi, Liangshuijing, Gaofengchang, and other gas reservoirs have been found in Huanglong Formation of Carboniferous formation (Figure 3). e proved reserves of natural gas are about 2400 × 10 8 m 3 , and the controlled reserves of natural gas are about 200 × 10 8 m 3 [34].

Source Rock Conditions.
e Carboniferous gas source mainly comes from the underlying Silurian strata.
e Silurian in eastern Sichuan is mainly a set of marine shale deposits, and the thickness of Silurian source rocks is mostly predicted to be 300-700 m (Figure 4(a)). e organic carbon content of source rock is 0.8-1.6% (Figure 4(d)), and the thermal evolution degree is generally high, with R o of 2.4-4.0% (Figure 4(c)). e gas generation intensity of source rock is mainly 20-80 × 10 8 m 3 /km 2 , and the expulsion intensity is 15-70 × 10 8 m 3 /km 2 (Figure 4(b)). e source rocks of Silurian have good organic matter types and strong hydrocarbon generating ability. In the early stage, they mainly generate liquid hydrocarbon, and, in the late stage, they mainly generate gas hydrocarbon, which provides sufficient gas source for the formation of gas reservoir [35].

Reservoir Conditions.
e average thickness of Huanglong Formation reservoir in Carboniferous is 36.7 m (Figure 4(e)), and the lithology is mainly dolomite. Core analysis shows that the porosity of the sample is generally 1-19%, and the permeability of the sample is generally 0.001-10 md. Although the permeability of the rock matrix of the Carboniferous gas reservoir in east Sichuan is very low, the permeability of the rock fracture is very high. It is precisely because there are a large number of structural open fractures in the Carboniferous gas reservoir; their permeability is much greater than that of the rock matrix [36].

Caprock Conditions.
e direct caprock of carboniferous gas reservoir is the shale of Liangshan formation of Lower Permian, and the indirect caprock is from Qixia Formation of Permian to Jialingjiang formation of Triassic, mainly the dense limestone and gypsum caprock. e cover of gypsum layer is between 100 and 300 m (Figure 4(h)).

Trap and Migration and Accumulation Conditions.
e Carboniferous traps in East Sichuan can be divided into three types: structural trap, stratigraphic structural composite trap, and lithologic structural composite trap. e hydrocarbon migration and accumulation of Carboniferous system experienced primary migration and secondary migration. at is to say, the primary (vertical) migration of hydrocarbon from Silurian source rock to Carboniferous reservoir, and the secondary (lateral) migration in Carboniferous reservoir include the whole process of accumulation in the ancient traps from Indosinian to Yanshanian and accumulation in various traps of Himalayan. Figure 2: Structure for the KDB model (K � 2). Chongqing H u a y in g sh a n J i u f e n g s i T ie sh a n Y u n a n c h a n g N an m en ch an g H u a n g c a o x i a H u a n g n i t a n g X ia n g g u o si M in g y u e x ia Q il ix ia D a t i a n c h i F a n g d o u s h a n D a c h i g a n W en q u an ji n g Producer well  e data used in KDB modeling are mainly from 248 exploration wells drilled to Huanglong Formation. Among the 248 wells drilled to target formation, 126 are producer wells and the other 122 are nonproducer wells. In addition to well data, the results of seismic data and basin modeling were also available for the present study.
In the third National Petroleum Resource assessment, PetroChina established oil and gas exploration risk assessment parameters, including 22 types of geological information in 5 categories [17,18]. eoretically, the more the kinds and quantities of geological information data, the higher the accuracy of model prediction. Actually, it is difficult to obtain all 22 types of geological information due to the influence of exploration degree and different information in different departments. e data used in this paper to predict the spatial distribution of oil and gas resources mainly include eight types of geological information, which are (1) source rock thickness (STH) of S 1 l (Figure 4(a)), (2) organic matter maturity (R o ) of source rock (Figure 4(b)), (3) total organic carbon content (TOC) of source rock (Figure 4(c)), (4) expulsion gas strength (EGS) of source rock (Figure 4(d)), (5) reservoir rock thickness (RTH) of C 2 hl (Figure 4(e)), (6) top structure (ST) burial depth of C 2 hl (Figure 4(f )), (7) fluid potential (FP) (Figure 4(g)), and (8) thickness of cap rock (CRT) of T 1 j (Figure 4(h)).

Gridding and Discretization.
We divide the data into two groups: a labeled exploration data set and an unlabeled grid data set. e former is used to build a Bayesian network model, and the latter is used to predict the spatial distribution of oil and gas. We first established 11,000 virtual grid points in the study area with equal spacing as shown in Figure 5 and then used interpolation technology to assign 8 types of geological factor isomap data to the virtual grid points to create unlabeled grid data set. e same method is used to build a labeled well data set.
In this study, discrete random variables are used for Bayesian network learning; the unsupervised 5-bin method was employed to discretize the text data of the isogram and the data of 248 wells in Weka platform.

Model Development.
In the model development process, the proposed KDB algorithm was used to perform structure learning and parameter learning on 248 sample data set, including the results of exploration wells and eight sets of related geological data. To reduce the computational complexity, the K value was set to 2. e topological structure of the obtained Bayesian network model is shown in Figure 6. By virtue of this model, prediction was performed on 248 exploration wells. Compared with the actual exploration results, the number of correctly predicted wells are 222, presenting an accuracy rate of 89.5% (222/248). At the same time, to examine the accuracy and superiority of the KDB method proposed in this paper, we compared with other methods such as multivariate-Bayesian method, support vector machine (SVM) method, logistic regression method, and TAN method. e experimental comparison results are shown in Table 1. It can be drawn from Table 1 that the KDB method has the highest accuracy rate of 89.5%; the logistic regression method has the lowest accuracy rate of 73.3%; and the multivariate-Bayesian, SVM, and TAN methods have accuracy rates of 79.1%, 86.3%, and 87.9%, respectively. Compared with the above four methods, the accuracy of the KDB method proposed in this paper is improved by 16.2%, 10.4%, 3.2%, and 1.6%, respectively. It can be seen from the results of the experimental comparison and analysis that the prediction accuracy of this method can meet the requirements of predicting the spatial distribution of hydrocarbon resource.

Results and Analysis.
e 2-KD model was used to calculate the hydrocarbon-bearing probability values at all discrete points in Figure 5, and the interpolation method was used to form a hydrocarbon generation probability map (Figure 7) in the Huanglong Formation reservoir to predict the spatial distribution of hydrocarbon occurrence. e map uses different colors to characterize the level of possibility of oil and gas at any sites in the Huanglong formation in eastern Sichuan Basin. e color change from blue to red indicates that the probability of hydrocarbon occurrence gradually changes from low to high.
It can be seen from Figure 7 that the gas fields that have been discovered basically show high probability of hydrocarbon occurrence. For example, gas fields such as Shapingchang, Wubaiti, and Shaguanping are in the red area in the figure, showing a higher value (>80%) of oil and gas. At the same time, the drilled producer wells basically fall in the red area with a high probability, and the drilled nonproducer wells fall in the blue area with a low probability, indicating that the prediction results agree well with the actual drilling. e hydrocarbon-bearing probability map can not only analyze the accuracy of the prediction of the discovered gas fields, but also predict the risk of the location of the undiscovered oil and gas fields to increase the success rate of oil and gas exploration. e potential structure of shiyanchang in the east wing of mingyuexia structural belt shows a high probability value (>75%) of oil and gas. e risk well mingyue 1 here has a gas yield of 2.22 × 10 4 m 3 /d. It shows that the denuded marginal belt with a very low degree of exploration in the east of Sichuan Basin still has a great exploration potential, among which the undrilled stratigraphic structural composite trap area in the marginal belt is about 850 km 2 , which has a great exploration potential and can be used as a future exploration direction of the Huanglong formation, as shown in area contoured with a pink polygon in Figure 7. However, for areas with a high probability of oil and gas, it is not necessary to contain oil and gas, and the relative structural lows should be avoided as much as possible when deploying exploration wells.

Conclusion
A Bayesian network-based method for predicting the spatial distribution of oil and gas is proposed. Treating hydrocarbon occurrence as a two-group classification with uncertainty, this method tries to identify the topological relationship between the hydrocarbon occurrence and major geologic factors, which is then illustrated by case study of the spatial distribution of oil and gas of Huanglong Formation in eastern Sichuan Basin in China. e application shows that Bayesian network is an effective model for enabling a repeatable and consistent evaluation not affected by the bias of the assessor. e probability map of hydrocarbon occurrence can effectively represent the spatial distribution changes of oil and gas, optimize exploration targets, and increase the success rate of exploration.

Data Availability
e data used to support the results of this study can be obtained from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest regarding publishing this paper.