Integrating Support Vector Regression with Genetic Algorithm for Hydrate Formation Condition Prediction

To predict the natural gas hydrate formation conditions quickly and accurately, a novel hybrid genetic algorithm–support vector machine (GA-SVM) model was developed. The input variables of the model are the relative molecular weight of the natural gas (M) and the hydrate formation pressure (P). The output variable is the hydrate formation temperature (T). Among 10 gas samples, 457 of 688 data points were used for training to identify the optimal support vector machine (SVM) model structure. The remaining 231 data points were used to evaluate the generalisation capability of the best trained SVM model. Comparisons with nine other models and analysis of the outlier detection revealed that the GA-SVM model had the smallest average absolute relative deviation (0.04%). Additionally, the proposed GA-SVM model had the smallest amount of outlier data and the best stability in predicting the gas hydrate formation conditions in the gas relative molecular weight range of 15.64–28.97 g/mol and the natural gas pressure range of 367.65–33,948.90 kPa. The present study provides a new approach for accurately predicting the gas hydrate formation conditions.


Introduction
Natural gas hydrate is a dry ice-like crystalline inclusion compound formed by combining small gas molecules (e.g., light hydrocarbons, CO 2 , H 2 S, or N 2 ) and water molecules in natural gas under certain pressure and temperature conditions. In industrial production, hydrates are widely used for gas storage and refrigeration. In the process of oil and gas field development, hydrate formation leads to engineering accidents. The formation of hydrates can easily block wellbores and reservoirs near gas wells, which results in a reduction or even shutdown of the gas-well production. Additionally, the formation of hydrates can block measurement instruments, surface separation equipment, and gas pipelines, which causes a pipeline rupture. It may even cause explosions. Therefore, it is important to accurately predict the formation conditions of natural gas hydrates.
Currently, the main determination methods for natural gas hydrate formation are experimental methods [1][2][3], formula methods [4], thermodynamic models [5,6], and intelligent models [7,8]. Experiments to determine the conditions for forming natural gas hydrates yield the most accurate results, but the experimental methods are time-consuming, laborious, and expensive. Formula methods are widely used because they are facile and involve easy programming. The widely used correlation formulas include those proposed by Ponomape [9], Bohadori-Vuthalur [10], Chayyem [11], and Amin [12]. Chayyem established an association formula based on a Katz gravity map for 545 data points, which exhibited the highest prediction accuracy among the previously mentioned formulas [13]. However, the formula method has a limited range, beyond which the calculation error is relatively large. Thermodynamic models are usually derived from the VdwP model proposed by Van der Waals and Platteeuw in 1959. The thermal equilibrium model equation is based on the thermodynamic theory. A numerical method is used to perform the calculations. Currently, the main models include those proposed by Parrish and Pransnitz [14], Chen and Guo [15], and Klauda and Sandler [16]. However, assumptions are made in establishing the models, and the errors of the models in practical applications are relatively large. Moreover, the models are complex and difficult to generalise [17,18]. Research on predicting the conditions for natural gas hydrate formation using intelligent algorithms (such as the backpropagation (BP) neural network and support vector machine (SVM)) has developed rapidly in recent years and yielded good results. Wu Xiaoqiang, Han Bing [19] established a prediction model for the natural gas hydrate formation conditions under a high-H 2 S and high-CO 2 environment based on an SVM by introducing a factor reflecting the contribution of acid gas (H 2 S+CO 2 ) to the formation of CH 4 .
To solve the problems of the large-scale secondary planning and low calculation speed of the SVM, Mesbah et al. [20] used the least-squares SVM method to predict the desulfurisation and temperature of acid gas-containing hydrates. In the same year, Mohammad Akbari as well as Afshin et al. [21] used the temperature, pressure, and natural-gas composition as input variables and combined radial basis function (RBF) neural networks with genetic algorithms (GAs) to predict conditions for natural gas hydrate formation at low temperatures. However, in this method, the number of nodes in the hidden layer and the weight threshold must be determined, and it is difficult to identify the optimal network structure parameters [22][23][24].
In this study, a GA was used to optimise the process parameters of an SVM, and a new hybrid model (GA-SVM) was developed for predicting the conditions of natural gas hydrate formation. The model takes the relative molecular weight of the natural gas and the hydrate formation pressure as input variables and outputs the hydrate formation temperature. Through example calculations, the GA-SVM model was compared with nine commonly used models. Using the leverage method, outlier detection was performed for 688 data points. This study provides a comprehensive method for the accurate prediction of the gas hydrate formation conditions.

Theoretical Basis
Support vector machines (SVM) are a kind of binary classification model. Its purpose is to find a hyperplane to segment the samples. The principle of segmentation is to maximise the interval, which is eventually transformed into a convex quadratic programming problem to solve [25]. Rather than relying on the general rule of empirical risk minimisation, it employs the structural risk-minimisation function, which significantly improves its efficiency for solving the multivariate nonlinear regression problem. Advanced optimisation algorithms (e.g., genetic algorithm, particle swarm algorithm, grey wolf algorithm) are used to avoid the problems related to local optima and non-convergence that easily occur in other machine-learning algorithms, such as the classic back propagation neural network, and overcome the difficulties of other machine-learning algorithms, e.g., a small amount of data samples, rough data, and large data volatility [26].
The basic idea of the SVM nonlinear regression model is to transform a nonlinear regression problem in a low-dimensional space into a linear regression problem in a high-dimensional feature space through nonlinear mapping [27]. In this way, the linear regression can be better fitted through a reasonable feature surface in high-dimensional space. Accordingly, the algorithm of the SVM dual optimisation problem is as follows.
Given a dataset {(xi, yi), i = 1,2, . . . , N}, its basic regression function is The structural risk function used in SVM regression is The loss function in Equation (2) is By substituting Equation (3) into Equation (2) and introducing the relaxation variables ξ and ξ*, the objective function can be obtained: Equation (4) introduces the Lagrange function. Dual processing is performed, which yields the following.
Using Equations (6) and (7), the SVM regression model can be obtained.
For the inner-product problem, There are four types of common kernel functions: linear kernel function, polynomial kernel function, radial basis function (RBF) kernel function and sigmoid kernel function. Among them, RBF kernel function has better performance and more applications. Therefore, in this paper, the internal product problem is replaced by RBF function, whose expression is as follows: Eventually, the SVM regression model becomes the following.
In Equations (1)-(10), ε represents the loss factor, ξ and ξ* represent the relaxation variables, C represents the penalty factor, α i and α i * represent the Lagrange multiplier pairs for each sample, and γ represents the width parameter of the kernel function.
The GA is an advanced parameter-optimisation algorithm. The basic idea is to simulate the evolutionary law of biological competition, i.e., survival of the fittest, in the biological world [28,29]. The specific steps are as follows.
(1) Random chromosome coding of individuals in the initial population using the binary principle to form a genotype string that mimics the chromosome of a biological gene, which is an individual in the biological population. (2) On the basis of step 1), the N initial genotype character strings are randomly generated to form an initial population containing N individuals.

Dataset and Variable Selection
In the development of the GA-SVM model, the dataset diversity, variable selection, and parameter optimisation were mainly considered. To this end, 10 representative natural-gas samples were selected, and 688 gas samples in pure water hydrate formation conditions were used as experimental data points [30,31]. Some of the data is based on our experiments, and it is not easy to make public because of confidentiality. The dataset was randomly divided into two groups, i.e., 457 model training data and 231 model prediction data (ratio of 2:1). In the selection of the variables, considering the simplicity of the input variables of the GA-SVM model, the gas-sample composition was not selected as an input variable. Each gas sample was considered as a whole, and the relative molecular weight of the natural gas was used as an input variable. The input variables of the model were the relative molecular weight of the natural gas and the hydrate formation pressure, and the output variable of the model was the hydrate formation temperature. The variable ranges are presented in Table 1. As indicated by Table 1, the experimental data had a large step difference. To eliminate the effect of the data step on the model results, the 'mapminmax' function in MATLAB was used to normalise and de-normalise the data. The 'mapminmax' function is shown below.
where x represents the variable, − x represents the normalised variable value, x max represents the maximum value of the variable, and x min represents the minimum value of the variable.

Model Parameter Optimisation
According to the GA algorithm parameter optimisation process in flowchart 1, combined with the 'trial calculation-control error' method, there are three main parameters to be optimised: the penalty factor C, the width parameter γ of the kernel function, and the loss factor ε. In the process of parameter optimization, the value of penalty factor C has the greatest influence on the accuracy of the ε-SVM regression model. If the C value is too large, the fitting accuracy of training samples is very high, but the generalization ability of the ε-SVM regression model is very poor, there is a phenomenon of over learning. Similarly, if the C value is too small, the optimization process takes a long time, the search is not complete, the fitting effect of training samples is very poor, and the generalization ability of ε-SVM regression model is very low. Therefore, there is a serious phenomenon of under learning. Therefore, on the premise of not affecting the prediction accuracy, the C value should be as small as possible to ensure that the SVM regression model has good generalization ability and prediction accuracy. After several optimisation calculations, the optimal GA-SVM model parameters were determined, as shown in Table 2.

Analysis of Results
The results of the GA-SVM model training and prediction are presented in Figures 2  and 3, respectively.   Figures 2 and 3, the GA-SVM model exhibited good training and prediction results. The predicted and experimental values are evenly distributed around the 45 • line, and the correlation coefficient (R 2 ) is greater than 0.99. Thus, the model has good calculation accuracy and generalisation ability for predicting the natural gas hydrate formation conditions. The mean absolute relative deviation (AARD) is defined as follows.

As shown in
where y exp i and y exp i represent the experimental value of the i th data point and the predicted value of the SVM model, respectively, and n represents the total number of data points.
As shown in Figure 4, among the 10 models for calculating the gas hydrate formation conditions, the GA-SVM model had the highest prediction accuracy, and the AARD was only 0.04%. For the Ghayyem, Ann2015, Zahedi (3), Zahedi (4), and Towler models, the AARDs were 0.14%, 0.46%, 0.56%, 0.56%, and 0.0.83%, respectively, and the calculation accuracy was relatively high. For the Berg model, the AARD was as high as 2.36%, and the calculation accuracy was poor.

Outlier Detection
Artificial-intelligence algorithms have seen rapid development in recent years. Mohammadi et al. [38] proposed a new technology that can efficiently detect the abnormal points of a model via the Rousseeuw outlier detection theory-Williams plate detection method. The basic idea is to use the H value of the hat matrix and the standardised deviation between the model experimental data and the calculated data with the H value as the abscissa and the normalised deviation as the ordinate. The calculations of the model are performed on the premise of setting the leverage value. Consequently, abnormal points are detected. The specific steps of its algorithm are as follows.
First, the hat matrix is defined below.
where X represents an n × 2 matrix, which is composed of an SVM model predictor column vector and an input variable number vector. The leverage of the H value is set as the following.
where m represents the number of input variables, and n represents the number of experimental data points. The effective range of the H value is 0 ≤ H ≤ H*. The SVM model standardised residual (SR) is defined as the following.
where H ii represents the diagonal element of the hat matrix for the i th data point. For model standardization residual (SR), the effective range is generally selected as −3 ≤ SR ≤ 3.
The H value and SR for the 10 models were calculated via the foregoing method. A Williams plot is constructed with the H value as the abscissa and the standardization residual (SR) as the ordinate, as shown in Figure 5. The number of abnormal points in each model is shown in Figure 6.
As shown in Figures 5 and 6, the GA-SVM model had the smallest number of abnormal points in the 688 data (only two), which indicates that this model had good stability for predicting the natural gas hydrate formation conditions. The Ghayyem model had 20 anomalous points. A larger number of anomalous points indicates greater instability of the model in calculating the natural gas hydrate formation conditions.

Conclusions
By combining a GA, the outlier detection method, and an SVM, a new model for predicting the conditions of natural gas hydrate formation was developed. Comparisons with nine other models and analysis of the outlier detection revealed that the GA-SVM model had the smallest average absolute relative deviation (0.04%) in the gas relative molecular weight range of 15.64-28.97 g/mol and the natural gas pressure range of 367.65-33,948.90 kPa. Lastly, based on the leverage method, the abnormal point detection of all the data of all models was carried out. It was found that, in the 688 data points, the number of abnormal points of the GA-SVM model data was the lowest. There were only two abnormal points, which further illustrates that the GA-SVM model's calculation of hydrate formation conditions has good stability.