Fire Prediction Based on CatBoost Algorithm

In recent years, increasingly severe wildfires have posed a significant threat to the safe and stable operation of transmission lines. Wildfire risk assessment and early warning have become an important research topic in power grid risk assessment. 'is study proposes a fire prediction model on the basis of the CatBoost algorithm to effectively predict the fire point. Five wildfire risk factors, including vegetation factors, meteorological factors, human factors, terrain factors, and land surface temperature, were combined using the feature selection method on the basis of the gradient boosting decision tree model and principal component analysis to achieve dimensionality reduction of redundant data and create a fire prediction model. 'e MODIS fire point product is used as the model evaluation data.'e verification result uses the AUC value as the evaluation factor.'e accuracy of the model is 0.82, and the AUC value is 0.83. 'e obtained fire point evaluation results are in good agreement with the actual fire points. Results show that this model can effectively predict the risk of wildfires.


Introduction
Mountain fire disaster is an essential factor that destroys the forest ecosystem and affects the safe and stable operation of the power grid [1,2]. Mountain fires accounted for 60% of all the emergencies that have changed the stable operation of the power grid in recent years [2]. According to statistical analysis over the years, most reclosing of transmission line trips caused by wildfire disasters will fail, which seriously affects the quality of life in the area and causes substantial economic losses to relevant departments.
Most regions in southern China are located in forest areas, with dense forests, complex terrain, and dry climate, which provide a good material basis for the occurrence of mountain fire disasters, leading to frequent mountain fire disasters and posing a considerable threat to the safe and stable operation of the power grid [3,4]. Mountain fire disasters have become an important factor that affects the safe and stable operation of the power supply system. erefore, effectively predicting the fire risk of woodland, grassland, and cultivated land that may occur in the future and making corresponding warnings are considerably significant to maintain the stable operation of the power grid [5].
Scholars at home and abroad are mainly divided into two directions in the research of wildfire risk: purely using meteorological data for wildfire risk assessment and combining the tripping mechanism of transmission lines, vegetation factors, and human factors to classify wildfire risk levels. At present, meteorological departments and forestry departments mainly assess the risk of wildfires from the perspective of meteorology [6]. In 1995, Wang et al. [7] and others proposed a new technology for forest fire risk assessment based on meteorological elements such as temperature, humidity, precipitation, and wind speed, but it is only suitable for large-area forest fire risk forecasting. Literature [8] built a graph model-based overhead transmission line wildfire risk prediction model based on the meteorological factors, combined with surface combustion factors and historical fire factors. is method has been effectively applied to a certain southern power grid. Literature [9] uses forest fire danger meteorological grades to assess the probability of wildfires and establishes a risk assessment model for transmission lines with temporal and spatial distribution characteristics. Literature [10] established a risk assessment model from two aspects: the risk of wildfire disasters and the vulnerability of transmission lines. Literature [11] combined the relationship between normalized differential vegetation index (NDVI), satellite remote sensing fire point, rainfall, and other factors with the occurrence of wildfires on transmission lines and proposed a wildfire risk assessment model for transmission lines, but only monthly risk assessment. In fact, fires are very closely related to human activities. Literature [11] proposed a fire prediction model that combines meteorological data and human activities. e model is applied in areas with severe fire disasters, and it has good prediction accuracy. Literature [12], based on historical meteorological data, vegetation, data and terrain data, used partial least squares method PLS to select the main wildfire forecasting factors and established an optimized power grid wildfire risk early warning model. Literature [13] designed a forest fire early warning model based on mobile edge computing (MEC) by acquiring ground surface parameters, which can be used to effectively predict wildfires.
In order to more fully combine meteorological data and human factors, this study is based on the MODIS fire point data of a southern province from 2015 to 2019 combined with meteorological data, terrain data, land surface temperature (LST), human factors, and vegetation factors to analyze the influencing factors of mountain fire disasters and establish a CatBoost model to predict fire points. Effective prediction and early warning of fire points are significantly important to reduce the loss of wildfire disasters.

Analysis and Data Acquisition of Influencing
Factors of Mountain Fire Disasters e occurrence of mountain fire disasters is comprehensively affected by a variety of factors. According to the analysis of relevant literature and the research on the principles of mountain fires [14], the occurrence of mountain fire disasters is not random, and specific laws have been passed in relation to this situation. is article divides the factors that affect the appearance of wildfires into five aspects: vegetation factors, human factors, surface temperature, terrain factors, and meteorological factors. is research aims to realize large-scale wildfire assessment through multisource remote sensing data and combined meteorological data. e specifically related factors among the five factors that affect the occurrence of wildfire disasters are as follows.

Vegetation Factors.
Vegetation is the material basis for the occurrence of wildfire disasters. In this study, the influencing factors of plant on wildfire disasters are refined into normalized difference infrared index 7 (NDII7) and normalized differential vegetation index (NDVI). e NDII7 is a critical wildfire risk assessment factor. Qin [15] proved that the NDII7 can characterize the vegetation fuel moisture content and then evaluate the mountain fire risk. e NDVI is used as a criterion for judging surface vegetation and estimate the growth status and density of plant. e occurrence of mountain fire disasters is closely related to the growth status and density of vegetation. Wang et al. [14] judged the event of wildfire disasters and estimated the area of the fired area according to the change of plant NDVI at adjacent time points. e acquisition of NDVI comes from the MOD13A1 vegetation information product of MODIS provided by the NASA website (https://ladsweb.modaps.eosdis.nasa.gov/), with a spatial resolution of 1000 m. e global NDVI information is updated every 16 days. NDII7 is derived from the MOD09A1 product provided on the website as previously mentioned. e temporal resolution of this product is 8 days. After the product is obtained, NDII7 is calculated according to the calculation formula obtained by Qin [15] and others: (1)

Human
Factors. e occurrence of wildfire disasters is highly correlated with the time of people's frequent activities. Statistics show that the occurrence of wildfire disasters shows a significant upward trend every Friday and every day from 13:00 to 16:00 from January to April [2]. e uncertainty of human factors is relatively considerable. is study extracts the influencing factors of wildfire disasters as land type, distance from roads, and distance from cultivated land. ese data directly indicate the inevitability of human activities and can be used as the influencing factors of wildfire disasters.
is notion indirectly suggests the impact of humans on fire. Land types are classified into cultivated land, forest land, grassland, water area, residential land, and unused land according to the 30 m classification data of the global surface.

LST.
Surface temperature affects the occurrence of forest fires because it will indirectly affect the moisture content of the combustibles of vegetation. In areas with a relatively dense vegetation, the evaporation of the surface is relatively small because the surface temperature is low, thereby leading to the high moisture content of the combustibles. Mountain fire disasters are less likely to occur [16].
By contrast, if the surface temperature is high, then it is easy to cause mountain fire disasters. e LST data come from the MOD11A1 product, with a spatial resolution of 1000 m and a temporal resolution of 1 day.

Terrain Factors.
Elevation, slope, and aspect are fixed static variables, and many researchers classify them as the fundamental factors leading to wildfire disasters. e ups and downs of terrain will cause different vegetation coverage and meteorological conditions, including rainfall, water content, dense vegetation, vegetation types, and growth conditions; thus, the probability of wildfire disasters will naturally vary. e spatial resolution of terrain data is 30 m. Currently, NASA website (https://ladsweb.modaps.eosdis. nasa.gov/) provides downloading of SRTM 30 m resolution digital elevation data.

Meteorological Data.
e probability of mountain fire disaster is highly correlated with meteorological factors. Meteorological factors, such as rainfall, average relative humidity, maximum temperature, average temperature, minimum temperature, maximum wind speed, and maximum wind direction [15], have a significant influence on the occurrence of wildfire disasters. e meteorological data come from the China Meteorological Data Network (http:// data.cma.cn/), which is the cumulative annual value data set (2015-2019) of China.

Fire Point Information Extraction.
e fire point data come from the fire point product of MODIS C6 (2015-2019) provided by https://firms.modaps.eosdis.nasa.gov/, and its spatial resolution is 500 m. is study extracts the fire point data according to the fire point collection time and confidence level provided by the product. Detailed MODIS C6 product information is shown in Table 1. is study will extract high-confidence fire data with a confidence of more than 90% as the input data of the fire information to improve the quality of the extracted fire information.

Nonfire Point Information Extraction.
is study first determines the distance of 35 pixels (17,500 m) from the buffer radius of the fire point through the semivariogram function [17] on the basis of the fire point data to eliminate the influence of time and then extracts it from the ring buffer (17,500-18,000 m). ereafter, all the nonfire point data in a month are obtained. Finally, the daily fire point data corresponding to the fire point data are extracted from the corresponding monthly nonfire point data according to the daily fire point data.

Spatial Interpolation of Meteorological Data.
e meteorological data downloaded from the China Meteorological Data Network are monitored by various meteorological stations and are spatially discrete. e meteorological data need to be spatially interpolated to achieve the continuity of the meteorological data in the study area.
is study uses Anusplin software to interpolate meteorological data, which has a good effect. Qian et al. [18] compared the interpolation accuracy of Anusplin software with that of Ordinary Kriging and reverse distance weights and found that the interpolation error of the former is the smallest. e interpolation principle is mainly to use ordinary and local thin disk spline functions. e advantage of this method is primarily that it allows the introduction of multiple influence factors as covariates. is study introduces elevation data to significantly reduce the influence of elevation on temperature data changes.

Data Undersampling.
is study will use the ensemble resampling [15] algorithm for undersampling the data to ensure the consistency of the model training samples, that is, the proportion of fire-spot samples and nonfire-spot samples is the same. is algorithm can correctly solve the problem of data loss in the undersampling. Such an algorithm uses ensemble to sample with various models. Each model is undersampling.
e undersampling results of multiple models are integrated, and the data distribution will not be changed. e sampling effect is better than the current numerous oversampling and undersampling techniques.

Normalization of Real Factor Data.
Among the influencing factors of mountain fire disasters, some variables are of real number type. Before the CatBoost model is trained, such input data must be normalized to ensure the dimensionlessness of the data, such as the following: distance from the road (x 1 , m), distance from cultivated land (x 2 , m), land surface temperature (x 3 ), NDVI (x 4 ), NDII7 (x 5 ), DEM (x 6 , m), precipitation (x 7 , mm/day), maximum temperature (x 8 ,°C), average relative humidity (x 9 , %), average temperature (x 10 ,°C), lowest temperature (x 11 ,°C), and maximum wind speed (x 12 , m/s).
ese input variables will be normalized to zero mean. e advantage of this method is that if abnormal points occur, then a small number of strange points will not have a significant effect on the average value; thus, the variance of the variance is little. Z-score normalization is also called standardization. is method maps data to a distribution with a mean of zero and a standard deviation of one. With regard to the above x i , formula (2) is used to standardize the data, and the obtained new variable data x i ′ is used as the input data of the model: where x i (i � 1, 2, 3, . . . , 12) is the original wildfire disaster impact factor, mean; σ is the average value and standard deviation corresponding to each element; and x i ′ is the standardized wildfire disaster impact factor.

Discrete Factor One-Hot Encoding.
e discontinuous values, such as land type, slope, and aspect, have no significance. is study will perform one-hot encoding to Mathematical Problems in Engineering 3 eliminate the influence between the numerical values. e significant advantage of this method is that it is easy to deal with noncontinuous values, and the model input data are also expanded to a certain extent.

Feature Selection Method Based on the Gradient Boosting
Decision Tree (GBDT) Model. Features must be selected because of the large number of variables in this study, and some variables have little effect on the occurrence of wildfires. Feature selection is the process of choosing factors that are highly correlated with the appearance of fires. e feature selection method based on the GBDT model is a commonly used feature selection method based on the tree model. e principle is to use the node magazines in each decision tree to calculate the importance of features. e final feature importance is the average of the feature importance of all decision trees. In this study, the cross-validation method is used to select the factors whose feature importance is more significant than 0.3. en, the dimensionality reduction is performed according to the principal component analysis (PCA). e ranking of the importance index of wildfire impact factors is shown in Table 2.
4.6. PCA: Principal Component Analysis. Among the influencing factors of mountain fire disaster, a specific correlation exists between elevation, slope, aspect, maximum temperature, average temperature, minimum temperature, and surface temperature. is study uses the currently widely used linear dimensionality reduction algorithm (PCA) to reduce the dimensionality of all influencing factors of wildfire disasters and eliminate redundant data. e advantage of this algorithm is its ability to retain the original data quality of the sample. In this mechanism, the model training data are compressed as much as possible, and the factors with high principal components for model training are determined. e mathematical model of the PCA algorithm in this study is as follows.
X � x 1 , x 2 , . . . , x m is the impact factor of wildfire disaster, where the dimension of X is m, which is the number of impact factors. e projection of x i on the hyperplane in the new hyperdimensional space is W T x i . e principle is to increase the variance between all sample points to ensure that the projections between all sample data are separated as much as possible. XX T can be obtained according to the following formula: After the sample feature matrix XX T is decomposed, the eigenvalues of each factor λ 1 ≥ λ 2 ≥ . . . λ m are obtained, and the corresponding eigenvectors of the first I samples W � (W 1 , W 2 , . . . W i ) are the required mountain fires of the principal components of the disaster impact factor. is paper retains 99% of the main information of the original feature. e latitude of the principal component m is 18. Compared with the feature selection based on the GBDT model, the feature dimension is reduced by 13.

CatBoost Model.
CatBoost is an algorithm that combines GBDT and categorical features. is approach is an improved implementation under the framework of the GBDT algorithm. CatBoost is based on oblivious trees with few parameters and supports categorical variables and high accuracy sexual GBDT framework. e main pain point is to efficiently and rationally deal with categorical features. CatBoost is composed of categorical variables and boost. is mechanism also deals with gradient bias and prediction shift problems, thereby improving the generalization ability and robustness of the algorithm [19,20]. is study considers many categorical features, such as rainfall, wind direction, slope direction, and land type. CatBoost can be used to quickly process nonnumerical features. When the CatBoost algorithm processes categorical features, it puts all sample data sets into the algorithm for learning. en, CatBoost randomly arranges all these sample data sets and filters out samples with the same category from all features. When numerically transforming the characteristics of each sample, the target value of the sample is first calculated before the sample, and the corresponding weight and priority are added [21,22]. e specific formula is shown in the following: where p represents the added prior value and the weight coefficient greater than zero. An a priori value is added to significantly reduce the noise points caused by low-frequency features to effectively minimize the overfitting of the model and improve the generalization ability.

Model
Training. e five-year MODIS monitoring fire point data of a southern province from 2015 to 2019 and the nonfire point data extracted by the method described in this study are selected as the sample set. e fire-spot data with a confidence level of less than 90% is eliminated to improve the quality of the fire-spot samples. e sample data after data oversampling, normalization, one-hot encoding, feature selection, and PCA dimensionality reduction are substituted into the CatBoost model for training. Approximately 70% of the data are randomly selected for model training and 30% for model testing. e temporal resolution of NDII7, NDVI, and land surface temperature in the input feature variables of the model are 8 days, 16 days, and 1 day, respectively. e input data of NDII7, NDVI, and land surface temperature select the data of the previous time phase before the fire to prevent the input vegetation data and land surface temperature from being affected by the fire and failing to achieve the effect of fire prediction. e data of human factors and terrain factors are unchanged, while the input time phase of weather data is consistent with the fire data. Figure 1 is the time phase relationship diagram of the input feature variables of the CatBoost model, and Figure 2 is the basic flow chart of model training.

Model Optimization.
is study uses grid search combined with tenfold cross-validation to optimize the primary hyperparameters of the CatBoost model, including iterations, learning_rate, max_depth, criterion, and feature importance, to improve the accuracy of model fire prediction. Tenfold cross-validation divides the sample data into ten mutually exclusive training subsets. Each time nine subsets are selected as training data, and the remaining subset is used as test data. e multiple rounds of training are repeated to ensure that each subset is as the test set, the ten test results are obtained, and the average of the ten test results is the accuracy of the model. e hyperparameters obtained through a grid search can effectively improve the prediction effect of the model [23].
After model optimization, the best hyperparameters of the fire point prediction model are shown in Table 3.

Model Evaluation.
is study uses accuracy, precision, recall, F1-score, and AUC value to make a comprehensive evaluation of the model prediction accuracy and address the classification problem of unbalanced data of fire point prediction.
e confusion matrix of the fire point and nonfire point data sets in this article is shown in Table 4. e evaluation index of the fire point prediction model can be obtained according to the confusion matrix. .
F1-score � 2 (precision * recall) (precision + recall) .     TN: the actual value is the fire point, and it is also predicted as the fire point. FN: the actual value is a nonfire point, but it is predicted to be a fire point. FP: the actual value is a fire point, but it is predicted to be a nonfire point. TP: the actual value is a nonfire point, and it is also predicted as a nonfire point. 6 Mathematical Problems in Engineering AUC value: the AUC value is the area value under the ROC curve, which can quantitatively reflect the model performance measured on the basis of the ROC curve. e abscissa of the ROC curve is the false positive rate, FPR � FP/ (FP + TN), and the ordinate is the true positive rate, TPR � TP/(TP + FN).
is study uses the best hyperparameters obtained from the model optimization in Section 4.2 to predict the fire point of the sample data set. e final five model evaluation indicators are shown in Table 5. e results shown in Table 5 demonstrate that the CatBoost fire point prediction model after model optimization has a nonfire point precision of 0.83, recall of 0.87, and F1-score of 0.78 and a fire point precision of 0.81, recall of 0.82, and F1-score of 0.83. e final accuracy is 0.79, the overall precision is 0.82, recall is 0.84, the F1-score is 0.80, and the AUC value is 0.83. e fire prediction results indicated that the model's prediction of the fire starts with a good predictive effect, and the risk of wildfires can be effectively predicted.
In order to more intuitively reflect the effect of the model in predicting the risk of wildfires, this article draws the comparison between the wildfire risk prediction maps and real fire spots in Yunnan Province on March 15, 2020, April 15, 2020, and May 15, 2020. e resolution of the wildfire risk prediction map is 500 meters, as shown in Figures 3-5. It can be seen that more than 80% of the real fire points fall in the high-risk area of the prediction map, which further verifies the model's effectiveness.

Conclusions and Prospects
is study uses MODIS fire data, combined with vegetation factors, human factors, meteorological factors, surface temperature, and terrain factors, based on feature selection and PCA dimension reduction to find out the influencing factors that are highly correlated with the occurrence of wildfires. e research proposes a method based on Cat-Boost algorithmic fire prediction model. is model can effectively predict fire points, is helpful in preventing wildfire risks, and has a specific guiding role for the electric power department to avoid risks of fire and make appropriate early warning arrangements in advance.
Although this article has achieved positive research results, it still has some deficiencies and areas worthy of indepth study. e research conducted in this study is only based on the first-level classification of land types to make fire forecasts and does not make precise fire forecasts under a single ground type. Under the secondary classification of land types, the establishment of different fire prediction models is based on each specific feature to achieve more precise and accurate fire prediction in the direction of further in-depth research.

Data Availability
is article contains data to support the results of this research. Some data cannot be provided because it involves the coordinate data of power grid poles.