Two-stage incidents prediction in heat supply systems using gradient boosting

Information technologies usage in heat supply sphere has an important role to solve accidents problems in heat networks. Heat networks are accidents (incidents) sources. Accident situation predicting in heat supply systems will make it possible to rationally finance incidents prevention funds. AI is the tool for predicting accidents. Modern AI methods and algorithms allow to solve many technical systems problems. Research area has a data bank that allows to forecast heat networks accidents. Gradient boosting and production rules were chosen as basic methods.


Introduction
Heat supply systems state is a determining factor in the social and economic sufficiency of any state, region and town. The acceptable state of heat supply systems is determined by this infrastructure accident problem solution. Due to their nature, heat supply systems are subject to components wear and tear, and therefore, incidents problem remains relevant yet.
Accident rate problem solution in heat supply systems, specifically in heat networks, is prediction using artificial intelligence tools.

Heating networks accidents prediction model description
Nowadays artificial intelligence is one of the major science trends. One of the main artificial intelligence systems trends is expert system. Expert system is similar to the expert decisions in specific area. Unlike other artificial intelligence systems (neural networks, genetic algorithms), they are applicable for solving problems in a narrowly focused specific area.
Expert systems usage is complicated by transferring expert knowledge problem. In addition, correct expert group formation plays an important role in system adequacy determining. This means that knowledge from expert sufficiency and reliability are directly determined by expert system result reliability [7][8]. Another solution may be another artificial intelligence tool usage, specifically machine learning. In classification and recovery context, those regressions are suitable: -linear regression; -logistic regression; -random forest; -gradient boosting. Accident prediction problem solution is considered in classifier problem solution context; so algorithm work result will be two classes: «accident» and «non-accident». However, for a full-fledged prediction, we still need to know how correct this prediction is. To do this, you need to consider the following metrics: accuracy, completeness, and f-measure. In this way, we can get our prediction estimate numerical value. Among the proposed algorithms, gradient boosting can provide more accurate predictions, but it cannot provide information about parameters (input data) which affected the prediction. This problem can be solved by implementing another model that should identify incidents based on the production rules. Production rules are formed on expert opinion base and extracting attributes importance from a machine learning algorithm. The first three of the algorithms above (linear regression, logistic regression, random forest) can give an attributes importance idea. Therefore, the system consists of two models: prediction model (1) and prediction analysis model (2).
where MPinput parameters set; SL -Basic algorithm used in gradient boosting (random forest); OV-training sample; VI-output information.

Model input parameters analysis
It is necessary to build pipe failure predict chance model based on available attributes. The attributes list is presented in Figure 1.

Figure 1. Attributes list
It is necessary to get rid of unnecessary attributes before data analysis. Removing all string type attributes, leaving and coding only "Type of heating network laying", we get following attributes list.  Figure 2. Numeric attributes Figure 3 shows a matrix containing correlation coefficients between all the attributes. The correlation matrix shows that some attributes have a weak effect on the failure rate. They will not be included in the final attributes list. In correlation matrix, attribute "Roughness of the supply pipeline, mm" is clearly highlighted. In Figure 4, it can be seen that almost all pipes with values from 0.5 to 2 mm had an accident.

Analysis of gradient boosting parameters
The first model version uses gradient boosting. Machine learning algorithms require not only correct training sample formation, but also algorithm internal parameters values selection. In gradient boosting, following internal parameters are considered: -max_features -count of attributes, which are taken into account when searching for best split; -max_depth -max. trees depth; -learning_rate -reduces the impact of an individual tree; -n_estimators -trees count [1][2][3][4][5][6]. Based on the training sample formed by us with the previously selected attributes, gradient boosting internal parameters analysis was carried out (Figure 7). showed that increasing random forest trees number, which is used as algorithm base, to over 200, leads to certain linear correlation. As for trees depth, if trees number is 200, optimal depth value is 9.
Thus, following parameter values were selected for our training sample and prediction score is high.

Conclusion
During the research, heat networks parameters were studied to form a further machine learning algorithm sample. Analysis results are useful not only for training sample, but also for production rules base further formation [10][11][12][13].
Prediction using machine learning requires not only data analysis for the training sample, but also a competent algorithm parameters selection itself. Following parameters were investigated for gradient boosting: max_features, max_depth, learning_rate, n_estimators. Prediction score equals to 0.9, which is considered a high prediction score. Next researching task is to make sure that this is not overtraining, but really an adequate prognosis assessment.