Prediction of mechanical properties of hot-rolled strip steel based on PCA-GBDT method

Improving the quality of rolled steel products is the primary task of the entire steel industry. As the main step of rolled steel production, hot rolling has received extensive attention. As far as the hot rolling process is concerned, the chemical composition of steel and related process parameters are the most direct factors affecting the quality of hot rolled steel sheets. Based on Principal Component Analysis (PCA) and Gradient Lifting Decision Tree (GBDT) methods, this paper takes the tensile strength as the research object and constructs a prediction model for the mechanical properties of hot rolled strip steel. Through principal component analysis of characteristic data, 28 variables were reduced into 8 new indicators to be used for GBDT regression analysis. 2489 pieces of data were divided into a training set and a test set at a ratio of 7:3, to be used in the training set to build a regression model, and get a root mean square error of 16.7393. The data in the test set was used to predict the tensile strength value, with the root mean square error reaching 18.2650.


Introduction
The development of strip steel products is the same as my country's intensive economic growth mode, and the essence is to improve product quality and economic benefits. As one of the main types of steel products, hot-rolled strip steel accounts for the largest proportion of the total steel output and occupies a dominant position in the production of rolled steel.
With the rapid development of the global economy, hot-rolled strip producers are increasingly concerned about the quality stability of their products and the flexible production capacity of their production lines. The prediction system of steel mechanical properties has become one of the research directions that metallurgists around the world pay the most attention to. Due to the complexity of steel rolling production data, it cannot adapt to the actual process of hot rolling production. In recent years, with the development of big data, the improvement of computer performance, the emergence of artificial intelligence and the proposal and development of neural networks have provided effective technical support for the study of mechanical properties of hot-rolled products, which has become the mainstream of research [1][2][3][4][5][6]. Due to the complexity of the rolling production process and the severe nonlinearity, time-varying, large lag, strong coupling and multi-parameter characteristics of rolling data, neural networks have certain shortcomings. Based on this, we propose a steel rolling mechanical performance prediction model based on principal component analysis-gradient boosting decision tree (PCA-GBDT). GBDT algorithm is one of the best performance methods in machine learning and has been applied in  (2017) used the measured vehicle travel time data to propose a principal component analysis-gradient boosting decision tree (PCA-GBDT) method for urban road travel time prediction. The results show that, compared with the traditional KNN method, time series ARIMA method, and SVM method, the PCA-GBDT method has higher prediction accuracy and algorithm stability [7].

Principal component analysis
Among the observation variables 1 , 2 , … of observation objects, the observation matrix × is obtained. Assuming that the observation matrix is standardized, the observation variables are combined and called a comprehensive variable. The core of PCA is to take a linear combination in each set: Where = ( 1 , 2 , … ) ′ , = ( 1 , 2 , … ) ′ , is the weight coefficient of the j-th variable of the observation value , is the principal of the i-th observation object Ingredient score. In this paper, N=2489 (the number of hot-rolled strip steel), P=28 (chemical composition and process parameters), because different weighting coefficient can get different comprehensive variables, so there can be multiple comprehensive variables for variables. The variance of a comprehensive variable is a scale that can comprehensively reflect the degree of change of the observed variable. Therefore, the principal component can be defined in terms of variance: . (2) The first principal component satisfies Where is the sample covariance matrix. The k-th principal component satisfies By determining the normalized weight coefficient, the variance of the principal component is maximized. The largest variance is interpreted as the first principal component, the second largest variance is the second principal component, and so on. Several principal components obtained by PCA are a linear combination of original variables and are not related to each other, which can represent the most important characteristic component of the original data.

Gradient boosting decision tree
The Gradient Boosting Decision Tree (GBDT) algorithm was proposed by Jerome Friedman in 2001 and can be used for classification and regression. It is a combination of gradient boosting algorithm and decision tree.
Assuming the training set is = {( 1 , 1 ), ( 2 , 2 ), … ( , )}, in the optimization process, GBDT adopts forward segmentation regression, by continuously adding a new decision tree to Reduce the error function value without changing the parameters of the existing decision tree. The loss function is calculated as follows: (5) For continuous variables, the classic form of the loss function is the sum of squared errors: Assuming that in the m-th round of learning, the strong learner is ( ) and the loss function is ( , ( )), then the maximum loss function of the i-th sample in the + 1 round decreases The direction is its gradient direction: Using ( , , +1 ), = 1,2, … , a decision tree and the + 1 th regression tree can be fitted. Assuming that the number of leaf nodes is J, the corresponding leaf node area is +1 , ( = 1,2, … ). For each sample in the leaf node, find the minimum loss function, which is to fit the optimal output value of the leaf node, The decision tree of the + 1 round can be obtained, and its fitting function is: ( ∈ +1 , ).
(9) Therefore, the final expression of the strong learner in the + 1 round is as follows: (10) This paper comprehensively uses these two methods, considering the characteristics of non-linearity, time-varying, large lag and other characteristics of rolled steel production data, and proposes a mechanical performance prediction method based on PCA-GBDT, and verifies the model's performance based on the actual data detected. Forecast reliability and algorithm stability.

Data sources
In this paper, the production data of 2,489 hot-rolled steel strips produced by domestic large-scale hotrolling mills are used to discuss the influence of the chemical composition of steel and related process parameters on the mechanical properties of hot-rolled steel strips.

PCA-based pivot selection
Using python software, input the original data to form a sample set, and perform dimensionality reduction processing on the original data, but the feature representation obtained after the principal component dimensionality reduction is not clear, so we first sort the original features by importance. an external factor to control the quality of strip steel products. , They are all external forces that can be controlled artificially.
As just now, the internal chemical factors are extremely important to the mechanical properties of the product. Among them, C, Mn, and Si are the factors that play a major role in the mechanical properties of the product, which can improve the strength, hardness and wear resistance of the steel. S and P are elements harmful to steel grades, the lower the better. Nb can refine the grain of steel, reduce the overheating sensitivity and temper brittleness of steel; improve the welding performance of steel, and increase the strength and corrosion resistance of heat-resistant steel. Ni can improve the strength, toughness, heat resistance, corrosion resistance, acid resistance, and magnetic permeability of steel. Increase the hardenability and hardness of steel.  The main idea of PCA is to reduce dimensionality and use as few factors as possible to explain more variation. Through SPSS gravel graphs, it is found that there are 8 factors with characteristic values greater than 1, and the top 8 can be found through the contribution rate of the factors to the variable explanation. A factor can express the cumulative contribution rate of the variable to 75.508%, which can represent most of the sample information to a certain extent. Therefore, the number of pivots is determined to be 8, and the feature vectors corresponding to the first 8 pivots are output to form a feature matrix, and the 8-dimensional features are used as the input of the predictive performance index. In order to evaluate the prediction performance of the mechanical performance prediction model, the experimental simulation uses root mean square error (RMSE) to measure the accuracy of the prediction model. The smaller the RMSE value, the more accurate the prediction.

Model rating indicators
Where is the number of samples, ̂ is the predicted value of the i-th sample, is the true value of the i-th sample.

Modeling process and result analysis (based on PCA-GBDT to establish regression prediction model)
In order to ensure the generalization of the algorithm and the portability of the experiment, the data set is divided into training set and test set according to the ratio of 7:3, and they are put into the regression model for training, and the root mean square of the regression model of the data in the training set is found The error is 16.7393. Then the data in the test set is used to predict the compressive strength, and compared with the real value in the test set, the root mean square error is 18.2650, which shows that the predicted compressive strength is more accurate and can guarantee the accuracy of the prediction.

Conclusions
This paper proposes a mechanical performance prediction model based on PCA-GBDT, and empirically verifies the effectiveness of the method. Aiming at the problems of multi-dimensional hot-rolled strip data, the existence of data noise, and the redundancy between attributes, PCA screening is used to obtain low Dimensional features to reduce the dimensionality of a subset of model input attributes, reduce training time, and improve model stability. Comparison with other machine learning models shows that the GBDT model has superior performance in terms of prediction accuracy and model interpretation ability, thus can well identify complex nonlinear relationships, and is suitable for predicting the mechanical properties of hot rolled strip steel.