Lithium-Ion Battery SOH Estimation Based on XGBoost Algorithm with Accuracy Correction

SOH (state of health) estimation is important for battery management. Since the electrochemical reaction inside LIBS (lithium-ion battery system) is extremely complex and the external working environment is uncertain, it is difficult to achieve accurate determination of SOH. To improve the accuracy of SOH estimation, we propose a SOH estimation method for lithium-ion battery based on XGBoost algorithm with accuracy correction. We extract several features, including average voltage, voltage difference, current difference, and temperature difference, to describe the aging process of batteries. Due to the higher prediction accuracy and generalization ability of ensemble learning algorithm, the XGBoost model is established to estimate the SOH of lithium-ion battery. Then, the estimation values are corrected by Markov chain. Compared with the methods by XGBoost, random forest, k-nearest neighbor algorithm (KNN), SVM, linear regression, our proposed method shows an accuracy improvement by 10% to 20%. Additionally, the errors of our method are also superior to the others in terms of the average absolute error, root mean square error, and root mean square error.


Introduction
Lithium-ion batteries are widely used in electric vehicles because of their high energy density and low self-discharge rate. Not only that, lithium-ion batteries are also widely used in high-tech products such as mobile phones and various portable information processing terminals, however, the service life of lithium-ion battery has limited its further promotion and development of electric vehicles. As the remaining life decreases, the overall performance of the battery decreases, which is characterized by capacity attenuation and internal resistance increase [1]. Therefore, accurate SOH estimation is of great importance [2][3][4].
SOH indicates the capacity of battery to store electric energy, and the percentage is generally used to indicate the health status of the battery. Here, SOH is defined as [5]: where Capacity aged represents the current capacity of the battery, and Capacity rated represents the rated capacity of the battery. At present, many SOH estimation methods have been presented in the literature. These methods can be divided into three categories: Direct measurement methods, physical failure methods, and data driven methods. Direct measurement methods include coulomb counting based methods [6] and internal resistance-based methods [7]. This kind of method calculates the state of charge (SOC) of the

Markov Correction
State Transfer Graph Transition Matrix

Feature Selection
Feature selection is of great importance in SOH estimation of lithium-ion batteries, Figure 2 shows the change curve of characteristics with time under different health conditions.
The average voltage (U_ave) is defined as follows: Among them, represents the starting voltage of the battery cycle once, represents the ending voltage of the battery cycle once, and Time represents the time of the battery cycle once. Figure 2a shows the voltage change curve of lithium-ion battery under different health conditions. It can be seen from the figure that under different SOH levels, the average voltage (U_ave) of lithium-ion batteries is not equal. This means that the health status of lithium-ion battery affects the average voltage (U_ave), that is, the average voltage (U_ave) can reflect the health status of the battery to a certain extent. Therefore, the average voltage (U_ave) in the discharge process is used as the aging characteristics of the battery.
The voltage difference (∆U) represents the arbitrary voltage change in a short fixed time interval. It also can be seen from Figure 2a that the voltage difference (∆U) in a fixed time increases with the decrease of SOH. Since the internal resistance of the battery increases with the aging of the battery. Therefore, the voltage difference (∆U) can also be used as the aging characteristics of batteries.
The temperature difference (∆T) represents the arbitrary temperature change in a short fixed time interval. Figure 2b shows the temperature change curve of lithium-ion battery under different health conditions. It can be seen from the figure that the temperature increment in a fixed time will increase with the decrease of SOH. Since with the aging of the battery, the internal resistance of the battery increases continuously, and the heat generated in the chemical reaction process increases. Therefore, the temperature difference (∆T) can also be used as the aging characteristics of batteries.
It can be seen from Figure 2c that in a fixed period of time, the SOC goes down as the SOH keeps decreasing. Since the SOC drops, it will inevitably affect the discharge capacity of the battery, that is to say, SOC can reflect the health state of the battery to a certain extent. However, SOC is not generally used as a feature to predict SOH, it is often corrected according to the SOH values. Figure 2d shows the current curve of lithium-ion battery under different health conditions. It can be seen from the figure that under different SOH levels, the current basically does not change. Therefore, the current difference can also be used as the aging characteristics of the battery. Therefore, the voltage difference, temperature difference, and average voltage for the same time interval are used as the aging characteristics of batteries.

Feature Selection
Feature selection is of great importance in SOH estimation of lithium-ion batteries, Figure 2 shows the change curve of characteristics with time under different health conditions.

The XGBoost Model of SOH Estimation
XGBoost adds split features to generate a tree. It is to learn a new function f (x) to fit the residual of the last prediction. When the k-th tree is trained, it is necessary to predict the score of a sample. According to the characteristics of the sample, it will fall on the corresponding leaf nodes in each tree. Each leaf node corresponds to a score. Finally, the corresponding score of each tree is added up as The average voltage (U_ave) is defined as follows: Among them, U 0 represents the starting voltage of the battery cycle once, U e represents the ending voltage of the battery cycle once, and Time represents the time of the battery cycle once. Figure 2a shows the voltage change curve of lithium-ion battery under different health conditions. It can be seen from the figure that under different SOH levels, the average voltage (U_ave) of lithium-ion batteries is not equal. This means that the health status of lithium-ion battery affects the average voltage (U_ave), that is, the average voltage (U_ave) can reflect the health status of the battery to a certain extent. Therefore, the average voltage (U_ave) in the discharge process is used as the aging characteristics of the battery.
The voltage difference (∆U) represents the arbitrary voltage change in a short fixed time interval. It also can be seen from Figure 2a that the voltage difference (∆U) in a fixed time increases with the decrease of SOH. Since the internal resistance of the battery increases with the aging of the battery. Therefore, the voltage difference (∆U) can also be used as the aging characteristics of batteries.
The temperature difference (∆T) represents the arbitrary temperature change in a short fixed time interval. Figure 2b shows the temperature change curve of lithium-ion battery under different health conditions. It can be seen from the figure that the temperature increment in a fixed time will increase with the decrease of SOH. Since with the aging of the battery, the internal resistance of the battery increases continuously, and the heat generated in the chemical reaction process increases. Therefore, the temperature difference (∆T) can also be used as the aging characteristics of batteries.
It can be seen from Figure 2c that in a fixed period of time, the SOC goes down as the SOH keeps decreasing. Since the SOC drops, it will inevitably affect the discharge capacity of the battery, that is to say, SOC can reflect the health state of the battery to a certain extent. However, SOC is not generally used as a feature to predict SOH, it is often corrected according to the SOH values. Figure 2d shows the current curve of lithium-ion battery under different health conditions. It can be seen from the figure that under different SOH levels, the current basically does not change. Therefore, the current difference can also be used as the aging characteristics of the battery. Therefore, the voltage difference, temperature difference, and average voltage for the same time interval are used as the aging characteristics of batteries.

The XGBoost Model of SOH Estimation
XGBoost adds split features to generate a tree. It is to learn a new function f (x) to fit the residual of the last prediction. When the k-th tree is trained, it is necessary to predict the score of a sample. According to the characteristics of the sample, it will fall on the corresponding leaf nodes in each tree. Each leaf node corresponds to a score. Finally, the corresponding score of each tree is added up as the predicted value of the sample, as shown in Figure 3.
XGBoost modeling: Give a generalized definition of objective function, find a suitable regression tree in each iteration to fit the residual of the last prediction, so that the objective function is further minimized, and the predicted value is closer and closer to the real value.
For example, data ([∆U 1 , . . , n. Among them, ∆U i , ∆T i , U avei and SOH i , respectively represent the voltage difference, temperature difference, average voltage, and health status corresponding to group I data. We define the tree f t (x i ) as follows: where q represents the structure of each tree, it can map each sample to the corresponding leaf nodes, and T is the number of leaf nodes in the tree. Each f t corresponds to an independent tree structure q and leaf weight w.
In order to standardize the improved objective function, we define tree complexity Ω ( f t ): where w j is the weight of the jth leaf node. The objective function is defined as: Among them, n i=1 l(SOH i ,ŜOH i ) is the residual of the predicted value and the actual value of the model. f t (∆U i , ∆T i , U avei ) represents the model to be added to the decision of the t-round. K k=1 Ω( f k ) is a penalty term for the complexity of each model. f k is the model function of the tree and c is a constant.  XGBoost modeling: Give a generalized definition of objective function, find a suitable regression tree in each iteration to fit the residual of the last prediction, so that the objective function is further minimized, and the predicted value is closer and closer to the real value.
and , respectively represent the voltage difference, temperature difference, average voltage, and health status corresponding to group I data.
We define the tree ( ) as follows: where q represents the structure of each tree, it can map each sample to the corresponding leaf nodes, and T is the number of leaf nodes in the tree. Each f t corresponds to an independent tree structure q and leaf weight w.
In order to standardize the improved objective function, we define tree complexity ( ): where is the weight of the jth leaf node. The objective function is defined as: The second-order Taylor's formula is expanded and the objective function is as follows: The new objective function can be defined as: Ob Define the candidate feature set of splitting nodes of each tree as I j , The optimal weight ω * j of leaf node j and the optimal objective function solution ob j ( * ) : ob j ( * ) is a function of marking tree structure and measuring the quality of tree structure q. The smaller the value of ob j ( * ) , the better. In this way, the optimal segmentation point and the optimal decision tree structure can be obtained to minimize the objective function.
XGBoost is improved and optimized on the basis of GBDT [19]. Regular terms are added to the objective function of each iteration to control the complexity of the model and support column sampling, it can not only reduce over-fitting, but also reduce calculation. Compared with the iteration principle of GBDT heuristic, the optimization criterion of XGBoost is based on the minimization of the objective function, and uses second-order Taylor expansion to support custom cost function. XGBoost blocks and sorts each feature, which makes it possible to parallel computing when searching for the best segmentation point. At the same time, by setting a reasonable block size, it makes full use of the CPU cache to speed up reading, and the speed of data reading is faster. When the eigenvalues are missing, XGBoost can also automatically learn the direction of node splitting.

Markov Chain Correction
Markov chain, also known as discrete-time Markov chain, is a stochastic process in which state space transitions from one state to another. This process requires a "no memory" nature, that is, the probability distribution of the next state can only be determined by the current state, and is independent of the events before it in the time series. Assuming that the sequence state is . . . X t−2 , X t−1 , X t , X t+1 . . ., then the conditional probability in state X t+1 depends only on state X t , i.e., (1) Partition interval The difference between the real data and the predicted data is called the relative estimation. The relative estimation of the SOH of the lithium-ion battery is calculated, X represents the relative estimation, and the relative estimations are normalized: According to the golden section method, the normalized values of relative estimation are divided into three states: E 1 , E 2, and E 3 , and the three states are inversely normalized to the relative estimation sequences by the above formula. The relative estimation can be divided into three intervals (S 1 , S 2 , S 3 ).
(2) The computation of state transition matrix The transition probability P ij is a one-step transition probability representing the transition probability of state i to state j, and the equation is expressed as follows: where c ij represents the access frequency from state i to state j between two adjacent cycles, and C i represents the number of states moving from state i between two adjacent cycles. The state transition matrix is: Therefore, the transition matrix of the n-th step is P(n) = [P] n . (

3) Estimation correction
To further improve the estimation accuracy, Markov Chain is used to correct the estimation results obtained by the XGBoost model when the relative error between the predicted value and the true value is large. The correction is performed by the following equation.
where ∆ a and ∆ b are the upper and lower limits of the estimation interval corresponding to their states of the relative estimation, SOH b indicates the corrected value, and SOH p indicates the value estimated by the XGBoost model.

The Implementation of the MC-XGBoost Method
As shown in Algorithm 1, the implementation of the MC-XGBoost method mainly has three steps: Data setting, XGBoost estimation process, and Markov correction process. The three steps are described in the following: Steps 1: Preprocess the data and set various parameters, such as eta, max_depth, min_child_weight, seed, colsample_bytree, gamma, etc. Steps 2: Set up training data and test data, use XGBoost algorithm to train the prediction model for SOH estimation. If it is not the optimal parameter, the error between the predicted value and the real value is large, it is necessary to reset the parameters, repeat the above operation, and adjust the parameters until the optimal parameters are obtained. Steps 3: Calculate the relative residuals by comparing the predicted values obtained by the XGBoost model with the real ones, partition the normalized relative estimations according to the golden section method, and compute the state transition matrix to determine the next state. Then, the corrected values are obtained by using Equation (13). 1: . . , n}, the loss function: l SOH i ,ŜOH i , the total number of sub-tree: T 2: Output: Predicted health status of lithium-ion batteries: SOH b ; 3: Repeat 4: Initialize the t-th tree f t (x i ) 5: Compute

7:
Use the statistics to greedily grow a new tree f t (x i ): obj ( * ) = − 1 2 T j=1 G i 2 /(H i + λ) + γT, among them, Add the best tree f t (x i ) into the current model 9: Until all T sub-trees are processed 10: A strong regression tree based on all weak regression subtrees 11: Compute the SOH p based on the strong regression tree 12: Repeat

13:
Normalize the relative estimations: x = x−xmin xmax−xmin ; determine the status: E 1 , E 2 , E 3 ; calculate the transition matrix: P; correct the prediction results by Markov Chain: Output SOH p based on Markov Chain

Experimental Results
The data used in our experiments comes from the NASA Ames Research Center [20]. In the process of stable discharge experiment, four lithium-ion batteries (5, 6, 7, and 18) were obtained by charging, discharging, and impedance, and two kinds of experiments were carried out. One group takes the discharge data of #5, #7, and #18 as the training set of the model, and the #6 discharge data is used to estimate the accuracy of the model. The other group uses the discharge data of #5, #6, and #18 as the training set of the model, and the discharge data of #7 is used for the precision estimation of the mode.
In order to verify the SOH estimation accuracy of lithium-ion batteries by the XGBoost method, we compare it with the Random Forest-based method, the Linear Regression-based method, the KNN-based method, and the SVM-based method on data set #6 and #7. Figure 4 shows the prediction results and prediction error of SOH on data set # 6 and #7.

Experimental Results
The data used in our experiments comes from the NASA Ames Research Center [20]. In the process of stable discharge experiment, four lithium-ion batteries (5, 6, 7, and 18) were obtained by charging, discharging, and impedance, and two kinds of experiments were carried out. One group takes the discharge data of #5, #7, and #18 as the training set of the model, and the #6 discharge data is used to estimate the accuracy of the model. The other group uses the discharge data of #5, #6, and #18 as the training set of the model, and the discharge data of #7 is used for the precision estimation of the mode.
In order to verify the SOH estimation accuracy of lithium-ion batteries by the XGBoost method, we compare it with the Random Forest-based method, the Linear Regression-based method, the KNN-based method, and the SVM-based method on data set #6 and #7. Figure 4 shows the prediction results and prediction error of SOH on data set # 6 and #7.  The predicted results in Figure 4a show that on the #6 data set, the predicted results of linear regression, support vector machine, and KNN have obvious errors with the real values, while XGBoost and random forest have higher estimation accuracy in the latter estimation, but XGBoost is significantly closer to the real values than KNN in the earlier estimation. The prediction errors in The predicted results in Figure 4a show that on the #6 data set, the predicted results of linear regression, support vector machine, and KNN have obvious errors with the real values, while XGBoost and random forest have higher estimation accuracy in the latter estimation, but XGBoost is significantly closer to the real values than KNN in the earlier estimation. The prediction errors in Figure 4c shows that there are obvious errors between the predicted results of linear regression, support vector machine, random forest and KNN algorithm and the real values on the #7 data set, while the predicted results of XGBoost are close to the real values.
As shown in Figure 4b, the residual curves of XGBoost and random forest fluctuates up and down around 0 in the second half of the curve, but in the first half, the errors generated by the random forest fluctuates up and down far away from 0. Therefore, the estimation of random forest is not accurate in the early stage. Table 1 shows that, the performances of the XGBoost method outperforms the other four methods on data set #6 and #7 in terms of mean absolute error (MAE), root mean square error (RMSE), and maximum error. It can be seen from Table 1 that the values of XGBoost in three technical indexes are lower than those of the other four algorithms on data set #6 or #7.
In conclusion, no matter on which data set #6 or #7 shown in Figure 5, the prediction results of XGBoost are almost close to the real values and have higher estimation accuracy.
Energies 2020, 13, x FOR PEER REVIEW 9 of 13 Figure 4c shows that there are obvious errors between the predicted results of linear regression, support vector machine, random forest and KNN algorithm and the real values on the #7 data set, while the predicted results of XGBoost are close to the real values. As shown in Figure 4b, the residual curves of XGBoost and random forest fluctuates up and down around 0 in the second half of the curve, but in the first half, the errors generated by the random forest fluctuates up and down far away from 0. Therefore, the estimation of random forest is not accurate in the early stage. Table 1 shows that, the performances of the XGBoost method outperforms the other four methods on data set #6 and #7 in terms of mean absolute error (MAE), root mean square error (RMSE), and maximum error. It can be seen from Table 1 that the values of XGBoost in three technical indexes are lower than those of the other four algorithms on data set #6 or #7.
In conclusion, no matter on which data set #6 or #7 shown in Figure 5, the prediction results of XGBoost are almost close to the real values and have higher estimation accuracy. The discharge process of lithium-ion battery is a time-varying process, and the change of its health state is a long-term process from the beginning of use to the end of life replacement. The state of health decreases with time, and the degree of health decreases. Markov chain has a significant prediction effect on the long-term system state, so we can correct the predicted errors by Markov chain to further improve the estimation accuracy by Markov chain.
It can be seen from Figure 6 that XGBoost and MC-XGBoost both have a good prediction effect in the later stage, and the predicted values are close to the real ones. However, in the early stage, no matter on which data set #6 or #7, the predicted values under MC-XGBoost are closer to the real values than XGBoost. The discharge process of lithium-ion battery is a time-varying process, and the change of its health state is a long-term process from the beginning of use to the end of life replacement. The state of health decreases with time, and the degree of health decreases. Markov chain has a significant prediction effect on the long-term system state, so we can correct the predicted errors by Markov chain to further improve the estimation accuracy by Markov chain.
It can be seen from Figure 6 that XGBoost and MC-XGBoost both have a good prediction effect in the later stage, and the predicted values are close to the real ones. However, in the early stage, no matter on which data set #6 or #7, the predicted values under MC-XGBoost are closer to the real values than XGBoost. The MAE of MC-XGBoost and XGBoost is 0.001364 and 0.001615, respectively on data set #6, and 0.001035 and 0.001336, respectively on data set #7. The RMAE of MC-XGBoost and XGBoost is 0.001878 and 0.002132, respectively on data set #6, and 0.001355 and 0.001712, respectively on data set #7. The maximum error of MC-XGBoost and XGBoost is 0.005355 and 0.005972, respectively on data set #6, and 0.003263 and 0.004244, respectively on data set #7, as shown in Figure 8. The MAE of MC-XGBoost and XGBoost is 0.001364 and 0.001615, respectively on data set #6, and 0.001035 and 0.001336, respectively on data set #7. The RMAE of MC-XGBoost and XGBoost is 0.001878 and 0.002132, respectively on data set #6, and 0.001355 and 0.001712, respectively on data set #7. The maximum error of MC-XGBoost and XGBoost is 0.005355 and 0.005972, respectively on data set #6, and 0.003263 and 0.004244, respectively on data set #7, as shown in Figure 8. The MAE of MC-XGBoost and XGBoost is 0.001364 and 0.001615, respectively on data set #6, and 0.001035 and 0.001336, respectively on data set #7. The RMAE of MC-XGBoost and XGBoost is 0.001878 and 0.002132, respectively on data set #6, and 0.001355 and 0.001712, respectively on data set #7. The maximum error of MC-XGBoost and XGBoost is 0.005355 and 0.005972, respectively on data set #6, and 0.003263 and 0.004244, respectively on data set #7, as shown in Figure 8.
Compared with the XGBoost method, the MC-XGBoost method reduces the maximum error from 0.005972 to 0.005355 on data set #6 and 0.004244 to 0.003262 on data set #7 as shown in Figure 8. Therefore, compared with the other five algorithms, MC-XGBoost has a small error fluctuation range and can guarantee the stability of SOH prediction. Compared with the XGBoost method, the MC-XGBoost method reduces the maximum error from 0.005972 to 0.005355 on data set #6 and 0.004244 to 0.003262 on data set #7 as shown in Figure 8. Therefore, compared with the other five algorithms, MC-XGBoost has a small error fluctuation range and can guarantee the stability of SOH prediction.
The MC-XGBoost method decreases by 15.5%, 22.5% in MAE, and by 11.9%, 20.8% in RMSE compared with the XGBoost-based method. That is to say, the performance ranks of the MC-XGBoost method and the other five methods is the MC-XGBoost method, the MC-XGBoost method, the XGBoost-based method, the RF-based method, the KNN-based method, the LR-based method, and the SVM-based method. Therefore, the Markov correction is effective in improving the accuracy of the predicted values. Figure 9 analyzes the contributions of the features to the SOH prediction. Among them, the average voltage has the greatest influence on the accuracy of the SOH prediction, followed by the voltage difference, and the temperature difference is the smallest. The feature importance ranking in SOH prediction is: Average voltage, voltage difference, temperature difference.

Conclusions
This paper presented a method for estimating the SOH using XGBoost algorithm with accuracy correction. Three battery features including the average voltage, voltage difference, and temperature The MC-XGBoost method decreases by 15.5%, 22.5% in MAE, and by 11.9%, 20.8% in RMSE compared with the XGBoost-based method. That is to say, the performance ranks of the MC-XGBoost method and the other five methods is the MC-XGBoost method, the MC-XGBoost method, the XGBoost-based method, the RF-based method, the KNN-based method, the LR-based method, and the SVM-based method. Therefore, the Markov correction is effective in improving the accuracy of the predicted values. Figure 9 analyzes the contributions of the features to the SOH prediction. Among them, the average voltage has the greatest influence on the accuracy of the SOH prediction, followed by the voltage difference, and the temperature difference is the smallest. The feature importance ranking in SOH prediction is: Average voltage, voltage difference, temperature difference. Compared with the XGBoost method, the MC-XGBoost method reduces the maximum error from 0.005972 to 0.005355 on data set #6 and 0.004244 to 0.003262 on data set #7 as shown in Figure 8. Therefore, compared with the other five algorithms, MC-XGBoost has a small error fluctuation range and can guarantee the stability of SOH prediction.
The MC-XGBoost method decreases by 15.5%, 22.5% in MAE, and by 11.9%, 20.8% in RMSE compared with the XGBoost-based method. That is to say, the performance ranks of the MC-XGBoost method and the other five methods is the MC-XGBoost method, the MC-XGBoost method, the XGBoost-based method, the RF-based method, the KNN-based method, the LR-based method, and the SVM-based method. Therefore, the Markov correction is effective in improving the accuracy of the predicted values. Figure 9 analyzes the contributions of the features to the SOH prediction. Among them, the average voltage has the greatest influence on the accuracy of the SOH prediction, followed by the voltage difference, and the temperature difference is the smallest. The feature importance ranking in SOH prediction is: Average voltage, voltage difference, temperature difference.

Conclusions
This paper presented a method for estimating the SOH using XGBoost algorithm with accuracy correction. Three battery features including the average voltage, voltage difference, and temperature

Conclusions
This paper presented a method for estimating the SOH using XGBoost algorithm with accuracy correction. Three battery features including the average voltage, voltage difference, and temperature difference are used to predict the SOH of lithium-ion batteries. SOH is first predicted by XGBoost algorithm, and then Markov chain is established to correct the predicted values. To verify the effectiveness of XGBoost, we compare XGBoost with five regression algorithms. The XGBoost algorithm has a high prediction accuracy, and strong generalization ability. The comparison between Markov correction and XGBoost shows that the Markov correction can significantly improve the prediction accuracy of XGBoost by 10%-20%. In addition, experiments on #6 and #7 battery data sets verify the generality of this method. In all, the experimental results show that the SOH prediction accuracy of the MC-XGBoost method is superior to the other five methods in three technical indicators, and the proposed method has higher SOH estimation accuracy.
Funding: This work is supported by the science and technology major project of Guangxi, under grant number AA18118009.

Conflicts of Interest:
The authors declare no conflict of interest.