A Machine Learning Method for Predicting Driving Range of Battery Electric Vehicles

,


Introduction
With the rapid development of automobile industry and the continuous improvement of people's living standard, car ownership and sales continue to rise, which brings a series of energy and environment problems.In the face of increasing energy and environmental problems, the development of new energy vehicles has become a new trend in the automobile industry [1], and the battery electric vehicle (BEV) is the main force of new energy vehicles.However, BEVs have many disadvantages compared with conventional fuel vehicles; for example, the charging station has a sparse distribution, the charging time is too long, and the energy stored per unit of mass is lower in electrochemical batteries with respect to fossil fuels [2].Besides, users of BEVs have a range anxiety problem that the residual power will be worried about not ensuring to reach the destination.All of these restrict the promotion and development of BEVs.The range anxiety is an easier case to be figured out than other issues in real-world application of BEVs [3].Therefore, it is of great significance to increase the practicability and reliability of BEVs by improving the driving range prediction accuracy to provide users with reliable information [4].
Therefore, various mathematic methods have been used in the driving range prediction to improve the accuracy and the credibility of it.The driving mode was incorporated into the study of driving range; it indicates that the stable driving habit plays an important role in saving the battery power and extending the driving range [5].Fuzzy Transform, a model-free method, was adapted to online use for the prediction of remaining range of an electric vehicle [6].A simple feature-based linear regression framework modeling the distribution parameters was proved to be an efficient approach to compute probabilistic attainability maps and model a driver's route preferences for electric vehicles [4].A multiobjective problem, the driving range prediction, with maximized electric motor efficiency and minimized energy consumption, was solved to get the optimal speeds, along with the total trip time corresponding to a predicted driving range [7].In another study, the energy consumption was analyzed and it was found that the electric vehicle has the lower energy consumption in the lower speed and more frequent 2 Journal of Advanced Transportation stops [1].In addition, basing on the LR (linear regression) and SVR (support vector regression) and the neural network, genetic algorithm and fuzzy logic intelligent optimization methods were fused into driving range prediction model of energy consumption to improve the prediction accuracy [8].
A real-time method was proposed to estimate the continuous driving range, considering both the driving behavior and the steepness of the driving route [9].In another research, the relationship between the energy consumption and the load (air conditioning, heating, etc.) was studied to put forward the prediction model in different load and different driving mode [10,11].Many studies used to take less factors into account when establishing the prediction model of driving range, which might lead to the poor applicability and prediction accuracy of the model.A series of energy equations based on linear models and Dijkstra's graph search algorithm were derived to calculate the driving range and the route minimizing energy consumption available to EVs based on the real-world traffic condition and topology of the road.However, weight, temperature, and many other parameters were not be included in this work [12].The battery remaining discharge energy prediction technique was studied by an energy prediction method based on the coupled prediction of future energy-related variables, but the future temperature variation was not be considered [13].
In a word, there are many methods that can predict the driving range and many factors that affect the driving range prediction, but the current studies cannot take both the accuracy and comprehensiveness of them into account [14].
For the problem existing in the previous studies, basing on the gradient boosting decision tree (GBDT), a new driving range prediction method, the machine learning method novel in including a large number of feature variables [15], has been presented to improve both the applicability and the accuracy, considering the real-world working condition, the battery status, and the traffic environment.The organization of the study is as follows.The collecting and processing of data is simply introduced in Section 2. The conventional multiple linear regression model of driving range is established and verified in Section 3. The machine learning method for the prediction problem is presented in Section 4. To investigate the prediction performance of the proposed method, a comparative study is conducted in Section 5. Finally, conclusions are drawn in Section 6.The nomenclature of symbols and abbreviations in this study is shown in Nomenclature.

Data Collecting.
To begin with we will provide a brief introduction on the data collection.The real-time data of travel status and battery status, collected by vehicle-mounted information collection equipment, was sent to the remote data monitoring center every 5∼10 seconds by GPRS wireless transmission network, and for storage.The research in this paper is based on the historical operation data of BEV, of which the type is E150EV produced by Baic New Energy Automobile Co., Ltd., rented and managed by a car-sharing company in Beijing.Of all the rented BEVs, No. 25 BEV, which has the longest running time and the largest data volume, is selected as the main research object.The discharge data, including 596 discharge process and 523,678 original data from March 1, 2015, to March 1, 2016, is extracted from the database, filtered, and processed.

Data
Processing.The information about vehicle state and battery status transmits through the wireless network.In the process, the transmission can be affected by many factors, such as weather, building density, channel conflict, data stability, and so on.Therefore, there will be data losses and errors in the collected data.
For subsequent analysis and modeling, deletion has been operated on the attributes (SOC, current, voltage, speed, etc.) of repeated and error data.Table 1, for example, shows the results of frequency number and frequency before and after the deletion operation of SOC.
It can be seen from Table 1 that frequency number after the deletion operation is reduced, while frequency after the deletion operation is basically the same as the original frequency.The result proves that the error is generated with the random influence of the driving environment and the data acquisition device, rather than deliberately.As shown above, the frequency number of SOC between 0 and 20 is relatively scarce.Besides, previous studies find that the battery performance is unstable when the SOC of the battery is less than 10%, which is easy to cause irreversible damage to the physical properties of the battery [11].Battery performance is relatively stable only when SOC is above 15%.Therefore, the SOC should be greater than or equal to 20% in the calculation of the driving range prediction in this study.
To facilitate subsequent analysis and modeling, Lagrange interpolation method has been used to make up the data gaps, making sure each discharge process complete.To accurately determine the interpolation effect, the root mean square error and the relative error of root mean square are calculated, as shown in Table 2.
In addition, the processed data has been averaged, which conforms to the requirements of modeling for accuracy and standard.

Multiple Linear Regression Modeling
3.1.Correlation Analysis.Generally, there are many factors affecting the driving range of BEV under the actual working conditions, including the driver's own characteristics, the vehicle's own parameters, and the road environment, etc.However, only several items of data can be collected and used.Therefore, the performance parameters of battery (SOC, voltage, current, and temperature) and state parameters of the vehicle (speed) are chosen to be researched [5].
Considering that the data used in this paper is distancedependent, Pearson's simple correlation coefficient is used to measure the strength of the correlation degree between the driving range and another variable.The definition of Pearson's simple correlation coefficient is shown in the following: where  is the number of samples,   and   are the variable values of two variables, respectively, and x and ŷ are the corresponding mean values, respectively.When || > 0.8, there is a strong linear correlation between the two variables; when || < 0.3, the linear correlation between the two variables is weak.Corresponding to Pearson's simple correlation coefficient are t test statistics, and its mathematical definition is as follows: when the probability value  of -test statistics is less than the significance level , the two variables are generally considered to have significant linear correlation.Otherwise, there is no significant linear correlation between the two variables.
No. 25 BEV discharge process data, from 09:45 to 14:30 on September 1, 2015, No. 15 BEV discharge process data, from 10:05 to 15:30 on August 11, 2015, and No. 12 BEV discharge process data, from 09:23 to 13:45 on July 20, 2015, are selected to calculate Pearson's simple correlation coefficient and the probability value , reflecting the correlation between parameters and driving range in a numerical way, as shown in Table 3.

Partial Correlation Analysis.
In multivariate correlation analysis, Pearson's simple correlation coefficient, however, generally cannot truly reflect the correlation between variables.Because the relationship between variables is more complex at this time, it may be affected by more than one variable respectively.Currently, partial correlation coefficient is a better choice.Partial correlation coefficient reflects the degree of net correlation between variables.
When analyzing the partial correlation between variables  1 and , under the condition of controlling the linear action of  2 , the first-order partial correlation coefficient between  1 and y is defined as follows: where  1 ,  2 , and  12 is the correlation coefficient of  and  1 , is the correlation coefficient of  and  2 , and is the simple correlation coefficient of  1 and  2 .The basic steps of partial correlation analysis are as follows: firstly, the null hypothesis is proposed; that is, the partial correlation coefficient between two populations is not significantly different from zero.Secondly, the test statistic of partial correlation analysis is  statistic, whose mathematical definition is shown in the following: where  is the partial correlation coefficient,  is the sample number,  is the order number, and --2 is the degree of freedom.Thirdly, calculate the observation value of the -test statistic and the corresponding probability value .Lastly, if the probability value p of the -test statistic is less than the given significance level , the null hypothesis should be rejected and the partial correlation coefficient of the two populations is significantly different from zero.Otherwise, it is considered that there is no significant difference between the partial correlation coefficient and zero of the two populations.
No. 25 BEV discharge process data, from 09:45 to 14:30 on September 1, 2015, No. 15 BEV discharge process data, from 10:05 to 15:30 on August 11, 2015, and No. 12 BEV discharge process data, from 09:23 to 13:45 on July 20, 2015, are selected to calculate the partial correlation coefficient and the probability value , determining whether the correlation between each parameter and the driving distance is affected by other parameters.
From Table 3, SOC has the highest absolute value of the simple correlation coefficient, while total current, speed, extremum voltage difference, and extreme temperature difference have no significant linear relationship with the driving range, respectively.Therefore, SOC is selected as control variable, and partial correlation coefficients of total voltage, maximum cell voltage, minimum cell voltage, maximum cell temperature, and minimum cell temperature are calculated.
As can be seen from Table 4, the linear relationship between total voltage, maximum cell voltage and minimum cell voltage, and the driving distance is affected by SOC.Therefore, after controlling the variable SOC, total voltage, maximum cell voltage, and minimum cell voltage have no significant linear effect on the driving range (|| 25, 15,12  ,, < 0.5,  25,15,12  ,, > 0.05).Correspondingly, there is a significant linear correlation between maximum cell temperature, minimum cell temperature, and the driving distance (|| 25, 15,12  , > 0.4,  25,15,12 , ≪ 0.01).According to the above correlation analysis and partial correlation analysis, minimum cell temperature has the second highest correlation with the driving range, so it is selected as the control variable for the partial correlation test of maximum cell temperature and the driving range.From the partial correlation test results, the relationship between the driving distance and maximum cell temperature is affected by minimum cell temperature, and there is no significant linear correlation between them ( 25  = −0.066, 25  = 0.519 > 0.05;  15   = −0.129, 15  = 0.351 > 0.05;  12   = −0.218, 12  = 0.229 > 0.05).

Variable Selection and Modeling.
In multivariate linear regression analysis, it is very important to choose the right independent variables to enter the regression model to make it have better generalization ability and higher prediction accuracy.It is necessary that only independent variables that play a major role are retained and the average variation of the dependent variable is described with fewer independent variables.It can avoid the problem of overfitting and generalization ability reducing caused by the entry of all relevant variables into the model.Therefore, based on the correlation analysis and partial correlation analysis results, some parameters that have greater impact on the dependent variable can be considered and selected as the independent variables.On the contrary, other parameters that have little influence on the dependent variable can be ignored.In view of the above result of the correlation analysis and partial correlation analysis, SOC and minimum cell temperature have been selected into the variables of the model.The multiple linear regression model is as follows: where  represents the driving range, the unit being km; s represents SOC, the value ranges from 20 to 100;  represents minimum cell temperature;  0 ,  1 ,  2 are parameters to be measured;  is the residual error.

Parameter Identification and Statistical Test.
When the regression model is determined, it is necessary to use the collected data to identify unknown parameters in the model according to certain estimation criteria.The least square method is widely used to identify parameters because of its excellent properties.No. 25 BEV discharge process data, from 09:45 to 14:30 on September 1, 2015, are used as input, and the least square parameter identification has been performed.The parameter identification results ( 0 = 126.960, 1 = -1.719, 2 = 1.627) are introduced into (3), and the driving range prediction model is as follows: A variety of statistical tests are conducted to ensure that the model has good stability and generalization ability, and the results are obtained as shown in Figure 1.
As can be seen from Figure 1(a), the residual sequence of the model is basically normal distribution, with the mean value of 4.19E-14, which approximates 0, and the standard deviation is 0.98. Figure 1(b) shows that the residual distribution of the observed value is compared with the normal distribution, standardized residual distribution scatter is very close to the straight line, so that standardized residuals obey normal distribution with mean zero.According to the statistical test results, the goodness of fit is high ( 2 = 0.996); the linear relationship between the driving range and the explained variables is significant (F = 14095.605,p<0.01); the linear relationship between the driving range and each of the explained variables (SOC, MinT) is significant (t 1 = 12.328, t 2 = -76.532,t 3 = 5.525, p 1,2,3 <0.01); there is no autocorrelation between residuals; the residual sequence is independent (DW = 2.23).
To sum up, the multiple regression model satisfies a series of requirements of statistical test, and the model can be used to predict and analyze.

Model Establishment and Verification.
No. 25 BEV total 10 discharge process data, from September 2 to September 22, 2015, have been chosen to conduct the pretreatment and the least squares parameter identification, making the model have higher prediction precision and applicability.Then, the final model parameters were obtained in order:  0 = 126.527, 1 = -1.579, 2 = 1.564.The final driving range prediction model is as follows: where the value range of  is [20, 100].
No. 15 BEV discharge process data, on September 11, September 19, and September 28, 2015, have been selected to further verify the reliability and practicability of the model.The results of the residual error sequence are shown in Figure 2, and the statistical residual errors are shown in Table 6.
It can be seen from Table 5 that the residual error is between -3.6975 km and 3.3865 km, the mean absolute error is about 1.5 km, the root-mean-square error is less than 2 km, and the root-mean-square relative error is less than 0.5 km.Although it is feasible to predict the driving range by the multiple linear regression model, the residual errors are relatively large for real-world driving condition.

Classification and Regression
Tree.Decision tree is a kind of classification and regression method.Decision tree method generally includes three processes: the feature selection, the tree creation, and the tree pruning (remove fitting).It can summarize some good performance classification rules from training set, which not only can well fit the training data, but also can make well predictions to the unknown data.[16].Therefore, the generation process is to construct the binary decision tree based on the training set recursively, and to prune the generated trees by using the loss function and validation set.
where  is a continuous variable.CART model generated under  is defined as follows: From ( 9), CART divides its eigenspace into M units  1  2 . . .,   , and each unit corresponds to a fixed output value   .The generation process of CART can be expressed as follows.

Begin
In the characteristic space of , each region is divided into two subregions recursively, and the optimal output value of each subregion is calculated, and the binary decision tree is constructed.
(1) Solve (10); select the optimal cut variable  and the optimal cut point .
Equation ( 11) is the space value of R 1 and R 2 : Iterate through the variables , and then scan the cut point  orderly in the specified cut variable , and select the value pair (, ) to make sure that (8) is minimum.
(2) Figure out the corresponding optimal output value: (3) Continue to call steps ( 1) and ( 2) of the two subregions until the stop condition is satisfied.
(4) The input space of  is divided into  regions 1  2 . . .,   and CART is generated (9).End By the CART generation algorithm, each time the recursive calculation, the optimal output value is generated from each division unit using the least square error criterion; that is, the optimal output value is the mean of all labels on the unit; a heuristic algorithm is used to solve the optimal cut variables and optimal cut points.The decision tree constructed from the above generation algorithm is called the least square CART.

CART Pruning.
In view of the problem of overfitting in the CART generated above, the pruning operation is necessary.The CART pruning is cut from the bottom end of the decision tree to make it simple, so that the unknown data has better generalization ability and higher prediction accuracy.
In the pruning process, the loss function of subtree is calculated by the following: where  represents any subtree, () represents the square error of training data, || is the number of leaf nodes of , (> 0) represents the fitting degree and the complexity of the model,   () represents the overall loss in the subtree under , and the only optimal subtree for fixed  exists.The CART pruning algorithm is given as follows.

Algorithm Framework 2: CART Pruning
Input:  0 constructed from CART generation; Output: the optimal CART   ; Begin (1) Suppose  = 0,  =  0 ; (2) Suppose  = +∞; (3) Calculate from the top down on (  ), ||, and (4) () =  is pruned by the internal node t, the output value of the leaf node t is calculated by average method, and the tree T is obtained; (5)  =  + 1,   = ,   = ; (6) Determine whether   is composed of the root node and two leaf nodes, if it is,   =   ; if not, go back to step (3); (7) Based on the independent verification data set, the cross-validation method is used to select the optimal subtree   in the subtree sequence {  } ( = 1, 2, ..., ) according to the square error.End In the above algorithm, () represents the decrease degree of the total loss function after pruning.It is indicated that (1) the size of the optimal subtree   is positively correlated with the size of ; (2) the subtrees in the corresponding subtree sequences {  } ( = 1, 2, ..., ) are nested by small increments ; (3) in the optimal subtree sequence, each subtree   corresponds to one , so when the optimal subtree   is determined, the corresponding  is determined.When the pruning operation is completed, it is possible to integrate the new base learner into the existing GBDT model.

Gradient Boosting Decision Tree.
The CART is used as the base learner in the gradient boosting decision tree (GBDT) [17].For its excellent performance, GBDT is widely used in various fields of real life.

Estimation Function.
The purpose of GBDT algorithm is to estimate the unknown function [18].Since it is a kind of supervised learning, the prerequisite for learning is to have enough data sets with labels (  ,   )  =1 , where  is the size of the sample set,   = ( (1)   ,  (2)   , ...,  (n)  )  ,   is the sample label.The purpose of supervised learning is to give an estimation function f() to the real function  :  →  and to minimize the loss function (, f()) to improve the accuracy of the prediction, as shown in the following: Equation ( 16) can also be written to the minimized expected loss form, as shown in the following: To materialize the target problem, the parameters  of the search space are limited, as shown in the following: So far, no specific formal assumptions have been made on estimation functions and real functions.Moreover, in most cases, the problem described above does not have a closed form solution, so the recursive numerical process is usually optimized.

Optimization Method.
At normal circumstances, the loss function adopted in optimizing is square loss function and index loss function; the general Boosting algorithm (such as AdaBoost) can achieve the goal of optimization.However, for general loss function, it is difficult to adopt common optimization methods.In response to this problem, Freidman proposed GBDT algorithm, using the value of the loss function in the negative gradient direction, as shown in (19), to approximate residuals and fit regression trees, improving the performance of the prediction model.

− [ 𝜕𝐿 (𝑦, 𝑓 (𝑥
GBDT is an algorithm to recursively solve prediction model.In the beginning of each stage of solving, unperfect model, a very weak model, can be used only to predict the average of the training set; and then a better model can be got by adding an estimator ℎ() to   (), as shown in the following: According to the empirical risk minimization principle, Then, the gradient descent method is used to minimize the loss function, and the model is updated according to the following: To sum up, the algorithm framework of GBDT is as follows: (2) For i = 1 to M, do Calculated pseudo residuals: Obtain ℎ  () using CART to fit pseudo residual, and calculate the weighted coefficient   : Update model: (3) Get the final prediction model   () End In some cases, overfitting and prediction error bias may occur in the above algorithm.In general, the regularization technique can be used to reduce overfitting effect by controlling the fitting process, so the updating rules of the above algorithm are modified as follows: where  is called the "learning rate", which is the weight reduction coefficient of the base learner.It has been found that a small learning rate ( < 0.1) can significantly improve the generalization ability of the model, but the disadvantage is that the number of iterations is increased.Overall, a regularized GBDT algorithm framework is adopted in the following modeling process.Considering the influence of the external environment on BEVs [19], weather information of Beijing urban area needs to be integrated in the training and test set, which comes from the national meteorological science data sharing service platform.The speed variable in the data is processed to average for its frequent change and nonlinear effect on the driving range, namely, the speed of the driving range for k corresponding to the average speed of driving range from 0 to k.In this way, the effect of average speed on the driving range is incorporated into the future prediction model.

GBDT Modeling
The purpose of GBDT algorithm is to extract the structure and essence of the target problem from the original data set.To make the selected features well explain the current problem, the selection of features should meet the following requirements that can construct the prediction model with high efficiency and low consumption, improving prediction accuracy.In fact, the extraction of features is to select the optimal feature set for model training from the original feature set.Good features often improve the prediction accuracy of GBDT algorithm.According to the previous analysis and research, SOC, MaxT, MinT, MaxV, MinV, TotalV, EDT, EDV, AveSpeed, TotalMile, Temper, Visibility, and Precip are extracted to train and test the model.

Parameter Setting and Relative Importance Calculation.
GBDT algorithm needs to set some key parameters, including each iteration step length, ; loss function, ; maximum depth of tree, MaxDepth; number of iterations, .
The specific steps of the parameter adjustment of GBDT algorithm are as follows.
(1) According to experience, the maximum depth of the tree is set to 10 (reference range for 6 to 20).Considering the accuracy requirement, the step length is set to 0.1; the loss function is set as the mean square error.Search for appropriate number of iterations within a range of 100 to 400.
(2) Then, the maximum depth of the tree and step length  are detected and adjusted until the optimal parameters are found.
In practice, the input features rarely have the same correlation.In order to understand the size of contribution of each characteristic in driving range prediction, the relative importance of input variables need to be calculated.The calculation of global relative importance of features is as shown in the following: where  is the number of base learners.The importance of feature j in a single tree is as shown in the following: where  is the number of leaf nodes, -1 is the number of non-leaf nodes, V  is the characteristic associated with node , and   is the reduction value of square loss after node  division.In short, the importance of a feature is the mean of its importance in all the basic learners.

Model Establishment.
According to the parameter setting method above, the statistical error results of the initial iteration are shown in Figure 3.As shown in Figure 3, when the number of iterations is [100, 300], the mean absolute error is rising, and the root mean square error is decreasing; when the number of iterations is greater than 300, the mean absolute error shows a downward trend, and the root mean square error is decreased after increasing trend.Since the maximum value of the mean absolute error is only 0.00466 from the minimum, considering the stability of the prediction model, the optimal iteration number is the number of iterations with the minimum root mean square error, 300.Then find the optimal maximum depth of the tree, and its statistical error is shown in Figure 4. From Figure 4, as the maximum depth of the tree increases, the mean absolute error fluctuates.However, the difference between the maximum and minimum values of the mean absolute error is only 0.01212.To make the model have better robustness, the optimal maximum depth should be chosen according to the root mean square error.The minimum mean square error of 0.1733 corresponds to the maximum depth of 11, which is the optimal maximum depth of tree.Then, other optimal parameters are detected as  = 0.05 and  = 300.
Training the GBDT model with the optimal parameters, the results are shown in Figure 5.  Figure 5 shows that, in the beginning of the iteration, both the training set error and the test set error are large; the two errors decrease with the increase of the number of iterations; when the number of iterations reaches about 300, the two error curves basically coincide and stop changing.The error statistical results of the GBDT model are given as RMSE = 0.278, MAE = 0.813, maximum error = 1.61 and minimum error = -1.58.
According to (28), the relative importance of each feature is shown in Figure 6.It can be seen that SOC and TotalV are the key feature of GBDT model for driving range.

Model Verification.
To verify the reliability of the GBDT prediction model, No. 12 BEV discharge process data on March 10, 2015, August 21, 2015, and January 9, 2016, are used for verification; the result is shown in Figure 7.
The results of the minimum error, the maximum error, and the mean absolute error obtained from the verification are shown in Table 6.
Table 6 shows that the maximum prediction error is 1.58 km, the minimum prediction error is -1.41 km, and the average prediction error is about 0.7 km.  7.
When data has many features and the relationships between them are complex, the idea of building a global model is difficult.One approach is to use conventional linear regression analysis to model; some variables will be excluded from the model for the multicollinearity between them.However, that does not mean the model ignores the global impact of other variables.The excluded variables still affect the model because of the existence of the remaining variables.The model established by traditional regression method can be used to predict and the results are reliable.Another approach is to use the decision tree to model; CART is a widely used decision tree.In the regression with CART, each node has a predicted value, which is equal to the average value of all samples belonging to the node.When branching, the best segmentation point of each threshold value of each attribute is selected, and the criterion to be measured is to minimize the mean variance.The value of this node is set as the average value of the training sample that falls on this node until it is indivisible or reaches a certain height or the attribute is used up or the mean square error does not decrease.The test samples are dropped according to the segmentation points during the training and fall to the leaf and can better meet the requirements of real-world driving conditions.

Conclusions
In recent years, the number of BEVs is increasing gradually, but the problem of inaccurate residual power display has been restricting the promotion and the use of BEVs.The purpose of this study is to solve the problem of "range anxiety" caused by battery performance and other factors by predicting the BEV driving range.Many studies usually take less factors into account when establishing the prediction model of driving range, which may lead to the poor applicability and prediction accuracy of the model.In this study, a prediction model for BEV driving range based on machine learning has been established.The study is innovative in its application of machine learning method, GBDT algorithm, which includes a very large number of feature variables that cannot be considered by conventional regression methods.Moreover, the study is novel in its accuracy and reliability of a prediction model for BEV driving range.
As the GBDT model belongs to the black box algorithm, it can only give the importance distribution of the feature variables but cannot specify the interconnection and interaction between the feature variables.In future studies, there is a lot of research space for the correlation of variables within the model.The prediction model proposed in this study can meet the requirements of actual working conditions, but it needs to be further optimized to improve the prediction accuracy in the future.For instance, the cloud computing can be applied in the task of modeling, which is responsible for irregularly training, to obtain more accurate prediction model.

Figure 1 :
Figure 1: The statistical tests of standardized residuals.

Figure 2 :
Figure 2: The residual error sequence chart of No. 15 BEV.

5. 4 .
Discussion.No. 15 BEV discharge process data on September 11, 2015, September 19, 2015, and September 28, 2015, have been selected as a data sample.To conduct comparison analysis, three methods, that is, GBDT, CART, and the multiple linear regression (MLR), are performed on the same data sample.The comparison results are shown in Table

Table 1 :
Statistical comparison of SOC before and after deletion operations.

Table 2 :
Analysis of the performance index of interpolation.

Table 3 :
Correlation test result between parameters and driving range.

Table 4 :
Partial correlation test results under SOC is controlled.
R represents partial correlation coefficient between parameters and the distance range; Sig.represents the probability value  of  test statistics, and  = 0.05.

Table 5 :
Statistical error results of prediction.

Table 6 :
Prediction error of the GBDT model.C4.5 classification tree whose none-leaf nodes have multiple branches, CART's none-leaf nodes only have two branches, and its output values of a leaf node are the mean of the sample label Classification and Regression Tree (CART) was put forward by Breiman et al. in 1984, different with ID3 and 5.1.Data Integration and Feature Extraction.Above all, No. 25 BEV discharge data from March 1, 2015, to March 1, 2016, is selected as training set, and No. 15 BEV discharge process data on January, March, and August 2015, is selected as test set.