Battery health prediction under generalized conditions using a Gaussian process transition model

Accurately predicting the future health of batteries is necessary to ensure reliable operation, minimise maintenance costs, and calculate the value of energy storage investments. The complex nature of degradation renders data-driven approaches a promising alternative to mechanistic modelling. This study predicts the changes in battery capacity over time using a Bayesian non-parametric approach based on Gaussian process regression. These changes can be integrated against an arbitrary input sequence to predict capacity fade in a variety of usage scenarios, forming a generalised health model. The approach naturally incorporates varying current, voltage and temperature inputs, crucial for enabling real world application. A key innovation is the feature selection step, where arbitrary length current, voltage and temperature measurement vectors are mapped to fixed size feature vectors, enabling them to be efficiently used as exogenous variables. The approach is demonstrated on the open-source NASA Randomised Battery Usage Dataset, with data of 26 cells aged under randomized operational conditions. Using half of the cells for training, and half for validation, the method is shown to accurately predict non-linear capacity fade, with a best case normalised root mean square error of 4.3%, including accurate estimation of prediction uncertainty.


Introduction
Electrochemical batteries, such as lithium-ion and leadacid cells, experience degradation over time and during usage, leading to decreased energy storage capacity and increased internal resistance. Being able to predict the rate of degradation and the remaining useful life (RUL) of a battery is important for performance and economic reasons. For example, in an electric vehicle, the driveable range is directly related to the battery capacity. For energy storage asset valuation, depreciation, warranty, insurance and preventative maintenance purposes, predicting RUL at design stage and during operation is crucial, and the investment case is strongly dependent on the degradation behaviour [1]. To estimate accurately the second hand value of assets such as EVs and grid batteries, credible predictions of RUL are required.
Unfortunately, battery degradation is caused by many complex interacting chemical and mechanical processes [2,3], and physical modelling from first principles is very challenging. To mitigate uncertainty in lifetime, batteries are often over-sized and under-used, which results in increased system costs and sub-optimal performance. Hence, new approaches for accurate health prognostics are required, and form an important component of a modern battery management system or energy management system.
Since the performance of a battery in an application is largely dependent on its nominal capacity and internal resistance, the state of health (SoH) is typically defined by one or both of these parameters. In the present case we consider just cell capacity as the SoH metric, but the methods outlined in this paper could be applied to any other SoH metric, such as internal resistance, or capacity at some nominal C-rate. A variety of techniques may be applied for SoH measurement and estimation [4], but in this paper we simply assume that SoH metrics are available, for example from a battery management system.
The conventional approach to battery SoH forecasting is to fit a parametric function to a broad set of ageing data measured under controlled laboratory conditions. Careful judgement is required to decide on the exact form of parametric model to use. For example, Schimpe [5] investigated both calendar and cycle ageing of lithium iron phosphate (LFP) batteries with respect to temperature and state of charge (SoC) and found that capacity evolved 1 arXiv:1807.06350v1 [stat.AP] 17 Jul 2018 with time according to where Q loss is the capacity fade at some point in time, k 1... 4 are empirically fitted stress factors that are a function of temperature, charging current, time and SoC, Q tot is the total charge throughput to time t, and Q ch is the charge throughput only during charging, to time t. The stress factors k 1...4 typically fit an Arrhenius equation of the form where α is a fitted constant, u is some input such as current or the reciprocal of temperature, and u ref is reference value for that input. Very similar approaches have been developed by others for LFP batteries [6], and for a variety of other chemistries including NMC lithium-ion [7,8] and lead-acid [9]. These empirical degradation models are essentially parametric curve fitting using specified underlying functions such as exponentials, square roots etc. For some kinds of battery degradation data, such as [10], these approaches may give a reasonable fit to the measured behaviour, although there is very little information in the literature about their long term predictive accuracy. These approaches also require the form of the model to be specified a priori, for example (1) assumes decoupling of inputs, and this may not be the case. Additionally, many degradation datasets exhibit an accelerated capacity fade regime in later life (see [11]), and this approach is not able to model such a regime change. Also, accuracy may be limited when environmental and load conditions differ from the training dataset.
As an alternative approach to empirical parametric functions fitted to laboratory test data, others have developed 'first principles' electrochemical models of battery ageing. These propose and model a set of underlying physical ageing mechanisms. For example a popular ageing mechanism is growth of the anode solid electrolyte interphase (SEI) through reduction of the ethylene carbonate in the electrolyte, modelled as a diffusion-limited single step charge transfer reaction [12,13]. This can be augmented to include additional physics related to lithium plating [14], particle cracking [15] and other mechanisms. Although reasonable results are demonstrated for calendar ageing, huge challenges remain with respect to parametrisation and validation of such models, and what physics to include to capture all the relevant ageing mechanisms and their interactions.
In contrast to these approaches, so-called data-driven battery ageing models are beginning to be investigated. These have some similarities with the empirically fitted functions previously discussed, but new techniques from machine learning allow much greater flexibility in these models than can be obtained using pre-specified parametric functions. The simplest formulation of this is direct fitting of capacity data with respect to time, or cycle count, which allows RUL estimation by extrapolation to future values. A variety of data-driven techniques have been explored in this context, including non-parametric approaches such as support vector machines [16,17,18,19], and Bayesian non-parametric approaches such as Gaussian process (GP) regression [20,21,22]. A non-parametric model is one whose expressivity (as would increase with the degree of a polynomial, for instance) naturally adapts to the complexity of data. Rather than having no parameters, a non-parametric model is perhaps better thought of as one with a number of parameters that can scale with the data and could become arbitrarily large. Bayesian approaches naturally incorporate estimates of uncertainty into predictions, allowing a model to acknowledge the varying probabilities of a range of possible future health values, rather than just giving a single predicted value.
These approaches have been demonstrated to work well when a battery health dataset is available for batteries that have all been cycled in a similar way. For example, our previous work [23] on RUL prediction applied a multiple-output Gaussian process model to incorporate data from multiple batteries, all cycled in the same way, demonstrating a large improvement in accuracy of RUL estimation over existing methods. However, for real world RUL prediction at design stage, or for preventative maintenance, a much more flexible approach is needed that allows health predictions to be made as a function of the changing stress factors such as time, charge throughput and temperature etc. The previously discussed parametric models can incorporate dependence on external inputs, but are limited to pre-specified functions. In other words, they assume that the shape of the degradation trajectory is known a priori, which limits their applicability.
To address this, we introduce the idea of a Bayesian nonparametric transition model for battery health. Rather than fitting the SoH data directly as a function of time or cycle count, the model predicts the changes in SoH from one point to the next as the battery is used, as a function of the usage. This is explained in detail in the next section.

Transition model
The approach in this paper formulates a transition model to predict the capacity changes between periods of usage that we term 'load patterns'. We define this differently to a standard battery charge-discharge cycle, instead it is the time-series of current, voltage and temperature data between any two capacity measurements or estimates, Q i and Q i+1 . Load patterns do not need to be uniformly spaced, i.e. they could be short or long periods of usage, and might include multiple charge-discharge events.
The goal of a regression problem is to learn the mapping from input vectors x to outputs y, given a labelled training set of input-output pairs where N D is the number of training examples. In the present case, the inputs x i ∈ R + are vectors of selected features (see section 2.2) for load pattern u i , and the outputs y i ∈ R + are the corresponding differences in measured capacity between load pattern u i and u i+1 . The underlying model takes the form y = f (x) + ε, where f (x) represents a latent function and ε ∼ N (0, σ 2 ) is an independent and identically distributed noise contribution.
The learned model can then be used to make predictions on a set of test inputs x * = {x * i } N T i=1 (i.e. load patterns where we wish to estimate the capacity), producing outputs where N T is the number of test indices. In our case we are interested in predicting the capacity changes in a new -previously unseen -battery cell, which has been exposed to a known test regime. This is called the validation or test dataset.

Input feature extraction
Each load pattern, u i , may contain within it an arbitrary number of time steps, N i . However, in order to use the inputs in our model, since the capacity measurements are only known per load pattern, we must first map time-series data to a fixed size input vector. In other words, assuming there are N i time-steps within a load pattern u i , then the measurements I ∈ R Ni , V ∈ R Ni , T ∈ R Ni are mapped to a single n-dimensional input vector, x, where n is the number of features of interest. Irrespective of the number of time steps in a load pattern, the size of the input vector x is the same.
For each load pattern, u i , the features to be extracted are defined by prior assumptions about what causes a battery to age. As discussed in the preceding section, there are many possible different stress factors that affect battery ageing, depending on the dataset and model. However, in the dataset used it was found that accurate results could be obtained with only a small number of factors (see table 4), as follows: The first component of the input vector, for the ith load pattern, is the total time elapsed during the load pattern, given by where t i and t i+1 are the times at the start and end of the load pattern respectively. The second component is the charge throughput, Q thru , during the load pattern, i.e. the total absolute current through the cell during the load pattern, given by The third component is the absolute time value, in seconds, since the beginning of the whole dataset, As discussed later, for the dataset considered here, it was found that the choice of model and number of overlapping load patterns were generally more important for determining predictive accuracy than the inclusion of additional input features. However, with a larger dataset, additional features could improve predictive accuracy. These might include the following: Firstly, the present cell capacity, Secondly, the time elapsed during which certain conditions are met. This is achieved by defining a selection of current, voltage and temperature ranges, and evaluating the time spent by the battery within these ranges: for j ∈ {5, 6, . . . }, where P , P l and P u are the parameters of interest, and their upper and lower bounds respectively. For example, a battery's aging behaviour is expected to be affected by high or low temperatures [5]. Hence, one might define the duration of time the battery spends (1) below 0 • C, (2) between 0 and 40 • C, and (3) above 40 • C as three distinct inputs: An example of an input vector for a single load pattern is given in Table 1. In this case, inputs were defined for ranges of temperature and current. Of course, additional inputs could also be defined by voltage ranges, but these have been omitted here for clarity of presentation. Note that the sum of all the times spent in each parameter range (e.g. in each temperature or current range) must equal the total time elapsed within that load pattern. Fig. 1 shows an exemplary schematic of the first 4 load patterns for a single cell. There is one capacity measurement (Q 1 ) at the very start of the cell's life and then 4 subsequent measurements (Q 2 − Q 5 ) at later times. The load patterns consist of everything that occurs between each capacity measurement; each load pattern is translated into equal sized input vectors, x i . Fig. 2 gives examples of real measurement data for two different cells, including the capacity values Q, some of the extracted inputs for each load pattern, and exemplary time series measurements corresponding to a portion of a load pattern. The dataset used for this work is explained in section 3.

Evaluation
The model predictions are evaluated using three different metrics, which reflect the quantities of interest in a practical application. The first is the root-mean-squared error (RMSE) in the mean output of the model (i.e. the capacity differences), defined as   where N T is the number of points to be evaluated (i.e. all points in the test dataset), y * i is the measured capacity difference using the test dataset andŷ * i is the estimated mean capacity difference predicted by the model, each between load pattern u i and u i+1 . The second is the RMSE in actual capacity, defined as where Q * i is the measured capacity (using the test dataset) andQ * i is the estimated mean capacity, each at load pattern i. This may also be expressed as a normalised value, to facilitate comparison with other studies, whereby the absolute capacities may be of different magnitudes: (5) Note that it is possible for a model to perform well in one of these metrics but poorly in the other. For instance, if a model over-predicts ∆Q every second load pattern but under-predicts on alternate load patterns, the overall capacity evolution may be accurate (implying good RMSE Q ), but the individual predictions might not be (implying poor RMSE ∆Q ). Hence, a good model should have low values of both these metrics.
Thirdly, since the approach used here is probabilistic, the accuracy of the uncertainty estimates can also be quantified using the calibration score (CS). This is defined as the frequency of measured results in the test dataset that are within a predicted credible interval. Within a ±2σ interval, corresponding to a 95.4% probability for a Gaussian distribution, the CS is given by Therefore, CS 2σ should be approximately 0.954 if the uncertainty predictions are accurate, using the techniques outlined in this paper. Higher or lower scores indicate under-or over-confidence, respectively.

Gaussian process regression
This section gives a brief overview of Gaussian process regression, the main approach chosen in this paper for modelling the transition in health from one load pattern to the next. A Gaussian process (GP) [24] defines a probability distribution over functions, and is denoted as: where m(x) and κ(x, x ) are the mean and covariance functions respectively, denoted by For any finite collection of input points, say X = x 1 , ..., x N D , this process defines a probability distribution p (f (x 1 ), ..., f (x N D )) that is jointly Gaussian, with some mean m(x) and covariance K(x) given by K ij = κ(x i , x j ). Gaussian process regression is a way to undertake nonparametric regression with Gaussian processes. Rather than suggesting a parametric form for the function f (x, φ) and estimating the parameters φ (as in parametric regression), we instead assume that the function f (x) is a sample from a Gaussian process as defined above.
In this work, we use the Matérn covariance function:  with output scale σ f , smoothness hyperparameter, ν = 5/2 (larger ν implies smoother functions) and R ν is the modified Bessel function. This kernel was chosen because it is suitable for functions with varying degrees of smoothness, although similar performance was observed using other common kernels, including the squared exponential [24]. A fuller discussion of various different kernels that may be used for GP regression in the context of battery health prediction is given in [23]. Finally, we also compare performance against a linear kernel, since this is equivalent to Bayesian linear regression [24]: where c is a constant defining the offset of the linear function. The mean function of the GP is commonly defined as m(x) = 0, and we follow this convention here. Now, if one observes a labelled training set of input- , predictions can be made at test indices X * by computing the conditional distribution p(y * |X * , X, y). This can be obtained analytically by the standard rules for conditioning Gaussians [25], and (assuming a zero mean for notational simplicity) results in a Gaussian distribution given by p(y * |X * , X, y) = N (y * |m * , σ * ) where m * = K(X, X * ) T K(X, X) −1 y σ * = K(X * , X * ) − K(X, X * ) T K(X, X) −1 K(X, X * ).
The values of the covariance hyperparameters θ may be optimised by minimising the negative log marginal likelihood defined as NLML = − log p(y|X, θ). Minimising the NLML automatically performs a trade-off between bias and variance, and hence ameliorates over-fitting to the data [26]. Given an expression for the NLML and its derivative with respect to θ (both of which can be obtained in closed form), θ can be estimated using gradientbased optimization. The Python GPy library was used to implement these algorithms.

Gradient boosting
As a state-of-the-art comparison to Gaussian process regression, we also investigated predictive performance with an alternative technique, gradient boosting. This is a popular data-driven time series modelling approach based on combining an ensemble of weak prediction models into a stronger model [25]. While this approach is not inherently probabilistic, and does not output a full covariance matrix for the predictions, it can be trained using quantile regression (QR) to approximately predict a probability distribution. Quantile regression deliberately introduces a bias in the prediction in order to estimate statistics. The loss function is modified such that instead of identifying the mean of the variable to be predicted, QR seeks the median and any other desired quantiles. To identify the upper and lower bounds of a prediction interval, QR is repeated at several different quantiles. One advantage of this method is that asymmetric intervals can be predicted. On the other hand, it is not clear how the confidence intervals for Q should be calculated from the values for ∆Q, since the full covariance matrix is unavailable. In this case, we simply centred the intervals around the mean, and fitted a Gaussian distribution in order to achieve this.

Dataset
The battery dataset used here was obtained from the NASA Ames Prognostics Center of Excellence Randomized Battery Usage Repository [27]. The data in this repository were first used in Ref. [28] for an investigation into capacity fade under randomized load profiles. The data are randomised in order to better represent practical battery usage. This is ideal for training a data-driven model. Fig. 3 gives smoothed histograms computed from the cell data showing the ranges of times, charge throughput, currents, voltages and temperatures that are explored by this dataset.
An overview of the battery dataset is given in Table 2. The cells used have a relatively high energy density, but short lifetime. The remainder of this subsection describes the cycling and characterisation procedure, based on [27]. For this study we used data from 26 of the 28 total battery cells available in the repository (cells 16 and 17 were omitted, since these were found to contain spurious data resulting in certain cycles having negative duration). The cells were grouped into 7 groups of 4, with each group undergoing a different randomized cycling procedure as described in Table 3.
The first 5 groups were cycled at room temperature throughout the duration of the experiments, whilst groups 6-7 were cycled at 40 • C. In all cases a characterisation test was periodically carried out, whereby a 2 A chargedischarge cycle was applied (i.e. approximately 1C) between the cell voltage limits -these discharge curves were used to evaluate the capacity as an indicator of state of health. There were a total of 950 discharge curves available across all cells (i.e. ∼ 34 curves per cell).
The cell capacity was calculated by integrating the current from each of the 2 A charge curves. Calculated capacities for the cells in each group are plotted against time in Fig. 4. The evolution of the capacity is quite different for each group of cells.  Repeatedly charged to 4.2 V using a randomly selected duration between 0.5 hours and 3 hours, then discharged to 3.2 V using a randomized sequence of discharging currents between 0.5 A and 4 A. Reference characterisation every 50 cycles. Group 2 (Cells 3-6) Same as group 1 except charging cycle not randomized. Group 3 (Cells 9-12) Operated using a sequence of charging/discharging currents between -4.5 A and 4.5 A. Each loading period lasted 5 minutes. Reference characterisation carried out after 1500 periods (about 5 days).

Group 4 (Cells 13-15)
Repeatedly charged to 4.2 V and then discharged to 3.2 V using a randomized sequence of discharging currents between 0.5 A and 5 A. A customized probability distribution skewed towards selecting higher currents was used to select a new load setpoint every 1 minute during discharging.

Group 5 (Cells 18-20)
Same as Group 4 except the probability distribution was designed to be skewed towards selecting lower currents.

Results
We considered 6 different configurations of data-driven transition model, as defined in Table 4, in order to show a range of comparisons in predictive accuracy. In each case, the model was trained on the data from even numbered cells (i.e. all the mappings between inputs and capacity drops across all of those cells), and subsequently tested on the odd numbered cells. Models 1 and 2 use a GP with a Matérn kernel. The difference between these two models is the way in which long term trends are captured. For model 1, data from the preceding 6 load patterns were all used as inputs for the mapping, and the total time elapsed was not included as an input. For model 2, only data from the current    load pattern was used, but to capture long term trends it was necessary to also include the total time elapsed as an additional input. Models 3 and 4 are analogous to models 1 and 2, except a linear kernel was used in the GP rather than a Matérn kernel. This gives a simple base case for comparison. Using a linear kernel is equivalent to implementing Bayesian linear regression, and the key point to note in this context is that it provides far less flexibility for the model predictions compared with a Matérn kernel.
Models 5 and 6 are also analogous to models 1 and 2, except that, rather than using a GP, they use a different regression technique called gradient boosting, as was introduced in section 2.6.
The predicted versus actual ∆Q for each approach is shown in Fig. 5. Model 1 was the best performing of the 6 cases tested, with RMSE ∆Q and RMSE Q of 0.0201 Ah and 0.07 Ah respectively. Normalised capacity prediction error RMSE norm for model 1 was 4.3%.
Finally we present in more detail in Fig. 6 the evolution of the capacity for each of the cells in the test dataset, using the best performing approach (model 1).

Discussion
The results given in section 4 show that model 1 accurately predicts the capacity trajectory, and provides reasonable, if slightly over-cautious, estimates of the uncertainty, indicated by the calibration score being close to 0.954. The true capacity generally lies within the ±2σ interval denoted by the blue shaded region in Fig. 6.
The model is also seen to be capable of predicting both positive and negative capacity differences. For instance, it is apparent in Fig. 6 that, although the capacities experience a long-term downward trend, they also experience occasional step increases. The model correctly predicts the timing of a number of these instances, e.g. for cell 7 at day ∼ 140. As an aside, the physical explanation for these increases is not clear; they may in fact be an artefact of the measurement process, possibly arising when reference tests are performed, after the cell is unused for some time. However, regardless of their cause, accounting for these effects is essential since the capacity measurement provided in a real application could also manifest similar behaviour.
Regarding feature selection, the fact that model 1 performs better than model 2 in the case of the dataset used here suggests that valuable information is being extracted from the inputs over the previous load patterns, which is not available from using just the total time elapsed as an additional input. Models 3 and 4, based on a linear kernel as noted earlier, perform considerably more poorly than the other approaches in terms of capacity prediction error, indicating that the simple linear combination of the inputs is insufficient to predict battery health for the dataset considered here, and the nonlinearities captured by the Matérn kernel are significant in this case. Their calibration scores also indicate over-confidence.
The models based on gradient boosting are slightly less accurate in terms of mean predictions than models 1 and 2 and it is also noteworthy that they are erroneously overconfident, as indicated by their low calibration scores.
Finally, we note that the train/test split used in this paper (whereby the even numbered cells are used for training and the odd numbered cells for testing) ensures that there is at least one training cell in each of the 7 groups of differently cycled cells, Table 3. Inferior results may be obtained if this were not the case, e.g. if the first N cells were used for training and the remaining 26-N used for testing, since in the latter case the model would be extrapolating beyond the region of the input space used for training. In practice, the performance of these methods will rely on using a sufficiently large training set being available, such that a large range of input conditions are covered.

Conclusions
This paper has developed a new technique for battery health prediction based on a Bayesian non-parametric model that estimates the change in capacity over a particular period of time as a function of how the battery was used during that period. A simple histogram-based feature selection approach was presented and models were trained using data from NASA [27]. It was found that the best performing approach used Gaussian process regression with a Matérn kernel function, and that time elapsed and charge throughput were the most important features to incorporate within the model, given the dataset used in this paper. It was also found that more accurate results could be achieved by considering the preceding 6 load patterns to capture longer range trends, rather than using absolute time as an input feature. Automated feature selection would be worth future investigation.
The best case results presented have a relative accuracy on mean capacity predictions that is within 5% of the actual values. To our knowledge this is one of the first papers to actually quantify battery health predictive accuracy comprehensively, and this is one of the most accurate long range predictions of future capacity seen to date.
The approaches explored in this paper offer an interesting insight into how the stress factors that drive degradation actually influence the capacity trajectory. It is noteworthy that, despite having a dataset that includes a wide range of temperatures and currents, in this case it was found that time elapsed and charge throughput were the dominant inputs. However, a naive modelling approach that uses a simple linear combination of inputs results in very inaccurate predictions, as shown by the GP regression results using linear kernels.
There are a number of interesting next steps to explore. First, it would be useful to test these ideas against a much larger dataset to show their general validity and explore in more detail the sensitivity of the approach to additional inputs. Second, prior knowledge about expected degradation behaviour could be included as an extension to this work by including a parametric mean function within the GP framework. Third, in the present work, when the model is used predictively, it assumes perfect knowledge about the inputs, i.e. that the future current, voltage and temperature time series are known in advance. In practice this will not be the case, since depending on the application these variables depend on driving style or market conditions, ambient weather conditions etc. Predicting these inputs is a separate but important issue.