Probabilistic modelling of wind turbine power curves with application of heteroscedastic Gaussian Process regression

There exists continued interest in building accurate models of wind turbine power curves for better understanding of performance or assessment of the condition of the turbine or both. Better predictions of the power curve allow increased insight into the operation of the turbine, aid operational decision making, and can be a key feature of online monitoring and fault detection strategies. This work proposes the use of a heteroscedastic Gaussian Process model for this task. The model has a number of attractive properties when modelling power curves. These include, removing the need to specify a parametric functional form for the power curve and automatic quanti ﬁ cation of the variance in the prediction. The model exists within a Bayesian framework which exhibits built-in protection against over- ﬁ tting and robustness to noisy measurements. The model is shown to be effective on data collected from an operational wind turbine, returning accurate mean predictions ( < 1% normalised mean-squared error) and higher likelihoods than a corresponding homoscedastic model.


Introduction
The power curve of a wind turbine is one of its key performance indicators. As the popularity of wind-based power generation continues to grow, the characterisation of the performance of turbines is an important step in the justification for, and management of, this renewable energy source. Being able to accurately predict the power output of a turbine has a number of beneficial use cases for the operator. The prediction of power output allows more accurate prediction of expected income from the turbine (and by extension, farm) allowing for more forward-thinking business planning. Alternatively, the power curve of the turbine has been shown to be an effective indicator of degradation in performance of the system, for example see Papatheou et al. [1]. That work sits within a wider body of work on monitoring wind turbines, usually via SCADA (supervisor control and data acquisition) systems in order to infer the structural condition of the turbine d this being one example of structural health monitoring [2].
Typical power curve data collected from a wind turbine SCADA system is shown in Fig. 1. Visually, it can be seen that the power curve exhibits a number of interesting features from the modelling perspective. The relationship between the wind speed and the power output is nonlinear. The data has a stochastic element or there is noise in the measurement of the data and that this noise is not constant across the input domain. This input-dependent noise variance is referred to as heteroscedasticity, as opposed to homoscedastic noise where the variance of the noise is independent of the input. Finally, there are a number of data points which could be considered outlying from the bulk of the data distribution. The combination of these factors makes modelling the behaviour and variance of the power curve robustly an interesting and challenging prospect.
This paper presents a methodology for building probabilistic models of wind turbine power curves based on a heteroscedastic Gaussian Process method. This allows predictions to be made of the mean and variance which approximate the distribution of power output from a turbine given the measured wind speed. The question remains: why might this be useful to the end user?
Considering applications in Structural Health Monitoring (SHM) [2,3], the value of a probabilistic model is made apparent. Here, probabilistic methods provide a natural framework in which decisions can be made based on quantifiable risk. Most simply, there is a distinction between saying there is or is not damage, as compared to considering the probability of damage on a structure. Alternatively, if predicting fatigue damage accrual, providing a distribution over expected crack lengths from a model returns information, not only about the most likely result, but also confers a measure of confidence which can guide the decision-making process. Finally, it is worth considering that a deterministic approach to these problems does not remove the inherent uncertainty from the process but it does fail to account for it. By failing to acknowledge uncertainty when attempting to understand a structure's condition, the engineer implies perfection in models and processes which are inherently imperfect. In safety critical applications, this can lead to failure in planning for unlikely events, greatly magnifying the consequences of these.
Beyond the realm of SHM, the quantification of uncertainty continues to add value. For example, one key use of power curve models is for investors to calculate expected returns from a turbine or farm. Models which can account not only for the mean trend but also quantify uncertainty in predictions allow financial planning to be done on the basis of more information. Clearly there exists a distribution over wind speeds that a turbine will be subject to. An uncertain model of the power output of the turbine allows combination of these distributions. The possession of models which quantify and handle uncertainty allows for robust uncertainty propagation. In this, all the uncertainty present throughout the power generation process can be combined to give a distribution over a variable of interest d e.g. monthly income. Again, being in possession of this distribution, allows better confidence in the models and enables long-term risk-based financial planning.
It is hopefully clear that the accurate modelling of uncertainty offers tangible cost benefit across a range of situations. This includes day to day benefits in operation such as health monitoring applications or longer term benefits in assistance with high-level financial planning. The final reason for building probabilistic models of uncertainty, for systems such as wind turbines, is that it is possible. To not do so fails to make full use of the data which has been collected. The process of sensing and data acquisition remains expensive and difficult in comparison to building and learning data-based models. By reducing this data to a single deterministic line, users fail to make full use of this valuable resource. If operators are willing to spend money to acquire data, it is only sensible to build the most expressive model possible.

Related work
The task of modelling wind turbine power curves has been explored in the literature previously with review papers being published in 2011 [4], 2013 [5], and 2014 [6]. Broadly speaking, approaches to solving this task have been separated into those which aim to build models based on physical/engineering understanding of the behaviour of the turbine and those which rely solely on learning from data.
A number of models have been built upon polynomial regression equations; the most common being those based on the cubic relationship between wind speed and maximum available power, e.g. Carrillo et al. [5]. However, attempts have been made to fit higher-order polynomial models, a 6 th order in Ref. [7] and a 9 th order in Ref. [8]. However, these works routinely fail to use any form of regularisation or cross validation, which is required to ensure that the models will generalise and fit well to unseen data. Mar ciukaitis et al. [7] discuss the use of a cross-validation technique for model assessment; while this demonstrates the consistency of the model, it does not provide any protection against over-fitting during the training stage. The problem of over-fitting occurs most frequently in over-parameterised models, the classic examples being high-order polynomials; further discussion of this problem and techniques to alleviate it can be found in (for example) Bishop [9] or Barber [10]. Taslimi-Renani et al. [11 ] also discuss the problem of overfitting in their work where a modified hyperbolic tangent function is proposed as a parametric model of the power curve. In these references and in this work, distinction is made between two subsets of data; training data which is used for learning the model, and (independent/unseen) testing data which is unused in learning the model but is used to assess the expected performance of the model in operation. Results are presented on both the training data and this unseen test data to demonstrate the ability of the model to generalise (i.e. continue to make valid predictions for the turbine into the future).
Other parametric methods have explored fitting functions which, heuristically, match the shape of the power curve. These have included variations on logistic and hyperbolic tangent functions, for example see Lydia et al. [6], Mar ciukaitis et al. [7], Seo et al. [12]. In a similar manner Villanueva and Feij oo [13] and Lydia et al. [14] propose parametric models of the power curve based upon a logistic function. These functions have the benefit of possessing many of the properties that appear inherent to the data in a wind turbine power curve, i.e. boundedness at high and low wind speeds and nonlinear transition between these bounds.
For modelling power curves, the use of (artificial) neural networks has been explored e.g. Refs. [15,16] and more recently this trend has continued [17,18] in line with the continued popularity of neural networks across many fields. The use of a support vector machine (SVM) was also discussed in Ouyang et al. [19], where the SVM is created based on using the centroids of a k-means algorithm as training data. Yan et al. [20] consider the combination of a number of deterministic models with approaches that also attempt to capture the uncertainty in the power curve, a comparison is made between these approaches in terms of the error and an "expectation variance ratio". Wang et al. [21] propose a probabilistic approach to modelling the wind turbine power curve based spline regression models which are used to generate inputs to a neural network for power forecasting. The use of Gaussian Process (GP) regression models for modelling the wind turbine power curve has, also, previously been discussed. In Papatheou et al. [1], Antoniadou et al. [22] and Papatheou et al. [23] the use of the standard GP formulation allows detection of damage in the turbine. In Manobel et al. [24] the GP is used as a pre-processing step for filtering data before it is passed to a neural network model. This work appears to overcomplicate the problem of modelling the power curve. The addition of the neural network seems superfluous since there exist proofs that the GP is a universal approximator [25]. In addition to this, it can be shown that, as the number of neurons in a single layer MLP (multi-layer perceptron) tends to infinity, a GP is recovered [26] d the extension of this work to deeper networks is discussed in Ref. [27]. Additionally, the use of the GP for filtering forces the data into an approximately Gaussian distribution, which is homoscedastic. This leads to the exclusion of potentially valid points as outliers and can distort the true distribution of the data. Pandit et al. [28] consider the use of a standard GP where the air density is included as a second input variable along with the wind speed, this shows improvement in accuracy. In Pandit et al. [29]a model similar to this one is compared to a support vector machine and a random forest model, results reveal the GP to score better in the performance metrics shown. It can be seen that there have been numerous approaches to modelling the wind turbine power curve and that investigation into this problem continues to be an active research area. However, a number of the models used make no attempt to model the uncertainty in the power output of the turbine and of those that do a heteroscedastic approach is very rarely taken. The work contained in this paper aims to provide a methodology that forms and an accurate model of the power output of the wind turbine while also quantifying this varying uncertainty across different wind speeds in a manner which is efficient for large datasets and statistically rigorous.
The layout of the paper is as follows; in section 2 the necessary theory for Gaussian Process regression is introduced, section 2.1 discusses the efficient modelling of large datasets with GPs; section 2.2 extends the GP model to the heteroscedastic noise case; section 2.3 combines these approaches to form a sparse heteroscedastic model, and finally section 2.4 presents a methodology for distributed computation of these models and combination via a robust Bayesian committee machine. The use of the model is shown in section 3 where it is applied to data measured from an operational wind turbine. The benefit of moving to a heteroscedastic model is demonstrated by comparison with a homoscedastic GP model, both quantitatively and qualitatively. Finally, conclusions are made in section 4 with discussion of possible directions for future work.

Gaussian Process regression
Gaussian Process (GP) models provide a flexible Bayesian machine learning method for solving regression problems [10,30e32]. They exhibit a number of desirable properties for this application: they are nonparametric, automatically quantify uncertainty in predictions, require little a priori input, and are capable of modelling signals even in the presence of high noise levels on the measured data. The GP allows a prior distribution to be placed over an entire function for inference rather than merely learning the parameters of a model. The GP is developed for modelling functions of the form, i.e. it models data as the output of some function f ðxÞ, operating on a D-dimensional input x. This function is corrupted by some additive Gaussian noise ε with zero mean and a fixed variance s 2 n . The most common d and most intuitive d introduction to the GP is as a distribution over functions, where a single draw from the GP is a potential realisation of a function generated by that GP. In this way the GP can be seen as the prior over f ðxÞ in equation (1).A GP is defined as in equation (2), where x and x' are a pair of inputs to the function of interest, f ðxÞ$GP ðmðxÞ; kðx; x ' ÞÞ (2) It follows that the GP is characterised completely by its mean, mðxÞ, and covariance, kðx; x'Þ, functions. To make predictions, the joint Gaussian distribution between the training and testing data is assessed, Here, the notation X is used to denote a set of N, D dimensional, training inputs where X2R NÂD and y is the corresponding set of N measured training outputs with y2R NÂ1 . When wanting to predict with the model, a new input x + can be considered (trivially this could also be X + if predicting at multiple points). This is used to make a prediction at a new potentially unknown output y + . By the properties of a multivariate Gaussian, every conditional distribution is also Gaussian. Using this standard result, it is possible to write down the predictive distributions over y + , pðy + jx + ; X; yÞ¼N ðE½y + ; V½y + Þ It is possible to assess new test points since the covariance of the process is fully described by the covariance function. This together with the mean function mð,Þ allows the GP to be used when predicting at any new x + . The mean function can be chosen to be any parametric function of the inputs, although it is commonly set to zero when the GP is presented in machine learning literature [31].
In order to fully specify the GP model, a covariance (kernel) function must be chosen which defines the similarity of any two sets of input points giving rise to the covariance matrix K. A popular choice for the covariance function is the squared-exponential (SE), which is defined for two input points x and x' as, The use of the squared-exponential kernel embeds the belief that the function being modelled is infinitely differentiable. 1 It should be noted that by choosing this covariance function the user is restricting the functions which can be modelled to those which conform to these properties.
It can be seen that there exist a small number of hyperparameters in the kernel which must be determined in order to make use of the GP. In the case of the squared-exponential covariance these are the signal variance s 2 f and the length scale [. These two hyperparameters control the behaviour of the covariance function; s 2 f can be interpreted as the prior variance of the signal being modelled and [ mediates the region of influence of the kernel. In other words, the length-scale controls how smooth the function being modelled is, where increasing the length-scale increases the smoothness of the function.
In order to learn these hyperparameters, a Type-II maximum likelihood approach is taken as in Ref. [31]. The marginal likelihood of the model, also referred to as the model evidence, is maximised. This optimisation makes use of the Bayesian Occam's razor [33e35] to find the minimally complex model given the observed data in the training set D ¼ fX;yg. This optimisation is normally performed as a minimisation over the negative log marginal likelihood for convenience and numerical stability. Thus, an estimate of the hyper- In this way the hyperparameters of the kernel (and if necessary the parameters of the mean function) can be learnt and the GP is fully specified by equations (4) and (7).

Handling large datasets
In order to either learn the hyperparameters of the GP or to make predictions, it is necessary to assess the inverse of the covariance matrix with noise, ðK XX þ s 2 n IÞ À1 . This operation is O ðN 3 Þ in both computation and memory storage. Practically, this means that for datasets larger than roughly ten thousand data points it is not feasible to learn a GP model. This is the case in many datasets collected from SCADA systems where the number of datapoints regularly exceeds tens, if not hundreds, of thousands. A number of methods have been considered to address this problem. This class of models is referred to as sparse Gaussian Processes. The most common methodology is to introduce a number of inducing points, although other methods have been explored [36,37]. Introducing these inducing points reduces the complexity of the process from O ðN 3 Þ, for N datapoints, to O ðNM 2 Þ, for M inducing points, giving advantage when M≪N. Broadly speaking, inducing point methods can be separated into two classes, model approximations and posterior approximations. Model approximations modify the prior of the model to achieve sparsity whereas posterior approximations approximate the posterior directly. In Quiñonero-Candela and Rasmussen [38], or more recently in Bui et al. [39], the use of inducing points (also referred to as pseudo-points) is brought under unifying frameworks. For the sake of brevity, the content of these papers is not duplicated here. In general, posterior approximations of the GP will result in more robust models than model approximations [40]. It is known that a posterior approximation is not able to overfit the data unlike, for instance, a Fully Independent Training Conditional (FITC) approach [40,41]. Therefore, a posterior approximation approach is adopted in this work, namely the Variational Free Energy (VFE) approach [42].
A very brief review of the VFE sparse GP is given here, for a full introduction the reader is referred to Ref. [42]. The inducing points of the model fZ; ug (where Z contains the locations of the inducing points and u the values of the latent function at those points) are used to form a variational approximation of the full posterior of the model. The model can then be learnt by minimising the Kullback-Leibler (KL) divergence between this approximate joint posterior and the full joint GP posterior. The joint variational (approximate) posterior qðf; uÞ is formed such that, qðf; uÞ¼pðf; uÞ4ðuÞ (8) where 4ðuÞ is known as the 'free' variational distribution with f being the latent function values at the measured inputs and u dependent upon the set of 'free' inputs Z. This allows the joint posterior of the GP pðf; f + Þ to be approximated directly as, It is possible to find the optimal choice of 4ðuÞ analytically from which a lower bound on the marginal likelihood, FðZÞ, can be established as, where, trð,Þ is the trace operator and the approximate covariance Q XX is defined as, 2 Now in learning the hyperparameters of the GP the bound in equation (10) is used in place of the marginal likelihood pðyjX; qÞ in equation (6). Predictions can then be made through this approximate posterior in a similar manner to the standard GP. The predictive distribution of the VFE model is given by, qðy + jx + ; X; y; uÞ ¼ N ðE½y + ; V½y + Þ By making use of this sparse approximation, the computational requirements for a dataset with N datapoints is reduced from O ðN 3 Þ to O ðNM 2 Þ for M inducing points. This now makes it feasible to handle large engineering datasets such as those returned by a data acquisition system installed on a wind turbine.

Heteroscedastic noise models
Considering the data shown in Fig. 1, it can be seen that one of the key assumptions in the GP does not hold when modelling power curve data. That is the assumption of homoscedastic noise, this is that the noise on the function f ðxÞ is an additive Gaussian noise with fixed variance. In fact, it can be seen that the noise variance changes across the input space, i.e. with changing wind speed there is a change in noise variance. In a heteroscedastic noise model it is assumed that the noise model is a function of the inputs to the system. The regression model introduced in equation (1) would 2 Here notation is established for a general matrix Q ab such that Q ab ¼ It can be seen that the variance of the noise process is now considered to be a function of the inputs to the model. In the case of the power curve, this expresses the fact that noise variance is dependent on wind speed.
In the same way that a GP prior can be used to infer the unknown function f ðxÞ, such that f ðxÞ$GP ðmðxÞ; k f ðx; x ' ÞÞ.I ti s possible to model the function over the noise variance using a GP. This was first presented in L azaro-Gredilla and Titsias [43], where, since the variance of the noise is strictly positive the function rðxÞ is modelled as the exponential of a Gaussian Process regression. That is, rðxÞ¼expfgðxÞg (14) where, The GP which is used to model gðxÞ is assigned its own covariance function k g ðx; x'Þ and is considered to have a constant mean m 0 . The addition of the second GP over the log noise variance increases the expressive power of the model, but with that, the difficultly in learning and inference. The second GP increases the number of hyperparameters that must be learnt by the number required to express the constant mean and kernel of the second GP.
The introduction of this heteroscedastic noise model also means that the marginal likelihood and predictive equations of the model are no longer available in closed form. To handle this, a variational approximation is used. Similarly to the VFE sparse method the approximate distribution over the posterior is used to form a lower bound on the marginal likelihood of the model which can be found to be dependent on two sets of parameters m and S. The bound is found in Ref. [43] to be, Here, K f and K g are used to denote the covariance matrices of the two Gaussian Processes over f ðxÞ and gðxÞ respectively. KLðpðaÞjjpðbÞÞ is the Kullback-Leibler divergence between distribution pðaÞ pofa and pðbÞ; 1 is a vector of ones; m and S are variational parameters to be determined, and R is a diagonal matrix whose diagonal elements are given by, It can be seen that, in m and S, there exist N þ NðN þ1Þ= 2 unknown free variational parameters which must be learnt. Following the approach in L azaro-Gredilla and Titsias [43], it is possible to reparameterise m and S in terms of L d a diagonal semi-positivedefinite matrix d reducing the number of parameters to be learnt to N. This allows m and S to be expressed in the following form, This being the case, the bound on the marginal likelihood can be computed and the hyperparameters of the model can be learnt. The overall increase in computational load for the heteroscedastic GP model means that learning takes roughly twice as long as a homoscedastic GP [43].
One final complication in the heteroscedastic GP model is that the full predictive distribution is also unavailable in closed form. To obtain it would require evaluating the integral, where, Although equation (19) cannot be computed in closed form, it is possible to calculate the first two moments of the predictive distribution qðy + Þ; that is, the mean and the variance of this distribution. 3 These moments can be found to be, Therefore, it is possible to make predictions using a GP under a heteroscedastic noise assumption. By calculating only the first two moments of the predictive distribution, the distribution over an unknown output y + given a test input x + can be approximated. An approximation of this distribution by its first two moments is to assume that the distribution over the test output at this point is well represented by its first two moments; the true distribution may not be Gaussian but by using only the first two moments is assumed to be close to this. This allows a probabilistic prediction of the function of interest to be made while also predicting the variance of the function at any given input. This additional information regarding the uncertainty on the process is invaluable if the predictions are to be carried forward into further analysis.

Sparse heteroscedastic Gaussian Process regression
In possession of both a sparse and a heteroscedastic Gaussian Process model it is natural to explore the combination of these into a sparse heteroscedastic GP. This combination has also been shown in the literature by Liu et al. [44]. In that work, the authors establish a new variational bound when making a variational approximation of the posterior under both the heteroscedastic model and a sparse model akin to the VFE approach, which they term the Variational Sparse Heteroscedastic Gaussian Process (VSHGP). Taking the heteroscedastic model shown in Ref. [43] and presented in section 2.2, it is possible to form a sparse heteroscedastic GP. Separate sets of inducing points are introduced to both the function GP modelling f ðxÞ and the log noise variance GP modelling gðxÞ. The same 3 Please note, the explicit conditioning of the posteriors on the training data, test input, inducing points, and hyperparameters is dropped for simplicity of notation.

T.J. Rogers et al. / Renewable Energy xxx (xxxx) xxx
strategy for achieving sparsity is followed as in section 2.1 (using the technique of Titsias [42]).
The variational approximation of this complete model again reduces to determining a lower bound on the marginal likelihood. This is found to be, where, and R is a diagonal matrix with elements, At this point it is worth clarifying the notation used in this section. The introduction of a sparse approximation to the two Gaussian Processes in the heteroscedastic model adds an additional number of hyperparameters corresponding to the inducing points used in f ðxÞ and gðxÞ. It should be noted that the inducing points for f ðxÞ and gðxÞ are two separate sets that can be different sizes. In light of this, notationally a covariance matrix is indexed by a superscript ðf Þ or ðgÞ to denote which function d and therefore hyperparameters d are being considered. The subscript is used to denote which sets of points the covariance is taken between, with X being the full measured set of inputs and u being the set of inducing points for that function. For example, K ðgÞ uX indicates the matrix of covariances between the inducing points of the process for gðxÞ and the training data X given those learnt inducing points and the hyperparameters of the kernel for the log noise process. Although there is a non-trivial amount of algebra to arrive at this point, the bound developed in equation (22) can be used in place of the marginal likelihood of the standard GP pðyjX; qÞ to learn the set of hyperparameters of the model q. However, the number of hyperparameters which must be learnt has now increased to include the hyperparameters for the kernels k f ðx; x'Þ and k g ðx; x'Þ, the constant mean for the log noise variance m 0 , the set of M f inducing points for f ðxÞ, the set of M g inducing points for gðxÞ, and the N variational parameters which form the diagonal matrix L.
Turning attention to making predictions with the VSHGP model, it is necessary to compute the approximate posterior distribution over y + d qðy + Þ. As with the non-sparse heteroscedastic GP the computation of this approximate posterior requires the computation of an intractable integral, . Each of these can be computed in closed form [44], defining K R as, (27) m ðf Þ In an analogous manner to (21), the first two moments of qðy + Þ under the VSHGP model can we written down as, These first two moments can then be assumed to well represent the full predictive distribution, i.e. it is assumed that this distribution is approximately Gaussian.

Distributed computation
In Ref. [44] one further extension is made to this model. The Distributed Variational Sparse Heteroscedastic Gaussian Process, is presented where the data are divided into a number of subsets. Each of these subsets is learnt via a separate VSHGP in the manner described above using the bound established in equation (22). This creates a mixture of experts type model where the experts each represent a local approximation of the function. These experts can then be combined using a variety of tools. The one presented in Ref. [44] is the Robust Bayesian Committee Machine (RBCM), developed in Ref. [45] and shown to be effective when used with Gaussian Process models in Ref. [46].
When modelling the wind turbine power curve this approach also can be beneficial. It will reduce, further, the computational complexity of the model which in turn reduces computation time, also helps account for the first of two types of heteroscedasticity present in the data. That is, the power curve exhibits three distinct regimes that it smoothly transitions between; the first is the behaviour before cut-in, the second as the power output rises with increasing wind speed, and the third when the turbine is limited to its maximum output. This allows the data to be separated into a three component mixture. Unlike the work of Liu et al. [44], the data relating to each of these components can be and is defined based on physical prior knowledge. Additionally, around the transitions between two components data are included in each of the components to ensure smooth transitions between the GPs.
When making predictions using this model, the predictions from each expert must be combined d here by means of the RBCM. The means and variances which are predicted by each of the experts are aggregated as a weighted sum over the experts. The experts are combined separately for the Gaussian Processes over f ðxÞ and gðxÞ. For a committee model with C experts, each expert has a calculated predictive mean and variance for f ðxÞ and gðxÞ which can be indexed according to which expert made that prediction. For example s 2 ðgÞ +i indicates the predictive variance of the i th expert for the GP over gðxÞ, this has an analogous precision s À2 ðgÞ +i ¼ 1= s 2 ðgÞ +i . The aggregated predictive distribution for f + has a mean given by, and precision, where s À2 ðf Þ ++ is the prior precision of the GP over f ðxÞ. Similarly for the GP over the log noise variance, the aggregated mean is given by, and the precision by, Since the GP model automatically returns a measure of uncertainty in the prediction it makes (the VSHGP included), this can be used as some measure of confidence in the prediction being made at any point. It is therefore possible to use the variance of the prediction to weight the experts. The variance for each GP is bounded by its prior variance, therefore it is possible to establish the weighting of each expert by comparing its predictive variance to its prior variance. Given this the weighting function for f ðxÞ is given as, Despite the somewhat circuitous route, the similarity between these equations and those shown in equation (21) make clear that this model is merely an extension of the non-sparse heteroscedastic model. In addition to this, these equations approximate the probability distribution over the mean and variance of each output given the previously observed data, D ¼fX;yg, in a manner analogous to the standard GP. As such, without needing to pre-specify a functional form for the data, the input-output relationship d with a heteroscedastic noise model d can be learnt in a Bayesian manner, returning a probabilistic output.

Modelling wind turbine power curves
With a mathematical framework for learning nonlinear functions with heteroscedastic noise models in place, attention can be directed towards prediction of wind turbine power curves. This section will explore the use of the techniques presented previously for modelling. To demonstrate the usage of these techniques a sample dataset taken from an operational wind turbine is used. For confidentiality reasons the measured values of wind speed and power have been obscured by normalisation of the data. Additionally, the values stated for the cut-in and nominal speeds of the turbine are selected to be representative in the normalised space and bear no relation to the stated values on the data sheet for the turbine being considered. However, the data collected are 10-min averages from a functional SCADA system over a period of 125 weeks, and as such, this dataset represents a realistic set of measurement data.
Following their normalisation, the data are separated into three distinct sets, one for training, one for validation, and one for testing of any models which are learnt. In the results shown here comparisons are made between predictions made on the training data d that data used to learn the model d and the test data d data that remains unseen by the model until predictions are made, the validation data is unused in this case. Both the training and testing datasets consist of 16359 pairs of data points where the input is the measured 10-min average wind speed and the target is the measured 10-min average power. One key modelling assumption is that the function is stationary, i.e. the relationship between the wind speed and power output does not change over time.
Considering the training and test data (collected several months apart) which are shown overlaid in Fig. 2, this assumption appears to hold across this dataset.
Although, as has been shown, the addition of a mean function to the standard GP formulation is trivial, for the heteroscedastic formulations, only the zero-mean versions have been shown. Observing the characteristic shape of the wind turbine power curve, it is clear that a constant zero mean assumption is not valid. In view of this, it is prudent to learn a parametric mean function which can be removed from the data before learning the GP model of choice. Two potential mean functions are considered, the first a piecewise-linear function that could be specified from the known cut-in and nominal speeds of the turbine, the second a hyperbolic tangent function the parameters of which must be learnt from the data. The piecewise-linear function is defined as a threecomponent curve, with a constant power output of zero, before the cut-in speed, and a constant of the rated power output above the nominal speed. A line then connects these two values between the cut-in and nominal speed. Fig. 3 shows a comparison between the piecewise-linear fit and the hyperbolic tangent fit. The parameters of the hyperbolic tangent have been learnt by a minimisation of the sum of squares errors on the training data d i.e. a standard least squares fit d using a quantum particle swarm population based optimiser [47]. The requirement for a very robust fit is relaxed since the GP is expressive enough to compensate for any bias introduced by learning 'sub-optimal' parameters of this model. The models learnt by these parametric fits have been removed from the data in order to transform it into a zero-mean space that can be used with the GP model. In Fig. 4, the data in this transformed space is shown. The piecewise linear model is seen to create a hard corner as it transitions between sections, this leaves the function to be learnt by the GP as non-smooth and discontinuous.
Remembering that the choice of kernel encodes smoothness beliefs about the functions, to make use of the piecewise linear mean would require finding a covariance function that allows for nonsmooth functions. However, using the hyperbolic tangent as a mean functions leads to a smoother function in the transformed space. The data, following removal of the hyperbolic tangent mean, would be more readily learnt using more common covariance functions encountered when using GP models, e.g. the squared exponential in equation (5).
Having removed this hyperbolic tangent mean, the task of fitting a GP model can begin. For comparison, it would be preferential to have fit a full homoscedastic GP, the VFE sparse GP, the full heteroscedastic, and the sparse heteroscedastic. However, due to the size of the data set, over 16000 training points, it is not possible to fita non-sparse GP with any reasonable amount of resources. This will often be the case when modelling wind turbine power curves as data acquisition systems typically run for extended periods of time accumulating very large datasets. It is also beneficial to use computationally efficient methods when inference needs to be conducted online, this may include the retraining of these models to make comparisons as the turbine ages. Therefore, given the necessity to use a sparse method, comparison is made between the VFE (a homoscedastic model) and the distributed sparse variational heteroscedastic GP (DSVHGP). Results are shown for fits to both the training and testing data with and without the hyperbolic tangent mean added.
Two quantities are used to assess the model fit, the first is a normalised mean squared error (NMSE) shown in equation (37); this measures the goodness-of-fit of point predictions of a model d either the output of a deterministic model or the mean of a probabilistic one. The NMSE will return a score of zero in the case of a perfect fit, a score of 100 corresponds to a prediction which has the same error as simply taking the mean of the data.

NMSE ¼
100 Here, N is the number of datapoints being assessed, s 2 y the variance of the measured data, y the measured data, and b y the predicted outputs. In addition to this, the joint likelihood of the probabilistic models is used to assess model quality. The NMSE metric fails to capture any quantification of uncertainty in the model. Since one of the main benefits of the GP approach is this automatic quantification of uncertainty, it is sensible to include this as a measure of goodness-of-fit. The joint likelihood is calculated as the product over the likelihood of each prediction, given that at each prediction a univariate Gaussian distribution is returned d in the case of the heteroscedastic model this is an approximate distribution given the first two moments of the prediction equation (21).
The predictions made by the VFE model are shown in Fig. 5 and Fig. 6, in the transformed space and on the original power curve data respectively. Before stating the quantitative assessment of this model, it can be seen that the model has failed to capture the uncertainty present in the data well. There exist many points in the rising part of the power curve which lie outside of the three sigma intervals, however, at low and high wind speeds (where the function is flat) the variance is overestimated. It would appear that, as expected, the noise variance has been learnt to be an average between the high variance section as the function rises and the low variance sections towards the edges. On first inspection it may appear that the variance has been captured well in the middle

T.J. Rogers et al. / Renewable Energy xxx (xxxx) xxx
section of the function and poorly at the end, however, on closer inspection it is seen that in addition to the overestimation of variance at high and low wind speeds the variance is actually underestimated in the middle section of the curve. With reference to the prediction in the transformed space in Fig. 5, the NMSE of the process is 56.4 for the training data and 56.2 for the test data. This is expected as the variance related to noise on the data is large in this space. Transforming back to the full power curve in Fig. 6, by adding back the hyperbolic tangent mean function, it can be seen that the fit that may look unimpressive in the transformed space actually represents the mean behaviour of the power curve well (despite poor modelling of the variance), this yields a NMSE of 0.81 for both the training and testing data. For comparison it is worth stating the NMSE scores of the two parametric functions considered. The piecewise-linear function scores 3.93 and 3.94 on the training and testing data, whereas, the hyperbolic tangent scores 1.49 and 1.50. From this it can be seen that the use of the GP with the hyperbolic tangent as a mean function leads to a significant decrease in the point-wise error. Considering the likelihoods of the models (stated as log likelihoods), in the transformed space these score 2:26Â 10 4 for both training and testing data. With the mean function added these scores remain the same, for both training and testing data, this allows more consistent comparison of the models as it both incorporates the uncertainty of the prediction and is insensitive to removing the parametric mean function.
A heteroscedastic model was also learnt and tested on the same datasets. The distributed sparse variational heteroscedastic GP is chosen for this task, to allow heteroscedastic inference over this large dataset. The training data are separated in to three overlapping datasets, which are in turn, used to train three experts in the robust Bayesian committee model framework. Fig. 7 shows the predictions made by these experts when predicting the test dataset. In the upper three plots the data which have been used to train each expert is also shown. It can be seen that each expert has been trained on a subset of data which overlaps in the input space, this ensures a smooth transition between the experts in the committee model. The locations of these splits and the amount of overlap were chosen a priori to divide the data into the three broad regions seen in the power curve: 1. Before and through cut-in speed; 2. Transition from cut-in speed to the upper bound on nominal speed; 3. Lower bound on the nominal speed to cut-out speed.
The split locations can be chosen based upon the known cut-in and nominal speeds of the turbine and the amount of overlap is a matter of user choice, for this example the normalised splits are listed in Table 1. It was the experience of the authors that a small overlap region ensured a smooth transition between the experts, in this case the ranges of wind speed were: As expected, each expert is most capable of making predictions close to data which has been used to train that expert and is most confident of the predictions in those regions. This confidence in the predictions is the measure used to weight the contribution of that expert as calculated in equations (34) and (35). The aggregated predictions of the model (equation (36)) for the test data are shown below the contribution of each expert in Fig. 7 with the measured test data superimposed.
The aggregated predictions made by the DSVHGP in the transformed space are shown in Fig. 8 for the training and testing data. The NMSE scores for these models are 56.5 and 56.2 for the training and test data respectively. The scores in the NMSE match very closely with the homoscedastic model fitted. Moving to the full space, shown in Fig. 9, the NMSE scores are found to be 0.81 for both the training and testing data. Scores which are identical to the homoscedastic model d up to this level of accuracy. This is, perhaps, expected considering that the predictive mean of the DSVHGP model is given by the mean of the GP over the function f ðxÞ which is a very similar formulation to the predictive mean equation of the homoscedastic sparse model. The main difference between the models (in a predictive sense) enters through the calculation of the variance of the predictive density.
Since the NMSE score does not depend on the predictive variance of the model, it is unsurprising that this score is largely unaffected by the changes in the model. The likelihood score of the model, however, reveals the improved quantification of the uncertainty in the prediction. As before, the joint log likelihoods in the transformed space and over the full power curve are identical up to the accuracy stated. These are 3:47 Â 10 4 for the training data and 3:39 Â 10 4 for the testing data. This represents a marked improvement over the scores calculated for the homoscedastic model. The increase in likelihood of the predictions would indicate that the heteroscedastic formulation has been able to better capture the variance in the data and is capable of making predictions which better represent that variance.
To visualise the difference in the fits of the two models, they are shown overlaid in Fig. 10; here, the similarity in predictive mean can be easily seen and the difference in the predictive variance is also apparent. The improved modelling of the noise on the data as a  result of the heteroscedastic GP is most apparent at the tails of the power curve where the homoscedastic model overestimates the variance. Although in certain situations this may not be of major concern, this overestimation of variance will lead to reduced sensitivity if the model is used in a damage detection setting such as in Papatheou et al. [23].
One concern that could be raised with both models, or indeed any GP fit of the power curve, would be that there is likelihood that the turbine would exceed its stated maximum power output d this would not be observed due to the limits of the turbine. This is most apparent when considering the output of the homoscedastic model since the variance of the heteroscedastic model reduces as the wind speed increases and the turbine consistently produces its maximum rated power. However, around the nominal speed of the turbine there is variance in the power output which is captured in both models. This region is focussed on in Fig. 11, where it can be seen that the heteroscedastic model captures well the variance in power output around the nominal speed below the maximum output but has variance extending above the maximum rated output. This is an artefact of the approximation of the posterior distribution as a Gaussian based on its first two moments, although it is likely that the full distribution would also have probability mass above this maximum output. Because of the Gaussian nature of this approximate posterior, the distribution over the outputs must be symmetric about the mean. Around the nominal speed of the turbine the distribution over the power output is heavily skewed, because only the mean and variance of the distribution of the output are modelled it is not possible to model is asymmetric distribution. One solution to this is to apply prior physical knowledge to the system and to recognise that it is extremely unlikely that the turbine would exceed its rated output, therefore, in any further analysis the predictions could be limited at the maximum value. Alternative approaches to handling this issue will be discussed as future work.    Finally, it is natural to consider the importance of each part of the modelling process. The use of the mixture of experts model to separate each region of the prediction allows for some heteroscedasticity by assigning a different noise variance to each expert. By separating the components of the prediction from the DSVHGP it is possible to inspect the role of the heteroscedastic model. In Fig. 12 the GP which accounts for the mean prediction of the model and the GP which models the log noise variance are visualised separately as well as showing the full prediction. In the top left frame of the figure it can be seen that the GP over the latent function of the mean for the power curve has a very low variance. This indicates a confident prediction of the mean behaviour of the curve, which is seen to fit well with the data. Considering the top right frame in the figure the prediction of the GP over the log noise variance is shown. Here, the role of the heteroscedastic formulation in the GP is clearly seen. If each member of the mixture of experts were to exhibit homoscedastic behaviour within its region this plot would show three horizontal lines, with quick transitions between them. Instead it can be seen that the log noise variance is itself a nonlinear function which is evolving with wind speed. This curve also shows the expected behaviour that the variance of the noise on the power curve is lower towards the upper and lower bounds on the wind speed. It can also be seen that the increase in variance close to the nominal speed of the turbine has been modelled by this GP. Finally, due to the relatively low number of datapoints seen close to a nominal wind speed of one, the prediction of the variance in the region becomes less confident (seen by an increase in variance) and the predicted variance increases to accommodate this. However, since this is modelling the log noise variance the actual uncertainty seen in the model remains very low even toward this uncertain region since the mean remains low. Through the modelling of this collected data it has been possible to demonstrate how the proposed methodology based on the DSVHGP can accurately capture the behaviour of a power cure both in terms of its mean behaviour and through modelling the uncertainty. The use of the mixture of experts is shown to be a pragmatic approach to capturing the form of the power curve across three key regimes. Finally, the use of the heteroscedastic GP model is shown to allow the change in noise variance, across the wind speed, to be modelled; giving more reliable predictions of the uncertainty associated with the prediction across the full power curve.

Conclusions
The work contained in this paper has laid down a methodology for rigorous probabilistic modelling of wind turbine power curves d extending the work of [1,22,23] to the heteroscedastic case with sparsity, improving the quantification of uncertainty significantly. The Gaussian Process has been introduced as a flexible Bayesian machine learning technique for modelling of nonlinear functions. The difficultly in use of a GP model with large datasets has been discussed with the use of a sparse approximation suggested. The variational free energy approach of Titsias [42] has been presented as a powerful and robust method in which large numbers of training data points can be incorporated into the GP framework. In view of the heteroscedastic noise behaviour seen in power curve data, the extension of a GP model to include the modelling of this input dependent noise has been discussed. The method of L azaro-Gredilla and Titsias [43] has been to shown to achieve this by modelling the log noise variance of a process as an additional Gaussian Process which is learnt in a variational manner. Combining the theory developed for the VFE sparse approximation and the heteroscedastic approach led to a sparse variational heteroscedastic GP model. This model, introduced in Liu et al. [44] allows the learning of a heteroscedastic GP over large datasets, such as those collected by SCADA systems installed on wind turbines. Finally, the approach of building a mixture of experts model based on partitioning the input space of the data is shown to lead to further computational gains and better robustness when modelling   functions with multiple behavioural regimes. As discussed in Ref. [44] the robust Bayesian committee machine is a useful tool for doing this within a Bayesian framework.
A measured set of wind speed and power output data collected via a SCADA system has been used to demonstrate the usage of these approaches for modelling wind turbine power curves. The wind turbine power curve is seen to exhibit nonlinear behaviour and heteroscedastic noise processes. It is shown how the power curve can be transformed to approach a zero-mean space via a prelearnt mean function. The hyperbolic tangent function is used in this work since it ensures smoothness of the function to be learnt by the GP in the transformed space. Fitting the data in this space via either the homoscedastic or heteroscedastic GP leads to nearly identical NMSE scores indicating that the mean fits of the models are very similar. The models of the full power curves for both the homoscedastic and heteroscedastic GPs and for both training and testing data are found to be 0.81 d heuristically this represents a "very good" fit as the NMSE can be thought of as similar to a percentage error.
The move to heteroscedastic modelling of power curves, although having little effect on the mean prediction quality of the model and leads to far better quantification of the variance in the data. This is reflected in the likelihood scores of the predictions. The joint log likelihood of the predictions for both the training and testing data increase by over 50% when moving the heteroscedastic model. It is also seen that visually, the variance in the data is captured far better (Fig. 10).
In this work, it has been shown that the wind turbine power curve is well suited to being modelled via a heteroscedastic GP regression and the distributed sparse variational heteroscedastic GP is a powerful and expressive model with which to do this. The use of this model naturally handles the heteroscedastic noise present in the data, automatically returns predictions as the (approximate) distribution over possible outputs, and avoids the risk of overfitting d present in high-order polynomial models. Therefore, the use of this model represents a good choice should a user wish to accurately model the power curve of a wind turbine (with quantification of the uncertainty) and becomes more valuable as the probabilistic outputs are carried into further calculations.
As previously discussed, in possession of this probabilistic model, it is now possible to refine further analyses. This includes better quantification of uncertainty in SHM applications leading to reductions in false alarms and increased sensitivity to damage. It also provides important information for making macro-level decisions about the turbine or farm. This process could include the propagation of distributions over wind speed to give a distribution over expected power which can be used for better financial planning or for grid-level power management.

Future work
While the approach adopted in this paper has been seen to be effective at modelling the behaviour of a wind turbine power curve, it opens up a number of avenues of further investigation. It has been observed that, despite the mean predictions being consistent with the physical behaviour of the wind turbine, the predictive variance of the model places probability mass above the maximum power output of the turbine. This is a consequence of considering the variance of the prediction to be symmetric around the mean. While it would be possible to truncate this distribution above the maximum rated output, future work into modifications of the likelihood function could lead to more statistically robust solutions. It is also beneficial to consider if the parameters of the mean function would be better learnt inside the GP framework rather than applying the transformation into the zero-mean space before learning the (hyper)-parameters of the GP. Finally, it will be valuable in future to demonstrate the propagation of the predictive variance of the model into further analyses which may benefit from a Bayesian treatment. For example, to predict distributions over expected income from a particular turbine, or to enhance previously presented damage detection strategies.
In conclusion, the move towards nonparametric modelling of wind turbine power curves, allows the use of probabilistic models which offer robust and accurate mean predictions, as well as automatic quantification of uncertainty. The use of these models opens up better understanding of the uncertainty of the power output, of the turbine and avoids issues in overfitting that may occur in parametric models. For this reason, the use of heteroscedastic Gaussian Process models is a powerful and sensible approach moving forward.