Diversity and degrees of freedom in regression ensembles

Ensemble methods are a cornerstone of modern machine learning. The performance of an ensemble depends crucially upon the level of diversity between its constituent learners. This paper establishes a connection between diversity and degrees of freedom (i.e. the capacity of the model), showing that diversity may be viewed as a form of inverse regularisation. This is achieved by focusing on a previously published algorithm Negative Correlation Learning (NCL), in which model diversity is explicitly encouraged through a diversity penalty term in the loss function. We provide an exact formula for the effective degrees of freedom in an NCL ensemble with fixed basis functions, showing that it is a continuous, convex and monotonically increasing function of the diversity parameter. We demonstrate a connection to Tikhonov regularisation and show that, with an appropriately chosen diversity parameter, an NCL ensemble can always outperform the unregularised ensemble in the presence of noise. We demonstrate the practical utility of our approach by deriving a method to efficiently tune the diversity parameter. Finally, we use a Monte-Carlo estimator to extend the connection between diversity and degrees of freedom to ensembles of deep neural networks.


Introduction
Ensemble methods are a cornerstone of modern machine learning. Numerous applications have shown that by combining a multiplicity of models we are able to train powerful estimators from large data sets in a tractable way. Successful ensemble performance emanates from a fruitful trade-off between the individual accuracy of the models and their diversity [10]. Typically diversity is introduced implicitly, by sub-sampling the data or varying the architecture of the models. In this paper we consider Negative Correlation Learning (NCL) [32], a powerful approach to learning ensembles of neural networks, in which diversity is encouraged explicitly by appending a diversity penalty term to the loss function. In the context of the recent breakthroughs in deep neural networks, ensembles of neural networks are likely to play an increasingly prominent role in machine learning applications. Thus, it is crucial that we obtain a deeper understanding of the dynamics of ensemble methods well suited to neural networks such as NCL. The statistical properties of NCL have already been studied in some detail [32,10,11]. Nonetheless, important questions remain surrounding the diversity parameter, the central hyperparameter in NCL which controls the level of emphasis placed upon the diversity penalty term. We shall address the following: • How does the complexity of the ensemble estimator vary as a function of the diversity parameter?
• How can we efficiently optimise the diversity parameter on large data sets?
• Is the optimal value of the diversity parameter always strictly less than one?
The core of our investigation lies in a degrees of freedom analysis of NCL ensembles. Our contributions are as follows: • We derive a formula for the degrees of freedom under the assumption of fixed basis functions (Section 3).
• We show analytically that the degrees of freedom is monotonically increasing as a function of the diversity parameter (Section 3).
• We present the surprising result that, in the presence of noise, the optimal value of the diversity parameter is always strictly less than one (Section 4).
• We develop an intriguing connection between NCL and Tikhonov regularisation (Section 5).
• We present an empirical verification of the theoretical results (Section 6).
• We give a fast and effective procedure for tuning the diversity parameter based upon the degrees of freedom (Section 7).
• We investigate ensembles of deep neural networks, demonstrating empirically that the degrees of freedom also behave monotonically with respect to the diversity parameter in this setting (Section 8).
The present paper extends a previously published conference paper [40]. The previous conference paper introduces the analytic formula for the degrees of freedom and demonstrates a computationally efficient approach to tuning the diversity parameter based on the formula. In the present paper we have extended this work. Firstly, we present additional technical results: A connection between NCL and Tikhonov regularisation and a result implying that the diversity parameter should never be set to precisely one in the presence of noise. Secondly, we used a Monte-Carlo estimator to conduct a detailed empirical investigation into the relationship between the diversity parameter and degrees of freedom in ensembles of deep neural networks.
We shall begin by introducing the background on ensemble learning and degrees of freedom in Section 2.

Background
In this section we shall introduce the relevant background on Negative Correlation Learning (NCL) and degrees of freedom. We begin by setting the scene. Throughout this paper we consider the regression problem: We are given a data set D = {(x n , y n )} N n=1 with (x n , y n ) ∈ X × R. We shall assume that there is an underlying function µ : X → R such that for each n, y n = µ(x n ) + n , where ( n ) N n=1 is a mean zero, independent and identically distributed random process. Our goal is to use the data D to provide an estimatorμ : X → R of the underlying function µ.

Ensembles, diversity and the ambiguity decomposition
Ensemble methods aggregate the predictions of a multiplicity of constituent models in order to provide a more powerful model with lower generalisation error. In order for an ensemble to outperform a single model it is essential for its constituent models to be diverse [8]. In the classification setting, there is no straightforward relationship between the performance of an ensemble and its diversity [17]; the ensemble error can even exceed the average error of its constituent learners. In the regression setting, however, the squared error of the ensemble may be decomposed into the average squared error of its constituents minus the variance over the ensemble's predictions. To be precise, suppose we have an ensemble F = { f m } M m=1 consisting of M functions f m : X → R. We let F := (1/M) · M m=1 f m denote the ensemble function. For each (x, y) ∈ X × R we have, Hence, when λ = 0 each function f m is trained individually and when λ = 1, L λ is the squared error for the average F.  Figure 1 displays the typical behaviour of the mean squared error as a function of the diversity parameter λ. As the diversity parameter moves from zero to one, the training error declines. This is to be expected given that L λ with λ = 1 corresponds to the square error for the ensemble F. The test error typically declines initially before rising sharply as λ approaches one. Whilst the test error does not always rise sharply as λ → 1 (see the Kinematics data set in Figure  1), we do consistently observe an increase in the gap between test and train error (we shall discuss the Kinematics data set in more detail in Section 4). It is this phenomenon that we intend to explain. Note that this increase cannot be explained by the upper bounds in [10] and [39], since these relate to behaviour on the training set D for λ > 1.

Open questions in Negative Correlation Learning
Taking λ < 1 appears to act as a regulariser, reducing the discrepancy between test and train error. This is intuitively plausible, given equation 3: When λ = 0 we are independently training a collection of M simple models and aggregating the result, an approach which is likely to under-fit. When λ = 1 we are minimising the training error for a single complex model F, which is likely to overfit. Choosing λ between zero and one blends smoothly between these extremes, providing an effective balance between underfitting and overfitting. Nonetheless, the underlying hypothesis class does not change with λ. Hence, the usual formalisms of VC dimension or Rademacher complexity do not apply. This raises several questions. Firstly, how can we provide a rigorous explanation for the apparent regularisation effect of taking the diversity parameter λ < 1, in spite of the fact that the hypothesis class remains unchanged. Secondly, under what circumstances does taking λ < 1 improve performance? Finally, how can we optimise the diversity parameter on large data sets without resorting to an expensive cross-validation procedure. To address these questions we shall use the concept of degrees of freedom.

Degrees of freedom and Stein's unbiased risk estimate
In this section we shall introduce the relevant background on degrees of freedom. We shall operate within the the fixed design setting, frequently utilised in statistics [14,50,44]. To be precise, suppose we have a fixed design matrix X = [x 1 , · · · , x N ] consisting of feature vectors x n ∈ X and assume the existence of an underlying function µ : X → R. Our data set D = {(x n , y n )} N n=1 consists of pairs (x n , y n ) ∈ X × R generated by y n = µ(x n ) + n . Here ( n ) N n=1 is a mean zero, independent and identically distributed (i.i.d) random process with variance σ 2 = E 2 n . Given an estimatorμ for µ we define the true error R true and the empirical error R emp , respectively, by The key tool in our analysis of NCL will be the degrees of freedom. ∂μ (x n ) ∂y n .
In the ordinary least squares setting df (μ) is simply the number of non-trivial parameters in the model, hence the nomenclature. However, in general the degrees of freedom is not determined by the number of model parameters [48]. The degrees of freedom quantifies the complexity of an estimation procedure via the sensitivity of model outputs to the data. A fundamental result of Stein implies that the degrees of freedom determines the relationship between the empirical error and the true error of an estimator.
Definition 2 (Stein's unbiased risk estimate). Given an estimatorμ, a data set D, and a positive realσ > 0, Stein's unbiased risk estimate is defined by Theorem 3 (Stein [43]). Suppose that the noise process ( n ) N n=1 is i.i.d with n ∼ N 0, σ 2 . Given any estimatorμ such that for each n = 1, · · · , N,μ(x n ) is differentiable with respect to y n , we have Theorem 3 implies that the true error may be broken down, in expectation, into two concrete factors: The empirical error and the degrees of freedom. In Section 3 we shall use this decomposition to give a theoretical explanation for the empirical behaviour of NCL ensembles described in Section 2. Theorem 3 also has important practical implications for model selection. Suppose we have a family of estimators A. We would like to choose the estimator µ ∈ A which optimises performance by minimising R true (μ) over A. Unlike the true error R true (μ), the empirical error R emp (μ, D) may be computed directly from the data set. However, R emp (μ, D) is a biased estimate for the true error R true (μ) with the bias depending upon the sensitivity of the estimator. Hence, if we minimise R emp (μ, D) over µ ∈ A we are likely to obtain a highly complex estimator which overfits to the training data. Theorem 3 states that SURE (μ, D, σ) gives an unbiased estimate for the true error R true (μ). This motivates a strategy where we selectμ ∈ A by minimising SURE (μ, D,σ), provided we are able to compute df (μ, D) and obtain an estimateσ for the noise variance σ. In Section 7 we use this approach to give a highly efficient procedure for tuning the diversity parameter in NCL ensembles.

A theoretical investigation of the degrees of freedom in Negative Correlation Learning
In this section we give a theoretical analysis of the degrees of freedom of NCL ensembles. We consider a special class of NCL ensembles F in which each function f m consists of a linear map applied to a fixed basis function φ m . More precisely, for each m = 1, · · · , M we take a basis function φ m : X → R H×1 which does not depend upon the targets {y n } but may depend upon the design matrix X. We let H m = x → w m , φ m (x) : w m ∈ R H×1 , where ·, · denotes the dot product, and assume that f m ∈ H m (see Figure 2). This encompasses a broad range of highly expressive mappings depending upon the choice of basis function. Examples include Nyström features [47,12] and Random Fourier Features [36,37] (see Section 6 for details). We consider ensembles The assumption of fixed basis functions allows us to analytically derive several properties of the ensemble that we would otherwise only be able to observe empirically.
The assumption of fixed basis functions allows us to derive a closed form solution for the minimiser of L λ . We first introduce some notation. Let Q = H · M and let φ : Similarly, for each m = 1, · · · , M, we define We shall assume that φ m , φ m D is of full rank H. This assumption is natural, when H N, as is the case in typical applications of NCL [10,32]. The assumption only fails to hold when one of the H coordinates of a basis function φ m is expressible as a linear combination of the other H − 1 coordinates of φ m , over all N examples. This is a strong form of redundancy which we are unlikely to encounter in practice.
Given a matrix M we let M + denote the Moore-Penrose psuedo-inverse of M. Given λ ∈ [0, 1], we let F λ denote the ensemble function denotes the ensemble which minimises the NCL loss L λ averaged over a data set, i.e.
Theorem 4. Suppose that each function f m consists of a linear map applied to a fixed basis function φ m . For each λ ∈ [0, 1], the ensemble function F λ : X → R is given by The closed form solution allows us to efficiently locate the minimiser of the NCL loss. Moreover, Theorem 4 leads to the following theorem which gives an explicit formula for the degrees of freedom of F λ , and describes the behaviour of df(F λ , D) as a function of λ.
Theorem 5. Suppose that each function f m is a linear map of a fixed basis function. Given any λ ∈ [0, 1], the degrees of freedom for F λ is given by In particular, we have the following behaviour • The function λ → df(F λ , D) is continuous, increasing and convex; • The function λ → R emp (F λ , D) is continuous and decreasing.
Full proofs for all theorems presented in this paper may be found in the appendix. Theorem 5 is the core result in our analysis. It gives a rigorous account of the empirical behaviour of NCL ensembles observed in Figure 1: As the diversity parameter increases, the degrees of freedom increase monotonically, and at an increasing rate, leading to an increasing discrepancy between the test and train error. This behaviour leads to our view of diversity as a form of inverse regularisation. Typically increasing a regularisation parameter leads to more stable models with lower degrees of freedom [50]. Conversely, increasing the diversity parameter produces a less stable model with more degrees of freedom. We shall confirm these results empirically in Section 6.

Too much diversity?
In this section we answer a fundamental question raised by Brown et al. [10]: Under what conditions does taking the diversity parameter λ < 1 improve the performance of the NCL ensemble? We shall return to the fixed basis function setting introduced in Section 3 in which each function f m consists of a linear map applied to a fixed basis function. In this setting, we provide the following surprising answer: Whenever the noise variance σ is strictly positive, taking λ = 1 is always sub-optimal. That is, whenever σ > 0, there exists a value λ opt < 1 such that R true F λ opt < R true (F 1 ).

Theorem 6. Suppose that each function f m is a linear map of a fixed basis function and let
In light of equation 3 this result is perhaps surprising -taking λ < 1 moves the NCL loss L λ away from directly targeting the squared error of the ensemble. However, by Theorem 5 we see that reducing the value of the diversity parameter also reduces the degrees of freedom, making the ensemble more robust to noise in the data. Theorem 6 goes further, showing that whenever the noise is positive there exists a λ < 1 such that the reduction in error due to the fall in degrees of freedom more than compensates for the increase in the training error. Let's return briefly to the results displayed in Figure  1, Section 2. For the first five data sets we observed a sharp rise in test error when λ = 1, which conforms neatly to the conclusions of Theorem 6. However, for the Kinematics data set we see that the optimal value of λ ∈ {0.00, 0.05, · · · , 0.95, 1.00} is actually λ = 1.00. This seems in tension with Theorem 6. There are two possibilities: Either we have a data set where σ = 0, which seems highly unlikely, or the optimal value of λ lies strictly between 0.95 and 1.00. Hence, we repeat the experiment with λ close to one: λ ∈ {1 − 10 −l : l ∈ {2, · · · , 20}}. The results are displayed in Figure 3. As we can see the optimal value of the diversity parameter is λ ≈ 1 − 10 −12 < 1, which resolves the apparent tension with Theorem 6.
Theorem 6 is curiously reminiscent of a classical result of Hoerl and Kennard on Tikhonov regularisation [23]. Hoerl and Kannard's result states that in the presence of noise there always exists a strictly positive value of the regularisation parameter which outperforms the unregularised least squares estimator. This strengthens our view of diversity as a form of inverse regularisation in which taking the diversity parameter to be strictly less than one bears an interesting parallel to positive Tikhonov regularisation. In the next section we shall provide a more detailed exploration of the connections between NCL and Tikhonov regularisation.

Negative Correlation Learning and Tikhonov regularisation
In this section we explore the relationship between NCL ensembles F λ and Tikhonov regularisation. Tikhonov regularisation, also known as weight decay or L2 regularisation, is perhaps the most widely used form of regularisation in machine learning -one simply adds a constant multiple of the L2 norm of the weights into the cost function. This encourages the weights to be close to the origin with respect to the Euclidean distance, increasing the robustness of the resultant model.
We shall consider the fixed basis function setting in which we have a function of the form G = (1/M) · M m=1 g m where each function g m consists of a linear weight composed with a fixed basis function, We define the following Tikhonov regularised loss function, We let G Tik γ denote the minimiser of the average Tikhonov regularised loss funciton L Tik γ , Recall from Section 3 that F λ denotes the function obtained by minimising the NCL loss L λ introduced in equation (3). We have the following relationship between G Tik γ and F λ .
Theorem 7 gives a close relationship between NCL and Tikhonov regularisation. This connection is especially interesting in light of the range of perspectives on Tikhonov regularisation present in the literature, from Bayesian perspectives [6] through to viewing Tikhonov regularisation as training in the presence of noise [5]. Note that the level of Tikhonov regularisation decays monotonically in relation to the corresponding value of the diversity parameter in Theorem 7. This conforms to the view of diversity as a form of inverse regularisation.

An empirical investigation of the degrees of freedom curve
In this section we empirically verify the key results in Theorem 5 concerning the behaviour of both the degrees of freedom df (F λ , D) and the training error R emp (F λ , D) as a function of the diversity parameter λ. We consider six data sets described in Appendix E. In each case the data was preprocessed by renormalising the targets to mean zero and standard devation one. For each data set we consider three architectures with with five, ten and twenty hidden nodes per module, respectively ie. H ∈ {5, 10, 20}. In each case the total number of hidden nodes is set at one thousand (H · M = 1000), so the number of modules M ∈ {200, 100, 50}.
For each m = 1, · · · , M we take as our basis function φ m : X → R H , an H-tuple of Random Fourier Features. Random Fourier Features were introduced by Rahimi and Recht [36] and build upon the longstanding tradition of generating the lower layers of a neural network by randomisation, rather than optimisation [7]. These basis functions are of the form φ m (x) = cos ζ T m x + b m , for x ∈ R d×1 in the original feature space, ζ m ∈ R H×d , b m ∈ R H×1 and cos applied elementwise. Rahimi and Recht proved that, given any stationary Mercer kernel k(x, y) = k(x − y) on R d , if we sample the rows of ζ m randomly from the normalised Fourier transform of k, and sample the elements of b m uniformly 8 from [0, 2π), then we have ( , with an approximation error decaying exponentially with H. Thus, we may view Random Fourier Features φ m constructed in this way as the composition of two mappings: A mapping to the reproducing kernel Hilbert space, composed with a low-dimensional projection which approximately preserves the inner-product structure. In our experiments we use the Scikit-Learn implementation [35], with the Gaussian kernel, which is its own Fourier transform (up to rescaling), so rows of ζ m are sampled from a d-dimensional Gaussian. We use these basis functions as they are highly expressive and have been successfully used in a range of applications [36,37].
In order to tune the frequency parameter γ for the random Fourier functions using the heuristic method of Kwok and Tsang [29]. For each λ ∈ {0, 0.5, · · · , 0.95, 1.0} we train an NCL estimator F λ . The weights w m are selected using the analytic solution given in Theorem 4, rather than via stochastic gradient descent. Note that this has the benefit of removing the need to select a learning rate.  Figure 4 shows the degrees of freedom df (F λ , D) as a function of the diversity parameter λ, computed via the formula in Theorem 5. Each graph clearly demonstrates the behaviour predicted by Theorem 5 concerning the degrees of freedom. In particular, the degrees of freedom function λ → df (F λ , D) is continuous, strictly increasing and strictly convex. Figure 5 shows the training error R emp (F λ , D) as a function of the diversity parameter λ. As predicted by Theorem 5, the function λ → R emp (F λ , D) is strictly decreasing. 7. Using the degrees of freedom formula to tune the diversity parameter In this section we give a practical application for the explicit formula for degrees of freedom presented in Theorem 5. We shall use the degrees of freedom formula, combined with Stein's unbiased risk estimate (see Definition 2) to tune the diversity parameter. We consider a range of data sets on NCL ensembles consisting of M = 100 networks, each with H = 10 hidden nodes. As in Section 6 we use basis functions φ m : X → R H×1 formed by taking an H-tuple of Random Fourier Features [36], with the frequency parameter set via the method of Kwok and Tsang [29]. We compare two approaches to obtaining a proxy for the true out of sample error. First, we have the baseline method in which we perform 5-fold cross validation data within the training data. Second, we have the method based on Stein's unbiased risk estimate (SURE) in which we compute the degrees of freedom using Theorem 5. In the latter approach we use the estimate for variance obtained byσ 2 := N n=1 F 0 (x n ) − y n /(N − H), following the approach of Ye [48, equation (20)] applied to F 0 . In each case the diversity parameter is selected by applying the Brent minimisation routine to the corresponding criterion [1] (5-fold cross validation error or SURE). We evaluate the mean square error on the test set, with the selected diversity parameter, along with the total optimisation time, for each method. This entire procedure is repeated five times on distinct test-train splits of the data obtained via cross-validation. Note that this means in the case of the baseline method there are two nested cross-validation loops: An outer loop, which is also applied to the SURE method, to asses the efficacy of the procedure, and an inner loop in which the diversity parameter is selected based solely on the training data from the outer loop. The data sets used are described in Appendix E. Several data sets correspond to multi-output regression problems. For these data sets we opted for a simple approach in which an entirely separate NCL model was trained for each of the outputs. More nuanced approaches to multioutput regression, which take account of the dependencies between the target variables, are available [25,42]. We leave the question of how well NCL performs in combination with these more nuanced approaches for future work. The displayed test error is obtained by taking a mean average over test errors of the different outputs. Table 1 shows the results. Whilst both the 5-fold cross validation method and the SURE method perform similarly in terms of test error, the SURE method leads to a significant reduction in training time.

The degrees of freedom for deep neural network ensembles
Deep neural networks play an increasingly central role in a wide variety of state of the art machine learning applications [27,30,41,18]. Deep neural networks utilise multiple layers of trainable parameters to construct hierarchical representations [2,4]. Our analytic formula for the degrees of freedom of an NCL ensemble, requires that each constituent model in the ensemble consists of a linear map composed with a fixed basis function. Thus, our formula does not apply directly to ensembles of deep neural networks. In order to investigate the relationship between de-grees of freedom and diversity for ensembles of deep neural networks we construct a Monte Carlo estimator for the degrees of freedom. By using this estimator we can extend the empirical conclusions of Section 6. In particular, we shall show that the degrees of freedom increases monotonically with the level of diversity, whilst the training error decreases monotonically with the level of diversity. Our approach is based on the estimators of Ramani [38] and Ye [48]. Ramani's approach begins with the following theorem.
Theorem 8 (Ramani [38]). Suppose that g : R N → R N is a twice differentiable function. Let b an independent and identically distributed random vector in R N with standard Gaussian entries. Then, for all y = (y n ) N n=1 ∈ R N , we have N n=1 ∂g(y) n ∂y n = lim ) .
Now suppose we have an estimatorμ : X → R. The estimatorμ implicitly depends upon the random vector of targets y = [y 1 , · · · , y N ] ∈ R N where D = {(x n , y n )} N n=1 . In this section we shall consider estimators trained with different data sets and so make this dependence explicit with the superscriptμ y . Given y ∈ R N we letμ y denote μ y (x 1 ), · · · ,μ y (x N ) . Theorem 8 has the following immediate corollary.
Corollary 9. Suppose that y →μ y is a twice differentiable function. Let b ∈ R N with i.i.d entries b n ∼ N(0, 1). Then, This is the direct analogue of the Monte-Carlo estimate of Ramani [38] in the supervised regression setting. Recently, Gao and Jojic conducted a detailed investigation into the degrees of freedom of deep neural networks for classification, with the cross-entropy loss [16] based on a similar estimator todf Ramani (μ, D). We shall apply the estimatordf Ramani (μ, D) to empirically study the degrees of freedom of NC ensembles of deep neural networks.
We consider deep NCL ensembles F λ , trained to minimise the NCL loss L λ (see Figure 6). In each case our architecture consisted of an ensemble of M = 30 networks with H = 5 hidden nodes per layer. The number of layers was varied between 3 and 5. Each layer was fully connected with tanh activation functions. To optimise L λ we employ Adam, a recent variant of stochastic gradient descent which efficiently trains deep networks by scaling weight updates by an estimate of the variance [26]. In addition we apply batch normalisation [24] to avoid internal covariate shift. Without these tools optimisation of F λ was extremely slow. Following Bengio [3], the initial learning rate was selected from 10 −6 , · · · , 10 −1 , 1 , and the learning rate was halved after 100 epochs without a reduction in the cost L λ . The weights were initialised using the method of Mishkin [33]. Figure 7 shows the degrees of freedom estimatedf Ramani (μ, D), estimated via Algorithm 1, as a function of the diversity parameter λ. The mean and standard error are displayed, based on a five-fold cross-validation procedure. For each data set and each number of layers we observe that the degrees of freedom estimatedf Ramani (μ, D) is monotonically increasing as a function of the diversity parameter. This conforms to the pattern observed in the case of fixed basis functions in Figure 4. Figure 8 shows the training error R emp (F λ , D) as a function of the diversity parameter. In each case the training error is montonically decreasing as a function of the diversity parameter. Again, this matches the pattern observed in the case of fixed basis functions observed in Figure 5. Thus, whilst the assumptions of Theorem 5 do not extend to ensembles of deep neural networks our empirical results give strong evidence that the monotonicity of both the degrees of freedom and training error holds in this more general setting.

Discussion
In this paper we have addressed several previously unanswered questions concerning Negative Correlation Learning, a highly effective method for training neural network ensembles. The core of our approach lies in a degrees of freedom analysis of NCL. We gave an explicit formula for the degrees of freedom of NCL ensembles with fixed basis functions and showed that the degrees of freedom a is monotonically increasing and convex function of the diversity parameter. We then confirmed these results empirically on a range of data sets. We demonstrated that our degrees of freedom formula may be utilised to tune the diversity parameter in a way that is both fast and effective. We then extended our empirical analysis to ensembles of deep neural networks by using a Monte-Carlo estimator for the degrees of freedom due to Ramani [38]. We also presented the surprising result that the optimal value of the diversity parameter is always strictly below one, whenever there is positive noise in the data. Finally, we presented an interesting connection to Tikhonov regularisation. Overall, we developed a deeper understanding of how the statistical behaviour of NCL ensembles depends upon the level of diversity, through a degrees of freedom analysis, showing that diversity acts as a form of inverse regularisation.

Appendix A. Proof of Theorem 4
In this section we focus on a special class of NCL ensembles F in which each function f m consists of a linear map applied to fixed basis function. That is, for each m = 1, · · · , M there exists a fixed function φ m : X → R H×1 with f m (x) = w m , φ m (x) , where w m ∈ R H×1 is a trainable weight vector. In this situation the ensemble F which minimises L λ averaged over the data has a simple closed form solution. We first introduce some notation. Let Q = H · M and let φ : Similarly, for each m = 1, · · · , M, we define We shall assume that φ m , φ m D is of full rank H. Under these assumptions the NCL estimator has a closed form solution. Note that given a matrix A, A + denotes the Moore-Penrose psuedoinverse of A.
Theorem 4. Suppose that each function f m consists of a linear map applied to a fixed basis function φ m . For each λ ∈ [0, 1], the ensemble function F λ : X → R is given by Proof of Theorem 4. Recall that the NCL loss is given by, Hence, we have .
Here, w denotes w := w T 1 , · · · , w T M T ∈ R Q×1 . Hence, to minimise the NCL loss L λ averaged over the data D, for each m we must have, This is equivalent to Thus, at the minimum we have, Taking β λ = (1/M)w proves the theorem.
For each m ∈ {1, · · · , M} let Φ m := φ m (x 1 ), · · · , φ m (x N ) ∈ R H×N and let Φ : Corollary 11. Suppose that each function f m consists of a linear map applied to a fixed basis function φ m . For each λ ∈ [0, 1], the ensemble function F λ is a linear smoother with smoothing matrix Proof. Corollary 11 follows immediately from Theorem 4.

Appendix B. Proof of Theorem 5
In order to prove Theorem 5 we require some supporting results.
Proof. For each m = 1, · · · , M we take a singular value decomposition where U m and V m are orthonormal and Σ m is diagonal. By construction we have, Hence, we have P =                     P 1,1 P 1,2 · · · P 1,M Φ m is the composition of a projection and an orthogonal matrix we have Thus, given any pair l, m ∈ {1, · · · , M}, we have ||P l,m || spectral ≤ 1. Note also that P m,m = I H×H for each m. In order to bound ||P|| spectral we first note that P is a real symmetric matrix, so the spectral norm is equal to the modulus of the greatest eigenvalue. Take an eigenvector of Hence, using the fact that ||P m,q || spectral ≤ 1 we have This holds for all eigenvalues ρ. Hence ||P|| spectral ≤ M. Proof. We begin by showing that trace(P) = HM. As noted in the proof of Lemma 13 we have Secondly, we show that rank(P) = rank φ, φ D . For each m ∈ {1, · · · , M}, φ m , φ m D is of full rank H. Hence, φ, φ diag D = diag φ 1 , φ 1 D , · · · , φ M , φ M D is of full rank HM. Thus, by the definition of P we see that rank(P) = rank φ, φ D .
Given a matrix M we let spec (M) denote its spectrum ie. its eigenvalues.
Proof. We have It follows from the above that P = UΣ 2 U T . Hence, Hence, the non-zero eigen values of S λ (X) are all of the form s 2 / M · (1 − λ) + λ · s 2 , where s is a diagonal element of Σ. Moreover P = UΣ 2 U T , so the square diagonal elements of Σ are precisely the eigenvalues of P ie. ρ = s 2 . Hence, the proposition holds.
Theorem 5. Suppose that each function f m is a linear map of a fixed basis function. Given any λ ∈ [0, 1], the degrees of freedom for F λ is given by In particular, we have the following behaviour • The function λ → df(F λ , D) is continuous, increasing and convex; • The function λ → R emp (F λ , D) is continuous and decreasing.
Proof of Theorem 5. The proof of the equation for df (F λ , D) follows immediately from Corollary 11 combined with Proposition 12, along with the fact that the trace is invariant to cyclical permutations. Moreover, letting ρ q Q q=1 denote the eigenvalues of P. By Proposition 15 the eigenvalues of S λ (X) are given by . Hence, we have The continuity of df (F λ , D) is immediate. Taking first and second derivatives with respect to λ we have By Lemma ρ q ≤ M for each q, so the first and second derivatives are positives ie. λ → df (F λ , D) is monotonically increasing and convex. Moreover, by Lemma 14, we have Q q=1 ρ q = trace (P) = H · M. Moreover, if H < rank φ, φ D then the number of posiive eigen values of ρ q must exceed H, so there must exist at least one q with ρ q ∈ (0, M). Hence the first and second derivatives of λ → df (F λ , D) must be strictly positive, which implies that the function is strictly increasing and strictly convex.
To prove the results regarding R emp (μ, D) we apply Lemma 12 and write Applying Proposition 15 we let v q Q q=1 denote the eigenvectors corresponding to the eigenvalues ρ q / M(1 − λ) + ρ q Q q=1 of S λ (X). We may write y as . Hence, Hence, we have Thus, R emp (μ, D) is monotonically decreasing with λ.
Appendix C. Proof of Theorem 6 Theorem 6. Suppose that each function f m is a linear map of a fixed basis function and let σ 2 = E 2 n , the noise variance.
Thus, we have

20
In the proof of Theorem 5 we saw that Thus, for all data sets D, at λ = 1 we have ∂R emp (μ, D)/∂λ = 0, so at λ = 1 we have E ∂R emp (μ, D)/∂λ = 0. On the other hand, we also saw in the proof of Theorem 5 that Thus, at λ = 1 we have Thus, whenever σ > 0, at λ = 1 the derivative is positive, so E R true (μ) does not attain its minimum at λ = 1. Proof. First write ψ(x) = (1/M) · ψ 1 (x) T , · · · , ψ M (x) T T and let w = w T 1 , · · · , w T m T . Hence, we may write G Tik γ (x) = w T ψ(x). Moreover, we may write Thus, minimising L Tik γ (G, x, y) corresponds to ridge regression on the features ψ(x). Hence, if we define ψ(D) = ψ(x 1 ), · · · , ψ(x N ) , then we can minimise 1 N N n=1 L Tik γ (G, x n , y n ) by taking Hence, we have Now by the definition of ψ m and ψ, we may write ψ( these formulas into the above expression for G Tik γ (x) and rearranging we have Hence, by Theorem 4 if we take γ = (1 − λ)/(M 2 · λ) then G Tik γ (x) = λ · F λ (x). This completes the proof of the Theorem.

Appendix E. Data sets
In this section we give a brief description of all the data sets used in our experiments. The features and target variables of each data set are preprocessed by subtracting the mean and dividing by the standard deviation. Table E.2 gives the dimensions of our data sets.

Kinematics
This data set corresponds to a realistic simulation of the forward dynamics of a robot arm. The task is to predict the distance of the end-effector from a target based on a collection of continuous features describing the robot arm such as joint positions and twist angles. There are eight thousand one hundred and ninety two examples and eight features. The data set may be found at [45].

Machine
The task in this data set is to predict relative CPU performance based upon features of the CPU: machine cycle time, minimum main memory, maximum main memory, cache memory, minimum channels and maximum channels. The data set is originally from the UCI Machine Learning Repository [31]. We use the version from [45]. There are two hundred and nine examples and six features.

California
The task in this data set is to predict house price based on features such as median income of occupants, house age, number of rooms, number of bedrooms etc. The data set consists of twenty thousand, four hundred and sixty examples and eight features. It may be found at [45] and is originally from [34].

Wisconsin
For this regression data set the task is to predict the time until recurrence of breast cancer based on features of the tumor. The data set consists of one hundred and ninety four examples and thirty two features and may be found at [45].

Housing
The task in this data set is to predict house values in various areas of Boston based upon several relevant features such as local crime rate and an index of accessibility etc. There are five hundred and six examples with thirteen features. For more details see [45].

Triazines
In this data set the task is to predict the activity of triazenes based on a large number of descriptive structural attributes. There are one hundred and eighty six examples and sixty features. For more details see [45].

Ailerons
In this data set the task is to predict the control action on the ailerons of an F16 aircraft data set required to achieve a certain status. There are thirteen thousand, seven hundred and fity examples and forty features. For more details see [45].

Bank
This data set is based upon a simulation of customer's behavior at banks. The task is to predict the proportion of customers that are turned away from the bank because all the open tellers have full queues. There are eight thousand, one hundred and ninety two examples and nine features. The full data set may be found [45, bank-8fm].

Elevators
The 'Elevators' data set is closely related to the 'Airelons' and also concerns the control of an F16 aircraft. However, in this case the task is to predict the required position of the aircraft's elevators based upon a different set of features. There are sixteen thousand, five hundred and fifty nine examples and eighteen features. The data set is available at [45].

Abalone
In this data set the task is to predict the age of abalone based upon a set of accessible physical features. There are four thousand, one hundred and seventy seven examples ane eight features. The data set is available at [45].

Energy
The task in this data set is to estimate the heating and cooling load of a building based upon building parameters such as glazing area, roof area, and overall height etc. The data set consists of seven hundred and sixty eight examples, eight features and two output variables. For further details see the original paper [46] and [42]. The data set may be downloaded from [20,Enb].

Andromeda
In this data set the task is to predict six water quality variables (temperature, pH, conductivity, salinity, oxygen, turbidity) for six days hence, based on the value of these variables over the previous five days. For further details see [22] and [42]. The data set consists of forty nine examples, thirty features and six output variables. For further details and the data set see [20,Andro].

Water quality
In the water quality data set (Water) the task is to predict the representation of various plant and animal species in Slovenian rivers based upon several physical and chemical water quality parameters. For details see [13] and [42]. The data set consists of one thousand and sixty examples, sixteen features and fourteen target variables. For further details and the data set see [20,Wq].

Occupational Employment Survey
The Occupational Employment Survey data sets (OES97 & OES10) are based upon survey data from the 1997 and 2010 Occupational Employment Surveys, respectively, compiled by the US Bureau of Labor Statistics. OES97 and OES10 are multi-target regression problems in which the task is to estimate the number of full-time employees across many employment types for a given city based upon the estimated number of full-time equivalent employees in a wide range of different other employment types for that city. Both data sets were first used for the purpose of multi-target regression in [42]. OES97 consists of three hundred and sixty four examples, two hundred and sixty three features and sixteen target variables. OES10 consists of four hundred and three examples, two hundred and ninety eight features and sixteen target variables. Note that missing values in both the input and the target variables were replaced by sample means for these results. For further details and the data set see [42], [20,Oes97 & Oes10].