Uncertainty Quantification in Extreme Learning Machine: Analytical Developments, Variance Estimates and Confidence Intervals

Uncertainty quantification is crucial to assess prediction quality of a machine learning model. In the case of Extreme Learning Machines (ELM), most methods proposed in the literature make strong assumptions on the data, ignore the randomness of input weights or neglect the bias contribution in confidence interval estimations. This paper presents novel estimations that overcome these constraints and improve the understanding of ELM variability. Analytical derivations are provided under general assumptions, supporting the identification and the interpretation of the contribution of different variability sources. Under both homoskedasticity and heteroskedasticity, several variance estimates are proposed, investigated, and numerically tested, showing their effectiveness in replicating the expected variance behaviours. Finally, the feasibility of confidence intervals estimation is discussed by adopting a critical approach, hence raising the awareness of ELM users concerning some of their pitfalls. The paper is accompanied with a scikit-learn compatible Python library enabling efficient computation of all estimates discussed herein.


Introduction
Statistical accuracy measures such as variance, standard error and Confidence Intervals (CI) are crucial to assess the quality of a prediction. Model uncertainty quantification is needed to build CI and has a direct impact on the prediction interval, especially when dealing with small datasets [1]. Uncertainty quantities for Feed-forward Neural Networks (FNN) solving regression tasks can be obtained by means of different methods [2,3]. Here these quantities are investigated in relationship with the use of the Extreme Learning Machine (ELM) model [4]. ELM is a single-layer FNN with random input weights and biases, therefore allowing the optimization of the output weights through the Least Squares (LS) procedure. One can think about ELM as a projection of inputs in a random feature space where a Multiple Linear Regression (MLR) with a null intercept is performed.
Three main uncertainty sources can be distinguished [5]. A first one comes from the data, and in particular from sampling variation and unexplained fluctuations, or noise. A second uncertainty source is related to the estimation of the model parameters, which in the case of an FNN correspond to the weights and biases [3]. In ELM, input weights and biases are randomly chosen, which clearly generates uncertainty. Moreover, despite being optimized through a procedure with a unique solution, the estimation of the output weights depends on the random input weights and on the data, which therefore induces additional fluctuations. Finally, a third type of uncertainty source is due to the model structure. This source, generally referred at as structural uncertainty, is not considered in this paper.
A number of methods were proposed to obtain confidence or prediction intervals with ELM. A Bayesian formulation was introduced to integrate prior knowledge and produce directly CI [6,7]. In the frequentist paradigm, bootstrap methods were investigated in the context of time series [8]. Akusok et al. proposed a method to estimate prediction intervals using a covariance matrix estimate coming from MLR [9].
Most of these methods make Gaussian assumption on the output distribution or do not consider the bias in interval estimation, which may cause misleading conclusions. Moreover, resampling methods lead to important computational burden when the number of data is high. Finally, it is often argued that randomness of the input weights and biases is supposed to be negligible providing the training set large enough. However, it is not always clear how many data is needed in practice. Indeed, while stochastic input layer initialization can have weak impact in low dimension, it is still unclear what could happen when number of features or/and neurons are large. Because of the curse of dimensionality, the random drawing of the input weights and biases could have a higher impact than suspected. To further investigate such impact, the development of ELM variance estimation methods taking into account its stochastic nature is therefore extremely relevant. Additionally, ELM is also used efficiently with small training dataset [10] -in which case precise variance estimate is crucial -where the randomness of the input weights and biases should not be ignored.
The contribution of this paper is threefold. First, analytical development are proposed to derive ELM variance taking into account also the contribution induced by the random input weights and biases. This is done without any other assumptions on the noise distribution than the facts that it is centered and have a finite variance. In particular, the presented theoretical results hold for dependant and non-identically distributed data. Second, homoskedastic and heteroskedastic variance estimates are provided, and some of their properties are investigated. While it may be argued that the homoskedastic case could be unrealistic, its study is of great interest as it provides an insightful propaedeutic value and develops the intuition for more advanced situations. Moreover, in case of applications with a small number of data, homoskedastic assumption may yield to better results. Third, the paper proposes empirical bases to move towards CI estimations, which include the variability induced by the random input weights and biases. Their discussion will also raise the awareness of ELM users about some pitfalls of confidence interval estimations. Overall, the results presented in this paper are expected to clarify the impact of input weights variability and noise, hence increasing the understanding of ELM variability.
The remainder of the paper is organized as follows. Starting from general assumption on the noise covariance matrix, probabilistic formulas are derived for predicted output variance knowing the training input for single and ensemble of (regularized) ELM in section 2. Based on these formulas, section 3 provides variance estimates when noise is independent with constant variance (homoskedastic case) and non-constant variance (heteroskedastic case), for which a Python implementation is available on GitHub, see the software availability at the end of the paper. The effectiveness of the proposed estimates is demonstrated through numerical experiments in section 4, where estimation of CI is also discussed. Finally, section 5 concludes the paper.

Analytical developments
This section begins by recalling ELM theory and fixing notations. Subsequently, the bias and variance for a single ELM are derived. The results are then generalized to ELM ensemble. Finally, correlation between two ELMs is investigated.

Background and notations
Assume that an output variable y depends of d input variables x 1 , . . . , x d through the relationship where x = (x 1 , . . . , x d ) T ∈ R d is the vector composed by the input variables, f is a function of x whose value represent the deterministic part of y, and ε(x) is a random noise depending on the input representing the stochastic part of y. It is assumed that, whatever the value of x, the noise is centered and has a finite variance.
Let the training set D = {(x i , y i ) : x i ∈ R d , y i ∈ R} n i=1 be a sample from the joint distribution of (x, y). Given a new input point x 0 ∈ R d , one wants to predict the corresponding output y 0 . The value f (x 0 ) seems a good guess. However, the function f is unknown. The true function f needs to be approximated based on the sample D in order to provide an estimate of the prediction f (x 0 ).

Extreme Learning Machine
ELM is a single-layer FNN with a random initialization of the input weights w j ∈ R d and biases b j ∈ R, for j = 1, . . . N , where N denotes the number of neurons of the hidden layer. All input weights and biases are independent and identically distributed (i.i.d.), and are generally sampled from a Gaussian or uniform distribution. They map the input space into a random feature space in a non-linear fashion by-way-of the non-linear feature mapping [11] h is the output of the j-th hidden node, for j = 1, . . . , N , and g is any infinitely differentiable activation function [12]. A N −hidden neurons ELM can generate output functions of the form where β ∈ R N is the vector of output weights that relate the hidden layer with the output node. The output weights are trained using the sample D and optimized regarding the L 2 criterion performing the following procedure. The n × N hidden layer matrix, denoted H and defined element-wise by H ij = h j (x i ), i = 1, . . . , n, and j = 1, . . . , N , is computed. Then the cost function is minimized, where || · || 2 2 denotes the Euclidean norm. This is exactly the LS procedure for a design matrix H [13,14]. If the matrix H T H is of full rank and then invertible, the output weights are estimated as the classical linear regression, with the analytical solution of the minimization of E, where the matrix (H T H) −1 H T is the Moore-Penrose generalized inverse of the matrix H and will be denoted H † in the following. Thus, ELM can be thought as a MLR with a null intercept, performed on regressors obtained by a random non-linear transformation of the input variables.
At a new point, the prediction is given byf (x 0 ) = h(x 0 ) T β. In the remainder of the paper, all dependencies in x 0 will be dropped for convenience and the prediction will be notedf (x 0 ) = h T β = h T H † y. The vector of model predictions at training points will be notedf , defined element-wise byf i =f (x i ). The (random) matrix of all input weights and biases will be denoted

Regularized Extreme Learning Machine
To avoid overfitting and reduce outlier effects, a regularized version of ELM was proposed [15]. Highly variable output weights due to multicolinearity among neurons can be stabilized with regularization, too. As mentioned by [16], this model is basically a -potentially weighted -Tikhonov regularization, also known as ridge regression [17]. The output weights are optimized regarding the cost function for some real number α > 0, sometimes called the Tikhonov factor, which controls the penalization of taking big output weights. Noting I the identity matrix, the analytical solution of this optimization problem for a fixed α is given by thanks to the fact that the matrix (H T H + αI) is always invertible, see [18]. To lighten the notation, the matrix H α = (H T H + αI) −1 H T is defined. Remark that as α goes to zero, H α goes to H † and the classical ELM is recovered [15]. In the remainder of the paper, most of results are presented with H α , but they remain valid for the non-regularized case, unless the contrary is clearly specified.

Extreme Learning Machine Ensemble
Another way to avoid overfitting is to combine several ELM models. This also reduce the randomness induced by the input weight initialization, which could be beneficial -especially for small datasets. Several ensemble techniques have been developed for ELM [19,16]. In this paper, each model of a given ensemble will have the same activation function and number of neurons, and all models will be averaged after training. This corresponds to retrain M times the model and average the results, where M is the number of ELM networks in the ensemble. The hidden layer matrix and the matrix of input weights and biases of the m−th retraining will be noted respectively H m and W m , for m = 1, . . . , M . If the m−th prediction is noted f m (x 0 ), the final prediction isf where h m and H α m are the analogous quantities defined previously for the m-th model. Note that the weights have the same joint distribution across all the models, which allows us to drop the m index in most calculations of the remainder of this paper.

Bias and variance for a single ELM
In this section, the uncertainty for (regularized) ELM is explored. For all derivations, it is supposed that hyper-parameters N and α -when applicable -are considered as fixed and non-stochastic. Also, all formulas are derived knowing X. However, this conditioning is dropped to avoid cumbersome notations. Note thatf (x 0 ) is a random variable depending on the noise at training points ε, but also on the input weights and biases W used in the construction of h and H α . As the noise is centred, Using the law of total expectation, one can compute the bias of the model at x 0 , Let us now compute the variance of the model at a new point. First, one have which is the typical variance expression for MLR. With equations (1) and (2), the variance of the model at x 0 can be computed by using the law of total variance, The first term of the right-hand side (RHS) is the variance of the LS step averaged on all possible random feature spaces generated by input weights and biases, while the second term is the variation of the LS step bias across all random feature spaces. Note that the second term appears if and only if the random input weights and biases are considered. In the non-regularized case with independent homoskedastic noise, if W and X are deterministic, the classical MLR formula for the variance at a prediction point is recovered, see [14].

Bias and variance for ELM ensemble
As mentioned before, the training could be done several times and averaged. A direct calculationwhich can be found in the appendix -can be done for bias and variance of the averaged predictor. Basically, it uses the law of total variance and elementary probability calculus from which one get while the bias still unchanged. The RHS first and third terms are the single ELM variance divided by the number of models. The bias variation of the LS step is reduced by a 1/M factor. Although the average variance of the LS step represented by the RHS first term seems to decrease by a 1/M factor, models are pairwise dependent which yields the RHS second term. Notice that if M = 1, equation (3) is recovered. If M grows, the RHS second term tends to dominate the model variance. Remark also that using the law of total covariance, it is easily checked by analogous computation that the covariance between two members of an ELM ensemble correspond to E h T H α Σ E H αT h .

Use of random variable quadratic forms
Formulation of variance in equations (3) and (4) are convenient for the interpretation of ELM as a MLR on random features. However, quadratic forms in random variables appears in these formulas, which allows to pursue calculations. With the Corollary 3.2b.1 of [20], the expectation of random variable quadratic form can be computed as the quadratic form in its expected values plus the trace of its covariance matrix times the matrix of the quadratic form. This Corollary will be used extensively in this paper, each time an expectation of a quadratic form in random variables appears.
Setting z = H αT h and assuming the existence of its expectation µ and covariance matrix C, equation where Tr [·] denotes the trace of a square matrix. Although the notation do not specify it, the quantities z, µ and C depend on the Tikhonov factor α in the regularized case. Similarly, z m = H αT m h m is set for ELM ensembles and the variance becomes

Correlation between two ELMs
As the covariance between two single ELMs is µ T Σµ, their linear correlation at x 0 is given by Remark that considering the input weights and biases as fixed is equivalent to ignore Tr [ΣC] + f T Cf and to have a correlation of 1 between the two models. An interesting insight is provided by the case of independent and homoskedastic noise, i.e. Σ = σ 2 ε I, which yields Notice that in this particular case, when σ 2 ε is small the linear correlation between two ELMs vanishes. Contrariwise, when σ 2 ε is large the linear correlation between two ELMs tends to b = µ T µ/(µ T µ + Tr [C]). Therefore, the amount of noise has a direct impact on the linear correlation which takes its value between 0 and b ≤ 1. The trace of C can be interpreted as a variability measure of z, called sometimes the total variation or the total dispersion of z [21]. In our case, it controls the linear correlation bound b, in the sense that more variable is z, farther from 1 is the maximal value that the linear correlation can take, regardless the noise in the data.

Variance estimation of ELM ensemble
This section introduces novel estimates of the ELM variance. Although several ELM are necessary to allow the estimation of quantities related to z -which motivates the use of ELM ensembles -reliable results are also obtained with very small ensemble. First, the variation of the LS step bias overall random feature space is estimated. Then, assuming noise independence, the variance of the LS step averaged on all possible random feature spaces is estimated under homoskedasticity and heteroskedasticity for non-regularized and regularized ELM ensembles.

Estimation of the least squares bias variation
The quantity f T Cf = Var h T H α f doesn't depends on noise. It is the variance induced by the randomness of W , knowing the true function f at training points. Tentatively assume that the output weights are not regularized. As f is unknown, one approximate it by the model prediction at the training pointŝ f = HH † y. For each model of the ensemble, This motivate the following estimate for f T Cf , and dividing by M 2 shows that the estimate given in (7) has a bias equal to (1/M )Tr [C], which comes from the expected values of the M quadratic terms in z m .
To remove this bias, one can estimate it by This is an unbiased estimate of (1/M )Tr [C], which immediately follows from the fact that Q is an unbiased estimate of C. Therefore, subtract (8) from (7) yields an unbiased estimate of µ T µ. Note that Hence, the unbiased estimate of µ T µ that was just developed results in Substituting equation (7) into equation (9), the computation still goes on, and This shows that the estimate (10) removes the quadratic terms m = l from which the bias of the naive estimate (7) was induced. However, note that formulation (9) is more convenient to compute than (10), from an algorithmic perspective.

Noise estimation
The estimation of σ 2 ε is separated in two cases, the non-regularized and the regularized ones. A couple of notations is needed to make readable the equations. The residuals for the m-th model are r m = y − P m y, Let us first concentrate on the non-regularized case. Then, P m is a projection matrix. A natural way to obtain estimate forσ 2 ε is to start with the expectation of the residual sum of squares (RSS) based on the averaged ensemble. However, mainly due to the fact that the expectation of a projection matrix is not a projection matrix, it is preferred here to work with the RSS of each model. Using this, the expectation of the residual sum of squares for the m-th ELM knowing input weights and biases is given by and taking the expectation over the input weights and biases yields which is the average of all MLR estimates of σ 2 ε . Its bias is directly obtained from previous calculation, yielding where non-negativity results from the facts that a projection matrix is positive semidefinite and expectation of a positive semidefinite matrix is positive semidefinite. If regularized ELMs are used, P m is no more a projection matrix, and the expected RSS for each ELM of the ensemble knowing W m becomes Analogously to what is done in [22], the effective degrees of freedom for error can be defined as n − γ, with γ = 2E [Tr [P ]] − E Tr P 2 . Expectation over input weights and biases of equation (14) gives which make appears the squared bias and the total variation of the conditional bias. This motivates the following estimate in the regularized case, and it is easy to check that Computationally,γ can be efficiently calculated using the singular value decomposition of H m . Indeed, it can be shown [17,23] that the trace of P m and P 2 m are given by where λ m,i , i = 1, . . . , N are the eigenvalues of H T m H m . In particular, substitution of (17) in (15) and elementary manipulations allows to writeŝ where λ m,i , i = 1, . . . , N are the singular values of H m . Note that this latter equation also show the drop in the degrees of freedom lost due to regularization, comparing to the non-regularized case.

Estimations of ELM ensemble variance
In [24], the authors proposed -only for the non-regularized case -the following naive homoskedastic estimate of the variance off ( This naive estimate directly use equation (7) to approximate µ T µ without considering its bias. However, a bias-reduced estimate is obtained by estimating µ T µ by equation (10), yieldinĝ Using the covariance definition, it is easy to see that where the bias ofσ 2 ε is (13) or (16). Note that in both cases, while the first term of the RHS of equation (18) is always non-negative, the estimates of σ 2 ε and µ T µ could be correlated, introducing the second term. However, this supplementary bias could be negative, potentially compensating the first term. Note that its magnitude is bounded by in the non-regularized case -see the appendix -showing that this covariance term vanishes with large M . Remark also that the bias ofσ 2 N Ho has an additional non-negative term, (1/M )Tr [C] E σ 2 ε , which disappears in (18) thanks to the unbiasedness of µ T µ − (1/M )Tr[Q].

Estimation under independence and heteroskedastic assumptions
Suppose the noise is independent but have variance which have a dependence of unknown form on x. Then S = µ T Σµ has to be estimated considering the noise covariance matrix Σ as diagonal. To this aim, it could be possible to reuse estimates from MLR. However, several estimates are based on the evaluation of the covariance matrix H α m Σ m H αT m of the output weights β m . In this paper, the modified heteroskedasticconsistent covariance matrix estimator (HCCME) obtained from the (ordinary) Jackknife [25] -noted HC 3 and extended to the ridge regression case [26] -is used, wherer m is the vector defined element-wise byr m,i = r m,i /(1 − p m,i ), p m,i is the i-th diagonal element of P m and Ω m is the diagonal matrix with the i-th diagonal element equal tor 2 m,i . This estimate is still valid for the non-regularized case, for which -under some technical assumptions -it is consistent [27,25], whilê Σ m is an inconsistent estimator of Σ. Nevertheless, other HCCME estimates could be used, such as HC 0 [27], HC 1 got from the weighted Jackknife [28], HC 2 proposed in [29] or HC 4 proposed more recently in [30]. The HC notation follows what can be found in [13], which provides useful insight on this kind of estimators. Note that for sufficiently large n, HC 3 is close to H α m Ω m H αT m , which corresponds to the estimate used in [9] to build prediction intervals for large amounts of data, assuming fixed input weights.
If a unique ELM model is performed, the use of the HCCME is straightforward. Nonetheless, as one attempts to take into account the input weight variability through ELM replications, the HCCME is applied in three different ways. Suppose first that Σ is known and write the estimate Finally, the proposed heteroskedastic estimates of ELM ensemble variance, notedσ 2 S1 ,σ 2 S2 ,σ 2 S3 andσ 2 N He , are given by respectively adding (1/M )σ 2 f to S 1 , S 2 , S 3 and S N He . To increase computation speed, approximated versions of S 1 , S 2 , S 3 , and S N He can be obtained by replacing Σ m by Ω m in the above reasoning. As a matter of fact, these two matrices are very close for sufficiently large n, but Ω m is a diagonal matrix, while Σ m is a full matrix. Remark that the approximated version ofσ 2 N He is exactly the heteroskedastic estimate proposed in [24].

Synthetic experiments
This section discusses the results obtained over different experimental settings. First, a simple nonregularized homoskedastic one-dimensional experiment is conducted. The variance estimate is thoroughly examined and assessed with quantitative measures and visualizations. Subsequently, the results are generalized to multi-dimensional settings with homoskedastic or heteroskedastic noise, both for the regularized and non-regularized cases. Finally, CI estimation is discussed. All the experiments presented will adopt the sigmoid as activation function, while input weights and biases will always be drawn uniformly between −1 and 1. All computations are done with the provided Python library, see the software availability at the end of the paper for more details.

One-dimensional case
To assess operationally the estimates proposed in section 3, a simple one-dimensional simulated case study of n = 60 training points is firstly proposed. A trapeze shape probability density function defined by  Figure 1. In particular, the empirical variance of the 10 000 ensembles will provide a reliable baseline for the variance estimation assessment and will be referred as the ground truth variance. Figure 1 (right) shows the mean ofσ 2 BR across the 1 000 experiments with ±1.96 standard-error band. Compared with the ground truth variance, the proposed estimate recovers effectively -in average -the variance from the 10 000 simulations baseline. The increasing variance in the borders due to the side effect of the modelling is fairly replicated. The uncertainty due to the trapezoidal shape of the input data distribution is also captured. Qualitatively, all aspects of the expected variance behaviour are globally reproduced. The naive estimateσ 2 N Ho gives very similar results in one dimension, and is not shown. The improvement due to considering the bias of µ T µ will be more relevant in the multi-dimensional case. However, one can still observe a residual bias forσ 2 BR , partially due to the bias ofσ 2 ε and to the dependence betweenσ 2 ε and the estimate of µ T µ -see equations (13) and (18).  To assess quantitatively each estimation, a measure is needed between the true standard error off (x) and its estimations provided by each repetition of the experiment. Following [2], let us look at the median of the k th standard error estimate over the training set, and the absolute error of the k th standard error estimate over the training set defined by , for k = 1, . . . , 1 000. Also, the relative error re k of the k th standard error estimate over the training set is defined by for k = 1, . . . , 1 000. Similar measures are defined on a random testing set of 1 000 points. In order to compute these quantities, σ(x i ) is replaced by the ground truth standard deviation. The means and standard deviations of se k , e k and re k over the 1'000 experiment repetitions are presented in Table 1. For the training set, the median of the ground truth standard error is recovered by the median ofσ BR , judging through the se k measure. Moreover, the mean and standard deviation of e k appear quite small. The relative errors allow a better interpretation by comparing point-wise the absolute error with the true standard error. For instance for M = 10, the mean of re k shows that -on average -the median error at training points represents 6.2% of the true standard error. The results on the 1'000 testing points are similar, which shows that the estimation is good both at testing and training points. Even for M = 5, all error measures are quite satisfactory, as well as their standard deviations. These results are also compared withσ 2 S3 . As expected for an homoskedastic dataset, estimateσ 2 BR based on homoskedastic assumption always results into better performance thanσ 2 S3 which is based on heteroskedastic assumption. The same experiment is done with n = 100 and N = 5. The results, reported in Table 1, show that all results are improved when increasing the number of data points, as expected.

Multi-dimensional case
An example on a multi-dimensional case study is now investigated. Specifically, the synthetic dataset described by Friedman in [31] is considered, with fixed inputs x = (x 1 , x 2 , x 3 , x 4 , x 5 ) drawn independently from uniform distribution on the interval [0, 1] and outputs generated with an independent homoskedastic Gaussian noise according to with noise variance σ 2 ε = 0.5. A number of n = 500 training points are drawn. The number of neurons is chosen by a similar cross-validation process as described above for the one-dimensional case, and fixed to N = 91. Ensembles are fitted and homoskedastic estimatesσ 2 BR andσ 2 N Ho are computed with M = 5, 10, 20, 100. This is repeated 1 000 times, while ground truth mean and variance are computed based on 10'000 ensembles, as in the previous experiment. The same experiment is conducted with ensembles of regularized ELM. Tikhonov factor is selected with the help of generalized cross-validation [23,17] repeated on 1'000 dataset generations and set to α = 6 · 10 −6 . Regularized version ofσ 2 BR andσ 2 N Ho are computed. Results of the regularized and non-regularized versions of the experiment are reported in Table 2. For both, the training se k tends to slightly overestimate the true standard deviation median over the training points. The testing se k has an analogous behaviour. Although the testing se k tends globally to be greater than the training se k , the testing e k and re k are similar to the training e k and re k . This suggests that regardless of the fact that the prediction is more uncertain at testing points, the variance estimation works at testing points as well as at the training points, as in the one-dimensional experiment. Note also that Non-regularized  the true standard deviation median decreases as M increases, as suggested by equation (5). For the nonregularized case, the bias-reduced estimateσ 2 BR is systematically better than theσ 2 N Ho estimate. Recalling thatσ 2 BR reduce the bias by a quantity inversely proportional to M -see section 3.2 -one observes that for e k and re k the improvement overσ 2 N Ho is decreasing with M . The regularization mechanism increases the bias of the model while its variance decreases, which explains the decreasing of the true standard deviation median for a given M from the non-regularized to the regularized case. Moreover, for the regularized case, as the bias of the variance estimation depends directly from the conditional bias of the model, this could explain that the regularized experiment yields slightly weaker results in terms of e k and re k . However, observe thatσ 2 BR is still better thanσ 2 N Ho . To illustrate the heteroskedastic case, the same experiment is conducted with a non-constant noise variance. The Gaussian noise ε(x) is now depending on the inputs variables through its variance by where ||·|| ∞ denotes the maximum norm. The variance estimatesσ 2 S3 ,σ 2 S2 ,σ 2 N He ,σ 2 S1 andσ 2 BR are computed in their approximated version 1 000 times with n = 1000, M = 5, 10, 20, 100, and N = 109.
Although the results on the 5-dimensional hypercube input cannot be visualized, a small subset such as its diagonal can be plot, see Figure 2. On the left, prediction with ±1.96σ S2 is displayed for one experiment, for M = 5. The true noise variance σ 2 ε (x) is also reported. On the right, results forσ 2 S2 ,σ 2 S1 andσ 2 BR for the 1'000 experiments are shown. Results forσ 2 S3 and σ 2 N he are visually close toσ 2 S2 and are not reported. Heteroskedastic estimates reproduce fairly well the behaviour of the true variance, butσ 2 S2 shows a smaller bias thanσ 2 S1 along the input diagonal. The homoskedastic estimatesσ 2 BR clearly fails to reproduce a coherent behaviour of the true variance, underestimating or overestimating it, depending on the location.
Quantitative results are shown in Table 3. Again, the true standard deviation median decreases when M increases, and the training (testing) se k tends to somewhat overestimate the true training (testing) standard deviation median. The homoskedastic estimateσ 2 BR no longer gives the best results because of the heteroskedastic nature of the data, which justify the use of heteroskedastic estimates. The estimateσ 2 S1which reuse HCCME in its original form -gives the worst results as suspected in section 3.3. The naive heteroskedastic estimateσ 2 N He gives in general reasonable results. However, the heteroskedastic estimateŝ σ 2 S2 -which was developed based on the insight given by the homoskedastic case in section 3.2 -allows to improve the results by a quantity decreasing with M , as expected. Finally, results fromσ 2 S3 are very close toσ 2 S2 . Regularized versions of the heteroskedastic estimates were also investigated andσ 2 S2 andσ 2 S3 gave quite reasonable results too.

Towards confidence intervals
Although visually it is tempting to say so, there is so far no guarantee thatf (x) ± 1.96σ BR (x) in Figure  1 (left) orf (x) ± 1.96 ·σ S2 (x) in Figure 2 (left) define a 95% CI for f (x). Let us investigate the possibility to build (approximate) CI in particular cases. The distribution of is unknown. However, for our simulated case studies, remark that it is very close to a Gaussian distribution. Kernel density estimates -based on the 10'000 replications done in previous sections -are shown in Figure  3 at some example points at whichf (x) exhibits significant bias. Figure 3 (left) displays the distribution of g(x 0 ) at x 0 = π/4 for the one-dmensional case. Figure 3 (middle) visualizes the distribution of g(x 0 ) at x 0 = 1/2 · (1, 1, 1, 1, 1) for the Friedman dataset with homoskedastic noise, which corresponds to the center of the 5-dimensional hypercube and the middle point of the input diagonal. Distribution of g(x 0 ) behaves similarly for the Friedman dataset with heteroskedastic noise (not shown) and other experiments were conducted with noise from Student laws showing the same behaviour. This suggests that for ELM ensembles, g(x) andf (x) may asymptotically follow a Gaussian distribution. However, dependencies exist   (right) The mean with ±1.96 standard-error bands forσ 2 S2 in green,σ 2 S1 in red andσ 2 BR in purple, based on 1'000 replications of the experiment. In black dashed line, the variance computed from the 10'000 ensembles, considered as ground truth. The diagonal distance from the origin on the x-axis is measured in maximum norm. between the components of z m , and also between the members of the ELM ensemble. Therefore, the classical central limit theorem is not directly applicable, and it seems hard to straightforwardly conclude to Gaussianity in case of large sample size n or large M . In spite of knowing if one can prove or disprove this conclusion, let us assume that the distribution of g(x) is (asymptotically) Gaussian in the remainder of this section. As a matter of fact, g(x) has a unit variance but it is not centred, due to the bias off (x). Its mean is given by and is reported in Figure 3 as a vertical dashed black line. Obviously, this quantity is unknown in practice but necessary to build a reliable CI for f (x). However, if the bias off (x) is negligible relatively to its variance, then g(x) is close to centred, and approximate point-wise CI can be derived based solely on an estimation of the variance off (x). That is, if g(x) is close to centred, then the estimated ±1.96 standard-error around f (x) define an approximate point-wise 95% CI for f (x).  Figure 1 (left) -this proportion comes closer to the true coverage probability of 0.95 defined by the confidence level. Conversely, for instance at x 0 = π/4 -which is near the point with the smallest variance, see Figure 1 (right) -the bias is high relatively to the variance, then g(x 0 ) is far from centred, which implies a wrong construction of the CI, leading to a bad result. Obviously, sd[f (x)] is not available in practice, and looking at the actual coverage probability of the estimated CIf (x) ± 1.96σ BR (x) is more interesting. However, note that it reproduces quite fairly the same behaviour, as expected.
The non-regularized multi-dimensional experiment for M = 5 is also displayed in Figure 4 for homoskedastic case, also estimated withf (x) ± 1.96σ BR (x) (middle), and for heteroskedsatic case, estimated withf (x) ± 1.96σ S2 (x) (right). For both, the actual coverage probability is globaly greater than the proportion based on the theoretical CI build with the true sd[f (x)]. This is partially explained by the overestimation of thef (x) standard deviation -see section 4.2. In some low bias regions, this results in slightly conservative CI, i.e. the actual coverage probability is greater than the true coverage probability of 95%. Observe that around x 0 = 1/2 · (1, 1, 1, 1, 1) the actual coverage probability is especially bad for the homoskedastic case, while it is quite reasonable for the heteroskedastic case. This is explained by the fact that the heteroskedastic noise variance around the center of the hypercube is up to five times more than the homoskedastic variance, which implies an increase of the variance off (x) and a decrease of E[g(x)] around x 0 . Finally, the actual coverage probability is quite satisfying for the heteroskedastic case.
Clearly, the effectiveness of the CI estimation for f (x) is highly dependant on the dataset at hand and significant bias off (x) relatively to its variance can lead to highly permissive and bad CI for f (x). However, it is possible to identify potential paths to overcome this problem. Firstly, note that even if the bias of f (x) is too important to be ignored, the estimated ±1.96 standard-error bands aroundf (x) still provides a reliable CI for E[f (x)]. Secondly, the bias could be estimated. Thirdly, a manner of reducing the bias is to smoothf (x) slightly less than what would be appropriate [32], for instance through regularization. For the latter, one provides here an example for the multi-dimensional case with homoskedastic noise.
Ensemble of M = 5 regularized ELM is trained with N = 300. Selecting voluntarily a bigger number of neurons increases the model complexity, then reduces the model bias. But increasing complexity also implies increasing the model variability, which puts the model in an overfitting situation that the regularization mechanism controls at the expense of the introduction of an additional bias. The general cross-validation estimate results in a Tikhonov factor of 10 −4 which introduces to much bias. Then, α = 10 −6 is empirically set to decrease the amount of smoothing hence alleviating the bias, at the expense of an higher variance. To measure predictive performance of the model, the mean squared error (M SE) and relative mean squared error (RE) are defined on the training set by and similar measures are defined on the testing set. Generally speaking, lower values of M SE and RE are better. A value higher than 1 for RE indicates that the model performs worse than the mean [33]. Note also that RE can be interpreted as an estimation of the ratio between the residual variance and the data variance. Table 4 shows the quantitative results of the regularized model with N = 300 compared to the nonregularized model done previously with N = 91. In average among the 1 000 experiments, the testing M SE is slightly better for N = 91. However, the testing RE shows that in both cases it represents 2.8% of the data variance and no significant difference is identifiable. As expected, the se k of the true varianceand its estimation -is greater for N = 300. The variance is better estimated, as shown by e k and re k . This is likely due to the model bias reduction, which probably implies a decrease of the bias of the noise estimation, and therefore of the bias of the variance estimates -see section 3.2. Figure 3    especially at x 0 = 1/2 · (1, 1, 1, 1, 1). Summarizing, while the CI are then globally correctly estimated, the predictive performance are almost the same.

Conclusion
This paper discussed variance of (regularized) ELM under general hypothesis and its estimation through small ensembles of retrained ELMs under homoskedastic and heteroskedastic hypothesis. As ELM is nothing more than a linear regression in a random feature space, analytical results can be derived by conditioning on the random input weights and biases. In particular, the variance off (x 0 ) knowing input data has been decomposed into additive terms, supporting the identification and the interpretation of the contribution of different variability sources. Based on these formulas, several variance estimates independent of the noise distribution were provided for homoskedastic and heteroskedasic cases, for which a Python implementation was provided. Formulas and estimate-related theoretical results are supported by numerical simulations and empirical findings. Bias-reduced estimateσ 2 BR is likely uniformly better thanσ 2 N Ho in the homoskedastic case and should be prefer. In the heteroskedastic case,σ 2 S2 andσ 2 S3 are empirically shown to be better than other proposed estimates. Although these estimates are close to each other,σ 2 S2 is computationally more efficient thanσ 2 S3 . The paper also showed the possibility of constructing accurate CI for f (x 0 ) and E[f (x 0 )] despite the nonparametric, non-linear, and random nature of ELM. It provided a detailed explanation of the bias/variance contribution in CI estimation and highlighted that bias must be carefully consider to achieve satisfactory performances, especially in the regularized case which introduce significant bias. In particular, bias was traded against variance which can be estimated while preserving the predicting performance of the modelling, leading to credible uncertainty estimation. Also, as the variance estimates are distribution-free, it is reasonable to think that CI could be built with non-Gaussian noise distributional assumptions.
Several aspect of ELM uncertainty quantification still need to be investigated. From a theoretical perspective, (asymptotical) normality of ELM (ensembles) should be proved or disproved. More generally, having an analytical expression for the distribution of H could be very useful to develop estimation based on a single ELM. Additionally, random matrix theory -which already provided theoretical results for ELM [34] -should be investigated in the uncertainty quantification context.
Practically, prediction variance estimation is straightforward by addingσ 2 ε to the variance estimate in the homoskedastic case, while the noise variance could be estimated in the heteroskedastic case, e.g. with a second model [8,9]. Prediction interval can also be constructed, assuming convenient noise distribution. Future studies could also involve dependant data, e.g. by adapting heteroskedasticity and autocorrelation consistent (HAC) estimations of the full noise covariance matrix, in the temporal or spatial cases [13,35,36].

Software Availability
UncELMe -Uncertainty quantification of Extreme Learning Machine ensemble -is a Python package proposed on PyPI and GitHub (https://github.com/fguignard/UncELMe). It allows interested users to compute all variance estimates for Extreme Learning Machine ensemble discussed in the present paper. It is built within the scikit learn estimator framework, which enable the use of all convenient functionalities of scikit-learn [37]. Noise estimation are also returned to enable building of prediction intervals.

Author Contributions
F.G. conceived the main conceptual ideas, conduct investigations, developed the theoretical formalism and the methodology, performed the calculations, interpreted the computational results, wrote the original draft, and developed the Python software. M.K. carried out the supervision, project administration and funding acquisition. F.G., F.A. and M.K. discussed the results, provided critical feedback, commented, reviewed and edited the original manuscript, corrected the final version of the paper, and gave final approval for publication.
The original idea of this article stems in a conference paper [24] presented at the 28th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2020) in Bruges, Belgium, from 2 to 4 October 2020. More precisely, in [24] equations (3) and (4) and naive estimates were presented for the non-regularized case only, with very few justifications and without any details. All other work presented in this paper is the result of an original research.