Distributed Gaussian Granular Neural Networks Ensemble for Prediction Intervals Construction

To overcome the weakness of generic neural networks (NNs) ensemble for prediction intervals (PIs) construction, a novel MapReduce framework-based distributed NN ensemble consisting of several local Gaussian granular NN (GGNNs) is proposed in this study. Each local network is weighted according to its contribution to the ensemble model. The weighted coefficient is estimated by evaluating the performance of the constructed PIs from each local network. A new evaluation principle is reported with the consideration of the predicting indices. To estimate the modelling uncertainty and the data noise simultaneously, the Gaussian granular is introduced to the numeric NNs. The constructed PIs can then be calculated by the variance of output distribution of each local NN, i.e., the summation of the model uncertainty variance and the data noise variance. To verify the effectiveness of the proposed model, a series of prediction experiments, including two classical time series with additive noise and two industrial time series, are carried out here. The results indicate that the proposed distributed GGNNs ensemble exhibits a good performance for PIs construction.


Introduction
It is important for a predictor to be able to estimate the reliability of the prediction results.However, the pointoriented prediction can only provide the estimated values without any indication of their reliability [1].Compared to the point-oriented methods, PIs construction can be employed to quantify the modelling uncertainty and the data noise.It becomes more popular in the field of data-based prediction [1,2].The models adopted for prediction mainly include fuzzy model [3], support vector machine (SVM) [4,5], and neural networks (NNs) [6][7][8], in which NNs-based models are the most commonly used for PIs construction.
The variance-estimation based numeric NNs, a class of typical PIs construction methods, have been proposed in the literatures, such as the delta [7], mean variance estimation (MVE) [9], and Bayesian techniques [10].The delta is applied to the NN modelling for PIs construction [11] with the assumption that the target variance is a constant for all the samples.However, such assumption deviates from real-world conditions, while the MVE method estimates the target variance by using a dedicated NN on an assumption that prediction errors are normally distributed around the true mean of the targets.As such, the PIs can be constructed if the parameters of the distribution are known [1,9].For the Bayesian technique based NNs, Gaussian probability distribution is considered over the weight values of NN, and the corresponding outputs can be approximated by a Gaussian distribution [12,13].The essence of the Bayesian NN can be regarded as a kind of Gaussian granular NN, in which the Gaussian probability density function (PDF) represents an information granularity.In such a way, the variance of the Gaussian PDF is the summation of the variances of the modelling uncertainty and the data noise, where the PIs can be built up from the variance of the output Gaussian PDF [14].
In addition, the interval-valued NNs modelling is another class of PIs construction method, in which the bounds of the intervals are considered.A multilayer perceptron (MLP) is proposed in [15] to learn the relationship between the interval-valued inputs and outputs, and the trained MLP 2 Complexity comes with interval-valued weights and biases.Another interval-valued network is also reported in [16], where the neurons connections are denoted by a series of interval values, while the modelling inputs are numerical values.Besides, considering how to utilize the interval information directly, a lower upper bound estimation (LUBE) based NN with two outputs is proposed in [17] for estimating the PI bounds, in which the two outputs are the upper and lower bounds of the PIs, respectively.The training objective of NN in [17] is a function of two evaluation indices of PIs, i.e., the PIs coverage probability and the mean width of PIs which can be calculated based on the PI bounds.
It is apparent that both classes of PIs construction mentioned above are based on the NN individual.When coping with large-scale or strongly nonlinear problems, such methods may exhibit weak modelling capacities due to their limited learning ability.Besides NN individual, modular NNs [18] and NNs ensemble [19] are also applied for classification, recognition, and prediction.For instance, a kind of modular NNs is proposed in [18] and applied for human recognition in [20][21][22], which improves the recognition rates compared with other existing methods.An NNs ensemble is reported in [23] with Type-1 and Type-2 fuzzy integration for time series prediction, the parameters of which are optimized by using the particle swarm optimization (PSO) algorithm.In addition, a network ensemble structure is designed in [24,25] for PIs construction compared to the network individualbased methodology.It is still a challenging task to organize an effective NN ensemble despite their popularity, because of three basic limitations of the ensemble structure.(1) The performance of an NN ensemble depends on each local network.If some local networks are biased, the accuracy of the ensemble modelling will be severely deteriorated [26,27].Moreover, it is rather hard for all local networks to guarantee their unbiased features.(2) The modelling uncertainty variance and the data noise variance are estimated for PIs construction [1,2].Another independent NN is required for the noise variance estimation.It may be more effective if these variances are estimated simultaneously.(3) Although an NNs ensemble consists of a number of NN individuals, it is not suitable for distributed computing.In such a way, the traditional ensemble approaches demand a high computational load and is troublesome for the real-time applications [28].
Granular computing is originally proposed by L. Zadeh [30,31].A granule may be interpreted as an interval, a set, a probability density function, or a distribution.Combining with granular computing, the shortcomings of the NNs ensemble can be properly overcome.The NNs ensemble can achieve better performance in many fields, such as time series prediction, human recognition, and pattern recognition.However, at present, there is still a lack of some relative research on the granular NNs ensemble-based PIs.
In this study, a Map-Reduce based distributed computing structure composed of a number of local Gaussian granular NNs (GGNNs) is proposed in this study for PIs construction.To avoid the ensemble being performed badly due to biased local networks, each local network of the proposed ensemble is specified by a penalty coefficient to weight their contributions to the ensemble.The coefficient can be estimated by evaluating the performance of PIs constructed by the local networks, where a new evaluation principle is designed here with the consideration of the interval width, the coverage probability, and the accuracy.To simultaneously estimate the uncertainties due to modelling and data noises, GGNNs are constructed as the local networks, and the neurons connections and the outputs are represented by Gaussian distributions.The prediction variance of the local GGNN has two components that are used to quantify the variance of modelling uncertainty and the data noise, respectively.In addition, a Map-Reduce programming framework is developed here in order to enhance the computing efficiency.To verify the effectiveness of the proposed model, a number of prediction tasks, including the noisy classical prediction problems and industrial data, are considered here.The experimental results indicate that the proposed distributed GGNNs exhibit a good performance for PIs construction.
The rest of this paper is organized as follows.First, the Gaussian PDF based granularity and the Gaussian granular based reasoning are introduced.The distributed GGNN is also proposed.Second, the parameter estimation of the distributed GGNNs is described, including the intrinsic parameters of local networks and the contribution coefficients of local networks.A Map-Reduce framework is developed for the distributed processing as well.Third, the quality of the proposed ensemble is experimentally verified.Finally, the conclusions are given.

Structure of the Distributed GGNNs
As for the drawbacks of the conventional ensemble structure analysed in the introduction, the contribution of this study lies in a distributed NNs ensemble, which consists of several local GGNNs whose connections are described by Gaussian PDFs.In the proposed ensemble, it does not depend on the bootstrap method to estimate the variance of the predicted targets.Here, the PIs are constructed based on a set of variances of the Gaussian probability densities of the outputs of the local NNs.

Gaussian PDF Based
Granularity.Let X and Y be two independent Gaussian random variables,  ∼ N(  ,  2  ) and  ∼ N(  ,  2  ).A linear combination  +  is also Gaussian [23].That is, where  and  are linear coefficients for random variables.
The connecting weights of one network can then be described by Gaussian PDF.We consider two neural networks (echo state network (ESN) [24] and extreme learning machine (ELM) [25]), whose mapping relationships between the unknown weights and the outputs are linear.If the output weights are represented by the Gaussian PDF    ∼ N(   ,  2   ), the network outputs can be represented by a linear combination of several Gaussian PDF.There is no doubt that the network outputs should follow a Gaussian distribution, denoted by  ∼ N(  ,  2  ), and where   are some linear coefficients.
In the perspective of information granularity, the Gaussian PDF can be viewed as a form of granularity for granular NNs construction.Nevertheless, the input and internal connections are usually unknown for most of the neural networks, and the mapping relationships of most networks are unlikely linear all the time.If the network weights are represented by Gaussian granules, it is hard to guarantee the output distributions to follow Gaussian distribution.Fortunately, this problem can be addressed with the help of Bayesian and Laplace approximate reasoning.

Gaussian-Granularity Based Approximate Reasoning.
Without loss of generality, one can assume a neural network with the input u and output t, and the relationship between input and output can be described as where  is the white Gaussian noise with mean zero, denoted by () = (; 0,  2  ), and m and h are the numbers of the input and hidden units, respectively.
Since the distribution information of the network weights w is unknown, the posterior of w needs to be first estimated by using the training dataset D. Considering the Bayesian rules, the posterior distribution (w | ) can be formulated by where () is a normalizing constant, (w) is the prior distribution, and ( | w) is the likelihood function used to quantify the similar degree between the network outputs and the observed outputs given the weights w.
where   () = (2/) /2 is a normalizing constant.Here,  is a hyperparameter related to the variance of the data noise, denoted by  = 1/ 2  , and n is the number of training samples.
When the distribution information of w is unknown, the prior distribution of w can be assumed as a Gaussian distribution with zero mean [26]; i.e., where  is a hyperparameter related to the prior distribution of w.   () = (2/) /2 is a normalizing constant.W is the dimension of the weights w.
Substituting the prior and the likelihood into (4), the posterior distribution of w can be written by using the Laplace approximation.
Using the Bayesian rule and the conditional independence between t and u, the output distribution can be written as (8) given a new input u.
Now considering the distribution (|u, w), given that the noise  obeys a Gaussian distribution with zero mean, ( | u, w) can be rewritten as Introducing (7) and ( 9) into (8), the output can also be approximated by a Gaussian distribution based on the Laplace approximation.
where w  is the optimal weights and  2  is the prediction variance.To construct the PIs, it is necessary to estimate the prediction variance related to the two hyperparameters  and  as well as the optimal weights w  .

Distributed GGNNs Based PIs Construction.
In this study, a novel distributed GGNNs ensemble composes of a number of nonrelated local neural networks is proposed.Its architecture is illustrated in Figure 1.Once the distribution information of the local output is obtained, the local network can be used to construct the local PIs.As such, the performance of the PIs constructed by the local network can be evaluated, and the contribution coefficient of each local network can also be computed.One can then obtain the output distribution of the distributed networks by reconciling those of the local networks.

Gaussian-granular NN 1
Gaussian-granular NN 2 Gaussian-granular NN B u Given the numeric input u, if the weights of the bth local network obey the Gaussian distribution, i.e.,   , ∼ N(0,  2  ), the corresponding output can be approximated by a Gaussian distribution   ∼ N(  ,  2 , ).The distribution of   can be written as The distribution information of the output   is obtained, and 100(1 − )% PIs can be constructed.
where  1−(/2) is the 1 − (/2)-quantile of a standard normal distribution function with zero mean and unit variance.
Introducing the contribution coefficient   of the bth local network, the output PDF of the ensemble can be written as where 0 <   < 1 and The architecture of the distributed GGNNs has the following advantages.(1) The proposed architecture has a stronger learning ability than the neural network individual.Sometimes, the network individual might achieve a good performance of PIs, yet the performance is not very stable.However, the contribution of each local network in the proposed architecture is weighted through evaluating the individual performances.A more stable model can be achieved based on the proposed architecture.(2) Unlike the generic NNs ensemble, the relationship among the local networks of the proposed architecture is relatively independent.Therefore, the proposed architecture is suitable for distributed computing, and the computational efficiency of the ensemble can be significantly improved.

Parameters Estimation of the Distributed GGNNs
The parameters estimation process of the proposed NN ensemble is briefly illustrated in Figure 2.For a training dataset S with   data samples, this study first adopts a bootstrap method to resample the training dataset S, and then a bootstrap dataset S b with   data samples are obtained, where   ≪   .Meanwhile, a corresponding dataset R  is also generated by excluding the bootstrap samples from the original dataset.Repeating the same process B times, B bootstrap datasets can be obtained.In this study, the parameters estimation process of the proposed model consists of two parts.First, B local GGNNs are trained on the basis of the B bootstrap datasets, respectively.Second, the contribution coefficients are estimated by employing the B remaining datasets.

Parameters Estimation of Local GGNN.
Given that the optimal weights are used for point-oriented prediction and the output variance are used for PIs construction, they have to be estimated for each of local GGNN.Since the optimal weights can be determined when the maximum of the posterior PDF occurs, one can optimize (7) to find the optimal weights of the bth local GGNN.Also, given that   is a normalizing constant in (7), the optimal problem is equivalent to minimize the function (w  ) according to the feature of the exponential function [10].Then, we have To minimize (w  ), (15) can be linearly approximated by Taylor expanding with respect to w   .That is, where A  =   ∇∇   +   I is the Hessian of (w  ).Gaussian-granular NN B As such, the prediction variance in ( 11) can be written as  2  , = 1/  + g   A  −1 g  and g  = ∇ w  | w  .Since the variance is greatly related to   and   , it is necessary to simultaneously optimize the values of the hyperparameters and the network weights.According to [26], there exist relationships between w  and the hyperparameters   and   , respectively.
where   = ∑  =1   /(  +   ), and   are the eigenvalues of the Hessian matrix.Particularly, if the Hessian matrix A  in ( 16) is singular, the eigenvalues should be approximated by using the singular value decomposition (SVD) algorithm.Since the prior knowledge is introduced for the Bayesian method, the dominator of   is the sum of   and   .Even   might be equal to zero, the value of   is still effective.Based on some experiments, the iterative strategy for the update of the hyperparameters is indeed feasible.
Based on the inference process from (3) to (11),   and   are two hyperparameters related to the distributions of the weights and the output, respectively.To optimize the cost function of (15), the extended Kalman filter (EKF) [27] is adopted here to optimize the weights of local networks, in which   and   can be viewed as the process noise and the measurement noise, respectively.We summarize the steps of the weights estimation based on the EKF as follows.
Step 1. Choose the initial values for   and   .Initialize the weights w  0 and the corresponding variance P w 0 by using the values drawn from the prior Gaussian distribution.Define the external loop number N loop .
Step 2. For  ∈ {1, 2, ⋅ ⋅ ⋅ , ∞}, the time update equations of the EKF are where A −1 is the identity matrix with the same dimensionality as the network weights and the variance of the process noise is defined as (R  k ) −1 = (  /2) ⋅ A −1 .
Step 3. The measurement update equations are where , and B  is the identity matrix with the same dimensionality as the training samples.
Step 5. Iterate to Step 2 until the external loop N loop reaches.

Coefficient of Local
Gaussian-Granular NNs.Considering one bootstrap dataset   with   data samples, i.e., at least  0 −   samples in   ,   will be used to train the bth local network, while   will be used to determine its contribution to the ensemble by evaluating the performance of PIs constructed by it.Generally, the performance of the PIs can be quantified by the PIs coverage probability (PICP) and the mean PIs width (MPIW).
where N is the number of the testing data samples.  is a binary-state variable.If the inequality  −  ≤   ≤  +  is valid, then   = 1, else   = 0. Particularly, a coverage width-based criterion (CWC) for PIs evaluation that is a function of PICP and NMPIW is proposed in [1].However, the CWC may not be suitable according to some experiments conducted in [17].A very large value of CWC will be obtained when a not very large PICP is counted with a relatively low MPIW due to the amplification effect of the exponential function.On the contrary, the index CWC cannot achieve a large value, when the constructed PIs are much wider and the PICP is large.
The evaluation of the quality of the constructed PIs is a multiple objective optimization problem.It is difficult to define a universal and effective criterion for all prediction problems.Even for the same prediction problem, the expectations of PICP, MPIW, and MAE may vary under different application backgrounds.A more concise and effective evaluated strategy is designed as follows: where the constants , , and  are scale factors that subject to  +  +  = 1., , and  serve to qualify the amount of attention paid to MPIW, MAE, and PICP, respectively.The three parameters can be determined according to the application requirements.It is noteworthy that the evaluation index of accuracy named Mean Absolute Error (MAE) is also considered here since the point-oriented prediction is the median of the constructed PIs.The definition of MAE is given by where   is the predicted value and   is the observed value.
The detailed analysis of ( 25) is provided here.First, the values of the MPIW and the MAE are usually expected to be small; meanwhile, the PICP is expected to be large.According to (24) and (26), the measure indices MPIW and MAE with the same measurement dimension can be merged under the premise that the practical significance is not considered.
Second, the PICP is related to MPIW and MAE that is difficult to be quantified by any learning method.It is true that there is an inverse function relationship between PICP and RMPIW.Notably, PICP is in percentage.In order to establish the link between the two indices, it is only required to multiply (1-PICP) by RMPIW.
According to section (11), the output distribution of a trained GGNN can be written as follows given an input u   of the remaining dataset   : where    is the network output defined as    = Ŵ  (u   , x   ).Then, the PIs can be constructed by the network output and the prediction variance in (28), formulated by As such, the value of   and   can be calculated, and the value of the synthesized index PPI b can also be obtained.To determine the contribution of each local network, the performance of the distributed networks can be presumably defined by the sum performance of the local networks.
where   is the PIs coverage probability of the bth local network and   is the sum of the corresponding mean PIs width and root mean square error.
To make sure the contribution of the optimal local network can be enlarged and the contribution of the poor one can be minimized, an intermediate variable   = (  /)  is defined, where  is a scale factor.So the contribution of the bth local network to the distributed networks can be defined as Here,  can be set from 5 to 20.The value of  controls the amount of attention paid to the optimal local network.The larger the value of  is, the larger the contribution coefficient  of the optimal network is.

Map-Reduce Based Distributed
Model.The advantage of the proposed NN ensemble is that its distributed structure allows an improved learning ability.If the performance of one local network is poor, that of the ensemble can still be guaranteed.However, there is a disadvantage of the proposed model that the computational efficiency is relatively low.Map-Reduce programming framework provides an efficient model that can be used to handle large datasets [28].Map-Reduce can greatly improve the operation time [30].In this section, the proposed model is organized into a Map-Reduce programming architecture.The Map task is responsible for three important works, as shown in Figure 3. First, the local neural network is built and learned in each Map task based on the bootstrap dataset; second, the performance of each local network is evaluated based on the remaining dataset; and third, the prediction of each network is also completed given the testing input.The input of Map task is a pair of key-value, where the key labels the serial number of the local network and the value denotes the corresponding dataset for the local network including the training dataset and the testing one.The output of Map task is also a pair of key-value, where the output value is the evaluated performance and the predicted output of the local network.
The Reduce task is responsible for two parts, as illustrated in Figure 4.The first one computes the contribution coefficients.The second one generates the integrated output [ −  ,  +  ], 1 ≤  ≤  of the distributed model with [ −  ,  +  ], 1 ≤  ≤   of each local network.The input of the Reduce task is a pair of key-value from the output of the Map function, and its output is the final predicted output.

Experimental Results and Analysis
To verify the effectiveness of the proposed distributed model, four prediction tasks are considered.The first two ones are two classical prediction problems, involving the differential defined Rossler chaotic time series and the Mackey Glass chaotic time series.In order to exhibit the ability of practical application, we also consider two kinds of industrial validation, coming from a traffic time series provided by the website http://www.neural-forecasting-competition.com/datasets.htm and a real-world problem that involves real-time flow prediction for blast furnace gas in steel industry.The experiments are designed as follows that involves the parameters estimation, the PIs construction, and the performances evaluation.Since the generic NNs ensemble will cost a large computational load, the experiments based on MATLAB tools are carried out by using the ESNs ensemble whose parameters estimation is relatively efficient.And the generic NNs ensemble and the ESNs ensemble are employed to do a set of comparative experiments for a complete comparative study.

Rossler Chaotic Time
Series.Rossler chaotic system, a kind of chaos phenomena of the lowest dimension equation, is proposed by Rossler in 1976 [32], which is described by three coupled first-order differential equations.
Time series of all variables (x, y, z) are obtained from solving the above differential equations via 4 th -order Runge-Kutta method.The integral step of 4 th -order Runge-Kutta Take the x-variable for example, and the processes of constructing PIs for y-variable and z-variable are so similar that they are not given here.The desired noisy Rossler chaotic time series is obtained by adding a Gaussian white noise with the variance 0.01 to the original values of the time series.This study selects 2000 data samples from the training dataset by using the bootstrap resampling to build the bth local networks, whereas 500 data samples are randomly selected from the training dataset excluding the bootstrap samples to evaluate the bth local network.60 data samples are selected from the testing dataset to verify the proposed model.In this study, the parameters are set for convenient calculation as follows.The number of the input units is set as 60 and the dimension of the dynamic reservoir is set as 200.
Table 1 lists the bootstrap validation results with respect to the different number of local networks for noisy Rossler prediction problem.In Table 1, three indices for evaluating the performance of the model are considered, where PICP, MPIW, and MAE pay attention to the coverage probability, the interval width, and the prediction accuracy, respectively.The penultimate column gives the validation results of a comprehensive index PPI that consider the PICP, MPIW, and MAE with weights 2/9, 1/9, and 2/3, respectively.The weights of PICP, MPIW, and MAE for PPI can be set by using a trial and error method.From Table 1, when the number of local networks equals 8, the performance of the ensemble is obviously better than that of the ensemble with less than 8 local networks.When the number of local networks equals 11, the performance of the ensemble is the best for the noisy Rossler problem.
Figure 5 shows the contribution coefficients of 11 local networks, where the scale factor  of ( 31) is set as 20.From this figure, the performance of each local network is significantly good and the contribution to the distributed networks shows no significant difference.The prediction results for the noisy Rossler problem based on the proposed distributed model with 11 local networks are shown in Figure 6.From   [33] is a time-delay differential system formulated as  ()  =  ( − ) 1 +  ( − ) 10 −  () (33) where the parameters are typically set as  = 0.2,  = −0.1, and  = 17.To sample the time series, we numerically integrate (33) by using the fourth-order Runge-Kutta method with a sampling period of 2s and an initial condition (0) = 1.2.The desired noisy Mackey Glass time series is obtained by adding a Gaussian white noise with the variance 0.01 to the original Mackey Glass time series.The original dataset sampled from the noisy time series consists of 2000 data samples, where 1500 data samples are used for training and the others for testing.We select 1000 data samples from the training dataset by using the bootstrap resampling method to build the bth local network, whereas 500 data samples are randomly selected from the training dataset excluding the bootstrap samples to evaluate the bth local network.60 data samples are selected from the testing dataset to verify the proposed model.In addition, the parameters of the local network are estimated based on the training data.The number of the input units is set as 60 and the dimension of the dynamic reservoir is set as 200.

Mackey Glass Chaotic Time Series. Mackey Glass system
Mackey Glass time series is also generated from a chaotic system, yet the range of values is narrower than that of the Rossler time series.Compared to Table 1, the forecasting performance of the proposed ensemble becomes much better; especially, the values of PICP are significantly improved.The penultimate column in Table 2 gives the validation results of PPI with the consideration of the PICP, MPIW, and MAE whose weights are 2/9, 1/9, and 2/3, respectively.The weights of PICP, MPIW, and MAE for PPI can be set by using a trial and error method.From this table, the values of PPI of different ensembles are not greatly varied with the increase of the number of the local networks.However, it is apparently that when the number of local networks is larger than 9, the performance of the ensemble is significantly improved.And when the number of local networks equals 16, the performance of the distributed ensemble is the best for the noisy Mackey Glass problem.
Figure 7 shows the contribution coefficients of 8 local networks where the scale factor  of ( 31) is set as 20.The prediction results for the noisy Mackey Glass problem based on the proposed distributed model with 8 local networks are shown in Figure 8. From Figure 8(a), the constructed PIs with the confidence level 95%, in which the noisy value can almost be covered by the PIs, and the predicted error in Figure 8(b) illustrate the high accuracy of the proposed ensemble for the noisy Mackey Glass problem.

Application to Prediction of Traffic Time Series.
In this subsection, the data are provided by the NN GC1 Forecasting Competition that can be found on website.The samples are constructed on the basis of the data of Tournament 1, 1.F.The original dataset consists of 1500 data samples, where 1000 data samples are used for training and the others for testing.We select 500 data samples from the training dataset by using the bootstrap resampling method to build the bth local network, whereas 100 data samples are randomly selected from the training dataset excluding the bootstrap samples to evaluate the bth local network.60 data samples are selected from the testing dataset to verify the proposed model.In addition, the parameters of the local network are estimated based on the training data.In this study, the number of the input units is  set as 50 and the dimension of the dynamic reservoir is set as 20.
The bootstrap validation results of the number of local networks are shown in Table 3. First, although the PIs are the widest when there is only one local network, the corresponding PICP is not the highest.Due to the increase of the prediction difficulty, there exists a large difference in the prediction performance between two networks.Sometimes, the prediction performance of one local network is bad for the traffic time series.When the number of local networks increases, the performance of the proposed distributed networks can be improved because the good performance is amplified and meanwhile the poor performance is punished in this study.The above inference is verified to a certain extent in Figure 9.In modeling the distributed networks, there exist some local neural networks with poor performances as the 1th, 3th, 4th, 5th, 7th, 9th, 10th, and 11th local networks shown in Figure 9.Only the 2th, 6th, 8th, and 12th local networks make a great contribution to the output of the distributed networks, especially the second local network.The penultimate column in Table 3 gives the validation results of PPI with the consideration of PICP, MPIW, and MAE whose weights are 1/3, 1/3, and 1/3, respectively.The weights of PICP, MPIW, and MAE for PPI can be set by using a trial and error method.From Table 3, when the number of local networks equals 12, the distributed networks perform best.
The prediction results based on the proposed distributed model with 12 local networks is shown in Figure 10.From

Application to the Consumption Flow of Blast Furnace Gas.
In this section, we present a real-world industrial application problem by using the proposed distributed model.Steel industry is usually accompanied with high energy consumption and environmental pollution, in which the byproduct gas is one of the useful energy resources.There is a need to predict the gas consumption flow for energy scheduling in a steel plant.The experimental data, with a sample interval of 1 min, come from a steel plant in China and cover the consumption flow of BFG by the #1, #2 Coke Oven in September 2015.The  original dataset consists of 4000 data samples, where 3000 data samples are used for training and the others for testing.We select 1000 data samples from the training dataset by using the bootstrap resampling method to build the bth local network, whereas 100 data samples are randomly selected from the training dataset excluding the bootstrap samples to evaluate the bth local network.60 data samples are selected from the testing dataset to verify the proposed model.Again, the number of the input units is set as 50 and the dimension of the dynamic reservoir is set as 100.Table 4 shows the bootstrap validation results of the number of local networks for the consumption flow of BFG.From Table 4, when the number of local networks equals to 9, the performance of the distributed networks is the best for the consumption flow.Figure 11 shows the contribution coefficients of 9 local networks, where the scale factor  of (31) is set as 20.The prediction results for the consumption flow prediction problem based on the proposed distributed model with 9 local networks are shown in Figure 12(a).The constructed PIs with a confidence level of 95%, in which the observed value can almost be covered by the PIs, and the predicted error in Figure 12(b) illustrate the high accuracy of the proposed model for the consumption flow prediction problem.

Statistical Analysis and Distributed Computation.
To further evaluate the performance, a series of statistical experimental results are reported in Table 5, where NN-based conformal prediction [29], the MVE-based NN individual [1], the Bayesian NN individual [14], and the networks ensemble in [22] are conducted for comparison.To guarantee the full indication of the statistical experiments, the PICP, MPIW, MAE, PPI, and computing time are employed here for statistics.The mean values of the PICP, MPIW, MAE, PPI, and computing time for 50 times are given in Table 5.In addition, to verify the performance of the proposed method, the ESN individuals-based ensemble and the MLP individuals-based ensemble that is the generic NNs ensemble are considered for the comparative experiments.
From Table 5, we observe that the proposed distributed NNs ensemble shows the best performance with the overall consideration of PICP, MPIW, and MAE.Although conformal prediction has a stable and good performance for the coverage probability of PIs, it comes with the tradeoff in MPIW.Although ESN individual has a lower computational  load and a higher accuracy, it is often accompanied with the instable performance since the coefficient matrix of unknown parameters might be ambiguous, which makes the distributed ESNs ensemble do not show the performance as well as the generic NNs ensemble.To improve the accuracy of the ESNs ensemble, an effective parameters estimation strategy is required, but with a large computational load.By contraries, MLP individual has a stable and excellent prediction performance but with a higher computational cost.Thus, the proposed model inhibits a better prediction performance in total if the computational efficiency can be improved.
It is well known that the distributed NNs ensemble implemented in serial mode costs more computing time than the NN individual.In Table 5, the computational time is monitored on the MATLAB platform.If the algorithm is implemented on another platform in series mode, it might cost a higher computational load.Fortunately, the proposed method can be implemented in parallel mode in the JAVA environment with the Map-Reduce programming model that can effectively improve the computational efficiency.The experiments are conducted based on the Baidu open cloud platform.The master node consists of Intel Xeon 2.00-GHz 4 Core, 32-GB memory, and 400G Hard disk, and the slave nodes are the same as the master one.The experimental results are shown in Figure 13 and Table 6.From Figures 13(a) and 13(b), when the distributed model consists of more local networks, the computational efficiency can be improved with an increase in the number of the local networks.The measurement unit of the value in Table 6 is minute.The parameters of the distributed networks are set as the above statements.From Table 6, we can see that the computing time of the distributed model can be greatly reduced with the increase of the number in the slave nodes.When the number of the local GGNNs equals 8, the computational efficiency can be obviously improved when using one master and four slaves.As for the distributed model with 15 local GGNNs, the computational efficiency can be improved with one master and eight slaves.For the two groups of experiments, the shortest computing time of each experiment only requires 8 minutes.

Conclusions
In this paper, a novel distributed GGNNs ensemble is proposed for PIs construction.The distributed model has B cascaded GGNNs and the output of the distributed model is the weighted sum of the output of the local neural networks.The advantages of the proposed method involve three important respects.(1) The distributed model has a stronger learning ability than the neural network individual.
(2) The performance of the proposed model is more stable.The effect due to local network with poor performance can be reduced when computing the coefficient of the local one contributed to the distributed model, and on the contrary, the good performance of the local networks can be amplified.The proposed distributed model is different from the generic NNs ensemble by taking the contribution of the network individual into consideration.(3) The relationship among the local networks of the proposed architecture is relatively independent.Therefore, the proposed architecture is suitable for implementing on a distributed programming platform named Map-Reduce architecture to improve the computational efficiency.Four prediction problems are used in our experiments to illustrate the effectiveness of the proposed method for PIs construction.Compared to three network Complexity future, we will pay more attention to the research on the cooperative learning of the individuals in an ensemble to achieve a stronger learning ability.For instance, the fuzzy integration or some constructive algorithms will be considered in the future.Meanwhile, we are expected to focus on the distributed implementation of the cooperative learning method in the future and obtain some valuable results.In addition, model-free-based PIs will be considered, which can be viewed as another future work.If the PIs are applied for industrial process, the computational load is one significant index to be solved.For the model-based PIs, especially the ensemble or modular NNs-based PIs, it costs much more computational load despite high accuracy, which affects the application of the PIs.

Figure 2 :
Figure 2: Training process of the proposed distributed networks.

Figure 3 :Figure 4 :
Figure 3: The processing procedure of the Map task.

Figure 5 :
Figure 5: The histogram of the contribution coefficient of each local network.

Figure 6 :
Figure 6: Prediction results for the noisy MSO problem: (a) the constructed PIs; (b) the corresponding predicted error.

Figure 6 (
Figure 6(a), the constructed PIs cover the observed value with a lower width, and the smaller predicted error in Figure 6(b) illustrates the prediction accuracy of the proposed model for the noisy Rossler time series.

Figure 7 :
Figure 7: The histogram of the contribution coefficient of each local network.

Figure 8 :
Figure 8: Prediction results for noisy Mackey Glass problem: (a) the constructed PIs; (b) the corresponding predicted error.

Figure 10 (
Figure 10(a), the constructed PIs with the confidence level 95%, in which the noisy value can almost be covered by the PIs, and the predicted error in Figure 10(b) conforms the accuracy of the proposed model.The prediction results in Figure10are different from those of the first two prediction problems.The numerical range of the value is large and the dynamic characteristics of the data are more complex.Since the uncertainties of the data and model are larger, the PIs are much wider.Figure10illustrates that the proposed model is effective for constructing the PIs of the traffic time series.

Figure 9 :
Figure 9: The histogram of the contribution coefficient of each local network.

Figure 10 :
Figure 10: Prediction results for the data of Tournament 1: (a) the constructed PIs; (b) the corresponding prediction error.

Figure 11 :
Figure 11: The histogram of the contribution coefficient of each local network.

Figure 12 :Figure 13 :
Figure 12: Prediction results for the consumption flow of BFG: (a) the constructed PIs; (b) the corresponding prediction error.

Table 1 :
The validation result of the number of local networks.
method is 0.15, the previous 1000-step values are abandoned, and the following 9000-step values of all variables are adopted as the sample data where 1-5000 values are used for training and the remaining 4000-step values are taken to test and validate the performance of PIs.In this study, [a, b, c] are set as [0.398, 2, 4], and the initial values of [x, y, z] are set as [-1, 0, 1].

Table 2 :
The validation result of the number of local networks.

Table 3 :
The validation result of the number of local networks.

Table 4 :
The validation result of the number of local networks.