Deep belief improved bidirectional LSTM for multivariate time series forecasting

: Multivariate time series (MTS) play essential roles in daily life because most real-world time series datasets are multivariate and rich in time-dependent information. Traditional forecasting methods for MTS are time-consuming and filled with complicated limitations. One efficient method being explored within the dynamical systems is the extended short-term memory networks (LSTMs). However, existing MTS models only partially use the hidden spatial relationship as effectively as LSTMs. Shallow LSTMs are inadequate in extracting features from high-dimensional MTS; however, the multilayer bidirectional LSTM (BiLSTM) can learn more MTS features in both directions. This study tries to generate a novel and improved BiLSTM network (DBI-BiLSTM) based on a deep belief network (DBN), bidirectional propagation technique, and a chained structure. The deep structures are constructed by a DBN layer and multiple stacked BiLSTM layers, which increase the feature representation of DBI-BiLSTM and allow for the model to further learn the extended features in two directions. First, the input is processed by DBN to obtain comprehensive features. Then, the known features, divided into clusters based on


Introduction
Presently, real-life multivariate time series (MTS) data sets have drawn considerable interest from diverse fields, including time series of sensor events [1], the Internet of Things [2], compartmental epidemiological models [3], and human conduct prediction [4].Classical linear models include the autoregressive (AR) model [5], moving average (MA) model [6], autoregressive moving average (ARMA) model [7], and autoregressive integrated moving average (ARIMA) model [8].Some nonlinear MTS models were applied for MTS prediction, such as threshold AR [9] and bilinear models [10].However, the classical MTS models only partially use the hidden spatial relationship between variable pairs and are limited by complex constraints.Support vector regression (SVR) [11], artificial neural network (ANN) [12], gradient-boosted regression tree (GBRT) [13], and gradient-boosting decision tree (GBDT) [14] are prominent machine learning (ML) algorithms that have gained recognition for their superior prediction performance.Conventional ML and ANN algorithms that propagate forward (FFANN) are restricted in MTS forecasting because they must consider the correlation between the multi-dimensional input variables over time [15].Natural techniques for modeling sequence-based systems with memories are recurrent neural networks (RNNs), which are more appropriate than FFANN [16].RNNs have been recognized as one of the most efficient models for predicting MTS [17].Nevertheless, traditional RNNs still need to improve the handling of long-range time-dependent MTS datasets, thereby decreasing prediction accuracy [18].
To solve the shortcoming of conventional RNNs, researchers [19][20][21][22] have employed a long short-term memory (LSTM) model that is treated as an extended RNN with memory gates.In the following studies, the employment of bidirectional LSTM (BiLSTM) aims to overcome the limitation of LSTM that solely trains signals in a unidirectional manner.However, the feature learning characteristics of BiLSTM remain unclear [23].Shallow BiLSTMs are still ineffective in representing the MTS features exhibiting high nonlinearity and long-term dependencies.Ewees et al. [24] employed a novel LSTM based on a heap-based optimizer to address complex optimization and engineering problems.Liu et al. [25] introduced an integrated hybrid convolutional neural LSTM network based on error correction and variational mode decomposition for hourly stepwise solar irradiance forecasting.Neshat et al. [26] proposed a deep hybrid LSTM prediction model utilizing a convolutional neural network featuring BiLSTM and an adaptive decomposition-based algorithm for wave power prediction.Li et al. [27] operated an evolutionary attention-based multi-layer LSTM network (EA-LSTM) through learning with competitive random search for MTS forecasting.Deep LSTM or BiLSTM architectures can generally acquire good representations of high-dimensional, complex, and strongly nonlinear MTS signals.
Notably, a deep belief network (DBN) [28] employed by Hinton et al. can be considered a layered feature extraction method that consists of multiple restricted Boltzmann machines (RBMs).

Deep belief network
A DBN is a probabilistic productive network that consists of several RBM layers.A DBN extracts data features using an unsupervised layer-by-layer training method [32], as shown in Figure 1.An RBM is an unsupervised non-linear feature extractor based on a Markov random field, including two essential layers: a visible cell and a hidden unit layer.The output of each RBM hidden unit is wholly linked to the next RBM unit by symmetric undirected synapses.These RBM properties lead to a conditional independence between visible and hidden cells.The joint probability distribution identified by the RBM's weights utilized an energy-based function of {v, h}, as shown in the following equation: , ij w represents the weight from visible cell i to hidden cell j, and ai and bj are the bias of units i and j, respectively.The joint probability distribution of the RBM model over the visible-hidden cells is computed according to Eq (1) as follows: ( , ; ) ( ( , ; )) ( ) where ( ) Z  is a normalizing constant value or partition function obtained from the summary of all potential energy allocations combining cells i and j.
The RBM can obtain the probability of input datasets through the energy equation.According to the joint probability distribution function, the conditional probability functions of cells i and j can be obtained by the following functions: The input state will be reconstructed by arranging every vi to 1 with the probability given by Eq (5).Thus, the state of hidden cells is gradually renewed to represent the reconstruction features.The maximization implements the training method in RBM and the probability distribution of the training data with regard to the model variables as follows: maximize{ , , } log( ( )) where m represents the length of training datasets.Therefore, the objective function is a log-likelihood term; a gradient descent algorithm should solve it.However, the gradient calculation of the log-likelihood term is difficult to implement due to the presence of ( ) Z  .Thus, sampling methods such as contrastive divergence [33] and persistent contrastive divergence [34] in a gradient calculation can be utilized instead.

Bidirectional LSTM
LSTM is an improved version of the conventional RNNs, that utilizes specially designed memory units to efficiently express the long-term dependencies of MTS datasets.In contrast to classical RNNs, the design of LSTM offers an efficient answer to the vanishing gradient issue.The LSTM cell learns the present hidden units' state according to the current input and the prior hidden units' state.Nonetheless, it replaces the architecture of the hidden units with a memory cell that can represent the long-term dependence of the MTS signals.As shown in Figure 2, the LSTM network introduces four controlled gates, including one input, one output, one forgets, and one self-loop memory cell, to manipulate the interactions of the information streams between various memory neurons.In the hidden LSTM layers, the forget gate determines which information from the previous moment should be preserved or ignored.At the same time, the entrance of the input neurons can decide whether input signals should be injected into the memory units' information.The gate of the output neurons handles whether the state of different memory units can be changed.Given the input xt of MTS and the dynamic output state ht, the gates states, outputs of hidden layers, and neuron states can be computed using the following equations:           In the equations used for LSTM, the recurrent weight matrices are represented by Wi, Wf, Wo, and Wc, while the weight matrices for the input, forget, output, and memory cell gates are denoted by Ui, Uf, Uo, and Uc, respectively.The gates biases are expressed as bi, bf, bo, and bc.The traditional LSTM may inadvertently discard important sequential information during training because it processes input signals in only one direction.As a result, the time series data cannot be thoroughly analyzed.To address this limitation, the BiLSTM was developed with a bidirectional structure that can capture representations of MTS data through forward and backward directions.This procedure is illustrated in Figure 3.The BiLSTM consists of two LSTM layers that run in parallel in two opposite directions.In the forward propagation direction, the information of the hidden LSTM neurons, represented by hf(t), retains information from past sequence values.In the reverse propagation direction, the hidden state, represented by hb(t), contains data from future MTS values.hf(t) and hb(t) are linked together to create the ultimate outputs of BiLSTM.The t-th hidden states of BiLSTM for the forward and backward states are calculated using the following equations: ( ) ( ) The weight matrices Wfh and Wbh represent the forward and backward synapses weights from the input to the internal unit weights.Similarly, Wfhh and Wbhh represent the forward and backward feedback recurrent weights.Additionally, bfb and bb correspond to bias information in two directions.
This study gives the activation function of the hidden layer  as tanh.Using these components, the output of BiLSTM yt is described using the following equation: Where the output layer's forward and backward weights are denoted by Wfhy and Wbhy, respectively.The activation function of the output layer σ is generally given as either a sigmoidal or linear function.Additionally, by denotes the output bias.

Variance based global sensitivity analysis
Due to the high dimensionality and numerous features of the MTS data, the input of DI-BiLSTM or the output of DBN becomes more complex after DBN processing.Therefore, conducting a sensitivity analysis on the DBN output is necessary to reduce the input complexity of the multi-layer BiLSTM.Among different available algorithms for network output sensitivity analysis (SA), variance-based SA algorithms, which apply Monte Carlo sampling and are based on model decomposition, have demonstrated their versatility and effectiveness.An improved variance-based GSA algorithm was proposed by Saltelli et al. [35] to compute a model's total sensitivity indices (SI).GSA has been applied to a parametric sensitivity analysis for various models with excellent stability and minimal computational cost.Considering the output of a nonlinear model , where Y is the output and 1  2 ( , , , ) where is the variance of the parameter i x , and ij V is the variance of the interaction between parameters i x and j x .The first-order sensitivity i S and total sensitivity Ti S of parameter i x are expressed as follows: where ~i x stands for all factors except x.
To calculate the total sensitivity values for factor xi, we must create two independent sampling matrices, P and Q, each of size (N, k), where N is the sample length and k is the number of model variables.Every row in the matrices corresponds to an input parameter vector X for the model.We where  was normalized.In this study, the total sensitivity of each parameter in the model was evaluated according to Eq (23).

Deep belief improved BiLSTM
Comprehensive research has shown that adding depth to ANNs can efficiently increase the overall performance of ANN models [36].Similarly, the academic community has been impressed by the learning capabilities of deep RNNs in MTS forecasting [32,37].Inspired by the learning capacity of multi-layer RNNs and the characteristic extraction ability of DBN, we develop a DBI-BiLSTM model for MTS prediction, consisting of multiple BiLSTM modules connected through a chained structure, as shown in Figure 4.The DBI-BiLSTM model has three main parts: a DBN-based feature extraction component, a sensitivity analysis component, and an improved BiLSTM-based prediction model.In the DBI-BiLSTM network, the DBN output vectors are categorized into major and minor features based on their sensitivity to the model's output (SI).The primary features are those that are highly sensitive to the model's production, while the minor features are those that are less sensitive.The significant features with a high sensitivity are injected into the first layer of the DI-BiLSTM.In contrast, supplementary features with low sensitivity are fed into the subsequent layers of the DI-BiLSTM.
The significant benefits of the proposed DBI-BiLSTM can be outlined in three aspects.1) DBN's powerful feature extraction capability ensures the feature representation ability of DBI-BiLSTM for MTS datasets, which further reduces the prediction error of the employed DBI-BiLSTM for MTS prediction.
2) The DBI-BiLSTM model incorporates both multi-layer and bidirectional RNN structures.The multi-layer enhances the training and generalization capabilities of DBI-BiLSTM.In contrast, the propagation in different directions enables the DBI-BiLSTM to efficiently train the MTS signals from two directions.This characteristic makes DBI-BiLSTM more capable of adapting to the natural features of the MTS datasets.
3) The multi-layer architecture of DBI-BiLSTM is different from the traditional deep LSTM structure; it is an improved structure that includes multiple BiLSTM modules.Each BiLSTM module can dynamically map learned feature vectors from the DBN layer.The forecasting results of DBI-BiLSTM are affected by continuously training the output of the former BiLSTM neurons and by increasing additional input data features in different BiLSTM hidden states.This makes the addressed DBI-BiLSTM more reliable and stable for MTS forecasting compared to traditional deep LSTM structures.
A structure explanation diagram of the DI-BiLSTM with three layers is shown in Figure 5, in which each layer is represented through a distinguishing color.
  In the experiments of this paper, the Adaptive Moment Estimation (Adam) method [38] is used as the learning optimizer to renew the weights of the DBI-BiLSTM model with an original learning rate (set to 0.001).For an MTS prediction, the training and testing procedure loss function can be defined as the mean square error (MSE) function: where ŷt represents the actual forecasted signal, y t denotes the wanted output signal and n denotes the length of y t .
Figure 4 shows that the model begins by using multiple DBN layers to map inputs to their feature representations.Then, the clustered feature vectors based on GSA are fed into the DI-BiLSTM layers, which process them in different directions.The outputs of the DI-BiLSTM layers are then passed through a wholly connected layer, which serves as the regression neurons and uses an adjusted linear activation function.A suitable dropout probability (set to 0.2) is applied to the proposed forecasting model to prevent overfitting of the MTS datasets.Table 1 summarizes the proposed DBI-BiLSTM parameters.
The learning mechanism of DBI-BiLSTM is shown as follows: Algorithm: Employed DBI-BiLSTM model Initialization: datasets are divided into training, validation, and testing set which are then normalized before training; for data in training and testing data do Extract features (X) of all the datasets by the DBN layer and update the weights of the DBN layer through Eqs ( 4)- (7); end for The total SI of Ti S  for each input feature vector (X) is calculated according to Eqs (20)- (23) and ranks the SI of feature (X); Divide X into major input features (high SI) and minor input features (low SI) based on the ranked SI and initialize the weights and layers of BiLSTM;  Y y y y for t 1  to n do Calculate the gates outputs through Eq (8)-( 13); Update the forward and backward hidden layers states through Eqs ( 24) and ( 25

Results
In this section, we simulate four real-world MTS datasets to evaluate the performance of the employed DBI-BiLSTM network.These datasets include the heat exchangers (HX) system [39], the small-medium-large (SML) system [40], the bike sharing (BS) dataset [41], and the metro interstate traffic volume (MITV) dataset from the UCI machine learning repository.The data was divided into three sections for each MTS simulation experiment: training set, validation set, and testing set.The neuron number and layers in different DBN and BiLSTM modules are jointly determined based on the length and dimensionality of the MTS data.In this study, we used the validation data.We identified the hyperparameters of the DBI-BiLSTM model, such as the number of DBN layers (Md), the neurons number per RBM (Nd), the number of BiLSTM layers (Mh), and the number of neurons per BiLSTM recurrent layer (Nh) by the grid search or greedy search algorithm until the established DBI-BiLSTM's validation MSE is minimized.All experiments were performed on an Intel-based Core i5-8265U (1.60 GHz CPU with 8 GB RAM).Table 1 summarizes the hyperparameters used in the simulation experiments.
The DBI-BiLSTM model is assessed through the following loss functions: percentage improvement (IM%), normalized mean squared error (NMSE), mean absolute error (MAE), and symmetric mean absolute percentage error (SMAPE):   To demonstrate the performance and efficacy of the DBI-BiLSTM model on different MTS, five ablation models were employed as a single-layer LSTM, a shallow BiLSTM (single-layer), a multi-layer LSTM, a multi-layer BiLSTM, and a DI-BiLSTM.These ablation models were used to further illustrate the DBI-BiLSTM model's superiority.Furthermore, we evaluated and compared the performance of the proposed DBI-BiLSTM model with several classical and state-of-the-art models, including SVR, GBRT, DBN-ANN, Elman [42], gated recurrent units (GRU) [43], ESN, attention-LSTM [44], stacked bidirectional and unidirectional LSTM (SBU-LSTM) [45], EA-LSTM [27], and LSTM-FCN [46], where the SVR and GRRT are typical ML models used for comparison, DBN-ANN and Elman are classical ANN and single-layer RNN models for comparison, GRU is a gate-based single-layer RNN similar to LSTM, and ESN is an RNN with a different training algorithm compared with LSTM.Similar to DBI-BiLSTM, attention-LSTM, SBU-LSTM, EA-LSTM, and LSTM-FCN are recently proposed structural improvements to the original LSTM, which are multi-layer LSTM models that acquire features from the original data by adding some feature representation layers before the multi-layer LSTM layers.Overall, this study employed various models to demonstrate the effectiveness of the proposed DBI-BiLSTM model, which was then compared to classical and state-of-the-art models to prove its relative performance.

Parameter settings
In the following experiments, several parameters were identified as significantly impacting the model's performance, including Md, Mh, Nd, and Nh.The parameters that resulted in the minimum loss for validating NMSE were chosen as the ultimate values for the model.Figures 6-9 illustrate the validation NMSE values obtained by varying the Md, Mh, Nd, and Nh parameters for the four MTS tasks.Table 1 summarizes the ultimate parameter values for the employed DBI-BiLSTM models.To ensure a fair comparison for testing, the parameters of the LSTM-based models used for comparison, including hidden layer neurons number, learning method, activation or fire function, original training rate, and dropout constant, were set equal to the values of the parameters in the DBI-BiLSTM model.Figure 9. NMSE for validating that vary with different N 2 .

Heat exchangers system
The HX task used in this paper is from the literature [39].HX is a complex non-linear MTS task involving the effective heat exchange between two streams utilizing the temperature difference.The task presents significant difficulties, including flow turbulence, fluid flow geometry, and complex thermal behavior.Figure 10 illustrates the whole sequential data of the HX datasets, which is comprised of a total of 4,000 datasets.First, the HX data are normalized between −1 and 1 ([−1, 1]); 4,000 length data steps are taken out for modeling and testing the DBI-BiLSTM model, of which the first 2,000 steps are used for training, the next 1,000 steps are used as validation set for parameter selection, and the last 1,000 steps are used for performance testing of the DBI-BiLSTM.The grid search method is used to select the optimal parameters (Md, Mh, Nd and Nh), as shown in Table 1.Then, the DBI-BiLSTM model is constructed based on the parameters in Table 1; the initial learning rate is set to 0.001, and the bias of each BiLSTM is set to 1.Meanwhile, a dropout probability of 0.2 is set for the BiLSTM layers to ensure that the DBI-BiLSTM does not overfit the time series datasets.

Small medium large system
The SML dataset is an open dataset from the UCI machine learning repository that collected data from a monitor system mounted in an intelligent house.SML data are sampled at one-minute intervals and smoothed by averaging the data over 15 minutes (Open-source download link: https://archive.ics.uci.edu/dataset/274/sml2010).This dataset includes sensor readings such as weather forecast temperature, relative humidity, lighting, rain, sun dusk, wind speed, sunlight in the west, east and south facades, sun irradiance, outdoor temperature, outdoor relative humidity, and room temperature, which is the target output for this experiment.Figure 12 illustrates the input and output sequential data of the SML benchmark.The SML data are normalized in this experiment between −1 and 1 ([−1, 1]).The total length of the SML dataset used in this paper is 4137.The first 3,000 items are used for training, the next 537 steps are used as validation set for parameter selection, and the last 6,00 steps are used for performance testing of the DBI-BiLSTM.The grid search method is used to select the optimal parameters (Md, Mh, Nd and Nh), as shown in Table 1.Then, the DBI-BiLSTM model is constructed based on the parameters in Table 1; the initial learning rate, the bias of each BiLSTM, and the dropout probability are similarly set in the HX experiment.
To evaluate the effectiveness of the proposed DBI-BiLSTM model, we compare its performance with several ablation LSTM-based models, as described previously.The ablation models' parameters are chosen according to the values of DBI-BiLSTM (see Table 1) to ensure a fair comparison.The prediction results acquired by DBI-BiLSTM and a shallow BiLSTM over a selected length of 200 testing signals for the SML benchmark are presented in Figure 13.Table 2 summarizes the NMSE, MAE, SMAPE, and enhancement IM% of the shallow BiLSTM, multi-layer LSTM, multi-layer BiLSTM, DI-BiLSTM, and DBI-BiLSTM models for the SML benchmark.

Bike sharing dataset
The BS dataset is an open dataset from the UCI machine learning repository that represents a new generation of traditional bike rental systems (Open-source download link: https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset).The BS dataset represents a new generation of conventional bike rental systems.This automated system allows for membership eligibility, bike rentals, and returns to be completed entirely through mechanical processes.The dataset consists of a two-year usage record (2011-2012) of the capital bike-share system and corresponding weather and seasonal information.Sensor data includes working day, weather, temperature, feeling temperature, humidity, wind speed, and the total number of bicycle rentals per hour.The target value to be predicted in this experiment is the total number of bicycle rentals per hour.Figure 14 displays the sequential trend (input and output) of the BS benchmark.In this experiment, the BS data are normalized between −1 and 1 ([−1, 1]), as described previously.The total length of the BS dataset used in this paper is 17,389, but only 5,000 were used for experiments due to a periodic pattern in the data.The first 3000 items are used for training, the next 1000 steps are used as validation sets for parameter selection, and the last 1000 data are used for performance testing of the DBI-BiLSTM.The grid search method is used to select the optimal parameters (Md, Mh, Nd and Nh), as shown in Table 1.Then, the DBI-BiLSTM model is constructed based on the parameters in Table 1; the initial learning rate, the bias of each BiLSTM, and the dropout probability are similarly set in the HX and SML experiments.
The performance of the DBI-BiLSTM model was simulated using the same ablation LSTM-based models, as described earlier.The parameters used in the ablation model were selected according to the values of DBI-BiLSTM model (see Table 1).The forecasting performance of the shallow BiLSTM and DBI-BiLSTM models for the BS task over a 200-length testing dataset are shown in Figure 15.The performance of the shallow BiLSTM, multi-layer LSTM, multi-layer BiLSTM, DI-BiLSTM, and DBI-BiLSTM models for the BS task in terms of NMSE, MAE, SMAPE, and IM% is shown in Table 2.

Metro interstate traffic volume dataset
The MITV dataset is an open dataset from the UCI machine learning repository that represents a situation of MTS regression, where the employed network aims to forecast the continuous variables (Open download link: https://archive.ics.uci.edu/dataset/492/metro+interstate+traffic+volume).This dataset includes hourly traffic volume data for the MN DoT ATR station 301, located approximately halfway between Minneapolis and St. Paul, MN.The sensor data includes holidays, temperature, rainfall, snowfall, percentage of cloud cover, weather descriptions, and hourly traffic volume.The hourly traffic volume serves as the target predicted variable in this experiment.The input and output sequential data for the MITV task are displayed in Figure 16.In this experiment, the MITV data are normalized between −1 and 1 ([−1, 1]), as described previously.The total length of the BS dataset used in this paper is 48,204, but only 10,000 were used for experiments due to a robust periodic pattern.The first 6,000 items are used for training, the next 2,000 steps are used as validation sets for parameter selection, and the last 2,000 steps are used for performance testing of the DBI-BiLSTM.The grid search method is used to select the optimal parameters (Md, Mh, Nd and Nh), as shown in Table 1.Then, the DBI-BiLSTM model is constructed based on the parameters in Table 1; the initial learning rate, the bias of each BiLSTM, and the dropout probability are similarly set in the HX, SML, and BS experiments.
As with the previous experiments, we simulated the experimental results of the DBI-BiLSTM model using the same ablation LSTM-based models.The parameters for the ablation model were set to the same values as the DBI-BiLSTM model, as shown in Table 1. Figure 17 displays the prediction performance over a 200-length testing dataset for both the DBI-BiLSTM and shallow BiLSTM models for the MITV task.Table 2 shows the NMSE, MAE, SMAPE, and enhancement IM% for the shallow BiLSTM, multi-layer LSTM, multi-layer BiLSTM, DI-BiLSTM, and DBI-BiLSTM models for the MITV task.

Comparison of DBI-BiLSTM and various other MTS models
To analyze and validate the impact of DBN layers and deep chained structure in the performance of DBI-BiLSTM, comparative ablation models, including single-layer and multi-layer LSTM-based models, are used to test the selected time series datasets.The forecasting results for the comparison of DBI-BiLSTM and the ablation models are shown in Table 2.The running time(s) performance in Table 2 is the total time of the training and testing process.From Table 2, we can see that the running time of the bidirectional ablation models is significantly longer than that of the unidirectional LSTM models, and the running time of the multi-layer ablation model is longer than that of the single-layer ablation model.The computational complexity of our proposed DBI-BiLSTM model is comparable to that of the multi-layer BiLSTM, which suggests that the DBN module and the stack chained structure in the DBI-BiLSTM do not significantly enhance the computational burden of the multi-layer BiLSTM.

Discussion and statistical analysis
To comprehensively assess and evaluate the performance of the DBI-BiLSTM model in MTS forecasting, we conducted a comparative test with several fundamental models.These models include conventional SVR, GBRT, DBN-ANN, traditional Elman RNN, classical variant GRU of RNN, and ESN.Additionally, we compared the DBI-BiLSTM model with several lately proposed LSTM models, including the EA-LSTM model that can be adjusted by evolutionary computation, the SBU-LSTM model with deep stack structure, and the attention-LSTM model.To ensure a fair comparison, we set all parameters used in the DBN-based and LSTM-based models to the same values as those used in the DBI-BiLSTM model (developed as Table 1).The performance comparison between DBI-BiLSTM and many other MTS forecasting models is presented in Table 3.The running time(s) performance in Table 3 is the total time of the training and testing process.Furthermore, we conducted a 10-fold cross-validation experiment on four selected MTS datasets and obtained NMSE performance results.These results are shown in Figure 18 in the form of a box plot.As can be seen from the running time in Table 3, the computational complexity of the DBI-BiLSTM model is significantly higher than that of the statistical models and the single-layer RNN models.However, the computational complexity of the DBI-BiLSTM model is still acceptable compared with the recently proposed multi-layer LSTM model (Attention-LSTM, SBU-LSTM, EA-LSTM, and LSTM-FCN) based on feature learning.

Discussion and statistical analysis
Heteroscedasticity, in contrast to homoscedasticity, is the case when the variance of the stochastic error items of the fitting network is not constant.Specifically, if the variance of the error term changes with changes in the independent variable, we have either variable variance or heteroscedasticity.Figure 19 displays the results of the heteroscedasticity test for each MTS dataset.According to the change of the residuals with the fitness numbers in Figure 19, it can be concluded that the values are typically spread around 0, and the variance keeps significantly steady with increasing fitness values.Figure 19 shows little heteroskedasticity among the DBI-BiLSTM models, and the homoskedasticity can be maintained.
Given that the feature vectors output by DBN are typically more complex and have a higher dimension, it becomes necessary to incorporate the GSA process.This is because GSA provides a framework for attributing the uncertainty in the model's output to various sources of uncertainty in the input factors of the model.Figure 20 illustrates the schematic chart of the total SI acquired by the GSA algorithm for the four MTS benchmarks.The SI in Figure 20 enables us to identify the output of the DBN with high and low sensitivity to DBI-BiLSTM.This information helps us to classify the input data of DI-BiLSTM into major and minor features.To demonstrate the different characteristics of the one-direction prediction model (DBI-LSTM) and the bidirectional forecasting model (DBI-BiLSTM), as well as the differences in synapsis distribution resulting from the two-direction transmission, partial output weight heat maps of DBI-LSTM (unidirectional) and DBI-BiLSTM (bidirectional-forward and bidirectional-backward) are plotted.Figure 21 displays these heatmaps, where the red rectangle on the left represents the weight heatmap for DBI-LSTM, and the green boxes represent the forward and backward heatmaps of DBI-BiLSTM.As indicated in Table 2 and Figure 21, bidirectional propagation outperforms one-direction propagation for shallow and multi-layer LSTM structures.Additionally, Figure 21 shows that the weights from the recurrent layer to the output layer of the one-direction LSTM are mostly uniformly distributed within the symmetric interval of 0, indicating that the features in the one-direction LSTM are transmitted in a single direction.Conversely, the forward and backward output synapses of the bidirectional LSTM models are distinguishable in Figure 21, meaning that if

Figure 1 .
Figure 1.The architecture of DBN consists of multiple RBMs.
At each time step, ht refers to the state of the hidden layer, and ot is the output.The candidate cell state c t  is used to update the original memory cell state ct.The hyperbolic tangent function is represented by tanh, and the logistic sigmoid activation function is denoted by σ .The multiplication operation by elementwise is described by  .
can use Monte Carlo methods to approximate  ( )

0f
is the average value of a model's output, ( ) Q P i represents the new matrix obtained by replacing the ith column data in matrix P with the ith column data in matrix Q.To calculate i S  and Ti S  , the model simulation number is set to N(k+2).Due to interaction effects, the sum of Ti S  of k parameters would be greater than 1, therefore Ti S 

Figure. 5 .
Figure. 5. Structure diagram and principle of DI-BiLSTM with three layers.
); Obtain the actual output Yt of the DI-BiLSTM through Eq (26); end for Calculate the training error MSE through Eq (27); Update all the weights of the BiLSTM layers by the Adam optimizer; Repeat until the training MSE converges; Test the DBI-BiLSTM on testing datasets.

Figure 6 .
Figure 6.NMSE for validating that vary with different BiLSTM hidden neurons.

Figure 7 .
Figure 7. NMSE for validating that vary with different BiLSTM layers.

Figure 8 .
Figure 8. NMSE for validating that vary with different N 1 .

Figure 10 .
Figure 10.Sequential trend of the HX task.

Figure 11 .
Figure 11.Fitting and test absolute error of DBI-BiLSTM and shallow BiLSTM for the HX experiment.

Figure 12 .
Figure 12.Sequential trend of the SML task.

Figure 13 .
Figure 13.Fitting and test absolute error of DBI-BiLSTM and shallow BiLSTM for the SML benchmark.

Figure 14 .
Figure 14.Sequential trend of the BS benchmark.

Figure 15 .
Figure 15.Fitting and test absolute error of DBI-BiLSTM and shallow BiLSTM for the BS benchmark.

Figure 17 .
Figure 17.Fitting and test absolute error of DBI-BiLSTM and shallow BiLSTM for the MITV task.

Figure 18 .
Figure 18.Box plot of test NMSE using DBI-BiLSTM and many other MTS forecasting models with 10-fold cross validation.

Figure 19 .
Figure 19.Performance and results of the heterogeneity test for the DBI-BiLSTM prediction model.

Figure 21 .
Figure 21.Heatmaps analysis presentation of the partial output weights (left part is unidirectional).
where ŷt represents the actual forecasted signal, y t denotes the wanted output signal,2σ is the variance of y t , and n denotes the length of y t The IM% value denotes the percentage improvement in performance achieved by different prediction models compared to a single-layer BiLSTM model.In the following experiments, the mean NMSE, MAE, and SMAPE errors for testing are acquired by experimentally averaging ten times on the MTS datasets.

Table 1 .
Parameter values for the DBI-BiLSTM models of different MTS simulations.

Table 2 .
The testing NMSE, MAE, MAPE, IM% and running time of DBI-BiLSTM and ablation models for the four MTS forecasting benchmarks.

Table 3 .
The performance comparison of DBI-BiLSTM and various MTS forecasting models.