Progressive Neural Network for Multi-Horizon Time Series Forecasting

In this paper, we introduce ProNet, an novel deep learning approach designed for multi-horizon time series forecasting, adaptively blending autoregressive (AR) and non-autoregressive (NAR) strategies. Our method involves dividing the forecasting horizon into segments, predicting the most crucial steps in each segment non-autoregressively, and the remaining steps autoregressively. The segmentation process relies on latent variables, which effectively capture the significance of individual time steps through variational inference. In comparison to AR models, ProNet showcases remarkable advantages, requiring fewer AR iterations, resulting in faster prediction speed, and mitigating error accumulation. On the other hand, when compared to NAR models, ProNet takes into account the interdependency of predictions in the output space, leading to improved forecasting accuracy. Our comprehensive evaluation, encompassing four large datasets, and an ablation study, demonstrate the effectiveness of ProNet, highlighting its superior performance in terms of accuracy and prediction speed, outperforming state-of-the-art AR and NAR forecasting models.


INTRODUCTION
Time series forecasting has a wide range of applications in industrial domain for decades, including predicting electricity load, renewable energy generation, stock prices and traffic flow, air quality [1] etc.Many methods have been developed for this task that can be classified into two broad categories.In the early years, statistical models such as Auto-regressive Integrated Moving Average (ARIMA) and State Space Models (SSM) [2] were widely used by industry forecasters.However, they fit each time series independently and are not able to infer shared patterns from related time series [3].On the other hand, machine learning methods have been developed for modelling the non-linearity from time series data.Preliminary methods are random forest [4], Support Vector Machine (SVM) [5] and Bayesian methods [6].Moreover, recent research has widely acknowledged the effectiveness of time series decomposition and ensemble learning methods in refining forecasting models [7], [8], [6], [9].Ensemble learning methods have gained recognition for their ability to combine individual models and enhance overall predictive performance while minimizing overfitting.Du et al. [6] develop the ensemble strategy that takes advantage of high diversification statistical, machine learning and deep learning methods, and assigns time-varying weights for model candidates with bayesian optimization, to avoid the shortage of model choice and alleviates the risk of underfitting.Similarly, Gao et al. [10] introduced an online dynamic ensemble of deep random vector functional link with three stages for improved performance.Decomposition-based methods have also shown promise in time series forecasting by breaking down the data into underlying components, leading to more accurate and manageable predictions.Different decomposition approaches, such as classical decomposition, moving averages, and state space model have been explored.For instance, Li et al. [11] proposed a convolutional neural network ensemble method that leverages decomposed time series and batch normalization layers to reduce subject variability.Wang et al. [12] proposed a fuzzy cognitive map to produce interpretable results by forecasting the decompositional components: trend, fluctuation range, and trend persistence.Lin et al. [13] developed SSDNet, employing Transformer architecture to estimate state space model parameters and provide time series decomposition components: trend and seasonality.Tong et al. [14], [15] ntroduced Probabilistic Decomposition Transformer with hierarchical mechanisms to mitigate cumulative errors and a conditional generative approach for time series decomposition.Furthermore, Wang et al. [9] introduced the ternary interval decomposition ensemble learning method, addressing limitations of point and interval forecasting models.The amalgamation of machine learning models, time series decomposition, and ensemble learning has demonstrated great promise as a potent solution for advancing forecasting performance.Notably, the philosophy of decomposition and ensemble can seamlessly integrate with major machine learning models, further enhancing their effectiveness in various applications.
AR forecasting models have the problem of slow inference speed and error accumulation due to the use of a recursive method that use previously predicted values to make future forecasts.AR models are usually trained with the teacherforcing mechanism and consider ground truth as previous predictions to feed into the model during training.This causes a discrepancy between training and prediction, and could cause unsatisfied accuracy for long forecasting horizons [27], [25].
In contrast, NAR forecasting models overcome the aforementioned problems since they generate all predictions within forecasting horizon simultaneously.However, NAR model ignores interdependencies in output space and such assumption violates real data distribution for sequence generation tasks [28], [29].This may result in unrelated forecasts over the prediction horizon and accuracy degradation [30], [25].Empirically, AR methods were found to be better for shorter horizons but outperformed by NAR for longer horizons due to error accumulation [30].Thus, both AR and NAR models have their own complementary strengths and limitations for multi-horizon forecasting which stem from their prediction strategy.Recently NAR models have been proposed specific for translation tasks that can alleviate the accuracy degradation by performing dependency reduction in output space and reduce the difficulty of training [28], [31], [32], [29].However, such studies are scarce for time series forecasting tasks.
A balance must be struck between AR and NAR forecasting models to tackle the challenges of error accumulation and low latency in AR models, alongside the NAR models' inability to adequately capture interdependencies within the output space.Recent strides in this domain have illuminated the advantages of incorporating dependency and positional information within the prediction horizon.These breakthroughs have exhibited their efficacy across a spectrum of sequence modeling tasks.For instance, Ran et al. [33] have ingeniously integrated future predictions to overcome the multi-modality predicament in neural machine translation.In a parallel vein, Fei [34] and Zhou et al. [35] have skillfully amalgamated information from future time steps to generate past predictions, exemplified in the context of caption generation.Furthermore, Han et al. [36] have introduced a diffusion-based language model with bidirectional context updates, adding a notable dimension to the evolving landscape of research in this field.To address these challenges and capitalize on the strengths of both AR and NAR modeling, we introduce Progressive Neural Network (ProNet), a novel deep learning approach designed for time series forecasting.ProNet strategically navigates the AR-NAR trade-off, leveraging their respective strengths to mitigate error accumulation and slow prediction while effectively modeling dependencies within the target sequence.Specifically, ProNet adopts a partially AR prediction strategy by segmenting the forecasting horizon.It predicts a subset of steps within each segment using a non-autoregressive approach, while maintaining an autoregressive decoding process for the remaining steps.
Fig. 1 illustrates the different prediction mechanism of AR, ProNet's partially AR, and NAR decoding mechanisms.For example, when the AR decoder considers step t + 4 dependent on steps from t to t + 3, the NAR decoder assumes no dependency.In contrast, ProNet's partially AR decoder takes into account dependencies from past steps t, t + 1, t + 3, as well as future step t + 5.The initiation of horizon segments is determined by latent variables, optimizing their training through variational inference to capture the significance of each step.Consequently, in comparison to AR models, ProNet's predictions require fewer iterations, enabling it to overcome error accumulation while achieving faster testing speeds.Moreover, compared to NAR models, ProNet excels in capturing dependencies within the target space.
The main contributions of our work are as follows: 1) We propose ProNet, a partially AR time series forecasting approach that generates predictions of multiple steps in parallel to leverage the strength of AR and NAR models.Our ProNet assumes an alternative dependency in target space and incorporates information of further future to generate forecasts.2) We evaluate the performance of ProNet on four time series forecasting tasks and show the advantages of our model against most state-of-the-art AR and NAR methods with fast and accurate forecasts.An ablation study confirmed the effectiveness of the proposed horizon dividing strategy.

RELATED WORK
Recent advancements in forecasting methodologies have led to the emergence of NAR forecasting models [23], [24], [25], [26].These models seek to address the limitations of AR models by eschewing the use of previously generated predictions and instead making all forecasts in a single step.However, the effectiveness of NAR forecasting models is hindered by their assumption of non-interdependency within the target space.This assumption arises from the removal of AR connections from the decoder side, leading to the estimation of separate conditional distributions for each prediction independently [28], [31], [32], [29].While both AR and NAR models have proven successful in forecasting applications, AR methods tend to excel for shorter horizons, while NAR methods outperform AR for longer horizons due to error accumulation [30].Unlike AR models, NAR models offer the advantage of parallelizable training and inference processes.However, their output may present challenges due to the potential generation of unrelated forecasts across the forecast horizon.This phenomenon could lead to discontinuous and unrealistic forecasts [30], as the incorrect assumption of independence prevents NAR models from effectively capturing interdependencies between each prediction.
Serval research [28], [31], [32], [29] have been made to enhance NAR models, although most of these efforts have been focused on Neural Machine Translation (NMT) tasks.Gu et al. [28] introduced the NAR Transformer model, which reduces output dependencies by incorporating fertilities and leveraging sequence-level knowledge distillation techniques [37], [38].Recent developments have seen the adaptation of NAR models for translation tasks, mitigating accuracy degradation by tackling output space dependencies.This approach aims to capture and manage dependencies, thereby alleviating training challenges [28], [31], [32].Notably, knowledge distillation [37], [38] emerges as a highly effective technique to enhance NAR model performance.
The trade-off between AR and NAR [39], [33], [34], [35] has been a subject of exploration, particularly in the context of NMT and other sentence generation tasks.Notable instances include the works of [39], [35], which retain AR properties while enabling parallel prediction of multiple successive words.Similarly, [33], [34] employ a strategy that generates translation segments concurrently, each being generated autoregressively.However, prior approaches have relied on dividing the target sequence into evenly distributed segments, assuming fixed dependencies among time steps.This assumption, while applicable in some contexts, proves unsuitable for time series forecasting due to the dynamic and evolving nature of real-world time series data.For instance, in Fig. 2, we visualize the partial correlation of two distinct days (comprising 20 steps each) from the Sanyo dataset.Evidently, the two plots exhibit varying dependency patterns, signifying that the most influential time steps differ between the two cases.Additionally, it becomes apparent that future steps can exert substantial influence on preceding steps.Take Fig. 2 (a) as an example, where step 5 exhibits strong partial correlation with steps 17 and 18.This correlation suggests that incorporating information from steps 17 and 18 while predicting step 5 could be highly beneficial.
In this work, we present ProNet that navigates the intricate balance between AR and NAR models.We extend previous work with several key enhancements: 1) assuming a non-fixed dependency pattern and identifying the time steps that need to be predicted first via latent factors and then predict further groups of steps autoregressively; 2) assuming the alternative time-varying dependency and incorporating future information into forecasting; 3) introducing the sophisticated-designed masking mechanism to train the model non-autoregressively.

PROBLEM FORMULATION
Given is: 1) a set of N univariate time series (solar or electricity series T l is input sequence length, and y i,t ∈ ℜ is value of the ith time series at time t; 2) a set of associated timebased multi-dimensional covariate vectors {X i,1:T l +T h } N i=1 , where T h is forecasting horizon length and T l + T h = T .Our goal is to predict the future values of the time series {Y i,T l +1:T l +T h } N i=1 , i.e. the PV power or electricity usage for the next T h time steps after T l .
AR forecasting models produce the conditional probability of the future values: where the input of model at step t is the concatenation of y i,t−1 and x i,t and θ denotes the model parameters.
For NAR forecasting models, the conditional probability can be modelled as: Table I presents a comparison of available information for predicting step t + 1 using AR and NAR forecasting methods.Both AR and NAR methods have access to covariates and ground truth from the past.However, there is a distinction in the scope of information they can utilize.The AR method can only make use of covariates and ground truth up to time step t, whereas NAR methods can utilize all covariates within the forecasting horizon but do not have access to ground truth.Specifically, ProNet produces the conditional probability distribution of the future values, given the past history: , where the input of model at step t is the concatenation of y i,t−1 and x i,t and θ denotes the model parameters.
The models are applicable to all time series, so the subscript i will be omitted in the rest of the paper for simplicity.

PROGRESSIVE NEURAL NETWORK
In this section, we first present the architecture of ProNet and then explain its details in four sections: 1) partially AR forecasting mechanism to overcome the limitations of AR and NAR decoders, 2) progressive forecasting to correct inaccurate predictions made at early stages, 3) progressive mask that implements the previous two mechanisms for Transformer model and 4) variational inference to generate the latent variables with dependency information to serve the partially AR forecasting mechanism.During each training iteration, the feedforward process unfolds through four stages:

Architecture
1) Encoder for Pattern Extraction: The encoder analyzes patterns from preceding time steps, contributing valuable insights to all decoders.2) Significance Assessment by Posterior Model: The posterior model q ϕ (z | y, x) integrates both ground truth and covariates, effectively discerning the significance of time steps within the forecasting horizon.This assessment identifies pivotal steps, subsequently used to segment the forecasting horizon.
3) Significance Assessment by Prior Model: A separate prior model p θ (z | x) employs covariates to predict the importance of time steps within the horizon.The outputs of this prior model are meticulously calibrated to closely approximate the posterior model's outcomes.4) Decoding and Forecast Generation: The decoder p(y | x, z) employs the ground truth, covariates, and the output of the posterior model q ϕ (z | y, x) to segment the forecasting horizon into distinct segments for accurate forecast generation.
During the inference phase, the posterior model is omitted, and the prior model seamlessly takes on its role, facilitating accurate predictions.Notably, in the absence of ground truth during prediction, the decoder employs past predictions to generate forecasts.
As the architectural backbone of ProNet, we adopt the Informer architecture [26]; however, it is pertinent to highlight that alternative Transformer-based architectures can be seamlessly integrated into the ProNet framework.Impressively, ProNet's efficacy remains pronounced when employing a vanilla Transformer as its architectural backbone.
In summary, the prior and posterior models are trained employing a variational inference approach, facilitating the identification of pivotal steps for the decoder's operation.This decoder employs progressive masks, thereby engendering the realization of partially and progressive forecasting strategies.The intricate implementation intricacies of these components are elaborated upon in the subsequent sections.

Partial Autoregressive Forecasting
Our ProNet makes predictions by combining AR and NAR decoding mechanisms together.To facilitate efficient predictions, we introduce a multi-step prediction strategy organized into segments.Specifically, we divide the forecasting horizon into n g segments and make predictions for the starting positions of each segment, denoted by S 1 = [s 1 , s 2 , ..., s ng ].Subsequently, we employ an autoregressive approach to forecast the subsequent steps of each segment, specifically S 2 = [s 1 + 1, s 2 + 1, ..., s ng + 1], in parallel.This process continues iteratively until all forecasted steps within the horizon are generated.Notably, the initial position of the first segment is set to the first step (s 1 = 1).The length of each segment is determined as the difference between the starting positions of two consecutive segments, denoted as T i = s i+1 − s i where In line with NAR forecasting models, we set the initial input of the decoder (y) for the first predictions as 0, since prior predictions have not yet been established.In order to predict all steps within the horizon, ProNet employs AR predictions a maximum of n step = max(T i:ng ) times, where n step represents the maximum segment length.This approach ensures accurate forecasts by iteratively refining predictions while considering relevant historical context.
Unlike traditional AR and NAR models, our method introduces a unique probability distribution formulation: where y j i,t is prediction at tth step of the jth segment and Y j i,T l +1:T l +t denotes the prediction history up to step t of the jth segment.

Progressive Prediction
In ProNet, the forecasting horizon is divided into segments of varying lengths.However, the number of AR steps is determined by the maximum segment length, leading to situations where certain segments complete their predictions before the AR iteration concludes.To capitalize on the additional dependency information available, completed segments are tasked with re-forecasting steps within their subsequent segments that have already been predicted.This progressive prediction strategy acknowledges that early steps in each segment may involve limited or no dependency information and therefore benefit from iterative refinement as more context becomes available.

Progressive Mask
The architecture of the AR Transformer decoder [40] employs a lower triangular attention mask to prevent future information leakage.Conversely, NAR Transformer decoders (e.g., Informer [26]) use unmasked attention.However, these standard masking mechanisms are inadequate for ProNet, as it operates with a partially autoregressive framework that integrates future information for predictions.In response, we introduce a progressive masking mechanism to facilitate access to the first t steps of all segments during the t-th step prediction.
Given the sample size N , forecasting horizon length T h and segment size n g , the progressive mask M is created by Algorithm 1.Initially, we take the top n g indexes of latent variable z that encodes the importance of steps for forecasting and stores them as ind, which is also the starting position S 1 .Then we set the elements of zero vector row located at ind as one.We iterate from 1 to the maximum AR step n s tep to create the mask M : firstly, we set the rows of mask M that is located at ind as the variable row; secondly, we increment all elements of ind by one and limit their value by the upper bound of forecasting horizon T h as shown in line 5 and 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 respectively; thirdly, we update the elements of row located at ind as one.
For instance, Fig. 4 illustrates how the elements change in Algorithm 1 from initialization to the final settings.We firstly initialize the mask M as a 7 × 7 zero matrix.For the first iteration, the starting position or the index is ind = S 1 = [1, 3, 5], which means ProNet predicts the 1st, 3rd and 5th steps simultaneously.Then, we update the temporary variable row → [1 0 1 0 1 0 0] (line 2 of Algorithm 1) and use it to fill the 1st, 3rd and 5th row of M (line 4 of Algorithm 1) as shown in the upper right of Fig. 2. Afterwards, we increment elements of ind → [2, 4, 6] by one and update temporary variable row → [1 1 1 1 1 1 0].The second iteration is as the first one, while final iteration implements progressive prediction: we now have the variable row → [1 1 1 1 1 1 1] and index ind = [3,6,7].We fill the 3th, 6th and 7th row of M with row, which means we use all previous predictions to forecast the 7th step and re-forecast the 3th and 6th steps.

Variational Inference
The ProNet algorithm addresses the challenge of segmenting sequences and prioritizing forecasted steps to achieve optimal performance.It is crucial to initiate forecasting with steps that carry the most significance and intricate dependencies on subsequent time points.However, obtaining this Algorithm 1 Creation of Progressive Mask Input: progressive mask M ∈ ℜ N ×T h ×T h = 0, latent factor z ∈ ℜ N ×T h , temporary variable row ∈ ℜ T h = 0. Output: progressive mask M .row[ind] = 1; 8: end for vital information is not straightforward.Drawing inspiration from the methodology introduced in [41], we tackle this issue by employing parallel forecasting of step importance, representing them as latent variables denoted as z.These latent variables are derived through conditional variational inference, an approach rooted in conditional Variational Autoencoders (cVAEs) [42].These cVAEs bridge the gap between observed and latent variables, facilitating a deeper understanding of data patterns.
The concept of cVAEs extends the classical Variational Autoencoder (VAE) framework [43], enhancing it by integrating conditioning variables into the data generation process.This augmentation empowers cVAEs to learn a more nuanced and contextually aware latent space representation of data.In a standard VAE, data is mapped to a lower-dimensional latent space using an encoder, and subsequently, a decoder reconstructs this data from points in the latent space.cVAEs introduce conditional variables that encode additional context or prior knowledge into the generative model.This enables cVAEs not only to learn conditional latent representations but also to incorporate provided contextual cues effectively.Particularly, cVAEs are advantageous in scenarios where supplementary information is available, mirroring the case of ProNet, which requires generating initial time steps for predictions based on past ground truth and covariates.
In the context of ProNet, the latent variables, denoted as z, correspond to individual output steps and rely on the entire temporal sequence for their determination.Consequently, the conditional probability is articulated as: The y denotes the ground truth in the forecasting horizon, and conditioning variable x plays the role of historical data and covariates, allowing the model to capture the relevance of different time steps as latent variable z for accurate predictions.However, the direct optimization of this objective is unfeasible.To address it, the Evidence Lower Bound (ELBO) [42] is employed as the optimization target, resulting in the following formulation: Here, the Kullback-Leibler (KL) divergence is denoted by KL.The term p θ (z | x) represents the prior distribution, q ϕ (z | y, x) denotes an approximated posterior, and P θ (y | z, x) characterizes the decoder.With the ground truth encompassed within the horizon denoted by y and the condition x, q ϕ (z | y, x) effectively models the significance of diverse time steps represented by z.Notably, during prediction, y is not available, prompting the need to train p θ (z | x) to approximate q ϕ (z | y, x), achieved through the minimization of the KL divergence.
Both the prior and the approximated posterior are modelled as Gaussian distributions characterized by their mean and variance.The mean µ is obtained via a linear layer, while the variance σ is derived through another linear layer, followed by a SoftPlus activation function.To enable smooth gradient flow through random nodes, the reparameterization trick [42] is invoked.This involves sampling the latent variable z using the equation z = g(ϵ, µ, σ) = µ + σϵ, where ϵ follows a standard normal distribution N (0, 1), effectively serving as white noise.The value of z encapsulates the significance of each time step within the forecasting horizon, guiding the selection of which steps to initiate predictions from.The top k indices featuring the highest z values are chosen to initiate forecasting.
During the training process, z is sampled from q ϕ (z | y, x), and the approximation of q ϕ (z | y, x) to the true posterior p θ (z | x) is enforced.This entire framework enables ProNet to identify and leverage the most crucial time steps for accurate and effective forecasting.
Empirically, we find both the prior and posterior models often assign elevated importance to a sequence of steps, leading to a substantial reduction in decoding speed during testing.Striking a balance between accuracy and speed, we introduce a novel approach to realign the latent factor z by incorporating a scaling factor with the assistance of weight vectors denoted as W ∈ ℜ T h −1 : This re-weighting operation modifies the latent factor z to achieve a more optimized equilibrium between the forecasting accuracy and the computational speed.Subsequently, we determine the initial position, denoted as S 1 , and identify the indices of the largest n g − 1 elements from z[2 :] as potential starting positions.For example, Fig. 5 provides a visual representation of the latent variable z before and after the re-weighting process.By selecting n g = 3, the original z yields the starting position S 1 = [1,5,6], necessitating 4 autoregressive (AR) iterations to complete the forecasting process.Conversely, the re-weighted z results in a starting position S 1 = [1,3,6], reducing the AR iterations required to 3. Remarkably, this re-weighting design elevates decoding speed by 25% in this scenario.Illustrating the tangible benefits of our approach, this strategic re-weighting of the latent variable z not only preserves forecast accuracy but also significantly enhances the computational efficiency of the process.

Data Sets
We conducted experiments using publicly available time series datasets, namely Sanyo [44], Hanergy [45], Solar [46], and Electricity [47].These datasets encompass diverse sources of information and provide valuable insights.Specifically, the datasets consist of: Sanyo and Hanergy: These datasets encompass solar power generation data obtained from two distinct Australian PV plants, covering periods of 6 and 7 years, respectively.We focused our analysis on the time range between 7 am and 5 pm, aggregating the data at half-hourly intervals.In addition to the power generation information, we incorporated covariate time series data related to weather conditions and weather forecasts.Detailed descriptions of the data collection process can be found in [48].For these datasets, we incorporated calendar features, specifically month, hour-of-the-day, and minute-ofthe-hour.
Solar: This dataset comprises solar power data originating from 137 PV plants across the United States.It covers an 8-month span, and the power data is aggregated at hourly intervals.Similarly to the previous datasets, calendar features are integrated, including month, hour-of-the-day, and age.
Electricity: This dataset involves electricity consumption data gathered from 370 households over a duration of approximately 4 years.The electricity consumption data is aggregated into 1-hour intervals.For this dataset, we incorporated calendar features, including month, day-of-the-week, hour-of-the-day, and age.
Following prior research [3], [48], all datasets were preprocessed by normalizing the data to have zero mean and unit variance.In Table II, we provide an overview of the statistics associated with each dataset.

Experimental Details
We compare the performance of ProNet with seven methods: five state-of-the-art deep learning (DeepAR, DeepSSM, LogSparse Transformer, N-BEATS and Informer), a statistical (SARIMAX) and a persistence model: • Persistence is a typical baseline in forecasting and considers the time series of the previous day as the prediction for the next day.
• SARIMAX [49] is an extension of the ARIMA and can handle seasonality with exogenous factors.• DeepAR [16] is a widely used sequence-to-sequence probabilistic forecasting model.• DeepSSM [17] fuses SSM with RNNs to incorporate structural assumptions and learn complex patterns from the time series.It is the state-of-the-art deep forecasting model that employs SSM.• N-BEATS [24] consists of blocks of fully-connected neural networks, organised into stacks using residual links.
We introduced covariates at the input of each block to facilitate multivariate series forecasting.• LogSparse Transformer [3] is a recently proposed variation of the Transformer architecture for time series forecasting with convolutional attention and sparse attention; it is denoted as "LogTrans" in Table IV.
• Informer [26] is a Transformer-based forecasting model based on the ProbSparse self-attention and self-attention distilling.We modified it for probabilistic forecasts to generate the mean value and variance.Note that Persistence, N-BEATS and Informer are NAR models while the others are AR models.
All models were implemented using PyTorch 1.6 and evaluated on Tesla V100 16GB GPU.The deep learning models were optimised by mini-batch gradient descent with the Adam optimiser and a maximum number of epochs 200.
In line with the experimental setup from [48] and [3], we carefully partitioned the data to prevent future leakage during our evaluations.Specifically, for Sanyo and Hanergy datasets, we designated the data from the last year as the test set, the second last year as the validation set for early stopping, and the remaining data (5 years for Sanyo and 4 years for Hanergy) as the training set.For the Solar and Electricity datasets, we utilized the data from the last week (starting from 25/08/2006 for Solar and 01/09/2014 for Electricity) as the test set, and the week preceding it as the validation set.To ensure consistency, the data preceding the validation set was further divided into three subsets, and the corresponding validation set was employed to select the best hyperparameters.Throughout the process, our hyperparameter selection was based on achieving the minimum loss on the validation set, enabling us to finetune the model for optimal performance.
We used Bayesian optimization for hyperparameter search for all deep learning models with a maximum number of iterations 20.The models used for comparison were tuned based on the recommendations in the original papers.We selected the hyperparameters with a minimum loss on the validation set.Probabilistic forecasting models use NLL loss while the point forecasting model(N-BEATS) uses mean squared loss.
For the Transformer-based models, we used learnable position and ID (for Solar and Electricity sets) embedding.For   Following [26], [50], we restrict the decoder layers to be less than encoder layers for a fast decoding speed.The selected best hyperparameters for ProNet are listed in Table III and used for the evaluation of the test set.

Accuracy Analysis
The performance of our proposed ProNet model, along with several benchmark methods, is summarized in Table IV.This table presents the ρ0.5 and ρ0.9 loss metrics for all models.Notably, since N-BEATS and Persistence generate point forecasts, we report only the ρ0.5 loss for these models.
We can see ProNet is the most accurate method -it outperforms other methods on all data sets except for ρ0.9 on Solar and ρ0.5 on Electricity where Logsparse Transformer shows better performance.A possible explanation is that the ProNet backbone -Informer has subpar performance for the two cases.As a NAR forecasting model, Informer ignores dependency in target space, while our ProNet assumes the alternative dependency and therefore achieves better accuracy than Informer.Comparing the performance of AR and NAR models, we can see our ProNet is the most successful overall -ProNet achieves a trade-off between AR and NAR forecasting models by assuming an alternative dependency and accessing both past and future information for forecasting with latent variables.

Visualization Analysis
We provide visual representations of example forecasts produced by our ProNet model on three distinct datasets: Sanyo, Hanergy, and Solar.As shown in Fig. 6, these illustrations demonstrate the remarkable forecasting accuracy achieved by ProNet, highlighting its ability to effectively capture intricate and diverse patterns within the forecasting horizon.The visualizations underscore the model's capacity to handle complex temporal dependencies and produce reliable predictions.
Moreover, Fig. 7 showcases the predictive prowess of ProNet on the Electricity dataset.This particular visualization presents the results for a consecutive 8-day period from the test set.Notably, ProNet employs a 7-day history to generate a 1day forecasting output.The graph reveals ProNet's remarkable capability to leverage the interconnected nature of related time series and exploit extensive historical context to generate accurate and informative predictions.

Error Accumulation
To investigate the ability of ProNet to handle error accumulation and model the output distribution, we compare ProNet with an AR model (DeepAR) and a NAR model (Informer) on the Sanyo and Hanergy as a case study.
Fig. 8 shows the ρ0.5-loss of the models for the forecasting horizons range from 1 (20 steps) to 10 days (200 steps).We can see the ρ0.5-loss of all models increases with the forecasting horizon but the performance of DeepAR drops more significantly due to its AR decoding mechanism and error accumulation.ProNet consistently outperforms Informer for short horizon and has competitive performance with Informer for long horizon, indicating the effectiveness of seeking the trade-off between AR and NAR models.ProNet assumes the dependency in target space without fully discarding AR decoding and can improve the forecasting accuracy over all horizons.The results show that error accumulation degrades the performance of AR models but ProNet can successfully overcome this by assuming the alternative dependency and fusing future information into predictions with a shorter AR decoding path.

Inference Speed
We evaluate the prediction time of ProNet with varying number of segments n g and compare it with the AR and NAR model: LogTrans and Informer.Table V shows the average elapsed time and the standard deviation over 10 runs; all models were run on the same computer configuration.
As expected, ProNet has a faster inference speed than the AR LogTrans for their shorter AR decoding path.The inference speed of ProNet increases with the number of segments n g up to 10.This is because the number of AR steps decreases with n g .ProNet with n g = 10 and n g = 15 have similar speed as both are expected to have same number of steps 2. As the number of segments n g increases, ProNet has competitive inference speed with Informer when n g = 10 and n g = 15.The results confirm that ProNet remains the fast decoding advantage of NAR models, in addition to being the most accurate.

Ablation and Hyperparameter Sensitivity Analysis
To evaluate the effectiveness of proposed methods, we conducted an ablation study on Sanyo and Hanergy sets.Table VI shows the performance of: 1) Trans (AR Transformer); 2) PAR-Trans is the partially AR Transformer implemented by simply dividing the horizon evenly [34]; 3) ProNet-Trans is the ProNet that uses Transformer as backbone instead of Informer; 4) Informer; 5) PAR-Informer is the partially AR Informer [34]; 6) our ProNet.
We can see PAR-Trans outperforms Trans and PAR-Informer performs worse than Informer that indicate the partially AR decoding mechanism can improve Trans but degrades the performance of Informer.A possible explanation is that simply dividing forecasting horizon into even segments and the fixed dependency assumption violates the real data distribution, which has time-varying dependency relationships (see Fig. 2).Both ProNet-Trans and ProNet outperform Trans and Informer as well as their partially AR version consistently, showing the effectiveness of our progressive decoding mechanism and confirming the advantage of it over partially AR decoding mechanism.
We perform the sensitivity analysis of the proposed ProNet on Sanyo and Hanergy sets.Table VII shows the ρ0.5/ρ0.9loss of ProNet with the number of segments n g ranges from 2 to 15. ProNet achieves the optimal trade-off with 5 and 10 segments n g , in which cases the performance is the best.It can be explained that when n g is low, more AR decoding steps are required and error accumulates; when n g is high, most steps of ProNet are predicted non-autoregressively without the dependency in target space.In summary, considering the ProNet inference speed as provided in Table V, dividing the forecasting horizon by half is the best choice that allows ProNet to achieve the best accuracy and speed.
Table VIII and IX present the evaluation of ProNet's ρ0.5/ρ0.9-lossperformance and prediction speed without the re-weighting mechanism, across varying segment numbers (n s ).Comparing these results with the performance metrics of ProNet showcased in Table VII and V, it becomes evident that ProNet exhibits significantly higher prediction speeds when the re-weighting mechanism is absent.Furthermore, ProNet outperforms its re-weighting mechanism-less counterpart in 10 out of the 16 cases examined.
This highlights the important role played by the reweighting mechanism in enhancing ProNet's prediction speed while preserving its prediction accuracy.The incorporation of this mechanism effectively prevents the assignment of undue importance to specific sequences of steps, thus contributing to the optimization of prediction speed without compromising the overall accuracy of ProNet's forecasting.

CONCLUSIONS
We introduced ProNet, a novel deep learning model tailored for multi-horizon time series forecasting.ProNet effectively strikes a balance between autoregressive (AR) and nonautoregressive (NAR) models, avoiding error accumulation and slow prediction while maintaining the ability to model target step dependencies.The key innovation of ProNet lies in its partially AR decoding mechanism, achieved through segmenting the forecasting horizon.It predicts a group of steps non-autoregressively within each segment while locally employing AR decoding, resulting in enhanced forecasting accuracy.The segmentation process relies on latent variables, effectively capturing the significance of steps in the horizon, optimized through variational inference.By embracing alternative dependency assumptions and fusing both past and future information, ProNet demonstrates its versatility and effectiveness in forecasting.Extensive experiments validate the superiority of our partially AR method, showcasing ProNet's remarkable performance and prediction speed compared to state-of-the-art AR and NAR forecasting models.

Fig. 2 :
Fig. 2: Partial correlation of Sanyo set for two different days (20 time steps for each day).

Fig. 3 :
Fig. 3: Structure of the four components in ProNet: encoder, decoder, prior model p θ and posterior model q ϕ .

Figure. 3
Figure. 3 illustrates the architecture of ProNet, a partially AR time series forecasting model by using latent variables to model the uncertainty in target space.ProNet comprises four core components: an encoder, a forecasting decoder, a prior model denoted as p θ (z | x), and a posterior model denoted as q ϕ (z | y, x).During each training iteration, the feedforward process unfolds through four stages:1) Encoder for Pattern Extraction: The encoder analyzes patterns from preceding time steps, contributing valuable insights to all decoders.2) Significance Assessment by Posterior Model: The posterior model q ϕ (z | y, x) integrates both ground truth and covariates, effectively discerning the significance of time steps within the forecasting horizon.This assessment identifies pivotal steps, subsequently used to segment the forecasting horizon.

Fig. 4 :
Fig. 4: Creation process of progressive mask M : initial M (upper left), M after the 1st (upper right) and 2nd (lower left) iteration, and the final M (lower right) when the forecasting horizon T h = 7, the segment size n g = 3 and starting positions of each segments S 1 = [1, 3, 5].We mark their changes in bold.

Fig. 5 :
Fig. 5: Visualization of latent variable z: (a) original z, (b) re-weighted z.Higher brightness indicates the higher value of z element.

TABLE I :
Available information for predicting step t + 1 by AR and NAR forecasting methods.

TABLE II :
Dataset statistics.L d -number of steps per day, N -number of series, n T -number of time-based features, n Cnumber of calendar features, T l -length of input series, T h -length of forecasting horizon.

TABLE III :
Hyperparameters for ProNetProNet, the constant sampling factor for Informer backbone was set to 2, and the length of start token T d e is fixed as half of the forecasting horizon.The learning rate λ was fixed; the number of segments n g was fixed as 10 for Sanyo and Hanergy data sets, and 12 for Solar and Electricity sets; the dropout rate δ was chosen from {0, 0.1, 0.2}.The hidden layer dimension size d hid was chosen from {8, 12, 16, 24, 32, 48, 96}; the Informer backbone Pos-wise FFN dimension size d f and number of heads n h were chosen from {8, 12, 16, 24, 32, 48, 96} and {4, 8, 16, 24, 32}; the number of hidden layers of encoder n e and decoder n d were chosen from {2, 3, 4}.

TABLE V :
Prediction time (ms) -mean and standard deviation.

TABLE VII :
ρ0.5/ρ0.9-loss of data sets with various number of segments n g for ProNet.

TABLE VIII :
ρ0.5/ρ0.9-loss of data sets with various number of segments n g for ProNet without re-weighting mechanism.

TABLE IX :
Prediction time (ms) ProNet without re-weighting mechanism -mean and standard deviation.