I MPROVING F ORECASTS FOR H ETEROGENEOUS T IME S ERIES BY ”A VERAGING ”, WITH A PPLICATION TO F OOD D EMAND F ORECAST

A common forecasting setting in real world applications considers a set of possibly heterogeneous time series of the same domain. Due to different properties of each time series such as length, obtaining forecasts for each individual time series in a straight-forward way is challenging. This paper proposes a general framework utilizing a similarity measure in Dynamic Time Warping to find similar time series to build neighborhoods in a k-Nearest Neighbor fashion, and improve forecasts of possibly simple models by averaging. Several ways of performing the averaging are suggested, and theoretical arguments underline the usefulness of averaging for forecasting. Additionally, diagnostics tools are proposed allowing a deep understanding of the procedure.


Introduction
In many forecasting settings, one encounters a heterogeneous set of time series for which individual forecasts are required.A heterogeneous set of time series may imply time series of different seasonalities or trends.Hence, it may be difficult to model all of them in a joint way, or choose an approach yielding reasonable results for the entire set of time series.Modelling each time series on its own might be an alternative way to handle this challenge, however, this does not make use of the shared properties of the domain.Depending on this domain each time series might be very difficult to model and forecast.Additionally, a practitioner may even want an automatic procedure to produce forecasts.
To this end, some work has been done in comparing local and global forecasting approaches, whereas local means modelling each time series by its own.Global models refer to modelling the set of time series simultaneously.Montero-Manso and Hyndman [2021] argue that global and local models can achieve the same forecasts without needing additional assumptions.The global approach does, however, require a rather high number of total observations, additional tuning of hyperparameters, and might be very difficult to find in the first place.Especially the number of total observations can, in practice, be not sufficiently large.On the other hand, local models are especially hard to use for short time series, thus we want to fill the gap of forecasting a set of heterogeneous time series containing also short time series with a still rather low number of total observations.Hewamalage et al. [2022] investigate further the notion of global models whereas they closely look at the relatedness of the time series.The authors focus on simulation experiments controlling the data-generating process of each time series to allow for arbitary relatedness in the set of time series as well.They then apply various machine learning methods to conclude that the performance and complexity of a global models heavily depends on the heterogeneity of the data as well as the actual amount of available data.A set of very heterogeneous time series requires possibly very complex global models which in turn also requires a lot of overall data.This leads us again to the challenge mentioned above.
The shortcomings of a totally global model are addressed by Godahewa et al. [2021].The authors argue that with increasing amount of data, a global model may be not localised enough anymore, leading to worse forecasts.Thus, they propose a localisation technique where the time series are clustered and each cluster is then individually modelled by a global model.The clusters are feature-or distance-based, or even randomly assigned.While sounding similar to this paper's work, the approach differs to ours.We do not model each neighborhood of time series -we rather just use the neighbors' existing models.Similarly, Bandara et al. [2020] discuss the use of neural networks on feature-based time series clusters but do require a lot of data.Another approach combining local and global modelling is taken by Smyl [2020].The author combines neural networks and statistical models such that the neural network is modelled across all time series, and local behavior is modelled by exponential smoothing models.This approach won the M4 forecast competition [Makridakis et al., 2020].
In terms of short time series, there is little literature available to tackle this challenge.Cruz-Nájera et al. [2022] compare simple and machine learning based models on a set of short time series (14 to 21 observations each) regarding crimes in Mexico.The authors conclude that simple models like simple moving average or ARIMA perform better than more complex models such as neural networks.
Considering all mentioned aspects, our contributions are as follows.This paper proposes a meta framework for forecasting a set of possibly very heterogeneous time series by utilizing a range of local models and aggregating them in an appropriate way, exploiting similarities of those time series.The similarities can be seen as homogeneous parts of the time series.Hence we do require that there exists some homegeneity in the heteregenous set of time series.
In fact, every time series in a dataset is modelled by a local model.For each time series, we obtain a set of forecasts using the estimated models of its neighboring time series, and perform a type of model averaging to improve its forecasts.This differs from regular model averaging where one time series is modelled by a variety of models.Benefits of this approach are that we firstly allow for simple models in case of short time series, and still do not require a large total number of observations as is the case for complex global models.Therefore, we can say that our methodology is a mixture approach: using local models while still taking into account the whole set of time series in the forecast procedure.This procedure remains very flexible since we do not fix any family of possible models.Even the use of more complex models such as neural networks is still possible.
The rest of the paper is structered as follows.In Section 2.1 we introduce the measure of (dis-)similarity as it is used in our methodology, namely Dynamic Time Warping (DTW).More details are available in A.1.Section 2.2 introduces the notion of k-nearest neighbors in the context of time series while in Section 2.3 we propose several averaging methods to improve forecasts.In A.2 we shortly outline how DTW can be used to obtain an average time series while a simple theoretical motivation is given in A.3 where we validate the notion of model averaging under certain assumptions.Section 3 is about the evaluation techniques we apply followed by experiments in Section 4 where 5 different datasets are modelled and deepening diagnostics are performed for one of the datasets.Concluding remarks are given in Section 5.

Methodology
In this paper we propose following methodology.Given a finite set of time series T of possibly different properties, we first propose an appropriate dissimilarity measure on T by using asymmetric open-begin open-end Dynamic Time Warping, see Section 2.1.This dissimilarity measure is used to construct neighborhoods for each time series in T .
Next, for a fixed time series its neighbors are aggregated in a new way to obtain improved forecasts.This aggregation is proposed to be done in multiple ways based on k-nearest neighbors (Section 2.2).The actual model-averaging is described in Section 2.3.Since it might not be clear how DTW corresponds to actual forecasts, we give a theoretical motivation in A.3 where the relationship between DTW and forecast distributions is studied for simple state-space models.
A general overview of the methodology given in the form of a flowchart is seen in Figure 1.

Dynamic Time Warping
A common technique for measuring similarity of two time series is Dynamic Time Warping (DTW), introduced by Sakoe and Chiba [1978].Originating from speech recognition, the goal is to align two sequences using a timewarping function in a cost-minimizing manner.In detail, we consider the (possibly multivariate) sequences X " pX 1 , . . ., X n q 1 P R nˆd , Y " pY 1 , . . ., Y m q 1 P R mˆd with possibly n ‰ m with appropriate dissimilarity function on the rows of X and Y as dpi, jq :" dpX i , Y j q.Denote the entries of X i and Y j by X ik and Y jk , k " 1, . . ., d, respectively.A common choice, also used in this paper, is the Euclidean distance, i.e. dpi, jq " b ř d k"1 pX ik ´Yjk q 2 .The DTW distance is now defined as DTWpX, Yq :" min where ϕ " pϕ X , ϕ Y q : t1, . . ., K ϕ u Ñ t1, . . ., nuˆt1, . . ., mu denotes the warping function, and w ϕ are corresponding weights.The warping function ϕ links the two sequences in a cost-minimizing way, e.g.ϕpkq " pϕ X pkq, ϕ Y pkqq implies that X ϕ X pkq and Y ϕ Y pkq are linked together.The length of the warping function K ϕ depends on the optimal warping function and is therefore chosen alongside the minimization problem in (1).
A more in-depth introduction of DTW is given in A.1, containing details about the warping functions as well as computing a representative time series using DTW.

Nearest Neighbors
For many classification and regression problems, k-nearest neighbors (k-NN) is an easy yet well-performing method to apply and improve predictions.The basic idea is as follows.Consider a metric space pX, δq and x P X.Then a neighborhood around x can be formed computing δpx, zq for each z P X, z ‰ x.The k elements with minimal distance to z are then the neighbors of x.Eq (2) gives a formal definition for this.N pxq :" tz 1 , . . ., z k : (2) δpx, z 1 q ď ¨¨¨ď δpx, z k q ď δpx, zq, z ‰ z i , i " 1, . . ., ku.
In case of ties one could increase the neighborhood's size or choose one of the equally distanced neighbors randomly.For the neighborhood including x itself we write N pxq :" N pxq Y txu.
From a statistical point of view, we consider observations px 1 , y 1 q, . . ., px N , y N q P X ˆY where Y denotes the set of possible labels and X the feature space.We write ypxq if y is the label corresponding to x.In an unsupervised framework there would not be any labels and we would just be able to form neighborhoods.In a classification setting, Y is a discrete set whereas the easiest example is a binary classifier with Y " t0, 1u.A new observation x 0 may be classified as the majority class of its neighborhood.In a regression setting, the labels are continuous, e.g.Y " R and the predicted label is set to be ŷpx 0 q " f p N px 0 qq where f is an aggregation function.In the simplest case we have ŷpx 0 q " 1 k ř xPN px0q ypxq, where the predicted label is just the arithmetic mean of the neighborhood's labels.The choice of k is vital, however, it is still not completely clear how to choose it.There are many heuristics, such as choosing k « ?N , which do not seem reasonable in many applications.In contrast, the optimal k can be chosen based on an optimization criterion.In supervised settings, one usually splits the data in training and test set, and chooses k to minimize a loss function on the training set.This k is then fixed and used to predict on the test set.For more advanced approaches where k is determined adaptively, see Anava and Levy [2017].

Time Series Nearest Neighbors
In the context of time series, k-NN has also been widely used.In time series classification, 1-NN is often used in conjunction with DTW [Jeong et al., 2011].Many approaches also consider k-NN in a feature space.Such feature space could consist of time series features like length, trend, auto-correlation properties, and many more.For such cases, DTW is not even used.However, this approach assumes that the features can be extracted in a reasonable way which might not always be the case.In terms of k-NN for regression or forecasting, Martínez et al. [2017] use a simple approach where a single time series can be forecasted by the mean value of neighboring labels based on Euclidean distances of lagged values.Naturally, this method can also be applied to many time series in a pooled manner but this would not use the idea of employing similar time series for improving forecasts, where similarity is based on DTW.

Our Setting
We let X be a set of heterogeneous and differently sized time series, equipped with the asymmetric open-begin openend DTW distance measure.Note that this space is not metric, however, k-NN can still be applied in such setting.The label set Y is not uniquely defined.For the first model, then average models (see Section 2.3), this set consists of one-step ahead forecasts for the time series of interest.In the case of first average, then model, the label set and the aggregation function are more complicated and are described in the corresponding section.

Model Averaging
For the upcoming sections, let δpx, yq " DTW asym, OBE px, yq.Further, let ˆ: px, M q Þ Ñ xt`1|t,...,s pM q with x " px s , . . ., x t q a time series and M any model used to forecast x t`1 , i.e. xt`1 pM q :" xt`1|t,...,s pM q is the one-step ahead forecast of x obtained by using model M .We refer to M as the baseline model which we aim to improve.
We understand a model M as a mapping M : T Ñ M, taking a time series x P T and outputting M x P M where M denotes the set of all possible models.
Given a model M " M x and a time series z we denote ˜: pz, M x q Þ Ñ Mx pzq as the refitting function, i.e.Mx pzq is the model M x , estimated on x but fitted on z.More specifically, M x is first trained on x yielding estimated parameters.Fixing those estimated parameters, we now plug in z to obtain fits of the model on z.No data of z is used to estimate parameters.Thus, forecasts of Mx pzq are always with respect to z.

First Average, Then Model Approach
The first approach of improving a single forecast yielded by a baseline model is as follows.Given a time series y " py s , . . ., y t q, we find a neighborhood N y " tz 1 , . . ., z ky u based on k-NN.Since the DTW distance also depends on the difference of level in the time series we center each time series first to avoid this effect.That way we keep other properties of the time series such as trends or seasonalities since these might still be helpful to improve forecasts.Denote z c :" pz sz ´z, . . ., z tz ´zq the centered time series for z " pz sz , . . ., z tz q with z " ř tz i"sz z i {pt z ´sz `1q.Then a neighborhood around y is constructed such that δpy c , z c 1 q ď δpy c , z c 2 q ď ¨¨¨ď δpy, z c ky q and z i " pz si , . . ., z ti q with t ě t i , t i ´si ą t ´s.That means we look for neighboring time series which have more historical data.This is especially motivated by possibly having shorter time series which will be difficult to model and is often the case in many on-line forecasting settings where data of a subset of the time series are rare.In such case it is not reasonable to allow for even shorter time series as neighbors.Naturally, we also allow for neighbors that might not have recent data.A matching with such time series can still be reasonable due to seasonality effects.Also, not taking such time series into consideration might remove important information from the dataset.
Next, as mentioned in A.2, we compute the neighborhood's averaging time series on centered sequences z c , z P N y , given by avgpyq :" aDBAptz c : z P N y uq.As the name already suggests, we now take the averaging time series avgpyq and map it to M avgpyq .
After estimating model M avgpyq and its parameters we take this model and use it to forecast on y, i.e. we obtain ŷt`1 p Mavgpyq pyqq.Note that since M avgpyq is based on centered data, we allow the initial parameters of M avgpyq to be re-estimated using data of y.This procedure automatically handles the re-centering.This averaging will be later referred to as G-AVG.
In terms of k-NN we can write the aggregation function as ŷt`1 " f p N pyqq " ˆpy, ¨q ˝˜py, ¨q ˝M ˝aDBA ˝centerp N pyqq, where, with abuse of notation, we first center the time series of the neighboorhood, then average them, model the resulting averaged time series, and finish by refitting and performing the actual forecast with respect to y.

First Model, Then Average Approach
As noted previously, for a time series y " py s , . . ., y t q we obtain a neighborhood N y with size k y .However, instead of averaging the neighboring time series in terms of DTW, we consider a different approach.As mentioned, each time series z P N y has been modelled, i.e. there exists M z for any z P N y .
In terms of k-NN, the averaging now is of the form ŷt`1 " f p N pyqq " g w ˝ˆpy, ¨q ˝˜py, ¨q ˝M p N pyqq, where g w is a weighted averaging function, defined as g w px 1 , . . ., x n q :" ř n i"1 w i x i where n denotes the number of forecasts to be averaged.We have either n " k y `1 or n " k y .Compared to the G-AVG approach, we clearly see based on Eq. (3) and Eq.(4) that the approaches basically differ in when averaging is performed.
Next, we propose various ways to choose the weights since it is not clear how to choose them in an optimal way.

Simple Average
The easiest and straightforward way to combine the forecasts is by taking the simple average.We obtain w z " w y " 1 ky`1 .This assumes that all forecasts are equally important.We will later refer to this type of averaging as S-AVG.Alternatively, we can opt not to use M y but only utilize the models Mz pyq of the neighbors, and set w z " 1{k y , w y " 0 (S-AVG-N).

Distance-weighted Average
For the distance-based model average we consider two ways.First, we can take the DTW distances of the neighborhood and set w z " 1 δpy c , z c q ř zPNy 1 δpy c , zc q , z ‰ y. (5) implying that closer neighbors in terms of DTW are assigned higher weights in the model averaging.Since δpy, yq " 0, we have to set w y " 0 in Eq. ( 5), implying that this type of averaging only regards neighbors' models.We denote this type of averaging as D-AVG-N.This averaging does not take into account the forecasts ŷt`1 pM y q because w y " 0.
In contrast, we can consider the neighborhood's average avgpyq and corresponding distances δpz c , avgpyqq for z P N y .Due to the averaging algorithm (see A.2) and a minimum neighborhood size of 1, we have that δpz c , avgpyqq ą 0 for all z.That way we can set up weights as in Eq. ( 6).
We denote this type of averaging by D-AVG.

Error-weighted Average
The error-based or, equivalently, performance-based weights are computed from the models' historical performance.They might be calculated in two ways.First, for each time series z P N y we obtain residuals r z based on the one-step ahead forecasts ẑpM z q, i.e. r z,i :" z i ´ẑ i pM z q for i " s z `1, . . ., t z .
Next, we calculate an error measure on the one-step ahead errors, denoted by Epz, ẑpM z qq.Corresponding weights for the model averaging are set to be This method will be referenced to as P-AVG.In constrast to Eq. ( 7), we consider the refitted residuals r y p Mz pyqq with rp Mz pyqq i :" y i ´ŷ i p Mz pyqq for i " s `1, . . ., t and z P N y .Consequently, we obtain errors of Epy, ŷp Mz pyqqq, and corresponding weights as given in Eq.( 8).
We denote this method by P-AVG-R.In practice, we utilize the root mean squared scaled error measure (RMSSE) as introduced in Section 3.

No-Model Approaches
In respect to the non-parametric neighboring search of k-NN, we also consider non-parametric forecasts as follows.
Following Martínez et al. [2017] where forecasts are computed based on the next available observed value, we apply a similar methodology.For the given time series y " py s , . . ., y t q and neighbors z " pz sz , . . ., z tz q P N y , we obtain matchings (on which the neighborhood is based on) such that ϕ pzq pkq " pϕ y pkq, ϕ z pkqq for k " 1, . . ., K z .Since every index of y must be matched, there exists a value k and tz P ts z , . . ., t z u such that ϕ pzq p kq " `t, tz ˘.Next, there are two cases to differ.
• Case tz ă t z : Then there exists a successor u z " tz `1 with corresponding value of z uz which is used to forecast y t`1 .
• Case tz " t z : No successor exists, thus the time series z cannot be used to forecast y t`1 .
To this end, we obtain the set of possible successors given by Clearly, Eq. ( 9) implies that |S y | ď |N y |.This type of averaging also only considers neighbors' information because y does not have any successor itself.
Similarly to before, we can forecast y t`1 by ŷt`1 " ř sPSy w s s with approriate weights.We choose simple weights as in w s " 1{|S y | (S-NM-AVG) and distance-based weights as in Eq. ( 5) (D-NM-AVG).Note that the matchings are based on centered time series, hence the forecasts obtained here are actually forecasts for y c .Since we do not have a model taking care of re-centering, we have to do it manually by setting ŷt`1 " ŷc t`1 `ȳ.The methodology's notation is briefly summarised in Table 1.

Evaluation Methods
For the evaluation of our methods we use scaled one-step ahead forecast errors as proposed by Hyndman and Koehler [2006], adapted to our problem setting of on-line forecasting.The authors denote the mean absolute scaled error (MASE) by MASEpy, ŷq " , where y " py s , . . ., y t q is a time series with one-step ahead forecasts ŷ " pŷ s`1|s , . . ., ŷt|t´1 q.This means we scale each error by the average in-sample error when using the last available observation as the one-step ahead forecast, also known as the random walk forecast.
As mentioned by Hyndman and Koehler [2006], one advantage of this measure is its independence of scale, making it easier to compare results of different time series.Since the measure compares the actual forecast with the mean forecast error based on a random walk forecast, we can say that if MASE ă 1, then the forecast method used to obtain ŷ works better on average than the naive approach of using the last available value.Similarly, if MASE ą 1, then the method performs worse then the random walk forecast.
However, in an on-line forecasting setting, scaling by the whole in-sample error may not be reasonable, hence we use a different scaling.We adapt Eq. ( 10) to a root mean squared scaled error (RMSSE) given by RMSSE t 1 s 1 py, ŷ, ŷb q :" , where s ď s 1 ď t 1 ď t.This means we obtain scaled errors q u by scaling the error y u ´ŷ u by the averaged benchmark forecast error until time u.Altogether, RMSSE t 1 s 1 in Eq. ( 11) gives the average error of the window ts 1 , . . ., t 1 u scaled by the average historical benchmark forecast until time t 1 .A typical benchmark method is the random walk forecast given by ŷb v " y v´1 .As the data is usually split into training and test set, we also need to differ the corresponding evaluation techniques.In the training set we compute the RMSSE as given above, yielding a final RMSSE value for the entire training set.
For an evaluation on the test set we follow the notation of Hyndman and Koehler [2006] and its use in the R package forecast [Hyndman and Khandakar, 2008] where the test errors are scaled by the training error of the random walk.Consider a time series y " py s , . . ., y t , y t`1 , . . ., y T q split in training and test set at time step t.Then each test set residual is scaled by the root mean squared error of the random walk forecast yielding scaled test set residuals given by Eq. ( 12).
Additionally to RMSSE, we also consider more common forecasting measures such as mean absolute error (MAE) (13), root mean squared error (RMSE) ( 14) and symmetric mean absolute percentage error (sMAPE) ( 15), see Hyndman and Koehler [2006] for a recent review.
MAE " For each dataset the mean and median error for each measure is computed.

Experiments
To demonstrate the methodology, we use 5 publicly available datasets.Table 2 provides summary information of these datasets.All datasets are on monthly basis except for the Food Demand data which is a weekly dataset.A short summary is as follows.
• Food Demand.Weekly data from smart fridges with the goal to forecast the demand of each fridge for the upcoming week in one-step ahead fashion.The dataset is available in this paper's R package.• M3 Dataset [Makridakis and Hibon, 2000].Monthly data of the M3 competition, split in 5 subcategories, namely micro, macro, industry, demographic and finance.The 6-th subcategory other was removed from the analysis.We use this prior domain knowledge and model each subcategory individually.The data was obtained from the R package Mcomp [Hyndman, 2018].• Tourism [Athanasopoulos et al., 2011].Monthly dataset from the tourism forecast competition.The data is available from the R package Tcomp [Ellis, 2018].
• CIF 2016 [Stepnicka and Burda, 2017].Monthly data from the CIF 2016 forecasting competition.This data contains 24 real time series from the banking domain and 48 artifically created time series.The dataset is available from Zenodo [Godahewa et al., 2020] and is put in this paper's corresponding R package for convenience.• Hospital.Monthly time series counting patients for different medical products and related medical problems.
The dataset is available in the R package expsmooth [Hyndman, 2015].
All datasets in Table 2 have a forecast horizon greater than 1, whereby the Food Demand data is a special case since we are only interested in one-step ahead forecasts for that dataset.Hence we run two experiments on these datasets: multi-step ahead forecasts as required for the datasets, and cumulative one-step ahead forecasts to compare to the Food Demand data results.We do not compute multi-step forecasts for the Food Demand data.
We denote by TSAVG the trained averaging method.For all datasets the same training procedure was performed.

Baseline Model
For the base family of models, any common time series model family can be used.We here use the ETS family of models [Hyndman et al., 2002, Taylor, 2003].These models consist of 3 components: Error, Trend and Seasonality.
The most simple ETS model is of the form AN N and is also known as Exponential Smoothing.It does not have any trend or seasonality component, and is given by the recursion l t " αy t `p1 ´αql t´1 , (16) ŷt`h|t " l t , h ą 0, where l denotes the level component of the model which is also equal to the flat forecast ŷt`1|t .The model parameter α is usually found by minimizing the sum of squared one-step ahead forecast errors.
The next more complex model is of the form AAN and is called Holt-Winters model.It is given by l t " αy t `p1 ´αqpl t´1 `bt´1 q, (17) b t " βpl t ´lt´1 q `p1 ´βqb t´1 , ŷt`h|t " l t `hb t , h ą 0, where l denotes the level component again.The newly introduced trend component is denoted by b.In this model the forecasts are not flat anymore, but linear in the forecast horizon h.All smoothing parameters are usually constrained between 0 and 1.For a more detailed review on those models and many more equations like Eq. ( 16) and Eq. ( 17), see Gardner Jr. [1985].
With that, this family provides a very flexible way of modelling and forecasting, and is therefore often used in business forecasting applications.The forecast package [Hyndman and Khandakar, 2008] in R also provides an automatic forecasting framework for ETS and other models (such as ARIMA), where the most appropriate model of all possible models in that family is chosen automatically based on some optimization criteria.We use the corrected Akaike's Information Criterion given by where L is the model's likelihood, k denotes the total number of parameters in the model, and T is the sample size.
The correction is needed because in small samples the regular AIC tends to overfit and select models with too many parameters [Hurvich and Tsai, 1989].
Each time series is modelled by ETS and then forecasted in a one-step ahead fashion, irregarding its length.
Additionally, another common model family of time series models in ARIMA [Shumway and Stoffer, 2000] is used to model the dataset, again utilizing the forecast package.As for the ETS family, the best model is chosen by the corrected AIC as defined in Eq. ( 18).

Global Benchmark Model
As a second benchmark model we use a pooled auto-regressive (AR) model [Baltagi, 2021], which can be seen as a global model since it uses all individuals' information.This linear ARppq model is given by where i " 1, . . ., N denotes the i-th individual and t " s i , . . ., t i denotes the time component of the panel.The order of the model is given by p ě 1.This means that for the entire panel data we only need to estimate p `1 parameters in Eq. ( 19) -the global intercept α as well as the global slope parameters β 1 , . . ., β p .In practice, the observations of all individuals are stacked on top of each other to obtain a simple linear model which can be estimated by ordinary least squares.
Such models, also called panel models, can be of different forms as well.We also modelled the data using variable intercepts, i.e. each individual has an individual intercept.However, this model turned out to be worse than the pooled model for the datasets.
Another variant is the variable-coefficient model with individual intercept and slope.However, such model is usually estimated by estimating each individual's model by its own.Thus, such approach cannot really be seen as a global model, and thus we opted to not use this model as well.
For more details about the models and their statistical properties, see the book of Baltagi [2021].All global panel models have been fitted using the plm package [Croissant and Millo, 2008] in R. Lag selection for p " 1, . . ., L max was performed using time series cross-validation as in Section 4.3.For each dataset an individual maximum lag was selected based on the rule of thumb of [Bandara et al., 2020], namely input size " 1.25 maxpoutput size, seasonal periodq.The specific values of L max are given in Table 3 while 5 equidistant values for p are chosen for each dataset.

Choice of Optimal Parameter
In every experiment we perform time series cross-validation (TSCV) based on a rolling window approach to choose the optimal hyperparameter k in k-NN.We cannot apply regular cross-validation since the assumption of independent data is not valid in a time series context.
The initial window contains observations from time step 1 until time step T 0 .In each iteration, the window is expanded by one time step, yielding a new fold.In each fold we obtain an RMSSE value for each individual and hyperparameter.In detail, assume there exists a hyperparameter k P Θ, where Θ is the search grid.Let a fold be f " t1, . . ., T 0 , . . ., t f u, T 0 ď t f ď T train and consider an individual time series y piq " py si , . . ., y ti q.Then we compute RMSSE t f maxpT0,siq py piq , ŷpiq pmq; kq with random walk benchmark forecasts for each parameter k, time step t f , individual i and method m.Furthermore, the cross-validation yields standard errors as well by aggregating over the folds, i.e.SEpy piq , ŷpiq ; kq " maxpT0,siq py piq , ŷpiq pmq; kq ´µpy piq , ŷpiq pmq; θq
(21) Note here that each individual i might be available in a different amount of folds since we perform cross-validation based on the time aspect of the panel data.The optimal hyperparameter is now chosen using the one-standard-error rule, i.e. we choose the most parsimonious model lying in the one standard error band of the global minimum of cross-validation errors values.This means that based on Eq. ( 20) and Eq. ( 21) we obtain the optimal parameter k for y piq as in the minimization problem (22) where is the parameter realizing the minimal mean cross-validation error, i.e. k pmq i " argmin k µpy piq , ŷpiq pmq; kq.
Training parameters such as T 0 are all listed in Table 3 for each dataset.The reason why a mean value is given for T train is because of the possibly different lengths of the time series.

Results
Tables 4, 5, 6, 7 show the corresponding mean and median errors for the multi-step horizon case for each dataset.
First, we note that the pooled AR model seems to perform well for the M3 and Tourism dataset.Those datasets are characterized by long forecast horizon as well as long time series which could be a reason for the good performance.
When comparing the averaging methodology to the local benchmarks ETS and ARIMA, there is no clear picture present.Nevertheless, TSAVG does seem to keep up with the benchmark methods and is able to improve performance of the ETS model which it is supposed to do.For testing statistical significance, non-parametric Friedman tests are performed on the mean values and the corresponding p-values are given in the tables' captions.No statistical significant difference for an α-level of 5% is found for any error measure.
The corresponding one-step ahead errors are as seen in Tables 8, 9, 10, 11.We observe a similar picture that the pooled model performs well for the M3 and Tourism dataset.When comparing TSAVG to ETS and ARIMA, the numbers are very close and no clear best method is present.In the case of one-step ahead errors, TSAVG is only able to improve performance of the ETS model in a few cases.In case of the Food Demand dataset containing just a small number of rather short time series we do observe improvements over the local base models as well as all other competing models.
As before, Friedman tests are performed on the mean values.The null hypothesis of no differences can not be rejected for any error measure again.
To summarize the results, the proposed method is competitive on all datasets.Especially on the Food Demand dataset we observe good performance due to its characteristics.However, TSAVG also performs comparably on datasets which do not share the same properties as the Food Demand dataset.For example, the Hospital dataset only contains equally sized time series whereas the Tourism dataset contains very long time series.This makes TSAVG a very flexible method to improve simple local models for a wide range of datasets.
When looking at all results with respect to the datasets' sizes, there does not seem to exist a huge impact.We do, however, observe that the bigger the dataset the better TSAVG seems to perform compared to the baseline ETS model.This is reasonable since the number of possible neighbors increases with increasing dataset size.

Runtime
Figure 2 shows training runtimes for a selection of averaging methods.We clearly see the effect of increasing neighborhood size across all datasets and averaging methods.The runtime also increases significantly with the datasets' size due to the pairwise search for DTW neighbors.While the calculation times of P-AVG and S-AVG are similar because the corresponding weights are both quickly computed, the distance-based averaging method has increasing computation time because of needing to compute the neighborhood's averaged time series.The transparency of TSAVG allows for advanced diagnostics.These are demonstrated on the Food Demand dataset which is about forecasting weekly demand of smart fridges.Its rather small size as well as short time series lead to understandable diagnostics on an individual time series level.The practical problem is also what initially motivated the idea of using pattern matching to improve forecasts.A selection of example time series of the dataset are available in Figure 3 and show its heterogeneity.While homogeneity is still present, individual 29, for example, shows a very similar pattern to the first part of individual 3.This is the similarity we ought to extract using DTW.

TSCV Diagnostics
In each fold we forecast the time series to obtain one-step ahead forecasts as explained in Section 4.3.The ETS models themselves are not being tuned using the folds.The best ETS model is solely chosen in each fold using AIC.That way the only tuning parameter is the size of each individual's neighborhood.
We take a closer look at the TSCV procedure and the choice of the optimal tuning parameters.Figure 4 shows the corresponding TSCV scores and standard errors as defined above for the distance-based averaging methods.Each plot panel shows the results for a selected smart fridge.For demonstration purposes we choose a grid of Θ " t1, 3, 5, 10, 20u number of neighbors and select the optimal number applying the one standard error rule.We do this for each forecast averaging methodology.The vertical lines indicate the choice of the tuning parameters.We observe different behaviors between the global averaging approach and the others (D-AVG, D-AVG-N).This can be explained by the similarity of the two methods.We also see that the scores for individual 3 obtain a constant level because this individual is one of the longer time series, and hence it is not even possible to have more than 3 neighbors.
The actual selected number of neighbors is given in Table 12.See Table 1 for a description of all averaging approaches.

Model Evaluation
First, we start off by evaluating the experiment on a global level.This is where the scaled errors come in handy since these are comparable also on individuals' level.Figure 5 shows boxplots of the RMSSE values as in Eq. ( 11), split in both training and test errors.Given the errors are skewed, we display them on a log-scale.We clearly observe that the simple, non-models based averaging approaches yield worse results than the ETS benchmark model.However, we do see minor improvements of the ETS forecasts by using distance-and performance-based averaging methods.Another interest fact is that approaches not considering an individual's information (S-AVG-N, D-AVG-N) tend to yield also worse results, indicating that these informations should also be included in the forecast procedure, even though these time series might be very short and difficult to forecast.For completeness, Figure 15 in the Appendix shows the original RMSSE values, but as mentioned before, this makes comparisons to the ETS benchmark model quite difficult.Next, we want to dig deeper into the performance of selected individuals.For this, we selected 4 of the 43 individuals to analyze in more detail.Figure 6 shows the RMSSE with respect to ETS forecasts for these selected fridges.We also show standard errors of the RMSSE.Due to the different behaviors of the individuals, we also observe different performances of the averaging approaches.While for individual 3 we can improve forecasts using modelbased averaging methods, this is not the case for individual 27.We also observe that the panel benchmark model PLM-POOL does not always yield better forecasts and is often even surpassed by any of the averaging approaches.
We can dig even deeper into evaluating the errors when looking at the evolution of the RMSSE.For reasons of clarity, Figure 7 shows the running RMSSE with respect to the RW forecasts for the distance-based averaging methods and the two benchmark methods of ETS and PLM-POOL.The dashed vertical line indicates the split between training and test periods.These plots really show the possibility of improving the benchmark forecasts by smartly taking averages of neighbors' forecasts.Similar plots for the remaining averaging methods can be found in the appendix.

Further Diagnostics
To understand better the differences in performance on individual level, we perform further diagnostics.For fixed neighborhood size of at most 5 neighbors, we first consider the median, minimum and maximum normalized DTW distance and their evolution over time.This analysis might already give hints on the homogeneity of the neighborhoods (Figure 8).We observe that for individuals 3, 15 and 29 the distances and hence the neighborhoods seem to stabilize after the first few steps while for individual 27 the normalized distance increases until the end.This means that this individual becomes harder and harder to match.Overall, individual 27 is too heterogeneous and lacks homogeneity to improve forecasts as also indicated by the high mean 2-Wasserstein distance in Figure 10.
Another way to evaluate the neighborhoods is as follows.We consider the final training neighborhoods as the ground truth and compare each neighborhood yielded in the training process to this ground truth using the adjusted Rand index (Rand [1971], Hubert and Arabie [1985]).Figure 9 shows the adjusted Rand index for each time step.We see that individual 3 is perfectly matched from the beginning which is also not suprising since this individual only has very few  Finally, we take a look at the 2-Wasserstein distances as discussed in A.3.Since we only introduced the Wasserstein for a basic AN N model, we want to go into more detail here.The one-step ahead point forecasts which we obtain and use for averaging are the means of the forecast distribution, hence we can compute the Wasserstein distances in a straight-forward way.Namely, we have that where X, Y are two arbitrary ETS models with realized point forecasts x, ŷ as well as estimated standard deviations σX , σY , respectively.
Figure 10 shows the mean 2-Wasserstein distances as in Eq. ( 23) per selected individual.The neighborhoods considered here are the same as in Figure 8.The dashed lines indicate the overall mean distance.We clearly observe large distances for individual 27 which is consistent with the rather bad results.Since the DTW distances as in Figure 8 are increasing for this individual, we cannot expect the Wasserstein distances to be small, and thus also cannot guarantee a reduction in forecast error.Indeed, the overall mean distance is 12.5 which is more than twice as large as the mean values for the remaining individuals (3.2 for individuals 3, 15, and around 5.2 for individual 29.) selected in the analysis.This result empirically verifies the extension of our motivating theoretical results.

The Best Model
By miniming the mean RMSSE values, we can choose the optimal averaging technique on the dataset.With a mean RMSSE value with respect to the ETS base forecast of 1.08 the optimal averaging method is P-AVG-R.This rather high value is due to bad performance of a small amount of individuals, skewing the mean value.The median values accounts to 0.85.Table 13 shows the results in terms of a number of error measures.While for individual 3 the averaging yields uniformly better results, it is not the case for individual 15 where the base ETS model and the pooled global model outperform the averaging methodology.
To test statistical significance, we follow the suggestions by Demsar [2006].A non-parametric Friedman test is performed at an α-level of 5% for each error measure using the results for all individuals.This yields significant differences for all error measures (p ă 0.01).Post-hoc pairwise Wilcoxon tests with Holm's α-level correction are then performed.However, these test does not yield any significant pairwise differences in the methods for any error measure.
Additionally, we take a look at the distributional properties of the RMSSE values obtained on the test set and compare the methods based on how many individuals could be improved by averaging.First, Figure 11 shows the ordered RMSSE test set values with respect to the ETS forecast for each averaging method.The vertical lines indicate the percentile of individuals with smaller error compared to the ETS benchmark.For an overall useful methodology we want the vertical lines to be greater than 0.5, meaning we can improve the forecasting for more than 50% of the individuals.This is the case for the global averaging method G-AVG, the performance-based one P-AVG, P-AVG-R, as well as the simple S-AVG, whereby the best improvement is obtained by the weighted average of using errors of the refitted models (P-AVG-R).Furthermore, we see that all methods not including the individual, namely D-AVG-N and S-AVG-N, yield worse results.Also, the non-model based methods which only regard the DTW matchings do not seem to be very useful at all since they only decrease the forecast error compared to the ETS benchmark for around 36%, and 38% of the individuals, respectively.
No−Model Avg.
Simple Avg.
Performance−Based Avg.
Distance−Based Avg.

Conclusions
We present a general framework for the improvement of individual forecasts in a set of possibly heterogeneous time series.It is based on a theoretical motivation regarding simple state-space models in terms of exponential smoothing, however, this motivation may be extended to more general models.In this theoretical part of the work, new approaches are proposed by computing the DTW distance explicitly on random processes as well as using the Wasserstein distance to compare the forecast distribution of state-space models.Additionally, dynamic time warping is introduced and described in a far more rigorous way.
We apply a sequence matching procedure in dynamic time warping which allows us to find similar time series in terms of their shape and which works for any two time series.To this end, we emphasize the use of asymmetric matching and the corresponding extension of the global averaging methodology to this type of matching.The transparency of the algorithm also enables us to perform diagnostics and to understand when this procedure is appropriate and yields reasonable results.This point also differentiates our work to machine learning approaches which tend to be intransparent and require a large total number of observations to work efficiently.
The models we use for our analysis are ETS models, yet any family of models can be used in our procedure, making it very flexible in practice.It also extends automatic forecasting frameworks such as ARIMA or ETS naturally.While the methodology may not be able to improve performance of the base model all the time, it still yields competitive results.Overall, this framework fits many real world applications where one encounters a set of heterogeneous and possibly short time series.
Some aspects are still to be considered.Extension of the work could be using an adaptive data-driven number of neighbors for each time series instead of a fixed one.Further, a robustification of the procedure could be beneficial since dynamic time warping especially is quite sensitive to outliers in the time series.

A Methodology
A.1 Dynamic Time Warping The set of all allowed warping functions Φ depends on the type of time-warping which is usually parametrized by a step pattern.This pattern defines the allowed values of ϕpkq given ϕ k p1q, . . ., ϕ k pk ´1q for each k.
Basic properties of a warping function are as follows.
• Monotonicity.We have ϕ X pkq ď ϕ X pk `1q, ϕ Y pkq ď ϕ Y pk `1q for all k.This allows for only reasonable matchings where the order of the sequences is unchanged.
• Slope.The possible (local) slope of ϕ is given by , where s denotes the step pattern for the warping function, T ks gives the number of allowed steps in the k-th step pattern path, and gives the actual size of steps in X and Y direction, respectively.Thus, we have that p pksq i " ϕ X pi `1q ´ϕX piq for any i, and q pksq i analogously.
The slope is therefore constrained by min ks Q ks ď Q ks ď max ks Q ks .
More properties arise when moving to more concrete warping functions.
This distance is naturally symmetric and positive definite.However, it is not a metric since the triangle inequality is usually not fulfilled.
This pattern type can be useful when considering sequences of similar lengths.However, for differently sized sequences this might not work reasonably anymore due to above constraints.

A.1.2 Asymmetric DTW
We are particularly interested in subsequence matching where we relax the endpoint constraints.This is also denoted as open-begin open-end (OBE) matching [Giorgino, 2009].The corresponding optimization problem looks as follows.
DTW OBE pX, Yq :" min 1ďpďqďm DTWpX, Y pp,qq q, where Y pp,qq " pY p , . . ., Y q q 1 P R pq´p`1qˆd is a subsequence of Y, and n ď m.This type of matchings requires an asymmetric matching pattern.The most basic one is given by gpi, jq " min ˜gpi ´1, jq `dpi, jq gpi ´1, j ´1q `dpi, jq gpi ´1, j ´2q `dpi, jq implying that continuity is not imposed anymore.In fact, information of Y can be skipped.The slope is constrained by min Q " 0, max Q " 2. Further, the distance is only normalizable by n since we consider subsequences of Y of different lengths.Because of the subsequence matching, we have an initial value of gp1, pq " dp1, pq and gpn, qq ": DTW asym, OBE pX, Yq.

A.1.3 Other Step Patterns
Many more step patterns exist with different types of slope constraints and other properties.Unfortunately, it is often not very clear which step pattern really is the most appropriate one to use.A general family of step patterns is defined in Sakoe and Chiba [1978] by the notion of symmetricPx, and asymmetricPx, respectively, where x controls the slope parameter.Usual values are 0, 0.5, 1 and 2. The larger the slope can possibly be, the more complicated the dynamic programming equations tend to get.As a limiting step pattern one obtains the rigid step pattern for x Ñ 8.As suggested in Giorgino [2009], this step pattern is only reasonable when considering an OBE matching, since it finds the most appropriate subsequence without any gaps and, in fact, does not perform any time warping.
To summarize, DTW is a very flexible sequence matching method which yields the optimal matching as well as an associated distance between any two sequences.

A.2 DTW Averaging
As all dissimilarity measures, DTW can also be used to find a barycenter of a set of sequences S. A barycenter is usually defined to minimize the sum of distances in the set.Generally, c P X for some metric space pX, δq is called a barycenter of Y Ă X if for any y P X.
In terms of time series and DTW we have Y " T as the finite set of time series of interest, X Ą Y as the set of all possible time series with some maximum length, and δ " DTW.However, the space of possible solutions is very large which makes the search for the average difficult.Therefore, approximative solutions have been developed.
In the context of DTW, it is not straightforward to provide a definition for an average.For the symmetric case of DTW, Gupta et al. [1996] have considered pairwise, coordinate-wise averaging, where two sequences are averaged to one sequence until only one sequence is left.This method is easy to implement, but a big downside is that it heavily depends on the order of sequences.A different approach, introduced in Petitjean et al. [2011], is global averaging.
Starting from an initial sequence, the average is updated in each iteration based on all matchings of all sequences in the set.As mentioned in their paper, this heuristic naturally reduces the sum of the warping distances to the average in Eq. ( 26).
One aspect to consider is the length of the averaging sequence.In pairwise averaging, this average can grow twice as long in each step.In global averaging, the length of the resulting average is fixed to the length of the initial sequence.
We adapt the global averaging methodology to the asymmetric DTW use case as follows.Denote T a set of time series of different lengths.Our aim is to have a longer barycenter than the time series of interest, hence as the initial average time series we take the longest one available, i.e.C 0 " argmax XPT |X|.Then iteratively we compute C i , i " 1, 2, . . .: • Compute DTW asym, OBE pX, C i q for all X P T to obtain ϕ pXq pkq :" pϕ X pkq, ϕ Ci pkqq for k " 1, . . ., K ϕ .
• For each time step t " 1, . . ., |C 0 | of C i , denoted by C i,t , let C i,t :" ␣ X j : X P T , ϕ pXq " pj, tq ( be the set of all associated X time steps.
• Update each time step of the averaging sequence by otherwise.
We perform this iteration for a fixed number of I times or until the sum in Eq. ( 26) does not decrease anymore.We denote now the averaging time series as aDBApT q :" C minpS,Iq where S ą S ˚is the time of no further reduction after a start-up period S ˚and the averaging function aDBA : T Ñ Z where Z denotes the set of all possible time series.

A.3 Theoretical Motivation
We want to motivate our approach of using the DTW distance.For that we consider just a simple case.Let X be an AN N model [Hyndman et al., 2008].An AN N model is also known as Exponential Smoothing.It does not have any trend or seasonality component, and for a time series pX t , t " 1, . . ., nq it is given by the recursion where l X denotes the level component of the model which is also equal to the flat forecast Xt`h|t for any h ą 0. The model parameter α is usually found by minimizing the sum of squared forecast errors or by maximum likelihood.

A.3.1 Theoretical DTW Computation
Let X, Y be two independent AN N models.Without loss of generalization, we assume both initial values are equal to 0, i.e. l X 0 " l Y 0 " 0 (otherwise we could just look at X ´lX 0 ).We write X " pX 1 , . . ., X n q 1 " AN N pα X , σ 2 X q and Y " pY 1 , . . ., Y n q 1 " AN N pα Y , σ 2 Y q.For both we consider an equal length of n " |X| " |Y |.Considering Eq. ( 27), we can rewrite the recursive equation using a state-space representation as where the innovations ϵ t iid " N p0, σ 2 X q, η t iid " N p0, σ 2 Y q are assumed to be also pairwise independent for t " 1, . . ., n.The innovations are the one-step ahead forecast errors given by ϵ t " X t ´lX t´1 .The states l X are latent and only X t itself is observable.
Next, we give an explicit expression for the asymmetric DTW distance between X and Y when considering the squared L 2 cross-distance.We denote X to be independent of Y if and only if X i is independent of Y j for all i, j.This is a simple assumption, however, following results can be easily extended to a more general setup.
Y q be two independent and centered Exponential Smoothing processes of length n.Let dpi, jq :" ErpX i ´Yj q 2 s denote the cross-distance between X and Y .Then the asymmetric DTW distance is given by Proof.First, note that the cross-distance is given by dpi, jq " due to the independence of X i , Y j for every i, j.Next, we need to recursively compute the warping matrix G as in Eq. ( 25).We have gp1, 1q " dp1, 1q " σ 2 X `σ2 Y .Due to the cross-distance being increasing in both i and j, we can easily solve Eq. ( 25) by gpi, jq " min ˜gpi ´1, jq gpi ´1, j ´1q gpi ´1, j ´2q ¸`dpi, jq where ĩ, j denotes the indices where the recursion needs more specific computation.We need this distinction because for small i, j the minimum value is different than expressed in the sum due to some of the values being not assigned.
In detail, we have that Altogether, we obtain, by induction, gpi, jq " Setting i " j " n finishes the proof.

A.3.2 Relation to Wasserstein Distance
Since we are interested in the one-step ahead forecast, we may look at the corresponding forecast distributions.The forecast distributions are given by the conditional distributions of Xn`1|n " , respectively.Now we want to measure the distance between those two distributions.For that we use the 2-Wasserstein distance, which is defined as follows [Villani, 2009].Let µ, ν be two measures and πpµ, νq be the set of all couplings of µ and ν.Then In the case of random variables we can also write for any two random variables X, Y .If both are Gaussian, i.e.X " N pµ 1 , Σ 1 q and Y " N pµ 2 , Σ 2 q, then the squared 2-Wasserstein distance is readily computed by Details are available in the work of Olkin and Pukelsheim [1982].Applying Eq. ( 30) to the forecast distributions of the AN N models, we can calculate the 2-Wasserstein distance between Xn`1|n and Ŷn`1|n yielding W 2 2 p Xn`1|n , Ŷn`1|n q " pl X n ´lY n q 2 `pσ X ´σY q 2 (31) Thus, both Eq. ( 28), and (31) are quadratic in the model parameters of AN N allowing us to give following theorem.Theorem 2. Let X be the space of independent AN N processes of length n ą nppq for some p P p0, 1q, equipped with the asymmetric DTW distance, and Y be the space of corresponding Gaussian forecast distributions equipped with the squared 2-Wasserstein distance.
Then the map X Ñ Y : X Þ Ñ Xn`1|n is Lipschitz-continuous with Lipschitz constant L ă 1 and probability at least p.
Proof.Let X, Y P X be two arbitrary AN N processes.We have by Eq. ( 31) that W 2 2 p Xn`1|n , Ŷn`1|n q " pl X n ĺY n q 2 `pσ X ´σY q 2 for fixed states l X n , l Y n .Since l X n , l Y n are both realizations of independent Gaussian random variables, we obtain where q p is the p-quantile of a χ 2 p1q distribution.Then using Eq. ( 32) yields Xn`1|n , Ŷn`1|n q DTW asym pX, Y q ă 1 using that `n 2 ˘, tn 2 {4u ą n for n ą nppq.Thus, the map X Þ Ñ Xn`1|n is Lipschitz-continuous with constant L ă 1 and probability at least p.
Remark.If we want to have Lipschitz-continuity with at least 95% probability, then q 0.95 ă 4 and Theorem 2 holds with n ą 16.
Remark.This result also holds when looking at the normalized DTW measure with Lipschitz constant L ą 1 and is therefore not a contraction anymore.
Remark.It also tells us that close time series in terms of DTW are also close in their corresponding forecast distribution both in mean and variance.Further, it assures us that small changes in the time series only affect the difference in forecast distributions by a small amount.
More detailed results are obtained when considering the mean forecasts given by ErX n`1|n s :" ErX n`1 |l X n s " l X n " N p0, nσ 2 X α 2 X q, and ErY n`1|n s " l Y n " N p0, nσ 2 Y α 2 Y q, respectively.The corresponding 2-Wasserstein distance is computed to be Theorem 3. Let X be the space of independent AN N processes of length n ą 5 equipped with the asymmetric DTW distance, and Y be the space of corresponding Gaussian forecast distributions equipped with the squared 2-Wasserstein distance.
Then the map X Ñ Y : X Þ Ñ ErX n`1|n s is Lipschitz-continuous with Lipschitz constant L ă 1.
Proof.Let X, Y P X be two arbitrary AN N processes.We have that using that `n 2 ˘, tn 2 {4u ą n for n ą 5. Thus, the map X Þ Ñ ErX n`1|n s is Lipschitz-continuous with constant L ă 1.

A.3.3 Reduction of Mean Squared Error
Another result is about the relation of DTW and the mean squared error of a convex combination of the mean forecasts.Let Z n`1|n pwq :" wErX n`1|n s`p1´wqErY n`1|n s for w P r0, 1s.We have Z n`1|n pwq " wl X n `p1´wql Y n .However, in practice, the states l n are not known and need to be estimated by actually estimating the smoothing parameter α.To this end, assume there exists unbiased estimators αX , αY for α X , α Y , respectively.We further assume they have finite second moment.The corresponding estimating forecasts are given by ẑpwq.Then, under certain conditions for the estimation errors made for α X , α Y we have following result.Theorem 4. Let X " AN N pα X , σ 2 X q, Y " AN N pα Y , σ 2 Y q be two independent and centered Exponential Smoothing processes of length n and known variances σ 2 X , σ 2 Y .Then a convex combination of the forecasts Ẑ reduces the mean squared error, i.e.
Proof.We have that ErpX n`1 ´ẑpwqq 2 s " Erpl X n `ϵn`1 ´pw lX n `p1 ´wq lY n qq 2 s " σ 2 X `Erpl X n ´pw lX n `p1 ´wq lY n qq 2 s, using the independence of the error term ϵ n`1 .Further, we obtain ErpX n`1 ´ẑpwqq 2 s ď Erpl X n ´lY n q 2 s `w2 Erp lX n ´l Y n q 2 s `Erpl Y n ´l Y n q 2 s " npσ 2 X α 2 X `σ2 Y α 2 Y q `w2 pErp lX n q 2 s `Erp lY n q 2 sq `Erpl Y n ´l Y n q 2 s.We can quickly compute the last terms of above by " nσ 2 Y MSEpα Y q, and Erp lX n q 2 s " nσ 2 X Erα 2 X s " nσ 2 X pMSEpα X q `α2 X q.In total we get that ErpX n`1 ´ẑpwqq 2 s ď nσ 2 X `p1 `w2 qα 2 X `w2 MSEpα X q ˘ǹσ 2 Y `p1 `w2 qα 2 Y `2MSEpα Y q !ď ErpX n`1 ´l X n q 2 s " nσ 2 X MSEpα X q.Using that nσ 2 X p1 `w2 qα 2 X `nσ 2 Y p1 `w2 qα 2 Y ď DTW asym pX, Y q, this finally yields ErpX n`1 ´ẑpwqq 2 s ď DTW asym pX, Y q `npw 2 σ 2 X MSEpα X q `2σ 2 Y MSEpα Y qq ď nσ 2 X MSEpα X q, if MSEpα Y q ď σ 2 X 2σ 2 Y `p1 ´w2 qMSEpα X q ´nDTW asym pX, Y q{σ 2 X q.
Remark.In practice, the previous theorems tell us that if X, Y are close in terms of DTW and, additionally, the estimation error made when estimating α Y is smaller than the error made for α X , then the convex combination forecast improves the point forecast for X n`1 .
The condition of MSEpα Y q ăă MSEpα X q might have various reasons.In an application, the fit of the model might be better for Y than for X, resulting in better estimation of α.

A.3.4 Conclusions
These theoretical results give us arguments of our methodology for the most simple cases of models.However, the arguments might be extended to a broader family of models as given in Hyndman et al. [2008].In practice, we also need to use open begin, open end matching since the asymmetric DTW measure cannot be computed once the reference time series is too long.Still similar arguments should hold, and motivate our approach to using DTW neighborhoods and perform model averaging.The theory does not give us hints how to choose the optimal weights, hence we propose the weights of Section 2.3.

Figure 1 :
Figure 1: Overall workflow of the proposed averaging methodology.

Figure 2 :
Figure 2: Runtime analysis for each dataset and selected averaging methods.The shown numbers are averaged runtimes of the TSCV procedure.Table 12: Optimal neighborhood sizes based on TSCV.

Figure 3 :Figure 4 :
Figure 3: Example time series of the Food Demand dataset.

Figure 5 :
Figure 5: Global RMSSE with respect to ETS forecasts.

Figure 7 :
Figure 6: Individual RMSSE with respect to ETS forecasts.

Figure 10 :
Figure 10: Wasserstein distances of the forecast distributions.

Figure 11 :
Figure 11: Percentile plots for each averaging method with respect to ETS forecast.The dashed lines show the performance of the global panel benchmark model.The dotted lines indicate the 50% percentile.

Figure 12 :Figure 13 :
Figure 12: TSCV scores for simple no-model averaging.The dashed vertical lines indicate the optimal number of neighbors.

Figure 14 :Figure 15 :
Figure 14: TSCV scores for performance-based averaging.The dashed vertical lines indicate the optimal number of neighbors.

Figure 17 :
Figure17: Running RMSSE with respect to random-walk forecasts for simple no-model averaging.

Figure 18 :
Figure18: Running RMSSE with respect to random-walk forecasts for simple averaging.
Model Avg. using distance-based weights D-AVG-N Model Avg. using distance-based weights and only neighbors P-AVG Model Avg. using performance-based weights P-AVG-R Model Avg. using refitted performance-based weights S-NM-AVG No Model Avg. using simple weights D-NM-AVG No Model Avg. using distance-based weights

Table 2 :
Summary of datasets.

Table 3 :
Summary of training parameters.T 0 denotes the initial training timestep, T train the final training timestep and L max the maximum lag in the pooled model.

Table 4 :
. The best results are in bold.

Table 5 :
. The best results are in bold.

Table 6 :
. The best results are in bold.

Table 7 :
. The best results are in bold.

Table 8 :
Cumulative one-step ahead MAE (p-value 0.564).The best results are in bold.

Table 9 :
Cumulative one-step ahead RMSE (p-value 0.564).The best results are in bold.

Table 11 :
Cumulative one-step ahead SMAPE (p-value 0.948).The best results are in bold.

Table 13 :
Test errors for benchmark models and best averaging method (P-AVG-R).The best models are in bold.