An ensemble learning strategy for panel time series forecasting of excess mortality during the COVID-19 pandemic

Quantifying and analyzing excess mortality in crises such as the ongoing COVID-19 pandemic is crucial for policymakers. Traditional measures fail to take into account differences in the level, long-term secular trends, and seasonal patterns in all-cause mortality across countries and regions. This paper develops and empirically investigates the forecasting performance of a novel, flexible and dynamic ensemble learning with a model selection strategy (DELMS) for the seasonal time series forecasting of monthly respiratory disease death data across a pool of 61 heterogeneous countries. The strategy is based on a Bayesian model averaging (BMA) of heterogeneous time series methods involving both the selection of the subset of best forecasters (model confidence set), the identification of the best holdout period for each contributed model, and the determination of optimal weights using out-of-sample predictive accuracy. A model selection strategy is also developed to remove the outlier models and to combine the models with reasonable accuracy in the ensemble. The empirical outcomes of this large set of experiments show that the accuracy of the BMA approach is significantly improved with DELMS when selecting a flexible and dynamic holdout period and removing the outlier models. Additionally, the forecasts of respiratory disease deaths for each country are highly accurate and exhibit a high correlation (94%) with COVID-19 deaths in 2020.


a b s t r a c t
Quantifying and analyzing excess mortality in crises such as the ongoing COVID-19 pandemic is crucial for policymakers. Traditional measures fail to take into account differences in the level, long-term secular trends, and seasonal patterns in all-cause mortality across countries and regions. This paper develops and empirically investigates the forecasting performance of a novel, flexible and dynamic ensemble learning with a model selection strategy (DELMS) for the seasonal time series forecasting of monthly respiratory disease death data across a pool of 61 heterogeneous countries. The strategy is based on a Bayesian model averaging (BMA) of heterogeneous time series methods involving both the selection of the subset of best forecasters (model confidence set), the identification of the best holdout period for each contributed model, and the determination of optimal weights using out-of-sample predictive accuracy. A model selection strategy is also developed to remove the outlier models and to combine the models with reasonable accuracy in the ensemble. The empirical outcomes of this large set of experiments show that the accuracy of the BMA approach is significantly improved with DELMS when selecting a flexible and dynamic holdout period and removing the outlier models. Additionally, the forecasts of respiratory disease deaths for each country are highly accurate and exhibit a high correlation (94%) with COVID-19 deaths in 2020.

Introduction
In the time series forecasting literature, two techniques compete: forecast model selection [1] and forecast model combination [2]. The traditional approach to forecasting seasonal and non-seasonal time series is to select a single best model from a pool of candidate models based on certain criteria or employing a given technique [3], potentially neglecting model risk. The ensemble prediction method is widely considered a promising strategy and it has been used with considerable success in research and industry thanks to the availability of a wide variety of individual models. Since Bates and Granger's [4] seminal study, a growing number of linear and non-linear univariate and multivariate times series methods [5,6] and statistical machine learning techniques [7][8][9] have been proposed to increase shortand long-term predictive accuracy in relation to a wide range of problems, including stochastic population -mortality, fertility, and net migration -forecasting [10], epidemiological and excess mortality forecasting [11], meteorology [5], and finance [12,13], among others. Indeed, several comprehensive theoretical and empirical studies have confirmed the superior predictive performance of ensemble methods exploiting a variety of approaches, including stacking and blending to improve predictions, bagging to decrease variance, boosting to decrease bias [14,15] and the Bayesian model averaging (BMA) [16][17][18]. When adopting this empirical strategy, choices have to be made as to which models to include in the combined pool and as regards the contribution (weight) of each model to the final prediction. Here, a significant body of literature has examined optimal model combination weights [19,20], by focusing on the selection of optimal combination schemes and weights [21,22], assigning equal weights to the set of superior models [23], or selecting a subset of best models from among the set of candidates (model confidence set) using a dynamic trimming scheme and considering the model's out-ofsample forecasting performance in the validation period [24,25]. Similarly, to cope with concept drift, memory, change detection, learning, and loss estimation, adaptive algorithms have been proposed [26].
However, experiments are usually conducted with a holdout set so as to pick pools of models manually that perform best for a given series type [27]. This is often motivated by a lack of computational power, as well as by limitations of time for checking and allocating one specific holdout from a holdout set to each individual model, and preferring, therefore, to consider a fixed holdout for all models and, by so doing, facilitating their combination in an ensemble model. In fact, different holdouts for different models results in different lengths for each model, which means the combination of these models of different lengths becomes a challenge. However, ignoring the different holdouts for each model reduces adaptability and undermines their generalization. This paper proposes a dynamic ensemble learning strategy (DELMS) in different layers for panel time series that not only overcomes the limitations of single-model based methods, but also addresses those of new ensemble models with fixed holdout sets and fixed thresholds for model selection. The strategy combines twelve models, including the models suggested by the M4 Competition [28] as benchmarks and standards for comparative purposes. We also consider model candidates to ensure sufficient diversification of statistical models, specifically SARIMA, and DNN models, including multi-layer perceptron (MLP) to yield more robustly accurate forecasts. We then consider different holdouts and different time series lengths in one layer, and other layers in order to select the best models for each series. In this way DELMS is able to generate effective and robust forecasts, separate the pattern from the noise, and overcome overfitting problems. The strategy is based on a Bayesian model averaging (BMA) to combine the heterogeneous models with the lowest error measure to generate an ensemble. It applies the selection of the subset of best forecasters (model confidence set) to be included in the forecast combination, the identification of the best holdout period for each contributed model, and the determination of optimal weights using out-of-sample predictive accuracy. A model selection strategy is also developed to remove the outlier models and to combine the models with reasonable accuracy in the ensemble. In short, the ensemble learning procedure proposed (DELMS) involves: (i) setting the different holdouts to be checked for each contributed model; (ii) choosing the best holdout for each model based on out-of-sample forecasting accuracy; (iii) selecting the subset of best forecasters (model confidence set), using a variable trimming scheme in which a multiple of the forecasting accuracy metric range obtained across all candidate models is used as the threshold for model exclusion; (iv) determining the posterior probabilities (weights) of each model, using the normalized exponential (Softmax) function; and, finally, (v) obtaining ensemble forecasts based on the law of total probability, considering the model confidence set and the corresponding model weights. Unlike previous approaches that have focused on either selecting optimal combination schemes and weights or equally weighting a subset of best forecasters, our novelty ensemble procedure involves identifying the best holdout period for each model, selecting the best forecasting models and determining the optimal weights based on the out-of-sample forecasting performance for each dataset.
To demonstrate empirically the robustness of our approach, we use monthly respiratory disease death data for 61 heterogeneous countries to estimate excess mortality during the COVID-19 pandemic. Excess mortality is the number of deaths attributable to all causes above and beyond mortality predictions under normal (baseline) circumstances for a given period in a population. Clearly, quantifying and analyzing excess mortality attributable to the coronavirus 2 (SARS-CoV-2) pandemic is of great relevance for policymakers, public health officials and epidemiologists [29,30] and, in this sense, any improvement in such forecasts are to be welcomed. Excess mortality is typically measured by national or supranational statistical agencies using the absolute, relative (P-score) or standardized (Z-score) number of ''excess'' deaths, where the benchmark is often computed quite naïvely, by using, for instance, the simple average of the previous year's deaths. The EuroMOMO project (https://www.euromomo.eu) is a notable example of this, with baseline mortality modeled using a generalized linear model corrected for over dispersion assuming that the number of deaths follows a Poisson distribution. However, this approach does not account for differences in the level, longterm secular trends, and seasonal patterns in all-cause mortality across countries and regions. Additionally, empirical studies show that it is hard to find a single, widely accepted forecasting method (if, indeed, one exists) that performs consistently well across all datasets and time horizons [31]. Besides, data quality is another concern, being responsible for biased and inconsistent parameter estimates and leading to flawed conclusions [32]. This is a matter of untold concern for forecasters of ensemble learning predictive models when seeking to predict, for example, numbers of deaths or when an economy should be re-opened [33,34], among others. Moreover, it makes excess mortality a highly appropriate case for comparing the experiences of different countries or regions, where either the degree of misdiagnosis/underreporting or the problems of data quality may differ [35]. Our hypothesis is that the approach proposed herein leads to a decrease in the individual error of ensemble members compared to that provided by normal model selection with equal holdouts for selected models and without overly decreasing the diversity between them. We examine the run times, accuracy, level of contribution, and error metric of the ensemble techniques proposed and compare  them with those of the well-known ensemble model without dynamic holdouts and model selection, and the individual forecasting models. This article presents a suitable ensemble time series with improved predictive accuracy and it is our belief that it helps demonstrate which time series techniques contribute more to ensembles.
The remaining sections of the paper are organized as follows. In Section 2, we describe the materials, methods, and related works considered in undertaking this study. Section 3 outlines an extensive set of experiments on respiratory disease deaths in 61 countries and the results. The main discussion and conclusions are reported and discussed in Section 4. Finally, future research proposals are presented in Section 5.

Materials and methods
Here, we propose a meta-learning approach for adapting the ensemble to the best combination of forecasting models. The candidate models are extracted from different layers with the best holdout for each contributed model and each panel member.
Figs. 1 and 2 provide graphical overviews of the materials and methods employed to develop our proposed strategy. We use multiple learning processes to improve the predictive performance of the ensemble, which is built using a learning approach for the candidates addressed in the last layer. In this section, we discuss these techniques in brief and highlight their contributions.

Layered learning and the ensemble learning strategy
The layered learning approach as applied to time series data consists of breaking a forecasting problem down into simpler subtasks that occupy different layers. Each layer addresses a different predictive task and the output of one layer can be used as input for the next layer [36]. In this study, the first task is to obtain a direct mapping of the time series for different countries, combining the intractable time series algorithms and predicting the ensemble model as the final output. This means the first layer task is to find the best holdout for each panel member and for each time series algorithm. This facilitates the second layer task of model selection, which in turn facilitates the identification of the model confidence set of best forecasters in the last layer [37].
It is useful to maximize forecasting accuracy in panel time series -a target that is achieved dynamically -and to adapt the model's learning process to possible unexpected shocks.
Along with the layered learning approach, our ensemble method runs multiple learning algorithms to employ adaptive heuristics that combine forecasters. As a result, we obtain a better predictive performance than might be obtained from any of the constituent learning algorithms. Our strategy comprises several selected models (see Table 1 -line 10), with the best performance being based on minimum error measures. Each model considers different holdouts to solve the problem at hand and selects the best holdout in each case. This leads to a more robust overall performance of the ensemble as it increases the diversity of the holdouts; however, the time series length differs according to the different holdouts. As such, it could constitute a problem for the ensemble layer, were we to merge the models of different lengths. Thus, we need to force all the models selected to be of equal length, so that eventually the length of the ensemble is equal to the minimum length time series in our time series set. Although this windowing strategy provides each forecaster's best prediction and, therefore, the ensemble's best performance, it is clear that to obtain the best results, the length of all the time series should be sufficiently large and almost the same.
Following , let each candidate model be denoted by M l , l = 1, . . . , L representing a set of probability distributions in which the ''true'' data-generating process is assumed to be included, comprehending the likelihood function L (y|θ l , M l ) of the observed data y in terms of model-specific parameters θ l and a set of prior probability densities for these parameters p (θ l |M l ). Consider a quantity of interest ∆ present in all models, such as the future observation of y. The marginal posterior distribution across all models is where p (∆|y, M l ) denotes the forecast PDF based on model M l , and p (M l |y) is the posterior probability of the model M l given the observed data with ∑ L l=1 p (M l |y) = 1. The weight assigned to each model M l is given by its posterior probability The workflow of our proposed method is presented in Fig. 2. To identify the model confidence set and compute model weights, we first set the different holdouts to be checked for each contributed model in each dataset. Let H = {h 1 , h 2 , . . . , h k } represent the set of holdout periods to be considered in the estimation procedure (see Fig. 2. Layer L1). The second step involves selecting the best holdout for each candidate model based on the out-ofsample forecasting accuracy measure (see Fig. 2. Layer L2). We use the symmetric mean absolute percentage error (SMAPE) as our measure of forecasting accuracy (see Table 1 -line 8). 1 To select the best holdout for each model, we tested the different holdout values from three to ten years -considering the holdout set (H = {3, 5, 7} years 2 ) (see Table 1 -line 7) as representative of the short-, medium-, and long-term -and compared the SMAPE values at each iteration, retaining the model with the lowest SMAPE as the candidate for the model confidence set selection step. This provides the strategy with an opportunity to cover different parts of the data space and to handle different dynamic regimes in different candidate time series. Additionally, it ensures the final ensemble model is able to manage the limitations of each in the others.
Third (see Fig. 2. Layer L3), the subset of best forecasters is selected using the best holdout period (see Table 1 -lines 21-34) and a variable trimming scheme in which a multiple θ (pre-set at 0.5) of the distance between the maximum and minimum values 1 We avoid using the AIC and the BIC because the candidate models are in different model classes, and the likelihood is computed differently. For selected models in the same class, the BIC is useful and is used automatically by the algorithm to select, for instance, a SARIMA model among the candidate SARIMA models. Another problem associated with the error term in ensemble modeling can be avoided by using accuracy measures whose formula contains a logarithm, such as MSLE, RMSLE, and SLE. Based on the work conducted here, the program would be interrupted because some of the algorithms potentially present negative values in these measures. 2 The results for the other holdout periods are consistent with those reported in this paper.
of the forecasting error metric is used as the threshold for model exclusion, i.e., using where SMAPE g,l is the SMAPE value for model l in the panel member (country) g (see Table 1 -lines [35][36]. For each panel member, if the error of a candidate model is greater than the Γ g indicator, (i.e., SMAPE g,l > Γ g ) the model is excluded from the model confidence set and from the ensemble forecast computation (Table 1 -lines 37-43), i.e., it is assigned zero weight in (1).
Depending on the distribution of the SMAPE values, the number of models excluded from the model confidence set will vary. From a frequentist point of view, building up a model confidence set is a way of summarizing the relative forecasting performances of the entire set of candidate models and identifying the set of statistically best forecasters. The advantage of this statistic defined in (3) is its simplicity, ease of application, and interoperability. Moreover, it falls somewhere between the time series' close and extremely distant models. In this case, the distant forecasting models are removed from the ensemble, which is ideal for avoiding overfitting and controlling the redundancy in the output of the ensemble model. Our intuition is that the models with a minimum error are closest to the actual data generating process. Yet, comparing the error measure with the mean of the errors removes only those models that are extremely distant from the other candidate models. This upholds the diversity of the selected models and avoids the overfitting problem.
Fourth, the posterior probabilities of the best forecaster model (model weights) are computed using the normalized exponential (Softmax) function with ξ l = S l / max {S l } l=1,...,L and S l : = SMAPE g,l . The Softmax function is a generalization of the logistic function that is often employed in classification and forecasting exercises using traditional machine learning and deep learning methods as a combiner or activation function [38]. The function assigns larger weights to models with smaller forecasting errors, with the weights decaying exponentially the larger the error (see Table 1 -lines [11][12][13]. Fifth, the BMA forecasts are obtained based on the law of total probability (1) considering the model confidence set and the corresponding model weights (4). The sampling distribution of the ensemble forecast of the quantity of interest is a mixture of the individual model sampling distributions (see Table 1 -lines [44][45][46][47][48][49][50][51][52][53]. The pseudocode of the proposed methodology is listed in Table 1.
In the interest of reproducible science, the dataset and all methods are publicly available [39].

The learning algorithms
This section summarizes the characteristics of the individual candidate learning algorithms (times series methods) used in this study 3 (see Table 1 -line 10). We selected our models by reviewing the six top-performing hybrid or combination models in the M4 Competition, but taking into consideration our research limitations derived from the length of time series and the computational power required to build the ensemble model for a 61-member time series from 2000 to 2016 with 12 individual models and three holdouts.
The seasonal trend decomposition using Loess (STL) allows us to decompose a time series into its trend and seasonal components. Based on the Loess smoother, it offers a simple, versatile, and robust method for decomposing a time series and estimating nonlinear relationships [41]. The models need to be robust to the outliers detected in the multiple panel members' (countries) datasets. In specifying the STL, we use a robust decomposition so that sporadic abnormal observations do not affect the estimates of the trend-cycle and seasonal components. The time series are tested for autocorrelation using the Ljung-Box test, considering the null hypothesis that the model exhibits appropriate goodness-of-fit. The method does not handle the calendar variation automatically, and it only provides facilities for additive decompositions, which could be considered a limitation of this approach. We use the two parameters t.window and s.window to control the speed at which the trend-cycle and seasonal components can change. Smaller values allow for more rapid changes, which we need especially for some time series with strong turning points. As a result, the number six was chosen for s.window and t.window based on the results of the residual checks and the Ljung-Box test statistics.
The seasonal naive (SNAIVE) method sets the forecast to be equal to the last observed value from the same season of the year (i.e., the same month of the previous year) [42]. It is a useful benchmark for other forecasting methods and, here, it was found to be helpful in showing the recent time series trend and for adjusting the ensemble model for the trend component.
Similarly, the SARIMA and random walk forecasts (RWF) -as a SARIMA(0,0,0)(0,1,0)m model, in which m is the seasonal period -were used as state-of-the-art methods to memorize repeating monthly patterns. However, many SARIMA models have no exponential smoothing counterparts [43], and the robust univariate forecasting models, such as Holt-Winters' multiplicative method (HWM) and the exponential smoothing state space model (ETS), can be considered a good complement to SARIMA models in our final ensemble. All ETS models are non-stationary, while some SARIMA models are stationary [44]. ETS follows the last trend of the time series and it is appropriate for the ensemble model for empowering the trend parameter in the final predictions. ETS point forecasts are equal to the medians of the forecast distributions. For models with only additive components, the forecast distributions are normal, so the medians and means are equal. For multiplicative errors, or multiplicative seasonality, which perform similarly in most of the time series analyzed in this study, the point ETS forecasts are not equal to the means of the forecast distributions. In these cases, SARIMA is a better choice. On the other hand, ETS is a non-linear exponential smoothing model with no equivalent SARIMA counterpart. Therefore, we propose that the ETS model be selected automatically and the type of trend and seasonal component be additive with the restriction of finite variance. The bootstrapping method for resampled errors was employed rather than distributed errors and simulation was used rather than algebraic formulas for calculating prediction intervals. The other options for the ETS model are shown in Table 2. The TBATS -that is, (T)rigonometric terms for seasonality, (B)ox-Cox transformations for heterogeneity, (A)RMA errors for short-term dynamics, (T)rend, and (S)easonal -are also used to adopt the ensemble model with multiple seasonality of some time series.
In the case of the neural network time series algorithms, the extreme learning machines (ELM) were used with the lasso penalty. ELM theory assumes that the randomness in the determination of coefficients of neural network predictors (input weights) can feed the learning models with no iterative tuning for a given distribution as is the case in gradient-based learning algorithms. The model entails randomly defined hidden nodes and input weights without any optimization, so that only output weights need to be calibrated during the training of the ELM [45]. In the hyperparameter calibration of the ELM, we consider the maximum 500 hidden layers for 200 networks to be trained and summarized in the ELM's final ensemble forecast model.
The neural network autoregression (NNAR) refers to single hidden layer networks using the lagged values of the time series as inputs and automatic selection of parameters and lags according to the Akaike information criterion (AIC) [46]. In the NNAR model specification, we considered the last observed values from the same season as the inputs to capture the seasonality patterns and to use a size equal to one, because we have one attribute without a regressor, and by way of improvement, we used 100 networks to fit the different random starting weights and then averaged them out to produce the forecasts. Additionally, we considered the multilayer perceptron (MLP) as a kind of NNAR model. This is more complicated and advanced than the NNAR, having three components in the form of NNAR(p,P,k), in which p denotes the number of lagged values that are used as inputs and which is usually chosen based on an information criterion, like AIC, P denotes the number of seasonal lags, and k denotes the number of hidden nodes.
Finally, singular spectrum analysis (SSA) was used as one of the high-quality modeling approaches. The calibration of the SSA is an important, but not easy task, in a standalone modeling approach [47]. It depends upon two basic parameters: the window length and the number of eigentriples used for reconstruction. The choice of improper values for these parameters yields incomplete reconstruction, and the forecasting results might be misleading. In this study, we set window length equal to 12 and eigentriples equal to NULL. Table 2 summarizes the hyperparameters of the algorithms used in this study.

Data selection and cleansing
We use cause-of-death data from the World Health Organization's (WHO) mortality database [50] to empirically demonstrate the forecasting capacity of the methodology proposed. The database collects cause-of-death statistics from country civil registration systems and estimates from the United Nations Population Division for countries that do not regularly report population data. We use an Excel file 4 of this database to evaluate the data quality of each country and a CSV file that includes the death time series of each country by gender. The first of these files identifies the quality of data for each country, using five color categories -green, dark yellow, light yellow, dark red and light red. Countries classified as green have multiple years of national death registration data with high completeness and quality of cause-of-death assignment. Estimates for these countries may be compared and time series may be used for priority setting and policy evaluation. However, this dataset only includes data for 2000, 2010, 2015, and 2016 and it is not complete for the time series. As a result, we used this dataset only to identify the countries reporting high-quality data to the WHO and ranked them according to their data quality. In line with the metadata of the dataset, the criteria used to rank the countries by data quality are shown in Table 3, coinciding, that is, with the WHO descriptors.  [48], and the TBATS method, which includes Box-Cox transformation, ARMA errors, trend and seasonal components [49]. We considered only those countries with a data quality corresponding to the first three categories and eliminated various islands due to a lack of data (e.g., Åland Islands). We also cleaned the dataset by removing the total column and various rows with unknown month data and/or zero deaths. Some countries reported total deaths for three months in a row during certain years. In such instances, we assumed a uniform distribution of deaths across the quarter and allocated the corresponding value to each month. We filtered the datasets for respiratory diseases and considered the death variable as a univariate time series with monthly sampling frequency. Table 4 shows the WHO codes classified as respiratory infections. To compute the number of deaths attributable to respiratory diseases, we aggregated codes 380 and 410 or, equivalently, codes 390, 400, and 410. We also corrected the names of some of the countries (Appendix A). In this way we were able to calculate the proportion of deaths attributable to respiratory diseases. To estimate the number of monthly deaths caused by Otitis media: Acute otitis media (AOM) is a common complication of upper respiratory tract infection whose pathogenesis involves both viruses and bacteria.
respiratory diseases, we multiplied the annual proportion by the total number of forecasted deaths each month. We used the fraction of annual deaths from respiratory diseases over the total number of deaths as a proportion of deaths in each month. This procedure provided us with a dataset with more than twelve thousand observations in a pool of a 61-member panel time series (countries) from 2000 to 2016 [39] (see Table 1 lines 3-4). These panel time series cover possible situations of stationarity, non-stationarity, increasing trends, seasonality, and structural breaks so as to undertake a comprehensive evaluation of the improvement in accuracy of candidate and ensemble models in different scenarios.
Given the varying data quality of countries/territories/areas as regards case detection, definitions, testing strategies, reporting practices, and lag times, missing values are expected in the time series dataset. To deal with this problem, we tested the Kalman, seasplit and seadec algorithms to impute the missing values. Of the three, the seasplit algorithm performed best as regards saving both the trend and the seasonality for our dataset (see Table 1 -line 19). We only imputed missing values within the time series, but not at the beginning of the time series with a start date after 2000. As a result, rather than changing the first year of the time series to our base year 2000, we used the latest year available (see Table 1 -line [17][18]. To avoid the error caused by combining time series of different lengths in an ensemble model, we adapted the R code to handle different start years. The same problem arises as a result of the procedure adopted to select the best holdout for each model, which may ultimately lead to model combinations considering forecasts based on different holdouts, i.e., different time series lengths.
Finally, for comparing the superiority of the proposed DELMS model, 7137 time series models were explored. They obtained from 12 time series models plus an ensemble, 3 scenarios, and 3 holdouts for 61 countries.

Forecasting accuracy comparison
The predictive accuracy metrics obtained for the three alternative holdout periods under investigation, using three alternative backtesting procedures, are reported in Table 5. In the case of the first approach -the ''Fixed holdout'' -we used a fixed holdout period equal to 3, 5, and 7 years to derive the composite (ensemble) model.
The results in the first columns show that some models exhibit better performance than that of the ensemble models with fixed holdouts. For instance, the average error of the TBATS model across two holdout periods is smaller than that of the BMA (see Table 5, Column (1)).
The second approach -the ''Fixed holdout with model selection'' -uses a multiple of the SMAPE values across all methods to evaluate the distance of each model to the others as detailed above in the pseudocode ( Table 1). The models with SMAPE values higher than that of the introduced indicator are considered poor forecasters and eliminated from the ensemble forecast.  [39]. The results in Table 5 show that the accuracy of the BMA approach improves in robustness when pursuing the selection approach for each holdout, with the composite model now ranking first among all the methods tested. With a fixed holdout of 3 for all models, which is the classical approach, the BMA has a SMAPE value of 0.112. For the same holdout, but with model selection, this SMAPE value improves (0.103). In the third approach -''model selection plus dynamic holdouts'' -that is, a combination of approaches one and two -the percentage error improves again (0.102). This approach combines the best forecasting models fitted using each model's optimal holdout selection. As a result, the accuracy of the ensemble is improved, leaving the individual learning algorithms at a reasonable distance. Fig. 3 summarizes the above empirical results. It is apparent that the ensemble model with the new layered learning approach (DELMS) exhibits greater predictive accuracy than either of the two single forecasting methods used and either of the ensemble strategies with fixed holdouts and with fixed holdouts and model selection. It shows that the approach proposed improves the predictive performance at each step of the learning process illustrated in Fig. 2.
Finally, the Wilcoxon signed-rank test was performed to determine the significance of the superiority of the proposed model (DELMS). This test was used to determine the significance of the forecasting errors in the forecasts of the central trend made by two forecasting models with the same number of data [51]. Let e i be the forecasting errors in the ith forecast value (i.e. countries) generated by two forecasting models (DELMS and BMA(holdout=3) (Appendix C).

sum of ranks
where r + and r − represent the sum of ranks. For e i = 0, we eliminate the comparison. The statistic W is defined as in Eq. (6): The proposed DELMS model significantly outperforms the BMA(ho=3), which is the ensemble model with the best performance among all ensembles with fixed holdouts ( Table 5). The proposed DELMS model is significantly superior to other BMAs with respect to forecasting (P-value = 0.015). Table 7 reports the distribution of the models excluded from the selection procedure and ranks them according to their contribution to the composite model. A vertical comparison of the results offers insights into the contribution of each model to the ensemble, while a horizontal comparison enables us to assess the rate of contribution across different holdout periods.

Models excluded from the selection procedure
The results show, first, that all models are excluded several times from the BMA model space as a result of the procedure to select the model confidence set, highlighting that the set of best-performing forecasters differs across the countries, i.e., their predictive accuracy is population-and period-specific. This is unsurprising and can be explained by the differential patterns observed in the respiratory disease data. The variability in the models' out-of-sample forecasting accuracy also reveals their ability to capture diverse features of mortality data. Second, the results suggest that combining models is a way to leverage their Table 5 Ranking of the models and ensembles according to the accuracy measure. Source: Authors' own.
(1) Fixed holdout column for the first row (BMA) shows the SMAPE for the Bayesian model averaging approach with fixed holdout. Rest of rows show individual models. (2) and (3) represent the methods proposed herein.    model by selecting both the best holdout for each model and the best forecasters in the model confidence set finally used to make the forecast. Table 8 presents the contribution ranks, the exclusion frequency, and the proportion of the selected models with the best holdout for the DELMS. The results show that the contribution of single learners to the ensemble changes when compared with that obtained with model selection only (Table 7), highlighting again the importance of combining model selection with holdout period calibration. Fig. 4 reports the BMA model confidence set (vertical axis) and corresponding posterior probability (horizontal axis) for selected countries.
As we used the SMAPE criterion to select the set of models and respective weights, a given weight of zero indicates excluding that individual model from the BMA forecast combination. We can observe that the model's contribution to the ensemble varies across countries and the ensemble model consistently performs well in all countries.

Algorithmic efficiency analysis
We analyze the algorithmic efficiency of each method -i.e., the amount of computational resources used by the algorithm -by measuring the time spent in fitting the ensemble model with each approach and using it to predict the maximum likely run-time of a new given time series (Table 9). The CPU used here is the Intel Core i7-7500U Processor @ 2.70 and 2.90 GHz with 16.0 GB RAM. The modeling, training, tuning, and testing are programmed in R 4.1.2. The method proposed fits the models considering three holdout periods in order to select the best holdout for each model.
Our expectation is that the method drives the run-time at least three times more than the two other approaches, which is expected given that the underlying model is a multi-step forecasting method. However, if we consider the average run-time and the mean confidence intervals for the three approaches, we see that they do not differ greatly, which indicates that our proposed method is efficient in terms of computation time.

Excess mortality analysis
The proposed ensemble learning for panel time series with strategy selection and dynamic holdouts (as discussed in Section 2 and here, above, in Section 3) was used to forecast the number of deaths caused by different kinds of respiratory disease for a subset of 61 countries in 2020 (see Table 1 -line 5). Additionally, COVID-19 deaths were extracted for the same year from the COVID-19 Weekly Epidemiological Update published by the World Health Organization (WHO) with data as received from national authorities, as of 3 January 2021, which provides full coverage for the period of 2020 [52]. Table B.1 (in Appendix B) presents forecasts of the total number of deaths attributable to respiratory diseases (RD TD), which is calculated as an aggregation of monthly death forecasts for each country. The last two columns show the standardized values of the total number of deaths attributable to respiratory diseases and COVID-19, respectively, used in calculating the correlation. The Pearson correlation for all 61 countries is 0.34, which is statistically significant (P-value = 0.007). As shown in Table 10, to calculate the correlation for a more limited set, we considered the European countries, including the United Kingdom, Canada, and the United States of America.
The selection criteria used were the maturity standards of their official statistics (SDDS+, SDDS, GDDS), outcomes from official statistics corruption models [53], and the quality of death data according to the WHO ranking discussed in Section 3.1. Now, the correlation coefficient increased dramatically to 94% (P-value =0.000), which can be attributed to the higher quality of the official statistics in these countries. Ashofteh and Bravo (2020) have shown there to be significant variation in the quality of the COVID-19 datasets reported worldwide, albeit a recent study suggests that data science and new technologies can be expected to play a significant role in improving data quality from national statistical offices in the future [54].
The comparison of death forecasts attributable to respiratory diseases and COVID-19 deaths is shown in Figs. 5 and 6. For most countries, COVID-19 deaths can be said to have ''replaced'' the respiratory deaths that would have occurred based on extrapolations of past respiratory disease trends. Here, a study of  Notes: ART: Average run-time, STD: Standard deviation, LCL: Lower confidence limit, UCL: Upper confidence limit.

Table 10
Comparison of number of deaths forecast for respiratory diseases and actual COVID-19 deaths. Source: Authors' own.

Row
Country (1) Alpha-3 Country No Population (2) RD TD (3) COVID TD (4) Standardized RD TD (5)  the factors affecting COVID-19 mortality shows a high correlation between respiratory deaths and COVID-19 deaths, a finding that is consistent with clinical manifestations and epidemiological studies. For example, countries with a high expectancy of respiratory diseases presented higher excess mortality, that is, at the macro (country) level. At the individual level, the higher number of deaths from respiratory diseases could be considered an indication of the population's greater susceptibility to COVID-19 symptoms and a greater risk of death. This comparative study highlights the fact that the policy effectiveness of different countries could result in an evaluation bias, without considering their past experience with respiratory diseases. Fig. 5 shows that the countries of Europe and North America were sensitive to respiratory diseases and that this boosted the excess mortality attributable to the COVID-19 pandemic; however, Fig. 6 shows that in 2020 some countries dealt better with COVID-19 than others as regards their vulnerability to respiratory diseases. Thus, this last figure highlights that in countries in which the forecast of respiratory disease deaths significantly exceeds the confirmed COVID-19 deaths (e.g., Japan and the Philippines), the management of the pandemic crisis succeeded in reducing excess mortality. The results shown in these two figures are very much in line with a recent study indicating a much lower overall excess-mortality burden due to COVID-19 in Japan than in Europe and the USA [55]. Here, Yorifuji et al. [56] suggest that in Japan, the public health regulations aimed at preventing COVID-19 may have incidentally reduced mortality related to respiratory diseases, such as influenza, and so decreased net excess mortality.
Additionally, in addressing vulnerability to respiratory diseases, Japan and the Philippines appear to have set a good example for the rest of the world in terms of controlling the effects of respiratory death numbers on the number of COVID-19 deaths. The similarity of the situations in these two countries seems to testify to the importance of the agreements struck on their, socalled, COVID-19 Response Support. As reported on the website of the Department of Foreign Affairs in the Philippines, the Japanese Government has been unstinting in its commitment to the Philippines' recovery efforts, previously pledging over JPY100 billion assistance in emergency and standby loans and donating 1 million Japan-manufactured AstraZeneca vaccines. 5 Fig. 6 shows a similar situation for the Republic of Korea, which geographically lies in the same vicinity as these two countries. The outcome of the comparison of death forecasts attributable to respiratory diseases and actual COVID-19 deaths for the Republic of Korea is in line with a recent study estimating mortality in Korea undertaken by Shin et al. [57], which finds that mortality in 2020 was similar to the historical trend. This similarity of outcomes reported by these neighboring countries seems to highlight the importance of international cooperation and the sharing of resources for the successful control of the effects of pandemics. Moreover, as these countries are geographically close to each other, meteorological factors might also have been influential in their respective outcomes. Clearly, more research is required.
Finally, in addition to the effect of respiratory deaths on deaths attributable to the pandemic, international cooperation, optimal scheduling and the utilization of medical resources, large-scale virus testing, protecting and managing the healthcare of the elderly, lockdowns, vaccination, and controlling the borders are examples of other factors that might result in different outcomes by country. However, accurate and timely estimations of respiratory deaths also seem to be an important factor when undertaking comparisons of multiple countries.

Conclusions
We have tested a new ensemble learning technique (DELMS) for the panel time series forecasting of respiratory diseases and  we summarized the empirical results obtained when using individual models, a simple ensemble model, an ensemble with model selection, and an ensemble with model selection and dynamic holdouts (DELMS). Our goal in so doing was to obtain a benchmark for evaluating the excess mortality related to COVID-19 that might serve as a common framework for all countries.
Based on the performance outcomes of the models ( Table 5) and results of Wilcoxon signed-rank test (Table 6), on average, the ensemble with model selection and dynamic holdouts (DELMS) performs significantly better than the other methods. Our results provide clear evidence of the competitiveness of this method in terms of its predictive performance when compared to the stateof-the-art approaches and even the ensemble model without the dynamic holdout and model selection layer.
Our analysis of the contribution of each of the candidate models to the ensemble (Tables 7 and 8) highlights the positive effect on overall prediction accuracy of selecting the best holdout for each model and excluding the outlier models from the ensemble. Moreover, it was evident that some of the state-of-the-art approaches outperformed the neural network time series models. A possible explanation for the underperformance of the complex neural network approaches might lie in the non-stationary elements, for example, the trend component and their pre-set hyperparameters. However, neural network time series models have been shown to perform much better when the time series data are nonlinear and stationary and present sudden changes in their layering hierarchy [58]. For this reason, they can be expected to add value to the ensemble in the case of mostly detrended time series. Additionally, recurrent neural networks, such as LSTM and GRU, have the potential to outperform time series models and their use could be usefully explored for the ensemble in future studies with sufficient computation resources or less panel members.
The variation in the performance of each model stresses the need to improve each of them individually by selecting the best holdout and, moreover, to determine the best models to contribute to the ensemble without overfitting. The indicator proposed here in Formula (3) removes only those models that are very distant from the other models and, by so doing, we are able to avoid the significant bias in the set of candidate forecasters. The final ensemble model shows a significant improvement in overall accuracy when compared with the other ensembles and with each individual state-of-the-art approach. The superiority of the proposed DELMS model was explored by comparing 7137 time series models obtained from 61 countries, 12 time series models plus an ensemble, 3 scenarios, and 3 holdouts.
Here, we have used the new ensemble strategy to forecast the number of deaths from respiratory diseases in 2020 for a sample of 61 countries. The correlation between the standardized values of deaths from respiratory diseases and those from COVID-19 was positive and statistically significant. Based on this outcome, it is apparent that we should consider death forecasts from respiratory diseases as a covariate for evaluating the management strategies employed by different countries, be they lockdown rules or the relaxation of border control regulations. On the basis of our study, Japan and the Philippines are candidates for further investigation in this regard; indeed, they are more eligible than other countries that only record a low death toll. It may well be that the experience of these countries with high mortality attributable to respiratory diseases played a more than relevant role in their management of the pandemic.
Indeed, in the case of the COVID-19 pandemic it might be more relevant to focus on the death toll rather than on the cumulative number of patients. Given the nature of pandemics, the challenge usually lies in being able to control its spread; however, here the primary concern might be said to have been controlling the severe cases and caring for the patients facing the greatest likelihood of death. Those countries presenting a high number of cases of respiratory disease and which successfully managed the pandemic, therefore, could be better targets for further studies that compare their health policies and strategies with those implemented by countries presenting only a low rate of mortality.
In short, the study described here represents an initial attempt at developing a new approach to ensemble forecasting tasks. The main motivation for this paper was the observation that the performance of the ensemble model might potentially be enhanced by selecting the best holdout for each candidate model and by choosing the best outcomes based on the dynamics of the observed values of the main series. In experiments using the 61member panel time series of respiratory disease deaths recorded between 2000 and 2016, the aggregation of selected forecasting models employing our approach provides a consistent advantage in terms of accuracy and leads to better predictive performance. Moreover, our study provides a correction of the total number of positive cases of COVID-19, in accordance with the expected number of deaths attributable to respiratory diseases as identified by our ensemble model.
Finally, this study has highlighted the pandemic experiences of Japan and the Philippines, identifying them as candidates for further exploration. The two countries present a high degree of vulnerability to the COVID-19 pandemic; yet, despite this, they succeeded in managing it well. Thus, regardless of higher death tolls than those recorded in other countries, their policy response should be examined to extract best practices. Finally, and of particular interest, is the fact that in most countries COVID deaths seem to have ''replaced'' the deaths attributable to respiratory disease that appear likely to have occurred in the absence of the pandemic, based, that is, on an extrapolation of past trends of such deaths.

Future research
Future studies could usefully seek the optimization of θ in Formula (3), that is, investigate the dynamic selection of optimum θ to ensure better performance. Additionally, as the usual neural networks fail to model time series adequately, especially in the case of incomplete/limited data during the onset of the epidemic [59], the study of recurrent neural networks, such as LSTM and GRU, would constitute an interesting future step if necessary computational power is available. This research should examine their impact on predictive accuracy, computation time and other resources, given the potential of these mechanisms to outperform ensemble time series models with no more than a reasonable increase in the computation power requirement. Indeed, the consideration of a non-linear meta-learning approach, as opposed to a linear approach, and of prediction intervals, as opposed to a point forecast, could constitute a fruitful next step. Moreover, the use of classification techniques to analyze heterogeneous and homogeneous countries could be considered another layer following the application of the forecasting methods. As such, a clustering analysis might usefully be implemented based on the notion of excess mortality. Finally, the countries of Japan and the Philippines standout, and their policy response should be subject to an epidemiological examination to determine what lessons might be learned.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.