Comparing trained and untrained probabilistic ensemble forecasts of COVID-19 cases and deaths in the United States

The U.S. COVID-19 Forecast Hub aggregates forecasts of the short-term burden of COVID-19 in the United States from many contributing teams. We study methods for building an ensemble that combines forecasts from these teams. These experiments have informed the ensemble methods used by the Hub. To be most useful to policymakers, ensemble forecasts must have stable performance in the presence of two key characteristics of the component forecasts: (1) occasional misalignment with the reported data, and (2) instability in the relative performance of component forecasters over time. Our results indicate that in the presence of these challenges, an untrained and robust approach to ensembling using an equally weighted median of all component forecasts is a good choice to support public health decision-makers. In settings where some contributing forecasters have a stable record of good performance, trained ensembles that give those forecasters higher weight can also be helpful.


Introduction
Accurate short-term forecasts of infectious disease indicators (i.e., disease surveillance signals) can inform public health decision-making and outbreak response activities such as nonpharmaceutical interventions, site selection for clinical trials of pharmaceutical treatments, or the distribution of limited health care resources (Wallinga et al., 2010;Lipsitch et al., 2011;Dean et al., 2020). Epidemic forecasts have been incorporated into public health decision-making in a wide variety of situations, including outbreaks of dengue fever in Brazil, Vietnam, and Thailand (Lowe et al., 2016;Colón-González et al., 2021;Reich et al., 2016) and influenza in the U.S. .
These efforts frequently use ensemble forecasts that combine predictions from many models. In a wide array of fields, ensemble approaches have provided consistent improvements in accuracy and robustness relative to stand-alone forecasts (Gneiting & Raftery, 2005;Polikar, 2006). The usefulness of ensemble forecasts has also been demonstrated repeatedly in multiple infectious disease settings, including influenza, Ebola, dengue, RSV, and others (Yamana et al., 2016;Viboud et al., 2018;McGowan et al., 2019;Johansson et al., 2019;Reis et al., 2019;Reich et al., 2019). In light of this record of strong performance, ensembles are natural candidates for forecasts used as an input to high-stakes public health decision-making processes.
This paper describes ensemble modeling efforts at the U.S. COVID-19 Forecast Hub (https: //covid19forecasthub.org/, hereafter the "U.S. Hub"), from spring 2020 through spring 2022. Starting in April 2020, the U.S. Hub created ensemble forecasts of reported incident deaths one through four weeks ahead in the 50 states, Washington, D.C., and 6 territories as well as at the national level by combining forecasts submitted by a large and variable number of contributing teams using different modeling techniques and data sources. In July 2020, forecasts of incident reported COVID-19 cases were added. Of note, the U.S. Hub produces probabilistic forecasts in which uncertainty about future disease incidence is quantified through the specification of a predictive distribution that is represented by a collection of predictive quantiles. Since the inception of the U.S. Hub, these ensemble forecasts have been provided to the U.S. Centers for Disease Control and Prevention (CDC) and have been the basis of official CDC forecasting communications (CDC, 2021).

Related literature
A wide variety of standalone methodological approaches have been shown to be able to make forecasts of short-term outbreak activity that are more accurate than naive baseline forecasts in various epidemiological settings. Some approaches have used existing statistical frameworks to model associations between outcomes of interest and known or hypothesized drivers of outbreaks, such as recent trends in transmission or environmental factors. To cite just a few examples, methods used include multiscale probabilistic Bayesian random walk models (Osthus & Moran, 2021), Gaussian processes (Johnson et al., 2018), kernel conditional density estimation (Ray et al., 2017;Brooks et al., 2018), and generalized additive models (Lauer et al., 2018). Other models have an implicit or explicit representation of a disease transmission process, such as variations on the susceptible-infectious-recovered (SIR) compartmental model (Shaman & Karspeck, 2012;Lega & Brown, 2016;Osthus et al., 2017;Pei et al., 2018;Turtle et al., 2021). Aspects of these modeling frameworks can also be combined, for instance using time series methods to build models that have a compartmental structure or incorporate key epidemiological parameters such as the effective reproduction number R t , or models that use a time series process to capture systematic deviations from a compartmental core (Bartolucci et al., 2021;Agosto et al., 2021;Osthus et al., 2019).
There is a large literature on ensemble forecasting, but of particular relevance to the present work is the research on combining, calibrating and evaluating distributional forecasts Ranjan & Gneiting, 2010;Claeskens et al., 2016). We note that prior work on forecast combination has mostly focused on combining forecasts represented as probability densities or probability mass functions rather than forecasts parameterized by a set of discrete quantile levels, which is the format of the forecasts in the present study. However, in psychological studies there is a long history of combining quantiles from multiple distributions as a mechanism for summarizing distributions of response times, error rates, and similar quantities across many subjects (Vincent, 1912;Ratcliff, 1979). More recently, this approach has also been used to combine probabilistic assessments from multiple subject matter experts or statistical models in fields such as security threat detection and economic forecasting (Hora et al., 2013;Lichtendahl Jr et al., 2013;Gaba et al., 2017;Busetti, 2017). In the context of infectious disease forecasting, Bracher et al. (2021b) conducted a similar but less extensive analysis to the one presented here using data from a related forecast hub focusing on Germany and Poland. Taylor & Taylor (2021) recently explored several approaches to constructing quantile-based ensemble forecasts of cumulative deaths due to COVID-19 using the data from the U.S. Hub, although they did not generate ensemble forecasts in real time or appear to have used the specific versions of ground truth data that were available for construncting ensembles in real time.
As was mentioned earlier, ensemble forecasts have also been used in a variety of other applications in real-time forecasting of infectious diseases, often with seasonal transmission dynamics where many years of training data are available (Yamana et al., 2016;Reich et al., 2019;Reis et al., 2019;Colón-González et al., 2021). In such applications, simple combination approaches have generally been favored over complex ones, with equal-weighted approaches often performing similarly to trained approaches that assign weights to different models based on past performance Bracher et al., 2021b). These results align with theory suggesting that the uncertainty in weight estimation can pose a challenge in applications with a low signal-to-noise ratio (Claeskens et al., 2016).

Contributions of this article
This paper is focused on explaining the careful considerations that have gone into building a relatively simple "production" ensemble model for a difficult, high-stakes, real-time prediction problem: forecasting COVID-19 cases and deaths in the United States, to support public health decision-making. We do not empirically investigate the performance of complex forecast combination strategies from the online prediction literature, which generally require richer and larger training data sets.
The goal of the U.S. Hub in developing an operational ensemble was to produce forecasts of the short-term trajectory of COVID-19 that had good performance on average and stable performance across time and different locations. Real-time forecasting for an emerging pathogen in an open, collaborative setting introduces important challenges that an ensemble combination method must be able to handle. First, teams occasionally submitted outlying component forecasts due to software errors, incorrect model assumptions, or a lack of robustness to input data anomalies (Figure 1 (a), Supplemental Figures 1 and 2). Second, some component models were generally better than others, but the relative performance of different models was somewhat unstable across time (Figure 1 (b), Supplemental Figures 3 and 4). In particular, some forecasters alternated between being among the best-performing models and among the worst-performing models within a span of a few weeks, which introduced a challenge for ensemble methods that attempted to weight component forecasters based on their past performance. In this manuscript, we explore and compare variations on ensemble methods designed to address these challenges and produce real-time forecasts that are as accurate as possible to support public health decision-makers.
We give detailed results from experiments that were run concurrently with the weekly releases of ensemble forecasts from the start of the U.S. Hub in 2020 through the spring of 2022, as documented in preliminary reports (Brooks et al., 2020 [Online]; Ray et al., 2021 [Online]). These experiments provided the evidence for decisions (a) to move to a median-based ensemble from one based on means in July 2020; (b) to switch to a trained ensemble for forecasts of deaths in November 2021; J u l 2 0 2 0 A u g 2 0 2 0 S e p 2 0 2 0 O c t 2 0 2 0 N o v 2 0 2 0 D e c 2 0 2 0 J a n 2 0 2 1 F e b 2 0 2 1 M a r 2 0 2 1 A p r 2 0 2 1 M a y 2 0 2 1 J u n 2 0 2 1 J u l 2 0 2 1 A u g 2 0 2 1 S e p 2 0 2 1 O c t 2 0 2 1 N o v 2 0 2 1 D e c 2 0 2 1 J a n 2 0 2 2 F e b 2 0 2 2 M a r 2 0 2 2 A p r 2 0 2 2 Forecast Date Relative WIS (log scale) Forecaster LNQ−ens1 Karlen−pypm Baseline Other Figure 1: (a) Predictive medians and 95% prediction intervals for incident deaths in Ohio generated on February 15, 2021 by two example component forecasters. The vertical axis scale is different in each facet, reflecting differences across several orders of magnitude in forecasts from different forecasters; the reference data are the same in each plot. The data that were available as of Monday, February 15, 2021 included a large spike in reported deaths that had been redistributed into the history of the time series in the version of the data available as of Monday, February 22, 2021. In this panel, forecaster names are anonymized to avoid calling undue attention to individual teams; similar behavior has been exhibited by many forecasters. (b) Illustration of the relative weighted interval score (WIS, defined in Section 2.5) of component forecasters over time; lower scores indicate better performance. Each point summarizes the skill of forecasts made at a given date for the one through four week ahead forecasts of incident cases across all state-level locations. and (c) to implement a weight regularization strategy for that trained ensemble starting in January 2022. In a secondary analysis, we also consider the prospective performance of these methods in the closely related setting of forecasting cases and deaths in Europe, to examine the generalizability of the results from our experiments using data from the U.S.
The following sections document the format and general characteristics of COVID-19 forecasts under consideration, the ensemble approaches studied, and the results of comparing different approaches both during model development and during a prospective evaluation of selected methods.

Methods
We give an overview of the U.S. and European Forecast Hubs and the high-level structure of our experiments in Sections 2.1 through 2.5, and then describe the ensemble methods that we consider in Section 2.6.

Problem context: forecasting short-term COVID-19 burden
Starting in April 2020, the U.S. Hub collected probabilistic forecasts of the short-term burden of COVID-19 in the U.S. at the national, state/territory, and county levels (Cramer et al., 2021a); a similar effort began in February 2021 for forecasts of disease burden in 32 European countries (European Covid-19 Forecast Hub, 2021). In this manuscript, we focus on constructing probabilistic ensemble forecasts of weekly counts of reported cases and deaths due to COVID-19 at forecast horizons of one to four weeks for states and territories in the U.S. and for countries in Europe. A maximum horizon of four weeks was set by collaborators at CDC as a horizon at which forecasts would be useful to public health practitioners while maintaining reasonable expectations of a minimum standard of forecast accuracy and reliability. Probabilistic forecasts were contributed to the Hubs in a quantile-based format by teams in academia, government, and industry. The Hubs produced ensemble forecasts each week on Monday using forecasts from teams contributing that week. In the U.S. Hub, seven quantile levels were used for forecasts of cases and 23 quantile levels were used for forecasts of deaths; in the European Hub, 23 quantile levels were used for both target variables.
Weekly reported cases and deaths were calculated as the difference in cumulative counts on consecutive Saturdays, using data assembled by the Johns Hopkins University Center for Systems Science and Engineering as the ground truth (Dong et al., 2020). Due to changes in the definitions of reportable cases and deaths, as well as errors in reporting and backlogs in data collection, there were some instances in which the ground truth data included outlying values, or were revised. Most outliers and revisions were inconsequential, but some were quite substantial in the U.S. as well as in Europe ( Figure 2). When fitting retrospective ensembles, we fit to the data that would have been available in real time. This is critical because the relative performance of different component forecasters may shift dramatically depending on whether originally-reported or subsequently-revised data were used to measure forecast skill. An ensemble trained using revised data can therefore have a substantial advantage over one trained using only data that were available in real time, and its performance is not a reliable gauge of how that ensemble method might have done in real time.
The U.S. Hub conducted extensive ensemble model development in real time from late July 2020 through the end of April 2021, with smaller focused experiments ongoing thereafter. We present results for the model development phase as well as a prospective evaluation of a subset of ensemble methods in the U.S. starting with forecasts created on May 3, 2021 and continuing through March 14, 2022. We note that we continued examining a wider range of methods to inform 6 weekly operational forecasting tasks, but the methods that we chose to evaluate prospectively were selected by May 3, 2021, the beginning of the prospective evaluation period, with no alterations thereafter. Real-time submissions of the relative WIS weighted median ensemble described below are on record in the U.S. Hub for the duration of the prospective evaluation period. In one section of the results below, we present a small post hoc exploration of the effects of regularizing the component forecaster weights; these results should be interpreted with caution as they do not constitute a prospective evaluation. To examine how well our findings generalize, we also evaluated the performance of a subset of ensemble methods for prospective forecasts of cases and deaths at the national level for countries in Europe from May 3, 2021 to March 14, 2022.

Eligibility criteria
In the Forecast Hubs, not all forecasts from contributing models are available for all weeks. For example, forecasters may have started submitting forecasts in different weeks, and some forecasters submitted forecasts for only a subset of locations in one or more weeks.
The ensemble forecast for a particular location and forecast date included all component forecasts with a complete set of predictive quantiles (i.e., 7 predictive quantiles for incident cases, 23 for deaths) for all 4 forecast horizons. Teams were not required to submit forecasts for all locations to be included in the ensemble. Some ensemble methods that we considered require historical forecasts to inform component model selection or weighting; for these methods, at least one prior submission was required. The Forecast Hubs enforced other validation criteria, including that predictions of incident cases and deaths were non-negative and predictive quantiles were properly ordered across quantile levels.

Notation
We denote the reported number of cases or deaths for location l and week t by y l,t . A single predictive quantile from component forecaster m is denoted by q m l,s,t,k , where s indexes the week the forecast was created, t indexes the target week of the forecast, and k indexes the quantile level. The forecast horizon is the difference between the target date t and the forecast date s. There are a total of K = 7 quantile levels for forecasts of cases in the U.S., and K = 23 quantile levels otherwise. The quantile levels are denoted by τ k (e.g., if τ k = 0.5 then q m l,s,t,k is a predictive median). We collect the full set of predictive quantiles for a single model, location, forecast date, and target date in the vector q m l,s,t,1:K . We denote the total number of available forecasters by M ; this changes for different locations and weeks, but we suppress that in the notation.

Baseline forecaster
In the results below, many comparisons are made with reference to an epidemiologically naive baseline forecaster that projects forward the most recent observed value with growing uncertainty at larger horizons. This baseline forecaster was a random walk model on weekly counts of cases or deaths, with Y l,t | Y l,t−1 = Y l,t−1 +ε l,t . The model used a non-parametric estimate of the distribution of the innovations ε l,t based on the observed differences in weekly counts d l,s = y l,s − y l,s−1 over all past weeks s for the specified location l. Predictive quantiles were based on the quantiles of the collection of these differences and their negations, using the default method for calculating quantiles in R. The inclusion of negative differences ensured that the predictive distributions were symmetric and the predictive median was equal to the most recent observed value. Forecasts at horizons greater than one were obtained by iterating one-step-ahead forecasts. Any resulting predictive quantiles that were less than zero were truncated to equal zero.

Evaluation metrics
To evaluate forecasts, we adopted the weighted interval score (WIS) (Bracher et al., 2021a). Let q 1:K be predictive quantiles for the observed quantity y. The WIS is calculated as where 1 (−∞,q k ] (y) is the indicator function that takes the value 1 when y ∈ (−∞, q k ] and 0 otherwise. This is a negatively-oriented proper score, meaning that negative scores are better and its expected value according to a given data generating process is minimized by reporting the predictive quantiles from that process. WIS was designed as a discrete approximation to the continuous ranked probability score, and is equivalent to pinball loss, which is commonly used in quantile regression (Bracher et al., 2021a). We note that some other commonly used scores such as the logarithmic score and the continuous ranked probability score are not suitable for use with predictive distributions that are specified in terms of a set of predictive quantiles, since a full predictive density or distribution function is not directly available (see Supplemental Section 3 for further discussion).
To compare the skill of forecasters that submitted different subsets of forecasts, we used relative WIS, as done in Cramer et al. (2022). The ensemble forecasters developed and evaluated in this manuscript provided all relevant forecasts; missingness pertains only to the component forecasters, and in the present work the relative WIS is primarily used to summarize component forecaster skill as an input to some of the trained ensemble methods described below. Let I denote a set of combinations of location l and forecast creation date s over which we desire to summarize model performance, and I m,m ⊆ I be the subset of those locations and dates for which both models m and m provided forecasts through a forecast horizon of at least four weeks. The relative WIS of model m over the set I is calculated as (4 · |I m,m |) −1 (l,s)∈I m,m s+4 t=s+1 WIS(q m l,s,t,1:K , y l,t ) (4 · |I m,m |) −1 (l,s)∈I m,m s+4 t=s+1 WIS(q m l,s,t,1:K , y l,t ) In words, we computed the ratio of the mean WIS scores for model m and each other model m , averaging across the subset of forecasts shared by both models. θ m was calculated as the geometric mean of these pairwise ratios of matched mean scores, and summarized how model m did relative to all other models on the forecasts they had in common. These geometric means were then scaled such that the baseline forecaster had a relative WIS of 1; a relative WIS less than 1 indicated forecast skill that was better than the baseline model. We note that if no forecasts were missing, I m,m would be the same for all model pairs, so that the denominators of each θ m and of θ baseline would cancel when normalizing relative to the baseline and the relative WIS for model m would reduce to the mean WIS for model m divided by the mean WIS for the baseline model. We used the geometric mean to aggregate across model pairs to match the convention set in Cramer et al. (2022), but this detail is not critical: Supplemental Figure 5 illustrates that the relative WIS changes very little if an arithmetic mean is used instead. We also assessed probabilistic calibration of the models with the one-sided coverage rates of predictive quantiles, calculated as the proportion of observed values that were less than or equal to the predicted quantile value. For a well-calibrated model, the empirical one-sided coverage rate is equal to the nominal quantile level. A method that generates conservative two-sided intervals would have an empirical coverage rate that is less than the nominal rate for quantile levels less than 0.5 and empirical coverage greater than the nominal rate for quantile levels greater than 0.5.

Ensemble model formulations
All of the ensemble formulations that we considered obtain a predictive quantile at level k by combining the component forecaster predictions at that quantile level: q ens l,s,t,k = f (q 1 l,s,t,k , . . . , q M l,s,t,k ).
We conceptually organize the ensemble methods considered according to two factors. First, trained ensemble methods use the past performance of the component forecasters to select a subset of components for inclusion in the ensemble and/or assign the components different weights, whereas untrained methods assign all component forecasters equal weight. Second, we varied the robustness of the combination function f to outlying component forecasts. Specifically, we considered methods based on either a (weighted) mean, which can be sensitive to outlying forecasts, or a (weighted) median, which may be more robust to these outliers. The weighted mean calculates the ensemble quantiles as The weighted median is defined to be the smallest value q for which the combined weight of all component forecasters with predictions less than or equal to q is at least 0.5; the ensemble forecast quantiles are calculated as: q ens l,s,t,k = inf q ∈ R : M m=1 w m s 1 (−∞,q] (q m l,s,t,k ) ≥ 0.5 .
In practice, we used the implementation of the weighted median in the matrixStats package for R, which linearly interpolates between the central weighted sample quantiles (Bengtsson, 2020). Graphically, these ensembles can be interpreted as computing a horizontal mean or median of the CDFs of component forecasters (Supplemental Figure 7). In trained ensemble methods that weight the component forecasters, the weights were calculated as a sigmoidal transformation of the forecasters' relative WIS (see Section 2.5) over a rolling window of weeks leading up to the ensemble forecast date s, denoted by rWIS m s : .
This formulation requires estimating the nonnegative parameter θ s , which was updated each week. If θ s = 0, the procedure reduces to an equal weighting scheme. However, if θ s is large, betterperforming component forecasters (with low relative WIS scores) are assigned higher weight. We selected θ s by using a grid search to optimize the weighted interval score of the ensemble forecast over the training window, summing across all locations and relevant target weeks on or before time s: WIS(q ens,θ l,r,t,1:K , y l,t ).
The size of the training window, a, is a tuning parameter that must be selected; we considered several possible values during model development, discussed further below. In a post hoc analysis, we considered regularizing the weights by setting a limit on the weight that could be assigned to any one model. We implemented this regularization strategy by restricting the grid of values for θ s to those values for which the largest component forecaster weight was less than the maximum weight limit. In this parameterization, the component forecaster weights are by construction nonnegative and sum to 1. When forecasts were missing for one or more component forecasters in a particular location and forecast date, we set the weights for those forecasters to 0 and renormalized the weights for the remaining forecasters so that they summed to 1.
Some trained ensembles that we considered used a preliminary component selection step, where the top few individual forecasters were selected for inclusion in the ensemble based on their relative WIS during the training window. The number of component forecasters selected is a tuning parameter that we explored during model development. This component selection step may be used either in combination with the continuous weighting scheme described above, or with an equally-weighted combination of selected forecasters. Throughout the text below, we use the term "trained" ensemble to refer generically to a method that uses component selection and/or weighting based on historical component forecaster performance.
There are many other weighted ensembling schemes that could be formulated. For example, separate weights could be estimated for different forecast horizons, for different quantile levels, or for subsets of locations. As another example, the weights could be estimated by directly minimizing the WIS associated with look-ahead ensemble forecasts (Taylor & Taylor, 2021). We explored these and other ideas during model development, but our analyses did not show them to lead to substantial gains, and thus we settled on the simpler weighting schemes presented above. Further discussion of alternative schemes is deferred to the supplement.

Data and code accessibility
All component model forecasts and code used for fitting ensemble models and conducting the analyses presented in this manuscript are available in public GitHub repositories (Cramer et al., 2021b;Ray, 2020Ray, , 2021.

Results
We discuss the decisions that we made during model development in Section 3.1 before turning to a more focused discussion of the impact on ensemble forecast skill of using robust or nonrobust combination mechanisms in Section 3.2, and trained or untrained methods in Section 3.3. Section 3.4 presents a post hoc evaluation of a variation on ensemble methods that regularizes the component forecaster weights. Results for the evaluation using forecasts in Europe are presented in Section 3.5.
Throughout this section, scores were calculated using the ground truth data that were available as of May 16, 2022 unless otherwise noted. This allowed five weeks of revisions to accrue between the last target end date that was evaluated and the date of the data used for evaluation. When reporting measures of forecast skill, we dropped forecasts for which the corresponding reported value of weekly cases or deaths was negative. This could occur when an error in data collection was identified and corrected, or when the definition of a reportable case or death was changed. We included scores for all other outlying and revised data in the primary analysis because it was difficult to define objective standards for what should be omitted. However, a supplemental analysis indicated that the results about the relative performance of different ensemble methods were not sensitive to these reporting anomalies (Supplemental Section 5.4, Supplemental Figures 14 through 16).

Model development
During model development, we evaluated many variations on trained ensemble methods. In these comparisons we take the equally weighted median ensemble as a reference approach because this is the method used for the production ensemble produced by the U.S. Hub during most of the time that we were running these experiments. As measured by mean WIS over the model development phase, the equally weighted median ensemble was better than the equally weighted mean ensemble, but both were outperformed by the trained ensemble variations using component forecaster selection and/or weighting ( Figure 3). The weighted approaches had stable performance no matter how many component forecasters were included. Approaches using an equally weighted combination of selected component forecasters were generally better only when top-performing component forecasters were included.
We also considered varying other tuning parameters such as the length of the training window and whether component forecaster weights were shared across different quantile levels or across forecast horizons. However, we did not find strong and consistent gains in performance when varying these other factors (Supplemental Figures 17 through 22). Finally, we evaluated other possible formulations of weighted ensembles, with weights that were not directly dependent on the relative WIS of the component forecasters but were instead estimated by optimizing the look-ahead ensemble WIS over the training set. As measured by mean WIS, the best versions of these other variations on weighted ensembles had similar performance to the best versions of the relative WIS weighted median considered in the primary analysis. However, they were more sensitive to settings like the number of component forecasters included and the training set window size (Supplemental Figures 17 and 18).
Based on these results, on May 3, 2021 we selected the relative WIS weighted ensemble variations for use in the prospective evaluation, as these methods had similar mean WIS as the best of the other variations considered, but were more consistent across different training set window sizes and numbers of component forecasters included. We used intermediate values for these tuning parameter settings, including 10 component forecasters with a training set window size of 12 weeks. We also included the equally weighted mean and median of all models in the prospective evaluation as reference methods. The following sections give a more detailed evaluation of these selected methods, describing how they performed during both the model development phase and the prospective evaluation phase.

Comparing robust and non-robust ensemble methods
We found that for equally weighted ensemble approaches, robust combination methods were helpful for limiting the effects of outlying component forecasts. For most combinations of evaluation phase (model development or prospective evaluation) and target variable (cases or deaths), the equally weighted median had better mean and worst-case WIS than the equally weighted mean, often by a large margin (Figure 3, Supplemental Figure 8). Results broken down by forecast date show that the methods achieved similar scores most of the time, but the equally weighted mean ensemble occasionally had serious failures (Supplemental Figure 10). These failures were generally  In panel (a) the vertical axis is the difference in mean WIS for the given ensemble method and the equally weighted median ensemble. Boxes show the 25th percentile, 50th percentile, and 75th percentile of these differences, averaging across all locations for each combination of forecast date and horizon. For legibility, outliers are suppressed here; Supplemental Figure 8 shows the full distribution. A cross is displayed at the difference in overall mean scores for the specified combination method and the equally weighted median averaging across all locations, forecast dates, and horizons. Large mean score differences of approximately 2,005 and 2,387 are suppressed for the Rel. WIS Weighted Mean and Rel. WIS Weighted Median ensembles respectively in the prospective phase forecasts of cases. A negative value indicates that the given method outperformed the equally weighted median. The vertical axis of panel (b) shows the probabilistic calibration of the ensemble forecasts through the one-sided empirical coverage rates of the predictive quantiles. A well-calibrated forecaster has a difference of 0 between the empirical and nominal coverage rates, while a forecaster with conservative (wide) two-sided intervals has negative differences for nominal quantile levels less than 0.5 and positive differences for quantile levels greater than 0.5. associated with instances where a component forecaster issued extreme, outlying forecasts, e.g., forecasts of deaths issued the week of February 15th in Ohio ( Figure 1). There were fewer consistent differences between the trained mean and trained median ensemble approaches. This suggests that both trained approaches that we considered had similar robustness to outlying forecasts (if the outliers were produced by component forecasters that were down weighted or not selected for inclusion due to poor historical performance) or sensitivity to outlying forecasts (if they were produced by component forecasters that were selected and given high weight).
Panel (b) of Figure 3 summarizes probabilistic calibration of the ensemble forecasts with onesided quantile coverage rates. The median-based ensemble approaches generally had lower onesided quantile coverage rates than the mean-based approaches, indicating a downward shift of the forecast distributions. This was associated with poorer probabilistic calibration for forecasts of cases, where the ensemble forecast distributions tended to be too low. For forecasts of deaths, which were better centered but tended to be too narrow, the calibration of the median-based methods was not consistently better or worse than the calibration of the corresponding mean-based methods.

Comparing trained and untrained ensemble methods
Averaging across all forecasts for incident cases and deaths in the model development phase, the weighted median was better than the equally weighted median and the weighted mean was better than the equally weighted mean ( Figure 3). However, in the prospective evaluation, the trained methods showed improved mean WIS relative to untrained methods when forecasting deaths, but were worse when forecasting cases. In general, the trained ensembles also came closer to matching the performance of a post hoc weighted mean ensemble for deaths than for cases (Figures 4 and 5). This post hoc weighted mean ensemble estimated the optimal weights for each week after the forecasted data were observed; it would not be possible to use this method in real time, but it gives a bound on the ensemble forecast skill that can be achieved using quantile averaging.
We believe that this difference in the relative performance of trained and untrained ensemble methods for cases and deaths is primarily due to differences in component model behavior for forecasting cases and deaths. A fundamental difference between these outcomes is that cases are a leading indicator relative to deaths, so that trends in cases in the recent past may be a helpful input for forecasting deaths -but there are not clear candidates for a similar leading indicator for cases (e.g., see McDonald et al. (2021) for an investigation of some possibilities that were found to yield only modest and inconsistent improvements in forecast skill). Indeed, the best models for forecasting mortality generally do use previously reported cases as an input to forecasting (Cramer et al., 2022), and it has previously been noted that deaths are an easier target to forecast than cases (Reich et al., 2021 [Online]; Bracher et al., 2021b). This is reflected in the performance of trained ensembles, which were often able to identify a future change in direction of trends when forecasting deaths, but generally tended to predict a continuation of recent trends when forecasting cases (Supplemental Section 7, Supplemental Figures 25 and 26). An interpretation of this is that the component forecasters with the best record of performance for forecasting deaths during the training window were able to capture changes in trend, but the best component forecasters for forecasting cases were often simply extrapolating recent trends. While all ensemble methods tended to "overshoot" at local peaks in weekly incidence, this tendency was more pronounced for forecasts of cases than for forecasts of deaths -and training tended to exacerbate the tendency to overshoot when forecasting cases, but to mitigate this tendency when forecasting deaths (Supplemental Figure  25).  Figures 4 and 5, which explore the relationships between component forecaster performance and the relative performance of trained and untrained ensemble methods in more detail. For deaths, the trained ensemble was able to identify and upweight a few component forecasters that had consistently good performance (e.g., Karlen-pypm and UMass-MechBayes). This led to consistently strong performance of the trained ensemble; it was always among the best models contributing to the U.S. Hub, and was better than the equally weighted median ensemble in nearly every week.

Another difference in component behavior when forecasting cases and deaths is illustrated in
For cases, the trained ensemble also had strong performance for many months when the LNQ-ens1 forecaster was contributing to the U.S. Hub. However, when LNQ-ens1 stopped contributing forecasts in June 2021, the trained ensemble shifted to weighting Karlen-pypm, which had less stable performance for forecasting cases. During July 2021, Karlen-pypm was the only forecaster in the U.S. Hub that predicted rapid growth at the start of the Delta wave, and it achieved the best relative WIS by a substantial margin at that time. However, that forecaster predicted continued growth as the Delta wave started to wane and it had the worst relative WIS a few weeks later. A similar situation occurred during the Omicron wave in January 2022, when the JHUAPL-Bucky model was one of a small number of forecasters that captured the rise at the beginning of the wave, but it then overshot near the peak. In both of these instances, the post hoc weighting would have assigned a large amount of weight to the forecaster in question at the start of the wave, when it was uniquely successful at identifying rising trends in cases -but then shifted away from that forecaster as the peak neared. Trained ensembles that estimated weights based on past performance suffered, as they started to upweight those component forecasters just as their performance dropped. This recurring pattern highlights the challenge that nonstationary component forecaster performance presents for trained ensembles. Reinforcing this point, we note that in the post hoc weighted mean ensemble, the component forecaster weights are only weakly autocorrelated (Figures 4 and 5, Supplemental Figure 27), again suggesting that an optimal weighting may require frequently changing component weights to adapt to nonstationary performance.
During the model development phase, the trained ensembles had better probabilistic calibration than their equally weighted counterparts ( Figure 3 panel (b)). During prospective evaluation, the trained median ensemble had generally higher one-sided coverage rates, corresponding to better calibration in the upper tail but slightly worse calibration in the lower tail. The trained mean ensemble had slightly better calibration than the equally weighted mean when forecasting deaths in the prospective evaluation phase, but inconsistent gains across different quantile levels when forecasting cases. Supplemental Figures 12 and 13 show that the widths of 95% prediction intervals from both the equally weighted median ensemble and the relative WIS weighted median ensemble tended to rank near the middle of the widths of 95% prediction intervals from the component forecasters. This can be interpreted as an advantage if we are concerned about the possible influence of component forecasters with very narrow or very wide prediction intervals. However, it can also be viewed as a disadvantage, particularly if improved calibration could have been realized if the prediction intervals were wider. We return to this point in the discussion.

Post hoc evaluation of weight regularization
Motivated by the assignment of large weights to some component forecasters in the trained ensembles for cases (Figure 4), in January 2022 we conducted a post hoc evaluation of trained ensembles that were regularized by imposing a limit on the weight that could be assigned to any one component forecaster (see Section 2.6). In this evaluation, we constructed relative WIS weighted median ensemble forecasts for all historical forecast dates up through the week of January 3, 2022. These ensemble fits included the top 10 component forecasters and were trained on a rolling window of the 12 most recent forecast dates, matching the settings that were selected for the prospective analysis. We considered six values for the maximum weight limit: 0.1, 0.2, 0.3, 0.4, 0.5, and 1.0. A weight limit of 1.0 corresponds to the unregularized method considered in the prospective evaluation, and a weight limit of 0.1 corresponds to an equally weighted median of the top ten forecasters, which was previously considered during the model development phase.
For both cases and deaths, the results of this analysis indicate that a weight limit as low as 0.1 was unhelpful ( Figure 6). When forecasting deaths, this regularization strategy had limited impact on the trained ensemble performance as long as the maximum weight limit was about 0.3 or higher, which is consistent with the fact that the trained ensembles for deaths rarely assigned a large weight to one model ( Figure 5). However, when forecasting cases, the regularization resulted in large improvements in mean WIS, with the best WIS at limits near 0.2 or 0.3. These improvements were concentrated in short periods near local peaks in the epidemic waves (Supplemental Figure  28). For both cases and deaths, smaller limits on the maximum weight were associated with a slight reduction in the empirical coverage rates of 95% prediction intervals. Based on these results, the U.S. Hub used a weight limit of 0.3 in trained ensemble forecasts starting in January 2022.

Results in the European application
Figure 7 summarizes weighted interval scores and calibration for the four selected ensemble methods when applied prospectively to forecast data collected in the European Forecast Hub. Consistent with what we observed for the U.S. above, the equally weighted median ensemble was generally better than the equally weighted mean. However, in the European evaluation, the trained methods had worse performance than the equally weighted median for forecasting both cases and deaths.
In a post hoc exploratory analysis, we noted that patterns of missingness in forecast submissions are quite different in the U.S. and in Europe (Figure 8, Supplemental Figures 29 through 36). In the U.S. Hub, nearly all models submit forecasts for all of the 50 states, and many additionally submit forecasts for at least one of the District of Columbia and territories. However, in the European Hub, roughly half of contributing models submit forecasts for only a small number of locations. Because the trained ensembles selected for prospective evaluation select the top 10 individual forecasters by relative WIS, this means that in practice the trained ensembles only included a few component forecasters for many locations in Europe.

Discussion
In this work, we have documented the analyses that have informed the selection of methods employed by the official U.S. Hub ensemble that is used by the CDC for communication with public health decision makers and the public more generally. In this context, our preference is for methods that have stable performance across different locations and different points in time, and good performance on average.
Our most consistent finding is that robust ensemble methods (i.e., based on a median) are helpful because they are more stable in the presence of outlying forecasts than methods using a mean. Ensemble methods based on means have repeatedly produced extreme forecasts that are dramatically misaligned with the observed data, but median-based approaches have not suffered from this problem as much. This stability is of particular importance in the context of forecasts Maximum Weight Limit 95% Prediction Interval Coverage Figure 6: Mean WIS and 95% prediction interval coverage rates for relative WIS weighted median trained ensemble variations with varying sizes of a limit on the weight that could be assigned to any one model. In panel (a), the baseline forecaster is included as a reference. Results are for a post hoc analysis including forecast dates up to January 3, 2022.  In panel (a) the vertical axis is the difference in mean WIS for the given ensemble method and the equally weighted median ensemble. Boxes show the 25th percentile, 50th percentile, and 75th percentile of these differences, averaging across all locations for each combination of forecast date and horizon. For legibility, outliers are suppressed here; Supplemental Figure 9 shows the full distribution. A cross is displayed at the difference in overall mean scores for the specified combination method and the equally weighted median of all models, averaging across all locations, forecast dates, and horizons. A large mean score difference of approximately 666 is suppressed for the Equal Weighted Mean ensemble forecasts of deaths. A negative value indicates that the given method had better forecast skill than the equally weighted median. Panel (b) shows the probabilistic calibration of the forecasts through the one-sided empirical coverage rates of the predictive quantiles. A well-calibrated forecaster has a difference of 0 between the empirical and nominal coverage rates, while a forecaster with conservative (wide) two-sided intervals have negative differences for nominal quantile levels less than 0.5 and positive differences for quantile levels greater than 0.5.   The plot on the right shows the estimated weights that would be used if all of the top 10 models (each represented by a different color) were available for a given location (at left side), and the effective weights used in each location after setting the weights for models that did not provide location-specific forecasts to 0 and rescaling the other weights proportionally to sum to 1. 20 that will be used by public health decision makers. These observations informed our decision to use an equally weighted median ensemble for the official U.S. Hub ensemble early on. We have seen more mixed success for trained ensemble methods. Overall, trained ensemble methods did well when they were able to identify and upweight component forecasters with stable good performance, but struggled when component forecaster skill varied over time. In the U.S., trained ensembles have a long record of good performance when forecasting deaths, and the U.S. Hub adopted the relative WIS weighted median ensemble as its official method for forecasting deaths in November 2021. However, trained methods have been less successful at forecasting cases in the U.S., both near peaks in weekly incidence (when they tend to overshoot) and at points where the performance of the component forecasters is inconsistent. Additionally, the trained methods we adopted did not translate well to a setting with a large number of missing component forecasts, as in the European Hub. To preserve the prospective nature of our analyses, we have not examined additional ensemble variations in the European application, but we hypothesize that these problems might be mitigated by including all component forecasters rather than the top 10, or by performing weight estimation separately in clusters of locations where the same component forecasters are contributing. Allowing for different weights in different locations may also be an effective strategy for addressing the impacts of differences in data availability and quality across different locations.

18
In this manuscript, we have focused on relatively simple approaches to building ensemble forecasts. There are several opportunities for other directions that we have not considered here, and the gap in performance between the ensemble methods we have considered and an ensemble using post hoc optimal weights indicates that there may still be room for improvement in ensemble methods. In our view, the most central challenge for trained ensembles is the inconsistency of the relative performance of many component forecasters, which may in turn be responsible for the lack of strong short-term temporal correlation in the component forecaster weights that were estimated by the post hoc weighted mean ensemble. For models with a relatively long history of performance over multiple epidemic waves, we believe that the most promising approach to addressing this is by using weights that depend on covariates like recent trends in incidence. This might allow the ensemble to learn the conditions in which component forecasters have been more or less reliable, and upweight models locally during phases similar to those in which they have done well in the past. Similar approaches have been used for other infectious disease systems such as influenza in the past (e.g., , but they used a substantial amount of training data over multiple years. There are several other possible directions for further exploration. We have addressed the challenge posed by outlying component forecasts by using median-based combination mechanisms, but another approach would be to pre-screen the component forecasts and remove outlying forecasts. This is a difficult task because there are times when weekly cases and deaths grow exponentially, and occasionally only one or two models have captured this growth accurately (Supplemental Figures 1 and 2). A component screening method would have to be careful to avoid screening out methods that looked extreme relative to the data or other component forecasts, but in fact accurately captured exponential growth (see Supplemental Section 1 for more discussion).
Another challenge is that the ensemble forecasts have not always been well calibrated. We are actively developing approaches to address this by post hoc recalibration of the ensemble forecasts. Another possible route forward would be to use a different method for ensemble construction. As we discussed earlier, the ensemble methods that we have considered work by combining the predictions from component forecasters at each quantile level, and therefore tend to have a dispersion that ranks in the middle of the dispersions of the component forecasters. In contrast, an ensemble forecast obtained as a distributional mixture of component forecasts would exhibit greater uncertainty at times when the component forecasts disagreed with each other. However, such an approach would be impacted by extreme component forecasts, and would likely require the development of strategies for screening outlying forecasts as discussed above.
Additionally, our methods for constructing ensemble forecasts do not directly account for the fact that some component forecasters are quite similar to each other and may provide redundant information about the future of the pandemic. Ensembles generally benefit from combining diverse component forecasters, and it could be helpful to encourage this -for example, by clustering the forecasters and including a representative summary of the forecasts within each cluster as the ensemble components. There are also related questions about the importance of different component forecasters to ensemble skill; we plan to explore this direction in future work by using tools such as the Shapley value to describe the contribution of individual components to the full ensemble.
We have used the WIS and probabilistic calibration to measure the extent to which forecasts are consistent with the eventually observed data. These summaries of performance are commonly used and provide useful insights into forecast performance, but it is worth noting that they do not necessarily reflect the utility of the forecasts for every particular decision-making context. Aggregated summaries of performance, such as overall quantile coverage rates could obscure finerscale details -for instance, a method with good coverage rates on average could have high coverage at times that are relatively unimportant and low coverage when it matters. Additionally, for some public health decision-making purposes, one or another aspect of a forecast may be more important; for example, some users may prioritize accurate assessments about when a new wave may begin, but other users may find accurate forecasts of peak intensity to be more important. Our evaluation metrics do not necessarily reflect the particular needs of those specific end users, and it is possible that different ensemble methods would be more or less appropriate to generate forecasts that serve different purposes.
Careful consideration and rigorous evaluation are required to support decisions about what ensemble methods should be used for infectious disease forecasting. As we discussed earlier, to obtain an accurate measure of a forecaster's performance, it is critical that the versions of ground truth data that would have been available in real time are used for parameter estimation. This applies as much to ensemble forecasters as it does to individual models. Additionally, it is important to be clear about what methods development and evaluation were done retrospectively and what forecasts were generated prospectively in real time. We believe that to avoid disruptions to public health end users, a solid evidence base of stable performance in prospective forecasts should be assembled to support a change in ensemble methods. We have followed these principles in this work, and have followed the EPIFORGE guidelines in describing our analysis (Pollett et al. (2021); Supplemental Section 11).
The COVID-19 pandemic has presented a unique challenge for infectious disease forecasting. The U.S. and European Forecast Hubs have collected a wealth of forecasts from many contributing teams -far more than have been collected in previous collaborative forecasting efforts for infectious diseases such as influenza, dengue, and Ebola. These forecasts have been produced in real time to respond to an emerging pathogen that has been one of the most serious public health crises in the last century. This setting has introduced a myriad of modeling difficulties, from data anomalies due to new reporting systems being brought online and changing case definitions, to uncertainty about the fundamental epidemiological parameters of disease transmission, to rapidly changing social factors such as implementation and uptake of non-pharmaceutical interventions. The behavior of individual models in the face of these difficulties has in turn affected the methods that were suitable for producing ensemble forecasts. We are hopeful that the lessons learned about infectious disease forecasting will help to inform effective responses from the forecasting community in future infectious disease crises. We also note that for both cases and deaths, there are many outlying forecasts. These can take the form of forecasts that bear little relation to the observed data, such as forecasts of nearly zero cases near peaks as shown in all facets of Supplemental Figure 1, or forecasts that are uniformly too high, such as the outlying forecast of deaths in Florida that is visible in the second panel of Supplemental Figure 2.

10
Another type of outlying forecast is one that predicts exponential growth that does not materialize, as illustrated at several points in forecasts of cases in Florida and Texas (Supplemental Figure 1). We note that these forecasts were unsuccessful, but closely match the trajectories of the few successful forecasts that were made during the Omicron wave in January 2022. This illustrates a potential challenge with automated outlier detection schemes in the context of a process where exponential growth is possible.

Relative WIS of component forecasters
Supplemental Figures 3 and 4 show the relative WIS of each component forecaster as a function of the forecast week for the weekly cases and weekly deaths targets respectively. For each week, we calculated a standardized rank for each model based on where that model's relative WIS fell relative to all other models that had submissions that week. In these rankings, 0 indicates the model with the best performance, and 1 20 indicates the model with the worst performance as measured by relative WIS. We see that nonstationarity of relative performance is very common, with many models alternating between weeks with top-ranking performance and bottom-ranking performance. Supplemental Figure 5 compares two methods for aggregating across forecasters. Recall that as defined in the main text, the relative WIS uses a geometric mean to aggregate WIS ratios across pairs of forecasters. To refresh the notation, we let I denote a set of combinations of location l and forecast creation date s over which we desire to summarize model performance, and I m,m ⊆ I be the subset of those locations and dates for which both models m and m provided forecasts. The relative WIS of model m over the set I is calculated as rWIS m,geom .
There is no substantive difference between the relative WIS values obtained using these aggregation strategies.  A rank of 0 indicates that the model has the best performance in a given week, and a rank of 1 indicates that it has the worst performance. There is a facet for each component forecaster, and the colored line shows the standardized rank of that forecaster.  Figure 5: Comparison of aggregation methods used to summarize across models when calculating the relative WIS. Each point corresponds to one combination of component forecaster and forecast date, and shows the relative WIS based on a trailing window of 12 weeks using the arithmetic mean to summarize across models (horizontal axis) or the geometric mean to summarize across models (vertical axis). The figure shows these results for all component model forecasts of incident cases at the state level in the United States made on dates between July 27, 2020 and March 14, 2022 using the truth data available as of the forecast date; the relative WIS scores shown are those that would have been used as input to the relative WIS weighted ensemble methods for real-time forecasts.
3 Comparison of logarithmic score and weighted interval score 25 In the Forecast Hubs, all forecasts are represented by a set of predictive quantiles at specified probability levels. This has motivated our decision to use the weighted interval score (WIS) for forecast evaluation, as the WIS is a proper score for quantile forecasts. In this section, we discuss challenges with using the logarithmic score when only predictive quantiles are available, and give a qualitative illustration of the difference between the logarithmic score and the WIS. We give a general discussion in Section 3.1, and defer some technical 30 details to Section 3.2.

Illustrative example comparing the logarithmic score and WIS
Suppose that we have a forecast submission consisting of quantiles of a predictive distribution at the 23 probability levels used in the Forecast Hub, and our goal is to calculate a score that measures the accuracy of this forecast for the observed value y obs . One common choice of score for probabilistic forecasts is the 35 logarithmic score; here we consider the negative logarithmic score so that its orientation matches that of the WIS. If a predictive density f Y (y) is available, the negative logarithmic score is defined as In a setting where we do not have the predictive density f Y , but only predictive quantiles, the logarithmic score cannot be calculated. One possible route forward is to attempt to reconstruct the predictive density from the provided predictive quantiles. We illustrate here that within the limits of the predictive quantiles, 40 it is possible to obtain a reasonable approximation of the predictive density. Here, we do so by fitting a monotonic spline to the provided quantiles to approximate the cumulative distribution function, and then differentiating to approximate the probability density function. However, extrapolating beyond the predictive quantiles is more of a challenge, and requires some assumption to be made about the behavior of the tails of the predictive distribution. This assumption can have a large impact on the log score when observations 45 fall in the tails.
Supplemental Figure 6 illustrates this using an example of a hypothetical forecast submission consisting of quantiles of the predictive distribution Y ∼ log normal(4, 0.5). We compare this predictive distribution with two approximations of it that are derived from the quantiles: one assuming a normal distribution for the tails, and the second assuming a Cauchy distribution for the tails. Although there are apparently only 50 minor differences in the CDFs and PDFs of these distributions, the log scores diverge substantially in the tails. By construction, all three distributions have the same quantiles at the 23 points that were included in the submission, and so the WIS is identical for all three distributions. To summarize, when only a set of predictive quantiles are provided, there is not enough information to characterize the behavior of the forecast distribution in the tails, and so a log score cannot be calculated. On the other hand, the WIS is defined only 55 in terms of the specified predictive quantiles, and so it does not suffer from this problem.
Qualitatively, the figure shows that both the negative log score and the WIS are minimized when the observed value falls near the center of the predictive distribution, and are larger when the observation falls in the tails. More formally, the negative log score is optimized when the observed value falls at a predictive mode, while the weighted interval score is optimized when the observed value falls at the predictive median.

60
See Bracher et al. (2021) for additional discussion of these scores and the continuous ranked probability score.

Methods for approximating a predictive density based on predictive quantiles
Here we describe the methods used in the previous section for obtaining an approximate predictive density 65f Y based on a set of predictive quantiles q 1 , . . . , q K at probability levels τ 1 , . . . , τ K . The method works in two phases: (1) we estimate the density on the interior of the predictive quantiles as the derivative of a monotonic spline that estimates the CDF; and (2) we approximate the tails with a distribution in a specified location-scale family. For the interior points, we fit a monotonic spline that interpolates the set of "observations" {(q 1 , τ 1 ), . . . , (q K , τ K )}.

70
This spline is an estimate of the predictive CDF, and its derivative estimates the predictive PDF. Because Supplemental Figure 6: CDFs, PDFs, log scores, and WIS for a log normal(4, 0.5) forecast and two approximations to it that match 23 specified quantiles but have different behavior in the tails. The predictive quantiles are shown with black points in the top panel.
the spline passes through the points (q 1 , τ 1 ) and (q K , τ K ), the integral of its derivative over the interval [q 1 , q K ] is equal to τ K − τ 1 . We estimate the density for the left and right tails separately, assuming that they come from a specified location-scale family. Setting notation, suppose that Y = a + b · Z where the random variable Z has a 75 specified distribution. Recall that at the probability level τ , a quantile of Y can be calculated in terms of the corresponding quantile of Z via q Y (τ ) = a + b · q Z (τ ). Using the quantiles at two probability levels τ i and τ j , we can calculate the value of b using Similarly, we can calculate the value of a as In the above expressions, we use the two smallest quantiles when estimating the lower tail and the two 80 largest quantiles when estimating the upper tail. With these choices, by construction the lower tail integrates to τ 1 on the interval (−∞, q 1 ] and the upper tail integrates to 1 − τ K on the interval [q K , ∞).

Quantile ensembles as horizontal combinations of predictive CDFs
Supplemental Figure 7 illustrates the ensemble methods considered in this manuscript as horizontal combinations of the cumulative distribution functions of predictive distributions from component forecasters, 85 computing a weighted or unweighted mean or median at each quantile probability level along the vertical axis.

Expanded results from primary analysis
This section includes figures giving additional views of the primary results from Figures 3, 4, 5, and 7 in the article.

Full distributions of Weighted Interval Score differences
For legibility, Figures 3 and 7 in the main text displayed the central tendency (central quantiles and means) of differences in weighted interval scores between the different methods, but suppressed outliers corresponding to individual combinations of forecast dates and horizons with large differences between the equal weighted median ensemble and another ensemble method. Supplemental Figures 8 and 9  Supplemental Figure 9: Performance measures for ensemble forecasts of weekly cases and deaths in Europe. The vertical axis is the difference in mean WIS for the given ensemble method and the equally weighted median ensemble. Boxplots summarize the distribution of these differences in means, averaging across all locations for each combination of forecast date and horizon. A cross is displayed at the difference in overall mean scores for the specified combination method and the equally weighted median averaging across all locations, forecast dates, and horizons. A negative value indicates that the given method outperformed the equally weighted median.

Scores by forecast creation date
Supplemental Figures 10 and 11 show relative WIS for the ensemble methods over time for forecasts in the US and in Europe respectively. The included ensemble methods are 1) an equally weighted mean ensemble, 2) an equally weighted median ensemble, 3) a weighted mean ensemble, and 4) a weighted median ensemble.

100
Both of the weighted ensembles combine the ten component forecasters with best individual performance as measured by the relative WIS, and are trained on a sliding 12-week window. The component forecasters included in the trained ensembles are updated each week based on performance during the training window.

95% prediction interval widths
Supplemental Figures 12 and 13 illustrate that the widths of 95% prediction intervals for the ensemble 105 forecasters generally fall in the middle of the widths of the intervals from the component forecasters. This is true because the ensemble forecasts are a (weighted) mean or median of predictive quantiles of the component forecasters. In particular, the equal weighted median forecast typically has a 95% interval width that ranks very close to the middle of the other forecasters' interval widths. We note that interval coverage rates are not in themselves a measure of forecast skill, and it is important 110 to also consider whether the interval contains the forecasted quantity at the nominal coverage rate (i.e., whether the forecasts are well-calibrated). The figures in the main text include displays of calibration as well as proper scores that measure both calibration and sharpness together.  Figure 12: Standardized ranks of 95% prediction intervals for weekly cases from component forecasters, the equally weighted median ensemble, and the relative WIS weighted median ensemble. For each combination of location, forecast date, and forecast horizon, we rank the widths of 95% prediction intervals from all forecasters that submitted the relevant forecast on a scale from 0 to 1, where the forecaster with the narrowest interval has rank 0 and the forecaster with the widest interval has rank 1. Density plots summarize the distribution of these ranks for each forecaster; forecasters are sorted by their median rank.  Figure 13: Standardized ranks of 95% prediction intervals for weekly deaths from component forecasters, the equally weighted median ensemble, and the relative WIS weighted median ensemble. For each combination of location, forecast date, and forecast horizon, we rank the widths of 95% prediction intervals from all forecasters that submitted the relevant forecast on a scale from 0 to 1, where the forecaster with the narrowest interval has rank 0 and the forecaster with the widest interval has rank 1. Density plots summarize the distribution of these ranks for each forecaster; forecasters are sorted by their median rank. 20

Impact of reporting anomalies
We conducted a supplemental analysis in which we removed forecasts that were affected by reporting anoma-115 lies before calculating summaries of forecast performance. We catalogued two types of reporting anomalies, as illustrated in Supplemental Figure 14: 1. Outliers were identified manually by examining plots of the data. Negative weekly counts and other observations that did not appear to match local trends were recorded as outliers.
2. Revisions were identified automatically. The value for a particular week was identified as a large 120 revision if the difference between the original reported value and the final reported value was at least 20, and that difference was at least 40% of the initial reported value or the final reported value.
For the purpose of this analysis, forecasts with a target end date coinciding with an observation that was identified as an outlier were excluded. This is because we would prefer forecasting methods to focus on capturing the epidemiological process rather than aspects of the reporting process that lead to outliers.

125
Forecasts with a forecast date on the date of a value that was later revised, or within the following 3 weeks unless the revision had already been made by the time of the forecast, were excluded. These forecasts were excluded because the input data used for the forecast were not a reliable indicator of the state of the epidemic at the time of the forecast. One might reasonably expect forecasters to account for the possibility of such data revisions, but this analysis represents a conservative examination of whether these revisions rizing forecast skill after removing scores affected by reporting anomalies. Although the forecasts affected by reporting anomalies generally had higher WIS values than other forecasts, they did not have unusually large differences in WIS between forecasting methods. The results about the relative performance of ensemble methods hold stable whether or not the forecasts affected by data anomalies are removed.  Figure 15: Summaries of forecast performance after removing forecasts affected by reporting anomalies. Panel (a) shows the 25th percentile, median, and 75th percentile of differences in mean WIS between specified ensemble methods and the equally weighted median ensemble, where the means average across locations for each combination of forecast date and forecast horizon. Crosses show the difference in overall mean WIS averaging across all locations, forecast dates, and forecast horizons. Panel (b) shows the calibration of predictive quantiles, with the difference between the empirical coverage rate and the nominal coverage rate on the vertical axis. A well calibrated model will have a difference between the empirical coverage rate and the nominal quantile level that is approximately zero. A method that generates conservative two-sided intervals would have a difference that is negative for nominal quantile levels less than 0.5 and positive for nominal quantile levels greater than 0.5. Supplemental Figure 16: Summaries of forecast performance after removing forecasts affected by reporting anomalies. Boxplots summarize the distribution of differences in mean WIS between specified ensemble methods and the equally weighted median ensemble, where the means average across locations for each combination of forecast date and forecast horizon. Crosses show the overall difference in mean WIS averaging across all locations, forecast dates, and forecast horizons.

Model variations considered during development phase 140
Here we present some results for model variations considered during the model development phase. All figures here represent only scores for forecast dates before May 3, 2021.

Additional combination methods and training window sizes
The manuscript gives results for equally weighted mean and median ensembles, and relative WIS weighted mean and median ensembles, using a fixed training set window size of the 12 weeks prior to the forecast 145 date. Here we show results on the development set for forecasts using a range of training set window sizes including 4 weeks, 8 weeks, 12 weeks, and all available forecast history. We also consider two additional combination methods along with those presented in the main text. The first of these new combination methods is a weighted mean. As a reminder, q m l,s,t,k denotes the predictive quantile at probability level k from component model m at location l, forecast date s, and target end date t. With this notation, the weighted mean ensemble forecast quantiles are calculated as q ens l,s,t,k = M m=1 w m s q m l,s,t,k .
The model weights w m s are constrained to be non-negative and sum to one; in case of missing forecasts, the weights for any missing models are set to zero and the remaining weights are rescaled to sum to 1. As 150 indicated by the subscript s, the weights w m s are updated each week by optimizing the ensemble WIS over the training window of the specified number of weeks before the forecast date s.
The second of the new combination methods is a weighted median ensemble that uses the weights estimated for the weighted mean ensemble. This offers comparable flexibility to the weighted mean ensemble, but has the disadvantage that the weights are not obtained by optimizing the forecast skill of the method 155 that is actually used for forecast combination. Direct estimation of the weights for a weighted median by optimizing ensemble WIS is challenging because the objective function is not differentiable in the weights; the optimization problem is a mixed integer linear program, which is computationally demanding. In other experiments, we also considered a method for computing an approximate weighted median by the smoothing weighted distribution of predictive quantiles from component forecasters. However, this method's perfor-160 mance was not substantively different from the other methods considered here and we omit those results for brevity.
Supplemental Figures 17 and 18 display the results of this expanded comparison including all combinations of the training set window size, the six combination methods, and three variations on the number of top-performing component forecasters included in the ensemble. Across all combinations of training set 165 window size, number of component forecasters included, and target variable (cases or deaths), the relative WIS weighted median ensemble had the most stable performance. For deaths, it had the best mean WIS for all training set window sizes, though it had similar performance to the equally weighted median of the top 5 models. For cases, it was more often matched by other methods, though the performance of the other methods was more inconsistent across different settings for other tuning parameters. Across most settings, 170 using a relative WIS weighted mean or median offered an improvement in mean WIS over taking an equally weighted mean or median of top performing models.
There is perhaps a slight indication that an intermediate training set size of 8 to 12 weeks is better than training on 4 weeks or the full available history, but this signal is not strong. Some of the combination methods were better when fewer top models were included, but the relative WIS weighted median method 175 was not sensitive to this setting.
We selected the relative WIS weighted mean and median ensembles for the prospective evaluation because they were consistently better than both more flexibly weighted methods and equally weighted combinations of top-performing components when forecasting cases, and were comparable to the best of the other approaches when forecasting deaths. We selected an intermediate training set window size of 12 weeks because both the

Separate weights at different forecast horizons
We considered a variation on the relative WIS weighted median ensemble that estimated separate weights for 185 each of the one through four week ahead forecast horizons. In this approach, the relative WIS of component forecasters was calculated separately at each forecast horizon, and the estimation of the weighting parameter θ was performed separately to optimize the forecast skill of the ensemble at each horizon. Supplemental Figure 19 shows the results of this per-horizon weighting scheme for the relative WIS weighted median ensemble combining the top 10 component forecasters. We found that using separate 190 model weights at each forecast horizon led to small improvements in mean WIS at short-term forecast horizons of one to two weeks ahead, but slightly worse mean WIS at longer forecast horizons of three to four weeks ahead.
We see two possible contributing factors to these results. First, forecasts at long horizons have scores that are larger in magnitude than forecasts at short horizons, and so tend to dominate the overall score when 195 averaging across horizons. This may result in weights that favor performance at longer term horizons when weights are shared across horizons, thereby harming performance of forecasts at short horizons. Second, sharing weights across horizons may be particularly helpful for longer term forecasts because of the gain in the training set sample size that comes with weight sharing. For example, with a training set size of four weeks and weights estimated separately by horizon, only one week's worth of forecasts are actually 200 included in the training set for weight estimation at a horizon of four weeks, because the target data for four-week ahead forecasts made within the past three weeks have not yet been observed at the time of weight estimation. Sharing weights across horizons means that more information about model performance is available for weight estimation at these longer horizons. In support of this explanation, note that the magnitude of relative losses in forecast skill from estimating per-horizon weights at longer forecast horizons 205 decreases as the training set size increases.
These results suggest the possibility of a blended strategy, where per-horizon estimation is used to obtain the weights for short horizons but the weights for longer horizons are estimated by sharing information across all horizons. In light of the small magnitude of the gains from using a per-horizon weighting at short horizons, we decided to pursue a unified approach of using shared weights across all horizons to reduce methodological   Figure 19: Boxplots summarizing forecast skill for forecasts of weekly cases, varying whether model weights are shared across all forecast horizons or are estimated separately for each forecast horizon. The vertical axis is the difference in mean skill for the given ensemble specification when component weights are shared across all horizons and the same specification with separate component weights for each forecast horizon. The boxplots summarize the distribution of these differences for each combination of forecast date and horizon, averaging across all locations. A cross is displayed at the difference in overall mean scores. A negative value indicates that the method with separate component weights for each forecast horizon outperformed the corresponding specification with weights shared across forecast horizons. For this analysis, only results for relative WIS weighted ensembles combining the ten best individual component forecasters are presented.

Separate weights at different quantile levels
We considered strategies for estimation of separate model weights at each quantile level rather than sharing model weights across all quantile levels. We considered this possibility for both the relative WIS weighted median ensemble and the convex mean ensemble that directly optimizes the component weights rather than 215 setting them to be a sigmoid function of the relative WIS. In both variations, weights were estimated by optimizing the contribution to the WIS from each quantile level separately, i.e., the pinball loss. Similarly, for the relative WIS weighted median, the relative WIS for component models that is used as an input for calculating model weights was obtained separately based on the contribution to WIS at each quantile level.
Supplemental Figures 20 and 21 summarize the WIS of these methods on the model development set, 220 comparing these approaches to the corresponding methods with a single weight per model that is shared across all quantile levels. Supplemental Figure 22 displays the probabilistic calibration of these model variations in terms of one-sided coverage rates for predictive quantiles. In this experiment, all methods combine the top ten component forecasters; we consider varying training set window sizes. Allowing for separate parameters per quantile level led to worse mean WIS. Additionally, for both cases 225 and deaths, the per-quantile weighting schemes led to generally narrower predictive distributions with worse calibration in the tails. This effect was much stronger for forecasts of cases than forecasts of deaths, and it was stronger for the relative WIS weighted median ensemble than the convex weighted mean ensemble.   Figure 20: Boxplots summarizing forecast skill for forecasts of weekly cases from a relative WIS weighted median ensemble, varying whether model weights are shared across all quantile levels ("Per Model") or are estimated separately for each quantile level ("Per Quantile"). The vertical axis is the difference in mean skill for the given ensemble specification when component weights are shared across all quantile levels and the same specification with separate component weights for each quantile level. The boxplots summarize the distribution of these differences for each combination of forecast date and horizon, averaging across all locations. A cross is displayed at the difference in overall mean scores. A negative value indicates that the method with separate component weights for each quantile level outperformed the corresponding specification with weights shared across quantile levels.   Figure 21: Boxplots summarizing forecast skill for forecasts of weekly cases from a convex weighted mean ensemble with directly estimated weights, varying whether model weights are shared across all quantile levels ("Per Model") or are estimated separately for each quantile level ("Per Quantile"). The vertical axis is the difference in mean skill for the given ensemble specification when component weights are shared across all quantile levels and the same specification with separate component weights for each quantile level. The boxplots summarize the distribution of these differences for each combination of forecast date and horizon, averaging across all locations. A cross is displayed at the difference in overall mean scores. A negative value indicates that the method with separate component weights for each quantile level outperformed the corresponding specification with weights shared across quantile levels.  Figure 22: Quantile coverage rates for the convex weighted mean ensemble and the relative WIS weighted median ensemble, varying whether weights are estimated separately per quantile level ("Per Quantile") or shared across all quantile levels ("Per Model"). All methods combine the top 10 component forecasters in the training set window size specified in facet rows. The vertical axis is the difference between the empirical coverage rate and the nominal coverage rate. A well calibrated method would have a difference of 0 between the empirical and nominal coverage rates, and a method that generates conservative (wide) interval forecasts would have a negative difference for quantile levels less than 0.5 and a positive difference for quantile levels greater than 0.5. 33 within a centered rolling window of 11 weeks (i.e., weeks that had the largest reported weekly counts among the preceding five and following five weeks). By visual inspection, we made manual adjustments to the weeks identified using this rule to remove outliers and weeks that did not correspond to a visually distinct peak. The resulting weeks identified as local peaks are shown in Supplemental Figures 23 and 24. Within our evaluation time frame, there were 159 local peaks for weekly cases and 146 local peaks for weekly deaths 235 across all locations.
Supplemental Figure 25 summarizes the central tendency of the errors for predictive medians across all forecasts at the state level in the U.S. and for just those forecasts issued in the week before a local peak. The errors are calculated as the value of the predictive median minus the observed count of cases or deaths in the target week, so that errors close to zero are preferred. We show results for three ensemble specifications: the 240 equally weighted median of all component forecasters, the equally weighted median of the top 10 forecasters, and the relative WIS weighted median of the top 10 forecasters. Note that the equally weighted median of all components is untrained, the relative WIS weighted median of the top 10 forecasters is trained, and the equally weighted median of the top 10 forecasters represents an intermediate strategy: it is also trained, but it is not as strongly adaptive to component performance as the weighted median. Both trained ensembles 245 use a training set window size of 12 weeks. We summarize our observations about these errors as follows: 1. Across all forecasts of cases, the median error is similar for all three strategies, but the magnitude of the average error from the relative WIS weighted median is slightly larger than the magnitude of the average error from the equally weighted median of all components.
2. Across all forecasts of deaths, the median error is similar for all three strategies, but the magnitude 250 of the average errors from the relative WIS weighted median is slightly smaller than the magnitude of the average error from the equally weighted median of all components.
3. For forecasts of cases near peaks, the trained methods were better than the untrained approach in the peak week, but have much larger errors at longer horizons. This indicates that forecasts from the trained methods "overshot" and missed the turning points by a larger margin than the untrained 255 method.
4. For forecasts of deaths near peaks, the trained methods have comparable performance to the untrained methods. In contrast to forecasts of cases, training did not exacerbate the tendency to overshoot near local peaks when forecasting deaths.
Supplemental Figure 26 shows illustrative examples of forecasts made the week before a local peak 260 from the equally weighted median of all components and the relative WIS weighted median of the top 10 components. We can see a systematic tendency for the predictions from the trained method near local peaks to predict a continuation of rising trends for cases, whereas the forecasts of deaths more often capture the coming downturn. Supplemental Figure 25: Errors of the predictive median across all forecasts, and for forecasts issued the week before a local peak in weekly cases or deaths. For forecasts issued before a peak, the one week ahead forecasts are for cases in the week of the local peak and forecasts at longer horizons are for cases in weeks after the local peak. The vertical axis is the difference between the predictive median and the observed value. A positive value indicates that the predictive median was larger than the eventually observed value, and a negative value indicates that the predictive median was less than the eventually observed value; a difference of zero is best. Boxes summarize the 25th percentile, median, and 75th percentile of these errors, and crosses show the mean error. mean ensemble, quantifying the observation that the weights changed substantially from week to week; this can be seen in Figures 4 and 5 in the main text. In contrast, the component forecaster weights were more strongly autocorrelated in the relative WIS weighted mean ensemble. In part, this is due to the use of a 12 week rolling window for estimating component weights; much of the training data for weight estimation is shared in consecutive weeks. 9 Post hoc evaluation of component weight regularization Section 3.4 of the main text describes a post hoc evaluation of regularizing the component forecaster weights in trained ensembles by imposing a limit on the maximum weight that can be assigned to any component. Supplemental Figure 28 illustrates that for most forecast dates, imposing this limit on the maximum component weight has negligible effects on the WIS of the ensemble forecasts. However, there are a few forecast where averages for each forecast date are calculated across all locations and forecast horizons. Results are shown for six variations on relative WIS weighted median ensembles that combine the top 10 component forecasters based on a rolling 12 week training set, using different limits for the maximum weight that can be assigned to any component forecaster. A maximum weight limit of 1.0 corresponds to the unregularized approach considered in the prospective analysis in the main text, while a maximum weight limit of 0.1 corresponds to an equal weighting of the ten selected forecasters.

42
Supplemental Figures 29 through 32 show histograms of the number of locations forecasted by the component models contributing to the U.S. COVID-19 Forecast Hub and the European COVID-19 Forecast Hub.

285
Patterns of submission are starkly different for the U.S. and the EU. In the U.S., nearly all models submit forecasts for at least the 50 US States, and many additionally submit forecasts for the District of Columbia and US territories. In the EU, roughly half of forecasters provide forecasts for all or most European countries, while the other half provide forecasts for only a few countries. Supplemental Figures 33 through 36 show the effective weights used in each location after accounting for 290 forecast missingness by rescaling the weights assigned to available models so that they sum to one. In the U.S., the effective weights closely match the nominal estimated weights in nearly all states, differing only slightly in the territories. In the EU, missingness is more prevalent and it is common for only a few of the selected top 10 component forecasters to provide forecasts for many countries.     Supplemental Figure 33: Component weights for forecasts of weekly cases in the U.S., facetted by forecast date. Only the weights for the first week in each month are shown due to space constraints. The estimated weights that would be used if all models were available for a particular location are shown at left within each facet. The weights actually used for each location are obtained by setting the weight for components that are missing forecasts for that location to 0 and rescaling the others proportionally so that they sum to 1. Supplemental Figure 35: Component weights for forecasts of weekly cases in Europe, facetted by forecast date. Only the weights for the first week in each month are shown due to space constraints. The estimated weights that would be used if all models were available for a particular location are shown at left within each facet. The weights actually used for each location are obtained by setting the weight for components that are missing forecasts for that location to 0 and rescaling the others proportionally so that they sum to 1. Supplemental Figure 36: Component weights for forecasts of weekly deaths in Europe, facetted by forecast date. Only the weights for the first week in each month are shown due to space constraints. The estimated weights that would be used if all models were available for a particular location are shown at left within each facet. The weights actually used for each location are obtained by setting the weight for components that are missing forecasts for that location to 0 and rescaling the others proportionally so that they sum to 1. 51