Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the United States

Significance This paper compares the probabilistic accuracy of short-term forecasts of reported deaths due to COVID-19 during the first year and a half of the pandemic in the United States. Results show high variation in accuracy between and within stand-alone models and more consistent accuracy from an ensemble model that combined forecasts from all eligible models. This demonstrates that an ensemble model provided a reliable and comparatively accurate means of forecasting deaths during the COVID-19 pandemic that exceeded the performance of all of the models that contributed to it. This work strengthens the evidence base for synthesizing multiple models to support public-health action.

(https://covid19forecasthub.org/) (4). Launched in early April 2020, the Forecast Hub facilitated the collection, archiving, evaluation, and synthesis of forecasts. Teams were explicitly asked to submit "unconditional" forecasts of the future, in other words, predictions that integrate across all possible changes in future dynamics. In practice, most individual models made predictions that were conditional on explicit or implicit assumptions of how policies, behaviors, and pathogens would evolve in the coming weeks. From these forecasts, a multimodel ensemble was developed, published weekly in real time, and used by the CDC in official public communications about the pandemic (https://www.cdc. gov/coronavirus/2019-ncov/covid-data/mathematical-modeling.html) Significance This paper compares the probabilistic accuracy of short-term forecasts of reported deaths due to COVID-19 during the first year and a half of the pandemic in the United States. Results show high variation in accuracy between and within stand-alone models and more consistent accuracy from an ensemble model that combined forecasts from all eligible models. This demonstrates that an ensemble model provided a reliable and comparatively accurate means of forecasting deaths during the COVID-19 pandemic that exceeded the performance of all of the models that contributed to it. This work strengthens the evidence base for synthesizing multiple models to support public-health action. (5). Forecasts were generated for the outcomes of reported cases, hospitalizations, and deaths due to COVID-19. This paper focuses on evaluating forecasts of reported deaths.
Ensemble models incorporate the information and uncertainties from multiple forecasts, each with their own perspectives, strengths, and limitations, to create accurate predictions with well-calibrated uncertainty (6)(7)(8)(9)(10)(11). Synthesizing multiple models removes the risk of overreliance on any single approach for accuracy or stability. It is challenging for individual models to make calibrated predictions of the future when the behavior of the system being studied is nonstationary due to continually changing policies and behaviors. Ensemble approaches have previously demonstrated superior performance compared with single models in forecasting influenza (12)(13)(14), Ebola (15), and dengue fever outbreaks (16). Preliminary research suggested that COVID-19 ensemble forecasts were also more accurate and precise than individual models in the early phases of the pandemic (17,18).
Predicting the trajectory of a novel pathogen outbreak such as COVID-19 is subject to many challenges. These include the role of human behavior and decision-making in outbreak trajectories, and the fact that epidemic forecasts may play a role in a "feedback loop" when and if the forecasts themselves have the ability to impact future societal or individual decisionmaking (19). There are also a host of data irregularities, especially in the early stages of the pandemic.
It is important to systematically and rigorously evaluate forecasts designed to predict real-time changes to the outbreak in order to identify strengths and weaknesses of different approaches and to understand the extent to which the forecasts are a reliable input to public-health decisions. Knowledge of what leads to more or less accurate and well-calibrated forecasts can inform their development and their use within outbreak science and public policy. In this analysis, we sought to evaluate the accuracy of individual and ensemble probabilistic forecasts submitted to the Forecast Hub, focusing on forecasts of reported weekly incident deaths.

Results
Summary of Models. Forecasts evaluated in this analysis are based on submissions in a continuous 79-wk period starting in late April 2020 and ending in late October 2021 ( Fig. 1 and Methods). Forecasts were evaluated at 55 locations including all 50 states, four jurisdictions and territories (Guam, US Virgin Islands, Puerto Rico, and the District of Columbia), and the US national level. The evaluation period captured the decline of the spring 2020 wave, a late summer 2020 increase in several locations, a large late-fall/early-winter surge in 2020/2021, and the rise and fall of the Delta variant in the summer and fall of 2021 (Fig. 1B).
The number of models that submitted forecasts of incident deaths to the Forecast Hub and were screened for inclusion in this analysis increased from four models at the beginning of the evaluation period to an average of 41.2 models per week during the first 10 months of 2021 ( Fig. 1C and SI Appendix, Fig. S1). A total of 28 models met inclusion criteria, yielding 1,791 submission files with 556,050 specific predictions for unique combinations of forecast dates, targets (horizons forecasted), and locations.
The evaluated forecasts used different data sources and made varying assumptions about future transmission patterns (SI Appendix, Table S1). All evaluated models, other than CEID-Walk, the COVIDhub-baseline, the COVIDhubensemble, and PSI-Draft, used case data as inputs to their forecast models. A total of 10 models included data on COVID-19 hospitalizations, 10 models incorporated demographic data, and 9 models used mobility data. Of the 28 evaluated models, 7 made explicit assumptions that social distancing and other behavioral patterns would change over the prediction period. Two naıve models were included. The COVIDhub-baseline is a neutral model built with median predicted incidence equal to the number of reported deaths in the most recent week with uncertainty around the median based on weekly differences in previous observations (see Methods). CEID-Walk is a random walk model with simple outlier handling (SI Appendix, Table S1).
Overall Model Accuracy. To evaluate probabilistic accuracy, the primary metric used was the weighted interval score (WIS), a nonnegative metric, which measures how consistent a collection of prediction intervals is with an observed value (20). For WIS, a lower value represents smaller error (see Methods and SI Appendix, SI Text).
Led by the ensemble model, a majority of the evaluated models achieved better accuracy than the baseline model in forecasting incident deaths ( Table 1). The COVIDhub-ensemble achieved a relative WIS of 0.61, which can be interpreted as achieving, on average, 39% less probabilistic error than the baseline forecast in the evaluation period, adjusting for the difficulty of the specific predictions made. An additional seven models achieved a relative WIS of less than or equal to 0.75. In total, 18 models had a relative WIS of less than 1, indicating lower probabilistic forecast error than the baseline model, and 10 models (including the baseline) had a relative WIS of 1 or greater (Table 1). Patterns in relative point forecast error were similar, with 18 models having equal or lower mean absolute error (MAE) than the baseline ( Table 1). Values of relative WIS and rankings of models were robust to changing thresholds for submission inclusion criteria and to the inclusion or exclusion of individual outlying or revised observations (SI Appendix, Tables S3 and S4). When stratified by phase of the pandemic, different models showed the highest accuracy overall (SI Appendix, Fig. S5).
The degree to which individual models provided calibrated predictions varied (Table 1). We measured the probabilistic calibration of model forecasts using the empirical coverage rates of prediction intervals (PIs). Across 1-through 4-wk-ahead horizons, 79 wk, and 50 states, only the ensemble model achieved near-nominal coverage rates for both the 50% and 95% PIs. Eight models achieved coverage rates within 5% of the desired coverage level for the 50% PI, and only the COVIDhubensemble and UMass-MechBayes achieved coverage rates within 5% for the 95% PI. Typically, observed coverage rates were lower than the nominal rate (Table 1 and SI Appendix, Fig. S2). Three models had very low coverage rates (less than 50% for the 95% PI or less than 15% for the 50% PI). In general, models were penalized more for underpredicting the eventually observed values than overpredicting (SI Appendix, Fig. S7).
Among the top-performing models, there was variation in data sources used, indicating that the inclusion of additional data sources was not a sufficient condition for high accuracy. Of the seven top individual models with a relative WIS less than or equal to 0.75 (Table 1), four used data beyond the epidemiological hospitalization, case, and death surveillance data from the Center for System Science and Engineering (CSSE) (SI Appendix, Table S1). A total of 10 of the 18 individual models that performed better than the baseline used data other than epidemiological surveillance data (e.g., demographics or mobility). The top performers consisted of both models with mechanistic components and mostly phenomenological ones.
Model Accuracy Rankings Are Highly Variable. The COVIDhubensemble was the only model that ranked in the top half of all models (standardized rank > 0.5) for more than 75% of the observations it forecasted, although it made the single best forecast less frequently than any other model (Fig. 2). We ranked models based on relative WIS for each combination of 1-through 4-wk-ahead horizons, 79 wk, and 55 locations, contributing to 17,006 possible predicted observations for each model (Fig. 2). All models showed large variability in relative skill, with each model having observations for which it had the best (lowest) WIS and thereby a standardized rank of 1. Some models, such as JHUAPL-BUCKY and PSI-DRAFT, show a bimodal distribution of standardized rank, with one mode in the top quartile of models and another in the bottom quartile. In these cases, the models frequently made overconfident predictions (SI Appendix, Fig. S6) resulting in either lower scores (indicating better performance) in instances in which their predictions were very close to the truth or higher scores (indicating worse performance) when their predictions were far from the truth. Similar patterns in ranking and relative model performance were seen when stratifying ranks by pandemic phase (SI Appendix, Fig. S3).
Observations on Accuracy in Specific Weeks. Forecasts from individual models showed variation in accuracy by forecast week and horizon (Fig. 3). The COVIDhub-ensemble model showed better average WIS than both the baseline model and the average error of all models across the entire evaluation period, except for 3 wk during which the baseline had lower 1-wkahead error than the ensemble. The COVIDhub-ensemble 1wk-ahead forecast for EW02-2021 yielded its highest average WIS across all weeks (average WIS = 72.7), and 9 out of 26  other models that submitted for the same locations outperformed it that week. The 4-wk-ahead COVIDhub-ensemble forecasts were worse in EW49-2020 than in any other week during the evaluation period (average WIS = 111.7), and 15 out of the 26 models outperformed the ensemble that week at a forecast horizon of 4 wk.
There was high variation among the individual models in their forecast accuracy during periods of increasing deaths and near peaks (i.e., forecast dates in July through early August of 2020, November through March, and August through October of 2021; Fig. 3). High errors in the baseline model tended to be associated with large outliers in observed data for a particular week (e.g., times when a state reported a large backfill of deaths in the most recent week) (SI Appendix, SI Text). In general, other models did not show unusual errors in their forecasts originating from these anomalous data, suggesting that their approaches, including possible adjustments to recent observations, were robust to anomalies in how data were reported.
Model Performance in Specific Pandemic Waves. In addition to evaluating performance in aggregate across the entire evaluation period and separate phases, we evaluated model performance during important moments during the pandemic. To assess the impact of rapidly changing trends on incident death forecasting accuracy, we ran an analysis restricted to specific locations and time periods that experienced high rates of change during four different waves of the pandemic ( Forecast performance varied substantially in these examples. Models in general systematically underpredicted the mortality curve as trends were rising and overpredicted as trends were falling. In some of the selected waves (e.g., North Dakota and Florida), the ensemble forecast showed inappropriate levels of uncertainty, with the 95% PIs covering the eventual observations less than 80% of the time. However, during other waves (e.g., Louisiana and Michigan), the ensemble forecast, while systematically biased first below and then above the eventually reported counts of deaths, did cover the observations at or above 95% of the time, although PIs were very wide. In general, lower-than-expected coverage rates and bias were more pronounced at a 4-wk horizon than a 1-wk horizon. These four examples appeared to be representative of trends observed when looking across a larger number of waves (Dataset S2).

Individual Model Forecast Performance Varies Substantially by
Location. Forecasts from individual models showed large variation in accuracy by location when aggregated across all weeks and targets (Fig. 5). Only the ensemble model showed superior accuracy when compared to baseline in all locations. Ensemble forecasts of incident deaths showed the largest relative accuracy improvements in New York, New Jersey, Indiana (relative WIS = 0.4), California, Massachusetts, and at the national level (relative WIS = 0.5) and the lowest relative accuracy in Table 1. Summary accuracy metrics for all submitted forecasts from 28 models meeting inclusion criteria, aggregated across locations (50 states only), submission week, and 1-through 4-wk forecast horizons The "No. forecasts" column refers to the number of individual location/target/week combinations. Empirical prediction interval (PI) coverage rates calculate the fraction of times the 50% or 95% PIs covered the eventually observed value. Values within 5% coverage of the nominal rates are highlighted in boldface text. The "relative WIS" and "relative MAE" columns show the relative mean WIS and relative MAE, which compare each model to the baseline model while adjusting for the difficulty of the forecasts the given model made for state-level forecasts (see Methods). The baseline model is defined to have a relative score of 1. Models with relative WIS or MAE values lower than 1 had "better" accuracy relative to the baseline model (best score in bold).
Vermont, Guam, and The Virgin Islands (relative WIS = 0.9). The COVIDhub-ensemble was the only model to outperform the baseline in every location when eligible in a specific pandemic phase (SI Appendix, Fig. S6).
Forecast Performance Degrades with Increasing Horizons. Averaging across all states and weeks in the evaluation period, forecasts from all models showed lower accuracy and higher variance at a forecast horizon of 4 wk ahead compared to a horizon of 1 wk ahead; however, models generally showed improved performance relative to the naıve baseline model at larger horizons (SI Appendix, Fig. S4). A total of 11 models showed a lower average WIS (range: 24.9 to 34.3) than the baseline at a 1-wk horizon (average WIS = 35.8). At a 4-wk-ahead horizon, 19 models had a lower average WIS (range: 39.9 to 65.2) than baseline (average WIS = 70.1). Across all models except one, the average WIS was higher than the median WIS, indicative of outlying forecasts impacting the mean value.
When averaging across locations and stratifying by phase of the pandemic, there was variation in the top-performing models (SI Appendix, Fig. S5). Four models had a lower mean WIS than baseline for both 1-and 4-wk-ahead targets in at least three out of four phases (COVIDhub-ensemble, GT-DeepCOVID, Karlen-pypm, and UMass-MechBayes). Additionally, UMass-MechBayes and COVIDhub-ensemble were the only models to appear in the top three models in three of the four phases analyzed (SI Appendix, Fig. S5). In contrast to average WIS, PIs coverage rates did not change substantially across the 1-to 4-wk horizons for most models (SI Appendix, Fig. S2).
While many teams submitted only short-term (1-to 4-wkahead) forecasts, a smaller number of teams consistently submitted longer-term predictions with up to a 20-wk horizon for all 50 states (SI Appendix, Fig. S8). Across all teams submitting forecasts for the 50 states, 4-wk-ahead forecasts had around 76% more error (based on relative WIS) than 1-wk-ahead forecasts, a relationship that was consistent across the entire evaluation period. Longer-term forecasts showed less accuracy on average than 1-and 4-wk-ahead forecasts. There were no clear overall differences in probabilistic model accuracy between 8-and 20-wk horizons, although in early summer 2020, late spring 2021, and fall of 2021, average WIS at 8-wk horizons were slightly lower than at longer horizons (SI Appendix, Fig.  S8B). For the two teams who made 20-wk-ahead forecasts for all 50 states, the average WIS was 2.9 to 4 times higher at a 20-wk horizon than it was at a 1-wk horizon. The increased WIS at longer prediction horizons for these models were due to larger dispersion (i.e., wider predictive distributions representing increased uncertainty) as well as larger penalties for underprediction and overprediction (SI Appendix, Fig. S9). The biggest increases in WIS were from increased penalties for underprediction, suggesting that the model forecasts did not accurately capture the possibility of increases in incidence at long horizons. Coverage rates for 95% PIs tended to be stable or decline as the horizon increased (SI Appendix, Fig. S8C).

Discussion
Given the highly visible role that forecasting has played in the response to the COVID-19 pandemic, it is critical that model consumers, such as decision-makers, the general public, and modelers themselves, understand how reliable models are. This paper provides a comprehensive and comparative look at the probabilistic accuracy of different modeling approaches for forecasting COVID-19-related deaths during the COVID-19 pandemic in the United States from April 2020 through October 2021. This work illustrates the tension between the desire for long-term forecasts, which would be helpful for publichealth practitioners, and the decline in forecast accuracy at longer horizons shown by all forecasting methods.
As seen in prior epidemic forecasting projects, ensemble forecasts simplify the information provided to model consumers and can provide a stable, accurate, and low-variance forecast (3,(14)(15)(16). The results presented here show high variation in accuracy between and within stand-alone models but more consistent accuracy from an ensemble forecast. This supports prior results and confirms that an ensemble model can provide a reliable and comparatively accurate forecast that exceeds the performance of most, if not all, of the models that contribute to it. The ensemble approach was the only model that 1) outperformed the baseline forecast in every location, 2) had better overall 4-wk-ahead accuracy than the baseline forecast in every week, and 3) ranked in the top half of forecasts for more than 75% of the forecasts it made. It achieved the best overall measures of point and probabilistic forecast accuracy for forecasting deaths. However, during key moments in the pandemic, while the ensemble outperformed many models, it often showed lower than desired accuracy (Figs. 3 and 4). These results strengthen the evidence base for synthesizing multiple models for public-health decision support. We summarize the key findings of the work as follows.
• The performance of all individual models forecasting COVID-19 mortality was highly variable even for short-term targets (Figs. 2 and 3). One source of variation was data inputs. Further investigation is needed to determine in what settings additional data can yield measurable improvements in forecast accuracy or add valuable diversity to a collection of models that are combined. • A simple ensemble forecast that combined all submitted models each week was consistently the most accurate model when aggregating performance across forecast targets, weeks, or locations ( Fig. 3 and 5 and SI Appendix, Fig. S4). Although rarely the "most accurate" model for individual predictions, the ensemble was consistently one of the top few models for any single prediction (Fig. 2). For publichealth agencies concerned with using a model that shows dependably accurate performance, this is a desirable feature of a model. • The high variation in ranks of models for each location/ target/week suggests that all models, even those that are not as accurate on average, have observations for which they are the most accurate (Fig. 2). point forecasts for 1 and 4 wk ahead. During periods of increasing incident deaths, the ensemble tended to underpredict while tending to overpredict during periods of decreasing incident deaths. PIs coverage during these times varied (Fig. 4). • Forecast accuracy and calibration were substantially degraded as forecast horizons increased, largely due to underestimating the possibility of increases in incidence at long horizons (SI Appendix, Figs. S8 and S9).
Model performance should be assessed both in aggregate (to compare models that showed the best overall performance) and in specific important moments during the pandemic. It is of public-health interest to evaluate how well models are able to predict points at which the observed trends change. However, we note that a post hoc evaluation that focuses only on times at which a specific type of trend was observed raises conceptual challenges. Extreme turning points in the pandemic are relatively rare compared with the many weeks during which trends continue or only slightly change from previous weeks. A post hoc evaluation that focuses exclusively on these "changepoints" may reward models that may regularly predict extreme changes even when they do not occur at other times (21). Adapting proper scoring rules to weigh good performance in both kinds of situations is difficult.
Rigorous evaluation of forecast accuracy faces many limitations in practice. The large variation and correlation in forecast errors across targets, submission weeks, and locations makes it difficult to create rigorous comparisons of models. Forecast comparison is also challenging because teams have submitted forecasts for different lengths of time, different locations, and for different numbers of horizons (SI Appendix, Figs. S1 and S8). Some teams have also changed their models over time (SI Appendix, Tables S1 and S2 and Fig. S1). To account for some of this variability, we implemented specific inclusion criteria. However, those criteria may exclude valuable approaches that  were not applied to a large fraction of locations or weeks (see Methods). Forecast performance may be affected by ground-truth data and forecast target. Ground-truth data are not static. They can be later revised as more data become available (Dataset S1). There are also instances in which data are not revised but rather left with large peaks or dips due to reporting effects, especially around holidays. Different sources for ground truth data can also have substantial differences that impact model performance. Lastly, because this evaluation focuses on incident death forecasts, it cannot speak to model performance for incident cases or hospitalizations. Deaths may serve as a lagging indicator of COVID-19, thus making it more predictable than hospitalization and case targets (22).
While the Hub has provided many insights into what has and has not been predictable in the COVID-19 pandemic, it also has left many important questions unanswered. Due to the operational, real-time orientation of the project, the Hub did not collect data on experimental modeling studies for which certain features can be included or left out to explicitly test what features of a model increase predictive accuracy. An observational study could be conducted with forecasts collected by the Hub, but any such analysis would be confounded by other factors about how the model was built and validated. Other research in this area has shown small but measurable improvements in predictive accuracy by including other data streams available in real time (23). Continued research is needed to evaluate how behavioral, mobility, variant prevalence, or other data streams might enhance predictive modeling Short-term forecasts of COVID-19 mortality have informed public-health response and risk communication for the pandemic. The number of teams and forecasts contributing to the COVID-19 ensemble forecast model has exceeded forecasting activity for any prior epidemic or pandemic. These forecasts are only one component of a comprehensive public-health data and modeling system needed to help inform outbreak response. Preparedness for future pandemics could be facilitated by creating resources for creating and maintaining model submission formats. This project underscores the role that collaboration and active coordination between governmental public-health agencies, academic modeling teams, and industry partners can play in developing modeling capabilities to support local, state, and federal response to outbreaks.

Methods
Surveillance Data. Early in the COVID-19 pandemic, the Johns Hopkins CSSE developed a publicly available data-tracking system and dashboard that was widely used (24). CSSE collected daily data on cumulative reported deaths due to COVID-19 at the county, state, territorial, and national levels and made these data available in a standardized format beginning in March 2020.   (Table 1).
Incident deaths were inferred from this time series as the difference in reported cumulative deaths on successive days. Throughout the real-time forecasting exercise described in this paper, the Forecast Hub stated that forecasts of deaths would be evaluated using the CSSE data as the ground truth and encouraged teams to train their models on CSSE data. Like data from other public-health systems, the CSSE data occasionally exhibited irregularities due to reporting anomalies. CSSE made attempts to redistribute large "backlogs" of data to previous dates in instances in which the true dates of deaths, or dates when the deaths would have been reported, were known. However, in some cases, these anomalous observations were left in the final dataset (SI Appendix, SI Text). All updates were made available in a public GitHub repository (https://github.com/CSSEGISandData/COVID-19/ tree/master/csse_covid_19_data#data-modification-records). Weekly incidence values were defined and aggregated based on daily totals from Sunday through Saturday, according to the standard definition of epidemiological weeks (EW) used by the CDC (25).
Forecast Format. Research teams from around the world developed forecasting models and submitted their predictions to the COVID-19 Forecast Hub, a central repository that collected forecasts of the COVID-19 pandemic in the United States beginning in April 2020. The Forecast Hub submission process has been described in detail elsewhere (26). Incident death forecasts, the focus of this evaluation, could be submitted with predictions for horizons of 1 to 20 wk after the week in which a forecast was submitted.
A prediction for a given target (e.g., "1-week ahead incident death") and location (e.g., "California") was specified by one or both of a point forecast (a single number representing the prediction of the eventual outcome) and a probabilistic forecast. Probabilistic forecasts were represented by a set of 23 quantiles at probability levels 0.01, 0.025, 0.05, 0.10, 0.15, … , 0.95, 0.975, 0.99.
Forecast Model Eligibility and Evaluation Period. To create a set of standardized comparisons between forecasts, only models that met specific inclusion criteria were included in the analysis. For the 79 wk beginning in EW17-2020 and ending with EW42-2021, a model's weekly submission was determined to be eligible for evaluation if the forecast 1. was designated as the "primary" forecast model from a team (groups who submitted multiple parameterizations of similar models were asked to designate prospectively a single model as their scored forecast); 2. contained predictions for at least 25 out of 51 focal locations (national level and states); 3. contained predictions for each of the 1-through 4-wk-ahead targets for incident deaths; and 4. contained a complete set of quantiles for all predictions.
A model was included in the evaluation if it had submitted an eligible forecast for at least 60% (n = 47) of the submission weeks during the continuous 79-wk period (SI Appendix, Fig. S1). Based on the eligibility criteria, we compared 28 models that had at least 47 eligible weeks during this time period.
Aggregated Forecast Evaluation of Pandemic Phases. In a secondary analysis, forecasts were evaluated based on model submissions during four different phases of the pandemic. A model was eligible for inclusion in a given phase if it met the eligibility criteria listed in the Methods section: Forecast model eligibility and evaluation period, and had forecast submissions for at least 60% of the weeks during that phase. For the spring phase, models had to submit eligible forecasts for at least 6 out of 10 wk starting EW16-2020 and ending EW26-2020. For summer eligibility, a model required submissions for at least 12 out of 20 submission weeks between EW27-2020 and EW46-2020. For winter eligibility, a model required submissions for at least 14 out of 23 submission weeks between EW47-2020 and EW16-2021. For delta phase eligibility, a model required submissions for at least 16 out of 26 submission weeks between EW17-2021 and EW42-2021. These phases were determined based on the waves of deaths at the national level during pandemic (Fig. 1B). Each phase includes a period of increasing and decreasing incident deaths, although forecasts for the spring phase did not begin early enough to capture the increase in many locations.
Forecasts were scored using CSSE data available as of November 16, 2021. We did not evaluate forecasts on data first published in the 2 wk prior to this date due to possible revisions to the data.
Disaggregated Forecast Evaluation by Pandemic Wave. In a post hoc secondary analysis, we evaluated forecasts made in selected locations during selected pandemic waves. We used the following criteria in selecting locations and waves to represent this analysis ( Fig. 4 and Dataset S2).
• We selected states that had unusually severe waves or whose waves "led" the overall wave. Locations for which data for weekly deaths during the wave had been substantially revised after the initial report were excluded from consideration. • We picked an initial date near the start of the first increase at the start of the wave and a last date at the end of the steep decline of the wave.
To compare the forecasts during the waves, we plotted 1-and 4-wk-ahead forecasts and calculated 95% PIs coverage rates of forecasts made for the given location both during the wave of interest over all weeks. Coverage rates were computed for models that were included in the overall analysis (see eligibility in Methods section: Forecast model eligibility and evaluation period) and, for inclusion in the coverage calculations for each wave, the model additionally had to have made forecasts for at least 3 wk in the selected wave.
Forecast Locations. Forecasts were submitted for 57 locations including all 50 states, six jurisdictions and territories (American Samoa, Guam, the Northern Mariana Islands, US Virgin Islands, Puerto Rico, and the District of Columbia), and a US national-level forecast. Because American Samoa and the Northern Mariana Islands had no reported COVID-19 deaths and one reported COVID-19 death, respectively, during the evaluation period, we excluded these locations from our analysis.
In analyses for which measures of forecast skill were aggregated across locations, we typically only included the 50 states in the analysis. Including these territories in raw score aggregations would favor models that had forecasted for these regions because models were often accurate in predicting low or zero deaths each week, thereby reducing their average error. The national-level forecasts were not included in the aggregated scores because the large magnitude of scores at the national level strongly influences the averages. However, in analyses for which scores were stratified by location, we included forecasts for all US states, including territories and at the national level.
This evaluation used the CSSE COVID-19 surveillance data as ground truth when assessing forecast performance. We did not score observations when ground-truth data showed negative values for weekly incident deaths (due to changes in reporting practices from state/local health agencies [e.g., removing "probable" COVID-19 deaths from cumulative counts]). This occurred 11 times.
Forecast Models. For the primary evaluation, we compared 28 models that submitted eligible forecasts for at least 47 of the 79 wk considered in the overall model eligibility period (Fig. 1). Teams that submitted to the COVID-19 Forecast Hub used a wide variety of modeling approaches and input data (SI Appendix, Tables S1 and S2). Two of the evaluated models are from the COVID-19 Forecast Hub itself: a baseline model and an ensemble model.
The COVIDhub-baseline model was designed to be a neutral model to provide a simple reference point of comparison for all models. This baseline model forecasted a predictive median incidence equal to the number of reported deaths in the most recent week (y t ), with uncertainty around the median based on changes in weekly incidence that were observed in the past of the time series (details in SI Appendix, SI Text).
The COVIDhub-ensemble model combined forecasts from all models that submitted a full set of 23 quantiles for 1-through 4-wk-ahead forecasts for incident deaths. The ensemble for incident weekly deaths was first submitted in the week ending June 6, 2020 (EW23). For submission from EW23 through EW29 (week ending July 18, 2020), the ensemble took an equally weighted average of forecasts from all models at each quantile level. For submissions starting in EW30 (week ending July 25, 2020), the ensemble computed the median across forecasts from all models at each quantile level (27). We evaluated more complex ensemble methods, and while they did show modest improvements in accuracy, they also displayed undesirable increases in variability in performance during this evaluation period (28,29).
Forecast Submission Timing. Of the 3,555 forecast submissions we included in the evaluation, 230 (6%) were either originally submitted or updated more than 24 h after the submission deadline. In all of these situations, modeling teams attested (via annotation on the public data repository) to the fact that they were correcting inadvertent errors in the code that produced the forecast and that the forecast used as input only data that would have been available before the original submission due date. In these limited instances, we evaluated the most recently submitted forecasts.
Evaluation Methodology. We evaluated aggregate forecast skill using a range of scores that assessed both point and probabilistic accuracy. These scores were aggregated over time and locations for near-term forecasts (4 wk or less into the future) and, in a single analysis, for longer-term projections (5 to 20 wk into the future).
Point forecast error was assessed using the MAE defined for a set of observations y 1:N and each model's designated point predictionsŷ 1:N as MAE ¼ 1 N ∑ N i¼1 j y i Àŷ i j. To assess probabilistic forecast accuracy, we used two scores that are easily computable from the quantile representation for forecasts described in the Methods section, Forecast Format. Briefly, the WIS is a proper score that combines a set of interval scores for probabilistic forecasts that provide quantiles of the predictive forecast distribution (20): IS a ðF, yÞ ¼ ðu À lÞ þ 2 a Á ðl À yÞ Á 1ðy < lÞ þ 2 a Á ðy À uÞ Á 1ðy > uÞ WIS a0:K ðF, yÞ ¼ 1 K þ 1=2 Á w 0 Á jy À mj þ ∑ K k¼1 w k Á IS ak ðF, yÞ ! : An individual interval score for a single prediction and uncertainty level can be broken into three additive components. These components-dispersion, underprediction, and overprediction as they appear, respectively, in the preceding IS equation-represent contributions to the score. As an example, say a 50% PIs (α = 0. 5) is (40, 60) and the observation is 30. The IS a¼0:5 40, 60 f g , 30 ð Þ ¼20 þ 40 þ 0 ¼ 60, where the dispersion is 20, the penalty for underprediction is 40, and there is no penalty for overprediction. Similarly, the WIS, which is computed as a weighted sum of interval scores across all available uncertainty levels as shown in the preceding equation, can be split into contributions from each of these components. These then can be used to summarize the average performance of a model in terms of the width of its intervals and the average penalties it receives for intervals missing below or above the observation. Proper scores promote "honest" forecasting by not providing forecasters with incentives to report forecasts that differ from their true beliefs about the future (30).
We also evaluated the PIs coverage, the proportion of times a PIs of a certain level covered the observed value, to assess the degree to which forecasts accurately characterized uncertainty about future observations. Computational details for PIs coverage are provided in SI Appendix, SI Text.
Forecast Comparisons. Comparative evaluation of the considered models 1, … , M is hampered by the fact that not all of them provide forecasts for the same set of locations and time points. To adjust for the level of difficulty of each model's set of forecasts, we computed 1) a standardized rank between 0 and 1 for every forecasted observation relative to other models that made the same forecast and 2) an adjusted relative WIS and MAE.
To compute the WIS standardized rank score for model m and observation i (sr m,i ), we computed the number of models that forecasted that observation (n i ) and the rank of model m among those n i models (r m,i ). The model with the best (i.e., lowest) WIS received a rank of 1 and the worst received a rank of n i . The standardized rank then rescaled the ranks to between 0 and 1, where 0 corresponded to the worst rank and 1 to the best (31-33), as follows: sr m,i ¼ 1 À r m,i À 1 n i À 1 : This metric is not dependent on the scale of the observed data. If all models were equally accurate, distributions of standardized ranks would be approximately uniform. A procedure to compute a measure of relative WIS, which evaluates the aggregate performance of one model against the baseline model is described in SI Appendix, SI Text. To adjust for the relative difficulty of beating the baseline model on the covered set of forecast targets, the chosen measure also takes into account the performance of all other available models. The same procedure was used to compute a relative MAE.
Data Availability. The forecasts from models used in this paper are available from the COVID-19 Forecast Hub GitHub repository (https://github.com/ reichlab/covid19-forecast-hub) (4,34) and the Zoltar forecast archive (https:// zoltardata.com/project/44) (35). These are both publicly accessible. The code used to generate all figures and tables in the manuscript is available in a public repository (https://github.com/reichlab/covid19-forecast-evals). All analyses were conducted using the R language for statistical computing (version 4.0.2) (36). We followed the EPIFORGE 2020 guidelines for reporting results from epidemiological forecasting studies (SI Appendix, Table S5) (37).