Epidemic outcomes following government responses to COVID-19: Insights from nearly 100,000 models

Government responses to COVID-19 are among the most globally impactful events of the 21st century. The extent to which responses—such as school closures—were associated with changes in COVID-19 outcomes remains unsettled. Multiverse analyses offer a systematic approach to testing a large range of models. We used daily data on 16 government responses in 181 countries in 2020–2021, and 4 outcomes—cases, infections, COVID-19 deaths, and all-cause excess deaths—to construct 99,736 analytic models. Among those, 42% suggest outcomes improved following more stringent responses (“helpful”). No subanalysis (e.g. limited to cases as outcome) demonstrated a preponderance of helpful or unhelpful associations. Among the 14 associations with P values < 1 × 10−30, 5 were helpful and 9 unhelpful. In summary, we find no patterns in the overall set of models that suggests a clear relationship between COVID-19 government responses and outcomes. Strong claims about government responses’ impacts on COVID-19 may lack empirical support.


INTRODUCTION
COVID-19 was-and to a large extent remains-the most meaningful health event in recent global history (1).Unlike the 2003 Severe Acute Respiratory Syndrome (SARS) epidemic, it spread globally; unlike Zika, everyone is at risk of infection with COVID-19; and unlike recent swine flu pandemics, the disease severity and mortality from COVID-19 were so high it led to life expectancy reversals in many countries (2,3).
If COVID-19 was a defining health event, the global responses to COVID-19 were a defining health policy experience (4,5).The swiftness of global responses, their extensiveness, and direct implications for billions of people's lives were historically unique: The responses to the 1918 influenza pandemic, in comparison, were largely localized, while the global response to the HIV pandemic was slower and smaller in extent than the response to COVID-19 (6,7).Government responses to COVID-19 intended to limit the virus' spread and disease burden, using encouragements or mandates on schools, travel, and masks, among others, as well as income support or debt relief to enable social distancing.
The rapid spread of the virus in early 2020 meant that many COVID-19 responses were implemented swiftly, based on partial information, often from simulation models, about transmission mechanisms and about anticipated benefits (8,9).The swiftness of spread afforded effectively no time for careful studies of policy effects, and favored emergency measures implemented with relatively little information about the trade-offs of alternative policy options (10).
Many approaches are needed to understand the impacts of government responses to the pandemic.Qualitative approaches may help with understanding why different governments used different policy responses (for example, why Norway implemented shelter-in-place while Sweden did not).Observational epidemiologic studies may help characterize the relationships between different policy responses and COVID-19 outcomes, while meta-analyses and systematic reviews can summarize the observational evidence on government response impacts (11,12).Experimental evidence is not available for understanding the impact of policies: No government studied its responses directly with trials or experiments.As a result, one common thread is that the data available for studying policy responses are messy and complex, resulting in analyses that may also be complex (13).For example, most policy responses were implemented concurrently or in close sequence, posing challenges to identifying the unique impacts of individual policies (14).Existing studies of COVID-19 response impacts range from unrealistically positive to dismissively negative, further complicating balanced assessments (15)(16)(17).
Despite the complexity of the endeavor, its importance is undiminished.Definitive studies of government response impacts on the virus' spread and disease burden would be enormously helpful for present decisions and future pandemic planning.The dearth of prospective and randomized studies means that, likely, no single study may settle this question.
In this analysis, we attempt to advance the science of government responses to COVID-19 by taking a multiverse approach to this topic (18)(19)(20)(21).Multiverse analyses elevate epistemic humility by relaxing the number of subjective choices in the research design process.Multiverse approaches also prompt analysts to comprehensively probe the space of plausible models and results of assumptions.By expanding the number of analyses, they provide information about the stability of relationships' magnitude and direction due to study design parameters and choice.We take this approach because (i) the data available for analysis are complex and rich, making possible a large number of plausible analyses and (ii) we aspire to limit the role of data and model choices in driving the results, or "researcher degree of freedom" (22).The emerging distributions of possible relationships can be considered an update to the strength of hypotheses about the effectiveness of COVID-19 responses that are in contrast with much of the highly cited literature.At the very least, the breadth of possible findings provides an understanding of what can or cannot be answered with limited observational data.

RESULTS
The daily analytic dataset includes 128,662 observations, an average of 711 observations in 181 countries in 2020 and 2021.The weekly and monthly datasets contain 18,795 and 4198 observations (an average of 103 and 23 observations per country, respectively).The average number of observations in each analysis was 18,930 and ranged for reasons such as period of aggregation (monthly, weekly, or daily), partial availability of outcome (especially for excess deaths), or outcome counts of zero, which prevent growth calculations.The earliest date with nonzero COVID-19 outcomes data is 22 January 2020.
Table 1 summarizes government response data.Across all countries, the containment measures that comprise the stringency index peaked, on average, in April or May 2020.Health system responses, especially access to testing and vaccine availability, peaked later and remained at their peak through the end of the study period in December 2021.The country with the highest stringency index for the entire study period was Honduras (average 73.8;China and the United States, by comparison, had average stringency of 69.7 and 56.3, respectively).Figure S1 provides an illustrative example of the time trends of four government responses in four countries: stringency index, school closure, workplace closure, and vaccine availability in the United States, South Korea, Sweden, and India.
Figure 1 provides a visual representation of the variations used to generate all 99,736 models.Of all models, 41,748 (42%) had a point estimate in the "helpful" direction, and 57,988 (58%) in the "unhelpful" direction (we use "helpful" as shorthand for a negative β 1 and "unhelpful" for a positive β 1 , without implying a causal link).The number of significant associations was similar between the helpful and unhelpful models: 3692 (8.8%) of the helpful and 3811 (6.7%) of the unhelpful associations were statistically significant using a false discovery rate criteria (23).
Figure 2 illustrates the direction of the evidence among the analyses.Each panel demonstrates the portion of the models that are helpful and unhelpful, by subgroup.Figure 2A demonstrates that about half of all models suggest government responses were helpful, and half unhelpful when examining either of the three indices (stringency, government response, and economic support).A similar picture emerges when looking at subgroups by time period (about 50:50 for early 2020, all of 2020, or the entire time period), outcome measure (infections, cases, COVID-19 deaths, or excess deaths), dataset aggregation period (daily, weekly, or monthly), fixed effects (with or without), and covariates (with or without).The only subgroup level where the balance shifts away from 50:50 is the model type.Models 1, 3, and 5 range from 60 to 80% in the unhelpful direction, while models 2, 4, and 6 range from 55 to 70% in the helpful direction.When removing model 1, the simplest and least balanced (80% of results in the unhelpful direction) model, then, among the remaining 82,864 estimates, 46% have a point estimate in the helpful direction and 54% in the unhelpful direction.Figure 3 shows that the distribution of the standardized effect sizes (proxied using the t-statistic from each model) for the overall set of 99,736 Table 1.Descriptive features of COVID-19 government response data.the top eight responses are categorized by OxcGRt as "containment measures, " the next two as "economic measures," and the next six as "health system measures."the composite indices are shown at the bottom.the mean averages the response across all countries from 22 January 2020 to 21 december 2021.Max month indicates the month in which the indicator was highest across all countries.Max country refers to the country in which the indicator was highest over the entire period.models and for five subsets are evenly and narrowly centered around zero (with small deviations from zero among the most skewed models in Fig. 2).The five response-outcome pairs with the most consistent associations in the helpful and unhelpful directions are shown in Table 2.The number of infections makes up three of the five most consistently helpful associations, while excess all-cause deaths and COVID-19 deaths make up the outcome in the majority of unhelpful associations.Among the 14 most extremely strong associations with P < 1 × 10 −30 , 5 were in the helpful direction and 9 were in the unhelpful direction.

Response
Last, a total of 12 models could be applied to the simulated measles dataset (six models, each with and without fixed effects).All results had effect sizes in the helpful direction, with all P < 3.1 × 10 −6 and t-statistic ≤ −5.3.

DISCUSSION
In this study, we perform a multiverse analysis of nearly 100,000 ways of probing the relationship between COVID-19 government responses and outcomes in 181 countries.The goal is to create a multiverse of plausible analyses and assess the sensitivity of the results to these choices.Exploring the multiverse for a question of high importance may be useful where there is no consensus.In this study, we found no clear pattern in the overall set of analyses or in any subset of analyses.We are left to conclude that strong claims about the impact of government responses on the COVID-19 burden lack empirical support.
Inferences from this analysis deserve careful consideration, including a clear understanding of what this study cannot illuminate.First, none of the models tested can tell the extent to which any government response could have improved COVID-19 outcomes.Perhaps with another virus, other implementation strategies, or different populations, school closures could have extinguished transmission.Nor can we learn from this study what COVID-19 outcomes would have been like in the absence of these responses.Second, our analysis is global in scope and examines government responses and COVID-19 outcomes at the level of countries.This is suitable for inferring global patterns and trends but cannot exclude patterns at state, district, community, or even neighborhood levels.
Third, and perhaps most importantly, we cannot conclude that there is compelling evidence to support the notion that government responses improved COVID-19 burden, and we cannot conclude that there is compelling evidence to support the notion that government responses worsened the COVID-19 burden.The concentration of estimates around a zero effect weakly suggests that government responses did little to nothing to change the COVID-19 burden.
This conclusion departs meaningfully from many scientific studies of government responses.For example, a highly cited study on this topic notes that "Our results show that major non-pharmaceutical interventions-and lockdowns in particular-have had a large effect on reducing transmission" (9).Such conclusions are common in the scientific literature (table S1), but our analysis-extensive in scope and outcomes-suggests that such strong claims lack empirical justification.
The contribution of any study can be thought of as an update to the reader's Bayesian prior.Most scientific studies aim to strengthen the reader's posterior belief in a hypothesis, while this study explores the opposite: We argue that strong beliefs about the impact of COVID-19 government responses, as reflected in the studies in table S1, may deserve weakening.
well-measured data collection platforms.The benefits of such platforms are demonstrated in the invaluable understanding of COVID-19 vaccine impacts from large registries in Israel and Qatar (24,25).Large national prospective data platforms have been a long-lasting challenge in many countries, including the United States; local platforms, which may be easier to implement, could facilitate understanding at smaller scales.In the context of assessing government response, improved measurements of responses would be important.The OxCGRT is a critical resource, but a better understanding of implementation, enforcement, and compliance could further understand effect heterogeneity.Mitigating uncertainty due to flexible analytic design includes public registration of hypotheses in a public repository before analysis, like the process that precedes many randomized clinical trials (26).
The issue of subjecting government responses to experimentation is complicated.Trials of public health programs would yield extremely valuable information.Such trials may be thorny on practical or ethical grounds.This deserves further consideration, however, given the enormous stakes and inevitable trade-offs involved in responses such as mandatory school or business closures.A final important consideration that could improve the quality of evidence is keeping the issue at hand away from special and financial interests.The polarization that beset the scientific community has made asking and probing some questions difficult (27,28).Keeping scientific questions separate from public decisions could enable a spirit of greater collaboration around such issues.
The limitations of this study fall into four broad groups.First, it may be that a consistent signal-either of government response helpfulness or unhelpfulness-is contained in the data or models but not identified in this analysis.We encourage further probing of the results in our Shiny app.Second, the models are limited in their causal strength in the sense that a counterfactual to the policies implemented cannot be inferred.Two features temper this limitation: the use of leads such that the outcome is measured 2 weeks or a month following policy response and the use of fixed effects that assess "within country" associations and control for all time-invariant effects.While covariates with time-varying information on-say, health care capacity-may provide additional nuance, this information would be useful if it were available and comparable for all (or many) countries and time-varying at a daily or monthly level.We note that all our models examined short-term epidemic outcomes following policy responses (2 or 4 weeks), but that long-term outcomes remain an important but largely unexamined area of study.Third, country-level data hide more nuanced patterns that may be discernible in analyses of more granular data.Last, despite efforts to limit investigator choices, we made choices in the design of the study, and those may limit inferences.The data and models used in this analysis are open for other investigators to use, modify, or reassemble.In sum, this comprehensive analysis of government responses and COVID-19 outcomes fails to yield clear inferences about government response impacts.This suggests that strong notions about the effectiveness or ineffectiveness of government responses are not backed by existing country-level data, and scientific modesty is warranted when learning from the responses to the COVID-19 pandemic.

METHODS
This analysis tests the relationships between government responses and COVID-19 outcomes.We test the extent to which COVID-19 outcomes improved or worsened following government responses.We recognize that the links between government responses and outcomes are mediated, for example, by the power of the government to enforce a response such as mask mandates.We also recognize that this approach cannot fully assess the counterfactual reality such as "what would outcomes have been had the government kept schools closed for longer?"Rather, we assess the observed relationship, implicitly assuming that if responses and outcomes go in opposite directions (for example, cases increase after easing mask requirements), this is generally consistent with the success of the government response.Conversely, if responses and outcomes go in the same direction (for example, cases increase after increasing mask requirements), this is evidence generally inconsistent with success.
The current paper presents the results of nearly 100,000 reasonable ways of assessing the relationship between government responses and COVID-19 outcomes.Government responses are represented as individual policies such as school closures, or as indices that aggregate the intensity and type of several policies.The rest of this section details the dimensions used in this analysis.

Government responses
The primary data source for government responses is the Oxford COVID-19 Government Response Tracker (OxCGRT).The OxCGRT recorded government responses daily in more than 180 countries using a standardized approach from publicly available sources such as news articles or government briefings.The OxCGRT recorded the official responses at the national level, not their implementation or enforcement.The complete details of the OxCGRT data-generating processes are publicly available (4,5).
Government responses fall into three primary domains: containment and closure (such as school closures and restrictions on gatherings; eight ordinal variables), health system responses (such as contact tracing and mask mandates; six ordinal variables), and economic relief policies (income support and debt relief; two ordinal variables).Because government responses may work in concert or synergistically, the OxCGRT constructs composite indices that aggregate the individual responses.We use three composite indices: the "government response index" which pools all 16 government responses; the "stringency index" which pools the eight containment and closure variables and one health system response, and the "economic support index" which pools the two economic relief policies.We use a version of the indices that combines government responses for vaccinated and unvaccinated populations using a weighted average based on the portion of the population that is vaccinated.All 19 variables (16 individual variables and 3 indices) are available daily for the 181 countries for which we have outcomes data from 22 January 2020 to 31 December 2021.

COVID-19 outcomes
We use nine different outcome measures.We extracted two from the Johns Hopkins COVID-19 dashboard: daily confirmed COVID-19 cases and deaths (29).We extracted three outcomes from the Institute for Health Metrics and Evaluation (IHME): other estimates of daily COVID-19 cases and deaths, and daily estimates of infections (30,31).The estimates of infections are modeled on the basis of agespecific infection fatality rates and age distribution of deaths.Last, we took weekly or monthly excess all-cause mortality from the New York Times (35 countries), the Financial Times (99 countries), the World Mortality Dataset (102 countries), and The Economist (181 countries) (32)(33)(34)(35).A comparison of data sources for excess mortality is available elsewhere (36).

Statistical models
We use six statistical models.The models represent several patterns of relationships between government responses and COVID-19 outcomes.We chose models that broadly represent a stated expected impact of government response policies (such as, for example, models that assess a "flattening of the curve"), and models that capture historical patterns of public health efforts that succeeded in reducing infectious disease burden such as measles vaccination or polio elimination (8,37,38).(We test the models on a dataset of measles cases in the United States; see Plausibility Analysis below.)The formal models are presented below.Each model was estimated such that the coefficient on the Policy variable (β 1 below) would be negative if the government response was associated with reduced COVID-19 burden (we use "helpful" as shorthand for this relationship, without causal implication) In each model, Y is the outcome of interest, c indexes a country, t indexes the observation time, and n represents the duration between the government response and outcome observation (2 or 4 weeks/ 1 month).μ k X k ct represents a k-wide matrix of covariates, δ c represents country fixed effects, and λ t are time fixed effects.Country fixed effects remove all time-invariant differences between countries, such that the effects are estimated within the country.Time fixed effects control for temporal trends shared among all countries.All models were estimated using ordinary least squares.
To facilitate an intuitive understanding of the model, we use a concrete example with COVID-19 deaths as the outcome and stringency index as the policy.Model 1 tests the extent to which higher stringency is associated with fewer COVID-19 deaths 2 or 4 weeks later.Model 2 tests the extent to which higher stringency is associated with fewer COVID-19 deaths 2 or 4 weeks later, compared with the day of observation.Model 3 tests the extent to which increasing stringency is associated with fewer COVID-19 deaths 2 or 4 weeks later, compared with the day of the increase (or, conversely, decreasing stringency associated with more COVID-19 deaths).Model 4 tests the extent to which higher stringency is associated with a lower growth rate of COVID-19 deaths 2 or 4 weeks later.Model 5 tests the extent to which higher stringency is associated with a growth rate of COVID-19 deaths 2 or 4 weeks later that is lower than the day of observation.Model 6 tests the extent to which increasing stringency is associated with a growth rate of COVID-19 deaths 2 or 4 weeks later that is lower than the day of the increase.

Analytic and data variations
We analyzed a total of 99,736 models.The data were analyzed at the daily, weekly, or monthly level of aggregation, allowing for smoothing of idiosyncratic variation in daily data.Covariates included the number of borders ("island effect"), the portion of the population over age 60, the total fertility rate (to capture age structure), and the daily mobility (percent of baseline) from Google, obtained from IHME.Covariates were either included or excluded as a bloc.Country and time fixed effects are commonly used in econometric models to control for time-invariant between-country differences and shared time patterns.Including country fixed effects, in particular, yields a pooled within-country association.For example, changes in the U.S. stringency index are assessed in relation to COVID-19 growth rates in the United States.Models 1 to 3 are analyzed using totals or per-capita outcomes (per-capita outcomes are identical to totals with growth models 4 to 6).To prevent population differences from overwhelming estimates, models with total outcomes include fixed effects.Last, we analyze all models over three time periods: the early pandemic (January 2020 to June 2020); the first year (all of 2020); and the first 2 years (2020 to 2021).
Standard errors are clustered by country in all analyses.Statistical significance is assessed using a false discovery rate of 0.05 (23).
The analytic code is provided along with the analysis.In addition, the entire set of 99,736 model results can be explored using a Shiny app at https://eranbendavid.shinyapps.io/CovidGovPolicies/.

Plausibility analysis
With such a large number of models, we conducted a separate analysis to test whether the models would identify effects, should ones exist within the data.Specifically, we use our approach to study the introduction of measles vaccination in the United States, widely regarded as a public health success (39,40).Following the licensing of the measles vaccine in 1963/1964, the number of reported measles cases dropped from approximately 400,000 annually in the 5 years before the licensing to 30,000 annually in the 5 years after (41,42).We created a synthetic dataset with measles cases proportional to the state population between 1954 and 1990 and assigned a vaccination adoption year to each state between 1964 and 1967 with an effect size of around 93% (equivalent to the case rate decline in the entire United States).We thus construct a dataset with different units (states), an efficacious policy intervention (vaccination), and different policy onset (1964 to 1967).We then applied our models to this dataset, with cases as the only outcome, scheduled vaccination onset as the main policy predictor, and fixed effects as with the main analysis.

Supplementary Materials
This PDF file includes: Fig. S1 table S1 legend for data file S1 References Other Supplementary Material for this manuscript includes the following: data file S1

Fig. 2 .
Fig.2.Portion of models with point estimates in the helpful or unhelpful direction, by group.the portion of models with helpful and unhelpful associations, by domain.each panel includes the overall distribution of effects (bottom bar), and the models that were significant (solid colors).Panel (A) shows the distribution for each of the three government response indices, panel (B) for the four outcome types (cases, infections, cOvid-19 deaths, and all-cause excess deaths), panel (C) for the three study periods, panel (D) for models with and without fixed effects, panel (E) for the three time aggregations, and panel (F) for the six models (1 bottom, 6 top).the models are indexed as l (levels) or G (growth), with d indicating a difference.

Table 2 . Five most consistent associations in the helpful and unhelpful directions
. ihMe, institute for health Metrics and evaluation; nYt, the New York Times; econ, The Economist; cSSe, center for Systems Science and engineering, the Johns hopkins covid-19 dashboard.