Early warning of vulnerable counties in a pandemic using socio-economic variables

In the U.S. in early 2020, heterogenous and incomplete county-scale data on COVID-19 hindered effective interventions in the pandemic. While numbers of deaths can be used to estimate actual number of infections after a time lag, counties with low death counts early on have considerable uncertainty about true numbers of cases in the future. Here we show that supplementing county-scale mortality statistics with socioeconomic data helps estimate true numbers of COVID-19 infections in low-data counties, and hence provide an early warning of future concern. We fit a LASSO negative binomial regression to select a parsimonious set of five predictive variables from thirty-one county-level covariates. Of these, population density, public transportation use, voting patterns and % African-American population are most predictive of higher COVID-19 death rates. To test the model, we show that counties identified as under-estimating COVID-19 on an early date (April 17) have relatively higher deaths later (July 1) in the pandemic.


Introduction
Coronavirus disease 2019 , the pandemic that emerged in Wuhan, China in 2019 (World Health Organization, 2020), increased rapidly across the United States in early 2020, challenging the capacity for a coordinated response. In the absence of a vaccine, two strategies for containing the virus have been social distancing and widespread testing (Bedford et al., 2020;Maharaj and Kleczkowski, 2012). Widespread testing reduces selection bias in estimating the numbers of undocumented infections, a crucial variable in the dynamics of spread (Munster et al., 2020;Li et al., 2020). In early 2020, however, there were not enough COVID-19 testing data in the U.S. to predict infections and health care demand (Munster et al., 2020), given substantial heterogeneity in testing rates in both geographic and socioeconomic terms (Chowell and Mizumoto, 2020). Under-reporting of COVID-19 infections was likely substantial, perhaps by orders of magnitude in the U.S., both overall and at the county level (Bendavid et al., 2020;Lau et al., 2020).
During the urgent early phase of such a pandemic, decisions at the level of both individual behavior and public health response are not only crucial (e.g., Dehning et al., 2020), but "have to be made using the scarce data available" (Zhang and Qian, 2020). Rapid, approximate estimates of infection rates, using online data, are valuable in this phase (Bentley and Ormerod, 2010;McIver and Brownstein, 2014;Bancroft and Lee, 2014;Chunara et al., 2013). Here we use U.S. COVID-19 data, from early (17 April 2020) and later (1 July 2020) in the progression of the pandemic, to test a means of county-scale estimation of pandemic virus infections when testing data are still incomplete and heterogeneous. This then offers a means of identifying the most vulnerable counties that have not yet reported significant statistics.
At the scale of U.S. counties, we assume that the numbers of recorded deaths by a given day are the most complete data on the extent of the virus (Baud et al., 2020;Flaxman et al., 2020;Marchant et al., 2020). Even though there will naturally be a distribution of times between infection and death, for purposes of statistical prediction, we assume the number of infections will be proportional to the number of deaths a certain number of days afterward. With the clinical literature as a guide (Huang et al., 2020), we follow Diebner and Timmesfeld (2020) in using regressions to determine the most predictive number of days lag between cases and deaths. We expect the optimal lag, in terms of predictive value to be between one and two weeks. In two clinical studies of patients with confirmed COVID-19 cases in Wuhan, China, in early 2020, the median time from onset of symptoms to ICU admissionwas 10.5 days (n ¼ 41 patients) in one study (Huang et al., 2020) and 10 days (n ¼ 138 patients) in another study (Wilson et al., 2020). Comparing the Pearson r correlations between the logarithm of cumulative COVID-19 cases versus the logarithm of cumulative COVID-19 deaths in Germany at different time lags, Diebner and Timmesfeld (2020) find that 13 days was the optimal lag. The same 13-day optimal lag was found when comparing log-transformed new (daily) deaths versus cases (Diebner and Timmesfeld, 2020). Other studies determine or use a time delay of 13 days from illness onset to death (Linton et al., 2019;Wang et al., 2020;Yang et al., 2020), so while our regression estimate of ten days (below) delay for the U.S. seems reasonable, we find only limited effect on our regression results by using a 14day delay. Similarly, an epidemiological model found little impact between using lags of 13, 15 or 18 days (Flaxman et al., 2020).
Although COVID-19 mortality rates are age-dependent Verity et al., 2020), we use a generalized fatality rate in our regressions, as this may be the only variable available from early county-aggregated data, as well as include two age variables in our regressions (mean age and % over 65 years old).
In using county-scale death statistics, a source of statistical uncertainty is the relatively low numbers of deaths early in the pandemic. On April 17, 2020, for example, a large portion of U.S. counties were not reporting any deaths, whereas by 1 July 2020 many more counties were reporting nonzero deaths ( Fig. 1a and b). Since the early increase of infections was exponential (Fig. S2), as expected in compartmental models (Wu et al., 2020b;Kucharski et al., 2020), the level of under-reporting can outpace increases in testing rates. As the early numbers of COVID-19 cases in states can differ by orders of magnitude (e.g., New York vs. Great Plains states; Figs. 1a and S2), their differences in under-reporting could be substantial. This raises the concern as to which counties reporting few deaths and cases early in the pandemic might be most vulnerable to relatively higher cases and deaths as the pandemic spreads. To address this, we can make use of the rich co-variate data available at the U.S. county level to improve estimates of under-reporting. Although these county-level estimates are our main objective, the multivariate regression also identifies factors most predictive of COVID-19 deaths. As our objective is prediction, we note that the most predictive variables may or may not be the most causal. Given, for example, the importance of age, physical clustering and pre-existing conditions to COVID-19 Verity et al., 2020;Centers for Disease Control, 2020;Thompson, 2020;Lu et al., 2020;Yusef et al., 2020), certain county-level variables may have direct correlation; others will have indirect correlation. For example, a likely direct effect on the COVID-19 death to case ratio is the number of hospital ICU beds in the county (Schulte et al., 2020;King, 2020). An indirect effect may be median household income: high income counties may have more jobs that can be performed remotely (del Rio-Chanona et al., 2020). Some variables likely subsume the effects of other variables. Nonwhite populations may have higher rates of COVID-19 infections and/or deaths, for reasons that could include distrust in healthcare (Armstrong et al., 2008), reliance on public transportation (Anderson, 2016), exposure to greater air pollution (Ard and Bullock, 2020) or higher incidence of chronic health conditions Lighter et al., 2020).
For our 'early warning' estimates, we do not use data on protective behaviors such as social distancing and mask wearing. While these behaviors affect (reduce) the number of infections, they are delayed, dynamic responses to the number of cases (see Supplementary Fig. S1) and unlikely to be available early in the pandemic. In the case of COVID-19, for example, survey data on mask wearing were collected and reported in mid-July 2020 (Katz et al., 2020), months after an early warning system could be implemented.
In essence, we use the county-level death statistics to estimate how many people in the county were infected by COVID-19 on a given date. Even if the death counts were 100% accurate, however, due to all the counties with low numbers, we need to impute the underlying likelihood of death in those counties. The county-scale likelihood estimates provide non-zero values for predictive purposes on all counties, including sparsely-populated counties or counties that the epidemic has not yet fully reached. Comparing observations of cases and deaths from April and July 2020, we can test whether the method identified in April the most vulnerable counties to subsequent infections in July.
Due to the exponential growth in numbers plus the sparseness of count data from many counties, here we will be using a negative binomial regression with the Least Absolute Shrinkage and Selection Operator (LASSO) method to estimate deaths across the country at a given date (a new regression must be carried for each selected date). When these death estimates are divided by a generalized fatality rate, the result is an estimation of the true numbers of infected people in each county. Comparing the estimated numbers of infected people to the observed number of confirmed cases gives us a measure of case under-estimation in each county.

Methods
To improve estimations from heterogeneous COVID-19 death data at the U.S. county level, here we introduce thirty-one covariates (Table 1) into a negative-binomial regression. These covariates cover nine broad categories: clustering of population, voting behavior, health-care access, preexisting health conditions, age, ethnicity, links to COVID-19 hotspots, employment and education. Included among these covariates are three scalar variables to capture some of the international spread of COVID-19 from early hotspots (Callaway et al., 2020), specifically China, Italy, and Iran (Table 1). These fixed scalar effects for each U.S. county are derived from the Facebook social connectedness index (Bailey et al., 2018).
All variables in Table 1 represent fixed, county-level effects. While dynamic effects of social networks and inter-county travel were surely a factor in coronavirus spread (Maier and Brockmann, 2020), we do not employ such dynamic data here for two reasons. The first is that we aim for early estimation using fixed countylevel covariates that would be already available at the onset of a future pandemic. Secondly, there is not currently an established method for adding network covariates to the method we employ here, a LASSO Negative Binomial regression (Hays et al., 2010;Silk et al., 2017;Leifeld and Cranmer, 2019). We reserve the challenge of adding dynamic network data for future work.
Since case numbers grow exponentially, our regressions use the logarithm of county-level death counts on day t, logðD t ! Þ, where the vector D t ! contains entries for each of 3088 counties. Even when log-transformed, however, the count data are likely to be overdispersed and subsequent standard errors underestimated. For this reason, we use a negative binomial distribution of errors in the regressions, which allows us to use number of COVID-19 deaths as count data for predictions. To estimate the numbers of deaths, D t ! , in all counties on day t, the regression relationship is: whereb is the vector of weights for the covariates X (see Table 1).
The errors e follow a negative binomial distribution and have a variance for a given mean, m, of mð1 þ m=rÞ, where r is a dispersion parameter. In the regression, we use sandwich corrected estimates of standard errors (Luque-Fern et al., 2016).
Using thirty-one related covariates, while comprehensive, will likely result in both overfitting and colinearity in the Negative Binomial regression. To mitigate these risks, we employ a LASSO penalization in the loss function of the Negative Binomial regression. LASSO selects the most predictive variables by regularization, forcing many of the estimated effect sizes to zero; the most predictive covariates are those left with non-zero effects. Importantly, as LASSO is a method of dimension reduction that focuses on prediction, covariates that are set to zero may actually be causal in the real world. Conversely, highly predictive covariates may actually not be causal, they may have just subsumed the variance of many other truly causal covariates.
In the LASSO process, the choice of the regularization parameter, l, is important, as higher l results in fewer non-zero parameters. We optimize l using 2-fold cross-validation over a range of possible values. The LASSO regularization minimizes the following function, which plays-off the log likelihood, 'ðD t jbÞ versus the sum of the individual coefficients inb multiplied by l (Lehman and Archer, 2019): where in this case, i indexes the 3088 counties in the sample. For a given likelihood ', the higher l is, the smaller P i jb i j must be, and the fewer non-zero parameters are allowed. We apply this negative binomial regression to COVID-19 deaths data on both 17 April and 1 July, 2020. To maximize the number of non-zero predictors, we chose the largest l that did not markedly reduce the out-of-sample predictive power of the LASSO regression. This turned out to be l ¼ 0:1; higher l values reduced the cross-validated log likelihood of the LASSO regression (see Fig. S3). Next, as our estimate of the number actually infected in county i on date t, I i;t , we divide D i;t by the fatality rate, a. We then define the under-estimation in the county, U i;t , as the ratio of confirmed cases on record, C i;t , to I i;t . Measuring under-estimation as a ratio-the difference of the log-transformed values (logI i;t À logC i;t )-rather than a simple difference helps account for the highly skewed distribution of cases across counties. Using vector notation to represent all 3088 counties in the sample, we determine U t ! : Using data from the two dates allows us to determine whether U t ! determined for low-data counties in April is predictive of unexpectedly high numbers of COVID-19 deaths in July.

Results
For estimating COVID-19 infection rates from death rates, we first determined an optimum lag between COVID-19 infection and death statistics (log-transformed) in 3088 counties. We ran forty regressions of the logged number of new deaths on April 13 against the logged number of new cases, at lag times ranging between 1 and 40 days. The largest R 2 value occurred with a lag of t ¼ 10 (see Fig. 2), which is in line with findings from other studies (Backer et al., 2020;Huang et al., 2020;Russell et al., 2020;Wilson et al., 2020).
We confirm that our results are robust to variation in lag time by showing our under-reporting statistic still predicts more future Fig. 2. Left: COVID-19 cases on day t À t versus deaths on day t (both log scale) for all U.S. counties with more than 1,000,000 people, from t ¼ 1 to t ¼ 14 days' delay. All are comparing versus a case date of 04/13/2020. Right: Daily deaths versus daily cases in the U.S., with logarithmic y-axis.  (Supplementary Tables S3-S6), indicating that our under-reporting statistic is predictive of a higher number of future COVID-19 cases. We stress that the value of t is strictly a parameter for use in prediction, not a definitive statement about duration of infection for individuals. Indeed the parameter t-which we estimated for COVID-19 in the U.S. in 2020-would ideally be estimated independently for each case study. The lag would likely differ, for example, in a future pandemic and/or within health/hospital systems of the world (e.g., Wood, 2020).

Five predictive covariates
Of the thirty-one variables (Table 1) entered into the LASSO negative binomial regression, five were retained by the LASSO analysis at the optimal level of l ¼ 0:1 as most predictive of COVID-19 deaths by county (Table 2). Consistent with existing literature (discussed below), the five predictive variables are: population size, population density, public transport, voteshare, and % African-American population. Table 2 shows their regression coefficients via Eq. (1). These five variables explain a significant proportion of the variance in county death data: comparing predicted versus confirmed death counts (both logtransformed) yields R 2 ¼ 0:47 for 17 April, and R 2 ¼ 0:60 for 1 July (Table 2). Using these coefficients, Fig. 3 compares logI t vs logC t in all counties for two dates in 2020, April 17 and July 1. Using a similar LASSO regression (see Supplement), we confirm that deaths are a much better predictor of U t than the full 31 covariates of Table 1. By themselves, these fixed socioeconomic variables cannot predict which counties are likely to under-report COVID-19 cases.
Since LASSO is a factor-reduction technique, rather than a definitive statement of causality, we discuss first the predictive value of these five covariates as an 'early warning system' to identify US counties of particular concern. The predictions are mapped at county level in Fig. 4, which shows county-level predicted COVID-19 cases and deaths for April 17, and July 1, 2020. The maps fill in the data gaps across the less populated counties in the middle of the country, including the Great Plains (cf. Fig. 1). Similar to the raw numbers (Fig. 1), the predicted numbers are highest on the coasts, the northeast, Florida, and major urban areas. Fig. 5 shows county-level maps of the under-estimation measure, U t ! , for April 17 and July 1. These maps of U t ! have less obvious patterning than maps of observed infections or deaths (Figs. 1 and 4). Many of the highest values of U t ! in July are found in counties in the northern states (Maine, Idaho, Montana and Michigan), as well as parts of northern California and Oregon.
Using a fatality rate of 1%, our estimations of actual infections, I t ! , are one to two orders of magnitude larger than confirmed case numbers, C t ! , as shown in Fig. 6a, which shows both C t ! and I t ! on April 17 and July 1. This is broadly consistent with localized estimates, such as a California county where the COVID-19 infection rate was "50-85-fold more than the number of confirmed cases" in early April (Bendavid et al., 2020). In China, by comparison, 6 out of 7 COVID-19 infections (before 23 January) were potentially not reported . Fig. 6a shows that the gap between the confirmed cases and the predicted infections (red band) increases with number of deaths (i.e., population size). We test how well U t ! estimated for lowinformation counties on 17 April identifies those counties with high death counts by July. For low-data counties, our estimate of U t ! can offer an indicator of risk at the county level, as shown in Fig. 6b. Determining U t ! for 17 April, we predict which counties were at most risk for relatively high levels of COVID-19 by July 1. Fig. 6b shows that, for counties with little data in April 2020, a high underreporting estimate was an effective early warning signal of excess deaths by July 2020. Fig. 6b shows this for the 1171 counties with 0 (n ¼ 609), 1 (n ¼ 341) or 2 (n ¼ 221) confirmed cases in April. The fits show that counties with larger under report scores had higher COVID-19 deaths, ranging between just over 0 and 5 on average. This may seem like a small number, but with our assumed fatality rate of 1%, this represents an outbreak stretching into the hundreds in counties with fewer than 3 reported cases in April. There is also a substantial range on these predictions, with 13 out of the 1171 counties reporting 15 or more deaths, indicating an outbreak of more than 1500 cases.

Discussion and conclusion
In this prediction exercise, certain variables identified by LASSO explain more of the variation in the outcome than others. We should avoid assigning actual causation with the value these variables have in predicting true number of infections. Nevertheless, the five covariates in Table 2 are among the prominent risk factors listed by the CDC, and we may speculate on how they relate to the twenty-two variables not retained by the LASSO at the optimized l ¼ 0:1 value (see Supplement for results using other values of l).
Equally notable are the twenty-two variables from Table 1 not returned in the LASSO results (Table 2). This does not mean these variables are not important in the real world, but that for the purposes of predicting case numbers, the information in the five are sufficient to supersede, or encompass, the information residing in the other twenty-six. Notably, none of the four pre-existing health conditions-obesity, diabetes, hypertension or pollutionwas selected by LASSO as predictive of COVID-19 deaths or cases. Given the studies showing these to be genuine risk factors for individuals (Lighter et al., 2020), one of the five variables in Table 2 must be subsuming their predictive effects. The importance of African-American proportion of the population is consistent with the literature on socioeconomic correlates with COVID-19. In different ZIP codes of New York City, for example, Lieberman-Cribbin et al. (2020) found that, as the proportion of white residents increased, the number of COVID-19 tests increased and fraction of those testing positive decreased. Again, as LASSO is a factor-reduction method, we suspect that the proportion African American has such predictive significance that it subsumed the predictive effects of other covariates such as diabetes, hypertension, pollution and income. This may explain why income was not an important predictive variable in the LASSO results, despite the relative lack of testing resources in poorer U.S. counties (Schulte et al., 2020;van Dorn et al., 2020) and the overall disparity in COVID-19 testing efforts and resources attributable to Table 2 Coefficients for the five variables retained by the LASSO regression (l ¼ 0:1) for prediction of COVID-19 deaths and cases, on 17 April 2020 and 1 July 2020.   socioeconomic and racial disparities in healthcare access (Karaye and Horney, 2020;Lieberman-Cribbin et al., 2020). As COVID-19 mortality rates depend on an individual's age Verity et al., 2020), it may seem surprising to see both % over 65 years old and mean adult age not among the most important factors emerging from the LASSO results (Table 2). Connections with Italy, China or Iran were also not among the five most predictive factors, in April or in July. This despite evidence that the U.S. received confirmed "index" cases from Europe and from China (Spiteri et al., 2020).
Voteshare in the last election (% Democratic À % Republican) was an important variable in predicting deaths, both in April 17 and on July 1 ( Table 2). The use of two dates helps us rule out the effect of counties in New York state acting as outliers. In April 2020, New York state had the highest number of reported COVID-19 cases, and was also among the most Democratic-leaning. The voteshare effect remains, however, by July 1, when both cases and deaths were higher in many other parts of the country, including counties in the U.S. South, where voteshare is much different.
It could be that democratic voteshare captures aspects left unrepresented by the other crowding variables. Various surveys in the U.S. in 2020 (FiveThirtyEight, 2020) have shown Democratic voters to have higher levels of concern about COVID-19 than Republican voters. Given that Democratic voteshare predicts more deaths, this might be due to higher levels of liberal behavior among states from the West coast and North-East driven by differences in culture (Harrington and Gelfand, 2014). Democratic voteshare may act as a proxy for more openness, tolerance (Ruck et al., 2020) and looser norms (Harrington and Gelfand, 2014). By contrast, collectivist cultures may be better equipped to mitigate a pandemic through a tendency to obey authority (Kemmelmeier et al., 2003), conform with peers (Murray et al., 2011) and engage in less physical contact (Wu et al., 2019). Further consideration of this hypothesis would need to account for the demographics of strongly Democratic counties.
Overall, we find that the use of socioeconomic determinants allows-as a supplement to existing SIR models-for rapid, early warning estimations for vulnerable counties at times when county-scale reporting data is heterogeneous and incomplete. In using death counts with a ten-day time lag, we were able to predict which counties were under-reporting COVID-19 cases in April and validate these predictions against fatality rates in July.
In future pandemics this 'early warning system' could be used to identify vulnerable counties where disease outbreaks have not yet occurred. False positives will be produced but this may be improved by expanding our set of 31 covariates. While the results present a parsimonious set of socioeconomic risk factors for COVID-19 prevalence, additional covariate data will inevitably become available for early warning tools in future events. With further research, the methodology we have laid out here can be adapted to incorporate dynamic and/or network data, such as seasonality and ultraviolet light exposure (Carleton et al., 2020;Merow and Urban, 2020), inter-county migration or the Facebook connectedness Index. The incorporation of the more complex data into these early warning tools is a goal for future work.