Identification of county-level health factors associated with COVID-19 mortality in the United States

Many studies have investigated causes of COVID-19 and explored safety measures for preventing COVID-19 infections. Unfortunately, these studies fell short to address disparities in health status and resources among decentralized communities in the United States. In this study, we utilized an advanced modeling technique to examine complex associations of county-level health factors with COVID-19 mortality for all 3141 counties in the United States. Our results indicated that counties with more uninsured people, more housing problems, more urbanized areas, and longer commute are more likely to have higher COVID-19 mortality. Based on the nationwide population-based data, this study also echoed prior research that used local data, and confirmed that county-level sociodemographic factors, such as more Black, Hispanic, and older subpopulations, are attributed to high risk of COVID-19 mortality. We hope that these findings will help set up priorities on high risk communities and subpopulations in future for fighting the novel virus.


Introduction
COVID-19 is a global pandemic. As of the date of starting writing this manuscript (30 June, 2020), there were over 10.4 million total confirmed cases and 511 000 total deaths among 188 infected countries/regions, especially in the United States (US) whose numbers reached 2.6 million of total confirmed cases and 127 000 of total deaths [1] . These numbers are continuing to grow. This public health crisis posed disproportional health outcomes related to COVID-19 across various racial/ethnic groups with unequal socioeconomic status at different geographical locations in the US.
According to the County Health Rankings Model [2] , the contributing health factors associated with inequalities in COVID-19 outcomes are multidimensional: Health behaviors, clinical care, socioeconomics, and physical environment (Fig. 1). These contributing health factors can also exist at multiple levels within the US: individuals, communities/counties, and states. Many studies have investigated individual demographic and clinical factors of COVID-19 outcomes since the inception of COVID-19 outbreaks in January 2020 in the US. For example, a review study by Tian and colleagues on 14 studies found that individuals with older age, male sex, and certain co-morbidities (e.g., hypertension, diabetes, and other chronic diseases) were at a high risk of dying from COVID-19 [3] .
The aggregation of these contributing factors at higher levels (e.g., community/county-level and statelevel) are also significantly associated with COVID-19 outcomes. For instance, community-level social determinants, such as severe overcrowding, lower educational status, higher unemployment rates, less access to healthcare, and more chronic diseases, were associated with a higher rate of COVID-19 infection among districts in New York City [4] and more COVID-19 deaths among counties in Colorado [5] . Cyrus and colleagues [6] , and Yancy [7] also found that communities and counties with more underrepresented minorities, especially African Americans, had a higher prevalence of COVID-19. These findings were confirmed by many other studies at county-level on the same topics [8][9][10] . Moving to a higher level, statelevel, the conclusion is the same. For example, Oronce and colleagues found that state-level income inequality was positively related to COVID-19 cases and mortality [11] . By adjusting 322 counties' racial composition, state-level racial/ethnic disparities still resulted in COVID-19 mortality 80% higher in African Americans than White [12] .
These consistent findings from the studies at all levels, especially at county-and state-levels, are not unexpected. COVID-19 responses and decision making in the US remain decentralized [10] . As shown in the conceptual model ( Fig. 1), variations in health policies and programs at county-and state-levels may influence health factors and, in turn, general health at those levels; which will ultimately have various impacts on populations' risks and capability to respond to COVID-19. As a matter of fact, local governments and their demographic characteristics, especially at the county-level, had been widely examined before COVID-19 outbreaks for other health-related outcomes, such as infectious disease [13] , vaccine [14] , obesity [15] , opioid [16] , and hospital performance [17] .
Because COVID-19 is a novel virus, besides demographic and socioeconomic factors, other contextual factors, such as health behaviors, clinical care, and physical environment ( Fig. 1), may provide a new insight or better understanding of the places and people that could be most affected by the COVID-19 pandemic so as to inform policy makers to take actions to improve community conditions [2] . Taking the factors of clinical care as an example, access to care (e.g., ventilator availability) would make a difference in outcomes of this novel COVID-19 [18] . Because treatments and vaccines for COVID-19 will not be ready or available for a certain period of time [19] , exploring other means of preventing and controlling COVID-19 spreading among communities becomes ever important. Therefore, a comprehensive examination of various county-level health factors associated with COVID-19 outcomes is warranted and imperative.
Although some prior studies investigated the associations between county-or state-level characteristics and COVID-19 outcomes, they either  Fig. 1 A conceptual model adopted from the county health rankings model [2] . The relationships indicated by dotted lines were not examined due to the scope of this study.

Data source
The data for this study included measures of county-level health factors and general health as well as demographic characteristics at both county-and state-levels. Those health factors, general health, and demographic characteristics were extracted from the County Health Rankings [2] . In addition, to more accurately measure the rurality of counties, a key sociodemographic variable, we supplemented the measure of the proportion of rural areas in the County Health Rankings with the Rural-Urban Continuum Codes from the US Department of Agriculture Economic Research Service [20] .
The study data also included the total numbers of the confirmed COVID-19 cases and deaths from the onset date to 31 May, 2020 for each county. These COVID-19 outcomes were computed based on New York Times' COVID-19 Data on GitHub [21] and crossvalidated against Johns Hopkins University's COVID-19 Case Tracker [1] and the Centers for Disease Control and Prevention's COVID-19 Cases in the US [22] , as well as some Google searches if necessary. From the late May 2020, some states had just started lifting the stay-at-home order and reopening their economies in phases. Thus, our data would reveal a general picture of the COVID-19 pandemic before the US economy was reopened.

COVID-19 outcomes
The COVID-19 outcomes were the total numbers of the confirmed COVID-19 cases and deaths from the onset date to 31 May, 2020 for each county. Because the total number of COVID-19 cases tend to be underestimated due to various reasons, such as the availability of COVID-19 testing and asymptomatic cases, the COVID-19 mortality, defined as the total number of deaths from COVID-19 per 100 000 population, was analyzed as the outcome in the study.

Country-and state-level demographics
In this study, demographic variables at both countyand state-levels were the same. They were age composition (proportion of below 18 years of age and proportion of 65 and older), gender (proportion of female), racial/ethnic composition (proportions of non-Hispanic White, non-Hispanic Black, Hispanic, Asian, Indian & Alaska Native, and Native Hawaiian/Other Pacific Islander-those six proportions add up to 100%), and an index of the rurality of the country (proportion of rural).
In addition, country-level demographics were supplemented with the Rural-Urban Continuum Codes [20] which captured not only the population size but also the proximity to large cities. The higher score of the rural-urban continuum code was, the more rural the county was.

County-level heath factors and general health
As indicated in the conceptual model ( Fig. 1), the county-level heath factors considered in this study were health behaviors, including tobacco use (1 item), diet and exercise (4 items), alcohol and drug use (2 items), and sexual activity (2 items); clinical care, including access to care (3 items) and quality of care (3 items); socioeconomic factors, including education (2 items), employment (1 item), income (2 items), family and social support (2 items), and community safety (2 items); and physical environment, including air and water quality (2 items), and housing and transit (3 items). The general health analyzed in the study were length of life (4 items), and quality of life (4 items). See the County Health Rankings [2] for more information about the item-level measures.

Statistical analysis
Descriptive statistics were first used to summarize the demographic characteristics of all 3141 counties across all 50 states plus Washington, DC in the US. Then, we utilized hierarchical generalized linear models (HGLM) [23] to identify county-level health factors associated with COVID-19 mortality for the 3141 counties, accounting for county-level general health and both county-and state-level demographic characteristics.
The use of HGLM was determined based on the following two considerations. First, the data were in a nested structure (i.e., counties were nested within states), and thus observations on counties were likely to be not independent. In this data situation, hierarchical linear models (HLM) [23] (aka multilevel modeling or mixed models) were appropriate. Second, the outcome variable in this study, COVID-19 mortality, was directly derived from a count variable, the number of deaths. In other words, it took values of non-negative integers. Thus, a Poisson distribution was appropriate for analyzing this type of outcome variables. Therefore, HGLM with Poisson distribution was chosen for this study.
It is worth noting that COVID-19 mortality also exhibited more variation than a regular Poisson distribution, whose variance equals its mean, could capture. To address this issue, an extra parameter, overdispersion, needed be considered to accommodate the inflated variability. Therefore, HGLM with Poisson distribution with overdispersion was finally chosen to analyze the data. In terms of the estimation method, the default option of the penalized restricted quasi-likelihood estimation was elected in the HLM/HGLM software [24] .

County-and state-level demographic characteristics
The descriptive statistics of demographic characteristics at county-and state-levels are provided in Table 1. From Table 1, it can be observed that the average population for counites was 104 159 ranging from as few as 88 people to more than 10 million. At the state-level, the average population was more than 6.4 million ranging from about 0.6 million to almost 40 million. Of those populations, on average, a typical county had more younger (≤18 years) population (22%) than older (≥65 years) population (19%) so did a typical state (22% vs. 16%); whereas both counties and states had an equal gender distribution. In terms of the racial/ethnic composition, the proportion of non-Hispanic White was the majority for both counties (76%) and states (68%), followed by Hispanic (10% for counties and 12% for states) and then non-Hispanic Black (9% and 11%, respectively). Other racial/ethnic groups (i.e., Asian, American Indian & Alaska Native, and Native Hawaiian/Other Pacific Islander) made of the rest (<5% for counties and <8% for states). It is worth noting that in the HGLM analysis, we included the proportions of non-Hispanic Black and of Hispanic as the groups of interest and the combined remaining four groups as the reference category the majority (~90%) of which was non-Hispanic White.
As for rurality, a typical county in general had more than a half of its area (59%) as rural; whereas a typical state only had a little more than a quarter of its area (26%) as rural. The average score of the rural-urban continuum code was 4.01 for counties, suggesting that most of counties were "nonmetro counties with an urban population of 20 000 or more, adjacent to a metro area". The rural-urban continuum codes were not available at the state-level. Table 1 also displays the summary statistics of COVID-19 outcomes. As of 31 May, 2020, among all 3141 counties, the total number of COVID-19 deaths was 104 183, and the average number of COVID-19 deaths was 33.17, ranging from 0 to 6732; while for all 50 states plus Washington, DC, the average number of COVID-19 deaths was 2043.75, ranging from 8 to 29 699. To make the numbers of COVID-19 deaths more comparable across counties and across states, both of which had various sizes of population, a standardized measure, COVID-19 mortality, defined as the number of COVID-19 deaths per 100 000 population, was computed. Among the counties, the average COVID-19 mortality was 12.85 per 100 000 population; while the average COVID-19 mortality for states was more than doubled as 26.20 per 100 000 population. Fig. 2 also demonstrated that the distribution of COVID-19 mortality was widely spread with several areas of high mortality in various regions across the country.

Associations between county-level demographics and COVID-19 mortality
To examine the raw associations between county-level demographics and COVID-19 mortality, we first established an HGLM base model which included demographic variables only. The base model set a stage for later identifying health factors uniquely associated with COVID-19 mortality. The results of the base model were presented in Table 2. As expected, the overdispersion parameter estimate (i.e., county-level residual variance σ 2 ) was 18.90, meaning that the estimated within-states variance of COVID-19 mortality given the demographics was 18.90 times larger than the variance expected from a regular Poisson model, which justified the use of the overdispersion parameter. Further, the between-states variance (i.e., state-level residual variance τ) was 0.75 which was statistically significant (P<0.001), meaning that COVID-19 mortality varied significantly across states.
In order to compare the substantive importance of impacts of demographics, the estimated regression coefficients were converted to a half-standardized event rate ratio (ERR) [25] . The half-standardized ERR is a measure of effect size, and its interpretation is similar to odds ratios. That is, if a half-standardized ERR is greater than 1, the expected death rate is higher in the group of interest than the reference group; otherwise, the expected death rate is equal or lower in the group of interest than the reference group. Then, its absolute percent (%) of change was ranked at the last column of Table 2 for indicating the substantive importance of each demographic variable.
The rank ordering by the absolute % change indicated that the proportion of non-Hispanic Black had the highest rank (#1) with a half-standardized ERR of 1.35, which means that for two counties that had one standard-deviation difference in the proportion of non-Hispanic Black (SD=0.14, Table 1), on average, the country that had a higher proportion of non-Hispanic Black had 35.0% more death than the country that had a lower proportion of non-Hispanic Black, controlling for other demographic variables in the model. Another example but a reverse case would be that if a county's proportion of rural went up one standard deviation (SD=0.31, Table 1), the COVID-19 mortality of that county would go down by 31.9% on average. It is also interesting to see that race/ethnicity measured as the proportions of non-Hispanic Black and Hispanic occupied the top two ranks, indicated that racial/ethnic minorities acted as the highest risk factors of COVID-19 mortality. On the other hand, rurality measured as the proportion of rural areas worked as a protective factor.

Identifying county-level health factors associated with COVID-19 mortality
To make a fair comparison for the mean event rate of COVID-19 mortality, we controlled demographic variables and general health by adding them to the HGLM model as covariates when identifying countylevel health factors uniquely associated with COVID-19 mortality. To avoid potential model overfitting due to too many item-level health factors and general health (totally 77 items) in the County Health Rankings [2] , we took a two-step approach to investigating how each health factor was uniquely related to COVID-19 mortality. The first step was a preliminary round to select potentially highly influential factors. To do that, we added one item-level factor at a time to the HGLM Poisson regression in addition to the demographic variables mentioned earlier. Using the criterion of half-standardized ERR≥0.30, which is considered to be a substantively moderate effect size [26] , we initially selected 12 item-level health factors and general health from the totally 77 items. Those selected item-level health factors and general health were listed in Supplementary Table 1  (available online).
Furthermore, for highly correlated items, such as food environment index and limited access to healthy foods within the factor category of Diet & Exercise (r=−0.77), and proportions of uninsured, uninsured adults, and uninsured children in the category of Access to Care (all rs>0.73), we decided to create a composite score for the highly correlated items in the category by creating a z-score for each item, adding the z-scores up, and then re-standardizing them into a single overall z-score to represent each category in order to increase the interpretability of the results. For other items, we also created a z score for each item so that we could compare the relative influence of each factor on COVID-19 mortality.
At the second step of analysis, the outcome, COVID-19 mortality, was regressed on these nine zscores in addition to the seven demographic variables described in the previous section using HGLM Poisson regression with overdispersion allowed. The results for this analysis were displayed in Table 3. From Table 3 we can see that long commute driving alone, severe housing problems, and juvenile arrests rate were statistically significant health factors of COVID-19 mortality (P<0.05); and the rates of suicide and uninsured were marginally statistically significant (P<0.10). However, in terms of effect size measured by ERR, suicide rate was most influential (−28.1%), followed by uninsured rate (20.7%), long commute driving alone (19.9%), juvenile arrests rate (−15.0%), severe housing problem (13.6%), access to healthy food (11.2%), and social associations number rate (−1.5%). It should be noted that two general heath items (child mortality number rate and average life expectancy) were neither statistically significant nor necessary to be interpreted as controlled covariates.

Discussion
Since the first confirmed case of COVID-19 in the US was reported on 20 January, 2020 in Snohomish County, Washington [27] , public health research on causes of COVID-19 and safety measures for preventing COVID-19 infections as well as biomedical and pharmacological research on COVID-19 vaccines and treatment medicines have exploded with about 2000 publications a week [28][29] . The unit of analysis in most of public health research was states or the country as a whole, which failed to address the disparities in health resources and services among counties and the problems of the decentralized health administration in the US. This study investigated county-level health factors, frequently used in community public health research, that may impact or be associated with COVID-19 cases and deaths so that the study results can inform the local governments for making more targeted health policies to prevent and reduce future pandemics.
This study first confirmed the prior research findings that most of the key county-level demographics are associated with COVID-19 mortality using the most recent available information about all counties in the U.S. Specifically, counties with a higher Black population, a higher Hispanic population, and a more older (≥65) population can have significantly higher COVID-19 mortality; whereas counties with more rural areas are significantly less likely to have higher COVID-19 mortality. Although these demographic characteristics are non-modifiable, local health officials and community health workers might need to have some targeted health prevention programs for those specific high-risk subpopulations and/or living in urbanized areas to reduce mortality from future outbreaks of novel viruses. After controlling for these non-modifiable demographics as well general health, this study then further identified some county-level health factors that are significantly associated with COVID-19 mortality. For example, counties with a higher rate of uninsured population, more housing problems such as overcrowding, and longer commute driving alone are more likely to have significantly higher COVID-19 mortality; whereas counties with higher rates of suicides and juvenile arrests may have lower COVID-19 mortality. These health factors are modifiable; thus, more effective policies on health insurance and housing are needed to reduce future mortality from novel viruses. Because of comprehensive nature of the data, these results can be treated as hard evidence to support the claims that have been frequently mentioned in the mass media. It is interesting to find that the relationship between long commute driving alone and COVID-19 mortality is significantly positive. The finding seems counterintuitive but not surprising because the farther people commute by vehicle, the higher their blood pressure and body mass index and the less physical activity the individuals tend to participate in [30] . It may result in higher likelihood of obesity [31] and poorer mental health [32][33] . Also, commutes by private motorized vehicle are associated with higher health risks such as cardiovascular diseases [34] . All these health risks may eventually lead to higher COVID-19 mortality.
The findings that the rates of suicides and juvenile arrests have a significantly negative association with COVID-19 mortality also seem counterintuitive. One possible explanation is that those who died from suicide, especially for those who had high health risks, would contribute to a possible undercount of mortality from COVID-19. As for juvenile arrests, those juveniles would have less chance to contract COVID-19 due to the confinement. In addition, prior research found that the proportion of non-Hispanic White in the community is significantly associated with a lower rate of juvenile arrests [35] . On the other hand, the present study found that, comparing with the reference group which has 90% of non-Hispanic White, counties with a higher proportion of non-Hispanic Black had higher COVID-19 mortality. Thus, it is possible that juvenile arrests appear to be negatively correlated with COVID-19 mortality after controlling for the racial/ethnic composition as we did in the analysis. Yet, these are only possible explanations. Because COVID-19 is novel, there is no established theoretical model or protocol to follow; thus, these counterintuitive findings are subject to future investigations.
Because there is no established theory to follow for studying this novel COVID-19, this study meant to be exploratory, and the findings are preliminary and open to discussion. Nevertheless, to our knowledge, this study was the first to examine such complex associations of a comprehensive set of health factors with COVID-19 outcomes for all the US counties, controlling for variations of all the US states, in a single modeling technique of HGLM Passion regression with overdispersion. Such advanced technique allows disentangling county-and state-level effects on COVID-19 outcomes.
This study also has a few limitations subject to future research. For example, the COVID-19 outcomes might be underestimated due to undertesting and asymptomatic cases. Also, some time-varying events, such as availability of ventilators, state-to-state travel restrictions, social distancing practice, face mask requirement, stay-at-home order, economic reopen, school reopen, subsequent spikes, and so on, would be desirable factors along with state-level characteristics to be measured and integrated into analysis in future research.
The study data were collected up to 31 May, 2020, and the results reflected the nature of the country before reopening its economy gradually from the late May 2020. Unfortunately, after the country reopened its economy, the number of COVID-19 cases has skyrocketed. It is shocking to see that, as of the date of finishing writing this manuscript (31 July, 2020), the number of COVID-19 cases increased from 2.6 million, only a month ago, to more than 4.5 million with 25 000 more deaths nationally. Such alarmingly exponential increase makes public health research on COVID-19 prevention, along with biomedical and pharmacological research on COVID-19 vaccines and drugs, ever more important. It is hoped that the findings of this study can contribute to future more informed research and help local health administrators create more targeted health policies and programs so that people in the communities will be better off from the collective efforts from all sectors to flight COVID-19 pandemic as well as future potential virus outbreaks.