Dimensions of uncertainty: a spatiotemporal review of five COVID-19 datasets

ABSTRACT COVID-19 surveillance across the United States is essential to tracking and mitigating the pandemic, but data representing cases and deaths may be impacted by attribute, spatial, and temporal uncertainties. COVID-19 case and death data are essential to understanding the pandemic and serve as key inputs for prediction models that inform policy-decisions; consistent information across datasets is critical to ensuring coherent findings. We implement an exploratory data analytic approach to characterize, synthesize, and visualize spatial-temporal dimensions of uncertainty across commonly used datasets for case and death metrics (Johns Hopkins University, the New York Times, USAFacts, and 1Point3Acres). We scrutinize data consistency to assess where and when disagreements occur, potentially indicating underlying uncertainty. We observe differences in cumulative case and death rates to highlight discrepancies and identify spatial patterns. Data are assessed using pairwise agreement (Cohen’s kappa) and agreement across all datasets (Fleiss’ kappa) to summarize changes over time. Findings suggest highest agreements between CDC, JHU, and NYT datasets. We find nine discrete type-components of information uncertainty for COVID-19 datasets reflecting various complex processes. Understanding processes and indicators of uncertainty in COVID-19 data reporting is especially relevant to public health professionals and policymakers to accurately understand and communicate information about the pandemic.


Introduction
As Boggs (1949) wrote in his call for an "Atlas of Ignorance" in 1949; "[i]n using well-made maps there is often a delusion of adequacy and completeness."Uncertainty in geographic information science is a complex topic, as it impacts both spatial information and the production of knowledge (Couclelis, 2003) as well as geographic definitions, the explanation of geographic phenomena, the complexity of spatial systems, geosimulation, the representation of spatial knowledge, subjectivity in spatial phenomena, and planning (Fusco et al., 2017).Multiple frameworks have emerged to frame or categorize dimensions of uncertainty in this respect (Buttenfield, 1993;Gahegan & Ehlers, 2000;Thomson et al., 2005) and highlight key challenges in visualizing uncertainty for decision making and analysis (see review by MacEachren, 2015;MacEachren et al., 2005).While much focus has been placed on the technical aspects of uncertainty (i.e.data quality or validity; visual techniques of communicating uncertainty), scholars have also emphasized that the unknown and uncertain aspects of data-generating processes are central to forging our knowledge of any given phenomenon.With the COVID-19 pandemic of 2020, uncertainty in information and knowledge proved critical not just to understanding the transmission processes of this novel virus, but also the public health management and epidemiological evaluation of COVID-19 at multiple scales.Spatial data were central to these efforts, fueling thousands of decision-making dashboards and modeling efforts.The existing research on COVID-19 is a conglomeration of different pandemic models and maps which both rely on different surveillance data as well as different methodological assumptions.
From the start of the pandemic through December 2020, the Centers for Disease Control and Prevention (CDC) only provided state-level data for the number of COVID-19 cases and associated deaths across the United States (CDC, 2020a).Because most states in the US are geographically vast with wide population variations and nuanced epidemiological landscapes, finer resolution data are essential.After March 2020, datasets containing national county-level COVID-19 case and death counts aggregated from jurisdictional public health authorities and media reports emerged from multiple third party organizations including 1Point3Acres; New York Times; Johns Hopkins University, and USAFacts.In this context, consistently reporting more cases and more deaths compared with other three since March 2020, and the reasons behind this requires further investigations.Wang et al. (2020) compared the same four major datasets and observed a 7-day cyclical pattern at the state level in both cases and deaths reporting.This finding was integrated into the anomaly detection and repairing package the group developed.Yet, these approaches do not make the uncertainty of COVID-19 data explicit, or take into account spatial or temporal aspects of the data as a fundamental aspect of uncertainty -not only in COVID-19 spatial information, but also our understanding of the data generating processes that produce it.
When considering how uncertainty could be addressed, Tobler's First Law of Geography (that near things are more similar than distal things) may be applicable and useful as a systematic, fundamental approach.Errors tend to have strong, positive autocorrelation in reality, and the discrepancies we observe tend to highlight state boundaries and result from statewide reporting practice differences.Thus, when considering multiple COVID-19 datasets, it is critical to examine not just similarity between datasets as a whole, but also where and when the similarities or differences persist.The datasets involved share aspects of provenance -county level health departments are critical data providers but errors are shared across datasets, and statewide discrepancies related to testing and reporting practices persist.Thus, the uncertainty errors for COVID-19 data reflect both spatial and temporal autocorrelation attributes, and the datasets observed interrelate in their propagation of these errors as datasets reference each other in an effort to construct more comprehensive data.MacEachren et al. (2005) highlight these dimensions as the attribute, positional, and temporal categories of uncertainty, with most uncertainty research only considering the first and occasionally second components (Kinkeldey et al., 2014).Our research aims to operationalize these dimensions and identify spatiotemporal uncertainty in the included COVID-19 data.We define uncertainty as a complex concept with attribute, spatial, and temporal dimensions that generate a lack of certainty over information.Rather than determine data accuracy, which is not possible with current publicly available COVID data, we are most interested in discrepancy and disagreement between study datasets.We offer concrete methods for researchers and analysts to identify dimensions of uncertainty across a range of possible applications, and propose solutions to better contextualize COVID data in light of this uncertainty.

Methods
To detangle some aspects of uncertainty in the spatial information and data generating processes producing spatial data, we take an Exploratory Spatial Data Analysis approach to investigate attribute, spatial, and temporal discrepancies among county-level datasets of COVID-19.To isolate the varied dimensions of uncertainty, we begin by assessing spatial variation in cumulative case and death rate differences across dataset pairs.Next, we explore the county-level agreement of datasets through the course of the pandemic by utilizing Cohen's kappa and Fleiss' kappa classifier agreement statistics.Finally, we plot the changes in dataset agreement over time to highlight which conditions of the pandemic may have created data conditions for higher or lower agreement.We conclude by providing recommendations for researchers and public health officials on best practices in interpretation of COVID-19 related research.
Our study evaluates the agreement and differences between five major COVID-19 case and death datasets: CDC (2020b); USAFacts (USAF) (US COVID-19 Cases and Deaths by State, 2021); 1Point3Acres (1P3A) (1Point3Acres COVID 19 Data, 2021); New York Times (NYT) ("New York Times COVID 19 Data," 2021); and Johns Hopkins University (JHU) (Dong et al., 2020).These datasets are widely utilized across academic, professional, and technology practices.JHU's Lancet article, the official citation for the dataset, has over 4500 citations at time of writing (Google, 2021) and numerous geospatial and temporal analyses utilize these datasets for analysis in the US context (Desjardins et al., 2020;Hohl et al., 2020;Paul et al., 2020) and globally (Martines et al., 2021).Google Maps includes NYT, JHU, and additional datasets as an available map overlay (Navigate Safely with New COVID Data in Google Maps, 2020).
While no final, canonical dataset has been established for data regarding the COVID-19 pandemic in the United States, we assume the CDC data to be reasonably authoritative.When the COVID Tracking Project, a popular aggregator, ceased new data collection in March of 2021, it directed users to the CDC's resources for case and death data (Gilmour, 2021), further suggesting the CDC's data resources are becoming a tentative standard.We use CDC data as the reference dataset against which the remaining "third party" datasets will be compared.USAFacts, NYT, and JHU were obtained from open source and open data repositories, CDC data was queried from the public CDC Covid Data Tracker website data endpoint API, and 1P3A data was provided through a closed-access API.Data access URLs are available in Appendix A.
Each of these data are reported at both state and county level.Datasets included have certain geography modifications: NYT, JHU, and 1P3A data aggregate New York City area counties (eg.Kings, Queens, etc.) as a single area; in 1P3A and JHU data in Utah are reported with non-county geographies; Puerto Rico and outlying territories are not consistently included; and certain areas of Alaska are not included in NYT data.All datasets have been modified to aggregate NYC counties, and only geographies present in all datasets are included in this study (n = 3105); Puerto Rico, Utah, and certain areas of Alaska have been excluded for reporting consistency.Data from the CDC, NYT, JHU, and USAFacts start time-series reporting on January 22nd, 2020, and 1P3A reports time-series starting on January 21, 2020.At time of writing, all datasets are updated daily with new data.Very early data present issues of missing data (e.g.many counties possess null values, days missing from time-series, etc.) and are highly sensitive data due to low counts of observed cases and deaths (a small numbers problem).

Exploratory data analysis
An initial exploratory data analysis was conducted on the available data for four pairwise comparisons between CDC and each of JHU, NYT, USAFacts, and 1P3A data.For this preliminary analysis, we explored the percentage differences between daily 7-day rolling averages of new cases between third party datasets and CDC data as a baseline to identify potential patterns.The percentage difference is calculated as: where Ω is the percent difference between the CDC and a third party rolling average for a particular day in a particular county, D is the CDC nationwide daily 7-day rolling average, and D' is the third party nationwide daily 7-day rolling average.Figure 1 compares the percent differences for available data.
Beyond initial data scarcity through mid-March 2020, dataset differences appear to arrive in surges and present larger differences starting around November 2020.1P3A and USAFacts appear to contain more frequent positive and negative surges, although NYT also shows variations.Johns Hopkins University data, after stabilizing in March/April 2020, appears to have relatively fewer deviations from the CDC data baseline.At some points in time, datasets in Figure 1 appear to experience similar temporal variations from CDC data taken as a baseline (e.g. in November 2020 three datasets report lower figures than CDC). Figure 2 transforms the data to absolute values of difference, rather than signed differences, and calculates the mean of percent differences for the four datasets, calculated as: Figure 2's heatmap visualizes the number of counties on each day by mean absolute percent difference from CDC rounded to the nearest ¼ of a percent, and log-scaled colored based on the number of counties in that range.
Figure 2 highlights the complexity in diagnosing and summarizing differences in geospatial time-series data across multiple datasets.Many counties appear in the lowest 10-15% of difference toward the bottom of the graph.Horizontal stripes represent repeated different intervals of case reporting likely due in part to a combination of small numbers problems and rounding errors.For instance, CDC data rounds to the nearest tens place decimal for 7-day case averages, so a 1 case increase in other data represents a~43% difference: CDC would report 0.1 as the 7-day case average vs 0.142857 . . .from all others.This preliminary analysis highlights concerns with using percentage based comparisons where small number problems and rounding issues may produce ostensibly larger differences and the need to utilize statistics that better summarize consistency across multiple datasets.Despite these artifacts in the data, we can observe apparent increases in absolute percent difference from November 2020 through January 2021 that suggest the "stable" period of data collection still faced many issues.For simplicity and to focus our assessment on the relatively stable periods of pandemic data collection, our study examines data between March 8th, 2020 through April 15th, 2021, providing a range of daily 7-day averages from March 15, 2020 to April 15, 2021.The following sections outline our data processing and approach to comparison methods.

Data preprocessing
CDC data are provided as daily 7-day rolling averages.For example, the 7-day rolling average reported on 4/15/ 2021 is the sum of cases reported from 4/9/2021 to 4/15/ 2021 inclusive divided by seven.Each new day of data shifts the start and end dates by one day.All third party data were prepared into time-series tables with rows representing the 3105 US counties included in this study and columns cumulative cases and deaths for dates from March 8, 2020 through April 15, 2021; this date range provides coverage for CDC reporting windows of daily 7-day rolling averages March 15, 2020 to April 15, 2021.Data, pre-processed into time-series tables, were accessed through the US Covid Atlas' github repository (Li et al., 2020).1P3A data were accumulated from daily new values into cumulative sums.Separate tables were generated for COVID-19 cases and reported COVID-19 deaths.For all third party datasets, daily 7-day rolling averages of new cases and deaths were calculated from the cumulative.The sums of these daily 7-day rolling averages for the 3105 counties included in this analysis for each dataset are presented in Table 1.As CDC data are only available as averages and not integral, rounding or edge errors may occur, and so the averaging process is replicated across all study datasets.

Comparison methods
To better understand the differences between the five selected datasets, this study compares: (1) differences in cumulative case and death rates normalized to county population; (2) time-series agreement of pairwise data and all datasets together for daily rolling averages of new cases and new deaths; and (3) county-level agreement of weekly rolling averages of new cases and new deaths.Respectively, these methods aim to explore spatial heterogeneity in reporting discrepancies across datasets, elucidate dataset-wide changes in agreement over time, and identify spatial patterns in county-specific data agreement.
To synthesize the differences between these datasets into a more legible metric, agreement between dataset pairs is calculated using a kappa coefficient (Cohen, 1960), which is a measure often used in social sciences to evaluate agreement between two classifiers.In short, Cohen's kappa helps to highlight the difference in agreement between two classifiers based on how much agreement we would expect to see, contrasted with what agreement is observed.Cohen's kappa is calculated as follows: where κ c represents the agreement between two classifiers, P o represents the observed agreement between the two classifiers, and P e represents the expected agreement between the two classifiers if they were to assign values at random given their observed distribution.Our programmatic implementation of Cohen's kappa utilizes the Python package ScienceKit Learn's Metrics Classification Module (Scikit-Learn/Scikit-Learn, 2011).In our application of Cohen's kappa, we use this measure to compare the daily 7-day rolling averages of new cases and new deaths normalized to the county population.This measure focuses on the state of the pandemic at each daily snapshot, smoothed only as much as needed for data integrity.Calculating Cohen's kappa relies on classification of data into categories or bins to assess agreement.To do this, we calculate decile bins based on CDC case and death rates for each day included in the study, then classify all dataset values based on those bins.
We can then evaluate agreement between each dataset pairs and across all datasets using these decile bin classifications.As we assume the CDC dataset to be the authoritative one, we used the CDC dataset as the reference group in our comparison.
To evaluate the five datasets together, we employ Fleiss' kappa (Fleiss, 1971), which allows for multiple classifiers to be evaluated simultaneously.Similar to Cohen's kappa, this measure estimates a coefficient that reflects how much agreement is observed between classifiers contrasted with the expected classification agreement.Fleiss' kappa is calculated as: where κ f is the agreement between all classifiers, À P o is the observed agreement across all classifiers, and À P e is the expected agreement of classification if all classifiers chose randomly, based on their distribution of assigned classifications.Our implementation of Fleiss' kappa utilizes statsmodels' inter rater module implementation (Appendix F).
While these two statistics provide a concise metric for evaluating agreement, they are not without limitations.As any binning methodology uses thresholds to categorize data, there is a limitation in this methodology as counties on the edge between bins may be sensitive to small fluctuations in changes.Additionally, kappa statistics are applied to categorical variables; consequently, a bin mismatch of bins 1 and 10 would be viewed the same as a disagreement of 5 and 6, for example.To account for this potential shortcoming, Appendix D demonstrates the same measures but using an average Euclidean distance between binning categories for both time series and spatial variations.
Once classified into decile bins, the agreement between the five datasets on each day were calculated.Since the direction of comparison (e.g.CDC -NYT vs NYT -CDC) produces the same output, 10 unique pairwise comparisons are generated for each day.The median values over time suggest how closely these datasets tended to agree during the course of the pandemic.Additionally, to identify potential spatial heterogeneity in agreement, we calculated Fleiss' kappa for each county across the time-series data available.As with the first agreement analysis, we classified values from all datasets based on equal count decile breaks generated from CDC data.Then, each county's decile bins over time were compared using Fleiss' kappa for the duration of the study date range, 3/15/2020 to 4/15/2021.Comparing how the datasets agreed over time for each county highlights where datasets consistently classified the intensity of the pandemic.Initial exploration also examined kappa values for total counts rather than per population rates; however, because total count decile bins tended to be much broader, agreement was less sensitive to changes and certain geographies remained in the same bin for the duration of the study dates.For instance, many cities sat in the top decile for the duration of the pandemic, as their total counts were always much higher.CDC data are also distributed as per 100,000 population rates, and so we will use this metric for analysis.

Sensitivity analysis
We explored two dimensions in our sensitivity analysis to validate data methods.To evaluate how sensitive the results are to classification needed for kappa coefficients, we examined the data using Pearson's correlation and Spearman's correlation coefficients, comparing the results generated against the Fleiss' kappa agreement methodologies.Specifically, we compared spatial and temporal trends for basic pairwise Pearson's correlation and Spearman's correlation coefficient of daily snapshot and county time-series data.While these statistics cannot compare the five datasets simultaneously, the pairwise comparisons can be compared over time, and an average of correlations mapped to illustrate spatial heterogeneity.The temporal and spatial variation observed in visual analysis of the figures and maps was sufficiently similar to proceed with methods utilizing Cohen's kappa and Fleiss' kappa as described above; further results from this sensitivity analysis are available in Appendix B.
Additionally, daily 7-day and 14-day rolling averages were computed and compared to the 7-day averages to observe if significant differences emerged between datasets given a larger reporting window.As a larger window of data is included for a daily average, fidelity and detail to emerging trends may be smoothed and lost, and our findings from this analysis suggest minimal differences in agreement between daily 7-day and 14day averages of new cases.Further results from these findings are available in Appendix C.

Cumulative cases and deaths reported
The first analysis compares pairwise results between datasets for cumulative COVID-19 case and death rates reported.Figure 3 shows the cumulative COVID-19 case rate as a sum of daily 7-day rolling averages per 100,000 people in each county.Maps in these matrix figures use a diverging color scale to identify spatial variation in where different datasets report higher or lower rates.In this scale, green represents the target (x-axis) dataset reporting higher rates whereas purple represents the source (y-axis) dataset reporting higher rates.Lighter colors symbolize smaller differences, and gray colored counties indicate areas within ±1 case or death per 100,000 people, accounting for small variations due to rolling average calculation.
A few consistent patterns across datasets become apparent when examining rows and columns together: CDC appears to report higher numbers in South Carolina and Alabama; 1P3A appears to report higher numbers in Wisconsin, Missouri, and New Jersey than other datasets, but reports lower rates in Iowa; USAFacts reports lower rates in Georgia than other datasets.Between the third party datasets, many states appear in gray, suggesting consistent cumulative case rates for the period observed.For each pairwise comparison, with the exception of the JHU-NYT data pair, state-wide differences stand out in the most extreme bins (more/less than ±150), suggesting that data from those entire states are treated differently by different datasets.Of the comparisons with CDC data, JHU and NYT data appear to have the fewest obvious state-wide differences, with CDC reporting higher numbers in Alabama and South Carolina and JHU and NYT data both reporting higher cases in New Jersey.This apparent agreement will be observed as well in the kappa coefficients used below, and time-series analysis will suggest that these three datasets have, over time, converged to express the closest agreement and smallest discrepancies of the five datasets explored in this analysis.
While CDC data comparisons also highlight statewide differences, we also observe less stark and more spatially heterogeneous patterns, particularly in a vertical band from the western side of Texas, continuing north through Kansas, Nebraska, and the Dakotas (Figure 4).Counties in this particular region appear to have case numbers both above and below CDC estimates, often in close geographic proximity.One possible explanation for these discrepancies is an error in reporting based on county of residence vs Figure 3. Matrix of cumulative case differences, calculated as the sum of daily 7-day rolling averages of new cases from 3/15/2020 to 4/15/2021.The color bins for these maps are fixed across the maps and approximate quintile breaks in either diverging direction.Geographic boundaries used are a modified version of Topojson's US Atlas county boundaries (Topojson, 2021).
county of occurrence, where cases may be attributed to the county where the case was detected, rather than the home county of the patient.This may possibly manifest on the map as some counties with better testing resources reporting more cases and less-resourced counties "sending" cases to them, but further research on testing infrastructures in this region is needed to confirm this possible explanation.Provisional CDC and National Center for Health Statistics (NCHS) datasets begin to disambiguate county of residence and occurrence data (National Center for Health Statistics [NCHS], 2021b, 2021a); however, not all sources clearly define this characteristic of the data.
Figure 5 illustrates the cumulative COVID-19 deaths as a sum of daily 7-day rolling averages per 100,000 people in each county using the same method as for cases in Figure 3.The spatial variation observed in these maps suggest some similarities in the patterns of differences for COVID-19 deaths as we observe in cases between CDC and third party datasets: CDC data have broad, heterogeneous discrepancies between third party datasets across a majority of the country where third party data tend to generally agree with notable state-level differences and the same vertical band from Western Texas through Dakotas emerges as a region of data uncertainty.Visual analysis by columns and rows suggests additional patterns: third party datasets report higher death rates in most of Ohio than reported in CDC data; 1P3A data are lower than other datasets in Wyoming; NYT reports lower deaths rates in most of Kansas than others.
In the first column comparing differences to CDC data, a large portion of the map suggests that CDC data reports fewer deaths than the third party datasets.This is summarized in the death rates to date shown in Table 1, where all third party datasets report roughly 15,000 deaths more than CDC data; for comparison, the third party datasets analysis have a range of roughly 8600 difference between the highest and lowest estimates.Additionally, none of these estimates consider excess deaths, which may have estimates far greater than the figures reported here; provisional data are emerging (NCHS, 2021c(NCHS, , 2021d)).
To synthesize these geographic findings into a single metric and compare agreement across the five datasets included, in the next section we will analyze dataset agreement using Cohen's kappa and Fleiss' kappa statistics as described in the methods section.

Dataset agreement
We measure data agreement using kappa coefficients to compare decile bin classifications of counties for each daily 7-day rolling average rate of new cases and deaths.For each day of available data, a matrix of kappa coefficients comparing each of the five datasets was generated alongside a Fleiss' kappa to reflect agreement of all datasets on that day.By assessing each daily rolling average as a snapshot, we can explore time-series agreement and then summarize the minimum, maximum, and median days of agreement.Additionally, by extracting decile binnings for the five datasets for a single county, we can assess how closely the time-series data for a specific county have been in agreement and analyze spatial variation.Kappa statistics range from −1 (perfect disagreement) to +1 (perfect agreement) and can generally be interpreted as   evaluating slight agreement with scores from 0 to 0.2, fair agreement from 0.2 to 0.4, moderate from 0.4 to 0.6, substantial agreement from 0.6 to 0.8, and almost perfect agreement above (Landis & Koch, 1977).To summarize Cohen's kappa agreements, Tables 2 and 3 present the minimum, maximum, and median values of dataset agreement across all days observed.Overall, agreement on decile binning is generally quite high, with the Fleiss' kappa agreement across all datasets ranging from .751cases/.784deaths to .908cases/.923deaths, and a median of .861cases/.877deaths.This means that each daily kappa agreement remained at or above a substantial level of agreement, with a median day of data observed well within the almost perfect category.Pairwise agreements never dropped below a 0.6 Cohen's kappa value, again suggesting that agreement was always at least substantial.
For agreement on case data, we observe the highest levels of kappa values between NYT/JHU datasets (min .776/median.914/max.969)and JHU/CDC datasets (.713/.893/.953).Agreements between NYT/USAFacts and JHU/USAFacts are also high, in addition to 1P3A/JHU.Median agreement between CDC and USAFacts (.819), NYT (.835), and 1P3A (.822) are among the lowest median kappa values observed; USAFacts has the lowest minimum bounds across all database agreements, ranging from .608(CDC) to .655(JHU).Overall, these agreements are quite high, and the lowest bounds still remain in the substantial category of agreement.More detailed box plots matrices for case and death Cohen's kappa agreements are available in Appendix E.
Agreement on death data is similar, and pairwise Cohen's kappa values are generally the same or higher with a few exceptions.As compared to case data agreement, CDC data pairs with JHU, NYT, and 1P3A data are lower, but slightly higher with USAFacts.The agreement on death data between the third part datasets is higher across minimum, maximum, and median relationships in most cases.Aside from the perfect agreement (kappa value of 1) achieved between CDC and USAFacts early in the observed data, the highest observed median agreement is between USAFacts and NYT (.924), followed by NYT/JHU (.912) and NYT/ 1P3A (.910).As with case data, overall agreement remains high, and even the lower bounds between CDC and third party datasets still represent substantial agreement.

Agreement over time
Figures 6 and 7 present findings from the time-series Cohen's kappa and Fleiss' kappa coefficients for cases and deaths, respectively.While certain pairwise agreements appear to increase over time, surprisingly, agreements between datasets for the most part did not improve over time.Starting in June and progressing over the remainder of 2020 into early 2021, we observe Fleiss' kappa agreement declining between datasets, falling below a threshold of 0.8 from almost perfect agreement to substantial agreement.A large dip in agreement during mid-November 2020 may reflect the massive surge in cases across the United States in the upper Midwest.With the exception of two data pairs, NYT/ JHU and JHU/CDC, we observed high temporal variability of pairwise dataset agreement, particularly beginning in mid-November 2020.
Temporal trends in death data are significantly less variable and tend to be more consistent over time.A similar trend of declining Fleiss' kappa agreement across all datasets is observed beginning late June 2020 and continuing through November 2020.Again, this decline may be related to increasing pandemic impact across the country at that time.All four CDC data pairs remain below third party data agreement and below Fleiss' kappa agreement for a majority of the study period.Agreement between NYT and JHU datasets again emerges as the most closely aligned pair observed beginning in March of 2021.

Spatial variation in agreement
Variations in these pairwise comparisons and Fleiss' kappa coefficient illustrate periods of the pandemic that experienced the highest and lowest levels of temporal agreement.In a similar manner, to explore the spatial dimension of agreement we calculate decile bins for each county included in the study using the CDC data, classify each day's values in those bins, and then evaluate agreement via a single Fleiss' kappa value at the county level.(See Figures 8 and 9.) In contrast with the daily 7-day rolling averages in the timeseries plots (Figures 6 and 7) in which each new data snapshot shifts the window by one day, these maps evaluate data by shifting the window by one week, thus avoiding dependent values in the series.As observed with spatial variation for cumulative cases and deaths, some patterns along state lines appear to characterize Fleiss' kappa agreement: overall Fleiss' kappa agreement is lower in Iowa, Missouri, Kansas, Nebraska, Massachusetts, and Alabama; agreements in Pennsylvania, Florida, Connecticut, and Vermont are generally higher.We observe a few apparent regional trends, but spatial variation exists strongly along and within state lines.Iowa, Missouri, Nebraska, and Kansas, all having relatively low rates of agreements observed in their counties, are bordered by states with county-level kappa values tending toward middle or upper quintiles of observed agreement; Illinois, Minnesota, and Oklahoma all have relatively high case data agreement across many counties.Similarly, a number of southeastern states, Alabama, Georgia, and the Carolinas have low agreement, but are bordered by Florida, Mississippi, Tennessee, and Virginia -all counties characterized by higher agreement of case data.
Spatial differences in death data agreement are similarly apparent along state borders, but states that tended to have lower agreements on case data do not necessarily always have low death data agreement as well: Massachusetts has relatively low case data agreement but high death data agreement; several states with moderate or higher case agreements have lower death data agreement as observed in Ohio, New York, and Illinois; several states present high agreement across both metrics such as Pennsylvania, Florida, Connecticut, and much of Tennessee; several states present low agreement across both metrics such as Missouri, Kansas, and much of Washington.As observed in mapping case data agreement, spatial heterogeneity within states does not present clear patterns, and most states have a mix of county-level values across multiple quintile bins.
Compared with the first analysis of cumulative data, there appear to be fewer regional trends observed in the Fleiss' kappa agreement maps, but state differences are clearly present along boundaries.Reasons for state-bounded differences in agreement may reside in issues of data reporting frequency, testing programs, data pipeline and communication issues, and state-specific data vetting and cleaning procedures on the part of the data aggregators.Future research may be able to test and validate these hypotheses, as well as work to typologize county data uncertainty with considerations of pandemic response robustness and access, public health landscape, and intersectional race, age, gender, and class considerations that may contribute to delayedor undercounting.

Summary of findings
We used three main approaches to identify dimensions of uncertainty: cumulative data to explore geographic differences of reported cases and deaths, time-series data snapshots to uncover changes in data agreement over time, and county-specific time-series data to identify areas where datasets have agreed or disagreed most.Trends were evaluated across each approach for consistency, denoting unique dimensions of uncertainty.Our analysis of cumulative case discrepancies compared pairwise data normalized to population and found discrepancies appearing strongly along state boundaries; discrepancies between datasets were observed in South Carolina, Georgia, and Alabama and an apparent vertical band of inconsistent counties western Texas to the Dakotas.CDC data tended to have more widespread disagreement with third party data (Johns Hopkins University, New York Times, USAFacts, and 1Point3Acres), which tended to have large areas of the country within ±1 case or death.Of all data mapped, NYT and JHU data appear to have the fewest discrepancies.Cumulative death data analysis showed variations in other areas, with Wyoming, Ohio, and parts of Illinois and New York highlighted as apparent discrepancies; the vertical band of uncertain data in the middle of the country remains.Overall, CDC data reported fewer deaths than third party datasets.The second analysis examined trends and changes in how datasets tended to agree, using Cohen's kappa to compare pairwise data agreement and Fleiss' kappa to compare all five datasets together.For both case and death datasets over time, agreement has tended to decline over time, particularly after November 2020; this observation may be due in part to larger case volumes and the intense period of the pandemic in late 2020 in the United States.Case data agreement is particularly volatile, although in later months data pairs of NYT/JHU and JHU/CDC have started to emerge as closer in agreement.Death data shows growing agreement of the NYT/JHU data pair, and CDC data pairs of death data tended to be lower than all other relationships.The final analysis extracted weekly data for each county from the five datasets included and compares the agreement for that county over the duration of the pandemic.Trends similar to cumulative case and death data appear along state boundaries, as well as variation within states.For cases, Georgia, Alabama, Missouri, Kansas, Nebraska and much of Washington are observed with low agreement; for deaths, New York, Wyoming, Kansas, and much of Washington were observed with low agreement.Florida, Pennsylvania, and Connecticut are among the states with reasonably consistent data agreement.
These findings highlight different dimensions through which uncertainty can be observed in spatiotemporal datasets, and our study highlights the extent of volatility, variation, and complexity that is often present.Across our analysis of spatial, temporal, and cumulative measures, marked differences were observed, but considered as a whole and considered in context of kappa interpretation scales, much of the data in our study generally tended to be in substantial agreement.To make the dimensions of uncertainty in COVID-19 information more explicit, we categorize the key findings across MacEachern's conditions of uncertainty (MacEachren et al., 2005) that considers three components of information (space, time, and attribute) paired with nine uncertainty types (accuracy/error, precision, completeness, consistency, lineage, currency/timing, credibility, subjectivity, and interrelatedness).Previous work demonstrated that incorporating these typologies to characterize uncertainty is useful to users and development work (Roth, 2009).In total, nine componenttypes of uncertainty were uncovered with an ESDA analytic approach: (1) accuracy/error in the time dimension; (2) completeness in the spatial dimension (ex.Utah); (3) consistency in the spatial dimension (ex.NYC boroughs); (4, 5) lineage in the spatial and temporal dimension (ex.provenance of data, how it may overlap sources); (6) credibility in the attribute dimension (ex.journalist and volunteers versus the CDC); and (7-9) interrelatedness across spatial, temporal, and attribute dimensions.

Understanding processes of uncertainty
The results above diagnose and highlight spatial and temporal patterns of uncertainty observed across the five included datasets.For practical applications of these data, these results suggest where and when data uncertainty may be elevated, and supply methods to analyze future datasets.However, it is important to understand and break down the processes that produced this uncertainty in the first place, to better expose root causes of uncertainty and how these datasets differ.In so doing, future data practices from both data aggregators and official data providers may better manage uncertainty before data publication and use.To understand the data pipeline and where these processes diverge temporally, we dig into the documentation of each dataset and outline typical data pipelines in Figure 10, highlighting potential divergence in each step of the data pipeline across datasets.If COVID-19 cases occur on May 2nd, data reporting supply case data to county and state level datasets, aggregated to national county-level files through data scraping, manual updates, and various automated processes, usually on the next day.Aggregation is followed by quality assurance and quality improvement (QA/QI) processes, correcting for geographic boundary issues, redistribution of unassigned cases, and handling special cases (eg.cruise ship data).Each unique jurisdiction may have its own unique reporting issues related to timeliness and accuracy: for detailed data history from JHU, see the readme file under the csse_covid_19_data folder of the Github data repository; for more on the history of the NYT COVID-19 data, see the Github commit history on their data repository (see links in Appendix A).For New York Times data, additional on-site reporting and confirmation with public health officials ("Nytimes/ COVID-19-Data," 2020/2021) provides ongoing validation; for 1Point3Acres data, county data are supplemented with media and social media case reports, which were particularly important in the early days of the pandemic (Yang et al., 2020).JHU data aggregates and reconciles a number of different datasets together, although it is unclear the precise details of this process.For USAFacts and JHU, this data pipeline outputs daily updates, and New York Times releases periodic updates multiple times per day ("Nytimes/COVID-19-Data," 2020/2021).These data reflect responsive aggregation, which emphasizes and publishes data in near-real time.This data pipeline differs from CDC data releases, which we label here as retrospective aggregation, due to the fact that county-level data only became available from CDC sources starting in late December.After that point, CDC updated countylevel data 2-3 per week and generally 2-5 days behind real time data.CDC data on the Covid Data Tracker website uses a combination of sources and data validation processes that are not fully known (CDC & CDC, 2021).Additionally, it is important to note that official provisional datasets from the National Center for Health Statistics (NCHS) use different validation methods, particularly for death data based on death certificates rather than county or state death reports (CDC & CDC, 2021).
The differences in these data pipelines manifest as part of the uncertainty observed across temporal and spatial dimensions.Additional dataset characteristics also reveal the differences in these processes.For instance, temporal decreases in cumulative values suggest some form of correction or reassignment of cases, and are common particularly in JHU and NYT data; however, CDC data show relatively few instances of decreasing cumulative values, suggesting that reassignments and corrections may be likely incorporated into data prior to publication.Differences in data sourcing, case and death assignment, validation, corrections, and timeliness are summarized below in Table 4.The last row of the table identifies some of the possible conditions of uncertainty (MacEachren et al., 2005) that may be introduced in each stage of the data pipeline.
These dimensions of uncertainty do not explicitly characterize the quality of data, but rather help us to understand where greater uncertainty in our understanding of the pandemic's impact exists and where data pipelines can be improved for future public health efforts.As Goodchild (2020) urges, exploring the phenomena driving uncertainty in data or geographic understanding is crucial.Efforts to document the provenance behind these COVID-19 datasets and their similarities or differences are central to this exploration.All crowdsourced datasets pulled from local health department websites in some form or other, whereas the CDC used a different reporting strategy unique to their internal process; this may provide some insight into the differentiations, but also poses new questions about how variation in reporting impacts outcomes data.Spatially, some regions are more   (Smith, 2020), as well as identifying the same 7-day cyclical patterns at the state level (Wang et al., 2020).Extending an analysis of uncertainty past attribute levels of data agreement, we also found distinct spatial and temporal patterning that are integral to understanding the nuances behind varying but complementary data generating processes.This is in line with calls to highlight the multiple dimensions of uncertainty in data as called for by (MacEachren et al., 2005) and again by Goodchild (2020).This work also highlights the complexities of communicating uncertainty from incomplete or disputed knowledge, a core challenge when planning for uncertain futures (Spiegelhalter et al., 2011).Capturing temporal aspects of uncertainty is difficult to integrate in mapping (Boggs, 1949), especially when needed in a way that is useful for planning and mitigation efforts.This work extends the field forward in that respect by integrating a visual analytic approach to first understanding dimensions of uncertainty in data (that may reflect complex underlying processes) and subsequently discerning a distinct, meaningful, and consistent form of communicating the uncertainty as a dimension of the data.The final kappa index can be communicated and explored as its own variable across time in spatiotemporal web applications like the US Covid Atlas, and/or it can also be included as an additional metric or indicator to provide insight at a county-level scale of analysis.

Small numbers problem
As observed in our preliminary exploratory analysis, small numbers of cases can produce large variations in assessed consistency, particularly when taken as ratios or percentages.Relatedly, where 7-day case or death averages remain low, rounding errors present in CDC data potentially amplify assessed inconsistency.Excluding early data from the time-series examined and utilizing the quantile classification utilized in our analysis helps to mitigate these issues.The state of the pandemic in the US continues to evolve as do data publication practices and standards.Changing case and death rates may introduce temporal or spatial periods of low counts and introduce new small numbers problems in assessing data consistency across dataset providers.Future researchers and practitioners dealing with low count data should carefully consider how datasets may introduce uncertainty related to small numbers problems or rounding errors.

Limitations and implications for future research
Looking to future data products, needs, and uses, what can we do better to manage uncertainty in data observation, aggregation, and communication pipelines?Gardner et al. (Gardner et al., 2021) call for better open data standards and systems to avoid the discrepancies and uncertainty we face in available COVID-19 data, specifically identifying issues and potential strategies to mitigate: (1) variable and metric definition and reporting timeliness; (2) data source/set disagreement; (3) differences in reporting format or medium; (4) changes in data over time, such as corrections; and (5) concerns of privacy and personally identifiable information.Such standards that aim to address modern data needs, such as the Mobility Data Standard (Open Mobility Foundation, 2020), integrate data across infrastructures to support interoperability between companies, jurisdictions, and individuals.But data fragmentation and discrepancies are likely to persist, and the need for methods to diagnose and manage uncertainty remain.The dimensions of uncertainty we highlight across the datasets and processes align with some criteria used to assess reliability and validity of data, such as time, space, measurement errors, bias, and representativeness (National Academies of Sciences, Engineering and Medicine, 2020).The need for decision-making in the face of uncertain and imperfect data will persist in the future.Our analysis contributes three main strategies for uncovering dimensions of uncertainty, and future research may expand this toolbox and operationalize these metrics for different uses.An array of methods may be appropriate depending on the depth, scale, and needs of research, policy, and data demands.For public health uses, real-time, if coarse, metrics can highlight areas with discrepancy or disagreement and trigger further review.Finer metrics employing big data techniques, in-depth modeling, and multidimensional correlation or dimension reduction may yield insights for longer-term research processes.Between these two extremes, numerous approaches to how we understand uncertainty and what we do with that knowledge exist.These practical methods build on our conceptual understanding of uncertainty to provide applied ways to diagnose and understand dimensions of uncertainty across datasets.
There are limitations on this study that hinder its ability to be conclusive regarding data uncertainty in the pandemic.For one, the dust has not yet settled: at time of writing, we are still in the midst of the pandemic.This means many datasets are still preliminary or provisional data releases, particularly the CDC data (NCHS, 2021b(NCHS, , 2021a)).Because of reporting differences in geographies, we were not able to include all geographies of the United States (Utah, parts of Alaska), and Puerto Rico was not included in some datasets altogether.Additionally, this study only examines the COVID-19 data ecosystem within the United States, but data uncertainty is a global issue; further, we selected five prominent or widely incorporated datasets, but many more exist.Lastly, the kappa metrics employed may introduce sensitivity to counties on the edges of binning decile classification, and it may unfairly convert observed agreement into a binary agreement or disagreement rather than a continuous scale.Despite these limitations, the five datasets, present across over 3100 counties, provide a robust data universe to explore and assess.Each of the datasets have their own editorial and methodological approaches to producing data, and the kappa statistics observed in this study provide an understandable metric to chart disagreement and uncertainty across time-series geospatial data.For these reasons and despite these limitations, we feel the merit in this study can be used to help identify recommendations for assessing uncertainty more actively as data emerges, communicate that uncertainty, and work toward data that more accurately represent the communities described.
County data remain the finest resolution available country-wide, but even this scale presents issues.As with any data representing aerial data aggregated into layer geographies, COVID-19 case and death data is limited by the Modifiable Areal Unit Problem (MAUP) and ecological fallacy (Piantadosi et al., 1988;Wong, 2009), exacerbated by the aggregation of counties around New York City into a single geography.This problem presents two issues characterized as the "scale problem" where different aggregation sizes produce different findings, and the "zoning problem" where similarly-scaled but differently-located geographic units again impact research findings (Fotheringham & Wong, 1991).Finer data, such as at the tract or zip code level, could provide useful resources to better understand sensitivity of case and death data to the geographic unit.

Conclusion
As the kappa and cumulative difference measures can be easily computed in real time and output to simple, comprehensible metrics, we encourage regional and national data product services, news organizations, and public administrations to consider utilizing metrics of agreement and uncertainty to more transparently report on the state of pandemic and future public health data.While it will always be important to explore the most recent data available -even if flawed or incomplete -identifying and communicating uncertainty in data may help to temper expectations and mitigate backlash as understanding of situations evolves with new information, better data, and a clearer narrative of the current situation.
Areas highlighted as observed with high discrepancies in cumulative data and low case and death rate agreement may warrant further scrutiny when used for research purposes and further review of the causes behind these data inconsistencies.The relative consistency of the CDC data, particularly for cases, emphasizes the need for federal agencies to serve as clearinghouses for data in real time.At the same time, the disagreement raised by journalistically researched datasets, including the New York Times, and crowdsourced datasets, such as 1Point3Acres, emphasizes the potential importance of multiple perspectives that may help identify where official data fall short, and fill gaps to ensure communities on the margins are not left unaccounted for.In short: at time of writing, NYT and JHU appear to be the closest in agreement for cases, and NYT data, updated several times a day, tends to be the most timely.CDC case data appear to likely represent the best choice for historic analysis, so long as daily 7-day average values are adequate for the given analysis.CDC data also provides coverage based on unmodified US County geographies (as opposed to merged New York City counties and alternate Utah geographies).For COVID-19 related death data, USAFacts, JHU, and NYT data appear similar in agreement.Forthcoming provisional datasets on deaths from the National Center for Health Statistics (NCHS) may possess greater accuracy potentially due in part to the use of death certificates rather than public health reporting, and these data may be useful for historical analysis but possess limited timeliness.1Point3Acres, with crowdsourced information, may possess unique advantages in early reporting of data, but for the most part it has lower disagreement coefficients with other datasets to date.
In spite of imperfect COVID-19 data agreement, policy-makers, communities, and individuals still require and deserve informed decision-making.CDC's data are the most institutionally grounded of data examined and ostensibly becoming a de facto reference dataset, but uncertainty remains.Data are representations of reality, subject to the same processes of obfuscation and discernment.Each dataset may have its own distinctions in time and space, and where we find general divergence in agreement, additional care of uncertainty is needed.In uncovering and documenting the differences between datasets, it becomes even more clear that we "need more GIScience thinking in COVID-19 research" as suggested by Roskenktranz et al. (2021).Maps or analytic models on their own do not tell the full picture, but of course GIScientists (should) have known that always to be the case.Even with existing limitations of coarse resolution health outcomes (county scale), and discrepancies between datasets, we can use the knowledge produced from these endeavors to drive new questions about underlying mechanisms to better support pandemic planning and mitigation.

Figure 1 .
Figure 1.Percent difference in daily 7-day rolling average of new cases reported vs CDC.

Figure 2 .
Figure 2. Heatmap of mean absolute percent difference of daily 7-day rolling averages compared to CDC data and by-county, by-day differences.

Table 3 .
Death data agreement matrix of minimum, median, and maximum daily values of pairwise data comparisons for daily 7-day rolling averages from 3/15/2020 to 4/15/2021.Table 2. Case data agreement matrix of minimum, median, and maximum daily values of pairwise data comparisons for daily 7-day rolling averages from 3/15/2020 to 4/15/2021.

Figure 4 .
Figure 4. Spatial variation observed across all third party datasets comparisons with CDC data, consistently appearing across counties in a north-south band between Texas and the Dakotas.

Figure 5 .
Figure 5. Matrix of cumulative death differences, calculated as the sum of daily 7-day rolling averages of new deaths from 3/15/2020 to 4/15/2021.

Figure 6 .
Figure 6.Time-series plot of pairwise Cohen's kappa agreements between datasets and Fleiss' kappa agreement across all datasets for COVID-19 case data.

Figure 7 .
Figure 7. Time-series plot of pairwise agreements between datasets and Fleiss' agreement across all datasets for COVID-19 death data.

Figure 10 .
Figure 10.Diagram of data pipeline for COVID-19 responsive aggregation and retrospective aggregation datasets.

Table 4 .
Summary of dataset sources, locality assignment, validation, and correction processes, and timeliness.