Protecting Privacy and Transforming COVID-19 Case Surveillance Datasets for Public Use

Objectives Federal open-data initiatives that promote increased sharing of federally collected data are important for transparency, data quality, trust, and relationships with the public and state, tribal, local, and territorial partners. These initiatives advance understanding of health conditions and diseases by providing data to researchers, scientists, and policymakers for analysis, collaboration, and use outside the Centers for Disease Control and Prevention (CDC), particularly for emerging conditions such as COVID-19, for which data needs are constantly evolving. Since the beginning of the pandemic, CDC has collected person-level, de-identified data from jurisdictions and currently has more than 8 million records. We describe how CDC designed and produces 2 de-identified public datasets from these collected data. Methods We included data elements based on usefulness, public request, and privacy implications; we suppressed some field values to reduce the risk of re-identification and exposure of confidential information. We created datasets and verified them for privacy and confidentiality by using data management platform analytic tools and R scripts. Results Unrestricted data are available to the public through Data.CDC.gov, and restricted data, with additional fields, are available with a data-use agreement through a private repository on GitHub.com. Practice Implications Enriched understanding of the available public data, the methods used to create these data, and the algorithms used to protect the privacy of de-identified people allow for improved data use. Automating data-generation procedures improves the volume and timeliness of sharing data.


Introduction
Federal open data initiatives that promote increased sharing of federally collected data [1,2,3] are important for transparency, data quality, trust, and relationships with the public and with STLT partners [4].These initiatives advance understanding of health conditions or diseases by making data available to more researchers, scientists, and policy makers for analyses and other valuable uses.Data sharing initiatives are particularly important during the COVID-19 pandemic, where data needs are constantly evolving and there is much to learn about the disease.
As part of the COVID-19 coordinated response, jurisdictions share de-identified, patient-level data for each case with CDC.These data are sent daily in a combination of three formats -comma separated values (CSV) file, direct data entry of case forms, and National Notifiable Diseases Surveillance System (NNDSS) electronic case notifications -to CDC's Data Collation and Integration for Public Health Event Response (DCIPHER) system.DCIPHER is a data management and analysis system using Palantir Foundry [5] software that allows analysis via R [6], Python [7], and an analytic tool called Contour.The data are managed with a DCIPHER Case Surveillance Pipeline, a series of linked programs that cleans, collates, deduplicates, and transforms data to produce an analytical-ready epidemiological dataset used across the response.Data do not include direct identifiers but do include demographic characteristics, exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and comorbidities (Figure 1).CDC's Case Surveillance Section, the response group established to conduct surveillance activities and serve as the data steward over case data, created a new process that transforms the epidemiological dataset with privacy protection algorithms to systematically create anonymized subset data.This process contains automated workflows and R statistical software (version 4.0.3;The R Foundation) that implement and validate field level suppression for k-anonymity [8] and l-diversity [9] levels to release microdata, monthly, in two public datasets: • COVID-19 Case Surveillance Public Use Data -a "public use" dataset designed with 11 data fields is accessible via Data.CDC.gov[10], with an interactive visualization to allow the public to filter, sort, and perform exploratory analysis.• COVID-19 Case Surveillance Restricted Access Detailed Data -a "scientific use" dataset designed with 31 fields, and more stringent privacy protections, to provide more detailed information for scientists, statisticians, journalists, and researchers; however, this requires users to sign a Registration Information and Data Use Restriction Agreement (RIDURA) to access through a private GitHub repository [11].
To increase usability and foster transparency, this paper describes dataset definitions, design of the pipeline that creates them, and privacy protection rationale.

Materials and Methods
Multiple groups within CDC's emergency response organization worked together to design the public case datasets.Surveillance Review and Response Group (SRRG), a group established to improve data use within the response, worked with the Case Surveillance Section to use privacy heuristics, available guidance, and codes of practice [12,13] to design the Data Sharing Privacy Review Procedures, a sevenstep process (Figure 2) that implemented CDC's data release policies [14] to protect privacy and publish useful and accessible data.This privacy review process was used to derive two datasets from the epidemiological dataset (Figure 1).Data elements were selected for inclusion in both datasets based on usefulness, public request, and privacy implications.Specific field values (e.g., age_group, race_ethnicity_combined) were suppressed to reduce risk of re-identification and exposure of confidential information.Datasets were created and verified for privacy and confidentiality standards using Palantir Contour and R scripts [15] using the sdcMicro package [16].
The privacy procedures reduce the risk of re-identifying patients while providing useful information.To meet these privacy protection needs, not all variables can be released.Since the public use dataset is widely accessible, its data are the most restricted while the scientific use dataset is released only to approved researchers and includes more variables.
Re-identification risk cannot be reduced to zero, but this systematic process is designed to make this risk low [17] to protect individuals whose data contributes to these public datasets.
Step 1: Classify Variables.All variables from the epidemiological dataset were reviewed and classified according to their sensitivity into one of four categories: direct identifiers, quasi-identifiers, confidential attributes, and non-confidential attributes.Direct identifiers are variables that would unambiguously identify an individual (e.g., name, address) and while CDC does not receive these types of data, each field is checked and confirmed that no identifying information is contained in an open-ended or free text response.Quasi-identifiers are fields that may identify an individual if they occur rarely enough in a dataset or could be combined with other fields or data (e.g., age group, sex, county).Confidential attributes are sensitive information that would not commonly be known about an individual (e.g., first positive specimen date).Non-confidential attributes are general information that cannot be used to identify individuals but still may potentially be combined with other data (e.g., case status).Fields are reviewed individually and as a combined set of fields within the dataset.From this review, all potential fields were either included, excluded, or transformed to reduce sensitivity.For example, date_of_birth is excluded, so we created a generalized age_group field using ten-year bins with a top-coded bin for 80+.
We finalized the design of the datasets by identifying the specific fields included in the public use (Supplement 1) and scientific use (Supplement 2) datasets.Fields were identified by evaluating their analytical usefulness with their re-identification risk.Fields included in the datasets were adjusted over time incorporating feedback.For example, additional geographic fields for county were added to the scientific use dataset; and race/ethnicity was added to the public use dataset.Geographic fields were only included in the scientific use dataset by researchers who sign a data use agreement.
For the public use dataset, the most current dataset contains 11 fields with three quasi-identifier fieldssex, age_group, race_ethnicity_combined -and one confidential attribute -pos_spec_date.For the scientific use dataset, the dataset includes 31 fields, with six quasi-identifiers -sex, age_group, race_ethnicity_combined, res_county, res_state, hc_work_yn -and one confidential attribute -pos_spec_date.These fields are used in subsequent steps to establish and check cell suppression levels (Table 1).
Step 2: Review for Personally Identifiable Information (PII).We reviewed all data fields to confirm that no PII was present.All data fields were limited to categorical, date, and numeric values and were reviewed and confirmed that they could not contain PII.All free text data fields that had the potential to contain PII were excluded.
Step 3: Set Privacy Levels.We established privacy thresholds by defining the minimum acceptable size for the number of records in the dataset that share quasi-identifiers.Although there is no universal threshold [18], a minimum level is suggested, with a common recommendation of 5 and uncommonly above 5 [19].We set this level at 5 to be conservative and consistent with previous approaches used in public health [20].This means that no fewer than 5 records are allowed to share values from a single, or any combination of, quasi-identifier fields.Workflows called "Contour boards" were created in DCIPHER to automatically detect any combination of quasi-identifiers meeting our criteria for small cell suppression and set fields to "NA".Only field values were suppressed, records remained in the dataset so researchers can identify when suppression criteria were applied.When suppressing fields, data managers made every effort to suppress as few fields as possible while meeting the privacy level (Table 1).
Step 4: Re-code Variables.We used common variable coding techniques within the pipeline to clean and ensure uniformity of the responses within each field.Questions that were left unanswered in the case report form were re-classified to "Missing", with the following exceptions: age_group recoded to "Unknown;" res_state recoded to the reporting jurisdiction; and res_county and county_fips_code were left unchanged.Logic checks were performed on dates to detect illogical responses and set them to "Null" until the jurisdiction updates; for example, dates reported in the future, or dates reported prior to the onset of COVID.Additionally, initial COVID report date was examined and when the value was blank upon receipt from the reporting jurisdiction, the value was set to the date the data file was first submitted to CDC.The primary goal was to ensure consistency in applying suppression and to simplify the dataset for ease of use and analysis.
Case-based surveillance data are dynamic and jurisdictions can modify and re-submit when new information becomes available; therefore, records may change between releases with de-deduplicated updates.Data are only included in public datasets with a 14-day delay based on the cdc_report_dt field.This allows data managers time to review responses and work with jurisdictions to correct data quality issues.The original release on May 18 th used a 30-day window, but was updated in subsequent updates as improved data quality reviews showed minimal changes after 14 days.
Step 5: Review k-Anonymity.Each time datasets are generated, we review for k-anonymity.Kanonymity is a technique used to reduce the risk of re-identifying a person or linking person-specific data to other information based on a rare combination of quasi-identifiers.We use this technique to suppress quasi-identifiers values so that each person contained in the released dataset cannot be distinguished from at least k-1 other persons who share the same quasi-identifiers [8].This technique uses privacy thresholds established in step 3 across all quasi-identifiers classified in step 1.
Figure 2 shows an example of how k-anonymity is used to suppress record quasi-identifier values using only 10 records to illustrate how k-anonymity applies to the entirety of both datasets.Fields on the left are the raw data before suppression.The frequency field indicates the number of records in the example dataset that have the same combination of quasi-identifiers; for example, the first record frequency =1, meaning that the combination of sex, age_group, and race_ethnicity_combined quasi-identifiers occurs only once within the data.Since we require 5-anonymity, we will suppress fields so that their quasiidentifiers occur at least 5 times.After suppression frequency shows that each record's quasi-identifiers occurs 5 times.Note that records are never removed, and, in this example, since we suppress the fewest fields possible to create a cell with 5 members, only sex and race_ethnicity_combined were suppressed and we were able to leave age_group unchanged.This example includes the three quasi-identifiers within the public use dataset but functions the same for the scientific use dataset using its six quasiidentifiers.
After every time datasets are regenerated through the pipeline, data managers use R programs [15] to verify that each generated dataset meets the levels established in step 3.If any errors are detected, the pipeline is revised to correct the bug and the datafile is regenerated and retested until both processes are satisfied.At the end of this step, each dataset is verified to be 5-anonymous.
The number of times fields are suppressed within each dataset varies with each monthly release and between datasets because suppression depends on the total number of rows in the dataset and on the number of included fields (Supplement 3, Supplement 4).Users should consider the amount of suppression within fields as they design and create analyses.
Step 6: Review l-Diversity.As an extension of the k-anonymity check, step 6 involves checking for ldiversity to reduce the risk of exposing confidential information on an individual.L-diversity, another technique to protect confidential information, checks to ensure that for a group of individuals who share the same quasi-identifiers, at least l distinct values exist for each confidential attribute [9].These datasets require 2-diversity so that confidential variables cannot be determined in situations where records share the same quasi-identifiers values.Figure 3 shows an example of how l-diversity is used to suppress specific confidential values within records to meet the privacy levels.Again, the fields on the left are raw data.The distinct field indicates the number of unique pos_spec_dt confidential field values shared by all records with the same quasiidentifiers; notice that some records have a distinct of 1 because they all share the same sex field value of "Female," age_group field value of "0-9," and race_ethnicity_combined field value "Asian, Non-Hispanic" and all share the same pos_spec_dt value of "2020-03-31."Since our requirement for the dataset is that we must have 2-diversity, the confidential field is suppressed and set to "NA" to not reveal the pos_spec_dt value.The distinct value remains 1 but now the value is "NA" and cannot be known.This prevents someone knowing the specific specimen date just because they know the person's sex, age group, race, and ethnicity.Records are never removed, only field values are suppressed.
Step 7: Research Links.Finally, in step 7, to reduce the risk of "mosaic effect" [1], we researched other publicly available datasets that could be linked by quasi-identifiers to identify individuals.The mosaic effect is a risk where information within an individual dataset may not identify an individual, but when combined with other available datasets may have a risk to identify individuals.This risk is reduced through the use of k-anonymity levels that reduce the number of rare combinations of quasi-identifiers that could be linked to other datasets; however, it is challenging to completely eliminate this risk.We reviewed the 13 COVID-19 related datasets published on Data.CDC.govat that time [20].We were not able to exhaustively search all available datasets but did review quasi-identifiers against the other 543 datasets, at that time, published by CDC with machine-readable metadata available through the Data.CDC.govpublic data catalog [22].

Results
The public use dataset, updated monthly, was published to https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf on May 18, originally contained 339,301 records with 9 fields.On May 29 we added onset_dt.On June 27 we added race_ethnicity_combined.As of December 4, it contains 8,405,079 records, every case through November 19 (Supplement 3).To support the most users, CDC releases these data following the FAIR Guiding Principles of findability, accessibility, interoperability and reusability [24], using machinereadable CSV formats and an open standards-compliant application programming interface.The dataset was viewed over 438,000 times and downloaded over 24,000 times (Supplement 5

Discussion
Public datasets are needed for open government and transparency; promotion of research; and efficiency.Specifically, COVID-19 case data transparency is important for fostering and maintaining trust and relationships with the public and STLT public health partners [26].To balance the need to create and share public use datasets with protection of patients' privacy and confidential information, we created a seven-step data sharing privacy review to protect privacy and publish useful data.
There are a large number and variety of repositories for public datasets [27].Given the large number of repositories for public datasets, and the large number of datasets contained within each repository, we were unable to develop a practical, systematic process to review all public datasets and ensure with complete certainty that the risk of re-identifying patients in our datasets through the use of quasiidentifiers is completely eliminated.For example, a single popular repository for public research data, figshare.com,revealed 803 results for "COVID", illustrating the large number of datasets that exist.
We compensated for this by reducing the number of variables, generalizing variables, and establishing conservative k-anonymity levels.As methods improve to compare data to other released datasets to rule out security concerns, we could include additional fields or apply more precise privacy levels making the data more useful for analysis.

Practice Implications
Systematic privacy review procedures are important for data engineering purposes to collaborate on and validate data design across systems, locations, and teams.Privacy review is complex, and requirements must be understood by epidemiologists, statisticians, data product owners, informaticians, analysts, health communicators, and data custodians so that they are implemented, tested, and applied reliably each time that a dataset is updated.Automated computational privacy controls are important to meet the volume and schedule of data updates while reliably meeting privacy requirements as this is not possible with manual processes.
Release of these datasets has led to improved data quality by incorporating user feedback into continual improvements of the data pipeline for public and non-public data, such as consistently coding missing values, adding county coding, and more accurately identifying state and county of residence.Public data are part of the data feedback loop throughout the data lifecycle where more users of data are able to identify and prioritize data features and bugfixes.
Through the creation of these datasets and implementation of computational privacy protections, CDC contributed to a knowledge base of COVID-19 data practices that will be used for design and publication of additional datasets beyond case surveillance.CDC publishes 40 different COVID-19 public datasets on Data.CDC.gov[28] as of November 18, 2020.Currently two datasets use these computational privacy protections; additional datasets will be published based on feedback and public health program priority.
These case datasets are now available to the public for review, use in research, and to improve data transparency with partners.The practices and tools developed to design and release these data are available to other programs within CDC's COVID-19 response through the shared data pipeline, privacy review procedures maintained by SRRG, and computational privacy review software.With increased, systematic releases of these public datasets and more training and information available, we expect increased use and greater public health benefit.

Figure 1 .
Figure 1.Case Surveillance Data Flow Process Includes Specific Pipelines to Implement Unique Privacy Protections for each Public Dataset

Figure 2 .Figure 3 .
Figure 2. Privacy Review Steps Used in Designing Public Datasets (PII = Personally Identifiable Information)

Figure 4 .
Figure 4. Example of l-diversity Field Suppression for Confidential Attributes [23]scientific use dataset, updated monthly, was published to a private GitHub repository on May 18, containing 315,593 records with 29 fields.On June 27 we updated the dataset to 31 fields: combined race and ethnicity into race_ethnicity_combined and added res_state, res_county, and county_fips_code.As of December 4, it contained 8,405,079 records representing every case received by CDC through November 19 (Supplement 4).GitHub is a third-party web site that CDC uses to make data easier for researchers to download datafiles as industry-standard, zip-compressed, CSV files.Dataset descriptions and RIDURA instructions are available at https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Restricted-Access-Detai/mbd7-r32t.The dataset had been accessed by 94 researchers as of December 11, 2020, and Google Scholar shows two papers referencing these data[23].