A structured course of disease dataset with contact tracing information in Taiwan for COVID-19 modelling

The COVID-19 pandemic has flooded open databases with population-level data. However, individual-level structured data, such as the course of disease and contact tracing information, is almost non-existent in open databases. Publish a structured and cleaned COVID-19 dataset with the course of disease and contact tracing information for easy benchmarking of COVID-19 models. We gathered data from Taiwanese open databases and daily news reports. The outcome is a structured quantitative dataset encompassing the course of the disease of Taiwanese individuals, alongside their contact tracing information. Our dataset comprises 579 confirmed cases covering the period from January 21, to November 9, 2020, when the original SARS-CoV-2 virus was most prevalent in Taiwan. The data include features such as travel history, age, gender, symptoms, contact types between cases, date of symptoms onset, confirmed, critically ill, recovered, and dead. We also include the daily summary data at population-level from January 21, 2020, to May 23, 2022. Our data can help enhance epidemiological modelling.


Background & Summary
Current epidemiological models rely on population-level data, demanding precise measurements of infected individuals and deaths.However, the accuracy of reported infection numbers can be compromised due to limited testing.In Stockholm, Sweden, Roxhed et al. found a seroprevalence of 12.5%, implying a significant discrepancy between the potential 150,000 infections and the 10,000 PCR-confirmed cases reported during the study period.Estimating mortality rates is also challenging.The WHO reported approximately 15 million excess deaths by December 2021, compared to 5.4 million confirmed COVID-19 deaths.The Economist's estimates varied from 15 million to 25.2 million excess deaths, far exceeding reported COVID-19 deaths.Additionally, calculating the basic reproduction number (R 0 ) is challenging due to reporting errors in infection numbers.For COVID-19, R 0 estimates ranged widely from 0.17 to 4.5 with broad confidence intervals 1 .These issues highlight the unreliability of commonly used population-level measurements.
To mitigate the inaccuracies inherent in population-level data, utilizing more informative individual-level data is advisable.For example, Lee et al. published a case report of the course of disease of a 46-year-old woman diagnosed based on positive real-time RT-PCR tests from oropharyngeal swab samples 2 .The days of fever, cough, dyspnea, dysuria, etc are listed.With three consecutive negative results of real-time RT-PCR tests for SARS-CoV-2, the woman was discharged after 24 days of hospitalization.In another instance, the report on France's first three cases 3 meticulously documented essential details, including travel history, symptom onset, medical consultations, hospitalization, confirmation, critical condition, and contact tracing.
Individual-level data finds valuable application in predicting specific risks.Garibaldi et al. developed the COVID-19 Inpatient Risk Calculator (CIRC), forecasting severe illness or death within 7 days for COVID-19 patients based on data from 832 consecutive admissions 4 .Wongvibulsin et al. created a system predicting progression from moderate to severe illness or death within 14 days using demographics, admission sources, comorbidities, vital signs, lab measurements, and clinical severity data from 3494 COVID-19 patients 5 .In the pandemic's early stages, when individual-level data accessibility was limited, Lichtner et al. trained a model on 718 non-COVID viral pneumonia patients' data, effectively predicting adverse outcomes in critically ill COVID-19 patients 6 .Unfortunately, none of these datasets are publicly available.
In this study, data is collected from diverse open-source databases, resulting in a comprehensive structured dataset encompassing demographics, epidemiological details, disease progression information, and subject contact tracing.We highlighted that this dataset has potential for use in model development and epidemiological analysis.Previously, we successfully forecasted the end of the third local outbreak in Taiwan based on the ratio of daily infected to suspected cases utilizing the daily summary data we included in this dataset 7 .Furthermore, we have trained an agent-based model using the individual data in this dataset.We have conducted an analysis on the detection rate and its impact on crucial epidemiological estimates, such as R 0 1 .

Methods
We collected individual-level COVID-19 data from public Taiwanese databases and converted the data into a unified format., and Taiwan Centers for Disease Control press conference (CDC press conference, https://www.youtube.com/@taiwancdc) 12.All cases are assigned a unique identifier across data sources, ensuring consistency and accuracy when combining datasets.The flow chart of the data collection process is shown in Fig. 1.

Data collection.
Each of these sources provides data under open licenses that permit free academic use, transformation, and sharing within the scope of copyright, provided that proper attribution is made as specified in their respective terms of use.
Individual-level data.We collected individual-level epidemiological data, course of disease data, and contact tracing data from January 21st, 2020 to November 9th, 2020 covering the first outbreak in Taiwan.The individual-level data in our dataset primarily originates from daily reports issued by the Taiwan CDC.These reports contained detailed information about confirmed COVID-19 cases.While this source was informative, the data was presented in an unstructured text format, necessitating manual conversion into a structured and quantifiable format.We also obtained valuable data from the COVID-19 Visualization, which provided visualizations of the contact networks among locally confirmed cases.Furthermore, the COVID-19 Dashboard served as a source for the course of disease data, including report dates and confirmed dates.We augmented this information with metadata and conducted validation processes based on CDC press conference.
The confirmed cases data contained 579 samples with 64 features including travel history, age, gender, nationality, the onset of symptom, confirmed date, symptoms, way of discovery, and contact types between cases.
We categorized the contact type into groups such as couple, parents, grandparents, siblings, family, friends, live together, flight, flight (nearby seat), travelling, school, car, coworker, hospital, hotel, Panshi combat ship (military ship), Coral Princess (cruise ship), and others.Some cases also included ICU admission, recovery, and death dates, as available from our sources.When accessible, Taiwan CDC provided details like the count of close contacts, contact dates, and case indexes for contacts.Crucial date information, such as the infection date, is not Fig. 1 Flowchart depicting the collection process for Taiwanese COVID-19 data, including daily summary data, the course of the disease for confirmed cases, and contact information.
available in the original data sources.Therefore, in instances where there are multiple contact dates, we designate the first contact date as 'earliest infection date' to maintain consistency throughout our dataset.It's important to note that this is a oversimplified definition of the infection date, adopted primarily to facilitate easier visualization in subsequent sections.For detailed analysis, users are advised to refer directly to the original data attribute 'Date of contact with infected case' , where all recorded contacts are preserved.
Daily summary data.The daily summary data, sourced from Taiwan CDC press releases 8 , provides essential population-level information.However, inconsistencies in reporting intervals and changes in counting methods, notably on 2020-03-06, impact data accuracy.To mitigate this, we accessed the 'Daily Number of Cases Suspected SARS-CoV-2 Infection Tested' dataset from the Taiwan CDC Open Data portal 11 .This dataset comprises daily counts of suspected SARS-CoV-2 cases derived from three sources: notifications of infectious diseases, home quarantine and inspection, and expanded monitoring.The data spans from January 21, 2020, to May 23, 2022, providing a comprehensive overview of the pandemic's progression.To ensure data consistency and reliability, we manually transformed this data into a structured format.This dataset includes daily statistics on various COVID-19 aspects, such as the number of suspected cases, excluded cases, abroad positive cases, local positive cases, positive cases with unknown sources, deaths, recovery cases, and hospital quarantine cases.

Data preprocessing.
When structuring the data, we identified a few anomalous cases, which we investigated as part of our data preprocessing.For instance, Case ID 19 was confirmed on February 15th, 2020.However, due to a mistaken diagnosis of pneumonia, ID 19 was placed in the ICU on February 3rd, 2020, which was before the confirmed date.Since there was no hospital transmission, it's likely that the case was isolated after being admitted to the ICU.Therefore, the confirmed date (monitor isolation) should be readjusted to February 3rd, 2020.Case ID 530 was later identified as a false positive and subsequently removed from the dataset by the Taiwan CDC.Case ID 555 had contact with infected individuals on September 13, 2020, and was confirmed as a COVID-19 case on   confirmed on October 30, 2020, which resulted in a long period of susceptibility to infection.To rectify this, we removed case ID 555's earliest infection date.

Data Records
The data is publicly available in figshare repository 13 .A detailed explanation of each variable and its data type can be found in the codebook on the repository.We provide an overview and examples showing the types of data available in our dataset in the following sections.

Data description of individual-level data.
Our dataset encompasses a total of 578 cases, including epidemiological data, course of disease data, and contact tracing data.In this section, we offer a comprehensive overview of the dataset's characteristics and composition.Table 1 provides an overview of epidemiological data, categorizing it based on case origin, travel date, travel history, gender, and age.Each category lists the number of cases alongside their corresponding percentages.Additionally, Table 2 explores the distribution of the course of disease data, indicating data availability through the count of cases with available data, along with the respective percentages.Specific aspects, such as asymptomatic date, symptoms, symptomatic date, confirmed date, critically ill date, recovery date, death date, and report to CDC date, are highlighted.For simplicity, we refer to the 'asymptomatic date' as the 'earliest infection date.' Again, this is an oversimplification of the infection date.Among 578 subjects, 452 had reported symptoms recorded by Taiwan CDC, and 442 of them provided their symptomatic dates.The number of recovered and deceased cases is reported in CDC's daily summary data.The number of critically ill cases is not directly provided by Taiwan CDC and is thus estimated as the sum of critically ill and deceased cases.
Kaplan-Meier plot.The Kaplan-Meier plot is a statistical tool used to estimate and visually represent the probability of an event occurring over time, commonly applied in medical studies to illustrate survival or state transition dynamics.In a subject's course of disease, a subject could go to asymptomatic (infection with no symptoms, I A ), symptomatic (infection with symptoms, I S ), confirmed (C) critically ill (ICU admission, I C ), recovered (two consecutive negative PCR tests, R), and death (D) as shown in the bottom right plot in Fig. 2. Infection and immune status can be determined by an antigen test, PCR test, and antibody test.The nonpharmaceutical interventions (NPI) inhibit the transition from S to I, and the intensive care units (Care) activate the transition from I C to R, and the vaccination activates the transition from S to R. In Fig. 2, the Kaplan-Meier plot illustrates state transitions from one state to another without passing through intermediate states.For example, the samples from symptomatic to recovered do not include individuals who transitioned from symptomatic to critically ill to recovered.

Contact network.
The contact network comprises 8,842 nodes, with 578 nodes representing infected individuals, constituting approximately 7% of the total nodes.In terms of edges, the network encompasses a total of 46,008 connections, of which 1,183 (2.6% of the edges) are infection-related.Among these edges, 37 have identifiable infection paths, resulting in directed edges, while the remaining 1,146 are forming undirected edges.The proportion of directed edges to total edges is 0.08%, rising to 3.1% for infection-related edges.Additionally, the network displays 152 clusters, ranging in size from 2 to 1,898 nodes, with a median cluster size of 16.5 and a mean cluster size of 58.2.
The left panel of Fig. 3 presents a mixed graph representation of a contact network, where infection paths are denoted by arrows.The contact network is visualized with Cytoscape.In this instance, the initial case, I277, transmits the infection to individuals I269 and I288, and subsequently, I269 spreads the infection to I299.Distinct contact types such as 'family' (highlighted in red), 'same flight' (depicted in blue), and 'other/ unknown' connections (in gray) are visually differentiated.Smaller nodes represent uninfected contacts within the network.
The right panel of Fig. 3 provides an overview of different contact types within the network.These contact types are categorized to ensure mutual exclusivity and comprehensive coverage.For simplicity, the plot shows larger contact groups.For example, diverse ship-related contacts, such as 'Ship' , 'Panshi fast combat support ship' , and 'Coral Princess' , were consolidated under the category 'Ship' .
The size of the contact network exceeds the number of infected individuals because it includes contact information with suspected cases, thereby providing a comprehensive view of interactions within the dataset.

technical Validation
We conducted a thorough validation of our dataset through a crossover check using diverse datasets from various sources.Specifically, each case's course of disease was verified through a double-check process involving CDC press releases and press conferences, while the accuracy of the contact tracing data was confirmed by cross-referencing CDC press releases, CDC press conference, and COVID-19 Visualization.Additionally, the integrity of the final dataset underwent a double-check process involving other lab members.
To enhance the validation of our dataset, we performed an in-depth analysis and compared our results with other research focused on COVID-19 in Taiwan.Liu et al. 14 conducted a study akin to ours, where they examined 321 imported cases sourced from Taiwan CDC press releases.Their analysis primarily focused on the interaction between demographic information, import sources, symptoms, and travel history.It's essential to note that our dataset includes 579 cases, encompassing their 321 cases offering a more comprehensive view.Our data spans from January 21, 2020, to November 9, 2020, and has been meticulously transformed into a structured, accessible format, ensuring ease of access and utilization.
In our validation process, we replicated a portion of Liu et al. 's statistical analysis using our dataset.We isolated the 321 imported cases (case ID  2, illustrating the relationship between the date of arrival and the infection source.The overall trend closely aligns with the original article, with minor variations potentially attributed to differences in defining continents or regions.When multiple travel countries were involved, and Taiwan CDC didn't specify the infection source country, we defined the last country as the infection source.Figure 4C corresponds to their Fig. 3, depicting arrival dates vs onset-of-symptom and case numbers.Our reproduction maintains the original pattern, although slight discrepancies, like those on days 1 and 2. Figure 4D replicates Liu et al. 's Fig. 4, with minor differences due to missing information on the source of detection for some cases, especially in contact tracing.

Usage Notes
We introduced a Taiwanese dataset that encompasses both population-level and individual-level data related to COVID-19.This meticulous approach to collecting and structuring data has resulted in a comprehensive dataset that empowers researchers to delve into individual-level COVID-19 data with confidence.To the best of our knowledge, this dataset represents the first comprehensive public structured dataset for COVID-19, providing individual-level information.However, it's important to acknowledge the limitations of this dataset, primarily stemming from its reliance on publicly available sources.As explained by Taiwan CDC, detailed information on recovered cases has not been reported since March 12, 2020, due to privacy concerns, with only the daily count of recoveries provided.Similar limitations apply to critically ill cases.Furthermore, obtaining the precise infection date is inherently challenging and would necessitate more extensive individual contact history data.These factors collectively impact the depth of information available in our dataset.Concerning the contact information for subjects, our dataset depends on the information available in the CDC's reports.Importantly, contact tracing poses inherent challenges, and despite the CDC's best efforts, it is likely that some contacts may be missed.As a result, the contact numbers we provide should be considered lower-bound estimates of the true values.Moreover, the 'identifiable infection paths' are reported by the Taiwan CDC.However, the detailed methodology for these infection paths is not disclosed.Users should be cautious when interpreting the results.
In conclusion, we provided a structured dataset for COVID-19 in Taiwan in the early phase of the pandemic, covering both individual-level and population-level data.Despite the limitations due to the scope of data collection and privacy concerns, this dataset offers more details for epidemiological modelling compared to the existing population-level data.The dataset includes a wide range of variables, such as epidemiological data, course of disease data and contact types between infected and uninfected cases.We hope that this dataset will empower researchers to develop more accurate and predictive applications for combatting future pandemics.Future work could focus on expanding the dataset with more individual-level data and refining the data processing techniques to extract more valuable information.

Fig. 2
Fig.2 Daily state transition Kaplan-Meier plot of Taiwanese data.The 95% confidence interval was calculated by Greenwood's Exponential method.The different states are denoted as asymptomatic (I A ), symptomatic (I S ), confirmed (C) critically ill (I C ), recovered (R), and death (D).The figure depicts the state transition process.

Fig. 3
Fig. 3 Visualization of the contact network of case I269 cluster and bar chart of each contact type.The left figure shows the visualization of the contact network of case I269 cluster.Infected cases are highlighted in pink and depicted as larger nodes, while uninfected cases are illustrated in white and represented by smaller nodes.The network is presented as a mixed graph, where infection paths are denoted by arrows and other connections are indicated by solid lines.Contact types are differentiated by color: contacts are shown in red, 'Flight' contacts are marked in blue, and 'Municipality' connections are depicted in gray.The right figure shows the bar chart depicting the distribution of each contact type.The chart displays the number of edges for infected cases in blue and all edges in green.A total of 14 different contact types are represented along the x-axis.

Fig. 4
Fig. 4 Visualization of Figs. 1 to 4 from Liu et al. using our dataset.The overall trend aligns with their study.There are only a few minor differences.Figure B includes an additional 'unknown' category, indicating cases where the exact source was not identified.In Figure C, we observed more cases on day 2 than on day 1, contrary to Liu et al. 's findings.Notably, we observed cases beyond day 13, a deviation from Liu et al. 's observations.In Figure D, we lack some data in the contact tracing and testing group, especially in the asymptomatic-on-arrival bar.
The data was collected from the following open-source databases: Taiwan Centers for

Table 1 .
Description of epidemiological data.This table outlines the distribution of epidemiological data within the Taiwanese dataset, categorizing it by case origin, Date of traveling, Travel country, gender, and age.The dataset comprises a total of 578 cases, 486 of these are cases from abroad.The numbers for both 'Date of Traveling' and 'Travel Country' should add up to 486, accounting for known and unknown values.The 'None' category represents non-abroad cases, which naturally do not contain travel-related information.For each category, both the number of cases and their respective percentages are provided.
September 18, 2020, in the Philippines.Subsequently, the case was discharged from the hospital on October 3, 2020, tested negative on October 9, 2020, and travelled to Taiwan on October 15, 2020.The case was tested and

Table 2 .
Description of course of disease data.The table outlines the distribution of course of disease data within the Taiwanese dataset.Data availability is indicated by the number of cases with available data and the corresponding percentage.The asymptomatic date refers to the first contact date with the source case, i.e., the earliest infection date.
1 to 373), excluding local positive cases, among which 53% were female.Notably, 37.1% fell within the 20-29 age group, and 23.7% were in the 30-39 age group, closely mirroring Liu et al. 's findings with only a slight difference in the 20-29 age group (37.4% in their report).Our Fig. 4A reproduces Liu et al. 's Fig. 1, depicting the age-gender distribution.Figure 4B replicates Liu et al. 's Fig.