SARS-CoV-2 Genomic Contextual Data Harmonization: Recommendations from a Mixed Methods Analysis of COVID-19 Case Report Forms Across Canada

doi:10.21203/rs.3.rs-1871614/v1

Download PDF

Research Article

SARS-CoV-2 Genomic Contextual Data Harmonization: Recommendations from a Mixed Methods Analysis of COVID-19 Case Report Forms Across Canada

https://doi.org/10.21203/rs.3.rs-1871614/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background

The timely sharing of public health information is critical during a pandemic and is an obstacle that Canada has yet to fully address. During the current Coronavirus Disease 2019 (COVID-19) pandemic, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) viral genome sequencing has provided a deeper understanding of transmission patterns, enabled the identification of variants of concern, and facilitated diagnostic tests and vaccine development and evaluation. The Canadian national response faces challenges in aggregating genomic contextual data and carrying out integrated analysis across regions partly due to disparities in the case report forms used to capture epidemiological and clinical data. Such variations delay data integration and make consistent analysis difficult or impossible. The objective of this work is to understand what information is being collected from severe acute respiratory syndrome coronavirus 2 case report forms used across Canada and identify potential genome sequence data harmonization issues and solutions.

Methods

Provincial/territorial/national Canadian COVID-19 case report forms were subjected to field-by-field comparisons to identify variations in data categorization, structures, formats, types, granularity, ambiguity, and questions asked. Federal epidemiologists were consulted to substantiate the results.

Results

Data harmonization issues and common data elements were identified. We make recommendations for better national coordination, integrated databases, and data harmonization tools.

Conclusion

This report compares data elements of the various case report forms used across Canada to identify overlaps and differences in the collection method of COVID-19 case information, while also highlighting data harmonization complications and potential solutions. Knowing which data elements are available to researchers and health officials will better inform the development of Coronavirus Disease 2019 surveillance and research questions.

COVID-19

SARS-CoV-2

Metadata

Data Collection

Data Curation

Public Health

Correlation of Data

Canada

Canada faces challenges in data comparison and integration across regions due to disparities in how questions and data are structured across the case report forms used to capture genomic contextual data. Case report forms are questionnaires often used in public health investigations and surveillance activities to capture epidemiological information regarding an ill individual. Contextual data is information that allows us to better understand the environment and circumstances surrounding sequence data, e.g. clinical case information, epidemiological data, laboratory conditions, methods, and genomic annotations. Variations make consistent analysis difficult or impossible as they limit the ability of epidemiologists and data analysts to perform crucial data discovery and aggregation tasks at scales beyond an individual collection agency [1]. A crucial element of the Coronavirus Disease 2019 (COVID-19) response is acquiring harmonized case data in order to construct a deeper understanding of the spread of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and the efficacy of public health interventions.

Canada’s health care system is decentralized, meaning that the ten provinces and three territories independently administer separate health care systems within their jurisdictions to provide care to their residents [2, 3]. Together these systems interlock to create a universal, single-payer health care system. While this structure offers advantages, such as allowing provinces/territories to develop methods of delivering healthcare that is appropriate for their population and geographical region, a salient vulnerability is the lack of a single, overarching authority to coordinate the health care data management practice. Provinces and territories are not legally obligated to follow federal recommendations pertaining to health care or health data sharing [4], they also maintain their autonomy when it comes to regulating the collection of health care information [5]. Within a province or territory, there may be regional health authorities, or other front-line public health organizations, that have their own processes for health information data management [3]. Genomic sequencing of the SARS-CoV-2 virus around the world has enabled tracking of the viruses, identification of variants, development of diagnostic assays, vaccines, and therapeutics [6–8]. However, the lack of coordinated data sharing practice across the numerous independent public health authorities in Canada have resulted in delayed access and exchange of COVID-19 genomic and epidemiological information and reduced data quality due to variability in data streams.

As a consequence of a lack of data standards, Canadian COVID-19 case report forms are designed independently by provincial/territorial health authorities based on the perceived needs of each jurisdiction. While a national case report form was made available to use, there are many reasons why a province/territory may have chosen not to use it: a lack of a data elements deemed necessary for a province/territory/agency-specific objective, a lack of capacity to integrate a new form into their data systems or to change forms in mid-use, a lack of means to disseminate it amongst their data collectors, and many more. While the content of provincial/territorial forms are similar, the information is encoded differently. There can be differences in how the information is structured, the kinds of questions being asked, and in the terminology being used that may cause discrepancies in downstream data (Fig. 1). These differences render data comparison and integration more burdensome and error-prone. Consequently, when data needs to be integrated for inter-jurisdictional analyses (e.g. inter-provincial outbreak investigations and surveillance), the data must be restructured and cleaned - a process which is time consuming and labor intensive. These data integration issues are a challenge for genomic surveillance, i.e. the use of viral genome sequences to track and control infectious disease, as the contextual data (largely collected via case report forms) are needed to interpret the sequence data. Thus, it is important that contextual data is shared in a timely manner, but these variations slow down efforts to perform large-scale, consistent analyses, and the intra-provincial/territorial/agency nature of how health data is collected within Canada makes it difficult to apply solutions at the case report form stage of data collection.

A viable, short-term alternative to addressing the inconsistencies of data sharing in Canada, specific to COVID-19, is to investigate the current methods of data collection to implement data harmonization solutions. Data harmonization is the process of reconciling data from different data streams in a manner which allows for meaningful comparison; i.e, data harmonization ensures all contextual datasets use the same fields, terms, and formats. Such an investigation into the variability of COVID-19 genomic contextual data would identify data sharing gaps which prevent more robust epidemiological, biomedical, and genomic analyses. Employing data harmonization tools would help address these gaps and help provide the best available evidence for governments across the country to guide public health action.

This study investigates the data harmonization delays and challenges currently faced in Canada, which partly stemmed from the use of different data collection instruments, and outlines how the Canadian COVID Genomics Network (CanCOGeN) VirusSeq Project [9] is working to solve these problems moving forward. The CanCOGeN initiative is a pan-Canadian partnership among academia; private sector; and regional, provincial/territorial, and federal governments to obtain and coordinate SARS-CoV-2 virus and patient host genomic sequence data as well as clinical/epidemiological contextual information. This analysis compares data elements of the various case report forms used across Canadian jurisdictions to understand what kinds of information are collected, how they are encoded, as well as to view/observe what elements are consistently available and thus should be prioritized to facilitate the harmonization of SARS-CoV-2 contextual information. While developing the CanCOGeN viral genomic contextual data specification for national surveillance, we found that jurisdictions were often unaware of what one another was doing. These discrepancies motivated our investigation and analysis of publicly available Canadian case report forms in order to propose new data standards and harmonization tools that improve the ease, quality, and capabilities of genomic health data management and collaboration.

This work utilizes comparative epidemiology methodology, focusing on understanding differences and commonalities across case report forms that collect epidemiological information. Comparative epidemiology is a synoptic methodology that aims to understand and analyze the principal factors of epidemics/pandemics [10]. The data elements of this study are primarily collected for applications in epidemiology and healthcare, but they can also be used to layer and combine with genomics results to use in public health intervention and surveillance (e.g., phylogenetic analyses, clinical manifestations of variants of concern, surveillance, etc.). Our goal is to obtain a comprehensive overview of common contextual data elements and the harmonization issues that may impede the sharing and aggregation of said data.

Within Canada, there is no universal data collection form for SARS-CoV-2 infected individuals. Some provinces and territories use their own forms, while others use a national form provided by the Public Health Agency of Canada (PHAC), created for the reporting of confirmed and probable COVID-19 cases and to facilitate the identification of COVID-19 outbreaks. For our analysis, Canadian federal, provincial, and territorial case report forms that target confirmed or presumptive SARS-CoV-2 infection cases were obtained electronically between 2020-03-03 and 2020-04-28 via open-access public health websites (Table 1). The most up-to-date versions of case report forms were obtained during the first few months of the COVID-19 pandemic and thus may not reflect changes to provincial, territorial, or national forms after June 1st, 2020. Provinces and territories that required the use of multiple forms are referenced when one or more of said forms utilized the data element/value of concern. Provinces and territories are abbreviated as follows: Alberta (AB), British Columbia (BC), Manitoba (MB), New Brunswick (NB), Newfoundland and Labrador (NL), Nova Scotia (NS), Nunavut (NU), Northwest Territories (NWT), Ontario (ON), Prince Edward Island (PEI), Québec (QC), Saskatchewan (SK), and Yukon (YK). Provincial and territorial forms were not observed in jurisdictions that reported to be using the PHAC national case report form [11]; namely, AB, NL, NS, NU, PEI, SK, and YK. Throughout this analysis, the national form is given significantly greater weight since seven out of thirteen provinces and territories were utilizing it at the time of this analysis (Table 1).

Table 1

Canadian provinces/territories and their associated COVID-19 case report forms and version information.
Province/Territory	Form	Version Number	Version Date (YYYY-MM-DD)
Alberta (AB) Newfoundland and Labrador (NL) Nova Scotia (NS) Nunavut (NU) Prince Edward Island (PEI) Saskatchewan (SK) Yukon (YK)	National - Public Health Agency of Canada (PHAC) Coronavirus Disease (COVID-19) Case Report Form	2	2020-03-03
British Columbia (BC)	BC COVID-19 Case Report Form		2020-04-20
Ontario (ON)	ON’s Severe Acute Respiratory Infection Case Report Form	7.0	2020-04-15
Québec (QC)	QC Coronavirus COVID-19 Déclaration Des Cas Confirmés Et Des Cas Cliniques De Covid-19		2020-04-28
Québec (QC)	QC Coronavirus COVID-19 Questionnaire D’enquête Des Cas		2020-04-02
Manitoba (MB)	MB Coronavirus Disease 2019 (COVID-19) Investigation Case Form		2020-05-05
Northwest Territories (NWT)	NWT COVID-19 Report Form (Suspect Case/Person Under Investigation) - Part A		2020-04-27
Northwest Territories (NWT)	NWT COVID-19 Report Form (For All Cases) - Part B		2020-04-27
New Brunswick (NB)	NB COVID-19 Combined Referral and Lab Requisition Form	5	2020-04-09
Table adapted from “Comparison and analysis of Canadian public health SARS-CoV-2 case report forms” [14]. Copies of the case report forms are available in the “Canadian COVID-19 Case Report Form Analysis Files” dataset [29].

Researchers were unable to locate official English translations of the French QC forms. Fields were directly translated by a research member with over 9 years’ experience in studying written and oral French (eight of which were immersion schooling). Fields with ambiguous meaning were initially paraphrased via Google Translate™ [12] before cross-checking against other non-COVID-19 case report forms or regional health documents that were available in both English and French. Any data elements that remained ambiguous were then confirmed by consulting with English-French bilingual medical doctors with working histories in both QC and BC.

All provincial/territorial case report form data fields were qualitatively mapped to the national form before a secondary review across all forms was performed to verify field mappings/counts. Field-by-field comparisons were performed manually, occasionally requiring an inference of meaning from surrounding information due to a lack of a formalized unanimous schema or accessible data dictionaries. Imperfect matches were later analyzed for how their variations impede data harmonization. A within-stage (simultaneous) mixed model combination of qualitative and quantitative analysis was then performed on data fields and their corresponding categories, terminology, structure/format, and level of granularity. Qualitative analysis of their similarity of meaning under the examination of the aforementioned criteria and contexts, and quantitative analysis to deduce occurrence frequencies.

Categories and terms were evaluated to be exact matches (words deemed identical, including those with alternate spelling), synonyms (exact, narrow, or broad), or completely different terms. Granular terms that could be classified under a broader umbrella synonym were permitted for counts of said broad synonyms, e.g., allowing “productive cough” to be classified as a “cough” for comparison with forms for which that was the highest level of granularity. Data values that contained more than one term (e.g., “Irritability/Confusion”) were analyzed in two different methodologies: 1) permitted as counts for the narrow use of the original terms (e.g., “Irritability” as well as “Confusion”) independent of one another unless case report form, and 2) considered in the broadest use (e.g., “Irritability” counting towards “Irritability/Confusion” but not vice versa). All data elements were then evaluated for potential syntactic and/or semantic ambiguity. From this information, we were able to highlight data harmonization and integration challenges that arise from the usage of distinct data collection instruments and then reached out to national epidemiologists via email to confirm that these challenges are factual.

Our analysis informed what COVID-19 case-related information was available, the frequency at which they occurred; how the data was structured; and how data values needed to be carefully defined to capture data of varying granularity. It also reviewed and highlighted specific harmonization challenges that can and have emerged from the use of different collection forms for the purpose of generating interoperable and comparable datasets [13]. Commonly collected data elements were identified with the intent of informing researchers and epidemiologists of what is and is not available to them for the design of surveillance and/or research questions. This information was critical in rapidly forming a pan-Canadian framework for public health emergency surveillance, enabling more efficient and accurate data sharing that can be leveraged for the surveillance and analysis of SARS-CoV-2 and other pathogens. Countries around the world are evaluating their genomic contextual data standards and looking internationally for standards as guidance. This analysis has resulted in the creation of a data standard (CanCOGeN VirusSeq) that is now being implemented internationally by other institutions and entities. Our investigation focuses on the critical moment of the early pandemic when SARS-CoV-2 data standards were not available, i.e., the period before the distribution of the initial standard and the publication of the case report form analysis CanCOGeN report [14].

Common Data Elements:

Data categories, elements and types that appeared in the majority or all Canadian case report forms were identified. The focus of these results is on data that is explicit within a form, i.e., presented clearly within the text of the observed case report form. The most common fields and field categories used across all observed case report forms focused on the Name, Date of Birth (DOB), Phone Number, Gender, Symptom Onset Date, Symptoms (often used synonymously with Signs), and Pre-existing Conditions and Risk Factors of the individual under observation (Table 2). Information that could facilitate the linkage of virus sequence contextual data with other datasets (e.g. Additional host sequence contextual data) include Patient, Case, and Other Identifiers; Gender Field Values; Host Health State/Outcome; Host Health Status Details; and Host Resident Information (Table 3). Along with assisting in general COVID-19 public health surveillance, this information permits the study of relationships between disease outcomes and host demographic information when appropriately linked. Categories collected to help determine COVID-19 manifestations and severity were determined to be Signs and Symptoms, Pre-existing Conditions and Risk Factors, and Complications. Clinical diagnoses found within these categories and deemed present in all case report forms can be found in Table 4. The data element Symptom Onset Data was also found to be present in all case report forms (Table 4), which is crucial since this information is vital for epidemiological inferences - such as quantifying incubation period (the window of time between initial infection and signs of illness) - and determining appropriate public health interventions.

Table 2

Overview of the data fields and field categories commonly found in the Canadian case report forms.
Case Information	Case Report Form
Case Information	National^a	BC	MB	NB	NWT	ON	QC
Name (First & Last)	✔	✓	✓	✓	✓	✓	✓
Date of Birth^b	✔	✓	✓	✓	✓	✓	✓
Phone Number	✔	✓	✓	✓	✓	✓	✓
Gender	✔	✓	✓	✓	✓	✓	✓
Symptom Onset Date^b	✔	✓	✓	✓	✓	✓	✓
Symptoms	✔	✓	✓	✓	✓	✓	✓
Pre-existing Conditions and Risk Factors	✔	✓	✓	✓	✓	✓	✓
“Case information” includes data elements associated with the host/patient being observed/diagnosed/tested. Table adapted from “Comparison and analysis of Canadian public health SARS-CoV-2 case report forms” [14]. ^a Applicable Provinces: AB, NL, NS, PEI, SK, YK. ^b Date formats are not consistent across all forms.

Table 3

Overview of “General Case Information” data fields commonly found in the Canadian case report forms.
General Case Information	Case Report Form
General Case Information	National^a	BC	MB	NB	NWT	ON	QC
Patient, Case, and other Identifiers
Personal Health Number		✓	✓	✓	✓		✓
Case and/or Other Identifiers	✔	✓	✓			✓	✓
Gender Field Values
Female, Male	✔	✓	✓	✓	✓	✓	✓
Unknown	✔	✓	✓			✓
Host Health State / Outcome
Symptomatic, Deceased	✔	✓	✓	✓	✓	✓	✓
Asymptomatic	✔	✓	✓		✓	✓	✓
Host Health Status Details
Hospitalized	✔	✓	✓	✓	✓	✓	✓
ICU, ICU Start Date	✔	✓	✓		✓	✓	✓
Date of Death / Disposition Date	✔	✓	✓		✓	✓	✓
Host Resident Information
City	✔	✓	✓	✓	✓	✓	✓
Address, Postal Code	✔	✓	✓	✓		✓	✓
^a Applicable Provinces: AB, NL, NS, PEI, SK, YK.
Table adapted from “Comparison and analysis of Canadian public health SARS-CoV-2 case report forms” [14].

Table 4

Overview of “Clinical Diagnoses” data fields commonly found in the Canadian case report forms.
Clinical Diagnoses	Case Report Form
Clinical Diagnoses	National^a	BC	MB	NB	NWT	ON	QC
Symptom Onset Date^b	✔	✓	✓	✓	✓	✓	✓
Signs and Symptoms
Cough	✔	✓	✓	✓	✓	✓	✓
Fever^c	✔	✓	✓	✓	✓	✓	✓
Headache	✔	✓	✓	✓	✓	✓	✓
Sore Throat	✔	✓	✓	✓	✓	✓	✓
Pre-Existing Conditions and Risk Factors
Cardiac Disease	✔	✓	✓	✓	✓	✓	✓
Diabetes	✔	✓	✓	✓	✓	✓	✓
Pregnancy	✔	✓	✓	✓	✓	✓	✓
Respiratory Disease	✔	✓	✓	✓	✓	✓	✓
Complications
Altered Mental Status	✔	✓	✓	✓	✓	✓	✓
Encephalitis	✔	✓	✓	✓	✓	✓	✓
^a Applicable Provinces: AB, NL, NS, PEI, SK, YK.
^b Significant variation in the recommended date format across case report forms.
^c Minimum temperature that defines a fever has some variation between forms or is not defined.
Table adapted from “Comparison and analysis of Canadian public health SARS-CoV-2 case report forms” [14].

The similarities informed researchers which data elements were useful across jurisdictions and thus should be included in the CanCOGeN VirusSeq data standard. This work highlights what data elements can cause downstream data harmonization issues for the national analysis of SARS-CoV-2 for public health surveillance and intervention. It informed CanCOGeN of what data elements currently being collected are common; how data was structured and the impact on ease of comparison, and how data values needed to be carefully defined to capture data of varying granularity.

Data Harmonization Challenges:

The following section discusses theoretical data harmonization issues that emerge as a consequence of using different Canadian case collection forms. Data harmonization issues in categorization, structure/format, values, granularity, semantics, and the use of disparate questions, were identified in this analysis (Table 5).

Table 5

Examples of harmonization issues identified in the case report form analysis.
Issue	Example
Data Categorization	“Risk Factors” could be presented as “Pre-Existing Conditions”, “Exposures”, both, and neither.
Data Structure/Format	“03/04/2021” date; unclear whether “3rd of April” or “4th of March”.
Data Type	Fever = “TRUE” or “FALSE” (i.e., ☐ ) Fever = ≥ 38°C Fever = 102.5°F
Data Granularity	The terms “cough”, “dry cough”, “productive cough”, or “new onset cough” are used in different forms. When combining data, treating all these terms as synonyms can result in the loss of pathological information.
Semantic Ambiguity	Does "Isolation" mean "Self-Isolation", "Home Isolation", and/or "Hospital Isolation"? Is "Negative Pressure" applicable?
Disparate Questions	Not all forms request Indigenous identification data. Engagement with first nations health authorities inconsistent.
Table adapted from “Comparison and analysis of Canadian public health SARS-CoV-2 case report forms” [14].

1. Data Categorization:

Case report forms vary in the overarching categories they use to house their data fields, sometimes making the underlying data fields difficult to correlate and consequently integrate. For example, “Pre-existing conditions” are a patient’s medical conditions prior to the infection of interest while “risk factors” are variables associated with increased risk of infection and can encompass internal (e.g. “pre-existing conditions”), external (e.g. “travel exposure”), or a combination of both (e.g. the behavioral risk of “smoking”). Since “risk factors” can encompass both “pre-existing conditions” and “exposures”, forms vary in their implementation - making it more difficult to collect, curate, and correlate underlying risk assessment data, potentially confounding analyses of risk. An overarching category may also change the field’s interpretation. For example, “hypotension” marked under “Signs & Symptoms” is not equivalent to “hypotension” under “Pre-Existing Conditions & Risk Factors”, the former implies a new symptom onset that correlates with the diagnosis while the latter is something the patient experienced prior to diagnosis and thus may have nothing to do with the disease of concern. While it may seem easy enough to differentiate this information within a single case report form, it limits the ability of a data curator to be certain that “hypotension” under “clinical information” in one data set can reasonably be matched to “hypotension” as a “sign & symptom” in a data set that used a different collection device. Moreover, as data passes from one partnering agency to another, the original context and usage of the data elements may be lost when the data are transcoded.

2. Data Structure/Format:

Data structures encompass a collection of values, their specialized intra-data relationships, organization, and how these values can be altered and operated on. They are usually designed for a specific purpose such that the intended interpolation can be appropriately inferred from the results. Date formats are an example of data structure; to represent a date, we structure it as three values, day, a month, and a year, with a specific temporal hierarchy. A date structure is formatted such that it informs what the data values represent (e.g., “01” within the month positions is inferred as “January”) and their relationship to one another (e.g., a day belongs within a month within a year). By applying a uniquely formatted representation to data, we avoid ambiguity in its interpretation.

However, not all case collection forms are consistent in how they structure date formats, resulting in an issue known as structural or syntactic ambiguity. While many were very clear in their intended structure, the national form used more than one date format within the same document, while NB specified no format at all. This can lead to ambiguity and misinterpretations between day, month, and even year (Table 6). For example, the date ”03/04/21” can result in misinterpretations between day, month, and year; it is not clear whether the example is referring to March 4th, April 3rd, or even the 21st day of April/March in the year 2003/2004. Not being consistent within a single form puts greater reliance on data entry personnel to catch these inconsistencies and - in the case of unclear formatting - lead to incomplete data, cross-referencing investigations, or literal guesswork. At this time the Government of Canada has declared the national standard to be the YYYY-MM-DD or YYYY-MM ISO 8601 international standard [15, 16]. This is not a requirement that provinces/territories need conform to and Canada does still accept dates in alternate formats. The misinterpretation of data formats on collection forms has the potential to cause significant problems in downstream data analysis, especially during the COVID-19 pandemic when getting epidemiological data analyzed is time-sensitive and misrepresentations of sampling dates have serious implications.

Table 6

Examples of structure variations date formats and symptom granularity used in Canadian case report forms.
Case Report Form	Date Format	Data Granularity
National^a	DD/MM/YYYY MM/DD/YYYY	Cough
BC	YYYY/MM/DD	Cough
MB	YYYY-MM-DD	Cough, Dry; Cough, Productive
NB	Free Text	New onset/exacerbation of chronic cough
NWT	YYYY/MMM/DD	Cough
ON	DD/MM/YYYY	Cough
QC	YYYY/MM/DD	Cough
^a The following provinces/territories were utilizing the Interim National Case Report From at the time of analysis: AB, NL, NS, NU, PEI, SK, and YK.
Date Format values: day (D), month (M), and year (Y). Provinces/Territories: Alberta (AB), British Columbia (BC), Ontario (ON), Québec (QC), Manitoba (MB), New Brunswick (NB), Newfoundland and Labrador (NL), Nova Scotia (NS), Nunavut (NU), Northwest Territories (NWT), Prince Edward Island (PEI), Saskatchewan (SK), Yukon (YK). Table adapted from “Comparison and analysis of Canadian public health SARS-CoV-2 case report forms” [14].

3. Data Types:

Another issue that can add to data processing time is when the same or similar data fields have differences in value types between forms, resulting in data string variations that may not be easily compared and require different levels of process. For example, where one form may offer a Boolean (True/False) value in response to whether a case has a “fever” (i.e., “Yes/No”), another form may ask for the highest temperature recorded (Table 7). The latter may have no declared data structure informing the user whether temperature should be written as a string of characters or a number and whether it should be in Celsius or Fahrenheit. And if a data curator, who was not the data recorder, is presented with a checkbox, will an “x” (☒) be interpreted as TRUE like a checkmark (☑), or will the data curator infer a negative context and input FALSE? Comparison of dissimilar data types presents problems for computer-based analysis where information recorded differs from what the software is written to handle, causing data corruption, systems crashes, or unintentional transformations (e.g., entry of “Yes” into a field expecting a number, since a number was not received it returns “False” which a downstream user may assume was intentionally entered to convey “No”).

Table 7

Examples of data type variations when collecting “Fever” information via Canadian case report forms.
Case Report Form	Question	Input	Data Type / Information
National^a	Fever (> 38°C)	☐ Yes ☐ No ☐ Unknown ☐ Not asked/assessed	TRUE/FALSE for fevers greater than or equal to 38 Celsius, missing value options
BC	Fever	☐ Yes ☐ No ☐ Asked but Unknown ☐ Declined to Answer ☐ Not Assessed	TRUE/FALSE or missing value options.
BC	If yes, specify the highest temperature recorded:	____ °C	Free text; may be words or numbers.
MB	Fever (> 38°C)	☐	TRUE/FALSE only for fevers greater than 38 Celsius
NB	Fever/chills	☐	TRUE/FALSE for Fever and/or chills. Unless “Fever” is circled, data is unspecified as to whether a fever occurred
NWT	Fever	☐	TRUE/FALSE
NWT	Temperature if known:		Free text; may be words or numbers, Celsius or Fahrenheit not specified
ON	Fever (> 38°C)	☐	TRUE/FALSE for fevers greater than or equal to 38 Celsius
QC	Fever (> 38°C)	☐ Yes ☐ No ☐ Unknown	TRUE/FALSE for fevers greater than or equal to 38 Celsius, missing value option
^a The following provinces/territories were utilizing the Interim National Case Report Form at the time of analysis: AB, NL, NS, NU, PEI, SK, and YK.
Demonstrates the varying data types and information that can be collected across case report forms, many of which are similar but not exact. Temperature recordings may have additional context (e.g., BC this would be the highest recording if multiple measurements were taken), be a specific number when known (BC and NWT), be taken in different temperature scales (NWT could be recorded in Fahrenheit or Celsius while all others are in Celsius), and for some the definition of “Fever” vary (National, ON, and QC would consider “38°C” a fever while MB would not).

4. Data Granularity:

A recurring complication in comparing data across case report forms is variation in granularity. In this context, granularity refers to the level of detail of a data element and how it is subdivided. Depth of analyses become limited when data collection sources contain variation that differentiates descriptors such that it can be difficult to match them to a common term. For example, “cough” as compared to “dry cough”, “productive cough”, or “new onset/exacerbation of chronic cough” as this differentiation in descriptors can result in inappropriate mappings and/or a loss of pathology information (Table 6). The inability for a pathologist to differentiate between dry and productive coughs can impact how respiratory diseases are defined and differentiated. Additionally, sometimes terms are grouped together without clear instruction or demarcation. Hypothetically, the data collector may indicate it to be “True” a case experienced “Nausea/Vomiting” because the patient had been nauseated. Downstream data entry/analysis personnel could interpret “Nausea/Vomiting” as a data point towards “Vomiting” when no vomiting had ever occurred, associating a false sign or symptom with a disease while also experiencing a loss of the intended “Nausea” data point. Multiple concepts in the same field create uncertainty (does “Nausea/Vomiting” indicate “Nausea”, “Vomiting”, or both?) while also making it hard to fit data with other datasets where the concepts are in separate fields.

5. Semantic Ambiguity:

A non-trivial issue across case report forms is how the meaning of words can differ between them, resulting in semantic ambiguity when the data value of interest can correspond to meanings different than the one intended when a term can have more than one meaning. An example of an ambiguous term that appeared on case collection forms is “Isolation”. Without explicit explanation, it is unclear to the data user whether this corresponds to “Self-Isolation”, “Home Isolation”, or “Hospital Isolation”, all of which are examples of terms that appear on other case report forms. And if a form does indicate “Hospital Isolation” does this mean that the patient was put into a private room, away from other patients, or put under “Negative Pressure” conditions (i.e., where there is a minimum number of air exchanges per hour)? For example, being unable to distinguish between "Home Isolation" and "Hospital Isolation" may have consequences for epidemiologists when modeling the spread of the disease, as transmission in these scenarios are significantly different. Analysts and decision makers must form their own assumptions on the meaning of terms in order to parse data, should these assumptions not correspond to those made by the data recorder, research conclusions and policy implementations may not reflect the ground truth. One way to mediate this risk is to provide case report form users and downstream data entry personnel with a controlled vocabulary that clearly conveys the intended meaning.

6. Disparate Questions:

The presence of partially aligned but non-identical questions presents another barrier to data normalization. Increasing the homogeneity of questions increases the capacity of investigators to perform detailed, large-scale analyses. For example, question disparity presents issues in the collection and analysis of demographic information. Forms may inquire whether a patient identifies as “First Nations”, “Inuit”, or “Métis”, and/or whether a patient resides on a reserve, or the form may not request any patient Indigenous identification data at all (Table 8). Because of this disparity, questions may be removed or severely limited when analyzing large combined datasets where the data values have partial but not complete overlap of meaning; for example, “lives on reserve” (whether the individual resides in a location with “reserve status” [17]) and “identifies as Indigenous” (self-determined Indigenous identification) are not equivalent.

Table 8

Indigenous Identification Data fields across Canadian case report forms.
Case Report Form	Identify as Indigenous	First Nations Status	First Nations	Métis	Inuit	Combination^a
National^b	✔		✔	✔	✔
BC	✓	✓	✓	✓	✓	✓
MB	✓	✓	✓	✓	✓
NB
NWT
ON	✓		✓	✓	✓
QC	✓		✓		✓
^a Options for “First Nations and Inuit”, “First Nations and Métis”, “First Nations, Inuit and Métis”, or “Inuit and Métis”.
^b The following provinces/territories were utilizing the Interim National Case Report From at the time of analysis: AB, NL, NS, NU, PEI, SK, and YK.
Provinces/Territories: Alberta (AB), British Columbia (BC), Ontario (ON), Québec (QC), Manitoba (MB), New Brunswick (NB), Newfoundland and Labrador (NL), Nova Scotia (NS), Nunavut (NU), Northwest Territories (NWT), Prince Edward Island (PEI), Saskatchewan (SK), Yukon (YK). Table adapted from “Comparison and analysis of Canadian public health SARS-CoV-2 case report forms” [14].

We also identified questions with no overlap between case report forms. QC was the only province/territory form to inquire whether a patient experienced “pregnancy complications” or whether the patient was a worker exposed to direct customer contact. Similarly, NB was the only province/territory to list “coryza” (acute inflammation of the nasal passage) under the assessment of symptoms. This does not imply that these questions are not important to ask, but rather their value is lessened since they appear infrequently during the data collection process. One could argue that these questions are unique to the region and jurisdiction collecting them, however we could not identify any instances where this appeared to be the case. It is also reasonable to assume that other jurisdictions chose not to include these fields/values to limit the size of their case report form. There is no strict limit on case report form length, but too many fields increase the burden of data entry on health care workers and patients – increasing the likelihood of some portions being missed or skipped. Case report form designers recognize that requesting too much of the form users may result in diminishing or negative returns on data quality and quantity. Some coordination across the nation could significantly reduce provincial/territorial inconsistencies, especially among high-priority descriptors.

Indigenous Identification Data:

Eleven of the thirteen Canadian case report forms were found to collect up to four categories of identification data pertaining to Indigenous peoples in Canada. These categories include First Nations Status, Identify as Indigenous, Indigenous Heritage, and Reservation/Community information. Indigenous identification (regardless of community designation) data collection on case report forms is represented in Table 8. Collecting this information is important as it provides a means to highlight systemic inequalities impacting Indigenous populations, supporting positive interventions and policy change.

First Nations Status is a distinct legal status available to Indigenous peoples in Canada who qualify for the criteria [18]. The process of being legally recognized as having First Nations Status can be laborious and difficult, often resulting in many First Nations peoples not being granted this status [18]. Data regarding First Nations Status was only collected on the BC and MB forms. Both provinces included separate options to Identify as Indigenous, an important addition for acknowledging and acquiring data on First Nations who were ineligible for status. Capturing differences in status information is pertinent as it allows for the analysis of how status may impact health outcomes (e.g. via access to health and government services).

All case report forms that included the option to Identify as Indigenous also included some capacity to indicate Indigenous Heritage information. The Indigenous Heritage options were First Nations, Métis, and Inuit. That being said, the QC case report form did not include an option for Métis and the BC case report form provided additional explicit options for inputs of any combination of the aforementioned options; other forms did not restrict the selection of more than one option. Collecting this level of disaggregated data allows for a more diverse inequality analysis of potentially intersecting demographics [19]. The BC Office of the Human Rights Commissioner recommends the immediate collection of disaggregated demographic data in the area of health care [19]. In order to ensure that race-based data is being observed through the lens of reducing oppression and systemic racism, and not that of measuring race, custodianship of this data should be put within the hands of Indigenous organizations [19], however, this cannot be done if the appropriate Indigenous organization associated with the data cannot be identified.

Outside of the utility of Indigenous community demographic data for public health analysis, collecting Indigenous demographic data is important for the identification of the Indigenous nation and organization that are responsible for data custodianship under Indigenous data governance initiatives [20, 21]. The national, ON, and QC case report forms collected whether the patient resides on a reserve, while Indigenous community is collected on MB and NB - with the former only collecting this information if the patient is symptomatic. There is an important distinction to recognize between these terms; while a reserve is an Indigenous community, reserves are designated a specific reserve status that other Indigenous communities may not qualify for [17]. BC was the only province to implement the collection of Indigenous organization information (e.g. “Nazko First Nation”).

It is important for us to acknowledge that Indigenous identification data is not covered by the CanCOGeN VirusSeq specification. This is primarily due to the lack of appropriate and culturally sensitive data standards. The CanCOGeN metadata harmonization team is working towards identifying language that is appropriate for data capture with the assistance of the CanCOGeN Ethics and Governance Working Group, and consultation with Indigenous organizations will be a key part of further development.

This work identified common Canadian COVID-19 case report form data elements and used them to build the foundation for the CanCOGeN VirusSeq data standard. During this process we identified data harmonization challenges in data categorization, structure, format, type, granularity, ambiguity, and questions asked. In order to address some of these challenges, we recommend pan-Canadian agency coordination to use an agreed upon standard, meaningful engagement with Indigenous peoples data governance boards, and the use of data harmonization tools.

Different institutions may have distinct form questions and data structures due to the unique circumstances and needs within their jurisdiction, potentially resulting in inconsistent and ambiguous information when merged with other datasets. Coordination between agencies across the nation to use an agreed upon standard when creating forms would make datasets more harmonizable from the start, significantly reducing inconsistencies at the point of data collection, data entry, and the linkage of contextual data with virus sequence data. In response to this need, CanCOGeN [9] developed the CanCOGeN-VirusSeq contextual data specification to facilitate the formation of well-structured, consistent contextual datasets from disparate sources across Canada. Continuous data standard development also provides flexibility to meet provincial/territorial needs as they come up; otherwise, agencies are incentivized to create their own contextual data parameters when their needs are not met. We also recommend case reports form developers meaningfully engage with regional and national Indigenous peoples governance boards to determine what kinds of disaggregate data elements should be collected and to what granularity. At minimum, we recommend the following Indigenous demographic data elements be brought to the discussion: Indigenous heritage information, First Nations status (separated from heritage information), and reservation/community/organization information.

In lieu of asking provinces/territories to change their current case report form(s), as changing internal procedures can be difficult and time consuming, we recommend addressing the national data sharing inconsistencies by encouraging provinces and territories to use database integrated or stand-alone data harmonization tools to improve data comparability and interoperability. One such tool, developed by CanCOGeN based on the aforementioned standard, is the DataHarmonizer [22]. The DataHarmonizer utilizes the flexible standardization of ontologies [23]; offers controlled vocabularies and minimal data standards, such as the Public Health Alliance for Genomic Epidemiology (PHA4GE) COVID-19 specification [24]; and minimizes data transformation by allowing customizable template imports while facilitating export to multiple genomic databases including but not limited to CNPHI (Canadian Network for Public Health Intelligence) [25], GISAID (Global Initiative on Sharing Avian Influenza Data) [26], and the NCBI (National Center for Biotechnology Information) BioSample [27].

While many health regions used the form(s) agreed upon throughout an individual province/territory, some regions had agency or location-specific case report forms that did not correspond to the provincial/territorial forms utilized in this report. This analysis is also limited to the use of case report forms that were publicly accessible and available online, accordingly excluding theoretically private or non-electronically published forms. Consequently, the results are skewed towards publicly accessible, electronic copies of case report forms that were deemed most likely to be in use and thus this analysis is not inclusive of all case report forms utilized across Canada. It also did not look at previous versions of case report forms that may have been used previously during the pandemic, potentially missing data harmonization issues that could have impacted downstream SARS-CoV-2 datasets. Additionally, due to the nature of qualitative analysis and the consequential impact of researchers on the interpretation of mappings, researchers outside this analysis may disagree with mappings and harmonization issue classifications; such disagreement further highlights the difficulty of data element interpretation and the potential for data harmonization complications.

This COVID-19 case report form analysis helped structure the CanCOGeN data standard by identifying which genomic data parameters are commonly being collected, informing partner agencies of what is and is not available to them for the design of surveillance and/or research questions. The analysis also informed whether a data field should be required, recommended, or optional; how data was structured; and how data fields and values needed to be carefully defined to capture data values of varying granularity. Understanding where data harmonization challenges occur on a provincial/territorial level helps in the development of solutions that can be offered to all stakeholders without overstepping jurisdictional boundaries that can result from trying to resolve these issues at the data collection level. While this work was completed to facilitate inter-provincial/territorial data sharing under the SARS-CoV-2 national emergency, the lessons we have learned can be leveraged for the surveillance and analysis of other human pathogens.

Alberta

British Columbia

CanCOGeN

Canadian COVID-19 Genomics Network

CNPHI

Canadian Network for Public Health Intelligence

COVID-19

Coronavirus Disease 2019

GISAID

Global Initiative on Sharing Avian Influenza Data

Québec

Manitoba

New Brunswick

NCBI

National Center for Biotechnology Information

Newfoundland and Labrador

Nova Scotia

Nunavut

NWT

Northwest Territories

Ontario

PEI

Prince Edward Island

PHAC

Public Health Agency of Canada

SARS-CoV-2

Severe acute respiratory syndrome coronavirus 2

Saskatchewan

Yukon

Ethics Approval: Not applicable

Consent for publication: Not applicable

Availability of Data and Material: The datasets supporting the conclusions of this article are available in the “Canadian COVID-19 Case Report Form Analysis Files” Open Science Framework repository, https://doi.org/10.17605/OSF.IO/4UA8P [28].

Competing Interests: The authors declare that they have no competing interests.

Funding: This study was funded by Canadian COVID Genomics Network (CanCOGeN) VirusSeq Project (Genome Canada grant number E09CMA) and by Genome Canada and Genome BC Computational Biology and Bioinformatics Grant (project number 286GET) to William Hsiao.

Author Contributions: Data collection and material preparation were executed by RC, EJG, DD, ASr, and LT. EJG, DD, WH, and RC contributed to the study conception and design. Funding acquisition was performed by WH. RC and ASe identified and counted common data elements RC reviewed data mappings/counts, French translation, performed the analysis, and wrote the first draft of the manuscript. EJG and RC prepared Figure 1. RC prepared tables 1-8 and supplementary tables S1-S13. RC, EJG, SSK, and WH reviewed, edited, and commented on previous versions of the manuscript. The final manuscript was read and approved by all authors.

Acknowledgements: Material was derived from the “CanCOGeN VirusSeq - Comparison and Analysis of Canadian Public Health SARS-CoV-2 Case Report Forms” report published December 10^th, 2020 by some of the authors of this paper [14]. All major contributors of the original report were contacted and agreed to this publication. We would also like to acknowledge Dr. Cathy Flanagan and Dr. Gerald Simkus for assisting in the French-English translation of the Québec case report form(s).

Biobanks in Europe. Prospects for Harmonisation and Networking. Publications Office; 2010:115–7. https://doi.org/10.2791/41701
Government of Canada. Canada’s Health Care System. Canada.ca; 2019. https://www.canada.ca/en/health-canada/services/health-care-system/reports-publications/health-care-system/canada.html. Accessed 11 Oct 2021.
Marchildon, GP. Canada: Health system review. Health Systems in Transition. 2013;15:1:1-179. https://apps.who.int/iris/handle/10665/330307. Accessed 23 Apr 2022.
Attaran A, Houston A. Pandemic Data Sharing: How the Canadian Constitution Turned Into a Suicide Pact. In: Flood CM, MacDonnell V, Philpott J, Theriault S, Venkapuram S, editors. Vulnerable: The Policy, Law and Ethics of COVID-19. Ottawa: University of Ottawa Press; 2020. http://dx.doi.org/10.2139/ssrn.3612825.
Office of the Privacy Commissioner of Canada: Provincial and territorial privacy laws and oversight. https://www.priv.gc.ca/en/about-the-opc/what-we-do/provincial-and-territorial-collaboration/provincial-and-territorial-privacy-laws-and-oversight/ (2020). Accessed 23 Apr 2022.
Aggarwal D, Myers R, Hamilton WL, Bharucha T, Tumelty NM, Brown CS, et al. The role of viral genomics in understanding COVID-19 outbreaks in long-term care facilities. Lancet Microbe. 2021. https://doi.org/10.1016/S2666-5247(21)00208-1.
Rasmussen SA, Khoury MJ, Del Rio C. Precision Public Health as a Key Tool in the COVID-19 Response. JAMA. 2020;324:933–4. https://doi.org/10.1001/jama.2020.14992.
Seemann T, Lane CR, Sherry NL, Duchene S, Gonçalves da Silva A, Caly L, et al. Tracking the COVID-19 pandemic in Australia using genomics. Nat Commun. 2020;11:4376. https://doi.org/10.1038/s41467-020-18314-x.
CanCOGeN Canadian COVID Genomics Network: Generating accessible and usable genomics data to inform policy and public health decisions. Genome Canada. 2020. https://genomecanada.ca/challenge-areas/cancogen/. Accessed 11 Oct 2021.
Kranz J. The Methodology of Comparative Epidemiology. In: Kranz J, Rotem J, editors. Experimental Techniques in Plant Disease Epidemiology. Berlin: Springer Berlin Heidelberg; 1988:279–89. https://doi.org/10.1007/978-3-642-95534-1_21.
Coronavirus Disease (COVID-19) Case Report Form. Public Health Agency of Canada; 2021.
Google Translate (Version: Canada). https://translate.google.ca: Alphabet Inc; n.d.. Accessed July 2020.
Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018. https://doi.org/10.1038/sdata.2016.18.
Cameron R, Savić-Kallesøe S, Griffiths EJ, William Hsiao WWL. Comparison and analysis of Canadian public health SARS-CoV-2 case report forms. CanCOGeN VirusSeq; 2020. https://genomecanada.ca/wp-content/uploads/2022/01/2020-12-10_crf_report_.pdf
Standards Council of Canada. CAN/CSA-Z234.4-89 (R2007): All-Numeric Dates and Times. https://www.scc.ca/en/standardsdb/standards/4449 (1989). Accessed 23 Apr 2022.
Treasury Board of Canada Secretariat. TBITS 36: All-numeric representation of dates and times - implementation criteria. https://www.tbs-sct.gc.ca/pol/doc-eng.aspx?id=17284 (1997). Accessed 11 Oct 2021.
McCue HA. Reserves. In: The Canadian Encyclopedia. Historica Canada. 2011. https://www.thecanadianencyclopedia.ca/en/article/aboriginal-reserves. Accessed 5 Oct 2021.
Crey K, Hanson E. Indian Status. In: Indigenous Foundations. First Nations & Indigenous Studies The University of British Columbia. 2009. https://indigenousfoundations.arts.ubc.ca/indian_status/. Accessed 4 Oct 2021.
Disaggregated Demographic Data Collection in British Columbia: The Grandmother Perspective. British Columbia’s Office of the Human Rights Commissioner; 2020. https://bchumanrights.ca/wp-content/uploads/BCOHRC_Sept2020_Disaggregated-Data-Report_FINAL.pdf. Accessed 4 Oct 2021.
The First Nations Information Governance Centre. https://fnigc.ca/ (2020). Accessed 5 Oct 2021.
BC First Nations Data Governance Initiative. https://www.bcfndgi.com/ (n.d.). Accessed 5 Oct 2021.
Gill IS, Griffiths EJ, Dooley D, Cameron R, Gosal G, Sehar A, Tindale L, Croxen M, Alexander D, Hsiao WWL. The Dataharmonizer: a Tool for Faster Data Harmonization, Validation, Aggregation, and Analysis of Pathogen Genomics Contextual Information. Preprint at https://doi.org/10.20944/preprints202206.0335.v1 (2022).
Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol. 2007;25:1251–5. https://doi.org/10.1038/nbt1346.
Griffiths EJ, Timme RE, Mendes CI, Page AJ, Alikhan N-F, Fornika D, et al. Future-proofing and maximizing the utility of metadata: The PHA4GE SARS-CoV-2 contextual data specification package. Gigascience. 2022;11. https://doi.org/10.1093/gigascience/giac003.
Government of Canada: Canada Network for Public Health Intelligence. https://www.cnphi-rcrsp.ca/cnphi/index.jsp (n.d.). Accessed 11 Oct 2021.
Shu Y, McCauley J. GISAID: Global initiative on sharing all influenza data - from vision to reality. Euro Surveill. 2017;22. https://doi.org/10.2807/1560-7917.ES.2017.22.13.30494.
Barrett T, Clark K, Gevorgyan R, Gorelenkov V, Gribov E, Karsch-Mizrachi I, et al. BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Res. 2012;40:D57–63. https://doi.org/10.1093/nar/gkr1163.
Cameron R, Savić-Kallesøe S, Griffiths EJ, Dooley D, Sridhar A, Sehar A, et al. Canadian COVID-19 Case Report Form Analysis Files. Open Science Framework. 2022. https://doi.org/10.17605/OSF.IO/4UA8P.

No competing interests reported.

Additionalfile1.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

SARS-CoV-2 Genomic Contextual Data Harmonization: Recommendations from a Mixed Methods Analysis of COVID-19 Case Report Forms Across Canada

Status:

Version 1

Abstract

Background

Methods

Results

Conclusion

Figures

Background

Methods

Results & Discussion

Common Data Elements:

Data Harmonization Challenges:

1. Data Categorization:

2. Data Structure/Format:

3. Data Types:

4. Data Granularity:

5. Semantic Ambiguity:

6. Disparate Questions:

Indigenous Identification Data:

Recommendations

Limitations

Conclusions

Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1