Our analysis informed what COVID-19 case-related information was available, the frequency at which they occurred; how the data was structured; and how data values needed to be carefully defined to capture data of varying granularity. It also reviewed and highlighted specific harmonization challenges that can and have emerged from the use of different collection forms for the purpose of generating interoperable and comparable datasets [13]. Commonly collected data elements were identified with the intent of informing researchers and epidemiologists of what is and is not available to them for the design of surveillance and/or research questions. This information was critical in rapidly forming a pan-Canadian framework for public health emergency surveillance, enabling more efficient and accurate data sharing that can be leveraged for the surveillance and analysis of SARS-CoV-2 and other pathogens. Countries around the world are evaluating their genomic contextual data standards and looking internationally for standards as guidance. This analysis has resulted in the creation of a data standard (CanCOGeN VirusSeq) that is now being implemented internationally by other institutions and entities. Our investigation focuses on the critical moment of the early pandemic when SARS-CoV-2 data standards were not available, i.e., the period before the distribution of the initial standard and the publication of the case report form analysis CanCOGeN report [14].
Common Data Elements:
Data categories, elements and types that appeared in the majority or all Canadian case report forms were identified. The focus of these results is on data that is explicit within a form, i.e., presented clearly within the text of the observed case report form. The most common fields and field categories used across all observed case report forms focused on the Name, Date of Birth (DOB), Phone Number, Gender, Symptom Onset Date, Symptoms (often used synonymously with Signs), and Pre-existing Conditions and Risk Factors of the individual under observation (Table 2). Information that could facilitate the linkage of virus sequence contextual data with other datasets (e.g. Additional host sequence contextual data) include Patient, Case, and Other Identifiers; Gender Field Values; Host Health State/Outcome; Host Health Status Details; and Host Resident Information (Table 3). Along with assisting in general COVID-19 public health surveillance, this information permits the study of relationships between disease outcomes and host demographic information when appropriately linked. Categories collected to help determine COVID-19 manifestations and severity were determined to be Signs and Symptoms, Pre-existing Conditions and Risk Factors, and Complications. Clinical diagnoses found within these categories and deemed present in all case report forms can be found in Table 4. The data element Symptom Onset Data was also found to be present in all case report forms (Table 4), which is crucial since this information is vital for epidemiological inferences - such as quantifying incubation period (the window of time between initial infection and signs of illness) - and determining appropriate public health interventions.
Table 2
Overview of the data fields and field categories commonly found in the Canadian case report forms.
Case Information | Case Report Form |
Nationala | BC | MB | NB | NWT | ON | QC |
Name (First & Last) | ✔ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Date of Birthb | ✔ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Phone Number | ✔ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Gender | ✔ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Symptom Onset Dateb | ✔ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Symptoms | ✔ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Pre-existing Conditions and Risk Factors | ✔ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
“Case information” includes data elements associated with the host/patient being observed/diagnosed/tested. Table adapted from “Comparison and analysis of Canadian public health SARS-CoV-2 case report forms” [14]. a Applicable Provinces: AB, NL, NS, PEI, SK, YK. b Date formats are not consistent across all forms. |
Table 3
Overview of “General Case Information” data fields commonly found in the Canadian case report forms.
General Case Information | Case Report Form |
Nationala | BC | MB | NB | NWT | ON | QC |
Patient, Case, and other Identifiers |
Personal Health Number | | ✓ | ✓ | ✓ | ✓ | | ✓ |
Case and/or Other Identifiers | ✔ | ✓ | ✓ | | | ✓ | ✓ |
Gender Field Values |
Female, Male | ✔ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Unknown | ✔ | ✓ | ✓ | | | ✓ | |
Host Health State / Outcome |
Symptomatic, Deceased | ✔ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Asymptomatic | ✔ | ✓ | ✓ | | ✓ | ✓ | ✓ |
Host Health Status Details |
Hospitalized | ✔ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
ICU, ICU Start Date | ✔ | ✓ | ✓ | | ✓ | ✓ | ✓ |
Date of Death / Disposition Date | ✔ | ✓ | ✓ | | ✓ | ✓ | ✓ |
Host Resident Information |
City | ✔ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Address, Postal Code | ✔ | ✓ | ✓ | ✓ | | ✓ | ✓ |
a Applicable Provinces: AB, NL, NS, PEI, SK, YK. |
Table adapted from “Comparison and analysis of Canadian public health SARS-CoV-2 case report forms” [14]. |
Table 4
Overview of “Clinical Diagnoses” data fields commonly found in the Canadian case report forms.
Clinical Diagnoses | Case Report Form |
Nationala | BC | MB | NB | NWT | ON | QC |
Symptom Onset Dateb | ✔ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Signs and Symptoms |
Cough | ✔ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Feverc | ✔ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Headache | ✔ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Sore Throat | ✔ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Pre-Existing Conditions and Risk Factors |
Cardiac Disease | ✔ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Diabetes | ✔ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Pregnancy | ✔ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Respiratory Disease | ✔ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Complications |
Altered Mental Status | ✔ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Encephalitis | ✔ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
a Applicable Provinces: AB, NL, NS, PEI, SK, YK. |
b Significant variation in the recommended date format across case report forms. |
c Minimum temperature that defines a fever has some variation between forms or is not defined. |
Table adapted from “Comparison and analysis of Canadian public health SARS-CoV-2 case report forms” [14]. |
The similarities informed researchers which data elements were useful across jurisdictions and thus should be included in the CanCOGeN VirusSeq data standard. This work highlights what data elements can cause downstream data harmonization issues for the national analysis of SARS-CoV-2 for public health surveillance and intervention. It informed CanCOGeN of what data elements currently being collected are common; how data was structured and the impact on ease of comparison, and how data values needed to be carefully defined to capture data of varying granularity.
Data Harmonization Challenges:
The following section discusses theoretical data harmonization issues that emerge as a consequence of using different Canadian case collection forms. Data harmonization issues in categorization, structure/format, values, granularity, semantics, and the use of disparate questions, were identified in this analysis (Table 5).
Table 5
Examples of harmonization issues identified in the case report form analysis.
Issue | Example |
Data Categorization | “Risk Factors” could be presented as “Pre-Existing Conditions”, “Exposures”, both, and neither. |
Data Structure/Format | “03/04/2021” date; unclear whether “3rd of April” or “4th of March”. |
Data Type | Fever = “TRUE” or “FALSE” (i.e., ☐ ) Fever = ≥ 38°C Fever = 102.5°F |
Data Granularity | The terms “cough”, “dry cough”, “productive cough”, or “new onset cough” are used in different forms. When combining data, treating all these terms as synonyms can result in the loss of pathological information. |
Semantic Ambiguity | Does "Isolation" mean "Self-Isolation", "Home Isolation", and/or "Hospital Isolation"? Is "Negative Pressure" applicable? |
Disparate Questions | Not all forms request Indigenous identification data. Engagement with first nations health authorities inconsistent. |
Table adapted from “Comparison and analysis of Canadian public health SARS-CoV-2 case report forms” [14]. |
1. Data Categorization:
Case report forms vary in the overarching categories they use to house their data fields, sometimes making the underlying data fields difficult to correlate and consequently integrate. For example, “Pre-existing conditions” are a patient’s medical conditions prior to the infection of interest while “risk factors” are variables associated with increased risk of infection and can encompass internal (e.g. “pre-existing conditions”), external (e.g. “travel exposure”), or a combination of both (e.g. the behavioral risk of “smoking”). Since “risk factors” can encompass both “pre-existing conditions” and “exposures”, forms vary in their implementation - making it more difficult to collect, curate, and correlate underlying risk assessment data, potentially confounding analyses of risk. An overarching category may also change the field’s interpretation. For example, “hypotension” marked under “Signs & Symptoms” is not equivalent to “hypotension” under “Pre-Existing Conditions & Risk Factors”, the former implies a new symptom onset that correlates with the diagnosis while the latter is something the patient experienced prior to diagnosis and thus may have nothing to do with the disease of concern. While it may seem easy enough to differentiate this information within a single case report form, it limits the ability of a data curator to be certain that “hypotension” under “clinical information” in one data set can reasonably be matched to “hypotension” as a “sign & symptom” in a data set that used a different collection device. Moreover, as data passes from one partnering agency to another, the original context and usage of the data elements may be lost when the data are transcoded.
2. Data Structure/Format:
Data structures encompass a collection of values, their specialized intra-data relationships, organization, and how these values can be altered and operated on. They are usually designed for a specific purpose such that the intended interpolation can be appropriately inferred from the results. Date formats are an example of data structure; to represent a date, we structure it as three values, day, a month, and a year, with a specific temporal hierarchy. A date structure is formatted such that it informs what the data values represent (e.g., “01” within the month positions is inferred as “January”) and their relationship to one another (e.g., a day belongs within a month within a year). By applying a uniquely formatted representation to data, we avoid ambiguity in its interpretation.
However, not all case collection forms are consistent in how they structure date formats, resulting in an issue known as structural or syntactic ambiguity. While many were very clear in their intended structure, the national form used more than one date format within the same document, while NB specified no format at all. This can lead to ambiguity and misinterpretations between day, month, and even year (Table 6). For example, the date ”03/04/21” can result in misinterpretations between day, month, and year; it is not clear whether the example is referring to March 4th, April 3rd, or even the 21st day of April/March in the year 2003/2004. Not being consistent within a single form puts greater reliance on data entry personnel to catch these inconsistencies and - in the case of unclear formatting - lead to incomplete data, cross-referencing investigations, or literal guesswork. At this time the Government of Canada has declared the national standard to be the YYYY-MM-DD or YYYY-MM ISO 8601 international standard [15, 16]. This is not a requirement that provinces/territories need conform to and Canada does still accept dates in alternate formats. The misinterpretation of data formats on collection forms has the potential to cause significant problems in downstream data analysis, especially during the COVID-19 pandemic when getting epidemiological data analyzed is time-sensitive and misrepresentations of sampling dates have serious implications.
Table 6
Examples of structure variations date formats and symptom granularity used in Canadian case report forms.
Case Report Form | Date Format | Data Granularity |
Nationala | DD/MM/YYYY MM/DD/YYYY | Cough |
BC | YYYY/MM/DD | Cough |
MB | YYYY-MM-DD | Cough, Dry; Cough, Productive |
NB | Free Text | New onset/exacerbation of chronic cough |
NWT | YYYY/MMM/DD | Cough |
ON | DD/MM/YYYY | Cough |
QC | YYYY/MM/DD | Cough |
a The following provinces/territories were utilizing the Interim National Case Report From at the time of analysis: AB, NL, NS, NU, PEI, SK, and YK. |
Date Format values: day (D), month (M), and year (Y). Provinces/Territories: Alberta (AB), British Columbia (BC), Ontario (ON), Québec (QC), Manitoba (MB), New Brunswick (NB), Newfoundland and Labrador (NL), Nova Scotia (NS), Nunavut (NU), Northwest Territories (NWT), Prince Edward Island (PEI), Saskatchewan (SK), Yukon (YK). Table adapted from “Comparison and analysis of Canadian public health SARS-CoV-2 case report forms” [14]. |
3. Data Types:
Another issue that can add to data processing time is when the same or similar data fields have differences in value types between forms, resulting in data string variations that may not be easily compared and require different levels of process. For example, where one form may offer a Boolean (True/False) value in response to whether a case has a “fever” (i.e., “Yes/No”), another form may ask for the highest temperature recorded (Table 7). The latter may have no declared data structure informing the user whether temperature should be written as a string of characters or a number and whether it should be in Celsius or Fahrenheit. And if a data curator, who was not the data recorder, is presented with a checkbox, will an “x” (☒) be interpreted as TRUE like a checkmark (☑), or will the data curator infer a negative context and input FALSE? Comparison of dissimilar data types presents problems for computer-based analysis where information recorded differs from what the software is written to handle, causing data corruption, systems crashes, or unintentional transformations (e.g., entry of “Yes” into a field expecting a number, since a number was not received it returns “False” which a downstream user may assume was intentionally entered to convey “No”).
Table 7
Examples of data type variations when collecting “Fever” information via Canadian case report forms.
Case Report Form | Question | Input | Data Type / Information |
Nationala | Fever (> 38°C) | ☐ Yes ☐ No ☐ Unknown ☐ Not asked/assessed | TRUE/FALSE for fevers greater than or equal to 38 Celsius, missing value options |
BC | Fever | ☐ Yes ☐ No ☐ Asked but Unknown ☐ Declined to Answer ☐ Not Assessed | TRUE/FALSE or missing value options. |
If yes, specify the highest temperature recorded: | ____ °C | Free text; may be words or numbers. |
MB | Fever (> 38°C) | ☐ | TRUE/FALSE only for fevers greater than 38 Celsius |
NB | Fever/chills | ☐ | TRUE/FALSE for Fever and/or chills. Unless “Fever” is circled, data is unspecified as to whether a fever occurred |
NWT | Fever | ☐ | TRUE/FALSE |
Temperature if known: | | Free text; may be words or numbers, Celsius or Fahrenheit not specified |
ON | Fever (> 38°C) | ☐ | TRUE/FALSE for fevers greater than or equal to 38 Celsius |
QC | Fever (> 38°C) | ☐ Yes ☐ No ☐ Unknown | TRUE/FALSE for fevers greater than or equal to 38 Celsius, missing value option |
a The following provinces/territories were utilizing the Interim National Case Report Form at the time of analysis: AB, NL, NS, NU, PEI, SK, and YK. | |
Demonstrates the varying data types and information that can be collected across case report forms, many of which are similar but not exact. Temperature recordings may have additional context (e.g., BC this would be the highest recording if multiple measurements were taken), be a specific number when known (BC and NWT), be taken in different temperature scales (NWT could be recorded in Fahrenheit or Celsius while all others are in Celsius), and for some the definition of “Fever” vary (National, ON, and QC would consider “38°C” a fever while MB would not). |
4. Data Granularity:
A recurring complication in comparing data across case report forms is variation in granularity. In this context, granularity refers to the level of detail of a data element and how it is subdivided. Depth of analyses become limited when data collection sources contain variation that differentiates descriptors such that it can be difficult to match them to a common term. For example, “cough” as compared to “dry cough”, “productive cough”, or “new onset/exacerbation of chronic cough” as this differentiation in descriptors can result in inappropriate mappings and/or a loss of pathology information (Table 6). The inability for a pathologist to differentiate between dry and productive coughs can impact how respiratory diseases are defined and differentiated. Additionally, sometimes terms are grouped together without clear instruction or demarcation. Hypothetically, the data collector may indicate it to be “True” a case experienced “Nausea/Vomiting” because the patient had been nauseated. Downstream data entry/analysis personnel could interpret “Nausea/Vomiting” as a data point towards “Vomiting” when no vomiting had ever occurred, associating a false sign or symptom with a disease while also experiencing a loss of the intended “Nausea” data point. Multiple concepts in the same field create uncertainty (does “Nausea/Vomiting” indicate “Nausea”, “Vomiting”, or both?) while also making it hard to fit data with other datasets where the concepts are in separate fields.
5. Semantic Ambiguity:
A non-trivial issue across case report forms is how the meaning of words can differ between them, resulting in semantic ambiguity when the data value of interest can correspond to meanings different than the one intended when a term can have more than one meaning. An example of an ambiguous term that appeared on case collection forms is “Isolation”. Without explicit explanation, it is unclear to the data user whether this corresponds to “Self-Isolation”, “Home Isolation”, or “Hospital Isolation”, all of which are examples of terms that appear on other case report forms. And if a form does indicate “Hospital Isolation” does this mean that the patient was put into a private room, away from other patients, or put under “Negative Pressure” conditions (i.e., where there is a minimum number of air exchanges per hour)? For example, being unable to distinguish between "Home Isolation" and "Hospital Isolation" may have consequences for epidemiologists when modeling the spread of the disease, as transmission in these scenarios are significantly different. Analysts and decision makers must form their own assumptions on the meaning of terms in order to parse data, should these assumptions not correspond to those made by the data recorder, research conclusions and policy implementations may not reflect the ground truth. One way to mediate this risk is to provide case report form users and downstream data entry personnel with a controlled vocabulary that clearly conveys the intended meaning.
6. Disparate Questions:
The presence of partially aligned but non-identical questions presents another barrier to data normalization. Increasing the homogeneity of questions increases the capacity of investigators to perform detailed, large-scale analyses. For example, question disparity presents issues in the collection and analysis of demographic information. Forms may inquire whether a patient identifies as “First Nations”, “Inuit”, or “Métis”, and/or whether a patient resides on a reserve, or the form may not request any patient Indigenous identification data at all (Table 8). Because of this disparity, questions may be removed or severely limited when analyzing large combined datasets where the data values have partial but not complete overlap of meaning; for example, “lives on reserve” (whether the individual resides in a location with “reserve status” [17]) and “identifies as Indigenous” (self-determined Indigenous identification) are not equivalent.
Table 8
Indigenous Identification Data fields across Canadian case report forms.
Case Report Form | Identify as Indigenous | First Nations Status | First Nations | Métis | Inuit | Combinationa |
Nationalb | ✔ | | ✔ | ✔ | ✔ | |
BC | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
MB | ✓ | ✓ | ✓ | ✓ | ✓ | |
NB | | | | | | |
NWT | | | | | | |
ON | ✓ | | ✓ | ✓ | ✓ | |
QC | ✓ | | ✓ | | ✓ | |
a Options for “First Nations and Inuit”, “First Nations and Métis”, “First Nations, Inuit and Métis”, or “Inuit and Métis”. |
b The following provinces/territories were utilizing the Interim National Case Report From at the time of analysis: AB, NL, NS, NU, PEI, SK, and YK. |
Provinces/Territories: Alberta (AB), British Columbia (BC), Ontario (ON), Québec (QC), Manitoba (MB), New Brunswick (NB), Newfoundland and Labrador (NL), Nova Scotia (NS), Nunavut (NU), Northwest Territories (NWT), Prince Edward Island (PEI), Saskatchewan (SK), Yukon (YK). Table adapted from “Comparison and analysis of Canadian public health SARS-CoV-2 case report forms” [14]. |
We also identified questions with no overlap between case report forms. QC was the only province/territory form to inquire whether a patient experienced “pregnancy complications” or whether the patient was a worker exposed to direct customer contact. Similarly, NB was the only province/territory to list “coryza” (acute inflammation of the nasal passage) under the assessment of symptoms. This does not imply that these questions are not important to ask, but rather their value is lessened since they appear infrequently during the data collection process. One could argue that these questions are unique to the region and jurisdiction collecting them, however we could not identify any instances where this appeared to be the case. It is also reasonable to assume that other jurisdictions chose not to include these fields/values to limit the size of their case report form. There is no strict limit on case report form length, but too many fields increase the burden of data entry on health care workers and patients – increasing the likelihood of some portions being missed or skipped. Case report form designers recognize that requesting too much of the form users may result in diminishing or negative returns on data quality and quantity. Some coordination across the nation could significantly reduce provincial/territorial inconsistencies, especially among high-priority descriptors.
Indigenous Identification Data:
Eleven of the thirteen Canadian case report forms were found to collect up to four categories of identification data pertaining to Indigenous peoples in Canada. These categories include First Nations Status, Identify as Indigenous, Indigenous Heritage, and Reservation/Community information. Indigenous identification (regardless of community designation) data collection on case report forms is represented in Table 8. Collecting this information is important as it provides a means to highlight systemic inequalities impacting Indigenous populations, supporting positive interventions and policy change.
First Nations Status is a distinct legal status available to Indigenous peoples in Canada who qualify for the criteria [18]. The process of being legally recognized as having First Nations Status can be laborious and difficult, often resulting in many First Nations peoples not being granted this status [18]. Data regarding First Nations Status was only collected on the BC and MB forms. Both provinces included separate options to Identify as Indigenous, an important addition for acknowledging and acquiring data on First Nations who were ineligible for status. Capturing differences in status information is pertinent as it allows for the analysis of how status may impact health outcomes (e.g. via access to health and government services).
All case report forms that included the option to Identify as Indigenous also included some capacity to indicate Indigenous Heritage information. The Indigenous Heritage options were First Nations, Métis, and Inuit. That being said, the QC case report form did not include an option for Métis and the BC case report form provided additional explicit options for inputs of any combination of the aforementioned options; other forms did not restrict the selection of more than one option. Collecting this level of disaggregated data allows for a more diverse inequality analysis of potentially intersecting demographics [19]. The BC Office of the Human Rights Commissioner recommends the immediate collection of disaggregated demographic data in the area of health care [19]. In order to ensure that race-based data is being observed through the lens of reducing oppression and systemic racism, and not that of measuring race, custodianship of this data should be put within the hands of Indigenous organizations [19], however, this cannot be done if the appropriate Indigenous organization associated with the data cannot be identified.
Outside of the utility of Indigenous community demographic data for public health analysis, collecting Indigenous demographic data is important for the identification of the Indigenous nation and organization that are responsible for data custodianship under Indigenous data governance initiatives [20, 21]. The national, ON, and QC case report forms collected whether the patient resides on a reserve, while Indigenous community is collected on MB and NB - with the former only collecting this information if the patient is symptomatic. There is an important distinction to recognize between these terms; while a reserve is an Indigenous community, reserves are designated a specific reserve status that other Indigenous communities may not qualify for [17]. BC was the only province to implement the collection of Indigenous organization information (e.g. “Nazko First Nation”).
It is important for us to acknowledge that Indigenous identification data is not covered by the CanCOGeN VirusSeq specification. This is primarily due to the lack of appropriate and culturally sensitive data standards. The CanCOGeN metadata harmonization team is working towards identifying language that is appropriate for data capture with the assistance of the CanCOGeN Ethics and Governance Working Group, and consultation with Indigenous organizations will be a key part of further development.