Inaccurate recording of routinely collected data items influences identification of COVID-19 patients

Background During the Coronavirus disease 2019 (COVID-19) pandemic it became apparent that it is difficult to extract standardized Electronic Health Record (EHR) data for secondary purposes like public health decision-making. Accurate recording of, for example, standardized diagnosis codes and test results is required to identify all COVID-19 patients. This study aimed to investigate if specific combinations of routinely collected data items for COVID-19 can be used to identify an accurate set of intensive care unit (ICU)-admitted COVID-19 patients. Methods The following routinely collected EHR data items to identify COVID-19 patients were evaluated: positive reverse transcription polymerase chain reaction (RT-PCR) test results; problem list codes for COVID-19 registered by healthcare professionals and COVID-19 infection labels. COVID-19 codes registered by clinical coders retrospectively after discharge were also evaluated. A gold standard dataset was created by evaluating two datasets of suspected and confirmed COVID-19-patients admitted to the ICU at a Dutch university hospital between February 2020 and December 2020, of which one set was manually maintained by intensivists and one set was extracted from the EHR by a research data management department. Patients were labeled ‘COVID-19′ if their EHR record showed diagnosing COVID-19 during or right before an ICU-admission. Patients were labeled ‘non-COVID-19′ if the record indicated no COVID-19, exclusion or only suspicion during or right before an ICU-admission or if COVID-19 was diagnosed and cured during non-ICU episodes of the hospitalization in which an ICU-admission took place. Performance was determined for 37 queries including real-time and retrospective data items. We used the F1 score, which is the harmonic mean between precision and recall. The gold standard dataset was split into one subset including admissions between February and April and one subset including admissions between May and December to determine accuracy differences. Results The total dataset consisted of 402 patients: 196 ‘COVID-19′ and 206 ‘non-COVID-19′ patients. F1 scores of search queries including EHR data items that can be extracted real-time ranged between 0.68 and 0.97 and for search queries including the data item that was retrospectively registered by clinical coders F1 scores ranged between 0.73 and 0.99. F1 scores showed no clear pattern in variability between the two time periods. Conclusions Our study showed that one cannot rely on individual routinely collected data items such as coded COVID-19 on problem lists to identify all COVID-19 patients. If information is not required real-time, medical coding from clinical coders is most reliable. Researchers should be transparent about their methods used to extract data. To maximize the ability to completely identify all COVID-19 cases alerts for inconsistent data and policies for standardized data capture could enable reliable data reuse.


Introduction
During pandemics such as the Coronavirus disease 2019 (COVID- 19) pandemic, information sharing on patient characteristics, treatment and outcomes is crucial [1][2][3][4][5]. Public health decision-making or forecasting required resources (e.g., ICU beds, ventilators, or protective gear) depends heavily on the number of patients in medical centers [3,[5][6][7][8]. The hypothesis is that these public health information needs could be fulfilled by reusing Electronic Health Records (EHR) data under the assumption that healthcare professionals keep information on, in this case, COVID-19 patients complete and up-to-date for care purposes, e.g. adjust records when a diagnosis changes from uncertain to confirmed, cured, ruled-out, or when the patient is discharged or deceased. To be able to extract or exchange these data, it is required that these data are stored in a structured and standardized format. Problem lists can help physicians track a patient's status and progress, and organize clinical reasoning and documentation in a structured and standardized way using for instance International Classification of Diseases, Tenth Revision (ICD-10) coding [9][10][11]. Unfortunately, data in EHRs are highly heterogeneous [12][13][14] due to variations in unstructured (e.g. free-text) data and incomplete structured data (i.e., current problem lists are not always kept up-to-date) [8,[15][16][17][18]. Most healthcare professionals believe that free text should always be an option to indicate problems that are hard to code or to indicate uncertainty in diagnoses [19,20]. This suggests that if data are not extracted from appropriate locations in the EHR, or if data are regularly recorded in a free-text field and structured fields are not kept up-to-date, real-time (automatic) extraction will likely produce incomplete or inconsistent information [21][22][23]. As a result, in the Netherlands, secondary registers for COVID-19 intensive care unit (ICU) admissions were put in place, where data were entered manually by healthcare professionals [24,25]. Manually collected data are considered time-intensive but also error-prone [15,[26][27][28], especially since ICUs were under high pressure [29], which can adversely affect analyses leading to potential erroneous conclusions [27].
While ideally data can be extracted automatically and real-time to support, e.g., public health decision-making, this currently may result in under-or overestimation of the prevalence of patients, which could be a significant hindrance for high-quality research, capacity planning and resource management [3,[30][31][32][33] as governments take measures based on the numbers reported. To our knowledge, no previous research has systematically investigated the accuracy of routinely collected data for COVID-19 case finding. Hence, the aim of this study is to investigate if specific combinations of routinely collected data items for COVID-19 can be used to identify an accurate set of ICU-admitted COVID-19 patients. We propose recommendations on how to improve data accuracy such that in the future we are better prepared for situations similar to the COVID-19 pandemic that require data collection and processing in realtime thereby also reducing unnecessary administrative workload to record COVID-19 patients twice [6].

Definition of a COVID-19 patient
To better understand what data items are required to accurately identify COVID-19 patients, we need to understand the concept of a 'COVID-19 ′ patient. The concept 'COVID-19 patient' has been internationally defined as a patient having a positive test result [28] which is indicated by reverse-transcription polymerase chain reaction (RT-PCR) testing or by chest computed tomography (CT) scans showing COVID- 19 Reporting and Data System (CO-RADS) above four [34][35][36]. However, these tests are not always available, and a patient could have a negative test result but is considered a COVID-19 patient nonetheless due to obvious symptoms and contact with infected cases. The World Health Organization (WHO) has provided specific codes for patients with positive test results irrespective of severity of clinical signs or symptoms (ICD-10 code U07.1) and patients diagnosed clinically or epidemiologically but where laboratory testing is inconclusive or not available (ICD-10 code U07.2) [37,38]. In the Netherlands, the Diagnosis Thesaurus (DT) that underlies problem lists in EHRs includes ICD-10 coded clinical concepts such as U07.1 and U07.2, that are also labeled with synonyms or 'preference' terms. These so-called preference terms for COVID-19 are for instance 'disease caused by sars-cov-2 ′ (corresponding ICD-10 code: U07.1) or 'disease potentially caused by sars-cov-2 ′ (corresponding ICD-10 code: U07.2) [39 ,40]. As described in WHO and Dutch guidelines, U07.2 can therefore be used for (highly) suspected cases of COVID-19 and cases of COVID-19 that are certain, but not confirmed by laboratory testing. In the Netherlands, diagnoses for which the patient was admitted to the hospital are also separately ICD-10-coded by clinical coders (often months) after discharge. Hence, this also applies for COVID-19 patients who were coded retrospectively with U07.1 or U07.2. However, the specific codes for COVID-19 were added and changed over the course of two months which required adjusting codes retrospectively for some patients by healthcare professionals or clinical coders, such as 'other viral pneumonia' that was first advised to use (corresponding ICD-10 code: J12.89) [38,40,41]. Additionally, in our hospital, a specialized infection prevention department provides and updates confirmed and suspected (COVID-19) infection labels to patients and potential need for isolation twice a day. Healthcare professionals can also add infection labels. In conclusion, for this study, we used the four data items to identify a COVID-19 patient: positive RT-PCR test results, COVID-19 coding from healthcare professionals, COVID-19 coding from clinical coders and infection labels.

Data collection
We performed a retrospective analysis on routinely collected data from two sources including suspected and confirmed COVID-19 patients admitted to the Amsterdam University Medical Center between 1 February 2020 and 31 December 2020: • The ICU dataset: the dataset included clinically confirmed COVID- 19 patients and their unique patient identifiers (provided by the hospital) and ICU admission and discharge dates. This list was prospectively and manually maintained outside of the EHR system by intensivists and retrieved by researchers as a single Excel file. • The EHR extract dataset: The data research department of this Dutch university hospital queried the EHR system (Epic) for all confirmed and suspected COVID-19 patients and stored the results in a data warehouse, from which the researchers could retrieve it via a secure server as a single Excel file. The criteria used by the data research department are based on RT-PCR test results, COVID-19 coding from healthcare professionals and infection labels, shown in Appendix A. As a result, the dataset included unique patient identifiers; hospital admission and discharge dates; the previous, current and next wards that indicate departments such as the ICU where patients have been admitted within one hospital admission; (sub)specialties; RT-PCR test results; all ICD-10 diagnoses recorded on the problem list by healthcare professionals; and infection labels.
For each patient in the ICU and EHR extract dataset the data research department enriched the data with the ICD-10 diagnoses retrospectively registered by clinical coders from our hospital.

Data processing
We created one dataset in which we included all adult patients who were labeled suspected and/or confirmed COVID-19 at any point during their hospital admission from the EHR extract dataset who have also been admitted to the ICU department before 31 December 2020 at some point during their hospital admissions by selecting patient records that had 'Intensive care volwassenen' (English: Intensive care for adults) as location. We also added patients from the ICU dataset that were admitted to the ICU department before 31 December 2020 and removed duplicate patients.

Gold standard annotation by labeling (non-)COVID-19 patients
We annotated each patient in our dataset with a COVID-19 or non-COVID-19 label based on typical EHR data items that could describe the presence or exclusion of a COVID-19 diagnosis (Fig. 1). Patients that were included in both the EHR extract dataset and ICU dataset were labeled 'COVID-19 ′ if their admission was provided with an ICD-10 code for confirmed COVID-19 (U07.1) that was registered retrospectively by clinical coders. If these codes were not (yet) available, or if patients only occurred in one of the datasets, author ESK manually checked patients in

Performance of routinely collected data items to identify COVID-19 patients
Some standardized routinely collected data items are theoretically suitable to identify all COVID-19 patients as they do have a value that is necessary and sufficient to discriminate between (non-)COVID-19 patients. Table 1 shows search queries including routinely collected data items that we applied to the gold standard dataset to determine the percentage of (non-)COVID-19 patients per individual item and specific combinations of two and three data items (e.g. % patients retrieved with positive RT-PCR test results and confirmed infection label for . A total of 37 search queries including the (combinations of) data items were applied to the dataset. As shown, four search queries included data items that can be extracted from the EHR real-time, and one search query, shown in italic, included a data item that is retrospectively registered by clinical coders and cannot be extracted realtime. It is important to mention that we have only included two search strings with regard to COVID-19 specific ICD-10 coding: "U071 and/or U07.2" (the WHO-definition) and "U07.1" (the Dutch definition). That is, because in the Netherlands the ICD-10 code U07.2 is also used for suspected cases, which makes it difficult to determine whether a patient with only U07.2 is an actual COVID-19 patient or not. Confusion matrices were used to determine the performance of each search query. An example of a confusion matrix is shown in Appendix E. Note that positive RT-PCR test results were used to annotate a patient as a 'COVID-19' patient ( Fig. 1), thus automatically leading to zero false-positives in the confusion matrices. Performance was defined in terms of recall, specificity and precision. Recall is a measure of how many of the COVID-19 patients were correctly identified with the data item indicating COVID-19, over all COVID-19 cases in our dataset. Specificity is defined as the proportion of patients that were correctly identified not to have the data item indicating COVID-19 (i.e., true negatives). Precision is a measure of how many patients were correctly identified with COVID-19 (i.e., true positives). We also reported the F 1 score, which is the harmonic mean between precision and recall. An F 1 score lies between zero and one where one indicates perfect precision and recall. RStudio statistical software (v 1.2.1335) for Windows was used for data analysis. Exact binomial 95 % confidence intervals (CI) were calculated for recall, specificity and precision using the 'epi.tests' function from the 'epiR' package. The formulas for the recall, specificity, precision and F 1 score are shown in Appendix E. We split the final gold standard dataset into two equally-sized subsets to determine whether data accuracy differed between earlier months (admission dates between 1 February -30 April) and later months (admission dates between 1 May − 31 December) of the pandemic. Fig. 2 shows that the gold standard dataset included 402 suspected and confirmed COVID-19 patients who had been at the ICU at some point during an admission between 1 and 2-2020 and 31-12-2020, of which 196 patients were labeled COVID-19, 206 patients were labeled non-COVID-19. As shown, sixteen patients were actual COVID-19 patients, but they were excluded because they had not been at the ICU while being diagnosed with COVID-19, but instead went through COVID-19 during other non-ICU episodes of the same hospital admission in which an ICU admission took place. Table 2 in Appendix B shows the recall, specificity, precision and F 1 scores for the complete set and the two subsets. The number of patients that were retrieved by applying search queries to the complete gold standard dataset and corresponding F 1 scores are shown in Fig. 3, with the legend showing below. In Appendix C similar figures are shown for both subsets. In the complete gold standard dataset, search queries including data items that can be extracted real-time from the EHR had F 1 scores ranging from 0.68 and 0.97 and returned total numbers of patients ranging from 111 to 327. Search queries including the data item that was retrospectively registered by clinical coders after discharge (ICD-10 code U07.1) had F 1 scores ranging from 0.73 and 0.99 and returned total numbers of patients ranging from 112 to 327. Our results show varying F 1 scores over the 37 search queries and over the two time periods without a clear pattern. Table 3 in Appendix D shows more specific details per data item for the COVID-19 and non-COVID-19 labeled patients showing, e.g., the number of patients coded with U07.2. Confusion matrices to determine the performance per search query are shown in Table 5 Table 1 Search queries including routinely collected data items to identify an accurate set of COVID-19 patients. Search queries shown on white background are EHR data items that could be extracted real-time from the EHR. The search query in italic includes the data item that cannot be extracted real-time as it is retrospectively registered.

Search queries
Positive RT-PCR test result The ICD-10 code for COVID-19 (U07.1 and/or U07.2) by healthcare professionals * The ICD-10 code for COVID-19 (U07.1) by healthcare professionals ** An infection label for COVID-19 (confirmed) The ICD-10 code for COVID- 19  B: Search queries including the data item that cannot be extracted real-time as it is retrospectively registered.

Principal findings
In this study, we investigated if we could use specific combinations of routinely collected data items to identify an accurate set of ICU-admitted COVID-19 patients. Our results showed that if information is not required to be available real-time, e.g. for retrospective research questions, extracting patients with queries including U07.1-codes registered by clinical coders returns a more accurate set than queries including only real-time data items. Earlier studies also showed high reliability of codes by clinical coders [43,44]. However, real-time data is required for monitoring and forecasting the (national) need for ICU beds, ventilators or protective gear [6]. One of the main findings in this study is that depending on the search query to identify COVID-19 patients (in realtime), patients would be missed or wrongly included which might have negative consequences for, e.g., bed capacity planning and research. While one might use a search query that coincidentally returns the correct number of COVID-19 patients, which may hence result in correct bed capacity planning, the combination of false-positives and true-positives may still impact research findings due to including wrong patient characteristics.
The outcomes of this study also showed that including infection labels in a search query resulted mostly in higher performance, but this can be explained by the fact that infection labels are maintained daily by the infection department team in our hospital. Including ICD-10 coding from problem lists resulted overall in a relatively low performance. We hypothesize that healthcare providers used U07.2 to indicate both confirmed and suspected COVID-19 patients, which explains why performance is lower when including U07.2 in search queries. This can be explained by the fact that in the Netherlands synonym or preference terms from the DT that were linked to U07.2 are described by 'suspected' and 'probable', but according to the WHO U07.2 can also be used for confirmed COVID-19 patients, albeit not proven in laboratory tests. Our study also showed variability in the accuracy of U07.1 coding and U07.1 and/or U07.2 coding by healthcare providers over time, without a clear pattern. This could be partially explained by the fact that concepts for COVID-19 such as U07.2 were added over the course of two months and local implementation rules changed. This required healthcare providers to manually adjust codes for some patients [38,40]. Additionally, the variability can be explained by the fact that COVID-19 cases might have been overestimated at the beginning of the pandemic, shown by the higher number of false positives indicating that more (suspected) cases were registered with U07.1. The use of these codes might therefore not be consistent across different hospitals and countries, which is also supported by findings from a study that investigated the accuracy of COVID-19 specific ICD-10 coding using data from the Mass General Brigham health system (Boston, United States) [45]. This study showed overall lower recall (49.2%) and precision (90%) for the use of U07.1 compared to the recall (82%) and precision (99%) for the use of U07.1 in our study. Furthermore, some financial incentives may promote accurate COVID-19 coding [46]. Researchers showed that these increased problem list accuracy by among others providing salary bonuses which increased the willingness of healthcare providers to change their workflows [47]. In the Netherlands hospitals received additional budget for COVID-19 care based on the number of patients treated, this might have influenced the accuracy of the problem list. Our study also showed that not all patients had positive RT-PCR test results, which could be explained by the fact that RT-PCR tests were not always available, especially at the beginning of the COVID-19 pandemic, which may account for the lower recall of positive RT-PCR test results in the first period of time, or because some COVID-19 patients that were transferred from other hospitals were not tested again in the current hospital and data from the former hospital was not exchanged [18].

Relation to other literature
Former studies often used RT-PCR test results to include COVID-19 patients for research [48], sometimes even by including only patients with two positive RT-PCR test results for SARS-CoV-2 [45,49]. Research also shows that chest CT results are considered highly accurate for diagnosing COVID-19 [35], because of good sensitivity [50]. During analysis of patients in the original EHR, chest CT results were included as free text, which made the analysis time-intensive and the results are not interpretable by machines. Considering that one might need all COVID-19 patients for surveillance or bed capacity planning, patients that did not have positive RT-PCR test results but did have positive chest CT results might be missed due to variations in details and the free-text format.
It should be noted that for our COVID-19 use case, identifying patients on testing is possible because disease-specific tests exist. For other diseases, these tests or other markers might be lacking, which makes researchers, governments and other parties more dependent on (standardized) diagnoses on problem lists. Recent studies show that researchers strongly rely on (other) coding systems (ICD-10 and SNOMED CT) to select cohorts for research, for instance whether patients diagnosed with substance use disorders were at increased risk for COVID-19 [51]. Another retrospective study included 513,284 confirmed COVID-19 cases based on "a cohort of all patients who had a confirmed diagnosis of COVID-19 (ICD-10 code U07.1)" [52]. However, this requires that problem list codes should be maintained when new evidence becomes available that proves the existence or absence of the disease. Our current study showed that when using ICD-10 coding from problem lists we would have both wrongly included and missed COVID-19 patients which indicates that problem lists are not kept up-to-date, e. g., old problems are not removed or resolved. This is also in line with previous research [9,19,20,47,[53][54][55][56][57][58][59][60][61][62]. Research further shows that problem list use varies between specialties [47,63], diseases [64] and between providers [65]. Providers are more likely to update problem lists for first-time patients than for patients they have seen before [65]. We further hypothesize that the accuracy of ICD-10 coding may vary between patients who have died or survived, as the reliability of ICD-10 coded cause of death mentioned on death certificates is variable [66][67][68]. Although this study does not take into account the impact of specific demographics on coding accuracy, we believe that this should be further investigated for COVID-19 and other diseases.

Recommendations to maximize the ability to identify an accurate set of COVID-19 patients
Firstly, considering that different discrepancies might occur when using different search queries, we recommend that researchers should be transparent about their methods of data extraction, which is also supported by recent literature [69]. This also implies that when identifying data items, a clarification of the scope is needed [70], i.e. the specific use case for which the data items are required. For instance, for bed capacity planning a complete cohort of patients is required, but researchers might want to differentiate between patients who have been admitted with COVID-19 (a different primary diagnosis), or due to COVID-19 (COVID-19 being the primary diagnosis). Secondly, it is important to make users aware of benefits (and potential harm to patient care if incorrect) of structured and standardized data capture and encourage better documentation [20,71]. Thirdly, one should be careful using certain (combinations of) data items, particularly when including coded problem list data. Still, evidence suggests that patients with complete problems lists may receive higher quality care than patients with gaps in their problem list [10]. Hence, we believe that a specific policy on keeping a problem list up-to-date, including when to change a working diagnosis (suspected covid-19) to the primary diagnosis (confirmed COVID-19) and when to close or remove a problem, is essential to reliably reuse problem list data, which is also supported by other studies [9,58,59,62,72]. Fourthly, in a problem-oriented medical record, ordering of RT-PCR tests could require ICD-10 code U07.2 on the problem list. Afterwards, alerts could be implemented in the EHR system to make users aware of this working diagnosis, e.g., a trigger alert for when U07.2 has been on the problem list for more than 24 h. Fifthly, validation rules implemented in the EHR system can be used to identify and solve inconsistencies during care and registration processes. When a patient receives a positive RT-PCR test result, the system could propose the healthcare provider to automatically put ICD-10 code U07.1 on the problem list or update U07.2 to the clinically confirmed ICD-19 code for COVID-19 (U07.1).

Strengths and limitations
Although many COVID-19 studies have been performed based on EHR data, to the best of our knowledge, this was the first study to unravel different routinely collected data items to identify COVID-19 patients. A limitation that should be mentioned is that data were obtained from a single site in the Netherlands but we believe that registration patterns observed in this system resemble those in other hospitals in the Netherlands as well as other western countries with similar system. Hence we believe that hospitals in other countries could learn and benefit from the results as well. Additionally, our study only focused on the accuracy of routinely collected data items for ICU-admitted patients, which could differ from the accuracy of routinely collected data items for patients admitted to the general wards, but not to the ICU [47,63]. Furthermore, in theory we could have missed COVID-19 patients in our gold standard, but we consider this highly unlikely because of the specific attention to COVID-19 in the ICU and research data management department. A potential bias that hampers generalizability of our findings for case finding of other types of patients based on routinely collected EHR data, is that for COVID-19 patient records and specifically problem lists might be kept more accurate than for other diseases, because of higher perceived importance of correctly registering COVID-19 cases. Nonetheless, even for COVID-19, we showed that it is difficult to extract a complete cohort of patients, which is an important finding for future research using the EHR system for data extraction.

Conclusions
Our study showed that identifying COVID-19 patients using  routinely collected data items can lead to missing or falsely including patients and thus leading to an inaccurate set and incorrect numbers of COVID-19 patients. Researchers should therefore be transparent about their data extraction methods and related limitations. If the reuse purpose of data does not require real-time data, one should consider to include clinical coding by clinical coders after discharge to maximize the ability to completely identify COVID-19 patients. Recommendations to further optimize EHR data quality are among others: the implementation of a problem-oriented structure in the EHR, policy on problem list use, and alerts for inconsistent data. Effectiveness of these recommendations should be evaluated in future research.

Author's contributions
ESK did the analysis of the data and wrote the drafts of the article. RC and NFdK supervised the process and commented on the drafts of the article as presented by ESK. DAD assisted in the final decision on the inclusion or exclusion of some patients for the gold standard.

Funding and declaration of interest
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. This study was funded by Amsterdam UMC 2019-AMC-JK-7. Amsterdam UMC did not have any role in the study design, collection, analysis, interpretation of the data, writing the report and the decision to submit the report for publication.

Statement on conflicts of interest
The authors declare that they have no competing interests.

Statement on author agreement
All authors have read and accepted the final manuscript.

Ethics in publishing
This is a quality improvement project carried out to improve data quality for operational reporting. The project was approved by the data protection officer in the hospital. The Medical Ethics Committee of the Amsterdam UMC judged that this study was not subject to the Dutch Medical Research Involving Human Subjects Act (W20_344 # 20.382) thus the need for ethical approval and patient consent was waived.

Summary table
What was already known on the topic.
• Data in EHRs is highly heterogeneous, which makes it difficult to extract data real-time to guide public health decision-making which was required for the COVID-19 pandemic for e.g. surveillance, bed capacity planning or research.
What this study added to our knowledge.
• The study highlighted that at this point we cannot rely on potential sufficient EHR data items for complete case finding. • Researchers should be transparent about the methods they used to extract data, and consider using data encoded by clinical coders for more complete case finding. • Implementation of a problem-oriented structure in the EHR, policies regarding standardized data capture, and alerts for inconsistent data need to be considered to improve data quality in the EHR and to maximize the ability to identify a complete set of COVID-19 patients.  Table 2 Performance of search queries including (combinations of) routinely collected data items to identify an accurate set of COVID-19 patients. The performance is determined using the gold standard dataset including the (non-)COVID-19 labels and two subsets. In white, the search queries including data items that could be extracted real-time from the EHR system are shown. In italic, the search queries including ICD-10 coding retrospectively registered by clinical coders are shown.         Abbreviation: CI, confidence interval.  Legend for Fig. 4  Appendix E Table 4 shows an example of a confusion matrix. Confusion matrices for the search queries are shown in Table 5.   * Patients who did not have one positive and/or one negative test, but other test results (antibodies, invalid tests, cancelled tests) were considered 'only other test results'. Not-confirmed indicated that patients did not have any positive RT-PCR test result. ** Problem list codes are considered 'active' or 'closed'. Problems are closed when the episode is over, but the problem should still be visible in the problem list (i.e. it will be relevant for medical history). When problems are corrected, they should be removed from the problem list, according to the problem list policy in our hospital. *** Patients with both confirmed and suspected in either infection labels and problem lists, the dates in 'infection start moment' and 'date of observation' were checked to determine whether confirmed and suspected was older for infection labels and problem lists respectively. ****Infection note is a free-text field indicating more details about the infection status, this displays the number of codes that had contradictory information in the infection note compared to the standardized infection label.

Research data for this article
Due to legal regulations were are not allowed to make the datasets publicly available for this study. The authors can be contacted to get more information on the datasets.