Natural Language Processing for Clinical Laboratory Data Repository Systems: Implementation and Evaluation for Respiratory Viruses

Background With the growing volume and complexity of laboratory repositories, it has become tedious to parse unstructured data into structured and tabulated formats for secondary uses such as decision support, quality assurance, and outcome analysis. However, advances in natural language processing (NLP) approaches have enabled efficient and automated extraction of clinically meaningful medical concepts from unstructured reports. Objective In this study, we aimed to determine the feasibility of using the NLP model for information extraction as an alternative approach to a time-consuming and operationally resource-intensive handcrafted rule-based tool. Therefore, we sought to develop and evaluate a deep learning–based NLP model to derive knowledge and extract information from text-based laboratory reports sourced from a provincial laboratory repository system. Methods The NLP model, a hierarchical multilabel classifier, was trained on a corpus of laboratory reports covering testing for 14 different respiratory viruses and viral subtypes. The corpus includes 87,500 unique laboratory reports annotated by 8 subject matter experts (SMEs). The classification task involved assigning the laboratory reports to labels at 2 levels: 24 fine-grained labels in level 1 and 6 coarse-grained labels in level 2. A “label” also refers to the status of a specific virus or strain being tested or detected (eg, influenza A is detected). The model’s performance stability and variation were analyzed across all labels in the classification task. Additionally, the model's generalizability was evaluated internally and externally on various test sets. Results Overall, the NLP model performed well on internal, out-of-time (pre–COVID-19), and external (different laboratories) test sets with microaveraged F1-scores >94% across all classes. Higher precision and recall scores with less variability were observed for the internal and pre–COVID-19 test sets. As expected, the model’s performance varied across categories and virus types due to the imbalanced nature of the corpus and sample sizes per class. There were intrinsically fewer classes of viruses being detected than those tested; therefore, the model's performance (lowest F1-score of 57%) was noticeably lower in the detected cases. Conclusions We demonstrated that deep learning–based NLP models are promising solutions for information extraction from text-based laboratory reports. These approaches enable scalable, timely, and practical access to high-quality and encoded laboratory data if integrated into laboratory information system repositories.


Introduction
Clinical laboratory data account for a large proportion of data stored in electronic health record systems worldwide and present a wealth of information vital for evidence-based decision-making and public health improvement 1,2 .Laboratory information systems record, manage, and store laboratory test data to facilitate reporting to clinicians and jurisdictional laboratory information repositories 3 .These repositories often include test orders and results from various laboratory service providers, such as hospitals, public health agencies, and private companies, and are populated as part of clinical care.
Several factors limit the secondary use of laboratory data for other purposes.The most important are concerns about the quality of the data, lack of standardization, and difficulty in extracting the needed information 4,5 .Laboratory data varies over time due to evolving standards of care and changing population demographics.Furthermore, specific categories of laboratory data are reported as free text in an unstructured format with no standard vocabulary in the actual contents, which adds more complexity for their secondary uses 1 .Therefore, efforts are needed to eliminate redundancies, extract the necessary information, and derive accurate interpretations from laboratory data.
Our institute has developed a specific information extraction workflow to manage the interpretation of a large volume of provincial clinical laboratory results, as shown in Figure 1.The workflow, called a semi-rule-based workflow, relies on timeconsuming and operationally resource-intensive approaches, including a library of rule-based and hand-crafted tools.These tools are explicitly programmed for various laboratory result categories and must be refined continually.To address challenges with our existing semi-rule-based workflow and automate the exhaustive information retrieval task, we built a deep learning-based natural language processing (NLP) tool.The objective of this study was to assess the feasibility of our deep learning-based NLP model and evaluate its performance relative to the semirule-based workflow.
The development of NLP methods is essential to automatically transform laboratory reports into a structured representation that scales data usability for research, quality improvement, and clinical purposes [6][7][8][9][10][11][12]   multistep procedure where all the unique reports were grouped by the LOINCs, year, and location in the first step.In the second step, SMEs created a list of dictionaries for terms related to the different viruses and strains and a set of if-then-else rules to generate interpretations and extract information from each laboratory report.The dictionaries and if-then-else rules were packaged as a python library called the rule-based text parser.Finally, the parser was improved based on inputs from three SMEs in an iterative manner (more details in the Supplementary Materials).
In this study, we focused on automating the retrieval of information related to respiratory viruses from the laboratory repository of Canada's most populous province.Respiratory viruses account for a substantial burden of disease globally

Study Design
The dataset used in this study is a collection of laboratory reports that cover testing for 14 different respiratory viruses and viral subtypes (Table 1), most of which are in the form of texts.The reports are text-based and require cleaning, parsing, and encoding.
The dataset was derived from the Ontario Laboratories Information System (OLIS).
OLIS has over 100 contributors, which comprise hospital, commercial, and public health laboratories, adding to the complexity and variability of the clinical data.
These data were analyzed at ICES (formerly the Institute for Clinical Evaluative Sciences).
The automated encoding of laboratory testing reports into respiratory viruses is framed as a multi-label classification task according to a hierarchy.There are two levels of classification hierarchy; at each level, the classification is multi-label.Each input text sequence can be assigned to a non-empty subset of various labels, as shown in Figure 2. In the first level of the hierarchy, the classifier assigns outputs to mutually non-exclusive fine-grained labels.The fine-grained labels are reassigned to a coarse-grained set of labels in the second level of the classification hierarchy.In this work, "sequence" refers to the input laboratory reports to the NLP model, which may be a single or several sentences.A "label" also refers to a status of a specific virus or strain being tested or Detected (e.g., influenza A is detected).Tested and Detected are not mutually exclusive; it must first be determined if the specimen is tested for any virus and then flagged as detected if the result is positive.Detected is a subset of the Tested.Mention Counts (#) represents the counts of specific virus terms from all the distinct laboratory reports (unique sequences).It does not provide any clinical information regarding the prevalence of the abovementioned viruses in our region.
* Tested and Detected represent the proportion of mentions flagged as tested or positively detected by our parser, respectively.Note: Tested and Detected are not mutually exclusive; we first determine whether it was tested for (i.e., have a result) and then flag it as detected if the result is positive.Detected is a subset of the Tested.†The subtypes of Influenza A and RSV were only analyzed for detection but not testing, as the scope of the planned analyses for using the respiratory virus data was primarily focused on the larger virus categories.
To summarize, the information extraction for an input text sequence involves retrieving virus types and identifying their status as being tested and(or) detected.Figure 2. The fully automated deep learning-based NLP approach is a hierarchical-based multilabel classification task that retrieves virus (or strain) types and identifies their status as being tested and/or detected.Each input sequence was assigned to 24 mutually non-exclusive finegrained labels in the first level of the hierarchy.In the second level of the classification hierarchy, the fine-grained labels were reassigned to a coarse-grained set of labels (k=6).An example laboratory report processed through the deep learning-based NLP approach for automated extraction and encoding of information is also shown.
Notes: A sequence refers to the input laboratory reports to the NLP approach, which may be a single or several sentences.A label also refers to a status of a specific virus or strain (tested or detected).Influenza is tested implies it was tested for any Influenza type; however, the total number of 'Influenza is tested' is greater than the total number of Influenza A tested + Influenza B tested since not all Influenza is mentioned for types.The same applies to "Influenza is detected" and "RSV is tested".

Corpus Development Description
Ontario Laboratories Information System (OLIS) shown in Table 2).The NLP model was implemented in Tensorflow on an NVidia Tesla GPU, and Adam was used as the optimization algorithm.The maximum sequence length was fixed to 400 words.The model was trained several times with random initialization on the development corpus, and the results of the top ten bestperforming models on the test sets are presented here.
The results for the finegrained classification in the first level of the hierarchy are aggregated by microaveraging across fine-grained 24 labels and are shown in Table 3.The detailed performance for each label is also shown in Supplementary Table 1.The F1-score performance of the model in the second level of the hierarchy, coarse-grained multilabel classification, for 'Any Influenza', 'Any RSV', and 'Any Virus' are also shown in Table 3.In addition, the variation of the model's Precision and Recall scores using barplots and 95% confidence intervals are shown in Figure 3.   and external test sets have larger confidence intervals.In general, the models' estimates on any test sets were variable across classes with varying degrees of uncertainty.
The averaged F1 scores of the estimates for both "fine-grained (micro-averaged)" and "coarse-grained Any virus" classes were above 90% on the internal test set.The F1 score for the coarse-grained 'Any Influenza detected' on all test sets was above 91%.Overall, the performance for "Coarsegrained detected" classes was lower than for "Coarse-grained tested" classes.Among the detected classes, the performance for 'Any Influenza virus' was evidently higher than 'Any RSV virus'.

Models' Stability and Performance Variation between Classes
In general, the model performances on any test sets were variable across classes and virus types due to the imbalanced nature of the corpus and sample sizes per class.There were intrinsically fewer classes of viruses detected compared with those tested.Therefore, the model's performance was noticeably lower in the "detected" cases.Among detected cases, the lowest performance was observed for RSV A, and the highest performance among the tested cases was observed for Influenza.prospectively explore the model's performance.During the silent period, our model will be integrated into the data quality and management workflow for the laboratory data repository, and the outputs will be internally validated in a fashion that would avoid exposure to data users.We also plan to run rigorous evaluation and continuous refinement of the model in the silent period to assess its performance better before it enters production.
Another significant limitation of this study is that the model was only trained on respiratory virus laboratory reports.Even within that collection, some categories were naturally underrepresented, which impacted the model's generalizability.So, during the silent period, more records from a diverse set of laboratory reports from various categories will be annotated and made available to the model, and the model will be updated accordingly.Although building deep learning-based NLP models is computationally intensive and memory demanding, the benefit-to-cost ratio of these models in clinical settings will continue to increase.

Conclusions
The health industry is rapidly becoming digitized, and information extraction is a promising method for researchers and clinicians seeking quick retrieval of information embedded in texts.This study described developing and validating a deep learning-based NLP approach to extract respiratory virus testing information from laboratory reports.We demonstrated that our system can classify and encode large volumes of text-based laboratory reports with high performance without any of the previous time-consuming handcrafted feature engineering approaches.Taken together, the findings in this study provide encouraging support that NLP-based information extraction should become an important component of laboratory information repositories to assist researchers, clinicians, and healthcare providers with their information and knowledge management tasks.
However, data access might be granted to those who meet prespecified criteria for confidential access, available at www.ices.on.ca/DAS (email das@ices.on.ca).

Figure 1 .
Figure 1.Semi-rule-based workflow vs. Fully automated Deep Learning NLP approach.Semi-rule-based relies on time-consuming and operationally resource-intensive approaches for the information extraction task.The corpus was derived from the Ontario Laboratories Information System (OLIS).OLIS has over 100 contributors, including hospital, commercial, and public health laboratories.Following basic text-cleaning steps, around 87k unique laboratory reports were collected and included in our corpus to be used in parallel by both approaches: Semi-rule-based and Deep Learning NLP.Semi-rule-based workflow is a

Figure 2
Figure 2 illustrates a running example of the input and output of the deep learningbased NLP model.
seq(#) represents the counts of unique sequences; sequence refers to the input laboratory reports to the NLP model, which may be a single sentence or several sentences.* Detected (%) and Tested (%) represent the aggregation of the proportion of any mentions of the virus terms from the total unique sequences in the dataset.+Theout-of-time (post-COVID-19) is a very small and imbalanced sample, including only 100 sequences with no mentions of any virus detected As expected, the performance on the internal test set was better than the out-of-time (pre-COVID-19) and external test sets.In this regard, the F1 score results of the test sets were compared, and noticeable differences were observed between the pairs of internal and out-of-time (pre-COVID-19) test sets, internal and out-of-time (post-COVID-19) test sets, and internal and external test sets.The out-of-time (post-COVID-19) test set was a small and imbalanced sample, including 100 sequences with <6 mentions of any virus as being detected.The sample included 12 sequences labelled as being tested for coronavirus, and our model correctly classified them with an F1 score of 0.67.Regarding the degree of uncertainty in the estimates, fewer variations in Precision and Recall scores are observed for the internal and out-of-time test sets (Pre-COVID-19).On the contrary, the estimates on the out-of-time (post-COVID-19)

Figure 3 .
Figure 3.The Precision and Recall scores of the predictions of the top 10 best-performing models with 95% confidence intervals.The fine-grained results are aggregated by micro-averaging across 24 fine-grained labels.

31 .
Deep learning-based NLP approaches have shown their efficiency in many clinical NLP tasks and have thoroughly permeated the informatics community.The existing body of literature has mainly focused on using deep learning models to extract and interpret cancer-related clinical concepts 17,27,28 from free text or other clinically meaningful entities from radiology reports or hospital notes 10,15 .Only one study at the time of writing explored the use of an NLP system, Topaz, for the automated extraction and classification of influenza-related terms from text emergency reports 29-To our knowledge, our study is the first to explore using deep learning models for efficient processing and extraction of clinically meaningful knowledge pertaining to respiratory viruses from a laboratory repository.One strength of the NLP approach used in this study is its scalability for various textbased laboratory scenarios.As the size and complexity of laboratory data grow, so does the need for scalable and reusable tools for automated extraction of knowledge from vast amounts of clinical notes and quick generalization from one task to another.Manual processing of laboratory reports severely limits the utilization of rich information embedded in the data repositories and makes the process of data cleaning and quality improvement prohibitively expensive.Usually, collections.On the other hand, deep learning-based NLP algorithms are well-poised to scale the information extraction process.
. NLP enables automated extraction of information, and its use in the clinical domain is growing with increasing uptake networks (RNN), and RNN variants such as bidirectional Long Short-Term Memory (Bi-LSTM) have been successfully applied to clinical NLP tasks 10,13-16 .They are now considered the baseline techniques for various information extraction tasks 11,12,17-20 .

Table 1 :
Details of the respiratory viruses embedded in text-based laboratory reports derived from the Ontario Laboratories Information System (OLIS).Specimens may be tested for one or more of the following viruses: influenza, respiratory syncytial virus (RSV), adenoviruses, seasonal coronaviruses, enterovirus/rhinoviruses, parainfluenza viruses, human metapneumovirus (HMV), and bocavirus.The testing modalities employed include single and multiplex polymerase chain reaction, direct fluorescent antibody, viral culture, and enzyme immunoassay rapid antigen tests.Repeated testing may involve multiple laboratories and testing modalities.
At the time of writing, the OLIS data holding at ICES consists of >9,000 unique LOINCs and >5 billion laboratory observations across 150 laboratory test centers in Ontario.As such, the clinical laboratory data has considerable complexity and variability.
To create the corpus for this study, over a million observations corresponding to 99 unique Logical Observation Identifiers Names and Codes (LOINCs) were pulled from OLIS, and the text-based laboratory results were extracted from the observations.OLIS was created and is managed by Ontario Health, from whom ICES receives an ongoing data feed.institute, LOINCs are mainly used to filter OLIS observations into relevant groupings (e.g., respiratory viruses) and not for encoding and interpretation since they are not always used appropriately by those entering the data into OLIS.Consequently, the SMEs identified a list of 99 LOINCs related to respiratory viruses, and all the laboratory reports in OLIS corresponding to these LOINC codes were retrieved.The workflow consists of the following tasks: 1.The data analyst and data scientist first scanned the text strings and, after performing basic text cleaning (e.g., removing punctuations, stop words, case normalization, lemmatization, and stemming) and removing duplicates, they created a meaningful list of 87k unique laboratory reports.externally on various test sets, as in Table 2.The internal test set used for model training was a randomly sampled subset representing 10% of the laboratory reports from OLIS from 2007 to 2018.The performance of the model was also evaluated on two out-of-time test sets, including samples from an entirely different time period: (1) a large pre-COVID-19 (2019) sample and (2) a small post-COVID-19 (2020) sample.A separate test set, denoted as the external test set, included samples up to 2019 from two separate laboratories (testing sites not included in the development of the model) and was used to assess the external generalizability of the model.F1, Precision, and Recall scores were calculated for the model's predictions.The paired t-test was used to determine whether a statistically significant difference in the F1 scores between classes and test sets existed.In addition, 95% confidence intervals were calculated for the Precision and Recall scores to quantify the uncertainty of the model's estimates.

Table 2
Dataset statistics for laboratory descriptions of the development and test sets.

Table 3 The
prediction results (F1-score) of the top 10 best-performing models on the in-time, out-of-time, and external test sets.The fine-grained results are aggregated by micro-averaging across 24 fine-grained labels.The fine-grained results with the best-performing model are shown in Supplementary Table1.
The same result was observed between 'Any Influenza virus' and 'Any RSV virus'.Comparably, larger confidence intervals are evidenced for the 'Coarse-grained Any RSV detected' estimates.