Applying Natural Language Processing to Textual Data From Clinical Data Warehouses: Systematic Review

Background In recent years, health data collected during the clinical care process have been often repurposed for secondary use through clinical data warehouses (CDWs), which interconnect disparate data from different sources. A large amount of information of high clinical value is stored in unstructured text format. Natural language processing (NLP), which implements algorithms that can operate on massive unstructured textual data, has the potential to structure the data and make clinical information more accessible. Objective The aim of this review was to provide an overview of studies applying NLP to textual data from CDWs. It focuses on identifying the (1) NLP tasks applied to data from CDWs and (2) NLP methods used to tackle these tasks. Methods This review was performed according to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. We searched for relevant articles in 3 bibliographic databases: PubMed, Google Scholar, and ACL Anthology. We reviewed the titles and abstracts and included articles according to the following inclusion criteria: (1) focus on NLP applied to textual data from CDWs, (2) articles published between 1995 and 2021, and (3) written in English. Results We identified 1353 articles, of which 194 (14.34%) met the inclusion criteria. Among all identified NLP tasks in the included papers, information extraction from clinical text (112/194, 57.7%) and the identification of patients (51/194, 26.3%) were the most frequent tasks. To address the various tasks, symbolic methods were the most common NLP methods (124/232, 53.4%), showing that some tasks can be partially achieved with classical NLP techniques, such as regular expressions or pattern matching that exploit specialized lexica, such as drug lists and terminologies. Machine learning (70/232, 30.2%) and deep learning (38/232, 16.4%) have been increasingly used in recent years, including the most recent approaches based on transformers. NLP methods were mostly applied to English language data (153/194, 78.9%). Conclusions CDWs are central to the secondary use of clinical texts for research purposes. Although the use of NLP on data from CDWs is growing, there remain challenges in this field, especially with regard to languages other than English. Clinical NLP is an effective strategy for accessing, extracting, and transforming data from CDWs. Information retrieved with NLP can assist in clinical research and have an impact on clinical practice.


Information sources
6 Specify all databases, registers, websites, organisations, reference lists and other sources searched or consulted to identify studies.Specify the date when each source was last searched or consulted.
line [127][128][129][130][131][132][133][134][135] Search strategy 7 Present the full search strategies for all databases, registers and websites, including any filters and limits used.line  Selection process 8 Specify the methods used to decide whether a study met the inclusion criteria of the review, including how many reviewers screened each record and each report retrieved, whether they worked independently, and if applicable, details of automation tools used in the process.

Data collection process
9 Specify the methods used to collect data from reports, including how many reviewers collected data from each report, whether they worked independently, any processes for obtaining or confirming data from study investigators, and if applicable, details of automation tools used in the process.
line 178-184 Data items 10a List and define all outcomes for which data were sought.Specify whether all results that were compatible with each outcome domain in each study were sought (e.g. for all measures, time points, analyses), and if not, the methods used to decide which results to collect.
-10b List and define all other variables for which data were sought (e.g.participant and intervention characteristics, funding sources).Describe any assumptions made about any missing or unclear information. -

Study risk of bias assessment
11 Specify the methods used to assess risk of bias in the included studies, including details of the tool(s) used, how many reviewers assessed each study and whether they worked independently, and if applicable, details of automation tools used in the process.
-Effect measures 12 Specify for each outcome the effect measure(s) (e.g.risk ratio, mean difference) used in the synthesis or presentation of results.-

Synthesis methods
13a Describe the processes used to decide which studies were eligible for each synthesis (e.g.tabulating the study intervention characteristics and comparing against the planned groups for each synthesis (item #5)).
-13b Describe any methods required to prepare the data for presentation or synthesis, such as handling of missing summary statistics, or data conversions.
-13c Describe any methods used to tabulate or visually display results of individual studies and syntheses.-13d Describe any methods used to synthesize results and provide a rationale for the choice(s).If meta-analysis was performed, describe the model(s), method(s) to identify the presence and extent of statistical heterogeneity, and software package(s) used.

Results of individual studies
19 For all outcomes, present, for each study: (a) summary statistics for each group (where appropriate) and (b) an effect estimate and its precision (e.g.confidence/credible interval), ideally using structured tables or plots. -

Results of syntheses 20a
For each synthesis, briefly summarise the characteristics and risk of bias among contributing studies.-20b Present results of all statistical syntheses conducted.If meta-analysis was done, present for each the summary estimate and its precision (e.g.confidence/credible interval) and measures of statistical heterogeneity.If comparing groups, describe the direction of the effect.

-13e
Describe any methods used to explore possible causes of heterogeneity among study results (e.g.subgroup analysis, meta-regression).-13f Describe any sensitivity analyses conducted to assess robustness of the synthesized results.-Reportingbias 14 Describe any methods used to assess risk of bias due to missing results in a synthesis (arising from reporting biases).methodsused to assess certainty (or confidence) in the body of evidence for an outcome.-RESULTSStudyselection16a Describe the results of the search and selection process, from the number of records identified in the search to the number of studies included in the review, ideally using a flow diagram.line 186-211 16b Cite studies that might appear to meet the inclusion criteria, but which were excluded, and explain why they were excluded.-Study characteristics 17 Cite each included study and present its characteristics.line 213-461 Risk of bias in studies 18 Present assessments of risk of bias for each included study.-

-20c
Present results of all investigations of possible causes of heterogeneity among study results.-20d Present results of all sensitivity analyses conducted to assess the robustness of the synthesized results.-Reporting biases 21 Present assessments of risk of bias due to missing results (arising from reporting biases) for each synthesis assessed.-Certainty of evidence 22 Present assessments of certainty (or confidence) in the body of evidence for each outcome assessed.-DISCUSSION Discussion 23a Provide a general interpretation of the results in the context of other evidence.line 463-490 23b Discuss any limitations of the evidence included in the review.information for the review, including register name and registration number, or state that the review was not registered.-24b Indicate where the review protocol can be accessed, or state that a protocol was not prepared.-24c Describe and explain any amendments to information provided at registration or in the protocol.-Support 25 Describe sources of financial or non-financial support for the review, and the role of the funders or sponsors in the review.-Competing interests