Electronic Medical Record Search Engine (EMERSE): An Information Retrieval Tool for Supporting Cancer Research

PURPOSE The Electronic Medical Record Search Engine (EMERSE) is a software tool built to aid research spanning cohort discovery, population health, and data abstraction for clinical trials. EMERSE is now live at three academic medical centers, with additional sites currently working on implementation. In this report, we describe how EMERSE has been used to support cancer research based on a variety of metrics. METHODS We identified peer-reviewed publications that used EMERSE through online searches as well as through direct e-mails to users based on audit logs. These logs were also used to summarize use at each of the three sites. Search terms for two of the sites were characterized using the natural language processing tool MetaMap to determine to which semantic types the terms could be mapped. RESULTS We identified a total of 326 peer-reviewed publications that used EMERSE through August 2019, although this is likely an underestimation of the true total based on the use log analysis. Oncology-related research comprised nearly one third (n = 105; 32.2%) of all research output. The use logs showed that EMERSE had been used by multiple people at each site (nearly 3,500 across all three) who had collectively logged into the system > 100,000 times. Many user-entered search queries could not be mapped to a semantic type, but the most common semantic type for terms that did match was “disease or syndrome,” followed by “pharmacologic substance.” CONCLUSION EMERSE has been shown to be a valuable tool for supporting cancer research. It has been successfully deployed at other sites, despite some implementation challenges unique to each deployment environment.


INTRODUCTION
The vast volume of clinical data captured within electronic health records (EHRs) has the potential to catalyze biomedical research. However, for all the benefits of EHRs, persistent challenges remain in leveraging EHR data for cancer research. This is because a substantial number (up to 80% by some estimates) 1 of the clinical details are captured in unstructured free-text notes and are therefore difficult to extract and convert to a computable form. 2 Ignoring the free text in EHRs can be problematic. 3 For example, symptomatic data are often recorded exclusively in the free text. 4 One study found that free text from EHRs was required for resolving nearly 60% of eligibility criteria for a chronic lymphocytic leukemia clinical trial and almost 80% of eligibility criteria for a prostate cancer trial. 5 Another such study listed 10 data elements derived from the free text related to bone marrow biopsy findings, including biopsy blast counts, biopsy cellularity, fibrosis grade, and aspirate cellularity. 6 A study about engraftment syndrome after allogeneic hematopoietic cell transplantation used concepts found in the free text, such as engraftment failure, stool output, lymphocyte recovery, cytokine storm, disorientation, capillary leak, effusions, fevers, and rashes. 7 Furthermore, the accuracy of the readily accessible structured data from EHRs may be low in some cases. 8 For example, one study found that up to 20% of patients at one medical center had a medication listed in their unstructured data that was not in the structured medication list. 9 Another study of cancer staging found that nearly 84% of patients had conflicting statements about staging in their records, necessitating an algorithm to infer the most likely staging for each patient. 10 To help the research community use the free text in EHRs, substantial resources have been devoted to develop natural language processing (NLP) tools. NLP remains promising for oncology research, 11 but Author affiliations and support information (if applicable) appear at the end of this article.
Accepted on February 27, 2020 and published at ascopubs.org/journal/ cci on May 15, 2020: DOI https://doi.org/10. 1200/CCI. 19.00134 widespread use remains limited. The quality of NLP results have been mixed, with some acknowledging the complexity and "inherent difficulty of natural language processing in this domain." 6(p330-331) This complexity results from a variety of factors, including understanding temporal relationships, ambiguous abbreviations, and anaphoric references. Other challenges include issues of replicability across algorithms and institutions 12 and the need for large manually annotated data sets for new use cases, 11 especially because these systems perform best when tailored to a specific task or domain. 13 The lack of available experts to architect and deploy NLP systems is also a limiting factor.
To address the immediate needs of the cancer research community, members of which often lack the resources, time, and access to NLP experts, we developed a simpler approach using information retrieval for concept identification in free text. The Electronic Medical Record Search Engine (EMERSE) is a general-purpose term-searching system tailored to the needs of the medical research community to help researchers quickly find information buried in EHR free text. In general, information retrieval is like search engines such as Google that help people find information quickly, but it does not attempt to code the data, the latter of which falls within the domain of NLP. General familiarity with tools such as Google is thus an advantage. EMERSE uses an index of terms coupled with the capacity for query expansion using locally customized or standardized terminologies.
Rather than an example of an artificial intelligence system, EMERSE is more like an augmented intelligence system, wherein the software helps a person perform his or her work more efficiently but does not completely remove that person from the workflow. With EMERSE, the person is needed for the complex task of making sense of nuanced prose, a task that remains formidable for machines. 14 EMERSE has been in use at the University of Michigan for 15 years and has supported a wide variety of clinical research, including oncology research. EMERSE is being implemented at other academic medical centers. Our report covers details about the system, including metrics based on use logs and publications, an analysis of search terms entered, and ongoing development work supported by the National Cancer Institute Informatics Technology for Cancer Research program.

System Description
EMERSE is a Web-based application that provides an easyto-use interface for either (1) identifying a cohort among all patients in the EHR or identifying concepts within the clinical unstructured notes of an existing defined patient cohort. EMERSE indexes free-text data from EHR notes, with additional metadata related to the notes (eg, date, clinical service, note type). The software is based on Apache Solr (an open-source search engine), but a substantial user interface has been built to provide study management features, visualization of results, and a query expansion feature.
Technical details about EMERSE can be found in a prior publication. 15 EMERSE maintains detailed audit logs for all user sessions. Figure 1 contains several screens from EMERSE showing various general functions of the system. A recently added feature visualizes trends over time based on the search terms of interest (Fig 2). Although EMERSE is intended to be a self-service tool, system support is expected to be managed centrally by groups such as operational informatics teams. EMERSE is available at no cost, including source code, but sites are required to contact the University of Michigan to obtain the software. Additional details about EMERSE, including documentation and explainer videos, can be found on the EMERSE project Web site. 16 EMERSE is currently in use at three academic medical centers: University of Michigan, University of North Carolina

CONTEXT Key Objective
To demonstrate the utility of an information retrieval system, the Electronic Medical Record Search Engine (EMERSE), in the context of supporting cancer research. Knowledge Generated An analysis of audit logs and peer-reviewed publications demonstrated that EMERSE is being used to support cancer research for a broad array of research projects and tasks, ranging from cohort identification to data abstraction for elements that may not be found in a structured form. Users are searching for a wide variety of concepts, including "pharmacologic substance," "neoplastic process," and "sign or symptom." Relevance Information retrieval systems such as EMERSE have the potential to be powerful and easy-to-use software tools for supporting cancer research. EMERSE is available at no cost and has been successfully implemented at multiple medical centers, so it is a viable option for sites seeking to provide additional software tools for supporting cancer research.

B
A C at Chapel Hill, and University of Cincinnati (Table 1). Other sites are currently at various stages in their implementation, including Case Western Reserve University (CWRU)/ University Hospitals of Cleveland, Columbia University, University of Kentucky, University of Utah, and University of California San Francisco. CWRU has implemented a version of EMERSE using data extracted from the MIMIC-III project 17 and plans to use EMERSE in a pilot program for training medical students about research software and as part of its health informatics training program.

Publication Data
Peer-reviewed publications using EMERSE were identified via manual searches for "EMERSE" or "electronic medical record search engine" in both PubMed and Google Scholar. Searches were conducted between August and September 2019. Each article identified was reviewed to confirm EMERSE use. To identify additional peer-reviewed publications without mention or citation of EMERSE, all principal investigators at the University of Michigan who had used EMERSE for research within the prior 5-year period (n = 600) were sent an e-mail in July/August 2019 to inquire about the use of EMERSE for their work and what publications arose from that use. The e-mail contained personalized audit logs to remind them about the use. A follow-up e-mail to nonresponders was sent in early September 2019. For all articles identified, the titles and abstracts were read to determine if they were cancer related.
To characterize how EMERSE was used to support various research initiatives, 47 recent cancer-related peerreviewed publications published within the last 2 years were reviewed. Among these, 11 were summarized with respect to their descriptions of how EMERSE was used. These 11 articles were selected to showcase a diversity of use cases, were from a variety of research teams from different disciplines, and had enough details described in their methods sections to understand the contribution of EMERSE.

Audit Log Analysis
Use logs were extracted to characterize the total number of users and the number of EMERSE logins over the past 5 years (September 2014 through August 2019; shorter timeframes for the two sites that recently adopted the system). The search terms (ie, search queries) entered within this timeframe were also extracted. The NLP tool MetaMap 18 was used to process the search terms from two of the sites (University of Michigan and University of Cincinnati; University of North Carolina did not provide its terms). For this analysis, the "-a -N" flags were used. The  "-a" flag enables the use of variants of acronyms and abbreviations, and the "-N" flag modifies how the output is displayed. Prior studies have shown that MetaMap can perform comparably to other NLP tools, such as cTAKES. 19 MetaMap processed each search term to determine if MetaMap could map the query to a concept unique identifier (CUI) within the Unified Medical Language System (UMLS) 20 and, if the concept could be identified, to what semantic type it belonged. Because MetaMap outputs a list of potential CUI candidates, only the top-scoring candidate was selected. For ties among top-scoring candidates, only the first was selected. The results across the two sites were merged, and the relative frequencies of the top 20 most common UMLS semantic types were visualized using RAWGraphs. 21

RESULTS
A total of 222 peer-reviewed publications were identified through manual searches using PubMed and Google Scholar through September 19, 2019. For the e-mail survey that was conducted to gain additional data about publications, 337 (56.2%) of the 600 principal investigators responded, revealing an additional 105 peer-reviewed publications that did not cite or mention EMERSE, bringing the total number of publications to 326. Of the 326 publications, 105 (32.2%) were oncology related. An additional 285 studies were still in progress, with potential publications coming at a later date. The current list of known peer-reviewed publications can be found on the EMERSE project Web site. 16 Summaries of how EMERSE was used for 11 selected oncology-related  articles are provided in Table 2. The use of EMERSE varied from cohort identification to various types of data abstraction.
The audit logs revealed substantial use of EMERSE for cancer-related work that did not acknowledge EMERSE use within publications. This included multisite clinical trials where EMERSE was used at a single site (University of Michigan). These publications could be identified via unique data, such as National Clinical Trial numbers, which were sometimes mentioned in the publications. Examples include one study that used EMERSE for 31 sessions, with a total session time of 13 hours (ClinicalTrials.gov identifier: NCT01865747), 22 another that used EMERSE for 58 sessions and 26 hours (ClinicalTrials.gov identifier: NCT01576172), 23 and a third that used EMERSE for 398 user sessions and 166 hours (ClinicalTrials.gov identifier: NCT01633372). 24 Other oncology-related research initiatives have used EMERSE, even though it is not possible to link the use back to specific studies. For example, the Michigan Medicine Oncology Clinical Trials Support Unit has an umbrella institutional review board application for which it accesses EMERSE but does not link use to a specific study. That unit logged into EMERSE 917 times for 388 hours of use on the system between December 2014 and July 2019. Additionally, the Bone Marrow Transplant research group uses EMERSE for tracking long-term outcomes and used EMERSE for 2,452 sessions and 1,106 hours between July 2014 and July 2019. The high number of logins per study is common for research that involves frequent patient monitoring or identification of adverse events. Additional use statistics are listed in Table 1.
Details about the analysis of search terms using MetaMap are listed in Table 3. A large number of terms (University of Michigan, 34.1%; University of Cincinnati, 55.9%) did not map to any CUI using MetaMap. Many of these nonmapping terms were misspellings (eg, "fludaribine," "ifosphomide," "pegasparaganase," "tamoxafen"). However, of the terms that did not map from the University of Michigan data set, 2,342 (9.0%) were numbers in various forms representing medical record numbers, dates, international classification of disease (ICD) codes, and even pathology slide identifiers. In the University of Cincinnati data set 1,975 (68.6%) of the terms that did not map were numbers. The relative frequency of the 20 most common semantic types for the search terms is shown in Figure 3. "Disease or syndrome" was the most frequent semantic type (11.5%), followed by "pharmacologic substance" (10.0%).

DISCUSSION
As shown by the audit logs, and as evidenced by numerous peer-reviewed publications (. 100 oncology related), EMERSE has proven to be a useful tool for supporting cancer research. Furthermore, EMERSE has been successfully deployed at three academic medical centers to date, including the University of North Carolina, with additional centers in process, leading to multiple peerreviewed publications. 25 Through several rounds of implementation work with other sites (several are still under way), we have learned a great deal about the complexities of enterprise-wide software implementation. We describe a few of the most important insights, provided as guidance for others who might be interested in implementing EMERSE or other centrally managed research tools.
Environments at each site are highly variable, including servers, storage, access to EHR documents, formats of these documents, and regulatory requirements. Although there is no cost per se for the software, the resources needed for implementation are not free. Competing priorities, institutional review board requirements, small teams, security reviews, and the need to obtain buy-in from leadership can delay implementation for months. There is no single solution to overcoming these challenges, but we have made efforts to reduce the burden on implementing sites, including providing installation and setup documentation, training materials for end users, and a messaging forum for technical teams.
Because EMERSE is meant to be user facing, preserving the original document formatting helps users understand the data in the notes. Modern EHRs, such as Epic, allow for documentation using rich text formatting, in which notes can be made with tables, line breaks, and other formatting (eg, bold-face text). However, the Epic analytics database, Clarity, almost universally stores a version of the notes stripped of all formatting. The University of North Carolina at Chapel Hill has avoided using Clarity and is using the live production database, Chronicles, instead.  was performed on a unique list of terms in the search logs, and there may be far fewer signs and symptoms than there are disease or drug names.
Additional work under way involves securely networking sites for obfuscated counts. This feature will be similar to other cohort discovery networks currently based on structured data, such as i2b2 ACT 28 and PCORnet, 29 but the novelty with the EMERSE-based network is the focus on free-text notes. This should be useful for finding rare cancer cases where structured data are not specific enough. For example, there is no specific code in the ICD (version 10) for endometrial stromal sarcoma, because the parent code C54.1 represents multiple types of endometrial neoplasms.
It is important to point out that EMERSE is not meant to be a replacement for NLP systems, and NLP will be a preferable option in certain use cases. For relatively small numbers of patients (eg, thousands) and where accuracy is important enough to warrant human review, EMERSE may be the tool of choice. In other situations, such as automatically coding data across hundreds of thousands or millions of patients, NLP may be a preferable option. There is no one-size-fits-all solution, and multiple tools can benefit the research enterprise.
In conclusion, EMERSE can be a valuable tool to support cancer research as well as other clinical domains. This is a simple-to-operate, self-service tool that is powerful, scalable, and generalizable across use cases, allowing for teams from various fields to increase their productivity and gain access to accurate patient data that normally would have required a manual approach for identification. In addition, it has many data security features. Successful implementation at other locations has demonstrated that EMERSE can be deployed and used outside its original site. Groups interested in adopting EMERSE can contact the EMERSE team at the University of Michigan for a working virtual machine for testing, demonstrations, advice, and other details.