Pressing needs of biomedical text mining in biocuration and beyond: opportunities and challenges

Text mining in the biomedical sciences is rapidly transitioning from small-scale evaluation to large-scale application. In this article, we argue that text-mining technologies have become essential tools in real-world biomedical research. We describe four large scale applications of text mining, as showcased during a recent panel discussion at the BioCreative V Challenge Workshop. We draw on these applications as case studies to characterize common requirements for successfully applying text-mining techniques to practical biocuration needs. We note that system ‘accuracy’ remains a challenge and identify several additional common difficulties and potential research directions including (i) the ‘scalability’ issue due to the increasing need of mining information from millions of full-text articles, (ii) the ‘interoperability’ issue of integrating various text-mining systems into existing curation workflows and (iii) the ‘reusability’ issue on the difficulty of applying trained systems to text genres that are not seen previously during development. We then describe related efforts within the text-mining community, with a special focus on the BioCreative series of challenge workshops. We believe that focusing on the near-term challenges identified in this work will amplify the opportunities afforded by the continued adoption of text-mining tools. Finally, in order to sustain the curation ecosystem and have text-mining systems adopted for practical benefits, we call for increased collaboration between text-mining researchers and various stakeholders, including researchers, publishers and biocurators.


Introduction
The unprecedented advances in high-throughput technology and tools to support bioscience have led to a boom in biological and biomedical science research and an accompanying growth of the scientific literature. Access to the wealth of knowledge embedded in the literature is critical for enabling continued scientific advancements and breakthroughs. For this reason, several efforts over the last decade have focused on improving knowledge reusability through improved storage, representation and curation. These efforts include both public literature resources (e.g. PubMed and PubMed Central/Europe PMC) and biological knowledge bases (e.g. UniProt (1), NCBI Database Resources (2). Figure 1 illustrates the interconnection between literature services and biological databases, and their importance in biological research. As can be seen, researchers rely on literature services to keep up with the state of the art on topics of their interest, to generate novel hypotheses, and as a reference for developing research strategies. In addition, today's curated databases are critical in biomedical research by being a firsthand tool for researchers to investigate their hypothesis or research results (3).
Biological knowledge bases rely heavily on expert curation, however, and scaling to accommodate the growth of the scientific literature has been a continued challenge. Automatically annotating biological entities such as genes/ protein and diseases (4,5) and other scientific artifacts in biomedical literature, such as investigation techniques or the dataset used (6) is useful for improving the scalability of biocuration services. Surveys regarding the role of text mining for assisting literature curation were performed during the International Biocuration Conference and Workshop (Berlin, 2009) and the BioCreative 2012 Workshop (Washington, DC) (7,8). The 2012 report indicates that more databases have adopted text mining into their curation workflows in some form than in 2009. A number of studies have indicated improved curation productivity with the assistance of text mining. In Table 1, we present a subset of studies benchmarking the quantitative significance of text-mining systems in database curation (9)(10)(11)(12)(13). We also refer the reader to the Interactive Annotation Task (IAT) at BioCreatives III-V (14)(15)(16)(17), which investigated some aspects of usability and productivity of the text-mining systems for biocuration.
Given the earlier successes and increasing cost/limited resources in manual curation, we argue that computational approaches such as text mining are essential in the future to provide researchers and medical professionals efficient, comprehensive and up-to-date literature services to manage this growth according to customizable criteria such as clinical relevancy or specific genes or species. Since most of the discoveries and breakthroughs are first made available to the public through scholarly publications, the emphasis of this position article is with regard to text-mining applications in literature search and curation. Specifically, the four real-world applications discussed are (i) Literature search (Europe PMC), (ii) Data search (SourceData), (iii) BEL database curation and (iv) VIROME database curation.
In this article, we first discuss the two applications related to literature services, Europe PMC and SourceData, explaining both their value to the bioscience community and how text mining is essential for their continued progress. We next discuss two recent efforts supporting biological databases, BEL and VIROME, which curate information related to biological cause-effect relationships and microbiomes, respectively. Finally, we summarize the opportunities for text mining in such applications and the multiple challenges that hamper its immediate adoption in these applications. We also provide our understanding of a few strategies to facilitate an increased adoption of text mining in such applications.

Real world large scale applications
Europe PMC (Johanna McEntyre, EMBL-EBI) Europe PMC (https://europepmc.org/) is a database of abstracts and full text articles (5). Partnering with PMC from the National Library of Medicine USA and PMC Canada as a PMC International node, it contains 30 million abstracts (including PubMed) and over 3.6 million full text articles from the life sciences. In addition to serving general life science researchers who use Europe PMC to search the literature and access full text articles, Europe PMC also seeks to serve the specialized subset of users who are database curators. Database curators are professional literature readers, filterers, evaluators and extractors who work with the purpose of adding scientific value and context to public data resources.
Manual literature curation has resulted in many bioinformatics resources of excellent quality. It is clear, however, that some supportive computational approaches will be required in order for curation to scale to the accelerating pace of the biomedical literature while maintaining scientific quality. Since curators often require a wide variety of highly specific information, providing text-mining tools to fill each need may be a complex and never-ending task.
However, text-mined outputs useful for curators are also likely to be useful for others in the broader scientific community; integrating text mining into Europe PMC therefore also opens the possibility for occasional users to contribute to community curation efforts and provide feedback on text-mining results.
Europe PMC is committed to enabling text mining. Currently, it provides features such as 'Highlight terms' to identify core biological entities such as genes/proteins and organisms within the article's abstract view; the entities are also linked to relevant databases. A similar feature is provided for full-text article views. Europe PMC is also being developed as a platform for third-party text-mining algorithms, allowing the output of these algorithms to be displayed in full text articles shown on the Europe PMC website.
In the future, outputs from the text-mining community could further semantically enrich Europe PMC content by including the annotation of additional entities, such as mutations, and/or relationships between various entities., such as genes/proteins and diseases. Search and browse features built on top of these annotations-e.g. references to other articles studying the same relationships (or, perhaps, contradictory relationships)-will help readers to better judge the article in light of related publications.

SourceData (Thomas Lemberger, EMBO and Ioannis Xenarios, SIB)
Hypothesis-driven research in molecular and cell biology primarily generates data from small-scale experiments. In scientific publications, such data are visually depicted in figures or tables. However the original data behind the figures-the 'source data'-are almost never available in structured format that would make them findable and reusable.
SourceData (http://sourcedata.embo.org/) (18) is building tools to allow researchers and publishers to generate machine-readable descriptions of data during the publication process and also to make this data searchable. To facilitate generating structured experimental descriptions, SourceData has developed an online tool for computerassisted manual curation of figures and figure legends by data editors. The intention is to integrate a curation step into the publishing workflow to annotate figures of before article publication. Authors then verify and approve the curated information through a validation interface. The result is a machine-readable representation of the data (descriptive metadata) based on the information routinely provided by authors in the text of figure legends, thus respecting the traditional workflow adopted by scientists.
SourceData have also developed a search interface that allows users to search for specific experimental evidence and the articles where these data have been reported. This search function is incorporated into the 'SourceData SmartFigure' viewer, which can easily be embedded in online publications. The SmartFigure application allows a specific figure panel to be linked with figures presenting similar data published elsewhere and therefore makes it possible for users to traverse the web of connected data by following these links across articles. Finally, programmatic access to the SourceData database is provided to the research community through a public API.
Integration of text mining with manual curation in the context of publishing seems to be a promising direction, as  (11) TAIR Genes PubTator 45% increase in productivity (12) PIR PPI involving protein phosphorylation eFIP 2.5-fold increase curation efficiency (13) Flybase Genes Tagtog 2-fold decrease in curation time it will improve the efficiency and speed of the metadata extraction process and it will allow supervision of the automated results by both data editors and authors. In this context, text-mining methods will be useful for the automated semantic enrichment of figure legends or of the corresponding referring statements in the full text and also for identifying entity relationships that represent tested experimental hypotheses. Text mining is also envisioned to play a complementary role by linking curated figures with interpretative statements made in the article or with reagents listed in 'Materials and Methods' section. Finally, textmining techniques developed for computer science publications (19,20) might be useful to automatically prioritize a pool of candidate publications for further extraction of detailed experimental data and metadata.
OpenBEL: computable knowledge bases of cause-effect relationships (Natalie Catlett, Selventa) Biological Expression Language (BEL) is a knowledge representation developed by Selventa to capture biological cause-and-effect relationships from the scientific literature in a format suitable for computation. BEL and its associated software platform are an open source project (www. openbel.org). BEL knowledge bases have been used to support inference from molecular profiling data (21)(22)(23) and to construct of network models representing specific biological processes (24). These approaches support precision medicine by illuminating the molecular mechanisms of disease, drug mechanisms of action, and supporting patient stratification. BEL is designed to represent experimental observations in molecular biology, providing specific representations of various biological measurements including RNAs, proteins, post-translationally modified proteins, and protein activities, as well as biological processes and pathologies. This granular representation facilitates mapping of biological measurements to BEL networks to drive interpretation of molecular profiling data. BEL also represents the context for these experimental observations, such as the cell line or tissue used for the experiment, as well as a literature citation, allowing the creation of BEL networks that accurately represent the experiment and its context.
Over the last decade, Selventa has built a knowledge base comprised of >500 000 BEL statements primarily through manual curation. Many of these statements resulted from targeted curation efforts to support projects in various disease areas. This approach requires a significant effort from trained scientists to build a comprehensive knowledge base and keep it current.
Text mining promises to greatly improve the efficiency of building BEL knowledge bases. Accurate entity identification from the literature is critical to generating BEL knowledge bases useful for inference or building models. Another computational aspect important for automation is relation identification. Recently, Fluck and colleagues developed BELIEF, a text-mining work flow to improve the efficiency of BEL curation (25). BELIEF includes a UIMAbased text-mining workflow (with several state-of-the-art natural language processing, named entity recognition (NER) and relationship extraction tools) to facilitate a semi-automatic curation pipeline. Use of BELIEF was shown to significantly reduce human curation effort. VIROME and building a knowledge base for microbiomes (Shawn Polson, University of Delaware) Microbial communities and viral assemblages have been found to be both numerous and important drivers of biological processes globally. Recent research has linked microbiomes, microbes co-existing with a host, to many normal and pathological processes such as co-metabolism of food sources, exclusion of pathogens, fostering of host immune response, obesity, susceptibility to cancer and even mental disorders (26)(27)(28)(29)(30)(31). Research aimed at unraveling the complex community-scale dynamics and functions of microbial communities, and the even more numerous viruses which play important roles in regulating and driving genetic diversity among them, are of paramount importance. Our ability to examine these systems was once limited by factors including the inability to cultivate the vast majority in a laboratory setting, but the advent of increasingly cost-effective platforms for deep sequencing of marker genes (e.g. 16S rRNA) and metagenomes in the past several years have finally opened the door for wide-spread research in this field.
These methods involve generation of raw sequence data elucidating the taxonomic or functional composition of the community at a specific geographic location, time, and environmental condition. Typically a study will include multiple samples varying across some spatial, temporal, or environmental variable allowing for testing of one or more specific hypotheses. The global nature of such data, however, means that its utility could extend far beyond the specific hypotheses it was collected to address. The results of such studies are typically published in peer-reviewed journals with deposition of only the raw data to public repositories such as the NCBI Sequence read archive. Other fields have seen the utility of publishing the analysed results of sequence-based studies (e.g. GEO for gene expression data). Some online tools such as VIROME (32) and MG-RAST (33) do provide a route for the analysis results themselves to be made public. Recent work by VIROME (http://virome.dbi.udel.edu/) and others are working to ensure that such results are accompanied by standardized metadata to make them useful when considered in alternate contexts, but the ability to look for trends across projects remains very limited. Leveraging microbial ecology results garnered from disparate projects could prove transformative for the field. Agreements are lacking, however, to populate centralized repositories with analysed data in a manner that would enable the creation of comprehensive microbial ecology resources, similar to what UniProt (1) and the Protein Information Resource (34), among others, provide for proteins. Development of such resources would enable large-scale observations and hypothesis testing, such as to assess the range of conditions under which a given microbe (or microbial protein) has been observed, thus providing key insights into its role, or assessing synergistic relationships by determining the consistency of co-occurrence for two or more microbes.
Text mining should play a key role in future microbiome studies by providing standalone tools to search for specific microbial relationships in the literature and populating databases designed to provide comprehensive views of such global data. Microbes almost always live in mixed communities, and thus cooperation and competition are key features; however, detecting such microbial dependencies is difficult and time-consuming. Similarly, defining the environmental parameters under which certain microbes or guilds of microbes exist can be very informative in understanding their roles. Single studies are rarely comprehensive enough to elucidate such trends, however. Textmining tools may enable a comprehensive understanding of microbiomes by focusing on the NER of specific microbial entities, the extraction of biological conclusions (e.g. organism x can do y, but only in the presence of z), metadata extraction (description of time, place, and conditions of samples at collection), and methodological details of the original sample. The ENVIRONMENTS and EXTRACT tools presented at BioCreative V (35,36), are examples of such tools, with emerging capability to extract environmental context and microbial taxonomy from published articles and map them to ontological standards such as Environmental Ontology (37).

Text-mining needs in large scale applications
The text-mining needs in aforementioned applications can be grouped into three primary tasks: NER, relation extraction (RE) and information visualization. NER involves automatically labeling bio-entities such as dataset name (SourceData), diseases, genes, proteins (BEL, Europe PMC) or microbial proteins (VIROME). Since NER is foundational to most text-mining applications, the availability of accurate application-specific NER tools is critical (38,39). RE introduces the next higher level of knowledge discovery by automatically extracting relationships between the entities identified by NER. Such relationships may describe cause-effect relations (BEL) or microbe-environment relations (VIROME), and relationships may also involve metadata (such as spatio-temporal variables) to curate complex higher order relations. The final task is visualization of the text mined results. Some applications require visuals-summaries (or visual tags), links to other online databases (EuropePMC) and metadata highlight within text (SourceData)-to enhance knowledge representation. Text mining can help in selecting the most relevant outputs from large scale text-mined results, as not all textmined outputs need be displayed even if they are correctly extracted. Although text-mining roles may be classified broadly into three tasks, the specific entities, relations and representation required for each application may be highly specific.

Challenges and opportunities in text mining
These domain applications above suggest several areas that remain challenging, namely 'accuracy', 'scalability', 'interoperability' and 'reusability'. These areas represent future opportunities for text mining to address the real world needs of large scale applications.

Accuracy
Although text-mining systems are rapidly transitioning to real world use, imperfect accuracy remains a limiting factor. Workflows incorporating text-mining systems must design processes that compensate for imperfect output. Although the importance of these considerations tapers as the output quality approaches that of human annotators, there are several limitations with the evaluations typically performed in the text-mining community. First, the evaluation most commonly performed is intrinsic, that is, it compares the output of the system to gold standard annotations performed by human annotators. Although such an evaluation provides several desirable properties, such as being quantifiable and providing a high degree of objectivity, it does miss some important considerations. Notably, it provides no feedback on whether the quality of the output is sufficient to support processes downstream in the workflow. Thus, while intrinsic evaluation of the system is important, the system must also be evaluated extrinsically, i.e. in place in the workflow.

Interoperability
Because system accuracy is critical and must be evaluated extrinsically in the workflow, each system evaluated must be fully integrated into the workflow. Thus, the difficulty in integrating the system must be kept to a minimum. Unfortunately, many factors reduce system interoperability, such as operating system dependencies and incompatibilities between input and output formats (40). Interoperability could be addressed in several different ways. For instance, UIMA (41) is a software architecture created by IBM in 2003 to provide uniform data formatting standards for different teams working on NLP projects. Although it uses a common analysis system (CAS), the ability to use different semantic tag sets creates an interoperability solution (42). Tools written in a systemindependent language such as Java or Python do not require a specific operating system. Format incompatibilities can be addressed by creating a standard data format. The recent BioC project is such an example, which has created an interoperable data format that is both straightforward and sufficiently expressive to represent a wide variety of text-mining tasks (43,44). Another solution may be web services, which hides all configuration and deployment details from the user by providing an API that can be accessed over the Internet, requiring no system installation or maintenance (45,46). Despite these attempts, integrating text mining into mature database workflows remains difficult due to the complexities of curation workflow and existing infrastructure.

Scalability
A defining characteristic of large-scale text-mining applications is the requirement to scale to millions of documents. PubMed, e.g. contains over 25 million abstracts-at the relatively high rate of 100 abstracts per second, it therefore requires nearly 3 days of computational time to process; processing an equivalent amount of full text articles requires an order of magnitude longer. Text-mining implementations are therefore frequently paired with a database, allowing the text to be preprocessed and the results cached and indexed. Although this allows the textmining results to be provided on demand for text available beforehand, this approach is insufficient for text that must be processed in real time. Moreover, this approach is also inconvenient for updates to the text-mining system, as all the cached results must be reprocessed. One approach to address scalability is the application of cluster computing: processing multiple documents in parallel on multiple hardware systems. Returning to our PubMed example, a cluster of 10 systems-each processing at the rate of 100 abstracts per second-is sufficient to reduce the processing time to under 7 h, a job which can be completed overnight.

Reusability
Text-mining systems are commonly applied to text somewhat different than the text used to train and evaluate them, making generalization-the ability to handle text previously unseen-very important. As an example, abstracts describing rare genetic diseases will contain significantly different information than those describing treatments for tropical infectious diseases, even though both will contain disease entities. A particular concern is the ability of the system to handle not only abstracts, but also full text documents (47)(48)(49)(50)(51)(52). However, systems for dealing with many of the various nuances (such as figure captions, data in tables, information in supplementary materials, and various text cleaning issues) of full text are still not fully in place. Thus, a large improvement in the robustness of a system against shifts in the textual domain may be significantly more useful for real world applications than incremental improvements in system accuracy.

Future roles of researchers, publishers and curators
Bridging the gap between text-mining research and its application in real world databases requires a collaborative effort from the various stakeholders involved in advancing biomedical sciences. In this section, we provide a few perspectives which researchers, publishers and curators can use to advance biomedical sciences through text mining.

Research community
Community run challenges in biomedical text mining such as BioCreative can play a major role in realizing the potential of large scale text-mining applications, both by assessing the state of the art and also helping advance the field (53). The aim of conducting these challenges, in general, is to promote interdisciplinary collaboration, evaluate and advance the NLP techniques to facilitate biological research. Thus, these challenges are conducted as shared tasks where research teams from across the globe participate in fulfilling the goals of specified text-mining tasks. A myriad of such challenges have been organized over the years following the success of CASP in 1994 (54, 55) on protein structure prediction; Huang et al. (2016) (53) provides a comprehensive overview of several challenges conducted within the last decade.
In recent years, the community has introduced challenges that focus on bridging the gap between biomedical text-mining research and new application domains. For example, since 2010, BioCreative has organized workshops at the annual meetings of the International Society for Biocuration (http://biocuration.org/) with a focus on better understanding biocuration workflows (8) and promoting the development and deployment of biomedical text-mining tools into production curation pipelines. Several of these have been successfully integrated into existing curation workflows (e.g. 4,13).
Nevertheless, there are several difficulties which must be resolved before community challenges can realize the potential of large scale text-mining applications. The foremost of these difficulties is that challenge tasks are often simplified or abstracted versions of the real-world problems. For example, although biocurators routinely use the full text of an article (56,57), challenge tasks often only utilize the abstract due to difficulties in accessing full text articles and processing full text. A consequence of this simplification of the real-world problem is that even systems that perform well on challenge tasks yield significantly lower results when evaluated in practical real-world settings. For example, previous BioCreative Gene Normalization challenges have shown that the task performance dropped significantly when tested on full texts (58) instead of abstracts (59). These difficulties can be addressed by designing challenge tasks that focus on the unique problems presented by real world applications.
The BioCreative Collaborative Biocurator Assistant Task (BioC) and the BioCreative Interactive Text-Mining Task (IAT) serve as examples of such focused efforts. The BioC task centered on creating a text-mining system to support BioGRID curators by developing BioC-compatible text-mining modules complementing each other and integrated into one system. The IAT task involved biocurators in testing text-mining systems. In a similar vein, we describe below a few ideas that can be realized as challenge tasks in BioC workshops in the near term to help realize the opportunities of text-mining research in real-world applications more directly.
i. Creating a wide variety of manually curated benchmarks datasets for various text-mining problems. These benchmarks datasets are critical for text-mining researchers to train, test and compare their algorithms and also for organizations to determine the best fit for their large scale applications. These benchmarks should come from various sources including biomedical literature (both abstract and full text), clinical trials, clinical notes and Electronic Medical Records. ii. Identifying metrics to measure critical system qualities in addition to accuracy. As application needs differ, so do their evaluation criteria for selecting text-mining tools. Identifying or creating metrics addressing performance aspects beyond accuracy, such as scalability, usability, and cost-of-adoption (such as database management and front-end design) will greatly help both researchers and application developers to identify textmining tools that best fit their performance dimensions. In this direction, BioCreative-IAT task has included both performance and usability metrics in the evaluation of the text-mining systems by curators, which were also adopted in the BioC task. These metrics should be extended to include scalability and costof-adoption. iii. Like BioC's focus on BioGRID, challenge tasks can be designed to focus on individual large scale applications such as SourceData, BEL and VIROME. Involving the data indexers and curators in the task design step will enrich the utility of the challenge task for real-world use. Parameters such as evaluation criteria can be designed specifically for the individual application. Moreover, the data bottleneck such as full text access and processing can be addressed with help of literature services such as Europe PMC.

Publishers' role
The SourceData project provides a good example of how publishers could actively encourage innovative knowledge curation and representation. As described in the SourceData section, the publishers collaborate with researchers to generate machine-readable descriptions of datasets during the publication process and also to make this data searchable. In addition to the role of text-mining expressed earlier, as the databases grow, text-mining systems can be employed in the future to provide automatic recommendations of machine-readable tags or descriptions for the datasets. Similar to SourceData project's initiative to enrich articles during in-publication or pre-publications phase, the publishers' role can be to enrich articles in prepublication phase by employing text-mining systems.
In the future, the curation step may not wait until after publication, as is the current practice. A possibility is to move the curation step 'upstream' i.e. capturing knowledge at the time of peer review and prior to publication. Such an initiative would require development of very high quality and sustainable text-mining systems, and possibly require a greater involvement of the article authors in validating some of the text-mined results.

Curators' role
It is central to keep the human curators/experts in the loop in any newly proposed text-mining-based curation ecosystem. Curators are critical for defining text-mining requirements, providing annotation guidelines and standards, and providing training data for the initial system development and evaluation. Curators should be involved in evaluating the text-mined results and decide their fitness for curation. Curators should help system developers iteratively improve the text-mining algorithms and make any necessary system customizations for their specific database curation needs. This would be the ideal way to incorporate text mining into curation workflows.

Conclusions
In this work, we presented four large scale applications of text mining in the biological and life sciences, as showcased during a recent panel at BioCreative V. We used these applications as case studies in the challenges encountered in adopting text-mining solutions into realistic tasks and discussed several areas of opportunity for text mining to support real world services in the near term. Finally, we presented a few actionable steps that the BioCreative community can take to bridge the gap between text-mining research and real world biomedical services.