DISEASES: Text mining and data integration of disease–gene associations

Text mining is a flexible technology that can be applied to numerous different tasks in biology and medicine. We present a system for extracting disease–gene associations from biomedical abstracts. The system consists of a highly efficient dictionary-based tagger for named entity recognition of human genes and diseases, which we combine with a scoring scheme that takes into account co-occurrences both within and between sentences. We show that this approach is able to extract half of all manually curated associations with a false positive rate of only 0.16%. Nonetheless, text mining should not stand alone, but be combined with other types of evidence. For this reason, we have developed the DISEASES resource, which integrates the results from text mining with manually curated disease–gene associations, cancer mutation data, and genome-wide association studies from existing databases. The DISEASES resource is accessible through a user-friendly web interface at http://diseases.jensenlab.org/, where the text-mining software and all associations are also freely available for download.


Named entity recognition (NER)
Recognizing named entities and concepts, such as genes and diseases, in text is the basis for most biomedical applications of text mining [1].NER is sometimes divided into two subtasks, namely recognition and normalization (also known as identification or grounding), the former being to recognize the words of interest and the latter being to map them to the correct identifiers in databases or ontologies.However, as recognition without normalization has very limited practical use, the normalization step is now often implicitly considered part of the NER task.
The main challenges in NER are the poor standardization of names and the fact that a name of, for example, a gene or disease may have other meanings [2].To recognize names in text, many systems thus make use of rules that look at features of names themselves, such as capitalization and word endings, as well as contextual information from nearby words.In early methods the rules were hand crafted [3], whereas newer methods make use of machine learning [4,5], relying on the availability of manually annotated text corpora.
Dictionary-based methods instead rely-as the name suggests-on matching a dictionary of names against text.For this purpose the quality of the dictionary is obviously very important; the best-performing methods for NER according to blind assessments rely on carefully curated dictionaries to eliminate synonyms that give rise to many false positives [6,7].Moreover, dictionary-based methods have the crucial advantage of being able to normalize names.Whether or not one makes use of machine learning, a high-quality, comprehensive dictionary of gene and disease names is thus a prerequisite for mining disease-gene associations from the biomedical literature.

Controlled vocabularies of diseases
It is fairly straightforward to find a good starting point for a dictionary of human gene names due to efforts such as the Human Genome Organization (HUGO) Gene Nomenclature Committee (HGNC) [8] and UniProt Knowledgebase (UniProtKB) [9].It is less obvious to find a good dictionary of disease names, as there are several competing classifications and ontologies, which are designed for different purposes, mutually inconsistent, and thus poorly integrated with each other.
In a clinical setting, various versions of the International Classification of Diseases (ICD; http://www.who.int/classifications/icd/) are almost ubiquitously used for coding diagnoses in electronic health records (EHRs) and derived health registries [10].European countries, Canada, and Australia use revision 10 (ICD-10), whereas the United States still use revision 9 (ICD-9).ICD-10 is not just an update to ICD-9; it is a restructured diagnosis classification, and no official mapping exists between the two revisions.Because ICD is designed for clinical coding and billing purposes, its structure and disease names are poorly suited for biomedical literature mining.It is, however, useful for text mining of clinical narrative in EHRs, especially because it has been translated to many languages [11].
A newer alternative is the Systematized Nomenclature of Medicine -Clinical Terms (SNOMED CT; http://www.ihtsdo.org/snomed-ct/).It cross maps to several revisions of ICD and has a considerably broader scope than just diseases.SNOMED-CT is one of many terminologies combined in the even broader Unified Medical Language System (UMLS) Metathesaurus; another is Medical Subject Headings (MeSH; http://www.ncbi.nlm.nih.gov/mesh/).Dictionaries based on subsets of UMLS have been used for recognition of disease names with varying success in text-mining tools, such as MetaMap [20442139], Medical Language Extraction and Encoding (MedLEE) [12], and the Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES) [13].However, because UMLS contains many distinct concepts that are very close in meaning even human annotation of UMLS concepts in text is problematic [14].Licenses for SNOMED-CT and other terminologies in UMLS further restrict their use in resources intended for redistribution.
In contrast to these, the Disease Ontology [15] is part of the Open Biomedical Ontologies (OBO) Foundry initiative [16].It cross maps to UMLS and has extensive annotation of synonyms.Consequently, Disease Ontology works well for recognition of diseases in Gene Reference Into Function (GeneRIF; http://www.ncbi.nlm.nih.gov/gene/about-generif)entries [17].

Information extraction (IE)
Having addressed the NER task using appropriate dictionaries of gene and disease names, the next task is to extract information on associations between genes and diseases.There are two fundamentally different approaches to IE: natural language processing (NLP), using a grammar to parse the syntax of each sentence, and statistical co-occurrence methods [1].We focus on the latter approach, which is highly flexible and generally gives better recall, but worse precision, than NLP [18][19][20].Other disadvantages of co-occurrence methods are that they are unable to extract the direction of an association and have difficulty distinguishing between direct and indirect associations [1].However, neither of these disadvantages are important with respect to extracting disease-gene associations.Almost all co-occurrence methods implement a frequency-based scoring scheme to account for the fact that a pair of entities or concepts may co-occur a few times without being in any way related [19,21,22].These scoring schemes have traditionally counted either the number of sentences or the number of abstracts in which the pair co-occurred, and both sizes of text units have merit [18].We have therefore recently introduced a scoring scheme that simultaneously takes into account both sentence-level and abstract-level co-occurrences [23].
Disease-gene associations extracted from Medline abstracts can already be searched through generalized co-occurrence tools such as CoPub [20,24] and FACTA+ [22,25].However, as these resources are technology-centric -focusing on text mining -they do not take into account any other types of evidence.This limitation is aggravated by the fact that neither resource allows bulk download of all associations, making it difficult for others to integrate additional evidence.

Disease-gene association databases
Several existing databases focus on or contain disease-gene associations, mainly obtained through manual curation of the biomedical literature.Unfortunately, most of these use an in-house controlled vocabulary of diseases and are subject to restrictive licenses, which makes it difficult to integrate them both from a technical and from a legal standpoint.The oldest and most famous of databases is Online Mendelian Inheritance in Man (OMIM; http://omim.org).More recent efforts include the Human Gene Mutation Database (HGMD) [26], the Comparative Toxicogenomics Database (CTD) (http://ctdbase.org/)[27,28], and Genetics Home Reference (GHR; http://ghr.nlm.nih.gov).In addition to these dedicated disease-gene association databases, UniProtKB also annotates diseases associated with each gene [9].
Databases also exist that deal with specific diseases or types of diseases, most notably cancer.The Catalog of Somatic Mutations In Cancer (COSMIC) is the most comprehensive source of information on somatic mutations and their frequencies in human cancers [29].Mutation data is manually curated from the primary literature and annotated according to a histology and tissue ontology.
Over the last decade, genome-wide association studies (GWAS) have produced data on thousands of single nucleotide polymorphisms (SNPs) associated with the risk of hundreds of diseases.GWAS data are, however, non-trivial to work with for the nonexpert, because they identify marker SNPs that are often not the actual causal SNPs [30,31].For this reason GWAS results must be analyzed in the context of linkage disequilibrium (LD), which is defined as the non-random association of variants at two or more loci [31,32].GWAS Central (http://www.gwascentral.org/) is a centralized database that collects the results from genetic association studies [33].Unfortunately it provides data only for small-to medium-scale investigations and explicitly forbids using the data to create similar public resources.By contrast, the National Human Genome Research Institute (NHGRI) GWAS Catalog (http://www.genome.gov/gwastudies/) is public domain [34].The latter is thus the basis for the derived databases DistiLD [35] and GWASdb [36] databases, which show disease-associated SNPs and genes in their chromosomal context.
Here we describe the DISEASES resource, which aims to be the most comprehensive freely available database of disease-gene associations.To this end, we have developed open-source text-mining software that performs NER of diseases and human genes as well as IE of disease-gene associations.We integrate the associations extracted through automatic text mining with evidence from databases with permissive licenses, namely manually curated associations from GHR and UniProtKB, GWAS results from DistiLD, and mutation data from COSMIC.To make the data easy to use for large-scale analyses, we map all sources of evidence to common identifiers, assign them comparable quality scores, and make them available for bulk download.We also make the information available as a user-friendly web resource (http://diseases.jensenlab.org)aimed at end users interested in individual diseases or genes.

Dictionary construction
For human gene and protein names, we used the alias file from STRING v9.1 [23], which integrates names from Ensembl [37], UniProtKB [9], and HGNC [8].We orthographically expanded the gene symbols with the prefix 'h', which means human and is commonly used in the literature to disambiguate a human gene from its identically named orthologs in model organisms.
To construct a dictionary of diseases for use in NER, we extracted all names and synonyms from the Disease Ontology [15].Comparing these to the dictionary of human gene names revealed that the HGNC gene symbol of a disease gene was in many cases listed in Disease Ontology as a synonym for the disease in which the gene is implicated.For example, BRCA1 and BRCA2 were listed as exact synonyms for hereditary breast ovarian cancer.As this would be a major source of ambiguity in the combined dictionary, we explicitly filtered out disease names that are identical to HGNC gene symbols.
To improve recall, we next automatically generated variants of the disease names.Although the terms disease, disorder, and syndrome have separate definitions, we found that they are used inconsistently in the literature when part of disease names; for example, Alzheimer's disease is occasionally referred to as Alzheimer's disorder or Alzheimer's syndrome.To address this we automatically generate the two other variants if either of them is in the dictionary.Similarly, the adjectives hereditary and familial are used interchangeably, and we thus automatically replace one with the other.We also removed words in parentheses and brackets occurring at the end of disease names, unless this would cause ambiguity.

Recognition of gene and disease names in text
To match a document against the dictionary, we have developed a highly efficient tagging algorithm, which is implemented in C++.The algorithm is described in full detail elsewhere [38], but is summarized here for completeness.Tests of the tagging speed and memory efficiency of the implementation compared to another popular tagger are also provided in our earlier publication [38].
We first tokenize the text on white space characters and special characters, such as hyphen and slash, and identify the leftmost longest matches by looking up all substrings consisting of up to 15 consecutive tokens.To make these lookups fast while handling character case variation as well as spacing and hyphenation of multiwords, we used a custom hash table to store the dictionary.The hash table is case insensitive, disregards white space characters and hyphens within name, and trims off other punctuation characters, such as quotes and parentheses, at the beginning and end of names.To match also acronyms that are not in the dictionary, we use a regular expression to search definitions of acronyms within the text and look up their long forms in the dictionary.Crucially, we globally block tagging names that would otherwise give rise to many false positives by manually inspecting the tagging results of all names that occur more than 2000 times in Medline.Many of the blocked names are acronyms; for example, the acronym for disseminated intravascular coagulation is DIC, which can also mean deviance information criteria, differential interference contrast, and dissolved inorganic carbon.By keeping track of all names that we have inspected -whether they were blocked or not -we are able to efficiently update the list of blocked names as both Medline and the dictionary grows.For each name recognized in the text we normalize it to the corresponding unique identifier and, in case of diseases, backtrack the term to the root of the ontology through is_a relationships to assign also the identifiers of all parent terms.

Extraction and scoring of disease-gene associations
We score associations between proteins and diseases using the scoring scheme previously described [39], which is also the basis for the co-occurrence-based textmining scores in STRING v9.1 [23] and COMPARTMENTS [40].For completeness we reiterate the scoring scheme here.
An important feature of the scoring scheme is that it simultaneously takes into account co-occurrences at the level of abstracts as well as individual sentences.To this end, we first calculate a weighted count ( ,  ) for each pair of a gene () and a disease () over the n abstracts in the text corpus: where  != 3 and  != 0.2 are the weights for co-occurrence within the same abstract and the same sentence, respectively, and the delta functions  !" ,  and  !" ,  signify whether or not  and  co-occur in abstract  or a sentence within it.A cooccurrence score ( ,  ) is calculated from the weighted counts as: where C G,• is the sum over all diseases paired with gene , C •, D is the sum over all genes paired with disease , the normalizing factor C •,• is the sum over all pairs of genes and diseases, and the weighting factor  = 0.6.All parameters ( !,  !, and ) have in earlier work been optimized to give the best possible performance on finding functionally associated genes [23].An important property of this function is that it not only rewards for the gene and disease being mentioned together, but also penalizes for them being frequently mentioned together with other diseases or genes, respectively.
We next convert the co-occurrence scores ( ,  ) to z-scores ( ,  ), which are easier to interpret and are robust to changes in the size of the text corpus.We assume that the empirically observed score distribution is a mixture of the true signal and a lower-scoring random background, which we model as a Gaussian distribution.The full details of this score conversion have been published elsewhere [39].Finally, we calculate the confidence score (stars) as  ,  2, limited to a maximum of four stars to account for automatic text mining never being as reliable as manually curated annotations.

Integration of curated knowledge
The GHR database does not provide download files for use in large-scale analyses.We thus used an automated crawler to download the web page for each disease and store the disease name, which is part of the uniform resource locator (URL), along with any gene symbols listed on the web page.We were able to map the names of 390 diseases to Disease Ontology using the dictionary we developed for text mining.The pages are regularly recrawled to update with new associations; the numbers used in the manuscript are based on what was downloaded on May 31, 2013.
In case of UniProtKB, associations to diseases can be found in the KW lines through the use of 149 keywords from the UniProtKB controlled vocabulary of keywords.We were able to manually map 132 of the 149 disease keywords to corresponding concepts in the Disease Ontology.Most of the keywords that we could not map, such as Disease mutation, were not disease names.
We mapped HGNC gene symbols from GHR and identifiers from UniProtKB to their identifiers in STRING v9.1 using the alias file [23].We subsequently used the explicitly annotated disease-gene associations from GHR and UniProtKB to infer broader Disease Ontology concepts via the is_a relationships in the ontology.As all diseasegene annotations imported and inferred from the two databases are based on manual curation, we assigned them a confidence score of five stars.

Benchmark of text-mining results
To assess the quality of the text-mining results, we constructed a reference set based on the manually curated annotations imported from GHR and UniProtKB.Due to the hierarchical nature of the Disease Ontology, it is necessary to select on a subset of terms to be used as the basis for the assessment.To this end, we chose to use the subset of terms that were explicitly annotated in the two databases (as opposed to inferred through is_a relationships).In case one term was a child term of another, we selected the broader parent term.This resulted in a positive reference set of 2780 associations between 2001 genes and 173 diseases.We defined the negative set as all other 343393 possible pairings of the same genes and diseases.
We next sorted the text-mined associations descending by score and compared them to the reference set.We present the results as receiver operating characteristic (ROC) curves by plotting the true positive rate (TPR) as function of false positive rate (FPR), considering either all disease-gene associations or only the best-scoring association per gene (Figure 1).We compare these results to two random backgrounds.One is simple random shuffling of the disease-gene pairs, which ignores that some diseases are associated with many more genes than others.To correct for this, the second random background is calculated by sorting the disease-gene pairs descending by prior probability of the disease.Because the prior of each disease is estimated based on the reference set itself, this likely overestimates the performance that can be attained by random guessing.

Integration of mutation and GWAS data
To integrate cancer mutation data from COSMIC [29], we manually created mappings between terms listed in the fields "Site primary" and "Histology" and Disease Ontology concepts classified under "organ system cancer" and "cell type cancer", respectively.We mapped the genes to STRING v9.1 identifiers via the Ensembl transcript identifiers provided by COSMIC.For each pair of a gene (G) and a disease (D) we counted the number of disease samples carrying at least one somatic missense or nonsense mutation within the gene (N G, D ).We discarded pairs with a count less than 10 and derived confidence scores (stars) as log !" N G, D − 0.5, limiting it to at most four stars.
To include also GWAS data, we integrated information from the DistiLD database [35], which maps genes and disease-associated SNPs onto so-called LD blocks defined based on data from the HapMap Project [41].We assigned each SNP with a p-value less than 10 !! to the nearest gene within the same LD block.The "Disease/Trait" descriptors from the NHGRI GWAS Catalog were mapped to the corresponding Disease Ontology concepts through the ICD-10 annotations from DistiLD, the Disease Ontology Lite annotations from GWASdb [36], and manual inspection of conflicts.The resulting disease-gene associations were assigned a confidence score (stars) using the formula 3 − log !" max P, P !"# , where P is the p-value, P !"# is the genome-wide GWAS significance threshold (5 ⋅ 10 !! ).

Dictionary-based tagger software
We have developed a highly efficient NER method for diseases and human genes, which are normalized to identifiers from Disease Ontology [15] and STRING v9.1 [23], respectively.On a server with two Intel E5520 processors and 24GB of random access memory (RAM), starting the tagger and loading the dictionary took only 4.2 seconds.Once started, the tagger used 260MB of RAM and was able to process 360 Medline abstracts per second on a single processor core (measured on a corpus of 100,000 Medline abstracts).The tagger software bundled with a dictionary of disease and human gene names is available for download under the BSD license.

Cooccurrence-based disease-gene associations
Because the NER task is for us only a step on the way towards the goal of extracting disease-gene associations, we chose to focus our benchmarking effort on assessing the quality of the end result.We therefore compared the text-mined associations to the manually curated associations imported from GHR and UniProtKB in two ways: 1) considering all disease-gene associations, and 2) considering only the highest scoring disease for each gene.The results of these comparisons (Figure 1) show that our textmining system is able to extract a large fraction of the known disease-gene associations with high specificity (low FPR).If a user were to simply trust the highest scoring disease association for each gene, 50% of all manually curated disease-gene associations in the benchmark set would be found at a FPR of only 0.16%.
The high quality of text-mining results is reflected by the fact that they are already being used extensively.The text-mined associations from DISEASES are included in the widely used GeneCards database [42].They have also been used as a basis for inference of disease associations for miRNAs from their predicted target genes [39] and for enrichment analysis of autism-related genes [43].Table 1: Overview of disease-gene association evidence.Each row shows the number of genes, diseases and associations between them that are supported by a given type, confidence level (in case of Text mining), or source (in case of Knowledge and Experiments).The numbers in parentheses specify the counts prior to backtracking of Disease Ontology terms through is_a relationships.

Contents of the database
Although we have in this paper placed most emphasis on the text-mining aspects, the DISEASES database integrates disease-gene associations from several sources.This is advantageous, because every source of associations has its shortcomings.Table 1 provides an overview of the total evidence landscape of the database, showing that the text-mining pipeline is indeed the largest single contributor of associations.However, it is important to note that this number depends strongly on the confidence cutoff; indeed the number of associations obtained from the manually curated databases rivals the number of text-mined associations with at least 3 confidence stars.Mutation data from COSMIC and GWAS data from DistiLD also both contribute a sizeable number of associations; however, the former data source only relates genes to cancers.
All disease-gene associations from all evidence sources are available for bulk download in tab-delimited format under the Creative Commons Attribution (CC-BY) license.

The DISEASES web interface
Whereas tab-delimited files are convenient for bioinformaticians wanting to perform large-scale analyses or create derived resources, a user-friendly web interface better caters to researchers interested in individual genes or diseases.We have thus developed a web interface for the DISEASES resource that allows users to either query for a gene to find associated diseases or query for a disease to find associated genes (Figure 2).In either case, the user will be presented with three tables called Knowledge, Experiments, and Text mining.These show the manually curated associations from GHR and UniProtKB, the mutation and association data from COSMIC and DistiLD, and the text-mined associations, respectively.Besides summarizing the imported information, the Knowledge and Experiments tables provide direct hyperlinks to the source entries in the external databases.
The table summarizing the text-mined evidence deserves special attention.As the textmining method correctly takes into account information from the narrower child terms of each disease, the text-mined disease associations for a gene have inherent redundancy.
When showing the list of diseases associated with a gene of interest, the web interface thus dynamically filters out redundant Disease Ontology terms for which better alternatives are present.The web interface also gives the user the possibility to inspect the text-mining evidence behind any disease-gene association by viewing the underlying abstracts with the gene and disease names highlighted.

Generality of the approach
The approach to text mining described in this paper is readily applicable to recognize other types of named entities in text and extract associations among them.Using the same tagger with a dictionary constructed from the NCBI Taxonomy [44], we were able to accurately identify taxonomic names in the biomedical literature [38].We are currently extending that work to identify environments from the Environment Ontology [45] in text, for example, from the Encyclopedia of Life [46].We have even used a slightly modified version of the tagger as part of a method for recognition of adverse drug events in Danish clinical narratives [47].This illustrates the flexibility of a simple dictionary-based NER approach in terms of applicability to new knowledge domains.
Combining the tagger with the co-occurrence scoring scheme for the purpose of IE is equally flexible.As previously mentioned, the scoring scheme was originally developed to extract functional associations between proteins for use in the STRING database based on co-occurrence of gene names within biomedical literature [23].In addition to using it for disease-gene associations as described here, we have since applied the same scoring scheme to extract information on protein-small molecule associations in the STITCH database [48], protein subcellular localization in the COMPARTMENTS database [40], and tissue distribution of proteins in the TISSUES database (http://tissues.jensenlab.org).
Besides using the same methods for NER and IE, DISEASES and the other resources mentioned above have in common that they integrate heterogeneous evidence from many sources.This sets them aside from the many resources that use text mining to extract associations between a wide variety of named entities and concepts.As tool developers, it is easiest and most efficient to be technology-centric and apply a single technology, such as text mining, to a wide range of topics.However, from a user's perspective, a resource that integrates many sources of information pertaining to a single topic of interest is usually what is sought after.We attempt to find a compromise by creating a general framework, which allows us to set up resources that each integrate information on a different topic but are maintainable, because they share software infrastructure.

Conclusions
We have developed a dictionary-based NER tool for Disease Ontology concepts and combined it with a co-occurrence scoring scheme to efficiently and accurately extract disease-gene associations from Medline.We have integrated these with manually curated associations from the GHR and UniProtKB databases as well as somatic mutation and GWAS data from COSMIC and DistiLD, respectively.We make the resulting database available as a searchable user-friendly web resource at http://diseases.jensenlab.org,where bulk datasets and the NER software are also available for download.

Figure 1 :
Figure 1: Benchmark of disease-gene associations obtained through text mining.The receiver operating characteristic (ROC) curves shows the true positive rate (TPR) as function of false positive rate (FPR) when considering all associations (black) and when considering only the highest scoring association for each gene (red).The dashed and dotted curves show the random expectations according to simple shuffling and prior-based ranking, respectively.The curves do not intercept TPR = 1 and FPR = 1, because some disease-gene pairs in the benchmark set are not found mentioned together in Medline, for which reason they have no text-mining score.

Figure 2 :
Figure 2: The DISEASES web resource.The figure shows how the disease-gene associations are presented in the web interface, exemplified by the LRRK2 gene.The three tables provide the user with an overview of the evidence from text mining, curated knowledge, and experimental data.Clicking on an association, e.g. to Parkinson's disease, in the Text mining table gives access to the underlying abstracts with the cooccurring gene and disease highlighted.The two other tables provide hyperlinks to the relevant entries in the source databases.