CoIN: a network analysis for document triage

In recent years, there was a rapid increase in the number of medical articles. The number of articles in PubMed has increased exponentially. Thus, the workload for biocurators has also increased exponentially. Under these circumstances, a system that can automatically determine in advance which article has a higher priority for curation can effectively reduce the workload of biocurators. Determining how to effectively find the articles required by biocurators has become an important task. In the triage task of BioCreative 2012, we proposed the Co-occurrence Interaction Nexus (CoIN) for learning and exploring relations in articles. We constructed a co-occurrence analysis system, which is applicable to PubMed articles and suitable for gene, chemical and disease queries. CoIN uses co-occurrence features and their network centralities to assess the influence of curatable articles from the Comparative Toxicogenomics Database. The experimental results show that our network-based approach combined with co-occurrence features can effectively classify curatable and non-curatable articles. CoIN also allows biocurators to survey the ranking lists for specific queries without reviewing meaningless information. At BioCreative 2012, CoIN achieved a 0.778 mean average precision in the triage task, thus finishing in second place out of all participants. Database URL: http://ikmbio.csie.ncku.edu.tw/coin/home.php


Introduction
With the increasing number of research studies in the medical community, the work of biocuration has become timeconsuming with low efficiency. The curation of these abundant biomedical studies leads biocurators into a series of unanticipated problems: the related literature is usually large in number and unstructured in nature. To obtain highquality information, databases usually require manual curation. Even with their best efforts, biocurators can only annotate a limited number of articles. Constraints, such as the limited number of biocurators, the rapid growth of biomedical literature and the compatibility of the resulting data formats, create barriers to the biocuration process (1).
When manual curation databases, such as PharmGKB (2) and the Comparative Toxicogenomics Database (CTD) (3), were developed, the biocurators used PubMed and Medical Subject Headings (MeSH) to understand and annotate medical terms. PubMed and MeSH are developed by the National Library of Medicine and contain a huge amount of literature, experimental results and ontology information. Determining how to effectively find the articles required by biocurators has therefore become an important task. For example, the Article Classification Task (ACT) in the BioCreative (Critical Assessment of Information Extraction Systems in Biology) competition aims to classify articles that are relevant to protein-protein interaction (PPI) curation (4). Thus, the ACT is helpful for building annotated PPI databases. However, there are many problems in the workflow of biocuration. The workflow of biocuration is a heuristic learning procedure, and the heuristic rules are based on the experience of the biocurators (5); that is, different biocurators might annotate differently. During the biocuration process, the biocurators annotate the chemicals based on their domain knowledge. To understand the behavior of biocurators, the BioCreative 2012 committee examined the decisions made by biocurators when they were determining whether articles should be curated. The decision is easily understood by biocurators, but the task is difficult for computers. Therefore, more studies must be conducted to ascertain the effects of text-mining approaches in classifying important articles.
Up to now, most researchers focused on the issue of relation extraction in the ACT, and only a few works considered the issue of multiple relation extraction. Therefore, we constructed an entity co-occurrence analysis system, which is applicable to PubMed articles and suitable for gene, chemical and disease queries. The relationship between two terms is often correlated to their co-occurrence in a sentence. The co-occurrence of two terms means that these two terms both occur in the same article. This cooccurrence determines the semantic relations between these two terms, although it does not guarantee that the two terms are indeed related. For example, 'dementia' and 'atopic dermatitis' might co-occur in the same article, but the reasons for the co-occurrence of the two terms are unclear. Co-occurrence features are still useful for providing possible candidates for relation extraction.
In this work, we propose a text-mining platform, known as the Co-occurrence Interaction Nexus (CoIN), to distill the entity co-occurrence information from literature and to measure the relationships between entities using a networking approach. CoIN integrated several named entity recognition tools and parsed single sentences in articles. The system is able to obtain co-occurrence pairs and create a ranking list. We assumed that network centralities represent the importance of co-occurrence pairs and would be useful in building an automatic curation system. Therefore, we calculated the mean average precision (MAP) from co-occurrence features and network centralities. The advantage of CoIN is that biocurators can survey the ranking lists of specific queries without reviewing meaningless information; thus, CoIN allows biocurators to focus on more useful tasks.
Network analysis concerns the relationships between processing entities. For example, the nodes in a social network are people, and the links are the friendships between the nodes. If we apply these concepts to the ACT, PubMed articles are the nodes, while the co-occurrences of genedisease, gene-chemical and chemical-disease relationships are the links. Network analysis provides a visual map and a graph-based technique for determining co-occurrence relationships. These graphical properties, such as size, degree, centralities and similar features, are important. By examining the graphical properties, we can gain a global understanding of the likely behavior of the network. For this purpose, this work focuses on two themes concerning the applications of biocuration: using the co-occurrence-based approach to obtain a normalized co-occurrence score and using the network-based approach to measure network properties, e.g. betweenness and PageRank. CoIN integrates co-occurrence features and network centralities when curating articles. The proposed method combines the co-occurrence frequency with the network construction from text. The co-occurrence networks are further analyzed to obtain the linking and shortest path features of the network centralities.
We have organized the rest of this work as follows. The second section is a review of the literature and addresses related works in biocuration. The third section describes the system and methodology of CoIN. The fourth section is an evaluation and discussion. We conclude the article with conclusions and future work.

Related work
Several studies investigated methods for extracting relations from biomedical literature. The intuitive methods use predefined phrase patterns or the co-occurrences of two queries from the text. However, these methods are limited by predefined knowledge patterns and are incapable of discovering new patterns. Therefore, machine learning (ML) techniques provide a better understanding when trying to discover new patterns. Hence, ML approaches have been widely used and have gained popularity in recent years (6,7). To date, support vector machines (SVMs), k-nearest neighbor, Naive Bayes, decision trees and neural networks have been used to extract knowledge patterns (6,8). Recently, attention has shifted from ML methods to natural language processing (NLP). NLP emphasizes linguistic features that are obtained from the text and can also distill knowledge patterns (9)(10)(11)(12). Although the above strategies considered pattern extraction, a few works have been published on article classification.
The co-occurrence frequency can be considered a measurement that describes the overlapping characteristics between concepts. Therefore, the co-occurrence-based approach evaluates the curated relatedness of articles by exploring the co-occurrence frequency of different concept pairs. Curatable articles correlate with a high frequency of concept pairs, whereas non-curatable articles correlate with a low frequency of concept pairs. The shortest paths on the graphs are examined for various forms (13). Problems arise when many concepts share the same information content, which leads to redundant concept pairs. Several studies applied concepts with different information content to approximate the curated relatedness (14,15). Wilbur and Yang retrieved articles from PubMed and transformed them into a matrix; the matrix was used to correlate the documents with their term frequency (16). Moreover, the researchers evaluated the curated relatedness of terms using the co-occurrences of terms. Patwardan and Pedersen used a context vector to estimate the value of curated relatedness (17) and constructed the context vector from the literature by word sense discrimination (18) and latent semantic indexing (19). However, the co-occurrences of two concepts do not correspond to the actual curated relatedness. Bollegala et al. proposed a framework for the curated relatedness between concepts; this framework combines the lexical patterns from short text snippets with four characteristics of page counts. Then, the authors used clustering lexical patterns to improve the performance (20).
Recent computational biology research has suggested that network-based approaches may indeed facilitate processes that are beneficial to the understanding of molecular biology. These benefits include integration of heterogeneous databases, prediction of disease genes and increased quality of modules of cellular machinery. Eronen and Toivonen developed a system with protein interaction prediction and a disease gene prioritization task as instances of link predictions, and the predictions were based on a proximity measure of the integrated graph (21). Then, Winter et al. reported that identifying prognostic genes connects gene expression measurements to a network of known relationships. The researchers ranked the genes using both expression and network information in a manner similar to Google's PageRank (22). At the same time, Atias and Sharan described a comparative analysis of networks from multiple species that detected significant biological patterns and provided more interpretation (23).
Several studies use network centralities to make tentative predictions of important vertices in compound networks. In this case, network centralities are able to measure the global influence of individual proteins. Jeong et al. reported that the proteins with a high degree in PPI networks may be important proteins (24). Yu et al. demonstrated that proteins with high betweenness centrality are important proteins in yeast PPI networks (25). More recently, the increase in network approaches for studying heterogeneous networks has decreased the accuracy due to incomplete networks and noise.
In recent years, interest in the issue of assisting manual curation has dramatically increased (26). Manual curation plays an important role in supporting basic analyses for advanced research, and BioCreative 2012 focused on the integration of biocuration. The BioCreative 2012 subcommittee identified three areas, or tracks, that comprised independent but complementary aspects of data curation (27). The three areas are literature triage (Track I); curation workflow (Track II); and text mining/NLP systems (Track III). Track I participants developed systems that would effectively triage and prioritize articles for curation. CoIN produced notable results in BioCreative 2012 Track I.

System description
CoIN is a web-based system that assists biocurators in assessing articles according to the correlations of their terms among sentences. For searching relevant documents, CoIN adopts the co-occurrence features and co-occurrence networks from PubMed articles, and then CoIN applies network analyses to distinguish between curatable and non-curatable articles. Although co-occurrence features are the basic components of searching for the patterns of domain knowledge, there are still many restrictions on named entity recognition; training set problems, such as imbalance and shortage, are of particular concern.
Network analysis has been applied to different research topics, such as phylogenetics, function predictions, human diseases and drug development (28,29). At the same time, the pre-tagging results of CoIN were developed based on the state-of-the-art named entity recognition tools in BioCreative III (30,31). Furthermore, we collected dictionary corpora to recognize disease and chemical names during sentence-level processing. Therefore, the idea of CoIN is basically generated from the co-occurrence of gene, disease and chemical names in a specific article.

Curation workflow
For the convenience of biocurators, CoIN allows users to query genes, diseases and chemicals. As shown in Figure 1, CoIN uses AIIAGMT (32) to identify gene names and separate articles into sentences. Next, we train conditional random fields to predict chemical names in the articles, and the training patterns are extracted from the CTD. This statistical modeling method is frequently applied in pattern recognition. To tag disease names, CoIN uses a dictionary-based method to identify diseases, and the dictionary is also extracted from the CTD. After collecting the tagging names, CoIN calculates the co-occurrences of the tagging names for each sentence. Then, the co-occurrence network is constructed using the co-occurrences of genedisease, gene-chemical and chemical-disease relationships, as shown in Figure 2. In the last stage of CoIN, the system provides the normalized co-occurrence score, the betweenness and the PageRank value for prioritizing PubMed articles. In the 'Methods' section, we introduce the normalized co-occurrence score, betweenness and PageRank.
For example, we use the PubMed articles for phenacetin as an input list; otherwise, the user can input a gene, disease or chemical name, as shown in Figure 3. After the computation is finished, we can obtain a ranking list, as shown in Figure 4. The name recognition process is usually time-consuming for the system schema of CoIN. CoIN provides a quick sorting result to biocurators after the name recognition process is finished. CoIN takes less time to train complex features, but the system immediately returns the ranking result from the network centralities of cooccurrence networks.
Many interaction data are accompanied by significant noise, and an overestimation is caused by the overlapping interactions. The noise stems from the related problem of named entity recognition. For example, we processed gene, chemical and disease names as single entities. However, various gene, chemical and disease names consisting of multiple words are not separable during parsing.
Consequently, the named entity recognition is restricted by the entity anonymization. Furthermore, there are many chemicals describing the curative effect in the same sentence, but these chemicals are usually synonyms. In this case, the combination relationships of these chemicals also become noise.

Normalized co-occurrence score
In analyzing the training data set, we found that the disease entity recognition rate was low. Therefore, we designed the normalized co-occurrence score to adjust the influence of the imbalanced recognition rate. Using a normalized co-occurrence score avoids many frequent patterns for describing overlapped combinations. Removing such patterns may result in a performance decrease, and these patterns might be insignificant for the ACT. However, we found that the number of true positives for the normalized co-occurrence score increased when the disease names were not treated as a single entity, but this finding was not applied to the official runs. The frequencies of co-occurrence of gene-disease, gene-chemical and chemical-disease are normalized by the standard score z as follows.   where x is the co-occurrence frequency of either the gene-disease, gene-chemical or chemical-disease relationship; is the mean value of a set of x; and S is the standard deviation of a set of x. After calculating the standard score z for the gene-disease, gene-chemical and chemical-disease frequency in an article, we define the sum of the above three standard scores of z as the normalized co-occurrence score of the article.
where st is the total number of shortest paths from node s to node t; and st ðvÞ is the number of those paths that pass through the vertex.

PageRank
PageRank is a famous linking algorithm, which was developed by the founder of Google (35). To rank each web page, the in-links (inward-directed edges) and out-links (outward-directed edges) are calculated. An in-link is a hyperlink that other web pages use to direct people to the linked web page, and an out-link allows people to access other web pages. The PageRank algorithm ranks the page by the linking structure of networks via a random walking model. For any vertex V in a network, the PageRank value is calculated as follows.
where d is a damping factor and is set to 0.85; PR(V i ) is the PageRank value of V i ; In(V i ) are the in-links of V i ; and Out(V j ) are the out-links of V j . After computing the PR value of the vertices in networks, we can consider that the vertices with a higher PR value have more influence than the vertices with a lower PR value. The co-occurrence network is constructed using the information of gene-disease, gene-chemical and chemicaldisease co-occurrences, as shown in Figure 5. The co-occurrence network is derived from the linking structure of web pages, but the co-occurrence relationships are essentially different from the in-and out-links. In Figure 5, we use an undirected edge to represent the co-occurrence relationship between entities and this edge also represents a bidirectional edge when counting in-and out-degree of a node in the undirected network. Thus, the co-occurrence network is displayed as an undirected graph, where vertices represent PubMed articles and edges represent co-occurrence interactions between PubMed articles. Note that both the in-and out-links of vertex i and vertex j are increased by 1 if there is an edge between vertex i and vertex j. For example, P 2 and P 4 have the highest betweenness value [C B (P 2 ) = C B (P 4 ) = 3], while C B (P 1 ) = C B (P 3 ) = C B (P 5 ) = 0. At the same time, P 2 and P 4 also have the highest PageRank value [PR(P 2 ) = PR(P 4 ) = 0.29], while PR(P 3 ) = 0.19 and PR(P 1 ) = PR(P 5 ) = 0.11. The results of this toy example show that P 2 and P 4 are more important than the other nodes in the co-occurrence network.
When the co-occurrence features and network centralities of PubMed articles are evaluated, we define CoIN index to estimate the relevance scores of PubMed articles. A concept pair means that two named entities occur within a single sentence in a document. Note that a neighbor of a concept pair (C i ) is also another concept pair that co-occurs with C i in the same document. CoIN index calculates the concept pairs in data sets, and the neighbors of concept pairs are collected. After retrieving the neighbors of concept pairs, we construct co-occurrence networks, and these networks are the linking structures of concept pairs. Furthermore, we use linear combination to adjust the co-occurrence model and the network-based model. The co-occurrence model outputs the two scores of co-occurrence for each article. In contrast to the co-occurrence model, the network-based model includes the PageRank and the betweenness in the same manner. After calculating the scores from the co-occurrence model and the networkbased model, we combine these two scores into the CoIN index for classifying curated articles. However, we consider a constant a, which is a damping factor, to determine the weight of concept pairs. We compute their CoIN index using the following equation:

Page 7 of 11
For the test set, we did not analyze and optimize for the submission. According to our training set experiments, classification performance is sensitive to the combinations of co-occurrence pairs. For the submitted run, we applied the normalized co-occurrence score because we found that the rate of recognizing disease names was underestimatedthat is, we assumed that the normalized co-occurrence score provided robustness against the low rate of entity recognition. The MAP of the submitted run is 0.778. We received second place in the triage task of BioCreative 2012, and the best MAP score was 0.803 (36). The scores of test target chemicals and approaches are illustrated in Figure 7. The ranking lists of network-based approaches are better than those of the co-occurrence-based methods, and the MAP of PageRank is 0.796, which is the best result from four approaches. Furthermore, we applied CoIN index to curate the test and training sets. When the co-occurrence model and the network-based model used the frequency and PageRank, respectively, the best performance was 0.819 for the test set and 0.698 for the training set, respectively (a = 0.1, a = 0.3). Note that the network-based approaches apply the linking structure between the frequency of concept pairs and the co-occurrence networks. However, the linking structure of co-occurrence does not affect the performance. Hence, the average MAP of PageRank is slightly superior to that of the co-occurrence frequency.

Discussion
In the discussion, we further investigated the relations between co-occurrence features and biocuration. The work was not fully optimized at the time of the competition due to time constraints. However, after analyzing a series of BioCreative data sets, we believe further improvement is possible based on ML techniques and also on the recently released gold standard test set. However, ML techniques led to an overfitting problem and decreased the classification performance in the test set, particularly for the triage task. Therefore, our tuning strategy for CoIN index was focused on different data and feature combinations, not on the tuning parameters and heuristic

rules.
Our system shows that the strategy of using both co-occurrence and network features in our classification framework is a good combination for the triage task.
To understand the importance of co-occurrence features in the triage task, we used the training set to compute P@10, P@20, P@50 and P@100, as shown in Figure 8. The precision can be evaluated at a given cutoff of ranking, and we can consider only the top k results returned by the system, known as P@k. We used the number of chemicaldisease (cd), chemical-gene (cg), gene-disease (gd) relations and the sum of three co-occurrence features (gcd) to measure the precision of CoIN at different values of k. In Figure 8, we can see that the cd and gd pairs decrease the precision of gcd, especially at P@20 and P@100. The exact reason for the low recognition rate of disease names is unclear, but the major problem in identifying disease names is that researchers tend to use general English terms, not MeSH terms. We found that the overall frequency of different concept pairs was effective for our ranking-based search system. However, our current performance is also limited by the recognition capability. It is recommended that co-occurrence features be regarded as an important feature in biocuration. In addition, there are problems in entity recognition for chemicals and diseases, and this problem has much room for improvement.
The observed differences between curated information led us to further research the influence of specific factors of the co-occurrence structure of curated articles. An independent sample t test was conducted to evaluate the hypothesis that there are more co-occurrence pairs in curatable articles than in non-curatable articles. The hypothesis was significant (P < 0.05). As shown in Table 1, for the different characteristics that were retrieved using our approaches, on average, there were more co-occurrence pairs in curatable articles than in non-curatable articles. It is reasonable to believe that there is a significant difference between the two groups in their performancethat is, the statistical evidence suggests that there are some co-occurrence relations between curatable and non-curatable articles.
As discussed above, the statistics of co-occurrence raised the possibility that biocurators might annotate the articles in the same manner. In our system, the development of the network-based model is closely tied to the effective use of the co-occurrence model. Hence, we used co-occurrence and network features to train SVM models, and then the SVM classifiers applied different features to the training and test sets. Note that we used an SVM with a linear kernel and trained SVM models in a 5-fold cross-validation. Table 2 presents the effect of applying an SVM using co-occurrence and network features on the BioCreative 2012 Triage task. After training the classifiers with the Figure 8. P@k of training set. cd: chemical-disease relations co-occur in a sentence, cg: chemical-gene relations co-occur in a sentence, gd: gene-disease relations co-occur in a sentence, gcd: the total occurrence of gd, cd and cg relations. training set, we used the training and test sets to make predictions. As shown in Table 2, adding network features to the co-occurrence features boosts the performance in the test set, and the overall co-occurrence frequency also enhances the performance. For SVM classifiers, there is less improvement; however, it demonstrates that network features provide a positive effect for the ACT. For the BioCreative 2012 Triage task, possible feature candidates, including both co-occurrence and network features, were examined and explored. As a result, the overall co-occurrence frequency and PageRank were further selected for better triage. In addition, betweenness is helpful to increase the precision and recall in the test set but decreases the recall in the training set.

Conclusion and future work
In this study, we used co-occurrence-and network-based approaches to develop a system (CoIN) and evaluated the ranking of the CTD data sets. CoIN applies the co-occurrences of sentence structures and the linking activities between biomedical terms, such as genes, chemicals and diseases, to prioritize the importance of articles. Note that our approach is different from traditional supervised learning methods. CoIN begins with the automatic identification of named entity recognition and connections in the neighbors. Then, we constructed heterogeneous co-occurrence networks from the combinations of different concept pairs. Next, we computed the co-occurrence frequency and network centralities for concept pairs. If an article has more concept pairs, this article has a higher priority to be curated. Finally, we tested CoIN with the test data, and CoIN demonstrated its ability to curate articles. The experiments with the test data showed that the network-based approaches perform better than the co-occurrence-based approaches. The proposed system is also helpful for biocurators to customize their own curation patterns.
In the future, we hope to clarify the influence of concept pairs in CoIN. Although we have investigated gene-disease, gene-chemical and chemical-disease concept pairs, we believe that more detailed research focusing on the relations of concept pairs will improve the performance of CoIN.