Methods Inf Med 2016; 55(04): 340-346
DOI: 10.3414/ME15-01-0108
Original Articles
Schattauer GmbH

Link Prediction on a Network of Co-occurring MeSH Terms: Towards Literature-based Discovery

Andrej Kastrin
1   Faculty of Information Studies, Novo Mesto, Slovenia
,
Thomas C. Rindflesch
2   Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, Maryland, USA
,
Dimitar Hristovski
3   Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
› Author Affiliations
Fundings This work was supported in part by the Slovenian Research Agency and by the Intramural Research Program of the U.S. National Institutes of Health, National Library of Medicine.
Further Information

Publication History

received: 17 August 2015

accepted in revised form: 19 May 2016

Publication Date:
08 January 2018 (online)

Summary

Objectives:Literature-based discovery (LBD) is a text mining methodology for automatically generating research hypotheses from existing knowledge. We mimic the process of LBD as a classification problem on a graph of MeSH terms. We employ unsupervised and supervised link prediction methods for predicting previously unknown connections between biomedical concepts.

Methods:We evaluate the effectiveness of link prediction through a series of experiments using a MeSH network that contains the history of link formation between biomedical concepts. We performed link prediction using proximity measures, such as common neighbor (CN), Jaccard coefficient (JC), Adamic / Adar index (AA) and preferential attachment (PA). Our approach relies on the assumption that similar nodes are more likely to establish a link in the future.

Results:Applying an unsupervised approach, the AA measure achieved the best performance in terms of area under the ROC curve (AUC = 0.76),gfollowed by CN, JC, and PA. In a supervised approach, we evaluate whether proximity measures can be combined to define a model of link formation across all four predictors. We applied various classifiers, including decision trees, k-nearest neighbors, logistic regression, multilayer perceptron, naïve Bayes, and random forests. Random forest classifier accomplishes the best performance (AUC = 0.87).

Conclusions:The link prediction approach proved to be effective for LBD processing. Supervised statistical learning approaches clearly outperform an unsupervised approach to link prediction.