Search and Graph Database Technologies for Biomedical Semantic Indexing: Experimental Analysis

Background: Biomedical semantic indexing is a very useful support tool for human curators in their efforts for indexing and cataloging the biomedical literature. Objective: The aim of this study was to describe a system to automatically assign Medical Subject Headings (MeSH) to biomedical articles from MEDLINE. Methods: Our approach relies on the assumption that similar documents should be classified by similar MeSH terms. Although previous work has already exploited the document similarity by using a k-nearest neighbors algorithm, we represent documents as document vectors by search engine indexing and then compute the similarity between documents using cosine similarity. Once the most similar documents for a given input document are retrieved, we rank their MeSH terms to choose the most suitable set for the input document. To do this, we define a scoring function that takes into account the frequency of the term into the set of retrieved documents and the similarity between the input document and each retrieved document. In addition, we implement guidelines proposed by human curators to annotate MEDLINE articles; in particular, the heuristic that says if 3 MeSH terms are proposed to classify an article and they share the same ancestor, they should be replaced by this ancestor. The representation of the MeSH thesaurus as a graph database allows us to employ graph search algorithms to quickly and easily capture hierarchical relationships such as the lowest common ancestor between terms. Results: Our experiments show promising results with an F1 of 69% on the test dataset. Conclusions: To the best of our knowledge, this is the first work that combines search and graph database technologies for the task of biomedical semantic indexing. Due to its horizontal scalability, ElasticSearch becomes a real solution to index large collections of documents (such as the bibliographic database MEDLINE). Moreover, the use of graph search algorithms for accessing MeSH information could provide a support tool for cataloging MEDLINE abstracts in real time. (JMIR Med


Biomedical Semantic Indexing
The last two decades have witnessed tremendous advances in our knowledge of life sciences and medicine, leading to an exponential growth of the biomedical literature.There are several biomedical bibliographic databases such as EMBASE, OVID, Ebsco Host Research databases, Scielo, Cochrane, and the largest one, with 5600 journals and over 26 million articles, MEDLINE.In 2015, more than 806,000 citations were added to MEDLINE with a load of 2000 to 4000 documents per day.This quickly growing volume of articles is an overwhelming challenge that requires a very specialized knowledge for organizing this bibliographic database.
To support the classification and indexing of the content of the MEDLINE database, the US National Library of Medicine (NLM) produces and maintains a thesaurus of medical concepts, MeSH (Medical Subject Headings), which is reviewed and updated continually (eg, 310 new headings were added to MeSH in 2015).Each document in MEDLINE is represented with a set of MeSH terms that describe its subject topic.This task, which is generally known as biomedical semantic indexing, is a crucial task to facilitate literature search because MeSH terms can be used in search queries to retrieve references that were annotated with these terms or with their hierarchically related terms in MeSH (ie, their synonyms, hypernyms, or hyponyms).The task of identifying the MeSH terms that best represent a MEDLINE article is manually performed by human experts (so-called curators).NLM also provides some basic principles [1] to assign MeSH terms that curators should follow when they catalog articles.Biomedical semantic indexing is usually a costly, time-consuming, and laborious task [2].Therefore, there is an urgent need to explore semiautomatic methods to support semantic indexing.
Several challenges such as Critical Assessment of Information Extraction in Biology (BioCreative) [3], Workshop on Biomedical Natural Language Processing (BioNLP) shared tasks [4,5], Informatics for Integrating Biology & the Bedside (i2b2) [6], and DDIExtraction [7,8] have significantly contributed to improve and advance the state of the art in Natural Language Processing for biomedicine, especially in the information extraction task.Similarly, the biomedical semantic indexing and question answering challenge (BioASQ) is being organized since 2013 to encourage and promote research in these fields and provide a common framework for assessment.The objective of the task is to tag an article with a set of terms (also known as headings or descriptors) from the MeSH thesaurus.In this task, the training data consist of a vast collection of MEDLINE abstracts.Each article includes the MeSH terms that the curators used to classify it.It also contains additional metadata such as its unique identifier number (PubMed unique identifier, PMID) used in PubMed (a free search engine for the MEDLINE database), title, journal name, and publication year (see Figure 1).The test data consist of recently published articles that have not been labeled by the curators yet.The participating systems have to find the best MeSH terms and report their answers for the test data.
Biomedical semantic indexing can be defined as a multilabel hierarchical classification problem because each document has to be classified with one or more concepts from a taxonomy.If the taxonomy has a significant number of concepts (more than hundreds), the main challenge is to work with this large number of classes in the classification problem.In the case of the BioASQ challenge, MeSH has a hierarchy with 16 main branches and contains more than 27,000 terms.Some works restrict the scope of MeSH hierarchy using only a particular branch in the MeSH tree (eg, heart diseases) [9] or a subset of terms (generally those appearing in the training collection) [10] to reduce the difficulty of the multilabel classification problem.

General Architecture
The general architecture of the most state-of-the-art systems comprises 2 differentiated phases: a first phase in which an initial set of MeSH terms is obtained and a second phase that ranks these terms to select the top K that better fit the input document.Several machine-learning techniques have been used such as Support Vector Machines (SVM) [11,12], logistic regression [13], k-nearest neighbors (k-NN) [11,13,14], or a combination of them.
Most previous systems employ either flat classifiers or cascades of classifiers [15].Flat classifiers [11,[16][17][18] do not take into account the hierarchical relations between the MeSH terms, whereas cascades approaches [19,20] apply a separate classifier top-down for each term.In each term, the method must decide whether to assign the current term to the article being classified or continue descending by the taxonomy and selecting which branches (children) to continue exploring.However, both approaches, flat and cascades, use the BoW (bag-of-words) model to represent the documents.One of the notorious disadvantages of BoW models is that they generate a large number of features (as many as the vocabulary size of the training set), which usually requires prohibitive computation time for practical applications.A possible solution could be the use of feature selection techniques to reduce the number of BoW features.However, these techniques have proved to be inefficient because of the large number of classes (as many as existing terms in MeSH) that must be represented.In other words, as mentioned above, this multilabel classification problem implies more than 20,000 classes (which are the terms stored in MeSH), and it would need to keep at least a few features to represent each class for the classification.Indeed, classifiers used in this problem usually obtain better performance without feature selection [15].More recently, some works [21,22] use word embedding techniques as an attractive alternative of BoW-based approaches, leading to very large dimensionality reduction and promising results.Some previous works have implemented different strategies based on the guidelines proposed by human curators to select the most appropriate set of MeSH terms for a given document.However, it is difficult to assess their real utility because human curators, paradoxically, do not always follow their own rules [23].
Table 1 summarizes some of the main systems for the task of biomedical semantic indexing.The underlying characteristics (such as the type of approach: flat vs hierarchical, if the system is based on a search engine, and a brief description of the main techniques used) of these works are presented.This study is an extension of our earlier work [24] that described our participation on the BioASQ 2016 biomedical semantic indexing (Task 4a).Our main hypothesis is that similar documents should be classified by similar MeSH terms.Although this hypothesis is not new, and whereas most previous works [11,21,22,25] use document similarity by clustering methods such as k-NN algorithm, our approach exploits document similarity computed by an open source search engine, the ElasticSearch tool [26], one of the most efficient document store databases [27].To the best of our knowledge, very few works have exploited search engines [14,18].In particular, the work by Ribadas et al [18] used the search engine tool Indri XSL • FO RenderX [28], with the drawback of the high computational time needed for its searches.
Although some works [29,30] have applied the semantic similarity between concepts to the biomedical semantic indexing task, very few works have exploited the curators' guidelines defined by NLM to assign MeSH terms.Our work proposes the implementation of one of the most important annotation rules [1], named "Specific Headings vs Broader Headings," which had not been considered by any of the previous automatic systems.This rule claims that if 3 MeSH terms are proposed to classify an article and share the same ancestor, then the curator should replace these terms by their lowest common ancestor.To do this, the MeSH thesaurus is represented as a graph database.This model based on graph theory leads to query the thesaurus much faster than using a relation database.It enables to swiftly and effortlessly capture hierarchical relationships such as the shortest path between 2 terms or their lowest common ancestor, which are features very useful to decrease the unnecessary overlapping of MeSH terms when an abstract is classified.
The rest of the paper is organized as follows: first, in the Methods section, we give a description of the datasets used in this study and explain our approach.Then, we report and discuss the results of our method in the Results section.Finally, conclusions and future work are presented.

Objective
The goal of the task was to automatically predict the most descriptive MeSH terms for a given article.The predictions should be compared with MeSH terms that were assigned by human curators.This section describes the MeSH resource, the data, and approach used in this study.

MeSH
MeSH is a thesaurus of medical concepts, which was created to assist human curators in the task of cataloging the articles in the MEDLINE database.Thus, each MEDLINE document should be represented with a set of MeSH terms that describe its subject topic.MeSH is an annually updated document (eg, 310 new headings were added to MeSH in 2015).The MeSH 2016 version contains a total of 27,883 main terms (also known as headings or descriptors), 82 qualifiers (subheadings), and more than 232,000 supplementary concept records, which represent specific examples of chemicals, diseases, and drug protocols.
In MeSH, most terms contain a short definition, links to related descriptors, a list of synonyms or very similar terms, and a unique alphanumerical ID. Figure 2 shows the content for the term "Lymphoma."The terms are organized in a hierarchy in which each child can have more than one parent.Therefore, any MeSH term can appear at different branches of the hierarchical structure of MeSH.For example, the term "Lymphoma" belongs to 3 different branches: "Neoplasms

Data
The training data for the BioASQ Task 4a consisted of MEDLINE articles that were manually annotated with MeSH terms by human curators.During the BioASQ 2016 challenge, a test dataset was published each week for the assessment of the participating systems.A total of 15 test datasets were published, which were grouped into 3 different periods (batches).Although the BioASQ challenge ended last May 15, 2016, the test datasets with gold annotations were not released because many articles have not been manually annotated yet.Training v.2016b with almost 5 million documents from the pool of journals that the BioASQ organizers used to select the articles for the test data.In both datasets, the average number of MeSH terms assigned to an article was 12 to 13.
In our previous work [24], we performed several experiments using each of the 2 training datasets, which led to the conclusion that they did not make a significant difference on the performance of our system.For this reason, we decided to only use the largest dataset (Training v.2016a ) to perform all of the experiments described in this new work (see the Results section).Moreover, to optimize the best setting of our approach, we randomly chose 1099 documents from the training dataset and separated them for development set.
As mentioned before, no test datasets with gold standard annotations were released.However, to perform a transparent and consistent evaluation of our work, we developed a script that obtains the MeSH terms for all abstracts in the test batches of the 2016 BioASQ.For each test document, the script obtains its PMID and then generates a query for searching it in PubMed.If the PMID exists in MEDLINE, PubMed returns a structured document containing the metadata for this abstract, among them its MeSH labels (see Figure 3), collected by the script using a regular expression.Finally, the labels are also searched in the MeSH resource to obtain their corresponding MeSH identifiers.In this way, we obtained the same 15 test datasets used in the 2016 BioASQ edition.Table 2 shows the size of the different datasets used in this study.

Indexing Documents and Query (Test Document) Using ElasticSearch
Our approach relies on the assumption that similar documents should be classified by similar MeSH terms.Previous research has generally used document clustering techniques, such as the k-NN algorithm, to obtain the similar documents for a given test document.Instead of using k-NN, we proposed the use of an open source search engine, ElasticSearch, to retrieve a set of similar documents for each test document.
Figure 4 shows the main steps of our approach.ElasticSearch was used to index all the documents of the training dataset (Training v.2016a).Each training document was stored along with its corresponding MeSH terms.Each test document was also represented as a query, which was fired against the index built from the training dataset.Then, ElasticSearch should return the most relevant (similar) documents to the query (the test document).Finally, our system initially assigns it all the MeSH terms of the similar documents retrieved by ElasticSearch for this document.
Below we explain in detail how the index was constructed and how a query (a test document) could be compared against this index to recover the most relevant (similar) documents.
The core of ElasticSearch is Apache Lucene, a free, open-source, and de facto standard retrieval software library (by The Apache Software Foundation).The efficiency of Lucene is because it searches on index instead of searching the text directly.Moreover, the index is stored in the main memory.
Lucene is based on the well-known and commonly used vector space model (VSM) for information retrieval.This model allows us to represent documents as vectors, where each position in the vector represents a specific term (typically terms are single words), and the value at that position denotes the weight of that term.There are several different ways of computing these values, being the most known term frequency-inverse document frequency (tf-idf) weighting.In this model, a given document d is represented as a vector where w i,d represents the frequency of the term i in the document d, D is the set of all documents, and |{d'Є D|I Є d'}| is the number of documents containing the term I (see Figure 5).
In short, VSM represents documents and queries as weighted vectors, where each dimension refers to an index term and its value is its tf-idf value.To assess the relevance of a document d for a given query q, VSM calculates the cosine similarity of their vectors (see Figure 6).Therefore, the basic idea behind VMS is that the more frequent a term is in a document relative to its frequency in the whole collection of documents, the more relevant that document is to the query.
Another important advantage of ElasticSearch is its capacity to create distributed and scalable systems by specifying only the configuration of the hierarchy of nodes.Thus, ElasticSearch is self-managed to maintain better fault tolerance and load distribution.In 2014, an empirical evaluation study about the effectiveness of the current databases demonstrated that ElasticSearch achieved the best performance compared with other document store databases [27].This is because ElasticSearch uses the main memory and compresses documents, thereby improving retrieval time.Moreover, another main challenge of the task is to manage the great amount of documents that have to be indexed.Thanks to its horizontal scalability (ie, the possibility of adding more storage and processing power), ElasticSearch is able to index large collections of documents such as the MEDLINE database.
In this study, ElasticSearch (version 5.0) was installed on an Ubuntu 16.04 server with 24 GB of RAM and 500 GB of disk space.It took 10,264.07seconds to index all the training dataset (ie, an average of 1.02 milliseconds per document).The training dataset (Training v.2016a) consists of a total of 10,100,380 documents, with an average size of 2.1 KB per document.

MTI Processing
The Medical Text Indexer (MTI) [14] is a tool developed by NLM and is considered as a baseline system for the task, which provides a preliminary annotation of the articles.MTI is based on a combination of MetaMap- [31] and PubMed-related citations to recognize MeSH terms that are then clustered and ranked by a k-NN algorithm.Given a document, MTI uses MetaMap to find its concepts.The UMLS (Unified Medical Language System) concepts found by MetaMap are restricted to MeSH by a combination of synonym and interconcept relations, and mappings.MTI also obtains a second list of MeSH terms by obtaining similar documents for the input document.To do this, MTI uses the list of PubMed-related citations provided by the PubMed system.Then, the MeSH terms of these similar documents are also extracted.Finally, MTI clusters both lists of MeSH terms into a single list.Terms are clustered by a k-NN algorithm and ranked according to the product of the frequency and the MeSH tree depth of each term.MTI also includes a postprocessing phase that implements a set of filtering rules from the NLM guidelines.For instance, it contains a list of triggers that activate one or more MeSH tags and that comes mainly from the NLM guidelines, in the way of rules such as "if XXXX appears in the text then you should tag as AAAA." As it was mentioned before, our system initially considered the set of MeSH terms from the relevant documents retrieved by ElasticSearch for a given test document.Then, that set was further extended with those terms provided by the MTI tool.

MeSH Labels Scoring Process
In the previous two sections, we described how an initial set of MeSH terms is proposed by ElasticSearch and later extended by the MTI tool, for a given test document.In this section, we introduce a new scoring function to rank the MeSH terms for a given test document (represented as a query q).The basic idea behind this scoring function is the more number of times a MeSH term appears in the set of more relevant documents for a given test document (query), the more significant that term is to this test document.The scoring function (see Figure 7) for a MeSH term l and a test document q considers the following parameters: tf(l): the frequency of the MeSH term l in the set of retrieved documents by ElasticSearch for the document q (query).Σ d:lЄd _score (d, q) is the sum of all scores of the relevant documents to the query q, which also contain the MeSH term l.As mentioned before, ElasticSearch uses the cosine similarity function to obtain the score between a document and a query.We normalized the sum of all scores because some documents may present a large number of MeSH terms, whereas others very few.To do that, we divided it by the maximum score of the relevant documents containing the term l.
T is a real positive value that represents the minimum threshold for the scores of the MeSH terms.That is, only the MeSH terms whose scores are greater than T finally will be selected for cataloging the test document q.

Selecting MeSH Terms by Exploiting a Graph Database
In this point, we already have a set of ranked MeSH terms for a given test document.
In the last phase, we implemented a heuristic based on the guidelines of human annotators [1] to classify MEDLINE articles.In particular, the implemented rule claimed that if an abstract had 3 or more MeSH terms sharing some ancestor, then the curators should replace these 3 terms by their lowest common ancestor.
Our hypothesis here was that representing the MeSH thesaurus as a graph would let to query the MeSH thesaurus much faster than when using its original format.By using well-known graph search algorithms such as depth-first search, the model graph enabled to rapidly and easily capture hierarchical relationships such as the shortest path between 2 terms or their lowest common ancestor.Knowing these hierarchical relationships allowed us to find the most appropriate MeSH terms for a given abstract, decreasing the possible overlapping among them, as the NLM recommends.
BlazeGraph [32]  We also developed an algorithm that, given an input document, traverses each of the MeSH terms proposed in the previous step and searches its ancestors by querying the graph database of MeSH with the depth-first search algorithm.Finally, when our algorithm finds out that 3 or more of its MeSH terms share the same ancestor, it replaces them by their lowest common ancestor.
Initially, we restricted the search to a given depth of ancestors, that is, pruning the search subtree below to a given height.However, because the maximum depth is relatively small (consisting only of 9 levels, with an average depth of approximately 4.5 levels), we decided to explore the complete tree of ancestors for each term.Figure 8 shows the query used by ElasticSearch to retrieve all ancestors of the term "Lymphoma."The output of this query is shown in Figure 9 where the term "Lymphoma" is in 3 different branches of the MeSH thesaurus: C04-Neoplasms, C15-Hemic and Lymphatic Diseases, and C20-Immune System Diseases.M Table 3 shows the list of MeSH terms proposed by our system for the article with PMID=25676421.The first column contains the MeSH terms after applying our script to replace the terms (3 or more) sharing the same ancestor, whereas the second one contains the MeSH terms proposed by using only ElasticSearch and the score function.For example, the terms "Lymphoma, B-Cell," "Ataxia Telangiectasia," and "Lymphoma" were substituted by their lowest common ancestor "Immune System Diseases."Table 4 shows the comparison of search times for 3 different MeSH terms.The reader can see that the 3 searches on the MeSH thesaurus stored into a graph database are significantly faster than the same searches on the RDF format.

Design of the Experiments
This section conducts an exhaustive set of experiments, where different parameters and options are evaluated on the development dataset to determine the best setting for our system, which will finally be evaluated on the test datasets.
In BioASQ, the performance of the participating systems is evaluated using standard IR measures (eg, precision, recall, and F1), as well as hierarchical variants of them, such as the lowest common ancestor Precision (LCA_P), Recall (LCA_R) and F-measure (LCA-F).The reader can find a detailed explication of these measures in the article [33].The HEMKit software [34], a tool that implements these measures and lets to easily evaluate the results of different experiments, was used to provide the scores.
Our experiments aimed to answer the following questions: What is the effect of the number of relevant documents retrieved by ElasticSearch?It is expected that the more documents the search engine obtains, the higher the recall and the lower the precision of our system.We experimented with different number of relevant documents to obtain the best balance between precision and recall, that is, the best F1.In particular, we tried with 10, 20, 30, 40, and 50 documents.
What is the best threshold T that we should consider in our scoring function?Higher values of this threshold should provide a high precision but with a significant decrease of recall.Our objective was to determine the optimum value of this parameter T, that is, that value that obtains the highest F1.Does the use of the hierarchical structure of MeSH improve the performance of our system?In particular, we assess whether the strategy of replacing terms sharing the same ancestor by their lowest common ancestor helped to improve the performance.

Experiment With/Without exploiting MeSH Hierarchical Structure
Tables 5 and 6 show the results exploiting the hierarchical structure of MeSH and without it, respectively.Each experiment is represented with the label Elastic-X-T, where X refers to the number of relevant documents retrieved by ElasticSearch and T to the threshold for our scoring function.
We tried with different number of retrieved relevant documents; in particular, the parameter X could take the following values: 10, 20, 30, 40, and 50.Although increasing the number of retrieved relevant documents achieves to improve the recall, it has a very negative effect on the precision of our system.Indeed, the best F1 (if we do not use the structure of MeSH, we obtain F1=0.70) is obtained with the lowest number of retrieved relevant documents regardless the value of the threshold T (see Tables 5 and 6).Therefore, we can conclude that the best value of X is 10.For values less than 10, the recall decreases significantly.In other words, the system achieves better performance if the search engine is set up to return at least 10 documents.
To assess the effect of the threshold T on the performance of our system, we tried with different values.Tables 5 and 6 show the results for values of T in range (0,9).The reader can see that, in general, the greater the value of the parameter T, the higher the precision, and also the maximum F1.However, the recall decreases when increasing the value of T. Any value lower than 1 achieves a very high recall but very low precision because the system would return all MeSH terms obtained by ElasticSearch along with those provided by the MTI tool, without applying any filter.That is, if the value of T is lower than 1, the scoring function does not rule out any term from the initial set of MeSH terms proposed by ElasticSearch and MTI.On the other hand, for values of T up to 5, the performance begins to drop.In general, best results are obtained for T equal to 5. The exploitation of the hierarchical structure of MeSH does not improve the results; on the contrary, the recall is dropped almost by 5% (see Tables 5 and 6).Therefore, we can conclude that the strategy of replacing terms sharing the same ancestor by their lowest common ancestor does not increase the results.A possible explication for this fact could be that human curators do not to follow the annotation guidelines.
The pattern of the hierarchical scores (LCA-P, LCA-R, and LCA-F1) according to the different parameters is very similar to the behavior of the flat scores.That is, the best hierarchical scores are usually obtained using the lowest number of retrieved relevant documents and the threshold of the score function equal to 8. Likewise in the flat setting, the rule of replacing 3 or more MeSH terms by their lowest common ancestor does not seem to improve the results.

Experiments on BioASQ 2016 Test Dataset
Finally, we ran the best setting (X=10, T=5) on the test datasets.Tables 7 and 8 show the results of this setting exploiting the structure of MeSH and those without it, respectively.As in the development dataset, the performance is better if we do not use the structure of MeSH.
As mentioned above, the MTI system is considered the baseline for the task.Table 9 shows the results achieved by MTI on each test set published in the 2016 BioASQ.The top F1 is 0.5196 and top LCA-F is 0.4807.
Table 10 shows the temporary scores of the best systems in BioASQ Task 4a.The reader can see that the best F1 rates are between 58% and 65%, the best recall between 54% and 60%, and the best precision between 60% and 72%, depending on the batch.Our approach that does not exploit the hierarchical structure of MeSH seems to obtain better performance than the top systems (see Table 8).Our best F1 is 0.70 (batch 1, week 1).On the other hand, if our system uses the hierarchical relations of MeSH to select the best set of terms to label a given article, this obtains an F1 of 0.67, also better than the top F1 (0.61) of the best systems.Therefore, we can conclude that our approach achieves to overcome the top participating systems at the BioASQ 2016.

Principal Findings
Our approach relies on the assumption that similar documents should be classified by similar MeSH terms.Previous works have already applied a k-NN approach for obtaining the set of similar document for a given test document.Our previous work [24] and this study are the first efforts to explore the document similarity using the search engine ElasticSearch instead of k-NN.ElasticSearch is one of the most efficient document-based database.Given a test document, this is represented as a query, which is executed in the search engine, returning the documents more relevant (similar) to the query.Then, our system proposes the MeSH of all these documents as the initial set of MeSH terms for the test document and extends this set with the MeSH terms proposed by the MTI tool.Finally, the system uses a scoring function to determine the best set of MeSH terms for a given article.Those MeSH terms that achieve a higher score than a given threshold are finally selected.The experiments show that the best results are obtained when the number of retrieved relevant documents by ElasticSearch is small (10) and the threshold for the scoring function is equal to 5.

Comparison With Prior Work
Our approach seems to provide better results than the top systems in BioASQ 2016.We note that our results are not immediately comparable with those reported by the BioASQ challenge because we have used a different test dataset.However, we think that it is a reasonable evaluation while no official test datasets are available.Moreover, our development test datasets are available at our webpage [35] to facilitate reproducible research, objective assessment, and further analysis.
In addition, we implement one of the guidelines established by human curators to classify MEDLINE abstracts.To do this, we store the MeSH thesaurus into a graph-based database by using the BlazeGraph tool.The main advantage of using a graph structure is the possibility to use algorithms well known in graph theory (such as depth-first search) to extract subgraphs satisfying a given query.In particular, the graph is visited with the objective to determine whether 3 or more MeSH terms assigned to a given article share the same ancestor.In this case, this lowest common ancestor should substitute them.Contrary to expectations, the system produces worse results if this rule is applied.This may be because human curators do not always follow the recommendations to catalog MEDLINE abstracts.

Limitations
Although the results are better when we do no exploit the hierarchy of MeSH, we think that the graph database version of MeSH is a promising resource that will allow us to implement other guidelines or strategies to select the most appropriate MeSH terms for representing a given article.

Conclusions
Semantic indexing of MEDLINE articles is a manual, laborious task, which could be helped by information technology.
As future steps, we also plan to determine semantic similarity between documents using word embeddings [36] instead of the well-known and commonly used VSM for information retrieval.This approach has already been exploited by Liu et al [21] and Kosmopoulos et al [22].Unlike these works, based on the use of k-NN for obtaining the set of similar documents, our approach will continue using ElasticSearch as search engine and our graph database format of MeSH.We also plan to explore deep learning methods (such as Convolutional Neural Networks) for supporting the automatic classification of MEDLINE abstracts.

Figure 1 .
Figure 1.JSON-based format for the training data in the biomedical semantic indexing and question answering challenge BioASQ task 4a.

Figure 2 .
Figure 2. Medical Subject Headings (MeSH) descriptor data for the term "Lymphoma".

Figure 3 .
Figure 3. MeSH terms for the abstract with Pubmed unique identifier (PMID)=26852276.

Figure 4 .
Figure 4. Architecture of our system.

Figure 5 .
Figure 5.The element wi,d is the frequency of the term i in the document d.

Figure 6 .
Figure 6.Cosine similarity between a document d and a query w, where V(q).V(d) is the dot product of their vectors, and |V(q)| and |V(d)| are their Euclidean norms.

Figure 7 .
Figure 7. Scoring function to rank Medical Subject Headings (MeSH) term.
is a graph database with support for Java APIs (Application Program Interface) and standardized query languages for graphs, such as SPARQL (Protocol and RDF Query Language).An important advantage of BlazeGraph is that it processes large graphs in near-real time by its GPU (Graphical Processor Unit) acceleration achieving better processing time than CPU (Central Processing Unit) technologies or other graph databases based on key values.NLM provides a beta version of the MeSH thesaurus in RDF (Resource Description Framework), a standard format for linked open data.This RDF version of MeSH can be loaded into BlazeGraph using the dotNetRDF API, a free and open-source project for working with RDF, SPARQL, and the Semantic Web.

Figure 8 .
Figure 8. BlazeGraph query to obtain the ancestors of the term "Lymphoma".

Figure 9 .Table 3 .
Figure 9. List of ancestors for the term "Lymphoma" provided by BlazeGraph.

Table 1 .
Main works for biomedical semantic indexing.

Table 2 .
Size of datasets (number of documents).

Table 4 .
Comparison of search times on the Resource Description Framework (RDF) format and the graph database of the MeSH (Medical Subject Headings) thesaurus.

Table 7 .
Results on the biomedical semantic indexing and question answering 2016 test datasets (exploiting the Medical Subject Headings hierarchy.

Table 8 .
Results on the biomedical semantic indexing and question answering 2016 test datasets (without exploiting the Medical Subject Headings hierarchy).

Table 9 .
Baseline results provided by the Medical Text Indexer (MTI) tool.These results were taken from the biomedical semantic indexing and question answering website.
a LCA-P: lowest common ancestor Precision.bLCA-R: lowest common ancestor Recall.c LCA-F: lowest common ancestor F-measure.

Table 10 .
Results of the top systems in biomedical semantic indexing and question answering (BioASQ) task 4a.These scores were taken on December 5 from the BioASQ website.