Boosting biomedical document classification through the use of domain entity recognizers and semantic ontologies for document representation: the case of

The increasing number of scientific research documents published keeps growing at an unprecedented rate, making it increasingly difficult to access practical information within a target domain. This situation is motivating a growing interest in applying text mining techniques for the automatic processing of text resources to structure the information that helps researchers to find information of interest and infer knowledge of practical use. However, the automatic processing of research documents requires the previous existence of large, manually annotated text corpora to develop robust and accurate text mining processing methods and machine learning models. In this context, semi-automatic extraction techniques based on structured data and state-of-the-art biomedical tools appear to have significant potential to enhance curator productivity and reduce the costs of document curation. In this line, this work proposes a semi-automatic machine learning workflow and a NER+Ontology boosting technique for the automatic classification of biomedical literature. The practical relevance of the proposed approach has been proven in the curation of 4,115 gluten-related documents extracted from PubMed and contrasted against the word embedding alternative. Comparing the results of the experiments, the proposed NER+Ontology technique is an effective alternative to other state-of-the-art document representation techniques to process the existing biomedical literature.


Introduction
The recent technological improvements emerged, and the reduction of costs to apply new scientific techniques are generating a vast amount of information associated with the area of Biomedicine [1]. This increase is followed by a corresponding publication of textual knowledge in the form of technical studies, posts and books (also known as bibliome), which keeps growing at an unprecedented rate and exceeding the ability of researchers to digest it [2,3]. At the same time, it is becoming increasingly difficult for the general population to contrast the media misinformation and find reliable sources of information based on the empirical evidence exposed in the bibliome [4,5]. In this sense, current bioinformatics challenges pass through the combination of vast amounts of structured, semi-structured, weakly structured data and unstructured information to build new sources of knowledge that could be explored by the general public and help researchers to discover the knowledge of practical use [6]. In this context, text mining (TM) techniques and machine learning (ML) approaches are being explored as procedures to recognize the relevant parts of the bibliome, allowing the effective search of information, the discovering of hidden interactions between biomedical entities and the assistance in obtaining new knowledge and inferring hypothesis for further biomedical research documents [7,8]. However, the automatic processing of the bibliome requires the previous existence of large, manually annotated text corpora, or structured biomedical databases, to develop robust and accurate workflows that use TM and ML algorithms to process all data automatically. The relevance of the annotated corpora has been highly discussed in diverse manual curation tasks that have been set up to construct gold standards to evaluate and develop derived computational algorithms [9][10][11][12][13]. These efforts have a high impact on delivering new and more robust computational algorithms and biomedical databases, but it also requires high human costs both in time and money [14,15]. The time-consuming nature of manual curation, along with the exponential growth of biomedical literature, strongly limits the number of publications that database curators can revise [16,17]. In addition, the limitations of keyword-based search techniques to rank biomedical articles in a specific problem domain require the application of robust text mining workflows that further consider the role of the biomedical entities discussed in the bibliome. The objective is to create novel computational methods that enable discovering important scientific publications considering the relevance of the biochemical interactions reported. The relevance of this computational support is of utmost importance to discover and analyze the health-related and pharmacological interactions supported by the literature [18]. Consequently, several tasks have promoted the development of novel computational methods to assist in the automatic literature classification and reduce the manual curation efforts [19,20]. In a similar line, different studies propose a hybrid curation process or semi-automatic tasks to promote that experts revise documents that have been automatically processed [21,22]. These semi-automatic approaches combine predictive knowledge extraction methods with manual expert annotation to reduce the required curation work [23].
In this context, this work proposes a biomedical document description (i.e. vector-space) integrated into a semi-automatic curation workflow to enhance the classification efficiency in a real curation task. To this end, the current approach combines unsupervised and biomedical knowledge extraction methods with lexical normalization procedures to boost different state-of-the-art classifiers and assist in the manual curation of the bibliome. In order to evidence the merit of the proposed approach, several experiments were executed comparing the proposed document description technique against the wellknown word embedding alternative through the use of state-of-the-art classifiers.

Related work
The limitations of keyword-based search techniques to rank relevant biomedical articles in a specific problem domain require investigating robust text mining techniques that further consider the biomedical knowledge contained. Traditional approaches explored unsupervised methods, named entity recognition techniques or domain ontologies to recognize the relevance of a document in a specific domain. For example, García et al. [24], Chen et al. [25], and Matos [26] applied unsupervised semantic similarities as bagof-concepts to improve the automatic classification performance of biomedical studies. On the other hand, Jorge et al. [27] and Luo et al. [28] considered the integration of named entity recognizers to support the classification of the literature. In terms of semantic normalization to enhance vocabulary unification and to classify documents with similar content, Kulmanov et al. [29] and Ding et al. [30] performed an overview of different approaches that incorporated ontologies-based techniques to ML methods to compute the word similarity. In this line, the authors highlighted the additional inference and reasoning capacity that domain ontologies contribute to the ML area. For its part, Sanchez-Pi et al. [31] demonstrated how an experimental developed ontology-based classification algorithm obtained better performance in the medical area compared with a state-of-theart ML method. Compared to these works, the main contribution of this study lies in proposing a novel document representation vector-space that takes advantage of combining different techniques (i.e. unsupervised text mining algorithms, named entity recognizers and the lexical capacities of a domain ontology) to boost several state-of-theart classifiers. Regarding the implementation of semi-automatic workflows to assist in the manual curation of the literature, different works explored the combination of automatic text mining methods with the manual work of experts to improve the curation accuracy and efficiency. For example, Kwon et al. [32] described the advantages of these workflows in a real curation scenario and discussed how these approaches reduced the annotation time for a beginner-intermediate level annotator. In the same line, Szostak et al. [33] compared a semi-automated workflow against a manual curation counterpart and proved that semiautomatic approaches reached similar results while reducing curation effort. This idea is also supported by Rinaldi et al. [34] that exemplified how text mining technologies could enhance the productivity of the curators. Finally, Winnenburg et al. [35] discussed how text mining methods could be tightly integrated with the manual annotation process to scale up high-quality manual curation. In the same line, the current work considers the problem of curation and classification of biomedical bibliome as a whole and, in contrast to previous approaches, presents the integration of the proposed document vector-space in a semi-automatic curation workflow to improve the computational and manual performance. Therefore, the proposed semi-automatic workflow guides the manual curators with the extracted knowledge at the same time as it reduces the manual work in an escalated way by applying past curator decisions to filter irrelevant information automatically. On the whole, the proposed approach takes advantage of the combination of an accurate document representation vector-space with a semi-automatic workflow to reach a better performance by the continuous improvement of the applied knowledge inference techniques. In this sense, the implemented workflow was applied in a real curation task to improve the classification performance of gluten-related documents that could contain relevant biochemical interactions.

Case study
Concerning the alimentary proteins, more and more studies, as well as health awareness campaigns, keep advertising the existent association between nutrition and the increment of chronic diseases among the population. In this sense, the number of exploratory research studies testing the elimination of some alimentary proteins in specific diets to treat patients with (or without) an apparent nutritional association has highly increased in recent years. However, in return, the implications of suggestions of these exploratory experiments may be misunderstood or misused by bad actors in the social media platforms causing misinformation and high monetary and human costs [36][37][38]. One of the diets that are being tested as an experimental therapy for the treatment of different diseases, not only for handling gluten-related disorders, is the gluten-free diet (GFD) [39,40]. The difficulty in digesting the growing scientific information, not conclusive scientific evidence in these experimental studies and the influence of social media platforms have caused an increased spread of gluten-related misinformation in the last years [41,42]. This event induces many people to follow the GFD as a self-prescribed lifestyle, although most of them have not been previously diagnosed with a related disease [43,44]. In relation to the gluten protein, Figure 1 shows the increment in the number of scientific documents discussing this topic up to and including the publication tendency for the year 2030. Therefore, the need for computational approaches to support the classification and analysis of publications containing relevant biomedical interactions becomes increasingly important, especially to structure the recognized role of the different biochemical compounds in the body processes and diseases. In this line, recent studies explored the manual curation of the literature in different knowledge areas to generate new databases with relevant health-related interactions that provide practical and structured scientific information to the general population and researchers [45][46][47]. Accordingly, the proposed semi-automatic workflow was applied to the gluten bibliome with the goal of identifying studies that contain relevant health-related knowledge (i.e. documents that support meaningful biochemical interactions) to create a future database that assists researchers to make appropriate decisions and develop new hypotheses supported by the available bibliome.

Materials and methods
This section describes the proposed document description technique, as well as the integrated, iterative, and semi-automatic data curation workflow applied to the glutenrelated bibliome. The proposed workflow comprises different sub-sequential rounds that support the application of past experiences to assist the document curation incrementally. In other words, manually curated portions of the dataset were incrementally applied to improve and fine tune the different predictive methods that supported the automatic classification and annotation of the remaining unclassified documents. This scalable and iterative approach established a semi-automatic workflow in which unprocessed documents were automatically filtered and annotated, considering the previous curator decisions. Therefore, the manual inferred knowledge of past iterations was automatically propagated to the sub-sequential curation rounds in order to enhance the baseline classification performance, improve the inferred knowledge methods and reduce human efforts.
In this sense, the implemented workflow consists of the following fundamental phases (i) knowledge retrieval; (ii) document processing; and (iii) document classification. Figure  2 summarizes the different tasks comprising each phase, while the following subsections give details about the strategies applied in every case.

Recalibration data
Select a subset of scientific documents  Figure 2: Schema of the semi-automatic curation workflow using the proposed document representation vector-space. The current approach presents a document description technique that combines (i) unsupervised text mining techniques, (ii) named entity recognizers and (iii) domain ontologies to create a document representation (i.e. vector-space) that boosts the automatic curation of the biomedical bibliome. This approach takes advantage of the integrated semi-automatic workflow using past experiences to automatically filter irrelevant documents and to improve the performance of the different processes.

Knowledge retrieval
As illustrated in Figure 2, the objective of this phase was to retrieve gluten-related scientific documents from the PubMed repository and to identify domain ontologies that were most suited to the scope of this work. The output of this phase provides an initial corpus of gluten-related documents to be further curated, plus a lexicon database to be used to annotate the retrieved documents and normalize the domain identified terms.

Data gathering, lexicon and domain word normalization
The National Center for Biotechnology Information (NCBI) Entrez Utilities Web services were used to access the PubMed library, search for potentially relevant documents, and download associated publication details, including the abstracts [48]. For the current study, the most relevant 4,115 abstracts (out of a total of 12,047 documents) were initially retrieved from the PubMed repository to be further processed.
The following domain-related ontologies and dictionaries were initially selected to recognize, extract and normalize the semantic domain concepts present in the selected documents: FoodOn ontology [49], Symptom (SYMP) ontology [50], Medical Subject Headings (MeSH) [51], Chemical Entities of Biological Interest (ChEBI) lexicon [52], Foundational Model of Anatomy (FMA) ontology [53], National Cancer Institute Thesaurus (NCIt) [54], Disease Ontology [55], DrugBank lexicon [56], KEGG [57]; PharmGKB [58], the protein catalogue of Uniprot [59] and an expert manually curated list of food diets. Overall, a lexicon of 1,000,450 entries was generated in this step to support the later entity recognition task as well as the normalization of the different terms with the same domain meaning (e.g. normalize the distinct representations of a concept like "Colonic hamartomatous polyp" or "Peutz Jeghers polyp" to a unique central idea "Peutz-Jeghers syndrome").

Document processing
Once the initial set of documents was retrieved, and the required resources were correctly identified in the previous phase, documents were processed to identify and normalize the different domain-relevant concepts discussed. The output of this phase produces an automatically annotated corpus to be further revised by the experts but also contributes with valuable information suitable for use in the design of the supervised document classifiers.

Initial text pre-processing
Initially, different text pre-processing operations were applied to prepare documents for further exploration. In detail, the following operations were carried out: (i) tokenization (i.e. to split a set of text up into minimal meaningful elements); (ii) English stop words removal (i.e. elimination of frequent English words like "the" or "by"); (iii) n-gram computation (i.e. consider a contiguous sequence of n tokens as a concept); (iv) part of speech (POS) tagging (i.e. identification of the lexical category of each token); (v) small tokens removal (i.e. those having less than three characters); (vi) convert tokens to lowercase; and (vii) lemmatization (i.e. obtaining the lexeme form of the tokens). This initial document pre-processing was implemented using the well-known Stanford CoreNLP pipeline [60].

Named entity recognition
After the previous procedure, different but complementary NER methods were used to correctly identify mentions of critical entities in the target domain, notably anatomy terms (e.g. duodenal), cell types (e.g. T-cell), compounds (e.g. vitamin D), variety of diets (e.g. vegan), diseases (e.g. osteoporosis), food or food products (e.g. rice), genes (e.g. HLA-DQB1), organisms (e.g. Lactobacillus), proteins (e.g. IgA) and symptoms (e.g. ataxia). These automatically generated annotations were used to index document contents and to help to reduce the cost of their manual annotation. To carry out this operation, an ensemble of the following six state-of-the-art NER taggers was used: LINNAEUS [61], an open-source stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy. The software can be freely downloaded from http://linnaeus.sourceforge.net/. ABNER [62], a statistical ML system using linear-chain conditional random fields (CRFs) for automatically tagging genes, proteins, and other entity names in a text. The software is freely available at http://pages.cs.wisc.edu/~bsettles/abner/.
OSCAR4 [63], an open-source chemistry analysis routines (OSCAR) developed since 2002 to recognize chemical names, reaction names, ontology terms, enzymes, chemical prefixes, and adjectives. The software can be freely downloaded from https://bitbucket.org/wwmm/oscar4/wiki/Home. TMCHEM [64], another open-source alternative for identifying chemical names in biomedical literature, including chemical identifiers, drug brand and trade names, and systematic formats. TmChem achieved the highest performance in the BioCreative IV CHEMDNER task (over 87% F-measure), being accessible at https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmchem/. DNORM [65], software that uses ML to recognize and normalize disease names in a biomedical text. DNorm achieved the best performance in the 2013 ShARe/CLEF shared task on disease normalization in clinical notes, being accessible at https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/dnorm/. In addition, with the goal of complementing the functionality offered by the previous state-of-the-art taggers, an in-house ontology-based NER able to perform dictionary lookups as well as pattern and rule-based recognition was also developed. It is based on an inverted recognition strategy that uses words as patterns to be matched against an ontology-based lexicon [66]. The proposed approach is suitable for the type of texts analyzed in the current work due to their short length compared to the size of the lexicon. Moreover, recognition preference was given to the longest possible n-gram (e.g. "wheat gluten protein" instead of "gluten"), while concepts that may be associated with more than one semantic category were ignored. Additionally, the implemented recognizer accepts perfect matches as well as lexical variations of the terms (i.e. lemmatized entries, abbreviations, and synonym normalization), being updated with the expert recommendations at the end of each curation round to improve its annotation performance (i.e. semantic type of the annotations and false-positive identified concepts).

Automatic text annotation
In order to integrate the previously commented alternatives with the goal of improving the global accuracy of annotations by taking into consideration their semantic context, the following strategy was applied. Initially, all documents were annotated separately with each tagger, selecting the annotations containing more grams. In this way, the inconsistences (i.e. different annotations of the same token with incompatible semantic types) were solved by prioritizing the expert-derived knowledge incorporated into the in-house ontology-based NER over the successive annotation rounds. Alternatively, if there was not a match that could be solved by the ontology-based NER (e.g. a new concept that was not previously incorporated into the ontology lexicon), the confidence of the different taggers was used. This ontology-based normalization enables additional inference and reasoning capacity steps like standardizing different terms with similar meaning (i.e. synonyms) or inferring related semantical terms (i.e. deduce families of concepts). After that, a post-processing operation to enhance the annotation performance based on the semantic context of the identified terms and their recognized semantic category was executed. In order to carry out this process, a rule-based annotation strategy was applied following the criteria of the expert curators (see Supplementary material 1). As an example, if a given tagger identified the anatomic part "small intestinal" and another tagger determined a symptom "intraepithelial lymphocytosis" in a nearest semantic context, then those annotations were joined into the most complex domain concept "small intestinal intraepithelial lymphocytosis". In addition, the capacity to identify semantical patterns in the context of the annotations allowed the recognition of more complex concepts. For example, supposing the name of a food or a protein is identified near to the word "sensitivity" or "intolerance", then the corresponding annotation is expanded to the associated symptom semantic category (e.g. "egg intolerance").

Document classification
Following the proposed workflow illustrated in Figure 2, the semi-automatic annotation process concludes with the curation of the documents by experts. In this phase, experts revise the integrated NER annotations automatically generated in the previous phase and classify the content of the documents as relevant or irrelevant. The first round of this iterative task generates an initial gold standard (updated in subsequent annotation rounds), that is used to retrain and update the overall performance of the different methods used in the implemented workflow. To facilitate a better understanding of the actual integration of the manual curation step in the proposed workflow, Figure 3 provides details about the iterative curation process and classification strategy applied in the current work.

Expert manual curation
In order to provide specific support to experts in the initial manual curation of the automatically generated annotations, and also for the later document classification phase, the Markyt annotation tool s was used. In detail, the Markyt framework contributed with useful information concerning the following aspects of the proposed workflow: (i) produced valuable insights to update both the ontology-based NER algorithm and the automatic text annotation of successive rounds; and (ii) made available relevant information about the manual classification of documents to improve the subsequent training and test of the automated classifier included in the workflow. Figure 4 shows the Markyt framework in action during the annotation of two given documents. As previously commented, in order to obtain an initial gold standard for feeding the proposed workflow and generate a primary document classifier, a first round with 1,000 automatically annotated but unclassified documents was carried out (see Figure 3, top). The expert revision of this initial set of documents enabled (i) the development of the first classifier to assess the relevance of the follow-up documents belonging to the next rounds and (ii) established the basis for the automatic annotation rules. The application of this semi-automatic curation strategy helped to save manual efforts by identifying relevant biomedical entities and relevant documents based on previous experiences [67]. Following this iterative approach, inconsistencies, glitches, misses, and interpretation issues were fixed and documented by experts to enhance the global workflow performance (i.e. improving the vocabulary and matching rules supporting both the automatic annotation and the priority given to each available tagger).

Initial document description
With the goal of accurately representing each document for serving as input to the automatic classifier of the proposed workflow, a document attribute matrix was initially formed considering three different but complementary groups of attributes. The first set of attributes is given by a word vector containing the most meaningful concepts of each document when considering the whole dataset (first set of descriptive columns in Table 1). For its generation, the term frequency-inverse document frequency (TF-IDF) measure of unigrams, bigrams, and trigrams was computed. TF-IDF measures the significance of a given word in a dataset regarding the total number of times that it appears in a particular document compared to the overall dataset (Equation 1).
TF -IDF( , , ) = TF( , ) × IDF( , ) where is the evaluated term, stands for any given document from the dataset, , and TF expresses the ratio of corresponding to the term, , in a document, , described as ( , ) follows (Equation 2): where is the number of occurrences of the term, in a document, , and is the total , The second group of attributes comprises the normalized label of all automatic annotations of each document (second set of descriptive columns in Table 1). In this way, the combination of the output of the different taggers, in conjunction with the distinct domain ontologies of the lexicon, enables that those annotations with a similar meaning can be computed as the same entry in this attribute group. Finally, the third group of attributes includes the different semantic types annotated in each document (third set of descriptive columns in Table 1).

Final document representation
Regardless of the specific strategy used to extract attributes from any document to generate its vector representation (as the one proposed in the previous section), thousands of entries usually form it. This scenario requires the consideration of a precise feature selection procedure to identify the most informative features. In this sense, a combination of Information Gain (IG), Chi Square ( ), and the stability-correlation measure was 2 applied in this work. In detail, the IG of any feature, , describing a class, , represents the reduction in uncertainty about when the value of is known, and can be calculated as follows (Equation 4): where is the fraction of the documents belonging to class, over the total number ( ) , of documents, is the fraction of the documents belonging to class, that contains ( , ) , a feature, , over the total number of documents, and represents the fraction of the ( ) documents that contain a feature, over the total number of documents.
, For its part, the measure is commonly used in mathematical statistics to evaluate the 2 independence of any two given variables. In the proposed approach, the independence of a feature, , with respect to a category, , is measured by Equation 5, in which the greater the value of the is, the more information provides the feature, : where stands for the total number of documents, is the frequency of feature, , in the category, , is the frequency of feature, in all the existing categories except, , , is the frequency with which category, occurs without containing feature, , and , is the number of times neither nor occur. Finally, the stability-correlation statistic evaluates the importance of any given variable based on its stability and correlation concerning a given class (i.e. variables with a high correlation and high stability achieve an importance nearest to 1). In the current study, the stability measure of a feature, , over a class, corresponds to the percentage of In the proposed approach, a combination of the three feature selection techniques was devised to select the top 300 features achieving the most significant average weight, normalized between 0 and 1.

Results and discussion
This section introduces the final gold standard dataset created by applying the suggested semi-automatic workflow to the gluten bibliome case study, giving relevant details about the corpus in terms of both relevance in classification and representativeness for the selected domain. After that, and with the goal of assessing the adequacy of the proposed document representation method (discussed in Sections 3.3.2 and 3.3.3), a well-known baseline (i.e. word embedding) is briefly described together with the introduction of the experimental setup and the definition of the selected performance measures. The results from six state-of-the-art classifiers are presented and analyzed in detail, evidencing the importance of having an accurate document representation for obtaining positive outcomes in the classification task. The section ends with a learned lessons discussion that summarizes the key findings resulting from applying the current workflow and the interaction with experts in the studied field.

Gluten-related gold standard
As previously commented, in order to create a comprehensive curated corpus following the proposed semi-automatic workflow, a total of 4,115 PubMed documents related to gluten bibliome (out of a total of 12,047 entries) were iteratively annotated and manually classified by experts with the help of the Marky platform. Table 2 describes the distribution of the final gold standard dataset regarding the relevance of documents. The curation process (i.e. semi-automatic annotation and classification) of the documents reflected in Table 2 was carried out by experts through nine rounds with a non-regular number of documents to revise in each iteration. This round flexibility helped to adapt the revision process to the agenda of the curators and enabled the successive improvement of the different workflow algorithms. In order to provide meaningful insights describing the existing knowledge in the newly generated gold standard dataset, a correlation analysis of the semantic categories was carried out using an association coefficient calculated as follows (Equation 8):   From the annotated categories shown in Table 3 and Table 4, it can be observed that those relevant documents related to the gluten protein were mainly focused on the study of proteins, compounds, and foods that produce a body change in terms of diseases and symptoms. In contrast, irrelevant documents were essentially focused on the analysis of foods and organisms relating to compounds, showing less correlation with diseases and symptoms. In addition, with the goal of analyzing the incidence of some representative terms, Figure 5 presents the top most mentioned terms along with their associated semantic category.

Experimental setup
In order to evaluate the proposed document representation method (explained in Sections 3.3.2 and 3.3.3) as part of the developed workflow ( Figure 2 and Figure 3), we compared its performance against the use of word embedding [68], one of the most popular representation technique for capturing the context of words in a given set of documents.
To this end, we make use of several state-of-the-art classifiers, including Support Vector Machines (SVM) [69], Random Forest (RF) [70], Generalized Linear Model (GLM) [71], K-nearest neighbor (KNN) [72], Fast Large Margin (FLM) [73], and a Deep Learning (DL) multi-layer feed-forward artificial neural network (ANN) with a 2-2-2 layer configuration [73]. As previously commented, word embedding is a well-reputed unsupervised technique able to generate clusters of words based on their context in a given set of documents [74,75]. This method is generally used in computational linguistics to improve the performance of different ML algorithms [76][77][78] by taking advantage of the normalization of words with a similar meaning [79,80]. Prominent among other alternatives, Mikolov et al. developed word2vec [81] based on the hypothesis that words that occur in similar contexts tend to have similar meanings [82]. Therefore, word2vec uses a simple neural network to embed words into a continuous vector-space. In the particular case of this study, although there are available different word2vec models for the biomedical area [83], it was trained an in-house word2vec model using the overall gluten-related dataset (i.e. 12,047 documents initially obtained from PubMed, as commented in Section 3.1.1) with the goal of fairly comparing its results against the proposed approach, called NER+Ontology. In order to obtain accurate results and a well-founded discussion, the six initially selected classifiers were evaluated following a 10-fold cross-validation strategy [84]. Standard measures of precision, recall, and F-score were calculated to assess the performance of the different classifiers, being computed as follows (Equations 9 to 11): where is the number of true positives (i.e. relevant documents classified as relevant), is the number of false positives (i.e. non-relevant documents classified as relevant), is the number of false negatives (i.e. relevant documents classified as irrelevant), and stands for the number of true negatives (i.e. non-relevant documents classified as irrelevant).

Assessing the importance of document representation: word embedding vs. NER+Ontology
This section analyzes the performance obtained by both alternatives as document representation techniques when used to train different state-of-the-art classifiers to be further used in the proposed semi-automatic workflow.
In detail, the first analysis involved the establishment of a baseline to discover which alternative obtains good performance results at the beginning of the process. To carry out this experiment, a 10-fold cross-validation analysis was executed using the first set of 1,000 manually curated documents, which were the output of the first iteration round of the proposed workflow (see Figure 3, top). Table 5 summarizes the results obtained in terms of precision, recall and F-score. Regarding the baseline performance comparison shown in Table 5, RF has proven to be the best approach to establish a first recommended classifier as starting point for the semiautomatic curation workflow (F-score = 0.697 and F-score = 0.820). From another perspective, comparing both representation techniques, the proposed NER+Ontology algorithm obtained an average F-score of 0.791, whereas the word embedding alternative achieved an average F-score of 0.653. Considering the differences between the F-score values reported in Table 5, the GLM and KNN classifiers reached the most significant advantage using the proposed document representation technique (F-score = +0.173 and F-score = +0.156, respectively). As an initial conclusion from this first experiment, it seems that the NER+Ontology approach has succeeded in improving the performance of all the analyzed classifiers, regardless of their specific type. In order to obtain conclusive results, Table 6 presents the final performance achieved in a subsequent experiment using the final gold standard dataset generated by the proposed workflow (see Figure 2). To properly compare the evolution and stability of the different classifiers, they were trained and tested using the same parameters as those adopted in the baseline evaluation. Regarding the results summarized in Table 6, RF has proven to be the best classifier when using the word embedding technique for document representation (F-score = 0.838), whereas FLM obtained the best position through the use of the proposed NER+Ontology technique (F-score=0.860). As in the previous experiment, comparing the average F-score of both alternatives, the proposed NER+Ontology algorithm reached a better value (average F-score = 0.851) compared to the word embedding approach (average F-score = 0.825). This experiment makes it possible to conclude that the improvement demonstrated by the proposed NER+Ontology technique was stable, regardless of the size of the corpus or the specific classifier used.
In addition, to complement the study carried out, a grid search optimization was executed to evaluate the best performance that the selected classifiers could reach with the two document representation techniques plus a third combination of both. In this regard, Table  7 summarizes the performance measures obtained under a 10-fold cross-validation scenario over the final gold standard dataset. From the results shown in Table 7 related to the performance of the two initial alternatives, it can be seen that the proposed NER+Ontology algorithm obtained a better average F-score value (0.860) than the one achieved by the word embedding counterpart (0.836). Considering the overall F-score values attained by the different classifiers, the GLM and RF algorithms reached the best classification performance using the proposed document representation technique (F-score = 0.864 and F-score = 0.849, respectively). Furthermore, the NER+Ontology approach always exceeded the performance obtained by the word embedding alternative, as showed by the positive values present in the Gain F-score column, being the GLM and DL classifiers that benefit most. From another interesting perspective, Table 7 also evidenced how the combination of the two document representation techniques (i.e. Word embedding & NER+Ontology) barely achieved a noticeable improvement in some specific cases. This behavior is because the unsupervised word embedding technique does not enhance the semantic normalization obtained using a domain ontology or specific domain NERs. In this sense, a more significant number of domain concepts were supported by the NER+Ontology domain normalization, and only a marginal set of remaining non-stop words was also considered by the standardization provided by the word embedding technique.

Learned lessons
With the goal of complementing the study carried out with useful insights, this section discusses certain expertise and some lessons learned from implementing the proposed workflow in terms of different design strategies and the manual curation of biomedical information.
In the first place, although the proposed semi-automatic workflow required more computational time and human effort to process (i.e. manually annotate and classify) all the documents comprising the final gold standard in several iterative rounds, human-inthe-loop (HITL) approaches provide better trade-offs guaranteeing an improvement of the accuracy in the majority of datasets while improving safety and precision. In this sense, even though all the automatically classified documents (i.e. relevant and irrelevant) were manually revised in order to correctly evaluate the proposed NER+Ontology technique, the following iterations of the implemented workflow will obtain more benefits since only the manual annotation of relevant documents, and a part of those automatically classified as irrelevant will be required. This strategy allows saving of manual classification efforts because it reduces the number of documents to be revised. In this way, in successive iterations of the proposed workflow, only documents automatically labeled as relevant and a random subset of documents classified as irrelevant (e.g. 20%), are going to be curated in order to recalibrate the internal classifier.
In terms of global performance, the proposed semi-automatic workflow achieves a more significant advantage by supporting the overall annotation process. From another perspective, mainly related to the analyzed case study and considering the most common annotated terms per semantic category identified in Figure 6, the topmost discussed concepts related to the topic of "anatomical parts" owned a pre-existing relationship to blood components and gastrointestinal organs due to the nature of the disease. In contrast, the term "bone mineral density" (BMD) stood out due to the high number of documents that relate untreated gluten diseases to a greater tendency to suffer from fractures and a density improvement on a gluten-free diet [85]. Consequently, associated with BMD, the terms "Osteoporosis" and "Osteopenia" (both diseases) were also widely mentioned and related with celiac disease (CD) and GFD, in the same way as BMD [86,87].
With regard to cell types, the most discussed concepts were related to T-cells with inflammatory and immune roles, namely, "CD4 + ", "T-lymphocytes", and "Intraepithelial lymphocytes", derived from the autoimmune nature of gluten-related diseases [88,89]. Similarly, the most mentioned proteins, besides the different protein fractions that constitute gluten, were related to antibodies closely associated with developing distinct health issues. In this sense, several scientific documents referring to these proteins discussed their relationship in diagnosing different illnesses or the benefits of the GFD. An example of this case was the relationship of the "IgA" and "IgG" autoantibodies against "tissue transglutaminase" [90,91]. Another protein that deserved attention due to its substantial presence in the bibliome was casein, the collective term for a family of milk proteins [92]. This protein is positively associated with non-gastrointestinal diseases, notably autism spectrum disorder (ASD), and the casein/gluten elimination from the diet is encouraged to improved ASD behaviors in children who reported some gastrointestinal symptoms [93,94]. Considering the diet semantic category, most of the identified concepts were related to discussing the advantages of a gluten-free diet in treating different diseases and their relation with the most annotated discussed symptoms like diarrhea, malabsorption, inflammation of the small intestine, and abdominal pain. In this sense, the curated studies evaluated the effect that GFD and other counterparts diets could produce in humans with different health issues [95,96]. The most mentioned compounds were iron, calcium, and other general nutrients, being discussed in a large number of documents signaling the alimentary unbalances produced by GFD [97], how GFD places compounds within the normal range [98], and also the difficulty of their absorption into the digestive system in related diseases [99,100]. Concerning diseases, numerous documents discussed the damaged relationship that gluten can produce to health issues related to the digestive system, like "Irritable bowel syndrome" [101] and "Type 1 diabetes" [102], but also for other diseases with a less apparent relationship, such as skin diseases [103] and psychological disorders, like "Autism" and "Schizophrenia" [104][105][106].
With respect to the food or food product category, the relevant discussed terms were related to cereals and, above all, oats. In this sense, recent studies have questioned the suitability of oats in the diet of gluten-related patients, with some authors claiming that oats pose no risk to celiac [107] and others arguing that a subgroup of celiac patients may be intolerant to oats [108]. As occurring with diet therapy, mentioned organisms were oriented towards applying different bacteria to reduce or degrade toxic gluten peptides [109,110]. Finally, concerning the genes category, the most identified terms were related to genes responsible for CD development, namely "HLA-DQA1" or "HLA-DQB1". These genes are present in 30-40% of the general population, but only a few percentages of carriers develop gluten-related diseases [111,112]. Similarly, the "myosin IXB" gene was discussed as a potential risk factor in inflammatory conditions, having a role in intestinal barrier functions with evidence of its association with CD, dermatitis herpetiformis, inflammatory bowel disease, systemic lupus erythematosus, and rheumatoid arthritis risk [113]. It is noteworthy that apart from these genes, few studies related other genes to this subject, which may be motivated because the genes that predispose individuals to glutenrelated disorders are very well defined.

Conclusions and future work
This work presents a semi-automatic ML workflow able to reduce the manual curation cost (i.e. annotation and classification) of thousands of documents downloaded from PubMed with the goal of generating a gold standard corpus. As a fundamental part of the proposed approach, it is introduced the NER+Ontology document description technique for the automatic classification of the bibliome. The practical relevance of the implemented workflow was demonstrated in the manual curation of 4,115 gluten-related documents, while the proposed NER+Ontology technique showed satisfactory results compared to other state-of-the-art document representation techniques in three different scenarios using well-known classifiers. Future work will be focused on applying the best-ranked classifier for the automatic classification of the remaining bibliome, adopting the proposed NER+Ontology description technique as the current baseline for identifying novel relation patterns in overall. Although the experimental results have demonstrated the proper operation of the proposed approaches, it would be interesting to consider the exploitation of the remaining ontology capabilities to obtain better classification results, for example, the semantic hierarchy inference provided by the different ontologies. Finally, a parallel objective is related to structuring and making available the curated knowledge through an online database to assist researchers in making decisions and developing new hypotheses based on the bibliome.

Declaration of Interest Statement
On behalf of all the authors, Dr. Florentino Fdez-Riverola (corresponding author) declare no conflict of interest.    Table 4: Correlation between different annotation types and the number of annotations per semantic category in irrelevant documents. Explicit mentions of general terms in the analyzed domain (e.g. "celiac", "gluten" or "protein") were not considered for the generation of the