Semantic Publication of Agricultural Scientiﬁc Literature Using Property Graphs

: During the last decades, there have been signiﬁcant changes in science that have provoked a big increase in the number of articles published every year. This increment implies a new difﬁculty for scientists, who have to do an extra effort for selecting literature relevant for their activity. In this work, we present a pipeline for the generation of scientiﬁc literature knowledge graphs in the agriculture domain. The pipeline combines Semantic Web and natural language processing technologies, which make data understandable by computer agents, empowering the development of ﬁnal user applications for literature searches. This workﬂow consists of (1) RDF generation, including metadata and contents; (2) semantic annotation of the content; and (3) property graph population by adding domain knowledge from ontologies, in addition to the previously generated RDF data describing the articles. This pipeline was applied to a set of 127 agriculture articles, generating a knowledge graph implemented in Neo4j, publicly available on Docker. The potential of our model is illustrated through a series of queries and use cases, which not only include queries about authors or references but also deal with article similarity or clustering based on semantic annotation, which is facilitated by the inclusion of domain ontologies in the graph.


Introduction
The importance of science in society has significantly increased in the last century. Science has become a big, global, complex, and collaborative effort due to the increase of the resources assigned to science world-wide [1,2]. This has implied an exponential increase of the scientific production, reaching a growth rate between 8 and 9% from the period since the Second World War [3]. According to Scimago Journal & Country Rank (https://www.scimagojr.com/journalrank.php), 2,761,624 articles were published in 2018, and 8,185,591 articles were published from 2015 to 2017. The agricultural and biological sciences constitute around 8% of such production, with 226,310 articles published in 2018 and 659,111 from 2015 to 2017. The availability of a larger base of scientific papers is mostly useful for scientists and science, but this is at the cost of making it difficult to find relevant related research. Since a human being is not usually able to review all these documents, this task should be delegated to software tools, which are able to quickly process large amounts of data. By improving the methods used for literature searches, a scientist could quickly find the most interesting articles, which will augment the efficiency of his/her research work.
Scientific publications are usually published in human-friendly formats such as PDF or HTML, whose content is hardly understood by machines. Therefore, the first requisite to use automatic processes in order to find relevant articles is that these articles are represented in machine readable formats. The Semantic Web [4] proposes different technologies which can contribute to this goal, namely, OWL ontologies [5], for representing domain knowledge; RDF [6], for describing resources; and SPARQL [7] as a query language. These technologies allow representing domain knowledge by using formal languages, which can be understood by machines, enabling the development of knowledge-based automated tasks. Semantic Web technologies are being adopted by publishers due to those benefits. David Shotton defines semantic publishing as anything that enhances the meaning of a published journal article, facilitates its automated discovery, enables its linking to semantically related articles, provides access to data within the article in actionable form, or facilitates integration of data between papers [8]. This involves enriching articles with machine-readable metadata, allowing the verification of published information and providing the capacity for automated discovery and for summarizing. Thus, there are several key problems in semantic publishing: (1) what metadata should be included, (2) how this metadata is modelled, and (3) how to store and access to this metadata.
Our previous studies [9,10] addressed problems 1 and 2 by proposing an RDF model for scientific literature description that reuses terms from well known domain ontologies and vocabularies, such as Bibliographic Ontology (bibo) (http://bibliontology.com/), Dublin Core Terms (dcterms) [11] or Friend Of A Friend Ontology (foaf) (www.foaf-project.org/), among others, following the Biotea model. This model covers most publishing information, including authors, references, journals, publishers, the structure and the content of the articles, and semantic annotations that enrich these contents. Moreover, we provided tools for RDF generation from literature represented in Journal Article Tag Suite (JATS) format [12].
In this paper, we address problem 3 by proposing the use of graph databases for storing and querying RDF data generated according to the Biotea model. Thus, in this work, we provide a workflow to populate property graphs including not only scientific literature data, but also semantic annotations and domain knowledge, which facilitate better understanding of the data by automatic agents. We exemplify the advantages of our model by means of different use cases, including article similarity, literature search, or literature metrics, among others.
The remaining structure of the paper is described next. The Related Work section describes how the three key problems of semantic publishing have been approached by the scientific community. In the Materials and Methods section, we describe a pipeline which uses Biotea for generating an RDF knowledge graph of scientific literature, together with Neosemantics (https://github.com/Neo4jlabs/neosemantics) for translating RDF into property graph model implemented in Neo4j. The Results section illustrates the knowledge graph generated and the use of this pipeline on a set of 127 articles in the agriculture domain. The comparison with state of the art solutions and future work is considered in the Discussion. Finally, some conclusions are put forward.

Related Work
State-of-the-art solutions propose different methods to address the three key problems of semantic publishing: (1) what metadata should be included, (2) how this metadata is modelled, and (3) how to store and access to this metadata. For instance, SciGraph (https://scigraph.springernature.com), created by Springer-Nature, address problem 1 by providing the most common metadata for scientific literature, including authors, affiliations, references or publishers, among others. The abstract of the articles is the only textual content taken into account. In connection with problem 2, this project uses an RDF model based in known vocabularies, such as dcterms, foaf or schema.org [13], in addition to own ontologies. Although semantic enrichment is not performed, each article is classified according Field of research (FoR) codes [14]. Regarding problem 3, the knowledge graph itself is not directly accessible by using any query language. Nonetheless, raw data is provided through JSON-LD [15] files. Also, there is a REST API available, which provides functionality for a literature search, supporting metadata-based filters, including DOI, FoR code and year of publication or journal, among others.
Other related work is The Semantic Lancet project [16]. This work proposes a model compliant with the Semantic Publishing And Referencing Ontologies (SPAR ontologies) [17], which include common bibliographic metadata. In this case, the content of the articles is taken into account for semantic enriching purposes, particularly, processing the abstracts by means of FRED tool [18] to obtain an RDF representation that captures the meaning of the free text. This project proposes a triplestore in order to publish the knowledge graph, making it accessible through an SPARQL endpoint. Moreover, they provide several applications built on the top of the knowledge graph, which offer an easy to use user interface.
Another recent project is the Open Research Knowledge Graph (ORKG) [19], whose model represents not only bibliographic metadata but also semantic descriptions of scholarship knowledge. The semantic enrichment in this model consists of describing each article by means of four generic concepts that summarize its contribution: process, method, material and data. Then, concrete annotations found in the text are linked to these generic concepts. Regarding how to store the data, this project makes use of a labelled property graph, a triplestore and a relational database, where each database engine is used for specific purposes in order to provide different services.
OpenBiodiv [20] is another knowledge graph focused on biodiversity in the scientific literature. It is defined by the OpenBiodiv-O ontology [21], which integrates bibliographic and biodiversity domain ontologies, such as SPAR ontologies or Treatmen Ontology (https://github.com/plazi/ TreatmentOntologies). This model is defined in RDF, and it includes information about article metadata, authors, institutions, geographical locations or taxon names, among others. Although the content of the articles is also included in this model, it is not processed by any semantic enriching process. This knowledge graph is implemented in a triplestore, and accessible through an SPARQL endpoint.

Materials and Methods
This section describes the pipeline developed for generating a knowledge graph from a set of articles (see Figure 1). In summary, this process consists of three main steps, namely, RDF generation, in which the files with the content of the scientific publications are translated into RDF, including both article metadata and contents; Annotation, where the content of the article is annotated with ontology concepts, generating new RDF triples describing these annotations, and Knowledge graph population, in which generated RDF files, together with the ontologies used for annotation, are populated into a knowledge graph.

RDF Generation
This step converts scientific publications into RDF by using well known bibliographic domain ontologies and vocabularies. Our current method takes as input scientific publications in JATS format, which is an XML file that contains all the information about an article, including metadata (authors, references, identifiers or journal data) and contents. This translation is performed by applying the Biotea RDFization process (https://github.com/biotea/biotea-rdfization). Two RDF files are generated for each JATS input file, namely, metadata and sections, which include the bibliographic information, and the structure and contents, respectively.
The RDF model used for describing the metadata resulting from the conversion is depicted in Figure 2. In this model, the rdf:seeAlso property is used to link articles and journals to their web page in PubMed and the National Library of Medicine (NLM) catalog (https://www.ncbi.nlm.nih. gov/nlmcatalog/), respectively. Besides, this model reuses properties and terms from: • Bibliographic Ontology (bibo) to represent bibliographic data such as author lists, literature identifiers, cites, references or abstracts; • Dublin Core Terms (dcterms) [11], to describe predicates like hasPart, title or publisher; • Friend Of A Friend Ontology (foaf), to represent information about authors; • Provenance Ontology (prov) [22], using the predicate generatedAtTime in order to save the date in which the RDF was generated; • Wikidata [23], to store the authors ORCID's, which was retrieved from ORCID search web service endpoint (https://pub.orcid.org/v2.0/search) from article DOI, author given name, and author family name, when possible; • Semanticscience Integrated Ontology (sio) [24], to relate an article to a PMC RDF dataset through the predicate is_data_item_in.  The RDF model for the sections is shown in Figure 3. In this model, the structure elements, namely sections and paragraphs, are described by using Document Components Ontology (doco) [25], bibo and dcterms. In summary, an article has a list of sections, and each section has a a title and a list of paragraphs, and optionally it could have a list of subsections. Moreover, each paragraph has the property rdf:value, which includes its textual content in plain text. Consequently, the structure of the article is represented in RDF, but there is not semantic processing of its content in this step. Only structure in RDF (for those articles whose full-text cannot be exposed)

Annotation
The annotation process consists of enriching the contents of each article by identifying ontological elements in its text. The input is the previously generated RDF sections file. The output is a new RDF file with the annotations found in the content of that article, which are obtained by applying the Biotea annotation process (https://github.com/biotea/biotea-annotation). This process permits to select an annotator among different external annotators that have been wrapped in Biotea.
The resulting annotations are described in RDF by means of the Annotation Ontology (AO) [26], the Provenance, Authoring and Versioning ontology (PAV) [27] and dcterms, according to the model shown in Figure 4. In this model, an annotation has a body, which is the annotated text, and a topic, which is the ontology entity associated with the annotated text. Moreover, the annotation has a context that locates the section, the paragraph and the article of the annotation. In addition to this, annotation and article are directly linked through ao:annotatesResource predicate. Occurrences of the annotation in the article are also described by using the custom predicate tf.

Knowledge Graph Population
This step generates a graph database by loading both the RDF files previously created, which include article metadata, sections and annotations, in addition to the ontologies used for the annotation, which add domain knowledge to the graph. This has been implemented using the Python script available at https://github.com/biotea/biotea-graph-population and the graph database used was Neo4j [28], whose model differs from RDF because it implements a property graph model. RDF [6] is a graph model used for describing resources, where resources are identified by a URI and they represent entities. These resources are described by a set of triples of the form <subject, predicate, object>, where the subject is linked to the object through the predicate. The subject will always be a resource, and the object could be another resource or a literal value. Then, a list of triples can be seen as a graph where resources are nodes and properties are edges.
In [29], the property graph model is described as follows:

1.
A property graph is made up of nodes, relationships and properties.

2.
Nodes contain properties [...] in the form of arbitrary key-value pairs. The keys are strings and the values are arbitrary data types.

3.
A relationship always has a direction, a label, and a start node and an end node. 4.
Like nodes, relationships can also have properties.
In the Neo4j property graphs, nodes are also labelled with their categories in order to establish their types. Taking these differences into account, we need to apply a model transformation in order to store RDF data into a property graph database. This transformation was carried out by the Neosemantics plugin for Neo4j. This plugin is able to load RDF data into a Neo4j graph according two kind of conversions depending on the desired detail level, namely importRDF and loadOntology.
On the one hand, the importRDF function was used for uploading the metadata, sections and annotation RDF files. This function ingests all triples in an RDF dataset and uploads them into the graph database by applying the following rules [30]: Rule 1: Every RDF resource becomes a property graph node. Since the subject in RDF is always a resource, there is at least one property graph node created or merged into an existing node. Rule 2: If the object is a literal, the predicate and object become respectively a property name and value. The property is added to the created or existing node corresponding to the resource. Rule 3: If the object is a resource, then both the subject and objects are transformed into a node.
A relationship between them is created and it holds the predicate's name as a relationship type. Rule 4: If the predicate is rdf:type, the subject becomes a node having a label set to the objects name (which is actually the resource type). Figure 5 shows the RDF representation of two resources and the property graph model resulting from applying the rules described before. In this case, res:article1 and res:article2 are the unique subjects of the triples in the RDF model. Thus, according to rule 1, they are mapped onto nodes in the property graph model, labelling them as Resource. Also, the URIs in the RDF model are stored as a property in the property graph node. Then, the RDF properties dct:title are translated to properties inside the nodes in the property graph model, as rule 2 states. Also, the RDF property bibo:cites is converted into a relation according to rule 3. Finally, RDF rdf:type statements are translated to category labels in the property graph model, following rule 4. On the other hand, the importOntology function was used for uploading the ontologies used by the annotator. Although ontologies can also be expressed in RDF, they usually contain complex axioms involving anonymous nodes that would add complexity to the graph model if they were uploaded through the importRDF function. The importOntology function only imports the following RDF statements:

1.
Named class (category) declarations with both rdfs:Class and owl:Class. The nodes representing named classes are labelled as Class.

2.
Explicit class hierarchies defined with rdf:subClassOf statements. The edges representing rdf:subClassOf relations are labelled as subClassOf.

3.
Property definitions with owl:ObjectProperty, owl:DatatypeProperty and rdfs:Property, whose corresponding nodes have been labelled as Relation, Property and Property, respectively.

4.
Explicit property hierarchies defined with rdfs:subPropertyOf statements. The edges representing this relation were labelled as subPropertyOf.

5.
Domain and range information for properties described as rdfs:domain and rdfs:range statements, whose corresponding edges in the graph were labelled as domain and range.

Results
In this section, we present our main results. First, we explain the experiment carried out. Then, we provide a description of the resulting knowledge graph as well as possible use cases for information retrieval. All the queries associated with the use cases can be found in Appendix B.

Description of the Experiment
We have applied the pipeline described in Section 3 to a set of 127 articles randomly selected from PubMed Central Repository (https://www.ncbi.nlm.nih.gov/pmc/), thus creating a Neo4j-based knowledge graph about agriculture. The 127 JATS files were downloaded through the NCBI OA web service (https://www.ncbi.nlm.nih.gov/pmc/tools/oa-service/). Some of these articles belong to special agriculture collections, while others were found by searching agricultural related articles manually (see Appendix A for a complete list). We used the annotator provided by AgroPortal [31] for the annotation step. This annotator uses more than 100 ontologies covering domains like agricultural research, animal science, ecology, nutrition or farming, among others, including ontologies such as Global Agricultural Concept Scheme (gacs) [32], Thesaurus for Animal Physiology and Livestock systems (TriPhase) [33], Environment Ontology (envo) [34], FoodOn [35] or AgroRDF [36]. These ontologies were obtained through the AgroPortal REST API and, when possible, converted to RDF/XML format with the ROBOT tool [37] in order to insert them into the knowledge graph.

The Knowledge Graph
The resulting Neo4j-based knowledge graph has been released as a Docker image, which is available at https://hub.docker.com/r/fanavarro/neo4j-agro-dataset. This knowledge graph contains information about 127 agriculture related articles, including metadata, data and annotations provided by the AgroPortal annotator, which uses the AgroPortal ontologies to provide a semantic context to the annotations. This graph is formed by 2,061,133 nodes and 3,176,199 edges in total (see Query 1).

Basic Metrics of the Graph
The knowledge graph supports queries to extract relevant information about article metadata. For example, in connection with article authorship, the article with a greater number of authors is Food systems for sustainable development: proposals for a profound four-part transformation, with 27 authors. In contrast, the article The economic value of mussel farming for uncertain nutrient removal in the Baltic Sea has the lowest number of authors, with only one (see Query 2). The average number of authors per article in the data set is 6.98 (see Query 3).
Another metric that could be retrieved from the knowledge graph is the number of articles per author (see Query 4). In this case, In-Jung Lee is the most prolific author in the data set, with 4 articles written. The average number of publications per author is 1.04 (see Query 5). With respect to bibliographic references, Plant Defense against Insect Herbivores is the article with the largest number of references (394), while Which factor contribute most to empower farmers through e-Agriculture in Bangladesh? has the lowest (10) (see Query 6). The articles in the data set contain on average 80.4 references (see Query 7). Moreover, the most cited reference has the PubMed ID 18444910 (Mechanisms of salinity tolerance), with 9 cites by the articles in the data set (see Query 8). Each reference is cited on average once(see Query 9).
Journals and publishers are also covered by the knowledge graph. The articles in the data set have been published in 29 journals, being PLoS ONE the journal with the largest number of articles (see Figure 6) (see Query 10). On the other hand, MDPI and Public Library of Science are the publishers with the largest number of articles in the data set with 34 and 25 articles, respectively (see Figure 7) (see Query 11).

Analysis of the Annotations Set
305,853 annotations were produced for the 127 articles (see Query 12). We can extract the most common annotations produced in the whole data set by querying the knowledge graph (see Query 13). These annotations include general domain terms such as PLANTS, STRESS, RICE, or SOIL among others (see Table 1). On the other hand, the query for the least frequent annotations (see Query 14) returns that 5742 annotations appear only once in the data set (see Query 15). Table 2 shows an example set of ten such annotations, most of them referring to concrete concepts such as 1-METHYLCYCLOPROPENE or ABSCISIC ACID METABOLISM. Table 1. The ten most common annotations in the whole data set, their number of occurrences and the number of articles annotated with them, sorted by the number of occurrences.

Annotation
Occurrences Annotated Articles The specificity of a concept can be measured through its depth in the corresponding ontology. Hence, we can obtain (see Query 16) for each annotation, the number of classes that are supporting the annotation, and the average depth of such classes in the hierarchy of their source ontologies. The most frequent annotations are associated with general concepts, whereas the least frequent ones are more specific. Correspondingly, the most frequent annotations are supported, in general, by ontology classes with smaller average depth (see Table 3).
We can extract data about which ontologies are mostly represented in the annotations set. For example, Table 4 shows the ten largest ontologies together with the number of ontology entities related with annotations, the total number of entities in that ontology, and the percentage of ontology usage in annotation, which results of the division between these two previous values (see Query 17). For this purpose, a Neo4j plugin for URI management was developed (https: //github.com/fanavarro/URIManager) to distinguish between the ontology base URI and the local identifier for an entity; for instance, the URI http://purl.obolibrary.org/obo/GO_0003904 is split into http://purl.obolibrary.org/obo/GO (ontology base URI) and 0003904 (entity identifier). Table 3. Common and unique annotations together with the number of ontology classes supporting them and the average depth of these classes in their ontology. Annotations marked with * are linked with individuals and not with classes, thus, average class depth could not be calculated.  Table 4. Set of the top-ten largest ontologies indicating, for each one, how many elements were used for annotating articles, the size of the ontology, and a percentage indicating the portion of the ontology used in annotation. An advantage of having a knowledge graph supported by ontologies is that the hierarchy of the ontology, defined through subClassOf edges in the graph, can be exploited for improving the quality of the results of queries. For instance, if we are interested in articles about polymers, not only articles annotated with POLYMERS would be retrieved but also those annotated with types of polymers. This is exemplified in Query 18, where we use the ontology PO2_DG to obtain all the subclasses of polymers class, and then we search all articles annotated with these classes. The results of this query include articles annotated with POLYMERS, POLYACRYLAMIDE or TEFLON, among others.

Use Case 1: Similarity between Articles
An important feature in scientific literature repositories is the ability to search articles that are similar to a given one. We can use the annotations of the articles for comparing them so that a higher number of common annotations between two articles entails a higher degree of similarity. In this work, the similarity between two articles is calculated as shown in Equation 1, where A and B are the sets of annotations associated with the two articles. The implementation of this algorithm is available in the Neo4j library Graph Algorithms (algo) (https://Neo4j.com/docs/graph-algorithms).
The most similar articles are Potato Annexin STANN1 Promotes Drought Tolerance and Mitigates Light Stress in Transgenic Solanum tuberosum L. Plants (pmid = 26172952) and Expression of miR159 Is Altered in Tomato Plants Undergoing Drought Stress (pmid = 31269704), with a similarity of 0.69. In contrast, the lowest similarity (0.14) is reached for the articles Tolerance of Transplastomic Tobacco Plants Overexpressing a Theta Class Glutathione Transferase to Abiotic and Oxidative Stresses (pmid = 30687339) and Folk use of medicinal plants in Karst and Gorjanci, Slovenia. pmid = 28231793 (see Query 19). The clustered heatmap in Figure 8, obtained with R [38] and gplots [39], shows the similarity between all articles and a histogram indicating the similarity distribution (note that not all article labels are included in the figure due to readability reasons). We analyzed the underlying clusters resulting from the heatmap calculation. We used the dendextend R package [40] to estimate the optimal number of clusters k, by applying the average silhouette width method [41]. Figure 9 shows that the optimal number of clusters is 2, which is in concordance with using the similarity score for the clustering. The largest cluster includes 94 articles which present similarity between them. The second cluster contains 31 articles with low similarity.
Finally, we extracted the most significant annotations for related articles from the cluster. For this, we generated a data set per cluster, which included the annotations of the articles in such cluster and their frequency in the cluster. The annotations appearing in both data sets were removed to keep only cluster-specific annotations. Finally, we can obtain the ranking of annotations representing the cluster by sorting them by frequency. Figure 10 shows a word cloud graph, obtained with wordcloud R package [42], including the forty most frequent annotations in the cluster of articles with similar documents.

Use Case 2: Annotations in the Same Context
The graph model used for annotating the scientific articles provides a paragraph level granularity, that is, annotations are associated with a concrete paragraph in an article. Thus, it is possible to retrieve annotations that usually appear in the same context (paragraph). This is important because it enables interesting features such as identifying related concepts that could be used for disambiguation.
For instance, the annotation SIGMA FACTORS appears in 3 paragraphs. Two of these paragraphs are part of the article with PubMed ID 28948393, while the other one appear in the article 25244327 (see Query 20). All these paragraphs have the following annotations in common: FACTORS, GENES, TRANSCRIPTION, SIGMA FACTORS and EXPRESSION (see Query 21). In this case, we have a context for SIGMA FACTORS conformed by genetic concepts. Thus, all these annotations can be classified in the same domain. This could help to disambiguate the annotation EXPRESSION, for example, which is referring to gene expression in this context, but could act as synonym of formula or equation in other contexts. This is exemplified in the following paragraph, which was taken from the article with PubMed ID 30486880 (see Query 22):

Discussion
In this paper, we have presented a workflow to generate knowledge graphs about scientific literature, providing an example in the agronomy domain. Our model includes the information about articles (authors, cites, sections, contents, etc.) and the semantic annotations on the contents, together with the ontologies from which these annotations come from. Next, we discuss the most relevant aspects of our work.

Progress beyond the State-of-the-Art
The inclusion of the contents of the articles is a differentiating point between our model and some state-of-the-art solutions commented on in Section 1. The models in Nature SciGraph and The Semantic Lancet project [16] only represent and process the abstracts of the articles. Although abstracts should summarize the whole article, other sections usually have relevant information to classify or extract features from the article. For this reason, our model is in line with OpenBioDiv, which provides the full text content organized by sections. In our case, our model also divides the content of a section in paragraphs, providing a finer granularity. Moreover, having a model containing the whole structure of the articles allows for the development of visualization applications since all the information is available.
Another differentiating factor in our model is the semantic enrichment. Nature SciGraph and OpenBioDiv only provides keywords for each article. In the case of Nature SciGraph, these keywords are taken from FoR, a standard classification for fields of research, while OpenBioDiv is not using any controlled vocabulary. For its part, The Semantic Lancet project generates an RDF representation of the abstracts, which involves the generation of RDF triples describing the meaning from this textual content. This translation from natural language to RDF is complex and it probably requires human supervision to avoid errors in the process. This could make the resulting RDF graphs difficult to compare because of a lack of homogeneity between them. This problem is not present in the model provided by Open Research Knowledge Graph (ORKG) [19]. In this case, they train a neural network for named entity recognition, which classifies entities as process, method, material and data, thus providing an homogeneous model that enables comparison between articles. Nonetheless, classifying instances directly under this four generic classes derives in a lack of domain knowledge that prevents the semantic understanding of the meaning of each instance. In our case, our semantic enrichment process consists on annotating the articles with entities from domain ontologies. Thus, these ontologies, which are also included in the knowledge graph, provide the semantic context for each annotation.

New Possibilities for Literature Management
This semantic context empowers interesting features in literature management. For instance, we can find articles annotated with a set of related concepts rather than with a single concept. This has been exemplified in Section 4.2, where we have shown a query for retrieving all articles annotated with "polymer", together with any concept taxonomically linked to polymer. Also, thanks to having the ontologies in the knowledge graph, we could measure the specificity of the annotations by checking the depth of the concepts in their corresponding ontology (see Section 4.2). This metric could help to determine which ontology concepts are specific enough to find relevant articles in a domain, or those that are too general. The latter could be a sign of heterogeneity of the set of articles. Furthermore, we also measured ontology usage in the annotation process by checking how many terms of each ontology were used in the annotations. This serves to calculate the degree of usage of each ontology. This kind of metric could help to (1) identify the ontologies that are not representing the data set, or (2) improve those ontologies that represent the data set in a certain degree, but they could be complemented by adding new elements.
Several use cases are shown in Section 4.
In these examples, we showed how our model contains enough information to compute similarity between articles (see Section 4.3) and context extraction (see Section 4.4). Regarding the similarity, we were able to implement only with Neo4j features a similarity measure between all articles based on their common annotations. Then, we used these similarity values as a feature to calculate a clustering in R, which resulted in two clusters: related and unrelated articles. Nonetheless this is only a usage example; we could have calculated the clustering by using the frequency of each annotation in each article as feature, which would have resulted in different clusters of articles, where each cluster would represent a set of topics. In addition to this, our model is able to retrieve all annotations in a paragraph, which helps to determine what annotations usually appear in the same context. This could be useful for the disambiguation of annotations, as shown in Section 4.4.

Property Graphs for RDF Representation
In this work we have explored the use of Neo4j for storing and querying our data, in opposition to the use of RDF native triplestores as Virtuoso [43] or GraphDB (https://www.ontotext.com/products/ graphdb/). The main difficulty was how to describe RDF blank nodes or OWL anonymous classes, which are used for defining axioms in domain ontologies. There is no a direct way to represent this kind of axioms in a property graph. Consequently, we avoid them when loading ontologies into the graph, as described in Section 3.3. Another important limitation of Neo4j against triplestores is the impossibility of having more than one graph in a single instance; or the lack of mechanisms to query different databases at once, feature offered by the RDF triplestores implementing SPARQL1.1 federated queries. On the other hand, Cypher, the Neo4j query language, is easier to use than SPARQL, and its functionality could be easily extended by means of plugins. Furthermore, there is an emerging community that supports Neo4j with new functions or visualization tools.

Limitations and Further Work
Finally, this work presents some limitations that we plan to improve in the future. The current graph contains 127 articles, which is a small number in comparison to the number of articles published. This is due to the fact that the input to our method are files in JATS format, which is not currently provided by many journals. PubMed is one of the few resources providing content in JATS format, so we used it as content source. However, PubMed is more oriented to biomedicine, so the number of papers related to agronomy is low. Therefore, we will extend our tools to accept input data in other formats, which will enable the inclusion of data from more sources. Another aspect that could be improved is the annotation context. At the moment, our model relates annotations at the level of paragraphs; however, sometimes these paragraphs are large and they could contain annotations that are semantically far between them. Therefore, we could improve this by relating annotations at the level of sentence, which would provide a more accurate context. Finally, the current Neo4j knowledge graph can be difficult to query and exploit by non-experienced users. Thus, we plan to fill the gap between the knowledge graph and the final user by developing a web query interface.

Conclusions
In this paper, we have presented a workflow to generate scientific literature knowledge graphs in the agriculture domain by using an emerging technology, Neo4j. Our method takes advantage of the possibilities offered by the Semantic Web and natural language processing technologies to generate semantically-rich graphs that contain the data, their semantic annotations and domain knowledge. We have shown how this combination facilitates the execution of knowledge-based complex data exploitation tasks. Hence, we believe that our work provides an example of how Semantic Web technologies can be effectively applied for the management of agriculture data and knowledge. Funding: "This research was partially funded by the Ministerio de Economía, Industria y Competitividad, Gobierno de España and the European Regional Development Fund through grant number TIN2017-85949-C2-1-R, and by the Fundación Séneca, Agencia Regional de Ciencia y Tecnología through grant number 20963/PI/18.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A. Articles in the Data Set
The data set used in this paper contains the following articles (pmcid):

Appendix B. Cypher Queries
This appendix contains all the Cypher queries used in the article for extracting information from the knowledge graph. These are referenced from the corresponding sections. WITH a n n o t a t i o n . aoc__body as t e x t , sum ( a n n o t a t i o n . b i o t e a _ _ t f ) as ' a n n o t a t i o n count ' , count ( a r t i c l e ) as ' a r t i c l e count ' WHERE ' a n n o t a t i o n count ' = 1 RETURN t e x t , ' a n n o t a t i o n count ' , ' a r t i c l e count ' Query 15: Cypher query to retrieve the number of the annotations that only appear once in the data set MATCH ( a n n o t a t i o n : a o t _ _ E x a c t Q u a l i f i e r ) WITH a n n o t a t i o n . aoc__body as t e x t , sum ( a n n o t a t i o n . b i o t e a _ _ t f ) as ' a n n o t a t i o n count ' WHERE ' a n n o t a t i o n count ' = 1 RETURN count ( t e x t ) Query 16: Cypher query to retrieve annotations, the number of classes that are supporting each annotation, and the average depth in which these classes are in the ontology hierarchy . aoc__body ) as a n n o t a t i o n s P e r P a r a g r a p h WITH c o l l e c t ( a n n o t a t i o n s P e r P a r a g r a p h ) as a n n o t a t i o n s WITH reduce ( commonAnnotations = head ( a n n o t a t i o n s ) , a n n o t a t i o n i n t a i l ( a n n o t a t i o n s ) | apoc . c o l l . i n t e r s e c t i o n ( commonAnnotations , a n n o t a t i o n ) ) as commonAnnotations RETURN commonAnnotations