A Unified approach to publish semantic annotations of agricultural documents as knowledge graphs

Abstract

alert bulletins published in France and written in French.The named entities to be recognised are crop names and phenological development stages.Crop names are defined in the French Crop Usage (FCU) thesaurus while development stages are formalized using the BBCH-based Plant Phenological Description Ontology (PPDO).For the three corpora, named entities were automatically extracted using natural language processing tools.We present an approach that relies on the formalization of a semantic data model based on common and well-adopted linked open vocabularies such as Web Annotation Ontology (OA) and Provenance ontology (PROV).The model describes the named entities and their links to vocabularies.It was slightly adapted to each corpus annotations.The model was populated using a mapping-based transformation pipeline implemented with the Morph-xR2RML tool which takes CSV files as input .The development of the proposed model was initiated by the formulation of motivating scenarios by experts in the domain of each corpus that led to a set of competency questions.Those questions provided requirements on the semantic model.The relevance of the semantic model was validated by implementing the competency questions into SPARQL queries enabling to query the constructed RDF knowledge graphs.

Introduction
Knowledge Graphs (KG) are multi-relational graphs of relations between well-defined and uniquely identifiable entities created from heterogeneous data sources.They enable to develop data management platforms compliant with the FAIR (Findability, Accessibility, Interoperability and Reuse) principles [WDA + 16] referring to best practice guidelines: resources must be accessible, understood, exchanged and reused by machines.A typical approach towards publishing FAIR knowledge graphs is to rely on Linked Data (LD) principles and Semantic Web technologies (SWT).Indeed, RDF and other Semantic Web standards are designed to promote interoperability and linking between datasets.Additionally, to ensure that RDF datasets are truly interoperable and reusable within a specific field, they must rely on domain-specific and open vocabularies, models, and data category registries capturing the shared theoretical foundations and terminology used by a community of domain experts [KCD + 22].Constructing knowledge graphs from unstructured data enables to bridge the gap between the huge amount of heterogeneous data and easily explore and query it to address various use cases.This paper focuses on building knowledge graphs from textual data sources by extracting relevant domainspecific entities and organizing them in a structured and meaningful annotation.This approach can be beneficial in making sense of large, complex and heterogeneous datasets, linking related information and knowledge, and providing intuitive ways to access and explore domain data and knowledge, adhering to the FAIR principles.We present a methodology for constructing domain-specific knowledge graphs using SWT, which involves the re-use of shared RDF-based vocabularies and models.Although the proposed methodology can be applied to various domains and can support a wide range of use cases, in this paper we focus on building knowledge graphs representing semantic annotations of textual documents in the field of agriculture.We consider three different text corpora and we demonstrate how we leverage Natural Language Processing (NLP) techniques to first extract different types of named entities and then structure and integrate them into KGs using the same data model.Two corpora are collections of scientific publications on rice and wheat functional genomics, retrieved from the PubMed1 repository.These publications investigate the gene-phenotype link for varietal selection, or more precisely the identification of gene markers involved in the expression of a given phenotype, for selection assistance [NBV + 14].A third corpus gathers technical documents called Plant Health Bulletins (PHBs) which are agricultural alert bulletins published in France.Although, the KGs that we built from these corpora are intended to serve different needs, we have adopted the same methodology and core data model.The first step of our methodology leverages NLP pipelines to perform the tasks of Named Entity Recognition (NER) and Linking (NEL) [MRHLA20].A semantic data model has been defined to capture how NE annotations produced by NLP pipelines should be structured and described in each KG.We initiated the data model definition with a set of competency questions, together with motivating examples.
We were inspired by the agile SAMOD methodology [Per16], which in turn is based on the early work of Uschold & Gruninger [GF95,UG96].The SAMOD process is initiated by a motivating scenario which lead to a set of competency questions (CQs) that provide requirements on the knowledge graphs to be created.All CQ demonstrate that the KGs should be uniformly queried by experts to highlight the context of NE co-occurrence in the original texts in order to reveal hidden interactions between NE.In the continuity of earlier works [MGA + 20], we propose to rely on the Open Annotation Ontology (OA) [SCY17] to describe, structure and integrate NE annotations and their occurrence contexts in texts.Domain specific vocabularies are also reused to describe bibliographic information of PubMed publications and provenance information for the PHB corpus.The resulting model was automatically populated using a mapping-based transformation pipeline implemented with the Morph-xR2RML tool [MDFM15].
The paper is structured as follows.in Section 2, we present the materials of our research work which consist of three text corpora and the semantic resources used to annotate those corpora.in Section 3.1, we present the CQs of each case study.We describe the proposed semantic model in further details in Section 3.2. in Section 4, we present the validation results of the case studies.Finally, in Section 5, we discuss the results and synthesize the learned lessons before concluding in Section 6.

Materials
In this section, we present the different materials used in this research work to build our KGs. in Section 2.1, we first present the text corpora that were processed using different NLP pipelines in order to generate semantic annotations of domain named entities.For the linking task, NLP processes rely on a set of semantic resources presented in Section 2. 2. 1955 1960 1965 1970 1975 1980 1985 1990 1995  In this corpus, the PubMed identifier, title and abstract of each publication are provided.
In several cases, the abstract of a publication is organised in different sub-sections.We used the AlvisNLP pipeline to identify genes, traits, phenotypes, taxa and varieties entities mentioned in the title and the abstract of publications, and the relationships between wheat varieties and phenotypes.In total, 88,880 mentions of 4,318 distinct named entities were recognized and linked to exiting entities of semantic resources (presented in Section 2.2). Figure 4 illustrates an example of PubMed publication where three types of NEs are recognised: we distinguish between NE mentions that refer to genes (e.g., Sr2, Lr27, Lr34 ), phenotypes (e.g., leaf rust resistance, resistance to stem rust, powdery mildew resistance ) and taxa (e.g., wheat).
The phenotype mentions are linked to classes in the Wheat Trait Ontology (WTO) and taxon mentions are linked to NCBI taxonomy classes.

PubMed Corpus on Rice Genomics
The DIADE research group3 collected 17,058 scientific articles from the Oryzabase database [KY06] which provides manually checked PubMed entries related to rice genomics.

Semantic Resources
In the agriculture domain, an increasing number of semantic resources (ontologies, thesaurii) was developed and published using Semantic Web technologies [DFMd19]

French Crop Usage Thesaurus
The French Crop Usage (FCU) thesaurus9 organises plants based on their roles in agriculture, or in other words, agricultural plant uses.The thesaurus hierarchy has two main branches as shown in Figure 3.The branch named Multiusages contains all the cultivated plants that have several uses in agriculture.For example, carotte (carrot) may be used as vegetable or as fodder.
The branch Usages_plantes_cultivees organises cultivated plants according to their uses and represents crop categories.FCU stores only the french vernacular names of plants.The FCU thesaurus is formalized using SKOS and used in this work to extract and link crop names in the PHB corpus.

Resources for other mentions
Resources for genes, markers, wheat and rice varieties published as LOD datasets are very limited and in most cases, they are either incomplete, do not come from authoritative organizations, or do not provide unique identifiers.Among the available semantic resources for genes and markers, the UniProt Knowledge base [The22] (UniProtKB) is a central hub for a collection of functional information on proteins, with accurate, consistent and rich annotation which is accessible through a SPARQL endpoint11 .However, multiple UniProtKB identifiers can be retrieved for the same genomic entity which makes it impossible to link named entities using this resource.Therefore, in order to recognize and normalize genes and markers from texts, AlvisNLP and HunFLAIR both rely on curated domain lexicons or dictionaries combined with patterns.For wheat genes and markers, a curated list of gene names from the GrainGenes [YBC + 22] database was created.For rice genes, the Oryzabase [KY06] database was used and integrated into the AgroLD Knowledge Graph [VTNH + 18] which capitalises genomic data about plant species of high interest for the plant science community (among which rice and wheat) to provide functional information on genes and their relationship across species.
AgroLD is available through a SPARQL endpoint12 .
For wheat varieties recognition and normalization, a curated list was created combining two sources: (1) the Plant variety catalogues, databases & information systems 13 and (2) the Official Catalogue of Species and Varieties of Cultivated Crops14 .To be compliant with LOD principles, we created a URI to identify each distinct entry in the different created lexicons.

Competency Questions
In this section, we present a set of Competency Questions (CQs) stemming from requirements expressed by experts and collected in the context of the D2KAB project15 .CQs are natural language questions illustrating the typical knowledge that scientists would require a data source to provide.A common way of validating a KG is to provide the formalisation of CQs, for a given case study, as SPARQL queries using the KG model.In the following we present the CQs for our case studies.Their formalisation in SPARQL is presented in Section 4.

Rice functional genomics
One of the most common investigated research questions in functional genomics are those related to genotype-phenotype relationships.However, they are not always straightforward to be identified.Considering rice and wheat genomes, they differ considerably in terms of size and complexity.The rice genome is relatively small compared to the wheat genome, comprising around 430 million base pairs in its haploid form.The wheat genome is much larger and more complex, with a hexaploid genome made up of three sets of chromosomes and comprising around 17 billion base pairs.Research in both rice and wheat functional genomics has already led to several important advances, such as the development of varieties with enhanced disease resistance and improved nutritional content.
Hence, exploiting the ever-growing scientific literature (Figure 1) could help scientists to discover hidden interactions between entities of interest for functional genomics by examining their co-occurrence in scientific publications.Thus, structuring and integrating genomic NEs extracted from scientific publications and annotated based on relevant knowledge from external Section 2, we present here a subset of CQ that ultimately consists of a set of research questions.
CQ1. Which genes are mentioned proximal to a specific trait (e.g., resistance to Fusarium head blight, resistance to leaf rust) ?
CQ1 expresses the importance of supporting experts in identifying genetic entities recognized proximal to a particular trait in order to establish possible links between gene expressions and traits.For instance, CQ1 addresses the need of scientists to discover genes involved in the resistance to biotic or abiotic factors in both wheat and rice species based on scientific literature.
As illustrated in Figure 4, several gene names proximal to a given wheat trait are recognized in PubMed scientific publications.Thus, genes that are involved in resistance to a specific disease can be discovered on the basis of their presence next to specific disease-resistance traits within scientific literature.Taking the example of the resistance to leaf rust, there are several genes that have been identified as being associated with resistance to rust in wheat crops.The Lr34 gene is a major gene for resistance to leaf rust.This type of knowledge is valuable in wheat breeding programs to develop varieties that are resistant to a specific disease.
CQ2: Which genetic markers appear proximal to a specific gene, and which genes are mentioned proximal to a particular phenotype in publications dating from after 2010?
The CQ2 is designed for the PubMed corpus on wheat genomics, since the NEs of the genetic markers are recognized only in this corpus.A genetic marker discriminates the different alleles of a gene with the polymorphism of the DNA sequence.Thus, genetic markers are used to select the wheat varieties with a trait or phenotype of agronomic interest [NBV + 14].For instance, in the case of resistance to the stripe rust disease in wheat, the gene Yr65 is often mentioned in literature along with this phenotype.Furthermore, markers such as Xgdm33, Xgwm11, Xgwm18, and Xgwm413 are mentioned in the same context as this gene.As the techniques for genetic markers selection have evolved over time and some of them have become obsolete, the expert can also refine the query to select only publications which appeared after 2010.The knowledge graph should contain the publication metadata such as publication year, list of authors, or the number of incoming citations.
CQ2-bis: Which chemical compounds are cited in scientific publications proximal to gene names, and which genes are in turn mentioned proximal to a particular phenotype?
Chemical compounds are often involved in metabolic processes which are controlled by genes.In scientific literature, associations between chemical compounds and genes can reveal interesting phenotypes.CQ2-bis emphasizes that biologists can search for rice genes that cooccur with a specific phenotype and a chemical compound.
CQ3. Which scientific publications mention gene names that appear proximal to a specific wheat or rice variety name and a trait from a specific given class of traits (e.g.all traits related to fungal pathogen resistance)?
CQ3 reflects the need for experts to conduct a systematic literature review of publications that mention specific genes cited in the literature proximal to certain traits (from a specific family of traits) as well as wheat or rice varieties.The results of this query should include a list of articles mentioning, in their abstracts or titles, gene names, a wheat or rice variety and a set of traits known, for instance, to be be involved in pathogen resistance.For instance, a scientist may be interested in resistance to fungal pathogens which cause massive and destructive losses to crops.Thus, the study of resistance mechanisms is essential to fully understand the interactions between pathogens across crop varieties.Based on the WTO structure which classifies traits in different taxonomies, it is possible to conduct this study for all traits belonging to the sub-hierarchy of fungal pathogen resistance class.This CQ highlights the importance to incorporate domain knowledge formally represented in ontological and terminological resources (e.g., WTO).
CQ4: Which gene names are cited in the literature proximal to a specific taxon (and optionally to one or more of its descendants)?
CQ4 reflects the need to perform a search of gene mentions cited proximal to different taxa mentions.We may initiate the query by focusing on a single taxon mention and expand it dynamically by including each descendant taxon.So, the query shows first results for a single search on a specific taxon mention.Then, it generates a more comprehensive set of results.
CQ5: What are orthologous genes in rice and wheat genomes?
It has been demonstrated that some fungal and bacterial disease pathogens affect both rice and wheat.Wheat and rice disease resistance has been studied for a large panel of pathogens, including rusts, smuts, Fusarium head blight, Septoria leaf blotch, tan spot, and powdery mildew, that cause the most serious losses.The goal is to search for wheat and rice genes co-occuring in literature with the same taxon of a pathogen (or a more specific taxon).This enables to identify orthologous genes16 in wheat and rice.

Competency Questions for Agronomic Studies
Climatic change has an impact on agriculture practices.Agronomists would like to study PHBs in order to analyse the distribution of crops on the French territory and provide answers to several questions such as: have the farmers changed the crops they produce over the time?
In addition, agronomists would like to study how climatic change has affected crop growth.
Indeed, due to variable weather conditions, crop development can differ from year to year.One of the mid-term objectives of the D2KAB research is to create a timeline of the development stages of crops in specific regions of France.As each PHB is related to a unique region of France and has a publication date, by extracting the crop names from PHB text, it is possible to identify the crops to which the PHB relates, and thus determine which crops were grown in that region at a given time.Extracting crop development stages from the PHB text also allows experts to understand the development stage that the crop had reached at the time of publication in that region.Considering the PHB corpus presented in Section 2, we present here a subset of CQ that ultimately consists of a set of research questions.
CQ6: Which crop names are mentioned in the title of a specific PHB?
This CQ aims to identify the topic of the PHB, i.e., the main crop or crop category mentioned in one of the titles of the PHB.A PHB title may mention one or several crop names.In the example of Figure 2, the term Viticulture is mentioned in the title, thus the PHB is about a single crop which is cultivated grapevine.
CQ6 bis: How many times is a crop name mentioned in a specific PHB?
The goal is also to identify the main crops or crop categories that represent the topic of a PHB, thus reinforcing the previous CQ.One way to identify the main crop topic of a PHB is to count how many times a crop name appears in the text of a PHB.The crop mention may appear in any type of section (e.g.footer).
CQ7: What are the most cultivated crops in a given French region and do they change over time?
The scientific objective is to find which crops are cultivated in a specific region of France.
Based on the PHB corpus, it is possible to retrieve the subset of bulletins concerning a French region for a specific period of time.Then, we can compute the main crop topics of these bulletins.
CQ8: Which development stages are mentioned proximal to a crop name in a specific PHB?
The goal is to identify when a development stage of a specific crop is observed in a specific region, given that the crop name, development stage may be in separate paragraphs of a PHB.
In the example of Figure 2, the crop name is mentioned in the title and several development stages are mentioned in the first paragraph of the middle column.
CQ9: What is the scientific literature available on the crop to which a PHB bulletin relates?
The goal is to identify which new research publications is related to a crop cultivated in the French territory, to search for example new crop varieties that are resistant to drought or high temperatures.

Proposed Semantic Model
We reuse a set of state-of-the-art vocabularies to design a unified semantic model that captures  1 shows the main vocabularies used to describe named entities annotations as well as documents in both corpora.

OA-Based Model for Text Annotations with Named Entities
The Web Annotation Data Model is an ontology for structuring and sharing any type of annotations in an interoperable format.According to the OA documentation 17 , "an annotation is considered to be a set of connected resources (each identified by a URI), typically comprising a body and a target where the body is somehow about the target".The core OA data model is that an annotation a i is an instance of the oa:Annotation class such that: • The oa:hasTarget property identifies the part of document that is being annotated with annotation a i .The target is a resource selection with a selector, i.   in Figure 5).One annotation has as body a class from the NCBI taxonomy.One annotation has as body a gene entity URI that we have created locally in our graph (green area in Figure 5).All mentions are identified by two selectors: • an instance of oa:TextQuoteSelector is used to specify the text of the mention.
• an instance of oa:TextPositionSelector is used to specify the start and end offset position of the mention.• an instance of oa:XpathSelector is used to express that the mention is found in the first section of the HTML element of type H1 which is a first level title.
• an instance of oa:TextQuoteSelector is used to specify the text of the mention, its prefix and suffix.The oa:motivatedBy property identifies the motivation of the annotation creation.Since all annotations a i aims to identify an entity e in the text of the document, the object of this property is oa:identifying which is an instance of class oa:Motivation.

Bibliographic Metadata
To describe bibliographic metadata of documents in the corpora, we have reused the following vocabularies: Dublin Core 21 , FRBR aligned bibliographic ontology (FaBiO) [PS12], bibliographic ontology (BIBO) 22 , Dolce Ultra Light (DUL) [PG08], PROV Ontology (PROV) 21 https://www.dublincore.org/specifications/dublin-core/dcmi-terms/ 22https://github.com/structureddynamics/Bibliographic-Ontology-BIBO  .An information object represents the generic information about a document such as its publication date, the corpus to which it belongs, its associated French region, its description.An information object has several realizations represented by using property dul:isRealizedBy.
Figure 8 presents the graph annotating the bulletin of Figure 2.
The bulletin is an instance of class d2kab_inrae:Bulletin which specializes dul:InformationObject.
The Dublin Core properties dct:date, dct:description and dct:spatial link the bulletin to its publication date, its description accessible on the download page, its French region extracted from the download web site and identified by its wikidata URI.Each sub-corpus is represented by an instance of prov:Collection.A bulletin belongs to at least one subcorpus which is represented by using property prov:hasMember.The files are instances of classes schema:DigitalDocument and dct:Text.The Dublin Core properties dce:language and dce:format link a file to its language and format.The OA property oa:textDirection links a file to its text direction.The property schema:url links a file to its URL where it is actually accessible.The property schema:isBasedOn links a file to the URL where it was previously downloaded.The property frac:total links the HTML file to its total number of tokens.

Provenance Metadata
The FCU thesaurus has evolved over time and several versions of it exist.Moreover different NLP processes based on different versions of FCU were tested on the PHB corpus to generate annotations.The PROV ontology is used to store the provenance information of the annotations.An example provenance metadata of a PHB annotation is shown on Figure 9.

Data Transformation Pipeline
To create the three KG, we adopted a materialization approach in which mapping rules are defined to transform raw annotations generated by NLP pipelines into RDF.We relied on the xR2RML mapping language [MDFM15] to define the mapping rules that formally de-

KG Pipeline for Scientific Literature on Wheat and Rice Genomics
The xR2RML mapping rules defined to materialize the knowledge graph describing the scientific literature on wheat genomics, WheatGenomicsSLKG, are available in the project's GitHub

KG Pipeline for Plant Health Bulletins
Since the beginning of their publications, PHBs have been made freely available in PDF format on the websites of the Regional Chambers of Agriculture or the websites of the regional agency of the French Ministry of Food and Agriculture (DRAAF).Therefore, PHBs are disseminated on different websites (one per region).
A web-crawler is periodically run over the DRAAF websites to look for new PHBs that are downloaded while some information are extracted (download date, download URL, local filename and web path) [RDA + 21].These data are transformed into RDF using python scripts.
The downloaded pdf files are transformed into HTML using the pdf2blocks27 conversion tool.
AlvisNLP pipelines are used to extract NEs from these HTML files.Finally, the CSV output files are transformed using specific xR2RML mapping rules.All the elements of this workflow are available in the project gitlab repository28 .The resulting RDF KG can be queried at http://ontology.inrae.fr/bsv/sparql.A list of SPARQL queries which implement CQs (CQ6 to CQ10) are presented in Section 4.2.

Implementation of Competency Questions
In order to demonstrate how the three KG serve several expert needs, we implemented the competency questions presented in Section 3.1 in SPARQL.All the presented CQ could be translated into SPARQL queries and their results analysed as valid, which shows that our semantic model fulfills the requirements.It is worth noticing that, although different classes of NE are recognized in the different corpora, the structure of SPARQL queries are quiet similar.

SPARQL Queries Implementing CQs on the Wheat and Rice PubMed Corpora
The SPARQL queries implementing CQs on PubMed corpora of wheat and rice functional genomics can be executed at the SPARQL endpoint29 .The queries and excerpt of the obtained results are provided as part of the supplementary materials.A Jupyter Notebook of these SPARQL queries is available on our github repository30 .
CQ1 : The SPARQL query presented in Listing 1 implements CQ1 and allows scientists to retrieve genes that are mentioned proximal to the resistance to leaf rust trait considering the WheatGenomicsSLKG graph.The query returns all genes mentioned proximal to the WTO concept (wto:0000483) that corresponds to the aforementioned trait and counts the number of times that a gene and the trait are recognized in the same context.The results of this query confirms that Lr34 is the most cited gene in the literature.Lr10, Lr26 and Lr24 genes appear ORDER BY DESC(?NbOcc) Listing 1: SPARQL query implementing CQ1 and retrieving the most cited genes mentioned proximal to the "resistance to Leaf rust" trait in WheatGenomicsSLKG.also as the most frequent genes.SELECT (GROUP_CONCAT(distinct ?GeneName; SEPARATOR="-") as ?genes) (GROUP_CONCAT(distinct ?marker;SEPARATOR="-") as ?markers) ?paper ?year ?WTOtrait question.Starting with the Magnaporthe oryzae URI 34 or its upper parent 35 , we retrieve wheat and rice genes co-occuring with these taxa.This may be the indication of orthologous genes in via SERVICE clauses.This illustrates the fact that publishing KGs according to FAIR design the source resource.To represent the annotations of the PubMed and PHB corpora, we used three different selectors proposed by OA to precisely locate an entity mention in a text: oa:TextQuoteSelector, oa:TextPositionSelector, and oa:XPathSelector.The first two selectors are used in both corpora.In particular the oa:TextPositionSelector selector locates the entity mention within this part of the document.oa:XPathSelector is used for the PHB corpus to identify the HTML element in which an entity mention was recognized.Note that the document structures are not represented in the same way in both corpora.In the PubMed corpus, each document is identified by a URI which corresponds to its entry in the PubMed repository.Entity mentions may be extracted from the title, abstract or abstract's sub-parts, each having its own URI and being linked by the frbr:partOf property (Figure 7).In the PHB corpus, an HTML document is identified by a URI that is the source of the annotation.
This work illustrates that OA selectors are sufficiently broad to locate entities mentions in a variety of situations.
The coverage scope of our model could be extended by identifying complementary vocabularies.In particular, information such as frequency, lexical and morphological characteristics that can be drawn from text corpora and semantic resources can be added to our model by

Conclusions and Future Works
In this paper, we presented the results of a research work to support scientists with methodologies to standardize and share domain knowledge extracted from texts according to FAIR As future works, we want to investigate the extraction of relations between recognized NE.Several competency questions involve retrieving entities that appear in the same context within texts.They could be refined by precising the relationship between the entities.As relation extraction strongly depends on the accuracy of the entity recognition task, an important first step for the PHB knowledge graph will focus on the improvement of the accuracy of NE annotations which is not always satisfying so far.
On another note, we plan to construct and publish richly annotated gold standard datasets based on the three corpora.This will require considerable efforts of domain experts to define guidelines and samples for NE annotations.Gold-standard datasets can be used to train and evaluate natural language processing (NLP) approaches, such as specialized NE recognition, relation extraction and entity linking.

Figure 1 :
Figure 1: Evolution of the number of research works on wheat and rice genomics from 1951 to 2022

Figure 3 :
Figure 3: An extract from the FCU thesaurus.Visualisation generated by the SKOS Play tool.

Figure 4 :
Figure 4: Example of NE recognition and linking in a PubMed publication e., a resource that identifies the part of text m e that mentions a recognized entity e.In this work, we use different types of selectors: oa:TextQuoteSelector, oa:TextPositionSelector, oa:XPathSelector to indicate respectively the NE's mention m e (i.e., surface form), the start and end offset position of m e in the text and/or the XPath expression to retrieve m e in the HTML structure of the text.The oa:hasSource property is used to specify the URI of the source where the selector is applied, the source being either the URI of the document or one of its sub-parts.•The oa:hasBody property identifies the entity e defined in a domain vocabulary such as WTO, NCBI taxonomy, PPDO or FCU thesaurus.

Figure 5
Figure 5 illustrates an example RDF graph that captures five instances of NE annotations recognised in the title and the abstract of a publication in the PubMed corpus 18 .The title and the abstract of the publication are identified by a URI and become the source of the target selector.Three annotations have as body a SKOS concept in the WTO resource (yellow area 17 https://www.w3.org/TR/annotation-model/ 18 https://pubmed.ncbi.nlm.nih.gov/21573954/

Figure 5 :
Figure 5: Example of NE annotations identified in a PubMed Publication's section (title and abstract) and represented in WheatGenomicsSLKG based on OA ontology.

Figure 6
Figure 6 represents three annotations extracted from the PHB 19 presented in Figure 2. One annotation 20 identifies the mention Viticulture localized in the main title of the PHB.Three types of selectors are used:

Figure 6 :
Figure 6: Example of annotations for a PHB.One annotation identifies crop, and the other one identifies two development stages.

Figure 7 :
Figure 7: Example RDF graph describing bibliographic metadata of a scientific publication in the PubMed corpus

Figure 8 :
Figure 8: Example RDF graph describing metadata of a bulletin in the PHB corpus.
Each instance of oa:Annotation is linked to an instance of prov:Activity which generated it.Properties prov:startedAtTime and prov:endedAtTime link the activity to the date when the NLP pipeline was applied on the sub-corpus.Property prov:used indicates the version of FCU thesaurus.Regarding the activity, property prov:qualifiedAssociation indicates the NLP pipeline plan and the NLP software used to run the plan.Regarding the plan, properties prov:wasAttributeTo, prov:generatedAtTime, and schema:url indicate its author, its creation date and its git repository.

Figure 9 :
Figure 9: Example RDF graph describing provenance metadata of a bulletin in the PHB corpus.
scribe the relationship between raw annotations, initially stored in CSV files, and classes and properties from the semantic model.The translation was carried out by an implementation of xR2RML for MongoDB databases, Morph-xR2RML 23 .Each mapping rule defines a Triple Map (rr:TripleMap) which expresses a generic pattern for generating RDF triples in accordance with the model proposed in Section 3.2.

520CQ2
and CQ2-bis : The SPARQL query presented in Listing 2 implements CQ2 and 521 allows to identify genetic markers and genes mentioned proximal to a specific wheat trait in 522 scientific publications.The results of this query returns a list of scientific publications from 523 the WheatGenomicsSLKG graph that list several genetic markers and genes entities mentioned 524 proximal to the resistance to Stripe Rust trait.On another side, the SPARQL query presented 525 in Listing 3 corresponds to the implementation of CQ2-bis and allows scientists to retrieve gene 526 names that are mentioned proximal to the GDP chemical component in the scientific literature 527 on rice genomics.528CQ3 : The SPARQL query, presented in Listing 4, implements CQ3 and allows scientists 529 to retrieve publications in which genes are mentioned proximal to wheat varieties and traits 530 from a specific class, e.g., all wheat traits related to resistance to fungal pathogens.Based on 531 the WTO structure which classifies traits in different taxonomies, the query retrieves all traits 532 belonging to the sub-hierarchy of fungal pathogen resistance class (line 20-45).

533CQ4:
A first implementation of this CQ is presented in Listing 5 that performs a search 534 of gene mentions cited proximal to a specific taxon identified by a class in the NCBITaxon 535 reusing terms from the FrAC vocabulary, an OntoLex module for Frequency, Attestation and Corpus information (FrAC) [CIdD + 20].FrAC allows to model absolute frequencies of a given lexical entity (how many times an element of a semantic resource is recognized in the text, e.g., as shown in CQ5 bis) which is a recurrent need.Figure10presents an RDF graph that links six instances of class oa:Annotations to one instance of the frac:CorpusFrequency class using the prov:wasDerivedFrom property.Those annotations are about the FCU concept "grapevine" recognized in the PHB example of Figure2.Considering that a document is a corpus of one element, the instance of class frac:CorpusFrequency represents the number of times that the given concept (grapevine) is recognized in the PHB document.

Figure 10 :
Figure 10: Frequency Modelling in PHB graph based on FRAC, OA and PROV vocabularies.
Most of the articles in the corpus date from the past two decades, which is consistent with the sharp increase in research interest in rice genetics as depicted in Figure1.
We used the HunFLAIR NER tagger [WSM + 20] that we combined with NLTK, Spacy and other Python libraries to extract four types of named entities in the title or the abstract of the articles: we distinguish between NE mentions that refer to genes (e.g., OsMAPK2 or MOC1 ), species (e.g., Oryza sativa or Magnaporthe oryzae), chemicals (e.g., gibberellic acid or nitrogen) and diseases or phenotypes (e.g., diseases blast or Sheath blight disease).In total, 351,003 mentions of 63,591 distinct NEs were identified from PubMed abstracts and titles.When possible, these NEs were linked with existing semantic resources as explained in Section 2.2.2.1.3PlantHealthBulletin CorpusIn France, the Grenelle Environment and Ecophyto 2018 program strengthened national surveillance networks of crops and agricultural practices.Plant Health Bulletins are one of the modalities established by these surveillance networks in all regions and French overseas departments.A Plant Health Bulletin (PHB) is an agricultural alert document, both technical and regulatory in nature, written in French under the responsibility of a regional epidemiological surveillance committee.A PHB gathers information about the health status of crops.It reports observations of crop development and pest attacks, and analyses pest risk in the whole area.
the context of occurrence of several types of NE annotations in documents.The core part of this model leverages and extends the model previously proposed in [MGA + 20].It is based on the W3C Web Annotation Ontology (OA) [SCY17] to structure, describe and integrate NEs extracted from both corpora, and eight complementary vocabularies to describe documents and NEs.Table

Table 1 :
[PG08]f reused vocabulariesBibliographic Metadata for PHB Technical Documents In the PHB corpus, a bulletin has two digital realizations: a PDF file and a HTML file.Therefore, to model the bibliographic information, we have reused an ontology design pattern from the DUL ontology called dul:InformationObject[PG08]

Table 2 :
Templates of URI for the resources in WheatGenomicsSLKG and RiceGenomicsSLKG directory24.Similar mapping rules have been defined to materialize the knowledge graph describing the scientific literature on rice genomics, RiceGenomicsSLKG; they are available in the project's GitHub directory25.Table2illustrates the templates used to generate significant URIs for different types of resources in WheatGenomicsSLKG and RiceGenomicsSLKG.
[MFZCG19]on, in order to enrich the scientific publications with bibliographic metadata, we developed a SPARQL micro-service[MFZCG19]to query the PubMed Central API and retrieve publication metadata 26 .For each publication, the micro-service transforms PubMed API's results into an RDF graph that we insert in the KG being constructed.Finally, we also inserted as a subgraph of WheatGenomicsSLKG the SKOS version of the WTO semantic resources used to annotate phenotypes entities.WheatGenomicsSLKG and RiceGenomicsSLKG can be queried at http://d2kab.i3s.unice.fr/sparql.A list of SPARQL queries which implement CQs (CQ1 to CQ5) are presented in Section 4.2.
cultivated crop in the region.The results show that grape, cabbage, leek and 557 carrot are the most cultivated crops in Pays de la Loire.These are more precise results than 558 those from our previous work [RBP + 17] in 2017 indicating that this region growth field crops, Listing 9: SPARQL query implementing CQ7 and retrieving the number of times that each crop name is mentioned in PHB bulletins related to the 'Pays de la Loire' French region.1304HTML elements that contain both an annotation of FCU crop concepts and of PPDO crop name and one for the development stage, that appear in the characters window of 573 1000 characters in the PHB example.This alternative implementation of CQ8 is presented in both wheat and rice KGs to search a correlation between gene expression and 581 disease resistance.The aim is to search for wheat and rice genes co-occuring with the same 582 taxon of a pathogen (or a more specific taxon) to identify candidate orthologous genes in wheat 583 and rice genomes.The SPARQL query presented in Listing 12 implements this competency ?paper a fabio:ResearchPaper ; dct:title ?source3 ; dct:issued ?year .FILTER (?year >= "2010"^^xsd:gYear) } GRAPH <http://ns.inria.fr/d2kab/ontology/wto/v3>{ ?WTOtraitURI skos:prefLabel ?WTOtrait .} Listing 3: SPARQL query implementing CQ2-bis and retrieving gene names that are mentioned proximal to the GDP chemical component in RiceGenomicsSLKG 554 de la Loire, ordered by descending order.It estimates the most cultivated crops in this region 555 during the whole time period.This query could also include a specific time period to observe 556 the evolution of 562 the same HTML element of a PHB.This query then estimates that the development stage is 563 applicable to the crop.Thus one can deduce that at the publication date of the PHB the crop 564 has reached the development stage in the region the PHB is relative to.The query retrieves 566 development stages from 190 distinct PHBs.Note that the execution of this query takes some 567 time (50s) due to the call of two service templates.569 distinct HTML elements.The crop name is mentioned before the development stage.Another 570 implementation consists in retrieving the crop name and the development stage in a window 571 of a fixed number of characters.The following query retrieves the couples of annotations, one 572 for the SERVICE <http://ontology.inrae.fr/frenchcropusage/sparql>{ ?body_fcu a skos:Concept ; skos:prefLabel ?cropName .FILTER (LANG(?cropName)='fr') } 576 Using federated queries, scientists can jointly exploit several KGs.In the following, we present 577 an example combined exploitation of WheatGenomicsSLKG and RiceGenomicsSLKG and an 578 example combined exploitation of WheatGenomicsSLKG and PHB KG. 33 579 Combined Exploitation of WheatGenomicsSLKG and RiceGenomicsSLKG CQ5 580 requires to use