Semantic representation of scientific literature: Bringing claims, contributions and named entities onto the Linked Open Data cloud

Motivation: Finding relevant scientific literature is one of the essential tasks researchers are facing on a daily basis. Digital libraries and web information retrieval techniques provide rapid access to a vast amount of scientific literature. However, no further automated support is available that would enable fine-grained access to the knowledge 'stored' in these documents. The emerging domain of Semantic Publishing aims at making scientific knowledge accessible to both humans and machines, by adding semantic annotations to content, such as a publication's contributions, methods, or application domains. However, despite the promises of better knowledge access, the manual annotation of existing research literature is prohibitively expensive for wide-spread adoption. We argue that a novel combination of three distinct methods can significantly advance this vision in a fully-automated way: (i) Natural Language Processing (NLP) for Rhetorical Entity (RE) detection; (ii) Named Entity (NE) recognition based on the Linked Open Data (LOD) cloud; and (iii) automatic knowledge base construction for both NEs and REs using semantic web ontologies that interconnect entities in documents with the machine-readable LOD cloud. Results: We present a complete workflow to transform scientific literature into a semantic knowledge base, based on the W3C standards RDF and RDFS. A text mining pipeline, implemented based on the GATE framework, automatically extracts rhetorical entities of type Claims and Contributions from full-text scientific literature. These REs are further enriched with named entities, represented as URIs to the linked open data cloud, by integrating the DBpedia Spotlight tool into our workflow. Text mining results are stored in a knowledge base through a flexible export process that provides for a dynamic mapping of semantic annotations to LOD vocabularies through rules stored in the knowledge base. We created a gold standard corpus from computer science conference proceedings and journal articles, where Claim and Contribution sentences are manually annotated with their respective types using LOD URIs. The performance of the RE detection phase is evaluated against this corpus, where it achieves an average F-measure of 0.73. We further demonstrate a number of semantic queries that show how the generated knowledge base can provide support for numerous use cases in ABSTRACT Motivation: Finding relevant scientiﬁc literature is one of the essential tasks researchers are facing on a daily basis. Digital libraries and web information retrieval techniques provide rapid access to a vast amount of scientiﬁc literature. However, no further automated support is available that would enable ﬁne-grained access to the knowledge ‘stored’ in these documents. The emerging domain of Semantic Publishing aims at making scientiﬁc knowledge accessible to both humans and machines, by adding semantic annotations to content, such as a publication’s contributions, methods, or application domains. However, despite the promises of better knowledge access, the manual annotation of existing research literature is prohibitively expensive for wide-spread adoption. We argue that a novel combination of three distinct methods can signiﬁcantly advance this vision in a fully-automated way: (i) Natural Language Processing (NLP) for Rhetorical Entity (RE) detection; (ii) Named Entity (NE) recognition based on the Linked Open Data (LOD) cloud; and (iii) automatic knowledge base construction for both NEs and REs using semantic web ontologies that interconnect entities in documents with the machine-readable LOD cloud. Results: We present a complete workﬂow to transform scientiﬁc literature into a semantic knowledge base, based on the W3C standards RDF and RDFS. A text mining pipeline, implemented based on the GATE framework, automatically extracts rhetorical entities of type Claims and Contributions from full-text scientiﬁc literature. These REs are further enriched with named entities, represented as URIs to the linked open data cloud, by integrating the DBpedia Spotlight tool into our workﬂow. Text mining results are stored in a knowledge base through a ﬂexible export process that provides for a dynamic mapping of semantic annotations to LOD vocabularies through rules stored in the knowledge base. We created a gold standard corpus from computer science conference proceedings and journal articles, where Claim and Contribution sentences are manually annotated with their respective types using LOD URIs. The performance of the RE detection phase is evaluated against this corpus, where it achieves an average F-measure of 0.73. We further demonstrate a number of semantic queries that show how the generated knowledge base can provide support for numerous use cases in managing scientiﬁc literature. Availability: All software presented in this paper is available under open source licenses at http://www.semanticsoftware.info/semantic-scientiﬁc-literature-peerj-2015-supplements. Development releases of individual components are additionally available on our GitHub page at https://github.com/SemanticSoftwareLab.

. This diagram shows our visionary workflow to extract the knowledge contained in scientific literature by means of natural language processing (NLP), so that researchers can interact with a semantic knowledge base instead of isolated documents.

INTRODUCTION
1 In a commentary for the Nature journal, (Berners-Lee and Hendler, 2001) predicted 2 that the new semantic web technologies "may change the way scientific knowledge 3 is produced and shared". They envisioned the concept of "machine-understandable 4 documents", where machine-readable metadata is added to articles in order to explicitly 5 mark up the data, experiments and rhetorical elements in their raw text. More than a 6 decade later, not only is the wealth of existing publications still without annotations, 7 but nearly all new research papers still lack semantic metadata as well. Manual efforts 8 for adding machine-readable metadata to existing publications are simply too costly 9 for wide-spread adoption. Hence, we investigate what kind of semantic markup can 10 be automatically generated for research publications, in order to realize some of the 11 envisioned benefits of semantically annotated research literature. 12 As part of this work, we first need to identify semantic markup that can actually help 13 to improve specific tasks for the scientific community. A survey by (Naak et al., 2008) 14 revealed that when locating papers, researchers consider two factors when assessing the 15 relevance of a document to their information need, namely, the content and quality of the 16 paper. They argue that a single rating value cannot represent the overall quality of a given 17 research paper, since such a criteria can be relative to the objective of the researcher. For 18 example, a researcher who is looking for implementation details of a specific approach 19 is interested mostly in the Implementation section of an article and will give a higher 20 ranking to documents with detailed technical information, rather than related documents 21 with modest implementation details and more theoretical contributions. Therefore, 22 a lower ranking score does not necessarily mean that the document has an overall 23 lower (scientific) quality, but rather that its content does not satisfy the user's current 24 information need. 25 Consequently, to support users in their concrete tasks involving scientific literature, 26 we need to go beyond standard information retrieval methods, such as keyword-based

Manuscript to be reviewed
Computer Science is to offer support for semantically rich queries that users can ask from a knowledge 29 base of scientific literature, including specific questions about the contributions of a 30 publication or the discussion of specific entities, like an algorithm. For example, a user 31 might want to ask the question "Show me all full papers from the SePublica workshops, 32 which contain a contribution involving 'linked data'." 33 We argue that this can be achieved with a novel combination of three approaches: 34 Natural Language Processing (NLP), Linked Open Data (LOD)-based entity detection, 35 and semantic vocabularies for automated knowledge base construction (we discuss these 36 methods in our Background section below). By applying NLP techniques for rhetorical 37 entity (RE) recognition to scientific documents, we can detect which text fragments 38 form a rhetorical entity, like a contribution or claim. By themselves, these REs provide 39 support for use cases such as summarization (Teufel and Moens, 2002), but cannot 40 answer what precisely a contribution is about. We hypothesize that the named entities 41 (NEs) present in a document (e.g., algorithms, methods, technologies) can help locate 42 relevant publications for a user's task. However, manually curating and updating all 43 these possible entities for an automated NLP detection system is not a scalable solution 44 either. Instead, we aim to leverage the Linked Open Data cloud (Heath and Bizer, 2011), 45 which already provides a continually updated source of a wealth of knowledge across 46 nearly every domain, with explicit and machine-readable semantics. If we can link 47 entities detected in research papers to LOD URIs (Universal Resource Identifiers), we 48 can semantically query a knowledge base for all papers on a specific topic (i.e., a URI), 49 even when that topic is not mentioned literally in a text: For example, we could find 50 a paper for the topic "linked data," even when it only mentions "linked open data," or 51 even "LOD", since they are semantically related in the DBpedia ontology. 1 But linked 52 NEs alone again do not help in precisely identifying literature for a specific task: Did 53 the paper actually make a new contribution about "linked data," or just mention it as an 54 application example? Our idea is that by combining the REs with the LOD NEs, we can 55 answer questions like these in a more precise fashion than either technique alone. 56 To test these hypotheses, we developed a fully-automated approach that trans-57 forms publications and their NLP analysis results into a knowledge base in RDF 2 58 format, based on a shared vocabulary, so that they can take part in semantically 59 rich queries and ontology-based reasoning. We evaluate the performance of this   writing scientific articles. Indeed, according to a recent survey (Naak et al., 2008), 76 researchers stated that they are interested in specific parts of an article when searching 77 for literature, depending on their task at hand. Verbatim extraction of REs from text 78 helps to efficiently allocate the attention of humans when reading a paper, as well as 79 improving retrieval mechanisms by finding documents based on their REs (e.g., "Give 80 me all papers with implementation details"). They can also help to narrow down the 81 scope of subsequent knowledge extraction tasks by determining zones of text where 82 further analysis is needed.

83
Existing works in automatic RE extraction are mostly based on the Rhetorical 84 Structure Theory (RST) (Mann and Thompson, 1988) that characterizes fragments of 85 text and the relations that hold between them, such as contrast or circumstance. (Marcu,86 1999) developed a rhetorical parser that derives the discourse structure from unrestricted 87 text and uses a decision tree to extract Elementary Discourse Units (EDUs) from text.

88
The work by (Teufel, 2010) identifies so-called Argumentative Zones (AZ) from 89 scientific text as a group of sentences with the same rhetorical role. She uses statistical 90 machine learning models and sentential features to extract AZs from a document. 91 Teufel's approach achieves a raw agreement of 71% with human annotations as the 92 upper bound, using a Naïve Bayes classifier. Applications of AZs include document 93 management and automatic summarization tasks.

94
In recent years, work on RE recognition has been largely limited to biomedical and  The JISC-funded ART project aimed at creating an "intelligent digital library," where 103 the explicit semantics of scientific papers is extracted and stored using an ontology-  Prior to the analysis of scientific literature for their latent knowledge, we first need 122 to provide the foundation for a common representation of documents, so that (i) the 123 variations of their formats (e.g., HTML, PDF, L A T E X) and publisher-specific markup 124 can be converted to one unified structure; and (ii) various segments of a document 125 required for further processing are explicitly marked up, e.g., by separating References 126 from the document's main matter. A notable example is SciXML (Rupp et al., 2006), 127 which is an XML-based markup language for domain-independent research papers. It In stand-off annotation style, the original text and its annotations are separated into two different parts and connected using text offsets. 5  Entity linking is a highly active research area in the Semantic Web community. A high-level overview of our workflow design, where a document is fed into an NLP pipeline that performs semantic analysis on its content and stores the extracted entities in a knowledge base, inter-linked with resources on the LOD cloud.

233
We designed a text mining pipeline to automatically detect rhetorical entities in sci-  Our RE detection pipeline extracts such statements on a sentential level, meaning 240 that we look at individual sentences to classify them into one of three categories: Claim, 241 Contribution, or neither. If a chunk of text (e.g., a paragraph or section) describes a Claim 242 or Contribution, it will be extracted as multiple, separate sentences. In our approach, we 243 classify a document's sentences based on the existence of several discourse elements 244 and so-called trigger words. We adopted a rule-based approach, in which several rules 245 are applied sequentially on a given sentence to match against its contained lexical and 246 discourse elements. When a match is found, the rule then assigns a type, in form of a Manuscript to be reviewed Computer Science

322
Using the rules described above, we can now find and classify REs in a scientific 323 document. However, by using REs alone, a system is still not able to understand the 324 topics being discussed in a document; for example, to generate a topic-focused summary.

325
Therefore, the next step towards constructing a knowledge base of scientific literature 326 is detecting the named entities that appear in a document. Our hypothesis here is that we will discard any tagged entity that does not fall within a noun phrase chunk. This 342 way, adverbs or adjectives like "here" or "successful" are filtered out and phrases like 343 "service-oriented architecture" can be extracted as a single entity.  URIs linked to their LOD resources. Figure 7 shows example RDF triples using our 367 publication model and other shared semantic web vocabularies.

368
The most similar vocabulary to our PUBO vocabulary would have been the Open Manuscript to be reviewed Computer Science and its attributes, such as its features, as the object. Table 1 summarizes the shared 393 vocabularies that we use in the annotation export process.    The pre-processed text is then passed onto the downstream processing resources. Manuscript to be reviewed

Computer Science
{ "Resources": [{ "@URI": "http://dbpedia.org/resource/ Software prototyping", "@support": "3235", "@types": "", "@surfaceForm": "prototype", "@offset": "1103", "@similarityScore": "0.9999996520535356", "@percentageOfSecondRank": "0.  Manuscript to be reviewed Computer Science   The documents in these corpora are in PDF or XML formats, and range from 3-43 477 pages in various formats (ACM, LNCS, and PeerJ). We scraped the text from all files, 478 analyzed them with our text mining pipeline described in the Implementation section, 479 and stored the extracted knowledge in a TDB-based triplestore. 30   30 The generated knowledge base is also available for download on our supplements page, http://www. semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements. 31 The table is automatically generated through a number of SPARQL queries on the knowledge base; the source code to reproduce it can also be found on our supplementary materials page, http: //www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements.  in a document, the total occurrence would be two, but since they are both grounded to 493 the same URI (i.e., <dbpedia:Linked data>), the total distinct number of NEs is one.

494
This is particularly interesting in relation to their distribution within the documents' 495 rhetorical zones (column 'Distinct DBpedia NE/RE'). As can be seen in Table 2, the 496 number of NEs within REs are an order of a magnitude smaller than the total number of 497 distinct named entities throughout the whole papers. This holds across the three distinct 498 corpora we evaluated.

499
This experiment shows that NEs are not evenly distributed in scientific literature.

500
Overall, this is encouraging for our hypothesis that the combination of NEs with REs 501 brings added value, compared to either technique alone: As mentioned in the example 502 above, a paper could mention a topic, such as "Linked Data", but only as part of its 503 motivation, literature review, or future work. In this case, while the topic appears in 504 the document, the paper does not actually contain a contribution involving linked data.

505
Relying on standard information retrieval techniques hence results in a large amount

510
We assessed the performance of our text mining pipeline by conducting an intrinsic 511 evaluation i.e., comparing its precision and recall with respect to a gold standard corpus.

513
In an intrinsic evaluation scenario, the output of an NLP pipeline is directly compared 514 with a gold standard (also known as the ground truth) to assess its performance in a

Intrinsic Evaluation Results and Discussion
536 Table 4 shows the results of our evaluation. On average, the Rhetector pipeline obtained 537 a 0.73 F-measure on the evaluation dataset. 538 We gained some additional insights into the performance of Rhetector. When    The system will then show the query's results in a suitable format, like the one shown in 594 Table 5, which dramatically reduces the amount of information that the user is exposed 595 to, compared to a manual triage approach.
596 Table 5. Three example Contributions from papers obtained through a SPARQL query.
The rows of the table show the paper ID and the Contribution sentence extracted from the user's corpus.

Paper ID Contribution
SePublica2011/ paper-05.xml "This position paper discusses how research publication would benefit of an infrastructure for evaluation entities that could be used to support documenting research efforts (e.g., in papers or blogs), analysing these efforts, and building upon them." SePublica2012/ paper-03.xml "In this paper, we describe our attempts to take a commodity publication environment, and modify it to bring in some of the formality required from academic publishing." SePublica2013/ paper-05.xml "We address the problem of identifying relations between semantic annotations and their relevance for the connectivity between related manuscripts." Retrieving document sentences by their rhetorical type still returns REs that may 597 concern entities that are irrelevant or less interesting for our user in her literature review 598 task. Ideally, the system should return only those REs that mention user-specified topics.

599
Since we model both the REs and NEs that appear within their boundaries, the system 600 can allow the user to further stipulate her request. Consider the following scenario: The results returned by the system, partially shown in Manuscript to be reviewed Computer Science Table 6. Two example Contributions about 'linked data'. The results shown in the table are Contribution sentences that contain an entity described by <dbpedia:Linked data>.

Paper ID Contribution
SePublica2012/ paper-07.xml "We present two real-life use cases in the fields of chemistry and biology and outline a general methodology for transforming research data into Linked Data." SePublica2014/ paper-01.xml "In this paper we present a vision for having such data available as Linked Open Data (LOD), and we argue that this is only possible and for the mutual benefit in cooperation between researchers and publishers." in reading, but it also inferred that "Linked Open Data", "Linked Data" and "LOD"  So far, we showed how we can make use of the LOD-linked entities to retrieve 620 articles of interest for a user. Note that this query returns only those articles with REs 621 that contain an NE with a URI exactly matching that of dbpedia:Linked data. However, 622 by virtue of traversing the LOD cloud using an NE's URI, we can expand the query 623 to ask for contributions that involve dbpedia:Linked data or any of its related subjects.

624
In our experiment, we interpret relatedness as being under the same category in the 625 DBpedia knowledge base (see Fig. 8). Consider the scenario below:  The results from the extended query that show Contribution sentences that mention a named entity semantically related to <dbpedia:Linked data>.

Paper ID Contribution
SePublica2012/ paper-01.xml "In this paper, we propose a model to specify workflow-centric research objects, and show how the model can be grounded using semantic technologies and existing vocabularies, in particular the Object Reuse and Exchange (ORE) model and the Annotation Ontology (AO)." SePublica2014/ paper-01.xml "In this paper we present a vision for having such data available as Linked Open Data (LOD), and we argue that this is only possible and for the mutual benefit in cooperation between researchers and publishers." SePublica2014/ paper-05.xml "In this paper we present two ontologies, i.e., BiRO and C4O, that allow users to describe bibliographic references in an accurate way, and we introduce REnhancer, a proofof-concept implementation of a converter that takes as input a raw-text list of references and produces an RDF dataset according to the BiRO and C4O ontologies.." SePublica2014/ paper-07.xml "We propose to use the CiTO ontology for describing the rhetoric of the citations (in this way we can establish a network with other works)." The system can respond to the user's request in three steps: (i) First, through a federated 629 query to the DBpedia knowledge base, we find the category that dbpedia:Linked data 630 has been assigned to -in this case, the DBpedia knowledge base returns "Semantic web", 631 "Data management", and "World wide web" as the categories; (ii) Then, we retrieve all 632 other subjects which are under the same identified categories (cf. Fig. 8) ORDER BY ?paper 646 The system will return the results, shown in Table 7, to the user. This way, the user 647 receives more results from the knowledge base that cover a wider range of topics 648 semantically related to linked data, without having to explicitly define their semantic 649 relatedness to the system. This simple example is a demonstration of how we can exploit  Manuscript to be reviewed Computer Science on their REs. To demonstrate the feasibility of these ideas, we developed an NLP 660 pipeline to fully automate the transformation of scientific documents from free-form 661 content, read in isolation, into a queryable, semantic knowledge base. In future work, 662 we plan to further improve both the NLP analysis and the LOD linking part of our 663 approach. As our experiments showed, general-domain NE linking tools, like DBpedia 664 Spotlight, are biased toward popular terms, rather than scientific entities. Here, we 665 plan to investigate how we can adapt existing or develop new entity linking methods 666 specifically for scientific literature. Finally, to support end users not familiar with 667 semantic query languages, we plan to explore user interfaces and interaction patterns, 668 e.g., based on our Zeeva semantic wiki (Sateli and Witte, 2014) system. 669