SCOPE-A Scientific Compound Object Publishing and Editing System

This paper presents the SCOPE (Scientific Compound Object Publishing and Editing) system which is designed to enable scientists to easily author, publish and edit scientific compound object s. Scientific compound objects enable scientists to encapsulate the variou s datasets and resources generated or utilized during a scientific experimen t or discovery process, within a single compound object, for publishing and exchan ge. The adoption of “named graphs” to represent these compound objects nables provenance information to be captured via the typed relationsh ip between the components. This approach is also endorsed by the OAI-ORE initi ative and hence ensures that we generate OAI-ORE-compliant Scientific Compo und Objects. The SCOPE system is an extension of the Provenance Expl orer tool – which enables access-controlled viewing of scientific provenance trails. Provenance Explorer provided dynamic rendering of RDF graphs of scienti fic discovery processes, showing the lineage from raw data to publication. V iews of different granularity can be inferred automatically using SWRL (Semantic Web Rules Language) rules and an inferencing engine. SCOPE extends the Provenance Explorer tool and GUI by: 1) Adding an embedded web browser that can be used for incorporating objects discoverable via the Web; 2) Representing compound objects as Named Graphs, that can be saved in RDF, TriX, TriG or as an Atom syndication feed; 3) Enabling scientists to attach Creative Commons Licenses to the compound objects to specify how they may be re-used; 4) Enabling compound objects to be published as Fedora Object X ML (FOXML) files within a Fedora digital library.


Introduction and Objectives
In recent years it has been increasingly acknowledged that traditional scientific publications are the outcome of the final phase of the scientific discovery processon the whole, they inadequately represent the earlier phases that involve the capture, analysis, modeling and interpretation of primary scientific data.A record of the complete scientific discovery process, through detailed provenance information, enables verification and repeatability of results and peer review of the methodology.Lack of access to high quality scientific data with the associated provenance information, is an obstacle to interdisciplinary and international research.As a result, many organizations and funding bodies are actively encouraging or even mandating the publication of scientific data along with traditional publications across many domains [1][2][3][4].But there are a number of barriers that need to be overcome, to get scientists to publish their datasets.These include: a lack of simple tools for publishing data with provenance information; lack of motivation for scientists to spend time and effort preparing their data for publication; concern with intellectual property rights; a lack of standards for publishing data sets and provenance; and discipline-specific tools that prohibit cross-disciplinary sharing and exchange.
A number of different approaches have been implemented that link raw scientific data to scientific publications.Some online publishers, including Acta Crystallographica Section E -Structure Reports Online [5], Nature [6], and the ePIC Earth System Science Data and Methods [7] journal enable the association of supplementary datasets with scholarly papers.Murray-Rust and Rzepa proposed the concept of datuments -XML documents, that combine the data and the document using formal markup to allow processing and rendering in different ways via XSLT [8].A number of initiatives and projects (e.g., Protein Data Bank (PDB) [9], CombeChem [10], eBank [11], NCBI [12], Virtual Observatory [13] and GBIF [14]) have spearheaded the development of infrastructures that facilitate the online publication of electronic scientific datasets.Although these existing approaches have advanced the publication of scientific data, they also have a number of limitations including: • The relationships between the datasets and publications are one-to-one, relatively fixed and involve hyperlinks with little or no support for semantics or provenance information.• Difficulty discovering and retrieving components that are deeply embedded or hidden within HTML pages or the deep web [15].The full potential of compound objects cannot be realized unless the component information is both human-understandable and machine-interpretable.The Open Archives Initiative [16] (OAI) is proposing a new standard named Object Reuse and Exchange (OAI-ORE) [17] that aims to make the information within compound/complex digital objects [18] discoverable, machine-readable, interoperable and reusable.The objective of the OAI-ORE Initiative is to develop an interoperability layer across cooperating digital repositories, registries and services for the reuse and exchange of compound digital objects, based on the Web architecture [19].The OAI-ORE white paper [20] recommends Named Graphs [21] as a means to publish compound digital objects in order to clearly state their logical boundaries and typed relationships between their components.Named Graphs consist of nodes and arcs.When applied to compound objects, the nodes correspond to component resources; and the arcs correspond to typed relationships.Named Graphs, their nodes and arcs are all web resources -so they can be identified and referenced unambiguously by HTTP URIs.This provides the basis for the reuse and exchange of the OAI-ORE compound objects and their components because URIs provide the handles.
Our hypothesis is that OAI-ORE Named Graphs provide the ideal mechanism for representing scientific compound objects that encapsulate the raw data, its derivative products and outputs/publications as well as the intervening processing steps.They do this in a way that is discipline-independent but provides hooks to include rich semantics, metadata and discipline-specific vocabularies, ontologies and rules.
Hence our primary objective was to develop an intuitive simple, easy-to-use system that enables scientists to quickly and easily author scientific compound objects with built-in provenance and to publish them to a repository with associated metadata and Creative Commons license -the SCOPE (Scientific Compound Object Publishing and Editing) system.If SCOPE can deliver on these objectives, then the system will overcome some of the current barriers to scientific data publication that include: a lack of incentive; lack of tools; difficulty preparing data for publication; difficulty providing appropriate level of provenance data; concern with intellectual property rights.
The Provenance Explorer tool [22] provides dynamic rendering of provenance trails (captured from workflow engines) via the visualization of RDF graphs.In addition, users can expand or collapse arcs to generate fine or coarse-grained views.Because Provenance Explorer provides a subset of the functionality required for SCOPE, we decided to exploit this existing work and extend it through the addition of new functionality.More specifically, SCOPE extends ProvenanceExplorer through the addition of the following functionality: • Adding a Web browser which enables the discovery and importation of digital objects on the Web -as new components of compound objects; • Enabling the compound objects (named graphs) to be saved in a variety of serializations (RDF/XML, TriX [23] and TriG [24], as well as the Atom syndication feed [25]) in order to strengthen the dissemination of the compound objects over the Web; • Converting the compounds objects with their components into Fedora [26] Object XML (FOXML) files, and ingesting them to Fedora digital libraries.• Enabling scientists to attach Creative Commons Licenses to the compound objects to specify how the compound object may be re-used; • Enabling existing compound objects to be reloaded and edited by authenticated users with appropriate access permissions via the SCOPE GUI.
The remainder of this paper is structured as follows: Section 2 describes related work; Section 3 describes the system architecture and components; Section 4 describes the case study we used for evaluation and testing; Section 5 describes the implementation and user interface and Section 6 concludes with an evaluation, discussion and future work plans.

Related Work
There are two primary methods by which research data is linked to scholarly publications: 1.The first method involves either including a reference from the paper to an accession number in a database or adding a hyperlink from the paper to a dataset or data held within a database via a unique identifier (e.g, many publishers use Digital Object Identifiers (DOIs)).2. The second approach involves embedding the data within the scholarly publication via a formal markup language.
Examples of publishers who support the first approach include: Nature and American Chemical Society -which require that papers about proteins, DNA sequences or molecular structures must associate them with accession numbers assigned by designated publicly accessible databases such as Genbank [27], the Protein Data Bank (PDB) [9], SWISS-PROT [28].This approach depends on the long term availability and accessibility of large-scale online databases of scientific data.The Protein Data Bank (PDB) [9] is just one example of a public database that is built from user submissions.Other similar large-scale online aggregated databases have been established and are maintained by organizations such as, NASA, NIST, NCBI, STD-DOI, GBIF and NOAA -for research domains including global atmospheric and climatic research, computational chemistry, genomics, earth sciences and analytical physics.Typically these organizations collect data in their own database schema and if others want to upload their data, it must first be converted to the organizational database schema and specified formats.Problems associated with this first approach include: • The link from the paper to the data is usually uni-directional and does not include any semantics or provenance information.Discovering the data via web crawlers is not possible because it is part of the deep web.• The procedure of submitting papers and/or data to online publishers and publicly accessible databases are database-specific and rigid.Understanding those procedures can frustrate, or demotivate scientists from publishing their data The second approach to publishing raw data linked to publications, involves using some form of XML to markup a scientific publication structurally and semanticallyto distinguish between and interpret the publication text and different types of embedded or related data.Examples of this approach include Murray-Rust's datuments [8] -XML documents that are machine readable and can be rendered in different ways using XSLT.The eCrystals Crystal Structure Report Archive [29], a subproject under CombeChem and eBank, publishes first-hand but non-peer reviewed crystallographic data online.All information about a single crystal is dynamically generated as a highly structured web page with detailed provenance information and links to related citations.Acta Crystallographica Section E -Structure Reports Online [5] also binds hyperlinks to the paper and supplementary material under the one title.The German Scientific Drilling Database (SDDB) project [30] is also investigating the use of XHTML to integrate geological sample information with derived data and published studies in which the data is interpreted.
The major drawback associated with the second approach is that many Web spiders/crawlers cannot determine the semantic relationships between raw data and HTML text bound within a single web page.Explicitly typed relationships, as defined within Named Graphs, are required to raise the relationships between components to first class objects which can have their own provenance information.
However, some of the limitations of the XHTML approach to scientific publishing may be overcome through the adoption of emerging technologies such as Microformats [31], RDFa [32] and GRDDL [33].Microformats and RDFa enables semantic tags to be embedded within XHTML documents to tag the content or link to related documents or data (via the rel tag) -without affecting the display of the HTML text.The inclusion of these light-weight semantic tags enables machine understanding, interpretation and processing of the publication.GRDDL (Gleaning Resource Descriptions from Dialects of Languages) is a mechanism for deriving formal metadata in RDF by using XSLT to process XHTML documents, to extract the embedded semantics (e.g., RDFa).The future may well see scientific publication authoring systems that use RDFa to embed tags or annotations in (X)HTML files and use RDF-a aware browsers or GRDDL to extract this, convert it to RDF, store it in an RDF triple store and search it using SPARQL.
3 System Architecture and Overview • The Algernon rule-inference engine -for inferring new indirect relationships based on SWRL rules.The Provenance Explorer paper [22] describes in detail the interactions between the knowledge base and the provenance view (via JGraph and Jena) and between the knowledge base and Algernon (via Protégé-OWL).The development of SCOPE has required the development of a number of additional functionalities: importing digital objects via the Web browser; the transformation of compound objects into various formats; and the ingestion to a Fedora digital library.

Fig. 1. Overview of the SCOPE Architecture
In the top RHS of the Authoring and Publishing Platform on Figure 1, there is a web browser implemented Java's JDesktop Integration Component (JDIC) [34].In many situations, scientists may want to incorporate links to external resources accessible via the web, within their compound objects.An "IMPORT"button has been provided at the bottom of the Web browser window.This adds a new node, representing the object, to the Publishing interface.The red arrow between the Web Browser and Publishing Interface indicates the import route and the red nodes on the interface are the representations of imported web information.
The nodes and arcs displayed in the Publishing Interface represent the current components of the Named Graph/compound object that is being constructed.Users can save this object in a range of different formats including: RDF/XML, TriX and TriG (using the Named Graphs API for Jena [35] (NG4J)); and Atom syndication feed (using Java's ROME, the RSS/Atom syndication and publishing tools [36]).
Finally, the compound object can also be converted into a FOXML file, and ingested into a Fedora repository using JGraph2FOXML, a Java API developed by UQ's eResearch Group, using Fedora Access and Management Web Services.
At the time of publishing to Fedora, users are also required to enter metadata (creator, data, title, description) associated with the compound object.This is entered via the metadata input interface (bottom RHS).Users may also choose a Creative Commons license and attach it to the FOXML file, at this time.

Case Study
Although this system has been designed to be used within any scientific discipline, at the University of Queensland we have been evaluating it within the materials science domain.In particular, we have been collaborating with a group of fuel cell scientists within the Australian Institute for Bioengineering and Nanotechnology (AIBN), who are investigating novel metal oxides for Solid Oxide Fuel Cell components which exhibit high conductivity within relatively low temperature ranges.
The composite powder that is output from a complex processing and sintering procedure is characterized using X-ray diffraction techniques.In addition thermal expansion coefficients and electronic conductivity are measured over a range of operating temperatures.During the characterization process, significant amounts of data are generated in a range of formats including images, numerical data and graphs.Figure 2 provides a simplified view of the powder manufacture and characterization process.It also shows a subset of the datasets generated from X-Ray diffraction and characterization -which are processed to generate graphs which are included in the final publication.For illustrative purposes, we also assume that the publication contradicts earlier results published in another previous publication.
The challenge is to provide a system that enables the fuel cell scientist to quickly and easily package up the relevant datasets, images, graphs and papers into a publishable compound object that also contains an explanation of the relationships between the components, the method of derivation, and allows easy fine-grained discovery of components.

Implementation and User Interface
In this section, we discuss the implementation of the SCOPE system, in the context of the above case study.We assume that the experimental steps and digital objects generated during the Synthesis Process and X-Ray Diffraction are captured and stored as RDF using one of the existing scientific Workflow Systems that generates RDFsuch as Kepler [37], Taverna [38] and Triana [39], or one of the e-Lab notebook systems such as Collaborative Electronic Research Framework (CERF) [40], SmartTea [41] or MyTea [42] systems.The RDF graph corresponding to the scientific experimental workflow can be used as the starting point for the scientific compound object that is to be published.This ensures that available provenance information will be leveraged.

Authoring
After users logon to the system and are authenticated, they are presented with a simple SPARQL search interface which enables search and retrieval of existing RDF experiments.For example, a user can search for and retrieve a particular experiment via a unique ID e.g.EXP280818.Initially users are presented with the basic default view of the experiment provenance in the top LHS of the application window.The blue nodes indicate the characterization processes that can be expanded to reveal further fine-grained information represented by light gray nodes shown on the RHS of Figure 3. Fig. 3.A coarse-grained view and a fine-grained view of a simplified scientific process One of the advanced functionalities within the Publishing Interface is automatic inferencing of direct relationships between indirectly-related non-expandable nodes (light-gray or yellow) via an inferencing engine.Users can drag and drop any two non-expandable nodes from the Provenance View panel down to the bottom Publishing panel, draw a link between them, and infer their relationship.This enables the streamlined publishing of coarse-grained views of the scientific method.In Figure 4 the inferencing result circled in blue is shown along the linkage on the bottom pane while the inferred path is highlighted in blue on the top pane.The inferencing rule is as follows: IF (processed_powder input_to X-ray_diffraction) AND (X-ray_diffraction outputs XRD_pattern) THEN (processed_powder characterized_by XRD_Pattern)

Fig. 4. An inferencing result
Users can also discover external objects on the Web via the embedded browser, and import these external objects as components of publishable compound objects.Figure 5 demonstrates the three-step process.The Web browser (top RHS) displays the resource to be imported.1) the Import Digital Object button is clicked to create a new red node in the Publishing Interface; 2) The user draws a line from the Scholarly Paper node to the imported node, and then labels the relationship e.g.contradicts; and 3) the Metadata view displays metadata extracted from the file's header (if available).

Fig. 5. Importing external digital objects from the Web
The nodes and arcs displayed in the Publishing Interface represent a compound object.Figure 6 demonstrates the three-step process of creating/editing and attaching metadata to a compound object: 1) When the user clicks in the Metadata editing/input window, the background of the Publishing Interface turns grey; 2) The Metadata view displays fields for the entry of Dublin Core Metadata; 3) After the users has attached the new metadata, the background returns to the original color.Users may also edit the metadata associated with component objects if it is stored locally.Fig. 6.Creating/editing and attaching metadata to a compound object

Publishing
The system enables a compound object to be saved to a variety of web formats including: RDF/XML, TriX and TriG, and the Atom syndication feed, respectively.Figures 7 and 8

User Feedback
Feedback from the fuel cell scientists with whom we have been collaborating has been very positive.They particularly liked the ability to graphically link internallygenerated provenance trails to external resources, discoverable via a Web browser.This allows authors to include other relevant research outcomes to strengthen the claims of their findings, thereby making their research outcomes more comprehensive but still self-contained -facilitating the peer-review process.The ability to interactively generate coarse-grained views of scientific experimental processes, via automatic inferencing was also very popular, as was being able to attach attribution and CreativeCommons license.Further collaboration with the fuel cell scientists is required to develop better, discipline-specific inferencing rules and relationship ontologies, grounded in solid scientific research and knowledge.

Limitations and Future Work
The system developed to-date is a working prototype that demonstrates the benefits of named graphs and OAI/ORE for scientific data publishing.However further effort is required to improve the system's usability and robustness and to overcome existing limitations that include: • Currently, only Dublin Core metadata is supported for the FOXML files at the time of publication.Support for other metadata schemas should be an option; • New typed relationships defined through the Publishing interface can currently only be labeled with free text labels.We are working on an ontology that defines a class hierarchy for relationships between information objects within the scientific domain.We are also focusing more effort on the inferencing rules that apply to these relationships; • The system currently only supports uni-directional relationships.We would like the ability to define bi-directional relationships -and symmetric, transitive and reflexive relationships within the relationship ontology.• Some of the script behaviours on web pages could prove frustrating to users.
For example, clicking on a hyperlink within the Publishing page may trigger the launch of a new browser window outside the system.• At this stage, the system does not support searching, reloading and editing of published OAI-ORE scientific compound objects.This capability is currently under development.
A number of weaknesses were also identified in the different serializations of the named graphs: 1) TriX and TriG are new and still in an early stage of their development and have not yet been widely adopted; 2) RDF/XML represents inferencing rules as triples, thereby confusing the rules with other triples; 3) the Atom syndication feed is less expressive and cannot indicate the relationships between entries/components; 4) the FOXML document relies on RDF/XML to represent the composite and component-to-components relationship, so inferencing rules cannot be represented within Fedora repositories effectively.Future plans also include discussing possible deployment of SCOPE within the Materials Science domain, in conjunction with the NSDL Materials Digital Libraryto enable publishing of compound scientific publications or e-learning resources on materials science research.We also plan to evaluate the system more thoroughly using case studies and user groups from other disciplines such as Bioinformatics, Earth Sciences, Crystallography and the Humanities and Social Sciences.

Conclusions
In this paper, we have described a tool for authoring and publishing OAI-ORE compliant scientific compound objects.SCOPE extends our existing Provenance Explorer tool and GUI by: 1) Adding an embedded web browser that can be used for incorporating objects discoverable via the Web; 2) Representing compound objects as Named Graphs, that can be saved in RDF, TriX, TriG or as an Atom syndication feed; 3) Enabling scientists to attach Creative Commons Licenses to the compound objects to specify how they may be re-used; 4) Enabling compound objects to be published as Fedora Object XML (FOXML) files within a Fedora digital library.In delivering these capabilities, the SCOPE system provides solutions to some of the current barriers to scientific data publishing.It provides a simple tool by which scientists can author and publish scientific compound publications that encapsulate raw data, derived data, provenance and publications in a single package.Authors can also attach metadata to the individual components and the compound object and save the package in a variety of formats, in order to maximize the discovery, dissemination and re-use of the publication or its components.With the worldwide efforts for open access to publicly funded research1 , scientists are under increasing pressure from funding agencies to publish the experimental and evidential data with the related traditional scholarly publication(s).SCOPE can help facilitate this.

Figure 1 o
Figure 1 illustrates the overall system architecture.It comprises: • The SCOPE Java Application and GUI -this has four components: o The provenance viewer; o The Web browser; o The publishing interface; o The metadata input and editing interface.• The knowledge base which consists of SWRL.OWL files that contain the provenance instance data, the metadata and the inference rules.• The provenance visualization tool -JGraph and Jena are used to convert an RDF graph into an image consisting of nodes (objects/classes) and arcs (relationships/properties).

Fig. 2 .
Fig. 2. Simplification of the Scientific Discovery Process for Novel Oxide Conductors illustrate the conversion to TriX and the Atom syndication feed.The coloured highlights on the RHS, correspond to the coloured nodes within the RDF graph in the Publishing Interface on the LHS.

Fig. 9 .
Fig. 9. Generation of a Fedora Compound Digital Object 6 Discussion • A lack of flexibility or extensibility -scientists require simple, interactive GUIs that enable them to interactively define a set of resources generated from an experiment or investigation, relate them to each other and publish the lot as a package.•Lack of support for multi-level access to data or information.Existing systems seem to support open access only.
• Lack of rule-based or template-based systems for rendering different presentations dynamically, based on the context, user's needs or access rights.