Using a suite of ontologies for preserving workflow-centric research objects

Scientific workflows are a popular mechanism for specifying and automating data-driven in silico experiments. A significant aspect of their value lies in their potential to be reused. Once shared, workflows become useful building blocks that can be combined or modified for developing new experiments. However, previous studies have shown that storing workflow specifications alone is not sufficient to ensure that they can be successfully reused, without being able to understandwhat theworkflows aim to achieve or to re-enact them. To gain an understanding of theworkflow, and how itmay be used and repurposed for their needs, scientists require access to additional resources such as annotations describing theworkflow, datasets used and produced by the workflow, and provenance traces recording workflow executions. In this article, we present a novel approach to the preservation of scientific workflows through the application of research objects—aggregations of data andmetadata that enrich theworkflow specifications. Our approach is realised as a suite of ontologies that support the creation of workflow-centric research objects. Their design was guided by requirements elicited from previous empirical analyses of workflow decay and repair. The ontologies developed make use of and extend existing well known ontologies, namely the Object Reuse and Exchange (ORE) vocabulary, the Annotation Ontology (AO) and the W3C PROV ontology (PROVO). We illustrate the application of the ontologies for building Workflow Research Objects with a case-study that investigates Huntington’s disease, performed in collaboration with a team from the Leiden University Medial Centre (HG-LUMC). Finally we present a number of tools developed for creating and managing workflow-centric research objects. © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
As science becomes increasingly data driven, many scientists have adopted workflows as a means to specify and automate repetitive experiments that retrieve, integrate, and analyse datasets using distributed resources [1]. Using a workflow, an experiment can be defined as a graph where the nodes represent analysis operations, which can be supplied locally or accessible remotely, and edges specify dependencies between the operations.
The value of a workflow definition is not limited to its original author, or indeed to the original study for which it was created. Once specified, a workflow can be re-used or repurposed by other scientists. This reuse can be as a means of understanding an experimental process, replicating a previous experimental result, or even using the workflow as a building-block in the design of new workflow-based experiments. To support this potential for reuse, public repositories such as myExperiment [2] and CrowdLabs [3] can be used by scientists to publish workflow definitions and share them over the web.
However, sharing just the workflow specifications is not always sufficient to guarantee successfully reuse. Previous empirical analysis of 92 workflows from myExperiment [4] has demonstrated that nearly 80% of the workflows suffered from decay in the sense that they could not be understood, or executed when downloaded. These failures were shown to be a result of one or more of the following issues: (i) Insufficient documentation. The user was unable to grasp the analysis or experiment implemented by the workflow due to the lack of descriptions of its inputs, intermediate steps, and outputs. (ii) Missing example data. Even in situations where the users were able to understand the overall analysis implemented by the workflow, it was difficult to determine what kind of data values to use as inputs to successfully execute that workflow. (iii) Volatile third-party resources. Many workflows could not be run because the third party resources they rely on were no longer available (e.g., web services implementing their steps). For example, the SOAP web services provided by KEGG 1 to query its databases have been replaced by Rest Web Services. As a result, a large number of the workflows that use the SOAP services in myExperiment could not be run. (iv) Execution environment. In certain cases, the execution of the workflow required some specific software infrastructures to be installed locally, e.g., the R statistical tool.
It is clear that in order to ensure the successful preservation of workflows, there is a need to change how we make them. Specifically we understand successful workflow preservation to be the immediate and continued ability to understand, run, and reuse the experimental process described by a workflow.
Issues 1, 2, and 4 above are all introduced at the point of the workflow's publication, through the omission of necessary supporting data or metadata. Issue 3 is instead a consequence of using 3rd party services as part of a workflow, and is a relevant issue in workflow decay [4]. Whilst the loss of 3rd party services is out of the control of original authors, there are a number of approaches to remedy this type of workflow decay by making use of metadata -such as additional semantic descriptions about the services used [5], or provenance information [6][7][8] -all of which can be either provided by the author of the workflow or automatically tracked and computed.
In light of this we propose a novel approach to workflow preservation where workflow specifications are not published in isolation, but are instead accompanied by auxiliary resources and additional metadata. Specifically we have chosen to adopt and extend the Research Object approach proposed in [9].
The Research Object approach defines an extendable model of data aggregation, and semantic annotation. At its core, the model allows us to describe aggregations of data and enrich that aggregation with supporting metadata. This aggregation can then be published and exchanged as a single artifact. Using this approach we have built a unit of publication that combines the workflow specification along with the supporting data and metadata required to improve preservation and the potential for reproducibility. Our implementation of workflow-centric research objects is realised as a series of ontologies that support both a core model of aggregation and the domain specific workflow preservation requirements.
In this paper we make the following contributions: -We present a series of requirements for the data and metadata needed to accompany workflow specifications to support workflow preservation. -We outline four ontologies that we have developed in response to those requirements, that can be used to describe Workflow-Centric Research Objects. -We present a collection of tools that make use of those ontologies in the support and management of Workflow Research Objects. -Finally, we present a series of competency queries that demonstrate how Workflow Research Objects support workflow preservation.
The remainder of this paper is organised as follows. We present the main requirements that guided the ontology development in Section 2. We present a case study from a Huntington's disease investigation for illustrating how the ontologies can be used (in Section 3). We present the ontologies in Section 4. We go on to present the tools we developed around them, and competency queries that can be answered using Workflow Research Objects 2 (in Section 5). We present and compare related work with ours in Section 6. Finally, we present our conclusions and future work in Section 7. The resources used in the paper are available online, 3 and the ontologies are documented online [10].

Requirements
Our previous work [4] has identified a need to preserve more than just the workflow specifications in order to preserve their understandability, reusability and reproducibility. Related literature on supporting preservation of software [11,12] and best practice recommendations on supporting scientific reproducibility and computing [13][14][15] has further confirmed the need to preserve software, data and methods in aggregate. We present 5 requirements in detail that serve to establish the type of data and metadata that we need to support workflow preservation.
R 1 Example data inputs should be provided. Of the 92 workflows analysed in [4], 15% of them could no longer be run because they were not accompanied with any data examples. Even when inputs were textually described, it was difficult to establish input data values to be used for their execution. Without input data, both experiment reproducibility and the ability to understand the function the workflow is inhibited.  2 Workflows should be preserved together with provenance traces of their data results. Provenance traces of executions allow users to track how results were produced by the workflow, and repair broken workflows [6]. Past studies have shown the usefulness of provenance information in supporting workflow reproducibility [6,[16][17][18]. The issues described in Section 1 could all benefit from the availability of detailed provenance information: issue 1 by replaying how the workflow functions [16] using the complete trace of all the computational tasks taking place in the workflow; issue 2, by finding example inputs data used by the workflow; issue 3, by retrieving the intermediate  results produced in the original runs to resume workflow runs from the failure point; and finally issue 4, by retrieving information about  the original computational environment, like the OS, library dependencies as well as their versions. Extensive provenance tracking is the focus of many reproducibility efforts, like VisTrails [16] and CDE [19], etc. This is also in line with the recommendations of several reproducibility best practice guidelines [13,14], which highlight the need of making all computational steps and parameter settings available.
A caveat to provenance is that the complexity of the traces can make it a challenge to quickly identify all the information to address the above questions, or to track as much provenance information as needed [20]. A well described workflow with good documentation can provide a complimentary solution for understanding how the workflows should work, just like documentation for software tools or code. R 3 Workflows should be well described and annotatable. Insufficient documentation impairs the runnability and understandability of workflows [4]. In the software world, imprecise documentation has similarly been identified as a critical barrier [21] to code reproducibility.
A number of works have approached the issue of describing experimental process and investigations driven by different needs. Related approaches include (1) capturing all the experimental steps and entities involved in using a common vocabulary so as to facilitate an interoperable understanding across investigations [22,23], (2) capturing extensive scientific discourse information around investigations (including hypothesis, claims, evidences, and etc.) in order to achieve automated knowledge discovery and hypothesis generation [24,25], (3) modelling scientists, publications, grants, and etc. about investigations in order to enhance the discovery of collaborators across disciplines and organisations [26], and etc.
To support the documentation of workflows we see the need for: (1) A structured description of the experimental steps carried out in a workflow in a system-neutral language, so that workflows from different systems can be annotated, queried and understood without relying on their specific language. Our description of the workflow therefore needs to provide a simplified high-level description, suitable for describing the steps of the workflow, but without the complexity of a fully functional and operational workflow language.
(2) Functional annotations to workflows as a whole. This is similar to the principle of documenting the ''design and purpose'' of software code by Wilson et al. [15], by for example, describing hypothesis to be tested by the workflow or providing sketches of the tasks to be carried out by the workflow.
(3) High-level functional annotations to the steps of a workflow, using controlled vocabularies, in order to facilitate domain-specific level understanding about what each task aims to achieve. R 4 . Changes/Evolutions of workflows, data, and environments should be trackable. According to our empirical study [4], volatility of third party resources accounts for 50% of the cause to workflow decay. Although changes of third party resources is not always under the author's control, mechanisms can be provided to remedy the issue. At the same time, attempting to re-run a workflow or reproduce in settings different from the original ones (like at a different time, on a different machine, using different datasets), is a common practice in scientific research. Therefore, we must provide support for users to deal with and document changes and subsequently trace through changes, so that users can (1) retrieve the original version of input data and environment setting in order to reproduce/verify the original results; (2) retrieve the different parameter configurations used to generate the different versions of outputs; and (3) identify the different changes made to the workflow specification in the process of experimenting with alternative/replacement services, different parameter settings, and etc.
This requires precise provenance tracking of workflow executions and workflow evolution. Provenance information can be very useful when the workflow is being adapted to run in a new environment, with different local libraries, operating systems or access to third party resources [16]. R 0 . Packaging auxiliary data and information with workflows. Our requirement analysis highlights a need for publishing more than workflow specifications themselves.
Guidelines for scientific reproducibility, and scientific software development [13][14][15]21] have also identified a need to share more than data, method and code, to achieve scientific transparency and reproducibility.
We note however that beyond simply making these resources available, there is a need for a mechanism that links individual resources to the specific version of a workflow-based experiment, and describes its role in that experiment. The same file, database entry, or piece of data with an identifier, may be used in any number of experiments. It is the contextual information about the role it played that is required for understanding.
Without this important contextual information when workflows are shipped from one lab to another we may lose the link between specific versions and configurations in the sea of trials-and-errors. Being able to share all these resources and auxiliary information about them (like provenance or annotations) as a single entity, and keep this relationship information within, is therefore fundamental.
It is this need to not only aggregate content, but richly describe that aggregation that is the driving motivation for us to adopt the Research Object model for building our workflow-specific extensions.
In response to the requirements outlined above, we have developed four ontologies that support the creation of workflow-centric research objects: ro: http://purl.org/wf4ever/ro#. wfdesc: http://purl.org/wf4ever/wfdesc# wfprov: http://purl.org/wf4ever/wfprov# roevo: http://purl.org/wf4ever/roevo#. These ontologies provide the mechanism to describe an aggregation of resources, and enrich that aggregation with the metadata required for workflow preservation. wfprov: The Workflow Provenance Ontology-in response to R 1 and R 2 , is used to describe the provenance traces obtained by executing workflows (described in Section 4.2 and illustrated in Fig. 5).
roevo: The Research Object Evolution Ontology-in response to R 4 , is used for describing the evolution of Workflow Research Objects and allows to track and describe the changes made to a Workflow at different levels of granularity (described in Section 4.4 and illustrated in Fig. 8).

Case study: investigating the epigenetic mechanisms involved in Huntington's disease
In this section we describe our case-study, a workflow based experiment investigating aspects of Huntington's disease. This study was performed with a team of scientists from the Leiden University Medical Centre (HG-LUMC) as part of the EU FP7 Wf4ever project-a project focused on Workflow preservation.
Huntington's disease (HD) is the most commonly inherited neurodegenerative disorder in Europe, that affects 1 out of 10 000 people. Although the genetic mutation that causes HD was identified 20 years ago [27], the downstream molecular mechanisms leading to the HD phenotype are still poorly understood. Transcriptional deregulation is a prominent feature of HD with gene expression changes taking place even before first symptoms arise. Epigenetic alterations can be responsible for such transcriptional abnormalities. Linking changes in gene expression to epigenetic information might shed light on the disease aetiology.
The team from HG-LUMC analysed HD gene expression data from three different brain regions that they integrated with publicly available epigenetic data to test for overlaps between differentially expressed genes in HD and these epigenetic datasets.
The epigenetic datasets considered in this analysis were CpG islands and chromatin marks. Epigenetic changes can switch genes on and off and control which genes are transcribed. Therefore, they are suspected to be implicated in various diseases. CpG islands and the selected chromatin marks are such areas on the genome where these changes can occur. CpG islands are areas of the genome with a high concentration of the CG dinucleotides that methylation occurs and if they are located near a gene promoter can affect the expression of that particular gene. Methylated areas on the genome are responsible for turning a gene off. Chromatin marks are also playing an important role in gene transcription by making chromatin regions accessible or repressed. The genes that overlapped with each of those epigenetic datasets were interpreted and prioritised using a text mining method called concept profile matching [28,29]. For interpreting the gene lists they used annotations of Biological processes to enrich them and export those annotations that describe them the best. In addition to that, the gene list was further prioritised based on its relation with Huntingdon (HTT) (Fig. 1). Fig. 1 sketches the two main analyses that the scientists followed for gene interpretation, namely gene annotation and gene prioritisation. The ellipses in the figure represent data artifacts, rectangles represent analyses steps, and the edges specify the dataflow dependencies. Given a list of genes (which are overlapping with an epigenetic feature, CpG islands or one of the four chromatin states), gene annotation is used to gather information about the genes in the list. Gene prioritisation, on the other hand, is a two-step process. Given a concept (gene or biological process) that is provided as input by the scientists, a set of terms describing that concept is retrieved. The list of terms obtained as a result is then used to prioritise the gene list. In the case of this example, the scientists were interested in prioritising the gene list against the concept representing the Huntington (HTT) concept.

Workflows
The steps illustrated in Fig. 1 were performed using three scientific workflows. Specifically, gene annotation, which consists of one step (AnnotateGenes in Fig. 1) was performed using the Taverna workflow annotate_genes_biological_processes.t2flow (illustrated in Fig. 2). On the other hand, gene prioritisation was performed using two workflows. The step getTermSuggestions (in Fig. 1) was performed using the workflow getConceptsSuggestionsFromTerm.t2flow, 4 and the prioritizeList step was performed using the workflow priori-tize_gene_list_related_to_a_concept.t2flow. 5 Fig . 2 shows the workflow annotate_genes_biological_processes that is used to annotate gene list, written for the Taverna workflow system. The workflow uses a local knowledge base that is mined from literature by a text mining tool [30] to enrich their knowledge about a set of input genes. Therefore, this workflow takes as an input a list of comma separated entrez gene identifiers, the database name that will be used to map the gene ids to the local concept profile identifiers (in this case of entrez gene ids, the database name should be ''EG''), a cut off parameter for the number of annotations to be obtained, and an identifier for the predefined concept set id that is used in the local database (in this case ''5'' is used, which stands for Biological processes). The workflow can be found on myExperiment http://www. myexperiment.org/workflows/3921.html.

Creating a workflow research object
In order to preserve the workflows and their context, a Workflow Research Object is created that aggregates various information resources related to the workflows, including the original hypothesis, example inputs used for running these workflows, the workflow definitions themselves as well as metadata descriptions about them, and finally, execution traces of the workflow runs. Fig. 3 depicts the process by which the scientists in HG-LUMC created the Research Object to encapsulate the implemented workflows and all resources associated to the in silico analysis.
First, a blank ''pack'' is created in myExperiment through the myExperiment web portal [2]. A pack is a basic aggregation of resources, which can be workflows, files, presentations, papers, or links to external resources. From the viewpoint of myExperiment users, Workflow Research Objects take the form of packs. Indeed, Workflow Research Objects can be viewed as an evolution of myExperiment packs.
As a result of creating a blank pack, a resolvable identifier is allocated by the Research Object Digital Library (RODL) [31] for the new Workflow Research Object. Where myExperiment acts as a front-end for Workflow Research Objects, the RODL acts as a back-end for their storage and retrieval.
The scientists then populate the newly created Workflow Research Object by filling in the title and the description. They also provide a text file specifying the hypothesis that they are investigating using the workflows. The hypothesis for the HD analysis is as follows:

Epigenetic phenomena are implicated in Huntington's disease gene deregulation.
Also included is a sketch that depicts the main steps of the overall investigation, specified using a graphical drawing tool.
Specifications of workflows are provided in their native language, in this case the T2 flow language of Taverna  can be used, for instance, for querying the workflows and retrieving information about their constituent steps using the SPARQL query language [32]. The scientists upload files containing example inputs that can be used to feed the execution of the uploaded workflows, and specify which files can be used as an input for each workflow. These are then followed up with files containing the traces of workflow runs, obtained by executing the uploaded workflows. The traces are again automatically transformed, this time into the wfprov format.
Finally, the scientists provide a file summarising the conclusions drawn from the analysis of the workflow results. The contents of the conclusions file in the HD example are as follows: The analysis of the results produced by the workflows we have designed allowed us to identify both known and novel associations with Huntington, and to prioritise mechanisms that are likely to be involved in HD and are associated with epigenetic regulation. Full analysis of the results are presented in Mina et al., 2014 [33].
The Workflow Research Object created for the HD investigation can be accessed online http://purl.org/net/jwsRO446.

Workflow Research Object ontologies
An overview of the 4 ontologies is depicted in Fig. 4. This illustrates how our proposed models extend and link existing ontologies. The first two ontologies, wfdesc and wfprov, specify workflows and their provenance execution traces respectively while extending the W3C prov-o ontology [34]. Our third ontology, ro, aggregates workflow specifications, their provenance traces and other auxiliary resources, such as data files, images, etc. ro extends the ore ontology [35] to specify aggregations, and uses the annotation ontology, ao [36], to specify  annotations. Finally, the roevo ontology is used to specify the evolution of Workflow Research Objects. For this purpose, it extends the prov-o and ro ontologies. In what follows, we present the four Workflow Research Object ontologies in detail.

Specifying workflows using wfdesc
The workflow description vocabulary (wfdesc) 6 is used to describe the workflow specifications included in a Workflow Research Object. The features of the ontology were established by an examination of the core and overlapping concepts used in 3 major data driven workflow systems, Taverna [37], Wings [38] and Galaxy [39].
The upper part of Fig. 5 7 illustrates the terms that compose the wfdesc ontology. Using this ontology, a workflow is described using the following three main terms: wfdesc:Workflow is used to represent workflows. It is defined as a subclass of the prov:Plan [34].
wfdesc:Process is used to represent a step in a workflow. wfdesc:DataLink is used to specify data dependencies between the processes in the workflow. A data link connects the output of a given process to the input of another process, specifying that the artifacts produced by the former are used as input for the latter.  Fig. 2 can be expressed using wfdesc. 8 There are four processes, namely getSimilarConceptsProfile_input, getSimilarConceptsProfilesPredefined, get_scores and Merge_List_to_a_String_4. These processes are connected using three data links D1, D2 and D3. For example, the data link D3 connects the output nodelist of the process get_scores to the input stringlist of the process Merge_String_List_to_a_String_4. The wfdesc RDF representation of the workflow fragment can be found in Appendix A.1, and the complete RDF file is available online at http://purl.org/net/gene_bio_process_wf.

Describing workflow runs using wfprov
The wfprov ontology is used to describe the provenance traces obtained by executing workflows. The lower part of Fig. 5 illustrates the structure of the wfprov ontology and its alignments with the prov-o ontology.
wfprov:WorkflowRun represents the execution of a workflow. wfprov:ProcessRun represents the enactment of a process and it is a subclass of prov:Activity. wfprov:Artifact represents an artifact that is used or generated by a given process run and it is a subclass of prov:Entity.
Some example wfprov provenance information can be found on the right side of Fig. 6, which is obtained by enacting the workflow (Annotate_gene_list_w) represented on the left of the figure. It shows the process runs that are part of this workflow run (0a475274a985). It also specifies which input files were used (27f69bea08ae, efce37ddf040 and 5fc16e10c982) and the final result obtained by the workflow fragment (c7b7616a1b20), along with intermediate results. Parameter values and process runs are connected to the workflow descriptions using the properties wfprov:describedByParameter and wfprov:describedByProcess respectively. All the processors are connected to the wfprov:WorkflowRun through the property of wfprov:wasPartOfWorkflowRun, so we can navigate easily through them.
The wfprov RDF representation of the above example can be found in Appendix A.2, and the complete RDF file is available online. 9 8 For simplification purposes we illustrate the encoding of a fragment of the workflow. 9 purl.org/net/jwsWfprov. Workflow descriptions and provenance can be used to enrich the description of the data produced by the workflow, to indicate the original datasets used by the workflow to produce the results, and the transformations that were applied to the data retrieved from the original data sources. Such information can be used for crediting the authors of the original data sources, for enriching the textual description of the datasets produced as a result of the workflow execution, or even for specifying the creation date and the authors of the dataset, as recommended by the W3C Data Catalog Vocabulary. 10

Describing aggregations using the ro ontology
We created the ro 11 ontology to create Research Objects that aggregate a workflow, provenance traces, and other auxiliary resources, e.g., hypothesis, conclusions, data files, etc. In our development we used and extended the ORE vocabulary [40], which defines a standard for the description and exchange of aggregations of Web resources. Workflow Research Objects are defined in terms of three main ORE concepts: ore:Aggregation, which groups together a set of resources so that they can be treated as a single resource.
ore:AggregatedResource, which refers to a resource aggregated in an ore:Aggregation. An ore:AggregatedResource can be aggregated by one or more ore:Aggregations and it does not have to be physically included in an ore:Aggregation. An ore:Aggregation can aggregate other ore:Aggregations. ore:ResourceMap, which is a resource that provides descriptions of an ore:Aggregation.
Using ORE, we defined the following terms for specifying Workflow Research Objects: ro:ResearchObject, represents a Workflow Research Object. It is a sub-class of ore:Aggregation. ro:Resource, represents a resource that can be aggregated within a Workflow Research Object and is a sub-class of ore: AggregatedResource. Typically, a ro:ResearchObject aggregates multiple wro:Resources, specified using the property ore:aggregates. ro:Manifest, a sub-class of ore:ResourceMap, represents a resource that is used to describe a ro:ResearchObject. It plays a similar role to the manifest in a JAR or a ZIP file, and is primarily used to list the resources that are aggregated within the Workflow Research Object.
As well as being able to aggregate resources, we require a general mechanism for annotation. For this purpose, we make use of the Annotation Ontology (AO) [36]. The ro ontology reuses three main Annotation Ontology concepts for defining annotations: ao:Annotation, 12 used for representing the annotation itself; ao:target, used for specifying the ro:Resource(s) or ro:ResearchObject(s) subject to annotation; and ao:body, which comprises a description of the target.
Workflow Research Objects use annotations as a means for decorating a resource (or a set of resources) with metadata information. The body is specified in the form of a set of RDF statements, which can be used to annotate the date of creation of the target, its relationship with other resources or Workflow Research Objects, etc.
The Workflow Research Object model does not prescribe specific vocabularies to be used for annotations. Users are free to use vocabularies that they deem suitable for encoding their annotations. The intention is to keep the ro ontology as domain neutral as possible.
Note that for the next release of the ro ontology, we intend to use the W3C Open Annotation model, 13 which is a feature compatible successor to the Annotation Ontology. Fig. 7 illustrates a fragment of the RDF ''manifest'' file that describes the Workflow Research Object used from our case-study. The node wfro represents the Workflow Research Object, which is composed of a number of ro:Resources, e.g., a text file specifying the hypothesis, a sketch specifying the overall experiment, and t2flow files specifying Taverna workflows: match_two_gene_lists_prioritize_ gene_list.t2flow and explainScoresStringInput2.t2flow. The Workflow Research Object is described using a manifest file, which extends the ORE term ore:ResourceMap. The figure also illustrates an annotation that is labelled by a1. It is used to describe the workflow explainScoresStringInput2.t2flow using a named graph that is encoded within the file1.rdf file.

Tracking research object evolution using the roevo ontology
The roevo ontology 14 is used for describing the evolution of Workflow Research Objects. Specifically, it allows to track and describe the changes made to a Workflow Research Object at different levels of granularity: the changes made as a whole (its creation and current status) and the changes made to the individual resources aggregated (additions, modifications and removals). The roevo ontology extends the prov-o ontology, which provides the foundational information elements to describe evolution of Research Objects. roevo:VersionableResource represents a resource that is subject to evolution, which can be a roevo:SnapshotRO, a roevo: ArchivedRO, a ro:Resource, or ro:AggregatedAnnotation. Since we want to track the provenance of a roevo:Versionable-Resource, we consider this class to be a sub-class of prov:Entity.   roevo:ChangeSpecification designates a set of (unit) changes (addition, removal or update) that given a roevo:Versionable-Resource yields a new roevo:VersionableResource (see the object properties roevo:from-version and roevo:toversion in Fig. 8). roevo:change designates a (unit) change, which can be adding, removing or modifying a resource or a Workflow Research Object.
Changes are chronologically ordered using the roevo:hasPreviousChange property.
To illustrate how the roevo ontology can be used, consider a Workflow Research Object that contains the workflow illustrated in Fig. 9. Such a workflow could not be run after a while from its creation because the web service that implements the process explainScoresStringInput was no longer available. To repair such a workflow, the scientist created a new Workflow Research Object that contains a new workflow obtained by replacing the process ExplainScoresStringInput with a process associated with an available web service that performs the same task as the unavailable one. The roevo ontology allows capturing the evolution of the Workflow Research Object at different granularities, as we describe below. Fig. 10 illustrates how the evolution at the level of the Workflow Research Object can be captured using roevo. It specifies that the Research Object, named data_interpretation-2-snapshot, was revised to give rise to a new Workflow Research Object named data_interpretation-2-snapshot-1. Notice that we make use of the prov-o object property prov:wasRevisionOf. data_ interpretation-2-snapshot-1 was obtained using a change specification consisting of two (unit) changes that are ordered using the property roevo:hasPreviousChange. The first change consists of removing a resource, representing a file containing the specification of the Taverna workflow annotate_genes_biological_processes.t2flow. The second change consists in adding a file containing the specification of a new Taverna workflow annotate_genes_biological_processes_xpath_cpids.t2flow. Fig. 11 illustrates how the evolution at a finer grain, i.e., at a workflow level instead of the Research Object, can be captured using roevo. It specifies that the workflow annotate_genes_biological_processes_xpath_cpids.t2flow was a revision of the workflow annotate_genes_biological_processes.t2flow. Such a revision took place using a change specification that consists of 6 (unit) changes that are ordered using the roevo:hasPreviousChange property. The 6 changes are as follows: (i) change1 consists of removing the datalink, oldDataLink1 connecting the process explainScoresStringInput_input and the process explainScoresStringInput in the workflow (see Fig. 9). (ii) change2 consists of removing the datalink, oldDataLink2 connecting the process explainScoresStringInput and the process explainScoresStringInput_output in the workflow. (iii) change3 consists of removing oldProcess representing the process explainScoresStringInput in the workflow. (iv) change4 consists of adding newProcess to the workflow, which represents the new process that is associated with an available web service.
(v) change5 consists of adding a datalink, newDataLink1 connecting the process explainScoresStringInput_input to the newly added process.
(vi) change6 consists of adding a datalink, newDataLink2 connecting the new process to the process explainScoresStringInput_ output in the workflow.
The RDF turtle listing of the above example can be found in Appendix A.3.

The Workflow Research Object family of tools
We have developed a suite of tools to support scientists in creating, annotating, publishing and managing Workflow Research Objects. The Research Object Manager (described in Section 5.1) is a command line tool for creating, displaying and manipulating Workflow Research Objects. The Research Object Manager incorporates the essential functionalities for Workflow Research Object management, especially by developers and a technically skilled audience used to working in a command-line environment. The Research Object Digital Library (RODL, described in Section 5.2) acts as a full-fledged back-end. RODL incorporates capabilities to deal with collaboration, versioning, evolution and quality management of Workflow Research Objects. Finally, we have also extended the popular virtual research environment myExperiment [2] to allow end-users to create, share, publish and curate Research Objects (Section 5.3). The developed tools are interoperable. For example, a user can utilise the Research Object Manager to create Research Objects, and upload them to the RODL portal or the development version of myExperiment, where it can undergo further changes.

The Research Object Manager
The Research Object Manager is a command line tool for creating, displaying and manipulating Workflow Research Objects. It is primarily designed to support a user working with Workflow Research Objects in the user's local file system. RODL and Research Object Manager can exchange Workflow Research Objects between them, using the Workflow Research Object vocabularies. The Research Object Manager also includes a checklist evaluation functionality, which is used to evaluate if a given Workflow Research Object satisfies pre-specified properties (e.g., the input data is declared, the hypothesis of the experiment is present, the Workflow Research Object has some examples to play with, etc.).
The Research Object Manager is documented in a user guide that is available online. 15 The source code is maintained in the Wf4ever Github repository. 16

Research object digital library (RODL)
RODL is a back-end service that does not directly provide a user interface, but rather interfaces through which client software can interact with RODL and provides different user interfaces for managing Workflow Research Objects.
The main system level interface of RODL is a set of REST APIs, including the Research Object API 17 and the Research Object Evolution API. 18 The Research Object API, also called the Research Object Storage and Retrieval API, defines the formats and links used to create and maintain Workflow Research Objects in the digital library. Given that the semantic metadata is an important component of a Workflow Research Object, RODL supports content negotiation for the metadata resources, including formats such as RDF/XML, Turtle and TriG.
The Research Object Evolution API defines the formats and links used to change the lifecycle stage of a Workflow Research Object, to create an immutable snapshot or archive from a mutable live Workflow Research Object, and to retrieve the evolution provenance of a Workflow Research Object. The API follows the roevo ontology (see Section 4.4), visible in the evolution metadata generated for each state transition.
Additionally, RODL provides a SPARQL endpoint that allows queries over HTTP to the metadata of all stored Workflow Research Objects. 19 A running instance of RODL is available for testing. 20

Workflow Research Object-enabled myExperiment
myExperiment [2] is a virtual research environment targeted towards collaborations for sharing and publishing workflows (and experiments). While initially targeted towards workflows, the creators of myExperiment were aware that scientists needed to share more than just workflows and experiments. Because of this, myExperiment was extended to support the sharing Packs. At the time of writing, myExperiment had 337 packs. Just like a workflow, a pack can be annotated and shared. The notion of a Research Object, presented in this paper, can be viewed as an extension of the myExperiment pack. A myExperiment pack is like a folder in which the constituent resources can be virtual (not necessarily files). It allows aggregating resources, versioning and specifying the kinds of the constituent resources as well as the relationships between those resources.
In order to support complex forms of sharing, reuse and preservation, we have incorporated the notion of Workflow Research Objects into the development version of myExperiment. 22 In addition to the basic aggregation supported by packs, alpha myExperiment provides the mechanisms for specifying metadata that describes the relationships between the resources within the aggregation. For example, a user is able to specify that a given file represents a hypothesis, a workflow run obtained by enacting a given workflow, or conclusions drawn by the scientists after analysing the workflow run.

Example competency queries
To illustrate the potential of Workflow Research Objects for preservation, and the value of their structured representation, we have developed a series of competency queries. These queries are designed to evaluate our approach by demonstrating the ability to answer questions about a workflow's data and metadata, and have been drawn from the requirements outlined in Section 2.
The queries are capable of: (i) Retrieving metadata associated with a workflow description-addressing requirement R 3 (ii) Retrieving information about the relationship between workflow descriptions and workflow runs-addressing requirement R 2 (iii) Retrieving lineage information associating the results of a workflow run with its inputs-addressing requirement R 2 (iv) Detecting differences between two versions of a Workflow Research Object-addressing requirement R 4 (v) Retrieving information about the relationship between a Workflow Research Object and the data artifacts it encompassesaddressing requirement R 1 . All queries can all be seen to address requirement R 0 by being predicated on the availability of additional data or metadata. In this section we evaluate the queries against the structured data and metadata captured in our HD case study Research Object. For each we present a description of each query, their translation into a SPARQL, and the results obtained by evaluating them.

Query 1 Find the creator of the Workflow Research
Object. This query is useful, e.g., for the (re)-user of the workflow to identify the person to credit.
The SPARQL [32] query that can be used for answering this query can be formulated as follows: Query 2 Find the workflow used to generate the gene annotation result reported. This workflow can be used to identify the experiment (workflow) that generated a given data (result).
The SPARQL query that can be used for answering the above query can be formulated as follows: Evaluating the above SPARQL query returns a list of workflows, which can be found in the SPARQL results listed in Appendix A.4.
Query 3 Find the inputs used to feed the execution of the workflow that generated a given result. This is an example of lineage query that is used to identify the input data values that contributed to a given result that was obtained as a result of a workflow execution. The SPARQL query that can be used for answering the above query can be formulated as follows:
The SPARQL results of the above query can be found in the Appendix in Appendix A.4.

Query 5 Find the Workflow Research Objects that use a given gene association file as input.
The SPARQL query presented below is used to retrieve Workflow Research Objects that use a given gene ontology association file, identified by < http://example.com/gaf_1 >, as input. More specifically, the query retrieves the Workflow Research Objects that contain a workflow run, such that the gene ontology association file in question is used by a process run that belongs to that workflow run.

Related work
Three parts of related work are presented in this section, including existing approaches for preserving workflows, for representing additional information about workflows driven by different motivation requirements and for representing bundle structure.

Scientific workflow preservation
Preservation of digital objects is a long studied topic in the digital preservation community. There is a growing recognition that preserving scientific or business workflows requires new features to be introduced, particularly given the dynamic nature of these objects. A few recent proposals from this community have also taken a similar approach to ours [41,42], by preserving more than process objects themselves and capturing additional contextual information about the processes, data, as well as the human actors involved in the processes. So far these works have more emphasis on preservation of software or business processes, which is a nice complement to our focus on scientific workflows.
Another aspect of workflow preservation is to provide the infrastructure to support the enactment and execution of workflows in the long term. Virtual Machines can be used to package up all the original settings and dependency libraries that are need to re-enact a workflow. Similarly packaging tools such as Docker [21] or ReproZip [17], can also help users to create relatively lightweight packages that include all the dependencies required to reproduce a workflow or a computational experiment. We too have a zip-based serialisation of our Workflow Research Objects described in the RO Bundle Specification [43]. These approaches differ with respect to ours in that they lack a structured description of the aggregation. As a result they lack a convenient mechanism to attach arbitrary annotations. They are also limited to aggregating resources that can be directly serialised, and lack the ability to describe an aggregation that includes remote resources, such as large third party databases.
A number of existing scientific workflow systems take a similar approach to ours in making use of provenance tracking for supporting reproducibility and enabling workflow preservation (as in e.g. VisTrails [44], Wings [38], etc.). Provenance information is particularly helpful when a workflow can no longer be executed [6,45], due to changes of third party resources used by the workflow or the execution environment (like the OS or dependency libraries).

Workflow/experiment descriptions
Analogous to software documentation, descriptions about workflows, like its main function and how it is divided in smaller steps, are also critical for understanding and preserving workflows. This is particularly useful when information like provenance is unavailable, incomplete or incomprehensible. Existing workflow systems use different languages to specify their workflows, which presents a challenge for interpreting a workflow description or querying workflow execution traces. Several attempts have been made to represent workflows from different systems in a unified language, but driven by different requirements than ours. For example, the IWIR model [46] was designed as an interchange language to make interoperable workflow templates among workflow systems. In our work, we focus on the descriptions of workflows, their steps and resources for their proper preservation, leaving out of scope whether the template can be imported by another workflow system or not.
Other related efforts are D-PROV [47] and OPMW [48], developed in parallel to our vocabularies. Their scope is similar to ours, but the complexity of the workflow patterns that is covered by each model is different. This is partially due to the different types of workflow systems that were used to drive their design requirements. D-PROV aims at representing complex scientific workflows which may include loops and optional branches. OPMW takes a simple approach by modelling just pure data flow workflows. Driven by our requirements, wfdesc does not cover the patterns of loops or branches, which are uncommon in the majority of scientific workflow systems. But it does provide descriptions for sub-workflows, i.e. nested workflows, included as part of a given workflow, which is a pattern that is not covered by OPMW.
wfdesc is essentially aimed for capturing the core structure of scientific workflows. If one needs to capture the bigger context, for example, about the scientific experiments or investigations, some existing vocabularies can be used for this purpose. For example, OBI (Ontology for Biomedical Investigations) and the ISA (Investigation, Study, Assay) model are two widely used community models from the life science domain for describing experiments and investigations. OBI provides common terms, like investigations or experiments to describe investigations in the biomedical domain [22]. It also allows the use of domain-specific vocabularies or ontologies to characterise experiment factors involved in the investigation. ISA structures the descriptions about an investigation into three levels: Investigation, for describing the overall goals and means used in the experiment, Study for documenting information about the subject under study and treatments that it may have undergone, and Assay for representing the measurements performed on the subjects. We have shown how the ISA framework can be used together with Research Object to capture the bigger context about a scientific investigation and boost the reproducibility of its results [49].

Scientific investigation preservation and packaging
The Knowledge Engineering from Experimental Design (KEfED) model aims to capture more than the process of a scientific investigation. The model provides a formalism of the process of observational reasoning and interpretational reasoning [50]. It is driven by the need for enabling reasoning over scientific observations by curating the observations and the process leading to the experimental results. While KEfED allow designing workflow-like processes, it is not build on a standard vocabulary. Moreover, it does not capture the evolution of workflow descriptions.
The Core Scientific Metadata Model (CSMM) [51] is a model to organise data by studies. It is aimed to capture high level information about scientific studies and the data that they produce. Currently it is deployed and used in data management infrastructure developed for the large scale scientific facilities, such as the ISIS Neutron Source [52] and the Diamond Light Source [53]. The model provides a hierarchical way to manage scientific investigations, by its research programme, projects and studies, and a way to categorise datasets into collections and files and associate them with individual investigations. Compared with Workflow Research Objects, CSMM does not provide constructs for specifying workflows or capturing their provenance traces.

Representation of packaging structure
There are several efforts that have been proposed to allow scientists packaging resources that are relevant to a given investigation. For example, scientific Publishing Packages (SPP) [54] are compound digital objects that encapsulate a collection of digital scientific objects, including raw data, derived products, algorithms, software and textual publications, in order to provide a context for the raw data. Its initial goal was to enable digital libraries to consume all these diverse information objects related to the scientific discovery process as one compound digital object. Its model has a strong notion of data lineage, enabling the expression of provenance of derived data results. However, there is no large adoption of this work and no active development exists to our knowledge. Unlike Workflow Research Objects, SPP does not cater for the description of workflows, their provenance traces, or their evolution.
ReproZip [55] is a tool that record workflows of command-line executions and associated resources, including files, dependency libraries, and variables. It then create package that can be used to rerun and verify the reproducibility of such workflows. Compared with Workflow Research Objects, Reprozip is confined to capturing command-line executions that invoke local programs. In Workflow Research Objects, we are targeting workflows that make use of distributed services that are not necessarily accessible locally. Moreover, Reprozip does not capture information about the evolution of workflows over time. They adopt proprietary language for workflow specifications.
Provenance-To-Use (PTU) [18] is similar to ReproZip. PTU relies on a user-space tracking mechanism for better portability instead of using a kernel-based provenance tracing mechanism. Similar to ReproZip, PTU adopts proprietary language for specification, and do not capture evolution of the workflow specification.
Science Object Linking and Embedding (SOLE) is a system that allows linking articles with science objects [56]. A science object can be a source code of a software, a dataset or a workflow. SOLE allows the reader (curator) to specify human-readable tags that links the paper with science objects. It transforms each tag into a URI and points to a representation of the corresponding science object. While their objective is similar to ours, the authors of SOLE take the view that the scientific article is the main object that contains links to other (science) objects. In our case, we focus on scientific workflows and link them to other resources, e.g., their provenance traces.
Our goal of workflow preservation is also related to facilitating reproducibility. It is aimed to complement many other existing efforts, which look at reproducibility, through policy and infrastructure (such as runmycode.org [57]), preservation of computation environment (such as SHARE [58]), the creation of executable papers (e.g., Utopia [59], Sweave [60], or iPython [61]), organisation of actual reproducibility case studies or assessment (e.g., the Reproducibility Initiative 23 or Mozilla Science Code Review [62]).
To enable reproducibility, the above proposals tend to opt for having locally all the data sources and computational tools necessary for executing the steps of the computation. While this approach is desirable, it is not always possible. In many cases, scientists want to use datasets and tools that are remote and cannot be locally deployed, either because the providers of those datasets and tools do not wish to provide access to their resources or because they are large or computationally expensive and cannot be deployed locally. Our approach allows scientists to describe the use of remote resources to perform their analysis, and the means to gather information about those resources and their relationships. While this approach does not guarantee full reproducibility, we think that it is more realistic, and is a step towards enabling reproducibility.

Conclusions
We have presented in this paper a novel approach to scientific workflow preservation that makes use of a suite of ontologies for specifying Workflow Research Objects. These Research Objects contain workflow specifications, provenance traces obtained by executing the workflows, information about the evolution of the workflow Research Object and its components elements, and annotations describing the aggregation as a whole using existing ontologies. We have also reported on available tools that can be used to create and preserve Workflow Research Objects through repositories like myExperiment.
While the notion of Workflow Research Object was initially developed as part of the Wf4ever project, its ethos, models and tools are being adopted and exploited by other communities, such as digital preservation (e.g. the EU Scape 24 project or Timbus 25 project) or workflow-based scientific research (e.g., the EU BioVel 26 project). In our ongoing work, we seek to collaborate with these communities, as well as others, such as Open Access publishers (e.g., GigaScience 27 ) and Digital Libraries (e.g. FigShare 28 or Dataverse [63]) to improve the Workflow Research Object concept and vocabularies. We also intend to align our ontologies with existing similar standards and initiatives, such as ISA, OPMW and D-PROV.
We believe that the work presented in this paper has the potential to: -Facilitate the process by which they package and annotate the resources necessary for preserving their scientific workflows.
-Encourage them to reuse existing workflows. For example, users will have elements that allow them to understand the workflow that they will be reusing, e.g., example inputs and provenance traces. -Emphasise the importance of associating datasets, with computations (workflows), their provenance, and the people involved. As such, we think that the work presented in this paper has the potential of promoting data citation and its associated advantages, such as encouraging data sharing, tracking data usage, encouraging enriching publications, assuring long-term availability of data and increasing trust in research findings.