Connecting Data Publication to the Research Workflow: A Preliminary Analysis

The data curation community has long encouraged researchers to document collected research data during active stages of the research workflow, to provide robust metadata earlier, and support research data publication and preservation. Data documentation with robust metadata is one of a number of steps in effective data publication. Data publication is the process of making digital research objects ‘FAIR’, i.e. findable, accessible, interoperable, and reusable; attributes increasingly expected by research communities, funders and society. Research data publishing workflows are the means to that end. Currently, however, much published research data remains inconsistently and inadequately documented by researchers. Documentation of data closer in time to data collection would help mitigate the high cost that repositories associate with the ingest process. More effective data publication and sharing should in principle result from early interactions between researchers and their selected data repository. This paper describes a short study undertaken by members of the Research Data Alliance (RDA) and World Data System (WDS) working group on Publishing Data Workflows. We present a collection of recent examples of data publication workflows that connect data repositories and publishing platforms with research activity ‘upstream’ of the ingest process. We re-articulate previous recommendations of the working group, to account for the varied upstream service components and platforms that support the flow of contextual and provenance information downstream. These workflows should be open and loosely coupled to support interoperability, including with preservation and publication environments. Our recommendations aim to stimulate further work on researchers’ views of data publishing and the extent to which available services and infrastructure facilitate the publication of FAIR data. We also aim to stimulate further dialogue about, and definition of, the roles and responsibilities of research data services and platform providers for the ‘FAIRness’ of research data publication workflows themselves.


Introduction
The data curation community has long encouraged researchers to document their collected research data early in the research workflow in a manner that will support publication and preservation, as well as future understanding, interoperability, and reuse. Introducing smoothly continuous, considered, and consistent data documentation practices to the research workflow, as close to the data collection point as possible, should reduce the data sharing burden that researchers associate with the data deposition process. Documentation of data closer in time to data collection would also help mitigate the high cost that repositories associate with the ingest process, as seen in the work by Beagrie et al. (2008 and2010). In this extension to the Data Publishing Workflows working group's initial report (Austin et al., 2015), we provide here a preliminary review of a selection of research workflows, with the intent of identifying connections between the goal of research data publication and the incorporation of such 'upstream' barrierreducing measures into the research workflow. These measures, intended to facilitate data publication, might include data preparation practices that offer the possibility of interaction with a repository before that data is ready for publication, use of workflow platforms that support such practices, and participation in "a novel publishing paradigm where "publishing" is intended as making a product online available, discoverable, peerreviewable, reusable according to given rights, realtime accessible, citable, and interlinked with its research activity and associated products" (Assante et al., 2015). A sampling of the research workflows that may be influenced in such a way were described by Addis (2015) and include data management planning; collection, creation, analysis, and use of data; data selection and access decisions; resolving ethical issues through deidentification, and publication.
Frequently, moving data curation activities closer to the research activity implies integrating elements such as code, software, models, documentation, products from the research process. It is important to understand how the intention to publish data might be made integral to the research workflow. For example, considering how best to crosslink the scholarly literature with different software releases and different versions of data versions, as well as considering integration with labbooks, ipython notebooks and so on, could be important contributions to making data publishing a usual part of the researcher's workflow.
Given the range of disciplinary practices in the research workflow, it is to be assumed that such an analysis will show stronger differences than in this group's previous analysis (Austin et al, 2015). The latter focused on highlevel outcomes that are broadly applicable to multiple disciplines and are less subject to the dynamic changes that are frequently found in research processes. The work of Williams and Pryor (2009) underlines the complexity and diversity of research workflows and respective information flows across disciplines.
As a first step in the direction of the upstream analysis, we consider how the examples collected instantiate the report's recommendations on data publishing..

Examples of upstream workflows
Twelve workflows were collected, representing a range of disciplines and broader projects. Here we present a first step towards a state of the art review of current data publishing solutions or projects that tie in with an active research workflow. They are put into the context of the recommendations made by Austin et al. (2015). In addition to these research workflows, several workflow tools and guidelines that support and enable data publication producing research workflows were recommended to us during our collection process. These included the Berkeley Initiative for Transparency in the Social Sciences, the Open Science Framework, and Taverna, which are detailed in more depth in the Appendix.

Start small, building modular, open source and shareable components
Research workflow examples such as those detailed by IPCCDDC (WDCC) and CERN provide additional components that are small, modular, open source and shareable, and which clearly complement the previous static data publication workflow analysis (Austin et al. 2015). Other workflows accommodate the more complex research workflows and the "work in progress" nature of some of the content elements, by establishing a counterpart that allows early referencing, versioning and often facilitates collaborative communication elements. It should be noted that access is often restricted to content that is "work in progress". The diverse content that might be accumulated in such an approach, could be published openly, following the example of nanopublications (for data and software).
Nanopublications offer, in part, an established method for enhancing reproducibility by way of data modelling frameworks and executable workflows. GonzálezBeltrán et al (2015) conducted an experiment to reproduce the results from a selected life science paper using a range of nanopublication methodologies. Their resulting paper provides useful insights into both the relative merits of the systems themselves and the reasons why better systems are needed to support reproducibility. The authors also point out that, if the principles of nanopublication can be evaluated and accepted by a critical mass of the research community, they could strengthen the scholarly communications model throughout its lifecycle: at points within the research, throughout the review process, and in the publication model.
Some of the represented workflows pay more attention to the computational components, which is reasonable for those areas of research that are heavily computational. The implementation of standardized, automated components (with instructions on how to use data and related materials), is considered an important step for future reproducibility of research (see for example Gil et al , 2007). One example of an executable workflow, is the integration of the Galaxy platform with the data journal Gigascience and with open RDM platforms such as myExperiment . 1 A closer connection between digital research infrastructure, traditional repositories and research communities is proposed in the concept of Science 2.0 Repositories. Assante et al. (2015) suggest that research infrastructure services should intercept and publish research products, whilst providing researchers with social networking tools for discovery, notification, sharing, discussion, and assessment of research outputs. Even though it is possible that the concept could be implemented in a modular way, the overall concept provides a single locus that would serve one or multiple research communities with a complex information system.
In summary, with the diverse content that results from an upstream research workflow, there is a need to address the individual needs (metadata, restrictions, publication products) step by step together with the research community.

Follow standards that facilitate interoperability and permit extensions
This preliminary analysis underlines the need to understand and distinguish between the different available standards: e.g. disciplinary and generic metadata standards and standards around exchanging and exposing data. As dependencies between modules and objects might be more prevalent upstream in the research workflow (e.g. data that can only be analyzed with a specific software), it is vital to ensure components can exchange information smoothly and with minimal information loss. This reinforces the ongoing work on FAIR Data Principles being coordinated by Force11. At time of writing, the principles are still open for community consultation, but in essence they strongly encourage an approach to make research data " Findable, Accessible, Interoperable, and Reusable" for both humans and machines . 2 The advanced solutions identified by this analysis predominantly serve specific disciplines or communities, including life and biomedical sciences, climate sciences and High Energy Physics. All of them provide standardized interfaces between the components (closed and open counterparts) and data curation and standardization support. One example is IPCCDCC which, for example, uses detailed project naming conventions for directory structures, data header information and file names. It appears that these solutions are being developed relatively closely with the research communities employing them. Improvements to workflows for data discovery and exchange are still required; for example, current solutions utilize JSON and JSONLD. Many solutions in this preliminary analysis use APIs to exchange and expose information about their content. While APIs at least enable the exchange of metadata across workflows, more open and sustainable approaches are based on open access protocols, and vocabularies openly published as Linked Open Data. Metadata captured upstream in the research process needs to be clearly exposed if it is to be reused by others and the benefits fully realised.
There is a growing number of electronic laboratory notebooks (ELNs) intended to help incorporate metadata curation into the data production workflow. The term 'curation at source' (Frey, 2008) has been used for such attempts to make metadata creation more effective, efficient, and less errorprone. Two of the examples submitted to the Working Group illustrate this: "RSpace ELN to DataShare Repository" and "Ontologies for research data tools". In the first, open standards are deployed to enable deposit from a proprietary ELN to an institutional data repository. In the second Linked Open Data is used to enrich research workflows with relevant descriptors, which may be published as domain ontologies further upstream.
In the first example, "RSpace ELN to DataShare Repository", the workflow enables researchers to deposit directly from the RSpace electronic lab notebook (ELN) to DataShare institutional data repository. The ELN content is exported as XML documents, and packaged as a zip archive with METS descriptive header, including the DataCite minimum metadata required for DataShare. The packaged content including citation metadata is deposited to DataShare using the SWORD protocol. This workflow results from a partnership between University of Edinburgh and Research Space, a provider of electronic lab notebook (ELN) software.
In the second example, "Ontologies for research data tools", the workflow employs Dendro (da Silva et al, 2014) an ontologybased collaborative platform for research data management. Dendro offers researchers a file management environment with a tool for creating metadata descriptors as Linked Open Data (LOD), optionally picking recommended terms from published vocabularies, including elements from wellrecognized standards like Dublin Core. Curators can work with Dendro to design domainspecific metadata models, and enrich the terms available to researchers they work with. The Dendro workflow optionally includes Labtablet, a mobile application designed to allow researchers to capture metadata on fieldwork. Locally relevant terms are packaged with the data for deposit in a public repository, while the terms themselves are published on the web as candidate ontologies for the researchers' domain, allowing for their evolution through broader community reuse.
Facilitate data citation, e.g. through use of digital object PIDs, data/article/person/software linkages, researcher PIDs With more complex workflows and dynamic content, it is even more important for humans and machines to be able to identify the data, software, and documentation correctly and uniquely for the purposes of reproducibility. Hence, it is not surprising that most solutions clearly commit to the use of PIDs and their versioning capabilities. Independent of any software environment, PIDs can be used to connect content such as data, software and publications.
It should be noted that the use of PIDs applies to not only the digital objects (data, software, any text document, etc), but also the physical objects and the persons involved in the processes. The advent of ORCID as a unique identifier for contributors allows an easy attribution of content to the individual person. It could be expected that researchers use several independent systems throughout their research process, and hence, such IDs could be used to connect contents automatically across, as permitted.
Many of the workflows that were studied incorporate PIDs: RSpace & Datashare, Elsevier, Imperial College London, Computational Chemistry, IPCCDDC, CERN Analysis Preservation, Galaxy and Science 2.0 Repos. This ensures that content can be tracked throughout any module or workflow. Ideally, solutions would be able to track changes to a digital object through internal, restricted and public modules.
The pervasive use of such identifiers can assist in instantiating the active practice of data citation. It appears that most solutions today try to facilitate data citation and the Joint Declaration of Data Citation principles have been finding general consensus. The analysis reuse patterns. Exposing information about content and their identifiers in a machine readable way facilitates such exercises.

Document roles, workflows and services
Some of the examples identified in this analysis are still works in progress at the time of writing. Hence it is to be assumed that documentation is not yet comprehensive. However, one can note that documentation of roles and responsibilities in such solutions is significant. This would also help researchers to include such relevant information into data management plans (according to DMPonline ). Given the more complex nature of upstream systems, often involving a 4 collaborative approach amongst several partners, there is a need for documented service level agreements and respective guidance for partners. User support is particularly relevant in order to generate the uptake of the service in the user community. If the added benefit is not highlighted explicitly it might be difficult to harness interest for a new tool, for instance.
In the "Rspace ELN to DataShare" example an institutional data repository (DataShare) provided a checklist approach to deposit. This subsequently facilitated a partnership with ELN software providers Research Space. The resulting partial automation of the deposition workflow shows how clear documentation can offer direct benefits to repository depositors and users. As a result, researchers can capture data in a structured way during the research process, and then retain and deposit this structure without duplication of the initial effort. Retaining the original structure of the research in a packaged form that may be associated with a publication benefits the reproducibility of the research.
Repositories partnering with thirdparties can extend the trusted repository model to their partners by delegating certain data management or curation functions to them. Standards bodies, for example the Data Seal of Approval, recognize that a data service may be partially outsourced (Data Seal of Approval, 2013). Repositories can partner with providers of research tools and upstream services, as well as downstream integration with journal publishers, or with harvesting and aggregating services. Trust is transitive, and where researchers use tools that they and their institution trust this can facilitate a level of delegation of curation tasks to the research group, and may reduce repository ingest costs. There is also potential for economic and research benefits further 'downstream', to the extent that wellpackaged research data facilitates easier integration with publication platforms and ease of reuse.

The curators' role in connecting research workflows to publishing platforms
The examples submitted to the Working Group often identified some measure of intermediation by curators to enable workflows to be joined up effectively. This could range from simply making researchers aware of tools, through enabling elements of automation, through to supporting the uptake of services. This points to another area of innovation; in the methods that curators use to engage with researchers and understand the workflows they are integrating. An example of this was "Ontologies for research data tools". Here the authors describe their approach to defining contextspecific domain ontologies, in which they invite researchers to an interview about their data activities, requirements and their expectations regarding data sharing. This interview is based on the Data Curation Profile Toolkit (Witt, 2009). That process is complemented by performing content analysis in researcher's publications, and discussing with the researchers the fragments of information that should be provided along with the dataset to help others interpret it.

Summary
This first step towards a state of the art review shows that practices and products are emerging to serve upstream research workflows better in data publishing.
An extension of the "traditional" data publishing model (Austin et al. 2015) to preserve internal "work in progress", i.e. dynamic content early in the research process An extension of collaborative features that enable easy collaboration with colleagues when conducting research Active interventions by curators which lead to better connected workflows and more richly engaged researchers Solutions that enable computational workflows (including preserved content); Solutions that are easily extendable: facilitated by APIs and new data models More work needs to be done to embed such tools and workflows into the "business as usual" experience of critical mass of researchers Further investigations are needed to determine how data publishing can accommodate the results of the research workflow. This preliminary analysis underlines that a few solutions are under way and the discussions within the working group (sessions) also highlighted a considerable interest in such solutions. Over time the emerging developments might require an updating of the reference model proposed by this working group (Austin et al. 2015) to refine the upstream components in more detail. But this will require more in depth work once more solutions are "on the market" and in use.
More work is also needed to understand whether and how communities (i.e. the individual researchers) really use such tools. So far there is almost no data available on the actual usage which is of uttermost importance to understand whether and how workflows work. Only with considerable uptake by researchers can such upstream workflows work in the mid and long term. In the light of these developments the recommendations presented in Austin et al. (2015) may themselves need a versioned update in the near future. Community engagement to support uptake of the services is critical. This is a task for einfrastructure providers, funders, thought leaders within disciplines, research managers and other key stakeholders. Some of the findings are also supported by a recent report provided by Matthew Addis who considers both the function and effects of various RDM workflows. Addis et al (2015) contains a number of case studies taken from UKbased Higher Education Institutions. There is a range of discipline, size and researchintensity level of institution, size of dataset, and so forth. Whilst acknowledging the impossibility of devising a 'one size fits all' solution, the report does discern a number of useful conclusions: • When presented with clear and seamless workflows, researchers are more likely to engage with the whole of the data publishing cycle. • Automation, wherever possible, will drive speed, accuracy, and the ability of groups of institutions to provide a high level of services, as well as keeping costs down. • A single point of contact or interface, even where different workflows/funders/subject areas are concerned, will also support engagement. • Providing trusted metrics for funders as well as the institutions contributes greatly to the value of the exercise, particularly if these can be specifically linked with tangible career enhancement.