Achieving human and machine accessibility of cited data in scholarly publications

This brief article provides operational guidance on implementing scholarly data citation and data deposition, in conformance with the Joint Declaration of Data Citation Principles (JDDCP, http://force11.org/datacitation) to help achieve widespread, uniform human and machine accessibility of deposited data. The JDDCP is the outcome of a cross-domain effort to establish core principles around cited data in scholarly publications. It deals with important issues in identiﬁcation, deposition, description, accessibility, persistence, and evidential status of cited data. Eighty-ﬁve scholarly, governmental, and funding institutions have now endorsed the JDDCP. The purpose of this article is to provide the necessary guidance for JDDCP-endorsing organizations to implement these principles and to achieve their widespread adoption.


INTRODUCTION
Citation of robustly maintained, described and identified data in persistent digital repositories is an important step towards significantly improving the discoverability, provenance documentation, validation, and reuse of scholarly data; and in validating the robustness of assertions based upon particular data (CODATA (2013); Altman and King (2006); Uhlir (2012); Ball and Duke (2012); Goodman et al. (2014); Borgman (2012)). It can help reduce the rate of false positives that persist in scholarly literature, and will be transformative in improving the robustness and reproducibility of research findings.
The Joint Declaration of Data Citation Principles (JDDCP) (https://www.force11.org/datacitation) outlines core principles for citing data, based on significant study by participating groups (1) and independent scholars (CODATA (2013); Altman and King (2006); Uhlir (2012); Ball and Duke (2012). It is the latest development in a collective process, reaching back to at least 1977, to raise the importance of data as an independent scholarly product, and to make data transparently available for verification and reproducibility (Altman and Crosas (2013)). However, the JDDCP deliberately did not provide implementation guidelines, which are addressed in this and forthcoming articles.
The purpose of this document is to outline a set of common guidelines to operationalize JDDCPcompliant machine accessibility. Its goal is to do this in a way that is as uniform as possible across conforming repositories and the associated citations of the data they contain. The recommendations outlined here were developed as part of a community process by members of the FORCE11 Data Citation Implementation Group (https://www.force11.org/datacitationimplementation), over a period of approximately one year.
Accessibility to machines and humans is fundamental to providing the required Web access to stable repositories of cited scholarly data and associated metadata, which may have differing lifecycles. This notion is implied by all eight of the JDDCP principles, beginning with • Principle 1 -Importance: "Data should be considered legitimate, citable products of research.
Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications." And it is particularly strongly endorsed in the following: • Principle 4 -Unique Identification: "A data citation should include a persistent method for identification that is machine actionable. . . " • Principle 5 -Access: "Data citations should facilitate access to the data themselves and to such associated metadata, documentation, code, and other materials, as are necessary for both humans and machines to make informed use of the referenced data" • Principle 6 -Persistence: "Unique identifiers, and metadata describing the data, and its disposition, should persist -even beyond the lifespan of the data they describe." The methods proposed below cover: • identifiers and identifier schemes; • landing pages; • minimum acceptable information on landing pages; • best practices for dataset description; and • recommended data access methods.
Additional sets of DCIG recommendations on other implementation issues for JDDCP endoresers will be provided in future articles.

WHAT IS MACHINE ACCESSIBILITY?
Machine accessibility of cited data, in the context of this document and the JDDCP, means access to data and metadata stored in a robust repository, by Web services (Booth et al. (2004) -preferably RESTful Web services (Fielding (2000); Fielding and Taylor (2002); Richardson and Ruby (2011)) -independently of integrated browser access by humans.
Clearly, "machine accessibility" is also an underlying prerequisite to human accessibility, as browser access to remote data is always mediated by machine-to-machine communication.
We call out machine accessibility separately here, as in the JDDCP, to emphasize the importance of program-to-program retrieval of data as an integrated services model component.

Unique Identification
Unique identification in a manner that is machine-resolvable on the Web, and has a long term demonstrated commitment to persistence, is fundamental to providing access to cited data and its associated metadata. There are several identifier schemes on the Web, which meet these two criteria. The best scheme of identifiers for data citation in a particular community of practice, will be one which meets these criteria and is widely used in that community.
Our general recommendation, based on the JDDCP, is to use any currently-available identifier scheme that is machine actionable, globally unique, and widely (and currently) used by a community; and that has a long term commitment to persistence. Best practice is to choose a scheme that is cross-discipline.
Examples of identifier schemes, meeting JDDCP criteria for robustly-accessible data citation, are shown in Table 1 below.

Identifier
Resolution services Achieving persistence Enforcing persistence Action on object removal DataCite DOI datacite.org registration with contract (2) link checking DataCite contacts owners; metadata should persist CrossRef DOI crossref.org registration with contract (3) link checking CrossRef contacts owners per policy (4) (2001)) domain resolver metadata should persist Table 1. Examples of identifer schemes meeting JDDCP criteria.

Landing pages
The identifier included in a citation should point to a landing page or set of pages rather than to the data itself (Rans et al. (2013); Clark et al. (2014)). This is strongly implied by three considerations. First, as mandated in the JDDCP, the metadata and the data may have different lifespans, the metadata potentially surviving the data. Second, the cited data may not be legally available to all, for reasons of licensing or confidentiality (e.g. Protected Health Information). The landing page provides a method to vend metadata even if the data are no longer present. And it also provides a convenient place where access credentials can be validated. Third, resolution to a landing page allows for an access point that is independent from any multiple encodings of the data which may be available.
By "landing page(s)" we mean a set of representations and presentations of information about the data via both structured metadata and unstructured text and other information. Landing pages should combine human-readable and machine-readable information on a selection of the following items.
• Tools/software: What tools and software may be associated or useful with the datasets, and how to obtain them (certain datasets are not readily usable without specific software).

3/7
PeerJ • Versions: What versions of the data are available, if there are more than one.
• Explanatory or contextual information: Provide explanations, contextual guidance, caveats, and/or documentation for data use, as appropriate.
• Access controls: Access controls based on content licensing, Protected Health Information (PHI) status, Institutional Review Board (IRB) authorization, embargo, or other restrictions, should be implemented here if appropriate.
• Licensing information: Information regarding licensing should be provided, with links to the relevant licensing or waiver documents as required (e.g., Creative Commons CC0 waiver description (https://creativecommons.org/publicdomain/zero/1.0/), or other relevant material).
• Dataset descriptions. The landing page must provide information to programmatically retrieve data where a user or device is so authorized. (See Dataset description for formats); • Persistence statement. Reference to a statement describing the data and metadata persistence policies of the repository should be provided at the landing page. Data persistence policies will vary by repository but should be clearly described. (See Persistence guarantee for recommended language.).
• Data availability and disposition: The landing page should provide information on the availability of the data if it is restricted, or has been de-accessioned (i.e. removed from the archive). As stated in the JDDCP, metadata should persist beyond de-accessioning.

Minimum acceptable information on landing pages
To provide a minimum acceptable level of information on landing pages, there are three guidelines.

Minimum content encoding formats for landing pages:
• HTML (for humans); that is, native browser-interpretable format used to generate a graphical display in a browser window, for human reading and understanding.
• At least one non-proprietary machine readable format; that is, a content format with a fully specified syntax capable of being parsed by software without ambiguity, at a data element level. Options: XML, JSON/JSON-LD, RDF (Turtle, RDF-XML, N-Triples, N-Quads), microformats, microdata, RDFa.

Minimum metadata content
• Dataset Identifier: A machine-actionable identifier resolvable on the Web to the datase • Title: The title of the dataset.
• Description: A description of the dataset, with more information than the title.
• Creator: The person(s) and/or organizations who generated the dataset and are responsible for its integrity.

Additional suggested metadata content
• Creator Identifier(s): ORCiD ID(s) of the individual creator(s).
• License: The license under which access to the content is provided (preferably a link to standard license text (e.g. https://creativecommons.org/publicdomain/zero/1.0/).

Best practices for dataset description
The World Wide Web Consortium (http://w3.org) standard for dataset description on the Web is the W3C Data Catalog Vocabulary (Mali et al. (2014)). This is a strongly endorsed best practice for dataset description, common across domains, and widely used. It is a settled standard that can be recommended without qualification. The W3C Health Care and Life Sciences Dataset Description specification (Gray et al. (2014)), currently in editor's draft status, provides capability to add additional useful metadata beyond the DCAT vocabulary. This is an evolving standard which we recommend for provisional use.
Data might also be presented in other formats, depending on the application area, in which case, content negotiation would be desirable for the data URI as well as the landing page URI.

Data access methods
The following are the recommended best approaches for serving content. These can and should be used together for maximum flexibility and accessibility.
1. Use HTTP Accept headers to serve different content based on the request.
• Also known as "content negotiation" • Commonly used in REST web services to serve XML, JSON, HTML, or an RDF serialization.

• Requires webmaster
• Generic-works for any kind of 'alternate' type relationships 2. Use HTTP links to direct non-human agents to alternate representations.

• Requires webmaster
• Generic-works in any kind of served content 3. Using link elements in HTML to connect to associated content in other formats • Example: OAI-ORE to explain how files are inter-related or linking to a file with the DataCite XML.
• Like "b" but doesn't require webmaster intervention • only works in HTML docs

Persistence guarantees
The topic of persistence guarantees is important from the standpoint of what repository owners and managers should provide to support JDDCP-compliant citable persistent data. We recommend that all organizations endorsing the JDDCP adopt a Persistence Guarantee for data and metadata based on the following template: "[Organization/Institution Name] is committed to maintaining persistent identifiers in [Repository Name] so that they will continue to resolve to a landing page providing metadata describing the data, including elements of stewardship, provenance, and availability. [Organization/Institution Name] has made the following plan for organizational persistence and succession: [plan]." As noted in the Landing pages section, when data is de-accessioned, the landing page should remain online, continuing to provide persistent metadata and other information, including a notation on data de-accessioning. Authors and scholarly article publishers will decide on which repositories meet their persistence and stewardship requirements, based on the guarantees provided, and their overall experience in using various repositories. Guarantees need to be supported by operational practice.
Registries of data repositories such as r3data (http://r3data.org) and publishers' lists of "recommended" repositories for cited data, such as those maintained by Nature Publications (http://www.nature.com/sdata/datapolicies/repositories), should take note of repository compliance to these guidelines, and provide compliance checklists.
Other deliverables from the DCIG are planned for release in early 2015, including a revision to the NISO-JATS XML schema for document publication and archiving (NISO (2014)), specifically designed to support data citation; and a review of selected data-citation workflows from early-adopter publishers (Nature, Biomed Central, Wiley and Faculty of 1000). The NISO-JATS revision is currently under review by the National Information Standards Organization (NISO) as a draft of NISO JATS version 1.1d2.
It is our hope that publishing this document and others in the series will accelerate the adoption of data citation on a wide scale in the scholarly literature, to support open validation and reuse of results.
We welcome comments and questions, which should be addressed to the forcnet@googlegroups.com open discussion forum.
6. "This information can be changed as needed to reflect the current state of the identified resource without changing its identifier, thus allowing the name of the item to persist over changes of location and other related state information." 7. For example, the French National Library has rigorous internal checks for the 20 million ARKs that it manages via its own resolver. 8. In most cases the national libraries archive also the content itself (in addition to the content holder) to be preserved with the NBN. 9. Force11.org (http://force11.org) is a community of scholars, librarians, archivists, publishers and research funders that has arisen organically to help facilitate the change toward improved knowledge creation and sharing. It is incorporated as a US 501(c)3 not-for-profit organization in California.