Data publication consensus and controversies

John Kratz; Carly Strasser

doi:10.12688/f1000research.3979.2

Home Browse Data publication consensus and controversies

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Review

Revised

Data publication consensus and controversies

[version 2; peer review: 2 approved, 1 approved with reservations]

John Kratz¹, Carly Strasser¹

PUBLISHED 16 May 2014

Author details Author details

¹ California Digital Library, University of California Office of the President, Oakland, CA, 94612, USA

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Research on Research, Policy & Culture gateway.

This article is included in the Data: Use and Reuse collection.

Abstract

The movement to bring datasets into the scholarly record as first class research products (validated, preserved, cited, and credited) has been inching forward for some time, but now the pace is quickening. As data publication venues proliferate, significant debate continues over formats, processes, and terminology. Here, we present an overview of data publication initiatives underway and the current conversation, highlighting points of consensus and issues still in contention. Data publication implementations differ in a variety of factors, including the kind of documentation, the location of the documentation relative to the data, and how the data is validated. Publishers may present the data as supplemental material to a journal article, with a descriptive “data paper,” or independently. Complicating the situation, different initiatives and communities use the same terms to refer distinct but overlapping concepts. For instance, the term “published” means that the data is publicly available and citable to virtually everyone, but it may or may not imply that the data has been peer-reviewed. In turn, what is meant by data peer review is far from defined; standards and processes encompass the full range employed in reviewing the literature, plus some novel variations. Basic data citation is a point of consensus, but the general agreement on the core elements of a dataset citation frays if the data is dynamic or part of a larger set. Even as data publication is being defined, some are looking past publication to other metaphors, notably “data as software,” for solutions to the more stubborn problems.

Corresponding author: John Kratz

Competing interests: No competing interests were disclosed.

Grant information: JK is supported by a Council on Library and Information Resources/Digital Library Foundation Postdoctoral Fellowship in Data Curation for the Sciences and Social
Sciences funded by the California Digital Library and the Alfred P. Sloan Foundation.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2014 Kratz J and Strasser C. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Data associated with the article are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).

How to cite: Kratz J and Strasser C. Data publication consensus and controversies [version 2; peer review: 2 approved, 1 approved with reservations]. F1000Research 2014, 3:94 (https://doi.org/10.12688/f1000research.3979.2) First published: 23 Apr 2014, 3:94 (https://doi.org/10.12688/f1000research.3979.1) Latest published: 16 Oct 2014, 3:94 (https://doi.org/10.12688/f1000research.3979.3)

Revised Amendments from Version 1

The article type has been changed from an Opinion to a Review to better reflect the content of the article. We shall submit another version addressing reviewer comments once the article has received the full complement of reviews.

See the authors' detailed response to the review by Mark Parsons and Peter Fox

Introduction: what does data publication mean?

The idea that researchers should share data to advance knowledge and promote the common good is an old one, but in recent years the conversation has shifted from sharing data to “publishing” data^1–3. This shift in language stems from the conviction that datasets should join the scholarly record and be afforded the same first class status as traditional research products like journal articles⁴. While many in the scholarly communication community share this goal, different people and organizations often imply different things by the phrase data publication.

The community largely agrees on two essential properties of a data publication^2,4. First, published data is publicly available now and for the indefinite future; access might demand payment of fees or acceptance of a legal agreement, but not the approval of the author. Second, like a book or journal article, a data publication can be formally cited. Open questions flock around a third property: how and to what extent a published dataset must be validated. In an effort to clarify the terminology, Callaghan et al. (2012)⁴ draw a distinction between data that has been shared, published (lower-case “p”), or Published (upper-case “P”): shared data is available, published data is available and citable, and Published data is available, citable, and validated. In practice, availability is usually satisfied by depositing the dataset in a repository, citability by assigning a persistent identifier (e.g. a Digital Object Identifier, or DOI), and validity by peer review.

Why publish data?

The underlying goals of data publication are to enable research to be reproduced and data to be reused. Hidden primary data exacerbates science’s very public “reproducibility crisis”^5–9, most recently illustrated by the collapse of a pair of irreproducible Nature articles describing a simple method to transform somatic cells into pluripotent stem cells^10,11. Widespread publication of the data underlying research papers could help expose both honest errors and fraud¹². The leaders of the US National Institutes of Health (NIH) recently cited “provid[ing] greater transparency of the data that are the basis of published manuscripts” as one way to improve scientific reproducibility¹³.

Journals already frequently require authors to supply underlying data on request. In 2011, Alsheikh-Ali et al.¹⁴ found that 88% of high-impact journals required a statement regarding the availability of underlying data; half of those made willingness to provide data a condition of publication. However, the authors of 59% of papers examined in the study failed to adhere to the availability instructions. Vines et al. (2014)¹⁵ could only obtain underlying data from 101 of 516 papers published from 1991 to 2011. Availability dropped off sharply with time; data could be obtained from only two of the 62 oldest papers. Now, some journals require that underlying data be published simultaneously with the article.

In 2010, a coalition of Ecology and Evolutionary Biology journals began to require that the data underlying articles be archived with a maximum embargo of one year^16,17. F1000Research has had a similar policy (without an embargo period) since its inception, and the Public Library of Science (PLOS) journals followed suit earlier this year¹⁸. Although there can be no substitute for funding new experiments and data collection, appropriate data reuse lowers costs and accelerates research. Documenting, publishing, and archiving data is time consuming and costly, but usually far less so than repeating the data collection. For example, Open Context published archaeological data from a site in eastern Turkey at the substantial cost of $10,000–15,0000, but this publication expense was minor compared to $800,000 spent to collect the data¹⁹. Piwowar (2011) contrasted the impact of $100,000 in National Science Foundation (NSF) grants, which generates an average of three to four papers, with an estimate that the same investment in curating, archiving, and publishing data could contribute to over 1,000 publications²⁰. Furthermore, while some data is merely expensive to recreate, time-dependent or ephemeral data, (e.g. climate records or observations of unique astronomical events) should be published because it can never be recreated for any price²¹.

Types of data publication

The still-congealing phrase “data publication” covers diverse classes of research objects published via diverse processes. Depending on the speaker, a data publication might be a spreadsheet on a website, a set of images in an institutional archive, a stream of readings from a weather station transmitted over the internet, or a peer-reviewed article describing a dataset. Because disciplines, sub-disciplines, and individual researchers consider different assortments of digital material to be data, it is unlikely that any single structure will suit every discipline and dataset. But, we can hope that a manageable number of designs will fit most data. Five data publication models described by Lawrence et al. (2011) are distinguished “by how the roles involved in publication are distributed between the various actors” (e.g. the author, archive or journal)³. Here, we will more simply group data publications into three categories based on the accompanying documentation; a dataset may supplement a traditional research paper, be the subject of a “data paper”, or be independent of any paper (Figure 1).

Figure 1. To be published, datasets are typically deposited in a repository to make them available and assigned an identifier to make them citable.

Some, but not all, publishers review datasets to validate them.

Data that supplements a paper

The most familiar kind of data publication is a traditional journal article accompanied by underlying data. That data can be hosted by the journal as supplementary material or deposited in a third-party repository. The trend is away from supplemental material because repositories are considered to be better suited to ensure long-term preservation and access to the data. For instance, The Journal of Neuroscience stopped publishing supplemental material in 2010; the announcement promotes disciplinary repositories as “vastly superior to supplemental material as a mechanism for disseminating data”²². Data underlying any peer-reviewed or otherwise “reputable” publication can be deposited in the Dryad repository. Dryad makes data available and citable, but the publisher of the article must manage any assessment of scientific validity. Other third-party repositories include Figshare, Zenodo, institutional repositories (e.g. the Purdue Research Repository), and discipline-specific repositories (e.g. DNA sequences are deposited in GenBank²³ and protein structures in the Protein Data Bank²⁴).

Data as the subject of a paper

A data paper describes a dataset with thoroughly detailed rationale and collection methods, but lacks any analysis or conclusions²⁵. Data papers are flourishing as a new article type in journals such as F1000Research, Internet Archaeology, and GigaScience, as well as in dedicated journals like Geoscience Data Journal, Nature Publishing Group’s Scientific Data, and a trio of “metajournals” from Ubiquity Press.

Data paper length and structure varies between journals, but the tendency is toward a short, tightly structured format. All journals require an abstract, collection methods, and a description of the dataset; a few encourage authors to suggest potential uses for the data (e.g. Internet Archaeology, and Open Health Data). Some journals supplement this general framework with field-specific sections. (e.g. Internet Archaeology and the Journal of Open Archaeology Data each include a section for temporal and geographic scope.) Data papers are most sharply defined not by the presence of any particular information, but by the absence of analysis or conclusions. A crisp distinction from other article types is important because many journals do not consider a data paper to be prior publication if the authors seek to publish an analysis of the same dataset (e.g. Nature-titled journals, Science, and others listed by F1000Research).

Data journals generally limit themselves to publishing the description of the dataset; a trusted repository publishes the data itself. For instance, Scientific Data and Geoscience Data Journal each direct authors to a list of approved repositories. As an exception, GigaScience hosts data in an integrated repository named GigaDB. An early implementer of data papers, The International Journal of Robotics Research²⁵ is unusual in that they permit authors to host datasets on their own websites.

Data independent of any paper

To be useful or reproducible, a dataset must be accompanied by descriptive information (i.e. metadata)²¹, but this need not take the form of a journal article. Instead, some repositories publish rich, structured and/or freeform description together with the data. The distinction between a data repository and a data publisher is often indistinct. Repositories provide access and citability, but the degree of validation varies widely and few are equipped to provide peer review. For instance, to make data publication as easy as possible for authors, Figshare and Zenodo publish datasets from any field with minimal validation.

Availability

Fundamentally, to publish is to make public, and to publish data is to make data publicly available. Present availability requires mechanisms for access; future availability also requires preservation (e.g. long-term storage, format migration)^21,26,27. As in print publication, published data need not be free or legally unencumbered, and data use agreements constrain many published datasets. If access is limited, it should be contingent on clear and objective criteria; writing a request to the creator for permission should not be part of the process. For example, before granting access to restricted data, The Interuniversity Consortium for Political and Social Research (ICPSR) judges the applicant’s proposed security measures, but not the merit of their research. Datasets from social science or clinical studies that involve human participants are easily the most common source of access restrictions because of the need to protect privacy. In the United States, the Health Insurance Portability and Accountability Act of 1996 (HIPPA) Privacy Rule severely limits the disclosure of medical information²⁸.

As a practical matter, publishing a dataset usually includes depositing it in a trustworthy repository. What constitutes a “trustworthy” repository is somewhat subjective and there are a handful of certification schemes to choose from. In 2007, The Center for Research Libraries (CRL) published the most extensive scheme: the Trusted Repository Audit Checklist (TRAC)²⁹. Many repositories consult TRAC for self-assessment, but only four (listed by the CRL) have completed the lengthy and rigorous process to be officially certified. The process to obtain a Data Seal of Approval (DSA) is considerably more streamlined. The DSA guidelines were also first released, by The Dutch Data Archiving and Networked Services (DANS), in 2007; 24 repositories have been stamped with the DSA since then. Few of the hundreds of repositories in operation (e.g, the 973 now listed Databib or the 609 at re3data.org) have pursued any kind of certification. Given the low adoption of repository certification, a more typical way to decide trustworthiness is to judge by the organization responsible. Repositories run by governments or large universities are likely to be considered trustworthy (although the effects of the 2013 US government shutdown on the PubMed biomedical article database³⁰ might give one pause).

Citability

Data citation is the element of publication that has come the farthest toward consensus. This year, a coalition–including Future Of Research Communication and E-Scholarship (FORCE11)³¹, the Committee on Data for Science and Technology (CODATA)³², and the Digital Curation Centre (DCC)–released a Joint Declaration of Data Citation Principles. The first of the eight principles states, in part, that “[d]ata citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications”. Most of the time, this means that when a published dataset contributes to a paper, it should be cited formally in the reference list.

Data publishers enable formal citation by assigning unique permanent identifiers, most commonly the same ones used for journal articles: Digital Object Identifiers (DOIs). In addition to clarifying exactly what resource is being cited, a DOI can be resolved to locate the referenced dataset. Note, however, that a DOI is neither sufficient nor necessary for citability- if a dataset moves and the DOI is not updated, the citation breaks and, conversely a well-maintained web-address works as well as a DOI.

Simple case

The present consensus is that a dataset should be cited using, at a minimum, five elements largely familiar from article citations: creator(s), title, year, publisher and identifier. This format agrees with CODATA’s recommendation³² and conveys all the information required to obtain a DataCite DOI³³ or be listed in the Thomson-Reuters Data Citation Index. However, this article-derived format fails to address some of the complications unique to datasets, described below.

Deep citation

The first major complication that data citation faces is the need for deep citation. When supporting an assertion in writing, it usually suffices to cite the entirety of a journal article and leave it to the inquisitive reader to find the relevant passage. But, to reproduce an analysis performed on a subset of a larger dataset, the reader needs to know exactly what subset was used (e.g. a limited range of dates, only the adult subjects, wind speed but not direction). Datasets vary so widely in structure that there may not be a good general solution for describing subsets. The most common suggestion is to cite the entire dataset in the reference list and describe the subset in the text of the paper³⁴. In straightforward cases, the Federation of Earth Science Information Partners (ESIP) and the National Snow and Ice Data Center (NSIDC) both recommend including a list of variables or range of dates in the formal citation.

Dynamic datasets

The second major complication arises when datasets change. In the past, the printing process cemented one version of an article as the version of record. Even for traditional scholarly literature, web-based publishing and preprint servers (e.g. arXiv.org) are complicating the situation, but datasets are especially prone to be dynamic. Two kinds of dynamic datasets warrant consideration: growing datasets that add new data while never changing or deleting existing data, and revisable datasets where data may by added, deleted, or changed.

Consider USC00046336, a weather station at the Oakland Museum. Each day, the high temperature, low temperature and amount of precipitation recorded at the Museum³⁵ flow, together with data from more than 20,000 other stations, into the swelling Global Historical Climate Network (GHCN)-Daily³⁶ dataset. Or, consider WormBase³⁷, a genome database used by the Caenorhabditis elegans research community. WormBase encompasses genomic sequences of C. elegans and 20 related species massively annotated with gene structures, protein sequences, expression patterns, and a host of other information from empirical data and computational predictions. Every two months, WormBase responds to new data and better computational models by issuing a revised version with new material added and inaccurate material deleted or corrected.

Additions and updates to published datasets are extremely valuable, but a researcher seeking to reproduce an analysis of a dynamic dataset needs access to a particular version. To enable that access, previous versions must be preserved and citable. Growing datasets can be cited with an access date or a date range in the citation, as recommended by ESIP and NSIDC. Revisable datasets are more difficult; the most common approach is to accumulate revisions and periodically publish a new version with a citable version number. For example, WormBase identifies each release with a citable version number and makes all of the previous versions available.

Controversy persists around the specific issue of identifiers for dynamic datasets. DataCite recommends, but does not insist, that their DOIs refer to immutable objects. NSICD and ESIP instruct researchers to use a single identifier for growing datasets and include the access date in the citation; each major version of a revisable datasets gets a new identifier, but minor versions do not. In contrast, the DCC, Dataverse, and the UK Natural Environment Research Council (NERC) insist that any change to a dataset should trigger a new identifier^4,34,38. To handle the difficulties with dynamic data that this policy creates, the DCC recommends periodically issuing growing datasets a new identifier that refers to the “time-slice” of new records and freezing versions of revisable datasets as individually-identified “snapshots”.

Just-in-time identifiers

The difficulties surrounding deep citation and dynamic data could potentially be solved by turning the identifier-issuing process on its head. Instead of the dataset publisher issuing identifiers for data at the level that researchers seem likely to cite, researchers could issue identifiers for precisely the part of the dataset that they want to cite. The Research Data Alliance (RDA) Data Citation Working Group recently put forth a sophisticated proposal applicable to data in (or convertible to) databases. Identifiers created under this scheme would wrap together identification of a database, a query to return the cited dataset, the version of the database queried for this analysis, and a number of other useful components. Although promising, many technical and policy issues must be resolved before this approach can be widely adopted.

Validation

Data validation is the least resolved aspect of data publication, and fundamental questions are still unanswered: What minimum level of quality should a published dataset guarantee? How and by what criteria can datasets be evaluated against that guarantee? Is literature peer-review an appropriate model?

Callaghan et al. (2012)⁴ draw a useful distinction between technical and scientific review. Technical review verifies that a dataset is complete, its description is complete, and that the two match up. Domain expertise is generally not required, and many repositories provide at least some level of technical review. Scientific review evaluates the methods of data collection, the overall plausibility of the data, and the likely reuse value. Scientific review does require domain expertise, making this level of validation more difficult to organize, and few repositories provide it. When data is published with a data paper, review may be split between the repository for technical review and the data journal for scientific review.

Data paper peer review

Peer review guarantees that journal articles entering the scholarly record reach some level of validity (although the aforementioned reproducibility crisis calls into question exactly what that level is). In many fields, peer-reviewed publications enjoy a much higher status than any other literature. Any effort to apply the prestige of “publication” to datasets cascades naturally into an effort to apply the prestige of “peer review”. But as data validation seeks to model itself on literature peer review, literature peer review itself is in flux^39–41. Open peer review at F1000Research and post-publication commenting at PubMed Commons are just two of many ongoing web-enabled experiments in article evaluation.

Journal article reviewers traditionally consider whether the methods used are appropriate for the questions asked and the data collected support the conclusions drawn. In the absence of particular questions and conclusions, it is not obvious what peer review of data should certify. A dataset may be suitable for some purposes, but not for others⁴². In addition, while a reviewer can be expected to read an entire article, they cannot inspect every point in a large dataset. Finally, researchers are already over-whelmed by peer review of articles⁴³ and may find any increased workload unreasonable. Despite all these difficulties, venues for peer-reviewed data papers are opening rapidly.

Data paper journals wrap scientific peer review of the paper and the dataset together into a single process. GigaScience, an exception, assigns technical review of the dataset to a separate data reviewer. The standards that various data journals provide to reviewers are fairly uniform, with the exception that about half of consider novelty or potential impact, while the rest only require that the dataset be scientifically sound. While review standards are similar, processes differ widely.

As an example, compare Biodiversity Journal and Scientific Data. Both journals divide reviewer guidelines into three sections along similar lines, which Biodiversity Journal calls “quality of the data”, “quality of the description”, and “consistency between manuscript and data”. Scientific Data follows a traditional peer-review process: an editor appoints reviewers who are encouraged to remain anonymous. In contrast, review at Biodiversity Journal follows a flexible and open process featuring entirely optional anonymity and multiple types of reviewer. There, an editor appoints two or three “nominated” reviewers who must report back and several “panel” reviewers who read the paper and only comment at their discretion. Additionally, the authors may choose to open the paper to public comment during the review process.

Independent data validation

Data journals all model their data validation more or less faithfully on literature peer review, but independent data validation practices and proposals are considerably more varied. On the conservative end of the spectrum, Lawrence et al. (2011) propose a set of criteria for independent data peer review⁴⁴. The Planetary Data System (PDS) peer-reviews datasets through the unusual process of holding an in-person meeting with representatives of the repository, the dataset creators, and the reviewers.

Two examples from archaeology, Open Context and the Digital Archaeological Record (tDAR), illustrate the diversity of approaches to data validation. Open Context provides multiple validation processes that incorporate peer review in a way that goes beyond the simple accept/reject binary¹⁹. Each Open Context dataset is rated from one to five based not on quality per se, but on the thoroughness of the validation; a one comes with no guarantees, a three has passed a technical review, and a five has passed external peer review. Whereas Open Context is a boutique publisher, focusing on data presentation and reuse, tDAR is a large repository primarily concerned with with collecting and preserving archaeology data for future use. tDAR is able to operate at scale by performing only technical validation and streamlining data deposition with a minimum of mandatory description. However, tDAR also serves as a platform for high-quality data publication. The repository accommodates contributors who wish to provide more information, and much of the content is deposited by digital curators who can be relied on to supply rich descriptions. Furthermore, two data paper journals, Internet Archaeology and Journal of Open Archaeological Data, recommend tDAR as a repository for their peer-reviewed data. Thus, data validation depends not only on discipline and data type, but on a host of external factors, including the goals of the organizations and researchers involved.

Pre-publication validation can be supplemented or replaced by post-publication feedback from successful or unsuccessful reusers. Parsons et al. (2010) suggest that “data use in its own right provides a form of review”, and go on to point out that the context of reuse demonstrates that the data is not generically “good”, but fit for some particular purpose⁴². The DANS repository solicits feedback from researchers who use its datasets: users are asked to rate the dataset on a one to five scale in each of six criteria (e.g., data quality, quality of the documentation, structure of the dataset)^45,46.

Beyond data publication

In a 2013 paper⁴⁷, Parsons and Fox argue that thinking about data through the the metaphor of print “publication” is limiting. Diverse kinds of material are regarded as data by one research community or another and, while at least some aspects of publication apply well to at least some kinds of data, other approaches are possible. An alternative metaphor that seems to be gaining traction is “data as software”⁴⁸. In some cases, it may be better to think of releasing a dataset as one would a piece of software and to regard subsequent changes as analogous to updated versions. The open-source software community has already developed many potentially relevant tools for working collaboratively, managing multiple versions, and tracking attribution. Ram (2013)⁴⁹ catalogs a multitude of scientific uses for the software version control system Git, including data management. Open Context uses Git and Mantis Bug Tracker to track and correct dataset errors. Furthermore, projects such as IPython Notebook integrate data, processing, and analysis into a single package. However, scientific software struggles for recognition⁵⁰ just as data does, so using it to alter or affect the academic reward system for data is a tricky prospect.

Ultimately, while “data as software” is promising, data is not software. Nor is it literature. The prestige and familiarity of terms like “publication” and “peer-review” are powerful, but we may have to stretch their definitions if we are determined to apply them to data.

Author contributions

JK collected information and prepared the first draft of the manuscript. JK and CS designed the scope and direction of the study. Both authors contributed to the writing and editing of the manuscript.

Competing interests

No competing interests were disclosed.

Grant information

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Acknowledgements

The authors would like to thank colleagues at the CDL and Jodi Reeves Flores for productive discussions. Margaret Smith and Christina Doyle contributed invaluable suggestions for editing the manuscript.

Faculty Opinions recommended

References

1. Costello MJ: Motivating online publication of data. BioScience. 2009; 59(5): 418–427. Publisher Full Text
2. Smith VS: Data publication: towards a database of everything. BMC Res Notes. 2009; 2: 113. PubMed Abstract | Publisher Full Text | Free Full Text
3. Lawrence B, Jones C, Matthews B, et al.: Data publication. Int J Digital Curation. 2011.
4. Callaghan S, Donegan S, Pepler S, et al.: Making data a first class scientific output: Data citation and publication by NERC’s environmental data centres. Int J Digital Curation. 2012; 7(1): 107–113. Publisher Full Text
5. Mobley A, Linder SK, Braeuer R, et al.: A survey on data reproducibility in cancer research provides insights into our limited ability to translate findings from the laboratory to the clinic. PLoS One. 2013; 8(5): e63221. PubMed Abstract | Publisher Full Text | Free Full Text
6. Pashler H, Harris CR: Is the replicability crisis overblown? three arguments examined. Perspect Psychol Sci. 2012; 7(6): 531–536. Publisher Full Text
7. Zimmer C: Rise in scientific journal retractions prompts calls for reform. The New York Times. 2012.
8. Hiltzik M: Science has lost its way at a big cost to humanity. Los Angeles Times. 2013. Reference Source
9. Begley CG, Ellis LM: Drug development: Raise standards for preclinical cancer research. Nature. 2012; 483(7391): 531–533. PubMed Abstract | Publisher Full Text
10. Cyranoski D: Acid-bath stem-cell study under investigation. Nature. 2014. Publisher Full Text
11. Tabuchi H: One author of a startling stem cell study calls for its retraction. The New York Times. 2014. Reference Source
12. Drew BT, Gazis R, Cabezas P, et al.: Lost branches on the tree of life. PLoS Biol. 2013; 11(9): e1001636. Publisher Full Text
13. Collins FS, Tabak LA: Policy: NIH plans to enhance reproducibility. Nature. 2014; 505(7485): 612–613. PubMed Abstract | Publisher Full Text
14. Alsheikh-Ali AA, Qureshi W, Al Mallah MH: Public availability of published research data in high-impact journals. PLoS One. 2011; 6(9): e24357. PubMed Abstract | Publisher Full Text | Free Full Text
15. Vines TH, Albert AYK, Andrew RL: The availability of research data declines rapidly with article age. Curr Biol. 2014; 24(1): 94–7. PubMed Abstract | Publisher Full Text
16. Whitlock MC, McPeek MA, Rausher MD: Data archiving. Am Nat. 2010; 175(2): 145–146. PubMed Abstract | Publisher Full Text
17. Fairbairn DJ: The advent of mandatory data archiving. Evolution. 2011; 65(1): 1–2. PubMed Abstract | Publisher Full Text
18. Bloom T, Ganley E, Winker M: Data access for the open access literature: PLOS’s data policy. PLoS Biol. 2014; 12(2): e1001797. Publisher Full Text
19. Kansa EC, Kansa SW: We all know that a 14 is a sheep: Data publication and professionalism in archaeological communication. J Endocrinol Metab Arch Heritage Studies. 2013; 1(1): 88–97. Reference Source
20. Piwowar HA, Vision TJ, Whitlock MC: Data archiving is a good investment. Nature. 2011; 473(7347): 285–285. PubMed Abstract | Publisher Full Text
21. Gray J, Szalay AS, Thakar AR, et al.: Online scientific data curation, publication, and archiving. 2002; 103–107. Reference Source
22. Maunsell J: Announcement regarding supplemental material. J Neurosci. 2010; 30(32): 10599–10600. Reference Source
23. Benson DA, Cavanaugh M, Clark K, et al.: GenBank. Nucleic Acids Res. 2013; 41(Database issue): D36–42. PubMed Abstract | Publisher Full Text | Free Full Text
24. Berman HM, Westbrook J, Feng Z, et al.: The Protein Data Bank. Nucleic Acids Res. 2000; 28(1): 235–242. PubMed Abstract | Publisher Full Text | Free Full Text
25. Newman P, Corke P: Data papers — peer reviewed publication of high quality data sets. Int J Rob Res. 2009; 28(5): 587–587. Publisher Full Text
26. Waters D, Garrett J: Preserving Digital Information. Report of the Task Force on Archiving of Digital Information. ERIC. 1996. Reference Source
27. Beagrie N: Digital curation for science, digital libraries, and individuals. Int J Digital Curation. 2008; 1(1): 3–16. Publisher Full Text
28. Office for Civil Rights. Renal resource guide. 2003. Reference Source
29. Center for Research Libraries (U.S.) and OCLC. Trustworthy repositories audit & certification (TRAC) criteria and checklist. Center for Research Libraries; OCLC Online Computer Library Center, Inc Chicago: Dublin, Ohio. 2007. Reference Source
30. Hayden EC: NIH shutdown effects multiply. Nature. 2013. Publisher Full Text
31. FORCE11. Improving future research communication and e-scholarship. 2012. Reference Source
32. CODATA-ICSTI Task Group on Data Citation Standards and Practices. Out of cite, out of mind: The current state of practice, policy, and technology for the citation of data. Data Sci J. 2013; 12: 1–75. Reference Source
33. Starr J, Gastl A: isCitedBy: a metadata scheme for DataCite. D-Lib Magazine. 2011; 17(1). Publisher Full Text
34. Altman M, King G: A proposed standard for the scholarly citation of quantitative data. D-Lib Magazine. 2007; 13(3/4). Publisher Full Text
35. Global Historical Climate Data Network. Daily summaries station details: OAKLAND MUSEUM, CA US, GHCND:USC00046336. Reference Source
36. Menne MJ, Durre I, Vose RS, et al.: An overview of the global historical climatology network-daily database. J Atmospheric & Oceanic Technology. 2012; 29(7): 897–910. Publisher Full Text
37. Harris TW, Baran J, Bieri T, et al.: WormBase 2014: new views of curated biology. Nucleic Acids Res. 2014; 42(Database issue): D789–793. PubMed Abstract | Publisher Full Text | Free Full Text
38. Ball A, Duke M: How to cite datasets and link to publications. 2012. Reference Source
39. Pulverer B: A transparent black box. EMBO J. 2010; 29(23): 3891–3892. PubMed Abstract | Publisher Full Text | Free Full Text
40. Herron DM: Is expert peer review obsolete? A model suggests that post-publication reader review may exceed the accuracy of traditional peer review. Surg Endosc. 2012; 26(8): 2275–2280. PubMed Abstract | Publisher Full Text
41. Kriegeskorte N, Walther A, Deca D: An emerging consensus for open evaluation: 18 visions for the future of scientific publishing. Front Comput Neurosci. 2012; 6: 94. PubMed Abstract | Publisher Full Text | Free Full Text
42. Parsons MA, Duerr R, Minster JB: Data citation and peer review. Eos, Transactions American Geophysical Union. 2010; 91(34): 297–298. Publisher Full Text
43. Diederich F: Are we refereeing ourselves to death? The peer-review system at its limit. Angew Chem Int Ed Engl. 2013; 52(52): 13828–9. PubMed Abstract | Publisher Full Text
44. Lawrence B, Jones C, Matthews B, et al.: Citation and peer review of data: Moving towards formal data publication. Int J Digital Curation. 2011; 6(2): 4–37. Publisher Full Text
45. Grootveld M, Van Egmond J: editors. Data Reviews, peer-reviewed research data. Number 5 in DANS Studies in Digital Archiving. Data Archiving and Networked Services. 2011. Reference Source
46. Grootveld M, Van Egmond J: Peer-reviewed open research data: Results of a pilot. Int J Digital Curation. 2012; 7(2): 81–91. Publisher Full Text
47. Parsons MA, Fox PA: Is data publication the right metaphor? Data Sci J. 2013; 12: WDS32–WDS46. Publisher Full Text
48. Schopf JM: Treating data like software: a case for production quality data. In Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries, JCDL ’12, New York, NY USA, 2012; 153–156. ACM. Publisher Full Text
49. Ram K: Git can facilitate greater reproducibility and increased transparency in science. Source Code Biol Med. 2013; 8(1): 7. PubMed Abstract | Publisher Full Text | Free Full Text
50. Pradal C, Varoquaux G, Langtangen HP: Publishing scientific software matters. J Comput Sci. 2013; 4(5): 311–312. Publisher Full Text

Comments on this article Comments (5)

Version 3

VERSION 3 PUBLISHED 16 Oct 2014

Revised

Comment

Version 2

VERSION 2 PUBLISHED 16 May 2014

Revised

Discussion is closed on this version, please comment on the latest version above.

Reader Comment 22 Aug 2014

Leonardo Candela, ISTI-CNR, Italy

22 Aug 2014

Reader Comment

Rather than a comment, I highlight here a potential issue in Reference 3. If I'm not mistaking it should be:
Lawrence, B.; Jones, C.; Matthews, B.; Pepler, S. & Callaghan, S. ... Continue reading Rather than a comment, I highlight here a potential issue in Reference 3. If I'm not mistaking it should be:
Lawrence, B.; Jones, C.; Matthews, B.; Pepler, S. & Callaghan, S. Citation and Peer Review of Data: Moving Towards Formal Data Publication International Journal of Digital Curation, 2011, 6, 4-37 doi:10.2218/ijdc.v6i2.205
Rather than a comment, I highlight here a potential issue in Reference 3. If I'm not mistaking it should be:
Lawrence, B.; Jones, C.; Matthews, B.; Pepler, S. & Callaghan, S. Citation and Peer Review of Data: Moving Towards Formal Data Publication International Journal of Digital Curation, 2011, 6, 4-37 doi:10.2218/ijdc.v6i2.205
Competing Interests: No competing interests were disclosed. Close
Report a concern
Discussion is closed on this version, please comment on the latest version above.

Version 1

VERSION 1 PUBLISHED 23 Apr 2014

Discussion is closed on this version, please comment on the latest version above.

Reader Comment 06 May 2014

Eric Kansa, Open Context (http://opencontext.org), USA

06 May 2014

Reader Comment

This is an excellent, and a tremendously useful overview of the issues involved in data publishing. From my perspective in archaeology, the discussion of tDAR and Open Context is useful, ... Continue reading This is an excellent, and a tremendously useful overview of the issues involved in data publishing. From my perspective in archaeology, the discussion of tDAR and Open Context is useful, since these different systems try to serve different needs. You may find this poster by Beth Sheehan comparing these different systems useful as well: http://www.slideshare.net/asist_org/rdap14-comparing-disciplinary-repositories-tdar-vs-open-context

One small point of clarification on a minor factual point. The Journal of Open Archaeological Data (JOAD) also lists Open Context (http://opencontext.org) as a repository for data, see: http://openarchaeologydata.metajnl.com/about/editorialPolicies#opencontext

Similarly, Internet Archaeology also lists Open Context in the same vein: http://intarch.ac.uk/authors/data-papers.html
This is an excellent, and a tremendously useful overview of the issues involved in data publishing. From my perspective in archaeology, the discussion of tDAR and Open Context is useful, since these different systems try to serve different needs. You may find this poster by Beth Sheehan comparing these different systems useful as well: http://www.slideshare.net/asist_org/rdap14-comparing-disciplinary-repositories-tdar-vs-open-context

One small point of clarification on a minor factual point. The Journal of Open Archaeological Data (JOAD) also lists Open Context (http://opencontext.org) as a repository for data, see: http://openarchaeologydata.metajnl.com/about/editorialPolicies#opencontext

Similarly, Internet Archaeology also lists Open Context in the same vein: http://intarch.ac.uk/authors/data-papers.html
Competing Interests: I direct Open Context (see: http://opencontext.org/about/people), so I have a professional interest in discussions of this project. Close
Report a concern
Reader Comment 02 May 2014

Konrad Hinsen, Centre de Biophysique Moléculaire (CNRS), France

02 May 2014

Reader Comment
First of all, thanks for this article, which is a good introduction to the problems surrounding data publication.

One aspect which deserves more attention is the question "What is data?" Or, ... Continue reading
First of all, thanks for this article, which is a good introduction to the problems surrounding data publication.

One aspect which deserves more attention is the question "What is data?" Or, more precisely, which categories of data should be distinguished with respect to publication? This is related to the last paragraph of this article that starts with "Ultimately, while “data as software” is promising, data is not software." Data is indeed not software - but software is data.

I would like to propose the following categories of scientific data:
Observational data. This is the "raw input" of science: data from experiments, observations, polls, etc.

Machine-readable information generated by humans. This category includes software, input files, workflows, etc. Information for human consumption but also stored electronically could be included as well: articles, drawings, software documentation, etc.

Data resulting from a computation: processed observational data, output of simulations, etc.
Data in category 1 is not reproducible in any way, and thus needs to be archived and published. Data in category 2 cannot be reproduced exactly by anyone else, but could be regenerated approximately from less complete/precise data by a domain expert. Nevertheless, it should be archived and published as well in order to produce a complete and accurate record of scientific activities. Data in category 3 can be reproduced by computation if the data in categories 1 and 2 is available. It may be convenient to share it nevertheless, in particular if recomputation is expensive, but it's less fundamental than categories 1 and 2.

I believe that these categories are more useful than the traditional separation into data, software, and writeup, in particular for questions such as archiving, citing, and updating. In particular, the vague term "dataset" does not distinguish clearly between categories 1 and 3.
First of all, thanks for this article, which is a good introduction to the problems surrounding data publication.

One aspect which deserves more attention is the question "What is data?" Or, more precisely, which categories of data should be distinguished with respect to publication? This is related to the last paragraph of this article that starts with "Ultimately, while “data as software” is promising, data is not software." Data is indeed not software - but software is data.

I would like to propose the following categories of scientific data:
Observational data. This is the "raw input" of science: data from experiments, observations, polls, etc.

Machine-readable information generated by humans. This category includes software, input files, workflows, etc. Information for human consumption but also stored electronically could be included as well: articles, drawings, software documentation, etc.

Data resulting from a computation: processed observational data, output of simulations, etc.
Data in category 1 is not reproducible in any way, and thus needs to be archived and published. Data in category 2 cannot be reproduced exactly by anyone else, but could be regenerated approximately from less complete/precise data by a domain expert. Nevertheless, it should be archived and published as well in order to produce a complete and accurate record of scientific activities. Data in category 3 can be reproduced by computation if the data in categories 1 and 2 is available. It may be convenient to share it nevertheless, in particular if recomputation is expensive, but it's less fundamental than categories 1 and 2.

I believe that these categories are more useful than the traditional separation into data, software, and writeup, in particular for questions such as archiving, citing, and updating. In particular, the vague term "dataset" does not distinguish clearly between categories 1 and 3.
Competing Interests: none Close
Report a concern
Reader Comment 01 May 2014

Hans Pfeiffenberger, Alfred Wegener Institut, Germany

01 May 2014

Reader Comment
Dear authors,
your article is a very noteworthy and valuable, broad overview of many of the issues surrounding "data publication". I would like to offer this as recommended reading to anybody ... Continue reading
Dear authors,
your article is a very noteworthy and valuable, broad overview of many of the issues surrounding "data publication". I would like to offer this as recommended reading to anybody unfamiliar with the field. However, there is one omission and one erroneous/misleading statement which I strongly suggest to correct:
In the first paragraph of "Data as the subject of a paper" you list a number of quite representative examples of data journals, but manage to omit the probably first example of a "pure" data journal (with peer review of data), ESSD, founded in 2008.
A brief summary of ESSD's rationale and approach was published 2011 in D-Lib Magazine, doi:10.1045/january2011-pfeiffenberger

In the second paragraph of "Citability" you write "DOI is neither sufficient nor necessary for citability- if a dataset moves and the DOI is not updated, the citation breaks and, conversely a well-maintained web-address works as well as a DOI."

I regard this as strongly misleading, at least for a novice to the domain of publishing or identifiers/DOIs: What typically breaks, sooner or later, is a bookmark with a "normal" URL. The DOI system - which I would characterize as "handle system with a policy" - was set up to work around that fact of life. The contracts data centers (DC) have to sign with "their" (DataCite) DOI registration agency typically contain wording such as: "DC has to ensure that registered content will be available for the entire duration of the agreement." (See "contractual form" , linked to from TIB's "DOI registration" page.) Admittedly, this and other such agreements are difficult to find.

By the way, this agreement also adresses the issue of fixity: "Once an item is registered, it may not be altered. If an item is changed, it has to be registered with a new DOI name."
Beyond those corrections, I suggest you provide the reader with some pointers about the venues where the ongoing discussions about data publication issues are actually being led. E.g., there are a number of working and interest groups at the Research Data Alliance (not just the one on Data Citation)

best regards,
Hans Pfeiffenberger
Dear authors,
your article is a very noteworthy and valuable, broad overview of many of the issues surrounding "data publication". I would like to offer this as recommended reading to anybody unfamiliar with the field. However, there is one omission and one erroneous/misleading statement which I strongly suggest to correct:
In the first paragraph of "Data as the subject of a paper" you list a number of quite representative examples of data journals, but manage to omit the probably first example of a "pure" data journal (with peer review of data), ESSD, founded in 2008.
A brief summary of ESSD's rationale and approach was published 2011 in D-Lib Magazine, doi:10.1045/january2011-pfeiffenberger

In the second paragraph of "Citability" you write "DOI is neither sufficient nor necessary for citability- if a dataset moves and the DOI is not updated, the citation breaks and, conversely a well-maintained web-address works as well as a DOI."

I regard this as strongly misleading, at least for a novice to the domain of publishing or identifiers/DOIs: What typically breaks, sooner or later, is a bookmark with a "normal" URL. The DOI system - which I would characterize as "handle system with a policy" - was set up to work around that fact of life. The contracts data centers (DC) have to sign with "their" (DataCite) DOI registration agency typically contain wording such as: "DC has to ensure that registered content will be available for the entire duration of the agreement." (See "contractual form" , linked to from TIB's "DOI registration" page.) Admittedly, this and other such agreements are difficult to find.

By the way, this agreement also adresses the issue of fixity: "Once an item is registered, it may not be altered. If an item is changed, it has to be registered with a new DOI name."
Beyond those corrections, I suggest you provide the reader with some pointers about the venues where the ongoing discussions about data publication issues are actually being led. E.g., there are a number of working and interest groups at the Research Data Alliance (not just the one on Data Citation)

best regards,
Hans Pfeiffenberger
Competing Interests: I happen to be the founder and chief editor of ESSD Close
Report a concern
Reader Comment 30 Apr 2014

Chris HJ Hartgerink, Liberate Science GmbH, Germany

30 Apr 2014

Reader Comment

Possibly of interest to your paper is dat, a program in development to provide version control of datasets (more so than git is able to). It has received funding recently ... Continue reading Possibly of interest to your paper is dat, a program in development to provide version control of datasets (more so than git is able to). It has received funding recently from the Knight Foundation (see here) and is something worth looking out for in terms of data sharing, but more importantly, preservation and logging.

Thank you for writing this — it provides a succinct introduction to an important issue.
Possibly of interest to your paper is dat, a program in development to provide version control of datasets (more so than git is able to). It has received funding recently from the Knight Foundation (see here) and is something worth looking out for in terms of data sharing, but more importantly, preservation and logging.

Thank you for writing this — it provides a succinct introduction to an important issue.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Discussion is closed on this version, please comment on the latest version above.

Author details Author details

¹ California Digital Library, University of California Office of the President, Oakland, CA, 94612, USA

Competing interests

No competing interests were disclosed.

Grant information

JK is supported by a Council on Library and Information Resources/Digital Library Foundation Postdoctoral Fellowship in Data Curation for the Sciences and Social
Sciences funded by the California Digital Library and the Alfred P. Sloan Foundation.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (3)

version 3

Revised

Published: 16 Oct 2014, 3:94

https://doi.org/10.12688/f1000research.3979.3

version 2

Revised

Published: 16 May 2014, 3:94

https://doi.org/10.12688/f1000research.3979.2

version 1

Published: 23 Apr 2014, 3:94

https://doi.org/10.12688/f1000research.3979.1

© 2014 Kratz J and Strasser C. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Data associated with the article are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Kratz J and Strasser C. Data publication consensus and controversies [version 2; peer review: 2 approved, 1 approved with reservations] F1000Research 2014, 3:94 (https://doi.org/10.12688/f1000research.3979.2)

NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 2

VERSION 2

PUBLISHED 16 May 2014

Revised

Views

Reviewer Report 10 Jun 2014

Ingrid Dillo, Data Archiving and Networking Services (DANS), The Hague, The Netherlands

Approved

https://doi.org/10.5256/f1000research.4518.r4540

The article focuses on a topic that receives a lot of interest these days. Therefore it is very timely. The article provides a useful and valuable overview of the current state of affairs and the ongoing debate. The title does justice to the content of the article. So does the abstract.

This is the second version of the article. I do not see many changes in the text based on the earlier critical comments made by Mark Parsons and Peter Fox.

The overview is very informative for everyone who needs a quick introduction into the subject. I do miss the opinion of the authors themselves on the issues at hand and on the quoted suggestions by others. This would have been appropriate in a concluding paragraph.

Detailed comments:

In the Introduction I miss a clear link between data publishing and data citation and creating the possibility for researchers to receive academic credits for their work on data. This academic credit is crucial as an incentive for researchers to put valuable time and effort in sharing their data.
In the paragraph Why publish data? a reference to the Dutch fraud cases might be useful, as these cases got a lot of international attention and more or less triggered the discussion in the Netherlands with respect to research data management, long term preservation of data and data publishing and citation. A reference could be:

Doorn P, Dillo I, van Horik R: Lies, Damned Lies and Research Data: Can Data Sharing Prevent Data Fraud? International Journal of Digital Curation. 2013; 8(1): 229-243 http://dx.doi.org/10.2218/ijdc.v8i1.256
In the paragraph Types of data publication a threefold model is introduced to categorize data publications. It is not clear how this model relates to the terminology and categorisation presented in the introduction. This could be somewhat confusing for the reader.

Furthermore, there are of course many other models available, e.g. that of the The Data Publication Pyramid, developed on the basis of the Jim Gray pyramid, to express the different manifestation forms that research data can have in the publication process:

Reilly S, Schallier W, Schrimpf S, et al.: Report on integration of data and publications. October 2011. Located at: http://www.stm-assoc.org/2011_12_5_ODE_Report_On_Integration_of_Data_and_Publications.pdf

Or the model presented in the report: Costas, R., Meijer, I., Zahedi, Z. and Wouters, P. (2013). The Value of Research Data - Metrics for datasets from a cultural and technical point of view. A Knowledge Exchange Report, available from www.knowledge-exchange.info/datametrics
With respect to trustworthy digital repositories, I would like to add a few comments. First of all, in Europe a European Framework for Audit and Certification of Digital Repositories is emerging. It contains three certification standards (DSA, DIN31644/NESTOR seal and ISO13636) and three levels of certification (basic, extended and formal) see: http://www.trusteddigitalrepository.eu/Site/Trusted%20Digital%20Repository.html

Of these three standards, only DSA has been up and running for some time now ,with 31 seals awarded and 30 ongoing self-assessments at this moment. The NESTOR seal has become available only very recently and the ISO standard is not yet officially available. The accompanying ISO 16919 standard: Requirements for bodies providing audit and certification of candidate trustworthy digital repositories, has been published very recently and now the ISO organization needs to be set up in the different countries, including the training of national auditors. The audits done by CRL are not fully official, since CRL is no formal ISO accreditation body.

In Europe we see a growing interest in TDRs, coming from funders who want to push open data and data sharing and demand the deposit of publicly funded data in long term TDRs. Furthermore European research infrastructures and projects are also looking more and more into the issue of trust hen it comes to data sharing and a groeing number of them is incorporating (parts of) the DSA guidelines into there repositories and policies (e.g. CESSDA, CLARIN, EUDAT).

Yet another certification procedure is offered by the ICSU/WDS to repositories that aim to become a member of the World Data System. See: https://www.icsu-wds.org/community/membership/certification

The certification of TDRs could also help publishers/editorial boards with Data Availability Policies to point their authors to the right repositories for the long-term storage of their data.

Competing Interests: No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Respond or Comment

Views

Reviewer Report 28 May 2014

Mark Costello, Institute of Marine Science, University of Auckland, Auckland, New Zealand

Approved

https://doi.org/10.5256/f1000research.4518.r4543

Having made some similar comments myself I must agree with this review. But a few points may merit amendment:

Abstract:
Note that what 'publication' means is the same as in print as in digital form. It does not imply peer-review or editorial oversight in either format.

Citability:
Yes, data citations are important but some datasets and web-based resources do not show how they should be cited, some journals do not allow citations to web resources in the Reference list, and even were both possible, too many authors neglect to cite actual datasets and instead cite a web site (which may have many datasets) or a related print paper.

I agree that a DOI is not enough and only permanent if it is updated when documents are moved. A full author-tile-publisher citation as you suggest is more informative and human readable.

I do not think it is problematic to cite parts of datasets. Pages and chapters in books are already cited for example. In most datasets it is also possible to identify individual dat records. Also, the actual data used could be provided in an Appendix so the reader is left in no doubt. Neither do I think 'versioning' is a problem. Where new data are added (e.g. to a time-series) then they comprise a new dataset, as they would if published in print. Where many corrections are made they a dataset can be treated like a paper; i.e. the original can be 'retracted' and replaced, or the new version be published with the metadata stating that it is more accurate.

You mention venues for peer-reviewed data papers. For this article to advance previous articles, perhaps it could expand on these venues and how they manage the details of the peer-review process?

I have published a few papers you may find of interest:

Costello MJ, Wieczorek J. 2014. Best practice for biodiversity data management and publication. Biological Conservation, 173, 68-73. http://www.vliz.be/en/imis?module=ref&refid=234968
Costello MJ, Appeltans W, Bailly N, Berendsohn WG, de Jong Y, Edwards M, Froese R, Huettmann F, Los W, Mees J, Segers H, Bisby FA. 2014. Strategies for the sustainability of online open-access biodiversity databases. Biological Conservation 173, 155-165. http://www.marinebiology.ugent.be/component/imis/?module=ref&refid=230520
Costello MJ, Michener WK, Gahegan M, Zhang Z-Q, Bourne P. 2013. Data should be published, cited and peer-reviewed. Trends in Ecology and Evolution 28 (8), 454-461. http://dx.doi.org/10.1016/j.tree.2013.05.002
Costello, M.J., Vanden Berghe E. 2006. “Ocean Biodiversity Informatics” enabling a new era in marine biology research and management. Marine Ecology Progress Series 316, 203-214. http://www.int-res.com/abstracts/meps/v316/
Costello MJ, Michener WK, Gahegan M, Zhang Z-Q, Bourne P, Chavan V. 2012. Quality assurance and intellectual property rights in advancing biodiversity data publications. ver. 1.0, Copenhagen: Global Biodiversity Information Facility, Pp. 33, ISBN: 87‐92020‐49‐6. Accessible at http://links.gbif.org/qa_ipr_advancing_biodiversity_data_publishing_en_v1.

Competing Interests: No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Respond or Comment

Version 1

VERSION 1

PUBLISHED 23 Apr 2014

Views

297

Reviewer Report 06 May 2014

Mark Parsons, Research Data Alliance, Troy, NY, USA

Peter Fox, Rensselaer Polytechnic Institute, Troy, NY, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.4264.r4541

General Comments:
Note: This review was written by Parsons and accepted (with some modification) by Fox. Insights likely come from conversations between Fox and Parsons, errors from Parsons.

I am very glad the authors wrote this essay. It is a well-written, needed, and useful summary of the current status of “data publication” from a certain perspective. The authors, however, need to be bolder and more analytical. This is an opinion piece, yet I see little opinion. A certain view is implied by the organization of the paper and the references chosen, but they could be more explicit. The paper would be both more compelling and useful to a broad readership if the authors moved beyond providing a simple summary of the landscape and examined why there is controversy in some areas and then use the evidence they have compiled to suggest a path forward. They need to be more forthright in saying what data publication means to them, or what parts of it they do not deal with. Are they satisfied with the Lawrence et al. definition? Do they accept the critique of Parsons and Fox? What is the scope of their essay?

The authors take a rather narrow view of data publication, which I think hinders their analyses. They describe three types of (digital) data publication: Data as a supplement to an article; data as the subject of a paper; and data independent of a paper. The first two types are relatively new and they represent very little of the data actually being published or released today. The last category, which is essentially an “other” category, is rich in its complexity and encompasses the vast majority of data released. I was disappointed that the examples of this type were only the most bare-bones (Zenodo and Figshare). I think a deeper examination of this third category and its complexity would help the authors better characterize the current landscape and suggest paths forward.

Some questions the authors might consider: Are these really the only three models in consideration or does the publication model overstate a consensus around a certain type of data publication? Why are there different models and which approach is better for different situations? Do they have different business models or imply different social contracts? Might it also be worthy of typing “publishers” instead of “publications”? For example, do domain repositories vs. institutional repositories vs. publishers address the issues differently? Are these models sustaining models or just something to get us through the next 5-10 years while we really figure it out?

I think this oversimplification inhibited some deeper analysis in other areas as well. I would like to see more examination of the validation requirement beyond the lens of peer review, and I would like a deeper examination of incentives and credit beyond citation.

I thought the validation section of the paper was very relevant, but somewhat light. I like the choice of the term validation as more accurate than “quality” and it fits quite well with Callaghan’s useful distinction between technical and scientific review, but I think the authors overemphasize the peer-review style approach. The authors rightly argue that “peer-review” is where the publication metaphor leads us, but it may be a false path. They overstate some difficulties of peer-review (No-one looks at every data value? No, they use statistics, visualization, and other techniques.) while not fully considering who is responsible for what. We need a closer examination of different roles and who are appropriate validators (not necessarily conventional peers). The narrowly defined models of data publication may easily allow for a conventional peer-review process, but it is much more complex in the real-world “other” category. The authors discuss some of this in what they call “independent data validation,” but they don’t draw any conclusions.

Only the simplest of research data collections are validated only by the original creators. More often there are teams working together to develop experiments, sampling protocols, algorithms, etc. There are additional teams who assess, calibrate, and revise the data as they are collected and assembled. The authors discuss some of this in their examples like the PDS and tDAR, but I wish they were more analytical and offered an opinion on the way forward. Are there emerging practices or consensus in these team-based schemes? The level of service concept illustrated by Open Context may be one such area. Would formalizing or codifying some of these processes accomplish the same as peer-review or more? What is the role of the curator or data scientist in all of this? Given the authors’s backgrounds, I was surprised this role was not emphasized more. Finally, I think it is a mistake for science review to be the main way to assess reuse value. It has been shown time and again that data end up being used effectively (and valued) in ways that original experts never envisioned or even thought valid.

The discussion of data citation was good and captured the state of the art well, but again I would have liked to see some views on a way forward. Have we solved the basic problem and are now just dealing with edge cases? Is the “just-in-time identifier” the way to go? What are the implications? Will the more basic solutions work in the interim? More critically, are we overemphasizing the role of citation to provide academic credit? I was gratified that the authors referenced the Parsons and Fox paper which questions the whole data publication metaphor, but I was surprised that they only discussed the “data as software” alternative metaphor. That is a useful metaphor, but I think the ecosystem metaphor has broader acceptance. I mention this because the authors critique the software metaphor because “using it to alter or affect the academic reward system is a tricky prospect”. Yet there is little to suggest that data publication and corresponding citation alters that system either. Indeed there is little if any evidence that data publication and citation incentivize data sharing or stewardship. As Christine Borgman suggests, we need to look more closely at who we are trying to incentivize to do what. There is no reason to assume it follows the same model as research literature publication. It may be beyond the scope of this paper to fully examine incentive structures, but it at least needs to be acknowledged that building on the current model doesn’t seem to be working.

Finally, what is the takeaway message from this essay? It ends rather abruptly with no summary, no suggested directions or immediate challenges to overcome, no call to action, no indications of things we should stop trying, and only brief mention of alternative perspectives. What do the authors want us to take away from this paper?

Overall though, this is a timely and needed essay. It is well researched and nicely written with rich metaphor. With modifications addressing the detailed comments below and better recognizing the complexity of the current data publication landscape, this will be a worthwhile review paper. With more significant modification where the authors dig deeper into the complexities and controversies and truly grapple with their implications to suggest a way forward, this could be a very influential paper. It is possible that the definitions of “publication” and “peer-review” need not be just stretched but changed or even rejected.

Detailed comments:

The whole paper needs a quick copy edit. There are a few typos, missing words, and wrong verb tenses. Note the word “data” is a plural noun. E.g., Data are not software, nor are they literature. (NSICD, instead of NSIDC)
Page 2, para 2: “citability is addressed by assigning a PID.” This is not true, as the authors discuss on page 4, para 4. Indeed, page 4, para 4 seems to contradict itself. Citation is more than a locator/identifier
In the discussion of “Data independent of any paper” it is worth noting that there may often be linkages between these data and myriad papers. Indeed a looser concept of a data paper has existed for some time, where researchers request a citation to a paper even though it is not the data nor fully describes the data (e.g the CRU temp records)
Page 4, para 1: I’m not sure it’s entirely true that published data cannot involve requesting permission. In past work with Indigenous knowledge holders, they were willing to publish summary data and then provide the details when satisfied the use was appropriate and not exploitive. I think those data were “published” as best they could be. A nit, perhaps, but it highlights that there are few if any hard and fast rules about data publication.
Page 4, para 2: You may also want to mention the WDS certification effort, which is combining with the DSA via an RDA Working Group:
Page 4, para 2: The joint declaration of data citation principles involved many more organizations than Force11, CODATA, and DCC. Please credit them all (maybe in a footnote). The glory of the effort was that it was truly a joint effort across many groups. There is no leader. Force11 was primarily a convener.
Page 4, para 6: The deep citation approach recommended by ESIP is not to just to list variables or a range of data. It is to identify a “structural index” for the data and to use this to reference subsets. In Earth science this structural index is often space and time, but many other indices are possible--location in a gene sequence, file type, variable, bandwidth, viewing angle, etc. It is not just for “straightforward” data sets.
Page 5, para 5: I take issue with the statement that few repositories provide scientific review. I can think of a couple dozen that do just off the top of my head, and I bet most domain repositories have some level of science review. The “scientists” may not always be in house, but the repository is a team facilitator. See my general comments.
Page 5, para 10: The PDS system is only unusual in that it is well documented and advertised. As mentioned, this team style approach is actually fairly common
Page 6, para 3: Parsons and Fox don’t just argue that the data publication metaphor is limiting. They also say it is misleading. That should be acknowledged at least, if not actively grappled with.

Competing Interests: No competing interests were disclosed.

We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above.

CITE

Report a concern

Author Response 12 May 2014

John Kratz, California Digital Library, University of California Office of the President, Oakland, CA, 94612, USA

12 May 2014

Author Response

Thank you for refereeing our paper and thank you especially for delivering your report so quickly.

We submitted the paper as a review article, not an opinion piece, and it was ... Continue reading Thank you for refereeing our paper and thank you especially for delivering your report so quickly.

We submitted the paper as a review article, not an opinion piece, and it was reclassified somewhere along the way. I contacted an editor at F1000 about the issue, and I believe it will be switched back shortly. While there is undoubtedly a viewpoint inherent in the way we have organized the manuscript, it was our intention to deliver a timely summary of the current landscape as a foundation for future thinking, not to offer prescriptions or to endorse particular approaches. We have no shortage of opinions about data publication, and a true opinion piece may follow at some point, but our aim here was to remain fairly neutral. I think the paper you are asking for would also be valuable, but it's an entirely different paper from the one we have written.

That said, your report is full of suggestions for expansion of analysis and clarification of scope that would absolutely improve the paper (e.g. the question of why some issues resist consensus more than others is an excellent one), and we will certainly address them in the next version.
Thank you for refereeing our paper and thank you especially for delivering your report so quickly.

We submitted the paper as a review article, not an opinion piece, and it was reclassified somewhere along the way. I contacted an editor at F1000 about the issue, and I believe it will be switched back shortly. While there is undoubtedly a viewpoint inherent in the way we have organized the manuscript, it was our intention to deliver a timely summary of the current landscape as a foundation for future thinking, not to offer prescriptions or to endorse particular approaches. We have no shortage of opinions about data publication, and a true opinion piece may follow at some point, but our aim here was to remain fairly neutral. I think the paper you are asking for would also be valuable, but it's an entirely different paper from the one we have written.

That said, your report is full of suggestions for expansion of analysis and clarification of scope that would absolutely improve the paper (e.g. the question of why some issues resist consensus more than others is an excellent one), and we will certainly address them in the next version.
Competing Interests: I am an author of the selected paper. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 12 May 2014

John Kratz, California Digital Library, University of California Office of the President, Oakland, CA, 94612, USA

12 May 2014

Author Response

Thank you for refereeing our paper and thank you especially for delivering your report so quickly.

We submitted the paper as a review article, not an opinion piece, and it was ... Continue reading Thank you for refereeing our paper and thank you especially for delivering your report so quickly.

We submitted the paper as a review article, not an opinion piece, and it was reclassified somewhere along the way. I contacted an editor at F1000 about the issue, and I believe it will be switched back shortly. While there is undoubtedly a viewpoint inherent in the way we have organized the manuscript, it was our intention to deliver a timely summary of the current landscape as a foundation for future thinking, not to offer prescriptions or to endorse particular approaches. We have no shortage of opinions about data publication, and a true opinion piece may follow at some point, but our aim here was to remain fairly neutral. I think the paper you are asking for would also be valuable, but it's an entirely different paper from the one we have written.

That said, your report is full of suggestions for expansion of analysis and clarification of scope that would absolutely improve the paper (e.g. the question of why some issues resist consensus more than others is an excellent one), and we will certainly address them in the next version.
Thank you for refereeing our paper and thank you especially for delivering your report so quickly.

We submitted the paper as a review article, not an opinion piece, and it was reclassified somewhere along the way. I contacted an editor at F1000 about the issue, and I believe it will be switched back shortly. While there is undoubtedly a viewpoint inherent in the way we have organized the manuscript, it was our intention to deliver a timely summary of the current landscape as a foundation for future thinking, not to offer prescriptions or to endorse particular approaches. We have no shortage of opinions about data publication, and a true opinion piece may follow at some point, but our aim here was to remain fairly neutral. I think the paper you are asking for would also be valuable, but it's an entirely different paper from the one we have written.

That said, your report is full of suggestions for expansion of analysis and clarification of scope that would absolutely improve the paper (e.g. the question of why some issues resist consensus more than others is an excellent one), and we will certainly address them in the next version.
Competing Interests: I am an author of the selected paper. Close
Report a concern

Comments on this article Comments (5)

Version 3

VERSION 3 PUBLISHED 16 Oct 2014

Revised

Comment

Version 2

VERSION 2 PUBLISHED 16 May 2014

Revised

Discussion is closed on this version, please comment on the latest version above.

Reader Comment 22 Aug 2014

Leonardo Candela, ISTI-CNR, Italy

22 Aug 2014

Reader Comment

Rather than a comment, I highlight here a potential issue in Reference 3. If I'm not mistaking it should be:
Lawrence, B.; Jones, C.; Matthews, B.; Pepler, S. & Callaghan, S. ... Continue reading Rather than a comment, I highlight here a potential issue in Reference 3. If I'm not mistaking it should be:
Lawrence, B.; Jones, C.; Matthews, B.; Pepler, S. & Callaghan, S. Citation and Peer Review of Data: Moving Towards Formal Data Publication International Journal of Digital Curation, 2011, 6, 4-37 doi:10.2218/ijdc.v6i2.205
Rather than a comment, I highlight here a potential issue in Reference 3. If I'm not mistaking it should be:
Lawrence, B.; Jones, C.; Matthews, B.; Pepler, S. & Callaghan, S. Citation and Peer Review of Data: Moving Towards Formal Data Publication International Journal of Digital Curation, 2011, 6, 4-37 doi:10.2218/ijdc.v6i2.205
Competing Interests: No competing interests were disclosed. Close
Report a concern
Discussion is closed on this version, please comment on the latest version above.

Version 1

VERSION 1 PUBLISHED 23 Apr 2014

Discussion is closed on this version, please comment on the latest version above.

Reader Comment 06 May 2014

Eric Kansa, Open Context (http://opencontext.org), USA

06 May 2014

Reader Comment

This is an excellent, and a tremendously useful overview of the issues involved in data publishing. From my perspective in archaeology, the discussion of tDAR and Open Context is useful, ... Continue reading This is an excellent, and a tremendously useful overview of the issues involved in data publishing. From my perspective in archaeology, the discussion of tDAR and Open Context is useful, since these different systems try to serve different needs. You may find this poster by Beth Sheehan comparing these different systems useful as well: http://www.slideshare.net/asist_org/rdap14-comparing-disciplinary-repositories-tdar-vs-open-context

One small point of clarification on a minor factual point. The Journal of Open Archaeological Data (JOAD) also lists Open Context (http://opencontext.org) as a repository for data, see: http://openarchaeologydata.metajnl.com/about/editorialPolicies#opencontext

Similarly, Internet Archaeology also lists Open Context in the same vein: http://intarch.ac.uk/authors/data-papers.html
This is an excellent, and a tremendously useful overview of the issues involved in data publishing. From my perspective in archaeology, the discussion of tDAR and Open Context is useful, since these different systems try to serve different needs. You may find this poster by Beth Sheehan comparing these different systems useful as well: http://www.slideshare.net/asist_org/rdap14-comparing-disciplinary-repositories-tdar-vs-open-context

One small point of clarification on a minor factual point. The Journal of Open Archaeological Data (JOAD) also lists Open Context (http://opencontext.org) as a repository for data, see: http://openarchaeologydata.metajnl.com/about/editorialPolicies#opencontext

Similarly, Internet Archaeology also lists Open Context in the same vein: http://intarch.ac.uk/authors/data-papers.html
Competing Interests: I direct Open Context (see: http://opencontext.org/about/people), so I have a professional interest in discussions of this project. Close
Report a concern
Reader Comment 02 May 2014

Konrad Hinsen, Centre de Biophysique Moléculaire (CNRS), France

02 May 2014

Reader Comment
First of all, thanks for this article, which is a good introduction to the problems surrounding data publication.

One aspect which deserves more attention is the question "What is data?" Or, ... Continue reading
First of all, thanks for this article, which is a good introduction to the problems surrounding data publication.

One aspect which deserves more attention is the question "What is data?" Or, more precisely, which categories of data should be distinguished with respect to publication? This is related to the last paragraph of this article that starts with "Ultimately, while “data as software” is promising, data is not software." Data is indeed not software - but software is data.

I would like to propose the following categories of scientific data:
Observational data. This is the "raw input" of science: data from experiments, observations, polls, etc.

Machine-readable information generated by humans. This category includes software, input files, workflows, etc. Information for human consumption but also stored electronically could be included as well: articles, drawings, software documentation, etc.

Data resulting from a computation: processed observational data, output of simulations, etc.
Data in category 1 is not reproducible in any way, and thus needs to be archived and published. Data in category 2 cannot be reproduced exactly by anyone else, but could be regenerated approximately from less complete/precise data by a domain expert. Nevertheless, it should be archived and published as well in order to produce a complete and accurate record of scientific activities. Data in category 3 can be reproduced by computation if the data in categories 1 and 2 is available. It may be convenient to share it nevertheless, in particular if recomputation is expensive, but it's less fundamental than categories 1 and 2.

I believe that these categories are more useful than the traditional separation into data, software, and writeup, in particular for questions such as archiving, citing, and updating. In particular, the vague term "dataset" does not distinguish clearly between categories 1 and 3.
First of all, thanks for this article, which is a good introduction to the problems surrounding data publication.

One aspect which deserves more attention is the question "What is data?" Or, more precisely, which categories of data should be distinguished with respect to publication? This is related to the last paragraph of this article that starts with "Ultimately, while “data as software” is promising, data is not software." Data is indeed not software - but software is data.

I would like to propose the following categories of scientific data:
Observational data. This is the "raw input" of science: data from experiments, observations, polls, etc.

Machine-readable information generated by humans. This category includes software, input files, workflows, etc. Information for human consumption but also stored electronically could be included as well: articles, drawings, software documentation, etc.

Data resulting from a computation: processed observational data, output of simulations, etc.
Data in category 1 is not reproducible in any way, and thus needs to be archived and published. Data in category 2 cannot be reproduced exactly by anyone else, but could be regenerated approximately from less complete/precise data by a domain expert. Nevertheless, it should be archived and published as well in order to produce a complete and accurate record of scientific activities. Data in category 3 can be reproduced by computation if the data in categories 1 and 2 is available. It may be convenient to share it nevertheless, in particular if recomputation is expensive, but it's less fundamental than categories 1 and 2.

I believe that these categories are more useful than the traditional separation into data, software, and writeup, in particular for questions such as archiving, citing, and updating. In particular, the vague term "dataset" does not distinguish clearly between categories 1 and 3.
Competing Interests: none Close
Report a concern
Reader Comment 01 May 2014

Hans Pfeiffenberger, Alfred Wegener Institut, Germany

01 May 2014

Reader Comment
Dear authors,
your article is a very noteworthy and valuable, broad overview of many of the issues surrounding "data publication". I would like to offer this as recommended reading to anybody ... Continue reading
Dear authors,
your article is a very noteworthy and valuable, broad overview of many of the issues surrounding "data publication". I would like to offer this as recommended reading to anybody unfamiliar with the field. However, there is one omission and one erroneous/misleading statement which I strongly suggest to correct:
In the first paragraph of "Data as the subject of a paper" you list a number of quite representative examples of data journals, but manage to omit the probably first example of a "pure" data journal (with peer review of data), ESSD, founded in 2008.
A brief summary of ESSD's rationale and approach was published 2011 in D-Lib Magazine, doi:10.1045/january2011-pfeiffenberger

In the second paragraph of "Citability" you write "DOI is neither sufficient nor necessary for citability- if a dataset moves and the DOI is not updated, the citation breaks and, conversely a well-maintained web-address works as well as a DOI."

I regard this as strongly misleading, at least for a novice to the domain of publishing or identifiers/DOIs: What typically breaks, sooner or later, is a bookmark with a "normal" URL. The DOI system - which I would characterize as "handle system with a policy" - was set up to work around that fact of life. The contracts data centers (DC) have to sign with "their" (DataCite) DOI registration agency typically contain wording such as: "DC has to ensure that registered content will be available for the entire duration of the agreement." (See "contractual form" , linked to from TIB's "DOI registration" page.) Admittedly, this and other such agreements are difficult to find.

By the way, this agreement also adresses the issue of fixity: "Once an item is registered, it may not be altered. If an item is changed, it has to be registered with a new DOI name."
Beyond those corrections, I suggest you provide the reader with some pointers about the venues where the ongoing discussions about data publication issues are actually being led. E.g., there are a number of working and interest groups at the Research Data Alliance (not just the one on Data Citation)

best regards,
Hans Pfeiffenberger
Dear authors,
your article is a very noteworthy and valuable, broad overview of many of the issues surrounding "data publication". I would like to offer this as recommended reading to anybody unfamiliar with the field. However, there is one omission and one erroneous/misleading statement which I strongly suggest to correct:
In the first paragraph of "Data as the subject of a paper" you list a number of quite representative examples of data journals, but manage to omit the probably first example of a "pure" data journal (with peer review of data), ESSD, founded in 2008.
A brief summary of ESSD's rationale and approach was published 2011 in D-Lib Magazine, doi:10.1045/january2011-pfeiffenberger

In the second paragraph of "Citability" you write "DOI is neither sufficient nor necessary for citability- if a dataset moves and the DOI is not updated, the citation breaks and, conversely a well-maintained web-address works as well as a DOI."

I regard this as strongly misleading, at least for a novice to the domain of publishing or identifiers/DOIs: What typically breaks, sooner or later, is a bookmark with a "normal" URL. The DOI system - which I would characterize as "handle system with a policy" - was set up to work around that fact of life. The contracts data centers (DC) have to sign with "their" (DataCite) DOI registration agency typically contain wording such as: "DC has to ensure that registered content will be available for the entire duration of the agreement." (See "contractual form" , linked to from TIB's "DOI registration" page.) Admittedly, this and other such agreements are difficult to find.

By the way, this agreement also adresses the issue of fixity: "Once an item is registered, it may not be altered. If an item is changed, it has to be registered with a new DOI name."
Beyond those corrections, I suggest you provide the reader with some pointers about the venues where the ongoing discussions about data publication issues are actually being led. E.g., there are a number of working and interest groups at the Research Data Alliance (not just the one on Data Citation)

best regards,
Hans Pfeiffenberger
Competing Interests: I happen to be the founder and chief editor of ESSD Close
Report a concern
Reader Comment 30 Apr 2014

Chris HJ Hartgerink, Liberate Science GmbH, Germany

30 Apr 2014

Reader Comment

Possibly of interest to your paper is dat, a program in development to provide version control of datasets (more so than git is able to). It has received funding recently ... Continue reading Possibly of interest to your paper is dat, a program in development to provide version control of datasets (more so than git is able to). It has received funding recently from the Knight Foundation (see here) and is something worth looking out for in terms of data sharing, but more importantly, preservation and logging.

Thank you for writing this — it provides a succinct introduction to an important issue.
Possibly of interest to your paper is dat, a program in development to provide version control of datasets (more so than git is able to). It has received funding recently from the Knight Foundation (see here) and is something worth looking out for in terms of data sharing, but more importantly, preservation and logging.

Thank you for writing this — it provides a succinct introduction to an important issue.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Discussion is closed on this version, please comment on the latest version above.

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3
Version 3 (revision) 16 Oct 14	read
Version 2 (revision) 16 May 14		read	read
Version 1 23 Apr 14	read

Mark Parsons, Research Data Alliance, Troy, NY, USA

Peter Fox, Rensselaer Polytechnic Institute, Troy, NY, USA
Mark Costello, University of Auckland, Auckland, New Zealand
Ingrid Dillo, Data Archiving and Networking Services (DANS), The Hague, The Netherlands

Comments on this article

All Comments(5)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

52 Views

06 Nov 2014 | for Version 3

Mark Parsons, Research Data Alliance, Troy, NY, USA

52 Views Cite this report Responses(0)

Approved

A nice rewrite and a very much improved paper, especially since it is now properly classified as a review article. Just a few small corrections listed below.

Page 4, Para 5:
"Writing a request to the creator should [very rarely] be part of the process" As noted before creator permission is sometimes legitimate.

Page 5, para 2:
"The most familiar kind of data publication is a traditional journal article accompanied by underlying data. " [citation needed] or change the sentence.

Page 5, para 7
"Data papers are predated by an approach that Lawrence et al. (2011) call data publication by proxy..." This is not true. As Hans Pfeiffenberger notes, ESSD dates back to 2008. Indeed I wouldn't be surprised if Lawrence chatted a bit with ESSD editors, Pfeiffenberger and Carlson, in preparing his paper.

Page 6, para 1
NSIDC does do external scientific reviews but they only go outside when they don’t have expertise in house. So it's sort of a combined approach. A quibble, but they pride themselves on in-house scientific expertise and engagement.

Page 6 para 2:
This paragraph is a little confused. Domain repositories don’t usually serve interdisciplinary use very well, but I don’t see how it’s necessarily bad to be distributed. What do you mean by publishing the whole research story? No one entity can do that.

Competing Interests

No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

67 Views

10 Jun 2014 | for Version 2

Ingrid Dillo, Data Archiving and Networking Services (DANS), The Hague, The Netherlands

67 Views Cite this report Responses(0)

Approved

In the Introduction I miss a clear link between data publishing and data citation and creating the possibility for researchers to receive academic credits for their work on data. This academic credit is crucial as an incentive for researchers to put valuable time and effort in sharing their data.
In the paragraph Why publish data? a reference to the Dutch fraud cases might be useful, as these cases got a lot of international attention and more or less triggered the discussion in the Netherlands with respect to research data management, long term preservation of data and data publishing and citation. A reference could be:

Doorn P, Dillo I, van Horik R: Lies, Damned Lies and Research Data: Can Data Sharing Prevent Data Fraud? International Journal of Digital Curation. 2013; 8(1): 229-243 http://dx.doi.org/10.2218/ijdc.v8i1.256
In the paragraph Types of data publication a threefold model is introduced to categorize data publications. It is not clear how this model relates to the terminology and categorisation presented in the introduction. This could be somewhat confusing for the reader.

Furthermore, there are of course many other models available, e.g. that of the The Data Publication Pyramid, developed on the basis of the Jim Gray pyramid, to express the different manifestation forms that research data can have in the publication process:

Reilly S, Schallier W, Schrimpf S, et al.: Report on integration of data and publications. October 2011. Located at: http://www.stm-assoc.org/2011_12_5_ODE_Report_On_Integration_of_Data_and_Publications.pdf

Or the model presented in the report: Costas, R., Meijer, I., Zahedi, Z. and Wouters, P. (2013). The Value of Research Data - Metrics for datasets from a cultural and technical point of view. A Knowledge Exchange Report, available from www.knowledge-exchange.info/datametrics
With respect to trustworthy digital repositories, I would like to add a few comments. First of all, in Europe a European Framework for Audit and Certification of Digital Repositories is emerging. It contains three certification standards (DSA, DIN31644/NESTOR seal and ISO13636) and three levels of certification (basic, extended and formal) see: http://www.trusteddigitalrepository.eu/Site/Trusted%20Digital%20Repository.html

Of these three standards, only DSA has been up and running for some time now ,with 31 seals awarded and 30 ongoing self-assessments at this moment. The NESTOR seal has become available only very recently and the ISO standard is not yet officially available. The accompanying ISO 16919 standard: Requirements for bodies providing audit and certification of candidate trustworthy digital repositories, has been published very recently and now the ISO organization needs to be set up in the different countries, including the training of national auditors. The audits done by CRL are not fully official, since CRL is no formal ISO accreditation body.

In Europe we see a growing interest in TDRs, coming from funders who want to push open data and data sharing and demand the deposit of publicly funded data in long term TDRs. Furthermore European research infrastructures and projects are also looking more and more into the issue of trust hen it comes to data sharing and a groeing number of them is incorporating (parts of) the DSA guidelines into there repositories and policies (e.g. CESSDA, CLARIN, EUDAT).

Yet another certification procedure is offered by the ICSU/WDS to repositories that aim to become a member of the World Data System. See: https://www.icsu-wds.org/community/membership/certification

The certification of TDRs could also help publishers/editorial boards with Data Availability Policies to point their authors to the right repositories for the long-term storage of their data.

Competing Interests

No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

60 Views

28 May 2014 | for Version 2

Mark Costello, Institute of Marine Science, University of Auckland, Auckland, New Zealand

60 Views Cite this report Responses(0)

Approved

Costello MJ, Wieczorek J. 2014. Best practice for biodiversity data management and publication. Biological Conservation, 173, 68-73. http://www.vliz.be/en/imis?module=ref&refid=234968
Costello MJ, Appeltans W, Bailly N, Berendsohn WG, de Jong Y, Edwards M, Froese R, Huettmann F, Los W, Mees J, Segers H, Bisby FA. 2014. Strategies for the sustainability of online open-access biodiversity databases. Biological Conservation 173, 155-165. http://www.marinebiology.ugent.be/component/imis/?module=ref&refid=230520
Costello MJ, Michener WK, Gahegan M, Zhang Z-Q, Bourne P. 2013. Data should be published, cited and peer-reviewed. Trends in Ecology and Evolution 28 (8), 454-461. http://dx.doi.org/10.1016/j.tree.2013.05.002
Costello, M.J., Vanden Berghe E. 2006. “Ocean Biodiversity Informatics” enabling a new era in marine biology research and management. Marine Ecology Progress Series 316, 203-214. http://www.int-res.com/abstracts/meps/v316/
Costello MJ, Michener WK, Gahegan M, Zhang Z-Q, Bourne P, Chavan V. 2012. Quality assurance and intellectual property rights in advancing biodiversity data publications. ver. 1.0, Copenhagen: Global Biodiversity Information Facility, Pp. 33, ISBN: 87‐92020‐49‐6. Accessible at http://links.gbif.org/qa_ipr_advancing_biodiversity_data_publishing_en_v1.

Competing Interests

No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

297 Views

06 May 2014 | for Version 1

Mark Parsons, Research Data Alliance, Troy, NY, USA

Peter Fox, Rensselaer Polytechnic Institute, Troy, NY, USA

297 Views Cite this report Responses(1)

Approved With Reservations

The whole paper needs a quick copy edit. There are a few typos, missing words, and wrong verb tenses. Note the word “data” is a plural noun. E.g., Data are not software, nor are they literature. (NSICD, instead of NSIDC)
Page 2, para 2: “citability is addressed by assigning a PID.” This is not true, as the authors discuss on page 4, para 4. Indeed, page 4, para 4 seems to contradict itself. Citation is more than a locator/identifier
In the discussion of “Data independent of any paper” it is worth noting that there may often be linkages between these data and myriad papers. Indeed a looser concept of a data paper has existed for some time, where researchers request a citation to a paper even though it is not the data nor fully describes the data (e.g the CRU temp records)
Page 4, para 1: I’m not sure it’s entirely true that published data cannot involve requesting permission. In past work with Indigenous knowledge holders, they were willing to publish summary data and then provide the details when satisfied the use was appropriate and not exploitive. I think those data were “published” as best they could be. A nit, perhaps, but it highlights that there are few if any hard and fast rules about data publication.
Page 4, para 2: You may also want to mention the WDS certification effort, which is combining with the DSA via an RDA Working Group:
Page 4, para 2: The joint declaration of data citation principles involved many more organizations than Force11, CODATA, and DCC. Please credit them all (maybe in a footnote). The glory of the effort was that it was truly a joint effort across many groups. There is no leader. Force11 was primarily a convener.
Page 4, para 6: The deep citation approach recommended by ESIP is not to just to list variables or a range of data. It is to identify a “structural index” for the data and to use this to reference subsets. In Earth science this structural index is often space and time, but many other indices are possible--location in a gene sequence, file type, variable, bandwidth, viewing angle, etc. It is not just for “straightforward” data sets.
Page 5, para 5: I take issue with the statement that few repositories provide scientific review. I can think of a couple dozen that do just off the top of my head, and I bet most domain repositories have some level of science review. The “scientists” may not always be in house, but the repository is a team facilitator. See my general comments.
Page 5, para 10: The PDS system is only unusual in that it is well documented and advertised. As mentioned, this team style approach is actually fairly common
Page 6, para 3: Parsons and Fox don’t just argue that the data publication metaphor is limiting. They also say it is misleading. That should be acknowledged at least, if not actively grappled with.

Competing Interests

No competing interests were disclosed.

Respond to this report

Responses (1)

Author Response

12 May 2014

John Kratz, California Digital Library, University of California Office of the President, Oakland, CA, 94612, USA

Thank you for refereeing our paper and thank you especially for delivering your report so quickly.

We submitted the paper as a review article, not an opinion piece, and it was reclassified somewhere along the way. I contacted an editor at F1000 about the issue, and I believe it will be switched back shortly. While there is undoubtedly a viewpoint inherent in the way we have organized the manuscript, it was our intention to deliver a timely summary of the current landscape as a foundation for future thinking, not to offer prescriptions or to endorse particular approaches. We have no shortage of opinions about data publication, and a true opinion piece may follow at some point, but our aim here was to remain fairly neutral. I think the paper you are asking for would also be valuable, but it's an entirely different paper from the one we have written.

That said, your report is full of suggestions for expansion of analysis and clarification of scope that would absolutely improve the paper (e.g. the question of why some issues resist consensus more than others is an excellent one), and we will certainly address them in the next version.

View more View less

Competing Interests

I am an author of the selected paper.

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] 1. Costello MJ: Motivating online publication of data. BioScience. 2009; 59(5): 418–427. Publisher Full Text

[2] 2. Smith VS: Data publication: towards a database of everything. BMC Res Notes. 2009; 2: 113. PubMed Abstract | Publisher Full Text | Free Full Text

[3] 3. Lawrence B, Jones C, Matthews B, et al.: Data publication. Int J Digital Curation. 2011.

[4] 4. Callaghan S, Donegan S, Pepler S, et al.: Making data a first class scientific output: Data citation and publication by NERC’s environmental data centres. Int J Digital Curation. 2012; 7(1): 107–113. Publisher Full Text

[5] 5. Mobley A, Linder SK, Braeuer R, et al.: A survey on data reproducibility in cancer research provides insights into our limited ability to translate findings from the laboratory to the clinic. PLoS One. 2013; 8(5): e63221. PubMed Abstract | Publisher Full Text | Free Full Text

[6] 6. Pashler H, Harris CR: Is the replicability crisis overblown? three arguments examined. Perspect Psychol Sci. 2012; 7(6): 531–536. Publisher Full Text

[7] 7. Zimmer C: Rise in scientific journal retractions prompts calls for reform. The New York Times. 2012.

[8] 8. Hiltzik M: Science has lost its way at a big cost to humanity. Los Angeles Times. 2013. Reference Source

[9] 9. Begley CG, Ellis LM: Drug development: Raise standards for preclinical cancer research. Nature. 2012; 483(7391): 531–533. PubMed Abstract | Publisher Full Text

[10] 10. Cyranoski D: Acid-bath stem-cell study under investigation. Nature. 2014. Publisher Full Text

[11] 11. Tabuchi H: One author of a startling stem cell study calls for its retraction. The New York Times. 2014. Reference Source

[12] 12. Drew BT, Gazis R, Cabezas P, et al.: Lost branches on the tree of life. PLoS Biol. 2013; 11(9): e1001636. Publisher Full Text

[13] 13. Collins FS, Tabak LA: Policy: NIH plans to enhance reproducibility. Nature. 2014; 505(7485): 612–613. PubMed Abstract | Publisher Full Text

[14] 14. Alsheikh-Ali AA, Qureshi W, Al Mallah MH: Public availability of published research data in high-impact journals. PLoS One. 2011; 6(9): e24357. PubMed Abstract | Publisher Full Text | Free Full Text

[15] 15. Vines TH, Albert AYK, Andrew RL: The availability of research data declines rapidly with article age. Curr Biol. 2014; 24(1): 94–7. PubMed Abstract | Publisher Full Text

[16] 16. Whitlock MC, McPeek MA, Rausher MD: Data archiving. Am Nat. 2010; 175(2): 145–146. PubMed Abstract | Publisher Full Text

[17] 17. Fairbairn DJ: The advent of mandatory data archiving. Evolution. 2011; 65(1): 1–2. PubMed Abstract | Publisher Full Text

[18] 18. Bloom T, Ganley E, Winker M: Data access for the open access literature: PLOS’s data policy. PLoS Biol. 2014; 12(2): e1001797. Publisher Full Text

[19] 19. Kansa EC, Kansa SW: We all know that a 14 is a sheep: Data publication and professionalism in archaeological communication. J Endocrinol Metab Arch Heritage Studies. 2013; 1(1): 88–97. Reference Source

[20] 20. Piwowar HA, Vision TJ, Whitlock MC: Data archiving is a good investment. Nature. 2011; 473(7347): 285–285. PubMed Abstract | Publisher Full Text

[21] 21. Gray J, Szalay AS, Thakar AR, et al.: Online scientific data curation, publication, and archiving. 2002; 103–107. Reference Source

[22] 22. Maunsell J: Announcement regarding supplemental material. J Neurosci. 2010; 30(32): 10599–10600. Reference Source

[23] 23. Benson DA, Cavanaugh M, Clark K, et al.: GenBank. Nucleic Acids Res. 2013; 41(Database issue): D36–42. PubMed Abstract | Publisher Full Text | Free Full Text

[24] 24. Berman HM, Westbrook J, Feng Z, et al.: The Protein Data Bank. Nucleic Acids Res. 2000; 28(1): 235–242. PubMed Abstract | Publisher Full Text | Free Full Text

[25] 25. Newman P, Corke P: Data papers — peer reviewed publication of high quality data sets. Int J Rob Res. 2009; 28(5): 587–587. Publisher Full Text

[26] 26. Waters D, Garrett J: Preserving Digital Information. Report of the Task Force on Archiving of Digital Information. ERIC. 1996. Reference Source

[27] 27. Beagrie N: Digital curation for science, digital libraries, and individuals. Int J Digital Curation. 2008; 1(1): 3–16. Publisher Full Text

[28] 28. Office for Civil Rights. Renal resource guide. 2003. Reference Source

[29] 29. Center for Research Libraries (U.S.) and OCLC. Trustworthy repositories audit & certification (TRAC) criteria and checklist. Center for Research Libraries; OCLC Online Computer Library Center, Inc Chicago: Dublin, Ohio. 2007. Reference Source

[30] 30. Hayden EC: NIH shutdown effects multiply. Nature. 2013. Publisher Full Text

[31] 31. FORCE11. Improving future research communication and e-scholarship. 2012. Reference Source

[32] 32. CODATA-ICSTI Task Group on Data Citation Standards and Practices. Out of cite, out of mind: The current state of practice, policy, and technology for the citation of data. Data Sci J. 2013; 12: 1–75. Reference Source

[33] 33. Starr J, Gastl A: isCitedBy: a metadata scheme for DataCite. D-Lib Magazine. 2011; 17(1). Publisher Full Text

[34] 34. Altman M, King G: A proposed standard for the scholarly citation of quantitative data. D-Lib Magazine. 2007; 13(3/4). Publisher Full Text

[35] 35. Global Historical Climate Data Network. Daily summaries station details: OAKLAND MUSEUM, CA US, GHCND:USC00046336. Reference Source

[36] 36. Menne MJ, Durre I, Vose RS, et al.: An overview of the global historical climatology network-daily database. J Atmospheric & Oceanic Technology. 2012; 29(7): 897–910. Publisher Full Text

[37] 37. Harris TW, Baran J, Bieri T, et al.: WormBase 2014: new views of curated biology. Nucleic Acids Res. 2014; 42(Database issue): D789–793. PubMed Abstract | Publisher Full Text | Free Full Text

[38] 38. Ball A, Duke M: How to cite datasets and link to publications. 2012. Reference Source

[39] 39. Pulverer B: A transparent black box. EMBO J. 2010; 29(23): 3891–3892. PubMed Abstract | Publisher Full Text | Free Full Text

[40] 40. Herron DM: Is expert peer review obsolete? A model suggests that post-publication reader review may exceed the accuracy of traditional peer review. Surg Endosc. 2012; 26(8): 2275–2280. PubMed Abstract | Publisher Full Text

[41] 41. Kriegeskorte N, Walther A, Deca D: An emerging consensus for open evaluation: 18 visions for the future of scientific publishing. Front Comput Neurosci. 2012; 6: 94. PubMed Abstract | Publisher Full Text | Free Full Text

[42] 42. Parsons MA, Duerr R, Minster JB: Data citation and peer review. Eos, Transactions American Geophysical Union. 2010; 91(34): 297–298. Publisher Full Text

[43] 43. Diederich F: Are we refereeing ourselves to death? The peer-review system at its limit. Angew Chem Int Ed Engl. 2013; 52(52): 13828–9. PubMed Abstract | Publisher Full Text

[44] 44. Lawrence B, Jones C, Matthews B, et al.: Citation and peer review of data: Moving towards formal data publication. Int J Digital Curation. 2011; 6(2): 4–37. Publisher Full Text

[45] 45. Grootveld M, Van Egmond J: editors. Data Reviews, peer-reviewed research data. Number 5 in DANS Studies in Digital Archiving. Data Archiving and Networked Services. 2011. Reference Source

[46] 46. Grootveld M, Van Egmond J: Peer-reviewed open research data: Results of a pilot. Int J Digital Curation. 2012; 7(2): 81–91. Publisher Full Text

[47] 47. Parsons MA, Fox PA: Is data publication the right metaphor? Data Sci J. 2013; 12: WDS32–WDS46. Publisher Full Text

[48] 48. Schopf JM: Treating data like software: a case for production quality data. In Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries, JCDL ’12, New York, NY USA, 2012; 153–156. ACM. Publisher Full Text

[49] 49. Ram K: Git can facilitate greater reproducibility and increased transparency in science. Source Code Biol Med. 2013; 8(1): 7. PubMed Abstract | Publisher Full Text | Free Full Text

[50] 50. Pradal C, Varoquaux G, Langtangen HP: Publishing scientific software matters. J Comput Sci. 2013; 4(5): 311–312. Publisher Full Text

Data publication consensus and controversies

Abstract

Revised Amendments from Version 1

Introduction: what does data publication mean?

Why publish data?

Types of data publication

Figure 1. To be published, datasets are typically deposited in a repository to make them available and assigned an identifier to make them citable.

Data that supplements a paper

Data as the subject of a paper

Data independent of any paper

Availability

Citability

Simple case

Deep citation

Dynamic datasets

Just-in-time identifiers

Validation

Data paper peer review

Independent data validation

Beyond data publication

Author contributions

Competing interests

Grant information

Acknowledgements

References

Comments on this article Comments (5)

Open Peer Review

Comments on this article Comments (5)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated