Committing to Data Quality Review

Amid the pressure and enthusiasm for researchers to share data, a rapidly growing number of tools and services have emerged. What do we know about the quality of these data? Why does quality matter? And who should be responsible for data quality? We believe an essential measure of data quality is the ability to engage in informed reuse, which requires that data are independently understandable (CCSDS, 2012). In practice, this means that data must undergo quality review, a process whereby data and associated files are assessed and required actions are taken to ensure files are independently understandable for informed reuse. This paper explains what we mean by data quality review, what measures can be applied to it, and how it is practiced in three domain-specific archives. We explore a selection of other data repositories in the research data ecosystem, as well as the roles of researchers, academic libraries, and scholarly journals in regard to their application of data quality measures in practice. We end with thoughts about the need to commit to data quality and who might be able to take on those tasks. Received 14 January 2014 | Accepted 26 February 2014 Correspondence should be addressed to Limor Peer, The Institution for Social and Policy Studies, Yale University, 77 Prospect Street, P.O. Box 208209, New Haven, CT 06520-8209. Email: limor.peer@yale.edu An earlier version of this paper was presented at the 9 International Digital Curation Conference. The International Journal of Digital Curation is an international journal committed to scholarly excellence and dedicated to the advancement of digital curation across a wide range of sectors. The IJDC is published by the University of Edinburgh on behalf of the Digital Curation Centre. ISSN: 1746-8256. URL: http://www.ijdc.net/ Copyright rests with the authors. This work is released under a Creative Commons Attribution (UK) Licence, version 2.0. For details please see http://creativecommons.org/licenses/by/2.0/uk/ International Journal of Digital Curation 2014, Vol. 9, Iss. 1, 263–291 263 http://dx.doi.org/10.2218/ijdc.v9i1.317 DOI: 10.2218/ijdc.v9i1.317 264 | Committing to Data Quality Review doi:10.2218/ijdc.v9i1.317


Introduction
We are seeing a growing number of tools and services that allow researchers to share their data, their code, their research design, and their analyses, and that's a good thing. Amid this growth and enthusiasm we think it is imperative to ask: What do we know about the quality of these research products? Why does quality matter? And who should be responsible for their quality?
Judgments about the quality of data are often tied to specific goals, such as authenticity, verity, openness, transparency, and trust (Altman, 2012;Bruce and Hillman, 2013). Data quality might consist of a combination of goals, often in competition with each other or prioritized differently by stakeholders (Wang and Strong, 1996). As Kevin Ashley (2013) recently observed, some may prize the completeness of the data while others their accessibility. He urges that curation practices 'be explicit about quality metrics and curation processes in domain-independent ways.' For the purpose of our discussion, we define data quality as a set of measures that determine if data are independently understandable for informed reuse. We argue that this perspective not only complements many of the goals referenced above, it also provides a roadmap for implementing specific quality measures and practices. We then urge the scientific community to subscribe to this vision of data quality by committing to data quality review. The paper explains what we mean by data quality, what measures can be applied to it, and how they are practiced in three domain-specific archives. 1 Next, we explore a selection of other data repositories in the research data ecosystem and ask whether there are gaps in the application of quality measures in practice and how they might be addressed. We end with thoughts about the need to commit to data quality and the review of data quality, and who might be able to take on those tasks.

Data Quality: Independently Understandable Data for Informed Reuse
We distinguish between the quality of the research and the quality of research products, including data, metadata, and code. Our concern here is with the products of the research that are made publicly available for reuse by being placed in archives or repositories. Although our perspective is on social science data, we believe that our recommendations and discussions in this paper could apply across scientific domains.
Data reuse means that the original researchers, or other researchers, may use the data at a future time without predefining what those specific uses might be. Motivations for reuse can be varied and include data verification, new analysis, re-analysis, metaanalysis, and reproducing original analysis and results. In order to enable reuse, data need to be processed, shared, and preserved in a way that ensures that they are 'independently understandable to (and usable by) the Designated Community,' and that there is enough information to be understood 'without needing the assistance of the experts who produced the information' (CCSDS, 2012). The concept of 'informed use' has also made its way into recent efforts to establish common citation principles; among the 'first principles' for data citation is the following: doi:10.2218/ijdc.v9i1.317 provides useful suggestions directed specifically at research in progress, 'making your data understandable, easy to analyze, and readily available to the wider community of scientists.' Similarly, Allan Dafoe (2013) offers recommendations for producing 'good replication files for researchers engaged in statistical analysis,' including preparing all data and analysis in code, following best practices for coding, fully describing variables, and documenting every empirical claim. Goodman et al.'s list of 'Ten Simple Rules for the Care and Feeding of Scientific Data' includes adopting format and metadata standards, and keeping careful track of versions of data and code (2014). And, a replication-oriented set of recommendations from Sandve et al. (2013) itemizes 'Ten Simple Rules for Reproducible Research' including keeping track of how every result was produced, to record all results in standard formats, and to provide public access to scripts, runs and results.
These guides and best practices are an expression of significant cultural changes in the research community, which is coming to terms with a more open science. 7 They have enormous potential to change how data are prepared for publication and sharing, if they are implemented uniformly. This paper explores ways to validate that best practices have been implemented and that files can truly be considered ready for independently understandable informed reuse.

Data Quality Review
A data quality review is a process whereby data and associated files are assessed and required actions are taken to ensure files are independently understandable for informed reuse. This is an active process, involving a review of the files, the documentation, the data, and the code. 8 We strongly believe that data quality cannot be realized without a data quality review. Below we explain what this review entails, who is positioned to carry out such a review, and what it means to commit to such a review.
Data quality requires that files are clearly identified, and that they are functional and accessible for the long term. A review of these basic aspects of data quality entails generating persistent identification (file level and study level where appropriate), creating a citation, recording file sizes and formats, creating checksums, checking that all necessary files are present, creating a study-level metadata record including file information (where appropriate), and creating non-proprietary file formats for dissemination and preservation. This also includes preservation-oriented steps, such as implementing a migration strategy for file formats, and ongoing bit monitoring.
Data quality also requires that documentation supporting the use of data is comprehensive enough to enable others to explore the resource fully, and detailed enough to allow someone who has not been involved in the data creation process to understand how the data were collected (Digital Preservation Coalition, 2008, p. 25). Files making up a data set (data, code, metadata, contextual materials, etc.) need to be carefully reviewed to establish that there is comprehensive descriptive information about the files and about methods and sampling, and to take corrective actions where this information is missing, including creating documentation compliant with doi:10.2218/ijdc.v9i1.317 Peer, Green & Stephenson | 267 community standards, e.g., the DDI XML specification. 9 All other known related research products (e.g., publications, registries, grants) also need to be explicitly linked to the data.
A data quality review also involves some processing -examining and enhancingof the actual data. These actions require performing various checks on the data, which can be both automated and manual procedures. The United Kingdom Data Archive (UKDA) provides a comprehensive list: 'double-checking coding of observations or responses and out-of-range values, checking data completeness, adding variable and value labels where appropriate, verifying random samples of the digital data against the original data, double entry of data, statistical analyses such as frequencies, means, ranges or clustering to detect errors and anomalous values, correcting errors made during transcription' (UKDA, n.d).
In addition, data need to be reviewed for risk of disclosure of research subjects' identities, of sensitive data, and of private information (Lyle, Alter and Green, 2014) and potentially altered to address confidentiality or other concerns.
Similar to data files, code files should also be subject to examination and potential enhancement to provide transparency and enable future informed reuse. A data quality review requires that code is executed and checked, that an assessment is made about the purpose of the code (e.g., recoding variables, manipulating or testing data, testing hypotheses, analysis), and about whether that goal is accomplished. As Victoria Stodden, a long-time advocate for code disclosure, put it: 'A research process that uses computational tools and digital data introduces new potential sources of error: Were the methods described in the paper transcribed correctly into computer code? What were the parameters settings and input data files? How were the raw data filtered and prepared for analysis? Are the figures and tables produced by the code the same as reported in the published article? The list goes on' (Stodden, 2013a). Roger Peng, also an advocate of reproducible research, argues that articles that have passed the reproducibility review 'convey the idea that a knowledgeable individual has reviewed the code and data and was capable of producing the results claimed by the author. In cases in which questionable results are obtained, reproducibility is critical to tracking down the "bugs" of computational science' (Peng, 2011).
These data review activities are essential for ensuring and enhancing data quality over time. To more clearly illustrate what is involved, we briefly report on data quality review practices of three domain-specific data archives and three general data repositories (see also the Appendix 1).

Data Quality Review in Domain Specific Data Archives
In this section we describe key data quality review practices in three disciplinary data archives: ICPSR and two small, domain-specific data archives (the Social Science Data Archive at UCLA and the ISPS Data Archive at Yale University). These are only three of numerous social science data archives, and we focus on them because we know them best.
Data quality review is embedded in data curation practices. The goal of curation is to maintain, preserve and add value to digital research data throughout its lifecycle, which reduces threat to the long-term research value of the data, minimizes the risk of its obsolescence, and enables sharing and further research (DCC, n.d). 'Gold standard' curation processes are carried out by data archives around the globe. 10 Their approach to data processing involves organizing, describing, cleaning, enhancing, and preserving 10  Most of these curation steps take place prior to sharing of the files.

ICPSR at the University of Michigan, Ann Arbor
ICPSR (The Inter-university Consortium for Political and Social Research) 11 is a well-known, member-based repository and archive of data used in social science quantitative research, maintaining more than 500,000 files. An international consortium of more than 700 academic institutions and research organizations, ICPSR provides leadership and training in data access, curation, and methods of analysis for the social science research community.

Data
ICPSR accepts data from the social and behavioral sciences, including surveys, public opinion polls, census enumerations, and other files produced by government agencies, research organizations and individual scholars. Data are deposited using an online form 12 that collects metadata such as study description and methodology, including study design, sampling, weighting, geographic details, as well as citations to publications that resulted from analyses of the data. Multiple data formats are generated from deposited files for dissemination and preservation.

Data quality review
At ICPSR, once data are submitted in a submission information package, the data pass through a 'pipeline' for processing and enhancement. 13 Steps include reviewing data and documentation for confidentiality issues and completeness, and assessing formats of study documentation and datasets. Depending on the outcome of the initial review, ICPSR staff may, in consultation with the data producer, recode variables to address confidentiality issues, check for undocumented or out of range codes, and standardize missing values. Study documentation is enhanced to ensure that question text, labels, and response categories and value labels are associated with variables. ICPSR documents all changes to files with syntax files and all correspondence with PIs and depositors. Once ICPSR completes processing work, the data collection goes through an internal quality review to insure the data collection is complete and selfexplanatory, as well as to insure no unintended changes were made during processing.

Metadata
ICPSR creates a full and complete metadata record for the data collection based on the DDI schema, and produces a DDI-compliant codebook. Any other documentation files are formatted as PDF files. doi:10.2218/ijdc.v9i1.317

UCLA SSDA
The UCLA SSDA (The Social Science Data Archive at the University of California, Los Angeles) 14 is a small domain-specific repository of surveys, polls, enumerations and administrative data used in social science quantitative research. SSDA serves the entire UCLA campus in providing access to publicly accessible data, and providing curation services and long term preservation of data collected by UCLA investigators.

Data
Data collected in survey research by UCLA faculty are initially described by the researcher in a detailed Deposit Agreement. Acceptance of materials is based on the Archive's ability to carry out all phases of the workflow considering the allocation of resources and fitness to collection policy. Some lifecycle curatorial processes (e.g., metadata creation compliant with DDI, distributed replication) are shared among a partnership of archives through the Data Preservation Alliance for Social Sciences (DataPASS) 15 . Other processes, such as media refreshing and file format migration, are carried out by the UCLA SSDA.

Data quality review
Data deposited at the UCLA SSDA undergo many of the same operations as those listed above for ICPSR. We have developed a workflow to address data quality at each step, from initial appraisal, ingest, metadata production, access and preservation. The Archive employs several open source and licensed software tools to carry out these tasks, including statistical software packages, emulation software, and Colectica Designer, Colectica for Excel and Colectica Repository. 16 SSDA works with researchers to resolve inconsistencies in the data and any changes are made with researcher approval.

Metadata
In order to produce complete lifecycle level metadata, Colectica Designer permits us to import statistical package format files, create item level documentation, and export DDI compliant metadata and documentation. Colectica for Excel is useful as an intermediary step to document variables, values, labels and question text. We use Colectica Repository to enable reuse through an item-level search capability, and links to downloadable files.

ISPS Data Archive
The ISPS Data Archive 17 at the Institution for Social and Policy Studies, Yale University is a small, specialized data repository, dedicated to supporting reproducible research (Peer and Green, 2012). It is meant to capture and preserve the intellectual output of a single unit within the university, to provide free and public access to research materials in line with open access principles, and to be used for reproducing research results through replication, i.e., by using author-provided data, code, codebooks, and other research materials.

Data
Data deposited in the ISPS Data Archive are produced by scholars affiliated with ISPS, with special focus on experimental design and methods. Field or other experiments (i.e., survey, natural, lab) produce original, often 'small' data of high value for researchers, educators, policy makers and students. Datasets frequently combine these data with survey or administrative data.

Data quality review
Researchers are asked to deposit all research output, including data, metadata, statistical code, codebooks, research materials and description files, and all files are subject to review before publication. The ISPS Data Archive pipeline closely follows that of ICPSR, including checking data for confidentiality and completeness, and assessing and enhancing study documentation and dataset formats. In addition, the Archive has developed curatorial practices that include verification and replication of the original research results. The ISPS Data Archive pipeline relies on some software, such as Stata, R, and Stat/Transfer, but many steps are manual. ISPS works closely with researchers when changes to the data or code files are made.

Metadata
The specialized nature of experimental data requires high quality documentation and metadata to facilitate replication of, and provide meaning to, each study. The ISPS Data Archive adheres to prevailing metadata standards, including OAI-PMH, Dublin Core, and the Data Documentation Initiative. Study-level metadata are compiled from information provided by researchers (via a deposit agreement form) and from associated materials (e.g., published article). Study-level metadata is made available on the Archive website and depends on content management functionality for search. For variable-level metadata, ISPS uses Stat/Transfer to produce make available XML files based on DDI version 3.1 for datasets.

Data Quality Review in General Data Repositories
Next, we describe the practices of three data repositories and data sharing venues. We illustrate varying curation policies and actions, and how measures of the 'quality' of data are reflected in the goals and capabilities of these repositories. 18 Information for the comparison was taken from the websites for these examples; it may be that this information is in flux and there may be features under review for future implementation.
Also note that some general repositories have developed as data publishing services, and in many respects they do not share the curatorial mission of domain-specific data archives, who are more closely involved in data preparation and review prior to publicly sharing data. The examples we use (e.g., Dryad, Dataverse and figshare) provide secure storage, persistent identifiers, useful guidelines, and they support varying degrees of file inspection. However, depositors have to take on the responsibility for preparing data for sharing, with the data documentation and code properly vetted prior to submission.

Dryad
Dryad 19 services have been set up to provide 'long-term access to its contents at no cost to researchers, educators or students, irrespective of nationality or institutional affiliation. Data files associated with any published article in the sciences or medicine, as well as software scripts and other files important to the article' may be deposited in Dryad.

Data
Dryad partners with journal publishers to make available the data behind published articles.
'Most data in the repository are associated with peer-reviewed articles, although data associated with non-peer reviewed publications from reputable academic sources, such as dissertations, are also accepted.' 20 Data are linked both to and from the corresponding publication and, where appropriate, to and from select specialized data repositories (e.g., GenBank). 21

Data quality review
Dryad has a curatorial team that checks files for technical problems and 'works to enforce quality control on existing content.' 22 Curators check for copyright statements and licenses, and identifiable human subject data. The review improves the odds that the data will be reusable. In terms of quality review, while the Dryad curators may discover problems, they do not verify that the data deposited can be reused to replicate findings in publications. Instead, '[s]ubmitters are advised to follow community data standards for the content and format of data files. Submitters should aim to provide sufficient data and descriptive information such that another researcher would be able to evaluate and reproduce the findings described in the publication.' 23 Dryad does not limit the types of files that are put into the repository. However, a draft preservation policy describes levels of support that will be given to specific file types. 24 The information content of the original file is never intentionally modified or processed, but copies may be made in different file formats to facilitate preservation.
'When a data file is submitted in a non-preferred format, a Dryad curator will convert it into the most appropriate preferred format. Both formats will be made available, labeled as original deposited file and transformed file for preservation.' 25

Metadata
Dryad has placed itself on the side of promoting good practice without actually requiring it, and relies on the scholarly community to press for completeness of

Data
Researchers are advised to 'deposit preferred or commonly used file formats in your discipline to ensure that others will be able to more easily replicate your research (and) to remove information from your datasets that must remain confidential.' 30

Data quality review
It is expected that data review happens prior to submitting data to a Dataverse system. The analytical tools that are part of the Dataverse software can be used to view documentation, to confirm sample size, to run summary statistics for the purposes of checking for missing information, and to review metadata in system files. However, changes to the files need to be made outside the Dataverse and resubmitted as new versions. These need to be done by the data depositor or another designated researcher or curator. There is no disclosure analysis for sensitive data built into Dataverse, but there is a new project to integrate the DataTags.org web application with 'Secure Dataverse.' This initiative will provide 'a standardized framework for sharing when data cannot be 100% open' (Crosas, 2013). Another new feature recently announced is an application for an integrated publishing workflow for open data. The application returns a data citation that can be inserted in publications. When the depositor wants to release the data to the public when the article is published, the metadata and data are released.

Metadata
Upon submitting data to Dataverse, a metadata record is created using a template containing fields selected by the depositor. Additional metadata is generated from statistical data files when they are submitted to Dataverse. The documentation files are compliant with the DDI 2 metadata schema. Review of the metadata record and documentation are not part of the Dataverse services, but an organization may choose to mediate deposits prior to release. doi:10.2218/ijdc.v9i1.317 figshare '[figshare 31 ] allows researchers to publish all of their data in a citable, searchable and sharable manner. All data is persistently stored online under the most liberal Creative Commons licence, waiving copyright where possible. Users of the site maintain full control over the management of their research whilst benefiting from global access, version control and secure backups in the cloud.' 32 Data figshare is a repository for 'figures, datasets, media, papers, posters, presentations and filesets' 33 that offers unlimited storage space for data that is made publicly available on the site. Researchers can upload any file type to figshare, and attempts are made to display all file types in a web browser. The repository's publicly available content is replicated in 'CLOCKSS's geographically and geopolitically distributed network of redundant archive nodes.' 34 figshare hosts the supplemental data for all seven PLOS journals.

Data quality review
Curatorial review is not part of the figshare model. In an article by Ned Stafford (2013), 'Peter Murray-Rust, a chemist at the University of Cambridge, UK, says he likes the figshare model, allowing researchers to publish first and sort out the problems of formats quality, et cetera later.' In other words, data review is not part of the figshare model and is left to the researcher to sort out.

Metadata
Upon submitting data to figshare, a metadata record is created, based upon Dublin Core. It is not reviewed.

Other Stakeholders and Data Quality Review
So far, our examination of data quality review has focused on data repositories and archives, as increasingly that is where data can be found. Three other important stakeholders have an interest in data quality and may hold the keys to data quality review: the researchers themselves, academic libraries, and scholarly journals.

Researchers
There is agreement that researchers are best positioned to do a lot to ensure the quality of the data they share for future reuse. For example, Donald J. Treiman (2009), in his book on data analysis in Stata, recommends archiving .do and .log files as a professional practice, using Stata codebook commands to document a data file, and including them with papers submitted for publishing. In a recent white paper, Ember and Hanisch (2013)  needs of curation, preservation, interoperability, and metadata.' Research culture and habit seem to play a significant role (Sandve et al., 2013). Could research teams themselves take on more of the curatorial tasks similar to those done by data archives? Part of the solution might be to incorporate the right training, guidance and tools that support data quality into the habits of researchers as part of the efforts to make their data independently understandable over time.
The production of metadata is often cited as one of the most significant barriers to researchers sharing data (Tenopir, et al., 2011). Edwards et al. (2011) state that 'just as with data themselves, creating, handling, and managing metadata products always exacts a cost in time, energy, and attention: metadata friction.' Tools that make the creation and capture of metadata during the research process are essential. Many of these tools would have to be domain specific, but a suite of curatorial tools for capturing contextual and descriptive information needs to be developed.
Other data quality issues arise during the research process in addition to the challenges of metadata production and capture including, inconsistent labelling, coding errors, version confusion, and lack of awareness about problems with proprietary formats that might not be usable by others and are difficult to migrate or emulate over time. These aspects of data quality would also benefit from having the right tools in the research space. Data and code review, for example, could take place in collaborative research spaces, allowing researchers to do quality review work while actively engaged in the research process. In their 'Ten Simple Rules for the Care and Feeding of Scientific Data ' Goodman et al. (2014) recommend 'publish[ing] workflow as context.' Similarly, Tyler Walters (2014) discusses how researchers are using repositories to deposit 'data generated in the first stages of a research project' and points out the work by the Sustainable Environment -Actionable Data (SEAD) 35 initiative to 'support coauthorship, shared tagging, microcitation, threaded discussions, and reviewing and commenting on data and research projects.' These collaborative research environments do not provide long-term preservation, and ideally could develop seamless integration with long-lived repositories. The advantage of considering virtual research environments as essential components for data quality is that many of the data quality review tasks are performed before files are deposited in repositories. Capturing the 'workflow' of the research team could go a long way in addressing the challenges of producing data that is independently usable, especially if guidelines are followed in regard to documentation, file formats, persistent identifiers, and the inclusion of methodology statements and documents explaining research methods or decisions about sampling.
Some examples of collaborative research platforms 36 that could capture data, metadata, and workflow needed for informed reuse include GitHub, 37 a hosted Git repository popular among open source developers. As a collaborative workflow, it allows one to 'take part in collaboration by forking projects, sending and pulling requests, and monitoring development' (CrunchBase, n.d). Increasingly, it is used for other collaborative projects, including research. 38 The emerging Open Science 35 SEAD: http://sead-data.net 36 Note that these platforms are often referred to as 'repositories' but they are in effect locations for managing changes to code, often collaboratively and openly. This is in distinction to the standard definition of repository as place to store and maintain things (see http://www2.archivists.org/glossary/terms/r/repository). In fact, GitHub, for example, states that it does not provide archiving (see https://help.github.com/articles/can-i-archive-a-repository). 37 GitHub: https://github.com 38 Zach Jones describes GitHub's appeal: It offers a hosting environment for a complete research project (that) is reproducible and transparent by default in a more comprehensive manner than a typical journal Framework 39 is 'part network of research materials, part version control system, and part collaboration software.' It has many potential uses, and is so far mostly recognized as the site of a project on reproducibility in psychology research. 40 Zenodo 41 enables uploading files into its system directly from Dropbox. Finally, the newly announced figshare Projects 42 system provides collaborative spaces for private, secure file management based upon the figshare platform. Data and code review could also take place after publication; once materials are released, the scientific community could review them. In the future, there may be incentives for researchers to do so, and post-publication crowd-sourced peer review may prove to be a successful model. Services supporting these efforts include RunMyCode 43 , which enables easy dissemination of the necessary pieces required to submit the research to scrutiny by fellow scientists, and ResearchCompendia 44 , a 'web service allowing people to share the research software and data associated with a scientific publication.' 45 Other tools, such as Active Papers 46 which consists of 'a file combining datasets and programs in a single package, which also contains a detailed history of which data was produced when, by running which code, and on which machine,' may prove to contribute to data quality as well. These services and tools are important facilitators for people who wish to have their data and code validated via a peer review process.

Academic libraries
Academic institutions, and their libraries, increasingly desire to be involved in the lifecycle data management process (Burnett, 2013). Some libraries have a history of including data files in their collection policies, and they support tools for data reuse and analysis, but most have only partnered with individual local researchers to provide guidance on data acquisitions or research data management. Exceptions are data libraries and data archives that have taken on stewardship of datasets, sometimes going back to the early 1970's. Institutional repositories are making progress in taking on the role of stewardship of data outputs by their affiliated researchers. 'In libraries, we see a similar trend of assisting researchers with the creation of metadata and its ingest along mandated replication archive. With a public Git repository the data, any manipulation code, and the associated models are available at any time that a change was 'committed' to a file tracked in said Git repository. Keeping data, data manipulation code, model code, code for visualizations (tables and graphs), along with the manuscript in a Git repository on GitHub (or a similar site, such as Bitbucket) thus subsumes and extends the advantages of journal maintained replication archives (Jones, 2013 Peer,Green & Stephenson | 277 with research data into a repository for preservation and access' (Riley, 2014). In biomedical libraries, 'informationists' work with research teams to advise on 'data management and curation, including metadata standards and preservation and preparation of data for sharing' (Federer, 2013). And finally, Kimpton and Minton-Morris (2014) point out that '(t)hough some libraries are accepting deposits without intervention, most try to review data as it is added to make sure that it includes appropriate bibliographic information.' However, in most cases institutional repository services are not able to take on the responsibility of reviewing data beyond basic bibliographic-level information, and they rely upon data being properly prepared for sharing prior to submission. In a 2013 survey of members, the Association of Research Libraries found that none of the responding institutions offered or carried out a clearly defined data quality review; instead libraries addressed 'data management best practices (both online resources and workshops), helping researchers identify (and apply) appropriate metadata standards, research file organization and naming, data citation, data sharing and access, and data storage and backup' (Fearon, 2013, p. 14).
It has been proposed that partnerships between data archives and institutional repositories be established so that the services and expertise of high end curatorial institutions can be shared by those who are not able to take on those tasks (Green and Gutmann, 2007). The ARL survey previously mentioned also encouraged 'collaboration within the library, across a campus, and sometimes across institutions… A common theme throughout the survey is the recognition that, in order to provide comprehensive RDM services and to support scientists throughout the data lifecycle, libraries need to collaborate, either formally or informally, with other units at the institution' (Fearon, 2013, p. 14). For example, at the UCLA SSDA, a pilot project has been initiated to study the data quality review and curation processes workflow. One objective is to determine the possibility of developing a cooperative data curation infrastructure where some tasks are carried out by the archive and some by the institutional repository.

Scholarly journals
It is too often the case that 'the amount of real data and data description in modern publications is almost never sufficient to repeat or even statistically verify a study being presented' (Goodman et al., 2014). Some scholarly journals have started to require that data are published with articles, and must meet a minimal set of requirements. Others take it further: it is the policy of the American Economic Review to 'publish papers only if the data used in the analysis are clearly and precisely documented and are readily available to any researcher for purposes of replication' (AEA, 2014). However, there is no quality review of the submissions. Allan Dafoe calls for better replication practices, particularly in political science. He places responsibility on authors to provide quality replication files, but also suggests that journals encourage high standards for replication files and that they conduct a 'replication audit' that will 'evaluate the replicability and robustness of a random subset of publications from the journal' (Dafoe, 2013). A document produced at a workshop held at the British Library on peer review recently recommended that 'publishers should provide simple and, where appropriate, disciplinespecific data review (technical and scientific) checklists as basic guidance for reviewers' (Tedds et al., 2013).
An example of a journal that takes an active role in data review is the Journal of Open Psychology Data (JOPD) 47 that requires open peer review of data descriptions and data deposit. Its peer review process has been developed 'to ensure that each paper correctly describes the data, and that it has been openly archived in accordance with best practices. The datasets themselves are not reviewed in terms of validity or importance.' All JOPD data papers are peer reviewed according to the following criteria: 'The methods section of the paper must provide sufficient detail that a reader can understand how the dataset was created, and would within reason be able to recreate it.' 48 The F1000 group identifies the 'complexity of the relationship between the data/article peer review conducted by our journal and the varying levels of data curation conducted by different data repositories' (Lawrence, 2013). The group provides detailed guidelines for authors on what is expected of them to submit and ensures that everything is submitted and all checklists are completed (F1000, 2014). It is not clear, however, if they themselves review the data to make sure it replicates results.
Scientific Data 49 is 'a new open-access, online-only publication for descriptions of scientifically valuable datasets.' The journal uses 'a new type of content called the Data Descriptor, which combines traditional narrative content with curated, structured descriptions of research data' including detailed methods and technical analyses supporting data quality. Data are not contained in the journal, but can be accessed via references and links to both related journal articles and data files stored at data repositories (particularly in figshare or Dryad). Professional in-house curation of the data descriptions 'helps to ensure standardized and uniformly discoverable content.' Scientific Data's aims align with the measures of quality to which we refer in this paper in these areas: 'Offer transparency in experimental methodology, observation and collection of data… Ensure all interested parties -scientists, policy makers, NGOs, companies, funders and the public -can find, access, understand and reuse the data they need.' 50

Committing to Data Quality Review
Our review of various players in the research data ecosystem reveals that data quality review is not uniformly practiced. At this time, we see little evidence that researchers, academic libraries and scholarly journals are committed to fully reviewing data to ensure quality, and we explained the ways in which general data repositories fall short of full data quality review. The data quality review practices at data archives can go a long way toward ensuring that data are accurate, complete, well documented, and that they are delivered in a way that maximizes their use and reuse. We acknowledge that our perspective has been focused on the social sciences, and that conversations with other disciplines are productive. Exciting developments in biology, for example, include 48 Detailed guidelines specify that '(t)he deposited data must include a version that is in an open, nonproprietary format. The deposited data must have been labeled in such a way that a third party can make sense of it (e.g. sensible column headers, descriptions in a readme text file). The deposited data must be actionable -i.e. if a specific script or software is needed to interpret it, this should also be archived and accessible. Participant data should be sufficiently anonymized and appropriate consent forms should be signed' (JOPD, 2014). 49 Scientific Data: http://www.nature.com/scientificdata 50 Scientific Data -Principles: http://www.nature.com/scientificdata/principles doi:10.2218/ijdc.v9i1.317 Peer,Green & Stephenson | 279 investments by organizations such as ENCODE 51 and EMBL-EBI 52 in data quality. In addition, we acknowledge variation in practice not only among disciplines, but among individual researchers. This paper does not intend to cover all of the mime types, varieties of research habits and workflows, or technologies and tools. Still, as evident in our research for this paper, some domain-specific data archives currently offer the most comprehensive data quality review.
While data archives may currently be best positioned to carry out such review, we believe that reviewing the quality of the data is the responsibility of any entity that assumes responsibility over the data (Peer, 2011). We think that the stakeholders and caretakers of scientific materials, such as data and code, must share the responsibility of meeting the challenges of data quality review in order to ensure that data, documentation, and code are of the highest quality so as to be independently understandable for informed reuse, in the long term. The commitment to data quality review, however, has to involve the entire research community for two reasons.
First, domain-specific data archives have limitations. The models described at ICPSR, ISPS, and SSDA at UCLA may not be applicable to other contexts, and indeed may not always be employed by other domain-specific archives. Quality review requires significant investment in staffing, relationships and resources. The ISPS Data Archive and the UCLA SSDA staff have data management and archival skills, as well as domain and statistical expertise. Both invest in relationships with researchers and learn about their research interests and methods to facilitate communication and trust. Further, the reproducibility imperative at ISPS does not neatly apply to more generalized data, or to data that is not tied to publications. In other instances, a larger lab, greater volume of research, or simply more data will require greater resources and may prove the level of review we endorse challenging. All of this requires the right combination of domain, technical and interpersonal skills as well as time, which translates into higher costs. A recent white paper on 'Sustaining Domain Repositories for Digital Data' has articulated the financial impact of the demands of data stewardship and 'aims to start a conversation with funding agencies about how secure and sustainable funding can be provided for domain repositories' (ICPSR, 2013). With regard to ICPSR, quality review practices are done within the context of a large consortium of paying members, and the level of review ICPSR offers has come to be expected from the 'gold standard' data archive in the United States. Still, ICPSR's staff and financial resources are finite, it is specific in selection and scope, and access is sometimes limited only to members. New initiatives, such as ICPSR's service, openICPSR (Lyle, 2013a), which facilitates data deposit into an open repository and provides a review by professional data curators who are experts in developing metadata for the social and behavioral sciences, might sidestep some of these limitations. This landscape is constantly changing; as Margaret Hedstrom (2013) points out, data archives and repositories still need to work out exactly what role they want to play in the data supply chain.
Second, in many situations it is imperative that quality review occurs outside repositories because data are being disseminated in a variety of ways. Obviously, if there are no curatorial services in place, the full burden of quality review falls to the researchers and whatever support they have available prior to publishing data, and they need to locate a trusted place to put and get data. Yet, even as Guédon and Stodden make a compelling argument that open repositories hold the key to the future credibility of the scientific enterprise, Christine Borgman reminds us that most repositories and archives follow the letter, not the spirit, of the law: They take steps to share data, but they do not review the data. 'Who certifies the data? Gives it some sort of imprimatur?' she asks (Borgman, 2013). Even when review steps are taken -for example, normalizing data to one format, such as SPSS -how can we be sure that there was no loss of precision (e.g., formats, missing values, labels)? As Stodden pointed out at Open Repositories 2013, it is not clear 'who, if anyone, checks replication pre-publication' (Stodden, 2013b). She suggested that this activity is community-dependent, often done by students or other researchers continuing a project, and that community can adjust norms by rewarding high integrity, verifiable research.
If researchers are not familiar with the repository and archive options in their subject area, it can be difficult for them to determine what type of curatorial review of data, documentation, and code various repositories and archives really do. One way to locate repositories for sharing and storing research data by subject discipline is to search a digital repository register (e.g., OpenAIRE, Databib, and re3data). However, it is difficult to assess what curatorial practices each of the repositories offer. There is no question that these can be very useful tools, but we suggest that it would be helpful if they would include information about the level of curatorial review, if any, that has been given to data after submission. There have been efforts to develop criteria for ensuring a level of data quality as it relates to repository operations. For example, the registry re3data.com uses a quality standard icon to indicate that a repository is either 'certified or supports a repository standard.' 53 Certification of repositories commonly focuses upon the important aspects of a repository's implementation, sustainability, and technical adequacy. 54 However, we find that repository certification metrics do not include explicit information about how much, and what types, of data quality review are done by the archive or repository itself. The Data Seal of Approval 55 differs from the other certification methods in that it has clear requirements that are assigned specifically to the data producer. The data producer is required to deposit the data with sufficient information for others to assess the quality of the data and compliance with disciplinary and ethical norms, provide the data in formats recommended by the data repository, and provide the data together with the metadata requested by the data repository. There are no explicit requirements for the repository to complete a data review or undergo the curatorial actions we describe in this paper.
We strongly believe that, as more entities take on various roles in the review, curation, or dissemination of data -especially entities removed from the original data producers -that strong controls should be put in place to ensure that there is no potential for unintentional (and even intentional?) changes that can significantly alter the data. For example, cleaning files could unknowingly reduce decimal precision due to imprecise format specifications to revising codes. As Lyle (2013b) Peer,Green & Stephenson | 281 understandable informed use of the data, and in no way jeopardize any other aspects of data quality (e.g., accuracy, authenticity, verity, etc.). A serious conversation about ways to ensure 'zero harm' to data and code needs to take place in the scientific community.
In spite of these challenges, we believe that stewardship of data requires this type of quality review because it leads to better science (Peer, 2013). Usable '[d]ata-rich research environments can promote new fields of study, improve understanding of complex systems, such as the Earth's climate, and lead to new products such as pharmaceutical drugs' (Wallis, Rolando and Borgman, 2013). This endeavor requires more and better tools, as well as smart, effective partnership among the various stakeholders. 'The social nature of science and the network of interested stakeholders in the future of access to scientific data,' says Gold (2007), 'make it essential to develop social and policy tools to support this future.' As Jones et al. (2006) observe, the key is 'to find the balance of responsibility for documenting data between individual researchers and trained data stewards who have advanced expertise with appropriate metadata standards and technologies.' And, the National Digital Stewardship Alliance recently urged the scientific community to 'work together to raise the profile of digital preservation and campaign for more resources and higher priority given to digital preservation, and to highlight the importance of digital curation and the real costs of ensuring long term access' (NDSA, 2014).

Conclusion
Independently understandable, informed reuse of data in the long term is in jeopardy: data are being lost at an alarming rate (Gibney and Van Noorden, 2013;Vines et al., 2014). At the same time, more data than ever are being released publicly. Unfortunately, in both scenarios, there is still significant misunderstanding about what is necessary to archive data for long term usability. Digital preservation practices go beyond storing and managing the bits (Owens, 2012). We can think of a continuum of data curation that progresses from a basic level where data are accepted 'as is' for the purpose of storage and discovery, to a higher level of curation which includes processing for preservation, improved usability, and compliance, to an even higher level of curation which also undertakes the verification of published results.
Data archives have traditionally taken on 'gold standards' of data processing as described above, but repositories vary widely in the curatorial processing they offer for incoming data, and in the preservation services they can provide over the long term. Researchers sometimes believe that assigning a persistent link, e.g., a DOI, and maintaining redundant backups will be enough to make data accessible and understandable for decades. Certainly repository systems offer more secure homes for research data than researchers may have had in the past, but we suggest that threats remain. Among the pitfalls of this approach is the lack of quality review when data are submitted to digital repositories. Those researchers who follow the guidelines we describe have better odds that their data will be independently usable over time, but what if those guidelines have not been followed? Wouldn't it be better to catch and correct the problems with formats, metadata, missing data, mismatches between data and code, disclosure review, etc. when the data are submitted and reviewed by a research team, a repository, or an archive, rather than waiting for those problems to prevent long term understandability and use of the data? The lack of quality review as a curatorial practice can have severe consequences and can contribute to the loss of data over time.
A conversation about reviewing the data we put in repositories is a sign of maturity in the scholarly community, across all scientific domains, and recognition that simply sharing data is necessary, but not sufficient. We call on the community as a whole to commit to data review by practicing it and by demanding to know when it has been done. Our hope is that it becomes a cornerstone in standard approaches to data curation and will become common practice once appropriate tools and frameworks are in place. doi:10.2218/ijdc.v9i1.317 Appendix 1: Quality Measures in Practice

Notes
All changes to data, documentation, and code are reviewed and recorded during processing. Preservation actions are more involved than the two discussed here.