Using Data Management Plans to Explore Variability in Research Data Management Practices Across Domains

This paper describes an investigation into how researchers in different fields are interpreting and responding to the U.S. National Science Foundation’s data management plan (DMP) requirement. As documents written by the researchers themselves, DMPs can provide insight into researchers’ understanding of the potential value of their data to others; the environment in which their data are developed and prepared; and their willingness and ability to ensure the data are available to others now and in the long-term. With support from the Institute of Museum and Library Services, the authors conducted a content analysis of DMPs generated at their respective institutions using a shared rubric. By developing and testing a rubric designed to understand and evaluate the content of DMPs, the authors intend to develop a more complete understanding, at a larger scale, of how researchers plan for managing, sharing, and archiving their data. Accepted 24 February 2016 Correspondence should be addressed to Susan Wells Parham, Georgia Institute of Technology Library, 266 4 th St NW, Atlanta, GA 30332. Email: susan.parham@gatech.edu An earlier version of this paper was presented at the 11 th International Digital Curation Conference. The International Journal of Digital Curation is an international journal committed to scholarly excellence and dedicated to the advancement of digital curation across a wide range of sectors. The IJDC is published by the University of Edinburgh on behalf of the Digital Curation Centre. ISSN: 1746-8256. URL: http://www.ijdc.net/ Copyright rests with the authors. This work is released under a Creative Commons Attribution (UK) Licence, version 2.0. For details please see http://creativecommons.org/licenses/by/2.0/uk/ International Journal of Digital Curation 2016, Vol. 11, Iss. 1, 53–67 53 http://dx.doi.org/10.2218/ijdc.v11i1.423 DOI: 10.2218/ijdc.v11i1.423 54 | Data Management Plans as a Research Tool doi:10.2218/ijdc.v11i1.423


Introduction
In growing recognition of the importance of research datasets as standalone scholarly products of research, funding agencies have introduced requirements for the inclusion of a data management plan (DMP) with proposals.The primary purpose of a DMP is to describe the data resulting from a project, and how they will be made publicly accessible for reuse.In response to the growing need among researchers for support in addressing research data management (RDM) mandates, many research and academic libraries are allocating significant thought, effort, and capital toward developing RDM services.
The Data management plan As Research Tool (DART) project has as its premise that data management plans can be a rich source of information about researchers' data management knowledge, capabilities, practices, and needs.By using these plans as a window into research practices, we can discern variability in RDM habits across broad research domains, as well as the extent to which university and library resources are being consigned in the plans.Such investigations will help inform efforts to develop or improve RDM services and infrastructure.In this paper we show how librarians and other data support professionals can use DMPs as a tool for exploring local RDM behavior and identifying data management services needs.We also discuss our analysis of 500 data management plans written by researchers at five U.S. research institutions.

Background
The DART team conducted an analysis of 500 DMPs from awarded proposals submitted to the U.S. National Science Foundation (NSF), using an analytic rubric that we developed and tested for this task (Whitmire, Rolando and Westra, 2015).Our work builds upon previous research conducted on the analysis of NSF DMPs.Articles from librarians at Cornell University (Steinhart, Chen, Arguillas, Dietrich and Kramer, 2012), the University of Illinois at Urbana-Champaign (Mischo, Schlembach and O'Donnell, 2014), the University of Minnesota (Bishoff and Johnston, 2015), and Georgia Institute of Technology (Parham and Doty, 2012) have all noted that researchers have difficulty understanding and responding to the requirements.Many research libraries have set up a DMP review service as a means to support researchers (Dietrich, Adamus, Miner and Steinhart, 2012), and as a means of training librarians about the data needs of researchers (Davis and Cross, 2015).
In contrast to previous work, our research utilized a dataset of plans from multiple U.S. institutions, affording us a rich source of content for comparisons across the seven directorates that comprise the NSF.By investigating and comparing DMPs written for different directorates, we begin to see how researchers in different fields understand and interpret the NSF data management requirements.We also get an idea of how well equipped they are to meet these requirements, as well as a glimpse into how they are currently managing their research data.
In our findings, we focus on results from six of the seven NSF directorates: Biology (BIO); Computer and Information Science and Engineering (CISE); Engineering (ENG); Geosciences (GEO); Math and Physical Sciences (MPS); and Social, Behavioral, and Economic Sciences (SBE).We discuss variability in researcher data

Approach
To facilitate consistent review across the project team, we developed an analytic rubric for the assessment of NSF data management plans (Rolando, Carlson, Hswe, Parham, Westra and Whitmire, 2015;Whitmire et al., 2015).The rubric contains assessment criteria across three performance levels for both NSF-wide and directorate-specific DMP content requirements (Whitmire, Carlson, Hswe, Parham and Westra, 2016a).The rubric was tested and improved through two rounds of individual reviews of the same set of DMPs and subsequent assessment of inter-rater reliability (IRR).We used intraclass correlation (ICC) to assess IRR (McGraw and Wong, 1996; Shrout and Fleiss,  1979; R package 'irr' 1  ), and, through improvements in the rubric, were able to achieve a median ICC score of 0.76, which is within the range of having excellent agreement between raters.We anticipate that the analytic rubric we developed to facilitate this work can be used by others to conduct their own RDM assessments.
Each team member assessed a random sample of 100 DMPs from their respective institution to create a dataset with 500 total DMP reviews (Whitmire, Carlson, Westra, Hswe and Parham, 2016b).This approach avoided potential rater bias for a given directorate, and distributed the work of reviewing plans evenly across the team.The resulting set of DMPs reflected the research strengths of each institution, but in aggregate also provided a sample distribution among the directorates that is similar to the national NSF awards.In addition to recording performance level ratings for the assessment criteria, we also gathered supplementary information, such as how researchers said they would share and archive their data, whether or not they mentioned the institutional repository or other university resources, if they mentioned a specific metadata standard, and so on.We translated the rubric into a Qualtrics survey to facilitate data collection and co-location, and to standardize scoring and collection of supplementary information.Some of the randomly selected plans stated that the research would not produce data and therefore, no DMP was needed.This analysis is based on the proposals that included a DMP (465 of the 500 selected).We did not drill down to the division level to ascertain differences that might be found there.
There are some inherent limitations in conducting a content analysis of DMPs.The information presented in a DMP regarding the data can be fairly complex, written for experts in the field.Without disciplinary knowledge, it may be difficult to fully understand the plan.The DMP is only one component of a grant proposal, and may make reference to other parts of the application that are unavailable.In addition, researchers write proposals to win funding, and therefore may be more motivated to write a DMP that appeals to the stated goals of the agency, rather than to provide an accurate description of their practices and intentions.doi:10.2218/ijdc.v11i1.423

Results and Discussion
The distribution of DMPs selected for this study across the NSF directorates closely follows the overall funded proposal distribution (See Table 1).This indicates that our selection of DMPs was suitably random, and that findings may be generalized.Of the 500 DMPs in our sample, 465 (93%) stated that the proposed project would produce data (Table 2), and therefore described a plan.The numbers and percentages in the rest of the paper refer to this subset of 465 plans.NSF guidelines across all directorates stipulate that the DMP must describe the data to be captured, created or collected.We evaluated how well researchers described their data, and found variability between the directorates (Table 3).Proposals submitted to BIO and SBE had data management plans that better defined the types of data to be produced during research, while those submitted to CISE were significantly less complete.Among all directorates, 5.8% to 15.3% of DMPs (or 9.5% overall) failed to describe the data that would be produced in any way.Throughout our review, we note that the DMPs submitted to BIO do a consistently better job of meeting rubric criteria.

Data Sharing
Of the DMPs reviewed, only 2.6% included statements that the data would not be shared, while 7.5% failed to specify how the data would be shared (Figure 1).Options for data sharing were not mutually exclusive, as many of the DMPs noted several different avenues for sharing.The most popular means, observed in 36.1% of DMPs across all directorates, was through journals (tables, supplements, etc.).However, as Figure 1 and the following directorate summaries show, the relative percentages for data sharing methods varied considerably across domains.We discuss selected findings below.An overwhelming proportion of BIO DMPs (75%) indicated that data centers or repositories would serve as the key platforms for sharing.This percentage is much higher than what was observed in DMPs overall, suggesting that researchers in biology fields are not only more likely to deposit data into repositories and data centers, but are also more familiar with this dissemination approach.The BIO DMP preference for data centers and repositories may also help explain why types of data in these DMPs are doi:10.2218/ijdc.v11i1.423more thoroughly described than DMPs overall.In addition, the named data centers -GenBank, Dryad, and the Sequence Read Archive, most frequently -are fairly established centers and repositories, suggesting that the propensity for sharing, or at least the intention to share, is common in the fields associated with biology.No BIO DMPs stated that researchers were not planning to share their data, nor did any fail to specify how data would be shared.Given how thoroughly BIO DMPs described the data for their proposed projects, it's probably not surprising that these DMPs also stood out on the question of how data would be shared.
For CISE DMPs, personal websites maintained by project personnel were the top venue proposed for sharing data (43.9%).This is a much higher percentage than what was found in DMPs overall.Second to this option were sharing them on request and sharing them through "other method" (both 30.3%), such as through Github, SVN, or bitbucket repositories -systems that typically track versions of code as part of developing and maintaining software applications.The "other method" response was also higher in CISE DMPs than in DMPs overall (21.7%).A smaller proportion of CISE DMPs (13.6%) indicated that they would be using a subject-based data repository.
DMPs in the ENG directorate followed the trend of DMPs overall: publication of results in a journal was the leading means for sharing data (45.3%),followed by sharing on request (37.7%).ENG DMPs also displayed a preference for data sharing via conference presentations and proceedings -a venue similar to journal publications.Only 8.5% of ENG proposals indicated a data center or repository compared to 34.4% overall, although roughly 20% specified sharing via an institutional repository.For GEO, 66.3% of DMPs favored sharing data via specifically named data centers, repositories, or data-sharing platforms, almost double the percentage reflected across all of the DMPs (34%), and second only to the BIO directorate for this form of dissemination.The centers and repositories mentioned ranged from those associated with supporting journal articles (e.g., Dryad), to national data centers (e.g., the National Geophysical Data Center, the NSF-sponsored Biological and Chemical Oceanography Data Management Office, or the National Institute of Health's GenBank).Many small, boutique databases were indicated, as well as nationally federated systems like DataONE.Like the DMPs for BIO, GEO DMPs showed little preference for institutional repositories, which were only mentioned in 6% of plans.
By far the most popular means of sharing MPS data was via supplemental information for an article (54.1%).This number is much higher than that of DMPs overall (36.1%).Sharing data via supplemental information is an approach common to chemists, and although it is often in the form of a PDF rather than actual data files, this method is supported by NSF Chemistry Division DMP guidance.Another popular choice was institutional repositories (28.2%), higher than the 16.6% of DMPs overall.
As with DMPs to the BIO and GEO directorates, SBE DMPs also showed a preference for data repositories (42%), such as the Inter-university Consortium for Political and Social Research (ICPSR) and the Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI).Institutional repositories marked another popular method for sharing SBE data (20%); SBE is second only to MPS in this regard.SBE DMPs also expressed a relative disinclination for data sharing: 10% stated they would not be sharing data, a proportion much higher than all the other directorates, perhaps because of the preponderance in SBE of projects collecting or utilizing human subjects or restricted-access data.
The preference in BIO, GEO, and SBE to share data through domain-specific repositories suggests that these researchers have more familiarity with such a dissemination practice than scientists in other domains and that infrastructure doi:10.2218/ijdc.v11i1.423Susan Wells Parham et al. | 59 (repositories) exists to host data in these fields.For CISE, researchers in computer science fields may be more inclined to develop a solution locally, rather than use an institutional or national solution.The reviews also revealed that ENG and MPS plans showed a preference for sharing data via journal publications and supplemental information, and when requested by other researchers -two of the most conventional ways of disseminating research results.Table 4 shows a comparison of how well researchers described their plans for sharing data, by directorate.

Data Discovery and Reuse Metadata
In order for data to be reused, it must be discoverable, accessible, well documented, and in a format that facilitates reuse (Van Tuyl and Whitmire, 2016).Metadata and other types of documentation (e.g., readme files or data dictionaries) facilitate discovery and reuse.Most DMPs that we reviewed did not specify a metadata standard (85.1%; Figure 2), but again, inter-directorate differences reveal key behavioral variability between research domains.Unsurprisingly, BIO DMPs have the highest percentage of plans (38.5%) that mention a metadata standard (Ecological Metadata Language and Darwin Core most often), while only 6.1% of CISE DMPs specify a standard.Also note that the greater percentage of BIO DMPs (38.5%) provided complete, detailed information about metadata, compared with the CISE low of 9.1%.In many cases, DMPs that did not name a specific metadata standard were still assessed as having fully addressed the topic (Figure 2).Slightly more than fifteen percent of MPS plans mentioned a metadata standard, while 18.8% received a "fully addressed" rating.While initially seeming erroneous, this result makes sense in light of the fact that metadata standards do not exist for all data types or domains.In some cases, a DMP did not have to mention a specific metadata standard in order to receive a "fully addressed" rating.For example, in lieu of listing standards, several CISE DMPs mentioned creating locally relevant metadata fields and/or readme files.GEO DMPs that did not mention a particular standard (because one doesn't exist for that data type) often mentioned the creation of a readme file.In addition, some plans described important characteristics of the data that they would capture, such as equipment calibration settings or corrections, without defining these fields as metadata.A number of MPS plans discussed documentation for experiments, such as would be recorded in lab notebooks, which reflects common practices in chemistry, for example.
Many DMPs simply stated that they would create "metadata" or "documentation" without providing detail or explanation, a phenomenon noted across all of the directorates.This may indicate that researchers have a limited understanding of what metadata is, and its role in making their data discoverable and useable by external audiences.While the proportion of plans that did not address metadata at all is a discouraging 57%, we saw many researchers making an honest effort at addressing what can be a difficult topic.

Polices for Reuse, Redistribution, and the Creation of Derivatives
NSF guidelines state that DMPs should include statements on policies for data reuse, redistribution and derivative creation.The collection of DMPs we reviewed lacked detail about these policies: 56.3% did not mention reuse policies ( Across all directorates, the policy statements made on reuse, redistribution, or the creation of derivatives tended to be very permissive, but also vague, implying that these issues had not been given much consideration.For example, one DMP stated, "there are no limitations on any data or samples generated during the scope of this research", and another claimed, "no issues regarding… intellectual property are foreseen for this work."Other statements indicate that the researchers did not understand what was being asked of them: "We are constructing an original dataset, so there are no re-use or redistribution issues to be addressed."A desire on the part of the researcher for others to provide some form of attribution or to cite the data appropriately was also observed.These statements were also rather vague, and typically did not define the method of attribution or citation standard.
Several DMPs referred to their institution's policies or technology transfer office; however, few if any details were provided as to what these policies actually permit for using, redistributing, or creating derivatives of the data.As such, these statements conveyed a sense that researchers felt the need to protect themselves against possible contradictions between what the funding agency required, and what their institutions, as presumed owners of the data, would permit.In addition, in some cases the results of the research were anticipated to have commercial applications, requiring the researchers to work with their institution's technology transfer or similar office before considering the release and reuse of their data.
Finally, a noticeable minority of DMPs specified a particular license for governing the reuse, redistribution and creation of derivatives from their data.The most common license to be assigned to data sets was some form of Creative Commons license (not always specified), a GNU General Public License, or a BSD license.In contrast, a few researchers referred to having or developing data use agreements of their own to address these issues.
Much like the results for metadata, these numbers as a whole suggest a need for improved understanding among researchers of what the concepts reuse and redistribution mean, and how to address them in a DMP through a stated policy or guideline.There is likely an assumption that data reuse and redistribution are natural byproducts of data sharing, and are addressed by repository policies.It may also be the case that without well-publicized instances of data reuse, researchers are less aware of this prospect, and thus provide less detail in DMPs.

Data Curation Infrastructure
As part of our analysis, we also documented how often researchers mentioned campus infrastructure and library services.As shown in Table 8, nearly 21% of the reviewed DMPs mentioned using library services, from 3.7% in plans for GEO proposals, to 32.9% for MPS funding proposals.Most of these references were to library-run institutional repositories, either as a means of sharing or archiving data, or as a place to deposit articles and other products of research.Some plans did mention library consultation services for data management plans and metadata standards.
References to campus infrastructure were a bit more varied, and included the use of campus storage and backup services, and department or campus web servers.We found that many researchers seem to conflate archiving research data with simply using campus storage services.We also found that 29% of the DMPs did not specify how they were planning to archive their data, in contrast to those that did not specify plans for sharing data (8%) (Figure 3).Twenty-eight percent of DMPs specified a data center or repository -a number heavily skewed by DMPs submitted to the BIO and GEO directorates.Centralized storage servers (23%) and PC/external storage (15%) were also notably present, and were particularly popular in ENG and MPS DMPs.This may indicate a lack of awareness of repository options, or it may indicate a reluctance to surrender local control over the data set to a third party for curation purposes.Furthermore, as archiving is often interpreted simply as long-term storage it may not be clear as to why a data center is needed or what value its preservation services would have for the researcher.We also noted that in some DMPs a repository was mentioned for sharing, but not specifically for archiving data.As noted above, there may be an assumption that all repositories provide both access and preservation services.

Conclusion
The findings of this project indicate that the application of an analytic rubric to DMPs can yield valuable information.Though we have several caveats, we conclude that reviewing DMPs as an authentic artifact of researchers' intentions can present a useful snapshot of current data practices, uncover institutional challenges for compliance, and inform the development or augmentation of useful data services.As noted throughout the paper, we found a number of data management concepts that appear to be unclear to researchers across disciplines.This shortfall demonstrates the importance of building strong support systems to ensure that researchers respond adequately to funding agency requirements, and to ensure that they receive the full benefits that good data management, sharing, and preservation afford.
The process of generating a rubric that is aligned with both general and directoratelevel NSF guidance highlighted the variability in guidance from directorate to directorate.The NSF expressly relies on research disciplines ("communities of practice") to promulgate and apply their own data management practices and infrastructure in the review of DMPs for funding decisions (National Science Foundation, 2015).This is a logical course of action, and one that should be supported.However, directorate guidelines would benefit from a shared understanding of concepts and terminology, expressed as clearly and unambiguously as possible.Common definitions and consistent approaches to accountability could help improve the quality of the DMPs and post-award compliance.
In addition to the data management concepts that researchers across domains did not understand or address (such as policies regarding data access and reuse), we also found that researchers in certain domains addressed some concepts more fully.Areas of divergence were particularly noticeable in terms of data description and sharing.We found that DMPs submitted to the BIO directorate provided more detailed descriptions of their data, how they would share data, description of metadata -including naming a doi:10.2218/ijdc.v11i1.423Susan Wells Parham et al. | 65 metadata standard, and specific domain repositories for sharing data.This trend leads us to speculate about the relationship between the existence of national, domain-specific repositories and the proclivity of researchers in that domain to use them effectively and in large number.The proportion of researchers in BIO who "fully addressed" the topic of metadata was nearly twice that of any other directorate.In many ways, the process of sharing data obligates the researcher to think about data formats, and about creating useful documentation.Can other research disciplines benefit from the creation of strong disciplinary repositories, with their attendant policies and standards?
As the NSF and other agencies rely on communities of practice to develop appropriate responses to the challenges in managing, sharing, and archiving data, mechanisms for communication across communities are also needed.Researchers in some fields, such as ecology, have developed support structures and processesincluding data centers, publications on best practices, metadata standards and common tools -support which other fields might consider in shaping their own efforts.There is an opportunity to bring stakeholders together from across mature and emerging domains to move toward shared best practices or infrastructure.Cross-disciplinary and open membership organizations, such as the Research Data Alliance (RDA), are increasingly important conduits in leveraging efforts from one field to inform thinking and possible approaches for others.
Observations made in our study and others regarding the current quality of DMPs do not appear especially promising -an observation supported by work that shows that the presence of a data management plan does not, in most cases, lead to effective sharing of research data (Van Tuyl and Whitmire, 2016).If research institutions are committed to supporting researchers in meeting this requirement, we must acknowledge that crafting an authentic DMP will require researchers to re-conceptualize how they conceive and carry out their research on a fundamental level.The potential impact to the cultures of practice for many fields is likely to need time to fully take root and play out, and will require the support not only of disciplinary groups and funding agencies, but also of research institutions.
At the institutional level, librarians, IT personnel, grant administrators, and others have stepped up to provide assistance to researchers in responding to the DMP requirement, but clearly more collaboration is required.In addition to increased training on data management topics such as metadata and its applications, formats suited for sharing data, and documentation for data reuse, researchers clearly need guidance on data licensing options and intellectual property policies.Expertise in these areas resides in a variety of groups within one institution, so successful training programs and other support require partnerships that value and prioritize these efforts.Forging alliances and partnerships between libraries, IT centers, grant administrators, and others should become a priority to build data management capacity and address local needs.

Figure 1 .
Figure 1.Methods of sharing research data as described in NSF data management plans.Numbers are percentages (shaded by color according to the scale).

Figure 2 .
Figure 2. Aspects of how well metadata is addressed in DMPs.In the first three columns, the DMP performance level ratings for 465 DMPs are shown (in %).The most commonly named metadata standards and the percent of DMPs that name a specific metadata standard are also shown.Percentages are shaded by color according to the scale at right.

Figure 3 .
Figure 3. Methods of archiving research data as described in NSF data management plans.Numbers are percentages (shaded by color according to the scale).

Table 1 .
Number and percentage of proposals funded for the National Science Foundation (NSF) as a whole (FY 2014) and for proposals reviewed for this paper.

Table 2 .
"Yes" responses to the question: "Will the project produce data?", by number and percent of DMPs from NSF-wide or within each directorate.

Table 3 .
DMP performance level ratings for the criterion: "Describes what types of data will be captured,

Table 4 .
DMP performance level ratings for the criterion: "Describes how the data will be made publicly available."Numbers are percentages, and are shown across all DMPs and by directorate.

Table 7 .
DMP performance level ratings for the criterion: "Describes policies or provisions for building off of the data, such as through the creation of derivatives."Numbers are percentages, and are shown across all DMPs and by directorate.

Table 8 .
The percentages of DMPs that mentioned the use of library services or campus-wide resources or services, across all DMPs and by directorate.