Introduction

To achieve the open research data, software and code recommended by UNESCO () requires a diligent and transparent approach to their management. A Data Management Plan (DMP)—a structured, formal document describing the roles, responsibilities, and related activities for data management during and after research—is not a new concept (e.g., ; ). This is particularly true for funded research. By 2011, the US National Science Foundation required a peer-reviewed DMP with every grant application (). Smale et al. () reported that in 2017, many funding agencies required DMPs within the initial funding application; for example, 86% of UK Research Councils, 37% of Australian institutions, and 63% of US funding bodies. The practice has now grown such that the majority of funding agencies require DMPs. Funders, and increasingly journals, are also requiring that data supporting a project are published alongside an article, with a proper citation in the references section of the paper, and that data are Findable, Accessible, Interoperable and Reusable (FAIR, ). The FAIR Data Principles serve as a guiding framework to periodically assess data over a project’s life cycle ().

Michener () outlined ten necessary components of a DMP, covering anticipated data, quality assessment, metadata formats, storage and preservation methods, licensing and dissemination, responsible parties, and ethical and budgetary considerations. These components have provided the basis for many DMPs (e.g., ; ; ), and promoted awareness that not only data, but other digital objects should be managed in the same way (). Despite these mandates, guidelines, and requirements, researchers commonly submit a plan with a proposal, but seldom revisit it (). When even a limited level of good data management is maintained throughout a project, the benefits become evident (; ).

The Belmont Forum

The Belmont Forum is a partnership of international funding organisations, national science councils, and regional consortia committed to the advancement of global environmental science and accelerated delivery of data-driven environmental research (). In 2013, the Belmont Forum initiated a multiphased ‘e-Infrastructure and Data Management Collaborative Research Action’ (CRA) (). The Science-Driven e-Infrastructures Innovation CRA intimately links research thinking and technological innovation to accelerate the full path of discovery-driven data use and Open Science, with an objective to create solutions from potentially disruptive innovations.

The genesis of the Belmont D(DO)MP

The Belmont Forum commissioned a study () which concluded that researchers should:

  • adopt data principles that establish a global, interoperable e-Infrastructure with cost-effective solutions to widen access to data and ensure its proper management and long-term presentation; and
  • promote good data planning and stewardship among Belmont Forum agency-funded research, enabling harmonisation of e-Infrastructure through enhanced project data planning, monitoring, review and sharing.

Allison et al. () further recommended that data should be discoverable with minimum delay; accessible and open by default; understandable to allow broad reuse; managed and protected in sustainable, trustworthy repositories; and supported by a highly skilled workforce. Such recommendations preceded the FAIR Data Principles () and the TRUST Principles for Digital Repositories ().

In response, the Belmont Forum developed an e-Infrastructures and Data Management Toolkit () that includes a rubric for researchers preparing a Data and Digital Outputs Management Plan (D(DO)MP) for proposal submission, plus data management webinars and guidelines (). Once a project is awarded, the Belmont Forum specifies that the D(DO)MP should be a living, actively updated document describing the management lifecycle for the data and other digital outputs to be collected, reused, processed, and/or generated. The Belmont Forum’s D(DO)MP criteria focus primarily on preservation requirements, however, providing minimal guidance on activities to be done during an active research project.

The PARSEC project

The four-year PARSEC project (‘Building New Tools for Data Sharing and Reuse through a Transnational Investigation of the socioeconomic Impacts of Protected Areas’, ) was funded by the Belmont Forum, and will end in 2024. We were therefore mandated to apply the Belmont Forum D(DO)MP guidelines and, in consultation with our researchers, we expanded the scope to include digital object management during the entire research lifecycle. In doing so, we changed the ‘O’ in D(DO)MP from ‘outputs’ to ‘object’. When all the data management work is left until the end of the grant, or the time of publication, the amount of work can be significant and sometimes impossible as important descriptive details may be lost. By including the management activities needed during the research effort, preservation is much easier. This paper summarises our experience to date and generalises lessons learned.

PARSEC is an assembly of participants from Brazil, France, Japan, the United States of America, with collaborators from Australia. It is notable that with the exception of two funded postdocs and a portion of one principal investigator, PARSEC project members are volunteers. The disciplinary range is considerable, with specialists in data science, research data management, image analysis, artificial intelligence, machine learning, deep learning, spatial systems, socioeconomics, wildlife biology, and ecology. Consequently, PARSEC provides an in-project model for the relationship between those who can help researchers enhance digital object management (the PARSEC Data Science Strand) and those who ‘do the research’ (the PARSEC Synthesis Science Strand), and exposes the cultural and geographical challenges in doing such management well.

PARSEC’s core scientific investigation (carried out by the PARSEC Synthesis Science Strand) examines the influence of marine and terrestrial protected areas on socioeconomic outcomes in the surrounding communities. The approach uses existing data, combining satellite and other remotely sensed data with socioeconomic data using artificial intelligence and other tools. The role of the PARSEC Data Science Strand within the project is to promote best practices for data and software management, such as preparing them properly for reuse and for preservation in a trusted, community-accepted repository with proper attribution. At the time of the PARSEC award there was no global tool for a researcher to easily find a preserved dataset, the publications that cited it, creators of the dataset, and any related software. This made automated credit and attribution very difficult and a barrier to researchers citing data used in their research. The Data Science Strand partnered with DataCite to develop a tool () where researchers can find datasets and software in repositories and confirm that their citations are properly ‘linked’ to the publication through persistent identifiers. The Data Science Strand was primarily responsible for the development of the D(DO)MP workbook over the life of the PARSEC project. We describe the application and development of that D(DO)MP, its evolving content in response to needs, and its internal and external outcomes. We will discuss some lessons learned through this process and provide recommendations and advice for others.

The PARSEC D(DO)MP

As planned for the SEI CRA in 2018, the Belmont Forum began requiring inclusion of a compliant D(DO)MP in all stages of the grant submission and project reporting process. At proposal submission, the following information was required by the Belmont Forum: (i) datasets and digital outputs expected, (ii) FAIR policy conformance, (iii) plan personnel, (iv) output protection, (v) post-project data management, (vi) restrictions and preservation of restrictions, (vii) documentation and metadata for reuse, and (viii) long-term support costs.

Following the update to include all elements for an awarded CRA, the PARSEC D(DO)MP was designed to provide guidance to the researchers for data stewardship during the project and preservation of data and software products. In the initial stages of the project, we scoped the data management sophistication of the team members by creating an ‘entry profile’ to gauge their existing data and software use and their preferred tools (). The research data workflow guided the PARSEC D(DO)MP, with four to five steps relevant to a synthesis centre used (adapted from ), rather than the more detailed workflow proposed by Michener et al. (). Questions did not directly refer to the steps in the research data lifecycle but were guided by them. This enabled us to ascertain that group members were unfamiliar with DMPs, with only five of the 19 respondents actively using one in other projects. This survey, and many discussions within the Data Science Strand team and at Synthesis Science Strand workshops, informed the development of our D(DO)MP. In short, the value of having a D(DO)MP as an integrated tool, or protocol, in the research lifecycle was not apparent for most PARSEC Synthesis Science Strand members.

Being a successful grant, to ensure individual team members understood the importance of PARSEC’s compliance to the D(DO)MP, we required each member to read and sign a ‘code of conduct’ when they joined the project, in which the following was stated:

Derived data and digital outputs generated during PARSEC activity will be documented and released at the time of publication where possible into the public domain in compliance with Belmont Forum requirements. Policies for broad access and sharing, including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements, metadata description, intended repositories, and so on will be clearly described in the Project’s Data and Digital Outputs Management Plan.

The required acknowledgement text for all outputs was also spelt out, as was the expected team behaviour within the project, according to the European Code of Conduct for Research Integrity (). It should be noted that adherence to this established code is reliant on the ethical practice that research data and software must be given the same attribution as peer-reviewed publications.

The ORCID IDs and institutional ROR IDs of each team member were recorded, and after considerable discussion, communication tools were determined (Figure 1). A password-protected Google Drive was created and made available to PARSEC team members for all team contacts, project documents, and working spaces. Project space on the Open Science Framework connected our google drive and provided space for datasets. Access to Amazon Web Services was facilitated through our Brazil team members. Analysis and synthesis were supported by our own dedicated GitHub space (PARSECworld), while the PARSEC Community in Zenodo was our main unrefereed document output service. We stated the desire for each data and digital output to be developed with representation from each PARSEC country. We chose the licensing for our data and code, and a community-accepted, trusted data repository, the Environmental Data Initiative (), at the beginning of the project (Figure 1).

Figure 1 

The tools chosen for the various tasks along the data workflow of the PARSEC project.

These decisions and the security measures for each are described in the D(DO)MP, which is arranged as a detailed workbook, with descriptions of operational procedures plus links to tracking sheets for recording digital outputs. The text is in two major sections: first, a narrative section describing the Operational Procedures, with appropriate items quoted from the Belmont Forum Rubric and responsible parties identified. A second section provides the links to tracking sheets (and their documentation) for within-project recording of four types of digital objects: scholarly publications, datasets, software, and other digital outputs such as posters, presentations, training, and workshop materials. Minor revisions (URL corrections, typographic errors) are handled as ‘patches’, with major revisions incorporated in a new version. The document preamble briefly describes such changes. We requested that all PARSEC team members read and comply with the D(DO)MP.

The evolving content of the PARSEC D(DO)MP

The implementation of the D(DO)MP Workbook required some vigilance, since oversight is important to establish and maintain new habits. This was mainly achieved at the biannual joint meetings of the two Strands.

As stipulated by the Belmont Forum, a D(DO)MP should not be a static document, but rather re-examined and updated on a cyclic basis. To date, there have been three versions of the PARSEC D(DO)MP Workbook, the initial version published in 2020 (), then two revisions in 2021 and 2023 (; ). For both of the revisions, as well as updating the tracking sheet for PARSEC research outputs, the Data Science Strand—with inputs from the Synthesis Science Strand—evaluated the uptake and efficacy of the tools and methods being employed for management of these outputs. Through such evaluation, an example of which is given in the next section, the Data Science Strand made decisions on whether to continue using these tools and methods, swap them to an alternative deemed to be better suited to user needs, or to remove them completely. For example, it quickly became clear that the D(DO)MP is substantial (~20 pages) and unwieldy for regular use by busy researchers, so a reference summary of resources was added to the 2021 version () and modified later (e.g., ). As we moved through the project, we added several important items to the Belmont Forum rubric (). These included processes that would help to ensure the quality of digital objects during the life of the project and prior to preservation, support for inclusive collaboration and authorship for data and digital outputs, guidance for giving attribution for data and digital outputs previously created or created by others, and suggested decisions for project backup and closeout ().

D(DO)MP internal impact

To evaluate the team’s use of the tools and practices established in the 2020 and 2021 versions of the D(DO)MP, we surveyed the 33 team members in October 2022 about their use of the current suite of tools (PARSEC Google Drive, ORCID, email, GitHub plus Zenodo, the EDI Repository, the PARSEC Zenodo Community, and the PARSEC Zotero site), asking whether they could access and use the nominated tools, and the frequency of use, both within and outside of PARSEC. Open-ended comments were invited.

Twelve PARSEC team members responded to the survey. All were able to access the Google Drive for the project (although one preferred to use Dropbox), all were able to access ORCID (although one respondent chooses not to have an ORCID ID at all), and all accessed and used email. The publishing tool for non-refereed research products, Zenodo, has been accessed by 75% of respondents, and slightly fewer have used the Zotero site (67%). The code and data repositories (development platform and preservation) have been accessed by 67% in the case of GitHub, but only 42% for EDI. These values reflect the stage the individual team members are at in the project, their roles, and the nature of some of the tools. The programmers are most likely to use GitHub, ‘Considering that I’m not part of the system development team, I didn’t need to use Git in the context of the PARSEC project’ (Respondent (R)15), and ‘One does not think of archiving data products in EDI until there are some final data products ready to share, and that is limited at present to the ‘end-of-project scenario’ (R4).

Several comments were made about the use of tools like Slack, Google Drive, or OSF in a researcher’s life; for example, ‘To share, people need first to know each other and to trust each other. That was maybe difficult for PARSEC to achieve between all countries because some planned in-person general meetings could not occur due to Covid…Sharing is still seen as such an additional burden so far…’ (R13), and ‘Slack today is just a tool typical for software development users and not used by other researchers in the team’ (R5). Suggestions were made to improve awareness and uptake of all the tools, including ‘…a private web site which centralises Zotero, GitHub, Google, Zenodo links would be useful’ (R2).

Did the existence of the D(DO)MP make a difference to the members of the PARSEC project? Establishing a code of conduct, expectations for and standards of citation early in the project, has benefitted and protected all members throughout. Having a clear and central location for communication and temporary storage has facilitated transparency, with all members of the team utilising the PARSEC Google Drive. Establishing the PARSEC Zenodo Community, the Zotero Group Library, and GitHub organisation for the project has enabled effective document and code sharing across the multinational team, and reduced duplication of effort. Our track sheets have provided the basis for reporting to the Belmont Forum and all four country funders, making that onerous task much easier than otherwise. All team members have helped populate these tools. Without the D(DO)MP to stimulate us to make these decisions and continually evaluate their utility, it is doubtful that the project would have been as organised and productive.

D(DO)MP external impact

Translation and reuse of our work is one of the aims of the PARSEC project. This includes the various guidelines that pertain to data and software management. The versions of the D(DO)MP have been well-viewed, with 510 unique downloads from our Zenodo community as of 12 August 2023 (; ; ). Other contributions, such as the repository guidelines, have had less impact, with 71 unique downloads by 12 August 2023 (). The number of items in the Zenodo PARSEC Community (140 at the 12 August) is testament to the recorded activity and outreach of the project, and evidenced by the increasing download numbers overall as we share the materials in workshops and webinars.

Lessons Learnt

The PARSEC D(DO)MP has been the catalyst for enabling the data and software management practices advocated by the Data Management Strand to be adopted by the entire team. It is the agreed-upon protocol used and improved by the team, facilitating the evolution of the documented practices. By gathering all project data and software management information into the D(DO)MP, we have enhanced communication among and between all team members (five countries, four languages) throughout the project, and it is considered an essential component for supporting our project management holistically. It has furthermore led to the development of intentional tools by PARSEC members to help compliance with the D(DO)MP through automation and better management of datasets, their attributes, metadata, and rules.

Trust

In a multidisciplinary, multinational project, where members are volunteers and previously unacquainted, relationships need to be actively built to establish trust among them. Only then can data, code, and knowledge sharing be effective (). Discontinuities due to differences in culture and the challenges of communicating across time zones can greatly affect the within-team transfer of knowledge (). The early development of a D(DO)MP provides a basis for an environment where new data and software sharing practices can be learnt and promulgated.

Collaboration

An unexpected complication has been that although PARSEC members were brought together with a common goal for the project, in many cases, the strongest bonds have been with those they already knew best and (therefore) trusted. Creating new working relationships, and thereby ensuring that everyone is comfortable to follow the D(DO)MP and share their research outputs with everyone (including potential academic rivals) across the teams, has required significant effort and time. In this regard, the PARSEC project was severely hampered by the reduced number of face-to-face meetings that we could organise because of the COVID-19 pandemic.

The above does not obviate the value of any element of the D(DO)MP. Although unknown at the start of the project to a large number of PARSEC members, most have seen, for example, the value of maintaining their ORCID record following the D(DO)MP requirements. Moreover, as we come towards the end of the project, the use of EDI to archive datasets is now increasing. A D(DO)MP does need, however, to be developed alongside a robust communication plan, with training for various tools at relevant times, and common platforms that can be bookmarked, with user instructions, and useful functionalities highlighted (as suggested by one of our survey comments).

Communication

Because a D(DO)MP is an evolving document undergoing cyclic updates, a numbering system is necessary to distinguish between revisions. A D(DO)MP should also have a mechanism to collect feedback, driving the events that trigger a new version. Brief surveys filled this need for the PARSEC D(DO)MP and provided relatively broad feedback and anecdotes. A traceable system could be to use a GitHub repository to manage the document, with feedback collected as issues, although using a survey instrument allows an estimate of prevalence of opinion along the project process.

Flexibility

Even with a D(DO)MP in place at the heart of the project, and regularly revisited, sustained patience, coordination, and encouragement by team leaders is vital. The first four years of the PARSEC project have highlighted several aspects of developing and implementing a D(DO)MP for a complex, multifaceted project. Firstly, some components in the D(DO)MP are only suitable for a subset of PARSEC members. In addition, those who are not already familiar with a particular tool tend to prefer ad hoc solutions or to use a mechanism already common in their particular domain. For example, programmers engaged in machine learning are familiar with, and are thus heavy users of, cloud storage and GitHub, others only use GoogleDrive, while those tasked with publishing resources are most likely to use Zenodo. It should also be noted that broad international use of tools and services may not be possible because of cost or access considerations. Indeed, in our survey, one respondent commented on the inaccessibility of Google Workspace to scientists in China. This was not an issue for PARSEC (China is not a partner), but this limitation is likely to be an issue for other partnerships.

In conclusion, we have found that a D(DO)MP can be, as intended, a valuable tool for a research project. It provides a place to summarise the co-determined data and software management practices, to track and modify these practices as they evolve, quickly recall decisions by using a common landing place for team members and facilitate them to follow the agreed processes for sharing within the project and more broadly. To be most effective, it should be sensitive to the particular and evolving needs of the project and the people associated with it.

Top tips for D(DO)MPs

Make complying with a project’s D(DO)MP-determined practices as easy as possible, by:

  1. Ensuring all team members are committed to the intended collaboration goals of the project. Periodically review commitment as personal situations change.
  2. Providing a way for team members to give feedback at any time. This can include conducting one-on-one discussions regularly, checking on possible barriers.
  3. Providing training for those team members new to a tool or process relevant to their expertise, and sensitive to the point they are at in the project.
  4. Providing the team with a short reference to all resources. This can be a landing page, an infographic with hyperlinks, or any concise resource.