A parasite's perspective on data sharing

Abstract Data generation is expensive in terms of both time and money. Sharing data enables the rapid replication, validation, and application of discoveries, increasing the pace and accuracy of research. As research parasites, or users of other people's data, we recognize that a strong science ecosystem requires that those who share best be recognized. We find that widely accessible benchmark datasets have provided outsized benefits, and we hope that the benefits of sharing will also begin to accrue to individual investigators who share well. Funders can enhance progress by adjusting incentives to better support data sharers, which will make their programmatic investments more effective. We note some funders who are making such efforts.

The primary goal of scientific publishing is to disseminate findings to the community and disclose supporting observations that allow the community to infer their validity and robustness. Widely used forms of data release, such as making data available only upon reasonable requests, can delay sharing indefinitely. Though these practices reduce the scientists' ability to critically evaluate or build upon previous work, some journals still allow these practices. Some editors have questioned whether data sharing may bring harm to certain fields, going so far as to call someone who has re-analyzed shared data without providing co-authorships to sharers "research parasites" [1]. This has given rise to "Research Parasite Awards" focused on data reuse [2] and "Research Symbiont Awards" focused on data sharing [3] that seek to recognize exemplar participants in this scientific ecosystem. As research parasites, we provide our perspective on some of the best examples of data sharing. We note how these practices provide an efficient path to accurate, impactful findings. We also discuss how funders can encourage innovation and improvements in science by preferring to support researchers who are sharing.
During the outbreak of the deadly Zika virus (ZIKV) in South and Central America in 2013 and 2014, the virus spread so rapidly that researchers studying it felt the need to share their data before completing the publication process. Acknowledging this, several researchers announced a data release, including sequence data of candidate strains, to the World Health Organization to enable other researchers to develop vaccines [4]. The Zika experimental-science team, who were studying rhesus macaques infected with Zika virus, made their data public rather than publishing them in a journal and continued to update their results daily for others to benefit from their progress [5]. These scientists committed to sharing information rapidly during public health emergencies, but there were no guidelines for formal recognition and rewards for such efforts at the time.
There are also sharing-first projects outside of infectious disease research. As part of a $62 million initiative, the UK Biobank recruited 500,000 individuals from the United Kingdom aged between 40-69 years in 2006-2010 and publicly released all generated data for bona fide research use [6]. The consortium behind this effort collected and continues to collect extensive phenotypic data including various physical measurements, accelerometry, questionnaires, biochemical assays, and genotyping data from all of its participants. The conventional model for novel, multisite data collection is to restrict access to only a   [8]. For ENCODE and ENCODE Community publications, the ENCODE official website [9] was used in place of PubMed for searches up to its finalization in 2016. For the past decade, we estimated that approximately 6,516 publications benefited from these public resources.
small set of participants who produce publications and other accomplishments linked to the resource. However, the UK Biobank is transforming the way public genetic and health data should be shared. This consortium, while providing strategically deidentified electronic health records, genotypes, and other research data to benefit public health, explicitly stated that no resource users will be expected to add the UK Biobank as coauthors on their publications or share income generated from research using these resources.
Certain data resources undergo continuous updates. In an example from ecology, researchers are actively building platforms to embrace evolving or dynamic data [7]. Yenni and colleagues describe an automated workflow where researchers can automatically integrate new data with versioning and archiving and update associated manuscripts and expand relevant supplementary tables. This framework not only allows faster integration and automated analysis of continuously updated data, it also ensures systematic review of the workflow and faster dissemination of results and provides transparency for others to validate the study and its findings. Figure 1 depicts citations for publicly shared data between 2008 and the present. For the past decade, we estimated that approximately 6,516 publications benefited from these public resources.
If data sharing plays such a key role in more rapid and accurate scientific progress, how can this behavior be better encouraged? Funders have a unique role in the scientific ecosystem. Those working at universities and other research institutions often face incentive structures that are challenging for those who seek to make change, such as promotion and tenure committees that are not accustomed to crediting data sharing. Alberts et al. [10], in an article describing flaws in the biomedical research ecosystem, note that at many institutions, a faculty salary is almost entirely covered by research grants. This increases universities' reliance on funder dollars and provides funders with an expanded role in driving evaluation metrics throughout the ecosystem.
Funders benefit when scientists find errors earlier and make discoveries sooner, and some are strengthening their resourcesharing expectations. The Chan-Zuckerberg Initiative expects grantees to share protocols and preprints for projects that they fund. Alex's Lemonade Stand Foundation asks applicants to describe their past sharing behavior alongside their plans for sharing resources that result from the proposal, and both are assessed for impact. Since 2017, the Bill & Melinda Gates Foun-dation has required peer-reviewed publications and associated data to be discoverable and accessible openly and has offered to pay for open access as needed. These are important changes because researchers who share face challenges due to the scientific norms around credit. These challenges include the costs associated with open access publishing and high-quality data sharing as well as the risks that errors could be found in their work or that others could extend their work, causing them to lose future accomplishments and credit. By providing incentives for sharing, funders help to mitigate these perceived risks and thus improve the value of their investment. Funders can make their investments go further by supporting a culture among investigators that values behaviors that accelerate scientific progress.