Scope and financial impact of unpublished data and unused samples among U.S. academic and government researchers

Summary Unpublished data and unused samples are common byproducts of research activity, but little is known about the scope and economic impact of their disuse. To fill this knowledge gap, we collected self-reported anonymous survey responses from 301 academic and government scientists from randomly selected institutions. Respondents estimated that they published ∼60% of their data and 95% had unpublished data. Of those collecting specimens, 60% stored unused samples. Systemic and logistical issues were identified as major contributory factors. The median cumulative self-reported estimated value of unused resources per researcher was $28,857, with life science ($36k) and government ($109k) researchers reporting the costliest assets. Using NSF headcounts, we estimated that the current cumulative value of unused resources at universities is approximately $6.2 billion, about 7% of the current annual R&D budget. These findings provide actionable information that can be used by decision makers to reduce obstacles that undermine scientific progress and productivity.


INTRODUCTION
In the United States over $600 billion is funneled into R&D each year to advance scientific knowledge and to address pressing societal needs. 1 Despite efforts to guarantee a return on investment through mandated reporting and specified deliverables, the problem of unpublished data and unused specimens remains. Ironically, while unpublished data are almost universally acknowledged by scientists, relatively few studies have examined their economic impact or the reasons for their disuse.
Some scientists have been quite vocal about the inefficiency in research. [2][3][4][5][6][7][8][9] These concerns reached a pinnacle in 2009 when Iain Chalmers and Paul Glasziou estimated that over 85% of biomedical research is avoidably wasted. 10 Marija Purgar et al. 11 arrived at a similar estimate for ecology. Although for both analyses this waste was attributed to a variety of factors, including flaws in relevancy, design, methodology, bias, etc.; lack of publishing and reporting of data were one of the main causes. 6,9 To date, most of the public discourse on unpublished data have been field-specific 7,11,12 with a central focus on highly visible biomedical studies such as clinical trials. However, it has been deduced that most ''dark data'' are actually held by a greater number of small labs receiving less sizable grants. 8 In addition, unused samples have yet to be considered in this problem. To begin to address this knowledge gap, we collected anonymous information on unpublished data and unused samples from 301 US scientists representing a breadth of fields, institutional types and sizes, and research roles. Our goal was to quantify the amount of unpublished data and unused samples researchers possessed and understand the reasons for their disuse.
Recent changes in publication practices have expanded what it means to ''publish'' data, thus, we established clear definitions for what constituted ''unpublished data'' (Table 1). Publishing data ahead of peerreview (pre-prints) and uploading stand-alone data in repositories is a valuable way to promote data sharing among scientists and overcome many limitations of the peer-review publication system. [13][14][15][16] Despite promising gains in popularity, widespread adoption has yet to be achieved and varies considerably by field [17][18][19] with researchers and publishers debating how this information should be used and presented. [20][21][22][23] In the mean-time peer-reviewed manuscripts are still viewed by many as the ''gold standard'' for justifying experimental rationale and demonstrating productivity for grants and career advancement. Thus, we retained the traditional view of ''publishing'' as being in peer-reviewed scientific journals. However, many pre-prints published ahead of journal submission would be included in our definition of ''publishing'' and using these new mechanisms does not prevent data from being published in peer-reviewed format. Study definitions are presented in Table 1.

Respondent demographics
To assemble contact lists for academic institutions, a random number generator was used to select institutions that were stratified by Carnegie Classification 24 and sector. Carnegie Classification ranks accredited degree-granting institutions by factors such as the size of the student body, degrees conferred, and/or research dollars. We selected 22 R1 (doctoral universities-very high research activity), 15 R2 (doctoral universities-high research activity), 15 D/PU (doctoral/professional universities, moderate research activity), and 15 M1-3 academic institutions (Master's colleges and universities; 1, larger; 2, medium; and 3, smaller programs) and 28 government research agencies that represented 39 states across the US. Contact lists were compiled from departmental websites or institutional directories and primarily included tenured and tenure-track faculty. Research staff, post-docs, and graduate students were also included if their contact information was available.
Survey invitations were sent to 10,206 contacts via email and an additional 176 contacts made via LinkedIn messages Invitation emails from 568 accounts bounced due to defunct email addresses or recipients opting out of Survey Monkey emails. Of the 5,713 individuals who opened their invitation emails, 317 responses were received for a 5.5% response rate and a 74% completion rate. The survey questions can be found in Data S1.
Sixteen responses were excluded from further evaluation due to incomplete information. The remaining 301 respondents were binned by research field into life sciences (54%), physical sciences (24%), social sciences (15%), or engineering (7%). Position and sector data sums exceeded 100%, as some respondents held cross-institutional appointments. The majority of respondents were academic researchers affiliated with R1 institutions (58%), followed by R2 (21%), and government researchers (14%). Researchers at D/PU and M1-3 institutions were collapsed into a single category (11%) due to low numbers. Most respondents iScience Article were tenured faculty (44%), tenure-track faculty (14%), or non-tenure track faculty/postdocs (23%). Other position and demographic information are indicated in Table 2.

Unpublished data
Respondents were asked detailed questions regarding their unpublished data, as defined in Table 1. The publication process is a lengthy continuum that can dead-end in the manuscript preparation or revision stage. 7,12 To institute a reasonable time cutoff, we asked researchers to exclude from their responses data that were ''part of articles currently under review and/or not part of a publication to be submitted in the coming year'' (which would include many pre-prints). Finally, we asked respondents to remain unbiased as to the data's projected value or impact (i.e., not to exclude data in which the results were negative, null, or inconclusive). Additional definitions used in the study are presented in Table 1.
Based on the criteria given, 95% of respondents indicated that they possessed unpublished data. Examples of responses provided included: ''gene overexpression in cell lines,'' ''plant physiology,'' ''training evaluation data,'' ''large longitudinal dataset of pregnancy,'' ''imaging studies on human brain tissue,'' ''attitudes and perceptions on local environmental issues,'' ''diversity and prevalence of pathogens,'' ''sediment iScience Article transport to marshes,'' etc. (Data S2). Respondents were then asked if they possessed unpublished data falling into certain categories ( Figure 1A) and to estimate how much of their unpublished data fell into each of those categories ( Figure 1B). Types of data meriting additional definitions included: ancillary findings, data unrelated to the lab's mission; orphan data, data that did not ''fit" well into the lab's other papers but would not constitute a publishable unit; and data producing negative results (also termed ''null data''), data for which there was a failure to reject the null hypothesis, including positive and negative controls, and collected without technical issues. As data could be assigned to more than one category, the sum of percentages exceeded 100%. ''Unfinished projects,'' the largest source of unpublished data, were reported by 82% of respondents and represented $50% of their unpublished data. ''Orphan data'' was the second most common category at 47%, followed by data producing ''negative (null) results'' at 34%. Write-in responses in the ''other'' category (19%) echoed issues related to unfinished projects, such as personnel changes, lack of time to write, or difficulty publishing ( Figure S1A). The age of unpublished data are shown in Figure S1B.
To prepare for the survey creation, we informally interviewed 13 scientists regarding their unused resources. These discussions revealed wide differences of opinion regarding whether negative results Self-reported estimates of the quantity, type, and cost of unpublished data among researchers Unpublished data were defined as being: publishable (meets respondents' field's rigor and reproducibility standards; no technical issues); not currently published in a peer-reviewed scientific journal; not in articles currently under review; not part of a publication to be submitted in the coming year.
Respondents were asked to remain agnostic of projected value or impact.
(A) Percent of researchers possessing each type of unpublished data. Types of data meriting additional definitions: Ancillary findings, data unrelated to the lab's mission; Orphan data, data that does not ''fit'' well into the lab's other papers but does not constitute an entire publishable unit; and negative results, data in which there was a failure to reject the null hypothesis, includes positive and negative controls, and was collected without technical issues. iScience Article were ''publishable'' and prompted us to ask respondents to remain unbiased as to their data's projected value or impact. To quantify varying opinions regarding the publishing of negative results, we presented respondents with various statements and ascertained the degree to which they agreed, disagreed, or held no opinion ( Figure 1C). The strongest opinions were that publishing of negative results was important (agreed-strongly agreed; mean G s.e.m 1.2 G 0.5), but that publishing negative results in respected journals was difficult (agreed-strongly agreed; 1.25 G 0.56).
Next, we sought to quantify the amount of unpublished data respondents had accumulated. To proceed, we defined a ''unit'' of unpublished data as being ''sufficient to create a single table or graph.'' We took this approach so that respondents could answer based on conventions within their field and area of research. In addition, this estimate would capture subsets of ''orphan data'' that may have been excluded from a larger set of data used for publication.
In keeping with this definition, most respondents estimated that they possessed 10-24 units of unpublished data, with over 60% possessing between 5 and 50 units ( Figure 1D). Respondents were then asked to estimate the typical cost to produce one unit of data, including the cost of materials, supplies, services, etc. but excluding the salary costs of full-time laboratory personnel ( Figure 1E). Most estimates (60.8%) fell between $100 and $2,999 but 10% of respondents estimated that the average unit of data cost them more than $10,000.
The midpoints of these ranges were then multiplied (quantity 3 cost) to calculate the approximate estimated value of unpublished data for each researcher. These values were stratified by field ( Figure 1F) and tier/sector ( Figure 1G). Researchers in engineering and life sciences possessed the highest median value of unpublished data ($31k), followed by the physical sciences ($13k), and social sciences ($5k). Researchers in the government sector possessed the costliest data ($56k), followed by R1 ($28k), D/PU and M1-3 ($17k), and R2 ($13K) institutions.
While these numbers represent self-reported estimates for the costs of data collection and processing over a range of data types, the findings demonstrate that unpublished data may represent a significant potential loss of resource investment and that systemic issues such as personnel turnover, time constraints, and perceived publication biases contribute to deficits in their utilization.

Publication efficiency and publication pressure
To gain further insights into how researchers viewed their own publishing activity, we asked respondents to imagine their publication efficiency, a theoretical scenario in which being 100% efficient means publishing 100% of all ''publishable'' data in peer-reviewed journals. These were compared to self-reported estimates of publication pressure ( Figures S2A-S2C). Estimates of publication efficiency were normally distributed with a mean, s.d. of 59.18 G 23.05% (Figure 2A). Publication efficiency did not significantly differ between researchers in different fields ( Figure 2B) but was lower at D/PU and M1-3 academic institutions ( Figure 2C) where perceived publication pressure was also reduced ( Figure S2B). Among individual researchers, there was no significant relationship between publication pressure and publication efficiency (p = 0.70) nor the number of estimated unpublished data units (p = 0.83). There was an association between the number of unpublished data units and publication efficiency (r s = À0.22, p < 0.01).
We then asked respondents to estimate the publication efficiency of the average research laboratory in their field. Differences in the perception of efficiency between self and peers are shown in Figure 2D. On average, individuals felt that they were as efficient or slightly less efficient than their peers (mean, À3.19%; median, 0%); however, greater differences emerged when respondents were stratified by position ( Figure 2E) and gender ( Figure 2F). While the number of estimated unpublished data units were identical between males and females (mean, median of 10-24 units of data), females viewed themselves as being significantly less efficient (À6.50% versus À0.76%; p < 0.05). Together, these data grant additional psychological and sociological insights into the research and publication process.

Unused samples
Next, we asked respondents about their unused samples. In order to include as many research fields as possible, we broadly defined an unused samples as ''samples/specimens that are produced in excess, left-over from experiments or collections, or so easily generated that the laboratory would be willing to iScience Article share them with other respected collaborators'' and exclude ''samples that the laboratory would be unwilling to share or are not suitable for use in publication due to insufficient rigor and/or reproducibility standards or technical/experimental issues.'' Given the wide range of fields we included in the survey, we allowed each researcher to decide what constituted an ''unused'' ''sharable'' sample within their specialty.
A total of 53% of survey respondents collected samples or specimens as part of their research. Sixty percent of these individuals indicated that they had material meeting the definition of unused samples/specimens, with wide variation seen among respondents. Respondents were also asked to provide examples of their unused samples, which included, ''serum and tissue from infected animals,'' ''whole animals; frozen tissues,'' ''frozen fruit flies from a selection experiment,'' ''embedded mouse brain tissue,'' etc. (Data S3). A range of 10-100 samples was the most common answer (22%), but the median answer corresponded to 300-500 samples ( Figure 3A). Respondents were then asked to estimate the average cost required to generate one sample, including the cost of materials, reagents, services, maintenance fees, disposables, etc. but excluding salary costs for full-time laboratory employees ( Figure 3B). Seventyfive percent of respondents estimated this cost to be between $1 and $100 per sample. The age of samples reported ranged from 0 to 50 years ( Figure 3C), implying gradual accumulation of unused samples over time.
For each researcher, the midpoints for the ranges they selected were multiplied (quantity 3 cost) to calculate the approximate estimated value of unused samples. These values were stratified by field ( Figure 3D) and tier/sector ( Figure 3E). Researchers in the life sciences possessed the costliest unused samples (median iScience Article of $26k), followed by the physical and social sciences ($11k). None of the engineering and technology respondents possessed unused samples that met the specified criteria (Figure 3 legend). In terms of sector and tier, government researchers had the most expensive unused samples ($100k), followed by R1 researchers ($26k), R2 ($22k), and D/PU and M1-3 ($15k).
In addition, respondents identified obstacles that interfered with their ability to share their samples with potential collaborators ( Figure 3F). These researchers reported that others not knowing of the samples' existence or not being able to find collaborators as the largest obstacles compared to accessibility, selectivity, resource challenges, or confidentiality.

Total unused resources
To estimate the total cost of unused resources per investigator, the self-reported estimated costs of unpublished data and unused samples were summed for each researcher ( Figure 4A). Across all researchers, the median value of total unused resources was $28,857. However, the possession of very costly assets among some researchers skewed the mean to $657,048. Researchers in the life sciences ($36k) and engineering ($31k) had the most expensive median unused resources compared to the physical ($19k) and iScience Article social sciences ($5k). The median value of unused resources for government researchers ($109k) was approximately three times as high at R1 institutions ($34k), followed by R2 ($31k) and D/Pu-M1-3 institutions ($10k).
These data were then used to estimate the total value of unused resources at U.S. academic institutions. We recalculated the median total unused resource value for full-time doctorate holders in academia only and then multiplied this number by corresponding NSF headcounts 25 across research fields ( Figure 4D). The resulting calculations projected a total of $6.2 billion in unused laboratory assets, the bulk of which may reside within the life sciences (65%, $3.9b; Figure 4E). Engineering researchers were estimated to hold 20% of unused assets, $1.2b, followed by the physical sciences (10%, $0.6b), and social sciences (8%, iScience Article $0.5b). Although government researchers were included in our survey, they were excluded from this computation as similarly stratified headcount data were not available for this group.

DISCUSSION
Economists define ''stranded assets'' as ''assets that have suffered from unanticipated or premature writedowns, devaluation, or conversion to liabilities.'' 26,27 While this term is often used in the context of devalued real estate or inaccessible fossil fuels, it is highly applicable to unpublished data and unused samples among government and academic researchers. In analogous terms, the research process distills investments of time and money into data, the value of which is scientific knowledge and advances to meet societal needs. In some research fields, the generation or acquisition of samples is a vital, resource-intensive precursor to obtaining and using data. In addition, the publications that originate from scientific activity provide value in the form of career advancement, ability to obtain future grant funding, and societal impact. However, when data remain unpublished and samples are left unused, the value of these investments becomes ''stranded'' in a state of stored potential. In cases where these materials are time-sensitive or their existence remains unknown, their value may be lost completely.
Our data suggest that the average researcher possesses about $29,000 in stranded assets, which means that the average R1 institution has millions of dollars in unused assets and, across US academic institutions, a grand total of $6.2 billion may be at stake. It is important to place these findings in temporal context and emphasize that this is a cumulative snapshot of unused resources. The rate of aggregation is currently unclear, as some respondents' answers may represent an entire career's worth of accumulated unused resources, while junior investigators may possess only a few years' worth. While this figure does encompass an enormous amount of resources, it represents a fraction of the total academic R&D spending in a given year-about 7% of the $90 billion spent at universities in 2021. 28 Moreover, it should again be emphasized that is not an estimate of ''waste'' but instead the amount of stranded assets that are in danger of becoming waste.
Here, we extrapolated the approximate value of unused resources in academia; however, the total unused laboratory resource economy in the U.S. is likely orders of magnitude larger when considering inputs from the government sector and private industry. Others have given much larger estimates. In their Lancet paper on research waste, Glasziou and Chalmers (2009) surmised that the percent loss attributed to non-publication alone (if data passed muster in other areas) would be $25%. Given NIH's $45 billion annual budget, 29 this would translate to about $$11 billion each year for biomedical research alone.
Our findings suggest that data are often not withheld by choice, as only 8% of respondents indicated that they possessed data they did not want to publish. Instead, the evidence indicates that systemic issues and publication bottlenecks hamper the efficient conversion of data into peer-reviewed publications. During the manuscript preparation process, carving out a cohesive publishable unit from results often leads to data being ''orphaned'' because some datasets are not sufficiently robust to be published on their own or may be regarded as tangential. Indeed, 47% percent of respondents possessed unpublished data that meet this description. Over a third possessed unpublished negative (null) results and felt strongly that these data are undervalued by peers and publishers. Finally, unfinished projects were by far the primary source of unpublished data with personnel turnover and a lack of time identified as major culprits. These results echo findings from other studies where lack of time, directionality of findings, and stalling in the preparation/review process have all been identified as causes of unpublished data. 7,12,30 With respect to unused samples, it appears that the opportunity to convert these stranded assets into usable resources via collaboration is stymied primarily by a logistical constraint: widespread knowledge of their availability.
Despite the obstacles faced by researchers in publishing, most feel their performance is commensurate with their peers. However, further stratification reveals that women tend to view themselves much more critically than men despite reporting identical quantities of unpublished data; a discrepancy that many would ascribe to ''imposter syndrome.'' [31][32][33][34] Although the literature is divided over the effects of gender on imposter syndrome, this finding supports previous reports that females experience higher rates than men. 34 Also of note is that the amount of publication pressure individuals experienced had no bearing on their perceived publication efficiency nor the self-reported amount of unpublished data. Together these data provide a unique view the intersection of scientific productivity with factors such as stress, systemic issues, resource limitations, and self-image. iScience Article While it is not realistic to expect that all the challenges identified here can be completely overcome, there is hope that solutions can be identified. In the past 10 years, much attention has been given to shifting the publishing framework toward open science as a means of facilitating data sharing. As these practices become more mainstream, they represent an essential way to reduce waste in the form of unpublished data. Moreover, they may represent the one of the few options to overcome publication bias against negative or null results.
While continued efforts in this area are valuable, our findings argue that tackling unfinished projects would also be fruitful, as they are a predominant source of unpublished data. In regard to unfinished projects, perhaps the greatest challenge is the fact that career development timelines and financial pressures are often not compatible with the pace and compensation structure of research. In academia, those directly involved with data generation (typically students and early career researchers) often leave their positions to acquire better salaries and to advance their careers, leaving behind unfinished projects. Indeed, our data provide actionable evidence and the financial impetus that decision-makers could use to justify policy changes targeting these problems. For example, requirements for better research planning can be put in place, such as transition plans for personnel changes and institutional incentives for completing projects and publishing manuscripts and/or raw data. One survey respondent suggested an ''institutional requirement for delivering all unpublished data with details required for publication and an agreement for authorship before departure from the lab.'' As the financial loss from unfinished projects is significant, implementing incentives to conclude experiments, publish results, and mentor replacement staff could also be justified.
There is also an important lesson to be learned from conventional ''stranded assets'' in economics: the way to make use of them is to repurpose them. Similarly, unpublished data and unused samples can be mobilized through new applications and collaborative efforts. Another respondent commented ''sometimes we don't see value in data . but others might find it very valuable'' and suggested the need for collaborative tools and exchanges to helps scientists ''make mutual agreements for analysis/publication.'' Such an approach could help make use of hard-to-publish negative results, reproducibility studies, pilot data, orphan data, or ancillary findings and would help overcome the central challenge of unused samples: knowledge of their availability.

Limitations of the study
These findings are based on the parameters and definitions set forth in the survey, which could under-or over-estimate the actual pool of unused resources in several respects. First, data that researchers anticipate will take longer than one year to submit to a peer-reviewed journal are counted as ''unpublished.'' This cut-off was established because many papers can stall in the manuscript preparation and the likelihood of publication tends to declines over time 7,12 ; however, it is not uncommon for some completed projects to take multiple years to be submitted, therefore some data could have been prematurely counted as ''unpublished.'' Moreover, we did not count stand-alone submissions in data repositories or pre-prints, although many such data would have been included in our estimates as they are often posted ahead of or in tandem with peer-review publications. Alternatively, the survey could have underestimated resources in that it did not quantify data that still need work in order to become ''publishable.'' Finally, we did not include the cost of employee time and effort in the value of data and samples, as we felt this would be too difficult to disentangle within the confines of this survey. This is arguably the most resource-intensive aspect of data and sample generation, which leads us to hypothesize that findings are overall likely underestimated.
Another challenge of this survey was that the generalizability of the questions made it difficult to capture field or institution-specific idiosyncrasies. This is particularly challenging because some ''units'' of data may be large and complex, such as cohort or population data, and some may be small, such as data from a simple experiment.
As with any survey, concerns regarding response bias must be acknowledged. At 5.5%, our response rate was on the low side, but not unexpected for a ''cold'' email survey and the characteristics of population being surveyed (i.e. busy professionals). [35][36][37][38] However, lower response rates are not necessarily correlated with response bias 39,40 and previous studies have demonstrated that response rates of 5% in similar respondent populations produce reliable survey results provided a sampling frame of at least 500 ll OPEN ACCESS iScience 26, 107166, July 21, 2023 iScience Article individuals. 41 In support of this interpretation, our publication efficiency statistics of $60% align closely with previously reported publication rates of 58%, 12 $45%, 3 >50%. 4

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

ACKNOWLEDGMENTS
A National Science Foundation I-Corps grant and Flinn Foundation Entrepreneurial fellowship provided support during the preliminary interviews that showed the value of a wider investigation. We would like to gratefully acknowledge the support of Dr. Christine Flanagan in her role as editor and whose discussions helped to frame the survey content. Dr. Deborah Goldberg and Dr. Shaun McCullough provided valuable mentoring, advice, and input.