Robustness of evidence reported in preprints during peer review

Scientists have expressed concern that the risk of flawed decision making is increased through the use of preprint data that might change after undergoing peer review. This Health Policy paper assesses how COVID-19 evidence presented in preprints changes after review. We quantified attrition dynamics of more than 1000 epidemiological estimates first reported in 100 preprints matched to their subsequent peer-reviewed journal publication. Point estimate values changed an average of 6% during review; the correlation between estimate values before and after review was high (0·99) and there was no systematic trend. Expert peer-review scores of preprint quality were not related to eventual publication in a peer-reviewed journal. Uncertainty was reduced during peer review, with CIs reducing by 7% on average. These results support the use of preprints, a component of biomedical research literature, in decision making. These results can also help inform the use of preprints during the ongoing COVID-19 pandemic and future disease outbreaks.


Introduction
As the use of preprint articles substantially increased during the COVID-19 pandemic, [1][2][3][4][5] researchers hypothesised that the use of unreviewed scientific articles could mislead personal and public health decision making. In bypassing the quality control of peer review, improvements to an article that would have been made during review are not incorporated and articles that are of too low quality to be published in a peer-reviewed journal might be communicated to the scientific community and the public. [6][7][8][9] Nevertheless, the US Government has encouraged policies that promote the use of preprints to increase scientific communication. When the National Institutes of Health (NIH) 10 announced their policy supporting the citation of preprints, whether biomedical researchers would use preprint articles as the US Government intended (ie, as "a complete and public draft of a scientific document") 10 or whether largely incomplete studies would be published was unclear. The National Library of Medicine also created a preprint policy to include some preprint articles in PubMed. 11 Currently, little is known about how much the evidence base of scientific articles changes during peer review. A retrospective observational study published in 2020 suggested that data reported in preprints might systematically overestimate epidemiological data, affecting health-care policy decision making. 5 Scientists, journalists, and members of the public can therefore question how much the evidence presented in a preprint might change if it was peer reviewed. This question considers the robustness of evidence base in preprint versions of an article as it assesses the resilience of this evidence to the processes of peer review. In this Health Policy paper, we aimed to quantify how much data reported in preprints change from preprint versions to peer-reviewed, published versions.

Search strategy
We aimed to identify studies that reported original data on COVID-19 measurements, had a preprint deposited ahead of journal publication, and had a peer-reviewed publication linked to the preprint. The NIH iSearch COVID-19 Portfolio, part of the NIH iCite web service, [12][13][14][15] provides links between preprints deposited on the servers arXiv, bioRxiv, ChemRxiv, medRxiv, Preprints.org, Qeios, and Research Square and the version published in a peerreviewed journal indexed in PubMed.
The NIH iSearch COVID-19 Portfolio uses artificial intelligence, machine learning approaches, and curation by biomedical research experts to comprehensively index preprint and peer-reviewed publications on COVID-19. 16,17 In addition, the COVID-19 Portfolio integrates natural language processing and detailed search engine functionality to allow users to identify articles that contain relevant epidemiological concepts. We searched the COVID-19 Portfolio for items published between Jan 1 and Sept 29, 2020, that contained the following keywords or their synonyms in the title or abstract: basic reproduction rate, incidence, case fatality rate, or infection fatality rate (appendix p 10). Articles that contained more than one keyword were analysed separately.

Data analysis
The curating process followed guidelines to ensure consistency. Curators (LN, AS, SL) first verified that a version of each preprint was available in a peer-reviewed journal. Any articles that the NIH iSearch COVID-19 Portfolio identified had a subsequent version were excluded from analysis if the subsequent version was also a preprint and no journal-published version existed. If the preprint and published versions were suitable, the article version was verified as the initial version of the preprint, not a subsequent version that included changes due to peer review. Only the initial preprint submission was included in the analysis as subsequent updates to the preprint might have incorporated changes because of ongoing peer review, thereby underestimating the amount of change during peer review. All sections of articles, including supplementary materials included in either the preprint, published version, or both, were included in the analysis. Curators examined identified preprint articles in random order. If there were data present on the search terms of interest (case fatality rate [CFR], incidence fatality rate [IFR], disease incidence, and basic reproduction number [R 0 ]), point estimates and their CIs were recorded. If a preprint mentioned one estimate (eg, CFR) in the title or abstract but also reported data for a different estimate that was not mentioned in the title or abstract (eg, R 0 ), 18,19 the data for the additional estimate were excluded as possibly ancillary. The same process was done when analysing the journal article. No data were collected if curators had to guess the value (eg, data were presented in a scatter plot but the estimate was not in the full text). Data points that were repeated (eg, mentioned in the abstract and in a table with accompanying data points) were deduplicated. Articles that reproduced data from another paper were excluded unless they also provided their own estimates.

Peer-review scores
High data alteration during peer review, either in terms of magnitude or censorship, might be expected in papers that were not published because of their low quality. This argument is untestable because the counterfactual peerreviewed manuscripts do not exist; however, it makes a testable prediction. If article quality affects the outcome of a preprint being published in a peer-reviewed journal, then this outcome should be statistically related to independent peer-review assessments of preprint quality.
To test this hypothesis, we aggregated peer-review scores of preprint quality from the Rapid Reviews: COVID-19 database, 20 which solicits transparent reviews of preprints. These so-called overlay review scores are produced by experts in the research area and include members of the National Academy of Medicine, the University of California, Lawrence Berkeley National Laboratory, and the University of Washington. Overlay peer review is an acknowledged form of quality review: the National Library of Medicine recently began accepting overlay journals into PubMed for indexing. 21 We averaged the review scores assigned to each preprint, similar to the aggregation of peer-review scores conducted by NIH. 15

Data transformation and statistics
Percentage data were normalised onto a 0-1 scale for consistency. To test for systematic differences between the matched preprint and publication data points, their ratio was calculated and the log ratio values were tested with a two-sided exact Wilcoxon signed rank test. The same procedure was used to detect systematic differences in ranges of CIs before and after peer review (appendix p 2).
We used logistic models to examine the relationship between peer-review scores and publication probability (appendix p 8). Area of research was also included as an independent variable.

Results
We identified 100 matched preprint-journal-article pairs using the NIH iSearch COVID-19 Portfolio. [12][13][14][15]17 On average, matched articles reported 19 epidemiological point estimates per paper. We analysed a total of 1921 data points appearing in the preprint version, the publication version, or both. Of the 1606 data points reported in preprints, 173 (11%) were deleted during peer review. Consequently, 1433 estimates (89%) persisted after undergoing peer review. An additional 315 (18%) were added during peer review.
Point estimate values changed an average of 6% during review. We observed similar results at the article level (appendix pp 3-4). Furthermore, we did not observe any systematic increase or decrease in values after peer review, indicating that preprint values are not generally systematically overinflated (Wilcoxon signed rank test; p=0·74). The correlation between preprint and peerreviewed estimate values was high, more than 0·99 (figure 1). These results held at all scales of the dataset (appendix pp 5-6). Accordingly, an assessment of agreement between preprint and publication point estimates using a non-parametric Bland-Altman analysis 22 showed high agreement between data from preprints and their published versions (appendix p 6).
For infectious disease research, time-varying estimates constitute a mechanism of change from preprint to peerreviewed versions of articles that might contribute a unique form of variance in the underlying data compared with other areas of research. We analysed the correlation between time from posting a preprint online to publication versus the fold-change of point estimates in both preprint and publication versions and found no significant correlation (R 2 <0·01; p=0·19; appendix p 3). Other areas of research might show less change than was measured here.
Outliers in this data analysis were most extreme in the incidence dataset as authors updated their estimates of the number of people infected, which sometimes increased substantially. However, after subdividing the data into categories based on the estimate type (appendix p 6), none of these changes in estimate values were statistically significant (Wilcoxon signed rank test; CFR p=0·81; IFR p=0·18; incidence p=0·13; R 0 p=0·23). Thus, there were small changes to the 89% of data points that persisted through peer review that were not statistically or practically significant. We identified 67 articles in the Rapid Reviews: COVID-19 database in the areas of biology, medicine, and public health with a timeframe that overlapped the papers from the NIH iSearch COVID-19 Portfolio. A positive relationship between overlay peer review scores and publication probability would support the hypothesis that the quality of published preprints differs meaningfully from that of those that remain unpublished. We did not observe such a direct relationship (figure 2, appendix p 7). However, public health research is more likely to be published than biology research (p=0·02). Preprint age (p=0·58) and peer-review scores (p=0·88) were not statistically significant independent variables, suggesting that time since posting a preprint online is not ratelimiting in this sample (an average of 417 days had elapsed from the original preprint posting). These analyses do not support the hypothesis that article quality scores are significantly associated with subsequent publication in a peer-reviewed journal (figure 2; appendix p 7).
Another possible effect of peer review is that it could improve evidence not by changing the central tendency, but by reducing uncertainty of estimates. This process could be done by modifying the data or experimental or analytical procedures in a way that reduces CIs. Therefore, we examined the change in point estimates' corresponding CIs (n=495). Similar to changes in point estimates, the amount of change in the CI ranges was small: 7·4% (geometric mean). Unlike point estimates, CIs showed a systematic tendency to decrease between preprint and published versions of articles, indicating that CI ranges reduced during peer review (Wilcoxon signed rank test; p<0·001; figure 2). Typically, this reduction was a result of increased sample size after review; authors added estimates (appendix p 3) and added data to their existing estimates during review. Based on these results, measurements of uncertainty, such as CI ranges, can be expected to decrease slightly during the peer-review process.

Discussion
The results of this study augment a small but growing literature on the reliability of data in preprints and changes to research articles caused by peer review, especially regarding COVID-19. Almost a quarter of COVID-19-related research is hosted by preprint servers which have been broadly disseminated by members of the scientific community and the public. 23 Metascience studies suggest that the discrepancy between preprints and peer-reviewed articles is small and the quality of reporting is within comparable range, supporting the validity of communicating research findings in preprints before review. 24,25 The results of our study and others [24][25][26][27][28][29][30] suggest that the reliability of data reported in preprints is generally high. Although there are measurable effects on research articles after peer review, such as the observed reduction in CIs, effect sizes are small. The amount of change to articles during peer review is small and expert opinions of article quality are not significantly different for preprints that are published versus preprints that are not published ( figure 2). Overall, articles submitted to   31 This study is not an experimental trial of the marginal effects of peer review on primary data and should not be interpreted as such. Instead, we quantify the amount of expected change from the time preprints are posted online until the time published versions are available. Our findings support the use of preprints in decision making as a component of biomedical research literature, and could help inform the use of preprints during the ongoing COVID-19 pandemic and future disease outbreaks. Future research should test the generalisability of these findings to other areas of research and to other time periods.

Contributors
BIH and HY conceptualised this Health Policy, did the formal analysis, used the software, and reviewed and edited the final draft. Data curation was done by LN, AS, SL, and SA. BIH supervised the process, and LN and BIH did all administration. The original draft was written by LN, HY, AS, SL, SA, and BIH.

Declaration of interests
BIH led the development of the National Institutes of Health iSearch COVID-19 Portfolio, which was used for data collection in this Health Policy paper. Support was provided by the Office of the Vice Chancellor for Research and Graduate Education at the University of Wisconsin-Madison, with funding from the Wisconsin Alumni Research Foundation. All other authors declare no competing interests.