Releasing a preprint is associated with more attention and citations for the peer-reviewed article

Preprints in biology are gaining popularity, but release of a preprint still precedes only a fraction of peer-reviewed publications. We examined whether having a preprint on bioRxiv was associated with metrics of the corresponding peer-reviewed article. We assembled a dataset of 74,239 articles, 5,405 of which had a preprint, published in 39 journals. Based on log-linear regression and random-effects meta-analysis, articles with a preprint had a 51% higher Altmetric Attention Score and 37% more citations compared to articles without one. These associations were independent of several other article- and author-level variables (e.g., scientific subfield and last author publication age) and unrelated to journal-level variables such as access model and Impact Factor. This observational study can help researchers and publishers make informed decisions about how to incorporate preprints into their work.


Introduction
Preprints offer a way to freely disseminate research findings while a manuscript is being peer reviewed (Berg et al., 2016) . Although releasing a preprint in disciplines such as physics and computer science-primarily via arXiv.org-is standard practice (Ginsparg, 2011) , preprints in the life sciences are just starting to catch on ("PrePubMed: Monthly Statistics for December 2018," n.d.) . Progress has been spurred by ASAPbio ("ASAPbio: Accelerating Science and Publication in biology," n.d.) , bioRxiv.org (now the largest repository of biology preprints), and others. However, some researchers in the life sciences remain reluctant to release their work as preprints, partly for fear of being scooped, as preprints are not universally considered a marker of priority (Bourne et al., 2017) . Furthermore, some journals explicitly or implicitly refuse to accept manuscripts released as preprints (Reichmann et al., 2019) , perhaps partly for fear of publishing articles not seen as novel or newsworthy. Currently, the number of preprints released each month in the life sciences is only a fraction of the number of peer-reviewed articles published (Abdill and Blekhman, 2019) .
In particular, how does releasing a preprint relate to the outcomes-in so far as they can be measured-of the peer-reviewed article? Previous work found that papers posted on arXiv before acceptance at a computer science conference received more citations in the following year than papers posted after acceptance (Feldman et al., 2018) . Another study found that articles with preprints on bioRxiv had higher Altmetric Attention Scores and more citations than those without, but the study was based on only 776 peer-reviewed articles with preprints (commensurate with the size of bioRxiv at the time) and did not examine differences between journals (Serghiou and Ioannidis, 2018) . We sought to build on these efforts by leveraging the rapid growth of bioRxiv. Independently from our work, a comprehensive recent study currently on bioRxiv replicated the findings of Serghiou and Ioannidis, but did not quantify journal-specific effects or account for differences between scientific fields (Fraser et al., 2019) .
Collecting the data Data came from four primary sources: PubMed, Altmetric, CrossRef, and Rxivist. We obtained data for peer-reviewed articles from PubMed using NCBI's E-utilities API via the rentrez R package (Winter, 2017) . We obtained Altmetric Attention Scores using the Altmetric Details Page API via the rAltmetric R package. The Altmetric Attention Score ("Attention Score" in the rest of the manuscript) is an aggregate measure of mentions from various sources, including social media, mainstream media, and policy documents ("Our sources," 2015) . We obtained numbers of citations using the CrossRef API (specifically, we used "is-referenced-by-count"). We obtained links between bioRxiv preprints and peer-reviewed articles using the CrossRef API via the rcrossref R package. We verified and supplemented the links from CrossRef using Rxivist (Abdill and Blekhman, 2019) via the Postgres database in the public Docker image ( https://hub.docker.com/r/blekhmanlab/rxivist_data ). We merged data from the various sources using the Digital Object Identifier (DOI) and PubMed ID of the peer-reviewed article.
• Had a publication type in PubMed of Journal Article and not Review, Published, Erratum, Comment, Lecture, Personal Narrative, Retracted Publication, Retraction of Publication, Biography, Portrait, Autobiography, Expression of Concern, Address, or Introductory Journal Article. This filtered for original research articles. • Had at least one author. A number of editorials met all of the above criteria, but lacked any authors. • Had an abstract of sufficient length. A number of commentaries and news articles met all of the above criteria, but either lacked an abstract or had an anomalously short one. We manually inspected articles with short abstracts to determine a cutoff for each journal. • Had at least one Medical Subject Headings (MeSH) term. Although not all articles from all journals had MeSH terms (which are added by PubMed curators), this requirement allowed us to adjust for scientific subfield within a journal using principal components of MeSH terms.
Inclusion criteria for bioRxiv preprints: • Indexed in CrossRef or Rxivist as linked to a peer-reviewed article in our dataset.
• Released prior to publication of the corresponding peer-reviewed article.
Inclusion criteria for journals: • Had at least 50 peer-reviewed articles in our dataset previously released as preprints.
Since we stratified our analysis by journal, this requirement ensured a sufficient number of peer-reviewed articles to reliably estimate each journal's model coefficients and confidence intervals (Austin and Steyerberg, 2015) . • We excluded the multidisciplinary journals Nature, Nature Communications, PLoS One, PNAS, Royal Society Open Science, Science, Science Advances, and Scientific Reports, since some articles published by these journals would likely not be released on bioRxiv, which could have confounded the analysis.
We obtained all data on September 28, 2019, thus all predictions of Attention Score and citations are for this date. Preprints and peer-reviewed articles have distinct DOIs, and accumulate Attention Scores and citations independently of each other. We manually inspected 100 randomly selected articles from the final set, and found that all 100 were original research articles. For those 100 articles, the Spearman correlation between number of citations from CrossRef and number of citations from Web of Science Core Collection was 0.98, with a mean difference of 2.5 (CrossRef typically being higher).

Inferring author-related variables
Institutional affiliation in PubMed is a free-text field, but is typically a series of comma-separated values with the country near the end. To identify the corresponding country of each affiliation, we used a series of heuristic regular expressions (Table S1 shows the number of affiliations for each identified country). Each author of a given article can have zero or more affiliations. For many articles, especially less recent ones, only the first author has any affiliations listed in PubMed, even though those affiliations actually apply to all the article's authors (as verified by the version on the journal's website). Therefore, the regression modeling used a binary variable for each article corresponding to whether any author had any affiliation in the U.S.
Author disambiguation is challenging, and unique identifiers are currently sparse in PubMed and bioRxiv. We developed an approach to infer an author's previous publications in PubMed based on that person's name and affiliations. We applied our approach to the last author of each article in our dataset. We limited the search to last authors in order to limit computation time.
The primary components of an author's name in PubMed are last name, fore name (which often includes middle initials), and initials (which do not include last name). Fore names are present in PubMed mostly from 2002 onward. For each article in our dataset (each target publication), our approach went as follows: 1. Get the last author's affiliations for the target publication. If the last author had no direct affiliations, get the affiliations of the first author. These are the target affiliations. 2. Find all publications between January 1, 2002 and December 31, 2018 in which the last author had a matching last name and fore name. We limited the search to last-author publications to approximate publications as principal investigator and to limit computation time. These are the query publications. 3. For each query publication, get that author's affiliations. If the author had no direct affiliations, get the affiliations of the first author. These are the query affiliations. 4. Clean the raw text of all target and query affiliations (make all characters lowercase and remove non-alphanumeric characters, among other things). 5. Calculate the similarity between each target-affiliation-query-affiliation pair. Similarity was a weighted sum of the shared terms between the two affiliations. Term weights were calculated using the quanteda R package (Benoit et al., 2018) and based on inverse document frequency, i.e., log 10 (1 / frequency), from all affiliations from all target publications in our dataset. Highly common (frequency > 0.05), highly rare (frequency < 10 -4 ), and single-character terms were given no weight. 6. Find the earliest query publication for which the similarity between a target affiliation and a query affiliation is at least 4. This cutoff was manually tuned. 7. If the earliest query publication is within two years of when PubMed started including fore names, repeat the procedure using last name and initials instead of last name and fore name.
For a randomly selected subset of 50 articles (none of which had been used to manually tune the similarity cutoff), we searched PubMed and authors' websites to manually identify each last author's first last-author publication. The Spearman correlation between manually identified and automatically identified dates was 0.88, the mean error was 1.74 years (meaning our automated approach sometimes missed the earliest publication), and the mean absolute error was 1.81 years (Fig. S1). The most common reason for error was that the author had changed institutions (Table S2).

4/15
Calculating principal components of MeSH term assignments Medical Subject Headings (MeSH) are a controlled vocabulary used to index PubMed and other biomedical databases ("Medical Subject Headings," 1999) . For each journal, we generated a binary matrix of MeSH term assignments for the peer-reviewed articles (1 if a given term was assigned to a given article, and 0 otherwise). We only included MeSH terms assigned to at least 5% of articles in a given journal, and excluded the terms "Female" and "Male" (which referred to the biological sex of the study animals and were not related to the article's field of research). We calculated the principal components (PCs) using the prcomp function in the R stats package and scaling the assignments for each term to have unit variance. We calculated the percentage of variance in MeSH term assignment explained by each PC as that PC's eigenvalue divided by the sum of all eigenvalues.
Quantifying the associations Attention Scores are real numbers ≥ 0, whereas citations are integers ≥ 0. Therefore, for each journal, we fit two types of regression models for Attention Score and three for citations: • Log-linear regression, in which the dependent variable was log 2 (Attention Score + 1) or log 2 (citations + 1). • Gamma regression with a log link, in which the dependent variable was "Attention Score + 1" or "citations + 1". The response variable for Gamma regression must be > 0. • Negative binomial regression, in which the dependent variable was citations. The response variable for negative binomial regression must be integers ≥ 0.
Each model had the following independent variables for each peer-reviewed article: • Preprint status, encoded as 1 for articles preceded by a preprint and 0 otherwise. • Publication date (equivalent to time since publication), encoded using a natural cubic spline with three degrees of freedom. The spline provides flexibility to fit the non-linear relationship between citations (or Attention Score) and publication date. Source: PubMed. • Number of authors, log-transformed because it was strongly right-skewed. Source: PubMed. • Number of references, log-transformed because it was strongly right-skewed. Sources: PubMed and CrossRef. For some articles, either PubMed or CrossRef lacked complete information on the number of references. For each article, we used the maximum between the two. • U.S. affiliation status, encoded as 1 for articles for which any author had a U.S. affiliation and 0 otherwise. Source: inferred from PubMed as described above. • Last author publication age, encoded as the amount of time in years by which publication of the peer-reviewed article was preceded by publication of the last author's *first* last-author publication. Source: inferred from PubMed as described above. • Top 15 PCs of MeSH term assignments (or all PCs, if there were fewer than 15). Source: calculated from PubMed as described above.

5/15
We evaluated goodness-of-fit of each regression model using mean absolute error and mean absolute percentage error. To fairly compare the different model types, we converted each prediction to the original scale of the respective metric prior to calculating the error.
As a secondary analysis, we added to the log-linear regression model a variable corresponding to the number of days by which release of the preprint preceded publication of the peer-reviewed article (using 0 for articles without a preprint), using preprint release dates from CrossRef and Rxivist and publication dates from PubMed.
We extracted coefficients and their 95% confidence intervals from each log-linear regression model. Because preprint status is binary, its model coefficient corresponded to a log 2 fold-change. We used each regression model to calculate predicted Attention Score and number of citations, along with corresponding 95% confidence intervals and 95% prediction intervals, given certain values of the variables in the model. For simplicity in the rest of the manuscript, we refer to exponentiated model coefficients as fold-changes of Attention Score and citations, even though they are actually fold-changes of "Attention Score + 1" and "citations + 1".
We performed each random-effects meta-analysis based on the Hartung-Knapp-Sidik-Jonkman method (IntHout et al., 2014) using the metagen function of the meta R package (Schwarzer et al., 2015) . We performed meta-regression by fitting a linear regression model in which the dependent variable was the journal's coefficient for preprint status (from either Attention Score or citations) and the independent variables were the journal's access model (encoded as 0 for "closed or hybrid" and 1 for "immediately open"), log 2 (Impact Factor), and log 2 (percentage of articles released as preprints).

Results
We first assembled a dataset of peer-reviewed articles indexed in PubMed, including each article's Altmetric Attention Score and number of citations and whether it had a preprint on bioRxiv. Because we sought to perform an analysis stratified by journal, we only included articles from journals that had published at least 50 articles that had a preprint on bioRxiv. Overall, our dataset included 74,239 articles, 5,405 of which had a preprint, published in 39 journals between January 1, 2015 and December 31, 2018 ( Fig. 1 and Table S3). Release of the preprint preceded publication of the peer-reviewed article by a median of 174 days (Fig. S2).
Across journals and often within a journal, Attention Score and citations varied by orders of magnitude between articles ( Fig. S3 and Fig. S4). Older articles within a given journal tended to have more citations, whereas older and newer articles tended to have similar distributions of Attention Score. In addition, Attention Score and citations within a given journal were weakly correlated with each other (median Spearman correlation 0.18, Table S4). These findings suggest that the two metrics capture different aspects of an article's impact.

6/15
We next used regression modeling to quantify the associations of an article's Attention Score and citations with whether the article had a preprint. To reduce the possibility of confounding (Falagas et al., 2013;Fox et al., 2016) , each regression model included terms for an article's preprint status, publication date, number of authors, number of references, whether any author had an affiliation in the U.S., the last author's publication age, and the article's approximate scientific subfield within the journal (Table S5). We inferred last author publication ages using names and affiliations in PubMed (see Methods for details). We approximated scientific subfield as the top 15 principal components (PCs) of Medical Subject Heading (MeSH) term assignments (Fig. S5, Fig. S6, and Table S6), analogously to how genome-wide association studies use PCs to adjust for population stratification (Price et al., 2006) .
For each journal and each of the two metrics, we fit multiple regression models. For Attention Scores, which are real numbers, we fit log-linear and Gamma models. For citations, which are integers, we fit log-linear, Gamma, and negative binomial models. Log-linear regression consistently gave the lowest mean absolute error and mean absolute percentage error ( Fig. S7 and Table S7), so we used only log-linear regression for all subsequent analyses (Table S8).
We used the regression fits to calculate predicted Attention Scores and citations for hypothetical articles with and without a preprint in each journal, holding all other variables fixed ( Fig. 1 and  Fig. S8). We also examined the exponentiated model coefficients for having a preprint (equivalent to fold-changes), which allowed comparison of relative effect sizes between journals (Fig. 2). Both approaches indicated higher Attention Scores and more citations for articles with preprints. Similar to Attention Scores and citations themselves, fold-changes of the two metrics were weakly correlated with each other (Spearman correlation 0.19).
To quantify the overall evidence for each variable's association with Attention Score and citations, we performed a random-effects meta-analysis of the respective model coefficients (Table 1 and Table S9). Based on the meta-analysis, an article's Attention Score and citations were positively associated with its preprint status, number of authors, number of references, and U.S. affiliation status, and slightly negatively associated with its last author publication age.
In particular, having a preprint was associated with a 1.51 times higher Attention Score (95% CI 1.43 to 1.59) and 1.37 times more citations (95% CI 1.31 to 1.43) of the peer-reviewed article. In a separate meta-analysis, the amount of time between release of the preprint and publication of the article was positively associated with the article's Attention Score, but not its citations (Table  S10 and Table S11). Taken together, these results suggest that having a preprint is associated with a higher Attention Score and more citations independently of other article-related variables.
We did not perform a random-effects meta-analysis of the coefficients for the MeSH term PCs, because the MeSH terms underlying a given PC varied from one journal to another. However, within each journal, typically several PCs had p-value ≤ 0.05 for association with Attention Score or citations (Fig. S9). In addition, if we excluded the MeSH term PCs from the regression, the fold-changes for having a preprint increased modestly ( Fig. S10 and Table S12). These results

7/15
suggest that the MeSH term PCs capture meaningful variation in scientific subfield between articles in a given journal.
Finally, using meta-regression, we found that the log fold-changes of the two metrics were not associated with the journal's access model, Impact Factor, or percentage of articles with preprints (Table 2 and Table S13). This result suggests that these journal-level characteristics do not explain journal-to-journal variation in the differences in Attention Score and citations between articles with and without a preprint.

Discussion
The decision of when and where to disclose the products of one's research is influenced by multiple factors. Here we find that having a preprint on bioRxiv is associated with a higher Altmetric Attention Score and more citations of the peer-reviewed article. The associations appear independent of several other article-and author-level variables and unrelated to journal-level variables such as access model and Impact Factor.
The advantage of stratifying by journal as we did here is that it accounts for the journal-specific factors-both known and unknown-that affect an article's Attention Score and citations. The disadvantage is that our results only apply to journals that have published at least 50 articles that have a preprint on bioRxiv. In fact, our preprint counts may be an underestimate, since some preprints on bioRxiv have been published as peer-reviewed articles, but not yet detected as such by bioRxiv's internal system (Abdill and Blekhman, 2019) . Furthermore, the associations we observe may not apply to preprints on other repositories such as arXiv Quantitative Biology and PeerJ Preprints.
We used the Altmetric Attention Score and number of citations on CrossRef because, unlike other article-level metrics such as number of views, both are publicly and programmatically available for any article with a DOI. However, both metrics are only crude proxies for an article's true scientific impact, which is difficult to quantify and can take years or decades to assess.
For multiple reasons, our analysis does not indicate whether the associations between preprints, Attention Scores, and citations have changed over time. First, historical citation counts are not currently available from CrossRef, so our data included each article's citations at only one moment in time. Second, most journals had a relatively small number of articles with preprints, so we did not model a statistical interaction between publication date and preprint status, and we largely ignored characteristics of the preprints themselves. In any case, the associations we observe may change as the culture of preprints in the life sciences evolves.
Grouping scientific articles by their research area(s) is an ongoing challenge (Piwowar et al., 2018;Waltman and van Eck, 2012) . Although the principal components of MeSH term assignments are only a simple approximation, they do explain some variation in Attention Score 8/15 and citations between articles in a given journal. Thus, our approach to estimating scientific subfield may be useful in other analyses of the biomedical literature.
Our heuristic approach to infer authors' publication histories from their names and free-text affiliations in PubMed was accurate, but not perfect. The heuristic was necessary because unique author identifiers such as ORCID iDs currently have sparse coverage of the published literature. This may change with a recent requirement from multiple U.S. funding agencies ("NOT-OD-19-109: Requirement for ORCID iDs for Individuals Supported by Research Training, Fellowship, Research Education, and Career Development Awards Beginning in FY 2020," n.d.) , which would enhance future analyses of scientific publishing.
Because our data are observational, we cannot conclude that releasing a preprint is causal for a higher Attention Score and more citations of the peer-reviewed article. Even accounting for all the other factors we modeled, having a preprint on bioRxiv could be merely a marker for research likely to receive more attention and citations anyway. For example, perhaps authors who release their work as preprints are more active on social media, which could partly explain the association with Attention Score, although it would likely not explain the association with citations. If there is a causal role for preprints, it may be related to increased visibility that leads to "preferential attachment" (Wang et al., 2013) while the manuscript is in peer review. These scenarios need not be mutually exclusive, and without a randomized trial they are extremely difficult to distinguish.
Altogether, our findings contribute to the growing observational evidence of the effects of preprints in biology (Fraser et al., 2019) , and have implications for preprints in chemistry and medicine (Kiessling et al., 2016;Rawlinson and Bloom, 2019) . Consequently, our study may help researchers and publishers make informed decisions about how to incorporate preprints into their work. Had any author with U.S. affiliation 0.100 1.19e-02 0.076 0.124 3.30e-10 6.59e-10

Figures and Tables
Last author publication age (yrs) -0.002 6.17e-04 -0.004 -0.001 6.49e-04 6.49e-04 Random-effects meta-analysis (across journals) of model coefficients from log-linear regression. A positive coefficient means that an article's Attention Score or number of citations increases as that variable increases (or if the article had a preprint or had any author with a U.S. affiliation). However, coefficients for different variables have different units and are not directly comparable. P-values were adjusted using the Bonferroni-Holm procedure, based on having fit two models for each journal. Meta-analysis statistics for the intercept and publication date are shown in Table S9. Absolute effect size of having a preprint, by metric and journal. Each point indicates the predicted mean of that metric for a hypothetical article with or without a preprint, assuming the hypothetical article was published three years ago and had the mean value (i.e., zero) of each of the top 15 MeSH term PCs and the median value (for articles in that journal) of number of authors, number of references, U.S. affiliation status, and last author publication age. Error bars indicate 95% confidence intervals. Journal names correspond to PubMed abbreviations. Journals are ordered by mean predicted mean Attention Score and citations.

12/15
14/15 Figure 2 Relative effect size of having a preprint, by metric and journal. Fold-change corresponds to the exponentiated coefficient from log-linear regression, where fold-change > 1 indicates higher Attention Score or number of citations for articles that had a preprint. A fold-change of 1 corresponds to no association. Error bars indicate 95% confidence intervals. Journals are ordered by mean log fold-change. Bottom row shows estimates from random-effects meta-analysis (also shown in Table 1).