Code sharing in ecology and evolution increases citation rates but remains uncommon

Abstract Biologists increasingly rely on computer code to collect and analyze their data, reinforcing the importance of published code for transparency, reproducibility, training, and a basis for further work. Here, we conduct a literature review estimating temporal trends in code sharing in ecology and evolution publications since 2010, and test for an influence of code sharing on citation rate. We find that code is rarely published (only 6% of papers), with little improvement over time. We also found there may be incentives to publish code: Publications that share code have tended to be low‐impact initially, but accumulate citations faster, compensating for this deficit. Studies that additionally meet other Open Science criteria, open‐access publication, or data sharing, have still higher citation rates, with publications meeting all three criteria (code sharing, data sharing, and open access publication) tending to have the most citations and highest rate of citation accumulation.

reproducibility of analyses, but it also may enhance the impact of publications (e.g., greater uptake of methods, more citations) and reduce duplicated efforts, allowing science to progress more effectively (McNutt, 2014;Munafò et al., 2017;Nosek et al., 2015; "Reality check on reproducibility", 2016).Furthermore, welldocumented code facilitates the peer review process, provides a valuable educational resource (Busjahn & Schulte, 2013), and facilitates our ability to credit developers, as software and packageusage data can be harvested directly from published code (Merow, Boyle, et al., 2023).
Has the increasing appreciation of code sharing influenced code sharing practices over time?Recent evidence suggests that biologists may be reluctant to share code.A study focused on publications in ecology journals with policies that mandated or encouraged code sharing found that 73% failed to share code (Culina et al., 2020), while a study focused on publications using agent-based models found that 81% did not provide code (Barton et al., 2022).PLOS open science indicators likewise suggest that code sharing is rare, with 92% of publications in Agricultural and Biological Sciences failing to share code (in comparison, 49% fail to share data; Public Library of Science, 2023).While some papers include the statement "code available upon request," this promise is often not met (Stodden et al., 2018).Where published, code may also not be reusable due to licensing issues (Stodden, 2009).
Resistance to code sharing and reuse may arise from unfamiliarity with best sharing practices, insecurity about code quality, fears of misuse or unsolicited appropriation of ideas, and excess preparation costs (Cadwallader & Hrynaszkiewicz, 2022;Gomes et al., 2022).However, it has been argued that many perceived issues with code sharing stem from misunderstandings of its risks and benefits (Gomes et al., 2022).To better understand how code sharing practices have changed over time and whether code sharing actually improves citation rates, we (1) estimated trends in R code sharing for articles in ecology and evolution published between 2010 and 2022 and (2) tested whether the citation rate was higher for papers that shared code.We focus on R because it has become the dominant coding language in ecology and evolution (Lai et al., 2019).

| List of ecology and evolution publications citing R
To generate a list of papers in ecology and evolution that likely made use of the R programming language (R Core Team, 2023), we performed a query on the Scopus database (https:// www.scopus.com) using the rscopus R package (Muschelli, 2019).We searched Scopus (performed August 19, 2022) for peer reviewed journal articles that: (1) included the words "ecology" or "evolution" in an "all fields" search (which searches article titles, keywords, abstracts, and journal titles); (2) were published in journals within the subject area "agriculture and biological sciences"; (3) were published after January 1, 2010; (4) were written in English (as this is currently the dominant language of publication in ecology and evolution; Mauranen et al., 2010); and (5) included a citation of R in their reference list.

| Checking for code and data availability
We manually evaluated a randomly chosen subset of the publications on our overall list.We selected a total of 1001 papers, evenly distributed across the time period (77 per year * 13 years).
Papers that cited R but did not use it (or were unclear on whether they used it; n = 3) were discarded and replaced by a randomly selected paper from the same year.For each publication in this subset, we manually identified whether the publication shared any R code, either as supplementary information, or via a link (e.g., to a Github repository).For each paper, we (i) checked for the presence of code in supplemental material, (ii) skimmed publications for code and data availability statements, (iii) searched through publications for terms associated with code (i.e., "code", "supplement", "appendix", "R", "script" "Github"), and (iv) searched publications for URLs.Papers were scored with a binary variable indicating whether they shared R code or not.We did not distinguish between publications which shared sufficient code for reproduction and those which did not.We also did not attempt to rerun the code or assess its reproducibility, and only recorded the presence of any code, even if it was incomplete.Where code was included, we recorded the license the code was provider under, or lack thereof.We also assessed whether publications were open access and whether they shared open data in order to understand the importance of open code relative to these other open-access components.Open access information was provided by the rscopus R package (Muschelli, 2019).Open data was scored as a binary variable indicating whether the authors shared the full set of raw data underlying the analyses or not.To control for differences in citation rates among journals, we downloaded impact factor information using the scholar R package (Keirstead, 2016) on June 16, 2023.To estimate the proportion of publications which use R but do not properly cite it, we screened 130 randomly selected publications evenly distributed across the time period.These publications were selected using identical criteria to the publications that did cite R, except that they did not include R in their list of references.

| Checking for code citations
Where code was shared in a citable location such as a DOI or URL (n = 33), we assessed whether the code itself was cited by querying the Scopus database for the URL (and DOI, where appropriate) using the rscopus R package (Muschelli, 2019).Publications where code was shared in appendices or supplementary information (n = 22) were excluded, as there was no way of distinguishing citations of the code with citations of the publication itself.

| Impact of code sharing on citations
We additionally modeled the relationship between code sharing and citation count using generalized linear models in R. We modeled the dependent variable (cumulative number of citations of each article by 2022) using a Poisson distribution, which models the number of independent events occurring within a period of time (Bolker, 2008).
In addition to the predictor variable for code sharing (binary, yes/ no), we included other variables that were hypothesized to influence citation count.Data sharing (binary, yes/no) may increase citation counts as readers may cite papers as data sources (Christensen et al., 2019;Piwowar et al., 2007).Open access (binary, yes/no) may also increase citation counts by reaching a broader set of readers (Tang et al., 2017).Publications accumulate citations over time, and so citation count should increase with publication age (continuous, 1-13 years).Finally, publications in higher impact journals may be more likely to be read and cited, and hence, journal impact factor (continuous, 0-11.633) may be positively associated with citation count.In addition to main effects, we considered two classes of interactions: (1) interactions between publication age and other main effects, which are appropriate if a main effect modifies the rate at which a publication accumulates citations over time; and ( 2 1).Continuous variables were scaled and centered.Overall model pseudo-R 2 for the best-performing model was calculated using the function r.squaredGLMM in the rsq package (Zhang, 2018).

| RE SULTS
We identified 28,227 articles that met our search criteria.From this set of articles, we randomly selected 1001 papers (the closest number to 1000 that is evenly divisible by 13) evenly spread across the temporal range (13 years) for a total of 77 papers per year.Overall, R code was only available for 55 of the 1001 papers examined (5.5%; Figure 1).When shared, code was most often in the Supplemental Information (40%), followed by Github (22%), Figshare (11%), or other repositories (37%).The majority of code (67%) did not include a license.Where a license was included, it was nearly always permissive or copyleft (e.g., CC0, CC-BY, GPL, and MIT), with only one publication including a proprietary license.Open-access publications were twice as likely to share code than closed-access publications (8.5% vs 4.24%, Χ 2 = 7.2576, p = .008599).Publications with open data were 12 times more likely to share code than closed-data publications (26.5% vs 2.2%, Χ 2 = 133.36,p = 9.999e-05).Among the set of publications that did not cite R, 6.2% mentioned using R in the text.Of the 33 papers that shared code via potentially citable DOIs or URLs, we were unable to find any citations of the code itself.

| Code sharing over time
The proportion of publications sharing code has increased significantly (p = .00157)over time (Figure 1, Table 2), with code sharing increasing at an average of 0.6% per year over this period.A Durbin-

Watson test indicated no temporal autocorrelation in residuals
(DW = 1.7544, p = .6475).We note that the years of 2021 and 2022 showed notable shifts toward more frequent sharing (although 2013 showed a similar level of code sharing), but the percentage of code sharing has been consistently below 20% over the past decade, and has remained lower than the percentage of open-access papers or papers sharing data (Figure 1).Over this same time period, the proportion of publications including data also increased significantly (p = 1.48e-06; Figure 1), while the proportion of open-access publications did not change significantly (p = .926;Figure 1).

| DISCUSS ION
We found that the scientific literature in ecology and evolution still falls far short of the code sharing required for adequate reproducibility and transparency, despite an increasing trend in code sharing over the last 12 years.This low rate of code sharing undoubtedly hinders scientific progress and likely has far-reaching financial consequences, since a lack of reproducibility means that coding must be continuously redone for common analytical tasks (Freedman et al., 2015).Further, our results indicate that the failure to share code may also reduce the academic impact of scientists, as sharing code leads to a higher rate of citation accumulation (i.e., a significant year-by-code-sharing interaction; p < 2e-16;  2).Surprisingly, our model found negative effects of code sharing and data sharing on citation count, despite the positive impacts of code sharing and data sharing on citation rate (i.e., significant interactions between code or data sharing and age).One possible cause of this discrepancy could be that scientists may be less likely to share code and data underlying publications they expect to be impactful if they are planning related studies using the same code or data.Alternatively, this discrepancy may be due to the increase in code availability with time leading to a disproportionate amount of the papers which share code being young, and hence having few citations.We also did not find support for an interaction between publication age and impact factor, suggesting that though impact factor may affect the total number of citations, it does not strongly affect the citation rate.
We also did not find evidence of a significant interactive effect of  organizations need to give more thought to software licenses.We note that considering licensing is important both for scientists who wish their code to be freely available and benefit from citations stemming from reuse as well as for scientists who wish to embrace transparency without allowing use of their code.
We note that there are important limitations to our study.The low rates of code sharing and data sharing limited our sample sizes which in turn likely impacted model precision.These low sample sizes may help explain the anomalously high number of publications sharing code in 2013 (Figure 1).Further, these low rates of sharing led to a major imbalance in our main variable of interest, code sharing (55 that shared vs 946 that did not).We also note that while we treated code and data sharing as binary variables, there is a tremendous amount of variation in the amount and quality of data and code that are shared.Where some publications include only summary data or example code, others include well-documented code and data, and this variation could impact citation count.
Some of the variables we examined may change over time, potentially weakening inferences: archived data and code may be lost, publications may become open access, and impact factors change over time.Our search was limited to English language publications, so these trends may not hold for publications in other languages (Konno et al., 2020), though they may be expected to share code even less often (Serwadda et al., 2018).Further, our work focused on papers that cited R, and does not account for the small percentage of papers which note using R but do not cite it.Thus, our results may overestimate the proportion of total publications around the globe sharing code.Finally, our work focused on one programming language, R, which may not be broadly representative.
To create an environment conducive to reproducibility and trans-  Although maintaining well-documented code in a version-controlled public repository (e.g., Github) and public archive with directions for its use (e.g., Zenodo) is ideal for code sharing, other options that require less effort can at least ensure the distribution of code to other researchers interested in using it.Recent advances in artificial intelligence (e.g., ChatGPT) have made documenting scripts easier, thus lowering the cost to authors to share documented code (Merow, Serra-Diaz, et al., 2023).Finally, we stress that as code sharing increases, our attribution practices must keep pace, both for scientific transparency and to credit the developers (Merow, Boyle, et al., 2023).3).Our model predicts that 3 years after publication, fully open papers published in a low impact journal may have roughly the same number of citations as fully closed papers published in high impact journals (on average).Legend is arranged in descending order of citations at year 13.
We tested for a trend in code-sharing over time by modeling code sharing (binary, yes/no) as a function of the year (relative to 2010) using a generalized linear model.Modeling was performed using the function glm in the stats R package (R Core Team, 2023) with a binomial error distribution.We similarly tested for temporal trends in two other open-science components, open-access publication (binary) and open data (binary).We also tested whether open-access or open-data papers were disproportionately likely to share code using chi-square tests via the chisq.testfunction in the stats R package (R Core Team, 2023).
publication age and open access on citation count.This failure of open access to impact citation rate, as well as the lack of change in the percentage of open access publications over time (Figure 1) may be driven by changes to open access status of publications over the years.As journals switch to open access, many papers that were initially published closed access have been converted to open after varying lengths of time, potentially degrading the signal of open access on citation counts.F I G U R E 1 Temporal trends of open-access publications (open publication), open data, and code sharing (open code) between 2010 and 2022.Sample size was 77 papers per year, 1001 total.The lines show trends in each open-science component over time, with dashed lines representing non-significant trends, and solid lines indicating significance.Open data (p = 1.48e-06) and code sharing (p = .00157)increased significantly over time, while open-access publications (p = .926)had no statistically significant relationship.Note that publications may be converted to open access after publication as journal policies change, so the open-publication trend should be interpreted with caution.
compliance often remains low(Culina et al., 2020).More ambitious solutions might include incorporating links between methods text and the corresponding code, employing dedicated code editors to help improve code style and clarity (similar to Data Editors employed by The American Naturalist), or incorporating computational notebooks (e.g., RMarkdown, Quarto, Jupyter;Peng, 2011).Such measures will enhance transparency in reporting and provide reviewers and readers with the critical information necessary to reproduce and validate the study's findings.The adoption of these principles and practices will serve to promote the integration of open code in the scientific landscape, enhancing the verifiability and impact of our research.Given the growing list of reasons for code sharing, we encourage scientists to embrace open code and open science more generally.

F
Code sharing can balance out low impact factors.Low impact factor (IF; 1.3) and high impact factor (4.7) are defined using the 0.1 and 0.9 quantiles in our dataset.Fully open = open code and open access publication; fully closed = closed-access publication and lack of publicly available code.Predictions are based on estimated model coefficients (Table

Table 3
tions between open access and code sharing (p < 2e-16; Table3) and between data sharing and code sharing (p < 2e-16; Table3), with publications meeting all three open science criteria (code sharing, data sharing, and open access) having the highest overall predicted citation rates ("Fully open," Figure Note: Statistically significant model terms in bold.***p < .001,*p < .05.
TA B L E 2 Estimated coefficients for models of temporal trends in code sharing, data sharing, and open access over time.