Credit Lost: Two Decades of Software Citation in Astronomy

Software has been a crucial contributor to scientific progress in astronomy for decades, but practices that enable machine-actionable citations have not been consistently applied to software itself. Instead, software citation behaviors developed independently from standard publication mechanisms and policies, resulting in human-readable citations that remain hidden over time and that cannot represent the influence software has had in the field. These historical software citation behaviors need to be understood in order to improve software citation guidance and develop relevant publishing practices that fully support the astronomy community. To this end, a 23 year retrospective analysis of software citation practices in astronomy was developed. Astronomy publications were mined for 410 aliases associated with nine software packages and analyzed to identify past practices and trends that prevent software citations from benefiting software authors.


Introduction
Citations reflect how authors contextualize their arguments and justify their conclusions. Citations provide a mechanism for upholding intellectual honesty and are used to assign value to scholarship. Citations are also the primary mechanism by which concrete connections are built between ideas over time -both the longevity and the validity of these connections are determined by people who have traditionally been able to follow clear citation rules and have never had to consider why those rules exist. But software citation practices are not yet standardized to this extent, often leaving software authors to instruct article authors on how to cite their code. Software authors must then develop rough approximations of how often their code is used based on the assumption that article authors have followed their instructions. Similarly, article authors are tasked with making choices about how to cite software with little guidance on how to make sure software authors benefit from those citations. When software authors do not provide instructions, article authors create citations ad hoc following the only conventions they know, which were developed for citing papers and books. Journal reviewers and editors are then responsible for enforcing software-related policies without knowing how to critically assess software citations when they see them, and they do not know when they are missing.
Software authors, article authors, and publishers all need to make decisions about how software is represented as recorded knowledge to ensure that future generations can access that software and know who wrote it. Astronomers have needed to make choices about software citation for decades though without clear metadata standards or citation practices, and they have had to do this in a scholarly landscape where articles had scientific value and software was just a tool. As a result, normative practices emerged that shoehorn software into citation practices meant for articles without adopting the same level of editorial review that article citations receive. Software authors correspondingly began to add preferred citations to their websites and documentation that do not persistently or consistently link to their code, leading to software citations that go undetected by systems now capable of identifying them. Technical and procedural mechanisms now exist to help enable software citation, but "community acceptance" of standardized citation practice is still needed . Tools and policies are only helpful if the astronomy community uses them and can make informed choices about them.
Citing software will only become more important as infrastructure to support computational work in astronomy improves and the field moves away from fragile systems toward scalable services that can handle immense astronomy data sets (O'Mullane et al. 2019). Without understanding the behaviors that are currently keeping software citations hidden, astronomers building the software the field relies upon will continue to lose credit for their work despite the community's attempts to acknowledge its value. Without change, only the software authors most capable of advocating for themselves will be rewarded for their contributions, and only their software will be curated and supported with the resources needed to sustain it. For all software authors to benefit from software citation, community-specific guidance will need to be developed and implemented "within the context of existing scholarly communication and software development norms" (Katz & Chue Hong 2018). It is therefore necessary to study past software citation behaviors, critique existing norms, and establish new practices.

Software Citation Principles
In astronomy, the need for software citation and resources to support software authors has been discussed at length through meetings and workshops (e.g., Allen et al. 2015;Norén 2015;Bouquin et al. 2018), as well as presentations and papers (e.g., Muna et al. 2016;Smith 2018a). In these contexts, astronomers have focused on the need to give software authors credit for their contributions to the field. They have also voiced a perception that software development is under supported and that increased acceptance of software citation has the potential to bolster software-centric astronomy careers, in addition to scientific transparency, future code utility, and infrastructure. In 2016, a FORCE11 working group established the "Software Citation Principles" ), a set of high-level ideas that echoed the astronomy community's perceptions.
The Software Citation Principles state that software should be considered an important contribution to science and that software citations should enable normative and legal credit and attribution to be given to software authors. The Principles go on to make it clear that software citations need to uniquely and persistently identify software, in addition to enabling access to the software itself and its associated metadata. Specifically, software citations should identify the version of the software being used, and the link used to identify that software should be a durable, stable, permalink. Employing this strong, unique identification mechanism ensures that specific people can be given credit for their software, and specific context can be established for how their software gets used over time.

Persistent Identifiers
Although the Software Citation Principles may seem straightforward, unique identification and persistence are not concepts that the astronomy community is generally considering as they undertake their research. Astronomers are not typically taught how persistent identifiers (e.g., DOIs) work or how permalinks differ from standard URLs. As a result, astronomers who write software or want to acknowledge software in their articles may not fully understand the decisions they are making when identifying software in citations.
To clarify, URLs are fragile because they designate locations where a digital object exists at a specific time, but content at that location can change or be removed. On the other hand, persistent identifiers can be used to create permalinks that are then associated with metadata and tied to digital objects maintained by external entities responsible for their long-term care. Archival repositories (e.g., Zenodo 5 ) and academic publishers create persistent identifiers for the digital objects that they publish and maintain, whereas URLs are temporary locations where digital content can be found. A URL is just another piece of metadata that can be associated with a persistent identifier. For example, software deposited in Zenodo gets archived and assigned a DOI. The record associated with the archival deposit includes updatable metadata fields (e.g., related links), but the DOI will always point to the deposited files maintained by Zenodo-in this way, linkrot is prevented. This distinction is particularly important when considering the ways that software evolves over time and the fact that it is often collaboratively developed using distributed version control platforms (e.g., GitHub). Software developed this way is "born networked" (Acker 2014) and cannot be uniquely identified using standard URLs.
"Corner.py," a software package used to make scatterplot matrices (Foreman-Mackey 2016), can be used to illustrate how persistent identifiers can impact software citation. Corner.py has four releases that have been archived and assigned DOIs using Zenodo . Corner.py was formerly called "triangle.py" though, so if an article author had used "https://github.com/dfm/triangle.py" in a citation rather than the DOI for triangle.py's first release ("10.5281/ zenodo.10598"), that citation would now redirect to the renamed software ("https://github.com/dfm/corner.py"), which currently has 23 more contributors than the originally cited software. Further, if an article author chose to use the information that had been available on GitHub about triangle. py to create their citation, they may have listed the only committer as the sole author of the code. This would have been reasonable given the fact that at the time, the sole committer was also the only person listed on the GitHub repo's license. 6 By referring only to the GitHub repo, the article author would not have known to list the other seven people listed as authors on the first release of triangle.py on Zenodo (Foreman-Mackey et al. 2014), leaving those software authors unrecognized. Similarly, if the current owner of the corner.py GitHub repo decided to delete the repo entirely or transfer ownership of the repo to a different person, the original URL and the current URL would both lead to a 404 page instead of the code. If the article author did not list the version of software they had used, there would be no way to determine who actually authored the code.
Permalinks allow people to establish the specific context in which a given piece of software exists at a specific time and the ability to explicitly define the people who built that software. URLs allow people to navigate to specific places online, but what exists in those places changes. However, permalinks designed specifically to identify software have only recently become available to software authors. Originally, persistent identification mechanisms (e.g., DOIs, arXiv IDs, 7 ADS bibcodes) were developed to allow article authors to identify and cite more traditional scholarly objects like articles and books and to disambiguate them from the broader network of objects. More recently, authors were given the ability to attach identifiers to themselves (e.g., ORCiDs 8 ) and to connect their research to identifiers for data sets as well (e.g., MAST DOIs 9 ). The resulting network of persistent identifiers makes it clear to future generations how to access past research, who did that research, and the context in which that research was done. Persistent identifiers for software now need to be incorporated into this important ecosystem. Article authors looking to cite software in the past though had to cite something else.

Citing Something Else
Citation practices for scholarly publications like articles are well established, and the tools used to search the scholarly literature have been capable of identifying article citations since they began indexing citation metadata. In astronomy, the primary tool to navigate the literature and find citation information is the Astrophysics Data System (ADS). The ADS is designed to index curated, structured metadata associated with persistent identifiers provided by trusted sources like scholarly publishers and librarians. The ADS can easily be used to show how often a given paper has been cited. The ADS is not designed to parse through full-text sources seeking out the many ways an article author might decide to acknowledge a piece of software. With no other tools at their disposal though, software authors have, in the past, needed to construct full-text ADS queries in an attempt to find out how often their software was used and had no operationally defined way to summarize those "citations" when they were found. Software authors needing to demonstrate the "impact" of their work therefore understandably began to hack the system. Rather than ask people to cite their software, software authors began writing papers about their software and asking people to cite those "software papers." When article authors cite papers, they know to cite persistent identifiers. Software authors who write software papers can therefore accrue machine-actionable citations when their software papers are cited-the ADS then indexes citation metadata and gives these software authors a way to show the "impact" of their code. 10 Publishers recognizing the need for article authors to have something to cite worked to acknowledge and accept software papers for this reason. AAS Publishing began accepting software papers and introduced a policy in 2016 (Vishniac & Lintott 2016) that made citing a software paper an acceptable way to cite software, stating that software citation could be done by citing a paper describing the software, and/or by directly citing the software using a software DOI. 11 In 2018 December, AAS also made public a collaboration with the Journal of Open Source Software (JOSS) 12 wherein JOSS would offer software reviews for papers submitted to the AAS journals (Vishniac & Lintott 2018). 13 The resulting JOSS papers link to archived versions of the reviewed software with persistent identifiers that can be cited by article authors and counted as software citations. And while the AAS recommends that authors of software papers archive their software, it is not required. Instead, the AAS software policy states that software papers minimally must, "provide a clear statement on how to access the code-for example, by contacting the author..." Software papers are not yet accepted by all publishers, but when they are accepted, they have allowed article authors to cite something when they want to cite software. Software papers have also not required significant institutional or cultural change for them to take advantage of existing citation tracking systems. But software papers are not software, and reliance on software papers to credit software authors is problematic. Specifically, software papers create ambiguity around software authorship and can prevent authors from getting credit for their work-software paper authorship lists are static, but software is dynamic with authorship lists that change from one version of code to the next. Software papers also make locating software more difficult over time as links from papers to their associated code bases break if they exist at all. Complicating these issues is the fact that citations to software papers for the purpose of software citation are indistinguishable from citations to those papers for other purposes. Software papers can also put important information about open source software behind paywalls, in addition to arbitrarily creating more work for software authors who already need to make time for documentation and supporting the communities that contribute to their projects.
Indirect software citation has also been done through the use of the Astrophysics Source Code Library (ASCL). The ASCL functions as a repository when software authors deposit software with the ASCL, but the majority of ASCL records are "registry" records generated by the registry's curator to represent software (i.e., ASCL identifiers for registered software are associated with curated records, not curated software deposits). 14 ASCL registry records list links to websites and documents associated with the software represented by the registry record to make that software more visible. This "registry" function creates a new type of abstraction layer between software and software authors, in addition to presenting new challenges for generating, maintaining, and reconciling registry records with archival records in repositories and elsewhere (Katz et al. 2019a). The issues presented by software registries have yet to be resolved, but ASCL records are indexed by the ADS and can, like software papers, accrue machine-actionable citations. The practice of citing software registry records could become a recommended practice for citing software that otherwise has no clear or direct way to identify it (e.g., the software itself cannot be located and/or its authors are unable to be determined). However, in cases where software authors can be determined and the software can be unambiguously identified, astronomers can now directly cite software rather than citing something else.

Citing Software
Recognizing the need for persistent, unique software identification mechanisms, AAS Publishing, ADS, and the archival repository Zenodo began working together in 2016 with funding from the Alfred P. Sloan Foundation on a project called "Asclepias" (Muench et al. 2017). Now in its early implementation, Asclepias enables the automatic creation and management of metadata about software versions, licensing, and authorship to facilitate software citations that can be easily discovered using ADS. To take advantage of this new ADS functionality though, software authors need to deposit releases of their software into Zenodo or another archive that generates software DOIs. 15 Article authors need to then cite the software DOIs generated by the archive, rather than the various URLs and papers authors have cited in the past. What still remains to be done is the implementation of practices that capitalize on these newly available software citation features in ADS. Specifically, updated, clear, enforceable guidance and policies need to be established to normalize the practice of "publishing" software and citing published software in astronomy. Katz et al. (2019b) define publishing software as the "formal process of archiving a copy of the software... and creating a resolvable identifier for it," clarifying that "Unpublished software is often publicly available, but hosted by an organization that does not commit to long term preservation." 10 Citation counts for software papers are associated with each paper's ADS record (e.g., Citations to Astropy Collaboration et al. 2013 can be viewed at https://ui.adsabs.harvard.edu/abs/2013A%26A...558A..33A/citations). 11 The most recent version of this policy is provided by AAS Editorial Board (2018). 12 The Journal of Open Source Software is a "developer-friendly journal for research software packages. It's designed to make it as easy as possible to create a software paper for your work." (Smith 2016). 13 https://blog.joss.theoj.org/2018/12/a-new-collaboration-with-aas-publishing 14 "ASCL's editors seek out both new and old peer-reviewed papers that describe methods or experiments that involve the development or use of source code and add entries for the found codes to the library. This approach ensures that source codes are added without requiring authors to actively submit them, resulting in a comprehensive listing that covers a significant number of the astrophysics source codes used in peer-reviewed studies." (ASCL 2020). 15 https://blog.datacite.org/doi-registrations-software/ This distinction between published and unpublished software is analogous to the distinctions one can make between sharing a document on a personal website, versus submitting it to a publisher that will archive it, create a record and an identifier for it, then expose metadata associated with that record so it can be found and indexed by systems like ADS.
Although Asclepias will help resolve many technical challenges impacting software citation in astronomy moving forward, software citation behaviors will continue to be enacted by people who have already developed a culture of practice around software citation. Technical interventions cannot have their intended positive effects if information associated with published software is incorrect or that information gets discarded (Katz & Chue Hong 2018). For this reason, the following retrospective analysis of software citation practices in astronomy was developed.

Methods
Case studies are difficult to replicate, time consuming, and not broadly generalizable. However, case studies can help contextualize perceptions and shape future directions for research more effectively than anecdotes alone. A case study though is not in itself a research method. Rather, the case study is the scope in which methods are applied. In this case study, the method employed is content analysis, wherein content from the astronomy literature was analyzed and inductively encoded into categories to quantitatively describe software citation practices in the field (Franzosi 2008). Specifically, the content analysis we employed aimed primarily to estimate how often article authors have attempted to give software authors credit, and the extent to which their practices diverge from those needed to consistently identify software citations.
Content analysis executed through a case study was appropriate in this context because it would be irrational to undertake a more broad investigation without first understanding the limitations of the tools available for investigating software citations in the astronomical literature (e.g., text parsing scripts, APIs). Further, content analysis at scale is necessarily reductionist and would limit our ability to identify normative behaviors that are not detectable as patterns within the text (e.g., preferred citation practices). Pursuing a case study also allowed us to isolate the impact that policies may have on software citation practices in specific journals, thus establishing a baseline for future comparison.

Selecting Software
Nine software packages developed in whole or in part at the Center for Astrophysics were selected for inclusion in the case study ( Table 1). The packages were selected to represent a spectrum of software (Smith 2018b) with varying degrees of complexity and a range of intended uses. Some software was developed by small groups with relatively narrow intended purposes (e.g., AstroBlend), while others were as complex as collaboratively developed data reduction pipelines associated with large surveys (e.g., spec2d). The software packages selected would also be expected to be cited across different timeframes within the case study period, with the earliest package being developed in 1990 (SAO Image DS9) and the latest having been released in 2017 (PlasmaPy). Software that is unlikely to have been mentioned in the astronomy literature (e.g., small packages, analysis scripts without titles or public development history) was excluded from our study.
Two different searches were conducted to identify mentions of the selected software packages throughout the astronomy literature: a search through XML files representing full-text articles published by the AAS, and a search of publications indexed by the ADS using the ADS API.

Search 1: AAS XML
The ADS was given permission by AAS Publishing to share XML files representing 76,791 articles published in the Astronomical Journal (AJ), Astrophysical Journal (ApJ), Astrophysical Journal Letters (ApJL), and the Astrophysical Journal Supplements Series (ApJS) between 1995 July and 2018 May for text mining ( Table 2). These files were subsequently parsed with Python scripts to identify mentions of each software package-scripts are available on Zenodo ) and can be found in Appendix A.
If any software package was mentioned in a given XML file, the XML tags associated with that mention were extracted to determine where the mention was found within the document. Files were flagged as containing software mentions in article bibliographies, footnotes, acknowledgments, and elsewhere. Multiple software mentions could be found in multiple locations within a single XML file. Table 3 contains JATS XML tags 16 that were used to determine whether or not a software mention was classified as a bibliographic entry, footnote, or acknowledgment. In an effort to be as exhaustive as possible in identifying ways that authors may have attempted to give a recognizable form of attribution to the software they were mentioning (i.e., beyond mentioning a title in-text without additional formatting), more ambiguous tags such as "ex-link" and "back" were classified as "other" recognizable forms of attribution. The XML tag "ex-link" could be used to include an external link anywhere in the body of an article, whereas "back" could be used to identify back matter like appendices, acknowledgments, glossaries, reference lists, footnotes, or acknowledgments. When an "ex-link" or "back" tag was nested within a less ambiguous location tag, the less ambiguous location was used to define the software mention.

Search 2: ADS API
In order to understand software citation practices in non-AAS publications, it was important to also search the ADS API for the same software over the same period of time covered by the AAS XML files. The ADS API search allowed us to identify software mentions in all publications indexed by the ADS. We compared the ADS API search results to the AAS XML search results in order to assess the accuracy of the API search. Having confirmed through extensive manual vetting the accuracy of our XML search results, we were able to use this comparison to understand the limitations associated with using the ADS API's full-text search capabilities to identify software mentions in non-AAS publications.

Software Aliases
To comprehensively search both the AAS XML and the publications indexed by ADS for software mentions, it was necessary to define lists of "software aliases" or search strings that could have been used by article authors mentioning software in their papers. Searching for exact matches to these software aliases made it possible to distinguish exactly how the software was being mentioned in a given article, whereas partial string matches or searches for title variations would lead to ambiguous and incomplete results.
For example, searching for inclusive variations of "TAR-DIS" could result in false positives if words like "stardisk" occurred in an article. Similarly a name like "Stroble" could return false positives for "AstroBlend." Fuzzy matches would also be inappropriate, as something like "Astropy" could return results based on words that end with "tropy" like "entropy." Searching for something other than exact matches would additionally make it impossible to distinguish a mention of "Astropysics" (the original title of Astropy) from a mention of the current title.
Searching only for title variations would also be insufficient. For instance, searching for variations of "spec2d" would miss papers that used the software authors' suggested citation method of inserting a specific funding acknowledgment into the article: "The analysis pipeline used to reduce the DEIMOS data was developed at UC Berkeley with support from NSF grant AST-0071048." 17 A search for "spec2d" would also miss things like a footnote containing "keck.hawaii.edu/inst/ deimos/pipeline.html." Other aliases were needed to account for software abbreviations that may have been written out in full like TARDIS, which stands for "Temperature And Radiative Diffusion In Supernovae." Additional aliases were added to include papers that do not mention the software's title. For example, the earliest paper describing TARDIS is titled, "A spectral synthesis code for rapid modeling of supernovae" (Kerzendorf & Sim 2014).
To define the necessary list of software aliases, online searches were conducted to gather phrases, alternate spellings and titles, as well as links and identifiers that could be used to refer to the software packages. In this way, we followed conventions typically employed in the early stages of a systematic review of the medical literature wherein search terms are vetted for inclusion in a final list that will be incorporated into a comprehensive search across multiple systems (Henderson et al. 2010).
For the purposes of this study, we used the term "identifier" to refer to software aliases that are DOIs, ADS bibcodes, Zenodo records, ASCL records, and arXiv ePrints, whereas "links" include URLs to many different types of personal, institutional, and corporate websites (e.g., GitHub repositories). Software authors were also contacted to contribute to the list of aliases associated with their software. In total, 410 aliases were gathered for the nine selected software packages (Appendix B).

Preferred Citations
When available, preferred citations and other attribution phrases for software were also recorded. Examples of preferred citations are listed in Table C1. For the purposes of this study, "preferred citations" are direct instructions given by software authors, or in the case of software registries (i.e., ASCL), registry curators, specifying how to attribute credit to the software in a publication in lieu of software authors.
Preferred citations were examined to better understand how they might impact what Katz & Chue Hong (2018) refer to as "quality of information" issues related to software citation metadata. Katz & Chue Hong (2018) state that there is a need to "ensure that systems collect and retain required metadata, making it easier to discover and reuse software," but the need to ensure quality must extend to the origin of the metadata provided. Without ensuring that quality metadata is available, it will matter little whether or not that metadata is discarded at some point in the publication process.
By looking for preferred citations to include in our software alias list, we identified a predominant trend to keep in mind while interpreting the results of our content analysis-there was always more than one preferred citation associated with a software package; these instructions often contradicted each other or were unclear as to which suggested practice was most preferred. For example, in 2018 September, the PlasmaPy Community had thoroughly documented their code and decided to update their "CITATION" file to instruct the We highly encourage researchers to acknowledge the packages that PlasmaPy depends on, including but not limited to Astropy, NumPy, and SciPy." From these instructions, it is unclear when to cite the first suggestion rather than including the specified acknowledgment message. It is also unclear when it would be appropriate to cite the archived code directly. Further examination of the Zenodo records listed in these instructions also highlights a problem that reviewers checking for software citations in the future could face. Although you can tell at a glance that citations to Zenodo DOIs comes from Zenodo (i.e., "Zenodo" is listed as the publisher), you cannot easily tell whether or not a Zenodo DOI is a software DOI-Zenodo mints DOIs for all content deposited in its archive and not just software. This issue is demonstrated by PlasmaPy's first preferred citation, which points to a Zenodo record for a slide deck published in 2018 April (PlasmaPy Community et al. 2018a). A reviewer could easily and wrongly assume that this DOI resolves directly to archived code. We also found that software DOIs do not necessarily mean direct software citation, as is the case with papers deposited in Zenodo as LaTeX files (e.g., Astropy's paper Price-Whelan et al. 2018).
Moreover, the PlasmaPy preferred citation example illustrates the challenge of having many locations to update preferred citation information. For instance, when one downloads the archival software release listed in the citation instructions (PlasmaPy Community et al. 2018b), the unzipped .tar.gz file contains a "CITATION" file with different instructions that do not mention the option to cite the archival release. 19 The same issue arises when software authors list citation information in documentation that goes un-updated once new publications are released. For instance, Stingray's "Read the Docs" page is out of sync with the ASCL record it points to: "Currently, the best way to cite Stingray in papers and other projects is with our Astrophysics Source Code Library [http://ascl.net/ 1608.001] entry. Stay tuned for the first version release later in 2018." Software authors also regularly suggested citations to more than one software alias in their preferred citations. For instance, spec2d requests citations to two different identifiers in addition to an acknowledgment message about the software's funding source. 20 Understanding that we would likely find multiple "preferred citations" for each software package helped us flesh out our alias list further and established software identification as an area where specific community guidance is needed. This idea was bolstered by the fact that we were unable to find preferred software citation guidance from astronomy publishers, leading us to suspect that the existence of preferred citations is a normative phenomenon not linked to publisher policies or recommendations.

Limitations
Case studies are inherently limited in their generalizability, and limitations were also unavoidable in our content analysis. Specifically, the existence of confounding and ambiguous software aliases could result in false positives and/or negatives that needed to be identified iteratively. It is possible that we were not able to identify all possible confounds. For example, searching for the software package "Stingray" required us to weed out results that were mentions of the stingray nebula, 21 stingray-shaped objects, 22 actual stingrays (i.e., animals), 23 and the identically named Corvette. 24 "Stingray" is also a name given to multiple instruments 25,26 and was even returned as part of an email domain. 27 We also limited our inclusion of aliases that introduced an unreasonable quantity of confounds. For instance, we chose not to include "DEIMOS" as an alias for spec2d as it is also the name of the outermost Martian moon.
Ambiguous software aliases also led us to miss variations of URLs that could be associated with a given package, or miss aliases altogether. For example, some packages had a proliferation of possible URL variations of which our list is not exhaustive. The acknowledgment recommendation provided by the WCSTools website illustrates this issue well as it suggests citing at least one of nine papers. 28 Each of the nine papers has their own publisher provided URL and DOI, ADS and arXiv entries, as well as three to six URLs linking from the WCSTools website to reproductions of content associated with the papers. WCSTools also has an ASCL record, which points back to the WCSTools website 29 and an ADS bibcode. 30 These aliases needed to be added to the list of URLs where versions of the software itself could be found and downloaded. 31 This content analysis is further limited by our decision not to include all possible links to past software versions, subroutines, and affiliated libraries. For WCSTools alone, including all of these variations would have added 246 URLs to our alias list just to include links that would allow a user to download past versions of the software and their associated README files. 32 To ensure we would actually be able to review our results, we chose to include as many aliases as possible while assessing whether or not it would be reasonably likely that the alias would be mentioned in a paper. We accepted these limitations as not inhibiting our ability to assess trends in software citation behavior and having an acceptable impact on the results of our content analysis.
In addition to content analysis limitations related to our choice of software aliases, our decision to focus on the ADS rather than a domain-agnostic index like Google Scholar also meant we were unable to search for software mentions in literature outside of astronomy and its related fields. Because the ADS is the primary index used by the astronomy community, and because our study is focused on past citation practices by astronomers, we accepted this limitation and suggest that a future study could expand our searches to include queries across indexing platforms.
Our decision to focus on the ADS also meant that licensing restrictions related to text mining impacted our study. The ADS API is not designed to reliably determine with a high degree of confidence where within an article a specific string might be found (e.g., in a bibliographic entry, footnote, table) because text-mining restrictions limit the maximum number of characters that can be extracted from article sources. We were therefore only able to determine whether or not aliases were mentioned within a given document, and not where those aliases were mentioned. 33 In order to approximate whether or not an alias mention was associated with a bibliographic entry, the presence or absence or bracketed content near the alias was noted. The presence or absence of bracketed content is not a definitive indication of a bibliographic entry, it only increases the likelihood that the software alias has a corresponding bibliographic entry associated with it. For these reasons, our analysis of the ADS API search results is less granular and conclusive than our XML search; however, insights can still be gained through comparison.
One other technical limitation that will have affected our results is that the XML schema used by AAS Publishing changed over the course of the 23 yr covered by our case study, so our XML tag extraction may be imperfect. XML files were also not available for all AAS articles that could have been included in this study based exclusively on our established timeframe. When the ADS had only PDFs, those articles were excluded from the content analysis.

Results
For each of the two search approaches, summary statistics were aggregated and examples were identified to represent characteristic software citation behavior associated with the nine software packages selected for study. The limitations described herein were incorporated into our analysis of the results.

Presence of Software Aliases in AAS XML
In total, 76,791 XML files were parsed and 1469, or roughly 2%, were found to contain at least one software alias associated with one of the nine software packages we selected. Table 4 breaks down the number of articles associated with each software package found in each AAS journal.
Astropy was found to have the highest total number of XML files in which an article author had attempted to directly give the software a recognizable form of attribution using bibliographic entries, footnotes, acknowledgments, and/or "other" mechanisms ( Table 5).
Of the aliases we searched for within the AAS XML files, 109 of the 410, or 26.6%, were found (Appendix D). The software package spec2d had the highest number of unique aliases found in AAS articles (22) followed closely by Astropy and SAOImage DS9 (21). Table 6 shows the total number of unique aliases associated with each software package in the XML. These results demonstrate that the length of time a software package has been available is not predictive of the degree of variation article authors employ when identifying that software.
The XML alias search also showed that the "identifiers" mentioned most in the literature were all associated with journal articles (Table 7), and none of the identifiers found in the XML actually resolved to specific software releases directly Similarly, the most common "non-identifier" external links also did not point to software releases. The only paper containing a non-identifier that resolved to a software release was Law et al. (2016), which mentioned "svn.sdss.org/public/ repo/manga/mangadrp/tags/v1_5_4/pro/spec2d" in a footnote. Instead, general project URLs (i.e., "astroblend.com," "astropy.org," "ds9.si.edu") were used or institutionally hosted landing pages (i.e., "astro.berkeley.edu/~cooper/deep/spec2d" and "tdc-www.harvard.edu/wcstools"). In the case of RADMC-3D, the most mentioned external link 34 pointed to an institutionally hosted website that now leads to a 404, but the Internet Archive has saved snapshots of the web page through 2019 March. 35 TARDIS was only referred to using variations of its title and no external links were found, whereas Stingray's most common external links were to subdomains of the software's GitHub repository. Stingray's GitHub URLs pointed to both tutorial material 36 and a code called "HENDRICS," 37 last released in 2017. All found aliases categorized as "nonidentifiers" are listed in Table D2.
Because many aliases could be found in a single paper, we also found the maximum number of unique aliases in a given XML file for each software package ( Table 8).
All software packages were associated with papers containing at least two different aliases. Astropy was found to be associated with the paper that used the highest number of unique aliases (9). The paper with the most Astropy aliases turned out to be about "HaloTools" v0.2 (Hearin et al. 2017a), an open Astropy-affiliated software package for modeling large-scale structure (Hearin et al. 2017b).
Interestingly, the HaloTools paper has two bibliographic entries that resolve to archived versions of the HaloTools code on Zenodo but no bibliographic entry for an archived version of

Analysis of XML Tags
Analysis of the XML tags associated with software aliases emphasizes the disuniformity of software citation practices employed by the astronomy community over the last 23 yr. For each mention of a software alias in an XML file, our scripts extracted the prior four tags in the file's XML tree structure, along with the content of those tags. Table E1 shows all unique combinations of XML tags associated with aliases included in this case study. Table 9 shows the total number of unique XML tag combinations associated with each software package.
The article we identified as being associated with AstroBlend can be used to demonstrate the encoding scheme we used when categorizing alias mentions and to clarify how we counted unique combinations of tags. The paper Vogt et al. (2016) contains two AstroBlend aliases: "astroblend" and "astroblend. com." The first alias was found in the body of the text, and the second was found in a footnote associated with the in-text mention.
"For more extensive examples and tutorials on how to use BLENDER with astrophysical data sets (including different ways to import the data and by-passing the X3D file format altogether), we refer the interested reader to the ASTRO-BLEND 26 project." " 26 http://www.astroblend.com" The first alias mentioned, "astroblend," was nested within the tags: sc, p, sec, and sec. The alias "astroblend.com" was nested within ext-link; p; fn; and p. The below example of XML content is associated with the "astroblend" sc tag, as well as the "astroblend.com" tags ex-link, p, and fn.
For more extensive examples and tutorials on how to use <sc>blen-der</sc> with astrophysical data sets (including different ways to import the data and by-passing the <sans-serif>X3D </sans-serif> file format altogether), we refer the interested reader to the <sc>as-troblend</sc> project. <xref ref-type=″″fn″″ rid=″″apj521773fn23″″> <sup>26</sup></xref>⧹n<fn id=″″ apj521773fn23″″>⧹n <label><sup>26 </sup></label>⧹n<p>⧹n<ext-link ext-link-type=″″uri″″ xlink:href=″″ http://www.astroblend.com″″>http:// www.astroblend.com</ext-link>⧹n </p>⧹n</fn>⧹n′]″ The paper associated with AstroBlend was counted as a single unique paper containing two unique combinations of tags, and the mention of "astroblend.com" was categorized as a footnote. The paper as a whole was classified as containing an alias in a footnote, meaning that even though "astroblend" was not associated with any more specific attribution tags like those associated with bibliographic entries, the authors of the paper clearly attempted to give an identifiable form of credit to AstroBlend by putting a software alias in a predictable location.
This classification procedure was applied to all XML files associated with software aliases and used to classify whether or not alias mentions were associated with bibliographic entries, footnotes, or acknowledgments. Tags associated with bibliographic entries were often, but not exclusively, found nested within acknowledgment tags. Individual alias mentions could therefore be counted as both acknowledgments and bibliographic entries (Table 10).
Finally, tags demarking "other" forms of attribution were incorporated into the list of papers associated with each software package to determine the total number of unique  papers that gave a recognizable form of credit to the software they were mentioning (Table 11). Despite expanding our definition of recognizable types of credit to include "other" tags, some papers could not clearly be classified as containing aliases in either bibliographic entries, footnotes, acknowledgments, or associations with "other" forms of attribution. We classified these papers as "inconclusive" (Table 12).
In total, there were 343 papers where software had been mentioned but the location of the alias used could not be determined. Aliases associated with older software were more likely to be mentioned without XML tags to indicate a clear location, but ambiguous papers were found as late as 2017. For example, some articles mentioned software by its title without further attribution: "Combining our isochrone data from the Dartmouth Stellar Evolution Database with our most updated star parameters using astropy.io, we created a diagram to illustrate log g versus Teff of our binary stars, as seen in Figure 10." (Aleo et al. 2017).
"Imaging observations were obtained in optical passbands from the 48 imaging telescope at FLWO and were de-biased/flat-fielded using ASTROPY packages." (Nicholl et al. 2017).
"The position of each geometric center is estimated using DS9 tools and the 24 image." (Silva et al. 2017).
"The astrometry of individual processed images was solved with the routine imwcs in the package wcstools." (Chen et al. 2009).
Papers mentioning software aliases in inconclusive locations sometimes mentioned more than one of the software packages included in our case study. Mallick et al. (2012) mentioned aliases for both SAOImage DS9 and WCSTools, but the alias locations could not be determined with either mention, suggesting that this behavior was generally applied and not tied to a specific software package included in our study: "The averaged spectrum was then analyzed in ds9 image display device, and the stars with an Ha emission line (detectable as an enhancement over the continuum) were identified manually by careful visual inspection, with the help of the photometric image of the field." "The stars were obtained using the task "IMTMC" in "WCSTOOLS" package in IRAF, for the J (λ=1.25 m), H (λ=1.63 m), and Ks (λ=2.14 m) bands." In other instances, software alias locations could not be determined because the citation associated with the alias was actually to a paper that had used the same software to perform a particular analysis: "If we consider a changing disk radius with primary mass as the most significant factor, it has been shown for a benchmark suite of models, including MCFOST and RADMC-3D Inconclusive mentions occasionally showed that an alias could have been flagged as a recognizable form of credit, but it was either misformatted or previously unrecognized as a software alias despite efforts to comprehensively identify all relevant aliases. For example, a 2013 paper was found to have inconclusively mentioned RADMC-3D (Beaumont et al. 2013). The paper mentioned the software twice as "RADMC-3D" in a table, and twice in the body of the paper, but the associated entry in the bibliography was oddly formatted: "Dullemond, C. P. 2012, ascl soft, 2015."  We found this same error in Smith et al. (2013): "Dullemond, C. P. 2012, Astrophysics Source Code Library, 2015." It seems that the article authors had attempted to cite one of the relevant ASCL records 39 associated with RADMC-3D, but had done so without mentioning any predictable variation of that record's multiple identifiers or URLs (e.g., ascl.net/1202. 015; 2012ascl.soft02015D; ascl:1202.015), and no corrections had been made during the article review process. The article authors also misinterpreted the ASCL record number to be 2015, possibly by misreading its ADS bibcode.
Other interesting inconclusive mentions of RADMC-3D were linked to our decision to include "RADMC" as an alias for "RADMC-3D." This was largely done to account for the fact that an ASCL record associated with RADMC-3D states, "RADMC-3D is a new incarnation of the older software package RADMC (ascl:1108.016)." It was not clear from this description whether or not article authors should interpret RADMC to be an earlier version of RADMC-3D, or another software package entirely distinct from it; we therefore included RADMC and its associated aliases in our alias list. When we later identified a paper containing an "RADMC" alias, we realized there were yet additional papers being used to identify the RADMC code that had not been listed on the RADMC ASCL record-we had included all links and identifiers listed on the ASCL record (e.g., "used in," "described in"), but the inconclusive paper had cited an entirely different article written in the same year by the same authors to represent RADMC: "We construct 2D dust radiative transfer models with RADMC (Dullemond & Dominik 2004) to calculate the passive heating, and adopt the viscous heating profile from an MRI-active disk model (Landry et al. 2013), which includes a viscosity prescription for accretion driven by MRI turbulence." (Yu et al. 2016).
By citing Dullemond & Dominik (2004b) instead of the citation listed as "preferred" by ASCL (Dullemond & Dominik 2004a), the article authors were citing a more relevant paper that served to credit the same software authors for their code. This behavior shows a clear attempt to cite software without divorcing the software from its application in a particular context.

Comparing XML Results to ADS API Results
The search we conducted using the ADS API had two components: a comparison to the results of our AAS XML search, and a search across all papers indexed by ADS. 40 The comparison between the results of the ADS API search and the search we conducted directly in the same files without the API was used to approximate how effectively we could use the ADS API full-text search to identify software mentions in non-AAS journals. To do this comparison, we used the ADS API to search for the same aliases we compiled for the XML search and then filtered the results to include only papers published in AAS journals during the same period of time. We then counted the total number of AAS papers associated with each software package. API search results for AAS papers were also refined systematically then manually reviewed for false positives caused by known confounds identified during the XML search.
Our comparison between the two search methods showed that the ADS API returned roughly 59.7% of the AAS papers we had been able to identify through our XML search (Table 13). We therefore estimate that our use of the ADS API to search for software aliases in the rest of the astronomy literature will miss up to 40.3% of the possible results.

ADS API Search of All Indexed Literature
The ADS API search results of the broader astronomy literature could not be analyzed with the same level of granularity that was applied to the XML results. We did not manually refine our API results when searching across all literature indexed by the ADS, meaning that false positives that were not systematically removed successfully remained in our results. New confounds may have also been introduced-this is particularly true regarding the software package "Stingray" whose title is commonly used in other contexts. Regardless of the API's limitations though, we executed the search to roughly approximate the number of papers where article authors mentioned our case study software packages in their publications. Table 14 shows the total number of papers returned by the API for each software package. Table 15 shows the total number of aliases found using the API. The API search found roughly 81% of the aliases we were able to identify when searching the AAS XML directly. Technical limitations did not allow us to use the API to extract XML tags for each mention of a software alias. We instead extracted "highlight" content from articles using the ADS API and tried to identify bracketed content that could reasonably indicate that an in-text citation was associated with that alias. Table 16 shows the estimated percentage of papers mentioning software aliases in bibliographic entries based on this assumption.
Much like the outcome of our XML search, the results of our API search showed that papers suspected of containing bibliographic entries for software did not contain identifiers that resolved directly to software releases (Table 17). We also found no new non-identifiers that resolved directly to software releases. Rather, papers were cited independent of a given article's publisher.
In addition to identifiers for papers, our API search also found identifiers tied to ASCL records. With the exception of an ASCL record for "Astropysics," an alias for Astropy, 41 and one of the ASCL entries for RADMC-3D, 42 the ASCL records associated with bibliographic entries actually listed a preferred citation that was not an ASCL record. For example, Astropy had been mentioned using the alias "ascl:1304.002," but the preferred citation method listed on this ASCL record is actually a request to cite two separate papers (Figure 1). Likewise, the ASCL record "ascl:1109.015," associated with WCSTools, links to further instructions about preferred citations (Figure 2). No change log is available for ASCL records; it is therefore correspondingly unclear whether or not ASCL records have always listed preferred citation methods other than the ASCL records themselves or if the preferred citation method was different when the ASCL record was mentioned in the literature.

Software Mentioned over Time
In recent years, software citation advocacy and other efforts to develop viable career trajectories for software authors have been premised on a perceived need for improved awareness about why software matters. Our case study demonstrates though that advocacy for the idea that software is important may not be as crucial as previously suspected.
Specifically, our analysis of the literature shows that when software becomes available to the community, the community starts mentioning that software in their papers. The results of our AAS XML search show that at least one paper mentioned each of our software packages within a year of it being made available (Figure 3). For the case of software released prior to 1995 (i.e., 1990 for SAOImage DS9 43 ), that software was mentioned from the start of our case study timeframe. Our search using the ADS API supported this finding (Figure 4).
Having identified software alias mentions throughout the entire timeframe covered by our case study, we can reasonably begin making assertions about the perception that increased advocacy is needed for scientific software to be valued and cited by the astronomy community. Rather than being grounded empirically, this perception may be driven instead by the fact that software citation has been practiced in such a way that we have not been able to systematically identify software citations, leading people to doubt the value of citing software or to believe software is not being cited. This assertion is bolstered by work done in 2016 by Howison and Bullard, who noted that software mentioned in a random sample of biology publications "often fail to accomplish many of the functions of citation," and suggested "reducing the acceptability of using a variety of informal forms of mentioning software" to improve the situation (Howison & Bullard 2016). This study also pointed out the need to focus not just on how articles are published but on how software is made available, stating the "problem is not motivation," but that citations to unpublished software do not meet the goals outlined by the Software Citation Principles.  One more complicating feature that could be resulting in a perceived need to emphasize the value of software to instigate software citation adoption is that article authors may be under the impression that they are already citing software, because in a way, they are. Article authors may mention software in their papers and therefore feel they do not need to change their practices even if those practices do not result in machineactionable citations. For these reasons, any specific software citation guidance shared with the astronomy community in the future should be referred to as "revised" or "updated" software citation practices rather than as general guidance on software citation. Acknowledging the fact that the community already attempts to cite software and that they need to adopt new practices may improve the likelihood that normative behaviors shift.

Discussion
The content analysis described herein allowed us to assess how the astronomy community's past software citation practices may influence the adoption of new software citation practices. What we found was that although the community clearly values software, the way the community acknowledges software is neither specific nor stable and does not consistently give normative legal credit to software authors. While it is true that assigning persistent identifiers directly to software will allow article authors to unambiguously give credit to software authors, establishing technical mechanisms to identify citations to published software does not fully address the needs of the astronomy community.
One of the primary concerns outlined by our results is that historical software identification norms in astronomy complicate efforts to systematically define and quantify software citations. The number of software aliases associated with the nine software packages in this case study helps illustrate the need for standard software identification practices. Just as important though is the need to address the community's "preferred citation" practices. It is clear from our results that preferred citation instructions are ambiguous (i.e., multiple preferred citations; contradicting preferred citations), change with time (e.g., citation instructions go out of date), and are dispersed throughout the software development ecosystem (e.g., tutorials, landing pages, repositories, citation files). It is also clear that article authors may not comply with preferred citation instructions and may instead attempt to associate their use of a given software package with a particular scientific finding-some articles mention software because that software was mentioned in a related study, while other software is identified using articles that are not "software papers." Moreover, our case study demonstrates that there are many reasons why article authors choose to use unresolvable links when they mention software. Article authors may want to point out relevant tutorial material or identify software affiliated with a larger project but do not know the best way to cite the software itself in these instances. Similarly, article authors may want to refer to a software package generally, rather than pointing to a specific version of a software package. This finding demonstrates the value of understanding what motivates astronomers to cite software and highlights the need for nuanced instruction and resources that make it clear when and how it might be appropriate to identify software using something other than (or in addition to) a persistent identifier. Existing in-text software citation guidance in astronomy , as well as recently created software citation checklists (Chue , could therefore be expanded to support more complex problemsolving in this area. Publishers should also adapt general in-text examples 44 to their specific review procedures and policies. The need to refer to a software package generally also emphasizes how important it will be for software authors to discuss and explicitly define software author lists when describing the "concept" of a software package versus a specific software release-improved tooling may be needed to accommodate these preferences as our study shows there is little evidence to suggest that software authorship choices like these could be satisfactorily automated. For example, it would be valuable to give authors more control over authorship lists associated with their software's "concept" DOI on Zenodo. Currently, concept DOIs on Zenodo resolve to a software package's most recent software release and provide no explanation as to how to define an authorship list when citing a concept DOI. Although Zenodo plans to change this method of representing software as a concept, 45 it is not obvious what article authors should do in the meantime or how much control software authors will have over authorship decisions like this in the future. The use of general project names instead of authorship lists (e.g., "AstroPy Collaboration") may also warrant further explanation.
Tools for generating and managing machine-actionable software metadata files (e.g., CodeMeta 46 ) throughout software development and archival processes will also be important when responding to the software identification issues outlined in this study. Incorporating tooling to generate software metadata files into development workflows will help prevent software metadata from being inconsistent across platforms, in addition to helping stop the further proliferation of software acknowledgment messages that do not include bibliographic entries. Building these tools into software archival processes will correspondingly ensure that structured metadata is added at the point of deposit even if it is missing during software development. The adoption of structured metadata files will also make tools like "CiteAs" 47 more helpful to the astronomy community. Tooling for generating and managing software metadata will be needed throughout other aspects of the article publishing landscape as well, but higher priority may need to be given to tools that directly support software authors who have not traditionally been responsible for cataloging their own work. Similarly, tools to help article authors generate proper citations using software metadata files will also be important when responding to the software identification issues outlined in our study. The R function "citation(pkgname)" (R Core Team 2020) is one example of this type of tooling-the "citation(pkgname)" function generates BibTeX for R packages using metadata provided by software authors 48 and could be the basis for creating similar functions for Python, the open programming language most commonly used in astronomy. 49 One other issue our results brought into focus is how widespread the practice of citing papers to represent software really is. This practice is long established and not likely to fade if publisher policies and reviewer instructions do not change reciprocally. For instance, when referring to software papers and software DOIs, the AAS Software Policy states: "Ideally, both forms of citation should be included. The former extends credit to the authors for their publication and tells the reader where to learn about the software. The latter gives the reader access to the exact version of the software used in the project" (AAS Editorial Board 2018).
The language in this policy is problematic in that it does not require article authors to cite DOIs when they exist (a practice that would require additional editorial attention), and it emphasizes the idea that credit is tied to traditional publications, whereas software DOIs simply enable access to software. Access to specific versions of software is important, but it is not obvious to the astronomy community that access to specific software is what enables specific people to be given credit for their work. Further, citing a specific software version is not always the article author's goal (i.e., citing software as a concept). Citing a specific piece of software is also not as straightforward as always citing a software DOI. If an article author would like to cite a software package that does not have a specific DOI, our case study shows that article authors are likely to mention general project landing pages that do not credit any specific software authors.
Further, software policies that offer more specific guidance may be valuable to the community given the range of "citation" behaviors identified in our study and emphasized by the diversity of locations in which software aliases were found. For this reason, publisher policies and author support materials may need to make more overt the distinctions between software citations and other attribution mechanisms. For example, although AAS Editorial Board (2018) states that article authors may wish to include links to code repositories "alongside" formal references, it may not be obvious that "formal references" require bibliographic entries, whereas links to code repositories may be more appropriately placed in footnotes. The AAS Software Policy also suggests that article authors create a "software section" in their papers stating that this Figure 2. "Preferred citation method" directs the user to an institutional website where preferred citation information is available. 45 See "Where does the Concept DOI resolve to?": https://help.zenodo.org/ #versioning. 46 https://codemeta.github.io/ 47 CiteAs is a search engine for finding citation information for diverse research products: http://citeas.org/. 48 https://www.r-bloggers.com/how-to-cite-packages/ 49 The proprietary programming language MATLAB developed by Math-Works plays a niche role in astronomy, but MATLAB's File Exchange citations provide another example of useful tooling: https://www.mathworks. com/matlabcentral/about/fx/#Citations. section is, "analogous to acknowledging a major facility or instrument and is done for the same reason, to give credit to a project which is generally useful for the community." It is not clear though when this kind of acknowledgment section should be used instead of (or in addition to) a formal reference. The assumption that the astronomy community innately and thoroughly understands the technologies that drive scholarly publication systems (because they interact with them regularly) could be why these explanations are not given more explicitly. It would therefore be valuable to incorporate software citation guidance into other resources related to scholarly publishing that are not linked to specific publisher guidelines.

Future Work
This case study should serve as a baseline for comparison for future studies about how various interventions impact software  citation implementation in astronomy. In addition to expanding on this initial case study through work related to Asclepias, 50 it will be important from here to map past software citation practices to new recommendations. In developing new recommendations, it will be essential to work with the astronomy community and publishers to review those new recommendations-creating new recommendations without community input could result in specific needs being overlooked and in effort being dedicated to metadata-related tools and resources that cannot be easily incorporated into software development and archival workflows. Input from the community will also help prioritize potential projects to support software citation implementation, inform how funding might be sought to enable those projects, and help determine updates to existing publisher policies.
It is also important to recognize that software citation standards are still being normalized in all disciplines, and software developed for one discipline may be used in entirely different contexts. The astronomy community therefore stands to instigate trends that could ripple throughout the broader scholarly communication landscape and should therefore work to ensure all tools developed for the astronomy community are open and as modular as possible-this will help other disciplines use them and could potentially support federated software identification across platforms.