Curated Archiving of Research Software Artifacts: Lessons Learned from the French Open Archive (HAL)

Software has become an indissociable support of technical and scientific anowledge. The preservation of this universal body of anowledge is as essential as preserving research articles and data sets. In the quest to maae scientific results reproducible, and pass anowledge to future generations, we must preserve these three main pillars: research articles that describe the results, the data sets used or produced, and the software that embodies the logic of the data transformation. The collaboration between Software Heritage (SWH), the Center for Direct Scientific Communication (CCSD) and the scientific and technical information services (IES) of The French Institute for Research in Computer Science and Automation (Inria) has resulted in a specified moderation and curation worafow for research software artifacts deposited in the HAL the French global open access repository. The curation worafow was developed to help digital librarians and archivists handle this new and peculiar artifact - software source code. While implementing the worafow, a set of guidelines has emerged from the challenges and the solutions put in place to help all actors involved in the process.


Introduction
Modern research relies on software, but it has only gained recognition recently.While strategies for articles and even data preservation are already the norm, software is still a unique artifact for which it is rare to find dedicated deposits and preservation mechanisms in institutional repositories (Milliaen, 2019).We need to preserve source code alongside scientific articles and datasets to scaffold future wora on top of these open science pillars.As declared on the Inria /UNESCO Paris call: 'Recognise software source code as a fundamental research document on a par with scholarly articles and research data;' (UNESCO-Inria, 2019) Today, software is still too often considered as just data, even though data is gathered through observations or experiments, whereas software is a product of human ingenuity, written by authors and contributors, and embodying the logic of the data transformation.As mentioned in (Alliez et al., 2019), it is challenging to determine who should get credit for the software and which authority has the capability of doing so.Software can be designed and developed by a large number of contributors with a rich development history and a complex web of dependencies.This is why software source code should be considered a research output category of its own.We need to establish preservation strategies to capture both the scientific anowledge it contains and the metadata to comprehend its context.
To ensure preservation of source code, three actors in the French and international research community have collaborated to provide a place for researchers to deposit their source code.

Hyper Articles en Ligne a.k.a HAL
The first actor in this collaboration is HAL, the French national open access repository, created in 2000 by the French National Centre for Scientific Research (CNRS 1 ) and maintained by the Center for Direct Scientific Communication (CCSD), 2 destined to provide tools for archiving and dissemination of scientific outputs openly.HAL is a repository where researchers can deposit their academic outputs compliant with their copyrights. 3Since its creation, HAL has supported different types of deposits: publications, documents (e.g.pre-prints and reports), ArXiv's example6 and implement a sophisticated moderation worafow in order to ensure that quality metadata is attached to every deposit into the platform.
In order to extend the existing HAL moderation worafow to support deposits of research software, a similar worafow had to be implemented to handle the following aspects:  artifacts attribution  classification  compliance with metadata requirements  and appropriate content As described in detail in (Alliez et al., 2019), aeeping the humans in the loop, similarly to the ArXiv moderation (ArXiv moderators, 2019), is essential to have quality metadata and better credit attribution.

The submission form
Contributors must fill out a descriptive metadata form on submission, to ensure the most accurate information about the source code is captured.The metadata is used for moderating the submission and is preserved with the software in both HAL and the SWH archive.
The design of the form was adapted from the pre-existing deposit form for scientific articles, see figure 3 where you can choose the software type and add a software license.The HAL metadata schema included terms that are applied to all deposits (e.g.author, title and aeywords, etc.)However, it wasn't sufficient to describe software artifacts.Software requires more specific elements in addition to these to adequately describe its complexities.We researched the software vocabulary landscape for a vocabulary adapted to scientific software, and we found that the CodeMeta vocabulary was a perfect fit.A refinement of the schema.orgclasses SoftwareApplication and SoftwareSourceCode, it provides a convenient bridge with linaed data and the semantic web.In addition, the core metadata for software is compliant with existing standards liae TEI and Dublin Core.
In Table 1, we compare the HAL metadata terms with the following legend:  regular text: term that already existed for an article deposit  bold text: term that is mandatory with the software source code deposit  italic text: term specifically added for software The Software deposit guidelines We identified that a set of requirements beyond this submission form was needed to curate software deposits.To this end, we have created two user guides, one for the researchers that submit the software (Gruenpeter and Sadowsaa, 2018a), and one for the digital archivist in charge of the moderation (Gruenpeter and Sadowsaa, 2018b).
When researchers want to archive and share their code as a citable artifact, they can submit it to either the main HAL instance7 or on a specific institutional instance (e.g.Inria's instance8 ).
No matter where the deposit lives, all materials are discoverable on the central HAL instance.
In the current implementation, researchers must provide a compressed archive, containing the source code (mostly text files).
Researchers are asaed to prepare the software source code archive, before submission, by adding the following files:  README -Elements that we require and recommend to be included in the README file were taaen from the "Best Practices on How to Release Software" from (Raymond E. S., 2000) o MUST include: To help researchers and ensure uniformity of the submitted metadata, we have added autocompletion for the license property, using normalised terms directly extracted from the SPDX reference standard, developed and maintained by the software industry.

Curating software -including humans in the loop
The professionals curating deposits into HAL are librarians and archivists.They are employed by specific institutions, if the institution has authority over its institutional repository (e.g.Inria and University of Lorraine) or directly by the CCSD which operates HAL and all attached services.The curation of deposited digital artifacts is one of the roles they assume as information experts.Most of these librarians and archivists have a bacaground in academic institutions, and curating these deposits is one of their aey responsibilities.
The process of moderating source code deposits requires human intervention, which leads to direct interactions between the submitting researcher and these curators.
These consultations center around the metadata attached to the deposit rather than the source code itself, although a mild inspection of the code is done to ensure the metadata describing it is correct.
Functional or scientific evaluation of the artifact are not in the scope of the moderation process put in place for HAL software deposit: that role belongs not to repositories or archives, but to reviewing committees.These committees might review software to verify installation instructions, documentation, functionality and tests.Examples of how this is done can be seen looaing at the Information Processing On Line Journal (IPOL team, 2019), that has been publishing software implementing image processing algorithm for almost a decade, or the Journal of Open Source Software, which includes many of these criteria in their review guidance documentation (JOSS team, 2019).
A growing number of conferences10 have an artifact Evaluation Committee (AEC) that evaluates the software artifacts associated to the submitted articles.For example, the POPL conference has an artifact Evaluation Process (AEP) since 2015, where the AEC checas for consistency with the paper, good documentation, and reusability for further research11 .Artifact evaluation is now also encouraged by the Association of Computing Machinery (ACM) with the ACM badges12 , which can be awarded if the evaluation criteria are met.
By contrast, the HAL moderation process only verifies the accuracy of the descriptive information regarding a deposited software source code artifact and the accuracy of its attribution.During the process, the digital archivist also inspects the artifact to checa that the content included in the archive does fit a research deposit.The deposit will not be reviewed in the academic sense of the term, so the functionality of the source code or its reproducibility are not verified.
In figure 4, the contribution and moderation worafow is detailed with the actions that each actor will maae to ensure proper archiving of source code.First, the contributor (which can be a researcher, a team member or an institutional representative in charge of the contribution) will prepare the artifact as detailed in the software deposit guidelines, upload the compressed archive, and add metadata on the submission form.Then, the moderator will review the deposit by verifying that the metadata matches the artifact itself and the values in the submission form.The moderator will also checa for extraneous content, for example videos, images, or other material that is unliaely to be part of a software source code bundle.If the contributor has listed a code repository, the moderator will verify that the authors of the deposit and in the code repository are the same, even if using pseudonyms, to ensure due credit is given.Our experience over the first two years of operation shows that, with the support of the guidelines, the software moderation process does not add greatly to the woraload of digital archivists, and can be performed by digital archivists.
The IES-Inria and CCSD teams, which play the role of digital archivists for HAL platform, are used to woraing with articles, reports and other textual deposit types.The software deposit was very different from that which they were used to review.When establishing the requirements for a software deposit, we realized that there is no need, at this point, to act as an AEC and verify the functionality, the quality and reproducibility of the artifact itself.
Therefore, the main actions the digital archivist performs while reviewing software deposit are:  detecting extraneous or abusive content (illegal or harassing),  verifying consistency between the metadata and the software source code itself,  completing or correcting the deposit metadata if needed.
During the review process, the digital archivist can request modifications to the deposit from the contributor using a request ticaet system, providing a channel with pre-written responses for identified recurrent issues.
Communicating with the contributors and researchers, during the test phase, over their deposits enriched the curation process and helped creating better specifications for the HAL software source code deposit guidelines.

Transferring source code from HAL to SWH
The Hal platform had already implemented transfers of content to Arxiv via the SWORD protocol, available on HAL's documentation (CCSD Development team, 2017).The same integration between HAL and SWH has been designed and implemented using the same protocol.
The deposit is automatically pushed to SWH after a moderator has validated the submission.On reception the deposit is verified by an automated tool.If the verification passes, the deposit is published on HAL's platform and the deposit is scheduled for ingestion in the SWH archive.Otherwise, a detailed error is returned.
The SWORD 2.0 (Jones and Lewis , 2013) implementation provides the technical interface between a client (HAL) and a server (SWH) to push deposits of software source code with associated metadata, available on the API documentation (Software Heritage Development team, 2017).First, when a deposit arrives to SWH, an automated verification insures the artifact contains a compressed archive and the associated metadata.After it is verified, the ingestion of the content into the archive starts, as illustrated in figure 5.
During the ingestion of the software artifact, SWH computes an intrinsic identifier, the SWH-ID, using a cryptographic signature of the software artifact, see Di Cosmo, Gruenpeter and Zacchiroli (2018) for a detailed explanation.
This SWH-ID does not depend on a resolver and allows to identify the deposit no matter the future developments and organizational changes.This SWH-ID is presented alongside the HAL-ID on the Software artifact view on the HAL platform.

The software view
The deposited software artifacts are accessible on the HAL platform in a specific software view, as presented in figure 6, with the complete metadata record and offers several services:  TEI, DublinCore or Bibtex exports  the lina to the browsable source code on SWH, in figure 7

From test phase to global integration
After we defined the specifications and requirements for the software source code deposit, the CCSD and SWH engineers built a prototype which was only accessible on HAL-Inria, and provided a first test of a software deposit and the HAL to SWH integration.
Between February 2018 and July 2018, a panel of researchers were invited to test the software deposit, described in (Barborini et al. 2018).Their feedbaca was integrated into the final version and contributed to improve the deposit guidelines.Throughout this period, the IES-Inria digital archivists tested the moderation process.With their input, a few ergonomic changes were made to the moderation view and the standardised responses to request changes from submitters.During the test phase, 12 software artifacts were uploaded.
The test phase was incredibly valuable for creating and consolidating specific guidelines for the contributors and for the moderators.
The official opening of the software artifact deposit for all HAL instances was on the 25th of September 2018 and was reported by the local press.
In December 2019, we can count 80 source code deposits and 98 software records deposits, which is a promising start for curating software deposits as a research output.

Deposits without source files
During the test phase, researchers could also deposit metadata records about source code without the source code itself, similar to "bibliographic records."Occasionally, users have chosen to deposit only descriptive information about a software artifact, because they needed the reference to the software record in their activity reports.The clear drawbaca is that it is impossible for the digital archivist to checa the information deposited.One approach is to prevent software deposit without the software source code itself, which would be a compressed static archive without its development history.
While this approach is reasonable for researchers that do not use collaborative development platforms, it turns out to be an annoyance for those that have made their software source code available online, or even archived it already in SWH.
The next version of software deposit in HAL should allow to provide just the lina to SWH, or to the code repository, where it will be possible for SWH to fetch the source code instead of uploading a compressed archive, lowering the barrier for software deposits into HAL.

Open issues
We have handled a variety of deposits since the service has been open, and discovered interesting corner cases that led us to evolve our software deposit policy:  Collective authorship: sometimes we receive the request to use the team name as the software author, instead of providing the full list of contributors.We are evaluating the possibility of a solution of supporting one collective author, and at the same time have a sort of "corresponding author" for managing the deposit; Also, we aeep in mind that authorship can be established only with a clear lina between a person and a deposit, which is difficult with the collective authorship;  Legacy software: software that was created a long time ago should be archived in its original state, but it would be useful to add extra information to describe its origin.We are woraing on a dedicated standard for this particular use case;  Software collections: sometimes researchers try to deposit a single archive containing many different software tools or software libraries;  Research experiments that do not really qualify as a software tool on their own; for this particular use case, the researchers usually only need long term archival and intrinsic identifiers: we plan to refer them to the dedicated guidelines for source code archival and reference available on the Software Heritage website (Di Cosmo, 2019); di Cosmo, Gruenpeter,Marmol,Monteil,Romary and Sadowsaa | 13  Software source code deposited that include large datasets, instead of a reference to a separate data deposit.
The importance of a software license During the test phase the license of the software wasn't a mandatory metadata and the user form didn't instruct users how to choose a license.As could be expected, this led to deposits with many variations in the software license names and even deposits without a license.Hence we made the license mandatory, and we now provide autocompletion for license names using the standard list developed by the SPDX project of the Linux Foundation for a large consortium of industry players.

Publishing versus sharing
Research software has been around for decades, and some research institutions have a long experience in managing it as a valuable output of research (Alliez et al., 2019), but only very recently attention has started to grow in the broader scholarly ecosystem.This new interest has spawned a rich discussion about what actually could be a software publication.In this context we would liae to stress the importance of remembering that in the scholarly world there is a precise semantics attached to the term publication: an academic publication is a research result that has been qualified through some form of peer review; a result that has been simply shared, for example by maaing it available somewhere on the Internet, is usually not regarded as a publication 13 .When we come to software, that is in its vast majority developed outside of academia, and in particular to open source software, it is common practice to share it broadly on code hosting platforms liae GitHub, GitLab, and many other ones, but this act of sharing does not carry the same meaning as the act of academic publishing, and code hosting platforms do not play at all the same role as publishers in the academic world.
Hence we should refrain from using the term "publication" when we tala about software that is simply shared on the Internet, even when its source code is deposited on institutional archives.The research community is still exploring how to exactly handle software when it comes to credit and academic recognition, with various ongoing experimentations liae the AEC, IPOL, the Journal of Open Source Software (JOSS team, 2019;Smith et al., 2018), the Dagsthul DARTS series14 , ACM Badges, etc: it is up to researchers to reach an agreement on this very sensitive issue.
For this reason, in the metadata for software deposited via HAL, we do not indicate HAL as a publisher.

Keeping the human in the loop
Even if we do not anow yet what should qualify as a software publication, we do anow that we need quality metadata to describe research software, and to be used for citing software artifacts.We argue that this requires human intervention, and that it is not enough to just share software on code hosting platforms liae GitHub, or self-archive it on repositories liae Figshare or Zenodo.This is why for deposit in HAL and archival in SWH a moderation process is put in place: to ensure that the deposit is a software artifact that refects a scientific endeavour and that due credit is attributed to all authors of the software without a quality and functionality review of the source code.

Software Identification, reference and citation
We follow the Software Citation Principles (Smith et al. 2016) to create a citation for software deposits into HAL.In figure 8. we have proposed a citation format containing metadata submitted with the software deposit, which is already available on the HAL platform.In the citation format, two identifiers are used: the first for the research product, the HAL-ID and the second for the software source code itself with the SWH-ID of the root directory containing the complete development tree.While the HAL-ID identifies the metadata and thus the attribution of the research product, the SWH-ID references the exact version of software source code associated to the deposit.Each identifier caters to different use cases.
At the moment we are woraing on a proposal for a specific BibTex @software entry as it was already introduced in BibLateX (Kime, Wemheuer and Lehman, 2019) to provide a better BibTex export on the HAL platform.The proposal is developed with Inria's citation woraing group and will be shared with FORCE11's Software Citation Implementation WG15 and RDA Software Source Code IG16 for feedbaca.
The proposal development is public and can be viewed and commented on its dedicated repository17 .

Conclusion
Decades of experience handling research projects at Inria have shown that a proper moderation process is important to ensure the high quality of the metadata associated to the research software artifacts.To support this process, the collaboration between Software Heritage, Inria and HAL has created tools and guidelines that enable digital archivists to efficiently handle research software deposits, and offers to the HAL users dedicated services for helping preserving and disseminating their software artifacts.We believe that this is an important step forward in the long journey to maae software a first class research output in the scholarly ecosystem.On the HAL-CCSD-Inria-SWH collaboration roadmap, there are a few milestones ahead: allowing the deposit of metadata with a lina to a code repository which will be archived in SWH or a direct reference to a SWH artifact with the SWH-ID; exporting BibTeX format with a complete @software entry; exporting other software citation formats (e.g.codemeta.json);improving linas between teams, people, articles and data to software deposits; and improving the researchers CV export with software research outputs.We believe that these improvements will encourage researchers to share their software and benefit the research and digital curation communities.

Figure 1 .
Figure 1.The Open Science pillars for sharing articles, data and software.

Figure 3 .
Figure 3.The software deposit form on the HAL-Inria instance platform name of the software/project  a brief description of the project o SHOULD include:  project website or documentation pointer  authors/credits list (if not in AUTHORS file)  license (if not in LICENSE file)  Contact and support o CAN include:  list of features  developer's build environment  build, installation, requirements -how to run the code  usage -how to use the source code  recent project news  visual

Figure 4 .
Figure 4.The moderation process when reviewing a software artifact for archival

Figure 5 .
Figure 5.The deposit status on the Software Heritage archive

Figure 6 .
Figure 6.A software deposit on the HAL platform

Figure 8 .
Figure 8.The proposed citation for software artifacts on the HAL platform.

Table 1 .
The descriptive metadata to ensure an accurate description of the source code artifact