Identifying scientific publications countrywide and measuring their open access: The case of the French Open Science Barometer (BSO)

Abstract We use several sources to collect and evaluate academic scientific publication on a country-wide scale, and we apply it to the case of France for the years 2015–2020, while presenting a more detailed analysis focused on the reference year 2019. These sources are diverse: databases available by subscription (Scopus, Web of Science) or open to the scientific community (Microsoft Academic Graph), the national open archive HAL, and databases serving thematic communities (ADS and PubMed). We show the contribution of the different sources to the final corpus. These results are then compared to those obtained with another approach, that of the French Open Science Barometer for monitoring open access at the national level. We show that both approaches provide a convergent estimate of the open access rate. We also present and discuss the definitions of the concepts used, and list the main difficulties encountered in processing the data. The results of this study contribute to a better understanding of the respective contributions of the main databases and their complementarity in the broad framework of a countrywide corpus. They also shed light on the calculation of open access rates and thus contribute to a better understanding of current developments in the field of open science.


INTRODUCTION
Open access to publications (e.g., Laakso & Björk, 2012;Piwowar, Priem et al., 2018) within the general framework of Open Science is now an issue shared by many institutions, universities and research organizations, and funders. France is no exception: Two national plans for Open Science have been successively launched, in 2018 and 2021, by the Ministry of Higher Education, Research and Innovation (MESRI). Generalizing open access to publications is the first axis of these two plans, with a goal of 100% of  German Open Access Monitor (OAM), the Danish Open Access Indicator, or the COKI Open Access Dashboard. Other countries have also adopted national strategies for monitoring open access (Carvalho, Laranjeira et al., 2017).
In its guide to assisting research organizations and funders in setting up a tool for monitoring open access publications (Philipp, Botz et al., 2021), the organization Science Europe considers the constitution of the corpus of publications to be analyzed as one of the key stages in the process. We could add that it is even one of the major challenges of this exercise. Indeed, no database provides an easy and complete answer to this question. The large databases, such as the Web of Science ( WoS) and Scopus, have the advantage of systematically listing a large part of the millions of scientific publications published each year in the world. The metadata are standardized and allow for efficient searching. However, the coverage of science, technology, and medicine (STM) and of English-language publications in international journals is privileged, while other disciplinary fields, other languages of publication, and other sources or document types are less fully surveyed (Mongeon & Paul-Hus, 2016; Van Leeuwen, Moed et al., 2001;Vera-Baceta, Thelwall, & Kousha, 2019). Moreover, these databases are accessible only by subscription, so their data are not open or reusable. If we consider thematic databases such as PubMed or NASA/ADS, their metadata are both high quality and open. On the other hand, they cover a very specific disciplinary field: An exhaustive census of publications in a multidisciplinary context will therefore require multiple sources.
As for open archives, while they have the advantage of listing types of publications, languages, and sources that are often absent from large databases, they offer insufficiently standardized metadata, which complicates their collection and processing. Thus, no single database offers comprehensiveness, standardized metadata, and openness. As Huang, Neylon et al. (2020) conclude in a recent article: "Any institutional evaluation framework that is serious about coverage should consider incorporating multiple bibliographic sources." Current Research Information Systems (CRIS) can be a way around this difficulty, provided that they are not fed solely by the large commercial databases mentioned above. They are increasingly being used in universities to help manage, understand, and evaluate research activities. However, most CRIS are, today, still used only at an institutional level (Sivertsen, 2019). Although their aggregation at the country level to constitute a national base is progressing, it is still most often correlated with the implementation of a public funding policy based on scientific publication performance, as is the case in Denmark, Finland, Hungary, Italy, Norway, and Poland (Puuska, Nikkanen et al., 2020). If the motivation is primarily financial, a national database is an opportunity to set up an effective monitoring of open access policies at the country level, as Finland has experimented with (Pölönen, Laakso et al., 2020).
For countries that do not have such a pool of data, the implementation of a monitoring tool on this scale implies selecting from among the existing databases, whether commercial or not, those that will best meet the objective set. The German Ministry of Education and Research has thus chosen to use the Dimensions and WoS databases to establish its corpus 3 . Universities UK, the association of 140 UK universities, has chosen to use Scopus to produce its latest report on the effects of new policies to promote open access 4 .
In the case of France, the objective of the MESRI was to set up a tool that would enable the steering of the national policy on open science, by measuring, on an annual basis, the level of open access of all publications with at least one French affiliation. This request was accompanied by a very specific requirement: "a transparent methodology and reproducible results." It is with this in mind that the French Open Science Barometer (BSO) was carried out 5 , as described by Eric Jeangirard (2019). For the BSO, the constitutive choice is to use only open sources. The methodology used consists in scanning all the papers referenced in Unpaywall and in the national open archive HAL (see below) to identify either the French authors or the presence of the mention of France in the affiliation. The publications thus identified were then enriched with information on their scientific discipline, using natural language processing (NLP), also based on open source code, to determine, from the title, the discipline to which a document belongs. Finally, the open access status was determined using the Unpaywall database. The corpus obtained by this strategy is available in open access from the MESRI OpenData portal 6 . In accordance with the recommendations made at the European level (Open Access Monitoring: Philipp et al., 2021), the French National Open Science Barometer is published on an annual basis.
About 150,000 publications are thus identified each year by the BSO. The purpose of this study is to consider an alternative approach, this time based on the use of the main open or nonopen bibliographic databases, and to analyze the extent to which this new corpus differs from that of the BSO. Our approach is based on the use of six complementary sources, namely WoS, Scopus, Microsoft Academic Graph, PubMed, NASA/ADS, and the HAL open archive, to identify and assess academic scientific publication at the scale of a country, in this case France, for publications released during the 6 years 2015-2020. As the year scale seemed to us more relevant to characterize scientific production, we chose to highlight, in the context of this article, the data related to the year 2019 7 . We then compare the corpus obtained with that of the BSO, and we show to what extent the diversity of the sources used makes it possible to refine the identification and characterization of French scientific production, as well as the estimation of the open access rate.
While there is an abundant literature on the comparison between Scopus, WoS, and other generalist databases (see, for example, in a national production context Archambault, Campbell et al. [2009], Bartol, Budimir et al. [2014], and Moed, Markusova, and Akoev [2018], or for a statistical comparison of large reference databases Mongeon and Paul-Hus [2016], Pranckutė [2021], and Visser, van Eck, and Waltman [2021]), our study provides a detailed quantitative view in the specific context of French research. Far from identifying a source that would be optimal, our study shows the importance of diversifying the sources used to provide complementary views on a country's publication.

Scientific publications
We consider here scientific publications indexed in databases (private or public) and accessible in open archives. All types of documents are taken into account. This primarily concerns articles, generally published in international peer-reviewed journals, but also conference proceedings, book chapters, or any other publication, provided that it has a DOI. However, the restriction to only documents with a DOI is an important restriction, which we must explain here.
To facilitate the aggregation of results, and to avoid duplication, we have chosen, as does the BSO (French Open Access Monitoring), to restrict the cross-referencing of data to publications identified by a DOI. This step is necessary to allow the efficient cross-referencing of documents identified in each database by their DOI identifier, common to all databases. In addition, the Unpaywall database, which will inform us about open access in the next step, only lists publications with a DOI.
Let us note that the requirement of the presence of a DOI immediately rules out a certain number of journals that do not adhere to this very general technology of persistent identifiers (Gorraiz, Melero-Fuentes et al., 2016); some of these journals may be, as Wang, Shen et al. (2020) point out, key journals in their discipline, with the example, for the field of Artificial Intelligence, of the Journal of Machine Learning Research.
Moreover, grey literature, under which we can group preprints, reports, theses, and in some cases conference proceedings , is often ignored by open access measurement tools, mainly for two reasons: The first corresponds to a concern to discard literature whose scientific relevance cannot be sufficiently controlled (lack of peer review); the second is rather related to technical considerations, in particular a difficulty in identifying these publications in the absence of complete and standardized metadata, especially persistent identifiers. In practice, this leads to ignoring a large proportion of the work published in certain disciplines where the thematic field, the regional vocation, or the applicative nature of the publications takes precedence over international referencing.
Our methodology, based on the use of the DOI, therefore effectively excludes some of the documents that might be of interest to us. This is why we will come back to publications without DOIs at the end of our study, by proposing an estimate of the share of grey literature in French national production (Section 5.2).
Finally, it should be noted that the publications taken into account to establish our corpus are exclusively those that have a digital version: It is this digital version for which we will try to measure the degree of accessibility. Thus, peer-reviewed research published in books or monographs is only covered when it is in digital format and has a DOI. For this reason, nonacademic publishing generally falls outside the scope of our study.

Open access
A scientific article that is only available on payment of a subscription or a fee (price per article) is considered closed. In contrast, a scientific article that is freely available, either on a publisher's website or after the deposit of the full text (in its final layout or not) on an open archive, is deemed open.
Our source of information for the open access status of an article will be the Unpaywall database (Piwowar et al., 2018), specifically the data in the "is_oa" field. If the value returned for a given publication is equal to "True," the publication will be considered open. If this value is "False," the publication will be considered closed. The so-called "bronze" status is considered open.
Note that the open access status may vary over time, because a closed publication may have its embargo lifted or be subsequently deposited in an open archive. Thus, in our study, it will be the status observed in February 2021, as recorded in the Unpaywall database snapshot for that date.
Let us recall that for France, the Law for a Digital Republic of 7 October 2016 9 establishes the possibility of deposit in an open archive of the postprint of any scientific article resulting from research funded at least 50% by the state or public authorities, at the expiration of a period of 6-12 months depending on the scientific field (respectively, STM or Humanities & Social Sciences).

Sources Used to Constitute the FR-2015-2020 Corpus
The collection of metadata related to a large set of publications is facilitated by the use of databases that systematically, if not exhaustively, collect a large part of the millions of scientific publications published each year worldwide.
In this article, we have privileged the databases providing a search capability for the mention of the country in the affiliation, and we have collected the publications whose affiliation mentions the country considered in our study, France, using the corresponding query modes of six databases that, to our knowledge, effectively cover French scientific production.
We did not use the Dimensions database, as it is not considered to be a reliable source for establishing a corpus on a country scale (Guerrero-Bote, Chinchilla-Rodríguez et al., 2021).
We use the following databases in our study: • Scopus (Baas, Schotten et al., 2020) references more than 25,000 journals and is considered one of the most comprehensive databases for international peer-reviewed journals. Query by country is possible. Metadata extraction is limited to batches of 20,000 documents. This database is available by subscription from Elsevier. • WoS (Birkle, Pendlebury et al., 2020) has been the reference database for scientometrics since the pioneering work of Garfield (1964). The query by country is provided in the advanced query mode. This database is available by subscription from Clarivate Analytics. In this study, we use all the indexes (including ESCI: Emerging Sources) except for the Book Citation Index, which was not available to us. French researchers are invited to deposit on this platform the products of their research, whether they are publications (article in a journal, communication in a conference, chapter 9 Law for a Digital Republic; see in particular its article 30: https://www.legifrance.gouv.fr/dossierlegislatif /JORFDOLE000031589829/. 10 https://hal.archives-ouvertes.fr/. of a book, book, poster, file, patent), unpublished documents (prepublication, working document, report), academic works (thesis, HDR, course), or research data (image, video, software, map, or sound). The recorded documents are either in the form of a notice only or accompanied by the full text of the article. This production can be grouped within different collections or portals relating to a theme (SHS for example), a medium (images and videos), or a research structure (university, laboratory, or research team), but it remains possible to carry out queries covering all portals and collections. After 20 years of use (Berthaud, Charnay, & Fargier, 2021), more than 2,700,000 works are now recorded in this archive.
HAL data can be queried using an advanced query or the API. The latter, which is available free of charge, allows the identification of the country of affiliation.
• The NASA/ADS database (Kurtz, Eichhorn et al., 2000) is one of the most recognized examples of a bibliographic database covering a research field: astrophysics and physics. Its query mode allows querying by country. Access is free. • The PubMed database is one of the preferred and free access points for metadata related to biomedical science research. A query by affiliation is possible (Ibarra, Ferreira et al., 2018). • The Microsoft Academic Graph (MAG) database (Herrmannova & Knoth, 2016;Wang et al., 2019), one of the three products of the Microsoft Research project, is one of the largest open publication and citation data sets. It is populated automatically, using bibliographic data from web pages crawled by the Bing search engine, also a Microsoft product. The data can be accessed using the Academic Knowledge API. It should be noted that MAG does not contain structured data on affiliation country. Identification of French outputs (provided by the Curtin Open Knowledge Initiative team) was by applying a query to the affiliation string (OriginalAffiliation data element from the MAG PaperAuthorAffiliations table, linked via the PaperID to the DOI) that sought to determine whether the affiliation string ended with "France" (or one of a small set of non-English names). This number may not match that in the online COKI country dashboard, which maps affiliation country from GRIDs in MAG to the country of organization in the GRID database 11 .
Some of the characteristics of these databases as well as the number of documents obtained for 1 year (the year 2019), in the framework of the query "France 2015-2020" carried out in October 2021 are presented in Table 1.

Aggregation of Results for Publications Identified by a DOI
As mentioned above, to facilitate the aggregation of results and to avoid duplication, we have chosen, as does the BSO (French Open Access Monitoring), to restrict the cross-matching of data to publications identified by a DOI. For the HAL archive, the point is that the DOI identifier is not systematically filled in because it is not a compulsory metadata during the deposit. While only 2-3% of the documents characterized as articles in WoS or Scopus do not have a DOI recorded, this proportion rises to 22% for documents characterized as articles in HAL. In addition, the open archive contains many unpublished documents, preprints, reports, or theses that do not have (or not yet) a DOI: With the book chapters, these documents represent half of the publications without a DOI, which will not be considered for the rest of the study.
However, we will return to HAL in Section 5 for a discussion of grey literature.
Note that for MAG, we had direct access to the DOI lists through the COKI team, whom we thank for their help. One of the objectives of this study is the measurement of the share of open access to publications. For this we use the Unpaywall database 12 , which is the leading database in this field (Holly, 2018;Piwowar et al., 2018).
This database offers a simplified access mode (by batches of 1,000 DOIs) which allows us to easily obtain the status of a publication (open or closed access, with the publisher and/or in an open archive) at the time of the query. It is also possible to download a complete version of the database (called a Snapshot ). For this study, we used the version dated February 2021. For the year 2019, this version lists more than 6 million publications.
Querying the Unpaywall database also allows us to validate the DOIs identified in the previous step: We consider that DOIs not found in Unpaywall generally correspond to identifiers that have not been confirmed by Crossref, the agency that certifies their quality and continuity.
Moreover, it is not uncommon to find differences in the date of publication from one database to another (often due to the time lag between the version published online (early access) and the "final" publication). We have chosen to use the year of publication provided in the Unpaywall database as the reference year (see Table 3), whether or not it is consistent with the year of publication mentioned in the source database. This choice is also the one adopted by the BSO (French Open Access Monitoring). Table 3 presents the results of the cross-matching between the six sources and their validation with Unpaywall.
The first column recalls the number of DOIs obtained from each source, already presented in Table 2. The second column presents the numbers of DOIs found in Unpaywall and recorded in this database as published in 2019.
Note that to obtain the counts in Table 3 we cross-referenced the results of queries covering for the six sources the whole of the years 2015-2020 with the year 2019 from Unpaywall. Discrepancies in publication dates affect about 8% of the documents. Because of the reassignment of publication dates, the number of DOIs with confirmed output (second column of Table 3) for a given year may be larger than the original number of DOIs for this year (case of MAG), despite a small loss of unidentified DOIs. 12 Unpaywall: https://www.unpaywall.org. In the following section, the 139,514 records described in column 2 will be crossreferenced with the BSO.

Overlap of the Two Sets
The corpus thus constituted (FR-2019) can now be compared with that of the French Open Science Barometer (BSO), which also aims to cover all French production, for several years including 2019 13 .
Because the BSO data are also restricted to publications with a DOI and have benefited from the Unpaywall query, it is easy to cross-reference the two sets of DOIs. The result is summarized in Table 4. Table 4 shows that, if we restrict ourselves to the data validated after querying Unpaywall, 8% of the total data set (i.e., 13,707 DOIs) are not identified in the BSO, while conversely 17% of the documents (i.e., 27,898 DOIs) had not been identified in our FR-2019 corpus.

Data from Our FR-2019 Corpus That Are Not Part of the BSO Corpus
The data from our sources not included in the BSO corpus seem to correspond mainly to a failure to identify the France affiliation in the algorithm developed by Jeangirard (2019). This was expected and corresponds to what Jeangirard calls false negatives-which he says he cannot estimate and which we estimate here at 9% of the BSO corpus.
In our study, the main sources contributing to this subset not identified by the BSO are Scopus (63%), WoS (41%), and MAG (23%). We believe that these documents come from the less represented publishers, for which it is likely that specific algorithms for extracting the country of affiliation have not been developed for BSO.

Data from the BSO Corpus Absent from the FR-2019 Corpus
The data from the BSO corpus not included in our sources come mainly from humanities and social sciences journals (44%), biomedical journals (24%), and basic biology journals (12%). 13 The BSO data have been produced in December 2020 and are made available on the Open Data portal of the Ministry of Higher Education (MESRI): https://data.enseignementsup-recherche.gouv.fr/explore/dataset /open-access-monitor-france/. We note a significantly higher proportion of articles in French in this BSO-only subset: 31% compared to the average of 15% for the global corpus (the language analysis methodology will be presented in Section 4.4).
These are mainly journals or resources not covered by the databases we have used, in particular, documentary resources and journals with a national scope in French or English. For example, the most represented sources in this set are the following: This set of documents also includes the "false positives" reported by Jeangirard (2019) (i.e., documents that his algorithm wrongly identified as publications from the France set). These are publications for which none of the authors has an affiliation in France but which the BSO algorithm nevertheless retained. Jeangirard estimates the false positive rate at 4% (which would correspond to about 6,000 publications for the year 2019).
We can try to estimate more precisely this share of false positives: The search in Scopus of DOIs corresponding to publications collected for the BSO but not confirmed by our other sources sheds light on this subject (Table 5).
This search allows us to identify 3,616 probable false positives: The Scopus database recognizes the DOI, the year is indeed 2019, but the article does not include, according to Scopus, an affiliation in France. This corresponds to 3.5% of the DOIs common to BSO and Scopus, which thus seems compatible with the 4% estimated by Jeangirard (2019). Let us note once again that the cross-referencing of the different sources highlights divergent assessments of the publication date of the articles. Table 6 presents the contributions of each source to the overall corpus (aggregating the two approaches: our FR-2019 corpus and the one collected for the BSO).  Table 7 presents the cross-referenced contributions of the sources to the overall corpus. It should be noted that the fact that a publication is identified in database A and is not identified in database B as being part of the corpus does not necessarily mean that it is absent from database B: It may be present in database B, but with a DOI that has not been filled in or is incorrect, or a failure to identify the country (no affiliation with France).  Note that we do not use here the original BSO open access observations, which were made at a different date, and thus could not be directly compared to ours. We have chosen to report all the calculations to the same observation date: that of the production of the Unpaywall snapshot in February 2021.  The reader is referred to Aliakbar and Stahlschmidt (2019) for a discussion of the merits and limitations of these rate calculations. In their conclusions the authors recommend the use of multiple sources to reduce errors and gaps, and this is clearly a view we share. Cross-matching all these data sets allowed us to correct, at least in part, the problem of false negatives and to obtain a refined estimate of the open access rate.

Variation in Open Access Rate by Document Type
The calculation for the articles alone, using the journal-article nomenclature proposed by Unpaywall, shows, as expected, a significantly higher rate of opening: 57% for the BSO corpus and for our corpus, and 56% for the corpus resulting from the aggregation of the two sets.
This category is interesting insofar as the national policy enacted by Article 30 of the 2016 law mentioned above concerns a "scientific writing […] published in a periodical appearing at least once a year," (i.e., in our terminology, a scientific journal article).
In this context, it is worth mentioning that the approaches presented here do not distinguish between publicly funded research articles and other articles from private and industrial research, for which the open science commitments do not apply.
The details of the types of documents identified for both approaches are given in Table 9. The percentages observed are very similar in the two data sets (FR-2019 and BSO) for articles and conference proceedings. The differences are more noticeable for book chapters and can be explained by a significantly wider coverage in the case of the BSO. The "other" category covers too many different situations for the differences in the observed rate to be significant. Trends (2015Trends ( -2020 To detect the ability to measure annual changes, we extracted the data (and present the annual counts in Table 10) for each of the years 2015 to 2020, following the same methodology as outlined for 2019. For 2019 the counts are identical to those in Tables 3 and 7. Table 11 provides the data from Table 4   The year 2020, observed in February 2021, has a different character, as the observation is made before the 6-month, 1-year, or in some cases longer embargoes have expired.

Observation of Annual
In Table 13, we give examples of observations of the open access status (Gold, Green, etc.) as provided by Unpaywall for 2 distinct years. These few examples allow us to affirm the absence of significant bias between the 2 data sets: The two strategies lead to quite similar estimates.
A comparison of the rates obtained for the French corpus with those obtained on an international scale would go beyond the limits of this article: The interested reader may refer to the   It is possible to cross-reference the observations presented above with information on the language in which the article is written: Are articles in French, for example, more often, or less often, in open access? To examine this, as this information is not systematically provided by all databases, we analyzed the title of the article as provided by Unpaywall by applying the simple language detection software langdetect 14 . Only detections assigned with a displayed probability greater than 0.99 were retained.
In the framework of our study of French national scientific production, for the year 2019, the two main languages concerned are English (83% of the detected documents) and French (15%), the rest of the detected languages not exceeding 3% in total (Table 14). The distribution is not identical according to the document type, in particular the communications to (mostly international) conferences (labeled proceedings-article in Unpaywall) are almost always in English. Table 15 shows that the rates of open access observed vary greatly according to the discipline (extracted here from the BSO). As a general rule, documents detected as being written in French are much less frequently in open access.
14 Langdetect (https://pypi.org/project/ langdetect/) is a python-port of Nakatani Shuyo's language-detection library (https://github.com/shuyo/ language-detection). When published (in 2010), it claimed to reach 99%+ accuracy on 49 supported languages.  Most of the French language material without open access comes from three areas: medical research, including journals for practitioners, and the humanities and social sciences.

Discussion of the Sources Used
The six sources we have chosen to use actually provide three different insights: • Scopus and WoS provide extensive coverage of the literature in peer-reviewed journals and international conference proceedings; while Scopus has a slightly wider coverage, the use of the two databases together provides a 10-20% improvement over what would be obtained with a single database. The MAG database, which will soon be discontinued, brings, as a complement, a set of documents not indexed by WoS and Scopus, contributing to a further increase of about 10% of the corpus identified in our study.
• The HAL open archive is filled at the initiative of the authors who deposit the bibliographic record (metadata) and, if applicable, the full text in its preprint or editor version. Part of the archive contains grey literature (Schöpfel, Prost, & Ndiaye, 2019) and moreover the DOI is filled in irregularly and not systematically. The metadata and DOI do not seem to be thoroughly quality controlled: For this reason, this source should be considered with caution for bibliometric studies. However, it is a reference source for French research and a cornerstone of the national open science policy. • The ADS and PubMed databases are thematic databases and are therefore only intended to cover parts of the research field. On the other hand, both databases are deep in their field and cover grey literature and sources not indexed by the large generalist databases. This study sheds new light on the coverage of French scientific production by the various databases. While the WoS and Scopus voluntarily restrict themselves to the perimeter of peer-reviewed publications appearing in referenced journals or books (Baas et al., 2020;Birkle et al., 2020), the use of complementary databases, whether thematic or not, allows us to have a more complete view of the share of literature that is not or poorly referenced, and that may be less general in scope geographically, linguistically, or thematically. We observe that the strategy adopted by the BSO allows for the systematic collection of data on a significant quantity of these publications-often neglected in bibliometric studies. Far from identifying an optimal source, our study shows the importance of diversifying the sources used to provide complementary views on a country's publication.

Characteristics of Excluded National Production Without DOI
Publications without a DOI form a heterogeneous group of peer-reviewed and grey literature. The share of unreferenced grey literature can be approached in particular through the HAL open archive, by considering documents without a DOI, which were not taken into account in our study. However, it is advisable to make sure beforehand that the absence of a DOI is not due to a lack of information, but corresponds to articles from journals that do not use this identification mode. As the open archive, which is mainly fed by author deposits, is not fed in a complete and systematic way, this approach can only be qualitative.
We note, first of all, without surprise, a very strong disciplinary variation: Only 15% of the documents in the field of humanities and social sciences (SSH) deposited in HAL have a DOI, while the proportion is 70% in chemistry or physics, the global average being 42% for the year 2019 considered here (see Table 2). This rate reaches 50% in the field of computer science. Among the records without a DOI the share of records from the SSH fields is 52%, compared to an SSH share of 12% of publications with a DOI.
We also note that the full text is deposited significantly less frequently for documents without a DOI: 39%, whereas the average is 44%.
We can also note, for HAL (year 2019) a strong differentiation according to the language (we use here the language informed in the archive): • Among the documents without a DOI, the proportion of articles in French is 57% (49% for articles in English), while for articles with a DOI it is only 8%. • 91% of the documents in French have no DOI (or no DOI indicated).
We found nearly 90,000 records without a DOI in HAL (Table 2). If we restrict ourselves to documents classified as articles, book chapters or conference papers, nearly 56,000 records without a DOI (or without a DOI indicated) listed in HAL had to be excluded from this study.
For journal articles (category ART in HAL) we tried to estimate the proportion that corresponds to not having been informed of a DOI: If we consider the articles without a DOI published in a journal for which other articles have a DOI, we note that this concerns 31% of the articles without a DOI (in HAL in 2019). We therefore estimate that at least 30% of DOIs are missing in HAL due to DOIs that are not filled in. Most of this 30% can be expected to be covered by the other sources. If this assumption is correct, it would mean that out of the 56,000 records without a DOI entered in HAL, we can estimate that there are around 40,000 articles or communications without a DOI, which were therefore not taken into account. This point will be the subject of further study.

Validation of the Open Strategy Used for the BSO
The comparison between the result obtained with our sources and the open strategy of the BSO validates the use of the latter: This strategy, if we summarize it in a few words, consists in scanning all the DOIs available from Unpaywall, and also from HAL, to identify either the French authors or the presence of a mention of France in the address.
We observe that this strategy makes it possible to identify more than 20,000 records (if we exclude the false positives) not found by our approach (i.e., about 17% of the total): These are mainly journals that are not indexed in the major international databases, and more particularly in the biomedical and social science fields.
Our approach also identified approximately 13,000 DOIs not included in the BSO and thus estimated the false negative rate in the BSO strategy to be close to 9% (see Table 4).
Recurrent sources of error include conflicting approaches to publication date (with the usual confusions between the first online publication and the final date of the reference; see for example Liu, 2021).

CONCLUSIONS
The main results of our study are as follows.
• Our study validates a strategy of determining a collection of scientific publications with an affiliation in France for a given year. This corpus is deliberately restricted by the use of DOIs. We present the details of the counts for the year 2019. We estimate that the corpus of outputs with a DOI covers around 80% of French national scholarly production in 2019, with an additional set of 40,000 articles or communications without a DOI not taken into account here. • Our determination of cross-coverage by the various databases provides useful insight for users of these databases. We believe that these counts can help users of these databases to identify overlaps and complementarities, in a context comparable to that of our study. • The use of multiple sources ensures validation at a sufficiently fine level to shed light on the geographical, thematic, linguistic, etc. disparities that affect bibliometric studies. Our study confirms the relevance of adopting a multisource approach. • The open-source strategy used by the BSO effectively identifies the vast majority of publications with a persistent identifier (DOI) for Open Science monitoring. • The determination of the open access rate has been refined. It should be remembered that this rate depends on the date of observation and may differ depending on the type of documents we wish to consider. Our objective is not to comment here on the 54% or 53% rate reached for the opening of publications in 2019 (observed in February 2021), but to note the convergence of two different methodologies that allow us to accurately draw the shifting landscape of open science at the country level.
The question of the place of the national open archive HAL, and of other open archives, in the strategy of Open Science deserves a specific development which should be the subject of a further study. The objective of such a study would be to examine the possibilities of convergence between, on the one hand, the specific challenges of open archives, allowing for easy depositing at the disposal of the authors, and on the other hand, the requirements of a referencing and query environment that should not only provide open access to scientific knowledge produced by French research, but also support the most diverse possible readership in their consultation process.