Towards Trusted Identities for Swiss Researchers and their Data

In this paper we report on efforts to enhance the Swiss persistent identifer (PID) ecosystem. We will frstly describe the current situation and the need for improvement in order to describe in full detail the steps undertaken to create a Swiss-wide model. A case study was undertaken by using several data sets from the domains of art and design in the context of the ICOPAD project. We will provide a set of recommendations to enable a PID service that could mint Archival Resource Key (ARK) identifers or a .avour of Research Resource Identifers (RRIDs) as complement to Digital Object Identifers (DOIs). We will conclude with some remarks concerning the transferability of this approach to other areas and the requirements for a national hub for PID management in Switzerland. Received 06 December 2018 ~ Revision received 18 October 2019 ~ Accepted 06 January 2020 Correspondence should be addressed to René Schneider, HEG-Genève, Rue de la Tambourine 17, CH-1227 Carouge, Switzerland. Email: rene.schneider@hesge.ch An earlier version of this paper was presented at the 14 International Digital Curation Conference. The International Journal of Digital Curation is an international journal committed to scholarly excellence and dedicated to the advancement of digital curation across a wide range of sectors. The IJDC is published by the University of Edinburgh on behalf of the Digital Curation Centre. ISSN: 1746-8256. URL: http://www.ijdc.net/ Copyright rests with the authors. This work is released under a Creative Commons Attribution Licence, version 4.0. For details please see https://creativecommons.org/licenses/by/4.0/ International Journal of Digital Curation 2020, Vol. 14, Iss. 1, 303–314 303 http://dx.doi.org/10.2218/ijdc.v14i1.596 DOI: 10.2218/ijdc.v14i1.596 304 | Towards Trusted Identites for Swiss Researchers doi:10.2218/ijdc.v14i1.596


Introduction
Persistent Identifers (PIDs) are a necessary tool to assure referenceability and -as their name says -identifability in a bi-unique manner upon which further mechanisms can be built upon, e.g. citability to mention only the probably most important one (Hakala, 2010;Van de Sompel, Sanderson, Shakar and Klein, 2014).
Their persistence is a real added value compared to the cool URIs of the Linked Data approach, which is not in opposition to a joint use of both PIDs and Linked Data. For a deeper understanding of the relationship between cool URIs and PIDs see Bazzanella, Bortoli and Bouquet (2013). In the meantime, it is often forgotten that PIDs encompass a large variety of heterogeneous and ever-growing denominators for which the metaphor of ecosystem seems to be appropriate (see Espasandin, Jaquet and Lefort (2018) for a complete overview), and that the combination of several PIDs bears a high potential for the creation of what are called Trusted Identities in research data management, where their use is primordial both for sharing and archiving data, no matter if the focus is set on open access or long term archival. As a matter of fact, PIDs seem to work as a hinge or even as a fulcrum between these two approaches in data curation.
Nevertheless, a short glance at the real existing use of persistent identifers shows that instead of diverse ecosystems preference is given to a monoculture. In this paper we will illustrate this and provide solutions to enhance this situation with examples from the Swiss scientifc landscape.

The Swiss PID Landscape
Currently the situation in Switzerland is dominated by one and only one PID, namely the prominent Digital Object Identifer (DOI), which is attributed since 2010 by the library of the Swiss Federal Institute of Technology Zürich (ETH) -a DataCite member. The PID service is used for documents and data and is only free of charge for the scientists working for the ETH; a constraint which can be seen for the inhibited use of the service.
To overcome this constraint, some institutions recommend their scientists to publish their data on Zenodo 1 , which is hosted at the European Organization for Nuclear Research (CERN), being an extra-territorial organisation located on the Swiss-French border near Geneva. The specifc circumstance of extra-territoriality makes other institutions forbid their scientists to publish the data there, since their policies do not allow data publication outside Swiss borders.
A third Swiss data repository is FORSbase 2 , which is only open to data from the Social Sciences and has no strict policy concerning identifers (with some data sets having DOIs and others not having persistent identifers at all).
Parallel to these well-established services, the allocation of Archival Resource Key (ARK) identifers is in preparation at the University of Basel's Digital Humanities Lab as part of the development of the Data and Service Center for the Humanities doi:10.2218/ijdc.v14i1.596 Julien A. Raemy and Rene Schneider | 305 (DaSCH) 3 , a long-term archive for data from (digital) humanities (Rosenthaler, Fornaro and Clivaz, 2015). These ARKs will hence only be given to data from a restricted number of scientifc disciplines during and after completion of the project.
Another ongoing Swiss project named Data Life-Cycle Management (DLCM) 4 is still under development and plans to also attribute DOIs to the data to be hosted in the near future.
No PIDs are allocated so far to persons or organizations, although another initiative, known as SWITCH edu-ID 5 , has been started to create IDs for all people involved in higher education institutions in Switzerland, developed by SWITCH -an organization for nation-wide IT-support in higher education. In this context it should be seen critical that these identifers are not really based on a social contract (like, for example, ORCID, where each researcher willingly allows the creation of an ID for him or herself) but automatically attributed for a lifetime as soon as a person enters an academic institution; a circumstance that might bear some juridical problems if used for other things than merely administrative matters.
This mixture of monopolized attribution, contradictory policies for non-payable deposits, work in progress, and unclear legal issues, leads to an unsatisfying and confusing situation, far beyond a scenario that is needed to create trusted identities for dynamic and ready-to-ingest research data at all granularity levels including scientists and their organizations.
For this reason, the project ICOPAD (= Identités de conance pour les données de l'art et du design) 6 started in 2017 to re.ect on a suitable PID model for the Swiss scientifc landscape, specifying all necessary requirements and work.ows.

The ICOPAD Project
Since the situation described above can only be improved through collaboration and partnership, ICOPAD started as a joint project between four Swiss institutions (namely the HEG-Genève as instigator and managing authority; the Swiss Institute for Cultural Sciences, the Swiss Institute for Art Research, the Zurich Central Library as well as the Zurich University of the Arts). All institutions worked together successfully in a previous project on Linked Data for Art and Design in Zurich (LOD-Z) 7 (Prongue, Ricci, Schneider and Schurte, 2017). This project ended with the development of a prototypical system and the desire to further work on PIDs to loosen the dependency from cool URIs.
The goal of this 18-month project (July 2017 to December 2018) was also to work out all details underlying this need and to give an exemplary showcase that proved the structural feasibility of this approach dedicated to the disciplines of art, design, and digital humanities in order to derive general conjectures and to prove the transfer of the model and work.ow created to other disciplines that share the same needs for their scientists and the data they generate. The ultimate goal of this project was to clarify all questions concerning the creation of a Swiss PID infrastructure, namely the theoretical, organisational, technical, and fnancial aspects. A more concrete implementation will be done in a follow-up project on a national level.

Preliminary Studies
The project started with a frst study of the situation in Switzerland, which resulted in a detailed description of the situation depicted roughly above as well as a frst overview of the PID universe.
For the sake of generalization and comprehensibility to the outer community, a .ow chart was developed that can be understood as a decisional path to the attribution of one or more complementary PID(s) in Switzerland. This work was strongly inspired by prior work done by the Australian National Data Service and other policy checklists (ANDS, 2017) and led to the following diagram: Based on our research, we decided that dynamic data sets (Treloar, Groenewegen and Harboe-Ree, 2007), i.e. those not yet being considered for long-term preservation, and those needing a fne level of granularity, as well as data likely to be often modifed (versioning) could be better handled with PIDs other than DOIs.

Synthetic Overview and First Recommendations
To gain more insights and know more about all possibilities that the PID universe offers, a second exploratory work was initiated. This work led to a synthetic overview of almost all PIDs being practically at disposal and their graphical representation as a disc 9 . The panorama can be viewed from the centre outwards. It is organised around ten classifcation criteria integrated into the circular sections (categories, systems, standards, syntax, opacity, granularity, metadata, resolvability, hosting, and cost) and two ancillary factors (main domains and the type of entities that can be described) (Espasandin et al., 2018). Based on the 27 different PIDs identifed on the disc, the authors recommended an attribution system that combined ARK, DOI, and ORCID. The former could be assigned to any kind of entity and the two latter PIDs could be leveraged together.
After these two preliminary investigations the work was continued in a twofold manner: 1. a detailed analysis of diverse data sets being delivered by the project partners for the exact determination of PIDs needed and; 2. the development of an overall attribution model containing one or more complements beside the DOIs attributed so far.

Data Set Analysis
The Swiss Institute for Art Research (SIK-ISEA) 10 , the Zurich Central Library (ZB) 11 , and the Zurich University of the Arts (ZHdK) 12 provided nine different use cases within the ICOPAD project. The scope and the data sets of the participating institutions are summarised in the Table 1 below.

Institution
Object of investigation and epistemological interest Data set types/entities SIK-ISEA The SIK-ISEA use case is an artist's dictionary entry containing biographical information, images of his artworks as well as hyperlinks to Wikipedia and the GND (Gemeinsame Normdatei), which is the German integrated authority fle. The interest of SIK-ISEA is -according to their website -to provide a daily updated onlineencyclopaedia of biographic information concerning artists and their works in Switzerland.
Artists Artworks Dictionary entries

ZB
The provided data sets by the ZB had already DOIs but it wasn't suffcient enough to their needs as they wanted to have their manuscripts and ancient books to be citable with a fner level of granularity that could point to an area of interest for the end user (page, folio, segment, paragraph, etc.).

Digital surrogates ZHdK
The ZHdK have several platforms (eMuseum) 13 , subdomains (Medienarchiv) 14 , as well as dedicated webpages on which are hosted their datasets. They all have their own (ephemeral) identifers which are derived from their different databases. Therefore, the same instance can have several identifers depending on which platforms it is on. They would like to unify this by attributing PID to all kinds of entities (a person, an object, an event, etc.) that they are hosting. The ZHdK would like as well to implement a PID solution with a Linked Data approach to migrate existing applications such as the eMuseum or the Medienarchiv to a both stable but nevertheless agile environment and to continue the development of their own products based on the principles of PID and LOD.

Modelling
After the period of data exploration, a modelling process started with the axiom-like assertion that the DOI-only approach followed so far in Switzerland is not suffcient, which lead to the conclusion that a complement must be found. A circumstance that was named "DOI+" internally and formally might be described as: This assertion leaves two options open, with x = 1 single identifer as complement or many, i.e. n identifers as a complement, internally designated as "DOI + 1" or formally noted for our purpose as: and with a being a distinctly chosen PID and n the number of possible identifers combined in a vector or n tuple.
Option (2) may further be differentiated into a sequential or parallel approach, whereas sequential means that an order of priority is assigned to the different identifers existing for several entities and parallel means that no further distinction is given and that as soon as an identifer for an entity may be found, it is assigned.
Finally, it was decided to opt for a solution that combines both approaches and creates a link both to the cool URIs of the Linked (Open) Data world as well as to all other PIDs. Internally this model was noted (without claiming it to be a correct mathematical formula) as which means that frstly one specifc chosen PID a was assigned to all entities before transforming them into a Linked Data representation using the OWL predicate sameAs to assign further PIDs to an entity. At the end of the modelling process, the only remaining question was to choose a proper PID for a.

Choosing the Appropriate PID for All Entities
Based on the preliminary study made by Espasandin, Jacquet and Lefort (2018), preference was given to the ARK identifer and the corresponding Name-to-Thing (N2T) resolver. The ARK anatomy is given in Figure 3. It can be seen that ARKs can provide a fne level of granularity through the use of parallel subdomains or Name Mapping Authority Hostports (NMAH) as well as qualifers. Each organisation is able to implement their own ARK policy and naming practices. For instance, Kunze and Rodgers (2008) suggest that ARK identifers should not be reassigned once it has been made public, should be opaque (i.e. no widely recognisable semantic information) and should contain a check character to guard against common transcription errors.
ARKs were chosen for the following reasons, to name only a few:  ARK identifers are free once you have found an institution that is willing to connect to the California Digital Library (CDL)'s Name Assigning Authority Number (NAAN) registry 16 , thus they seem to be the appropriate alternative to the pay-per-ID approach practiced so far in Switzerland;  ARKs are built using a completely different theoretical model, consisting of a decentral and domain (i.e. DNS) agnostic approach, which allows a considerable freedom for the internal management of the data sets;  ARKs allow easy use with LOD -a circumstance that has proven its feasibility in a number of projects, e.g. the data.bnf.fr 17 project for creating a Linked Data application for several data bases hosted by the French National Library (BnF);  ARKs can be combined effortlessly with other specifcations, such as the International Image Interoperability Framework (IIIF) canonical URI syntax 18 .

Recommendations
In the context of the ICOPAD project, several alternatives or approaches have been selected to mint ARKs or to design their own identifers. For the latter, the PID that has been considered to be emulated is a mimic of the Research Resource Identifer (RRID) 19 used for giving consistent links to biological data in the following format: RRID:Identifier. It is leveraged by, among others, the University of California 20 and these identifers can be found on Google Scholar or PubMed. For the purpose of the ICOPAD project, the authors and the participating 15 The ARK syntax can be synthesized as follows: institutions have decided to label this RRID-like identifer as such: The Art and Design Identifer (ADID).
1. ARK via its own means: Each organisation could ask a NAAN to the CDL and mint their own ARKs. They could decide whether they want to deploy one or several NMAH. This solution could also be conducted for some time before switching to the ones proposed by the implementation of a national Hub minting either ARKs (3) or RRID-like identifers (5).
2. ARK via DaSCH: As stated in the Swiss PID landscape section, the DaSCH project attributes ARK identifers for the needs of its research data archival system. The internal structure of the system could theoretically and practically allow the minting of ARKs to third members. The NMAH (ark.dasch.swiss) and NAAN (72163) would be then the ones that DaSCH has deployed or received. Unfortunately, DaSCH does not see itself as the future national hub at the moment.
3. ARK via a national hub: The creation of a Swiss Hub that is able to mint ARKs and offer services to organisations that have participated or are interested in the ICOPAD project effort. This national hub could deploy one or several NMAH.

RRID-like (ADID) via its own means:
Each organisation could mint their own RRID-like identifers (ADIDs). This solution could also be conducted for some time before switching to the ones proposed by the implementation of a national Hub minting either ARKs (3) or ADIDs (5).

RRID-like (ADID) via a national hub:
The creation of a Swiss Hub that is able to mint ADIDs rather than ARKs. They could also deploy one or several hostname services (the equivalent to NMAH in the ARK anatomy) to resolve PIDs. Figure 4 shows a matrix that combines two types of variables: hostname (or the subdomain that can resolves PIDs) and the PID authority (in other words the organisation that is able to mint PIDs). In the matrix, the fve approaches are exemplifed with three projected and hypothetical PID URLs from the ICOPAD contributing institutions. The "core numbers" of the digital objects are the following:

Conclusions and Future Work
We have seen in this paper that the importance of persistent identifers for research data is still underestimated. Brie.y said, they can be seen as the conditio sine qua non for adhering to the FAIR principles (Wilkinson et al., 2016). Despite the diversity of PIDs, the Swiss landscape heavily depends on DOIs, (with some alternatives and/or complements starting to be at disposal). The major problem may consist in the fact that whenever people think of PIDs they chie.y think of DOI and it remains diffcult to convince them that alternatives or PIDs for other entities exist. Besides that, the DOIs are mainly assigned to publications and data sets with a rough level of granularity. They are also less suitable for linked data enrichment and do not permit decentralised management.
Secondly, people often see linked data or cool URIs as an alternative for PIDs, without thinking of them as possibly being two sides of the same coin in order to bring research data to their full potential. The state-of-the-art work done for PIDs in general, their use in Switzerland, as well as the still unfulflled scientifc needs (illustrated using the example of several data sets from the faculty of Art & Design) has shown that there is still lots of work to do.
The peculiarity of the Swiss scientifc landscape, as being quite developed and advanced but not fully connected to the rest of Europe requires some common efforts