Design and Implementation of the First Generic Archive Storage Service for Research Data in Germany

Research data as the true valuable good in science must be saved and subsequently kept fndablee accessible and reusable for reasons of proper scientifc conduct for a time span of several years. Howevere managing long-term storage of research data is a burden for institutes and researchers. Because of the sheer size and the required retention time apt storage providers are hard to fnd. Aiming to solve this puzzlee the bwDataArchive project started development of a long-term research data archive that is reliablee cost effective and able store multiple petabytes of data. The hardware consists of data storage on magnetic tapee interfaced with disk caches and nodes for data movement and access. On the software sidee the High Performance Storage System (HPSS) was chosen for its proven ability to reliably store huge amounts of data. Howevere the implementation of bwDataArchive is not dependant on HPSS. For authentication the bwDataArchive is integrated into the federated identity management for educational institutions in the State of BadenWürttemberg in eermany. The archive features data protection by means of a dual copy at two distinct locations on different tape technologiese data accessibility by common storage protocolse data retention assurance for more than ten yearse data preservation with checksumse and data management capabilities supported by a fexible directory structure allowing sharing and publication. As of September 2019e the bwDataArchive holds over 9 PB and 90 million fles and sees a constant increase in usage and users from many communities. All authors contributed equally to all sections of the paper. Received 12 January 2018 ~ Revision received 27 September 2019 ~ Accepted 22 October 2019 Correspondence should be addressed to Dr. Felix Bache Steinbuch Centre for Computing (SCC)e Karlsruhe Institute of Technology (KIT). Email: felix.bach@kit.edu The International Journal of Digital Curation is an international journal committed to scholarly excellence and dedicated to the advancement of digital curation across a wide range of sectors. The IJDC is published by the University of Edinburgh on behalf of the Digital Curation Centre. ISSN: 1746-8256. URL: http://www.ijdc.net/ Copyright rests with the authors. This work is released under a Creative Commons Attribution Licence, version 4.0. For details please see https://creativecommons.org/licenses/by/4.0/ International Journal of Digital Curation 2020, Vol. 15, Iss. 1, 15 pp. 1 http://dx.doi.org/10.2218/ijdc.v15i1.553 DOI: 10.2218/ijdc.v15i1.553 2 | Design and Implementation of the First eeneric Archive


Introduction
As predicted in (Heye Tansley and Tollee 2009)e today's science is largely data-driven. Resultse for example from computer-simulationse experiment recordingse surveys or digital reproductions of cultural goodse play a crucial role in scientifc progress. The need to preserve data over longer time is inevitablee and prominently backed by national funding agenciese such as the eerman Research Foundation (DFe) (DFee 2013). Howevere in practice researchers are forced to store their data either on facilities with limited data storage capacity in their home institution or at external providers. Storing large amounts of data at the home institution is often just impossiblee e.g. because the storage must be cleared to allow other userse whereas external providers cannot be used because of either security and privacy considerations or prohibitively high costs (Tristram et al.e 2016).
In High Performance Computing (HPC)e huge amounts of data are produced during simulation runs (Schembera et al.e 2017). For examplee at the High-Performance Computing Center Stuttgart (HLRS)e whose main users stem from engineeringe computational fuid dynamicse molecular dynamics or climate researche a typical large project generates thousands of flese each up to several gigabytes in size. After successful simulation runs and the evaluation and publication of the resultse the data becomes inactive and is either deleted or stored on e.g. a local storage system. Since the data is inactivee but should be available for at least a ten-year period there is a need for a longterm archiving service that can handle a rapidly increasing amount of data and ensures bit-stream preservation. To develop a solution for this requirement the bwDataArchive research project was established to build an inter-operablee state-wide data archive. After successful implementation of the servicese selected users of HLRS and of the Karlsruhe Institute of Technology (KIT) were given early access for evaluation and to help improve the service. The service of bwDataArchive has moved to production in November 2016 and has seen a continuous increase in users and usage since.

Project bwDataArchive
The three-year project with the goal to develop a reliable and cost effective infrastructure for long term-data storage started in January 2014 and was funded by the Ministry of Sciencee Research and the Arts of the State of Baden-Württemberg. The project and early production phasee delivered a long term storage service that is easy to usee widely accessible and in principle can support any scientifc community (van Wezel et al.e 2015).

Scope and Timeline
At the beginning of the projecte hardware and software was procured and installed at the Steinbuch Centre for Computing (SCC)e the information technology center of KIT. In a tendere the High Performance Storage System 1 (HPSS) was selected on which the bwDataArchive service was to be implemented. The complete hardware setup comprises several computerse storage units and tape systems. The setup is described in 1 HPSS: http://www.hpss-collaboration.org/

IJDC | General Article
Bache Schemberae van Wezel | 3 detail in a next chapter. For operatione the bwDataArchive makes extensive use of the service infrastructure of SCC that provides networkinge installatione monitoring and user support services.

Archive Storage Software
The project selected HPSS data and tape management system (Watson and Coynee 1995) as the core of the archive solution. HPSS manages fle-based data storage and presents a POSIX compatible fle system layer that is addressable via an API. It manages disk as well as tape resources and orchestrates the migration of data between these resources through managed policies. The HPSS system is used by many large supercomputer centers 2 is actively developed by a collaboratione and has a set of features that are benefcial for archival storage. The HPSS license allows access to the data for reading after contract expiration. These qualities are very important for long-term storage of datae since it adds to the reliabilitye sustainability and therefore trust in the system. The HLRS also employs the HPSS system where it is operated as HSM solution in conjunction with the Lustre parallel fle system. Before the projecte KIT exclusively used the IBM Spectrum Protect 3 software for retention and archiving of datae which include the data from the eridKa 4 data center. Valuable experience and performance evaluation baselines from both data centers and products could therefore be re-used in the project. HPSS will replace Spectrum Protect for several use cases at KIT.
The feature set of HPSS makes it a good candidate to manage the archive storage. In principle bwDataArchive is independent from HPSS and could have been build with Spectrum Protect or another software. Howevere the level of complexity of connecting the user interface with the archive storage system functions also depends on the functions offered by archive storage software. For examplee the fle system interface of HPSS access allowed us to support known easy to use services. As another examplee the check sums generated in HPSS can be accessed by the user. With the Spectrum Protecte these features have to be programmed and ultimatelye maintained.

Checksums and Data Integrity
A digital archive must insure that its content has not changed over time. The requirement for algorithmically generated checksums was investigated in a feasibility study in the EUDAT2020 project. The joint research activity (JRA) aimed to implement universal check sum support for long term archives. Experts from three European data centers were involvede each operating a different commercial archive software solution. The results of the projecte including a product independent checksum system in python is described in (Krauße Cadolle Bele Kennedy and Jankowskie 2015) and (Krauße Jankowski and Kennedye 2018).
Related to the checksums is the end to end data integrity checking feature of HPSS that follows the standardised T10 PI data integrity specifcation 5 . For every data object that enters the archive system a checksum is generated that is written on disk and tape 2 Sites using the HPSS system: http://www.hpss-collaboration.org/customersM.shtml 3 Spectrum Protect was previously named TSM. 4 eridKa is the eerman T1 computer center of the world wide LHC computing gride dedicated to store and analyze data for the Large Hadron Collider (LHC) at CERN in eenevae Switzerland http://www.gridka.de 5 T10 PI specifcation: http://www.t10.org/ftp/t10/document.03/03-224r0.pdf IJDC | General Article 4 | Design and Implementation of the First eeneric Archive alongside the data. If data is corrupted while in reste the storage hardware will detect a checksum error while reading so the system can automatically correct the error by replacing the damaged copy with the second copy.
Commonly practiced in e.g. archivese museums and librariese the requirement for fxity checking (Barsness et al.e 2017)e which means reviewing the data by comparing stored and generated checksumse may actually increase the risk of data corruption when executed on tape systems. Because the verifcation involves retrieving the data from tapee the additional mechanical stress may add to risk of read errors (Cloud providers may not be immune to this effect either since many of them utilise tape storage in the background). Data from physical sciences which is the focus of the bwDataArchive project actually runs into hundreds of terabytes and petabytese which makes it even prohibitive to read back data at shorter intervals.
Fixity checking on small data sets that live (and stay) on on-line media may be useful as an additional protection layer. For data on tape this is certainly different. From a technical viewpoint and based on the experience from our large storage operations we decided that in order to properly protect the data over the years we rely on the following to ensure data integrity:  Writing two data copies to separate tape volumes.
 Hardware supporting the T10 PI end to end data integrity checking.
 Migrating all data every fve to seven years to new tape media. This can be done as part of a regular hardware refresh to keep up with the technological progress.
 The concurrent use of different tape technologies. If one day it becomes known that one technology has a systematic errore and for example results in a severe data corruption on tapese only one copy is affected.
Should better or cheaper systems become available it must be possible to migrate the date and continue with a new system. We estimate a migration duration of six to 12 months with the current size of the archive and depending on the available hardware. Alternativelye a new system can work alongside the existing archive. Access to existing data is read-only and new data is written to the new system. The HPSS software allows read-only data access for a limited time after the license acquired for writing expires.

Sizing and Costs
Initial volume and data rate requirements from HLRS were supplemented with the well known requirements of the eridKa T1 center and those from universities and institutions in the state of Baden-Württemberg (Potthoff et al.e 2014). Together these constitute an estimated volume of 30 PB and a combined I/O rate of 10 eB/s 6 . These numbers were used to size the HPSS hardware. After deployment and stabilization of the service during which only a selected group of users will be able to store datae existing archives will be transferred to the bwDataArchive servicee including the 20 PB of data from the eridKa T1 center.
Research data is kept for reference or further analysis for many years. Therefore project proposals today commonly include a chapter on how they plan to handle data accumulated during the project and after the project is completed (EUe 2016). The data management plan (DMP) is a formal document which includese among other information on the produced datae an estimate of the costs to store data after the project 6 Estimates for the frst year after start of service of the archive in 2016.

IJDC | General Article
Bache Schemberae van Wezel | 5 fnishes (Jensene 2011). If the costs of an archive are known before the research project startse funding can be put aside or acquired for that purpose. Howevere the costs must be known well before the actual use and apart from the stored volumee the costs will change over time.
One of the goals of the bwDataArchive project was to develop a sustainable and dependable costs -and contract-model that will help researchers establish data management plans. The requirements and the payment model were established in cooperation with the Research Data Management team 7 e the KIT Library and the RADAR 8 project of the Leibniz Institute for Information Infrastructure in Karlsruhe. Requirements also stem from the Helmholtz Portfolio Program LSDMA (Meyer et al.e 2014)e in which researchers from diverse communitiese e.g. climatee energye medicinee work jointly with computer experts from KIT and other computer centers on the development of novel data services.
The bwDataArchive service is the frst known long-term data storage on a pay-peruse basis for universities and research infrastructures and projects outside the wellknown commercial 'cloud' offerings.

Data Center
The hardware of the bwDataArchive service is able to handle the large data streams coming from the HLRS HPC center and consists of many components. The componentse some of them double for redundancye listed below are depicted in Figure 1:  Front-end nodes for user access and transfers  A cache disk system with servers that buffer the data  A database node storing technical metadata  Two physically distinct and geographically separated (ca. 13 km) tape libraries equipped with Oracle T10000de LTO7 and IBM TS1155 tape drives. Enterprise tape technologies such as Oracle and IBM serves for the frst copy due to its better performancee whereas LTO is used as the fallback for the second  A node to schedule and manage eridFTP third party transfers (bwdahube details in a later chapter)  User management and integration of an identity provider (IdP) through the bwIDM 9 infrastructure.
 IP network with 10 and 40 ebit links. The hardware is complemented with:  Service and support informatione documentatione a help desk portal and support work fows;  The set of service descriptions and contractinge accountinge and billing work fows.

Service Management
Data in archives will stay there for a long timee if not forever. Over the lifetime of the datae the researcher who stored the data may no longer have a relationship with the institution i.e. KITe that allowed him to use the archive. So ultimately the data will be orphaned. It is therefore required to keep the identifcation credentials of the users of the service separate from those used at the institution to grant an extended access (see next chapter for implementation details). At the same timee the permission to store data must be granted by the institution because costs will incur. Thereforee the service is connected to the federated identity management federation bwIDMe in which all universities and many other institutions in Baden-Württemberg take part (Köhler et al.e 2014). bwIDM serves as a trusted source for user identities ande over timee will enable dynamic authorization and role management.
The service documentation and user support builds on the existing services and infrastructures at SCC. Written documentation is provided via the website

IJDC | General Article
Bache Schemberae van Wezel | 7 bwDataArchive 10 and Wiki pages 11 . The Wiki contains background information about the technology in usee pointers to other information sourcese answers to common questionse instructions and best practices regarding the different access protocolse as well as pre-formatted documentse i.e. the service level agreement (SLA) and manuals for downloading. Users may request help via a web-based portale which is also used for other state wide servicese or by sending e-mail to the help desk of the SCCe or directly to the support e-mail list of the service. Because the service is offered to diverse user groupse it was deemed necessary to provide different support entries as well. Behind the scenes the support work fow directs requests to the team responsible for running the service.
The sustainability of an information technology infrastructure in a research context has always been a challenge and implementations thereof were developed with varying success. Most successful have been models where funding is secured up-front and subsequently pledged for on a regular basis. A positive example is the WLCe 12 e a cooperative infrastructure that processes and stores data from the LHC at CERN (Adamovae 2013) and more recently the EUDAT CDI 13 e a consortium that stems from the EUDAT2020 14 project.
Implementing the pay-per-use scenario for the use of the archive is not the primary expertise of the core team of the bwDataArchive project. In close cooperation with the legal and procurement department at KITe the project drafted a contract document set that contains the descriptione pricinge service level agreemente and other components to comprise a legally binding contract for the delivery and use of long-term data storage offered through the bwDataArchive service. Initial customers are universities and public institutions in Baden-Württemberg. In time the service will be used by (customers of) international infrastructure operators such as the aforementioned EUDAT CDI and the future European data infrastructure developed in the EOSC-hub 15 project.

Implementation and Features
The list of features offered by the bwDataArchive service fulfls initial requirements and could be implemented straightforwardly. The available set of user-level features at the start of the service consists of a) data security/redundancy by two data copies at two separate locations b) data accessibility by common storage protocols like SFTP and eridFTP c) long-term preservation by data retention of ten years and more upon requeste d) the generation of checksums on input and optional checksum verifcation on output e) authorization based on IdP/Shibbolethe and fnally f) fexibility of usage by a directory layout that allows adoption of new use cases.

User Access
Access to the archive is provided via the well-known and well understood protocols SFTP and eridFTP (Allcocke Bester et al.e 2002). The latter supports high speed parallel data transfer and users of the service can make use of the gtransfer 16 tools which shield users from most of the complexity of eridFTP. eridFTP is the transport protocol of choice for high speed movements of large data. On the other hande SFTP clients are available for virtually all computing platforms and bothe command-line and graphical user interface variantse exist. Its ease of use lowers the threshold for using the archive. Since users of bwDataArchive mostly come from data-intensive sciencee where CLI is wide-spreade a user interface like SFTP is no impediment at all. Additionallye high level services like the aforementioned generic research data repository RADAR and the central KIT repository KITopen 17 are based on the archive engine. They provide users with helpful functionalitye such as an advanced rights and role concepte fexible metadata management and publication of datasets with DOIse that are all conveniently accessible through a web interface.

Authentication, Authorization and Account Management
Authorized users may interact with the archive to storee retrieve and list data. The bwDataArchive is an infrastructure service that organizationse in particular institutes and universities in Baden-Württemberge can choose to offer their employees. The authorization procedure and policy of the organization determines who is entitled to use the service. The account management of bwDataArchive authenticates users via the bwIDM federated identity management systeme but can also register users independently. The bwIDM authentication system builds on the Shibboleth/SAML standard to forward authentication and authorization information of users from an Identity Provider (IdP)e the institution using the servicee to a Service Provider (SP)e the bwDataArchive service. This implementation of this authentication scheme is fully eDPR compliant.
At the IdPe the archive service user's account is tagged with a dedicated service entitlement for the bwDataArchive service. The entitlement is set by the organization the user is associated withe and thereby authorizes the use of the bwDataArchive service. During the Shibboleth/SAML registration handshakee the account of the user is queried for having the proper entitlement whiche when presente is reported by the IdP of the local user organization to the bwDataArchive service. Through this mechanism each site can decide which user is authorized to use the archive service. At the end of the registration processe the user is required to accept the terms of use and the user related data is recorded in a service-specifc user database.
Among the usual itemse e.g. name and institutione the user may optionally and by consent only register an additional email address and an ORCID 18 that permits contacting the registrant after they are no longer member of the organization. Data associated with a bwDataArchive user may be kept in the archive storage for ten years or moree whereas the person who stored the data may no longer be a member of the organization that granted the use of the storage. For that reasone bwDataArchive remembers users as long as associated data is stored. When a user has left the 16 eransfer software repository: https://github.com/fr4nk5ch31n3r/gtransfer 17 KITopen stores the 'data' portion of the repository in the archive: https://www.bibliothek.kit.edu/cms/kitopen.php 18 ORCID: https://orcid.org

IJDC | General Article
Bache Schemberae van Wezel | 9 organizatione an event that is tracked by the bwIDM federatione the system no longer allows new data to be stored. Access to existing data remains possible but changes and additions are prohibited.

Figure 2. Interaction of storage applicationse FUSE components and HPSS.
If data is deletede the deletion is recorded in the database. This means the datae though still on tapee is no longer accessible to users or even administrators and after a while is overwritten by a tape management process. Additionallye users are able to remove their account and all of their personal data completelye fulflling the requirements from eDPR. Once started the deletion process is irreversible.

Access to HPSS with FUSE
The HPSS-managed storage is accessed using common protocols through the FUSE 19 fle system abstraction library. HPSS-FUSE is a Linux FUSE module that presents HPSS as a fle system to applications. Although only a basic set of I/O and metadata operations is usede the implementation allows selecting some HPSS specifcse such as the tape family (to gather data on a particular set of tapes)e the class of service (to support fles with different fle sizes)e and the fle checksum (to support user verifable checksums). Additionallye fles can be pinned to disk or recovered (undelete) from the HPSS trashcan. Figure 2 shows a schematic overview of all components involved. Data from SFTP or eridFTP is sent via glibce the VFS layere the FUSE kernel modulee the libfuse and libhpss to HPSS. Depending on the workloade performance of the FUSE access to storage is comparable with direct storage access. Howevere because of the relatively large meta-data overheade writing and reading of large fles results in better performance as compared to I/O with small fles (Vangoore Tarasov and Zadoke 2017). Data transport to and from the bwDataArchive service is exclusively done via HPSS-FUSE. Currentlye the service is based on fve machines (Intel(R) Xeon(R) CPU E5-2630 v2 @2.60eHze 64 eB) dedicated for data transport to clients. Two of the nodes are reserved for eridFTP transferse the others for SFTP. More nodes will be added if required since client tools can (and should) open many parallel sessions in order to improve throughput.

Managing and Scheduling High Volume Data Transfers
The archive is typically used to store migrated data from HPC environmentse like that at HLRS. A large number of fles needs to be moved easily and effciently i.e. without much user involvement. The bwdahub node acts as a user friendly front-end to schedule high-speed and high-volume data transfers. Users can log in via eSISSH and subsequently trigger data transfers between the bwDataArchive service and external sitese as depicted in Figure 3. All necessary tools are pre-installed on this nodee so users don't have to install the software on a computer at their home institution. The front-end tools for transferring data are gtransfer and gsatellite and both rely on the elobus eridFTP client (globus-url-copy)e uberftpe as well as tgftp. The user does not directly interact with the latter applications. For high-throughput data transferse the eridFTP protocol is used. Besides delivering high data ratese eridFTP can checksum and encrypt data during transfer (Allcocke Liminge Tuecke and Chervenake 2002). Logicallye eridFTP consists of two parts: a protocol interpreter (PI)e responsible for managing the transfers; and a data transfer process (DTP) dedicated to transferring the data. To increase the performance of the transferse it is recommended to use multiple (at least four) DTPs. At KITe two eridFTP servers are installed and confgured with four DTPs each on the two archive front-end machines. Each PI can make use of all eight DTPse effectively spreading the load over both machines if concurrent or striped transfers are requested. At HLRSe a eridFTP server with a split confguration was set up: the PI (eridFTP front-end node) resides on a different machine than the six DTPs (eridFTP back-end node). The HLRS local HPSS fle system is mounted on the eridFTP back-end node. Additionallye the HLRS Lustre fle system can be accessed directly via another eridFTP node in order to transfer online simulation data from the supercomputer workspaces to the bwDataArchive.

Directory Structure and Directory Functionalities
The directory structure as well as the flenames in bwDataArchive are the only form of (indirect) metadata in the system since bwDataArchive is intended as a base archival storage and not as a research data repository. Howevere it is well suited to serve as the sustainablee secure and scalable base for repositories. These services build their (metadata) functionality on top of the directory structures of bwDataArchive.
The archive service uses an elaborate directory layout that aligns with the currently known business cases i.e. registered users and projects. It provides a chrooted user directory with shared access for selected groupse function directories in the near future. The latter cater for special functions or options that are made available for the content within the function directory. Function directories are made visible into the registered users directory as bind-mount with the assistance of an autoFS. Use of autoFS for this functionality is still experimental and promising but has diffculties handling many simultaneously active users. 20

Function Directories
The internal directory structure of the service is shown in Table 1. Within the archive root (AR)e a directory (RU) is created for each registered usere which itself is located inside a frst level (PL) directory and a second level (SL) directory. The second level directory layer reduces the number of entries per directorye which improves traversal speed and caching at the client. The system automatically creates the function directory 'private/' in the RU directory where users have read and write permission based on POSIX ACLs. This is the central work space for a user to store fles and create directories. It is typical that users ingest data themselvese whereas data managers might ingest data into the shared group directory. Deleting data is possible anytime as long as the user has the correct permission or the data is not yet immutable.
The function concept allows for multiple directories at SL/ that are managed by the system. Within the SL/ a directory is usually a bind-mount that shows the content of a shared directory or a directory which allows storing huge fles (i.e. > 1 TB) etc.

File Exchange and Mutual Directory Access
To enable fle exchange between userse a group share directory is created for each group of users that require mutual access to fles. The concept can be as generic as a group share for all members of a scientifc institutione or very specifc for a group consisting of only two users. The group access can cover multiple institutions and may also include users that have no institution in the context of bwIDM. The data itself does not change owner and accounting is still done against the owner of the data. The group share is created for the group leader and each group member has this directory mapped into his/her RU directory. The future user management console enables the archive admins to designate a user as group leader and group leaders can add or remove users to and from the group.

Immutable Data
An important feature of the archive is to prevent data being changed after the archive is locked. There are two main cases where data should be made immutable: removal of writing authorisation and referenced data. Once a registered user is no longer member of a home institutione writing to the archive is made impossible and no additional costs will result. Similarlye when archived data is referenced in a publicatione changes to the data are no longer allowed. The user should still have read access with his or her registered account that is independent from his or her home institution account. This function is only available at directory level and is implemented by mounting the RU directory as read-only. Currently this is implemented using autoFS maps. Changes and deletions won't be possible even if the user has access rights thus effectively preventing accidental deletion. This solution is fail-safe because the contents of the RU directory must be made explicitly available and are not visible elsewhere.

Conclusion and Outlook
Hosting the bwDataArchive service at a large institutione such as KITe assures sustainability and dependability of the service. Both qualities in turne add to the trust of the data storage service. The bwDataArchive project has built such a service which is already in use by researchers of KIT and is being tested by projects at HLRS and

IJDC | General Article
Bache Schemberae van Wezel | 13 several universities in Baden-Württemberg. With a focus on the requirements of users that need to move data away from expensive disk storage and in general seek affordable long term storagee the service has incorporated some unique features that resulted in quick acceptance. The infrastructure and features of the installation match the requirements and can scale with growing demands. As of September 2019e the archive holds over 9 PB of data in roughly 90 million fles from more than 400 users. Figure 4) shows the trend of both the volume and number of fles since 2015.
The amount of data stored does not prohibit migration to a new platform or different technology in the future though moving the data may take some time.