The International Journal of Digital Curation

With new scientific instruments growing exponentially in their capability to generate research data, new infrastructure needs to be developed and deployed to allow researchers to effectively and securely manage their research data from collection, publication, and eventual dissemination to research communities. In particular, researchers need to be able to easily acquire data from instruments, store and manage potentially large quantities of data, easily process the data, share research resources and work spaces with colleagues both inside and outside of their institution, search and discover across their accessible collections, and easily publish datasets and related research artefacts. The ARCHER Project has developed production-ready generic e-Research infrastructure including: a Research Repository; Scientific Dataset Managers (both a web and desktop application); Distributed Integrated Multi-Sensor and Instrument Middleware; and a Collaborative Workspace Environment. Institutions can selectively deploy these components to greatly assist their researchers in managing their research data.


Introduction
The need for the data management infrastructure is being felt ever more acutely in the e-research community due to: • The quantity of scientific data increasing exponentially, challenging researchers to keep track of it all; • This large quantity of electronic data creating new challenges for collaboration; • Much greater expectations placed on online publishing of data and verifiability of experiments; • Concerns about security and privacy in many disciplines of e-research; and • A significant push to streamline the workflows of e-research by providing centralised, persistent, and reliable storage.
The ARCHER Project 1 was funded by the Australian Government's Department of Education, Science and Technology in 2006 as an attempt to begin to address these concerns.By providing a cogent, data-centred view of the e-research enterprise, ARCHER allows researchers the flexibility of iterative and heuristic workflows and ease of collaborative management, and ensures that data remain well curated and publication-ready, with appropriate metadata, provenance, and authorisation.
ARCHER has produced a suite of tools developed jointly by Monash University, James Cook University, and the University of Queensland; drawing on, integrating and extending existing open source toolkits.It provides infrastructure to assist researchers in collecting, managing, storing, collaborating on, and publishing scientific data.The Project builds on the earlier DART Project (Atkinson, Beitz, Buckle & Treloar, 2007;Faux et al, 2007;Treloar, 2006Treloar, , 2007aTreloar, , 2007b) ) taking selected proof-of-concept tools and moving them to production.ARCHER completed its tool suite in September 2008, and the resulting products and source code are openly available from its website 1 .
In this paper, we take a closer look at the features and benefits of the ARCHER suite of data management tools, identify ARCHER's relationship with the Australian e-Research environment, and demonstrate how its components can be loosely coupled with other e-Research data management components to provide a comprehensive data management solution from data collection, through to publication, and to its eventual dissemination within a research community.

Overview
The ARCHER initiative has developed a suite of open-source production-ready generic e-Research infrastructure components to provide better management of research data, including: • DIMSIM 2 -concurrent data capture and analysis, and telemetry  Figure 1 demonstrates how these components may be integrated.
Figure 1.Potential architecture for ARCHER's deployment.

Distributed Integrated Multi-Sensor & Instrument Middleware (DIMSIM)
New scientific instruments are collecting research data at phenomenal rates, and conventional practices, such as storing the collected research data on CDs or portable hard drives, will not suffice to ensure long-term storage and management.Other potential challenges include: dealing with complex and distributed instruments; determining the status of a remote experiment; transferring data from a remote instrument to the desired data store; and starting an analysis while the experiment is still running.
DIMSIM solves all of these problems, and allows multiple sensors to be easily integrated.It is built on CIMA 5 (Atkinson et al, 2007), which allows instruments to be more easily accessible over a network.In turn this supports: direct deposition of collected research data into a network data store, without human intervention; concurrent analysis; and remote telemetry.By having DIMSIM deposit research data directly into a large, reliable, and secure institutional research repository, many storage concerns are alleviated.If the institutional research repository also supports rich metadata, then curation can begin at collection time, improving curation quality and potentially reducing its costs.Figure 2 shows a snapshot of live telemetry being captured by DIMSIM, during an X-ray crystallography experiment.

ARCHER Research Repository
A research repository (for experimental data) needs to be: secure; reliable; able to cope with large and numerous datasets; able to provide support for rich disciplinespecific metadata; support good data management practices; and be easily accessible.
SRB was chosen as the foundation for ARCHER's Research Repository because of its demonstrated ability to deal with large and numerous datasets.One key limitation however was its metadata repository (MCAT 6 ), which only supports key/value pairs.ARCHER considered this inadequate for the requirements of proper data curation and so augmented the metadata repository with an additional metadata store called iCAT.This is a implemented using a relational database, which, although not the most flexible solution, was chosen as the new store for the metadata about research data mainly because of its ability to scale.
A schema was required which provided at the top levels a highly structured and scientific discipline-agnostic approach, while allowing for additional disciplinespecific metadata.CCLRC's 7 Scientific Metadata Model was identified as the most suitable solution, providing a rigid structure of Project (Study)  Experiment (Investigation)  DataSet  DataFile in the top levels (see Figure 3), and offering discipline-specific schemas associated with Samples, DataSets, and DataFiles.

Crystallography Data Management System (XDMS)
XDMS is the web-based data management component of the ARCHER suite of eresearch infrastructure tools, and sits on top of ARCHER's Research Repository.It promotes good data management practices and provides researchers with data access, data deposit, data export, curation facilities, search and discovery services, and the ability to associate persistent identifiers with datasets.
XDMS provides two levels of metadata support: a generic core metadata profile, applicable across disciplines, using the CCLRC Scientific Metadata Model; and a domain-specific metadata profile, which is user-configurable, and editable by the ARCHER Metadata Editor.Metadata associated with the various levels within the CCLRC metadata hierarchy, including discipline-specific metadata, can be searched and browsed, enabling researchers to easily locate objects and collections.
XDMS provides support for the deposition of research data, and can automatically extract a datafile's metadata from its header and associate it with the deposited datafile.Due to connection timeout issues inherent in all web browsers, ingestion of large quantities of research data via HTTP is not practical, and deposition of multiple datafiles is better handled by ARCHER's desktop client data management component HERMES.

Hermes
HERMES is ARCHER's desktop client data management tool, and can sit on top of ARCHER's Research Repository.It functions as a desktop file browser, which allows browsing of local drives, Samba, SRB, GridFTP, FTP and Secure FTP file systems.HERMES allows upload and download of large files singly and in batches which, coupled with its support for a wide range of storage solutions, makes it ideal as a file transfer agent.

ARCHER Collaborative Workspace Development Tool
Generally, each research discipline within an institution has its own unique set of needs for e-Research technologies.These needs may also vary within the same discipline in different institutions, making it practically impossible to produce a generic e-Research portal which satisfies all researchers.Therefore some level of customisation is usually necessary.ARCHER's approach was to develop generic e-Research components which could be coupled together selectively and which would share the same authentication system.Such an approach included the adoption of a portal development tool that was easily adaptable and which supported customisation of the information architecture, including research data stored in ARCHER's Research Repository.
PLONE is a popular content management system and was extensively customised

ARCHER's Place in the Australian e-Research Environment
Australia's e-Research environment is influenced by the Platforms for Collaboration (PFC) capability9 , which is part of the National Collaborative Research Infrastructure Strategy10 .The PFC contains a number of services, two of which (ARCS11 and ANDS12 ) can be associated with ARCHER's tools.
This section provides a brief overview of these services and describes ARCHER's synergies with each of them.

Australian Research Collaboration Service (ARCS)
ARCS's objective is to provide long-term eResearch support services for the Australian research community with a particular focus on interoperability and collaboration infrastructure, tools, services and support.It offers services like: • Video collaboration; • Web-based collaboration; • Research Data Fabric; and • Remote Instrumentation and Sensor Network Activities.

Australian National Data Service (ANDS)
The overall objective of this service is to improve researchers' practises in managing their research data, predominantly by: • Transforming collections of Australian research data into a cohesive network of research repositories; • Assisting Australian research data managers to become experts in creating, managing and sharing research data under well formed and maintained data management policies; • Increasing the amount of research data that is routinely deposited into stable, accessible and sustainable data management and preservation environments; • Enabling researchers to find and access any relevant data in the Australian "data commons" ; and • Facilitating the sharing of Australian data to support international and nationally distributed multidisciplinary research teams.Through ARCHER's work in research repositories, it has contributed to the development of the ARCS's Data Fabric13 .The ARCS Data Fabric is intended to make it easy for researchers to store and share their data outside their usual institutional confines.This is encouraging new collaborations to form, and providing new research opportunities.

ARCHER's Synergies with ARCS
ARCS has also adopted HERMES as the front-line tool in providing access to its Data Fabric from a client desktop.Its interface's support for multiple file systems makes it very easy for researchers to move their data from one digital repository to another.

ARCHER's Synergies with ANDS
ARCHER's relationship with ANDS is that it provides software components to enable Australian researchers to manage their research data better, and therefore, it is hoped, increasing the amount of data being stored in secure, reliable, and sustainable repositories.ANDS hopes that this will help to facilitate the sharing of Australian research data, both locally and internationally.

Research Data Management: Gluing the Pieces Together
This section describes how ARCHER's components may be coupled with additional data management components from the ARROW and TARDIS projects to provide researchers with a comprehensive data management solution.

ARROW
ARROW14 is a consortium consisting of Monash University (lead institution), together with the University of New South Wales, Swinburne University of Technology, and the National Library of Australia.Its objective is to identify and test software or solutions to support best-practice institutional digital repositories that would contain e-prints, electronic theses, e-research and electronic journals.From this project has come the ARROW Repository15 .The ARROW repository is an institutional publication repository built on the Fedora open source repository platform (Lagoze et al, 2006).

TARDIS
TARDIS (The Australian Repositories for Diffraction Images)16 (Androulakis et al, 2008) is a multi-institutional collaborative venture consisting of Monash University (lead institution), Institute for Molecular Bioscience (University of Queensland), the University of Melbourne, St Vincent's Health Melbourne, Bio21 Institute, ARROW, ARCHER, the Australian National University, the University of Sydney, Australian Partnership for Sustainable Repositories, eCrystals Federation Project , and the University of Southampton, UK.It aims to facilitate the archiving and sharing of raw X-ray diffraction images, collectively known as a dataset.It has developed a number of client desktop tools that assist in the preparation and deposition of a collection of raw crystallographic datasets into an institutional publication repository.It also provides a community portal "TARDIS" which harvests metadata of published crystallographic datasets from registered institutional publication repositories, indexes the collected metadata, and then provides a federated search across institutional repositories.

Modelling the Curation and Migration of Research Data from Collection to Dissemination
As data are collected, shared, published, and disseminated; it is migrated through a range of conceptual domains (Treloar et al, 2007;Treloar & Harboe-Ree, 2008).Each of these domains can be defined by a set of attributes which describes the data objects and the repositories that store them (e.g.accessibility of the data and richness of the metadata).The boundary between these domains can be referred to as curation boundaries.
There are four domains: •
6 MCAT -SRB http://www.sdsc.edu/srb/index.php/MCAT 7Council of the Central Laboratory of the Research Councils -now the Science and Technology Facilities Council (STFC) http://www.scitech.ac.uk/The International Journal of Digital Curation Issue 1, Volume 4 | 2009
research data in both native file format and packaged into a METS format.It can also deposit the METS package directly into a Fedora-based Public Domain Repository.Public Domain repositories are where data are made available to a general audience rather than the collaboration group (Treloar & Harboe-Ree, 2008), with a guarantee of long-term persistence.These are typically provided by institutionally supported repositories, and use technologies such as Fedora and DSpace rather than SRB; so packaging is necessary for transferring the data across.As with The International Journal of Digital Curation Issue 1, Volume 4 | 2009deposition, exporting large quantities of research data is better handled by HERMES.XDMS allows persistent identifiers to be generated for research datasets using CNRI's 8 Handle technology, enabling researchers to easily share references to their Datasets with selected colleagues.
One of the ARCS's collaborative tool offerings is ARCHER's enhanced version of PLONE.Its customised plug-ins make it well suited to the developing e-Research The International Journal of Digital Curation Issue 1, Volume 4 | 2009 environment, and allows PLONE collaborative tools to directly access and link to research data stored in an ARCHER Research Repository, enabling researchers to easily collaborate around their research data.
Private Research Domain, where data are initially collected and are generally only shared amongst a tight-knit research team; • Shared Research Domain, where the team may open up access to the research data to a select group of researchers (e.g.reviewers assessing prepublished research data); • Public Domain, where the data are relatively open to the public; and • Community Domain, where selected data are made available to a community for dissemination and further collaboration.This model is explained further in Figure5below.

Figure 5 .
Figure 5.A Model for the Curation and Migration of Research Data. 17 Figure 5.A Model for the Curation and Migration of Research Data. 17

•
XDMS -web-based research data manager and curator • HERMES -desktop client research data manager and file transfer agent • Collaborative Workspace Development Tool (based on Plone 4 ), for creating e-Research Portals

of Digital Curation Issue 1, Volume 4 | 2009 by
ARCHER to support e-research collaboration anchored to research data and projects.Part of this customisation is a PLONE plug-in, which enables PLONE to access the ARCHER Research Repository.This in turn enables links, comments, blog, and discussions to be made on stored research data.
8 Corporation for National Research InitiativesThe International Journal