Computing Environments for Reproducibility: Capturing the"Whole Tale"

The act of sharing scientific knowledge is rapidly evolving away from traditional articles and presentations to the delivery of executable objects that integrate the data and computational details (e.g., scripts and workflows) upon which the findings rely. This envisioned coupling of data and process is essential to advancing science but faces technical and institutional barriers. The Whole Tale project aims to address these barriers by connecting computational, data-intensive research efforts with the larger research process--transforming the knowledge discovery and dissemination process into one where data products are united with research articles to create"living publications"or"tales". The Whole Tale focuses on the full spectrum of science, empowering users in the long tail of science, and power users with demands for access to big data and compute resources. We report here on the design, architecture, and implementation of the Whole Tale environment.


Introduction
The pervasive use of computation for scientific discovery has ushered in a new type of scientific research process. Researchers, irrespective of scientific domain, routinely rely on large amounts of data, specialized computational infrastructure, and sophisticated analysis processes from which to test hypotheses and derive results. While scholarly research has evolved significantly over the past decade, the same cannot be said for the methods by which research processes are captured and disseminated. In fact, the primary method for dissemination-the scholarly publication-is largely unchanged since the advent of the scientific journal in the 1660s. This disparity has led many to argue that the scholarly publication is no longer sufficient to verify, reproduce, and extend scientific results [Pen11, KS14, AAQAMI11, SBB + , DMR + 09, SLP14].
The challenges associated with rethinking the scholarly publication model are complicated by the pervasive increase in the collection and analysis of data, coupled with dramatic increases in computational power, and new methods for investigation such as data-driven discovery. The scientific landscape is now littered with a vast array of powerful cyberinfrastructure for acquiring, storing, analyzing, publishing, and archiving data. However, current approaches regarding the dissemination, validation, and verification of computationally based research outcomes do not yet accommodate this reality. Despite the increasing recognition of the need to share all aspects of the research process, scholarly publications today are often disconnected from the underlying data and code that produced the findings. While efforts have been made to support data publication [Cro11,fig17,CPB + 15], "unfortunately, the vast majority of data submitted along with publications are in formats and forms of storage that makes discovery and reuse difficult or impossible" [COP15].
Studies of published data have also shown that data availability decays with time, if the data are available at all [VAA + 14].
To address these challenges we present Whole Tale [LCG + 16], a research environment that captures and, at the time of publication, exposes salient details of the entire research process via access to persistent versions of the data and code used, workflow provenance, and data lineage (including parameter settings, intermediate, and output data). The Whole Tale directly addresses the transformation of the scientific enterprise to deeply computational research by supporting the entire research pipeline, from prepublication collaboration, through publication, and to post-publication access and re-use in the broader scientific community. We present here the design and current implementation of the Whole Tale environment (https://dashboard.wholetale.org).
The Whole Tale strengthens the three layers of scholarly publication: scholarly process, data, and computational analysis. Traditionally, the first layer of scholarly publication has been accomplished through the production and dissemination of research articles. As data become more open and transportable, a second layer of research output, linking publications to the associated data, has emerged [KS14]. This is now followed by the recognition of an important and new third layer: communicating the process of inquiry itself, i.e., a complete computational narrative, through the linking and sharing of methods, source code, and data, thereby introducing a new model of reproducible science and accelerated knowledge discovery [SMB + 16a]. The Whole Tale strengthens the second layer (linking data, code, and digital scholarly objects to publications) and also builds a robust third layer that integrates all parts of the research story into a computational tale (conveying the holistic experience of reproducible scientific inquiry, i.e., sharing the source code, data, and methods, along with the computational environment in which inquiry is conducted) and making both layers accessible from the scholarly publication. To the user it thus appears that by sharing a paper as a tale, the narrative is shared together with an on-demand, virtual computer that is preloaded with all the relevant data, methods, software packages, and analysis frontends needed to reproduce, tinker with, or even extend the paper.
The Whole Tale environment can also be seen as a form of science gateway [WD07]: it simplifies access to a vast array of cyberinfrastructure for a broad range of domain scientists. The architecture described herein builds upon advances made within the science gateways community to leverage external services for core functionality such as user authentication and authorization, data management, and management of computational resources. Finally, the Whole Tale architecture is based on extensible APIs that can be leveraged by other gateways for recording computational processes, importing and managing data, issuing identifiers, and sharing and publishing reproducible tales.
The remainder of this article is structured as follows. In Section 2 we first present examples from three scientific domains that highlight some of the challenges commonly faced by computational scientists. In Section 3 we describe prior and current efforts towards enabling reproducible research. We then describe high level requirements of the Whole Tale in Section 4 before presenting the architecture and current implementation in Section 5. In Section 6 we review related work. Finally, we summarize our contributions in Section 7.

Science Narratives
We first describe three scientific domains that, like many others, have embraced computational and datadriven science. We focus specifically on examples that elucidate usage requirements for the Whole Tale.

Materials Science
Materials scientists are now generating vast amounts of computational and experimental data from a wide set of user facilities (e.g., the APS, SNS, NSLS-II), from simulations at Leadership Computing Facilities, from individual research labs, and from high-throughput experiments. To address the deluge of high quality data there are now numerous data repositories designed to store and provide access to curated materials data including: the Materials Data Facility (MDF) [BCP + 16], Materials Project [JOH + 13], Citrination [OMM16], and NoMaD (Novel Materials Design) [TJ16]. With these rich materials data sources, opportunities are available to conduct new types of analysis and for researchers to supplement their own data to expand investigations.
Computational approaches are having a profound effect on materials science. Over the past several decades, concurrent advancements in physics-based simulation methods and computing power have made it possible to model material behavior on a large range of length and time scales [Yip07]. Consequently, computational tools are starting to become an integral part of designing new materials [CNB + 04]. For example, quantum-mechanics-based calculation tools are routinely used in the development of structural metals and semiconductors, among other materials [CHN + 13]. Increasingly, these computational processes are based on new machine learning methods to construct rich models of materials properties [HMP + 16, WACW16,Raj05]. With such changes, researchers are increasingly in need of new methods for publishing not only references to the data used to derive results but also the models and computational processes that underpin results.
As one example of these changes we describe the process undertaken by Ward et al. to design new metallic glasses using machine learning models [WACW16]. The authors first assembled a collection of materials data [KYTM97]. In order to build a machine learning model, they used custom software to transform the raw data, text strings describing each materials composition and properties, to a form compatible with their models: finite length vectors of physically-meaningful inputs. They trained machine learning models using Weka [HFH + 09] and employed the models to scan over several million compositions to identify novel glassforming alloys. In an effort to make their methods verifiable and reproducible the authors published their workflows (as text input and data files) as supplementary information to the paper. Using Whole Tale these researchers could streamline this process via access to a large and varied amount of data, a platform for conducting their analyses using containerized frontends, and the ability to subsequently publish their entire method (including data and analyses) with a persistent identifier. Readers of their manuscript could then view their methods (as a tale) and reproduce the exact steps taken within the Whole Tale environment.

Astronomy
Detailed analysis and visualization of astronomical datasets, particularly those generated from computational simulations, requires both access to the original underlying data (or catalogs of reduced data products) and access to computing resources. One particular example being explored by the Whole Tale project is that of studying the formation of the first stars and galaxies in the universe (see, for instance, [SWO + 15]), but another common use case is that of galaxy formation [KAA + 14]. These simulations are conducted on largescale computing resources; typically, the analysis utilizes community packages such as yt [TSO + 11] and, through the development of scripts and interactive analysis sessions, produces either publication-level plots or reduced data products that can be reanalyzed at a later date. In the case of observational astronomy, multiple datasets may need to be synthesized to create a unified understanding of either a particular class of object or a region of the sky; in many cases, this will require small "slices" of data from many different sources (e.g., from several public registries) to be combined.
For the specific case of analysis and visualization of simulations, Whole Tale will provide access to a collaborative environment, where scripts and analysis methods can not only be transplanted seamlessly between datasets, but where they can be collaborated on between individuals -such as an advisor and a student. Researchers will be able to conduct simulations, make available the results of those simulations inside the Whole Tale environment, and then conduct their analysis in that environment directly. The scripts that produce plots and analysis products for publication purposes will be combined with these datasets to form a tale, which can then be accessed, remixed, and modified for subsequent analysis either by those researchers or by others. This will open up new avenues for discovery, which at present is constrained both by the difficulty of providing access to data and by the difficulties inherent in collaborating on the specific methods of analysis and visualization.

Archaeology
A set of grand challenges in archaeology 1 have been identified by the community through a crowdsourced effort and synthesis workshop [KAB + 14]. While archaeological data and research are essential to addressing fundamental questions, e.g., about the origin and trajectories of civilizations or the response of societies to climate change, the community lacks the capacity for acquiring, managing, analyzing, and synthesizing datasets needed to address such important questions. This in turn led to recommendations for computational infrastructure, tools, and scientific case studies to demonstrate archaeology's ability to contribute to transdisciplinary research on long-term social dynamics [KAK + 15].
One such project is SKOPE [SKO17], which is developing an online resource and toolkit for paleoenvironmental data and models that will enable researchers to easily discover, explore, visualize, and synthesize knowledge of environmental factors most relevant to humans in the past. SKOPE's focus on transparent, reproducible research, facilitated by different forms of provenance, makes it an ideal partner project and science driver for Whole Tale. To address the complex, multi-stage workflows inherent in this domain, these researchers will employ YesWorkflow [MSK + 15, MBBL15] to create graphical, queryable representations of each of the computational workflows enacted as part of the research. These workflows capture the prospective provenance of all data products generated during the study. Such workflows can be used within the Whole Tale environment to create hybrid forms of provenance [PDM + 16, ZCW + 17] that combine prospective with retrospective provenance information of intermediate and final data products, complete with records of the specific program executions involved, the values of program arguments applied, and-where possible-the values of key variables within the programs themselves as exposed by YesWorkflow.
Finally, there are several community efforts such as "How To Do Archaeological Science Using R" [Mar17] that aim to improve community practice: i.e., instead of sharing methods only via traditional publications, reproducibility and reuse are facilitated by authors communicating their methods also via open code repositories and using tools to package computational narratives as research compendia for R [MBM17, BBJ + 17]. These efforts provide current, real-world science use cases that Whole Tale aims to support and enhance.
1. To facilitate reproducibility, share the data, software, workflows, and details of the computational environment in open trusted repositories. 2. To enable discoverability, persistent links should appear in the published article and include a permanent identifier for data, code, and digital artifacts upon which the results depend. 3. To enable credit for shared digital scholarly objects, citation should be standard practice [SKNF16,Mar14]. 4. To facilitate reuse, adequately document digital scholarly artifacts. 5. Journals should conduct a Reproducibility Check as part of the publication process and enact the Transparency and Openness Promotion (TOP) Standards at level 2 or 3 [NAB + 15a]. 6. Use Open Licensing when publishing digital scholarly objects [Sto09a,Sto14]. 7. To better enable reproducibility across the scientific enterprise, funding agencies should instigate new research programs and pilot studies.
To date, more than 5000 journals have signed on to the TOP Guidelines [NAB + 15b, Sto09a]. Journals are progressively taking steps to encourage the submission and publication of reproducible computational research [SGM13]. • c. Investigators and grantees are encouraged to share software and inventions created under the grant or otherwise make them or their products widely available and usable.
The NSF held an agency-wide Director's Symposium "Robust and Reliable Science: The Path Forward" on September 10, 2015. More recently, on February 25-26, 2017, the National Science Foundation's Directorate on Mathematical and Physical Sciences held a workshop "Systematic Approaches to Robustness, Reliability, and Reproducibility in Scientific Research" fomenting a discussion around reproducibility [MRS17]. In December of 2016 the Advisory Committee to the Computer and Information Science and Engineering Directorate at NSF released a report "Realizing the Potential of Data Science" [BRC + ] which included recommendations on reproducibility: • Recommendation 2: Invest in research into data science infrastructure that furthers effective data sharing, data use, and life cycle management: ... Research outcomes should ultimately be translatable to infrastructure that enables access to data in ways that: ... (iii) support reproducibility; (iv) support access, provenance, sustainability, and other life cycle challenges.
• Recommendation 3: Support research into effective reproducibility: Develop research programs that support computational reproducibility and computationally-enabled discovery, as well as cyberinfrastructure that supports reproducibility.
Scientific societies, in part in their role as publishers, are taking steps toward reproducibility as well. The ACM has implemented a system of badging for publications that have digital artifacts available [ACM16]. In November of 2016 IEEE held a workshop on publication practices for reproducibility, "The Future of Research Curation and Research Reproducibility" [B + 16].
Finally, the National Academies of Sciences, Engineering, and Medicine, released a report in April 2017, Fostering Integrity in Research [Nat17], which contained two recommendations regarding reproducible research: • Recommendation 6: Through their policies and through the development of supporting infrastructure, research sponsors and science, engineering, technology, and medical journal and book publishers should ensure that information sufficient for a person knowledgeable about the field and its techniques to reproduce reported results is made available at the time of publication or as soon as possible after publication.
• Recommendation 7: Federal funding agencies and other research sponsors should allocate sufficient funds to enable the long-term storage, archiving, and access of datasets and code necessary for the replication of published findings.
There have been concurrent advances in European open access and open data policy. In 2003 the Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities was signed by nearly 300 stakeholder groups including research and educational institutions, libraries, museums, funding agencies, and governments from around the world to help establish the Internet as the primary medium of communication and dissemination of scientific knowledge [Ber03]. EUDAT [EUD17], a European-based effort to share and preserve data across international border and across research disciplines was started in 2012 and continues actively today. OpenAIRE [Ope17] is a European repository effort to, in part, link data to publications and was started in 2009. On the infrastructure side, EuroCloud [Eur17] was launched in 2010 in part to support cloud based research and innovation in Europe.
The 2017 version of the European Code of Conduct for Research Integrity explicitly mentions data integrity [ALL17]. Their list of "Good Research Practices" includes: • Research institutions and organisations support proper infrastructure for the management and protection of data and research materials in all their forms (encompassing qualitative and quantitative data, protocols, processes, other research artefacts and associated metadata) that are necessary for reproducibility, traceability and accountability.
A Dagstuhl seminar on Reproducibility of Data-Oriented Experiments [FFR16] summarizes: • Transparency, openness, and reproducibility are vital features of science. Scientists embrace these features as disciplinary norms and values, and it follows that they should be integrated into daily research activities. These practices give confidence in the work; help research as a whole to be conducted at a higher standard and be undertaken more efficiently; provide verifiability and falsifiability; and encourage a community of mutual cooperation. They also lead to a valuable form of paper, namely, reports on evaluation and reproduction of prior work. Outcomes that others can build upon and use for their own research, whether a theoretical construct or a reproducible experimental result, form a foundation on which science can progress. Papers that are structured and presented in a manner that facilitates and encourages such post-publication evaluations benefit from increased impact, recognition, and citation rates. Experience in computing research has demonstrated that a range of straightforward mechanisms can be employed to encourage authors to produce reproducible work. These include: requiring an explicit commitment to an intended level of provision of reproducible materials as a routine part of each paper's structure; requiring a detailed methods section; separating the refereeing of the paper's scientific contribution and its technical process; and explicitly encouraging the creation and reuse of open resources (data, code, or both).
As noted in several of the recommendations discussed above, new research and new technologies are needed to implement reproducible computational research, and the Whole Tale represents one initiative to address these gaps in our research and dissemination infrastructure.

Design Requirements
The Whole Tale project is intended to support the lifecycle of data. This means that all parts of the lifecycle, from data ingest or creation through to publication of the resulting scholarly objects such as data, code, workflows, and manuscripts, should be manageable within the Whole Tale framework. Our discussion of design and implementation therefore reflects an integrated view of the generation of computational scientific findings that includes all these research activities. This integrated approach to research is crucial to enable reproducibility and downstream re-use of scholarly objects.
To provide such support, Whole Tale incorporates data ingestion, identity management, data publication, and the deployment of user-facing "frontends." We use the term frontend to describe any environment in which data can be operated on, ranging from terminals with a command-line interface to specialized analysis programs. Examples of common frontends include interactive notebooks (e.g., Jupyter, RStudio), HTML5 web-apps, and domain-specific GUIs (e.g., OpenRefine). We briefly describe the requirements in several core areas to motivate the architecture presented in the following section.

Data Ingestion
Researchers now have access to an enormous amount of data from sources such as data repositories, instruments, and local storage. Researchers who want to act upon data, for example to test a hypothesis or reproduce a result, must first discover and then obtain access to data distributed across many possible locations. The Whole Tale environment aims to reduce these barriers by providing mechanisms by which researchers can ingest data from a wide variety of sources. We focus initially on four commonly used data sources: • Data Repositories: There is an increasing number of domain-specific, institutional, project-centric, and publisher-owned data repositories. Many data repositories, including the Materials Data Facility (MDF) and DataONE [MAB + 12](a federation of repositories), support common interfaces for accessing published metadata and data. These interfaces include the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) [LvdSHNW08] as well as many custom REST interfaces.
• Storage systems: Research data is distributed across a range of local systems, from instruments to archival storage. Each storage system implements one of many interfaces for accessing that data (e.g., object storage, tape interfaces, cloud storage, high performance file systems, etc.) • Web accessible data: There is a vast amount of data stored on web pages or web-based data repositories. In these cases, data can be discovered and downloaded using HTTP-based tools.
• Local data: Much research data exists on researchers' personal computers, shared clusters, or otherwise inaccessible (in terms of a common API) devices.

Analysis Frontends
As mentioned above, we use the term frontend to describe any environment in which data is operated on, ranging from command-line terminals to specialized analysis programs. The choice of frontend used for a specific scientific analysis may be based on analysis requirements, data type, or user preference. One example frontend that is commonly used by researchers is the Jupyter notebook environment [jup17a]. Jupyter notebooks support multiple language backends (Python, R, Julia, and many others), widget development for interactive exploration, file editing, and shell activity within a unified, web-based environment.
To address the needs of a wide range of users and use cases, the Whole Tale must support an extensible set of frontends. Users coming to Whole Tale should be able to search available frontends by the types of data they can support, the user interface offered (web, command-line, digital notebook, etc.), and which user provided them. Having discovered a frontend, users should then be able to rapidly deploy them on-demand and access data directly from within the frontend. The Whole Tale environment must manage the execution of a frontend while also capturing the steps followed by a user such that the entire frontend can be packaged, published, and shared with others.

Persistent Identification
One of the primary goals of the Whole Tale is to enable publication and identification of scholarly objects, where the term scholarly object is used to describe more than a traditional publication but also data and computational processes. A flexible identification and resolution service is required to allow persistent identifiers (e.g., DOIs, ARKs, Handles) to be associated with these different objects. Furthermore, models are needed to allow researchers to organize their objects in different ways, both for their own purposes and also to simplify collaboration and discovery. As such, the Whole Tale must provide a way for researchers to organize their scholarly objects.

Authentication and Authorization
The Whole Tale aims not to reinvent existing capabilities but rather to interoperate with existing services and cyberinfrastructure providers (e.g., repositories, compute environments, libraries). Each of these existing providers might be managed independently with proprietary identities and authorization models. It is therefore important that Whole Tale adhere to existing authentication models and ensure that digital artifacts accessed and created during the exploration and publication of scholarly objects are correctly authorized.
When designing the authentication model it is desirable that researchers are able to sign in once, access a range of supported services, and have their identity and permissions be used securely across services. Given the rate of identity proliferation, the authentication system should allow researchers to authenticate with their preferred identity (e.g., campus identity, ORCID, Google account) and control authorization at a fine grained level (e.g., revoking access when needed). Rather than restrict the identities used in the system, we instead associate identities with actions and artifacts, and allow other users to determine trust based on knowledge of the identity used. To enable extensibility to new tools and services, Whole Tale must support standard authentication and authorization protocols through which external services and clients can easily integrate with the system. The Whole Tale focuses on acting upon research data, we therefore consider issues related to sensitive data (e.g., restricted medical or government data) outside the scope of this work.

Reproducibility: Defining a Tale
The final, and perhaps most important, aim of the Whole Tale is to define a model for reproducibility by capturing the data, methods, metadata, and provenance of a particular research activity within the system. We refer to this entity as a tale. As has been observed time and again, successful adoption of new models is often related to the ease by which they can be used. As such, it is crucial that capturing, publishing, and "replaying" a tale is simple and unobtrusive: the relevant provenance of an analysis should be transparently recorded without requiring users to manage or record the data and computational process used in their work. Having created one or more tales, researchers should be able to simply share them with others, publish them to connected repositories, associate persistent identifiers, and link them to publications. Other researchers who access a tale should, just as simply, be able to instantiate a version of the tale and execute it in the same state as it was when published. Tales also contain Intellectual Property metadata with licensing information for its components (data, scripts, workflow information, etc), which is crucial to enabling ease of re-use, reproducibility, and broad access.

Architecture and Implementation
The Whole Tale architecture uses a range of flexible APIs to enable users to ingest and manage data, manage frontends, and capture, replay, and extend tales. The general architecture of the platform is shown in Figure 1. Our development philosophy follows open source principles to be consistent with our goals of research transparency, but more importantly to enable the re-use and extension of the project and encourage a community to grow around the Whole Tale, see https://github.com/whole-tale.

General Architecture
At the heart of the Whole Tale infrastructure lies the Metadata Management System, which creates an abstraction layer between user-facing interfaces and the physical location of the data. For this purpose we utilize Girder [gir17]-a general purpose framework with a simple data model and REST interface. Using Girder, datasets can be organized into Collections, containing Folders and Items. Folders are a hierarchically nested organizational structure that consist of other Folders and Items. An Item is the basic unit of data in the system. Items live beneath Folders and contain zero or more Files, which represent raw data objects. Each organizational object, i.e., Collection, Folder and Item can be annotated with metadata. Additionally Girder provides models for user and group management (Users, Groups) and an access control model for resource management. Each of these objects is represented by a model with a RESTful interface, that can be used to create, store, and retrieve persistent records in an internal MongoDB. As a result, data is managed entirely by reference. That is, the data stored in external data repositories are completely decoupled from the data in Whole Tale: e.g., an Item representing an external object (e.g., an HTTP URL, or a file on a Globus endpoint) can be easily copied, renamed, and moved around Girder's Folder structure without performing any operations on the actual data. This approach is advantageous as it allows Whole Tale to scale to represent very large datasets leaving data management tasks to external systems (e.g., Globus) and only copying the data when needed.
The Whole Tale builds upon Girder by providing plugins that: • Introduce new models for objects specific to the project, such as Recipe, Image, Tale, Instance, Repository (see Section 5.2 for details).
• Allow users to execute tasks such as building Images from Recipes and creating Instances from Tales.
• Manage transfers of streams of data from remote Repositories into running Instances (see Section 5.5 for details).

The Whole Tale Workflow
As mentioned in Section 4.1, Whole Tale's functionality includes the reuse of published scientific data. Users may register data located in an external repository which Whole Tale understands as native Girder objects, such as Folders and Items. Registration of data, say in preparation for conducting research in Whole Tale, is a two-step process. First, users provide a data identifier (e.g., DOI, data provider specific UUID, URL), which is passed to the external search engine of each supported data provider. Basic information about the dataset is obtained (e.g., name, size, provider; see Fig. 2). To obtain this information we define a new Repository endpoint 2 in Girder, which abstracts access to repository-specific interfaces. The registration procedure starts with the creation of a Folder object to group all the references related to the selected dataset (datasets may be comprised of many files and folders). For each of the files provided by the data provider an Item is created as a child object in the main Folder. Each Item stores the information about the original name of the file, its location, and the protocol to access it (e.g., HTTP, Globus, etc.).
In some cases datasets include references to other datasets. In this case, we create a sub-Folder for each reference and the procedure continues recursively. As this may be a time consuming process, users are notified of progress. Once the registration is complete, users are allowed to modify the resulting data hierarchy, i.e. rename Items, move Folders etc. However, these modifications do not affect the provenance attributes of those objects (e.g., source repository, location, name, etc.). It is important to note that when data is registered from a remote source the system will provide a shallow copy of that entire dataset. As the user interacts with the imported dataset (e.g., in a tale), the raw data is copied on-the fly and cached thereafter to create a deep copy of the data. At present the Whole Tale supports repository access via DataONE and Globus [CTF14]. Additional repositories can be easily added by implementing a simple interface that provides the necessary information: name, size, location and access protocol; and embedding it within the Repository model.
The availability of the data and the fact that it can be freely composed into a dynamic dataset through the Folder and Item hierarchy, is a necessary ingredient of the most important artifact that comes out of the Whole Tale project, which is the tale itself. A tale bundles a frontend and relevant data into a research environment. The environment itself is based on a Docker image-a lightweight, stand-alone, executable package that includes everything needed to run a research environment for a tale. In order to ensure that the image can be reconstructed in exactly the same state we require a machine parseable description of all runtime dependencies. For this purpose we use a Dockerfile as a Recipe for constructing the environment (i.e., Docker image). For reproducibility purposes we treat each modification to a given Dockerfile as a prescription for a different frontend. We store these recipes using a combination of a Git repository alongside a log of changes and represent this object as a Recipe in Whole Tale.
Depending on the complexity of an image the process of building it can be lengthy and resource consuming. We utilize a Distributed Task Queue (Celery/ZMQ) that is integrated with Girder, to asynchronously build, track status, and deposit Docker images in a local instance of a Docker registry-a server application that stores and distributes Docker images. The state of a given image is represented within the metadata management system through the Image model. Once a docker image is successfully built and deposited in the Docker registry, it becomes accessible on registered Whole Tale compute resources. At that point it is available to the user as an executable research environment. The Image, which represents the frontend, and the Folder, that represents the data, can then be combined into a tale (see Fig. 3).

Web Interface
The Whole Tale is designed to be easy to use and accessible to a wide range of users. Its primary interface is a web-based appolciation that allows users to manage data; create, modify, and share frontends for analyzing data; and create, publish, and reproduce tales by linking together datasets and frontends.
The web interface supports the standard set of file and folder operations as one would expect in a desktop finder or file manager application (rename, move, delete, etc.). Files or datasets can be registered from external data repositories (via a search workflow) or dragged and dropped from a user's desktop into the environment. Users can also view their registered files, as well as public datasets that may have been registered by other users.
The Whole Tale web interface 3 (shown in Figure 4) is implemented using the Ember.js open-source JavaScript web framework [Emb17]. Ember is based on the Model View View Model (MVVM) pattern, enabling developers to create single page applications (SPAs). Ember also provides front end data models, which provide seamless access to Web APIs. The Whole Tale interface is implemented using the Semantic UI development framework [Sem16].

Authentication and Authorization
We base the Whole Tale authentication and authorization model on Globus Auth [TAC + 16]-a platform for identity and access management. By leveraging Globus Auth, we essentially outsource core authentica- tion functionality to a highly reliable service provider and need not implement our own user management functionality (e.g., password management, user creation workflows, etc.) Globus Auth provides a number of desirable properties for the Whole Tale. First, it allows researchers to authenticate using a range of identities, including those common in academia (e.g., campus credentials and ORCID). It also allows researchers to link together different identities such that presentation of one identity enables permissions granted to any identity in that set. Second, it supports standard web authentication and authorization protocols (e.g., OpenID Connect and OAuth 2) that simplify integration in Whole Tale services and also provides an extensible model by which other related services can leverage Whole Tale capabilities. Third, it provides an extensible delegated authorization model by which services (e.g., Whole Tale) can obtain delegated tokens to access other services (e.g., data repositories) on behalf of users. Conversely, the model also allows external services (e.g., publishers) to obtain tokens to access Whole Tale services on behalf of users.
We have implemented support for Globus Auth by extending Girder's OAuth plugin. This integration allows users to authenticate with Whole Tale using any of the supported identity providers. Whole Tale is configured to request access ("scopes") to various resources on behalf of users including their profile and linked identities, as well as being able to access other services including Globus transfer and MDF.

Data Management
The Whole Tale Data Management System (DMS) 4 is responsible for managing the "bits" that make up the data used in tales. Primary data, which is data that is sourced from external services, does not, in general, come with a uniform access mechanism. Each external service is free to define its own rules and mechanisms of access. The DMS addresses this issue by providing a POSIX interface to primary data. This interface allows tales to act upon diverse, distributed data as if the data were local. A secondary goal of the DMS is to provide data locality. Primary data services are assumed to be geographically distributed, as such, there is significant latency when data is accessed directly. The DMS provides an abstraction layer that hides these differences. The main components of the DMS are: • The transfer subsystem: manages the movement of data from external data providers to a storage area local to the Whole Tale infrastructure. It does so through the use of transfer adapters, which are specific to each external provider.
• The storage management system: controls the use of storage space by evicting data that is considered to be unlikely to be used frequently. It acts as a data cache for external data.
• The filesystem interface: allows tales to access cached external data through a POSIX interface.
From a user perspective, the process of consciously interacting with the DMS is limited to composing filesystem hierarchies to be used in tales. An initial process of ingestion described in Section 5.2 populates the Whole Tale backend with metadata about available external data collections. Many such collections are available and navigating them in a tale, through a filesystem interface, can be difficult. Users are, therefore, allowed to freely construct specialized subsets of all the data accessible to a user and known to Whole Tale. These specialized subsets are termed "sessions." Each tale is associated with a session. This association is seen by the user as a filesystem that contains the data items composing the session. The filesystem is implemented as a FUSE [S + 05] layer. The filesystem is currently stored using OpenStack's Cinder block storage. Direct access to files on this filesystem results in data transfers from external data sources, unless the data already exists locally. A locking mechanism ensures that data corresponding to files that are in active use by a tale cannot be removed to reclaim storage space.
Maintaining local copies of all external data available to Whole Tale is not feasible. Consequently, the storage management system acts as a cache and garbage collector, periodically traversing the local storage and purging data in a way that meets storage constraints as well as optimizing the latency of data access for tales. The exact optimization mechanism is flexible and involves sorting data based on an objective function that is calculated based on metadata generated by the DMS, such as usage count, usage frequency, and time of last access.

Tale Execution and Management
Once a tale has been created it can be executed (see Fig. 6). The only requirement for execution on a specific compute resource is the availability of a Docker Engine and two lightweight helper daemons: a reverse proxy that is responsible for routing all traffic in and out of a running tale instance (e.g., configurable-httpproxy or NGINX), and a tale management daemon (TMD) 5 that is responsible for managing the tale and its data dependencies. An instantiation of the tale is a multi-step process: 1. A request for a tale instance is sent from the Metadata Management System (MMS) to the TMD running on a computational cluster, along with credentials (access token). 2. The TMD creates a docker volume using the "local" driver, which is basically an empty POSIX directory inside the host's filesystem. 3. The TMD creates a docker instance using an Image referenced by the tale and the volume created in the previous step. 4. The TMD creates the FUSE layer using the Folder referenced by the tale and mounts it in the mountpoint corresponding to the docker volume. 5. The TMD starts the docker container and registers the internal port by which the container can be accessed with the reverse proxy. 6. The TMD returns basic information about the container (routing path, container id, host where it is running, etc.) to the MMS. 7. The MMS creates an Instance model to store the information provided by the TMD and exposes it to the web interface. The Instance object created during the tale instantiation is a regular RESTful object. It allows the UI to query information about running tales, and to update, suspend, or delete them.
Tales can host any frontend as they are based on a generic Docker container, the only requirements of which are an open port for user access and a mountpoint for the Whole Tale FUSE filesystem. At present, pre-configured Jupyter and RStudio frontends are provided. These types of frontends were prioritized based on user needs and popularity. While users can create their own frontends, we will continue to add base frontends to simplify use.
Whole Tale containers are executed on OpenStack virtual machines (running Container Linux). On each VM we deploy Docker Swarm to manage the execution and scheduling of Docker containers on our resource cluster. We use cloud resources at the National Center for Supercomputing Applications (NCSA), Texas Advanced Computing Center (TACC), and San Diego Supercomputer Center (SDSC). There are well-established security risks of running containers on shared infrastructure. To mitigate these risks we require that users users upload source images to be built on-demand and we disable intra-container communication to limit possible interference between containers. In future work we intend to investigate methods for validating and certifying containers.

Tale Representation
Tales are defined by their execution environment, the data used, and metadata related to the tale. The key elements of a tale are as follows: Environment: The environment captures the active components of a tale. For this purpose we rely on a Docker image and container. The tale maintains a reference to a Git repository (including a hash to the specific commit version). This Git repository is used as the working directory to build the docker image. The environment includes a name, a Git repository URL, a commit ID that references specific version of the repository, and an optional configuration object that defines specific parameters passed to Docker when running the container.
Data: is represented by a set of Girder objects (folders, items, files). Each object is described with several internal descriptors (e.g., name, size, child-parent relations, uuid, creation/modification time etc.). Those that are most important to capture in a tale are the the source URL and access protocol (e.g., https://website/file1, globus://endpoint/file1), provider (e.g., DataONE, Globus, etc.), unique identifier (e.g., URN in DataONE, or DOI used by publication repository), and in the future an optional checksum. Given these objects are mapped to a filesystem, each object must also include its size and a POSIX compatible name. The name includes the full path (with respect to the mount point) as the tale must recreate the entire directory structure.
Metadata: represents information about the tale that is not specifically related to the environment or data. We expect that tale metadata will grow over time based on user needs and tale usage. At present, tales may include metadata that describes the title, authors, description, icon, illustration, category (e.g., tags), publication status, as well as licensing information for all artifacts [Sto09b].

Related Work
Scientific reproducibility is becoming an increasingly widespread concern and stakeholders are exploring a range of approaches to address challenges. For example, data repositories now support data analysis [RMLS16,WLLB17], science gateways facilitate the capture of rich provenance information [GDP + 15], and publishers enable verification of figures and computational results from within papers [She14], and through third party offerings [ZMRE11,GM13].
Science gateways allow users to conduct (generally domain-specific) analyses that exploit advanced computing infrastructure. They provide intuitive user interfaces that abstract the complexities of submitting jobs via queue submission systems or instantiating virtual computing environments for executing GUI-based tools [MK10]. Given the gateway's position at the center of all analysis, it is possible to capture the steps performed by users (see e.g., [JWS + 14]). Often these steps are recorded in the form of workflows [GNT10] or in other standard formats [BBC + 12]. Science gateways are generally focused on a specific domain, and on the analysis of data. Unlike the Whole Tale they do not provide a general model for capturing and sharing computational processes on arbitrary datasets and linking these artifacts with publications.
Scientific workflow systems, such as Galaxy [GNT10] and Kepler [LAB + 06], provide the ability for users to create flexible analysis routines comprising various processing steps. They typically provide extensible interfaces via which external data can be imported for analysis. While their goals overlap somewhat with the Whole Tale, there are significant differences between these systems. Workflow systems prescribe a particular format and analysis model, they therefore require researchers to modify their computational processes to fit their model. They do not support the range of interactive and user-specific analyses enabled by the Whole Tale.
As data repositories grow in size and usage there is increasing interest in offering co-located analysis capabilities. Often these capabilities are offered as a set of tools that allow users to aggregate datasets and perform simple computations. However, recently several data repositories have added more advanced computing environments for processing managed data. For example, the Wolfram Data Repository [Wol17] provides tight coupling with the Wolfram programming environment for analyzing and visualizing hosted data. The Cloud Kotta secure data enclave [BCGD16a,BCGD16b] provides a co-located analysis framework that supports interactive Jupyter notebooks and a batch submission system for analyzing sensitive data. These systems support only data stored within their respective repositories. They also do not provide a model for sharing analyses in a standard format nor are they capable of capturing complete provenance.
The widespread adoption of interactive programming environments (e.g., Jupyter) have lead to countless examples of multi-user, interactive analysis environments. For example, JupyterHub [jup17b], supports multiple Jupyter notebook instances simultaneously via execution of notebook processes on a single server. Tmpnb [tmp17] and Binder [bin17] provide multi-user environments by launching Docker containers for notebook instances. Tmpnb is used to provide temporary notebooks for replicating analyses published in Nature [She14]. These systems provide similar analysis capabilities to the Whole Tale, however, they do not provide standard models for discovering and accessing data, capturing the computational process and the data used, or any form of linkage to publications.
Several platforms have emerged with the aim of hosting or linking digital scholarly objects to publications, including Zenodo [Zen17], RunMyCode [SHP12], ResearchCompendia [SMS15], and SparseLab [Don07]. These platforms typically provide a web-based location for collecting data, code, and other information required for verification of the published claims, with a link to the article or the article itself.
Publishers are providing repository services either as standalone or in support of published claims, in addition to hosting supplementary materials. Springer-Nature for example provides the figshare service [fig17], and Elsevier provides Mendeley [Men17]-both host digital scholarly objects such as data and code and attach unique identifiers such as digital object identifiers (DOIs) to hosted items. Other projects exist to help close similar gaps in a variety of areas. For example, the journal Image Processing Online (http://ipol.im) provides reproducible publications for the image processing community, Code Ocean provides reproducibility functionality for IEEE publications, the Madagascar project extends the reproducibility functionality described by Claerbout and Karrenbach in 1992 (http://www.ahay.org), and the WaveLab project pioneered reproducibility in signal processing (http://statweb.stanford.edu/~wavelab/), just to name a few.
There are other general efforts aimed at aggregating digital resources and capturing the provenance of research artifacts. W3C PROV [GMB + 12] defines a model for representing provenance using a data model that represents the entities (e.g., files), agents (e.g., people), and activities (e.g., computational processes) associated with data processing. Research Objects (RO) [BDRG + 10] provides a model for capturing a single unit of research including, for example, the datasets, analysis scripts, and derived results associated with a paper. RO provides a formal specification for encoding these objects, as well as associated attribution and provenance information. W3C PROV and ProvONE [CVLM + 15], a PROV extension to link prospective and retrospective provenance [MBBL15], have been incorporated into DataONE [CJCV + 16]. We will be adding similar provenance support to Whole Tale in the future.

Summary
The widespread adoption of computational and data-driven science have significantly altered the discovery lifecycle. However, the methods by which scientific results are published have not kept pace with the drastic changes to the underlying processes used for discovery. The Whole Tale aims to redefine the model via which computational and data-driven science is conducted, published, verified, and reproduced. The Whole Tale builds upon a wide range of efforts to support data discovery and ingestion, analysis using flexible frontends, scalable computation in isolated containers, and ultimately publication of verifiable and reproducible processes using these artifacts.
The Whole Tale architecture consists of a set of microservices (e.g., for data access, persistent identifier creation, etc.) and interoperability software that leverages, where possible, existing cyberinfrastructure. The resulting services not only provide value to end users through the Whole Tale web interface but also to application developers through REST APIs. Through a number of Whole Tale working groups, we are actively engaging several science communities to pilot these capabilities and evaluate their use for enabling reproducible science.