Storing Reproducible Results from Computational Experiments using Scientific Python Packages

—Computational methods have become a prime branch of modern science. Unfortunately, retractions of papers in high-ranked journals due to erroneous computations as well as a general lack of reproducibility of results have led to a so-called credibility crisis. The answer from the scientiﬁc community has been an increased focus on implementing reproducible research in the computational sciences. Researchers and scientists have addressed this increasingly important problem by proposing best practices as well as making available tools for aiding in implementing them. We discuss and give an example of how to implement such best practices using scientiﬁc Python packages. Our focus is on how to store the relevant metadata along with the results of a computational experiment. We propose the use of JSON and the HDF5 database and detail a reference implementation in the Magni Python package. Further, we discuss the focuses and purposes of the broad range of available tools for making scientiﬁc computations reproducible. We pinpoint the particular use cases that we believe are better solved by storing metadata along with results the same HDF5 database. Storing metadata along with results is important in implementing reproducible research and it is readily achievable using scientiﬁc Python packages.


Introduction
Exactly how did I produce the computational results stored in this file?Most data scientists and researchers have probably asked this question at some point.For one to be able to answer the question, it is of utmost importance to track the provenance of the computational results by making the computational experiment reproducible, i.e. describing the experiment in such detail that it is possible for others to independently repeat it [LMS12], [Hin14].Unfortunately, retractions of papers in high-ranked journals due to erroneous computations [Mil06] as well as a general lack of reproducibility of computational results [Mer10], with some studies showing that only around 10% of computational results are reproducible [BE12], [RGPN + 11], have led to a so-call credibility crisis in the computational sciences.
The answer has been a demand for requiring research to be reproducible [Pen11].The scientific community has acknowledged that many computational experiments have become so complex that more than a textual presentation in a paper or a technical report is needed to fully detail it.Enough information to make the experiment reproducible must be included with the textual presentation [RGPN + 11], [CG12], [SLP14].Consequently, reproducibility of computational results have become a requirement for submission to many high-ranked journals [Edi11], [LMS12].
But how does one make computational experiments reproducible?Several communities have proposed best practices, rules, and tools to help in making results reproducible, see e.g.[VKV09], [SNTH13], [SM14], [Dav12], [SLP14].Still, this is an area of active research with methods and tools constantly evolving and maturing.Thus, the adoption of the reproducible research paradigm in most scientific communities is still ongoing -and will be for some time.However, a clear description of how the reproducible research paradigm fits in with customary workflows in a scientific community may help speed up the adoption of it.Furthermore, if tools that aid in making results reproducible for such customary workflows are made available, they may act as an additional catalyst.
In the present study, we focus on giving guidelines for integrating the reproducible research paradigm in the typical scientific Python workflow.In particular, we propose an easy to use scheme for storing metadata along with results in an HDF5 database.We show that it is possible to use Python to adhere to best practices for making computational experiments reproducible by storing metadata as JSON serialized arrays along with the results in an HDF5 database.A reference implementation of our proposed solution is part of the open source Magni Python package.
The remainder of this paper is organized as follows.We first describe our focus and its relation to a more general data management problem.We then outline the desired workflow for making scientific Python experiments reproducible and briefly review the fitness of existing reproducibility aiding tools for this workflow.This is continued by a description of our proposed scheme for storing metadata along with results.Following this specification, we detail a reference implementation of it and give plenty examples of its use.The paper ends with a more general discussion of related reproducibility aiding software packages followed by our conclusions.

The Data Management Problem
Reproducibility of computational results may be considered a part of a more general problem of data management in a computational study.In particular, it is closely related to the data management tasks of documenting and describing data.A typical computational study involves testing several combinations of various elements, e.g.input data, hardware platforms, external software libraries, experiment specific code, and model parameter values.Such a study may be illustrated as a layered graph like the one shown in figure 1.Each layer corresponds to one of the elements, e.g. the version of the NumPy library or the set of parameter values.The edges in the graph mark all the combinations that are tested.An example of a combination that constitutes a single simulation or experiment is the set of connected vertices that are highlighted in the graph in figure 1.In the present study, we focus on the problem of documenting and describing such a single simulation.A closely related problem is that of keeping track of all tested combinations, i.e. the set of all paths through all layers in the graph in figure 1.This is definitely also an interesting and important problem.However, once the "single simulation" problem is solved, it should be straight forward to solve the "all combinations" problem by appropriately combining the information from all the single simulations.

Storing Metadata Along With Results
For our treatment of reproducibility of computational results, we adopt the meaning of reproducibility from [LMS12], [Hin14].That is, reproducibility of a study is the ability of others to repeat the study and obtain the same results using a general description of the original work.The related term replicability then means the ability of others to repeat the study and obtain the same results using the exact same setup (code, hardware, etc.) as in the original work 1 .As pointed out in [Hin14], reproducibility generally requires replicability.
The lack of reproducibility of computational results is oftentimes attributed to missing information about critical computational details such as library versions, parameter values, or precise descriptions of the exact code that was run [LMS12], [BPG05], [RGPN + 11], [Mer10].Several studies have given best practices for how to detail such metadata to make computational results reproducible, see e.g.[VKV09], [SNTH13], [SM14], [Dav12].
Here we detail the desired workflow for storing such metadata along with results when using a typical scientific Python workflow in the computational experiments.That is, we detail how to document a single experiment as illustrated by the highlighted vertices in figure 1.

The Scientific Python Workflow
In a typical scientific Python workflow, we define an experiment in a Python script and run that script using the Python interpreter, e.g.
This is a particularly generic setup that only requires the availability of the Python interpreter and the libraries imported in the script.We argue that for the best practices for detailing a computational study to see broad adoption by the scientific Python community, three elements are of critical importance: Any method or tool for storing the necessary metadata to make the results reproducible must 1. be very easy to use and integrate well with existing scientific Python workflows.2. be of high quality to be as trustworthy as the other tools in the scientific Python stack.3. store the metadata in an open format that is easily inspected using standard viewers as well as programmatically from Python.
These elements are some of the essentials that have made Python so popular in the scientific community 2 .Thus, for storing the necessary metadata, we seek a high quality solution which integrates well with the above exemplified workflow.Furthermore, the metadata must be stored in such a way that is is easy to extract and inspect when needed.

Existing Tools
Several tools for keeping track of provenance and aiding in adhering to best practices for reproducible research already exist, e.g.Sumatra [Dav12], ActivePapers [Hin15], or Madagascar [Fom15].Tools like Sumatra, ActivePapers, and Madagascar generally function as reproducibility frameworks.That is, when used with Python, they wrap the standard Python interpreter with a framework that in addition to running a Python script (using the standard Python interpreter) also captures and stores metadata detailing the setup used to run the experiment.E.g. when using Sumatra, one would replace python my_experiment.py with [Dav12] $ smt run -e python -m my_experiment.py 1.Some authors (e.g.[SLP14]) swap the meaning of reproducibility and replicability compared to the convention, we have adopted.
2. See http://cyrille.rossant.net/why-using-python-for-scientificcomputing/for an overview of the main arguments for using Python for scientific computing.This idea of wrapping a computational simulation is different from the usual scientific Python workflow which consists of running a Python script that imports other packages and modules as needed, e.g.importing NumPy for numerical computations.This difference is illustrated in figure 2.
We argue that an importable Python library for aiding in making results reproducible has several advantages compared to using a full blown reproducibility framework.A major element in using any tool for computational experiments is being able to trust that the tool does what it is expected do.The scientific community trusts Python and the SciPy stack.For a reproducibility framework to be adopted by the community, it must build trust as the wrapper of the Python interpreter, it effectively is.That is, one must trust that it handles experiment details such as input parameters, library paths, etc. just as accurately as the Python interpreter would have done.Furthermore, such a framework must be able to fully replace the Python interpreter in all existing workflows which uses the Python interpreter.A traditional imported Python library does not have these potentially staggering challenges to overcome in order to see wide adoption.It must only build trust among its users in the same way as any other scientific library.Furthermore, it would be easy to incorporate into any existing workflow.Thus, ideally we seek a solution that allow us to update our my_experiment.pyto have a structure like: Interestingly, the authors of the Sumatra package has to some degree pursued this idea by offering an API for importing the library as an alternative to using the smt run command line tool.
Equally important, to how to obtain the results, is how to inspect the results afterwards.Thus, one may ask: How are the results and the metadata stored, and how may they be accessed later on?For example, Sumatra by default stores all metadata in a SQLite database [Dav12] separate from simulation results (which may be stored in any format) whereas ActivePapers stores the metadata along with the results in an HDF5 database [Hin15].The idea of storing (or "caching") intermediate results and metadata along with the final results has also been pursued in another study [PE09].
We argue that this idea of storing metadata along with results is an excellent solution.Having everything compiled into one standardized and open file format helps keep track of all the individual elements and makes it easy to share the full computational experiment including results and metadata.Preferably, such a file format should be easy to inspect using a standard viewer on any platform; just like the Portable Document Format (PDF) has made it easy to share and inspect textual works across platforms.The HDF5 Hierarchical Data Format [FP10] is a great candidate for such a file format due to the availability of cross-platform viewers like HDFView 3 and HDFCompass 4 as well as its capabilities in terms of storing large datasets.Furthermore, HDF5 is recognized in the scientific Python community 5 with bindings available through e.g.PyTables 6 , h5py 7 , or Pandas [McK10].Also, bindings for HDF5 exists in several other major programming languages.

Suggested Library Design
Our above analysis reveals that all elements needed for implementing the reproducible research paradigm in scientific Python are in fact already available in existing reproducibility aiding tools: Sumatra may serve as a Python importable library and the ActivePapers project shows how metadata may be stored along with results in an HDF5 database.However, no single tool offers all of these elements for the scientific Python workflow.Consequently, we propose creating a scientific Python package that may be imported in existing scientific Python scripts and may be used to store all relevant metadata for a computational experiment along with the results of that experiment in an HDF5 database.
Technically, there are various ways to store metadata along with results in an HDF5 database.The probably most obvious way is to store the metadata as attributes to HDF5 tables and arrays containing the results.However, this approach is only recommended for small metadata (generally < 64KB) 8 .For larger metadata it is recommended to use a separate HDF5 array or table for storing the metadata 9 .Thus, for the highest flexibility, we propose to store the metadata as separate HDF5 arrays.This also allows for separation of specific result arrays or tables and general metadata.When using separate metadata arrays, a serialization (a representation) of the metadata must be chosen.For the metadata to be humanly readable using common HDF viewers, it must be stored in an easily readable string representation.We suggest using JSON [ECM13] for serializing the metadata.This makes for a humanly readable representation.Furthermore, JSON is a standard format with bindings for most major programming languages 10 .In particular, Python bindings are part of the standard library (introduced in Python 2.6) 11 .This would effectively make Python >=2.6 and an HDF5 Python interface the only dependencies of our proposed reproducibility aiding library.We note, though, that the choice of JSON is not crucial.Other formats similar to JSON (e.g.XML 12 or YAML 13 ) may be used as well.We do argue, though, that a humanly readable format should be used such that the metadata may be inspected using any standard HDF5 viewer.

Magni Reference Implementation
A reference implementation of the above suggested library design is available in the open source Magni Python package [OPA + 14].In particular, the subpackage magni.reproducibility is based on this suggested design.Figure 3 gives an overview of the magni.reproducibilitysubpackage.Additional resources for magni are: In magni.reproducibility, a differentiation is made between annotations and chases.Annotations are metadata that describe the setup used for the computation, e.g. the computational environment, values of input parameters, platform (hardware/OS) details, and when the computation was done.Chases on the other hand are metadata describing the specific code that was used in the computation and how it was called, i.e. they chase the provenance of the results.

Requirements
Magni uses PyTables as its interface to HDF5 databases.Thus, had magni.reproducibilitybeen a package of its own, only Python and PyTables would have been requirements for its use.The full requirements for using magni (as of version 1.5.0) are 14 We now give several smaller examples of how to use magni.reproducibility to implement the best practices for reproducibility of computational result described in [VKV09], [SNTH13], [SM14].An extensive example of the usage of magni.reproducibility is available at doi:10.5278/VBN/MISC/MagniRE.This extensive example is based on a Python script used to simulate the Mandelbrot set 18 using the scientific Python workflow described above.An example of a resulting HDF5 database containing both the Mandelbrot simulation result and metadata is also included.Finally, the example includes a Jupyter Notebook showing how to read the metadata using magni.reproducibility.

Quality Assurance
The Magni Python package is fully documented and comes with an extensive test suite.It has been developed using best practices for developing scientific software [WAB + 14] and all code has been reviewed by at least one other person than its author prior to its inclusion in Magni.All code adheres to the PEP8 19 style guide and no function or class has a cyclomatic complexity [McC76], [WM96] exceeding 10.The source code is under version control using Git and a continuous integration system based on Travis CI 20 is in use for the git repository.More details about the quality assurance of magni are given in [OPA + 14].

Related Software Packages
Independently of the tool or method used, making results from scientific computations reproducible is not only for the benefit of the audience.As pointed out in several studies [Fom15], [CG12], [VKV09], the author of the results gains as least as much in terms increasing one's productivity.Thus, using some method or tool to 19.See https://www.python.org/dev/peps/pep-0008/20.See https://travis-ci.org/help make the results reproducible is a win for everyone.In the present work we have attempted to detail the ideal solution for how to do this for the typical scientific Python workflow.
A plethora of related alternative tools exist for aiding in making results reproducible.We have already discussed ActivePapers [Hin15], Sumatra [Dav12], and Madagascar [Fom15] which are general reproducibility frameworks that allow for wrapping most tools -not only Python based computations.Such tools are definitely excellent for some workflows.In particular, they seem fit for large fixed setups which require keeping track of several hundred runs that only differ by the selection of parameters 21 and for which the time cost of initially setting up the tool is insignificant compared to the time cost of the entire study.That is, they are useful in keeping track of the full set of combination in a large computations study as marked by all the edges in the layered graph in figure 1.However, as we have argued, they are less suitable for documenting a single experiment based on the typical scientific Python workflow.Also these tools tend to be designed for use on a single computer.Thus, they do not scale well for big data applications which run on compute clusters.
Another category of related tools are graphical user interface (GUI) based workflow managing tools like Taverna [OAF + 04] or Vistrail [SFC07].Such tools seem to be specifically designed for describing computational workflows in particular fields of research (typically bioinformatics related fields).It is hard, though, to see how they can be effectively integrated with the typical scientific Python workflow.Other much more Python oriented tools are the Jupyter Notebook 22 as well as Dexy 23 .These tools, however, seem to have more of a focus on implementing the concept of literate programming and documentation than reproducibility of results in general.

Conclusions
We have argued that metadata should be stored along with computational results in an easily readable format in order to make the results reproducible.When implementing this in a typical scientific Python workflow, all necessary tools for making the results reproducible should be available as an importable package.We suggest storing the metadata as JSON serialized arrays along with the result in an HDF5 database.A reference implementation of this design is available in the open source Magni Python package which we have detailed with several examples of its use.All of this shows that storing metadata along with results is important in implementing reproducible research and it is readily achievable using scientific Python packages.

Fig. 1 :
Fig. 1: Illustration of a typical data management description problem as a layered graph.In this exemplified experiment, several combinations of input data, hardware platforms, software libraries (e.g.NumPy), algorithmic/experimental setup (described in a Python script), and parameter values are tested.The challenging task is to keep track of both the full set of combinations tested (marked by all the edges in the graph) as well as the individual simulations (e.g. the combination of highlighted vertices).

Fig. 2 :
Fig.2: Illustration of the difference between a full reproducibility framework (on the left) and an importable Python library (on the right).The reproducibility framework calls the metadata collector as well as the Python interpreter which in turn runs the Python simulation script which e.g.imports NumPy.When using an importable library, the metadata collector is imported in the Python script alongside with e.g.NumPy.
if __name__ == '__main__': reproducibility_library.store_metadata(...) run_my_experiment(...) Fig. 3: Illustration of the structure of the magni.reproducibilitysubpackage of Magni.The main modules are the data module for acquiring metadata and the io module for interfacing with an HDF5 database when storing as well as reading the metadata.A subset of available functions are listed next to the modules.