The MPO API: A tool for recording scientific workflows

https://doi.org/10.1016/j.fusengdes.2014.02.011Get rights and content

Highlights

  • A description of a new framework and tool for recording scientific workflows, especially those resulting from simulation and analysis.

  • An explanation of the underlying technologies used to implement this web based tool.

  • Several examples of using the tool.

Abstract

Data from large-scale experiments and extreme-scale computing is expensive to produce and may be used for high-consequence applications. The Metadata, Provenance and Ontology (MPO) project builds on previous work [M. Greenwald, Fusion Eng. Des. 87 (2012) 2205–2208] and is focused on providing documentation of workflows, data provenance and the ability to data-mine large sets of results. While there are important design and development aspects to the data structures and user interfaces, we concern ourselves in this paper with the application programming interface (API) – the set of functions that interface with the data server.

Our approach for the data server is to follow the Representational State Transfer (RESTful) software architecture style for client–server communication. At its core, the API uses the POST and GET methods of the HTTP protocol to transfer workflow information in message bodies to targets specified in the URL to and from the database via a web server. Higher level API calls are built upon this core API. This design facilitates implementation on different platforms and in different languages and is robust to changes in the underlying technologies used. The command line client implementation can communicate with the data server from any machine with HTTP access.

Introduction

The Metadata, Provenance and Ontology (MPO) project [1] provides an unobtrusive interface to recording important metadata associated with scientific workflows and builds on ideas previously presented [2]. The project goal is not to create or replay the workflows themselves but to capture the structure of the workflow, annotation and any other relevant metadata into a centralized database. It is unobtrusive in that the degree of recording done is up to the user. For both design and implementation reasons, a Representational State Transfer (RESTful) software [3], [4] architecture style of interface to the data server is used. The main aspect of this style is that the data server maintains no ‘memory’ or command histories of intermediate data states and this drives a data centric design rather than a procedural one. Identified resources are accessed through the API using HTTP verbs and data oriented URLs or routes. This property facilitates a separate client–server relationship. Changes may be made to the RESTful API more easily without breaking existing clients. Because transactions are atomic and without side effects, it is robust to interruptions in workflow recording.

A workflow is defined by the activities it contains and the data objects consumed or produced by those activities. These, combined with the direction of the workflow, results in a directed acyclic graph (DAG) [5] representation of the workflow structure that is recorded through the API methods and stored in the database. An example DAG is shown in Fig. 1 in which arrows connect parent to child. Such a graph may be presented by a web based client. Atomic HTTP POSTs are used to construct the graph. The change in state of the workflow is represented in a JSON (javascript object notation) [6] encoded data structure that is passed to and from the data server using the HTTP queries, POST and GET. These messages are sent to routes (URLs) on the server that correspond to the resources of concern. We describe these URLs in Section  2.1.

Users would typically use an intermediary client rather than composing RESTful queries directly. To best illustrate the uses of the API we will discuss three examples in Section  3: archiving gyrokinetic simulation results while capturing select metadata for future searching and indexing, tracking the status of a full wave rf simulation by instrumenting a parallel batch script, and using an IDL client implementation to instrument a fusion analysis.

Section snippets

The MPO API

A workflow in MPO is represented as a directed acyclic graph with data objects (ellipses) and activities (rectangles) as the nodes. The head node (diamond) of the graph is the workflow label (cf [1] for examples.) Nodes in the graph have default metadata attached to them such as creation time and creator as well as user specified metadata in the form of key-value pairs. Each of these items is a resource to be manipulated. In Fig. 1, we show a workflow graph visualization of the data structure

Applications of the client method interface

The following examples using the command line client and follow a common format. The command line client is pointed to by the $MPO shell variable. A new workflow is created with the ‘init’ method and the resulting UID of the workflow is captured in a local shell variable for use in subsequent methods that add onto its structure. The first activity or data object attached to the workflow will references the workflow UID twice: the first argument is the workflow to be added to and the second is

Future work and conclusions

We have developed a RESTful API for accessing the Metadata Provenance and Ontology (MPO) data server and high level method interfaces for the MPO project. These provide a robust and platform independent framework for clients to communicate with the MPO data server. Because the communication in the RESTful approach uses the HTTP protocol, there is near universal support for implementation in other languages. We have presented examples in Python and IDL; Matlab/Octave support is also planned. In

Acknowledgments

This work was supported by the US DOE, Office of Advanced Scientific Computing Research and the Office of Fusion Energy Sciences under DE-SC0008697, DE-AC02-05CH11231, and DE-SC0008736.

References (9)

There are more references available in the full text version of this article.

Cited by (0)

View full text