YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts

Scientific workflow management systems offer features for composing complex computational pipelines from modular building blocks, for executing the resulting automated workflows, and for recording the provenance of data products resulting from workflow runs. Despite the advantages such features provide, many automated workflows continue to be implemented and executed outside of scientific workflow systems due to the convenience and familiarity of scripting languages (such as Perl, Python, R, and MATLAB), and to the high productivity many scientists experience when using these languages. YesWorkflow is a set of software tools that aim to provide such users of scripting languages with many of the benefits of scientific workflow systems. YesWorkflow requires neither the use of a workflow engine nor the overhead of adapting code to run effectively in such a system. Instead, YesWorkflow enables scientists to annotate existing scripts with special comments that reveal the computational modules and dataflows otherwise implicit in these scripts. YesWorkflow tools extract and analyze these comments, represent the scripts in terms of entities based on the typical scientific workflow model, and provide graphical renderings of this workflow-like view of the scripts. Future versions of YesWorkflow also will allow the prospective provenance of the data products of these scripts to be queried in ways similar to those available to users of scientific workflow systems.


Introduction
Many scientists use scripts (written, e.g., in Python, R, or MATLAB) or scientific workflow environments for data processing, analysis, model simulation, result visualization, and other scientific computing tasks.In addition to the widespread use in the natural sciences, computational automation tools are also increasingly used in other domains, e.g., for data mining workflows in the digital humanities [VZ12], or to implement data curation workflows for natural history collections [DCM + 12].One advantage of using scientific workflow systems (e.g., Galaxy [GNT10], Kepler [LAB + 06], Taverna [OAF + 04], VisTrails [BCC + 05], RestFlow [MM10, TMnG + 13]) is that they often include capabilities to track data as it is being processed.By capturing and subsequently sharing such provenance information, scientists can provide a detailed account of how their results were derived from the given inputs via intermediate results, workflow steps, and parameter settings, thereby facilitating transparency and reproducibility of workflow products.In addition to this external use, provenance information can also be used internally, e.g., to allow scientists to trace sources of errors and to debug their workflows.
The data provenance captured by workflow environments is sometimes called retrospective provenance to distinguish it from another form called prospective provenance [CFV + 08, LLCF10].The former consists of data dependencies and lineage information recorded at runtime, which can then be used later for retrospective exploration and analysis (a.k.a."querying provenance" [DF08]).In constrast, prospective provenance is a description of the computational process itself, i.e., the workflow specification is considered a form of provenance information, describing the method by which analysis results and other data products are obtained.Scientific workflow systems therefore naturally support both forms of provenance, i.e., prospective provenance by visually presenting a workflow as a directed graph with data and process steps, and retrospective provenance by capturing and subsequently exporting runtime provenance.
Despite these and other advanced features of workflow systems, a vast number of computational "workflows" continue to be developed using general purpose or specialized scripting languages such as Python, R, and MATLAB.This is true in particular for the "long tail of science" [WRB13,Hei08], where advanced features such as provenance support are rarely available.For example, provenance libraries for R have only recently been announced [LB14], while for Python, a new tool called noWorkflow has just been developed [MBC + 14].The noWorkflow (not only workflow ) system uses Python runtime profiling functions to generate provenance traces that reflect the processing history of the script.Thus, noWorkflow allows users to continue working in their familiar Python scripting environment, without adopting a new system, while retaining the advantage of automatic capture of retrospective provenance information similar to the one available in workflow systems.
In the following, we describe a new tool called YesWorkflow that complements noWorkflow by revealing prospective provenance in scripts, i.e., YesWorkflow makes latent workflow information from scripts explicit.In particular dataflow dependencies that are often "hidden" inside of a script and not easily understood by outsiders looking at the script are extracted from simple user annotations and can then be exported and visualized in graph form.
The main features of YesWorkflow (or YW for short) are: • YW exposes prospective provenance (workflow structure and dataflow dependencies) from scripts based on simple user annotations.
• YW annotations are embedded inside of comments, so they are language independent and can be used, e.g., in Python, R, and MATLAB.
• YW annotations and the underlying model are deliberately kept simple to allow scientists a very low entry bar for adoption.
• The YW-toolkit is a grass-roots, agile, open source effort, whose simple and modular architecture and underlying UNIX philosophy facilitates interoperability and extensibility.
• The current YW prototype generates different, easily reusable output formats, including three different graph views, i.e., a process-centric, a data-centric, and a combined view of the extracted workflow graph in Graphviz/DOT form.
We discuss YW limitations and plans for future development in Section 7.

YesWorkflow Model and Annotation Syntax
In order to use the YesWorkflow tools, a script author marks up scripts using a simple keyword-based annotation or tagging mechanism, embedded within the comments of the host language.YW annotations are expressions of the form @tag value .Here, @tag is one of the recognized YW keywords, after which a value follows, separated by one or more whitespace characters.Thus, the YW annotation syntax mimics the syntax of conventional documentation generators such as Javadoc and DOxygen.
The YW tool then interprets the embedded, structured comments and builds a simple workflow model of the script.This model represents scripts in terms of scientific workflow entities, i.e., programs, workflows, ports, and channels: • A program block (short: program or block ) represents a computational step in the script that receives input data and produces (intermediate or final) output data.
A program is designated in a script by bracketing the relevant code between a pair of @begin and @end comments.Program blocks are usually visualized as boxes.A block that contains other programs is considered a workflow.
• A port represents a way in which data flows into or out of a program or workflow.Ports are identified by @in and @out annotations in the source code comments.
• A channel is a connection between an @out port of a program and an @in port of another (or, in case of feedback loops, the same) program.YW infers channels by matching the names of @in and @out ports within the same workflow.
Figure 1  Alternative Workflow Views.The process-oriented view in Figure 1 is the default YW view shown to the user, as it emphasizes the overall block structure, given by the script author using @begin and @end markers.However, the extracted YW model can also be rendered in other forms.For example, Figure 2 depicts a data-oriented view, where data elements (i.e., dataflow channels obtained from @in and @out tags) are shown as nodes, while programs are only mentioned in edge labels.Finally, Figure 3 shows a combined workflow view, i.e., in which both programs and data channels are represented as nodes.

Querying YesWorkflow Models
The workflow structure of large scripts can be difficult to interpret fully even when represented graphically.While the YW prototype is limited to such graphical views, the YW comments and model are sufficient to support queries that reveal specific aspects of the script in workflow terms.Example workflow-structure queries that will be supported by YesWorkflow include: • List all of the code blocks defined in the script along with any description given for each.
• List the code blocks nested (directly or indirectly) within a particular code block.
• List the code blocks that invoke a particular function or external program.
• List the code blocks that contain a particular block (directly or indirectly).
• List the code blocks that receive inputs derived (directly or indirectly) from the outputs of a particular upstream code block.
• List the code blocks affected (directly or indirectly) by a particular parameter value provided to the script.
Prospective Data Provenance Queries.YesWorkflow additionally will allow scripts marked up with YW comments to be queried from a data provenance perspective.Because YesWorkflow analyzes the definition of a workflow (the script plus YW comments) rather than information recorded during a run of the script, YesWorkflow will support queries against prospective provenance.Example prospective provenance queries include: • Given the name of an output of the script, list the inputs to the script that the output depends on (directly or indirectly).
• List the computational steps (code blocks) involved in deriving a particular output of the script, or of a named intermediate data product.
• For a particular computational step reveal where each input to the step comes from: an input to the script, a constant in the script, a value produced by a different step, etc.
• Reveal the complete derivation of a particular script output.That is, list the sequence of code blocks and input and intermediate data products leading to the output.Results of queries of this kind optionally may be rendered graphically.
Inference of Retrospective Data Provenance.As described above, YesWorkflow will allow prospective provenance to be inferred from scripts marked up with YW comments.We additionally foresee that combining the information extracted from a markedup script with references to data files corresponding to a run of that script will in some cases allow the retrospective provenance of those files to be inferred (see also [BML12] and [ZL10]).That is, in cases where the entire sequence of data derivation steps for a particular output can be determined unambiguously from YW annotations, YesWorkflow will support queries of the following kind even in the absence of a run-time data-provenance recorder: • Given a file output by a run of a script, indicate which files input to the script this output file was derived from (or affected by).
• Given an input file to a script, indicate which output files were derived (or affected) by the data contained in that file.
• Indicate which parameter values applied to a run of the script affected which of its output files.

YesWorkflow Examples
In the following we show YesWorkflow views extracted from real-world scientific use cases.The scripts were annoted with YW tags by scientists and script authors, using a very modest training and mark-up effort.1 Due to lack of space, the actual MATLAB and R scripts with their YW markup are not included here.However, they are all available from the yw-idcc-15 repository on the YW GitHub site [Yes15].

Analysis of Gene Expression Microarray Data
Bioinformatics workflows commonly possess a pattern of large numbers of incoming parameters and outputs at each stage of computation.In addition, analysis of even a single bioinformatics dataset tends to yield a large number of different output files.Hence, bioinformatics pipelines are attractive candidates for workflow systems, which can capture this complexity [Bie12].Figure 4 shows a YesWorkflow representation of an R script performing a classic, complex bioinformatics task: analysis of Affymetrix gene expression microarray data.This R script was modeled on our previous workflows developed in the Kepler environment [SMLB12].The script analyzes experiment designs consisting of two conditions (e.g., microarrays from control-treated cells vs microarrays from drug-treated cells) with multiple replicates in each condition.The R script employs a set of standard BioConductor [GCB + 04] packages mixed with custom programming.The workflow consists of four fundamental tasks: normalization of data across microarray datasets (Normalize), selection of differentially expressed genes (DEGs) between conditions (SelectDEGs), determination of gene ontology (GO) statistics for the resulting datasets (GO Analysis), and creation of a heatmap of the differentially expressed genes (MakeHeatmap).Each module produces outputs, and each module (aside from MakeHeatmap) requires external parameter inputs.Importantly, this graphical representation clearly indicates the dependence of each module on datasets and parameter inputs.This example demonstrates that YesWorkflow can provide informative visualizations of bioinformatics workflows, especially workflows involving large numbers of inputs and outputs.

Terrestrial Biospheric Modeling
In the Multi-scale Synthesis and Terrestrial Model Intercomparison Project (MsTMIP)2 , climate scientists primarily use MATLAB scripts to standardize terrestrial biosphere model output across multiple models and simulation runs for intercomparison purposes and to facilitate diagnosis and attribution.MsTMIP is a large, collaborative effort, aimed at harmonizing a number of complex terrestrial biospheric models for the purposes of comparing these model outputs [HSM + 13].There is a strong need to standardize many aspects of the MsTMIP process, to assure greater uniformity in the treatment of the codes and outputs of the disparate models in the intercomparison analyses.Current practice in MsTMIP, however, is representative of many scientific investigations, i.e., researchers develop their codes with a specific focus on functionality and efficiency.Comments are added primarily as "bookmarks" to assist with accessing appropriate code areas for debugging, optimization, or discussion.In the more general case, depending on whether the codes are developed in a collaborative context, structured in-code documentation may be recommended or required by the project.Nevertheless, the mechanisms for these "code annotations" are typically unformalized and unstructured, and rely primarily on the ability to insert non-executable "comment" statements in the code.
As the complexity of code grows, and the numbers of variants and alternative approaches increases, MsTMIP researchers need a clear and consistent way to document, review, and share their model intercomparison scripts.This provides a compelling use case for YW, in that MsTMIP brings together models from a number of independent efforts that require harmonization into a single framework for evaluating their relative capabilities to predict critical earth system features, such as global Net Ecosystem Exchange (NEE) data from terrestrial biogeographic realms.

Paleoclimate Reconstruction
As another working example from a different field, we have used the YesWorkflow markup syntax to analyze the paleoclimate reconstruction workflow presented by Bocinsky and Kohler [BK14].Their reconstruction method takes as input a spatial interpolation of contemporary weather data, the long-term record of climate held in regional tree-ring chronologies, and a handful of parameters, and uses a novel regression-based analysis method to generate spatial reconstructions of climate extending 2000 years or more back in time.Figure 6 shows that the YW system nicely exposes the prospective provenance hidden in the underlying R script, even for scripts whose workflow views are highly non-linear.

YW Architecture
The YesWorkflow software distribution is envisioned as a set of standard modules that can be used together or independently.The primary goal of this modularity is to enable YW users and developers independently to implement alternatives to any module, as needed, to solve problems particular to their research domain.It will be possible to develop these alternative implementations and extensions in any programming language.One way we plan to facilitate such easy replacement of YW modules is to require that each standard module optionally input and output files-with well-defined formats-representing the expected inputs or outputs of that module.Any program that produces or consumes these file formats can then function as an alternative to one or more standard YW modules and can provide identical, overlapping, or completely different capabilities (e.g., the current YW prototype is primarily implemented in Java, but also contains some alternative YW modules implemented in Python).
Five standard modules (implemented in Java) currently are implemented or planned: The YW-Extract module identifies YW comments in a script and produces a languageindependent representation of the script and the YW annotations.YW-Model interprets the comments identified by YW-Extract and builds a model of the script in terms of entities analogous to the components of a traditional scientific workflow as described in Section 2, while YW-Graph operates on the outputs of YW-Model to produce the dataflow graphs discussed in that same section.As described in Section 3, the planned YW-Query module will allow users to probe the structure of a complex script without having to inspect a visual representation of it.An envisioned YW-Validate module will ensure that YW comments in a script are consistent both with the other YW comments in the script and with the script itself.Finally, the YW-CLI module enables a user to execute sequences of the standard modules, starting from an input file with format appropriate to the first module in the executed sequence.

Related Work
The YW approach can be seen in the tradition of programming code annotation, which is widely used for facilitating code understanding and for generating documentation (e.g., DOxygen3 , Epydoc4 , Javadoc5 , etc.) YW builds on programming code annotation to provide a higher level of abstraction by revealing the dataflow that underlies the interactions between the different pieces of a script or program.
YW is also related to ideas from literate programming6 and available in tools such as Knitr [Xie13] and IPython [PG07].In literate programming, a script is decomposed into snippets of macros, which are interspersed within documents that are written in natural language to explain the scripts and eventually analyze the results it generates upon execution.While borrowing ideas from literate programming, YW is primarily targeted for developers who are using pure traditional scripting environments to edit their scripts and programs.YW aims at providing a consistent interpretation and visualization of codes wherever the language provides for insertion of non-executable "comments".
YW can also contribute to the area of reproducible computational research [SLP14], which seeks to provide scientists with sufficient information to understand and eventually validate the results claimed by their peers.For instance, the SOLE system [PMF + 12] allows linking articles with science objects, which can be source code, a dataset, or a workflow.SOLE allows the reader (curator) to specify human-readable tags that link the paper with science objects, and it transforms each tag into a URI that points to a representation of the corresponding object.While in SOLE the scientific article is the main object that contains links to other (science) objects, we focus on the scripts produced by the scientists, and aim to facilitate the understanding of their dataflow logic.Gavish and Donoho [GD11] present the notion of a Verifiable Computational Result (VCR), where every result is assigned a unique identifier, and results produced under the exact same conditions have the same identifier to support reproducibility.
Various tools have been proposed to capture the runtime provenance of scripts.Mechanisms that capture provenance at the operating system level [FMS08, GS12, MRHBS06] monitor system calls to track the data dependencies between computational processes.Some tools [BGS08, Dav12, HAW13, MBC + 14] have been developed to capture runtime provenance for Python scripts: while Bochner et al. [BGS08] and Davison [Dav12] propose Python libraries and APIs that need to be added to the code to capture the execution steps, ProvenanceCurious [HAW13] and noWorkflow [MBC + 14] are transparent and do not require changes to the scripts.Similarly, RDataTracker [LB14] captures provenance from the execution of R scripts, and the approach taken by Tariq et al. [TAG12] supports all programming languages allowed by the LLVM compiler framework.We note that the YW approach is complementary to these tools, since it captures prospective provenance of scripts.We argue that YW, along with runtime provenance approaches, provide a low-effort entry point for scientists who want to reap some of the benefits of scientific workflow systems while still using their familiar scripting environments.

YesWorkflow Development Roadmap
In the following we list some limitations of the current YW prototype and highlight features planned for future releases of the software.
Visualization of Nested Code Blocks.The YW-Extract and YW-Model modules support nesting of code blocks.Any pair of @begin and @end comment lines can enclose code that contains any number of other code blocks delimited with @begin and @end comment lines.The workflow model constructed for a script reflects such nesting, i.e. the top-level workflow corresponding to the script as a whole may contain one or more programs (code blocks), and any of these programs can in turn be a sub-workflow that contains further nested programs and workflows.Future versions of YW-Graph will reveal these nested code blocks and render sub-workflows graphically.
Functions and Function Calls.YW-Extract currently expects nested code blocks to be defined in-line.However, many scripts are structured as functions (or classes) with a top-level script that calls these functions (or methods on objects).These functions can in turn call other functions.Future versions of YesWorkflow will allow function declarations to be marked up with YW comments in a manner similar to that supported by Javadoc and DOxygen.Calls to these functions also will be annotated with YW markup.The result will be that YW-Extract and YW-Model will be able to represent function calls as nested code blocks.
Interactive Graphs.YW-Graph currently produces static graphical views (in the wellknown Graphviz-DOT format).An interactive viewer for YW graphical output will make these graphs easier to explore and interpret.In the planned graphical user interface, clicking on a data item in the combined or data views optionally will highlight the (prospective) direct and indirect data dependencies for that data item (the data from which it will be derived when the script is run).Features for expanding and collapsing nested subworkflows also will facilitate exploration of these graphs.
Live Graph View.Although the primary function of YesWorkflow is to reveal workflowlike structure in existing scripts, YesWorkflow also can be used as a design tool when developing new scripts (or even before a script is written).Future versions of YesWorkflow will better support such applications by providing live-update features to the interactive graph capabilities described above.Given a set of script files, the live-graph feature will monitor these files for changes and update the chosen graphical view automatically.Users of this feature will continue to be able use their favorite text editor or IDE for developing their scripts.
Distinguished Data and Parameters.The inputs to scripts for processing scientific data often can be viewed either as data (the data to be processed by the scripts) or as parameters (values that control how that data is processed).Planned versions of the YW comment vocabulary will allow data and parameters to be distinguished.YW-Graph optionally will emphasize graph edges, nodes, and labels representing data over those representing parameters.
Validation of Comments.The future YW-Validate module will perform extensive validation of YW comments in light of the actual code in the script.This capability will help guide users adding YW comments to their script.Perhaps more importantly, automatic validation will help prevent initially correct YW comments from becoming stale (i.e., incorrect) when the underlying script is changed or refactored.Validity checks that YW-Validate will perform include: • Confirm that data names used in @in and @out comments actually appear in the code bracketed by associated @begin and @end comments.
• Confirm that the names of functions referred to in YW comments for function declaration or for function calls match the names of the functions actually declared or called.
• Confirm that continuous data dependency chains exist from each script output all the way back to script inputs (and embedded constants).

Conclusions
YesWorkflow is an agile, grass-roots effort that aims at bringing workflow modeling and analysis features to scientific "workflows" that are defined in script form.Through simple user-annotations in the comments of scripts, dataflow and workflow structure are revealed by the YW toolkit.The user can thus exploit prospective provenance information from scripts, e.g., by visualizing, querying, and analyzing this information.
Our early YW prototype [Yes15] has been used by scientists from different domains to mark up complex, real-world scientific scripts with ease.Encouraged by the enthusiastic response of the early adopters, a number of researchers will be incorporating YW into their projects, thereby guiding and driving the future development of YesWorkflow.
MsTMIP researchers plan to annotate their scripts such that authors, as well as reviewers and potential new users, will be able to click on the workflow steps in the interactive YW graph viewer and inspect the corresponding code-blocks in the original script.When clicking on data elements, they will be taken to a folder containing the data instances that were used in the various runs of the script (provided these have been shared).Since the YW approach is language independent, it will also facilitate code migration, say from MATLAB to R, or from R to Python.
In the Kurator project [HLM14] we plan to enable collection managers to author their own data curation workflows using both an Akka-based workflow system and via scripting languages such as Python and R. In the latter case, Kurator tool users will annotate their scripts with YW comments to enable provenance queries to span script-based curation workflows.The Kurator team also plans to use the YW-Graph and YW-Query tools to graphically render workflows defined using the Kurator-Akka workflow system and to query the prospective provenance of products of these workflows.
Finally, DataONE is planning a number of enhancements to the YW annotation language.For example, in addition to the currently supported, simple user-defined vocabulary for program blocks and data elements, controlled vocabularies from shared ontologies may be used with these extensions.Similarly, to improve YW interoperability within the DataONE infrastructure, PROV [MM13] and ProvONE [CVLM + 15] compatible vocabulary extensions may be used in YesWorkflow in the future.

Figure 2 :
Figure 2: Data-oriented workflow view: program blocks are mentioned in edge labels only, while data channels are exposed as proper graph nodes.

Figure 3 :
Figure 3: Combined workflow view of a script: both programs and data are nodes.

Figure 4 :
Figure 4: Process workflow view of an Affymetrix analysis script (in R).

Figure 5 :
Figure5: Combined workflow view of a MsTMIP script (in MATLAB).YW views can be easily tweaked via Graphviz properties in the generated DOT files: here, a "Tavernastyle" top-down layout is used, as opposed to the default left-to-right display.