Compi: a framework for portable and reproducible pipelines

Compi is an application framework to develop end-user, pipeline-based applications with a primary emphasis on: (i) user interface generation, by automatically generating a command-line interface based on the pipeline specific parameter definitions; (ii) application packaging, with compi-dk, which is a version-control-friendly tool to package the pipeline application and its dependencies into a Docker image; and (iii) application distribution provided through a public repository of Compi pipelines, named Compi Hub, which allows users to discover, browse and reuse them easily. By addressing these three aspects, Compi goes beyond traditional workflow engines, having been specially designed for researchers who want to take advantage of common workflow engine features (such as automatic job scheduling or logging, among others) while keeping the simplicity and readability of shell scripts without the need to learn a new programming language. Here we discuss the design of various pipelines developed with Compi to describe its main functionalities, as well as to highlight the similarities and differences with similar tools that are available. An open-source distribution under the Apache 2.0 License is available from GitHub (available at https://github.com/sing-group/compi). Documentation and installers are available from https://www.sing-group.org/compi. A specific repository for Compi pipelines is available from Compi Hub (available at https://www.sing-group.org/compihub.

139 dependencies are now complete, are sent to a worker thread pool that has a parameterizable size. 140 When there are no more tasks to run, the pipeline execution ends. 141 Every time a task is about to run and there is a free thread in the pool, a new subprocess is 142 spawned by invoking the system Bash interpreter to execute the task script. Pipeline parameters 143 are passed to the task script through environment variables, which is a robust, standard 144 mechanism with two main benefits. On the one hand, environment variables are easily accessed 145 via "$variable_name" or more complex expressions, so pipeline parameters are available directly 146 within the task script. On the other hand, this allows Compi to pass parameters to scripts written 147 in languages other than Bash, because virtually all programming languages give access to 148 environment variables. 149 The way to execute languages differently from Bash is through task interpreters. Any task can 150 have a task interpreter defined in the pipeline specification, which is an intermediate, user-151 defined bash script intended to take the task script as input and call an interpreter from a different 152 programming language. 153 Moreover, it is possible to define task runners, which are the same concept as interpreters, but 154 with a different purpose. They are not defined within the pipeline specification, but rather at 155 execution time, and are intended to tailor task execution to specific computing resources, without 156 modifying the pipeline itself. In this way, the workflow definition is decoupled from the 157 workflow execution, making it possible to change the way tasks are executed without modifying 158 the workflow XML file. For instance, if the tasks must be run in a cluster environment such as 159 SGE or Slurm, computations must be initiated via a submission command (qsub in SGE or srun 160 in Slurm). Task runners intercept task execution by changing the default Bash interpreter with 161 queue submissions.

162
163 Compi project architecture 164 Compi comprises three main modules. The most important is the core module, which contains 165 the workflow execution engine, with its main data structures. On top of it, there are two 166 additional modules. On the one hand, the cli (command-line interface) module contains the 167 command-line user interface for running pipelines, which generates a specific pipeline 168 application tailored for pipeline tasks and parameters. On the other hand, the dk (Development 169 Kit) module allows to create a portable application in the form of a Docker image and publishing 170 the pipelines at Compi Hub. 171 172 Compi Hub 173 As discussed above, Compi Hub is a public repository where Compi pipelines can be published. 174 The Compi Hub front-end was implemented using the Angular v7 web application framework, 175 while the back-end was implemented using TypeScript and offers a RESTful API that supports 176 all the functionality of the front-end. This REST API is also used by the compi-dk tool to allow 177 pipeline developers to publish their pipelines from the command line. The Compi Hub back-end 178 runs in a Node.js server and uses a MongoDB database to store the data. 259 to tell Compi that iterations over the "analyze" loop could start when the corresponding iterations 260 over the "preprocess" loop have finished. 261 Figure 3 shows an example of this, where two foreach tasks "preprocess" and "analyze" iterate 262 over the same set of items ("samples"). The "analyze" task depends on the "preprocess" task, but 263 at an iteration-level (after="*preprocess"). Without the '*' prefix, the "analyze" task will only 264 start when the whole "preprocess" task has finished. 265 266 Pipeline execution and dependency management 267 One of the main features of Compi is the creation of a classic CLI for the entire pipeline based 268 on its parameter specifications to aid users run the pipeline. This CLI is displayed when a user 269 executes "compi run -p pipeline.xml --help", which describes the Compi execution parameters 270 and specification of each pipeline task. Supplementary File 1 shows this CLI for the RNA-Seq 271 Compi pipeline (https://www.sing-group.org/compihub/explore/5d09fb2a1713f3002fde86e2). 272 As Figure 4 illustrates, the execution of Compi pipelines can be controlled using multiple 273 parameters that fall in three main categories: pipeline inputs (i.e. the pipeline definition and its 274 input parameters), logging, and execution control. In Figure 4A, the "compi run" command 275 receives the pipeline definition file explicitly, while the "--params" option indicates that the 276 input parameters must be read from the "compi.params" file. On the other hand, in Figure 4B the 277 pipeline definition file is omitted and compi assumes that the pipeline definition must be read 278 from a file named "pipeline.xml" located in the current working directory. Also, pipeline 279 parameters are passed on the command line after all the "compi run" parameters and separated 280 by the '--' delimiter. 281 Regarding logging options, both include the "--logs" option to specify a directory to save the 282 standard (stdout) and error (stderr) outputs of each task along with the specific parameter values 283 used in each execution. They are saved in three different files named with the name of the 284 corresponding task as prefix (e.g. "task-name.out.log","task-name.err.log", and "task-285 name.params"). To avoid unnecessary file creation, Compi does not save task outputs unless the 286 "--logs" option is used. However, it is possible to specify which tasks should be logged using the 287 "--log-only-task" or "--no-log-task" parameters. In addition to the task specific logs, Compi 288 displays its own log messages during pipeline execution. These messages can be disabled by 289 including "--quiet" as shown in Figure 4B. In contrast, in the example given in Figure 4A, the "-290 -show-std-outs" option forces Compi to forward each task log to the corresponding Compi 291 output, which is very useful for debugging purposes during pipeline development. 292 The third group of options allows to control the execution of the pipeline. For instance, the "--293 num-tasks" parameter used in Figure 4A sets the maximum number of tasks that can be run in 294 parallel. It is important to note that this is not necessarily equivalent to the number of threads the 295 pipeline will use, as some tasks may use parallel processes themselves. The "--abort-if-296 warnings" option, also used in the example in Figure 4A, tells Compi to abort the pipeline 297 execution if there are warnings in the pipeline validation. This is a useful and recommended 298 option for pipeline testing during development, to avoid undesired effects that may arise from 299 ignoring such warnings. A typical scenario that causes a warning is when the name of a pipeline 300 parameter is found inside the task code, but the task does not have access to it, because it is not a 301 global parameter nor defined in the set of task parameters. 302 One of the most notable features of the Compi workflow execution engine is that it allows a fine-303 grained control over the execution of the pipeline tasks. While many workflow engines only 304 allow launching the entire pipeline or partially relaunching a pipeline from a point of failure, 305 which can also be done in Compi using the "resume" command, Compi also allows launching 306 sub-pipelines using modifiers such as "--from", "--after", "--until" or "--before". In the 307 example shown in Figure 4B, a combination of the "--from" and "--until" modifiers is used, 308 resulting in the execution of all tasks in the path between "task-1" and "task-10", including 309 "task-1" and "task-10". If the "--after" and "--before" are used instead, then "task-1" and "task-310 10" are not executed. A fifth modifier is "--single-task", which allows only the specified task to 311 execute and is not compatible with the other four modifiers. The Compi documentation 312 (https://www.sing-group.org/compi/docs/running_pipelines.html#examples) includes several 313 examples to illustrate how each of these options work. 314 Compi runs each task as a local command by default ( Figure 5A). This means that if a task 315 invokes a certain tool (e.g. a ClustalOmega alignment running "clustalo -i /path/to/input.fasta -o 316 /path/to/output.fasta"), this tool must be available either from the path environment variable or 317 by including the absolute path to the binary executable. Since dependency management is always 318 cumbersome, a special effort has been made to offer developers and users various alternatives to 319 deal with this problem, which are explained below. 320 One way to address this issue is by means of a file with a custom XML runner definition, as done 321 in the example shown in Figure 4A with "--runners-config pipeline-runners.xml". Individual 322 runners are defined through a "runner" element within a runners file, where the "task" attribute is 323 used to specify the list of tasks that the runner must execute. In this way, when a task identifier is 324 assigned to a runner, Compi will ask the runner to run the corresponding task code (instead of 325 running it as a local command). The usage of pipeline runners to handle dependencies allows 326 Docker images to take responsibility for them ( Figure 5D). For instance, Figure 6A shows a 327 pipeline task named "align" that uses a tool (defined by the pipeline parameter "clustalomega") 328 that receives a file as input and produces an output. The runner defined in Figure 6B for the same 329 task runs the specified task code (available in the environment variable "task_code") using a 330 Docker image. The runner here is almost a generic Docker runner, and the key points are: 331  First, the creation of a variable ("$envs") with the list of parameters that must be passed 332 as environment variables to the Docker container.

333
 Second, run the Docker image with the list of environment variables and mount the 334 directory where the command has the input and output files ("workingDir" in this 335 example). 336 Such a Docker runner would allow to follow an image-per-task execution pattern, where each 337 task is executed using a different container image ). An example of this 338 execution pattern can be found in the GenomeFastScreen pipeline (https://sing-419 creation date, as well as links to external repositories in GitHub or Docker Hub. In addition, for 420 each pipeline version, Compi Hub displays the following information: 421  Overview: this section is headed by the pipeline DAG, generated in the backend using the 422 "compi export-graph" command. Since it is an interactive graph, visitors can use it to 423 navigate to each task description. Figure 10 shows the Metatax pipeline DAG 424 (https://www.sing-group.org/compihub/explore/5d807e5590f1ec002fc6dd83). The DAG 425 is followed by two tables, one containing the pipeline tasks and their associated with the displayed pipeline. In addition to the instructions given in the Readme section, 443 we also recommend that pipeline publishers provide test datasets to help users test the 444 pipelines themselves.

445
 Runners: this section displays a list of example runner configurations when they are 446 present in the compi-dk project. Runner configurations must be stored as XML files 447 within the "runners-example" directory of the project.

448
 Params: this section shows a list of example parameter configurations when they are 449 present in the compi-dk project. Parameter configurations must be stored as plain-text 450 key-value files in the "params-example" directory of the project. 451 As can be seen, Compi Hub was not designed to be a merely pipeline repository. We seek that 452 developers accompany each pipeline with all the necessary information to ensure its portability 453 and reproducibility by other researchers.