The COMBAT-TB Workbench: Making Powerful Mycobacterium tuberculosis Bioinformatics Accessible

ABSTRACT Whole-genome sequencing (WGS) is a powerful method for detecting drug resistance, genetic diversity, and transmission dynamics of Mycobacterium tuberculosis. Implementation of WGS in public health microbiology laboratories is impeded by a lack of user-friendly, automated, and semiautomated pipelines. We present the COMBAT-TB Workbench, a modular, easy-to-install application that provides a web-based environment for Mycobacterium tuberculosis bioinformatics. The COMBAT-TB Workbench is built using two main software components: the IRIDA platform for its web-based user interface and data management capabilities and the Galaxy bioinformatics workflow platform for workflow execution. These components are combined into a single easy-to-install application using Docker container technology. We implemented two workflows, for M. tuberculosis sample analysis and phylogeny, in Galaxy. Building our workflows involved updating some Galaxy tools (Trimmomatic, snippy, and snp-sites) and writing new Galaxy tools (snp-dists, TB-Profiler, tb_variant_filter, and TB Variant Report). The irida-wf-ga2xml tool was updated to be able to work with recent versions of Galaxy and was further developed into IRIDA plugins for both workflows. In the case of the M. tuberculosis sample analysis, an interface was added to update the metadata stored for each sequence sample with results gleaned from the Galaxy workflow output. Data can be loaded into the COMBAT-TB Workbench via the web interface or via the command line IRIDA uploader tool. The COMBAT-TB Workbench application deploys IRIDA, the COMBAT-TB IRIDA plugins, the MariaDB database, and Galaxy using Docker containers (https://github.com/COMBAT-TB/irida-galaxy-deploy). IMPORTANCE While the reduction in the cost of WGS is making sequencing more affordable in lower- and middle-income countries (LMICs), public health laboratories in these countries seldom have access to bioinformaticians and system support engineers adept at using the Linux command line and complex bioinformatics software. The COMBAT-TB Workbench provides an open-source, modular, easy-to-deploy and -use environment for managing and analyzing M. tuberculosis WGS data and thereby makes WGS usable in practice in the LMIC context.

user-friendly web interface for sequencing data and metadata storage, workflow execution, and result visualization. We adopted IRIDA as the basis of the COMBAT-TB Workbench because of its integration with the Galaxy platform, its proven track record in public health bioinformatics (PHAC-NML has analyzed over 200,000 pathogen samples using IRIDA [T. Matthews, personal communication]), and the fact that it is open source.
IRIDA stores sequence samples on disk and sample metadata in a MariaDB database (10). Sequence data are shared between IRIDA and the Galaxy analysis platform (reducing data duplication).
Scientific workflows and IRIDA plugins. The Workbench uses Galaxy for its bioinformatics workflow composition and execution. Two workflows, for M. tuberculosis sample analysis and phylogeny, were implemented in Galaxy. Building workflows in Galaxy involves connecting Galaxy tools to construct an analysis workflow where the Galaxy tools (also known as tool wrappers) themselves connect command line bioinformatics tools to the Galaxy framework. Building our workflows involved updating some Galaxy tools (Trimmomatic [11], snippy [12], and snp-sites [13]) and writing new Galaxy tools (snp-dists [14], TB-Profiler [5], tb_variant_filter [15], and TB Variant Report [16]). In addition to the work on Galaxy tools, the command line tb_variant_filter and TB Variant Report tools were created as part of the COMBAT-TB project.
The irida-wf-ga2xml tool (17) was updated to be able to work with recent versions of Galaxy, and it was used to build IRIDA plugin skeletons. These plugin skeletons were further developed into IRIDA plugins for both workflows, in the case of the M. tuberculosis sample analysis involving the addition of an interface between the Galaxy workflow output and the metadata stored for each sequence sample. The M. tuberculosis Sample Report and M. tuberculosis Phylogeny plugins are hosted in Github repositories (https:// github.com/COMBAT-TB/irida-plugin-tb-sample-report and https://github.com/COMBAT -TB/irida-plugin-tb-phylogeny, respectively) and deployed into IRIDA as part of the COMBAT-TB Workbench deployment process.
Deployment. The COMBAT-TB Workbench application (https://github.com/COMBAT -TB/irida-galaxy-deploy) deploys IRIDA, the COMBAT-TB IRIDA plugins, the MariaDB database IRIDA uses for metadata storage, and Galaxy using Docker containers. The Docker containers are orchestrated using docker-compose (19). This allows the entire Workbench to be installed by users without advanced Linux systems administration knowledge and hides the complexity of the underlying software from the user.
Integration with external data storage. Data can be loaded into the COMBAT-TB Workbench via the web interface or via the command line IRIDA uploader tool. This allows data from external storage (for example, the storage of a sequencing facility) to be loaded into the workbench in bulk.
Comparison to other similar open-source software. At the time of writing we found two systems, Innuendo (20) and the IRIDA project (8), that were comparable to the COMBAT-TB Workbench (Table 1).
Innuendo is a web interface to sequence storage and Nextflow (21) workflow execution aimed at analysis of foodborne pathogens. It is oriented around common tasks in the foodborne pathogen surveillance terrain, such as molecular typing of pathogens. The platform is strongly tied to the workflows of a foodborne pathogen surveillance lab, and adding additional species to the analysis system requires modifying the underlying database and adding a whole-genome multilocus sequence typing (wgMLST) scheme. As such, Innuendo addresses a different challenge to the one the COMBAT-TB project tackles.
IRIDA is the official bioinformatics platform for public health genomics within the Public Health Agency of Canada. It has been in use since 2016 by Canada's provincial and national public health laboratories for genomic investigations of foodborne disease outbreaks, as part of PulseNet Canada's foodborne disease surveillance activities. While it is more flexible than Innuendo, as it allows deployment of a wide variety of Galaxy workflows and is not species specific, it is complex to deploy, with the installation guide assuming knowledge of deployment of both Galaxy and the Tomcat Java Servlet system (22). The COMBAT-TB Workbench, in contrast, is straightforward to deploy with a single Linux command. Innuendo and IRIDA both support storing and modifying metadata for samples. While the Galaxy platform provides a flexible platform for web-based bioinformatics, it lacks similar features for organizing samples together with their metadata and forces users to maintain metadata separately (perhaps in a spreadsheet). The COMBAT TB Workbench builds on the IRIDA support for metadata storage.
Use case. For the purpose of these analyses, we installed the COMBAT-TB Workbench on a virtual machine with 8 virtual CPUs, 32 GB RAM, and 3000 GB hard disk space, running Ubuntu 18.04 with Docker version 19.03.14 and docker-compose version 1.27.4.
The COMBAT-TB Workbench user interface is organized around projects and analyses. Projects store sequence samples, are associated with a reference genome, and allow controlling sharing and access to samples. Samples can be selected from a project and added to a cart. Once samples are in the cart, selecting the cart displays a workflow selection screen from which analyses can be started (Fig. 2).
Selecting one of the workflows allows the workflow parameters to be set and the analysis to be started. Workflows can be either per-sample, in which case a workflow instance is started for each sample, or multisample (for example, phylogenies) in which case the samples are analyzed as a group. Once a workflow has been executed, it creates a new entry visible via the Analyses interface. This interface allows for monitoring the status of workflow execution and visualization of workflow analysis results.
Analysis of data from MDR M. tuberculosis in Indonesia. Tania et al. (23) collected M. tuberculosis samples from 30 patients with confirmed pulmonary tuberculosis treated in four hospitals in the western region of the Indonesian island of Java. Phenotypic drug susceptibility testing (DST) was performed on the cultured samples, and DNA was isolated and sequenced Thirty samples were retrieved from EBI European Nucleotide Archive (ENA) and uploaded to the COMBAT-TB Workbench using the irida-uploader (24) command line tool. Per-sample quality control was performed automatically (using FastQC [25]) after sample uploading. Uploading and quality control took 9 and 3 min, respectively.
The inferred ancestral M. tuberculosis reference genome produced by Comas et al. (26) was downloaded from Zenodo (27) and uploaded to the COMBAT-TB Workbench FIG 2 COMBAT-TB Workbench workflow selection screen. The workbench user interface is organized around projects and analyses. Samples are selected from a project and added to a cart (not shown here). The user sees the workflow selection screen after selecting a cart. One of the 9 workflows can be selected, and the corresponding workflow parameters will be set automatically for analysis. Sequences stored in the cart are displayed on the far right.
as the default reference genome for the sequencing project. While the H37Rv reference genome can be used, we used the M. tuberculosis inferred ancestral reference, as Goig et al. (28) showed previously that this genome is equidistant, in terms of sequence variants, from all known M. tuberculosis lineages and thus provides a superior reference to H37Rv (NC_000962.3) for variant calling, especially if that variant calling is going to be used for phylogeny construction.
Per-sample reporting. The M. tuberculosis Sample Report pipeline, to generate drug resistance phenotype prediction and lineage assignment (see Materials and Methods), was run on all 30 samples. Detailed output from this pipeline is made available in a persample analysis report (Fig. S1) that includes a full report on variants identified in the sample (annotated using information from the COMBAT-TB NeoDB [29]), drug resistance prediction (from TB-Profiler), and quality control information on read mapping.
Upon completion of the Sample Report pipeline, the output (as described in Fig. S1)namely, drug resistance prediction, drug resistance-associated variants, and assigned lineage-was added to the metadata that were originally associated with each sample (Fig. 3).
Examining the read mapping outputs shows that only 9.63% and 1.47% of reads from samples SRR12416824 and SRR12416842, respectively, mapped against the M. tuberculosis genome. Further investigation with kraken2 (30) using the standard database from 14 April 2020 showed that the majority (63.55%) of reads from SRR12416824 were classified as belonging to the Mycobacterium avium complex (MAC) and the majority of reads (75.39%) from SRR12416842 were classified as the nontubercular mycobacterium Mycolicibacterium fortuitum. These samples were thus excluded and not used in subsequent analyses. This step provides a useful interactive space for users to consider the origin (sequence identity) of the reads before submitting it to a phylogenetic analysis pipeline.
The results were broadly concordant with those found by Tania et al. (23). Some small differences are likely a result of different versions of the TB-Profiler software used. In our own analysis, updating the TB-Profiler version from 2.8.4 to version 3.0.6 reduced the discordance between streptomycin resistance predicted by the COMBAT-TB Workbench and that predicted by MGIT by two samples. This illustrates the improvement of drug resistance prediction from WGS data over time. As noted above, Phylogeny on all samples. The 28 samples that previously passed quality control were selected in the web interface and submitted to the M. tuberculosis Phylogeny Pipeline (https:// github.com/COMBAT-TB/irida-plugin-tb-phylogeny). This pipeline computes a maximum-likelihood phylogeny using the single nucleotide variants (SNVs) identified in each sample (Fig. 4).
The IRIDA Advanced Visualization view (Fig. 5) allows metadata from the sequence analysis project's sequencing store to be associated with tips (i.e., samples) in the phylogeny view. In addition to the metadata computed by the M. tuberculosis Sample Report pipeline, an Excel spreadsheet was generated associating each sample ID with the sample IDs and hospital collection sites identified in the paper by Tania et al. (23). This spreadsheet was used to load additional metadata into the IRIDA project.
The distance between hospital sites varied from 10 km (Mampang Prapatan Hospital, Jakarta, Indonesia, to Cempaka Putih Islamic Hospital, Jakarta, Indonesia) to 107 km (Drajat Prawiranegara Hospital, Serang, Indonesia, to M. Goenawan Partowidigdo Pulmonary Hospital, Bogor, Indonesia). When data were visualized this way, it was apparent that there was no clear relationship between phylogenetic relationship (and thus lineage) and hospital site, illustrating that the outbreaks occurring in the region were circulating in areas broader than hospital catchment areas.
Analysis of run times. The run times of the steps in the analyses are listed in Table 2. Galaxy scheduled the execution of analysis steps in parallel when data dependencies allowed. The similarities of the run times between the sample processing and the phylogeny pipelines are to some extent due to the fact that IRIDA currently starts all analyses from sequence reads rather than from assembled genomes. This limitation is currently being addressed by the IRIDA development community (31).
Analysis of data from M. tuberculosis in Spain. Xu et al. (32) collected M. tuberculosis samples from 117 patients and analyzed them to understand dynamics of TB transmission in the Valencia Region, in Spain. While the key element of their analysis is identifying individual transmission patterns, their work also provides a convenient larger data set to examine the performance of the COMBAT-TB Workbench. Uploading of the 117 samples took 21 min, and per-sample analysis took a total of 6 h. A phylogeny was computed using the samples. Phylogeny generation took just over 8 h, and visualization with the advanced phylogeny viewer revealed clustering by M. tuberculosis lineage (Fig. S2), as expected. Unfortunately, Xu et al. (32) did not include the output of their phylogeny construction in their results, so a direct comparison of the computed phylogenetic trees is not possible.

DISCUSSION
The COMBAT-TB Workbench makes routine M. tuberculosis WGS sample storage and bioinformatics analysis accessible in an extensible framework. The IRIDA platform on which it is built allows pipelines to be added as needed, and the use of Docker container technology means that installation on a machine (with the required Docker and docker-compose software) is a straightforward process not requiring advanced system administration skills.
Access to this project that is based on open-source best practices reduces the analytical cost of WGS, increasing opportunities for M. tuberculosis WGS deployment in low-and middle-income countries. As the capabilities of the underlying IRIDA and Galaxy platforms continue to evolve, the COMBAT-TB Workbench will naturally acquire additional features in addition to the features being added by the authors.  van Heusden et al.

MATERIALS AND METHODS
Implementation of the COMBAT-TB Workbench. IRIDA and Galaxy are both server environments, and IRIDA relies on a MariaDB database. The COMBAT-TB Workbench executes each of these servers in Docker containers, and the execution of the COMBAT-TB Workbench as a whole is orchestrated using docker-compose. The COMBAT-TB Workbench is deployed on any machine that runs Docker and docker-compose by fetching its code from Github (https://github.com/COMBAT-TB/irida-galaxy-deploy) and running a single command ("dockercompose up -build -d"). Updates to the workbench are similarly applied using a single command. The COMBAT-TB Workbench updates its IRIDA plugins from their repositories on Github on startup.
Implementation of the M. tuberculosis Sample Report pipeline. The TB sample report workflow starts by running Trimmomatic (v. 0.38.1) with the Sliding Window trimmer, truncating reads when the average quality within a 4-base window drops below a quality score of 30, followed by the minimum-length trimmer, which discards reads shorter than 20 bases. Only reads which remain in read pairs after quality trimming are used in subsequent analyses.
After quality trimming, sequence reads are mapped to a user supplied reference genome. The workflow requires a genome with the same coordinate scheme as the M. tuberculosis H37Rv reference (RefSeq NC000962). Mapping and variant calling are done using snippy (v. 4.4.5), a Perl-based pipeline that combines the bwa-mem (v. 0.7.17) (33) mapper and the freebayes (v. 1.3.2) (34) variant caller. Snippy has been shown (35) to produce good-quality variant calls in M. tuberculosis when run with default parameters. Samtools (v. 1.9) (36) flagstat is run to provide an overview of statistics from the mapping process.
After variant calling, variants are annotated with SnpEff (v. 4.3) (37) using the H37Rv reference genome annotation. Variants are then filtered using tb_variant_filter (v. 0.1.3). While snippy performs quality-based filtering of the variants it predicts, this tool offers a variety of filtering options commonly used in M. tuberculosis variant filtering. In our workflow, it filters out variants in the PE/PPE gene regions (38), in the repetitive and insertion section regions identified by UVP (39) and those with lower than 30 supporting reads or within 5 bp of an indel. These filtering options, like all tool options in the workflow, can optionally be altered by the user.
In parallel to the SnpEff annotation and variant filtering steps, the mapped reads are provided to TBProfiler (v. 2.8.4), which performs its own variant calling and lineage and drug resistance prediction.
Finally the filtered, annotated variants and the TBProfiler results are fed to tb_vcf_report (v. 0.1.7), which produces a report further annotated with information from the COMBAT-TB eXplorer database (29) in both text and HTML formats.
The final reports provided to the user are the variant reports from tb_vcf_report, text and JSON format reports from TBProfiler, variants in VCF format from SnpEff, and mapping statistics from samtools flagstats. These reports include both user-readable and raw data suitable for further downstream analysis. The metadata stored in the sample line list are updated with mapping percentage, M. tuberculosis lineage, spoligotype information, and drug resistance information.
Implementation of the M. tuberculosis Phylogeny Pipeline. The workflow in the phylogeny module starts with quality filtering of samples using fastp (v. 0.19.5) (40) with default settings. The filtered reads are then aligned to the user provided reference using snippy and predicted variants are filtered with tb_variant_filter as described above. In addition, only SNVs are retained, as the phylogeny software used in the workflow cannot extract meaningful information from indels.
For each sample, the identified sequence variants are inserted into the reference genome, yielding one sequence per sample, of the same length as the reference but with variants from the sample inserted. These are concatenated into a multiple-sequence alignment (MSA), which is used as input to snp_dists (v. 0.6.3). Variant and constant sites are identified using snp_sites (v. 2.5.1), and the multiple-sequence alignment is filtered to retain only variant sites. A phylogeny is then built using IQ-TREE (v. 1.5.5.1) (41)(42)(43).
The final report includes the SNP distance matrix and (Newick format) tree. The tree is also visualized in a phylogeny viewer and can also be displayed with associated sample metadata.
The SNP distance matrix, the phylogeny and the filtered variants (in VCF format) are presented to the user as output.
Data availability. DNA from cultured samples was deposited in the NCBI Sequence Read Archive (SRA) with BioProject accession number PRJNA633244.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only.