BGDMdocker: a Docker workflow for data mining and visualization of bacterial pan-genomes and biosynthetic gene clusters

Recently, Docker technology has received increasing attention throughout the bioinformatics community. However, its implementation has not yet been mastered by most biologists; accordingly, its application in biological research has been limited. In order to popularize this technology in the field of bioinformatics and to promote the use of publicly available bioinformatics tools, such as Dockerfiles and Images from communities, government sources, and private owners in the Docker Hub Registry and other Docker-based resources, we introduce here a complete and accurate bioinformatics workflow based on Docker. The present workflow enables analysis and visualization of pan-genomes and biosynthetic gene clusters of bacteria. This provides a new solution for bioinformatics mining of big data from various publicly available biological databases. The present step-by-step guide creates an integrative workflow through a Dockerfile to allow researchers to build their own Image and run Container easily.


INTRODUCTION
Docker is an open source project and platform for building, shipping, and running any app, enabling the widespread distribution of applications. Docker allows users to package an application, along with all its dependencies, into a standardized unit for software development (https://docs.docker.com/). Docker includes three core structural compositions: Image, Container, and Repository. An image can start software as complex as a database, wait for you (or someone else) to add data, store the data for later use, and then wait for the next person. Containers afford similar resource isolation and allocation benefits as virtual machines; however, a different architectural approach allows the former We included all of these applications and their dependencies in a BGDMdocker (bacterial genome data mining Docker-based) to enable the workflow to be implemented online with a single run. We additionally wrote three standalone Dockerfiles for Prokka, panX, and antiSMASH in order to meet the various requirements of different users. We recommend setting up the workflow with three independent files, each with a specific purpose. This method is presented in the Supplementary Information. Here, we describe how to build the workflow and conduct the analysis in detail.

MATERIALS AND METHODS
Installation of latest Docker on your host 1. Copy the following commands for quickly and easily installing the latest Docker-CE (https://docs.docker.com/engine/installation/) (Ubuntu, Debian, Raspbian, Fedora, Centos, Redhat, Suse, Oracle, Linux etc. are all applicable): $ curl -fsSL get.docker.com -o get-docker.sh $ sudo sh get-docker.sh If user would like to use Docker as a non-root user, you should now consider adding your user to the "docker" group, e.g., using: $ sudo usermod -aG docker <user name> Type the following commands at your shell prompt. If this outputs the Docker version, your installation was successful.
Type the following commands at your shell prompt (cmd.exe or PowerShell). If this outputs the Docker version, your installation was successful.

$ docker version
Use Docker to build the BGDMdocker workflow 1. On your host (with Docker), type the following command lines to build a BGDMdocker workflow: If use the "-v home:home" parameter, Docker will mount the local folder/home into the Container under /home, storing all of your data in one directory in the home folder of the host operating system; then, you may access the directories of home from inside the Container.
We analyzed the pan-genome and biosynthetic gene clusters of 44 B. amyloliquefaciens strains using the BGDMdocker workflow. For detailed commands, see the Supplementary Information.

RESULTS
Fast and reproducible building of the BGDMdocker workflow across computing platforms using Docker Using Docker technology, the Dockerfile script file can build Images and run a container in seconds or milliseconds on Linux and Windows. The file may also be deployed in Mac and cloud-based systems such as Amazon EC2 or other cloud providers. The Dockerfile is a small, plain-text file that may be easily stored and shared. Therefore, the user is not required to install and configure the programs.
Here, based on Debian 8.0 (Jessie) Image, we have established a novel Docker-based bioinformatics platform for the study of microbe genomes and pan-genomes (Fig. 1). The workflow, which offers the advantages of cross-platform and modular reuse, provides biologists with simple and standardized tools to extract biological information from their own experiments and from online sequence databases. Researchers may therefore focus solely on mining information from the obtained sequences rather than determining how to install the software package. We have uploaded this Dockerfile to GitHub for sharing with relevant scientific researchers.

Datamining and visualizing the pan-genomes of B. amyloliquefaciens
In order to explore the result data, a website (http://bapgd.hygenomics.com/pangenome/ home) was built for the interactive exploration of the B. amyloliquefaciens pan-genome and biosynthetic gene clusters using the BGDMdocker workflow. Visualization allowed for the rapid filtering and searching of genes. For each gene cluster, panX displayed an alignment and a phylogenetic tree, mapped mutations within that cluster to the branches of the tree, and inferred gene losses and gains on the core-genome phylogeny. Here, we provide the summary statistics of the pan-genome (Table 1), the phylogenetic relationships of the 44 B. amyloliquefaciens strains (Fig. 2), and screenshots of the website (http://bapgd.hygenomics.com/pangenome/home) (Figs. 3 and 4). All data may be visualized and downloaded without registration.

Datamining and visualizing of biosynthetic gene clusters of B. amyloliquefaciens
Results from the identification and analysis of the biosynthetic gene clusters of 44 B. amyloliquefaciens strain genomes, using the BGDMdocker workflow, have been uploaded to our website (http://bapgd.hygenomics.com/pangenome/home). All data may be downloaded without registration.
Here, we provide brief summary statistics for the biosynthetic gene clusters of all 44 strains (Table 2), as well as an example of the type and number of biosynthetic gene clusters in the Y2 (NC_017912) strain (Table 3) and representative screenshots of the website (http://bapgd.hygenomics.com/pangenome/home) (Figs. 5 and 6). There are a total of 31 gene clusters in the genome of Y2. Among these, 21 gene clusters show similarities to known clusters in MIBiG (http://mibig.secondarymetabolites.org/)    Note: Genome sequences of 44 B. amyloliquefaciens (https://www.ncbi.nlm.nih.gov/genome/genomes/848) strains downloaded from GenBank RefSeq database: "Acc gene" refers to accessory gene (dispensable gene); "Uni gene" refers to unique gene (strain-specific gene); "All genes" refers to gene of * .gbff files recorder, including Pseudo Genes; "Total genes" refers to those used for pan-genome analysis gene of * .gbff files recorder, excluding Pseudo Genes.
such as surfactin, mersacidin, and fengycin; the remaining 10 gene clusters are unknown.

DISCUSSION
The Dockerfiles of BGDMdocker scripts are convenient for deployment and sharing, and it is easy for other users to customize the Images by editing the Dockerfile directly. This is in contrast to Makefiles and other installations, for which the resulting builds differ   Note: "Total" of Biosynthesis gene clusters includes "Known" and "Unknown." "Known" of Biosynthesis gene clusters is inferred from the MIBiG (Minimum Information about a Biosynthetic Gene cluster, http://mibig.secondarymetabolites.org). "Unknown" of Biosynthesis gene clusters is detected by Cluster Finder and further categorized into putative ("Cf_putative") biosynthetic types. A full integration of the recently published Cluster Finder algorithm now allows the use of this probabilistic algorithm to detect putative gene clusters of unknown types; "-" of host is unrecorded. across different machines (Boettiger, 2015). Dockerfiles can maintain and update related adjustments, rapidly recover from system failure events, control versions, and build application environments with the optimal flexibility. BGDMdocker Images enable portability and modular reuse. Bioinformatics tools are written in a variety of languages and require different operating environment configurations across platforms. Docker technology is capable of executing the same functions and services in different environments without additional configurations (Folarin, Dobson & Newhouse, 2015), thus creating reproducible tools with high efficiency. By constructing pipelines with different tools, bioinformaticians may automatically and effectively analyze biological problems of interest. The BGDMdocker Container enables application isolation with high efficiency and flexibility. Applications may run Container independently with Docker technology, and each management command (start, stop, boot, etc.) may be executed in seconds or milliseconds. Hundreds or thousands of Containers may be run on a single host at same time (Ali, El-Kalioby & Abouelhoda, 2016), thus ensuring that the failure of one task does not cause disruption of the entire process: new Containers may be initialized rapidly to continue the task until the completion of the entire process, thus improving overall efficiency. In recent years, several online tools and software suites have been developed for pangenome analysis, including Roary, PGPA, SplitMEM, PanGP, and PanTools. However, generally, the installation of these pipelines with many dependencies, but a single function, is complex and challenging. Therefore, limiting researchers' ability directly focus on their analyses of interest (Table 4). Although the BGDMdocker workflow includes several tools, installing and running the software is quite simple. Biologists may automatically install, configure, and test the scripts, making these processes faster and the results repeatable.

CONCLUSION
Here, we present a BGDMdocker workflow to achieve bacterial and viral genome annotation, pan-genome analysis, mining of biosynthetic gene clusters, and √ is provided with the function, Â is not provided with the function. visualization of results on a local host or online. This allows researchers to browse information for every gene, including duplication, diversity, indel events, and sequence alignments, as well as for biosynthetic gene clusters, including structure, type, description, detailed annotation, and predicted core structure of the target compounds. These tools and their installation commands and dependencies were all written in a Dockerfile. We used this Dockerfile to build a Docker Image and run Container for analyzing the pan-genome of 44 B. amyloliquefaciens strains retrieved from a public database. The pan-genome included a total of 172,388 genes and 2,306 core gene clusters. The visualization of the pan-genomic data included alignments, phylogenetic trees with mutations within each cluster mapped to the branches of the tree, and inference of gene losses and gains on the core-genome phylogeny for each gene cluster. In addition, 997 known (MIBiG: http://mibig.secondarymetabolites.org database) and 553 unknown (antiSMASH-predicted clusters and Pfam database) genes in biosynthetic gene clusters and orthologous groups were identified in all strains. The BGDMdocker workflow for the analysis and visualization of pan-genomes and biosynthetic gene clusters may be fully reused immediately across different computing platforms (Linux, Windows, Mac, and cloud-based systems), with flexible and rapid deployment of integrated software packages across various platforms. This workflow may also be used for other pan-genome analyses and visualization of other species. Additionally, the visual display of data provided in this study may be completely duplicated. All resulting data and relevant tools and files may be downloaded from our website (http://bapgd.hygenomics.com/pangenome/home) with no registration required.