Data on genome sequencing, analysis and annotation of a pathogenic Bacillus cereus 062011msu

Bacillus species 062011 msu is a harmful pathogenic strain responsible for causing abscessation in sheep and goat population studied by Mariappan et al. (2012) [1]. The organism specifically targets the female sheep and goat population and results in the reduction of milk and meat production. In the present study, we have performed the whole genome sequencing of the pathogenic isolate using the Ion Torrent sequencing platform and generated 458,944 raw reads with an average length of 198.2 bp. The genome sequence was assembled, annotated and analysed for the genetic islands, metabolic pathways, orthologous groups, virulence factors and antibiotic resistance genes associated with the pathogen. Simultaneously the 16S rRNA sequencing study and genome sequence comparison data confirmed that the strain belongs to the species Bacillus cereus and exhibits 99% sequence homo;logy with the genomes of B. cereus ATCC 10987 and B. cereus FRI-35. Hence, we have renamed the organism as Bacillus cereus 062011msu. The Whole Genome Shotgun (WGS) project has been deposited at DDBJ/ENA/GenBank under the accession NTMF00000000 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA404036(SAMN07629099)).


a b s t r a c t
Bacillus species 062011 msu is a harmful pathogenic strain responsible for causing abscessation in sheep and goat population studied by Mariappan et al. (2012) [1]. The organism specifically targets the female sheep and goat population and results in the reduction of milk and meat production. In the present study, we have performed the whole genome sequencing of the pathogenic isolate using the Ion Torrent sequencing platform and generated 458,944 raw reads with an average length of 198.2 bp. The genome sequence was assembled, annotated and analysed for the genetic islands, metabolic pathways, orthologous groups, virulence factors and antibiotic resistance genes associated with the pathogen. Simultaneously the 16S rRNA sequencing study and genome sequence comparison data confirmed that the strain belongs to the species Bacillus cereus and exhibits 99% sequence homo;logy with the genomes of B. cereus ATCC 10987 and B. cereus FRI-35. Hence, we have renamed the organism as Bacillus cereus 062011msu. The Whole Genome Shotgun (WGS) project has  [1].

Value of the data
The Bacillus cereus 062011msu is a deadly pathogenic bacterium known for causing abscess mainly in the female sheep and goat population. Hence, the genome sequence resource and their annotation details can be effectively utilized to understand the pathogenicity of the bacterium for the benefit of the farmers who rear the sheep and goat.
The genome annotation data of Bacillus cereus 062011msu provided a broad overview regarding the subsystem features, metabolic pathways, orthologous groups, virulence factors and antibiotic resistant genes associated with the genome of the species. Most of the unique genes of the species were found to be clustered in ten genetic islands. In this study we provided a detailed analysis of the genes clustered on the genetic islands.
The data obtained from 16S rRNA analysis and genome sequence comparison with other Bacillus species provided significant information regarding the identification and taxonomic classification of this new bacterial strain. Although according to the previous study using the partial 16S RNA sequences the pathogen was reported to be genetically similar to Bacillus anthracis [1], but the whole genome data confirmed that the strain is in fact belongs to the species Bacillus cereus and phylogenetically related with B. cereus ATCC 10987 and B. cereus FRI-35.
The entire genome dataset can be utilized further for determining the genes and biochemical pathways related to the pathogenicity (abscess) of the strain and developing new antimicrobial drugs for the pathogen.

Data
The overall data represents the genome sequencing, assembly, annotation and comparative analysis of pathogenic bacteria Bacillus cereus 062011msu. Table 1 denotes the summary statistics of the draft genome assembly of the B. cereus 062011msu. The data describing the length and Phred quality score distribution of the raw and filtered reads are illustrated in Supplementary Fig. S1. Data on Fig. 1 represent 10 genetic islands predicted in the genome of the isolate. The details of the genes clustered on the genetic islands are shown in Supplementary Table S1. Fig. 2 shows the subsystem distribution of the B. cereus 062011msu genome based on RAST genome annotation. The complete list of the RAST annotated genes is given in Supplementary Table S2. Fig. 3 gives a complete overview of the KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways associated with the annotated genome sequence. The data illustrated in Fig. 4 show the Clusters of Orthologous Groups (COG) distribution of  Fig. 5 represents the phylogenetic tree constructed based on 16S rRNA comparison of the strain with its closely related homologs. Simultaneously the phylogenetic analysis data obtained from 23S rRNA comparison study are depicted in Fig. S2. Data on Fig. 6 represents the visualization of the annotated circular genome map of B. cereus 062011msu obtained from DNAPlotter.

Genome sequencing, quality assessment and de novo assembly
The Bacillus cereus 062011msu was isolated from the abscess tissue of the affected female sheep and goats in Maruthamputhur village near Alangulam Region, Tirunelveli District, Tamil Nadu, India [1]. The whole genome sequencing of the species using Ion Torrent personal genome machine (Life Technologies, Carlsbad, CA) [2] produced 458,944 raw reads having average length of 198.2 bp and  total size of 90,974,357 bp (90.974 MB). The FastQC (version.0.11.5) plug-in software (https://www. bioinformatics.babraham.ac.uk/projects/fastqc/) [3] and CLC genomics workbench version 9.0.1 [4] were used for analyzing the read quality and trimming of ambiguous low quality reads. After quality assessment and trimming total 432,619 cleaned reads were obtained with an average length of 178.08 bp. The trimmed reads were assembled into 3,200 contigs with an average length of 1,611 bp and GC content of 35.3% using the denovo assembly algorithm of CLC Genomics Workbench version 9.0.1.

Genome sequence annotation and genomic data analysis
The draft genome contigs of Bacillus cereus 062011 msu were annotated by using the NCBI Prokaryotic Genomes Automatic Annotation Pipeline (PGAAP) [5] and Rapid Annotation of microbial genome using Subsystem Technology (RAST) version 2.0 (http://rast.nmpdr.org/) [6]. The PGAAP annotation of the isolate's genome showed total 7,061 CDS (2,301 protein coding genes and 4,760 pseudo genes) and 81 RNA genes including 66 tRNAs, 10 rRNAs and 5 ncRNAs. The annotation details were given in whole genome shotgun (WGS) project with the project accession NTMF00000000. The genomic islands are set of genes with probable horizontal origin which facilitate in the diversification, adaptation and evolution of pathogenic microbes [7,8]. The Genomic islands in our study were predicted by submitting the PGAAP generated GenBank file to the Island Viewer 4 (http://www.patho genomics.sfu.ca/islandviewer/) [8]. Total 219 genes were clustered on 10 genetic islands.
Simultaneously the data obtained from the RAST annotation server revealed that the draft genome contains 8721 coding sequences and 472 subsystems with "Amino Acids and Derivatives" and  "Carbohydrates" were the most represented subsystem features. In addition the annotated subsystem features denoted 257 genes associated with "Virulence, Disease and Defense" including 154 genes associated with antibiotics and toxicity resistance, 52 genes associated with the synthesis of antibacterial peptides, Bacteriocins, 50 genes associated with invasion and intracellular resistance and one gene associated with adhesion. The KEGG (Kyoto Encyclopedia of Genes and Genomes) biological pathways associated with the genome of Bacillus cereus 062011msu were identified by annotating the protein coding sequences against the KEGG pathway database using the BLAST2GO program [9]. A total of 3104 sequences were mapped to 116 different KEGG pathways. Among them the pathways associated with Nucleotide metabolism, Amino acid metabolism, Metabolism of cofactors and vitamins and Carbohydrate metabolism were the most dominant KEGG pathways observed in the genome dataset. The prediction and classification of the orthologous groups associated with the Protein coding genes of Bacillus cereus 062011msu were performed by using the EggNog database (Evolutionary genealogy of genes) embedded within the BLAST2GO software [10]. The COG (Clusters of Orthologous Groups) data denoted that the cluster for "Amino acid transport and metabolism" (381 sequences) forms the largest functional group. Among the other functional groups the clusters for "Function unknown" (344 sequences), "General Function Prediction Only" (289 sequences), "Inorganic ion transport and metabolism" (281 sequences) and "Carbohydrate transport and metabolism" (261 sequences) were the highly represented categories.
Emphasizing the pathogenic nature of the strain, the virulence factors and toxic genes residing in the genome of Bacillus cereus 062011msu were further screened by annotating the coding sequences against the Virulence Factor Database (VFDB) [11] using the local BLASTX with E-value cutoff of 1E-5. A total of 1108 sequences homologous to 743 virulence factors and toxic genes were identified from the BLAST search. Among them 56 genes showed complete sequence homology (100%) with the annotated genome dataset of B. cereus 062011msu, indicating that the flagellar proteins might play regulatory role in the pathogenicity of the bacterium. The previous in vitro experiments by Mariappan et al., 2012 reported that the pathogen was sensitive to tetracycline (TET) and ciprofloxacin (CPFX) [1]. The antibiotic resistance genes present in Bacillus cereus 062011msu were screened by using the curated database, Antibiotic Resistance Genes Database (ARDB) (http://ardb.cbcb.umd.edu/) [12]. The data illustrated that the pathogen consists of total 14 crucial antibiotics resistance genes exhibiting resistance to the antibiotics like bacitracin, penicillin, fosfomycin, streptogramin_a, chloramphenicol, doxorubicin, fluoroquinolone, puromycin, streptomycin, beta_lactam, lincomycin and fosmidomycin, thus confirming the susceptibility of the strain to TET and CPFX.

Genome sequence comparison, 16S and 23S rRNA analysis, and genome map visualization
The closest neighboring strains for Bacillus cereus 062011msu based on the genome sequence comparison using RAST server were identified as Bacillus cereus AND1407 (score 544), Bacillus cereus MSX-D12 (score 406) and Bacillus cereus BAG3O-2 (score 387). Based on the local similarity of the aligned nucleotide sequences using the rapid sequence comparison tool BLAST [13] the genome of Bacillus cereus 062011msu exhibited 99% sequence homology with the genomes of Bacillus cereus ATCC 10987, Bacillus cereus strain M3, Bacillus cereus FRI-35, Bacillus thuringiensis serovar finitimus YBT-020, Bacillus cereus strain CC-1, Bacillus cereus NC7401, Bacillus cereus AH187 respectively. In microbial genomics research the comparison of 16S rRNA gene sequence has emerged as a reliable technique to identify new bacterial strains associated with pathogenicity and infections [14]. The deduced 16S rRNA sequence for Bacillus cereus 062011msu genome was aligned to its nearby homologs using the Clustal W multiple sequence alignment and the phylogenetic analysis was performed through the maximum likelihood method with 100 bootstrap replicates using the MEGA7 software (www.megasoftware.net/) [15]. The phylogenetic tree based on 16S rRNA sequence comparison confirmed that the pathogenic strain belongs to the species Bacillus cereus and exhibits close evolutionary relationship with B. cereus ATCC 10987 and B. cereus FRI-35 as they were clustered together as a monophyletic clade. Simultaneously we have also performed the phylogenetic analysis based on 23S rRNA gene sequence comparison using the MEGA7 software. The 16S rRNA gene derived phylogenetic tree was found to be concordant with the 23S rRNA gene tree as it also identified B. cereus ATCC 10987 as the closest evolutionary homolog of the pathogen. The complete genomic map of Bacillus cereus 062011msu representing the GC content, GC skew graphs, coordinates and coding sequence features on both forward and reverse strands obtained from RAST annotation was generated by DNAPlotter (http://www.sanger.ac.uk/science/tools/dnaplotter) [16].