Complete genome sequence and characterization of the haloacid–degrading Burkholderia caribensis MBA4

Burkholderia caribensis MBA4 was isolated from soil for its capability to grow on haloacids. This bacterium has a genome size of 9,482,704 bp. Here we report the genome sequences and annotation, together with characteristics of the genome. The complete genome sequence consists of three replicons, comprising 9056 protein-coding genes and 80 RNA genes. Genes responsible for dehalogenation and uptake of haloacids were arranged as an operon. While dehalogenation of haloacetate would produce glycolate, three glycolate operons were identified. Two of these operons contain an upstream glcC regulator gene. It is likely that the expression of one of these operons is responsive to haloacetate. Genes responsible for the metabolism of dehalogenation product of halopropionate were also identified.


Introduction
Human activities are thought to have great impact on the environment. While the development of industry has greatly improved our living condition, it has also escalates many environmental problems. Pollution has been an issue for a long time. Halogenated compounds have been used indiscriminately with the expansion of industrialization. Many of these compounds are found in the environment as disinfection by-product [1]. Not only do they cause environmental problems they also have deleterious impact on our health [2].
Many bacteria are capable of transforming halogenated compounds and utilize them as carbon and energy sources. These bacteria are distinguished by their encoding enzymes known as dehalogenases which catalyze the breakdown of halogenated compounds through cleavage of the carbonhalogen bond [3]. Burkholderia caribensis [4] MBA4 was isolated for its ability to mineralize 2-haloacids [5]. The dehalogenase gene, deh4a, together with a downstream permease gene, deh4p, form an inducible operon that mediate the transformation and uptake of 2-haloacids, respectively, in MBA4 [6]. The dehalogenase has been purified and characterized [5,7,8]. The permease has also been investigated [9]. Moreover, MBA4 possesses a cryptic dehalogenase with a signal peptide [10,11]. While proteomic analysis of the degradation of chloroacetate by MBA4 has been described, the identities of the differentially expressed proteins were hampered by the lack of a comprehensive protein database [12]. The acquisition of a complete genomic sequence deems necessary. Here we describe the characterization of B. caribensis MBA4 and its complete genome sequence and annotation, with an emphasis on genomic features and genes related to degradation of haloacids.

Genome sequencing information
Genome project history The genome of MBA4 was selected for sequencing in order to unravel the genetic background of the bacterium to utilize haloacids. MBA4 has a genome larger than most Burkholderia species with a size of more than 9.  Table 2 shows the project information and its association with MIGS version 2.0 compliance [21].   The total is based on either the size of the genome in base pairs or the total number of protein coding genes in the annotated genome 100-and 300-bp paired-end data and the 454 reads were obtained from Centre for Genome Sciences (previously Genome Research Centre), The University of Hong Kong. The PacBio long reads were obtained from Groken Bioscience. Bar codes were trimmed and low quality reads were filtered using the commercial software CLC Genomic Workbench 6.0.1 (CLC bio, Aarhus, Denmark). After trimming and filtering, Illumina paired-end and 454 reads were de novo assembled through CLC Genomic Workbench 6.0.1 with default setting. Scaffolds were then generated from the contigs with SSPACE basic 2.0 [22] using information derived from the paired-end reads. De novo assembled transcripts from nine sets of RNA-seq paired-end raw data were mapped to the scaffolds to remove some of the internal gaps and ambiguous bases, and to join the scaffolds together. Standard PCR and Sanger-sequencing technology were employed to fill the gaps inside the scaffolds. Multiplex PCR was used to amplify unknown regions between scaffolds, and some scaffolds were linked after subsequent cloning and sequencing. Clean PacBio reads were assembled by SMRT Analysis v2.3.0 HGAP.2 with pre-assembled highquality draft genome as reference sequences. Ambiguous base and inserted/deleted regions between PacBioassembled and preassembled high quality draft sequences were manually corrected using consensus sequences derived from nine sets of transcriptome data. A draft genome was annotated automatically with the Rapid Annotations using Subsystems Technology server [23][24][25] and the Prokaryotic Genomes Automatic Annotation Pipeline from NCBI [26]. Subsequent annotation of the complete genome was based on the annotated draft sequences. Minor corrections were conducted manually.

Genome properties
The complete genome is represented by three replicons. The total size of the genome is 9,482,704 bp with a GC content of 62.46 % [27]. A total of 9151 genes were predicted for the genome, including 15 pseudo genes. As for RNA genes, 18 rRNA and 62 tRNA genes were identified. About 80.07 % of the total genes are protein coding with known function while 1729 genes were annotated as hypothetical protein [27]. Among the total, 6596 genes were assigned to COGS. The properties and the statistics of the genome are described in Table 3. The distribution The total is based on the total number of protein coding genes in the genome of the genes in COG functional categories [28] is shown in Table 4. Circular genome maps, showed in Fig. 3, were generated using CGview [29] based on ORFs with COG information, tRNA, rRNA and GC content.

Insights from the genome sequence
The haloacid utilizing operon, comprising dehalogenase deh4a and permease deh4p genes, was found in replicon CP012747. Besides deh4a, eight other genes are annotated as haloacid dehalogenase or haloacid dehalogenase-like protein for the whole genome. However, in previous studies, when MBA4 was grown in medium containing MCA as the sole carbon and energy source, only Deh4a was detected. A BLASTN analysis showed that these other genes have relatively different nucleotide sequences and which suggested that they are not homologs of deh4a. It would be interesting to investigate whether these putative dehalogenases have similar function as Deh4a. When MCA is taken into the cell and processed by Deh4a hydrolytically, glycolate will be produced. Further transformation of glycolate will be mediated by glycolate oxidase, an enzyme that consists of three subunits, viz GlcD, E and F. Fig. 3 Genome maps of B. caribensis MBA4. The outer circle indicates the location of all ORFs. All ORFs were colored according to their COG functional groups. Light venetian red and medium rose colored arrows indicate tRNA and rRNA genes, respectively. GC content is in black and GC skew + andis in green and fuchsia, respectively. The sizes of the replicons are not drawn to scale Fig. 4 Schematic representation of the genomic organization of three glycolate oxidase genes in B. caribensis MBA4. Glycolate oxidase genes comprising glcDEF were identified in replicons CP012746, CP012747 and CP012748. In replicons CP012747 and CP012748, a glcC regulator gene was also discovered. In replicon CP012747, a glcB gene, encoding malate synthase, was found downstream of glcDEF The genes encoding for glycolate oxidase are clustered as an operon. In MBA4, three glycolate oxidase operons were identified. One of these is located downstream of deh4a, in replicon CP012747. This operon has a downstream malate synthase gene, glcB, and an upstream regulator gene, glcC, in the opposite strand. Another glcDEF, also containing an upstream glcC, was discovered in replicon CP012748. A third glycolate oxidase operon, located in replicon CP012746, has neither glcC nor glcB in the neighborhood (Fig. 4). It is apparent that glycolate could be utilized in three ways after transformation to glyoxylate by glycolate oxidase. Whether these three glycolate oxidases are responsible for three different courses awaits further investigation.
For other features of the genome, 612 tandem repeats were found in the genome by Tandem Repeats Finder [30]. There are at least 58 genomic islands being predicted by IslandViewer [31]. On-line CRISPRFinder [32] has identified ten CRISPR regions with one confirmed and nine questionable CRISPRs. Four incomplete and one questionable prophage regions were identified using PHAST [33].

Conclusions
In this study, we report the complete genome sequence of Burkholderia caribensis MBA4 which was isolated for its ability to utilize haloacetates. Examination of genes such as dehalogenases and glycolate oxidases have provided insight on the metabolism of the bacterium in transforming haloacetates for carbon and energy source. Further analysis on genes related to conversion of halopropionate would be fruitful.

Competing interests
The authors declare that they have no competing interests.
Authors' contributions KFK carried out the molecular biology study. YP conducted the assembly, annotation, data analysis and draft the manuscript. JSHT conducted the data analysis, conceived of the study, and participated in its design and coordination and draft the manuscript. All authors read and approved the final manuscript.