Draft genome sequence data of Clostridium thermocellum PAL5 possessing high cellulose-degradation ability

Clostridium thermocellum is a potent cellulolytic bacterium. C. thermocellum strain PAL5, was derived from strain S14 that was isolated from bagasse paper sludge, possesses higher cellulose-degradation ability than representative strains ATCC27405 and DSM1313. In this work, we determined the draft genome sequence of C. thermocellum PAL5. Genomic DNA was used for whole-genome sequencing using the Illumina HiSeq 2500. We obtained 215 contigs of >200 bp (N50, 78,366 bp; mean length, 17,378 bp). The assembled data were subjected to the National Center for Biotechnology Information (NCBI) Prokaryotic Genome Annotation Pipeline, and 3198 protein-coding sequences, 53 tRNA genes, and 4 rRNA genes were identified. The data are accessible at NCBI (the accession number SBHL00000000). Our data resource will facilitate further studies of efficient cellulose-degradation using C. thermocellum.


Data
The thermophilic anaerobic bacterium Clostridium thermocellum (recently called Hungateiclostridium thermocellum) is a multifunctional ethanol producer, capable of both saccharification and fermentation [1]. C. thermocellum PAL5 was derived from strain S14 [2e4] that was isolated from bagasse paper sludge. The cellulolytic activity of strain PAL5 was compared with those of C. thermocellum ATCC27405 T , a type strain of this species [5], and C. thermocellum DSM1313 [6] by incubation for 3 days at 60 C in CTFUD medium [7] containing 1.0% microcrystalline cellulose powder instead of cellobiose. PAL5 showed better cellulose degrading ability than the other strains ( Fig. 1), indicating that PAL5 may, like strain S14, possess high cellulose-degradation ability.
In this work, we determined the draft genome sequence of C. thermocellum PAL5 to identify which factors affect its cellulose-degradation ability. In total, 81,421,880 single reads with length 100 bp were obtained after filtering for quality score. Genome de novo assembly was performed using the CLC Genomic Workbench (CLC Bio, Qiagen, Valencia, CA); 215 contigs of >200 bp excluding scaffolded regions were obtained. Features of the genome are shown in Table 1. The assembled data for PAL5 were subjected to the NCBI Prokaryotic Genome Annotation Pipeline (PGAP), and 3,198 protein-coding sequences (CDSs), 53 tRNA genes, and 4 rRNA genes were identified. The equivalent values for strain ATCC27405 were 3,204 CDSs, 56 tRNA genes, and 12 rRNA genes (GenBank accession number:NC_009012). Thus, it was confirmed that the sequencing results for PAL5 in this work were similar to the known genome information for the type strain, and thus could be considered reliable.
We used the average nucleotide identity (ANI) assay [8] among eight strains of C. thermocellum, including PAL5, and two out group strains, C. clariflavum DSM19732 (CP003065.1) and Herbivorax (Hungateiclostridium) saccinocola GGR1 (CP025197.1). The ANI value is calculated as the mean identity of BLASTn matches between the virtually fragmented query genome and the reference genome. A dendrogram of relatedness using ANI values (Suppl. Table 1) was constructed using the unweighted pair group method with arithmetic (UPGMA) method ( Fig. 2) and single-linkage method (data not shown) as clustering methods, which showed that PAL5 is closely related to all the C. thermocellum strains. Value of the data • Clostridium thermocellum PAL5 having strong cellulose-degradation ability was derived from strain S14 that was isolated from bagasse paper sludge. • Data on draft genome sequence of stain PAL5 can be used to search and characterize genes and enzymes regarding high cellulose-degradation ability. • The comparison of genome sequence data between C. thermocellum strains gives an opportunity to understand a difference of cellulose degradation ability.
Eight putative cellulosomal scaffolding protein of PAL5 were identified from genomic data by similarity with strain ATCC27405 ( Table 2). The protein accession numbers corresponding to CipA and OlpB were divided into three nonconsecutive fragments; we suggest this was because the single reads could not be concatenated by the algorithm used in the de novo assembly. We consider that our genome data are of sufficient quality for further analysis to consider which factors affect the cellulosedegradation ability of strain PAL5 and others.

Genomic DNA extraction and sequencing
Genomic DNA of C. thermocellum PAL5 was extracted from microbial cells grown in anaerobic conditions at 60 C. We used the cetyltrimethylammonium bromide (CTAB) method to extract genomic DNA [9]. The genomic DNA was processed to template samples using the TruSeq Nano DNA LT Library Prep Kit (Illumina, San Diego, CA). The template samples were formed into clusters using the HiSeq PE Rapid Cluster Kit v2-HS and HiSeq Rapid Due cBot v2 Sample Loading Kit, and then sequenced using the HiSeq Rapid SBS Kit v2-HS (Illumina) with the HiSeq 2500 next generation sequencer (Illumina).  Genome de novo assembly was performed using the CLC Genomic Workbench. The assembled data were subjected to the NCBI PGAP.

Genomic average nucleotide identity
ANI analysis, which is used for in silico analysis of DNAeDNA hybridization, was performed. ANI values of combinations of the whole genome sequences of C. thermocellum strains were calculated using the web tool ANI calculator (http://enve-omics.ce.gatech.edu/ani/). The matrix made from ANI values between C. thermocellum strains was converted to a genetic dendrogram with algorithms such as the unweighted pair group method with arithmetic mean and single-linkage clustering method in the R statistic program.