Skip to main content
Dryad

Supplementary materials for Phylogenomics and genetic analysis of solvent-producing Clostridium species

Cite this dataset

Brown, Steven (2024). Supplementary materials for Phylogenomics and genetic analysis of solvent-producing Clostridium species [Dataset]. Dryad. https://doi.org/10.5061/dryad.g4f4qrfx7

Abstract

The genus Clostridium is a large and diverse group within the Bacillota (formerly Firmicutes), whose members can encode useful complex traits such as solvent production, gas-fermentation, and lignocellulose breakdown. We describe 270 genome sequences of solventogenic clostridia from a comprehensive industrial strain collection assembled by Professor David Jones that includes 194 C. beijerenckiI, 57 C. saccharobutylicum, 4 C. saccharoperbutylacetonicum, 5 C. butyricum, 7 C. acetobutylicum, and 3 C. tetanomorphum genomes. We report methods, analyses and characterization for phylogeny, key attributes, core biosynthetic genes, secondary metabolites, plasmids, prophage/CRISPR diversity, cellulosomes and quorum sensing for the 6 species. The expanded genomic data described here will facilitate engineering of solvent-producing clostridia as well as non-model microorganisms with innately desirable traits. Sequences could be applied in conventional platform biocatalysts such as yeast or Escherichia coli for enhanced chemical production. Recently, gene sequences from this collection were used to engineer Clostridium autoethanogenum, a gas-fermenting autotrophic acetogen, for continuous acetone or isopropanol production, as well as butanol, butanoic acid, hexanol and hexanoic acid production. 

README: Supplementary Dataset and Tables for Phylogenomics and Genetic Analysis of Solvent-Producing Clostridium Species

This Dryad submission supports a publication entitled “Phylogenomics and Genetic Analysis of Solvent-Producing Clostridium Species” accepted for publication at Scientific Data (SDATA-23-01651) in April 2024 (https://www.nature.com/sdata/). The paper describes 270 genome sequences and detailed analyses. The raw sequence and genome sequence information are publicly available, described in the publication and in the related works section of this submission. This Dryad dataset provides supplementary information to the Scientific Data publication.

Description of the data and file structure

There are 11 files in this dataset in excel format:

Supplementary Table 1. This file contains 270 DJ strain designations, linked to database references (e.g. GenBank accession numbers, Genome IDs) and provides important genome analysis results such as CheckM2 outputs, contig and gene numbers. Header information is provided in row 1 and information is organized by strain designation in ascending order.

Supplementary Table 2. A subset of strains whose genomes are described in the Sci Data paper are available from international culture collection are listed in this table (colored red). The table is arranged by phylogenetic clade and by species/strain designations and it also provides other strains for multiple species that are also deposited in international culture collections (colored black). The table contents for different species contains several blank cells that separate species for visual ease of access.

Supplementary Table 3. The National Chemical Products (NCP) chemical and fermentation company operated industrial ABE fermentation processes in South Africa from 1935 to 1983. This table provides cross reference information for newly generated genome sequence information to historical NCP designations. Header information is provided in row 1. Information is organized by DJ strain designation in ascending order, followed NCP information, then full strain name and associated JGI IMG Genome ID separated by a comma.

Supplementary Table 4. Core to the clostridial acetone-butanol-ethanol (ABE) process are key biosynthesis proteins, which are listed in this table. Type strain protein sequences used in the Sci Data paper as input search sequences are listed, along with search output results organized by individual enzymes, protein sequence and by strain hit results. The table contains several blank cells that separate enzymes for visual ease of access.

Supplementary Table 5. BlastP analysis of RRNPP-type quorum-sensing systems. Table S5a: Blast hits corresponding to in putative RRNPP-type quorum-sensing system in seven C. acetobutylicum genomes. Below the first table, separated by a blank row is Table S5b: Blast hits corresponding to in putative RRNPP-type quorum-sensing system in seven C. saccharoperbutylacetonicum genomes. The genome columns indicate the type strain reference input sequence used to search DJ genomes, which are described in the references provided in the table. Putative hits to DJ strains are shown per genome and protein accession numbers are provided.

Supplementary Table 6. Summary of cellulosomal genes in butanol-producing clostridia. The presence of sequences for cohesin(s) or dockerin in the given gene is indicated by Coh(s) or Doc, respectively. The genes are organized in a cluster, except for the underlined genes that are scattered through the respective genomes. From all the genomes analyzed, cellulosomal elements were retrieved only in C. acetobutylicum, C. saccharoperbutylacetonicum and one of the C. beijerinckii strains. Analysis of the seven C. acetobutylicum strains revealed that each genome contains 6 cohesins (5 in the scaffoldin sequence and one in the orfX gene) and 10 dockerins, as described previously for type strain ATCC 824 (Lopez-Contreras, A.M. et al, 2003). Dockerin-containing proteins include the following glycoside hydrolase (GH) catalytic modules; three GH5s, four GH9s, one GH44, one GH48 and one GH74. Nine of these twelve cellulosomal genes are organized in a gene cluster, identical to that reported previously for C. acetobutylicum strains (Lopez-Contreras, A.M. et al, 2003; Dassa, B. et al, 2017). Similarly, the analysis of the four C. saccharoperbutylacetonicum genomes revealed a similar cellulosomal organization as in strain N1-4 (Levi Hevroni, B., et al, 2020), including 2 cohesins and 8 dockerins. The annotation of the dockerin-containing enzymes is also analogous to those of the N1-4 strain, with two GH5s, two GH9s, one GH26, one GH44, one GH48 and one GH74. Moreover, five of the genes encoding the putative enzymes are organized in a gene cluster, together with the two-cohesin (bivalent) scaffoldin gene. C. beijerinckii DJ015 appeared to contain two cellulosomal genes, highly similar to the C. saccharoperbutylacetonicum genes encoding ScaA and GH48 (99 and 100% sequence identities, respectively). However, as a PCR assay was unable to confirm the presence of the genes for ScaA and GH48 in strain DJ015. Strain DJ015 was generated from paired-end Illumina data, contained 665 annotated contigs, a greater number of total contigs and was excluded from this study, as were several other genomes.

Supplementary Table 7. Distribution of glycoside hydrolases (GHs) identified in DJ strains. Annotation of glycoside hydrolases was performed with dbcan2 (Han Zhang, Tanner Yohe, Le Huang, Sarah Entwistle, Peizhi Wu, Zhenglu Yang, Peter K Busk, Ying Xu, Yanbin Yin, dbCAN2: a meta server for automated carbohydrate-active enzyme annotation, Nucleic Acids Research, Volume 46, Issue W1, 2 July 2018, Pages W95–W101) https://doi.org/10.1093/nar/gky418. The number of different GH families of enzymes (described by dbcan2) is shown per genome (headers). Blank cells indicate no hit identified.

Supplementary Table 8. Putative in silico plasmid detection in DJ genomes. Contigs were classified into plasmid and chromosomal categories to determine plasmid presence or absence using PlasFlow v1.1.0. Header information is provided in row 1. Information is organized by DJ strain, putative detection (Yes or No) in the PlasFlow Plasmid Call header, followed by predicted plasmid number of different plasmids per genome (Predicted Plasmid Count), followed by plasmid size estimates. Where plasmids were not identified a "-" value is present.

Supplementary Table 9. Analysis of the DJ genomes identified a number of prophage sequences, which are listed. The table includes taxonomic affiliation (prophage, host species, genomic boundaries, NCBI RefSeq & Marker genes) using the IMG/VR approach as well as vContact clusters, along with quality information obtained with CheckV. The host genome, contig number and genomic boundaries are described under the Prohage header, respectively, separated by bars. Headers are provided in row one for other attributes. The table contents several blank cells where there was no hit or information or for visual ease of access.

Supplementary Table 10. CRISPR-Cas systems for ABE clostridia genomes were analyzed for prevalence, diversity, and dynamics in closely related host species. The complete Type I-B system was the most common, detected in all the C. saccharobutylicum, C. saccharoperbutylacetonicum, and C. tetanomorphum genomes, and a minority of genomes from C. butyricum and C. beijerinckii. This dataset identifies CRISPR-Cas systems, arranged by hits to host genome species, and contains details such as the number of spacer sequences, locations, repeat sequences as named in row 1.

Supplementary Table 11. List of matches between industrial clostridia CRISPR spacers and phage sequences. Information about phage sequences was obtained either from this study or from the IMG/VR v3 database. Row 1 of this dataset has descriptive column headers that identify attributes of sequence matches. The host genome (JGI IMG Genome ID), contig number and starting genome coordinate are described under the Spacer header, respectively, separated by bars. Host species are identified, followed array name and additional characteristics such as phage match, ecosystem.

Sharing/Access information

The source for the underlying genome and sequence data is described in:

Microorganisms. 2023 Sep 7;11(9):2253. doi: 10.3390/microorganisms11092253. Solvent-Producing Clostridia Revisited. David T Jones , Frederik Schulz , Simon Roux , Steven D Brown. https://pubmed.ncbi.nlm.nih.gov/37764097/

For convenience, the 270 sub-projects are deposited under NCBI BioProject PRJNA990349.

Methods

Genomes were sequenced at the JGI using Pacific Biosciences (PacBio) technology on either an RSII instrument (P6/C4 chemistry), or Sequel (v2.1 v2 or v3.0 chemistries) or using an Illumina NovaSeq (v1 chemistry) and have been reported previously (https://pubmed.ncbi.nlm.nih.gov/37764097/). All genome sequences were annotated using the JGI IMG pipeline, with the vast majority by version v.5.0.10 and details available for each at the JGI IMG database. Genomes used in this study are curated under the GOLD study ID Gs0118866, with links to raw sequence data available via JGI or via the NCBI SRA database. Contigs were classified into plasmid and chromosomal categories to determine plasmid presence or absence using PlasFlow v1.1.0. For convenience, the 270 sub-projects are deposited under NCBI BioProject PRJNA990349. Strains analyzed in this study are shown Supplementary Table 1, and include relevant culture collection details (Supplementary Tables 2-3). Additional methodological details are provided in accepted Scientific Data manucript SDATA-23-01651.

Funding

U.S. Department of Energy (DOE), Office of Science (SC), Office of Biological and Environmental Research (BER) , Award: DE-SC0018249, Biosystems Design