Natural diversity of cellulases, xylanases, and chitinases in bacteria

Background Glycoside hydrolases (GH) targeting cellulose, xylan, and chitin are common in the bacterial genomes that have been sequenced. Little is known, however, about the architecture of multi-domain and multi-activity glycoside hydrolases. In these enzymes, combined catalytic domains act synergistically and thus display overall improved catalytic efficiency, making these proteins of high interest for the biofuel technology industry. Results Here, we identify the domain organization in 40,946 proteins targeting cellulose, xylan, and chitin derived from 11,953 sequenced bacterial genomes. These bacteria are known to be capable, or to have the potential, to degrade polysaccharides, or are newly identified potential degraders (e.g., Actinospica, Hamadaea, Cystobacter, and Microbispora). Most of the proteins we identified contain a single catalytic domain that is frequently associated with an accessory non-catalytic domain. Regarding multi-domain proteins, we found that many bacterial strains have unique GH protein architectures and that the overall protein organization is not conserved across most genera. We identified 217 multi-activity proteins with at least two GH domains for cellulose, xylan, and chitin. Of these proteins, 211 have GH domains targeting similar or associated substrates (i.e., cellulose and xylan), whereas only six proteins target both cellulose and chitin. Fifty-two percent of multi-activity GHs are hetero-GHs. Finally, GH6, −10, −44 and −48 domains were mostly C-terminal; GH9, −11, −12, and −18 were mostly N-terminal; and GH5 domains were either N- or C-terminal. Conclusion We identified 40,946 multi-domain/multi-activity proteins targeting cellulase, chitinase, and xylanase in bacterial genomes and proposed new candidate lineages and protein architectures for carbohydrate processing that may play a role in biofuel production. Electronic supplementary material The online version of this article (doi:10.1186/s13068-016-0538-6) contains supplementary material, which is available to authorized users.


Background
Glycoside hydrolases (GH) are key enzymes for the processing of complex carbohydrates [1]. Plant-derived cellulose and xylan represent the major source of carbon in terrestrial ecosystems, whereas chitin is the most abundant source of carbon in marine ecosystems. The deconstruction of these polysaccharides by GH is key to the global earth carbon cycle [2], mammal nutrition [3], and is the primary target of several industries (e.g., biofuel production) [4]. Many GH families for carbohydrate processing have been identified, some being populated with many identified and characterized proteins (e.g., cellulases from GH5) and others containing few sequences (e.g., arabinases from GH93) [1]. Sixty-one GH families have been assigned a PFam ID, allowing for domain identification based on HMM-profile recognition [5,6]. According to characterized proteins in the CAZy database (http://www.cazy.org), many GH families display substrate specificity, and so the potential activity of the GH can be determined by examining the protein sequence. For example, most enzymes from GH families 5,6,7,8,12,44,45, and 48 act on cellulose, while GH families 10, 11, and 30 are mostly xylanases, and GH family 18,19, and 85 are chitinases [1,7]. There are also some GH families that do not target-specific substrates (e.g., GH16).
The complete breakdown of polysaccharide requires the synergistic action of multiple enzymes acting on internal bonds (e.g., endocellulase), extremities (e.g., exocellulase), and intermediate degradation products (e.g., β-glucosidase). Thus, most identified degrader lineages have several genes coding for GH and many seemingly redundant enzymes targeting similar substrates [1,7,8]. Across environments, polysaccharides are associated and form complex structures (e.g., plant cell walls); therefore, many degraders often target several substrates (e.g., cellulose and xylan) [7,9]. To degrade complex polysaccharides, bacteria have adopted several strategies including the production of (i) individual enzymes, sometimes associated with non-catalytic accessory domains, such as carbohydrate-binding modules (CBM) [8]; (ii) production of complex proteins with multiple GH domains (i.e., multi-activity GH, MAGH), with or without CBM [10]; and (iii) the production of non-covalent multi-protein complexes called cellulosomes [4]. When released simultaneously, distinct GH domains act synergistically and display overall improved hydrolytic activity, compared to single domains. Synergy among GH domains is further achieved by the physical association of catalytic domains into complex proteins with multiple catalytic domains, and in cellulosomes [4,10]. These MAGHs and protein complexes are promising tools for improving biomass processing [4,[10][11][12].
A particular bacteria's potential ability to deconstruct polysaccharides can be predicted by the number and diversity of GH domains in its genome [7,13]. In sequenced bacterial genomes, the presence of GHs is mostly conserved at the genus level [7,9]; therefore, the presence of GH domains in new members of previously identified genera can be easily inferred. Little is known, however, about the conservatism of GH domain organization across bacteria. To address this question, we developed a custom bioinformatic pipeline aimed at identifying and listing the protein architectures (aka the domain organization) for GHs targeting cellulose, xylan, and chitin in sequenced bacterial genomes. Next, we analyzed the conservatism of domain organization in MAGHs, and investigated the variability of domain organization in identified polysaccharide degraders (i.e., bacterial lineages associated with GH for polysaccharide degradation). We hypothesized that, across bacterial gnomes, the distribution of GH domains and the architecture of proteins with GH domains would correlate. Indeed, several groups of bacteria are systematically identified as polysaccharide degraders [8,14,15] and the distribution of GH domains in sequenced bacterial genomes is phylogeneticaly conserved at the genus level [7,9]. Thus, one could expect that bacteria from the same genus, with similar GH content, share similar GH organization.
Finally, we specifically investigated the association of GH domains in MAGHs. We expected a high frequency of MAGHs with synergistic domains, (i.e., targeting the same substrate) and/or domains targeting physically associated substrates (i.e., cellulose and xylan in plant cell walls). MAGHs with a combination of catalytic domains that target the same substrate and/or physically associated substrates would benefit from identical regulation and expression processes and increase the synergy between catalytic domains by reducing their diffusion [16], among other benefits. Conversely, we expected that there would be few proteins with GH domains targeting unrelated substrates (e.g., cellulose:chitin or xylan:chitin).
Our systematic investigation of the association and organization of catalytic and accessory domains involved in carbohydrate processing across sequenced bacterial genomes highlights new proteins, new domain architectures, and provide new insights about how bacteria are able to process complex carbohydrates with implications for biofuel research.

Distribution of GH for cellulose, xylan, and chitin
We searched 11,953 sequenced genomes and identified 40,946 proteins containing 41,196 domains that target cellulose, xylan, or chitin (Additional file 1). First, 25,682 identified proteins were single domain (Table 1) with no accessory domain. Next, 15,047 proteins were multidomain proteins with a unique GH domain (i.e., MDGH) targeting cellulose, xylan, or chitin, associated with other domains (e.g., CBM). Finally, 217 multi-activity proteins (i.e., MAGH) had multiple catalytic domains for cellulose, xylan, or chitin, along with accessory domains.
To identify bacteria with a high potential for cellulose, xylan, and chitin processing, we first investigated the average frequency of GH domains for cellulose, xylan, and chitin per genome, at the genus and species levels ( Fig. 1).

Conservatism of protein architecture
Excluding unique domains, observed once, we identified 210 types of protein domains associated with the GH domains of interest (Additional file 26: Table S1). First, we identified 18 additional types of GH domains targeting oligosaccharides (e.g., GH2, 3) and other sugars (e.g., mannanase from GH26, galactosidase from GH35). Next, we identified other catalytic domains, including glycosyltransferase (mostly GT2), polysaccharide deacetylase, some lipases and esterases (e.g., GDSL), and few alpha/beta-hydrolases. We also identified many       non-catalytic domains, including 13,573 CBMs from 17 families and targeting cellulose (e.g., CBM2, 3), xylan (e.g., CBM35), and chitin (e.g., CBM5_12). Next, we identified 1102 dockerins (i.e., PF00404) and 7 cohesins (i.e., PF00963) for cellulosome associated with GHs. Finally, 478 domains of unknown function (DUF), 60 bacterial neuraminidase domains (i.e., BNR, PF02012, PF14873), 727 S-layer homology domains (i.e., SLH, PF00395), 2597 fibronectin domains (i.e., PF00041, PF16893), 1308 lectin domains (e.g., PF14873, PF11721), and 23 Cadherin-like domains (i.e., PF12733, PF00028, PF16184) were identified, among others. With the exception of CBMs and some lectins, most of these domains were not listed in the CAZy database. However, their high frequency in association with GHs for cellulose, xylan, and chitin suggested that these accessory domains could have functional or structural implications in carbohydrate processing. We next tested the conservatism of protein architecture in genera with >3 sequenced genomes by clustering the genomes based on the architecture of proteins with GH domains (accounting for all the accessory domains) and, in a separate analysis, the distribution of GH domains only ( Fig. 3; Additional file 27: Table S2). First, in few genera including Cellulomonas (n = 5 genomes) and Cytophaga (n = 4 genomes), the clustering based on protein architecture did not correlate with the distribution of GH domains. Next, in some genera, including Caldicellulosiruptor (n = 10 genomes), the clustering based on protein architecture correlated partially with the distribution of GH domains (P mantel = 0.002, r mantel = 0.55) (Fig. 3a, b). Finally, in many genera the two clusterings were highly consistent ( Fig. 3c; Additional file 17: Table  S2). For example, in Xanthomonas (n = 131 genomes), the clustering of strains based on proteins targeting cellulose, xylan, chitin, and their accessory domains correlated with the clustering based on the distribution of GH domains (P mantel = 0.001, r mantel = 0.96, Fig. 3c; Additional file 27: Table S2; Additional file 28: Figure S25). Significant correlations were independent of the number of sequenced genomes and unaffected by the number of GHs per genome.

Discussion
Protein domains are defined as "conserved, functionally independent protein sequences that bind or process ligands using a core structural motif " [17]. Although many proteins are known to be multi-domain assemblages [18], most studies of proteins are focused on individual domains and do not consider how interactions between domains might affect the structure and the activity of enzymes. The selection pressure for the domain combination is governed by the structural (see [19] for review) and the functional advantage provided to the organism. Indeed, multi-domain proteins connect complementary domains and activities. Thus, analyzing the architecture of GHs that target cellulose, xylan, and chitin in bacterial genomes allows us to further understand the distribution of GH domains [7,9], highlight the Our HMM-based survey of bacterial genomes reveals the variability and the distribution of GH architecture in well-known degrader genera (e.g., Clostridium, Ruminococcus) and avoids biased interpretation of bacterial carbohydrate processing based on known, or predicted, hydrolytic capabilities. To the best of our knowledge, the identified cellulases, xylanases, and chitinases described here outnumbered currently available databases (Table 1) and uncover the high potential for carbohydrate processing in lineages not included in previous studies (e.g., Actinospica) [1,13]. However, there are a number of caveats associated with our approach to studying the diversity of enzymes involved in carbohydrate processing across bacteria. We recognize that some GH genes we identified as potential cellulases, xylanases, and chitinases may have other enzymatic functions given that some GHs have side activities (e.g., [20]). In addition, some enzymes identified as cellulases are instead involved in cellulose biosynthesis or in the interaction between microorganisms and plants (e.g., GH8) [9,13,21].
Our analysis of the distribution of GH architectures across sequenced genomes supports the hypothesis that the distribution of GH domains and the potential for carbohydrate processing are taxonomically conserved [7,9]. Many genera (e.g., Stenotrophomas, Rhizobium, Gluconoacetobacter) displayed conserved protein architectures. In these genomes, knowing the exact protein architectures provides a way to estimate the GH content and the protein architecture in newly identified strains. However, beyond conserved sets of single-domain proteins, many bacteria display species-specific protein architectures. These unique protein architectures have no effect on the clustering of genomes and their domain organization cannot be predicted. This highlights the multimodularity of GH and suggests the rapid evolution of closely related organisms regarding their potential to target substrates in the environment. Thus, using our data, it is possible to infer the GH content of taxonomically identified bacteria and complex microbial communities (e.g., metagenomes). Because of the extensive variation even between closely related strains, however, inferring the exact protein architecture will remain a major challenge.
Multi-activity proteins (i.e., MAGHs) mainly correspond to associations of GH domains targeting similar substrates (e.g., cellulase:cellulase). In addition, most MAGHs are homo-GHs (e.g., GH5-GH5). The association of two identical GH domains into MAGHs suggests a duplication-fusion of the catalytic domain, whereas the rare hetero-GHs (e.g., GH5-GH6) result from more complex recombination [32,33]. Thus, bacteria target one substrate at a time and take advantage of the synergistic activity among catalytic domain targeting similar substrate [8].
This allows for precise regulation of each pathway for carbohydrate deconstruction as observed in few bacterial lineages (e.g., Streptomyces [34] and filamentous fungi [35]).

Conclusions
In the environment, microbes (i.e., fungi and bacteria) are essential for the deconstruction of complex carbohydrates (e.g., cellulose) [37]. The increasing number of sequenced genomes, mostly from bacteria, and their consistent annotation [38], provides an unprecedented opportunity to perform large-scale comparative genomics [9,39,40]. Our systematic investigation of sequenced bacterial genomes to identify protein architectures has many potential uses. First, it provides an overview of the spatial organization of catalytic domains (i.e., GHs) and their association with CBMs, as well as other non-catalytic accessory domains involved in carbohydrate binding. Second, our analysis reveals the heterogeneous distribution of GHs in bacteria. Indeed, although GH domains are conserved within bacterial genera [7,9], the complex domain architectures are mostly species specific. Thus, knowing the phylogenetic distribution and the association between catalytic domains targeting the major carbohydrates, it will be possible to predict the GH content in most bacteria. This will help identify new bacterial isolates with increased potential for carbohydrate processing. However, the GH architecture remains extremely variable and thus cannot be predicted. Finally, listing the GH architectures will serve as a guide for future tests on the taxonomic breadth of domains association and their spatial organization.

GH identification
Protein sequences from sequenced bacterial genomes were retrieved from the PATRIC database [41] and analyzed using a custom bioinformatic pipeline aimed at identifying proteins involved in cellulose, xylan, and chitin processing. Briefly, bacterial proteins with GH domains targeting cellulose, xylan, and chitin were identified using a custom database of hidden Markov Model profiles, retrieved from PFam-A [6]. Then, selected proteins with GHs for cellulose, xylan, or chitin were analyzed against the entire PFAM-A database (as of December, 2015) to confirm the GH domains and identify their associated domains (e.g., CBMs). Identified domains with e value <10 −5 and alignment coverage >60 % of PFam length were used in subsequent analyses. Substrate specificity of identified GH and CBM domains was derived from biochemically characterized bacterial homologs found in the CAZy database [1,7]: GH 5,6,7,8,9,12,44,45, and 48 were identified as cellulase; GH 10, 11, and 30 were identified as xylanase; and GH 18,19, and 85 were identified as chitinases. Some GH families identified recently (e.g., GH74), have no assigned HMM and thus are not included in this study. Sequences of interest can be retrieved directly from the database using the listed IDs (e.g., VBIactrob102134_8073) in figures and supplementary data and the PATRIC portal (https://www.patricbrc.org/portal/portal/patric/Home) [41]. Finally, the complete taxonomy of each individual strain was retrieved from the NCBI taxonomy server (http://www.ncbi.nlm.nih.gov/Taxonomy/).

Statistical analysis
GH distribution and domain organization in sequenced bacterial genomes were analyzed using Vegan, Stats, and APE packages in the R software environment [42,43]. Clustering bacterial strains used two distinct approaches. First, genomes were clustered according to the distribution of GH domains per genome, regardless of the protein architecture. Second, we compared the architecture of all identified proteins with GH domains for cellulose, xylan and chitin, including accessory domains, and then clustered the sequenced genomes as described before. To investigate correlation among clusterings based on the number of sequenced genomes in a particular bacterial lineage or the number of GH domains within a genome, we performed Mantel correlation tests (999 permutations) on distance matrixes used for clustering.