Large‐scale analysis of the genome of the rare alkaline‐halophilic Stachybotrys microspora reveals 46 cellulase genes

Fungi are of great importance in biotechnology, for example in the production of enzymes and metabolites. The main goal of this study was to obtain a high‐coverage draft of the Stachybotrys microspora genome and to annotate and analyze the genome sequence data. The rare fungus S. microspora N1 strain is distinguished by its ability to grow in an alkaline halophilic environment and to efficiently secrete cellulolytic enzymes. Here we report the draft genome sequence composed of 3715 contigs, a genome size of 35 343 854 bp, with a GC content of 53.31% and a coverage around 20.5×. The identification of cellulolytic genes and of their corresponding functions was carried out through analysis and annotation of the whole genome sequence. Forty‐six cellulases were identified using the fungicompanion bioinformatic tool. Interestingly, an S. microspora endoglucanase selected from those with a low isoelectric point was predicted to have a halophilic profile and share significant homology with a well‐known bacterial halophilic cellulase. These results confirm previous biochemical studies revealing a halophilic character, which is a very rare feature among fungal cellulases. All these properties suggest that cellulases of S. microspora may have potential for use in the biofuel, textile, and detergent industries.

Fungi are of great importance in biotechnology, for example in the production of enzymes and metabolites. The main goal of this study was to obtain a high-coverage draft of the Stachybotrys microspora genome and to annotate and analyze the genome sequence data. The rare fungus S. microspora N1 strain is distinguished by its ability to grow in an alkaline halophilic environment and to efficiently secrete cellulolytic enzymes. Here we report the draft genome sequence composed of 3715 contigs, a genome size of 35 343 854 bp, with a GC content of 53.31% and a coverage around 20.59. The identification of cellulolytic genes and of their corresponding functions was carried out through analysis and annotation of the whole genome sequence. Forty-six cellulases were identified using the fungicompanion bioinformatic tool. Interestingly, an S. microspora endoglucanase selected from those with a low isoelectric point was predicted to have a halophilic profile and share significant homology with a well-known bacterial halophilic cellulase. These results confirm previous biochemical studies revealing a halophilic character, which is a very rare feature among fungal cellulases. All these properties suggest that cellulases of S. microspora may have potential for use in the biofuel, textile, and detergent industries.
Since the completion of the first human genome sequence, the demand for cheaper and faster sequencing methods has increased dramatically. This demand has led to the development of second-generation or next-generation sequencing (NGS) sequencing methods. The NGS platform performs massively parallel sequencing (sequencing millions of DNA fragments from a single sample) that facilitates high-throughput sequencing, allowing the entire genome to be sequenced in less than a day [1]. The creation of the NGS platform allows laboratories to access sequences and rapidly increases the number of research results in multiple fields such as fungal biology research.
Fungi have immeasurable impacts on ecosystems, and are so important in ecology, agriculture, medicine, and biotechnology that they affect many aspects of society. They are used as suitable organisms for basic research because of the typical haplobion life cycle that facilitates the study of the phenotype of mutations and the ability of most cells to differentiate throughout the organism. In addition, many fungi are easy to culture and are useful for microbiological, genetic, and molecular techniques. Therefore, fungi were one of the first model organisms for the study of genetics, biochemistry, cells, and developmental biology. Therefore, it is not surprising that the first eukaryote for which the complete genomic sequence was obtained was the fungus Saccharomyces cerevisiae [1].
Fungi are also very important in biotechnology and in the manufacture of drugs and enzymes. The first antibiotic Alexander Fleming isolated was derived from a fungal strain of Penicillium species. Today, the genomes of over 1000 filamentous fungi have been sequenced. NGS technology has revolutionized genomic resequencing. In strain comparisons, gene mapping, or transcriptome and ChIP analysis, the de novo assembly of the eukaryotic genome still represents a significant hurdle due to its size and range of repeats [2]. Filamentous fungi are good candidates because the 30-90 Mb genome contains few repetitive regions and is expected to assemble from short sequence reads much easier than mammals and higher plants [3]. We are working on a rare, locally isolated fungal strain belonging to the Stachybotrys microspora. Stachybotrys is related to the genus Memnoniella, most Stachybotrys species inhabit cellulose-rich substances [4]. This genus is widely distributed and contains about 50 species [5]. Interestingly, our strain is characterized by growing in a cellulosic-based medium over a wide pH range of 4-9. This is rarely reported in most known fungal species. In addition to its ability to grow at alkaline pH, S. microspora produces neutral and alkaline endoglucanases [6] and secretes several b-glucosidases [7][8][9][10][11][12][13], whereas fungi generally have one or two b-glucosidase. It has been shown that at least six b-glucosidases are produced experimentally [7][8][9][10][11][12][13], but the number of genes is probably much higher (see below in the Results section). One of these b-glucosidases, called BglG, although produced constitutively on 1% glucose, is inhibited by glucose, retaining only 33% at 10 mM glucose while stimulated to 120% by the same xylose concentration [14]. In this regard, it should be noted that one of the major challenges in the bioconversion of lignocellulosic biomass to liquid biofuels involves the search for sugar-tolerant b-glucosidases, as glucose inhibits b-glucosidase activities. b-Glucosidase is one of the key enzymatic components of cellulase that converts cellobiose into glucose to complete the final stages of cellulose hydrolysis. This removes the inhibition exerted by cellobiose on cellobiohydrolase, the enzyme that initiates the attack on cellulosic substances. Therefore, b-glucosidase plays a very important role in the enzymatic production of bioethanol from biomass [14][15][16][17].
To take advantage of the potential of this strain, we conducted a molecular study of the cellulolytic genes to investigate the molecular expression profile of the corresponding gene. We succeeded in isolating three bglucosidase genes from the glycoside hydrolases 1 and 3 families, confirming the presence of b-glucosidase, and showed different expression patterns [9,10]. In addition to secreting cellulase, the fungus produces xylanase, protease, chitinase, b-glucanase, and pectinase, some of which share the same regulatory mechanism to induce or suppress their respective activities [18]. In addition to its richness in b-glucosidases, our fungus is distinguished by the very specific properties of its endoglucanases. Indeed, we have shown that it secretes two enzymes, called EG1 and EG2, which are very interesting for their high salt tolerance [19,20]. Indeed, EG1 is the most halophilic fungal endoglucanase ever studied, with an optimal activity at 5.6 M NaCl and resists SDS by retaining more than 70% of its activity at 10% SDS [19]. Moreover, this enzyme is active in the presence of ion liquids [20].
For all these interesting data, sequencing the entire genome of this rare fungus is crucial for advancing knowledge of its genomic properties. We analyzed the genome sequence of this strain and identified 46 cellulase genes. The halophilic profile was studied through the physicochemical parameters of one endoglucanase predicted to be halophile and through its molecular modeling.
Genomic DNA was extracted as follows: 1 g of frozen mycelium was ground into fine powder with alumina and mixed with 5 mL of extraction buffer [10 mM Tris-HCl (pH 8.0), 50 mM EDTA and 0.5% SDS]. After two extractions with an equal volume of phenol equilibrated with Tris-HCl, the DNA was precipitated overnight with 0.1 times the amount of 3 M sodium acetate (pH 5.2) and 2.5 times the amount of ethanol [22]. DNA quality and concentration were determined by horizontal gel electrophoresis on a 0.8% agarose gel in 50 mM Tris-Acetate-EDTA buffer, pH 8.

Illumina MiSeq analysis and assembly
Sequencing libraries were prepared using the Nextera XT DNA Library Prep Kit (Illumina, San Diago, CA, USA) and genomic sequencing was performed using the MiSeq Sequencer (Illumina) and Reagent Kit v3 (Illumina). Unless otherwise stated, default values were used throughout the software.
Run duration is based on the number of cycles performed. We performed a paired-end run up to 2 9 301 sequencing cycles plus any Index Reads with MCS v2.3. De novo assembly of the raw data was performed using SPADES v3.11.1 [23]. The assembly uses the default settings and adds an option to modify the reading.
N50: statistic defines assembly quality in terms of contiguity. Given a set of contigs, the N50 is defined as the sequence length of the shortest contig at 50% of the total assembly length. It can be thought of as the point of half of the mass of the distribution; the number of bases from all contigs longer than the N50 will be close to the number of bases from all contigs shorter than the N50. N50 can be described as a weighted median statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value. L50: Given a set of contigs, each with its own length, the L50 is defined as the count of the smallest number of contigs whose length sum makes up half of the genome size.

Phylogenetic analysis of marine fungi
In order to investigate the halophilic feature of produced cellulases from S. microspora, the entire 18S ribosomal sequences from marine fungi were aligned using the multiple sequence alignment tool of the MUSCLE algorithm, integrated in the MEGA X package, and subjected to cluster analysis by the distance with the neighbor-joining method using MEGA X software (Molecular Evolutionary Genetics Analysis, https://www.megasoftware.net/).

Homology modeling and docking
The secondary structure of the protein sequence deduced from the studied S. microspora endoglucanase was modeled by homology with a Swiss-model server [24] using the structure of a tri-modular halophilic bacterial cellulase with a family 46 carbohydrate binding module (PDB: 5e09.1.A) as a template [25]. The built model was displayed in the software SWISS PDB VIEWER v4.01 [26]. A ProSA server was used to evaluate the parameters and predict the quality of the modeled structure [27].
The Molecular Operating Environment MOE 2019 (MOE) software was used for molecular mechanics optimization, Protein Property Descriptors, and structure visualization. Within MOE, the 3D structure of the halophilic and nonhalophilic endoglucanase was repaired for missing atoms, charges were assigned according to the OPLS-AA force field, and molecular mechanics optimization until the gradient of 0.01 kcalÁ( A mol) À1 was reached. Its properties were then calculated across a range of pH values pH = [4,10] and ionic strengths 'NaCl' I = 0.1 M with temperature T = 310 K, Viscosity g = 0.89 (cP), and Dielectric e r = 78. PROTPARAM (https://web.expasy.org/protparam/) was used as a tool to enable the calculation of various physical and chemical parameters of the predicted GH5 endoglucanase. Calculated parameters include molecular weight, theoretical pI, amino acid composition, atomic composition, extinction coefficient, estimated half-life, instability index, aliphatic index, and overall mean hydropathy (GRAVY).
Molecular docking was performed using the SwissDock server (http://www.swissdock.ch/). SwissDock is based on the docking software EADOCK DSS, whose algorithm consists of many steps [28,29]. The target molecule was provided as a PDB file. Binding mode was evaluated and clustered using FullFitness. The clusters were then ranked based on the perfect fit of those elements. In subsequent cycles, the structure with the lowest "Full Fitness" and the estimated DG value was selected and the adjacent docked ligand structures were collected as representatives. In other words, of the many binding modes generated, only the minimal energy conformational state of the ligand-binding protein complex was considered. Binding mode results were visualized and analyzed by Swissdock using UCSF CHIMERA 1.14 [30].

Results and Discussion
Genome sequencing and assembly The S. microspora genome was sequenced using the next-generation Illumina-MiSeq sequencer, and genome assembly was performed using the SPADES program.
The raw data included 5 374 208 paired reads, with an exact length of 301 bp, that were used for assembly. Gene prediction was performed using AUGUSTUS v2.0 [31], and the final annotation included 10 830 protein-coding genes. The genome sequence had 35 343 854 bp, with a GC content of 53.31% and a 20.5-fold coverage. The total number of contigs was 3715, the values of N50, N75, L50, and L75 were 15.961, 7.477, 644, 1463, respectively.

Functional annotation with the Omicsbox platform
BLAST hits were imported into OmicsBox to perform Gene Ontology mapping and annotation. InterPro protein signatures and domain hits were obtained using INTERPROSCAN. The output was then imported into OmicsBox and merged with the GO annotation and mapping results (Fig. 1).
In fact, the full Gene Ontology annotation workflow used cloud-powered algorithms (BLAST, INTERPROSCAN, GO MAPPING) to get the most complete annotation labels. Detailed statistics of every step were generated to summarize the results. Combined graphs for the three categories (Biological Process, Molecular Function, and Cellular Component) were generated including the annotations from INTERPROSCAN results. To give an idea regarding the OmicsBox workflow, an example of the output after performing annotation and a sample results analysis of family 3 b-glucosidase (Figs S1-S3).
A total of 11 000 sequences included in a FASTA file as an input were blasted and only 8500 proteins were annotated with a given function. Moreover, the analysis of species distribution for the highest BLAST hits indicated a high homology with the closely related fungi Stachybotrys chlorohalonata and Stachybotrys chartarum (approximately 100% similarity for some cellulases genes like GH1 b-glucosidases and one GH5 endoglucanase) but also with many species from the genus Fusarium (Fig. 2).
The study of the enzyme distribution in the Omicsbox platform gave that most of the annotated proteins are enzymes belonging to hydrolases, transferases, oxidoreductases, translocases, ligases, lyases, and isomerases (Fig. 3), with a high level of hydrolases in comparison with other classes of enzymes (Fig. 3). In fact, hydrolases englobe classical enzymes such as proteases and lipases but also glycoside hydrolases, a widely distributed group of carbohydrate active enzymes (CAZy) that hydrolyze glycosidic bonds into glycosides, glycans, and glycol-conjugates [32].

Functional annotation with the fungicompanion web server
To address the demand for quick, automatically generated genome annotations, we used the fungicompanion web server (http://fungicompanion.gla.ac.uk/). This server allows uploading the target assemblies and select a closely related reference species to guide the annotation. The resulting annotation using the fungicompanion web server is summarized in Table 1. Unfortunately, the fungicompanion server has its own limitations in the choice of the reference genome. In the absence of S. chlorohalonata and S. chartarum in the server, we were forced to use only Aspergillus fumigatus and Fusarium graminarium, as they are the closest species to our strain among all the options offered by the program.
Moreover, we were also forced to reduce the total number of nodes, as the annotation under this server  3. Distribution of enzymes classes: oxidoreductases, transferases, hydrolases, lyases, isomerases, ligases, and translocases. One thousand and seven hundred sequences were annotated to be hydrolases. is limited to 3000 nodes. A comparative table regarding the annotation of cellulase genes between the Omicsbox platform and the fungicompanion web server ( Table 2) indicates some differences between these tools precisely in the number of predicted proteins, which is higher with Omicsbox (around 10 830) than with the fungicompanion web server (5225 and 5931 using as references the genomes of A. fumigatus and F. graminarium, respectively).

Screening of cellulases genes
The predicted cellulases (b-glucosidases, exoglucanases, and endoglucanases), using fungicompanion and the Omicsbox platform, are presented in Tables 2 and 3, with a total of 46 cellulases for the fungicompanion and 43 for the Omicsbox platform. Moreover, the annotation of the S. microspora genome using the dbCAN meta server with Diamond, Hmmer, and Hotpep programs gave comparable results (data not shown). An inventory of cellulases enzymes was performed to compare the richness of the GH families between S. microspora and the most related fungi according to the genomic analyses: Fusarium oxysporum, S. chlorohalonata, S. chartarum, and Trichoderma harzianum ( Table 3). The results show that Fusarium oxysporium possesses more cellulase enzymes than the other studied fungi and shares the same number of endoglucanases with S. microspora. However, a higher number of cellobiohydrolases are found in S. chlorohalonata and S. chartarum.
This table shows that T. harzianum and S. microspora share the same high number of bglucosidases, which it is not the case for Trichoderma reesei, which shows only 7 b-glucosidases from the GH3 glycoside family [33]. This result confirms the richness of our strain concerning the production of bglucosidases and endoglucanases (only two cellobiohydrolases and six endoglucanases for T. reesei against 10 cellobiohydrolases and 17 or 18 endoglucanases (depending on the platform used) secreted by S. microspora), which is considered a very interesting characteristic for further industrial use.
It should be noted that the number of the predicted endoglucanases and b-glucosidases differ, depending on the annotation tool used, but the predicted exoglucanases (cellobiohydrolases) number is the same whether it was predicted using the fungicompanion tool or the Omicsbox platform.

Phylogenetic analysis
Biochemical studies on S. microspora have shown a halophilic character of secreted endoglucanases [19,20]. Moreover, some strains form Stachybotrys genera are considered a marine fungus like S. chartarum according to the World Register of Marine Species (https:// fr.wikipedia.org/wiki/Stachybotrys#WRMS), S. longispora [33], and Stachybotrys sp. Strain MF347 [34]. In fact, having a marine origin is concordant with somehow the halophilic feature for some fungal secreted enzymes. Therefore, we performed a multiple alignment of 18S ribosomal sequences (Fig. S4) and constructed a phylogenetic tree for some known marine fungi with S. chartarum and S. chlorohalonata compared to S. microspora (Fig. S4). We note that the most related fungi to S. microspora is S. chartarum, sharing the same clade in the tree. In this context, the purification of the EG1 led us to reflect on the halophilic character of our strain, since this enzyme has the strongest resistance to NaCl ever described for a cellulase, i.e., 5.6 M [20]. In fact, it must be said that although EG1 served as the instigator of our research, this protein has not yet been identified as a protein sequence or as a gene since we have not yet succeeded in determining its N-terminal sequence, despite several attempts. We think it would be blocked and we will try to use the LC-MS/MS to approach it. All we know about EG1 is its size, 55 kDa, its extreme resistance to high concentrations of NaCl, and its other biochemical properties reported in our article [20]. In the meantime, we tried in this study to see if there were candidate sequences with halophilic profiles. So, for now, the model is chosen only on the basis of size and halophilia.
The analysis of known halophilic enzymes has shown that such enzymes are endowed with an isoelectric point pI value between 4 and 5 and a higher number of acidic amino acids residues compared to other basic ones [35]. More important, the structure prediction of such enzymes indicates that the acidic amino acids (negatively charged amino acids) are predominantly present on the protein surface, giving a tertiary structure entirely colored in red [36]. In this context, we compiled data on annotated S. microspora endoglucanases regarding their pI, size, and ratio of negatively charged versus positively charged amino acids (Table 4). Table 4 reveals that seq2, 10, 12, 13 and many others show the same range of pI (around 4 and 4.5) and electronegativity; but we chose the seq10 for the following reasons. (a) seq10 showed the highest value of total solvent accessibility around 23 590 compared to seq12 (13 842.7) and seq13 (16 620.3); (b) the area of protein patches negative for seq10 is higher than those for seq12 and seq13 (Table 6). That said, seq13 is closer to seq10 and Bacillus sp. BG as seq12; (c) we did not analyze seq2 and other endoglucanases (having pI in the same range as seq10) due to their small sizes, considering the large size of EG1.
Therefore, seq10 was chosen for further analyses. Indeed, seq10 is a GH5 endoglucanase that presented a molecular weight of 59 kDa and a theoretical pI of 4.23; its predicted physicochemical properties give a halophilic profile with a total number of negatively charged residues (Asp + Glu) of 73 and 26 positively charged residues (Arg + Lys). Moreover, the electrostatic potential analyses indicate that almost the entire solvent-accessible surface of the studied endoglucanase is negatively charged, which is consistent with the halophilic nature of this enzyme (Fig. 4A). The instability index (II) was computed to be 31.76 and classifies the protein as stable.

Homology modeling of the predicted halophilic endoglucanase
More interestingly, the seq10-GH5 endoglucanase from S. microspora showed 33.60% sequence identity with a tri-modular halophilic cellulase from Bacillus sp. BG displaying a Carbohydrate-Binding Module from the Family 46 (CBM46), a typical (b/a)8 TIM barrel catalytic domain and a CBM_X domain (PDB: 5e09.1.A), as determined by Zhang et al. [25].
The parameters and prediction quality of the modeled structure were evaluated using the ProSA server [27]. The overall model quality gave a Z score of À7.65. The determined scores are within the range of native proteins of similar size (data not shown). The catalytic domain shows the features of the GH5 family, with a typical (b/a)8 TIM barrel, while the CBM46 domain has a sandwich-like structure. The catalytic and the CBM46 domains form an extended substrate-binding cleft, within which several tryptophan residues are well exposed. Such residues are proven to be important for substrate binding. Also, almost the entire surface of CelB is negatively charged, which is a signature of halophilic proteins [25]. Table 5 indicates also the differences regarding the physicochemical properties between the endoglucanase from S. microspora, the template from Bacillus sp. BG «5e09.1», and the predicted nonhalophile endoglucanase from S. microspora.
We recall that proteins, and more generally solutes, are often charged in solution. Knowledge of their electro-hydrodynamic properties is of utmost importance in the design of pharmacological formulations and systems within which these are to be manipulated. When these charged objects are suspended in a buffer solution of a given pH and ionic strength I, there is a cloud of ions that forms around them, which dampens the perceived electrostatic field and gives the object an apparent charge that will dictate how it reacts to external electrostatic fields [37].
The properties structure of the halophilic and nonhalophilic endoglucanase (Fig. 4C) Table 6.  At pH 7, the analysis of the structures surface of the three proteins (nonhalophilic, halophilic, and reference) share that the ratio (area of ionic protein patch (es)/accessible surface area) is almost similar (11%, 14%, and 15%, respectively). However, we note that the area of the negative protein patch is much higher for halophilic endoglucanase and the reference than the nonhalophilic endoglucanase (2779 and 2930 vs. 812 A 2 , respectively).
We also note that the ratio (area of negative/ionic protein patch), which translates the percentage of the negative charge compared with the total charged surface is dominant for the halophilic endoglucanase (84% and 92%) than the nonhalophytic endoglucanase (57%). The analysis of the area of the negative and positive protein patch with a range of pH values from 4 to 10 is shown in Fig. 6. We note that the area of the negative protein patch remains much higher than the area of the positive protein patch for the halophilic endoglucanase when the pH varies from 4 to 10, while we can have an equality of the area of positive and negative for the nonhalophilic endoglucanase when the pH is between 5.5 and 6. These results are in line with the fact that halophilic endoglucanases are more negatively charged.

Docking of the predicted halophilic endoglucanase with its substrate
Moreover, docking studies can provide important information and are a very useful tool for understanding the prevailing binding modes between proteins and ligands. In our work, we performed blind docking of CT3 (b-cellotriose, a small substrate used to study endoglucanase activity) using the SwissDock webbased program in order to investigate the possible modes of interaction. Figure S5 shows that the best binding solutions are located in the same regions with values of FullFitness ranging from À2178.56 to À173.97 kcalÁmol À1 and with an estimated DG of À7.2.

Conclusion
Previous biochemical studies carried out in our laboratory have shown the production of cell wall degrading enzymes from S. microspora. In fact, our strain is characterized by growing in a cellulosicbased medium over a wide pH range of 4-9, a very rare feature reported in most known fungal species. In addition to its ability to grow at alkaline pH, S. microspora produces neutral/alkaline and halophilic endoglucanases and secretes several b-glucosidases. Sequencing the entire genome of this rare fungus is crucial for advancing knowledge of its genomic properties. The genome sequence of S. microspora consists of 3715 contigs with a genome size of 35 343 854 bp and a GC content of 53.31%. Two bioinformatic tools, Omicsbox and the fungicompanion web server, were used to identify 46 cellulases, confirming the abundance of cellulolytic enzymes. An endoglucanase, named seq10, was predicted to exhibit a halophilic profile, compared to other annotated and analyzed endoglucanases and was chosen for deeper analyses. It was shown to share strong structural resemblance with a known bacterial halophilic enzyme. These findings corroborate prior biochemical research on S. microspora halophilic endoglucanases, which is a very unusual feature for known fungal cellulases. All of these characteristics make S. microspora ideal for a wide range of applications, including biofuel, textiles, and detergents.

Supporting information
Additional supporting information may be found online in the Supporting Information section at the end of the article. Fig. S1. A flow diagram of the steps for functional annotation with the OmicsBox platform. Fig. S2. Sample results after functional annotation under OmicsBox platform. Fig. S3. A sample results analysis of family 3 b-glucosidase under OmicsBox platform. Fig. S4. Phylogenetic tree for some known marine fungi with S. chartarum and S. chlorohalonata compared to S. microspora using MEGA X package. Fig. S5. Visualization of the predicted GH5 endoglucanase from S. microspora with different modes of binding using b-cellotriose (CT3) as a ligand.