Comparison of the performance of multiple whole-genome sequence-based tools for the identification of Bacillus cereus sensu stricto biovar Thuringiensis

ABSTRACT The Bacillus cereus sensu stricto (s.s.) species comprises strains of biovar Thuringiensis (Bt) known for their bioinsecticidal activity, as well as strains with foodborne pathogenic potential. Bt strains are identified (i) based on the production of insecticidal crystal proteins, also known as Bt toxins, or (ii) based on the presence of cry, cyt, and vip genes, which encode Bt toxins. Multiple bioinformatics tools have been developed for the detection of crystal protein-encoding genes based on whole-genome sequencing (WGS) data. However, the performance of these tools is yet to be evaluated using phenotypic data. Thus, the goal of this study was to assess the performance of four bioinformatics tools for the detection of crystal protein-encoding genes. The accuracy of sequence-based identification of Bt was determined in reference to phenotypic microscope-based screening for the production of crystal proteins. A total of 58 diverse B. cereus sensu lato strains isolated from clinical, food, environmental, and commercial biopesticide products underwent WGS. Isolates were examined for crystal protein production using phase contrast microscopy. Crystal protein-encoding genes were detected using BtToxin_Digger, BTyper3, IDOPS (identification of pesticidal sequences), and Cry_processor. Out of 58 isolates, the phenotypic production of crystal proteins was confirmed for 18 isolates. Specificity and sensitivity of Bt identification based on sequences were 0.85 and 0.94 for BtToxin_Digger, 0.97 and 0.89 for BTyper3, 0.95 and 0.94 for IDOPS, and 0.88 and 1.00 for Cry_processor, respectively. Cry_processor predicted crystal protein production with the highest specificity, and BtToxin_Digger and IDOPS predicted crystal protein production with the highest sensitivity. Three out of four tested bioinformatics tools performed well overall, with IDOPS achieving high sensitivity and specificity (>0.90). IMPORTANCE Strains of Bacillus cereus sensu stricto (s.s.) biovar Thuringiensis (Bt) are used as organic biopesticides. Bt is differentiated from the foodborne pathogen Bacillus cereus s.s. by the production of insecticidal crystal proteins. Thus, reliable genomic identification of biovar Thuringiensis is necessary to ensure food safety and facilitate risk assessment. This study assessed the accuracy of whole-genome sequencing (WGS)-based identification of Bt compared to phenotypic microscopy-based screening for crystal protein production. Multiple bioinformatics tools were compared to assess their performance in predicting crystal protein production. Among them, identification of pesticidal sequences performed best overall at WGS-based Bt identification.

the B. cereus group are foodborne pathogen B. cereus sensu stricto (here referred to as Bc) (1) and entomopathogen B. thuringiensis (here referred to as Bt), the latter of which is commercially available as a biopesticide for application in organic farming (2).
The insecticidal activity of Bt is supported by the production of insecticidal proteins (i.e., Bt toxins), including parasporal crystal proteins such as crystal (Cry) and cytolytic (Cyt) toxins, and non-parasporal proteins such as vegetative insecticidal proteins (Vip) (3,4).Bt toxins act against different insect species, including Lepidoptera, Diptera, Coleoptera, and Hymenoptera (3,5).In 1995, the United States Environmental Protection Agency (US EPA) registered the first Bt biopesticide products for use in the US (6).Currently, over 180 Bt products are registered under 15 different EPA product code numbers, and they all include strains of at least one of the four Bt subspecies (i.e., kurstaki, israelensis, aizawai, and tenebrionis) (6).However, despite the long history of agricultural application of Bt, the concerns over safety of Bt for humans have been raised over the past decade (7)(8)(9)(10)(11).
Bt strains have been reported to encode and produce human enterotoxins that are known to contribute to foodborne illness caused by Bc (8,9,12).These include hemolysin BL (Hbl), non-hemolytic enterotoxin (Nhe), and cytotoxin K (CytK) (13)(14)(15)(16).Given that routine diagnostic assays do not differentiate between Bt and Bc, it is possible that some Bc-associated outbreaks may have been caused by Bt (17).However, direct evidence for Bt causing human illness has not been explicitly established (17,18).
The high genomic similarity of Bc and Bt results in their classification into the same genomospecies, even based on the most conservative classification criteria for genomospecies (13).Moreover, these two species cannot be distinguished by typical culture-based detection methods.The only phenotypic trait that is used for differenti ating Bc and Bt is microscopy-based detection of parasporal crystal proteins (i.e., Bt toxins), which are responsible for the bioinsecticidal properties of Bt strains (19)(20)(21).The above-outlined taxonomic classification shortcomings have been addressed in a new proposed taxonomic framework that considers both genomic and phenotypic informa tion for B. cereus group species identification (22).Within the proposed taxonomic framework, Bc and Bt are represented by one genomospecies, Bacillus cereus sensu stricto.Furthermore, a biovar Thuringiensis has been defined to represent genomes that carry crystal protein-encoding genes (i.e., Bt toxin genes), including cry, cyt, and vip genes (2).Reliable detection of Bt toxin genes is therefore required for accurate genome-based detection of B. cereus s.l.biovar Thuringiensis (Bt).
One of the main challenges in detecting Bt toxin genes is the high variability in their sequences.For example, over 300 variants of cry genes have been identified based on the amino acid sequence similarity (23,24).Thus, multiple whole-genome sequencing (WGS)-based bioinformatics tools have been developed not only to detect known Bt toxin gene sequences but also to predict new variants of these genes.These tools include BtToxin_Digger (25), IDOPS (identification of pesticidal sequences) (26), Cry_processor (27), and BTyper3 (13).Each tool has been developed using a different approach.For example, BtToxin_Digger uses multiple algorithms, including Basic Local Alignment Search Tool (BLAST), hidden Markov models (HMMs), and support vector machine (SVM) method, to comprehensively detect known and potentially novel Bt toxin gene variants.Among these algorithms, BLAST is known to be the most conservative, whereas HMM and SVM model are more likely to produce false positive results and be more suitable for the detection of novel gene variants.IDOPS uses profile HMMs to predict Bt toxin-encoding genes and annotates sequences adjacent to predicted Bt toxin-encoding genes, which can help users interpret the genetic context in which these genes are detected.Cry_processor searches for 3-domain cry gene sequences using profile HMMs, which has the limitation of missing other genes (e.g., cyt and vip).Lastly, BTyper3 uses BLAST to search for crystal protein-encoding genes based on a specific amino acid similarity threshold and hence may not be able to detect novel variants of Bt toxin genes.
Despite the availability of these tools, their performance has not been assessed using phenotypic assays.This study therefore aimed to (i) compare Bt toxin-encoding gene identification tools and (ii) compare their performance against phenotypic data (i.e., microscopy screening for crystal protein production).

RESULTS AND DISCUSSION
We first compared the completeness of hybrid and short-read genome assemblies using the N50 metric and the number of contigs, as determined by QUAST (Fig. 1).N50 values for hybrid assemblies (min = 443,575 bp, max = 5,675,203 bp, mean = 4,245,832.3bp) were significantly higher than N50 for short-read assemblies (min = 22,040 bp, max = 514,228 bp, mean = 113,641.28bp) (t-test P value = 1.98 × 10 −47 ).The number of contigs equal to or longer than 1,000 bp was significantly higher in short-read assemblies (min = 32, max = 607, mean = 235) compared to hybrid assemblies (min = 1, max = 98, mean = 17) (t-test P value = 7.37 × 10 −23 ).N50 values were, on average, 59% shorter in short-read assemblies than hybrid assemblies.Similarly, there were on average 79% fewer contigs in hybrid assemblies than short-read assemblies.Both of these metrics taken together indicate that hybrid assemblies were more complete.Both types of assemblies were further used for the assessment of bioinformatics tools for the detection of Bt toxin-encoding genes to measure the effect of assembly completeness on the performance of individual tools.

IDOPS performed best overall for Bt toxin-encoding gene detection in comparison to phenotypic data
Prior to the application of bioinformatics tools for the detection of Bt toxin-encoding genes, we identified and removed redundant genomes, here defined as genomes that differed by <6 single nucleotide polymorphisms (SNPs).Briefly, using PubMLST's seven-gene multi-locus sequence typing (MLST) scheme for "B.cereus, " 37 different sequence types (STs) were identified with biovar Thuringiensis identified in 10 STs (i.e., ST1085, ST1099, ST1142, ST138, ST15, ST1734, ST325, ST33, ST414, and ST8).Out of the 37 different MLST STs, 6 STs had more than one isolate (i.e., ST1099, ST1424, ST15, ST8, ST24, and ST73); thus, high-quality SNPs were identified within those STs.As a result, a total of 20 clonal genomes were excluded from further analyses to mitigate clonal redundancy bias.This resulted in a total of 58 isolates that were included in the assessment of the performance of four bioinformatics tools (i.e., IDOPS, Cry_pro cessor, BtToxin_Digger, and BTyper3).IDOPS uses profile HMMs (28) to detect pestici dal protein-encoding genes in accordance with the BPPRC (Bacterial Pesticidal Protein Resource Center) nomenclature system (23).Cry_processor was developed based on an HMM-based algorithm and has two different search modes available: domain only (DO) and find domain (FD).DO mode detects the sequences comprising all the three domains in the right order, and FD mode searches novel domains of the matched Bt toxin gene sequences, using the Bt-Toxin nomenclature database (29).Here, we tested both modes to compare the performance of the pipeline and found no differences in results between the two modes.BtToxin_Digger uses three different methods (i.e., BLAST, HMM, and SVM) for the identification of Bt toxin gene sequences using the Bt toxin nomenclature database.Lastly, BTyper3 uses BLAST to detect Bt toxin genes using translated amino acid sequences with conservative detection thresholds of 70% identity and 50% coverage.The production of crystal proteins was confirmed for 18 out of 58 isolates using phase contrast microscopy (Fig. 2).When applied to short-read assemblies, CryProcessor predicted crystal protein production with the highest specificity (0.97).BtToxin_Digger and IDOPS predicted crystal protein production with the highest sensitivity (0.94), whereas BtToxin_Digger had the lowest specificity (0.82).When applied to hybrid assemblies, the same bioinformatics programs showed the highest and lowest specific ities and sensitivities (Table 1).Specifically, CryProcessor predicted protein production with higher specificity (1.00) compared to the other three programs: BTyper3 (0.97), BtToxin_Digger (0.85), and IDOPS (0.95).However, the other three programs showed higher sensitivity than CryProcessor (0.83): BTyper3 (0.88), BtToxin_Digger (0.94), and IDOPS (0.94).Similar to results observed for short-read assemblies, BtToxin_Digger had the lowest specificity when applied to hybrid assemblies (0.82).These results indicate that genome assembly completeness was not a major factor influencing the perform ance of bioinformatics tools for the detection of Bt toxin-encoding genes.Three out of four tested bioinformatics tools (i.e., IDOPS, BTyper3, and Cry_processor) performed well overall, with IDOPS achieving sensitivity and specificity >0.90.BtToxin_Digger was highly sensitive but also nonspecific, as it resulted in the highest number of false positive hits (n = 6).The other three tools produced a maximum of three false positive hits (Table 1).
Even though IDOPS performed best among the assessed bioinformatics tools, it made two false positive predictions (i.e., isolates PS00086 and PS00100).These two isolates were not phylogenetically closely related and were predicted as positive for Bt toxin-encoding genes by IDOPS' profile HMMs 5 (CryM5) and 6 (CryM6).All other IDOPS models (i.e., CryM1 to M4, CytM1, CytM2, and VipM1) correctly predicted the presence of Bt toxin-encoding genes.The CryM5 profile HMM targets Cry proteins with less conserved variants of the classical 3 domains, while CryM6 targets the C-terminal region of Cry pesticidal proteins (26).This could explain why the CryM5 and CryM6 models produced false positive predictions on non-Bt strains.
BtToxin_Digger, which had the lowest specificity, uses a combination of multiple algorithms, including BLAST, HMM, and SVM.We examined whether false positive results were caused by any specific algorithm and found that the majority of false positive calls were attributable to the outcomes of predictive models (i.e., HMM and SVM) (Table 2).HMM had both low sensitivity and specificity (Sn = 0.68, Sp = 0.68), whereas SVM had low sensitivity and high specificity (Sn = 0.00, Sp = 0.95).In contrast to HHM and SVM, we found that BLAST search alone performed very well (Sn = 0.97, Sp = 0.95) and produced BtToxin_Digger results that were comparable to the other three programs.Although the HMM and SVM model implemented in BtToxin_Digger produce more false positive results compared to the BLAST approach, they may be more useful for the discovery of novel insecticidal protein-encoding genes compared to BLAST or other bioinformatics tools tested here.

Phylogenetic distribution of biovar Thuringiensis strains within the B. cereus group
Overall, 12 out of 18 biovar Thuringiensis strains were classified into adjusted eight-group panC phylogenetic group IV, 2 into group II, 1 into group V, and 3 into group VI (Fig. 2).The maximum-likelihood phylogenetic analysis showed that the biovar Thuringien sis strains (as defined by the detection of Bt toxin-encoding genes by at least three bioinformatics tools or the microscopic screening) were found across different line ages of phylogenetic group IV (corresponding to B. cereus s.s.) and were intermixed with non-Thuringiensis strains (Fig. 2).Notably, most of the biovar Thuringiensis strains included in this study were not closely related to the B. thuringinesis type strain (ATCC 10792) nor the B. cereus s.s.type strain (ATCC 14579).An average SNP difference relative to the B. thuringiensis type strain was 44,765 SNPs (min = 7,717 SNPs, max = 76,229 SNPs).Compared to the B. cereus s.s.type strain, tested strains differed by an average of 37,748.94SNPs (min = 6,808 SNPs, max = 62,696 SNPs).This demonstrates that phylogenetic analysis alone is not sufficient for the identification of strains belonging to biovar Thuringiensis.This finding agrees with previous studies and suggests that Bt toxin genes (e.g., cry, cyt, and vip) likely disseminated among B. cereus group strains through horizontal gene transfer (16).
Bt toxin genes are typically found on plasmids (30,31), although they have also been detected within the chromosome of the B. thuringiensis HER1410 strain (32).We found Bt toxin-encoding genes on non-chromosomal contigs in 17 out of 18 hybrid assemblies of strains belonging to biovar Thuringiensis.Only one isolate (PS00095) carried the cry27Aa1 gene on a chromosomal contig.This gene shared 75% identity and 100% coverage with the reference sequence in the B. thuringiensis delta-endotoxin nomenclature database.A transfer and chromosome integration of a plasmid-borne gene sequence may have been facilitated by a transposase; however, no transposable elements were detected adjacent to cry27Aa1.
Overall, this work demonstrates that sequence-based detection of cry, cyt, and vip can serve as an accurate method for the detection of B. cereus biovar Thuringiensis, which can aid in the identification of this biovar, as defined in the taxonomic nomenclature proposed by Carroll et al. (13) or a variation of it agreed upon by the stakeholders.

Isolates included in the study
A total of 78 B. cereus group strains available in the Kovac lab culture collection, some of which have been reported previously (15,33), were included in the study based on the following criteria: (i) an isolate was classified as B. cereus s.s.(phylogenetic group IV) using BTyper 3 (v.3.2.0) (n = 72) or (ii) an isolate belonged to any of the B. cereus s.l.phylogenetic groups and had cry, cyt, and/or vip genes detected in its draft genome using BTyper3 (n = 6) (v.3.2.0) (13).Isolate information, including year of isolation, origin of isolation, and NCBI accession numbers are listed in the Supplemental Material (Table S1).

Illumina whole-genome sequencing and sequence quality control
Total genomic DNA was extracted from isolates using E.Z.N.A Bacterial DNA extraction kit (Omega Bio-Tek, Georgia, USA) by following the manufacturer's instructions.Extracted DNA was examined for quality and quantity using Nanodrop One (Thermo Fisher Scientific, Massachusetts, USA) and Qubit 3 (Thermo Fisher Scientific, Massachusetts, USA), respectively.DNA libraries were prepared using a NexteraXT library preparation kit (Illumina, California, USA).Samples were sequenced using an Illumina NextSeq with 150-bp paired-end reads.Additionally, genomes of 15 isolates with publicly available WGS reads were obtained from the NCBI Sequence Read Archive (SRA; NCBI BioProject accessions PRJNA437714 and PRJNA288462) (15,33).Illumina adapters and low-quality reads were trimmed using Trimmomatic (v.0.39) with a sliding window size of 4 and quality cutoff value of 15 (SLIDINGWINDOW:4:15) (34).Reads shorter than 36 bp were excluded (MINLEN:36).Trimmed reads qualities were assessed using FastQC (v.0.11.9) (35).

Nanopore whole-genome sequencing and sequence quality control
DNA was extracted using the QIAamp DNA blood mini kit (Qiagen, Hilden, Germany) by following the manufacturer's instructions with additional steps for cell lysis.Briefly, 3 loopfuls of biomass grown at 30°C on brain-heart infusion agar (BD Biosciences, New Jersey, USA) was collected and mixed with a phosphatebuffered saline (PBS) buffer (137 mM NaCl, 1.7 mM KCl, 10 mM Na 2 HPO 4 , 1.8 mM KH 2 PO 4 ).One gram of 0.1-mm zirconia/silica beads (Biospec Products, Oklahoma, USA) was added to tubes, which were then vortexed in a horizontal position to disrupt cell walls.The Oxford Nanopore Technologies (ONT) RBK-004 rapid barcoding kit was used for the library preparation by following the manufacturer's instructions, and four libraries were pooled for sequencing (ONT, Oxford, UK).An R9 flow cell (FLO-MIN106; ONT, Oxford, UK) was used for sequenc ing using a MinION Mk1C device (ONT).Raw sequencing signal files were used for high-accuracy basecalling and adapter trimming using Guppy (v.6.0.1).Low-quality reads were trimmed using FiltLong (v.0.2.1) (36), and the quality of trimmed reads was assessed using FastQC (v.0.11.9).
To remove redundant and clonal isolates, high-quality SNPs were identified using the FDA CFSAN SNP pipeline with default setting (v.2.2.1) (46).This analysis was conducted separately for isolates belonging to each individual MLST sequence type that contained two or more isolates.MLST STs were assigned using BTyper 3 (v.3.2.0) and PubMLST's seven-gene MLST scheme for "B.cereus" (47).Pairwise SNP differences were calculated within each ST, and for each group of clonal genomes (i.e., isolates that differed by <6 whole-genome SNPs), one high-quality representative genome was selected based on N50 values and number of contigs.The same SNP pipeline was applied to all identified biovar Thuringiensis genomes by using two type strains as references (B.thuringiensis type strain ATCC 10792 and B. cereus type strain ATCC 14579) to calculate average SNP differences between biovar Thuringiensis strains and the two type strains.

Microscopy-based screening of isolates for the production of insecticidal crystal proteins
Isolates were grown on T3 agar for 3 days at 30°C to promote sporulation and crystal protein production.After completed incubation, one colony was resuspended in 10 µL of PBS on a microscope slide and covered with a cover slip (VWR international, Pennsyl vania, USA) and screened within 10 minutes for the presence of crystal proteins using a phase-contrast microscope (Olympus BX51, Olympus, Tokyo, Japan).Crystal proteins were differentiated by shape (bipyramidal, cuboidal, and circular), size (smaller than a vegetative cell and endospore of Bacillus), and color (darker than Bacillus endo spore).Initial screening results were classified as "positive, " "negative, " or "inconclusive." Screening was repeated three times by two individuals to assess the production of crystal proteins.

Statistical analysis
All statistical analyses in the study were conducted using R (v4.2.1) (49).A paired t-test was employed to compare the average N50 and the number of contigs between short-read assembly and hybrid assembly (50).The sensitivity, specificity, positive predictive value, and negative predictive value were calculated to assess the perform ance of four different bioinformatics tools for the detection of Bt toxin-encoding genes (51).Genome assemblies produced using short reads as well as hybrid assemblies were used.The results of microscopy screening were used to validate the sequence-based bioinformatic prediction of B. cereus biovar Thuringiensis.

ADDITIONAL FILES
The following material is available online.

FIG 1
FIG1 Distribution of contig counts in hybrid and short-read assemblies.(A) N50 and (B) number of total contigs (≥1,000 bp).

TABLE 1
Sensitivity, specificity, and positive and negative predictive values for predicting crystal protein production using four bioinformatics tools a PPV, positive predictive value.b NPV, negative predictive value.c IDOPS, identification of pesticidal sequences.

TABLE 2
Sensitivity, specificity, and positive and negative predictive values for predicting crystal protein production using different BtToxin_Digger algorithms (i.e., BLAST, HMM, and SVM model) applied on hybrid assemblies a BLASTP, Basic Local Alignment Search Tool for protein sequences.b PPV, positive predictive value; NPV, negative predictive value.

TABLE 3
Description of bioinformatics tools for prediction of Bt toxin-encoding genes a BLASTP, Basic Local Alignment Search Tool for protein sequences.