Proposal of a Taxonomic Nomenclature for the Bacillus cereus Group Which Reconciles Genomic Definitions of Bacterial Species with Clinical and Industrial Phenotypes.

The Bacillus cereus group comprises numerous closely related species, including bioterrorism agent B. anthracis, foodborne pathogen B. cereus, and biopesticide B. thuringiensis Differentiating organisms capable of causing illness or death from those used in industry is essential for risk assessment and outbreak preparedness. However, current species definitions facilitate species-phenotype incongruences, particularly when horizontally acquired genes are responsible for a phenotype. Using all publicly available B. cereus group genomes (n = 2,231), we show that current species definitions lead to overlapping genomospecies clusters, in which 66.2% of genomes belong to multiple genomospecies at a conventional 95 average nucleotide identity (ANI) genomospecies threshold. A genomospecies threshold of ≈92.5 ANI is shown to reflect a natural gap in genome similarity for the B. cereus group, and medoid genomes identified at this threshold are shown to yield resolvable genomospecies clusters with minimal overlap (six of 2,231 genomes assigned to multiple genomospecies; 0.269%). We thus propose a nomenclatural framework for the B. cereus group which accounts for (i) genomospecies using resolvable genomospecies clusters obtained at ≈92.5 ANI, (ii) established lineages of medical importance using a formal collection of subspecies names, and (iii) heterogeneity of clinically and industrially important phenotypes using a formalized and extended collection of biovar terms. We anticipate that the proposed nomenclature will remain interpretable to clinicians, without sacrificing genomic species definitions, which can in turn aid in pathogen surveillance; early detection of emerging, high-risk genotypes; and outbreak preparedness.IMPORTANCE Historical species definitions for many prokaryotes, including pathogens, have relied on phenotypic characteristics that are inconsistent with genome evolution. This scenario forces microbiologists and clinicians to face a tradeoff between taxonomic rigor and clinical interpretability. Using the Bacillus cereus group as a model, a conceptual framework for the taxonomic delineation of prokaryotes which reconciles genomic definitions of species with clinically and industrially relevant phenotypes is presented. The nomenclatural framework outlined here serves as a model for genomics-based bacterial taxonomy that moves beyond arbitrarily set genomospecies thresholds while maintaining congruence with phenotypes and historically important species names.

B. mycoides and B. weihenstephanensis, as has been documented previously ( Fig. 1 and 2A1; Table S2) (29,34,44). The type strains of B. mobilis and B. wiedmannii also produced overlapping genomospecies in which a genome could share Ն95 ANI with both species type strains ( Fig. 1 and 2A1; Table S2). The largest source of ambiguity, however, stemmed from B. anthracis and neighboring lineages, as the genomospecies cluster formed by the B. anthracis reference genome overlapped with those of B. pacificus, B. paranthracis, and B. tropicus ( Fig. 1 and 2A1; Table S2).
Genomic elements responsible for anthrax, emetic, and insecticidal toxin production exhibit heterogeneous presence in multiple species using current genomospecies definitions. Additional nomenclatural discrepancies arise when a trait of interest is plasmid encoded, such as anthrax toxin genes cya (edema factor encoding), lef (lethal factor encoding), and pagA (protective antigen encoding) (45): 93 of 241 (38.6%) genomes most closely resembling the B. anthracis reference genome at Ն95 ANI did not possess anthrax toxin genes ( Fig. 3A and B; Table S4). Notably, isolates FIG 1 Dendrogram constructed using symmetric pairwise average nucleotide identity (ANI) dissimilarities calculated between 2,218 B. cereus group genomes from NCBI's RefSeq database with N 50 of Ͼ20 kbp (i.e., D ANI sym in Materials and Methods) and the average linkage hierarchical clustering method implemented in the hclust function in R. Blue tip labels denote the location of species type strain/reference genomes in the dendrogram, while tree height corresponds to ANI dissimilarity. Branch colors correspond to branch height within the tree. Dashed vertical lines appear at dissimilarities of 7.5, 6, 5, and 4, which correspond to ANI thresholds of 92.5, 94, 95, and 96, respectively (from left to right in order of appearance along the x axis). which most closely resemble B. anthracis by current species definitions (i.e., Ն95 ANI), despite lacking anthrax toxin-encoding genes, do not appear to be uncommon. Such strains have been isolated from diverse environments (e.g., soil, animal feed, milk, spices, egg whites, and baby wipes) and from six continents, plus the International Space Station (Table S4). The classification of these isolates as B. anthracis could lead to incorrect assumptions of their anthrax-causing capabilities.
Importantly, isolates which display phenotypic characteristics associated with "B. cereus" (e.g., motility and gamma bacteriophage resistance) can cause anthrax (2,30,31,(40)(41)(42). Despite the assertion that it is a clonal species with low diversity (46)(47)(48), the B. anthracis genomospecies cluster formed at 95 ANI encompasses lineages which fall outside the one most commonly associated with anthrax illness (Fig. 3A and B). At a 95 ANI genomospecies threshold, three of seven genomes deposited in RefSeq as anthraxcausing "B. cereus" most closely resembled the B. anthracis reference genome ( Fig. 3A and B; Table S5), while also sharing Ն95 ANI with the B. paranthracis type strain genome (Table S5). The remaining four anthrax-causing "B. cereus" genomes most closely resembled the B. tropicus type strain, shared Ն95 ANI with the B. paranthracis type strain, and shared between 94 and 95 ANI with the B. anthracis species reference genome (Table S5). The separation of anthrax-causing "B. cereus" genomes into two genomospecies at 95 ANI was maintained when medoid genomes were used ( Fig. 2A2; Table S5). As such, several anthrax-causing "B. cereus" strains are technically still B. anthracis at 95 ANI ( Fig. 3A and B) and despite having a mosaic of phenotypic characteristics attributed to "B. cereus" and B. anthracis.

FIG 2
Weighted undirected graphs constructed using symmetric pairwise average nucleotide identity (ANI) values calculated between 2,218 B. cereus group genomes from NCBI's RefSeq database with N 50 of Ͼ20 kbp (i.e., S ANI sym in Materials and Methods). Nodes represent individual genomes, while weighted edges connect each pair of genomes with a mean ANI value of Ն95 (A) and Ն92.5 (B), where edge weight corresponds to the mean ANI value of the pair. Nodes (i.e., genomes) are colored by (i) closest matching type strain genome or (ii) closest matching medoid genome of clusters formed at the respective ANI value. Graphs were constructed using the graphout layout algorithm implemented in R's igraph package, using 1 million iterations and a charge of 0.02. Similar issues plague emetic "B. cereus," designated as such by its ability to produce cereulide, a toxin responsible for foodborne illness characterized by vomiting symptoms (13,38,49). At 95 ANI, all 30 emetic "B. cereus" genomes most closely resembled the B. paranthracis type strain, were confined to a single medoid-centric genomospecies, and were interspersed among genomes which lacked cereulide synthetaseencoding genes cesABCD ( Fig. 3A and C; Table S6). cesABCD were detected in five genomes representing two additional medoid-based genomospecies at 95 ANI ( Fig. 3A and C; Table S6). One contained the type strains of B. weihenstephanensis and B. mycoides, which is unsurprising considering that cereulide-producing B. weihenstephanensis has been isolated in rare cases (28,50). However, two genomes categorized previously as emetic "B. weihenstephanensis" belonged to a completely separate genomospecies at 95 ANI ( Fig. 3A and C; Table S6).
The Cry and Cyt insecticidal proteins associated with B. thuringiensis (i.e., Bt toxins), which can be plasmid mediated, face similar issues, as B. thuringiensis has historically been defined by its ability to produce insecticidal toxins (e.g., Cry and Cyt toxins) (51). However, genes encoding known insecticidal toxins were detected in nine of 21 B. cereus group type strain-centric genomospecies at 95 ANI ( Fig. 3A and D). These results are consistent with previous findings, as Bt toxin production has been attributed to numerous lineages (29,51,52).
ANI-based comparisons to medoid genomes using a lowered genomospecies threshold of Ϸ92.5 eliminate the species overlap problem. Numerous bacterial genomospecies have showcased a breakpoint in genome similarity which is close to Tip and branch labels are colored by genomospecies assignment using medoid genomes of genomospecies clusters formed at the widely used genomospecies threshold of 95 ANI (clusters are arbitrarily numbered) (A) and presence (colored) and absence (gray) of anthrax toxin genes cya, lef, and pagA (B); cereulide synthetase-encoding cesABCD (C); and one or more previously described Cry or Cyt insecticidal toxin-encoding genes (D). Phylogenies were constructed using core SNPs identified in 79 single-copy orthologous gene clusters present in 2,231 B. cereus group genomes. The type strain of "B. manliponensis" (i.e., the most distantly related member of the group) was treated as an outgroup on which each phylogeny was rooted. Virulence genes (cya, lef, and pagA and cesABCD) were detected using BTyper version 2.3.2 (default thresholds), while insecticidal toxin-encoding genes were detected using BtToxin_scanner version 1.0 (default settings; presence and absence of highconfidence, previously known Cry-and Cyt-encoding genes are shown, with predicted putative novel insecticidal toxin-encoding genes excluded). 95 ANI (5); however, ANI values among a significant proportion of B. cereus group genomes, particularly B. anthracis and neighboring lineages, fall within the 93 to 95 ANI range, with a breakpoint occurring at Ϸ92.5 ANI ( Fig. 4; Fig. S1). Using a hard 92.5 ANI threshold for B. cereus group genomospecies assignment, rather than 95, nearly eliminates the species overlap problem: only six of 2,231 genomes (0.269%) were assigned to 2 or more medoid-based genomospecies ( Fig. 2B2; Table S7), compared to 18.2% and 66.2% of genomes assigned to multiple genomospecies at 95 ANI when medoid genomes and species type strain/reference genomes were used, respectively ( Fig. 2; Tables S2 and S3). Eighteen genomospecies were present at a 92.5 ANI threshold, compared to 36 medoid-centric genomospecies at 95 ANI ( Fig. 3A and Fig. 5; Tables S3  and S7). Notably, at 92.5 ANI, seven genomospecies did not possess type strains of any published species (Table S7), indicating that putative novel genomospecies may be present. While one of these genomospecies has recently been proposed as novel species "B. clarus" (53), the remaining six are uncharacterized (Table S8).

DISCUSSION
When applied to bacteria, the concept of "species" is notoriously ambiguous, particularly in cases where it is intertwined with a phenotype, and even more so when that phenotype is an established component of the medical or industrial lexicon. Taxonomic definitions based on phenotype lack nuance in the omics era, as they ignore underpinning genomic diversity which can be leveraged to improve assessment of an isolate's pathogenic potential or industrial utility. Furthermore, taxonomy based on phenotype can be ambiguous-and even misleading-when a trait is lost, gained, or not widespread throughout a lineage. A notable example is provided by botulinum neurotoxin (BoNT)-producing species, to which the Clostridium botulinum label has historically been applied, despite multiple genomospecies exhibiting BoNT production capabilities (6). Adherence to a nomenclature just for the sake of taxonomic rigor, however, can be equally problematic when a lineage has deep roots in medicine or industry. Shigella spp. and Escherichia coli, for example, constitute a single genomospecies but are considered to be distinct entities, despite genomic inconsistencies reflected in their nomenclature (7,54,55).
An ideal taxonomy should be interpretable, without sacrificing the resolution provided by contemporary technologies. Several publications have appended the term "biovar" to species names to denote isolates which exhibit interesting phenotypes (e.g., anthrax-causing "B. cereus" as B. cereus biovar anthracis and Cry-producing B. wiedmannii as B. wiedmannii biovar thuringiensis) (41,52). We therefore propose a taxonomic framework consisting of (i) an amended collection of genomospecies, corresponding to resolvable genomospecies obtained at Ϸ92.5 ANI; (ii) a formal collection of subspecies, which account for established lineages of medical importance; and (iii) a formalized and extended collection of biovars, which account for phenotypic heterogeneity (Fig. 6). Tip and branch labels are colored by genomospecies assignment using medoid genomes of genomospecies clusters formed at the proposed genomospecies threshold of 92.5 ANI. Phylogeny was constructed using core SNPs identified in 79 single-copy orthologous gene clusters present in 2,231 B. cereus group genomes. The type strain of "B. manliponensis" (i.e., the most distantly related member of the group) was treated as an outgroup on which the phylogeny was rooted.
Bacillus cereus Group Taxonomy ® Note that a recently proposed "genomovar" framework for B. cereus sensu stricto/B. thuringiensis (56) is not adopted here, due to the lack of genomospecies boundaries between their type strains (shown here and elsewhere [33][34][35], including the paper proposing the framework [56]), as well as the lack of a standardized species definition for B. thuringiensis (B. thuringiensis has been used to refer to any B. cereus group species capable of producing Bt toxins [29] or to the genomospecies formed by the B. thuringiensis type strain genome [56]).
Proposed taxonomic nomenclature. (A) Genomospecies. The B. cereus group currently consists of eight published genomospecies (designated I to VIII), four previously proposed genomospecies (designated ix to xii), and six putative novel genomospecies (designated xiii to xviii) (Fig. 5). A genome belongs to a genomospecies if it FIG 6 Taxonomic hierarchy for the proposed B. cereus group nomenclature. Taxonomic levels are listed in the left margin, with levels which are optional/not applicable to all organisms denoted as such. Rounded boxes shaded in light green correspond to possible taxonomic designations at their respective level, while blue boxes correspond to requirements that an isolate and/or its genome must meet to be assigned that designation. Possible forms which the final taxonomic assignment can take can be found in the gray box at the bottom of the chart.
shares ¬92.5 ANI with the genomospecies medoid genome (see Table S7 in the supplemental material). Due to the resolvability of genomospecies at this threshold, it follows that (i) a genome does not belong to a genomospecies if it shares «92.5 ANI with the genomospecies medoid genome, (ii) two genomes belong to the same genomospecies if they share ¬92.5 ANI with each other, and (iii) two genomes belong to different genomospecies if they share «92.5 ANI with each other (i.e., in practice, a genomospecies medoid genome does not need to be used for genomospecies assignment, but rather any genome of known genomospecies; see Tables S6 and S7 for a comprehensive list of genomospecies assignments). When written, genomospecies names immediately follow the genus name (Bacillus or B.) and are italicized and lowercase.
Published genomospecies. (I) Bacillus pseudomycoides. The B. pseudomycoides genomospecies contained 111 genomes, including the genome of species type strain B. pseudomycoides strain DSM 12442. All genomes previously classified as B. pseudomycoides relative to the type strain at a 95 ANI threshold remain in this genomospecies, and no additional genomes belong to the genomospecies. As such, this genomospecies remains consistent with its previous classification, and its name remains unchanged.
(II) Bacillus paramycoides. The B. paramycoides genomospecies contained six genomes, including the genome of species type strain B. paramycoides strain NH24A2. All genomes previously classified as B. paramycoides relative to the type strain at a 95 ANI threshold remain in this genomospecies, and no additional genomes belong to the genomospecies. As such, this genomospecies remains consistent with its previous classification, and its name remains unchanged. Additionally, all members of the lineage formerly known as emetic "B. cereus" belong to B. mosaicus (see "Subspecies" and "Biovars" below). While the species formerly known as B. anthracis is the oldest described former species in this group, it is not proposed as the genomospecies name, as doing so could lead to incorrect assumptions of an isolate's anthrax-causing potential. As such, the proposed genomospecies name (mosaicus) is chosen to reflect the diversity of lineages and phenotypes present among members of this genomospecies. All genomes previously assigned to the abovementioned former species using their respective type strain or reference genomes at a 95 ANI threshold belong to B. mosaicus.
(IV) Bacillus cereus sensu stricto. The B. cereus sensu stricto genomospecies contained 949 genomes, including those of type strains B. cereus sensu stricto (B. cereus sensu stricto strain ATCC 14579) and former species B. thuringiensis (now B. cereus sensu stricto serovar Berliner biovar Thuringiensis strain ATCC 10792; see "Biovars" below). B. cereus sensu stricto was chosen as the genomospecies name, with Thuringiensis proposed as a biovar to account for phenotypic heterogeneity within B. cereus sensu stricto, as well as the presence of insecticidal toxins in other genomospecies (see "Biovars" below). All genomes previously assigned to the species B. cereus sensu stricto and former species B. thuringiensis at a 95 ANI threshold using these type strains belong to B. cereus sensu stricto.
(V) Bacillus toyonensis. The B. toyonensis genomospecies contained 230 genomes, including the type strain of B. toyonensis (B. toyonensis strain BCT-7112). All genomes previously classified as B. toyonensis relative to the type strain at a 95 ANI threshold remain in this genomospecies, and no additional genomes belong to the genomospecies. As such, this genomospecies remains consistent with its previous classification, and its name remains unchanged.
(VI) Bacillus mycoides. The B. mycoides genomospecies contained 164 genomes, including the type strain of B. mycoides (B. mycoides strain DSM 2048), and former species B. nitratireducens (now B. mycoides strain 4049), B. proteolyticus (now B. mycoides strain TD42), and B. weihenstephanensis (now B. mycoides strain WSBC 10204). Additionally, all members of the lineages formerly known as emetic B. weihenstephanensis belong to B. mycoides (see "Biovars" below). B. mycoides was selected as the genomospecies name, as it is the oldest of the published former species described in this cluster (and remains consistent with taxonomic changes recently proposed by others [44]). All genomes previously assigned to the abovementioned species using their respective type strain or reference genomes and a 95 ANI threshold belong to B. mycoides.
(VII) Bacillus cytotoxicus. The B. cytotoxicus genomospecies contained 14 genomes, including the type strain of B. cytotoxicus (B. cytotoxicus strain NVH 391-98). All genomes previously classified as B. cytotoxicus relative to the type strain at a 95 ANI threshold remain in this genomospecies, and no additional genomes belong to the genomospecies. As such, this genomospecies remains consistent with its previous classification, and its name remains unchanged.
(VIII) Bacillus luti. The B. luti genomospecies contained nine genomes, including the type strain of B. luti (B. luti strain TD41). All genomes previously classified as B. luti relative to the type strain at a 95 ANI threshold remain in this genomospecies, and no additional genomes belong to the genomospecies. As such, this genomospecies remains consistent with its previous classification, and its name remains unchanged.
Putative novel genomospecies. Six putative genomospecies (xiii to xviii [ Table S8]) have not been proposed as novel genomospecies. Future novel B. cereus group genomospecies should (i) share Ͻ92.5 ANI with all B. cereus group genomes and (ii) share Ն97% 16S rRNA gene similarity with known B. cereus group species (a definition used in previous studies [35]).
(B) Subspecies. The following subspecies are proposed to ensure that the medically important lineages formerly known as B. anthracis and emetic "B. cereus" remain interpretable. When written, subspecies names are italicized and lowercase and can optionally (i) be appended to the species name, after the nonitalicized delimiter "subspecies" or "subsp.," prior to a serotype designation (if applicable); or (ii) follow the genus name (Bacillus or B.) (5) which was replicated here. The use of the term "subspecies anthracis" does not indicate whether an isolate produces anthrax toxin or possesses the machinery required for anthrax toxin synthesis (see "biovar Anthracis" below). b . Bacillus mosaicus subsp. cereus (can be written as B. mosaicus subsp. cereus; B. cereus) refers to the lineage formerly known as emetic "B. cereus." All genomes possessing cereulide synthetase genes (cesABCD) which did not belong to the B. mycoides species cluster (see "Genomospecies" above) shared Ն97.5 ANI with the emetic reference strain formerly known as B. cereus strain AH187 (now B. mosaicus subsp. cereus biovar Emeticus; RefSeq accession no. GCF_000021225.1). As such, isolates assigned to this subspecies (i) produce cereulide and belong to the species B. mosaicus, (ii) possess cesABCD and belong to the species B. mosaicus, and/or (iii) share ¬97.5 ANI with emetic reference genome B. cereus strain AH187 (now B. mosaicus subsp. cereus biovar Emeticus; RefSeq accession no. GCF_000021225.1). The use of the term "subspecies cereus" does not indicate whether an isolate produces cereulide or possesses the machinery required for cereulide synthesis (see "Biovar Emeticus" below).
(C) Biovars. The following biovars are proposed to account for phenotypes of clinical and industrial importance which can be distributed across species and heterogeneous in their appearance in individual lineages. While phenotypic evidence of a trait is ideal, biovars can be predicted at the genomic level. When written, (i) the first letter of the biovar is capitalized; (ii) the biovar name is not italicized; (iii) the biovar is appended to the end of a species, subspecies (if applicable), or serotype name (if applicable), following the nonitalicized delimiter "biovar"; (iv) if applicable, multiple biovars follow the nonitalicized, plural delimiter "biovars," are listed in alphabetical order, and are each separated by a comma and a single space; (v) biovar(s) may follow the genus name (Bacillus or B.) directly, with the species, subspecies (if applicable), and serotype (if applicable) names omitted a. Biovar Anthracis is applied to an isolate (i) known to produce anthrax toxin (preferred) and/or (ii) known to possess anthrax toxin-encoding genes cya, lef, and pagA. Capsular genes (e.g., cap, has, and bps) (21,22,57) are deliberately excluded from this definition as a conservative measure (i.e., to avoid cases in which an isolate might cause anthrax via a previously unknown capsule). Examples include B. mosaicus subsp. anthracis biovar Anthracis (i.e., anthrax-causing members of the "clonal" lineage often associated with anthrax disease; can be written as B. anthracis biovar Anthracis or B. Anthracis); B. mosaicus biovar Anthracis (i.e., anthrax-causing lineages formerly known as "anthrax-causing B. cereus"; can be written as B. Anthracis). b. Biovar Emeticus is applied to an isolate known to produce cereulide (preferred) and/or to possess cereulide synthetase-encoding genes (cesABCD). Examples include B. mosaicus subsp. cereus biovar Emeticus (i.e., cereulide-producing lineages formerly known as emetic "B. cereus"; can be written as B. cereus biovar Emeticus or B. Emeticus) and B. mycoides biovar Emeticus (i.e., cereulideproducing lineages formerly known as "emetic B. weihenstephanensis"; can also be written as B. Emeticus). c. Biovar Thuringiensis can be applied to an isolate known to produce one or more Bt toxins (e.g., Cry, Cyt, or Vip toxins; preferred) and/or to possess Bt toxinencoding genes. Examples include B. mosaicus biovar Thuringiensis and B. cereus sensu stricto biovar Thuringiensis (both of which can be written as B. Thuringiensis).
The proposed taxonomy offers numerous advantages. Most importantly, it is consistent; it provides an explicit, standardized framework for taxonomic classification using genomic and/or phenotypic methods, and it resolves previous nomenclatural ambiguities. Second, the proposed taxonomy is backwards compatible with important medical and industrial taxonomic definitions. For example, any B. cereus group isolate capable of producing Bt toxins can be referred to as B. Thuringiensis, which is equivalent to the traditional species definition (29). All isolates capable of producing anthrax toxin can be referred to as B. Anthracis, while members of the "clonal" anthrax lineage remain B. anthracis (using subspecies notation). Finally, the proposed taxonomy is flexible and can be extended to account for additional lineages or phenotypes through the adoption of novel subspecies or biovars, respectively. For example, biovars can be proposed to describe B. cereus group members capable of causing diarrheal foodborne disease (i.e., biovar Cereus), as this disease involves multiple toxins and is not fully understood (58). The nomenclature proposed here not only provides a standardized framework for taxonomic classification which accounts for both phylogenomic diversity and phenotypic heterogeneity, but also serves as a model taxonomic framework which moves beyond arbitrary genomospecies thresholds while maintaining historical congruence.

MATERIALS AND METHODS
Acquisition and initial characterization of Bacillus cereus group genomes. All genomes in the NCBI RefSeq Assembly database (59) which were submitted as one of 18 published B. cereus group species (35) were downloaded, along with the type strain genomes of three proposed effective B. cereus group species (60-62) (n ϭ 2,231, accessed 19 November 2018) (see Tables S1 and S6 in the supplemental material). QUAST version 4.0 (63) was used to assess the quality of each genome, and BTyper version 2.3.2 (31) was used to detect B. cereus group virulence genes in each genome, using default minimum amino acid sequence identity and coverage thresholds (50% and 70%, respectively) (Table S6) (31,64). Prokka version 1.12 (65) was used to annotate each genome, and the resulting coding sequences were used as input for the command-line implementation of BtToxin_scanner version 1.0 (BtToxin_scanner2.pl), which was used to identify Bt toxin genes in each genome using default settings (66).
Calculation of pairwise ANI values, hierarchical clustering, and medoid genome identification. FastANI version 1.0 (5) was used to calculate ANI values between each of 2,231 genomes (4,977,361 comparisons). To ensure that the breakpoints and shape of the distribution of pairwise ANI calculations were robust, genomes which (i) fell below various N 50 thresholds (i.e., Յ10 kbp, 20 kbp, 50 kbp, and 100 kbp) and/or (ii) contained any contigs classified in domains other than Bacteria, phyla other than Firmicutes, and/or genera other than Bacillus using Kraken version 2.0.8-beta (67,68) and the complete standard Kraken database (accessed 6 August 2019) were removed (Fig. S2). For medoid genome identification (described below), all genomes with N 50 of Ͼ20 kbp in the original set of 2,231 RefSeq genomes were used in subsequent steps (n ϭ 2,218) (Table S6 and Fig. S2).
The resulting pairwise ANI values were used to construct a similarity matrix, S ANI , using R version 3.6.0 (69) and the reshape2 package (70) as follows, where n ϭ 2,218: Let g 1 , g 2 , . . . g n be a set of n genomes, denoted by G (G ϭ {g 1 , g 2 , . . . g n }). Similarity function ANI(g i , g j ) denotes the ANI value shared by query and reference genomes g i and g j , respectively, where ANI: G ϫ G¡[0,100].
Similarity matrix S ANI can be defined as S ANI ϭ (s ij ); s ij ϭ ANI ij ϭ ANI(g i , g j ).
Similarity matrix S ANI was converted to a dissimilarity matrix, D ANI , as follows, where J denotes an n ϫ n matrix where each element is equal to 1: D ANI ϭ 100J ϪS ANI .
ANI as a similarity function is not symmetric [i.e., for all g i , g j , ANI(g i , g j ) ANI(g j , g i )], as minor differences between corresponding values in the upper and lower triangles of D ANI existed: max[d(g i , g j ), d(g j , g i )] ϭ 0.504; min[d(g i , g j ), d(g j , g i )] ϭ 0; mean[d(g i , g j ), d(g j , g i )] ϭ 0.056; median[d(g i , g j ), d(g j , g i )] ϭ 0.046.
As such, D ANI is not a symmetric matrix (i.e., D ANI D ANI T ). To coerce D ANI to a symmetric matrix, D ANI sym , the following transformation was applied: D ANI sym ϭ 0.5͑D ANI ϩ D ANI T ͒. The hclust function in R's stats package was used to perform average linkage hierarchical clustering, using D ANI sym as the dissimilarity structure, and the resulting dendrogram was annotated using the ggplot2 (71), dendextend (72), and viridis (73)  where d(g i , g j ) ϭ 100 Ϫ ANI(g i , g j ).
To construct a graph with each of 2,218 genomes represented as nodes and ANI values represented as weighted edges, D ANI sym was converted to a symmetric similarity matrix, S ANI sym , as follows: The igraph (75) package in R was used to construct each graph, with S ANI sym treated as an adjacency matrix, and edges with weights (i.e., ANI values) less than a similarity threshold T s (i.e., T s ϭ [92.5, 95]) removed. Genomospecies assignment. FastANI version 1.0 was used to assign each of 2,231 B. cereus group genomes to a genomospecies, using (i) species reference/type strain genomes (n ϭ 21) (Table S1) and medoid genomes identified at (ii) 95 ANI (n ϭ 36) (Table S3) and (iii) 92.5 ANI (n ϭ 18) (Table S7) as reference genomes for each of three separate runs.
Phylogeny construction. Amino acid sequences of protein-encoding features produced by Prokka were used as input for OrthoFinder version 2.3.3 (76). Single-copy orthologous clusters (i.e., genes) present in all 2,231 genomes were identified using an iterative approach, in which OrthoFinder was used to identify single-copy genes core to n of the 2,231 genomes, sampled randomly without replacement, where n ϭ 30 or n ϭ 11 for 74 and 1 (the remainder) iteration(s), respectively. The union of single-copy genes present in all n genomes in each random sample of genomes was then queried again using OrthoFinder, which identified a total of 79 single-copy genes core to all 2,231 genomes. Nucleotide sequences of each of the 79 single-copy core genes were aligned using PRANK v.170427 (77). The resulting alignments were concatenated, and SNP-sites version 2.4.0 (78) was used to produce an alignment of variant sites, excluding gaps and ambiguous characters. IQ-TREE version 1.6.10 (79) was used to construct a maximum likelihood phylogeny, using the alignment of core single nucleotide polymorphisms (SNPs) detected in all 2,231 genomes. The GTRϩGϩASC nucleotide substitution model (i.e., general time reversible model [80] with a gamma parameter [81] to allow rate heterogeneity among sites and an ascertainment bias correction [82] to account for the use of solely variant sites) was used, along with 1,000 replicates of the ultrafast bootstrap approximation (83). The resulting phylogeny was annotated in R using the ggplot2 (71), ape (84), phytools (85), phylobase (86), ggtree (87), and phangorn (88) packages.
Data availability. Accession numbers for all genomes queried in this study are available in Table S6. BTyper3, a command-line tool for characterizing B. cereus group genomes using the framework outlined here, is available at https://github.com/lmc297/BTyper3. An R package, bactaxR, is available for identifying medoid genomes and constructing plots using the methods described here at https://github.com/ lmc297/bactaxR.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only.