Advanced prokaryotic systematics: the modern face of an ancient science

Prokaryotic systematics is one of the most progressive disciplines that has embraced technological advances over the last century. The availability and affordability of new sequencing technologies and user-friendly software have revolutionised the discovery of novel prokaryotic taxa, including the identification and nomenclature of uncultivable microorganisms. These advances have enabled scientists to resolve the structure of complex heterogenous taxon and to rectify taxonomic status of misclassified strains due to errors associated with the sensitivity and/or reproducibility of phenotypic approaches. Time- and labour-intensive experimental characterisation of strains could be replaced with determining the presence or absence of genes or operons responsible for phenotypic and chemotaxonomic properties, such as the presence of mycolic acids and menaquinones. However, the quality of genomic data must be acceptable and phylogenomic threshold values for interspecies and supraspecies delineation should be carefully considered in combination of genome-based phylogeny for a reliable and robust classification. These technological developments have empowered prokaryotic systematists to reliably identify novel taxa with an understanding of community ecology and their biosynthetic and biodegradation potentials.


Introduction
Prokaryotic systematics, a discipline of identifying, classifying and naming novel prokaryotic taxa, provides foundation to all microbiological research of agricultural, ecological, industrial, medical and veterinary importance. Prokaryotes are abundant in nature with an estimated 2.2-4.3 million species, of which only~21,000 have been characterised and validly published [1][2][3]. The discipline that primarily relied on phenotypic characterisation for species identification has evolved over a century since the first edition of Bergey's Manual of Determinative Bacteriology in 1923 to include numerical taxonomy in 1962 and adopt polyphasic taxonomy in 1990s [3][4][5][6]. Polyphasic taxonomy including morphological, biochemical and chemotaxonomic characterisation, DNA-DNA hybridisation (DDH) and phylogenetic analysis of variation in the16S rRNA gene sequence has been a gold standard in prokaryotic systematics until whole genome sequencing became affordable in the last decade. Polyphasic systematics is time-and labour-intensive and involve approaches such as DDH that are quite sensitive to minor variations in experimental conditions and probes used in hybridisation, which sometimes results in poor reproducibility between different laboratories [7,8]. The variation in the 16S rRNA gene sequences may not be enough to resolve closely related species [9]. Distinct species can show >99% similarities in their 16S rRNA gene sequences, for example, several species within the family Geodermatophilaceae [10] and within the genus Mycobacterium [11]. Furthermore, horizontal transfer of 16S rRNA gene has been reported in multiple bacterial species [12][13][14], suggesting that variation in 16S rRNA gene should be carefully considered for species delineation.
The modernisation of the practices of prokaryotic systematics was not possible without significant contribution and tireless efforts of the scientific community continuously developing bioinformatic tools to help systematists with limited computational experience. The availability of interactive userfriendly bioinformatics tools and resources played an important role in facilitating the smooth integration of whole genome sequences into the discipline. For example, EzBioCloud and Type (Strain) Genome Server (TYGS) are extensively used resources by prokaryotic systematists. EzBioCloud has an integrated sequence database of high-quality 16S rRNA gene and genome sequences of archaea and bacteria [26]. The server hosts various bioinformatic tools for identification of novel species from 16S rRNA or genome sequences. Similarly, TYGS is a simple and fully automated genome analysis pipeline that can extract 16S rDNA gene sequences from the genome sequences and determine the closest type strains for further analyses [27,28]. TYGS also performs state-of-the-art phylogenomic analysis on a user provided set of genomes and constructs a highly robust genome-based phylogenetic tree using the Genome BLAST Distance Phylogeny (GBDP) method [27,28]. However, there are still some challenges that remains to be addressed. In this review, we discuss the progress made by the incorporation of genomics into prokaryotic systematics along with important computational tools, and how the definition of genus and species evolved over the last decade.

Important phylogenomic approaches for prokaryotic systematics
Several taxogenomic approaches are available to assist systematists in identification of novel taxon by comparing genome sequences. One of the most used approaches is the calculation of average nucleotide identity (ANI) between pairs of genomes where genome sequences are spliced into 1020 bp fragments followed by BLAST-search [29] against other genomes [30,31]. ANI values are calculated from pairwise matches with 70% alignable length and at least 30% sequence identity. The users could easily compare up to 50 genomes by using online Genome-based distance matrix calculator (http://enve-omics. ce.gatech.edu/g-matrix/). OrthoANI is another great tool that calculates ANI values from a set of orthologues genes within the dataset to address the bias introduced by the variation in the size of sequence alignments by pairwise reciprocal similarity criteria [32]. However, both these approaches could be slower for large datasets. More recently, a fast computational approach has been developed that calculates ANI values from orthologous gene pairs between two genomes using an alignment-free approximate sequence mapping method [33]. The reliability of all these approaches seems to be comparable with a cut-off value of 95% for defining species [30,32,33].
Pairwise average amino acid identity (AAI) values are often treated to be more reliable, particularly for circumscribing higher taxa [31,34,35]. AAI values are calculated from the protein sequences showing >30% identity and an alignable region of >70% after a two-way BLAST search [36,37]. Again, the online Genome-based distance matrix calculator (http://enveomics.ce.gatech.edu/g-matrix/) with a user-friendly interface could be used to calculate AAI values for up to 50 genomes. Another efficient and easy to use tool for AAI calculation is EzAAI pipeline, which uses hierarchical clustering tool MMseqs2 [38] for a sensitive protein sequence search and faster calculation of AAI values [39]. The AAI values were comparable to those obtained from BLAST based approach. Similar to ANI, 95% threshold is used for defining species.
An important approach for species delineation is the calculation of digital-DDH (dDDH) values using an efficient userfriendly web interface, genome-to-genome distance calculator [40,41]. The calculator will estimate overall similarity between pair of genomes and provide a value that mimics the experimental DDH values, helping systematists to avoid tedious and often irreproducible laboratory-based DNA-DNA re-association experimentation. Similar to the laboratory-based experimental DDH values, two strains with 70% dDDH value are assigned to the same species. The values are calculated by three different formulas but formula #2 is recommended that is not affected by the size of genomes and provides robust values, particularly if incomplete draft genomes are analysed [42].
A pangenomic approach that is based on the calculation of the percentage of the core/pan-genome ratio between different strains is emerging as an important tool for reclassification of bacterial species [43,44]. Distinct species were estimated to show a discontinuity or an abrupt break of >10% in the core/ pan-genome ratio [43,44]. The core/pan-genome ratio was 94% when six Klebsiella pneumoniae genomes were compared; however, this ratio was decreased to 79% and 72% when Klebsiella pneumoniae subsp. rhinoscleromatis and K. pneumoniae subsp. ozaenae were successively compared [43]. A reduction of 13-27% was observed when Klebsiella mobilis, Klebsiella variicola and Klebsiella oxytoca genomes were compared with the six Klebsiella pneumoniae genomes, leading to the conclusion that K. pneumoniae subsp. rhinoscleromatis and K. pneumoniae subsp. ozaenae should be treated as distinct species [43][44][45].
A potential ANI threshold of~75% has been suggested for separating different genera [17]. However, the suitability of ANI values in delineating genera was challenged in favour of the percentage of conserved proteins (POCP) between a pair of genomes [46]. The conserved proteins are identified based on >50% aligned length with >40% sequence identity and an evalue <1 × 10 −5 . The hypothesis is that at least 50% of the genome wide proteome will be shared by two individual species belonging to the same genus [46].
Phylogenetic analyses are important to infer evolutionary relatedness among different strains that help systematists in assigning species to correct taxon. PhyloPhlAn 3.0 is a highly efficient approach to rapidly construct phylogenies from large genomic datasets [47]. The program automatically selects the most informative clade-specific loci from genome assemblies or genome-wide protein sequences to create multiple-sequence alignment and construct a maximum-likelihood tree [47]. Phylo-PhlAn trees are highly robust and can perform taxonomic assignments based on the NCBI taxonomy. However, several other approaches including multilocus sequence analyses, core genome phylogeny and genome distance-based phylogeny [40,48] have also been used to study taxonomic relatedness between different strains. 16S rRNA sequences remain central to the process that helps identify relevant genomes/organisms for comparative analyses. GenBank is the fastest growing database of prokaryotic genomes (363,287 prokaryotic genome sequences; accessed on 01/ 10/2021) that has served as a great resource for facilitating phylogenomic comparisons [49]. The scientific community can freely access the genome sequences and can use them in comparative analyses. As mentioned before, automated genomic pipelines such as EzBioCloud server [26] and TYGS [28] have integrated phylogenetic and taxogenomic approaches that are interactive and very easy to use for determining taxonomic status of strains and to classify them without the need of installing any software. The availability of easy-to-use computational tools contributed to the integration of genome sequencing into prokaryotic systematics.

What has been resolved by integration of genomics into prokaryotic systematics?
The taxonomic structure of large complex taxon, such as the phylum Actinomycetota (former known as Actinobacteria), remained unresolved until the integration of genomics [19]. The phylogenomic analyses of 1,142 strains led to a large scale restructuring of the phylum including the addition of two new orders, ten families, 17 new genera along with the reclassification of more than 100 species [19]. Rhodococcus, a heterogeneous actinobacterial taxon that include species of ecological, industrial and medical importance, encompasses multiple species-groups which equate to the rank of distinct genera [17,22]. A phylogenomic review of the phylum Bacteroidota (formerly known as Bacteroidetes) resulted in elevation of the families Balneolaceae and Saprospiraceae to a new phylum Balneolaeota, and addition of a new class Saprospiria [50]. Similarly, phylogenomic analyses of the genus Flexibacter resulted in defining new monophyletic genera with emended species descriptions [50].
Phylogenomic analyses also resolved the taxonomic structure of clinically and ecologically important taxa. The genus Mycobacterium sensu lato was divided into five monophyletic clades designated as "Fortuitum-Vaccae", "Terrae", "Triviale", "Abscessus-Chelonae" and "Tuberculosis-Simiae", which correspond to four novel genera Mycolicibacterium, Mycolicibacter, Mycolicibacillus, Mycobacteroides, respectively [51]. The "Tuberculosis-Simiae" clade encompasses pathogenic strains including Mycobacterium tuberculosis, which is responsible for tuberculosis in humans. Similarly, Burkholderia cepacian complex containing strains that are responsible for fatal lung infection in humans, was resolved into 36 groups with a reassignment of 22 species to the correct rank and identification of 14 novel species [52].
The genus Frankia is known for ecological and agricultural significance but the taxonomic relatedness between Frankia species remained vague [53,54]. Using phylogenomic analyses, twelve Frankia species were shown to form four distinct groups with capacities to produce plant growth promoting hormones, bioremediation potential, and potential to produce novel bioactive compounds [20]. Phylogenomic analyses also highlighted close phylogenetic relationship of the genera Streptomyces, Kitasatospora and Streptoacidiphilus, which are well known for their ecological and genetic diversity [55]. These genera shared a number of genomic features and capacities to produce similar secondary metabolites, potentially due to high recombination frequencies [55]. In addition, genomic analysis provided new insights into biotechnological potential of Kitasatospora for novel antibiotic discovery [55]. Therefore, integration of phylogenomics into prokaryotic systematics made the classification of prokaryotic taxa more robust, helped understand evolutionary processes behind the genomic diversity between related taxa and revealed biosynthetic capacities of different strains.

Time to move on to the modern prokaryotic systematics
Traditional phenotyping methods using in vitro biochemical assays or API 20/CHL strips provide an overview of the metabolic features of microorganisms but are sometimes sensitive to minor variations in experimental conditions, resulting in poor reproducibility and/or ambiguous data. But do we really need to continue with the traditional phenotyping methods?
Some studies reported a lack of consistency between chemotaxonomic and phylogenetic data; for example, the genus Turicella which only differed from Corynebacterium strains by lacking mycolic acids and quinone profiles [19,56]. Genomic characterisation revealed that chemotaxonomic variations were contributed by deletion of genes responsible for synthesis of mycolic acids and menaquinone in the former [57]. Turicella otitidis, the sole Turicella species, was reclassified as Corynebacterium otitidis based on phylogenetic analyses [57]. Therefore, identification of genetic basis of phenotypic characteristics could replace phenotypic and chemotaxonomic characterisation of microorganisms. It is now possible to model the pathways for a particular phenotype by mapping them onto the genome sequences which is emerging as a prominent discipline called microbial phenomics [58].
High throughput phenotyping technologies such as Biolog Phenotype MicroArrays, have been extensively used for characterisation of metabolic pathways in combination with the genomic data, which has also been useful for taxonomic studies [53,[59][60][61]. This system can determine approximately 2,000 phenotypes for a microbial cell including their capacities to use different carbon, nitrogen, phosphorus and sulphur substrates, abilities to tolerate a range of pH and salinity, and sensitivity to several antibiotics and inhibitory compounds. Several R packages are available for processing of high-throughput phenotypic data [62,63]. Therefore, it is time to modernise the practices of polyphasic taxonomy by minimising the battery of biochemical and chemotaxonomic assays and adopting a more balanced taxonogenomic approach for species description. Taxonogenomics combines phenotypic and genotypic characteristics for identification of microorganisms [64]. The most relevant basic features, such as morphological and growth properties, along with the phylogenomic characteristics should be sufficient for a species description [64][65][66][67]. Indeed, several new species have been described using the taxonogenomic approach [68][69][70].

Has genomics empowered prokaryotic systematists?
Despite the importance of non-cultivable microbes in maintaining different ecosystems, our knowledge of the microbial world is mostly limited to cultivable taxa. Only~2% bacteria from nearly half of the identifiable bacterial domains are culturable [71]. Environmental metagenomics and single cell sequencing made it possible to assess the microbial diversity including the unculturable bacteria, so-called 'microbial dark matter', in their ecological niches, which is guiding further studies to investigate their biosynthetic or catalytic potential [72][73][74]. New taxa identified by culture-independent methods have expanded the tree of life to include "Candidate Phyla Radiation (CPR)", a large group of mostly non-cultivable bacterial lineages that represent more than 15% of the domain Bacteria, including representatives of over 70 phyla and two superphyla (Candidatus Parcubacteria and Candidatus Microgenomates) [75]. Significantly higher numbers of new phyla and novel species are regularly being identified from complex environmental samples using metagenomic and single cell sequencing approaches in comparison to traditional culturebased approach [76][77][78]. In line with the practices of prokaryotic systematics, a roadmap for nomenclature of high numbers of nonculturable Archaea and Bacteria has been developed and is extensively discussed by the community [24]. Assessing the cell number and morphology of non-culturable bacteria from a complex sample is also now possible by using microfluidic single-cell dispenser systems [79]. Therefore, these great technological developments have empowered systematists not only to identify novel prokaryotic taxa but also to study community ecology and their biosynthetic and catalytic (biodegradation) capacities.

The key challenges
The technological advancements and availability of user-friendly phylogenomic tools have clearly benefited prokaryotic systematists. However, there are still some challenges, which need addressing. The most important issue is the quality of the sequencing reads and assembled genomes. More than 2 million entries in the GenBank were found to be contaminated in a recent study, including 4-8% of data for bacterial species [80]. It is important to check the quality of sequencing reads and genome assemblies before using them in phylogenomic analyses. Too many short contigs and those with poor coverage should be excluded. Genome assemblies should be checked for completeness, contamination and percentage of ambiguous bases using the programs such as CheckM [81] before downstream analyses.
While phylogenomic tools provide usable matrices, the threshold values suggested for interspecies and supraspecies delineation should be carefully considered in combination of genome-based phylogeny [82][83][84]. For instance, if dDDH values are marginally below the threshold and ANI and/or AAI values are >95% for a pair of genomes, phylogenetic relationship could be considered along with other ecological characteristics, such as association with a particular clinical condition that has been characterised with the presence of the same set of virulence genes. In a recent study, Corynebacterium strains from yellow-eyed penguins belonging to two distinct lineages were proposed to be assigned to two subspecies despite dDDH values being below the 70% threshold due to ANI and AAI values that were higher the 95% cut-off, very close phylogenetic relationship and their association with diphtheritic-stomatitis [84].
The publicly available resources allow users to analyse smaller number of genomes that are often limited to 50-75 genomes. It causes problems for scientists resolving the taxonomic status of large and complex taxon where these programs need to be installed on Unix/Linux based high-performance computers and data needs to be analysed using the command lines. It requires significant investment in computational resources and advance knowledge to configure and run phylogenomic analyses that may not be feasible for all prokaryotic systematists.

Conclusions -The "genomic" definition of a genus and species
According to the polyphasic approach, groups of phylogenetically closely related strains with similar phenotypic characteristics, 70% pairwise DNA-DNA hybridization (DDH) values and >98.7% identity between their 16S rRNA gene sequences are often assigned to the same species [85][86][87]. The definition of prokaryotic genera is a bit vague that often relies on monophyly of strains in phylogenetic tree and average rRNA gene sequence divergence <6%.
Phylogenomically, a prokaryotic genus is defined as a monophyletic group of species sharing at least 50% of genome wide proteome (pairwise POCP values 50%). A group of phylogenetically closely related strains with 95% ANI and/or AAI values and dDDH value of 70% belongs to the same species. Therefore, genome based phylogenomic metrices could be used for interspecies and supraspecies delineation.
In summary, prokaryotic systematics is clearly a progressive discipline that has evolved over a century by embracing advanced technologies. While polyphasic approach provides important information on the cellular composition and lifestyle of an organism, these techniques sometimes show poor reproducibility between different laboratories. Since whole genome sequencing has integrated well into prokaryotic systematics, more emphasis could be placed on describing the genetics, i.e., presence or absence of genes involved in phenotypic properties, where possible. It is probably the time to accept the phylogenomic description of novel taxa (species and genera) for validation where quality of genome assemblies is excellent and robust phylogenomic analyses present compelling evidence without extensive polyphasic characterisation (phenotypic and chemotaxonomic properties).