The Next Million Names for Archaea and Bacteria

binomials, popularised in the 18th century by the Swedish naturalist Linnaeus, stood the test of time in providing a stable, clear, and memorable system of nomenclature across biology. However, relentless and ever-deeper exploration and analysis of the microbial has created an urgent need for huge numbers of new for Archaea and Bacteria. Manual creation of such cult and and typically relies on expert-driven nomenclatural quality control. to ensure that the legacy of Linnaeus lives on in the age of microbial genomics and metagenomics, we propose an automated approach, employing combinatorial concatenation of roots from Latin and Greek to create linguistically correct names for genera and species that can be used off the shelf as needed. As proof of principle, we document over a million new names for Bacteria and Archaea. We are con ﬁ dent that our approach provides a road map for how to create new names for decades to come.

This creates an urgent unmet need for new taxonomic names for Archaea and Bacteria.
Currently, creation of well-formed names relies on time-consuming nomenclatorial quality control by a dwindling pool of experts conversant with classical languages and the International Code of Nomenclature of Prokaryotes.
These problems are compounded by the custom of creating names on an as-needed, just-in-time-fashion.
Here, we outline a novel approach with three features: creation of names en masse before they are tied to taxa; combinatorial concatenation of roots from Latin and Greek, drawing on stocks of roots with relevant meanings; computerised automation of the creation of new names.
Another pressing problem is that most microbiologists follow Shakespeare in possessing, at best, 'small Latin, less Greek' [10] and so are poorly equipped for creating well-formed binomials that comply with the rules of Latin grammar and are presented with clear, plausible etymological justifications (Box 1). Despite the publication of several 'how-to' guides [11][12][13], this skills gap has led to propagation of numerous erroneous malformationsa high-profile example is the species epithet pyloridis, which even passed validation in the International Journal of Systematic Bacteriology before it had to be corrected, according to the rules of Latin grammar, to pylori [14,15]. What's more, bacteriologists are bound by the provisions of the Code, which include many detailed and difficult rules and recommendations on how names should be formulated [5]. These exacting requirements mean that new names have to undergo time-consuming nomenclatorial quality control by a dwindling pool of experts, who are required to be conversant with classical languages, the Code, and contemporary microbiology [16]. These problems are compounded by the custom of creating names on an ad hoc, as-needed, just-in-time fashion, which provides a non-stop drip-by-drip flow of work for nomenclatural experts.
Another key challenge stems from the exhilarating success of high-throughput sequencing and bioinformatics, which, twinned with molecular phylogenetics, represent a remarkable unifying force across the whole of biology, drawing together all cellular organisms into a single great tree of life (http://tolweb.org/tree/). Within microbiology, such advances have been driven by culturomics (high-throughput culture followed by whole-genome sequencing, which has delivered many hundreds of new species) and via metagenomics, which has delivered many thousands of metagenome-assembled genomes (MAGs), mostly from uncultured organisms [17,18]. In addition, bioinformatics analyses have enabled the development of a comprehensive genome-based taxonomy, GTDB, ranging from species up to domains [19].
While nomenclature has largely kept up with culturomics [20], valid publication of names for bacterial and archaeal species currently requires deposition of cultured type strains in public repositories. This requirement controversially precludes application of the ICNP rules to uncultured organisms identified and characterised by metagenomics [21,22] (Box 2). One work-around is to apply the Glossary Ancient Greek: (abbreviated Gr.) a classical language, the language of the many celebrated poets, playwrights, and philosophers. Even in ancient times, many Greek words were carried over into Latin and this trend continues in the formation of taxonomic names. Binomial: a Latinised name for biological species, written in italics and composed of two parts, the first capitalised and identifying the genus, the second identifying the species, for example, Escherichia coli. Candidatus: a category of name for archaeal or bacterial taxa representing as-yet uncultured organisms. The Code grants no standing to such names, but specifies that they should be prefixed with Candidatus (in italics), the genus and species names should be in Roman type, and the entire name in quotation marks; for example 'Candidatus Phytoplasma allocasuarinae'. Cladistics: an approach to biological classification, pioneered by the German entomologist Willi Hennig, in which organisms are grouped into monophyletic groups (also called clades) and classification strictly reflects phylogeny. Culturomics: high-throughput culture of microbes as an approach to discover taxonomic novelty. Etymology: justification for the new name, including a description of constituent terms, their origins, grammatical properties, and meanings. Genome-based taxonomy: an approach to molecular phylogenetics in which genome sequences are used to create phylogenies and taxonomies: epitomised by the Genome Taxonomy Database (GTDB Latin remains the language of taxonomic nomenclature. This brings the advantages of neutrality and stability, but presents problems for most microbiologists, as Latin is no longer widely taught in schools. Unlike English, Latin is a highly inflected language, where the endings of nouns and adjectives vary according to their role in the sentence and linguistic properties. For example, adjectives change their endings to reflect the gender of the noun they are qualifying, for example, the neuter form faecale is used in Microbacterium faecale, but the masculine form faecalis is used in Enterococcus faecalis. Another problem is that many taxonomic names come from Ancient Greek, which has its own alphabet, and so have to be transcribed into the Roman alphabet and then Latinised before use (experts still argue over whether Acinetobacter should have been Akinetobacter).
These problems are compounded by the fact that bacteriologists are also bound by the Code, which includes 65 rules and dozens of recommendations, some requiring subjective judgements, for example, avoiding names that are 'very long or difficult to pronounce'. The Code insists that words from languages other than Latin or Greek should be avoided if equivalents exist in Latin or Greek, but allows genus names to be created in an arbitrary manner, so long as they are treated as Latin nouns.
All this makes it difficult for the non-expert to get things right, so that up to half of all newly proposed names for Archaea and Bacteria need to be corrected before use. Common problems include trying to use poorly Latinised English words (e.g., geesorum instead of anserum for a species associated with geese) or making up nonsensical etymologies [28]. The Code clarifies that names are primarily labels rather than descriptions: 'The primary purpose of giving a name to a taxon is to supply a means of referring to it rather than to indicate the characters or the history of the taxon'. The Code also explains what is required for names to be validly published, which includes a description of the taxon. Valid publication of names is typically accompanied by a protologue, which includes a description of the taxon with an etymology and designation of type material.

Trends in Microbiology
designation Candidatus (abbreviated to Ca.) to names for uncultured taxa [23]. Although the resulting names have no standing according to the Code, the Candidatus approach provides a clear, memorable, and potentially stable nomenclature for uncultured species that mirrors the nomenclature for cultured species. However, so far barely more than 850 species-level Candidatus names have been published in the peer-reviewed literature [24].
Similarly, Latin names have yet to be assigned to the vast majority of new species or genera defined by genome-based taxonomies, which include not only those represented solely by MAGs, but also new taxa for which cultured strains are available. Instead, almost all new genera and species identified in these settings have been assigned unstable, confusing, and hard-to-remember alphanumerical identifiers. For example, the current release of the GTDB (https://gtdb.ecogenomic.org R05-RS95 July 17 2020) documents 31 910 species of Archaea and Bacteria, but 23 171 have only placeholder names, including 4827 species epithets created by the addition of an alphabetical suffix to an existing name (e.g., Helicobacter pylori_A). Similarly, of GTDB's 9428 genera, 6652 have only placeholder names, 708 of them created by appending an alphabetical suffix (e.g., Pseudomonas_D).
So, should we conclude that the legacy of Linnaeus is no longer relevant to microbiology in the age of genomes and metagenomes? Should we be happy to refer to a new species as, for example, UBA6965 or sp000063525? We believe that the answer is a resounding 'no!' However, high-throughput generation of taxa via sequence-based approaches clearly precludes the detailed attention usually applied to the one-by-one construction of Latin binomials. Instead, we propose that the problem can best be solved by automating the creation of well-formed names. together with effectively but not validly published names, Candidatus names, and names of cyanobacterial taxa validly published under the Botanical Code. Metagenome-assembled genome (MAG): a genome sequence that has been reconstructed by assembling and binning metagenomic reads. Molecular phylogenetics: analysis of inherited differences in informational macromolecular sequences (DNA, RNA, or proteins) to identify evolutionary relationships and construct phylogenetic trees. Monophyletic group: another term for a clade, a taxonomic group that includes a common ancestor and all its descendants. Nomenclature: a system for giving names to organisms. Nomina nuda: a term (singular: nomen nudum) for names that look like taxonomic names but have no standing as they have not been published according to the rules of the relevant nomenclatural code. Protologue: a description of a new taxon which includes the etymology of the name, a description of the taxon, and of the designated nomenclatural type (a strain for a new species, but a species for a new genus). Taxonomy: the branch of biology concerned with the classification, identification, and nomenclature of organisms; it can also refer to a particular scheme for categorisation.

Box 2. Culture Wars
Anyone seeking a stable system of nomenclature would not start from where we are now. Rather than one code for all organisms, there are several, with different rules for naming algae-fungi-and-plants, animals, and Bacteria-and-Archaea. Cyanobacteria were for a long time treated as plants and so most still lack validly published names according to the ICNP. Oddly, names for phyla have never been included in the Code. And, for cladists, the term 'prokaryote' is deprecatedbecause prokaryotes are no longer considered a monophyletic group [29][30][31] and so, one could even argue that the ICNP is misnamed and should instead be named the International Code of Nomenclature of Archaea and Bacteria. It is also worth noting that there is no specific term for people who study Archaea and Bacteriahere, we have tended to use 'bacteriologist', noting that the term 'archaeologist' has been appropriated by another discipline.
Despite claiming that 'Nothing in this Code may be construed to restrict the freedom of taxonomic thought or action', the ICNP is very particular about what counts as type material in naming a species: only living pure cultures meet the requirementan ironic contrast to the International Code of Nomenclature for algae, fungi, and plants, which requires that type material be dead or at least inert! This requirement for live cultures has attracted controversy. One objection is that almost all microorganisms live in communities and in challenging conditions, so organisms grown in pure culture, at best, provide an incomplete view of the natural worldand, at worst, represent a laboratory artefact [32].
A more pressing objection stems from the fact that most microbial species remain uncultured, even though they are increasingly accessible by metagenomics. Many molecular microbiologists now suggest that uncultured organisms should have their own names and that genome sequences should be acceptable as type material [22]. However, after lengthy debate, a proposal to amend the Code to accommodate this request was rejected in March 2020 [33].
In the meantime, we have to muddle through with Candidatus names, which, although mentioned in the Code, have no priority. Does this matter? Probably notas although the de jure position is that a Candidatus name could be replaced by a fresh name any time in the future, the de facto situation is that, for most microbiologists, it will be simply too much effort to create and validly publish new names when perfectly good names already exist. After all, nearly half of the new names assigned to cultured organisms in journals other than the International Journal of Systematic and Evolutionary Microbiology are never validly published, which simply requires a request for inclusion in a Validation List in that journal [34]. Reassuringly, online resources, such as the List of Prokaryotic Names with Standing in Nomenclature (LPSN), National Centre for Biotechnology Information (NCBI), and GTDB, already incorporate Candidatus names, and such de facto arrangements do a great job at allowing these names to be used while also preventing confusion over which names have already been used, whether formally or informally.

Automating the Creation of Names
To meet the need for a stable, clear. and memorable nomenclature for the next million bacterial or archaeal species, we propose abandoning the current cottage-industry approach and instead advocate the automated creation of names en masse, in advance of the need to allocate them to biological entities.
How is this even possible? The answer is an approach that reaches back to classical times: joining individual word roots from Greek or Latin together to create compound words with new meanings. For example, the 1st century Greek geographer Strabo gave us Rhinoceros, combining Ancient Greek roots for 'nose' and 'horn', while Linnaeus named the genus Chrysanthemum using the Ancient Greek roots for 'gold' and 'flower'. This principle has been systematised in the Code to create new genus names, with the rule that a connecting vowel -o-is used after Greek roots and -i-after Latin roots. Any new genus name that is created inherits its grammatical properties (including gender and declension) only from the last element in the word formation. As many as four roots have been combined to give us validly published genus names such as Ectothiorhodospira from the Greek roots for 'outside-sulfur-rose-spiral' or Allocatelliglobosispora from the Greek and Latin roots for 'another-chainlet-sphere-spore'.
The Great Automatic Nomenclator We propose to extend this approach so that very large numbers of genus names can be created using an automated combinatorial concatenation of a relatively small set of starting terms. Let us say we wish to explore a biome-specific generic namespace defined by combinations of three terms (Figure 1, Key Figure). If we select ten roots to be deployed in each of the initial, middle, and final positions, then it becomes possible to create, from just 30 roots, ten-times-ten-timesten = a thousand names with little effortan approach we have already used to create names for several hundred new genera from the chicken gut microbiome [25].
To automate this approach, we have created a Python script named the 'Great Automatic Nomenclator' (or Gan: https://github.com/telatin/gan) after a short story by Roald Dahl [26], or 'garden' in Hebrew, reflecting its fertile productivity. The script takes, as its input, tables of roots in a specified format and then performs combinatorial concatenation, considering ICNP rules governing the use or elision of connecting vowels. In addition, because the input roots have already completed linguistic quality control, the new names are grammatically correct and come complete with etymological justifications that can be used in a protologue. Gan can be interpreted and run using the programming language Python, which can be installed on a range of operating systems (Windows, MacOS, Linux, and even mobile devices running iOS or Android). However, version 1.0 still requires detailed curation of the input files and produces rather basic outputs that can be finessed by editing in, for example, Excel. However, we anticipate that the program will become more user-friendly and productive in subsequent versions.
The Power of Prefixes Before exploring the full power of this combinatorial approach, let us take a quick look at an easy win in creating new names that reflect phylogenetic positions. Since the time of Linnaeus, when advances in taxonomy demand that a new taxon be split from an existing taxon, it has been common practice to add a short prefix (or less commonly, a suffix) to the existing name to create a new name, with an etymology that defines the new taxon as 'related to but distinct from the' pre-existing taxon. This approach has already seen extensive use for names for Bacteria (https://lpsn.dsmz.de/text/genera-named-after-other-genera), using prefixes such as neo-(from the Greek for 'new') or allo-(from the Greek for 'other').

Trends in Microbiology
We have collated a list of 38 prefixes that can be used for this purpose (Table 1). We then used Gan to apply these prefixes to validly published names for Bacteria and Archaea. Using this approach, we have been able to create over 130 000 new genus names and over 700 000 species names, complete with grammatical metadata and etymological justifications, that can be applied to sister taxa related to, but distinct from, already named taxa (see Tables S1-S5 in the supplemental information online). Of course, some of these names will never be used as the number of new names is much larger than the number of new sister taxa that are likely to be discovered. However, as proof of immediate utility, this approach could be applied to all genera marked in GTDB simply

OPEN ACCESS
with an alphabetical suffix (Bacillus_A, Bacillus_B etc.) to generate well-formed Latin names for over 600 new genera. It is also worth noting that there are precedents for incorporating more than one prefix into a bacterial name (e.g., Parapseudoflavitalea or Allopseudarcicella), so if we allow two prefixes to be added to all existing names (while avoiding using the same prefix twice) we would be able to generate over 4 million new genus names and 29 million new species names.

Flexible Endings
Often in the past, final word elements for bacterial genus names have reflected cellular morphologyas in the ending 'coccus' in Enterococcus, describing a coccus associated with the gut. However, if we are to create a set of names that can be applied flexibly to any bacterium or archaeon, particularly to uncultured genera, we need to use last-word elements that can be used without

Trends in Microbiology
knowledge of phenotypic characters (e.g., cellular or colonial morphology). We have therefore collated a set of last-word elements that can be used in genus names derived from biomes and/or in association with proper nouns ( Table 2). Use of such elements brings not just remarkable combinatorial power, but also minimises clashes with botanical and zoological codes.

People and Places
Under the auspices of the ICNP, Archaea and Bacteria have often been named after places or people (mythical or real). In 2005, the nomenclature expert Hans Trüper expressed exasperation at excessive use of this approach with place names, which he termed 'localimania'. However, the practice continues, with the most salient example being the use of Massilia (the Latin name for Marseille) by the IHU Méditerranée Infectiona term that has found its way into over 260 species or genus names [20]. Usefully, these include validly published precedents for combining a proper noun with other roots, for example, Methanomassiliicoccus. The way is thus open for combining names of places associated with identification of new taxa through genomic or metagenomic analyses with additional roots, including our set of last-word elements, for example, Brisbanmonas, Brisbanibacterium, for species delineated by the GTDB project in Brisbane (Figure 1).
Linnaeus made widespread use of the names drawn from mythology. This practice continues in microbiology. For example, the genus name Cronobacter was applied to a pathogen of children after Cronos, a Titan who swallowed his children as soon as they were born. Again, this approach provides a precedent for combining a proper noun with other roots in, for example, Neptunicoccus or Poseidonocella, and paves the way for the creation of names for new taxa identified through genomic or metagenomic analyses. For example, combining our flexible end elements with names for over 100 sea deities drawn from diverse cultures, we have been able to create names for over 1000 marine microorganisms (Tables S1 and S2).
The ICNP provides rules for a well established approach for turning surnames into genus names by the addition of Latin endings or diminutivesexamples include Escherichia and Salmonella. Recently, this approach has been broadened into combining personal names with other roots, but so far only for a couple of dozen names [27]. Application of this approach to the several hundred surnames that have already been used in genus names for bacteria and archaea would allow the creation of many thousands of new names (e.g., Salmoniimonas, Salmoniiplasma, Salmoniimicrobium). However, we note that many of those who created the conceptual and technical framework for microbial taxonomy have yet to be honoured in our disciplinefor example, you will look in vain for Archaea or Bacteria named after Carl Linnaeus, Charles Darwin, or Willi Hennig! We have therefore compiled a gender-balanced list of 80 worthy scientists and have used our program to create 640 new names from this list, including Darwiniibacterium and Hennigiimonas (Tables S1 and S2).

Binomials for Biomes
Another well established approach for naming new taxa is to describe the habitat or biome in which the organism is found. For example, as we have noted, Enterococcus describes a coccus found in the gut. However, in the age of metagenomics and microbiome research, we need new genus names for inhabitants of each microbiome by the dozen or even in the hundreds.
Fortunately, this need can be easily met using our combinatorial approach to link our final word elements to terms that describe an organ or a hostfor example, Enteromonas, for an microbe associated with the intestine, or Avimonas for one associated with birds. As the common names from classical languages for organs or animals typically provide multiple roots for the same organ/ tissue (e.g., faeci-, merdi, excrementi, stercori-, cacco-for faeces) or for the same animal (e.g., galli-, pulli-, alectryo-, cotto-for chicken), this approach has allowed us to generate thousands of names for new genera from animal microbiomes, drawing on over 200 curated roots specifying organs, tissues, or hosts (Tables S1 and S2). The same approach can be applied to the use of classical roots for biomes associated with plants, for example, Leguminimicrobium for a microbe from beans, or with the abiotic environment, and, for example, Oceanimonas for a microbe from the oceans, or Chthonomicrobium for a subterranean microbe (Tables S1 and S2).
This combinatorial approach proves particularly powerful when, as well as using common names from classical languages, one exploits the genus name of the host as a neo-Latin term that can combined with other roots, for example, Drosophilimonas or Arabidopsidimicrobium. As there are hundreds of thousands of named genera of eukaryotes, this opens up the creation of millions of names for host-associated bacterial genera. Similarly, adopting neo-Latinised versions of technical terms for a particular biome, for example, generating roots such as nasopharyngo-, lotici-, bioreactori-or phylloplani-, brings added precision and enhanced fecundity to the creation of names for the inhabitants of microbiomes.
Stepping up to Three Roots The remarkable power of combinatorial concatenation steps up a gear when we move from two roots in a row to three. Here, we propose an approach in which the first root specifies a general context, for example, a host, a general environment, a person or a place, while the second root specifies a more specific context, such as an organ or tissue or a specific environment. Using our software on terms for animal hosts and their organs/tissues together with our final word elements, we have generated over 100 000 new genus names for inhabitants of animal microbiomes (Tables S1 and S2). This approach could also be used for biomes from the abiotic environment, for example, giving us Chthonohydromonas for a microbe from a subterranean water source. However, even more names can be created if existing host genus names or personal nouns are used in the first root position, for example, Triticirhizomicrobium, Darwiniintestimonas, or Brisbaniiterriplasma.

Trends in Microbiology
The Species Problem So far, we have concentrated on the creation of genus rather than species names. Nonetheless, a similar principle of combinatorial concatenation of classical roots works here too, even though the context is slightly different. For a start, a species name consists of a genus name and a species epithet. Unlike genus names, species epithets are typically genitive nouns or adjectives, although nouns in the nominative case in apposition are occasionally used. This creates an exacting requirement for selecting the correct form of the noun in the genitive case or of the adjective in the nominative case, which has to take the same gender as the genus name. This has become easier lately, as the requisite forms for such nouns and adjectives can be found in online dictionaries, such as Wiktionary (https://www.wiktionary.org). However, as the grammatical properties of a species epithet are inherited from the final component of the word, so as long as that final term is formatted appropriately, it can be used effortlessly and repeatedly in multiple constructions. Furthermore, unlike a genus name or a Linnaean binomial, a species epithet need not be unique and can in fact be used again and againfor example, the epithet massiliensis has been used over 100 times. Thus, as typically only a few dozen species names are needed per genus, a preformed stock of names can easily be created for each biome by reusing single roots across multiple genera (e.g., avium, gallinarum, for species associated generally with birds or specifically with chickens) or by adding multiple roots in front of a final element (e.g., merdavium, faecavium, caccavium for species associated with bird faeces).

Concluding Remarks and Future Perspectives
In 1999, the nomenclature expert Hans Trüper claimed 'in view of the million names that will have to be formed in the future… [arbitrary names] are a simple necessity, whether Latin formalists like them or not.' [12]. Contrary to Trüper, here we have shown how combinatorial use of Greek and Latin roots could be used to create millions of well-formed taxonomic names for Bacteria and Archaea. What's more, we have put this principle into practise in the documentation of a million names in the supplementary material (Tables S1-S5).
In so doing, we have outlined a scalable system for filling taxonomic namespace that circumvents onerous and expert-dependent one-by-one creation of namesexploiting computational automation to deliver millions of names that are linguistically correct, meet the requirements of the ICNP, and so can be used off the shelf, as needed. We are thus providing added impetus to efforts to create a nomenclature for uncultured organisms and hold a mirror up to the current failure to incorporate uncultured organisms into the Code. We expect that our approach could be broadened to cover the need for well-formed names across the whole of the Darwin tree of life (https://www.darwintreeoflife.org). We have started a process that raises many questions (see Outstanding Questions) and have created a program that has to be run over the command line, but we predict that, one day, naming Bacteria and Archaea might be as easy as using Google Translate. In the meantime, we have provided a template showing how input files for Gan should be formatted (Table S6).
The software we have created for this purpose is freely available. However, it comes with the warning: 'Caveat Nomenclator!' in that it will concatenate terms that simply do not belong together. For example, Gallidentimonas might appear to be a well-formed Latin name, but it is nonsensical as hens do not have teeth. Similarly, some newly created names might fall foul of the ICNP recommendation to avoid names that are overly long, difficult to pronounce or are disagreeable in form (e.g., with prefixes repeated in tandem as in, say, neoneoaurum), so we recommend sorting names by size and using the shortest first and weeding out repetitive forms). In addition, as the current version of the software does not check whether a name has already been listed by LPSN or used under any of the other nomenclature codes, users should check this themselves.

Outstanding Questions
How far can the creation of taxonomic names be automated? Will it soon prove possible for bacteriologists to feed in terms in any modern language to a completely automated and unsupervised system to generate any taxonomic names that they need? We must also stress that we have not created or even named any new taxa, merely provided software to generate names that could be used for this purposebut only once they have been published in peer-reviewed journals and have been properly attached to nomenclatural types. For the time being, our names remain naked, as what the jargon calls nomina nuda! The challenge now is for readers in the microbiology community to clothe them with strains, sequences, circumscriptions, positions, and ranks.