1,003 reference genomes of bacterial and archaeal isolates expand coverage of the tree of life

Metagenomic and microbial sequence data are made easier to interpret with the addition of 1,003 genomes to the Genomic Encyclopedia of Bacteria and Archaea. We present 1,003 reference genomes that were sequenced as part of the Genomic Encyclopedia of Bacteria and Archaea (GEBA) initiative, selected to maximize sequence coverage of phylogenetic space. These genomes double the number of existing type strains and expand their overall phylogenetic diversity by 25%. Comparative analyses with previously available finished and draft genomes reveal a 10.5% increase in novel protein families as a function of phylogenetic diversity. The GEBA genomes recruit 25 million previously unassigned metagenomic proteins from 4,650 samples, improving their phylogenetic and functional interpretation. We identify numerous biosynthetic clusters and experimentally validate a divergent phenazine cluster with potential new chemical structure and antimicrobial activity. This Resource is the largest single release of reference genomes to date. Bacterial and archaeal isolate sequence space is still far from saturated, and future endeavors in this direction will continue to be a valuable resource for scientific discovery.


R e s o u R c e
Systematic surveys of the diversity of cultivated microorganisms have lagged behind improvements in sequencing technologies. Traditionally, most isolate sequencing projects are chosen based on the clinical or biotechnological relevance of the target organisms or their physiology 1 . In 2015, 43% of sequenced bacterial genomes comprised just ten human pathogenic species. While sequencing different strains of the same species aided our understanding of pathogenesis, the focus on specific bacterial species results in a biased phylogenetic representation of sequence space. This skewed phylogeny narrowed our view of the functional and evolutionary diversity of microbial life. There is a direct correlation between phylogenetic distance and novel function discovery 2,3 , which suggests that filling the gaps in the phylogenetic tree might result in a substantial increase in new genes, protein families and pathways 4 .
Reference genomes can fill phylogenetic gaps, but also serve as anchors for the identification of sequence fragments from metagenomic studies. Previous efforts to expand the bacterial and archaeal reference genomes by targeted sequencing of phylogenetically underrepresented lineages have enabled vast improvements in taxonomic assignment in metagenomic data sets 5 . Furthermore, access to completed genomes enables more accurate whole-genome-based taxonomic assignments 6,7 and improved phylogenies 8,9 . Bacterial and archaeal type strains are the representative unit of a microbial species, and are chosen when the species name is established. Type strains are maintained in at least two different culture collections and provide easy access to source strain material for subsequent experiments. Typically, a type strain has well-characterized taxonomic and phenotypic data, isolation source metadata, and other criteria, as defined by the International Code of Nomenclature of Prokaryotes (ICNP) 10 . As of December 5, 2015, there were 12,981 bacterial and archaeal species with valid, published names, with 650 new type strains added (on average) every year 11,12 . However, despite their importance, the genomes of only 826 type strains were publicly available at the start of this study.
The Genomic Encyclopedia of Bacteria and Archaea (GEBA) pilot project presented the analysis of 56 type-strain genomes and validated the usefulness of a phylogeny-driven 'encyclopedia' of bacteria and archaea 3 . We now present a substantially expanded data set (GEBA-I) comprising 1,003 reference genomes from 974 bacterial and 29 archaeal type strains. Our objectives were to provide an expanded reference genome catalog of broad phylogenetic and physiological diversity, to determine how this catalog facilitates the discovery of protein families and expands the diversity of known functions, and to ascertain whether these type-strain genomes improve the recruitment and phylogenetic assignment of existing metagenomic sequences.

RESULTS
Increased phylogenetic diversity of microbial genomes 974 bacterial and 29 archaeal genomes (from 579 genera in 21 phyla and 43 classes) were sequenced as part of the GEBA Initiative (GEBA-I), using a phylogeny-based scoring system for strain selection 6,13 .

R e s o u R c e
Of the 1,003 genomes presented, 396 GEBA-I genomes were the first sequenced representative of a genus (Fig. 1a). The Caldithrixae, Deferribacteres, Synergistetes and Thermodesulfobacteria (Fig. 1a) phyla have the most new genera. The most populous phyla, in terms of numbers of genomes sequenced, were the Proteobacteria (with 330 genomes), Firmicutes (178), Bacteroidetes (163) and Actinobacteria (157). The remaining 175 genomes belonged to 17 additional phyla, including the only sequenced representative of the Caldithrixae phylum (Supplementary Table 1). The GEBA-I strains originate from a multitude of habitats including extreme environments, terrestrial biomes, industrial waste and human body sites (Supplementary Fig. 1) and unsurprisingly have diverse physiology, genome size and average G+C content (Supplementary Fig. 2). GEBA-I is a high-quality reference resource with 99.4% (on average) genome completeness (assessed using CheckM 14 ; Supplementary Table 1). Annotation of the 1,003 GEBA-I genomes resulted in 3,472,483 predicted genes from 3.75 Gbp of assembled sequence data (Supplementary Fig. 3 and Supplementary Table 1). All GEBA-I genomes are publicly available through the Integrated Microbial Genomes with Microbiomes (IMG/M) system 15 and GenBank, and the corresponding strains through the respective culture collection (Supplementary Table 1).
To quantify the increase in phylogenetic diversity contributed by GEBA-I genomes compared with all previously available, validly named archaeal and bacterial species (i.e., type strains), we measured the diversity distance of all sequenced type strains in a comprehensive 16S rRNA gene tree 6 . The GEBA-I genomes increased the phylogenetic distance threefold, expanding the overall diversity of the type-strain sequence space by ~24% (Fig. 1b). Further, we applied a whole-genome comparative analysis based on the average nucleotide identity to verify the relative novelty of the GEBA-I genomes compared to a set of 14,625 control genomes. We found that the vast majority (845/1,003) of the GEBA-I genomes were 'singletons' on the basis of the proposed criteria for defining a "species group" 7 , verifying that no other sequenced representative of that species is available.
Expanding the universe of known proteins A total of 3,402,887 protein-coding sequences were predicted from the 1,003 GEBA-I genomes. We compared this data set with 23,470,984 non-redundant proteins from all available (14,625) control bacterial and archaeal genomes. Clustering ~26 million total proteins at ~30% sequence identity over 80% alignment length using KClust resulted in 1.89 million protein clusters (containing at least two sequences) and 2.6 million singletons. Of these, 55,105 clusters and 436,840 singletons were composed of proteins from GEBA-I genomes only (Supplementary Table 2), corresponding to a 10.5% increase in known protein sequence diversity.
To test if this represents a meaningful increase, or a mere continuation of a trend that has been ongoing since the advent of whole genome sequencing, we calculated the growth rate of new protein families (per 1,000 genomes) (Fig. 2a), and the number of protein families added by newly sequenced bacterial and archaeal genomes over time (i.e., in chronological order of their date of release; Fig. 2a, inset). First, we observed that the growth rate of new protein families markedly declined after the first 2,000 sequenced genomes. Addition of the GEBA-I genomes (noted in red) resulted in a dramatic increase in the growth rate of new protein families, equivalent to the protein family novelty initially observed with the first 2,000 genomes. Second, we found that the number of protein families added over time was initially large with the addition of the first 5,000 genomes, but almost plateaued at around 15,000 genomes (Fig. 2a, inset). The addition of GEBA-I genomes led to a substantial increase in the number of added protein families (Fig. 2a, inset). Together, this reinforces the hypothesis that substantial functional gene novelty remains to be discovered within the cultivated genome space and suggests that continued phylogeny-driven sequencing efforts will result in an expanded catalog of diverse protein families.
In order to explore whether increased functional novelty is correlated with specific phylogenetic lineages, we examined the minimum  The number of new genera added by GEBA-I per phylum is displayed next to the pie charts. Bootstrap support values ≥50% are shown with small circles on nodes with robust phylogenetic support. (b) Overall increase in 16S rRNA gene diversity relative to all the type strains. Blue denotes the genetic diversity covered by 828 genomes of type strains before GEBA-I, red denotes the diversity covered by the GEBA-I genomes and gray denotes the remaining type strains lacking a genome sequence. Balanced relative phylogenetic diversity (bRPD) was calculated by adding branch lengths between each leaf and root node in the tree followed by proportional downweighting of internal branches 6 . R e s o u R c e 16S rRNA gene distance compared to the total number of new protein clusters for each GEBA-I genome (Fig. 2b). In general, genomes with increased phylogenetic distance (i.e., greatest 16S distance from reference) encoded the greatest number of novel protein families. As expected, many of the genomes with the greatest phylogenetic distance and number of novel genes belonged to phyla for which few or no sequenced representatives were previously available (Fig. 1a). For example, Ktedonobacter racemifer 16 , a member of the phylum Chloroflexi, contributed 5,102 genes to GEBA-I-only clusters and singletons (Fig. 2b). However, a handful of GEBA-I genomes with closely related reference genomes (i.e., near-identical 16S rRNA gene sequences) also encoded a preponderance of novel genes. The most striking outliers were Mycobacterium genavense ATCC 51234 and Promicromonospora kroppenstedtii RS16, DSM 19349, contributing 1,327 and 2,038 novel genes, respectively ( Fig. 2b and Supplementary Table 2). For the M. genavense genome, this observation is explained by the highly conserved nature of the 16S rRNA gene for this group, with other sequenced markers revealing a higher rate of polymorphism, for example, the 16S-23S internal transcribed spacer is preferred for species discrimination 17,18 . Thus, the close evolutionary relationship for M. genavense implied by this minimum 16S rRNA gene distance (distance = 0.018, Mycobacterium parascrofulaceum) is likely an underestimation, and not a good indicator of actual evolutionary distance for this genome. Conversely, the relatively smaller sizes of genomes with high 16S distance to reference, but few novel genes (e.g., Mycoplasma elephantis, Allofustis seminis, both host-associated) suggests they may have undergone streamlining or genome reduction.
Exploring GEBA-I-only protein clusters A total of 55,105 clusters were composed exclusively of proteins from GEBA-I genomes. Approximately 25% of these clusters (13,371 in total) contained proteins arising from a single genome (designated here as "homogeneous" or paralogous clusters), and possibly result from lifestyle-specific gene expansion, or from proliferation of integrated elements like phage or transposons (Fig. 2c). For example, the 13.6-Mbp genome of Ktedonobacter racemifer contributed a striking 411 homogeneous clusters, the largest number proportional to genome size of all the analyzed GEBA-I genomes; most of these clusters are implicated in regulatory functions, such as two-component signal transduction systems (TCS) involved in sensing and responding rapidly to environmental stimuli. Although TCS themselves are not novel, the K. racemifer encoded genes (e.g., Histidine Kinase, Cluster ID: 2509672) have a novel domain configuration involving multiple sensory PAS folds 19 , and high levels of sequence divergence from existing TCS (Supplementary Fig. 4). Four related clusters (Cluster IDs: 2586264, 809557, 4221619, 3082022) from the termite hindgut isolate Sphaerochaeta coccoides may represent another lifestyle-specific expansion 20 , with some clusters arranged as tandem arrays (Supplementary Fig. 5), suggesting gene expansion by recent gene duplication.
For the remaining 41,734 clusters in GEBA-I genomes (designated as "heterogeneous clusters"), varying levels of "heterogeneity" were identified in terms of membership within the same genus, family, order or class (Fig. 2c). We found a subset of clusters that originated from members of two or more phyla (designated as "hyper-heterogeneous" clusters (Fig. 2c). One of these clusters is a four-protein cluster (66% amino acid identity, Cluster ID: 2968370) present in four disparate species (Thermodesulfobacterium hveragerdense, Thermodesulfobacterium thermophilum, Thermodesulfovibrio thiophilus, Desulfurella acetivorans) from three phyla (Thermodesulfobacteria, Nitrospirae and Proteobacteria) that share a common physiology of thermophilic anaerobic sulfur reduction. While members of these particular genera or their higher taxonomic groups may not be well represented in sequence databases, the lack of cluster membership from genomes of relatively well-saturated phyla such as Proteobacteria is curious, suggesting horizontal gene transfer among these possibly cohabiting species. Further support for this speculation may be the putative function of the proteins themselves-rhodanese-like sulfotransferases, described as versatile proteins using persulfide chemistry to accomplish cellular functions ranging from cell cycle progression to stress resistance to sulfur metabolism 21 . A case with no apparent unifying theme in terms of known ecological niche or physiology is a co-localized pair of three-gene clusters (Cluster IDs: 4177102 and 4403394 with 49% and 43% amino acid identity, respectively) from two domains of life, namely, Maritalea myrionectae, Cucumibacter marinus (both Proteobacteria) and Methanolobus tindarius (an archaeon), with possible functions in quinolone export.
Hyperheterogeneous clusters are curious instances of phylogenetic discordance, that is, when the phylogenetic history of an individual gene is different from the known species history. Plausible explanations for this observation (as reviewed by Galtier and Daubin 22 ) include: horizontal gene transfer, where the phylogeny is influenced by the number and nature of transfers that have transpired; incomplete lineage sorting due to rapid speciation events, that is, the ancestral polymorphism is not fully resolved into two monophyletic lineages when the second speciation occurs; hidden paralogy-for paralogs, the phylogeny partly reflects the duplication history of the gene independent of species divergence history, or convergent evolution.
The large number of singletons identified in the GEBA-I genomes represents potential new functions and confirms that a large proportion of functional novelty still remains to be captured. One such example is a putative pepsin A encoded by Endozoicomonas elysicola DSM 2238, isolated from the gastrointestinal tract of a mollusk sea slug. Although pepsin-like enzymes are commonly found in eukaryotes, the E. elysicola candidate is the first instance of a secreted bacterial pepsin (based on a signal peptide) containing all the conserved residues of its eukaryotic counterparts (Supplementary Fig. 6). To verify that singletons are not artifacts of gene prediction pipelines, we assessed their size distribution and presence of signaling or other structural motifs (Supplementary Table 2). Based on this, more than 70% of singletons are >100 amino acids in length, and of these, 31% possess either a signal peptide or two or more transmembrane helices.
Biosynthetic clusters for secondary metabolites Microbial secondary metabolites are organic compounds that are not directly involved in primary growth and development, but rather have auxiliary functions such as defense, communication and other interactions. Genes encoding biosynthetic enzymes for the synthesis of secondary metabolites are typically co-localized on the chromosome and are referred to as "biosynthetic gene clusters" (BCs). While only a few of the selected type strains in this study were known to be prolific producers of secondary metabolites, a large bounty of potential new BCs were predicted in the GEBA-I genomes (Supplementary Table 3).
A total of 23,839 BCs were predicted from 1,003 GEBA-I genomes using the IMG-ABC system 23 . Three Pseudonocardiaceae genomes (Pseudonocardia acaciae, P. spinosispora and Sciscionella marina) encoded the greatest total number of BCs among all GEBA-I genomes (Fig. 3a). These included numerous nonribosomal peptide synthetases, polyketide synthetases, as well as lantipeptides, bacteriocins, ectoine thiopeptides, and others. We observed a clear correlation between the number of predicted BCs and genome size R e s o u R c e with an average of 6.41 (±2.4 s.d.) BCs predicted per Mb of sequence (Supplementary Fig. 7). Actinobacterial genomes were outliers with an average of 9.58 (±3.4 s.d.) BCs per Mb. This observation is likely reflective of their particular ecological niches involving multiple (perhaps antagonistic) interactions with cohabiting microbes (e.g., P. acaciae was isolated from a competitive plant rhizosphere environment). While Streptomyces species are known to be prolific producers of antibiotics and other natural products 24 , genomes from the Nocardiaceae and Pseudonocardiaceae families of Actinobacteria had not been sequenced extensively before this study, and therefore had not been intensively targeted for BC gene discovery. Given that six of the top ten BC-rich genomes in GEBA-I belong to the above two families, future sequencing efforts focused around these clades may prove fruitful for discovering natural products.
On average, the GEBA-I genomes devote nearly 10% of their genome to secondary metabolite biosynthesis, with actinobacterial GEBA-I genomes apportioning an average 16.5% (±8% s.d.) of their genome. Among the actinobacterial GEBA-I soil isolates, Actinoalloteichus cyanogriseus and Smaragdicoccus niigatensis encode the greatest fraction of BCs at 39% and 36%, respectively. This is the highest percentage reported so far for any genome, trumping the previous record for Streptomyces bingchenggensis 25 . Given that Actinobacteria are vigorously pursued for new antimicrobial product discovery 26 , these two previously unrepresented genera isolated from soil and an oil spring, respectively, might contribute new classes of bioactive compounds.
In addition to predicting biosynthetic gene clusters, we annotated the class of secondary metabolite synthesized by each BC across the GEBA-I genomes. Most of the predicted BC products were unclassified, reflecting both the limited information available for characterized natural products and the rich genomic resource of biosynthetic capabilities contributed by GEBA-I. For example, nine new phenazine pathways with novel operon structures and genes were identified in the GEBA-I genomes 23 . Phenazines are a large class of nitrogencontaining heterocyclic secondary metabolites that have potent antimicrobial and antifungal activity, and are produced by a wide A crude extract of P. halotolerans DSM 18316 produced three known phenazines PCA, PDC and griseoluteic acid; however, d-alanylgriseoluteic acid was not observed (Fig. 3b). The phenazine operon in P. halotolerans DSM 18316 included all of the core phenazine genes found across all taxa known to produce the two core phenazines (phenazine 1-carboxylic acid (PCA), and phenazine 1,6 dicarboxylic acid (PDC) ; Fig. 3c). This operon also contained additional phenazine-modifying genes that exhibited the same pathway architecture found in Pantoea agglomerans Eh1087, a known producer of griseoluteic acid as well as d-alanylgriseoluteic acid 27 . The three genes known to modify griseoluteic acid to d-alanylgriseoluteic acid in P. agglomerans Eh1087 are present in the P. halotolerans DSM 18316 genome, yet the amino acid incorporated by the amino acid adenylation domain is likely different. Some of the other prominent metabolites (unknown peaks in Fig. 3b) may contain this potentially new phenazine. Furthermore, we also identified the biosynthetic genes likely responsible for the pelagiomicin phenazine antibiotic (structure known) produced by M. variabilis ATCC 700307 (ref. 28) (Supplementary Fig. 8).

Improved taxonomic assignment of metagenomic sequences
The ability to phylogenetically analyze and provide taxonomic classification to metagenomic data is largely dependent upon reference microbial genomes. Previous efforts to expand the genomic reference set through inclusion of phylogenetically underrepresented lineages have yielded dramatic improvement in classification of metagenomic data 5 . Here, we evaluated whether the GEBA-I genomes could serve as phylogenetic anchors for metagenomic studies. A total of 3,402,887 GEBA-I proteins were compared to 2,664,695,939 non-redundant protein sequences derived from 4,948 metagenomes in the IMG database. The GEBA-I protein set recruited 25,576,559 previously unassigned metagenomic proteins from 4,650 metagenomes (Supplementary Table 4). The majority of newly recruited proteins were derived from metagenomes of terrestrial (32%), aquatic (28%) habitats and plantassociated samples (21%) (Fig. 4a and Supplementary Fig. 9). This finding is primarily attributed to the high proportion of metagenome samples from these particular habitats. Solirubrobacter soli DSM 22325 (ref. 29), a ginseng field soil isolate, recruited the highest number of metagenome proteins (Supplementary Fig. 9); habitat distribution of these new hits were 50% terrestrial, 34% plant host associated, 6.5% aquatic, and a tiny fraction were from termite gut samples. Although GEBA-I strain selection was based on phylogenetic placement rather than numerical dominance within certain environmental R e s o u R c e samples, about 282 genomes, designated "top recruiters," were found to notably recruit protein sequences from 1,204 individual environmental samples. Furthermore, we found evidence that a number of the genomes that significantly recruited metagenomic proteins may serve as important members of the microbial community in terms of abundance and encoded metabolic potential. For example, cellulosedegrading soil isolate, Rudaea cellulosilytica 30 preferentially recruits sequences (over 87% coverage of total coding sequence (CDS)) from two corn rhizoplane samples, at high abundance (based on an average read depth of ~25), but not other plant rhizosphere samples (Fig. 4c).
We hypothesize that R. cellulosilytica is an opportunist in senescing corn rhizoplane samples taken from a drought-stressed continuous corn plot (where root decomposition from previous years probably provided plentiful substrate for its growth), because it is not present in samples from unstressed corn in subsequent years (personal communication, James M. Tiedje, Michigan State University). Another notable example is an anaerobic sulfate reducer, Desulfomicrobium baculatum DSM 4028 (over 85% coverage of total isolate CDS), which is abundant in an oil pipeline biofilm sample and likely had a pivotal role in the microbial-induced corrosion that led to failure of the pipeline 31 (Fig. 4b).
Overall, we found a correlation between isolation source of the GEBA-I strain and the metagenome sample habitat, as expected. Some interesting exceptions were identified, for example, Inquilinus limosus DSM 16000, a GEBA-I strain isolated from sputum of cystic fibrosis (CF) patients (although not known to cause disease or pathology) showed recruitment of proteins from several plant rhizosphere metagenome samples (e.g., Arabidopsis, corn). We hypothesize that closely related Inquilinus species or strains may be members of the plant root microbial community. Indeed, Inquilinus spp. have been previously reported in 16S rRNA surveys of root nodules of wild legumes 32,33 . There is mounting evidence that human-pathogenic enteric bacteria such as Salmonella can colonize plant tissues, and use similar mechanisms for infection of animal and plant hosts 34,35   Phylogenetic analyses of whole genome sequences were conducted using the high-throughput version of the Genome-Blast Distance Phylogeny approach. Internal branch support above 60% is colored in a range from red (60%) to green (100%) The colored dots decorating the terminus of every tree branch indicates the isolation source habitat for the given GEBA-I genome. The outermost circle bearing a black bar chart denotes the total number of metagenomic sequences with protein blast hits to that GEBA-I genome (Supplementary Fig. 7 and Supplementary Table 4). The habitat distribution for these hits is given in the colored concentric circles that follow. R e s o u R c e representation of public databases, in this case, in adding to complementary cultivation-independent efforts to explore the breadth of microbial diversity and ecology. Other investigators have taken advantage of early access to the GEBA-I genomes and discovered prominent member species in their samples, for example, Treponema succinifaciens 36 and Treponema brennaborense in the gut microbiomes of non-human primates and traditional hunter-gatherers 37 , Ktedonobacter racemifer in an enrichment to identify rare soil microbes 38 , Coraliomargarita akajimensis 39 in an Amazon river plume 40 , Sphaerobacter thermophiles 41 in thermophilic switchgrass-adapted compost 42 .
We also report genome features and a large set of CRISPR-Cas systems comprising more than 28,000 novel spacer sequences (Supplementary Table 5 and Supplementary Fig. 10). These CRISPR-Cas data enabled identification of novel associations between viruses and their hosts 43 .

DISCUSSION
This Resource data set is the single largest effort (to our knowledge) to increase the phylogenetic coverage of cultivated bacterial and archaeal isolates. We observed that genomes with increased phylogenetic distance encoded the highest number of novel protein families, supporting the rationale for continued phylogeny-driven sequencing efforts aimed at expanding the representation of cultivated microbes.
Recent studies of uncultivated bacteria and archaea using metagenomics or single-cell genomics have revealed immense unexplored phylogenetic diversity and have provided insights into microbial ecology and evolution 5,[44][45][46][47][48] . Those studies have also bolstered gene discovery efforts, particularly for biofuel and biotransformation applications and secondary metabolites [49][50][51] . New species, strains and clusters arising from the uncultivated majority are now complemented by our Resource of cultivated microbe genomes.
Genomes reconstructed from metagenomic data contain much valuable information. However, a widely perceived problem is that these genomes are characterized by relative low quality. Artifacts arising from highly fragmented, chimeric or contaminated sequences mean that assertions, comparisons and accurate estimations of diversity are difficult. Metagenomic data also contribute to 'homology creep' , which results in speculative, sequence-based predictions, particularly for phylogenetically divergent organisms, and underscores the urgent need for biochemical validation 52 . One path forward, as previously proposed by the research community 12 , is the development of a saturated collection of isolate reference genomes, which, along with biochemical and genetic characterization, could serve as a solid foundation to support assembly, annotation and interpretation of the exponentially growing amounts of data from uncultivated microorganisms. While our GEBA-I selection of type strains exclusively targeted phylogenetic gaps in the isolate genome space (rather than genomes likely to be present in existing metagenomes), we did observe improvements in recruitment of metagenomic data. In addition, we uncovered potentially important members of microbial communities previously lacking taxonomic identity due to absence of reference genomes.
Unlike genome sequences reconstructed from metagenomes of (asyet) uncultivated microbial species and strains, the GEBA-I species are all cultivable. We hope that GEBA-I will provide a foundation for an array of experiments, including the development of microbial model systems and analyses of biotechnologically relevant pathways, for years to come.

METhODS
Methods, including statements of data availability and any associated accession codes and references, are available in the online version of the paper.