Systematic and Applied Microbiology Phylogenomics reveals the basis of adaptation of Pseudorhizobium species to extreme environments and supports a taxonomic revision of the genus

The family Rhizobiaceae includes many genera of soil bacteria, often isolated for their association with plants. Herein, we investigate the genomic diversity of a group of Rhizobium species and unclassiﬁed strains isolated from atypical environments, including seawater, rock matrix or polluted soil. Based on whole-genome similarity and core genome phylogeny, we show that this group corresponds to the genus Pseudorhizobium. We thus reclassify Rhizobium halotolerans , R. marinum , R. ﬂavum and R. endolithicum as P. halotolerans sp. nov., P. marinum comb. nov., P. ﬂavum comb. nov. and P. endolithicum comb. nov., respectively, and show that P. pelagicum is a synonym of P. marinum . We also delineate a new chemolithoautotroph species, P. banﬁeldiae sp. nov., whose type strain is NT-26 T (= DSM 106348 T = CFBP 8663 T ). This genome-based classiﬁcation was supported by a chemotaxonomic comparison, with increas- ing taxonomic resolution provided by fatty acid, protein and metabolic proﬁles. In addition, we used a phylogenetic approach to infer scenarios of duplication, horizontal transfer and loss for all genes in the Pseudorhizobium pangenome. We thus identify the key functions associated with the diversiﬁcation of each species and higher clades, shedding light on the mechanisms of adaptation to their respective ecological niches. Respiratory proteins acquired at the origin of Pseudorhizobium were combined with clade-speciﬁc genes to enable different strategies for detoxiﬁcation and nutrition in harsh, nutrient-poor environments.


Introduction
Bacteria of the family Rhizobiaceae (Alphaproteobacteria) are usually soil-borne and found in association with plant roots, where they mostly rely on a saprophytic lifestyle degrading soil organic compounds and plant exudates, including aromatic compounds [6,7,50]. This particular versatility in using various organic compounds likely stems from the presence of some of the largest known sets of carbohydrate transporter genes in Rhizobiaceae genomes [38,74]. Some members of this taxon sometimes engage in a sym-biotic or pathogenic relationship with a specific host plant, with the ability to switch to these lifestyles being determined by the presence of adaptive megaplasmids in the bacterium [8,19,74].
It is more unusual to isolate Rhizobiaceae strains from an arseniccontaining rock in a sub-surface environment mostly devoid of organic matter such as the Granites goldmine in the Northern Territory, Australia [53]. Rock samples from this mine containing arsenopyrite (AsFeS) were used to enrich for and isolate organisms (designated NT-25 and NT-26) capable of using arsenite (oxidation state +3, i.e. As(III)) as the electron donor coupling its oxidation to arsenate (As(V)) with oxygen and using carbon dioxide as the sole carbon source [53]. 16S rRNA gene sequence analysis revealed that these strains were very closely related and likely belonging to the same species in the family Rhizobiaceae; they were provisionally named Rhizobium sp. [53]. However, recent advances in multi-locus sequence analysis (MLSA) and genome sequencing led to the recognition of the polyphyly of the genus Rhizobium. Subsequently, the taxonomy of Rhizobiaceae was largely revised, with many Rhizobium species being reclassified in newly created genera, including Neorhizobium, Pararhizobium, Allorhizobium and Pseudorhizobium [25,41,42,44] suggesting that the taxonomic status of NT-25 and NT-26 should be re-examined.
The strains NT-25 and NT-26 can withstand very high levels of arsenic (greater than 20 mM arsenite and 0.5 M arsenate), thanks to functions encoded notably by the aio, ars and phn/pst genes, which are for most located on an accessory 322-kb megaplasmid distantly related to symbiotic plasmids of rhizobia [1]. In addition, these strains can gain energy from the oxidation of arsenite, a function encoded by the aio operon, with the mechanism in NT-26 studied in detail [4,5,14,51,52,54,71]. Comparative genomics showed that this operon formed a stable genetic unit found sporadically in a diverse set of bacteria, with the most closely related sequences found in members of the Rhizobiaceae [1,2].
Having identified the genetic features allowing NT-25 and NT-26 to live in harsh environments, we sought to investigate if this combination of adaptive determinants were only present in these ecologically specialized strains, or if they were the trademark of a wider taxononomic group. We thus searched organisms closely related to NT-25/NT-26 based on their 16S rRNA sequence. Their closest relative, strain TCK [15], was selected for its ability to oxidize sulphur compounds, including hydrogen sulphide, sulphite and thiosulphate, a phenotype shared by NT-25 and NT-26 [1]. Other close relatives were strikingly all isolated from polluted environments: R. sp. strain Khangi-ran2 from a soil contaminated with petroleum, R. sp. strain Q54 from an arsenic-contaminated paddy soil; R. flavum strain YW14 from organophosphorus insecticide-contaminated soil [20] and R. halotolerans AB21 from soil contaminated with the detergent chloroethylene [16]. Interestingly, several pathways mediating resistance to these toxic compounds or relating to their metabolism rely on the cellular respiration machinery, including oxidative degradation of noxious organic compounds, or the oxidation of arsenite. The more distantly related species R. endolithicum has the peculiar ability to live within a rock matrix [46], whereas the even more distant relatives Pseudorhizobium pelagicum and R. marinum live in open sea waters, a quite unusual feature within the Rhizobiaceae [25,37]. The environments these organisms were isolated from suggest bacteria in this group have a special ability to live in habitats that are chemically harsh and depleted in organic nutrients.
We sequenced the genomes of all known bacterial isolates closely related to Rhizobium sp. NT-26, resulting in eight complete or almost complete genome sequences. We complemented our dataset with all currently available complete or near-complete genome data for the bacterial families Rhizobiaceae and (closest relative) Aurantimonadaceae. We used this extensive dataset to compute a robust phylogenomic tree covering a broad taxonomic scope, which led us to the delineation of a new species, Pseudorhizobium banfieldiae sp. nov., and the reclassification of four species into the genus Pseudorhizobium. Using a phylogenetic framework, we analysed the distribution and history of all pangenome genes in this genus, and revealed key genetic innovations along its diversification history. Crucially, a large repertoire of respiratory chain proteins was acquired by the ancestor of Pseudorhizobium and later expanded in descendant lineages. The diversification of Pseudorhizobium species was then marked by their respective acquisition of unique metabolic pathways, providing each species with some specific detoxification mechanisms. Finally, we predicted and experimentally tested phenotypic traits that characterize and distinguish the studied species, providing at least one new diagnostic phenotype (inhibition of growth of P. banfieldiae by the azo dye Congo Red).
For long-read sequencing, cells were cultured at UCL until stationary phase in a minimal-salts medium (MSM) containing 0.08% yeast extract (YE) at 28 • C [53]. Genomic DNA was extracted using the Wizard DNA Purification kit (Promega, Madison, Wisconsin) according to the manufacturer's instructions. Quality of the genomic DNA was assessed as described in the Supplementary Methods. DNA libraries were prepared for sequencing at the Earlham Institute on the PacBio RSII platform using C4-P6 chemistry with one SMRT cell per genome. This generated 82-174 × 10 3 long reads per genome (mean: 139 × 10 3 ), representing 0.332-1.26 × 10 9 bp (mean: 0.982 × 10 9 ). Illumina short-read sequencing and short read-only genome assembly was conducted at the DSMZ, as previously described [73]. Hybrid assembly of short and long reads was performed using the Unicycler software (version 0.4.2, bold mode) [72], relying on the programs SPAdes [3] for prior short read assembly, miniasm [35] and Racon [69] for prior long-read assembly and Pilon [68] for polishing of the consensus sequence.
Unless specified otherwise, the following bioinformatic analyses were conducted using the Pantagruel pipeline under the default settings as described previously [31] and on the program webpage http://github.com/flass/pantagruel/. This pipeline is designed for the analysis of bacterial pangenomes, including the inference of a species tree, gene trees, and the detection of horizontal gene transfers (HGT) through species tree/gene tree reconciliations [64]. A more detailed description of genomic datasets and bioinformatic analyses is given in Supplementary methods.

Genomic dataset
We used Prokka [56] as part of Pantagruel (task 0) to annotate the new genomes sequences, using a reference database of annotated proteins from the Rhizobium/Agrobacterium group (see Sup. Methods and list of reference genomes at https://doi.org/10. 6084/m9.figshare.13118405). We complemented our set of eight new genomes with a dataset of 563 publicly available bacterial genomes obtained from the NCBI RefSeq Assembly database that cover the alphaproteobacterial families Rhizobiaceae and (sister group) Aurantimonadaceae, for a total of 571 genomes (dataset '571Rhizob').

Reference species trees
From the 571Rhizob genome dataset, we define the pseudocore genome (thereafter referred as pCG 571 ) as the set of genes occurring only in a single copy and present in at least 561 out of the 571 genomes (98%). The pCG 571 gene set includes 155 loci, for which protein alignments were concatenated and used to compute a reference species tree (S ML571 ) with RAxML [59] under the model PROTCATLGX; branch supports were estimated by generating 200 rapid bootstraps. From the S ML571 tree, we identified the well-supported clade grouping 41 genomes including all representative of Neorhizobium spp. and Pseudorhizobium spp. and our new isolates (dataset '41NeoPseudo'). We restricted the pCG 571 concatenated alignment to the 41 genomes of this clade of interest, which we used as input to the Phylobayes program and ran a more accurate (but computationally more expensive) Bayesian phylogenetic inference under the CAT-GTR + G4 model [30] to generate a robust tree for the 41 genomes (S BA41 ). We finally used this S BA41 tree as a fixed input topology for Phylobayes to infer an ultrametric tree (unitless 'time' tree) under the CIR clock model [34], further referred to as T BA41 .

Gene trees, reconciliations and orthologous group classification
Gene trees were computed for each of the 6714 homologous gene family of the 41-species pangenome with at least four sequences using MrBayes [48] under the GTR + 4G + I model. The resulting gene tree samples had the first 25% trees discarded as burn-in and we used the remainder as input for the ALEml program [63][64][65], to reconcile these gene trees with the reference tree T BA41 and estimate evolutionary scenarios for each gene family, featuring events of gene duplication, transfer and loss (DTL). Based on the estimated gene family evolutionary scenarios, we could define orthologous gene groups (OGGs) based on a true criterion of orthology, i.e. common descent from an ancestor by means of speciation only [17], rather than a proxy criterion such as bidirectional best hits (BBH) in a similarity search. This has the advantage of explicitly detecting the gain of an OGG in a genome lineage by ways of HGT or gene duplication. We then built a matrix of OGG presence/absence in the '41NeoPseudo' dataset, and computed the clades-specific core genome gene set for each clade of the species tree S BA41 . Hierarchical clustering was performed based on the OGG presence/absence matrix using the pvclust function form the pvclust R package version 2.0−0 [61] with default settings, to obtain bootstrap-derived p-values (BP) and approximately unbiased (AU) branch support estimates. We compared the distribution of functional annotations (Gene Ontology terms) between sets of genes specific to each clade in the S BA41 tree and corresponding reference gene sets made of the clade's core-genome or the clade's pangenome.

Overall genome relatedness measurement
We used the GGDC tool (version 2.1) for digital DNA-DNA hybridization (dDDH) to compare the genomes of the closest relatives of the NT-25/26 clone, using the formula d 4 (BLASTN identities/HSP length) [39]. We also used compareM [47] to estimate the amino-acid average identity (AAI) [27] between genomes of the 41NeoPseudo dataset.

Biochemical tests
A range of phenotypic assays were performed on a set of ten strains, including the five newly PacBio-sequenced strains as well as the relevant type strains. Salt tolerance: growth of strains was assayed at 28 • C in liquid R2A medium (DSMZ medium 830) supplemented with increasing concentrations of NaCl (0-9% range was tested) to determine their minimum inhibitory NaCl concentration (MIC NaCl ). Congo Red assay: strains were plated on yeast extract -mannitol agar medium (YEM) [60] with 0.1 g/L Congo Red dye for seven days. The commercial biochemical identification system for Gram-negative bacteria 'Api 20 NE' (BioMérieux) was used for an initial analysis of the biochemical capacities. High-throughput phenotyping was conducted using the GenIII microplates (Biolog, Inc., Hayward, California) for testing for growth with 94 single carbon or nitrogen nutrient sources or with inhibitors (antibiotics, salt, etc.). The GenIII phenotype data were analysed using the 'opm' R package [68]. The association of accessory gene occurrence with phenotypic profiles obtained with the Biolog GenIII (continuous values) was tested using the phylogenetic framework implemented in the 'treeWAS' R package [12].

MALDI-TOF typing
Sample preparation for MALDI-TOF mass spectrometry was carried out according to Protocol 3 in Ref. [55]. Instrumental conditions for the measurement were used as described by Ref. [67]. The dendrogram was created by using the MALDI Biotyper Compass Explorer software (Bruker, Version 4.1.90).

Fatty acid profiles
Fatty acid methyl esters were obtained as previously described [24] and separated by using a gas chromatograph (model 6890 N; Agilent Technologies). Peaks were automatically computed and assigned using the Microbial Identification software package (MIDI), TSBA40 method, Sherlock version 6.1. The dendrogram was created with Sherlock Version 6.1. Polar lipids and respiratory lipoquinones were extracted from 100 mg freeze-dried cells and separated by two-dimensional silica gel thin layer chromatography by the identification service of the DSMZ as previously described [21].

Genome sequencing of eight new Rhizobiaceae genomes
We determined the genome sequences of strains Rhizobium sp. NT-25, R. flavum YW14 T , R. sp. Q54, R. sp. TCK and R. sp. Khangi-ran2 using hybrid assembly of Illumina short sequencing reads and PacBio long reads, both at high coverage (Sup . Table S1). Hybrid assembly yielded high-quality complete genomes with all circularized replicons (chromosomes and plasmids) for all strains except Q54. In the genome assembly of strain Q54, only one 463-kb plasmid is circularized, leaving eleven fragments, of which one is chromosomal (size 3.79 Mb) and ten (size range: 1-216 kb) that could not be assigned to a replicon type. In addition, strains AB21, JC140 and P007 were sequenced using Illumina short sequencing reads only; their assembly produced high-quality draft genomes, with 20-84 contigs, with N50 statistics ranging 336-778 kb and average coverage 43x-75x (Sup. Table S1). All these genomes carry plasmids, with one to four confirmed circular plasmids in strains NT-25, TCK, YW14 and Khangiran2 (plasmid size range: 15-462 kb) and possibly more for strains Q54, AB21, JC140 and P007.

Comparison of genomes with digital DNA-DNA hybridization and AAI similarity
To direct the assignment of strain NT-26 and the newly sequenced strains to existing or new species, we proceeded to pairwise comparisons of the new whole genome sequences with the already published reference genomes of strain R. sp. NT-26 and type strains of related species P. pelagicum R1-200B4 T and R. marinum MGL06 T using dDDH ( Table 1). As expected, strains NT-25 and NT-26 are highly related (98% dDDH; 100% AAI) and are thus considered to belong to the same clone (thereafter referred to as the 'NT-25/26 clone'). They are also closely related to strain TCK (71.5-71.7% dDDH; 98.4-98.5% AAI), indicating these three strains form a new species. Strains R. sp. Q54 groups clearly with R. endolithicum JC140 T (81% dDDH; 98.2% AAI) and thus is assigned to the species R. endolithicum. The dDDH score of 76.30% (97.8% AAI) between the genomes of P. pelagicum and R. marinum type strains (R1-200B4 and MGL06) indicates that both strains belong to the same species. R. marinum [37] having priority over P. pelagicum [25,42], this warrants that R. marinum should be kept as the valid species name and P. pelagicum as its synonym. Strains R. flavum YW14 T , R. halotolerans AB21 T , and R. sp. Khangiran2 are closely related, but have dDDH scores below the classic 70% species threshold [58]. However, R. sp. Khangiran2 has a AAI similarity score of 97.1% with R. halotolerans AB21 T (96.5% with R. flavum YW14 T ), values that are similar to the within-species scores we report above.
While dDDH value saturate around 20% when comparing distant species (Table 1), AAI values decrease more gradually. AAI values between type strains from different genera do not exceed 76%, whereas values between types of Neorhizobium species are all above 82%, suggesting a discontinuity of AAI values can be used to determine genus membership. Interestingly, the strain cluster including NT-26 and the type strains of R. flavum, R. halotolerans, R. endolithicum, R. marinum and the type strain of the Pseudorhizobium genus, R1-200B4 T (= LMG 28314 T = CECT 8629 T ), all show AAI values above 80%, suggesting membership of a same genus (Table S2). The genus Rhizobium is well known to be paraphyletic and several new genera have been recently defined to solve that issue [40,41]; our observations further support the reclassification of R. halotolerans, R. flavum, R. endolithicum and R. marinum into the Pseudorhizobium genus, thus becoming P. halotolerans (Phalo), P. flavum (Pfla), P. endolithicum (Pendo) and P. marinum (Pmari).

Phylogeny of Neorhizobium and Pseudorhizobium
We produced a large phylogeny based on the concatenated 155 core proteins of 571 Rhizobiaceae and Aurantimonadaceae complete genomes and using a fast ML method of inference (S ML571 ) (Sup. Fig. S1). In addition, we generated a phylogeny focused on the group of interest encompassing the Neorhizobium and Pseudorhizobium genera ('41NeoPseudo' dataset) using a Bayesian inference and more realistic molecular evolution model (S BA41 ) ( Fig. 1), which confirmed the groupings described above based on dDDH and AAI. Almost all branches in S BA41 are well supported with Bayesian posterior probability (PP) support >0.97, except some internal branches in the Neorhizobium clade and the branch grouping strains R. sp. Khangiran2, Phalo AB21 T and Pfla YW14 T . Both the S ML571 and S BA41 trees place strain Khangiran2 closer to AB21 T than to YW14 T , in agreement with pairwise AAI values. This indicates strain Khangiran2 should be classified as a member of P. halotolerans. The relatively low PP support of 0.79 on the stem branch of the Pfla + Phalo group suggests that gene flow may have occurred between this group and their close relatives.  The clade containing strains NT-25, NT-26 and TCK groups with the Phalo + Pfla clade, and further with Pendo and then Pmari, as a well-separated clade from Neorhizobium, i.e. the Pseusdorhizobium genus. Accordingly, we propose that strains NT-25, NT-26 and TCK should form a new species in this genus and we propose to name it ¨Pseudorhizobium banfieldiae(Pban).
The positions of strains "Rhizobium oryzae" B4P and "R. vignae" CCBAU 05176 in the S ML571 and S BA41 phylogenetic trees (Sup. Fig.  S1, S2) and their pairwise AAI values (Table S2) makes it clear that they were incorrectly named and should be designated as Neorhizobium. sp. B4P and N. galegae CCBAU 05176. Similarly, other strains present in the clade corresponding to the genus Neorhizobium and currently identified as Rhizobium sp. need to be renamed as Neorhizobium sp.: strains YS-1 r, NFR07, NFR12, Leaf306 and Leaf321.
We compared the phylogenies based on core genome alignment to those obtained with alternative sources of information on genomic variation (detailed in Suppl. Text). Based on the distribution of pangenome genes, i.e. the presence/absence of orthologous accessory genes in the '41NeoPseudo' genomes (S CL41 ), a hierarchical clustering shows a very similar picture to S BA41 , with good support for most branches leading to major clades and species ( Fig. 2B; Sup. Fig. S3). Low support for the branch separating Pfla YW14 T from Phalo strains Khangiran2 and AB21 T suggests frequent HGT between these species. Additionally, strains that branch deep in the S BA41 tree all cluster together as a sister group of the Pseudorhizobium clade. This limited resolution of gene pres-ence/absence data beyond the species level may be explained by inter-specific HGT, possibly driven by convergent adaptation.
Phylogenetic trees were also built from a restricted set of classic marker genes (atpD, recA, rpoB, glnII and gyrB), either separately or in concatenation, i.e. in a multi-locus sequence analysis (MLSA) (Supp. Methods; Supp. Dataset S1). The monophyly of all species within Pseudorhizobium is recovered with the MLSA, but not with any single marker gene. However, the deeper grouping Phalo+Pfla is not recovered by the MLSA tree, with the inclusion of Pban in that clade being moderately supported (see Suppl. Text). These results indicate that marker gene-based analyses are mostly consistent with the information obtained from whole genomes, with MLSA providing a satisfying framework for species typing of Pseudorhizobium strains. However, the study of deeper evolutionary relationship and the classification of distant strains should be based on genome-wide data, in accordance to recent guidelines [29].

Phenotype-based classification
We found that clustering of strains based on data generated from lipid, protein or metabolic screens yielded a classification broadly similar to the one obtained from core-genome alignments. Fatty acid profiling showed the poorest resolution as it could not discriminate between species (Fig. 2C). However, it distinctly clustered Pban+Phalo+Pfla+Pendo to the exclusion of its outgroups Neorhizobium and P. marinum, indicating a synapomor- phic change of lipid composition at the common origin of these four species. A proteome screen (using MALTI-TOF mass spectrometry) further differentiated Pendo from the group Phalo+Pfla+Pban (Fig. 2D). Metabolism profiles (based on growth curves in 95 different conditions) proved the most accurate sequence-independent predictor of the phylogenomic tree as it also distinguished the group Phalo+Pfla from Pban and thus almost completely mirrored the branching pattern in the genus Pseudorhizobium (Fig. 2E). All phenotype screens however led to cluster Pmari MGL06 T with outgroups N. galegae and R. leguminosarum (Fig. 2C-E), showing the limited ability of chemotaxonomic and phenotypic analyses to resolve taxonomy at deeper evolutionary scales, likely due to convergence of adaptive traits.

Clade-specific gene sets reveal specific functions and ecologies
We inferred gene family evolution scenarios accounting for HGT history by reconciling gene tree topologies with that of the species tree S BA41 . Based on these scenarios, we delineated groups of orthologous genes that reflect the history of gene acquisition in genome lineages -every gain of a new gene copy in a genome lineage creating a new OGG. We looked for groups of OGGs with contrasting occurrence patterns between a focal clade and its rel-atives, to identify specific events of gene gain or loss that led to the genomic differentiation of the clade. Data for all clade comparisons in our '41NeoPseudo' dataset are presented in Sup. Table S3 and are summarized below for the clades on the lineage of strain NT-26; more detailed information and description of gene sets specific to other groups are listed in the Supplementary Text. Major gene sets that have contrasting pattern of occurrence in Pseudorhizobium are listed in Table 2 and those specifically contributing to the differentiation of the NT-25/26 clone lineage are depicted in Fig. 3.

'NT-25/NT-26 clone'
Genes specific to this group are mostly part of mobile or selfish elements. As expected from previous studies [1], this includes the 322-kb plasmid in NT-26 and homologous 119-kb plasmid in NT-25 that carry the arsenite oxidation aio locus and extra copies of the arsenic resistance ars operon and the pst/phn locus, which encode phosphate-specific transporters with high affinity for phosphate but not for its structural analogue arsenate. In addition, the chromosome is laden with specific mobile elements. A prophage is located between two tRNA genes (positions 1347-1419 kb; length 71 kb) characterized by an entire set of phage structural genes and an integrase gene at the end of the locus, with no other identified function n/a n/a n/a n/a n/a n/a n/a n/a Inhibition by Congo Red dye a  I  I  I  T  I  T  T  T  T  T n/a Biolog GenIII metabolic activities C07 (L-Fucose) ---

Genotypic traits Predicted pathways/activities
Arsenic oxidation a: I: inhibited; T: tolerant. n/a: data not available. 'Pseudorhizobium banfieldiae' (Pban) Of all the genes exclusively present in Pban compared to other Pseudorhizobium, the most striking feature is a locus encod-ing the Calvin cycle pathway (including the RuBisCO enzyme) and respiratory chain cytochromes, the main determinants of chemoautotrophy in this species. This locus belongs to a larger Pban-specific region composed of two closely located 27-kb and 80kb fragments, which suggests it results from the recent insertion and domestication of a mobile element (likely interrupted by an even more recent insertion/rearrangement). Among the 37 Pbanspecific genes in this extended region, several code for enzymes of the classes oxidoreductase, monooxygenase, decarboxylase and glutathione-S transferase, which all use reduced electron acceptors and/or protons, and with their putative substrates including aromatic cyclic and halogenated organic compounds.
This suggests a functional link between chemoautotrophy and detoxification pathways. Reconstructed HGT scenarios indicate that the donor of these genes was a deep-branching lineage of Neorhizobium (Sup. Fig. S4; Sup. Table S4), but also that it was preceded by series of HGT events, dated as early as the diversification of the Neorhizobium/Pseudorhizobium group. This suggests genes coding for chemoautotrophy have been circulating for a long time in this wider taxon, and were later fixed in the Pban lineage. Other Pban-specific genes include a locus putatively encoding the biosynthesis of a lipopolysaccharide O-antigen with an N-acetylneuraminic acid (Neu5Ac) function, and a 26-kb region encoding putative enzymes and transporters related to pathways for the utilization of taurine and for the degradation of (possibly halogenated) aromatic compounds (Sup . Table S5).
Conversely, several genomic regions have been lost in the Pban lineage. An operon that encodes a multimeric Na + /H + cation antiporter was present in the ancestor of of Pban, Phalo, Pfla and Pendo, then specifically lost in Pban; a homolog is present in Pmari strain MGL06 T , with the gene evolution scenario indicating this gene is a HGT recipient from Phalo to MGL06. An operon encoding a cellulose synthase is present in all other Pseudorhizobium species, indicating the likely presence of a cellulose-like polymer in their exopolysaccharide, but not in Pban where it was lost. Finally, Pban genomes specifically lack genes coding for a respiratory complex including several cytochrome c oxidases, in linkage with a gene encoding the EutK carboxysome-like microcompartment protein, whose known homologues are involved in the degradation of ethanolamine (see Supplementary text). This respiratory complexencoding locus often includes genes coding for redox enzymes that may be the terminal electron acceptor; interestingly, these genes vary with the species (Sup. Fig. S5): Pfla YW14 carries a copper-containing nitrite reductase, while Phalo strains AB21 and Khangiran2 carry a (non-homologous) TAT-dependent nitrousoxide reductase; the locus in Pendo strains harbours no gene encoding such terminal electron acceptor, but other genes encoding metabolic enzymes that differ among strains. This suggests that this respiratory chain and associated putative micro-compartment are used as an evolutionary flexible platform for the reductive activities of these organisms.

'Pseudorhizobium sub-clade Pban+Phalo + Pfla'
The Pban+Phalo + Pfla clade presents two large specific gene sets: the 20-kb super-operon paa coding for the uptake and degradation of phenylacetate, and the 13-kb locus including the soxXYZABCD operon that encodes the sulphur oxidation pathway, allowing the lithotrophic oxidation of thiosulphate.
'Pseudorhizobium sub-clade Pban+Phalo + Pfla + Pendo' Genes specific to the Pban+Phalo + Pfla + Pendo clade are enriched in the cellular process of NAD cofactor biosynthesis (GO:0009435), tryptophan catabolism (GO:0019441) and phosphatidic acid biosynthesis (GO:0006654). They also carry an operon encoding a thiosulphate sulphurtransferase with a pyrroloquinoline quinone (PQQ)-binding motif, a SoxYZ-like thiosulphate carrier and a SoxHlike metallo-protease and a membrane-bound PQQ-dependent dehydrogenase with glucose, quinate or shikimate as predicted substrates. In addition, a 17-kb locus including the pqqBCDE operon involved in the biosynthesis of cofactor PQQ and PQQ-dependent methanol metabolism enzymes was also specifically gained in this clade, but later lost by Pfla strain YW14.

Other Pseudorhizobium species
Traits specific to other clades, including the species Pendo, Pmari and Phalo, are discussed in the Supplementary Text. Among the many species-specific traits found, we can highlight the following predictions: Phalo features a specific pathway involved in the biosynthesis of carotenoids; Pendo has specific accessory components of its flagellum, and misses many genes that are otherwise conserved in the genus, including a cyanase gene; Pmari, as the most diverged species in the genus, has several hundred species-specific genes, including a 27-kb locus coding for a potassium-transporting ATPase, extrusion transporters and degradation enzymes with putative phenolic compound substrates, and a poly(3-hydroxybutyrate) (PHB) depolymerase.

Validation of bioinformatic predictions of phenotypes
We aimed to experimentally validate the predictions of cladespecific phenotypes that would allow us to distinguish taxa and also to confirm the bioinformatically predicted functions of the identified genes. We thus implement a new version of the polyphasic approach to taxonomy [58], where genome-based discovery of phenotypes complements genome relatedness-based delineation of taxa, and ultimately would help us link the conservation of a genotype in a taxon with relevant aspects of its ecology [32]. We focus on P. banfieldiae, for which our dataset provides the best phylogenetic contrast, with three genomes sampled within the taxon and eight genomes sampled in close relatives, ensuring the robust identification of species-specific genes (Fig. 3). We describe below the clade-specific phenotypic traits that were experimentally validated. For other predicted traits, i.e. the specific utilization of taurine and phenylacetate for Pban and Pban+Phalo+Pfla clades, respectively, the experimental test did not match the expectations (see Supplementary text).

Pban-specific chemolithoautotrophy
This is a known trait of all Pban strains, which were indeed isolated for that particular property [15,53]. All strains can grow with thiosulphate as a sole electron source and by fixing carbon dioxide as a C source; the use of arsenite as an electron donor is unique to the NT-25/26 clone, due to the presence of the aio operon on the clone's specific plasmid.

Genus-wide salt tolerance
Tolerance of salt is a known trait of all previously reported strains of Phalo (up to 4% NaCl), Pfla (up to 4% NaCl), Pendo (up to 5% NaCl), and Pmari (up to 7% NaCl for former P. pelagicum strains and up to 9% NaCl for strain MGL06 T ), having indeed inspired the choice of the species epithet of R. halotolerans [16,20,25,37,46]. This phenotype could be conferred, at least in part, by the expression of a Na + /H + antiporter, a function that was identified as a marine niche-associated trait in Rhodobacteraceae [57]. The Na + /H + antiporter genes are missing in Pban, leading us to predict a lower salt tolerance in this species (Fig. 3). The ability to grow in NaCl concentrations ranging from 0 to 9% was tested for 11 strains of Pseudorhizobium and related organisms (Sup. Table S6). The results showed no significant difference between Pban and other species in the genus with all tolerating up to 3-6% NaCl under our test conditions, apart from Pmari strain MGL06, which still grew in the presence of 7% NaCl, rejecting the hypothesis of the presence of a Na + /H + antiporter as cause of this phenotype. The common baseline of salt tolerance in Pseudorhizobium suggests that core genes encode salt tolerance factors or, less parsimoniously, that all lineages have convergently evolved such traits. The levels of salt tolerance we measured are lower than previously reported for Pendo and Pmari, and higher for Phalo, suggesting that other factors that determine salt tolerance in other conditions were not expressed in our experiment.
Pban-specific lack of production of a cellulose polymer Pban and Phalo strains were plated on yeast extract-mannitol agar medium (YEM) supplemented with 0.1 g/l Congo Red dye, a characteristic marker of beta-glucan polymers, to test for the presence of cellulose or a related polymer such as curdlan in their capsular polysaccharide [26]. Contrary to expectations, P. banfieldiae strains were coloured by the dye, which was observed for other tested Pseudorhizobium strains. Pban strains had an salmon-orange hue, while strains from closest relative species Phalo and Pfla had a more intense red or pink-red colour, suggesting that they bound the dye more strongly (Sup. Fig S6; Sup. Table S7). In addition, growth of Pban strains was inhibited in the presence of the Congo Red dye, resulting in small, dry colonies on YEM plates (Sup. Fig S6; Sup. Table S7) or no growth on MSM + 0.08% YE (data not shown). This impeded growth phenotype was unique to Pban among studied Pseudorhizobium strains, except for Phalo strain Khangiran2. The inhibitory effect of the dye has been previously observed for other bacteria deficient in the production of beta-glucan polymers [62], and the inhibition observed on Pban strains could thus reflect the absence of protection that the cellulose-like polysaccharide provides to other Pseudorhizobium isolates. The case of Phalo strain Khangiran2 remains unclear: its growth is inhibited by exposure to the dye, but its pink-red colour indicates it binds the dye more strongly than Pban strains, thus suggesting that it does express a beta-glucan polymer.

Search for genotype-phenotype associations
We took advantage of our well-defined phylogenomic framework and of the compendium of phenotypes that we had tested (Sup . Table S8) to search for potential associations between the distribution of accessory genes and the distribution of phenotypes, with the expectation of revealing new gene functions. Using a genome-wide association (GWAS) testing framework, we looked for the basis of significant phenotypes that are not necessarily distributed following the taxonomical structure of species. Specifically, we explored the association between metabolic traits and the distribution of OGGs in the accessory genome, using the species tree to account for potential spurious associations linked to oversampling of closely related strains. The GWAS reported numerous significant associations, listed in Sup. Table S9. Manual exploration of results singled out only one association with a clear association pattern that we believe to be of biological relevance: the utilization of beta-methyl-d-glucoside, observed most strongly in strains Pban NT-26, Phalo AB21 and N. galegae HAMBI 540, was associated with the presence of several chromosomal clusters of genes. These include two operons, one encoding an allophanate hydrolase, and another encoding a transporter and a dienelactone hydrolaserelated enzyme.

Discussion
The traditional polyphasic approach in bacterial taxonomy combines several criteria, in particular marker gene phylogeny and biochemical phenotypes, to determine the boundaries of taxa [58]. The validity of this approach has recently been questioned, due to the growing evidence that most phenotypes are encoded by genes that may be accessory within a species or be shared promiscuously among species, making them poor diagnostic characters [28]. Instead, the growing practice in the field is to use whole genome sequences to estimate similarity between bacterial isolates [10] and to compute phylogenetic trees based on genome-wide data [44]. A tree provides the hierarchical relationships between organisms, while the level of overall genome relatedness provides an objective criterion for the delineation of species. This criterion requires a threshold, and 70% has been proposed for dDDH, which is equivalent to the classical score of DNA-DNA hybridization (DDH) widely considered as adequate for species delineation [9], as obtained using the GGDC tool [39]; we also established that in this dataset, 97% amino-acid average identity (AAI) provides a practical species threshold. However, similarity thresholds are arbitrary and the relevance of species boundaries proposed based on this sole criterion may be questioned.
In this study, pairwise genome similarities among strains Khangiran2, Phalo AB21 T and Pfla YW14 T are all below, but close to the thresholds of 70% dDDH and 97% AAI. According to recent taxonomical guidelines [29], criteria other than similarity-based ones should be considered to decide on species boundaries. We therefore complemented our taxonomic investigations with an assessment of the genomic, chemotaxonomical and predicted ecological differentiation between strains. These three strains form a clade ('Phalo+Pfla') that is well supported in the S ML571 ML tree. Genes specific to this clade are enriched in functions involved in the biogenesis and modification of membrane lipids. This important cellular pathway could be the means of adaptation to a shared ecological niche, and this group could thus constitute an ecological species [32]. However, core genes specific to Phalo (i.e. strains AB21 and Khangiran2) are also significantly enriched with coherent cellular functions, including the biosynthesis of carotenoid pigments and related isoprenoid metabolism -a key metabolic pathway suggesting that the core genome of Phalo may also be involved in the adaptation to its own specific niche. Therefore, the ecological arguments rather support Phalo and Pfla to be two distinct ecological species. In addition, the relatively lower support in S BA41 Bayesian tree for the Phalo+Pfla clade is a strong argument against its election to species status. For this reason, we recommend that P. halotolerans and P. flavum shall remain two distinct species until further evidence to the contrary.
A recent large-scale analysis of Alphaproteobacteria type strain genomes recommended that the species P. pelagicum be amalgamated with R. marinum [23]. By renaming the only species yet recognised as part of the genus Pseudorhizobium as a Rhizobium, Hördt et al. implied that Pseudorhizobium is not a bona fide taxon. However, we bring extensive evidence that Pseudorhizobium is well differentiated genomically and phenotypically and forms a bona fide bacterial genus. Through the phylogeny-aware comparison of genomes, we explored the functional specificities of lineages within this genus (Fig. 3). From the functional annotation of genomes, we identified the genomic basis of known phenotypic traits, such as chemolithoautotrophy or sulfur oxidation, and predicted others such as a cellulose component of the bacterial coat or a lipopolysaccharide O-antigen. We mapped the distribution of these traits within a phylogenetic framework, identifying those traits which presence or absence was exclusive to a group. The only new prediction of contrasting phenotypes that could be verified in the lab is related to the absence of a cellulose-like polymer in Pban. Phe-notypic features of wider evolutionary groups were documented, including the general tolerance of members of the Pseudorhizobium genus to NaCl (Sup . Table S6; Supplementary text). This shared feature suggests that the ancestor of the group might have been itself salt tolerant, and thus possibly a marine organism -a hypothesis consistent with the basal position in the genus tree of seaborne P. marinum.
We showed that some cellular processes and pathways were over-represented in the specific core genome of clades of the Pseudorhizobium genus. This pattern results from the serial acquisition of genes with related functions in an ancestral lineage and their subsequent conservation in all descendants -a process likely driven by positive selection [33]. Indeed, under an ecotype diversification model, the acquisition of genes enabling the adaptation to a different ecological niche can trigger the emergence of a new ecotype lineage [11]. Ecological isolation of this ecotype may in turn drive the differentiation of its core genome, with additional adaptive mutations (including new gene gains and losses) producing a knock-on effect leading to ecological specialization [32]. Analysing the specific gene repertoire of each clade of the Pseudorhizobium genus indeed shed light on putative ways of adaptations of these groups to their respective ecological niches.
The emergence of the highly specialized NT-25/NT-26 clone could be explained by a hypothetical scenario involving a sequence of ecotype diversification events: a first key event was the acquisition of multiple new cytochromes and interacting redox enzymes in the Pseudorhizobium genus ancestor, enhancing its capacity to exploit the redox gradients between available environmental compounds. This was followed by the acquisition by the ancestor of Pban+Pfla+Phalo+Pendo of a first set of sulphur oxidation enzymes, with electrons from the periplasmic oxidation of thiosulphate being transferred to the carrier molecules SoxYZ and PQQ, likely to fuel oxidative enzymes such as a jointly acquired toxic carbohydratedegrading metallo-hydrolase. Then, the sox gene cluster was gained by the ancestor of Pban+Pfla+Phalo, allowing it to use thiosulphate as a source of electrons to fuel respiration and therefore to convert them into proton motive force and to recycle the cellular pool of redox cofactors. The joint acquisition of the phenylacetate degradation pathway -many reactions of which require reduced or oxidized cofactors [66] -allowed this organism to use this aromatic compound and its breakdown products as carbon and electron source. This set of new abilities would have allowed this lineage to colonize new habitats either depleted in organic nutrients or contaminated with toxic organic compounds. This was followed by the acquisition of RuBisCO and other Calvin cycle genes and additional respiratory cytochromes by the Pban ancestor. The encoded metabolic pathways provide electrons to the cell and allow the fixation of carbon dioxide, thus allowing that ancestor to live chemolithoautotrophically using sulphur oxidation -again this has likely let this lineage colonize environments yet uncharted by most rhizobia, such as rock surfaces. Finally, the acquisition of a plasmid carrying the arsenite oxidation genes and other factors of resistance to arsenic and heavy metals, allowed the NT-25/NT-26 clone ancestor to successfully colonize the extremely toxic and organic nutrient-poor environment of a gold mine.
Aside from this scenario of extreme specialization towards chemolithoautotrophy and resistance against toxic heavy metals, all species in the genus Pseudorhizobium have achieved significant ecological differentiation from the bona fide rhizobial lifestyle, which is characteristic for members of the most closely related genus Neorhizobium typically isolated from soil and the plant rhizosphere [22,36,41,45,70]. P. endolithicum has only been found inside the mineral matrix of sand grains and its high salt and temperature tolerance indicate it is likely adapted to this peculiar lifestyle, even though its capacity to nodulate soybean indicates its ecological niche encompasses various lifestyles [46]. Among the many gene gains and losses that occurred over the long branch leading to Pendo, one notable change involved the structural and biosynthetic genes of the flagellum, possibly leading to a deviant morphology of this bacterial motor in this species. This might be linked to its ability to colonize the interior of sand rock particles necessitating a particular type of motility.
The P. marinum core genome is largely differentiated from the rest of the genus, owing to its early divergence. Among its speciesspecific components, some genes are involved in functions that are key for survival in a typically marine lifestyle. These include transport of K + and Cl − ions, urea and various sugars and organic acids and amino-acids and the degradation of toxic phenolic compounds, in combination with many signal transduction systems, which must allow the rapid scavenging/extrusion of rare/excess ions or toxins in response to changing availability of mineral and organic nutrients and the rise in toxicity of the environment. In addition, the (de)polymerisation of storage compound PHB may allow the cell to survive long-term starvation during nutrient-depleted phases. Finally, the ability to synthesize of the osmoprotectant NAGGNa trait common to all sequenced members of the genus Pseudorhizobium -makes this species particularly adapted to life in marine habitats and other environments where salinity can vary strongly.

Conclusion
In summary, in this work we have used a comparative genomics approach within a phylogenomic framework to identify the unique characters of five species of the genus Pseudorhizobium, shedding light on the genome evolution that led them to adapt to their respective ecological niche. Our analysis highlights how speciesand higher groups -within this clade of the Rhizobiaceae family evolved towards strikingly different ecological strategies, through the acquisition of traits such as tropism and resistance to environmental toxins, thus allowing each species to colonize its own peculiar niche. Aerobic, Gram negative non-spore forming rods forming white colonies on YMA. Optimal growth at 28-30 • C and pH 7−8. Catalase test is positive. Production of ␤-galactosidase is positive. The production of indol is negative. The production of arginine dehydrolase and gelatinase is negative. Growth is observed in presence of 0-4% NaCl, and up to 7% for certain isolates. The main fatty acids are C 18:1 7c/C 18:1 6c. The G + C content of genomic DNA is 61.8-62.8 mol%.
The type species is P. marinum (synonym: P. pelagicum) and the type strain is P. marinum R1-200B4 T .
Delineation of the genus was determined based on wholeproteome similarity analysis (AAI) and the phylogenetic analysis of the concatenated alignments of 155 conserved genes. Strains within the genus have all above 80% AAI similarity between each other, and below 76% AAI similarity with N. galegae HAMBI 540 T , the type strain of sister genus Neorhizobium.

Description of Pseudorhizobium halotolerans sp. nov
The description of the species is the same as the descriptions given by Diange and Lee [16], except that it is tolerant to NaCl up to 5 % (w/v), instead of 4 %.
The type strain, AB21 T (= DSM 105041 T = KEMC 224-056 T = JCM 17536 T ), was isolated from chloroethylene-contaminated soil from Suwon, South Korea. We note that the name Rhizobium halotolerans that was proposed in the original publication [16] has not been validly published yet.

Description of Pseudorhizobium flavum comb. nov
The description of the species is the same as the descriptions given by Gu et al. [20]. Notably, it has a tolerance of 0-4 % NaCl (w/v).

Description of Pseudorhizobium endolithicum comb. nov
The description of the species is the same as the descriptions given by Parag et al. [46].

Description of Pseudorhizobium marinum comb. nov
The genus Pseudorhizobium was described along with the species name P. pelagicum Kimes et al. 2015, which is a heterotypic synonym of R. marinum Liu et al. 2015. Because the genus name Pseudorhizobium has been validly published (with type strain R1-200B4 T = LMG 28314 T = CECT 8629 T ) [42], the species epithet marinum is now to be preceded by the Pseudorhizobium genus prefix.
The description of the species is the same as the descriptions given by Liu et al. [37].
Etymology: N.L. gen. n. banfieldiae [ baen · fil · di · ae], named in honour of Prof Jillian Banfield, environmental microbiologist whose research revolutionised the view of bacterial and archeal diversity.
P. banfieldiae strains are salt tolerant up to 4% NaCl. The following phenotypes distinguish them from other members of Pseudorhizobium: they are sulphur oxidizers, and can harvest electrons from sulphur compounds including thiosulphate; they are autotrophic and can assimilate carbon from CO 2 in the presence of an electron source, such as the reduced inorganic sulphur compound thiosulphate. Note that arsenite oxidation and autotrophy in the presence of arsenite are accessory traits borne by a plasmid and are not diagnostic of the species.
In addition, growth of P. banfieldiae strains is inhibited on yeast extract -mannitol agar medium (YEM) supplemented with 0.1 g/l Congo Red dye, resulting on small, dry colonies, and are coloured salmon-orange by the dye.
The type strain is NT-26 T (= DSM 106348 T = CFBP 8663 T ), isolated from arsenopyrite-containing rock in a sub-surface goldmine in the Northern Territory, Australia.

Data availability
All genomic data were submitted to the EBI-ENA under the BioProject accession PRJEB21840/ERP024139, in rela-

Funding
This work was supported by the European Research Council (ERC) (grant ERC260801-BIG IDEA to FB). FL was supported by a Medical Research Council (MRC) grant (MR/N010760/1) to XD. Computational calculations were performed on Imperial College high-performance computing (HPC) cluster and on MRC Cloud Infrastructure for Microbial Bioinformatics (MRC CLIMB) cloud-based computing servers [13]. THO was supported by a Biotechnology and Biological Sciences Research Council (BBSRC) grant (BB/N012674/1) to JMS.

Authors contribution
FL, JMS, JP and FB designed the study. SMMD, FJZ, JZ, JMS, THO and JP isolated, provided and cultivated the bacterial strains. SV and AF performed phenotypic analyses. JS and FL analysed the phenotypic data. FL and HB conducted genome assemblies. FL and XD wrote the phylogenetic analysis software. FL conducted the bioinformatic and evolutionary analyses. FL, JMS, JP, FB and XD wrote the manuscript. All authors read and approved the content of the manuscript.