An updated phylogeny of the Alphaproteobacteria reveals that the parasitic Rickettsiales and Holosporales have independent origins

The Alphaproteobacteria is an extraordinarily diverse and ancient group of bacteria. Previous attempts to infer its deep phylogeny have been plagued with methodological artefacts. To overcome this, we analyzed a dataset of 200 single-copy and conserved genes and employed diverse strategies to reduce compositional artefacts. Such strategies include using novel dataset-specific profile mixture models and recoding schemes, and removing sites, genes and taxa that are compositionally biased. We show that the Rickettsiales and Holosporales (both groups of intracellular parasites of eukaryotes) are not sisters to each other, but instead, the Holosporales has a derived position within the Rhodospirillales. Furthermore, we find that the Rhodospirillales might be paraphyletic and that the Geminicoccaceae could be sister to all ancestrally free-living alphaproteobacteria. Our robust phylogeny will serve as a framework for future studies that aim to place mitochondria, and novel environmental diversity, within the Alphaproteobacteria.


INTRODUCTION
The Alphaproteobacteria is an extraordinarily diverse and disparate group of bacteria 40 and well-known to most biologists for also encompassing the mitochondrial lineage 41 (Williams, Sobral, and Dickerman 2007; Roger, Muñoz-Gómez, and Kamikawa 2017). 5 alleviating phylogenetic artefacts. We found that amino acid compositional 99 heterogeneity, and more generally long-branch attraction, were major confounding 100 factors in estimating phylogenies of the Alphaproteobacteria. In order to counter these 101 biases, we used novel dataset-specific profile mixture models and recoding schemes 102 (both specifically designed to ameliorate compositional heterogeneity), and removed 103 sites, genes and taxa that were compositionally biased. We also present three draft 104 genomes for endosymbiotic alphaproteobacteria belonging to the Rickettsiales and
8 Compositional heterogeneity appears to be a major confounding factor affecting 139 phylogenetic inference of the Alphaproteobacteria 140 The average-linkage clustering of amino acid compositions shows that the Rickettsiales, 141 Pelagibacterales and Holosporales are clearly distinct from other alphaproteobacteria. 142 This indicates that these three taxa have divergent proteome amino acid compositions 143 (Fig. 1A). These taxa also have the lowest GARP:FIMNKY ratios in all the 144 Alphaproteobacteria (Fig. 1A); the Pelagibacterales being the most divergent, followed As a first step to discriminate between these two alternatives, we used maximum 171 likelihood to estimate a tree on our 200-gene dataset for the Alphaproteobacteria under 172 the site-heterogenous model LG+PMSF(ES60)+F+R6. The resulting tree united the 173 Rickettsiales, Pelagibacterales and Holosporales in a fully supported clade ( Fig. 2A).

174
The clustering of these three groups is suggestive of a phylogenetic artefact (e.g., long-175 branch attraction or LBA); indeed, such a pattern resembles the one seen in the tree of 176 proteome amino acid compositions (see Fig. 1A). This is because the three groups have 177 the longest branches in the Alphaproteobacteria tree and have compositionally biased 178 and fast-evolving genomes (see Fig. 2 Holosporales is disrupted (Fig. 2B). The new more derived placements for the 188 Pelagibacterales and Holosporales are well supported (further described below), and 189 support tends to increase as compositionally biased sites are removed (Fig. S8).

190
Furthermore, when each of these long-branching taxa is analyzed in isolation (i.e., in LG+PMSF(ES60)+F+R6 model and from a dataset whose compositional heterogeneity

19
A fourth independent analysis further supports a derived placement of the Holosporales 271 nested within the Rhodospirillales. Bayesian inference using the CAT-Poisson+Γ4 272 model, on a dataset whose compositional heterogeneity had been decreased by 273 removing 50% of the most compositionally biased sites but for which no taxa had been 274 removed, also recovered the Holosporales as sister to the Azospirillaceae (see Fig. S6).

275
The Rhodospirillales is a diverse order and comprises five well-supported families 276 The Rhodospirillales is an ancient and highly diversified group, but unfortunately this is   Rhizobiales, and the Rhodobacterales sister to both (e.g., Fig. 2B and 3). This is 342 consistent throughout most of our results and such interrelationships become very 343 robustly supported as compositional heterogeneity is increasingly alleviated (Fig. S8).

344
The placement of the Rickettsiales as sister to the Caulobacteridae (i.e., all other 345 alphaproteobacteria) remains stable across different analyses (Fig. 2B, S10C, S11C, 346 S12C and S13D); this is also true when the other long-branching taxa, the  were primarily aimed at reducing amino acid compositional heterogeneity among taxa-360 a phenomenon that permeates our dataset (Fig. 1). Compositional heterogeneity is a 361 clear violation of the phylogenetic models used in our, and previous, analyses, and 362 known to cause phylogenetic artefacts (Foster 2004). In the absence of more 363 sophisticated models for inferring deep phylogeny, the only way to counter artefacts 364 caused by compositional heterogeneity is by removing compositionally biased sites or 365 taxa, or recoding amino acids into reduced alphabets. A combination of these strategies 366 reveals that the Rickettsiales sensu lato (i.e., the Rickettsiales and Holosporales) is 367 polyphyletic. Our analyses suggest that the Holosporales is derived within the 368 Rhodospirillales, and that therefore this taxon should be lowered in rank and renamed 369 the Holosporaceae family (see Fig. 2B and 3). The same methods suggest that the 370 Rhodospirillales might indeed be a paraphyletic order and that the Geminicoccaceae 371 could be a separate lineage that is sister to the Caulobacteridae (e.g., Fig. 2B). These 372 two results, combined with our broader sampling, reorganize the internal phylogenetic 373 structure of the Rhodospirillales and show that its diversity can be grouped into at least 374 five well-supported major families (Fig. 3).

375
In 16S rRNA gene trees, the Holosporales has most often been allied to the artefactually attracted to the Rickettsiales (e.g., Fig. 2A), but as compositional bias is 400 increasingly alleviated (through site removal and recoding), they move further away 401 from them (Fig. 2B). The Holosporales is placed within the Rhodopirillales as sister to 402 the family Azospirillaceae (Fig. 3). The similar lifestyles of the Holosporales and   (Fig. 3). We restrict the Rhodospirillaceae sensu stricto to the subgroup that 438 is sister to the Acetobacteracae (Fig. 3). The other two subgroups are the 439 Rhodovibriaceae and the Azospirillaceae; the latter is sister to the Holosporaceae (Fig.   440 3).

441
Based on our fairly robust phylogenetic patterns, we have updated the higher-level 442 taxonomy of the Alphaproteobacteria (Fig. 4). We exclude the Magnetococcales from 443 the Alphaproteobacteria class because of its divergent nature (e.g., see Fig. 1 Table 1).

615
Taxon and gene selection 616 The selection of 120 taxa was largely based on the phylogenetically diverse set of 617 alphaproteobacteria determined by Wang and Wu (2015). To this set of taxa, recently 618 sequenced and divergent unaffiliated alphaproteobacteria were added, as well as those 619 claimed to constitute novel order-level taxa. Some other groups, like the 620 Pelagibacterales, Rhodospirillales and the Holosporales, were expanded to better 621 represent their diversity (see Fig. S1).  (Table S1). This was done as an alternative way to which implements the methods of Susko and Roger (2007), was used to find the best 675 recoding schemes-please see Fig. 3, S12 and S14 legends for the specific recoding