Independent origins and evolution of the secondary replicons of the class Gammaproteobacteria

Multipartite genomes, consisting of more than one replicon, have been found in approximately 10 % of bacteria, many of which belong to the phylum Proteobacteria. Many aspects of their origin and evolution, and the possible advantages related to this type of genome structure, remain to be elucidated. Here, we performed a systematic analysis of the presence and distribution of multipartite genomes in the class Gammaproteobacteria, which includes several genera with diverse lifestyles. Within this class, multipartite genomes are mainly found in the order Alteromonadales (mostly in the genus Pseudoalteromonas ) and in the family Vibrionaceae . Our data suggest that the emergence of secondary replicons in Gammaproteobacteria is rare and that they derive from plasmids. Despite their multiple origins, we highlighted the presence of evolutionary trends such as the inverse proportionality of the genome to chromosome size ratio, which appears to be a general feature of bacteria with multipartite genomes irrespective of taxonomic group. We also highlighted some functional trends. The core gene set of the secondary replicons is extremely small, probably limited to essential genes or genes that favour their maintenance in the genome, while the other genes are less conserved. This hypothesis agrees with the idea that the primary advantage of secondary replicons could be to facilitate gene acquisition through horizontal gene transfer, resulting in replicons enriched in genes associated with adaptation to different ecological niches. Indeed, secondary replicons are enriched both in genes that could promote adaptation to harsh environments, such as those involved in antibiotic, biocide and metal resistance, and in functional categories related to the exploitation of environmental resources (e.g. carbohydrates), which can complement chromosomal functions.


Dataset Filtering
We sought to identify phylogenetic signal through a Bidirectional Best Hits (BBH) approach, in which 30 conserved markers in the form of Hidden Markov Profiles are searched against the input proteomes, and the resulting matches are forwarded to a thorough search against the entire Pfam protein database. These markers consist of housekeeping genes that are involved in information processing (replication, transcription, and translation) or central metabolism, and thus are thought to be relatively recalcitrant to lateral gene transfer [2,3]. These are listed below: 30S ribosomal protein S10 PF00338 rpsK 30S ribosomal protein S11 PF00411 rpsM 30S ribosomal protein S13 PF00416 rpsS 30S ribosomal protein S19 PF00203 smpB SsrA-binding protein PF01668 tsf elongation factor Ts PF00889 Using this hybrid approach we were able to assign 30 markers in 94.1% of the entries (2190/2327). We then investigated case by case those species for which markers were not found, to see if at least one other representative of the genus was present in the data set. As a form of quality control, we filtered out possibly incomplete genomes that did not contain all of the marker proteins. The dataset was further filtered to include one random representative per species. Since hits for markers dnaG and pyrG were very heterogeneous, both in terms of the spread of sequence and domain composition, we decided not to take these two markers into account for the subsequent multiple sequence alignment stage. The final number of species included before the multiple alignments stage was 1139, with 28 eligible markers. While manually editing each MSA with BioEdit version 7.2.5 [4], we further excluded the following species, due to a gappy and dishomogeneous signal throughout most of the alignments: And, for the same reason, the following markers: From an initial number of 2327 species 1190 were removed, leaving a total of 1128 organisms of non redundant species and 25 markers, and 4392 distinct alignment patterns with a proportion of gaps of 0.0%. Multiple sequence alignments were performed using Mafft [5] version 7.205 in automatic mode; the RAxML version 8.2.9 algorithm [6] was used to build a maximum likelihood phylogeny, using the LG amino acid substitution model and the GAMMA rate heterogeneity. The final tree is the bootstrap best tree following 200 bootstrap replicates, which was visualized using iTol [7].

Replicon-level Phylogenetic Analyses Data set description
We downloaded the amino acid sequences and features information for 259 publicly available species assigned to the orders of Alteromonadales and Vibrionales. A total of 25 genera were represented (19 and 6, from each order, respectively). 141/259 species had at least one secondary replicon assigned other than a primary

Supplementary_Figure_1
The first subplot illustrates the relationship between the roary's selected identity threshold and the total number of orthologous gene clusters in the pangenomes of each genus. The second subgraph illustrates the number of core genes in relation to the selected identity threshold. The x-axis shows the identity threshold, and the y-axis is the total number of genes and the number of core genes, respectively.

Supplementary_Figure_2
Phylogeny of the Gamma-Proteobacteria.

Supplementary_Figure_3
Boxplot showing the average of the sum of sizes of all secondary replicons in a genome to that of the primary chromosome in the same assembly in the Alteromonodales order and in the Vibrionaceae family.

Supplementary_Figure_4
Phylogenetic relationship of the secondary replicons of the order Alteromonodales and Vibrionales. Maximum likelihood phylogeny on 348 sites for 139 secondary replicons using partitioning protein ParA AAA_31 (A) and 244 sites for 134 replicons using ParBc (B) as markers. Clades whose average branch length distance to their relative leaves is less than 0.7 were collapsed.

Supplementary_Figure_5
Expanded, unrooted phylogenetic tree on of all replicons for both orders. Bootstrap support greater or equal than 90 is reported; phylogenetic signal computed using partitioning protein ParA AAA31.

Supplementary_Figure_6
Expanded, unrooted phylogenetic tree of the secondary replicons for both orders. Bootstrap support greater or equal than 90 is reported; phylogenetic signal computed using partitioning protein ParA AAA31.

Supplementary_Figure_7
Expanded, unrooted phylogenetic tree of all replicons for both orders. Bootstrap support greater or equal than 90 is reported; phylogenetic signal computed using partitioning protein ParBc.

Supplementary_Figure_8
Expanded, unrooted phylogenetic tree of the secondary replicons for both orders. Bootstrap support greater or equal than 90 is reported; phylogenetic signal computed using partitioning protein ParBc.

Supplementary_Figure_9
Bar plots showing COG category abundances in replicon specific accessory pangenomes of each Vibrionaceae group. Each sub-plot concerns data from a different Vibrionaceae genus, while the sub-plot Vibrio spp.
represents the comparison between the accessory pangenomes of chromosomes and the additional secondary replicons present in the Vibrio sp. THAF190c, THAF191c, THAF191d and THAF64 strains. The individual bars correspond to the percentage of gene clusters assigned to a specific COG category (blue -accessory chromosome pangenomes, orange -accessory secondary replicon pangenomes Asterisks symbolize comparisons for which the difference was statistically significant (p < 0.05 in Fischer's exact test).

Supplementary_Figure_10
Bar plots showing the abundance of KEGG categories in replicon specific accessory pangenomes.
Plots are similar to those shown in Figure 5 and Supplementary_Figure_9, but the annotation of specific gene clusters was performed based on the KEGG functional categories.                                ad  ace  ae  }Ps  eu  do  alt  ero  mo  na  s tra  nsl  uc  ida  TA  C1  25   {Se  co  nd  ary  rep  lico  n Pse   ud  oa  lte  rom on   ad  ace  ae  }Ps  eu  do  alt  ero  mo  na  s nig   rifa  cie  ns  KM  M  66  1   {Se  co  nd  ary  rep  lico  n Pse   ud  oa  lte  rom on   ad  ace  ae  }Ps  eu  do  alt  ero  mo  na  s tra  ns  luc  ida  KM  M  52  0   {Se  co  nd  ary  rep  lic  on  Ps  eu  do  alt  ero  mo  na  da  ce  ae  }Ps  eu  do  alt  ero  mo  na  s sp  . 13  -15   {Se  co  nd  ary rep   lic  on  Ps  eu  do  alt  ero  mo  na  da  ce  ae  }Ps  eu  do  alt  ero  mo  na  s sp  . 16  -SW -7   {Se  co  nd  ary rep   lic  on  Ps  eu  do  alt  ero  mo  na  da  ce  ae  }Ps  eu  do  alt  ero  mo  na  s arc  tic  a A 37   -1-2   {Se  co  nd  ary rep   lic  on  Ps  eu  do  alt  ero  mo  na  da  ce  ae  }Ps  eu  do  alt  ero  mo  na  s ca  rra  ge  en  ov  or  a KC  TC  22  32  5   {Se  co  nd  ary rep   lic  on  Ps  eu  do  alt  ero  mo  na  da  ce  ae  }Ps  eu  do  alt  ero  mo  na  s ag  ari  vo  ran  s Ha  o  20  18   {Se  co  nd  ary rep lic   on  Ps  eu  do  alt  ero  m  on  ad  ac  ea  e}P  se  ud  oa  lte  ro  m  on  as  ag  ari  vo  ran  s DS  M  14  58  5   {Se  co  nd  ary re   pli  co  n  Ps  eu  do  alt  er  om on   ad  ac  ea  e}P  se  ud  oa  lte  ro  m  on  as  sp  . Xi1  3   {Se  co  nd  ar  y re  pl  ico  n  Ps  eu  do  alt  er  om on   ad  ac  ea  e}P  se  ud  oa  lte  ro  m  on  as  es  pe  jia  na  DS  M  94  14   {Se  co  nd  ar  y re  pl  ico  n  Ps  eu  do  alt  er  om on   ad  ac  ea  e}P se   ud  oa  lte  ro  m  on  as  do  ng  ha  en