Shared properties of gene transfer agent and core genes revealed by comparative genomics of Alphaproteobacteria

Gene transfer agents (GTAs) are phage-like particles that transfer pieces of cellular genomic DNA to other cells. Homologues of the Rhodobacter capsulatus GTA (RcGTA) structural genes are widely distributed in the Alphaproteobacteria and particularly well conserved in the order Rhodobacterales. Possible reasons for their widespread conservation are still being discussed. It has been suggested that these alphaproteobacterial elements originate from a prophage that was present in an ancestral bacterium and subsequently evolved into a GTA that is now widely maintained in extant descendant lineages. Here, we analysed genomic properties that might relate to the conservation of these alphaproteobacterial GTAs. This revealed that the chromosomal locations of the GTA gene clusters are biased. They primarily occur on the leading strand of DNA replication, at large distances from long repetitive elements, and thus are in regions of lower plasticity, and in areas of extreme GC skew, which also accumulate core genes. These extreme GC skew regions arise from the preferential use of codons with an excess of G over C, a distinct phenomenon from the elevated GC content that has previously been found to be associated with GTA genes. The observed properties, along with their high level of conservation, show that GTA genes share multiple features with core genes in the examined lineages of the Alphaproteobacteria.


INTRODUCTION
Gene transfer agents (GTAs) are phage-like particles that transfer small pieces of genomic DNA between cells. They have been identified in multiple Gram-negative bacteria and one archaeon [1]. Currently there are five distinct GTA types known, each appearing to have an independent evolutionary origin and varying breadths of taxonomic distribution [1]. Homologues of the Rhodobacter capsulatus GTA (RcGTA) genes are found in the genomes of members of multiple orders of the class Alphaproteobacteria [2,3], and the functionality of these RcGTA-like elements has been confirmed in divergent members of the alphaproteobacterial order Rhodobacterales [1,[4][5][6]. It has been suggested that these GTA elements are descendants of a prophage that integrated into the genome of an ancestral alphaproteobacterium, subsequently lost multiple phage-specific features, such as DNA replication and packaging specificity, and acquired mutations that resulted in a reduced head size [1]. This proto-GTA was then maintained through to the evolution of the extant lineages where the GTA genes are under cellular control.
Most of the RcGTA structural genes are located in a gene cluster of approximately 14 kb [7] that is conserved in the genomes of almost all examined members of the order Rhodobacterales [3]. This set of genes is also conserved to varying degrees in approximately half of the members of the alphaproteobacterial orders Rhizobiales, Sphingomonadales and Caulobacterales [3]. Possible reasons for the widespread conservation of these GTA genes are still being discussed. On the one hand, GTAs might contribute to the transfer of beneficial genes among cells [1,[7][8][9]. This hypothesis is supported by findings on the unrelated GTA produced OPEN ACCESS by Bartonella spp., where particle release and DNA uptake are restricted to the subpopulations with the highest fitness [10] and thus the GTAs are more likely to transfer genes that offer a benefit to the recipient cell. However, a modelling approach did not find support for any fitness advantage to GTA-producing over non-GTA-producing populations, as the resulting gene transfer did not compensate for the loss caused by GTA release [11], which requires cell lysis of the producing subpopulation of cells [12,13]. Perhaps GTAs are simply defective remnants of previously functional prophages [8], but this is difficult to reconcile with the findings that the RcGTA-related genes are under purifying selection [14] and that the production of RcGTA is co-regulated with the ability of cells to receive DNA from the particles [15]. Alternatively, an immunological function of GTAs has been proposed where GTAs transfer prophage DNA that can be incorporated into a recipient cell's CRISPR/Cas array and thereby 'vaccinate' the cells before an actual infection takes place [11].
The RcGTA family gene clusters show an increased GC content [(G+C)/(A+T+G+C)] relative to the rest of the genome, which results from a bias in the encoded proteins to contain amino acids that have a lower carbon content [16]. This could be important for the production of GTAs during nutrient limitation, as observed for RcGTA [17]. A previous analysis of different factors associated with GTA gene expression [18] drew our attention to the localization of the GTA gene clusters in regions of especially high GC skew, which is the normalized ratio of guanine to cytosine [(G-C)/(G+C)] and different from absolute GC content, in two considered species, R. capsulatus and Dinoroseobacter shibae. Circular bacterial chromosomes can be divided into two halves, the right and left replichores, based on the orientation relative to the origin (ori) and terminus (ter) of replication. The GC skew typically has positive values on the right replichore and negative values on the left replichore as guanine and cytosine dominate on the leading and lagging strand, respectively [18][19][20][21][22][23]. This asymmetric distribution is thought to be largely driven by deamination of cytosine to thymine, which might be affected by DNA replication, since the distribution pattern matches replication directionality [23,24]. This chromosomal composition bias is increased by some factors, such as an elevated growth rate [22], and decreased by others, such as recombination [25], and an overall more pronounced GC skew correlates with lower numbers of repeats [26]. Repetitive sequences and mobile genetic elements such as prophages can facilitate chromosomal rearrangements that reduce genomic stability, although these increase genomic plasticity and can provide an organism with greater adaptability [27][28][29].
Motivated by these previous observations related to DNA composition patterns, we performed a comprehensive genome sequence and structure analysis focused on patterns of GTA gene cluster conservation in four orders of the Alphaproteobacteria. This revealed trends in their localization, GC skew, codon usage and potential DNA methylation, and led to the overall conclusion that GTA genes share multiple properties with core genes in these bacteria.

METHODS
Analyses were carried out with R studio version 4.0.3 and relevant packages (Table 1).

Genome dataset and chromosome reorientations
Closed genomic sequences from bacteria within four alphaproteobacterial orders, the Rhodobacterales (n=147), Sphingomonadales (n=114), Caulobacterales (n=30) and Rhizobiales (n=462), were obtained from the National Center for Biotechnology Information (NCBI) GenBank assembly database (e.g. https://www.ncbi.nlm.nih.gov/assembly/?term=Rhodobacterales) on 12 March 2019. The accession numbers of the sequences used are provided in Table S1 (available in the online version of this article). We note that there have been subsequent taxonomic revisions among these bacteria but do not believe these detract from the utility or meaning of our analyses as based on this previous, long-standing taxonomic organization. The origin of replication (ori) was identified on each chromosome using Ori-Finder and default settings [30]. The ptt files were generated from gbff files (https://github.com/sgivan/gb2ptt#gb2ptt), downloaded from the NCBI database on 23 April 2019. Only chromosomes where one ori could be unambiguously identified were subsequently included in the investigation. We next determined the locations of the gene encoding the GTA major capsid protein (MCP) and found that all were located on the presumed major chromosomes (largest replicons) that were used for subsequent analyses.

Impact Statement
Gene transfer agents (GTAs) are phage-derived genetic elements that mediate gene transfer in some prokaryotes. The genetic potential for GTA production is notably widespread in some lineages within the class Alphaproteobacteria. The functionality of most of these elements remains untested, but their conservation leads to speculation that they serve some beneficial function for these bacteria. We analysed the genomic properties of alphaproteobacterial GTA genes that might relate to their conservation. This revealed they have biased chromosomal locations such that they primarily occur on the leading strand of DNA replication, in regions of lower plasticity, and in areas of extreme GC skew. These are properties of core genes in these bacteria, providing further evidence of the importance of GTAs for the biology of these organisms.
Depending on the analysis, the positions of ori or the GTA MCP gene were used to reorient the DNA fasta and gff files using custom R functions that are available within the newly developed package 'reorientateCircGenomes' (https://github.com/SonjaElena/reorientateCircGenomes.git). This package simplifies reorientation of sequences within fasta and gff files that originate from the NCBI database or Prokka based on base pair location or ProteinID. It can also be used to visualize circular chromosomes with strand information, GC skew and locations of selected genes.

Homology analysis
To identify homologous proteins, and thus gene families, in the genomes from the different orders, all proteins were blasted against each other and a matrix was generated for each order using Proteinortho version 5.16b [31]. The criteria to be considered a homologue were an e-value ≤1e-05, identity ≥30 % and coverage ≥75 %. Based on the identification of specific genes of interest (e.g. the GTA major capsid protein gene) in reference organisms' genomes, this database was then used to identify homologues in the other genomes. The reference organisms for the Rhodobacterales, Sphingomonadales, Caulobacterales and Rhizobiales are Dinoroseobacter shibae DFL 12 (=DSM 16493), Sphingopyxis alaskensis RB2256, Brevundimonas subcrescentus ATCC 15264 and Brucella suis 1330, respectively. A protein was designated a core protein if it was present in ≥90 % of all genomes within an order.

DNA composition analysis
The GC skew was calculated as (G-C)/(G+C) for a sliding window of 10000 bp. For cumulative visualizations, the sliding window size was set to 0.1 % of the chromosome lengths and then the mean GC skew of all organisms being considered was calculated.
GC skew peaks were identified independently of the reversal of the right and left replichores by using sliding quantiles to identify the local GC skew minima and maxima. The positions on the chromosome with GC skew values that belonged to the upper or lower 3 % of the GC skew values in a sliding window of 150 000 bp were identified. Genes located at these locations were then identified for further analyses. The relative GC skew was calculated as (skew sample -skew control )/skew control . The GC content of genes was calculated as G+C over the length of each protein coding region.
To examine the GC skew of prophages in relation to their respective host genomes, the insertion positions on the chromosome were determined using PHASTER [32]. Hits that overlapped with GTA locations, identified by ProteinOrtho, were attributed to be part of GTAs. To compare the GC contents and skews of GTAs and phages, the genomic sequences of phages that infect bacteria in the four considered orders were downloaded from the NCBI virus database (on 4 November 2020). The GC contents and skews were determined over sliding windows of 1000 bp for the phages and per gene for the GTAs.

Identification of repeats, methylation motif sites and large-scale inversions
Repetitive elements were identified with RepSeek [33]. Only repeats with a length >800 bp and identity >90 % were included to focus the analyses on long and highly similar repeats that are more prone to recombination [34]. We excluded overlapping repeats because these are probably not a major reason for large-scale chromosomal rearrangements. CcrM methylation potential was examined by searching for the GANTC motif in the DNA sequences. To rule out the possibility that the pattern we observed was caused by base composition instead of a possible methylation site, we also examined variations of this pattern that have equal base compositions (CGANT, CTGAN) (Fig. S1). Large-scale chromosome rearrangements were identified by visual inspection using MAUVE with the option progressiveMauve [35].

Codon usage
The DNA sequences for each open reading frame (ORF) were extracted from the genome fasta nucleotide files, based on the position information in the annotation (gff) files using the Biostrings and Genomic Ranges packages in R (Table 1). After sorting the ORFs of each organism according to whether they were found in GC skew peaks or not, the occurrence of each codon in each group was counted. The means per codon for both groups and for each organism were then calculated. The relative codon usage was determined as (peak-not-peak)/not-peak. For visualization, codons were grouped based on the encoded amino acids.

Phylogenetics
To identify closely related strains in which the location of the GTA gene cluster was switched between right and left replichores, a phylogenetic tree was generated for each order using RNA polymerase β protein (RpoB) amino acid sequences. Those RpoB sequences were identified with ProteinOrtho using sequences AAV96733.1, AAN30162.1, ANF54622.1 and ABF53199 as references for the Rhodobacterales, Rhizobiales, Caulobacterales and Sphingomonadales, respectively. The alignments and trees were generated with mega X [36]. The default settings were used for pairwise and multiple alignments. Partial deletion with a delay divergent cutoff of 30 % was used for gaps and missing data. The trees were constructed with the maximum-likelihood method. The branching patterns were evaluated using 100 bootstrap replicates, and the LG model was applied with gamma distribution at invariant sites. The site coverage cutoff was 95 %.

Dataset generation
Closed genomic sequences were downloaded from the NCBI database for four orders of the class Alphaproteobacteria. These orders were chosen based on their high numbers of available complete genomic sequences and their high level of GTA gene cluster conservation [3]. The Rhizobiales had the most genomes (462), followed by the Rhodobacterales (147), Sphingomonadales (114) and Caulobacterales (30) ( Table 2). To standardize analyses of gene localization, repeats and methylation motifs, all chromosomes were reoriented to the origin of replication (ori). Genomes were excluded when a single ori could not be unambiguously identified. For GTA gene cluster-related analyses, further dataset reduction was made based on the presence of the major capsid protein (MCP) gene. These selection criteria resulted in a reduced set of genomes available for analysis (Table 2). Repeating the analyses with only one representative strain per species for the Rhodobacterales and Rhizobiales showed that there was no bias in the results caused by overrepresentation of strains for certain species (data not shown).

GTA gene clusters are located on the leading strand, close to the terminus of replication and far from repeat regions
As expected, based on previous analyses, the presence of complete GTA gene clusters was most well conserved in the genomes of Rhodobacterales members, followed by the Rhizobiales, Sphingomonadales and Caulobacterales (Fig. 1). The localization was conserved near the chromosomal replication terminus (ter) in the Rhodobacterales (Fig. 1) and to a lesser extent in the Rhizobiales. The clusters were more scattered in the Sphingomonadales and tended to be halfway between ori and ter in the Caulobacterales (Fig. 1). *Based on more than one representative strain per species; final number of genomes included in analysis.
Long repeats (>800 bp) enable homologous recombination [34] and can be responsible for extensive chromosomal rearrangements. Indeed, recombination events between repetitive elements were identified as the likely explanation for the GTA gene cluster's location on different replichores in closely related species in the Rhodobacterales and Rhizobiales. For example, the replichore switch observed between Phaeobacter inhibens 2.10 and P. galleciensis was associated with regions containing many transposable elements (Fig. S2), which are often the cause of homologous recombination [37]. Similarly, the recombination event associated with replichore differences between B. suis and Brucella abortus was due to a region with paralogous genes (Fig. S2).
The MCP gene was found on the leading strand of DNA replication in all but five genomes of the Rhodobacterales and Rhizobiales, irrespective of the replichore the cluster was located on (Fig. S3). DNA replication and transcription occur simultaneously in bacteria and head-on collisions between the replisome and RNA polymerase lead to disruptions of both processes that require conflict resolution mechanisms [38]. This has a strong effect on genome organization and evolution because conflicts occur more often on the lagging strand (head-on), and genes oriented this way have higher mutation rates [39]. These disruptions are less common for genes on the leading strand (co-directional) [40][41][42] and slower evolving core genes tend to have this orientation [41,43]. Therefore, the GTA genes are like core genes with respect to gene orientation on the chromosome and although recombination events switch GTA gene clusters between replichores, the orientation of the clusters on the leading strand of DNA replication is maintained.

Cumulative genomic GC skew reveals a unique pattern associated with Rhodobacterales GTA clusters
A typical GC skew pattern for a circular genome is positive on the right replichore and negative on the left replichore. It has been hypothesized that inversions from the co-directional to head-on orientation can be traced by a sign change of the GC skew (e.g. right replichore, leading strand, positive GC skew to right replichore, lagging strand, negative GC skew) ( Table 3) [41]. We applied this categorization to all genes whereby not following the typical GC skew indicates an inversion from the other strand. By determining the proportion of genes present in the different genome orientations and with different GC skews we found that co-directional localization was predominant in all four orders, with 8-10 % more genes located on the leading strand than on the lagging strand (Fig. S4a). The overall proportion of genes that follow the typical GC skew was similar in the four orders and ranged from 62.6-66.8 % (Fig. S4b). This trend was most pronounced on the leading strand in the Rhodobacterales (Fig. 2a) and on the lagging strand in the Rhizobiales and Sphingomonadales, while equal numbers of genes followed the typical GC skew on the leading and lagging strands in the Caulobacterales. Approximately equal numbers of genes that follow the typical GC skew or are inverted were found on the lagging strand in the Rhodobacterales (Fig. 2a). Thus, the Rhodobacterales genomes have the strongest conservation of the typical GC skew pattern on the leading strand but the lowest conservation on the lagging strand. The same distribution pattern, although with a slightly stronger preference for the leading strand, was found for core genes (Fig. S4c). The differences between the gene proportions ( Fig. S4c)   Most organisms' genomes follow the typical GC skew pattern, and it is considered an archetypal genomic property. A high GC skew might reflect the original ancestral genome of the bacteria considered here. Indeed, it was recently proposed that a deviation  from this pattern could be used to detect misassembled genomic sequences [42]. In the Rhodobacterales the expected skew is reduced on the lagging strand due to increased inversions, with genes changed from co-directional to head-on orientation. It is unclear why this pattern is found for this group, but it has been suggested that switching genes from co-directional to the head-on orientation might have benefits by increasing evolvability [41], although this is still under debate [44,45]. There are three hypotheses on the evolutionary history of genome architecture [41] whereby there is either a reduction of head-on genes, a reduction of co-directional genes, or the original ratio is retained. While all four orders studied here have similar proportions of head-on and co-directional genes (Fig. 2a), the inversions on the leading and lagging strands change in different directions in the Rhodobacterales compared to the other three orders, making it difficult to draw a general conclusion from our analysis.
Next, we used cumulative representations of the GC skew (determined over a sliding window) to evaluate the magnitudes of and patterns in the GC skews. The maximal deviations from zero were similar in all four orders, but the patterns varied among them (Fig. 2b). The GC skew values deviated >±0.03 in the Rhodobacterales and Rhizobiales, resulting in clear bimodal distributions (Fig. 2b). In contrast, the distributions were closer to normal in the Sphingomonadales and Caulobacterales, although some bimodality could still be seen. The GTA gene clusters were predominantly located within GC skew peaks in all four orders (Fig. 2c). Indeed, comparing the absolute GC skews of the regions where GTA gene clusters are located to the remainders of the genomes showed that they have greater GC skews than average (Fig. 2c). Interestingly, the strongest deviations could be observed in the Sphingomonadales and Caulobacterales, which have more genes with GC skews around zero (Fig. 2b). This indicates that GTA clusters tend to have high GC skews, irrespective of the rest of the genome's overall properties.

Correlation between GC skew and codon usage in the Rhodobacterales
A difference in GC skew value could affect the codon usage, so we examined which codons are enriched in genes with a high GC skew and checked for potential differences in codon usage between genes in GC skew peak and non-peak regions. In the Rhodobacterales we found that codons that were overrepresented in GC skew peaks also had a significantly higher GC content (Fig. 3). This contradicts findings of a negative correlation between GC content and composition bias [25]. Although the Rhizobiales and Rhodobacterales displayed very similar bimodal GC skew distributions, no differences between the codon usage in peak and non-peak regions were observed in the Rhizobiales. Similarly, no significant correlations were found for the other two orders. Next, we compared codon usage to codon GC skew. Significant positive correlations were found for the Rhodobacterales and Caulobacterales (Spearman's correlation coefficient P-values=0.0002 and 0.012, respectively) (Fig. 3). Hence, in the Caulobacterales GC content and in the Rhodobacterales the GC content and GC skew positively correlate with codon usage, which suggests that special codons are used in GC skew peaks and that there is a distinct evolutionary process occurring in the Rhodobacterales.
To determine what influence GC content and GC skew might have on codon usage, we selected codons with identical GC content that encode the same amino acid and compared them based on their GC skew and usage (Fig. 4). We found that codons with higher GC skew were more predominant in GC skew peaks for the Rhodobacterales and Caulobacterales. Thus, in GC skew peaks of those orders, codons with a higher proportion of guanine are preferred instead of cytosine. This was also true for GTA genes compared to non-peak genes in all four orders (Fig. 4). It was previously shown that GTA genes preferentially encode amino acids with a lower carbon content [16], which results in increased GC content [3]. However, our results show that these genes also preferentially use codons with increased GC skew, which is responsible for the location of GTA gene clusters in GC skew peak regions.
Nucleotide strand bias is generally attributed to cytosine deamination events that produce thymine [23,46]. Therefore, cytosine levels should decrease as they are converted into thymine if deamination is responsible for the GC skew. To see whether there is a lower occurrence of cytosine in the GC skew peaks, we examined codons for their G and C content and their relative usage. We found that an increased codon usage correlated with an increased G content in the Rhodobacterales and Caulobacterales, while the C content tended to decrease (Fig. S5). In the GTA genes, the use of both guanine and cytosine correlated positively with the increase in codon usage (Fig. S6). There was no significant change in guanine or cytosine levels in the peak versus non-peak regions of the Sphingomonadales or Rhizobiales. This could mean that deamination processes are not responsible for the GC skew or at least that they play a smaller role than the preferred selection for guanine-containing codons in the Rhodobacterales and Caulobacterales.

Core and GTA genes are commonly found in GC skew peaks
We found that core genes, defined as those present in ≥90 % of all genomes of an order, were accumulated in regions with GC skew peaks in all four orders (Fig. 5a). The ratios of core genes in peak to non-peak regions (calculated as percentage peak/percentage non-peak) were the most extreme in the Rhodobacterales (32) and Rhizobiales (14), but were also >1 in the Sphingomonadales (2.4) and Caulobacterales (2.4).
To investigate which genes are in GC skew peaks and how conserved their presence in peaks is, we compared the numbers of members of gene families found inside and outside of peaks (without differentiation according to the direction of the peaks) (Fig. 5b). Besides the GTA gene cluster genes that are strongly enriched in GC skew peaks in all four orders, there were multiple examples of genes associated with central physiological processes, such as protein processing (i.e. chaperones; dnaK and clpB), translation (ribosomes), cell division, nicotinamide adenine dinucleotide metabolism, flagellar motility and recombination, that were frequently found in GC skew peaks (Table S2). Overall, although core genes do not necessarily have to be in GC skew peaks, they are overrepresented in these peaks.
The localization of GTA gene clusters as well as many core genes in genomic areas with GC skew peaks was especially evident for the Rhodobacterales and Rhizobiales, which also have the most pronounced GC skews of the considered orders. This is particularly interesting for two reasons. First, these core genes have been conserved in a wide range of genomes over a long period of time and thus originate from a common ancestor [44,47]. Second, the GTA gene cluster is believed to have evolved from a prophage that integrated into the genome of a shared ancestor of multiple alphaproteobacterial orders [2], which means that it has been present in these genomes as long as many of the core genes [14] and also shares the property of being preferentially located in GC skew peaks. Thus, it is possible that the original ancestor of these bacteria had a more extreme GC skew, which is supported by the following points. GTAs are evolutionarily related to phages, which have been shown to integrate into the host genome in ways that maximize their success of replication, such as preferential integration on the leading strand and closer to ter, and with counter-selection for motifs that could result in disruption of macrodomain structures of the host genome (e.g. Fig. 3. Relationships between codon usage and GC content or GC skew. The median relative codon usages for each order were calculated from the mean codon usage values per genome (peak−non-peak/non-peak) and correlated with the GC content (left) or GC skew (right). The Spearman correlation was used to test significance and P-values are given above the plots.
motifs associated with the ter macrodomain are not found in prophages) [45]. The GTA gene clusters showed similar trends (co-directional orientation and localization closer to ter), and possibly these properties have been maintained since evolving from the original prophage. Regarding GC skew, however, prophages and the GTA gene cluster differ. We found no preferential localization of phages in GC skew peaks (Fig. S7), which also fits with their place as accessory mobile genetic elements as opposed to core genes. Moreover, our results have some commonality with findings for eukaryotes, in which highly conserved genes are also contained in genomic regions with strong GC skew [48][49][50], and it was also shown that those highly conserved genes encode proteins with longer half-lives [50].

Core and GTA genes are located far from repetitive elements
Although GTA gene clusters are occasionally translocated between replichores (Figs S2 and 3), the cluster itself and its orientation relative to the direction of DNA replication are well preserved. Therefore, we investigated the stability of these localizations. Regions with repeats, especially long repeats (>800 bp), represent hotspots for chromosome rearrangement and thus increase the local genome plasticity. We found that the distance of GTA gene clusters to long repeats is higher than for most other genes Fig. 4. Relative codon usage in GC skew peak genes and GTA genes. Only codons that have a GC content greater than zero and codons specifying a particular amino acid that have the same GC content were considered. The relative codon usage values for peak (or GTA) versus non-peak genes were calculated as [peak (or GTA)]/non-peak. The GC content and skew for each codon in each order are indicated based on their shape and colour, respectively, according to the legend at the top right. The codons are indicated at the bottom of the plots with dashed vertical lines separating the different amino acids, which are indicated above the plots using their one-letter codes.
( Table S3). To investigate whether this characteristic is also shared by core genes, we sorted all genes into two groups according to their distances from the nearest repeat, with 'far' being considered as >1/1000 of the genome size (approximately >4 kb for most organisms) and 'near' being considered as <1/1000 of the genome size (approximately <4 kb for most organisms). Core and GTA genes were predominantly found further away from repeats in all four orders (Fig. 6) and therefore are localized in regions with lower plasticity. Interestingly, a previous analysis of genomes from different phyla showed that genomes with a stronger overall GC skew tend to have fewer repeats and it was concluded that stable chromosomes have higher GC skews and fewer repeats [29]. As discussed above, it was previously documented that GTA genes have a higher GC content [3], and high-GC-content regions have also been shown to have lower rearrangement frequencies [16]. Studies on the gammaproteobacterium Vibrio parahaemolyticus and the epsilonproteobacterium Helicobacter pylori also showed that GC content and plasticity regions are negatively correlated [51,52]. This could also contribute to the conserved GTA gene cluster localization.   6. Relationship of gene conservation to distance from repeats. Genes were classified as far from >1/1000 of the genome size (approximately >4 kb for most organisms), or near to <1/1000 of the genome size (approximately <4 kb for most organisms), repeats. The percentages of the number of genes found in ≥90 % of the genomes are shown for the Rhodobacterales and Rhizobiales. The percentages of genes found in more than six or eight genomes (representing the numbers of genomes closest to 90%) are shown for the Caulobacterales and Sphingomonadales, respectively.

Relationship between potential DNA methylation sites and GTA gene cluster localization
One of the key RcGTA regulators is the response regulator protein CtrA [7]. This regulator is almost universally conserved throughout the Alphaproteobacteria [53], with shared and unique roles in different lineages and acting as a key cell cycle regulator in some. In Caulobacter crescentus, where CtrA has been best characterized, the ctrA promoter and multiple promoters of the CtrA regulon are targeted by the methyltransferase CcrM and the transcriptional regulator GcrA. CcrM methylates the adenine residue of the motif 5′-GANTC-3′, which the protein GcrA then binds to and recruits RNA polymerase to initiate transcription from the associated promoter [54,55]. While it is well documented that methylation has a strong influence on the CtrA phosphorelay and its regulon, and that CtrA controls the GTA gene cluster in Rhodobacterales members, a connection between all three components (CtrA, methylation by CcrM and the GTA gene cluster) has not yet been investigated to our knowledge. Therefore, we analysed potential CcrM methylation patterns by determining the occurrence of the GANTC recognition motif over the length of the chromosomes (normalized to 1000 bp). A strong increase in methylation motifs in the region around ori was observed in all four orders (Fig. 7). Strong and slight additional increases in GANTC motif numbers towards ter also exist in the Caulobacterales and Rhodobacterales, respectively.
The reorientation of each chromosome, such that the position of the MCP was the first gene, showed greater overall fluctuations in the GANTC motif pattern occurrences (Fig. 7), probably due to the greater variability in the localization of the GTA genes compared to ori (Fig. 1). A slight drop in GANTC numbers at the MCP gene can also be seen in the Caulobacterales and Sphingomonadales genomes. In the Rhodobacterales, this reorientation showed that the GTA regions had the lowest methylation potentials across genomes, even though there is a slight increase in motifs near ter in this group. The strongest conservation of GTA gene cluster localization is also found in this order, where it is biased towards ter. This is in accordance with our previous study where the Rhodobacterales ctrA gene was also found conserved near to ter and showed the lowest number of GANTC motifs in ctrA promoter regions [56]. Hemi-methylation of this motif during replication has been found to be a signal for transcriptional activation [57]. The presence of less GANTC sequences and its location near ter might indicate that ctrA transcriptional control is uncoupled from replication in this group. However, future studies are needed to show the potential significance of the low number of GANTC motifs in the GTA gene cluster region.

CONCLUSIONS
In this study, we performed a comprehensive genome structure analysis for four orders of the class Alphaproteobacteria to examine patterns of the RcGTA-type gene cluster localization, genomic GC skew and DNA methylation. We found that the GTA gene cluster shares properties with core genes, such as localization in low-plasticity regions, gene orientation on the leading strand of DNA replication and localization in regions of especially strong GC skew. These high-GC-skew regions at least partly arise due to a selection for codons with higher GC skew and are not necessarily associated with codon GC content enrichment. The GTAs studied here are proposed to have evolved from a phage that integrated into the genome of an ancestral alphaproteobacterial host. Generally, phages try to mimic their host's genome structure [58], but we did not find any notable elevation of GC skew among phages and prophages. Therefore, it seems that part of the evolutionary process of becoming a GTA included gaining this elevated GC skew, and this might be connected to other properties these genes have in common with core genes, such as their location in regions distant from repetitive elements.

Conflicts of interest
The authors declare that there are no conflicts of interest.