Correlation between bacterial G+C content, genome size and the G+C content of associated plasmids and bacteriophages

Based on complete bacterial genome sequence data, we demonstrate a correlation between bacterial chromosome length and the G+C content of the genome, with longer genomes having higher G+C contents. The correlation value decreases at shorter genome sizes, where there is a wider spread of G+C values. However, although significant (P<0.001), the correlation value (Pearson R=0.58) suggests that other factors also have a significant influence. A similar pattern was seen for plasmids; longer plasmids had higher G+C values, although the large number of shorter plasmids had a wide spread of G+C values. There was also a significant (P<0.0001) correlation between the G+C content of plasmids and the G+C content of their bacterial host. Conversely, the G+C content of bacteriophages tended to reduce with larger genome sizes, and although there was a correlation between host genome G+C content and that of the bacteriophage, it was not as strong as that seen between plasmids and their hosts.


INTRODUCTION
The redundancy of the genetic code, where as many as six different codons may encode a single amino acid, allows at least some tolerance of the nucleotides used by different organisms. This tolerance, at least in part, means that the bacterial genomic guanine+cytosine (G+C) content may vary enormously, depending on the species. Recently, this range was shown to extend from 17 to 75 mol% [1]. The factors influencing this variation have been debated for at least 50 years [2,3], including the suggestion that mutational bias acts upon genomes. This bias, together with environmental factors, was thought to exert a selection pressure towards the most adapted genome composition for a given habitat. Subsequent research suggested that this mutational bias generally acts across all bacterial species and promotes a trend towards genomes with higher adenine+thymine (A+T) content [4][5][6]. Other research revealed that the G+C content of individual bacterial species is correlated to a number of factors. These factors are not mutually exclusive and have included variables such as the organism's living environment [7], the ability or inability to fix atmospheric nitrogen [8], an organism's preference for aerobic or anaerobic conditions [8,9], and normal optimal temperature range [10,11]. The interconnection of these intrinsic and extrinsic factors means that no single condition is likely to be responsible for the G+C content of an organism, but rather this is due to multiple factors, which in turn makes identification of the relationships between them difficult to analyse.
Various approaches have been adopted to analyse the factors that might influence the G+C content, including traditional (laboratory-based) microbiology and in silico analyses using phylogenetic studies in an attempt to identify similarities between organisms with particular G+C contents. One of the simplest hypothesized relationships was that of a potential correlation between the genomic G+C content of an organism and genome size. This was first proposed by Sueoka [3] and has been studied further by others since (e.g. [12][13][14][15]). Initial investigations relying on examining the genome size posed problems due to shearing of DNA during the extraction process, thereby potentially leading to underestimations of the correct size. Even with the advent of pulsed-field gel electrophoresis [16], which greatly overcame the potential problem of DNA fragmentation, this issue was not fully resolved. However, with the improvements to DNA sequencing methods, particularly the increased use of next-generation sequencing to determine complete genome sequences, accurate values for both genome size and G+C content are becoming increasingly available.
The present study makes use of data from genome sequences and is, to our knowledge, the largest investigation undertaken to date to assess the potential relationship between genome size and G+C content. Furthermore, it also includes plasmids in the analysis and compares their G+C content to that of their host organism.

METHODS
Data were downloaded from the National Center for Biotechnology Information (NCBI) database, on 12 June 2017. For that purpose, Linux shell commands were used (awk for address parsing and wget for downloading), wrapped in a python script. At the time of downloading, the database contained 14 774 genome entries. The downloaded dataset included a number of draft and incomplete sequences. Only entries containing the text string 'complete' in their Fasta definition line (defline) were selected. The same criterion was applied for the separation of plasmids and phages, namely the existence of the text strings 'plasmid' and 'phage'. The rest were assigned as bacterial genomic sequences. The majority of bacteriophage genomes were downloaded from a separate directory in NCBI, but some sequences were also included in the main dataset for microbial genomes. These two datasets were merged after cleaning and any duplicates were removed computationally. Further entries described in their defline as 'putative' or 'endosymbiont' were also removed. This subset was comparatively small and lacked clear annotation.
All data manipulation and statistical analysis was performed using python 2.7 (implemented in anaconda 2, v4.4.0) (Python Software Foundation, https://www.python.org), in a Linux 64-bit environment. Standard python libraries were used for data cleaning and subsequent analysis, such as pandas, scipy and numpy.
Ordinary least squares (OLS) was applied for linear regression, using python with statsmodels.OLS. This method still provides an unbiased regression estimation in the presence of unequal variance across the data (heteroskedasticity) [17], as the latter were evident across all datasets. One drawback, however, is that when heteroskedasticity is present, OLS has no predictive power, as the error margins and P values can be too small or too large, and cannot be trusted. To mitigate this effect, OLS was used with the HCCM (heteroskedasticity consistent covariance matrix) method [17], which in python statsmodels is implemented with the cov_-type='HC0' option.

RESULTS
After cleaning, the dataset comprised 12 424 complete genome sequences from bacterial sources; 6671 from bacterial chromosomes, 5744 from plasmids and 4580 from phages. Inevitably, extensively studied microbial species, such as Escherichia coli or Bacillus spp., were represented by more than one strain.
The G+C content ranged from 13.5 mol% (Zinderia insecticola CARI) to 87.5 mol% (Streptomyces autolyticus strain CGMCC0516 plasmid), with a mean value of 48.4 mol%. In distributions with heavy skew, the median is a better estimate of a representative value. For the whole dataset, this was slightly higher than the mean, at 48.5 mol%. Lengths varied from 744 bp (Tremplaya phenacola PAVE plasmid) to 16 Mb (Minicystis rosea strain DSM 24000). Mean and median lengths were 2.08 and 1.64 Mb, respectively.

Bacterial genomes
Bacterial genomic sequence length ranged from 112 kb (Nasuia deltocephalinicola strain PUNC) to 16 Mb (M. rosea strain DSM 24000), with a mean length of 3.66 Mb and a median of 3.78 Mb ( Table 1). The lowest G+C content was that of Z. insecticola CARI at 13.5 mol% and the highest that of Anaeromyxobacter dehalogenans 2CP-C, at 74.9 mol % ( Table 2). The mean G+C content was 48.8 mol% and the median was 49.3 mol%.
The data showed a prominent heteroskedasticity. Longer sequences tended to have higher G+C content values, while

IMPACT STATEMENT
Larger genomes provide an opportunity for containing more genes due to the larger amount of DNA. However, the reasons associated with this are still debated and relatively unclear. Using genome sequences accessible from public databases, this paper examines the potential relationship between G+C content and genome length. In addition to studying bacterial genomes, the work also looks at this relationship between G+C content and genome length for both plasmids and bacteriophages. We also compare the G+C content of both plasmid and bacteriophage genomes relative to the G+C content of the organism from which they were isolated. In general, we found that larger bacterial genomes tend to have higher G+C contents, as was the case for plasmids. However, in bacteriophages, the G+C content appeared to reduce with an increase in size. There was a high level of correlation between the G+C content of plasmids and their host organism, a pattern that was seen to some extent between bacteriophages and the organisms they infect, but with a lower correlation level.
variation in G+C started high in short genomes and decreased as genomes became longer. In keeping with previous research [13,14], this creates a data plot of a roughly triangular shape ( Fig. 1). There is a positive correlation between genomic G+C content and bacterial genome length, though this is not a simple one: length is associated more with the range of G+C content, rather with its absolute value. As noted above, small sequences accommodate the whole range of G+C content, while as length increases, G+C values tend to occupy the upper part of the range. This is in keeping with the data in Table 1, where the five longest genome sequences all have G+C values of 69 mol% or more, whilst the shortest five examples range from 16.6 to 58.8 mol%.
Therefore, trying to fit a linear regression model onto this dataset was potentially problematic. Using heteroscedasticity-robust regression, the linear model explained only a small proportion of the variation (Pearson R=0.58, P<0.001). This is equivalent to an r 2 of 0.34 and, thus, around 66 mol% of the variation in G+C content cannot be accounted by this model. The heteroskedastic pattern could  (Table 2).
Plasmids showed a similar pattern of G+C content variation to that seen in bacterial genomes, namely high variability of G+C in smaller sequences and a tendency for high G+C content as the size increased (Fig. 2). However, given the generally smaller length of these plasmids, the general abundance of shorter sequence lengths generated a rotated L-shape pattern when plotted, rather than the triangular shape seen for bacterial chromosomes.
Correlation between plasmid and host G+C content A linear relationship (Fig. 3) (Table 2).
Phages showed the pattern seen in both bacterial and plasmid genomes, namely high variability of G+C in smaller sequences. However, unlike bacterial and plasmid genomes, those with larger genomes showed a tendency for lower G+C content (Pearson R value=À0.14, P<0.0001) as the size increased (Fig. 4).
Correlation between phage and host G+C content A linear relationship (Fig. 5) was evident between phage G+C content and the corresponding G+C content of the host organism (Pearson R value=0.90, P<0.0001). This was the best regression result for the whole dataset. The linear equation obtained approached a one-to-one relationship between the two variables, with the phage G+C content increasing about 0.88 mol% for every 1 mol% increase in host G+C, with about 81 mol% of the variation being explained by this relationship (r 2 =0.81).

DISCUSSION
The data presented here demonstrate that there is a correlation between the length of a bacterial genome and its G+C content, particularly in the case of organisms with longer genome lengths. However, it is also clear that this alone is not enough to explain the complete variation in genome G+C content as evidenced by the results from the linear regression model. Therefore, it is clear that other factors need to be considered to explain the G+C content. Probably the most obvious of these would be the organism's normal optimal temperature range [10,11], as the physical property of having a high percentage of triple bonds (G+C rich) is more likely to prevent denaturing of double-stranded DNA than would be the case for those with a high percentage of double bonds (AT rich). However, other environmental factors also need to be considered as well [7][8][9], together with the physiological capabilities of the organism [8]. Moreover, the heteroskedasticity of the length versus percentage G+C plot suggests that multifactorial variables may be most important in terms of organisms with shorter genome lengths, arguing that the roles played by environmental factors in terms of influencing the G+C content of a bacterial genome will require meta-analytical approaches to elucidate the other key factors. It is also worth noting that to date  there has been a bias towards sequencing genomes of organisms that are either medically or agriculturally important. It will be interesting to determine whether the patterns observed continue as more bacterial genome sequences become available from organisms that are not medically important or from those that lack agricultural significance.
In the case of the chromosomal analysis, the G+C content does not go above 75 mol% or below 13 mol%. In part, this may be a reflection of the restrictions of the genetic code, where encoding certain amino acids requires at least some usage of A/T or G/C, e.g. phenylalanine requires TTC or TTT as a codon (with G+C-rich organisms likely to favour TTC) and glycine requires GGN as a codon (with A+T-rich organisms likely to favour either GGA or GGT). In addition to this requirement of compliance to the genetic code, there may also be restrictions imposed whereby unusual or rare codons are incorporated into genes [19], with the possible effect of slowing down the rate of translation to allow correct protein folding to take place. Moreover, there is evidence to suggest that DNA replication in organisms with a higher G+C content is associated with variants in the presence of DNA polymerases present such as polC being used, in addition to the number of and types of variants of the dnaE gene [20], as evidenced by organisms such as Pseudomonas putida [21].
Plasmids can be considered as genetic components of the bacterial cell and it is not surprising that their G+C content is correlated to that of their host. This observation has previously been discussed by Campbell and colleagues [22], where a substantial similarity in genomic signatures between prokaryotes and their plasmids was reported, although more recently Rocha and Danchin [23] reported that genetic elements that can be considered as 'intracellular pathogens', such as plasmids, phages and insertion sequences, have a tendency to have a lower G+C content than their host organism. However, this conclusion was drawn from a much smaller dataset relative to the current work. Moreover, with the potential benefits associated with some genes on plasmids, it makes sense to see a similarity in terms of G+C content for plasmid-borne genes that rely on the transcriptional and translational factors of the host organism (e.g. the encoding of specific tRNA molecules by the bacterial host). It has also been proposed that similarity in G+C content acts as a way of allowing the bacterial cell to discriminate between compatible   and non-compatible DNA [24], although factors such as methylation patterns ensure that this is not as simple a mechanism as relying on the G+C content alone.
Moreover, the increasing number of examples of lateral gene transfer, or horizontal gene transfer, shows that inter-species transfer of genes is more commonplace than first imagined. While there are other means of moving DNA from one organism to another, using plasmids as a vector for this transfer is regarded as one of the most important. This is true for both inter-species conjugation of plasmids or transformational uptake of plasmids that have been released into the ecosystem by an alien species. Therefore, although the plasmids described are known to have been isolated from a particular bacterial species, it is impossible to determine when this plasmid first became part of the bacterial cell, and also what previous organism(s) may have acted as the prior host (s). As above, it will be interesting to put this into context based on both bacterial and plasmid sequence data when sequences from additional organisms become available.
Conversely phages can be regarded as being true parasites of the cells depending on the host organism for expression of their genes, without the potential associated benefit of factors such as antibiotic-resistance genes. However, this in turn also places a dependence on them to maintain a G+C pattern similar to that seen in the organisms they infect. As mentioned above, there have been reports to suggest that intracellular pathogens may have a G+C content lower than their host organism [23], and we also find this to be the case in the current analysis of phages, based on a much larger dataset than was used previously. The evolutionary explanation for this is unclear, although reducing the phage's metabolic burden via reduced pyrimidine synthesis has been proposed (e.g. [23,25]).
In terms of phage genome analysis, the site of any incorporation into the bacterial genome (e.g. as part of any lysogenic cycle) could also influence the G+C content of the phage genome. This would be in keeping with reports of heterogeneity of G+C content across bacterial genomes [26], where sliding window analysis identified regions of intragenomic variation of G+C content within a single species.
In conclusion, using a considerably increased dataset relative to previous work, we propose that a simple linear regression between bacterial chromosome length and G+C content accounts for at least some of the relationship. The same relationship is also true for bacterial chromosome G+C content and plasmid G+C content, although phages tend to have a lower G+C content than their hosts. However, in all cases there are other factors involved, although the true extent of each of these factors remains unclear, arguing for additional analyses via techniques such as principal component analysis or multiple regression analysis on data regarding the ecosystems from which organisms have been isolated.