Genome-scale rates of evolutionary change in bacteria

Estimating the rates at which bacterial genomes evolve is critical to understanding major evolutionary and ecological processes such as disease emergence, long-term host–pathogen associations and short-term transmission patterns. The surge in bacterial genomic data sets provides a new opportunity to estimate these rates and reveal the factors that shape bacterial evolutionary dynamics. For many organisms estimates of evolutionary rate display an inverse association with the time-scale over which the data are sampled. However, this relationship remains unexplored in bacteria due to the difficulty in estimating genome-wide evolutionary rates, which are impacted by the extent of temporal structure in the data and the prevalence of recombination. We collected 36 whole genome sequence data sets from 16 species of bacterial pathogens to systematically estimate and compare their evolutionary rates and assess the extent of temporal structure in the absence of recombination. The majority (28/36) of data sets possessed sufficient clock-like structure to robustly estimate evolutionary rates. However, in some species reliable estimates were not possible even with ‘ancient DNA’ data sampled over many centuries, suggesting that they evolve very slowly or that they display extensive rate variation among lineages. The robustly estimated evolutionary rates spanned several orders of magnitude, from approximately 10−5 to 10−8 nucleotide substitutions per site year−1. This variation was negatively associated with sampling time, with this relationship best described by an exponential decay curve. To avoid potential estimation biases, such time-dependency should be considered when inferring evolutionary time-scales in bacteria.


Introduction
Estimating the rate of molecular evolution is critical for understanding a variety of evolutionary and epidemiological processes. Rates of molecular evolution are the product of the number of mutations that arise per replication event, the frequency of replication events per unit time and the probability of mutational fixation. These rates are determined by a variety of factors including background mutation rate, the direction and strength of natural selection, generation time and population size. Both generation time and population size have been shown to scale negatively with the substitution rate in a range of organisms (Bromham, 2009). For example, because of differences in generation time, spore-forming bacteria evolve more slowly over time than those that do not form spores (Weller & Wu, 2015). Similarly, lineages undergoing adaptive evolution are expected to accumulate substitutions more rapidly than those subject to purifying selection (Eyre-Walker & Keightley, 2007).
Despite the wealth of sequence data, genome-wide rates of evolutionary change in bacteria are often uncertain. At one end of the spectrum, rates as high as~10 À5 nucleotide substitutions per site year À1 have been reported for Neisseria gonorrhoeae (Pérez-Losada et al., 2007). In contrast, genome-wide rates of only~10 À9 substitutions per site year À1 have been observed in Mycobacterium tuberculosis (Comas et al., 2013). Importantly, however, these estimates are not always readily comparable because they use different methods and sources of data. Furthermore, most previous studies have not investigated the degree of temporal structure (i.e. clock-like behaviour) in the data, such that their reliability is uncertain. The increasing availability of genomic data means that it is now possible to estimate substitution rates with sequences collected over a number of years (i.e. tip-date calibration), not only for rapidly evolving pathogens, such as RNA viruses, but also in some DNA viruses and bacteria (Biek et al., 2015). The rates at which bacteria evolve, the strength of clock-like signal in the data and the determinants of any rate differences observed have not been systematically investigated. However, these parameters are of importance both for the accurate interpretation of outbreak investigations that may depend on the reliable estimation of the time-scale of putatively linked transmission cases, and for revealing the long-term time-scales over which bacteria have been associated with specific hosts. Importantly, estimates of evolutionary rate have been shown to scale negatively with their time-scale of measurement in several organisms (Ho et al., 2011), a pattern that has been attributed to the gradual purging of deleterious mutations over time (Ho & Larson, 2006;Penny, 2005). In the context of phylogenetic analyses, the time-scale of measurement corresponds to the age of the calibration or the sampling time-frame (Duchêne et al., 2014;Molak & Ho, 2015). In viruses, natural selection, mutational saturation and substitution model inadequacy have also been shown to contribute to this pattern (Duchêne et al., 2014Ho et al., 2015a). However, the phenomenon of timedependency of rate estimates remains largely unexplored in bacterial genomes, although there is some empirical evidence that it may also hold in these organisms (Comas et al., 2013).
To provide a comprehensive picture of genomic-scale evolutionary rates in bacteria and their temporal dynamics, particularly the extent of time-dependency in the data, we analysed, using a variety of phylogenetic methods, 36 publically available whole genome SNP data sets from bacterial pathogens associated with human disease sampled over periods extending over 2000 years.

Methods
Data collection. We collected 35 whole genome nucleotide SNP alignments from previously published studies (Baines et al., 2015;Bart et al., 2014;Bos et al., 2014;Bratcher et al., 2014;Croucher et al., 2011;Davies et al., 2015;Devault et al., 2014;Eldholm et al., 2015;Gaiarsa et al., 2015;Holden et al., 2013;Holt et al., 2008Holt et al., , 2013Holt et al., , 2016Howden et al., 2013;Marvig et al., 2013;Merker et al., 2015;Njamkepo et al., 2016;Schuenemann et al., 2013;Schultz et al., 2016;Stinear et al., 2014;Uhlemann et al., 2014;Wagner et al., 2014;Ward et al., 2014;Zhou et al., 2013Zhou et al., , 2014, and one unpublished data set of Salmonella enterica serovar Kentucky (Table S1). The Salmonella enterica serovar Kentucky data set comprised 88 strains isolated between 1937 and 2012 (Le Hello et al., 2013). Whole genome sequencing was performed at GATC Biotech (Germany) using an Illumina HiSeq system and analysed as described previously . Although sequence reads for most other data sets are publically available in the NCBI Short Read Archive (SRA), we obtained the original alignments from the authors wherever possible. With this approach we take advantage of domain knowledge of the original studies for choice of reference sequence and identification of repetitive or horizontally transferred sequences, which are important for the accurate generation of SNP

Impact Statement
Phylogenetic methods can be used to infer the evolutionary time-scale of groups of organisms. In some pathogens it is possible to estimate the time of origin of disease outbreaks or of cross-species transmission events. However, the accuracy of these estimates relies on our understanding of the rate at which genetic change accumulates over time. The recent surge of genomic data presents an unprecedented opportunity to enhance our understanding of microbial evolution, with the potential of improving future inferences of their evolutionary time-scale. We estimated the rate of nucleotide substitution in 36 complete genomes of different bacterial pathogens using a range of computational methods. We find large differences in the rates produced. For example, some bacteria, such as those that cause hospital-derived infections, evolve orders of magnitude more rapidly than those that cause tuberculosis, which undergo extended periods of latency. We also find that the sampling times appear to play an important role in determining their rate. Our results provide the first genomic perspective of bacterial rates of evolution, thereby improving our understanding of the timescale over which they diversify.
alignments. We removed outgroup taxa and samples that were distantly related to the majority of sequences to limit our analyses to the taxonomic group of interest and to minimize the artificial inflation of among-lineage rate variation due to the presence of very long branches in the phylogenetic trees. For the Neisseria meningitidis and M. tuberculosis Lineages 2 and 4 data sets we obtained sequence reads from the Short Read Archive and called SNPs using the RedDog pipeline [as described previously ; available at https://github.com/katholt/RedDog].
SNPs can be introduced into bacterial genomes individually via mutations or in clusters by recombination, a process that may bias demographic inferences and rate estimates (Hedge & Wilson, 2014;Lapierre et al., 2016). In particular, ignoring recombination can lead to incorrect estimates of branch lengths, which in turn results in an apparently overdispersed molecular clock, precluding accurate estimates of evolutionary rates and time-scales. In some cases, removing sites with evidence of recombination can alleviate this problem (see Fig. S1). Thus, we removed all genomic regions with evidence of recombination using Gubbins v1.4.2 (Croucher et al., 2014), and verified these results using RDP4 (Martin et al., 2015). For RDP4 we removed sequences with significant evidence of recombination according to six of the methods implemented in the program: Bootscan (Salminen et al., 1995), Geneconv (Padidam et al., 1999), Maxchi (Smith, 1992, Siscan (Gibbs et al., 2000), RDP (Martin & Rybicki, 2000) and 3seq (Boni et al., 2007). Accordingly, we discarded an entire data set of N. meningitidis in which nearly half of the sequences displayed recombination (Table S1). We also visually checked that the alignments were free of additional obviously recombining regions. Our final data set comprised 16 bacterial species from 13 genera, with genome sizes ranging from 1.4 Mbp (Streptococcus pyogenes) to 6.2 Mbp (Pseudomonas aeruginosa). The data set sizes ranged from 189 (Streptococcus pneumonia and M. tuberculosis Lineage 4) to 15 (Mycobacterium leprae ancient DNA) sequences, and alignment lengths from 15 394 SNPs (Acinetobacter baumannii) to 402 (M. tuberculosis Lineage 4) SNPs.
All the data sets analysed here included sampling times associated with each sequence in the form of year of isolation. Individual data sets spanned sampling ranges of approximately 2000 years in the case of M. leprae to 2.5 years in Staphylococcus aureus ST8:USA300. Four data sets included ancient DNA samples: M. leprae (Schuenemann et al., 2013), two Yersinia pestis data sets (Wagner et al., 2014), and one M. tuberculosis data set comprising strains sampled from New World mummies, animal strains and human Lineage 6 (Bos et al., 2014). In M. leprae, the oldest sample had an estimated age of approximately 2000 years (Schuenemann et al., 2013), while in the Y. pestis data sets the oldest samples had ages of approximately 600 and 1500 years (Wagner et al., 2014). The New World mummy M. tuberculosis samples had ages of approximately 900 years (Bos et al., 2014). Notably, one of the Y. pestis data sets included samples from both contemporary strains and those from the first, second and third plague pandemics, while the other is a subset that only included sequences from the second pandemic that began with the Black Death and contemporary strains. The remaining data sets had sampling times from a few years to several decades (Table S1). All the data sets used here are available online (http://zenodo.org/record/ 45951#.Vr1M-JN95E4).
Maximum-likelihood (ML) analysis and root-to-tip regression. We estimated ML trees using PhyML v3.1 (Guindon et al., 2010), employing the GTR+G substitution model with four categories for the G distribution of amongsite rate heterogeneity and sub-tree pruning regrafting branch-swapping. We did not consider the proportion of invariable sites in the substitution model because the data sets comprised SNPs only, such that all sites are variable. The branch lengths in the ML trees are the expected number of nucleotide substitutions per site; as we are working with SNP alignments, this corresponds to the expected number of substitutions per variable site. To convert the branch lengths into genome-wide distances, we multiplied them by the length of the SNP alignment, and divided them by the core genome length that reflects the size of the genome in which SNPs were called (this information was extracted from the original publications). We fitted regressions for the root-to-tip distance as a function of sampling time using NELSI v0.21 , where the position of the root is selected to maximize the determination coefficient, R 2 . Under clock-like evolution there should be a linear relationship between sampling time (year) and the expected number of nucleotide substitutions along the tree. The slope of the line corresponds to the substitution rate, although this estimate is statistically invalid because data points may not be phylogenetically independent. The extent to which the points deviate from the regression line reflects the amount of among-lineage rate variation, measured using R 2 (Rambaut et al., 2016).
Bayesian analysis and date randomizations. We estimated rates of evolutionary change using a Bayesian Markov chain Monte Carlo method implemented in BEAST v1.8 (Drummond et al., 2012), with a chain length of 10 8 steps, sampling every 5000 steps. If the effective sample size of any of the parameters was less than 200, we increased the chain length by 50 % and reduced the sampling frequency accordingly. For all data sets we used the GTR+G substitution model and the sampling times (tip dates) for calibration. For this analysis we utilized both strict and uncorrelated lognormal molecular clock models and constant-size coalescent and Bayesian Skyline demographic models, resulting in four possible model combinations. We compared the statistical fit of these models by estimating marginal likelihoods via path-sampling (Xie et al., 2011). We report the rate estimates from the model with the highest marginal likelihood.
To validate our Bayesian rate estimates we conducted a daterandomization test, which consists of repeating the analysis while assigning the sampling times randomly to the sequences (Firth et al., 2010). In all cases we used the best-fit combination of demographic and clock model, described above. We conducted ten randomization replicates per data set to generate an expectation of the rate estimates in the absence of temporal structure, which appears to be sufficient to assess temporal structure in bacterial data (Murray et al., 2015). Two criteria have been proposed to assess temporal structure. One method, known as CR1 , considers that data do not have temporal structure if the mean rate with the correct sampling times is contained within the 95 % highest posterior density (HPD) of that from any of the randomizations. CR2 is more conservative; the data do not have temporal structure if the HPD of the estimate with the correct sampling times overlaps with that from any of the randomizations. These criteria have different levels of type I and type II errors . Here, we considered the proportion of randomized replicates with HPDs that overlapped with those obtained using the correct sampling times, following CR2. We arbitrarily determined that data had 'strong' temporal structure if the proportion was 0, 'moderate' if it was between 0 and 0.5, and 'low' if it was less than 0.5. We verified that samples with the same sampling time did not form monophyletic groups in our ML trees. Such pattern can produce false positives in the date-randomization test (i.e. incorrectly suggesting that a data set has temporal structure) Murray et al., 2015).
Regression analyses of the time-dependency of substitution rates. A key element of our study was to determine whether the time-span of sampling was associated with the evolutionary rate estimate. To test for this pattern we conducted a least squares linear regression for the genome-wide rate as a function of sampling time. Importantly, only rate estimates with strong and moderate temporal structure were included in this analysis. We used a log 10 transformation because our rate estimates span several orders of magnitude. A potential shortcoming of this analysis is that the data do not necessarily represent independent samples because some correspond to closely related lineages. This can be addressed by using phylogenetic independent contrasts, or phylogenetic generalized least squares. However, these methods require a phylogenetic tree of the evolutionary relationships of all data points that cannot be estimated for our data because the sites in different SNP alignments are not necessarily homologous. Instead, we used an approach in which the regression is conducted using a single randomly chosen data point from each species. We repeated this procedure 1000 times and verified that the range of slope estimates with random subsamples did not include zero, and we report the mean value and the corresponding confidence interval. For our regression of the rate as a function of sampling time we used a test described previously (Duchêne et al., 2014) to verify that the slope estimate was not a statistical artefact that sometimes occurs when fitting a regression for a ratio as a function of its denominator.
The linear regression method, however, does not provide a realistic description of time-dependency, which has been suggested to follow a decay curve (Ho et al., 2011;Penny, 2005). We therefore modelled the relationship between rate and time using a double exponential curve of the form r=a/ T b , where r and T correspond to the rate estimate and sampling time on log 10 scale, respectively, and parameters a and b control the asymptote and rate of decay in the function. This parameterization is similar to that proposed by O'Fallon (2010). Parameters a and b were optimized using the algorithm of Nelder & Mead (1965). To obtain a confidence interval around the parameter estimates, we conducted 100 bootstrap replicates of the rate estimates and sampling times.

Results
Measuring the extent of clock-like structure in bacterial evolution Our root-to-tip regressions revealed large differences in the degree of clock-like behaviour, with 22 of the 35 data sets having R 2 values of less than 0.5, suggesting weak clock-like behaviour, and eight with R 2 values of 0.7 or higher, suggesting stronger clock-like behaviour (Fig. 1). A data set of Staphylococcus aureus multilocus sequence type ST239 had the highest R 2 , at 0.96, with similar values observed in Vibrio cholerae (0.92), Shigella sonnei (0.89) and Shigella dysenteriae type 1 (0.88). The lowest R 2 was 1.52Â10 À3 for M. tuberculosis Lineage 2. The regression slope (rate) for M. tuberculosis Lineage 2 and Y. pestis from the second global pandemic included negative values, suggesting that these rates are either too low, or that there is extensive among-lineage rate variation, to allow reliable rate estimation.
Our comparison of molecular clock models included the strict clock, which assumes that the rate of evolution is constant across lineages, and the uncorrelated lognormal relaxed clock that treats the rate across lineages as a random variable. Most bacterial data sets favoured a relaxed molecular clock, such that there is marked rate variation among lineages (Figs S2 and S3). The date-randomization test suggested that 28 data sets had strong to moderate temporal signal (Figs 2 and S2) (defined as 5 randomizations with an HPD overlapping with the HPDs estimated with the correct tipdates). Only one (M. leprae) of the 15 bacterial species (excluding N. meningitidis, which was discarded because of the large number of recombinant sequences) analysed had no data sets displaying moderate to strong temporal signal. However, five of eight species represented by two or more data sets displayed a mix of both weak signal and strong to moderate temporal signal (Fig. 2), suggesting that a lack of temporal signal may be a property of the individual data sets rather than a true species effect.
That even data sets from slowly evolving bacteria such as M. tuberculosis, which exhibit substitution rates in the order of 10 -8 substitutions per site year À1 , possessed moderate temporal signal underlines the power of genome-scale data   Fig. 1. Regressions of the root-to-tip genetic distance (expected nucleotide substitutions per site) as a function of sampling time (year) for 35 bacterial data sets. Each point corresponds to an individual sampled genome (SNP sequence in the alignment), and the red dashed line is the linear regression using least squares; the R 2 coefficients are also shown. Shading corresponds to the degree of temporal structure according to the date-randomization test in BEAST; blue indicates strong temporal structure, while orange and red indicate moderate and low temporal structure, respectively.
to estimate evolutionary dynamics. Despite this, it is striking that some data sets, including ancient DNA samples, had very little temporal signal, such that incorporating historical DNA data is not always sufficient for reliable rate estimation. For example, previous studies have suggested no temporal structure in data sets of Y. pestis with sampling time ranges of~290 years (Bos et al., 2016) and~650 years (Wagner et al., 2014), and of M. leprae with a sampling range of almost 2000 years (Schuenemann et al., 2013).
There was large variation in rate estimates for some closely related bacteria. Our analyses included several data sets of different serovars of Salmonella enterica: Kentucky, Agona, Typhi and Paratyphi A. Interestingly, the rate estimates for the human-restricted serovars Typhi and Paratyphi A (the agents of typhoid fever) (mean rates ranging from 1.78Â10 À7 to 8.02Â10 À8 substitutions per site year À1 ) were consistently lower than those estimated for the host-generalist serovars Kentucky and Agona (5.34-3.95Â10 À7 substitutions per site year À1 ). The rate estimates for Staphylococcus aureus were also highly variable between lineages, which included CC398, type USA300 and multilocus sequence types ST22, ST93 and ST239. The estimated mean rate for the livestock-associated CC398 was 2.43Â10 À6 substitutions per site year À1 (HPD: 1.14Â10 À6 -3.86Â10 À6 ), the highest for this species. In contrast, the estimated mean rate for ST93 was 5.55Â10 À7 substitutions per site year À1 (HPD: 2.77Â10 À7 -9.60Â10 À7 ), nearly five times lower (Fig. 2).
We also investigated whether rate estimates based on rootto-tip regression were consistently biased compared to those obtained using the Bayesian approach in BEAST (Fig. 3). If the estimates from the two methods are the same, then they should fall along the line y=x when plotted against each other. Notably, most points fell above the regression line, implying that the mean Bayesian estimates (y-axis) were higher than those from the regression (x-axis). A probable cause of this pattern is that deep branches in the phylogenetic trees are over-represented in the regression method. If substitution rates do indeed vary in a timedependent manner see below), such that higher rates are observed toward the present, then rate estimates obtained using regression may exhibit a downward bias (Fig. 3). The exceptionswere three Klebsiella pneumoniae data sets (CC258), only one of which had sufficient temporal structure for reliable estimation, with a mean rate estimate using BEAST of 2.99Â10 À7 substitutions per site year À1 (HPD: 9.51Â10 À8 -5.65Â10 À7 ), while that using regression was 3.05Â10 À7 substitutions per site year À1 [95 % confidence interval (CI): 1.03Â10 À7 -5.07Â10 À7 ); and Bordetella pertussis, with a mean rate of 1.74Â10 À7 substitutions per site year À1 using BEAST (HPD: 1.53Â10 À7 -1.94Â10 À7 ), and of 2.15Â10 À7 substitutions per site year À1 (CI: 2.24Â10 À7 -2.78Â10 À7 ) using regression.

Modelling time-dependent rates of evolution
For those data sets with strong to moderate temporal structure, we investigated the relationship between evolutionary rate and sampling time, considered as the time-span between the youngest and oldest samples in each data set. Our linear regression for the rate estimates as a function of sampling time on a log 10 scale revealed a significant negative association between rate and time with slope =À0.701 (CI: À4.74 to À5.81 and P=0.0003; Fig. 4). The optimal parameterization for the decay function r=a/T b was a=À5.90 (CI: À6.39 to À5.77) and b=À0.17 (CI: À0.26 to À0.13) (Fig. 4). Importantly, the residual errors were normally distributed around the fitted curve ( Fig S4).

Discussion
We have performed the largest comparative and systematic study of evolutionary rates in bacteria to date, providing an important resource for future studies of bacterial evolution. This analysis revealed an approximately two order of magnitude range of evolutionary rates for those bacterial data sets in which there was sufficient temporal structure for rate estimation. For the most rapidly evolving bacteria analysed here, such as E. faecium, Staphylococcus aureus and A. baumannii, we estimated genome-wide rates between 10 À5 and 10 À6 substitutions per site year À1 . Not only are these rates within the range of those previously estimated for rapidly evolving bacteria (Holt et al., 2012;Ward et al., 2014) but, more strikingly, they are also similar to those of slowly evolving DNA viruses (Duchêne et al., 2014;Firth et al., 2010). However, even our highest rate estimates are lower than some reported previously, such as those for Bacterial data set Fig. 2. Bayesian estimates of genome-scale nucleotide substitution rates for all bacterial data sets. The axis for the nucleotide substitution rate is shown on log 10 scale. Circular points represent the mean rate estimate, and error bars correspond to the 95 % HPD values. Colours indicate the degree of temporal structure according to the date-randomization test as indicated in Fig. 1. For comparison, the x symbol represents the point estimates using regression, while the asterisk (*) corresponds to estimates that were negative, and are thus not shown.
Helicobacter pylori (Kennemann et al., 2011) andN. gonorrhoeae (Pérez-Losada et al., 2007), at~10 À5 and 10 À4 substitutions per site year À1 , respectively. As these species also experience high rates of recombination (Maixner et al., 2016;Yahara et al., 2016), which can disrupt temporal signal, we suggest that these unusually high estimates be treated with caution until they are verified. For example, our N. meningitidis data set exhibited extensive recombination, precluding standard phylogenetic analyses.
The lowest rate estimates obtained here, at~10 À8 substitutions per site year À1 , were also comparable to previous studies of slowly evolving bacteria, notably M. tuberculosis (Kay et al., 2015). Importantly, however, even though they were relatively low, these rates were sufficiently rapid to be accurately estimated using genomic-scale data. In contrast, rate estimates lower than about 1.5Â10 À8 substitutions per site year À1 could not be validated using our methods and hence may require data sampled over far longer time periods. However, it was also noteworthy that the M. leprae and one of the Y. pestis data sets did not have sufficient temporal structure for rate estimation despite the availability of ancient DNA. Interestingly, a recent analysis of Bronze Age strains of Y. pestis is consistent with a much higher substitution rate in this bacterium, at~10 À7 substitutions per site year À1 (Rasmussen et al., 2015), in contrast to the data presented here and previously (Cui et al., 2013;Morelli et al., 2010). Why these data sets differ so fundamentally in estimated substitution rate is clearly an important area for future study.
Nearly all bacterial species investigated here displayed genome evolution that was measurable over a period of 10-100 years. Data sets with sampling times spanning less than  Fig. 3. Estimates of genome-scale nucleotide substitution rates using a Bayesian method (BEAST) compared to those estimated via rootto-tip regression. The line represents y=x. Points that fall on the line correspond to data sets for which the mean Bayesian estimates closely match those from the regression. Points above and below the line are data sets for which the Bayesian estimate is higher or lower than that from the regression, respectively. The colour corresponds to the degree of temporal structure according to the date-randomization test as indicated in Fig. 1. 10 years were largely unreliable. Importantly, for several species the strength of temporal signal varied substantially between data sets, and a lack of temporal signal in one data set could not be taken as a general indication of overall lack of signal for the species.
Notably, our analysis reveals that evolutionary rates in bacteria display a negative relationship with sampling time, such that their rates can be considered time-dependent (Ho et al., 2011). This observation is of particular importance in the context of the analysis of genome sequences taken from individual disease outbreaks or transmission chains (Didelot et al., 2012), as the evolutionary rates measured over these very short time-scales will probably contain deleterious mutations yet to be purged by purifying selection and substitutional saturation is increasingly apparent over time. As such, these estimates are expected to be much higher than those for samples obtained over longer time-scales. For example, based on our data (Fig. 4), we would predict the evolutionary rate estimated for a given bacterial species or clade over a sampling frame of 10 years to be more than an order of magnitude higher than that estimated for the same bacteria sampled over a period of 100 years. Conversely, the longer-term (and lower) evolutionary rates of the type inferred here would tend to over-estimate the time-scale of bacterial transmission when applied to outbreak data. The relationship between evolutionary rate and sampling time can be used to assess the extent to which extrapolating rates of evolution over different time-scales will lead to a bias in estimates of divergence times. To this end, future studies should compare bacterial data sets of the same species Sampling time (years) Fig. 4. Estimates of genome-wide nucleotide substitution rates in human-associated bacterial pathogens as a function of sampling time in years. The axes are shown on a log 10 scale. The colour corresponds to the degree of temporal structure according to the date-randomization test; blue indicates strong temporal structure and orange indicates moderate temporal structure. The dashed line corresponds to the linear regression, while the solid line corresponds to the decay curve (both fitted using only the points with strong and moderate temporal structure). The grey lines represent 100 bootstrap replicates of the decay curve, and thus represent the uncertainty in the decay function.
across different sampling time-frames to more precisely determine the nature of the time-dependent curve.
Strikingly, in some cases the time-dependent pattern appears to hold within closely related bacterial lineages, such as our A. baumanii and Salmonella enterica Paratyphi A data sets. However, it is notable that both reliable rate estimates for M. tuberculosis, estimated over sampling frames of 15 and 895 years, were nearly identical. M. tuberculosis are notoriously slow growing bacteria whose generation time is orders of magnitude slower than the other bacteria analysed here. It is likely that other factors such as genome size and DNA G+C (guanine and cytosine) content also contribute to rate variation between bacteria (Rocha et al., 2006). Given the time dependency of rate estimates we observed, we suggest future studies of bacterial evolutionary dynamics would be best addressed by comparing multiple independent replicate sample sets for each species, collected over matched time-spans. Overall, our analyses show that genome evolution can now be reliably measured in bacteria and establish a genomic framework for understanding long-term evolutionary dynamics in bacteria.