Navigating the future of bacterial molecular epidemiology

Technological advances in high-throughput genome sequencing have led to an enhanced appreciation of the genetic diversity found within populations of pathogenic bacteria. Methods based on single nucleotide polymorphisms (SNPs) and insertions or deletions (indels) build upon the framework established by multi-locus sequence typing (MLST) and permit a detailed, targeted analysis of variation within related organisms. Robust phylogenetics, when combined with epidemiologically informative data, can be applied to study ongoing temporal and geographical fluctuations in bacterial pathogens. As genome sequencing, SNP detection and geospatial information become more accessible these methods will continue to transform the way molecular epidemiology is used to study populations of bacterial pathogens.


Navigating the future of bacterial molecular epidemiology
Stephen Baker 1 , William P Hanage 2

and Kathryn E Holt 3
Technological advances in high-throughput genome sequencing have led to an enhanced appreciation of the genetic diversity found within populations of pathogenic bacteria. Methods based on single nucleotide polymorphisms (SNPs) and insertions or deletions (indels) build upon the framework established by multi-locus sequence typing (MLST) and permit a detailed, targeted analysis of variation within related organisms. Robust phylogenetics, when combined with epidemiologically informative data, can be applied to study ongoing temporal and geographical fluctuations in bacterial pathogens. As genome sequencing, SNP detection and geospatial information become more accessible these methods will continue to transform the way molecular epidemiology is used to study populations of bacterial pathogens.

Introduction
In 1854, as a cholera epidemic ravaged London's Soho district, John Snow investigated the outbreak by locating the location of cholera cases on a street map. From the distribution of cases, and by questioning local residents, Snow concluded that the source of the outbreak was a public water pump. This was the first epidemiological investigation to make use of maps and Snow is hailed as having founded the science of epidemiology [1]. Now, our understanding of bacteriology and mapping have improved to the extent that we can characterise an isolate by genome sequence, and precisely locate it using global positioning (GPS) technology.
Combining genetic, phenotypic, spatial and temporal data allows a comprehensive view of the epidemiology of bacterial pathogens and their evolution, helping to explain how virulence and other phenotypic traits evolve in bacterial species over time [2,3]. Adaptation occurs via a number of processes, including mutations, and the movement of genes among distinct lineages through recombination [4]. Both mutation and recombination produce genomic diversity that can be used to discriminate between related organisms. Multiple techniques have been employed to assay genomic differences among different lineages or clones of the same species, such as pulsed field gel electrophoresis (PFGE) and multi-locus variable number tandem repeat (VNTR) analysis (MLVA). The majority of these methods suffer from issues of portability and a limited understanding of the processes through which variation arises. By contrast, DNA sequence-based techniques provide robust and portable differentiation within bacterial populations, and can be used to infer their phylogenetic history.
Multi-locus sequence typing (MLST) is a well-established method to study bacterial populations exhibiting sufficient nucleotide diversity in a small number of genomic loci [5]. Databases containing MLST and associated data from hundreds or thousands of isolates can be accessed via the internet (http://www.mlst.net/ and http://pubmlst.org/) [6]. While MLST has provided numerous insights into the epidemiology and population genetics of bacteria, technological advances in DNA sequencing (e.g. 454, Illumina/Solexa and ABI SOLiD platforms) allow the rapid sequencing of entire bacterial genomes [7]. As a result, sequence-based analysis of bacterial populations exhibiting levels of nucleotide diversity too low for MLST has become possible [8]. Tagged genomic libraries can be used to generate sequence data from multiple isolates in a single assay, providing sufficient information to discover single nucleotide polymorphisms (SNPs), small insertions or deletions (indels) and variation in gene content in multiple bacterial strains over a short time frame.
It is recognised that the distribution of some bacteria may be related to geographical patterns, such as climatic zones and movement of human populations [9][10][11]. However, the spatial distribution of genetic variants can shed light on pathogen evolution and transmission. This type of analysis is aided by GPS devices that can be used to record the coordinates of relevant locations. The spatial distribution of bacterial pathogens can be considered at a local level (e.g. streets, hospital wards), at regional level (cities, provinces) or even globally. In a hospital setting this may indicate nosocomial transmission, but in a community  Google map and haplotype map outlining the circulation of multiple Salmonella Typhi haplotypes in a small urban area of Jakarta. The SNP typing of 54 S. Typhi strains from a single location in Jakarta identified several different haplotypes circulating within a two-year period [17]. (a) A minimum spanning tree showing relationships between the eight different S. Typhi haplotypes (e.g. H45) identified. The tree shows the overall population structure defined by the SNPs targeted in the assay, defined in ref. [14]. Coloured circles correspond to haplotypes found in the sample (colour corresponds to the colour scheme in part b below). Grey circles are haplotypes that were not identified among isolates from Jakarta. H45 is the ancestral group and the red circle denotes Salmonella Paratyphi A strains.  setting, the simultaneous appearance of an identical genotype in widely dispersed locations may be a warning of an imminent epidemic.

The phylo-geographical distribution of Salmonella Typhi
Defining the population structure of Salmonella Typhi (S. Typhi), the causative agent of the human restricted disease typhoid fever [12], has been, historically, particularly challenging. Typhoid is common in parts of Asia, South America and Africa, particularly in densely populated areas with poor sanitation. S. Typhi is genetically monomorphic, rendering MLST largely uninformative [13]. However, an approach that studied variation at 200 loci identified sufficient SNPs to define a minimum spanning tree containing 80 distinct haplotypes [14]. This comprehensive study, consisting of data from 105 strains isolated over 84 years on three continents, identified remarkable homogeneity with only very limited phylogeographic signal. Furthermore, there was evidence for the persistence of multiple haplotypes in a single country over decades, indicating a stable population rather than clonal replacement by successively better-adapted lineages. Yet, there was evidence of a recent clonal expansion of a specific haplotype (H58) in Asia. Molecular typing of Vietnamese S. Typhi isolates suggests replacement of sensitive isolates with those predominantly of H58 haplotype frequently associated with multiple drug resistance (MDR), microevolution and acquired resistance mutations within this emerging clone [15,16].
A high-throughput SNP detection platform was used to identify S. Typhi haplotypes circulating in an urban area of Jakarta [17] (Figure 1). The S. Typhi strains were isolated as part of a case/control study to identify risk factors for typhoid [18]. The SNP profiling of 140 S. Typhi strains identified nine haplotypes circulating in the Indonesian archipelago over more than 30 years, with eight detected in a single suburb over two years. One specific haplotype of S. Typhi was dominant and uniquely associated with an atypical flagella antigen [19]. These findings show that marked genotypic and phenotypic differences can exist within a relatively monomorphic pathogen population within a limited geographical area over a short time frame.
An additional 2000 SNPs have since been identified within the S. Typhi population, providing additional loci for more refined SNP typing of clinical isolates [20]. The sequencing of the whole genomes of 19 strains chosen to be representative of the global of S. Typhi, detected other forms of genetic variation (indels), which could, potentially, be used as markers for studying S. Typhi diversity. The development and use of a custom SNP array (containing over 1500 SNP loci) for S. Typhi using the GoldenGate platform (Illumina) provided greater discriminatory power than any previous study of S. Typhi. Application of this assay to S. Typhi populations in Nairobi, Kenya and Kathmandu, Nepal again showed multiple S. Typhi haplotypes co-circulating in a single city [21,22]. In both cities, however, a single haplotype was dominant, supporting the notion of clonal expansion rather than successive clonal replacement being the ongoing force in the population of this pathogen. Our current sequencing and SNP typing work relating specific haplotypes to the spatial and temporal distribution of typhoid cases in an area of Kathmandu, is expected to help elucidate specific transmission routes and microevolution within a highly localised area.
The global dissemination of Staphylococcus aureus: unraveling tangled transmission routes S. aureus is a major human pathogen in both hospital settings and the community, where it is a leading cause of skin and soft tissue infections. In most cases, S. aureus is carried asymptomatically by humans (and domestic animals, in which it can also be important pathogen in some contexts [23]). In healthcare settings the circulation of methicillin resistant strains (MRSA) is a constant challenge for infection control, and the emergence of MRSA as a cause of severe disease among healthy adults in the community is a cause for considerable concern [24].
The mainstays for studying the molecular epidemiology of S. aureus have been MLST [25] and staphylococcal protein A (spa) typing [26]. These two methods are augmented for MRSA with staphylococcal cassette chromosome (SCCmec, which carries the methicillin resistance gene, mecA) sequencing [27]. The MLST system has been enhanced by an interface with simple mapping data, where users can add and access geographical information (http:// maps.mlst.net/view_maps.php) (Figure 2). The information gathered from MLST indicates that MRSA has evolved multiple times, leading to the circulation and predominance of particular clonal complexes and sequence types, for example ST5, ST225 and ST239 [28]. However, the repeated emergence of resistance on these backgrounds means that they are heterogeneous with respect to the type of spa gene and the SCCmec they carry. Nü bel et al. discovered 156 bi-allelic-polymorphisms (BiPs) in 138 global ST5 MRSA isolates [29]. These BiPs defined 89 haplotypes, which clustered according to the continent of isolation, but not the spa typing group. Furthermore, sublineages were found to be locally clustered. These data suggest that the global dissemination of MRSA is restricted and that locally dominant MRSA strains may be the result of SCCmec transfer into a strain of S. aureus that is pre-adapted, already exhibiting superior fitness.
A close relative of the well-described ST5 MRSA clone, namely ST225, has recently become increasingly prevalent in health care settings in central Europe [28]. The spatiotemporal dynamics of the spread of ST225 has been studied via mutation detection at 269 loci in a collection of 73 ST225 strains from Europe and the United States [30]. The ST225 MRSA strains demonstrated remarkable uniformity, with only 36 haplotypes (resulting from 48 BiPs) identified. This lack of diversity implied a recent common ancestor. A reconstructed ancestral scenario suggested the spread of this strain from Germany across central Europe, with the eventual expansion of the dominant clade from 1995 onwards [30]. This work illustrates the potential of combined sequence and spatial analysis to reconstruct strain dissemination events in the recent past.
ST239 is another widely dispersed lineage of MRSA, common in mainland Asia, South America and parts of Eastern Europe [31,32]. The genomes of 63 globally distributed ST239 isolates were recently sequenced using multiplex Illumina/Solexa sequencing [33]. SNPs identified among the 63 genomes revealed a strong phylogeographic signal, with highly similar sequences identified in the same geographic area. A close relationship was noted between strains from Portugal and South America, which is suggestive of the historical and modern links between these two regions. In some cases, strains did not cluster by geography, and were considered to represent intercontinental transfer, including evidence of a single transmission event from Southeast Asia initiating an outbreak in the United Kingdom. This study also gave an indication that such a method may be suitable to study local transmission events. Five strains isolated over 13 weeks with a potential link from the same Thai hospital could only be differentiated by 14 individual nucleotide changes.
A glimpse of the future of bacterial molecular epidemiology may be offered by a recent study of the geographic distribution of differing MSSA and MRSA clones in Europe [34]. This work integrated data from 450 hospitals spanning 26 European countries and provided a snapshot of the current S. aureus strains circulating across Europe, identifying dominant spa types that form distinct geographical groups when compared using spatial statistics. Additionally, it introduced a public Web-based mapping and genotyping tool that could be applied to other organisms (http://www.spatialepidemiology.net/) (Figure 2).
This online tool has also been integrated with a smartphone application (EpiCollect), making the collection and interrogation of epidemiological data in the field an existing reality [35]. The coordination of such a large network will act as a blueprint for conducting similar investigations and outlines an obvious direction for microbiological reference laboratory networks and surveillance systems.

Conclusions
We have described several examples of recent work showing the potential of high-resolution genome sequencing for the study of the evolution of bacterial pathogens. This work has been enhanced by combining genomic data with epidemiological and geographical information. A general observation is that additional metadata (such as, disease syndrome, antimicrobial resistance phenotype and isolation date) are an increasingly important element in molecular based projects.
The examples we have considered are drawn from relatively clonal pathogens, meaning that recombination rates within these species are low. This is an important caveat as recombination can obscure phylogenetic signal, and so methods that rely on a robust phylogeny may be compromised. Recombination also has the potential to introduce phenotypic traits into different genetic backgrounds (e.g. SCCmec elements). One of the interesting questions that remains is the role of local clonal expansion, and the extent to which this is eroded (or perhaps facilitated) by the widespread movement of selected genes among lineages.
An understanding of the dynamics of bacterial populations can help to determine appropriate interventions, including, the use of vaccines, therapeutics, public health measures and ongoing pathogen surveillance. The combination of new technologies to gain increasingly accurate, high-resolution spatial and genetic data related to large populations of bacteria, promises to extend our understanding of the dynamics and transmission of pathogenic bacteria even further.