Whole-genome sequencing to delineate Mycobacterium tuberculosis outbreaks: a retrospective observational study

Summary Background Tuberculosis incidence in the UK has risen in the past decade. Disease control depends on epidemiological data, which can be difficult to obtain. Whole-genome sequencing can detect microevolution within Mycobacterium tuberculosis strains. We aimed to estimate the genetic diversity of related M tuberculosis strains in the UK Midlands and to investigate how this measurement might be used to investigate community outbreaks. Methods In a retrospective observational study, we used Illumina technology to sequence M tuberculosis genomes from an archive of frozen cultures. We characterised isolates into four groups: cross-sectional, longitudinal, household, and community. We measured pairwise nucleotide differences within hosts and between hosts in household outbreaks and estimated the rate of change in DNA sequences. We used the findings to interpret network diagrams constructed from 11 community clusters derived from mycobacterial interspersed repetitive-unit–variable-number tandem-repeat data. Findings We sequenced 390 separate isolates from 254 patients, including representatives from all five major lineages of M tuberculosis. The estimated rate of change in DNA sequences was 0·5 single nucleotide polymorphisms (SNPs) per genome per year (95% CI 0·3–0·7) in longitudinal isolates from 30 individuals and 25 families. Divergence is rarely higher than five SNPs in 3 years. 109 (96%) of 114 paired isolates from individuals and households differed by five or fewer SNPs. More than five SNPs separated isolates from none of 69 epidemiologically linked patients, two (15%) of 13 possibly linked patients, and 13 (17%) of 75 epidemiologically unlinked patients (three-way comparison exact p<0·0001). Genetic trees and clinical and epidemiological data suggest that super-spreaders were present in two community clusters. Interpretation Whole-genome sequencing can delineate outbreaks of tuberculosis and allows inference about direction of transmission between cases. The technique could identify super-spreaders and predict the existence of undiagnosed cases, potentially leading to early treatment of infectious patients and their contacts. Funding Medical Research Council, Wellcome Trust, National Institute for Health Research, and the Health Protection Agency.

We first searched the archive database for all isolates within the above groups. From these groups, specific isolates were selected for sequencing as below. Where initially selected samples could not be found or did not grow, replacements were selected in the same way as the initial samples. In some cases the freezer location was not listed in the database so the isolate could not be retrieved; in other cases there was no vial present in the specified freezer location; other vials contained non-viable bacteria -unsurprising given the age of some of the samples.
(i) Cross-sectional paired isolates were chosen at random, until 50 pairs of DNA preparations ready for sequencing had been obtained (in total 74 pairs selected for locating in freezers and growing). As no data were available on likely within-host diversity at the time the choices were made, the target sample size was 50 to ensure that if no single nucleotide polymorphisms (SNPs) were observed between any pair, then the upper 97.5% confidence limit around the observation of 0% with ≥1 SNP was 7%.
(ii) Longitudinal samples within individuals were selected to maximize the time period between first and last isolate for each patient, as longitudinal diversity over time was considered to be most relevant to onward transmission. Any intervening samples were also included for sequencing from these patients. No data were available on likely diversity over time when the samples were chosen, so the arbitrary decision was made to sequence 100 isolates within this group, a similar number to those previously used to estimate molecular clock rates (Didelot X, Eyre D, Cule M et. al. Microevolutionary analysis of Clostridium difficile genomes to investigate transmission, Genome Biology 2012 (in press).
(iii) and (iv) We aimed to sequence all isolates from all household outbreaks known to the surveillance laboratory (total 93 isolates) and from 10 reasonably sized (6-47 patients) MIRU-VNTR-based community clusters identified by the public health teams as containing some cases where direct case-to-case transmission was supported and others where it was uncertain (total 207 isolates). (Note: in other groups (i), (ii), and (iii), 46 isolates were sequenced from 18 patients who also belonged to a very large MIRU-VNTR-defined cluster containing >280 patients (reference 27 in main paper); these 18 patients were analysed as an 11 th cluster but selection for sequencing was not based on this cluster membership).
Of note, both failure to be located and failure to grow were strongly related to duration of storage. As expected, the longer the duration of storage the more likely an isolate could not be located, and, if it was located, that it failed to grow. For example, among the cross-sectional isolates the mean time since original isolation among the missing isolates was 8 years and among the isolates that failed to re-grow it was 9 years. This contrasted to 5 years for the successfully cultured isolates. Among the longitudinal isolates the mean time was 10 years for missing isolates and 8 years for those that failed to re-grow. It was 6 years for successfully cultured isolates. For groups (i), (ii) and (iii) missing data from one isolate sometimes meant that other sequences were also excluded (eg if one of two longitudinal isolates from a patient in (ii) could not be located/failed to grow, then the patient could not contribute any data to (ii)).
All missing data is assumed to be completely at random in the analysis, with sequenced cases assumed to represent the underlying population. This assumption was felt to be reasonable by the HPA referral laboratory as they have not seen any marked variation in type of cases over the last decade, so length of storage is therefore not a plausible confounder. The global lineages of sequenced strains reflect those prevalent in the Midlands, as would be expected from essentially random sampling.
In real-world, real-time settings, detecting outbreaks is unlikely to suffer from the problem of missing, unculturable or un-sequencable samples. No isolates will fail to grow as the sequenced cases will all be, by definition, culture positive. Growth failures in our study were strongly related to previously cultured isolates dying in the freezer or as a consequence of freeze-thawing. Sequencing failures will undoubtedly continue to occur but are much more likely when there is little DNA input (a consequence of poor growth) or where there is contamination (not a problem that is unique to sequencing). S6. Summary of colour-coded epidemiological links within 11 community clusters: green = known link; orange = possible link; red = no known link. Each patient is depicted by a study number within a red node. Edges linking the nodes are annotated by the minimum SNPs/months between patient isolates.

S2. Table of all isolates and patients appearing in each collection (C-1 = cross-sectional isolates from individuals, C-2 = longitudinal isolates from individuals, C-3 = longitudinal isolates from households, C-4 = MIRU-VNTR defined community clusters
The networks are an attempt to summarise the links between patients in each cluster in a way that maximises epidemiological and genetic proximity. To avoid double counting we restricted the number of connecting edges to the number of nodes minus one. Methods used to produce network diagrams were as follows: For each cluster we started with the first patient to be diagnosed. We sought to draw a link (an 'edge') to another patient with a 'known' epidemiological link. If there was >1 patient to choose from, we chose the patient with an isolate closest in SNPs, and where this failed to identify a unique edge, we chose the patient closest in time (as judged by date of isolation of sample). If no 'known' epidemiological link existed, we sought 'possible' linkage before 'no known' linkage, in each instance prioritising by SNPs and time as described. The second edge and all subsequent edges were determined by the same rules until the network was complete.