Phylogenomics of Brazilian epidemic isolates of Mycobacterium abscessus subsp. bolletii reveals relationships of global outbreak strains

Rapidly growing, non-tuberculous mycobacteria (NTM) in the Mycobacterium abscessus (MAB) species are emerging pathogens that cause various diseases including skin and respiratory infections. The species has undergone recent taxonomic nomenclature refinement, and is currently recognized as two subspecies, M. abscessus subsp. abscessus (MAB-A) and M. abscessus subsp. bolletii (MAB-B). The recently reported outbreaks of MAB-B in surgical patients in Brazil from 2004 to 2009 and in cystic fibrosis patients in the United Kingdom (UK) in 2006 to 2012 underscore the need to investigate the genetic diversity of clinical MAB strains. To this end, we sequenced the genomes of two Brazilian MAB-B epidemic isolates (CRM-0019 and CRM-0020) derived from an outbreak of skin infections in Rio de Janeiro, two unrelated MAB strains from patients with pulmonary infections in the United States (US) (NJH8 and NJH11) and one type MAB-B strain (CCUG 48898) and compared them to 25 publically available genomes of globally diverse MAB strains. Genome-wide analyses of 27,598 core genome single nucleotide polymorphisms (SNPs) revealed that the two Brazilian derived CRM strains are nearly indistinguishable from one another and are more closely related to UK outbreak isolates infecting CF patients than to strains from the US, Malaysia or France. Comparative genomic analyses of six closely related outbreak strains revealed geographic-specific large-scale insertion/deletion variation that corresponds to bacteriophage insertions and recombination hotspots. Our study integrates new genome sequence data with existing genomic information to explore the global diversity of infectious M. abscessus isolates and to compare clinically relevant outbreak strains from different continents.


Introduction
Rapidly growing, nontuberculous mycobacteria (NTM) species are opportunistic pathogens that cause a range of diseases including lung (Griffith, 2010) and skin (Khan and Khakoo, 2011) infections, and are thought to be primarily acquired from the environment (Primm et al., 2004). NTM often infect immune-compromised patients and are sometimes associated with nosocomial infections due to their occurrence in tap water and ability to form biofilms (Cardoso et al., 2008;Falkinham et al., 2001;Feazel et al., 2009). For example, several outbreaks of post-surgical infections caused by a subtype of Mycobacterium abscessus were reported in various regions of Brazil from 2004 to 2009 (Cardoso et al., 2008;Duarte et al., 2009;Monego et al., 2011;Viana-Niero et al., 2008). These emerging Brazilian outbreak strains were thought to be related to the selection and spread of a particularly virulent epidemic isolate named BRA100, showing a high level of resistance to glutaraldehyde (GTA), the disinfectant used in hospital sites that had reported cases (Duarte et al., 2009;Leao et al., 2009;Shang et al., 2011). The GTA-resistant BRA100 epidemic isolate, CRM-0019, from Rio de Janeiro showed increased virulence in macrophages and mice compared to a GTA-susceptible reference clinical isolate CIP108297 suggesting that BRA100 may represent a particularly invasive and pathogenic strain (Shang et al., 2011). Whether differences in the colonial morphotype of BRA100 isolates relative to CIP108297 (Shang et al., 2011) and, therefore, cell envelope lipid composition (Howard et al., 2006) may account for this difference in pathogenicity is currently under investigation.
The current era of bacterial genome sequencing by next generation sequencing methods provides the opportunity to evaluate genetic diversity at a much finer resolution than traditional gene sequencing. Recently, whole genome sequencing and large-scale variant analyses resolved the genetic relationships of MAB infections in a cystic fibrosis (CF) clinic in the United Kingdom (UK) from 2006 to 2012 (Bryant et al., 2013). Though both MAB-A and MAB-B were detected, the genomic composition of a subset of MAB-B strains provided the first suggestive evidence of person-to-person transmission and infection of MAB (Bryant et al., 2013). Building on these and other studies, we hypothesized that an investigation of MAB genomic diversity would reveal relationships of strains from geographically different locations and increase our understanding of clinically relevant MAB phylogeny. To this end, we sequenced the genomes of five diverse strains of MAB including two Brazilian epidemic isolates derived from patients with skin infections in Rio de Janeiro (CRM-0019 and CRM-0020), two strains derived from patients with pulmonary infections (NJH8 and NJH11) in the United States (US) and one European type strain originally derived a patient with a pulmonary infection (CCUG 48898) . Together with publically available genomic data, we use an integrative phylogenomics approach to compare our newly sequenced genomes to 25 additional MAB genomes including representative isolates from the recent UK outbreak and unrelated strains from the United States, China, Malaysia and Europe to explore the diversity of infectious MAB isolates in a global context.

Bacterial strains sequenced in this study
Five strains of MAB were sequenced in this study, including two Brazilian MAB-B strains, CRM-0019 and CRM-0020 that were associated with an epidemic of post-surgical, cutaneous infections in Rio de Janeiro, Brazil in 2006. These strains were isolated from soft tissue biopsies and were identified as having BRA100 pulse field gel electrophoresis (PFGE) molecular fingerprints (Duarte et al., 2009). Two additional strains were isolated from the sputa of US patients referred to National Jewish Health in Denver, CO, including NJH8 that was derived from a patient with a pulmonary NTM infection in 2011 and was identified as MAB-A by rpoB sequencing, and NJH11 that was derived from a patient with cystic fibrosis and a pulmonary NTM infection in 2009, and was identified as MAB-B (formerly M. abscessus subsp. massiliense) by rpoB sequencing. The type MAB-B strain, CIP108297 T /CCUG 48898, that was originally isolated from the sputum of a patient with hemoptoic pneumonia  and which has an existing draft genome (Tettelin et al., 2012) was also re-sequenced in this study.
Mycobacterial cultures were grown in Middlebrook 7H9-OADC broth supplemented with 0.05% Tween 80 or on 7H11-OADC agar, and genomic DNA was extracted as previously described (Pelicic et al., 1997). Genomic sequencing libraries for the MAB strains were prepared using standard protocols and sequenced on Life Technologies SOLiD3 or SOLiD5 platforms (Life Technologies Corporation, Carlsbad, CA) (Table 1). DNA from two strains, CRM-0019 and CCUG 48898, were sequenced as both paired-end and fragment libraries.

Taxon sampling, variant analyses and phylogenomics
For genomic comparisons, we selected 25 representative MAB isolates with publically available draft genomes or next generation re-sequencing data including strains from Brazil, China, Malaysia, the United Kingdom, the United States, and type strains originally derived in Europe (Table 2). For MAB strains with publically available draft genomes, we performed multi-genome alignments of each draft genome to the complete reference genome of the MAB-A strain ATCC 19977 (Ripoll et al., 2009). Whole genome alignments and single nucleotide polymorphism (SNP) identification were performed with the progressiveMAUVE algorithm in Mauve 2.3.1 (Darling et al., 2010). For strains with nextgeneration re-sequencing data, sequence reads were mapped to the complete genome of ATCC 19977 with either Lifescope 2.1 software for SOLiD data (Life Technologies Corporation, Carlsbad, CA) or Bowtie software (Langmead et al., 2009) for Illumina data (Bryant et al., 2013). SNPs were called using the SAMtools pileup program (Li et al., 2009) and were filtered using a custom Perl script using the following parameters: minimum SNP quality score of 20, minimum of 10× read depth, the majority of base calls support the variant base, and less than 25% of variant calls occurring at the beginning or end of the fragment reads.
After extracting all variable sites, indels were removed, and a genotype matrix was created for SNP sites in which at least one strain differed from the reference genome, and for which high quality variant and/or reference calls were available for all strains. Base calls were then concatenated and sequences were compared using Neighbor-joining with 100 bootstrap replicates in MEGA 5.1 (Tamura et al., 2011).

Genome coverage analysis of M. abscessus subsp. bolletii outbreak strains
For six MAB-B epidemic or outbreak strains (CRM-0020, 19a,19b,2f,28f) and two unrelated strains with next generation sequence data (CCUG 48898 and NJH11), sequence reads were mapped to the complete genome of the MAB-B BRA100 strain, GO-06 (Raiol et al., 2012) as described above. Sequence coverage values were estimated with a custom perl script that counts all mapped reads in sliding, non-overlapping 1Kb windows across the entire GO-06 genome (5068 total windows). A matrix of read counts for all strains was converted to z-scores, and normalized read counts were clustered by hierarchical clustering with a Spearman correlation distance metric and average linkage with the Multiple Experiment Viewer (MeV) Java package (Saeed et al., 2003).

Phylogenomics of M. abscessus outbreak and globally diverse strains
After reconstructing the phylogeny using 27,598 concatenated SNPs ( Figure 1A), the phylogenomic relationships between MAB strains are congruent with the topology of recently reported phylogenies in the MAB group (Bryant et al., 2013;Heydari et al., 2013;Macheras et al., 2011;Zelazny et al., 2009). Consistent with previously reported relationships, the three formerly described MAB subspecies are monophyletic on the tree and, as expected, NJH8 and NJH11 cluster with the MAB-A and MAB-B (formerly M. abscessus subsp. massiliense) clades, respectively. Although sampling within any locality was not exhaustive, our analyses revealed significant global genomic diversity within MAB. In general, we observed greater genetic diversity within MAB-B compared to MAB-A. A majority of SNPs vary within MAB-B (22,882/27,598=82.9%) compared to the SNP variation observed within MAB-A (3,507/27,598=12.7%). When strictly comparing strains belonging to MAB-B, formerly M. abscessus subsp. massiliense, a majority of SNPs are also variable (17,994/27,598=65.2%). Contrastingly, the SNP variation observed within strains formerly recognized as M. abscessus subsp. bolletii (1,457/27,598=5.3%) is much lower than other MAB subspecies.
A small number of SNPs (76/27,598=0.27%) are variable within a clade composed almost entirely of previously reported MAB-B epidemic or outbreak strains ( Figure 1B). Interestingly, however, the BRA100 epidemic isolates from the Brazilian states of Rio de Janeiro (CRMs) and Goias (GO-06) are not monophyletic. In fact, CRM-0019 and CRM-0020 are more closely related to MAB-B strains isolated in the recently reported UK outbreak in CF patients than they are to the BRA100 outbreak strain GO-06 (Raiol et al., 2012), consistent with previously observed phylogenomic relationships between UK strains and GO-06 (Bryant et al., 2013). A comparison of variable sites within Brazilian strains identified 75 SNPs, all of which differentiate CRM-0019 and CRM-0020 from GO-06 suggesting a low level of variation between epidemic strains from distinct states in Brazil.
Variant analyses between outbreak strains from different geographic locations revealed only one SNP between strains from the Brazilian (CRM-0019 and CRM-0020) and UK (19a,19b,2f, and 28f) MAB-B outbreaks. Additionally, our analyses suggest that strains M18 and 47J26 from Malaysia and the UK, respectively, which were not implicated in previous outbreaks are nearly identical to the recent outbreak strains as they have no SNPs compared to the UK strains and only one SNP compared to the Brazilian CRM strains. This lack of variation suggests that a recent circulating strain of MAB-B likely emerged in the Brazilian and UK outbreaks, and further suggests this strain may be emerging in another global location. With CRM-0019, CRM-0020 and GO-06 each isolated in 2006;19a, 19f and 47J26 isolated in 2009;and 28f isolated in 2010; the isolation dates and phylogenetic pattern collectively suggest the Brazilian MAB-B outbreak as ancestral to the subsequent UK CF clinic MAB-B outbreaks.
The lack of difference between strains from disparate outbreak localities is surprising especially since the CRM strains were derived from cutaneous infections and the UK strains from pulmonary samples. More interesting is the Malaysian strains (M154, M159 and M172) isolated during the same timeframe as the Brazilian and UK outbreaks strains (Table  2) are genetically diverse, and for the most part genetically distinct from the Brazilian and UK outbreak strains. The US-derived MAB-B strains are also genetically different from the Brazilian, UK and Malaysian strains as a comparison of NJH11 and 5S0304 to the MAB-B outbreak strains revealed relatively high SNP variation (12144/27,598=44.0%).

Large-scale deletions and gene content differences among MAB-B outbreak strains
Given the striking similarities in SNP profiles of MAB-B isolates from geographically distinct outbreaks, we further explored large-scale genomic differences that would not be detected by the SNP analysis. First, we analyzed total genome coverage in six outbreak and two unrelated MAB-B strains compared to the complete genome of the BRA100 strain, GO-06 (Raiol et al., 2012) (Table 3). All samples had substantial sequencing depth (from 57X to 195X), yet the genome mapping analysis showed that only 89.4 to 94.2% of the GO-06 genome is represented in the other MAB-B strains. This suggests that 5.8 to 10.6% of the GO-06 genome is absent in one or more of the other strains, even those isolated from the neighboring region of Rio de Janeiro. This is consistent with genome differences previously described through DNA-DNA hybridization experiments of diverse MAB strains ) and comparable to differences reported between the genomes of pathogenic strains of Helicobacter pylori (Alm et al., 1999) and Streptococcus agalactiae (Tettelin et al., 2005).
To test the hypothesis that the BRA100 epidemic and UK outbreak strains share similar large-scale deletion patterns, we calculated read coverage values for all strains in sliding 1Kb windows across the entire GO-06 genome (Figure 2). We observed overlapping deletions in some, but not all, regions including a 80Kb deletion that is shared by the CRM and UK strains, but not by NJH11, CCUG 48898 and GO-06 (Figure 2A: X). This region is consistent with a phage insertion as it is flanked by integrase genes and has 111 predicted open reading frames (ORFs) that are primarily annotated as hypothetical or virus-related (Juhas et al., 2009). A different 60Kb deletion that is shared by the CRM strains and CCUG 48898 shows differential mapping patterns in the UK strains and NJH11 (Figure 2A: Y). This region has 64 predicted ORFs including two transposases, a serine-family recombinase and an integrase and is likely a hotspot for homologous recombination (Juhas et al., 2009). A cluster analysis of genome-wide deletion patterns compared to GO-06 revealed that the outbreak strains cluster largely by geographic location ( Figure 2B). This suggests that, though highly similar in their core genomes, the CRM epidemic and UK outbreak strains are evolving through local horizontal gene transfer events.

Conclusions
Our study explores the phylogenomic relationships of globally diverse MAB isolates including five newly sequenced strains and 25 strains with publically available genomes. Genome wide SNP analysis revealed three distinct subgroups consistent with previous subspecies classifications. We observed more SNP diversity within the formerly recognized subspecies M. abscessus subsp. massiliense than within MAB-A or the formerly recognized subspecies M. abscessus subsp. bolletii. Moreover, our results suggest that the Brazilian BRA100 epidemic isolates from Rio de Janeiro (CRM-0020 and CRM-0019) and the state of Goias (GO-06) belong to a monophyletic clade that also includes representative strains from a recent UK CF clinic outbreak (19a, 19f, 2f and 28f), a single UK CF patient infection (47J26) and one Malaysian strain (M18). In contrast, the CRM epidemic strains show significant variation compared to other MAB-B strains from the US, Malaysia and Europe. Though the CRM and UK strains show considerable similarities in core genome SNPs suggesting a widespread epidemic clone, large-scale variant analysis revealed geographicspecific deletions suggesting independent evolution of outbreak isolates within each outbreak locale. Whether the CRM strains pose a risk of patient-to-patient transmission as suggested for the UK outbreak strains (Bryant et al., 2013) is unclear, as GTA resistance remains the most likely explanation for the rapid transmission of epidemic BRA100 strains across Brazil (Duarte et al., 2009). It is important to underscore the genetic similarities between the sequenced Brazilian BRA100 epidemic and UK outbreak strains. Isolates sampled from multiple outbreak locations on different continents showing no genetic divergence can be caused by: i) a lack of genetic variation in the species; ii) isolates stochastically sharing genetic identity by state; and iii) the possibility of a globally circulating, infectious strain (Achtman, 2008). The infectious nature of the BRA100 epidemic isolates and recent research into the epidemiology of MAB in CF patients warrants these possibilities to be investigated further to better characterize the epidemiological impact.  Large-scale genomic variation among MAB-B outbreak strains. Next-generation sequence reads for six outbreak strains (2f, 19a, 19b, 28f, CRM-0019, CRM-0020) and two epidemiologically unrelated strains (CCUG 48898 and NJH11) were mapped to the complete genome of the BRA100 strain, GO-06. Normalized read coverage counts were estimated for all non-overlapping 1Kb windows and plotted relative to GO-06 genomic coordinates. A. Normalized read mapping coverage (y-axes show z-scores) in a 500Kb region of the GO-06 genome (x-axis). Two contiguous regions with z-scores less than −2.0 represent two large scale deletions including an 80Kb deletion with 111 predicted ORFs (X) and a 60Kb deletion with 64 predicted ORFs (Y) B. Hierarchical cluster analysis of genomewide deletion patterns for all 5068 windows across the GO-06 genome.  Table 3 Mapping coverage of re-sequenced MAB-B strains versus the complete genome of the MAB-B BRA100 strain, GO-06 (Raiol et al., 2012) and average sequencing depth across the GO-06 genome.