Evidence for a rapid rate of molecular evolution at the hypervariable and immunogenic Mycobacterium tuberculosis PPE38 gene region

Background PPE38 (Rv2352c) is a member of the large PPE gene family of Mycobacterium tuberculosis and related mycobacteria. The function of PPE proteins is unknown but evidence suggests that many are cell-surface associated and recognised by the host immune system. Previous studies targeting other PPE gene members suggest that some display high levels of polymorphism and it is thought that this might represent a means of providing antigenic variation. We have analysed the genetic variability of the PPE38 genomic region on a cohort of M. tuberculosis clinical isolates representing all of the major phylogenetic lineages, along with the ancestral M. tuberculosis complex (MTBC) member M. canettii, and supplemented this with analysis of publicly available whole genome sequences representing additional M. tuberculosis clinical isolates, other MTBC members and non tuberculous mycobacteria (NTM). Where possible we have extended this analysis to include the adjacent plcABC and PPE39/40 genomic regions. Results We show that the ancestral MTBC PPE38 region comprises 2 homologous PPE genes (PPE38 and PPE71), separated by 2 esat-6 (esx)-like genes and that this structure derives from an esx/esx/PPE duplication in the common ancestor of M. tuberculosis and M. marinum. We also demonstrate that this region of the genome is hypervariable due to frequent IS6110 integration, IS6110-associated recombination, and homologous recombination and gene conversion events between PPE38 and PPE71. These mutations result in combinations of gene deletion, gene truncation and gene disruption in the majority of clinical isolates. These mutations were generally found to be IS6110 strain lineage-specific, although examples of additional within-lineage and even within-cluster mutations were observed. Furthermore, we provide evidence that the published M. tuberculosis H37Rv whole genome sequence is inaccurate regarding this region. Conclusion Our results show that this antigen-encoding region of the M. tuberculosis genome is hypervariable. The observation that numerous different mutations have become fixed within specific lineages demonstrates that this genomic region is undergoing rapid molecular evolution and that further lineage-specific evolutionary expansion and diversification has occurred subsequent to the lineage-defining mutational events. We predict that functional loss of these genes could aid immune evasion. Finally, we also show that the PPE38 region of the published M. tuberculosis H37Rv whole genome sequence is not representative of the ATCC H37Rv reference strain.


Background
The Mycobacterium tuberculosis genome contains two large gene families that together comprise around 10% of its protein coding capacity [1]. These families, termed PE and PPE, appear to have originated in the fast growing mycobacterial species before undergoing extensive expansion and diversification in certain slow growing species, particularly M. ulcerans, M. marinum and members of the M. tuberculosis complex (MTBC) [2]. The large multi-protein families encoded by these genes are of unknown function, although reports suggest that at least some members are cell surface associated [3][4][5][6] and can be antigenic [4,5,[7][8][9], a finding that has stimulated interest in their potential role in vaccine production, e.g. [10,11]. PPE proteins contain a proline-proline-glutamic acid (PPE) amino acid sequence at positions 7-9 in a highly conserved N-terminal domain of approximately 180 amino acids. The C-terminal domains of both PE and PPE protein families are highly variable in both size and sequence and often contain repetitive DNA sequences that differ in copy number between genes [1]. Several studies have shown that some PE and PPE genes are polymorphic and this has been interpreted as indicating strong selection pressure for antigenic variants that may aid in host immune evasion [3,7,[12][13][14][15][16][17].
A recent phylogenetic analysis of the 69 PPE genes present in the M. tuberculosis reference strain H37Rv has uncovered their evolutionary relationships and reveals that they can be divided into several subfamilies [2] (Figure 1). PPE38 (Rv2352c) is shown to be a member of PPE sublineage IV (the SVP subfamily) and analysis of its protein sequence confirms that it encodes the SVP subfamilydefining amino acid sequence (GxxSVPxxW) at positions 309 -317. However, along with the closely related gene PPE49 (Rv3125c), it shares a more recent common ancestor with PPE sublineage V members (the MPTR subfamily) than with any other member of the SVP sublineage ( Figure 1). Although no reports are available regarding its antigenicity or other biochemical features, because of its position on the "border" of sublineages IV and V, PPE38 was included in a larger study aimed at determining the genetic variation of PE and PPE genes between various strains of M. tuberculosis (manuscript in preparation). Here we present our analysis of this gene and its surrounding region using a cohort of phylogenetically diverse and well-defined M. tuberculosis clinical isolates representing all of the major phylogenetic lineages, along with the most ancestral MTBC divergent member, M. canettii. This has been supplemented by in silico analysis of this genomic region in the whole genome sequences of 15 publicly available M. tuberculosis strains, 8 other MTBC members (M. bovis, M. bovis BCG, M. microti, 3 × M. africanum, dassie bacillus and oryx bacillus), and 14 non tuberculous mycobacteria (NTM) species (7 fast growing species and 7 slow growing species). We show that this region is hypervariable in MTBC members and that this has resulted in a rapid rate of genetic divergence occurring between most M. tuberculosis strain lineages.

Identification of the variable PPE38 region and the RvD7 deletion
The published M. tuberculosis H37Rv genome sequence [1] predicts the amplification of a 1335 bp PCR product spanning the entire PPE38 gene when using the PPE38F/R primer pair (Figure 2a, Table 1). However, our analysis of 3 M. canettii clinical isolates, the H37Rv and H37Ra American Type Culture Collection (ATCC) reference strains (ATCC numbers 25618 and 25177 respectively), and 40 M. tuberculosis clinical isolates from different IS6110 RFLP-defined strain lineages covering all three principal genetic groups (PGGs) [18], revealed that only 7 strains produced this amplicon. Most samples (including the H37Rv and H37Ra ATCC reference strains and the 3 M. canettii's) produced a dominant amplicon of approximately 3.4 kb, while other samples produced amplicons of alternate sizes varying from approximately 2.5 to 5 kb, and 3 samples failed to amplify. Analysis of the H37Ra whole genome sequence revealed that the 3.4 kb amplicon (actual predicted size = 3398 bp) results from the presence of a second copy of PPE38 along with 2 esat-6 (esx)-like genes (annotated as MRA_2374 and MRA_2375 in H37Ra) in this region (Figure 2b) [19]. The second copy of PPE38 has been previously identified and designated as PPE71 in the CDC1551 whole genome sequence [20]. Its coding region is identical to PPE38 and both genes also share the same 5'-untranslated region up to position -35 bp. As previously suggested by Zheng and colleagues [19], this genomic structure suggests that the published H37Rv sequence represents the result of a homologous recombination event between PPE38 and PPE71 that has deleted one of these genes along with MRA_2374 and MRA_2375. This deletion is annotated as RvD6 in their analysis of the H37Ra whole genome sequence [19]. However, the authors did not acknowledge that the term RvD6 was appropriated in 2005 to define a specific variation between the M. bovis and H37Rv genomes [21]. For purposes of clarity and uniformity we therefore propose that the this deletion rather be termed RvD7.

Detailed analysis of the PPE38 region
In order to analyse the variation in this region more thoroughly, we designed additional primers (PPE38IntF/IntR, Table 1, Figure 2b) to allow PCR analysis and sequencing of the region between PPE38 and PPE71. Sequence analysis of the complete 3.4 kb product produced using the PPE38F/R primers in the H37Rv ATCC reference strain, one M. canettii and 4 of our clinical M. tuberculosis isolates Phylogenetic reconstruction of the evolutionary relationships between members of the H37Rv M. tuberculosis PPE protein fam-ily members Figure 1 Phylogenetic reconstruction of the evolutionary relationships between members of the H37Rv M. tuberculosis PPE protein family members. The phylogenetic tree was constructed from a phylogenetic analysis done on the 180 aa Nterminal domains of the PPE proteins. Results show the division of PPE proteins into 5 sublineages with PPE38 (Rv2352c, highlighted in green) located at the border of sublineages IV (SVP subfamily) and V (MPTR subfamily). Reproduced from ref [2] with permission from the authors.  in silico from publicly available whole  genome sequences -see below), along with group, lineage (F) and mutation details is listed in additional file 1.

Analysis of clinical isolates displaying alternate PPE38 region genetic structures and determination of lineage specificity
The PPE38 region of clinical isolates that produced PCR amplicons of sizes that did not correspond to the H37Ralike genotype were analysed in more detail by sequencing PCR amplicons. In order to characterize IS6110-associated mutations, the IS5' and Xho1 primers were used (Table 1). Twelve isolates possessed IS6110-mediated mutations, with two of these also displaying indels involving presumably recombination-mediated swapping of parts of the 5' untranslated regions of PPE38 and PPE71. One isolate revealed a 5'-untranslated region indel without an accompanying IS6110 mutation. Four isolates displayed the RvD7 genotype as defined by the H37Rv whole genome sequence ( Figure 2a). The final isolate failed to produce PCR product when using any of the PPE38-associated primer pairs, although PCRs directed at other regions of the genome were successful. We conclude that this isolate possesses a large deletion in the PPE38 region. Details and figures of the characterized mutations can be found in additional files 1 and 2 (S1 -S18). Additional clinical isolates were investigated in many cases in order to determine whether specific characterized mutations were IS6110 lineage-, cluster-, or isolate-specific. Results showed that in most cases the mutation was specific to all of the different clusters analysed from within the lineage, although several instances of within-lineage and even within-cluster variation was observed. Details of this analysis can be found in additional file 2 (S1 -S18).

In silico analysis of the PPE38 region in M. tuberculosis and other MTBC member whole genome sequences
The results obtained from our clinical isolates encouraged us to further investigate the genomic structure of this region in isolates whose whole genome sequences are publicly available. Along with the H37Rv and H37Ra sequences previously described we also analysed the region in 13 M. tuberculosis and 6 non-M. tuberculosis MTBC members for which the whole genome sequences are publicly available. For convenience, although the dassie and oryx bacillus genomes have not been completed, we have included known information on their PPE38 regions [22,23] in this section, thus providing a total of 21 additional MTBC genomes for analysis. Surprisingly, only 4 of these genomes (H37Ra, CDC1551 and M. africanum isolates GM041182 and K85) displayed the "normal" (ancestral) H37Ra-like PPE38 genotype of 2 PPE genes separated by 2 esx-like genes ( Figure 2b). Six genomes (including H37Rv) displayed the RvD7 genotype ( Figure  2a). Six genomes displayed various IS6110-associated mutations that, in some cases, were associated with additional indel mutations. The remaining 7 genomes, including all of the non-human animal-associated organisms, displayed large RD5 and RD5-like [24,25] deletions that spanned the entire PPE38 region including adjacent genes. Details and figures of all the characterized muta- tions can be found in the additional files 1 and 2 (S19 -S32). A schematic representation of the 7 large RD5 and RD5-like deletions can be seen in Figure 3.

Analysis of micro-mutations within the PPE38/71 gene sequences
Along with the macro-mutational events described above, the PPE38 region of 15 isolates from our cohort plus the fully sequenced genomes were also examined for mutations at the micro-mutational level. Apart from the 21del mutation which is described below, only 4 isolates (M. canettii, SAWC 1870, KZN 4207 and K85) were confirmed to possess micro-mutations. These are detailed in additional file 1. These results demonstrate that micromutations within the PPE38 region are rare.

Analysis of the 21del mutation
The 21del mutation consists of an in-frame 21 bp deletion that results in the loss of amino acids 357 -63 and was ini-tially identified in PPE71 of the CDC1551 whole genome sequence as well as in our clinical isolate SAWC 1645 (Haarlem, F10). The M. tuberculosis strain C, which possesses the RvD7 deletion, also shows this mutation, demonstrating that PPE38, rather than PPE71, has been deleted in this case. Interestingly, while all are PGG2 members, SAWC 1645 belongs to F10 of the Haarlem lineage while CDC1551 and strain C belong to the LCC lineage ("4 banders" and "3 banders" respectively). In order to further track the presence of the 21del mutation in our clinical isolates, PCR primers were designed to distinguish between the 21del (90 bp) and wild type (111 bp) genotypes (Table 1). This PCR was initially performed on all PGG2 isolates from our cohort. Results revealed the presence of this mutation in all 4 members of the LCC ("2, 3, 4 and 5 banders") as well as in 5 of 8 lineages representing the Haarlem, Pre-Haarlem and Haarlem-like clades (Figure 4). A simplified phylogenetic tree of PGG2 lineages in relation to their 21del genotypes is shown in Figure 5. All  LCC members showed a "heterozygote-like" "2, WT/ 21del" (2 genes, 1 wild type, 1 21del) genotype, suggesting the CDC1551-like structure, while results for the Haarlem lineages were more variable with "homozygotelike" signals for both the wild type and 21del genotype observed (Figures 4 and 5). In these cases PPE38F/R and PPE38IntF/IntR PCRs were used to determine the number of PPE38/71 genes present and thus distinguish between recombination (1 gene) and gene conversion (2 genes) events. Figure 5 shows that recombination and gene conversion events were observed within the F1, 2, 4, 10, 19 and 24 Haarlem lineages. Mutational analysis of the 2 Haarlem-like lineages (F6 and F7) suggests that the 3' region of PPE38, including the region corresponding to the 21del position in PPE71, has been removed by an IS6110-associated mutation [see additional file 2, S11 and S12]. The "1, 21del" genotype seen in both these lineages is therefore not due to recombination with deletion of PPE38.

RD5 and RD5-like deletions seen in MTBC isolates
We next performed the 21del PCR on additional isolates from the various LCC and Haarlem lineages in order to determine whether the observed genotypes were cluster or lineage-specific and also to investigate any additional instances of gene conversion or recombination. Four LCC "6 bander" isolates, which represent a lineage not originally used in our study, were also included. Results are shown in Table 2 and show that 90 additional isolates (including the Haarlem F4, CDC1551 and strain C whole genome sequences), representing 5 LCC and 8 Haarlem lineages (including Haarlem, Pre-Haarlem and Haarlemlike lineages) were analysed. Of these, 5 isolates (5.6%) displayed an altered genotype compared to the standard genotype observed within their lineage. Where genotypic changes were observed the PPE38F/R and PPE38IntF/IntR PCRs were again used to differentiate between recombination and gene conversion events ( Table 2) No cases of within-cluster genotypic alterations were observed.

Analysis of the plcABC genes from publicly available whole genome sequences
Previous reports have revealed that the genomic region adjacent to PPE38 encompassing the three phospholipase (plc) gene loci plcA, plcB and plcC is subjected to frequent deletions and IS6110 insertions [26][27][28][29][30]. We therefore also examined this region in the 15 publicly available M. tuberculosis whole genome sequences and 8 non-M. tuberculosis MTBC members described above. Numerous mutations were observed. These included SNPs, micromutations that resulted in frameshifts and altered amino acid incorporation, IS6110 integration and a case of gene fusion between plcA and plcB in isolate 02_1987. Some of the observed SNPs were found to be lineage-specific. For example, a sSNP (A → C) at position 435 of plcA distinguished all "ancient" strains (TBD1+) from "modern" strains (TBD1-) [31]. Large RD5 and RD5-like deletions have been previ-21del PCR results for all 21 members of PGG2 In isolates that possess both a normal and a 21del gene copy an additional amplicon of approximately 100 bp is seen. This presumably represents a heteroduplex comprising both amplicons.  [24,25], M. microti (where it is part of the RD5 mic deletion [32]), the dassie bacillus (where it is part of the RD5 das deletion [22]) and in the oryx bacillus (where it is part of the RD5 oryx deletion [23]) ( Figure 3). Two of the M. tuberculosis isolates (T92 and 94_M4241A) showed similar RD5-like deletions ( Figure 3). A complete summary of the results can be seen in Table 3.
Analysis of the PPE39 and PPE40 genes from publicly available whole genome sequences As described above, many instances of PPE38-encompassing RD5-like deletions that span the region from plcABC to PPE39 or PPE40 have been detected in the non-M. tuberculosis MTBC members [22][23][24][25]32]. Similar deletions were also found in the T92 and 94_M4241A M. tuberculosis isolates ( Figure 3, Table 4). We analysed the mutational 21del genotype results in relation to PGG2 phylogeny Figure 5 21del genotype results in relation to PGG2 phylogeny. A simplified phylogenetic tree of PGG2 lineages shows that the 21del mutation is seen only within the Haarlem and LCC lineages indicating that they share a recent common ancestor. Results also suggest frequent gene conversion and recombination events, particularly between the Haarlem groups. Each mutational type is shown in colour and indicates the number of PPE38/71 genes present and the genotype (WT = wildtype, ie lacking the 21del mutation). Green (a): 21 bp deletion (21del) in PPE71, both genes retained; Blue (b): Recombination between PPE38/71 with PPE71 deletion; Yellow (c): Gene conversion leading to deletion of PPE71 and duplication of PPE38; Grey (d); Recombination between PPE38/71 with PPE38 deletion. The "Haarlem-like" lineages (F6 and F7) could not be included in the analysis because the 3' end of PPE38, including the region homologous to PPE71 21del, has been deleted due to an IS6110-associated deletion event [see additional file 2, S11 and S12].
Haarlem Groups status of the PPE39 and PPE40 genes in all available whole genome sequences in order to determine the extent of the hypervariable region that appears to be centered around PPE38. Our analysis demonstrates that, apart from RD5like deletions, both genes are frequently subjected to additional mutational events at both the micro-and macromutational scale and that in many cases the resultant protein function is predicted to be abolished or altered (Table  4). Several mutations are of particular interest. Isolates 94_M4241A and M. bovis BCG both revealed the presence of a single PPE39/40 fusion gene ( Figure 3). Analysis of the DNA sequences of PPE39 and PPE40 genes shows that they are identical to position 538 (N-terminal conserved region) after which they diverge. The PPE39/40 fusion genes possesses PPE40 upstream sequence indicating that this portion of the gene is indeed PPE40. However, the sequences following position 538 are specific to PPE39 suggesting that the fusion has occurred at the point of divergence. The resultant proteins are thus predicted to be identical to PPE39 although the upstream regulatory sequences correspond to PPE40. Also of note was the finding that, despite being unrelated, the PPE39 gene in isolates CDC1551 and EAS054 both share the same 3 bp inframe deletion of nucleotides. The deletion occurs at a trinucleotide repeat region (4 × GCG, positions 79 -90) and we predict that microsatellite instability has resulted in independent deletion events to both these isolates. Finally, we also found that the Haarlem and F11 isolates share a direct IS6110 integration at position 47 of PPE39 with a resultant GGA duplication at the site of insertion. Interestingly, isolate CPHL_A has an identical IS6110 insertion in PPE40 and isolate 02_1987 also reveals an IS6110 integration at this point. A more detailed analysis of this apparent sequence-specific hotspot for IS6110 integration has recently been accepted for publication.

Analysis of the PPE38 gene region in non-tuberculous mycobacterial species
The extensive variability observed at the M. tuberculosis PPE38 region led us to examine its structure in more distant evolutionary time in order to gain insights into its evolutionary history. The whole genome sequences of 7 slow growing and 7 fast growing species of non-tuberculous mycobacteria (Figure 6), as well as several actinobacteria members outside the mycobacterium genus, were analysed for protein sequences showing homology to PPE38, MRA_2374, MRA_2375, plcABC and other genes found in the M. tuberculosis PPE38 region. The genomic region surrounding proteins of high homology was examined for similarities to the M. tuberculosis structure.
We first investigated the structure of this region in actinobacteria outside of the genus Mycobacterium (including members of the genera Corynebacterium, Rhodococcus, and Nocardia). In all cases glyS and an orthologue of Rv2345 (which are both located near PPE38 in M. tuberculosis, see Figures 3 and 7) could be found situated in close proximity to each other and in all cases they were separated by between 1 and 5 genes. These genes are unrelated to any genes found in the PPE38 region in the mycobacteria. The M. smegmatis genome contains a region homologous to the M. tuberculosis esxA/esxB operon found in the RD1 region [33]. However, the single PE/PPE and esx gene pairs located within this region are the only ones present in the genome and it has previously been demonstrated that PE/ PPE expansion has only occurred in certain slow growing mycobacteria and not in the fast-growers [2]. A number of fast-growing mycobacterial genomes have been gilvum to only contain two genes, namely glyS and the orthologue of Rv2345. This seems to represent the structure of the region before the insertion of PPE38 and the other genes found in this region in the slow-growing mycobacteria (Figure 7). The genome of M. vanbaalenii contains the same region with the insertion of an ortho-  Figure 3).

M. bovis BCG Deleted
Deleted Deleted plcABC part of the RD5 region ( Figure 3). CPHL_A sSNP A → C at position 435. + + All genes predicted to be fully functional. Hybrid plcA/B gene predicted to be non-functional. Part of massive genome rearrangements seen in this isolate ( Figure S23).  logue of Rv2248 between glyS and the orthologue of Rv2345. The other PPE, esx and plcABC genes are absent from the regions and the rest of the genomes of these organisms.
M. abscessus, which is one of the earliest mycobacterial species to diverge within the genus Mycobacterium, has an expanded region containing glyS, an aminotransferase, an 1-aminocyclopropane-1-carboxylate deaminase, an GntR family transcriptional regulator, and the orthologue of Rv2345. It is unclear whether the genes between glyS and the orthologue of Rv2345 have been inserted or whether this represents the ancient structure of the region.  PPE39 part of RD5 region (Figure 3). PPE40 function predicted to be altered.

Oryx bacillus Deleted
Deleted PPE39 and PPE40 are parts of the RD5 oryx region (Figure 3).

Dassie bacillus Deleted
Gene present. Deletion analysis suggests that PPE40 is intact but exact sequence is unknown.  From this result it seems that the ancestral region only consisted of these four genes (Figure 7).

M. leprae
A substantially reduced PPE38 region was identified in the genome of M. leprae, which contains only glyS, an IS6110 element (pseudogene -ML0827c), PPE38 (pseudogene, named PPE7 in the M. leprae database), plcA (pseudogene) and the orthologue of Rv2345 (pseudogene -ML0830c). Due to the extreme reductive evolution of this organism's genome [34], it is unclear what the original structure of this region in M. leprae was before genome downsizing, so Possible evolutionary scenario for the MTBC PPE38 gene region Figure 7 Possible evolutionary scenario for the MTBC PPE38 gene region. Analysis of fast growing mycobacterial species and the M. avium complex indicates that the homologues of Rv2345 and glyS have been in close proximity for a long evolutionary period and that the insertion of homologues to PPE38 and Rv2348c between these genes was also a relatively early event (a, b this organism was also not found to be useful for investigating the evolution of this region. M. marinum M. marinum and M. ulcerans share a recent common ancestor and both are also closely related to the MTBC ( Figure 6). M. marinum has the most extensive PE/PPE gene repertoire yet discovered and contains 105 PPE genes [35]. BLAST analysis of M. marinum proteins with the PPE38 amino acid sequence identified 2 genes (MMAR_3661 and MMAR_3664) with highest homology. Two esx-like genes are located downstream from each PPE38/71 homologue suggesting that the M. tuberculosis PPE38 region evolved initially by duplication of a esx/esx/ PPE sequence, to produce the structure seen in M. marinum, followed by deletion of the esx gene pair downstream of PPE38 homologue MMAR_3661 (Figure 7).  Figure 7). Sequence analysis of 1 amplicon confirmed that, apart from several intergenic SNPs, the structure was identical to that seen in H37Rv.

M ulcerans
The genome sequence of M. ulcerans shows that it has recently evolved from a M. marinum-like ancestor that acquired a virulence plasmid from another actinobacterium [36]. Since their divergence M. ulcerans has undergone extensive reductive evolution that has included genome downsizing [37]. This has resulted in alterations to the PPE38 region and no region of significant homology could be found. This organism was thus not found to be useful for investigating the evolution of this region.

Discussion
Using PCR and sequencing-based analysis of clinical isolates in conjunction with data obtained from publicly available whole genome sequences of M. tuberculosis, non-M. tuberculosis MTBC members and other non-tuberculous mycobacteria, we have investigated alterations of the PPE38 gene region, along with its evolutionary history. Analysis of the M. marinum whole genome sequence shows that the MTBC PPE38 region probably arose from the duplication of an esx/esx/PPE38 gene cluster followed by the deletion of one esx/esx gene pair (Figure 7). The more ancient evolution of the region is difficult to interpret from the available mycobacterium genome sequences. Analysis of the M. avium complex suggests that the insertion of PPE38 between Rv2345 and glyS was an early event but the exact timing of the esx and plc gene appearances remains unresolved. These questions will only be answered by the sequencing of more Mycobacterial species evolutionary situated close to the M. tuberculosis and M. avium complexes (e.g. M. kansasii) as well as additional species located on different phylogenetic branches, such as M. gordonae/M. asiaticum and members of the extended helix 18 group such as M. terrae ( Figure 6).
Our results demonstrate that the M. tuberculosis PPE38 region is hypervariable, adding to mounting evidence indicating that MTBC genomes are not as homogeneous as previously thought [38], and that they have undergone, and continue to undergo, considerable divergence from their most recent common ancestor. From a total of 69 MTBC isolates analysed 36 (52%) were found to contain major structural alterations. When smaller micromutations that are predicted to alter PPE38 or PPE71 protein function are included in this tally only 22 isolates (32%) remain that show the ancestral H37Ra-like structure (Figure 2b) containing the identical PPE38 and PPE71 genes. It should be noted that several of the analysed isolates were close relatives (e.g. SAWC 2576, KZN 4207, KZN 1435 and KZN 605 all belong to F15) and thus our mutation frequency may be a slight overestimate. However, countering this is the fact that genotypic analysis for many of our clinical isolates was based on PCR analysis rather than sequencing and additional micro-mutations may have gone undetected.
The hypervariability of the PPE38 region results from the combination of a high frequency of IS6110 integration events, IS6110-associated recombination/deletion events, homologous recombination and gene conversion events.
The frequency and variety of IS6110-associated mutations observed was striking. At least 20 of the 69 isolates (29%) displayed IS6110-associated mutations and these ranged from direct integrations, both into genes and intergenic regions, to recombination events that resulted in partial or full gene deletions. IS6110 integrations were also found to be common in PPE39 and PPE40 (Table 4) and they are also implicated in the large RD5-like deletions observed in isolate 94_M4241A and members of the non-human animal adapted MTBC members (Figure 3). The reason for the high IS6110 activity within this, or any of the other previously described M. tuberculosis IS6110 hotspot regions [39][40][41][42], is unclear. The element does not display any obvious insertion site sequence specificity, although in our analysis of PPE38, PPE39 and PPE40 we documented multiple, independent, identical integration sites. A more detailed analysis of this finding has recently been accepted for publication. Also of note is the finding that of all IS6110 integrations that were found to disrupt the 4 PPE genes analysed here, all occurred in their 5' (conserved N-terminal) regions. Apart from the obvious negative functional effects of gene deletion and disruption, IS6110 can also function as a mobile promoter and upregulate genes located downstream of its integration site [43][44][45]. Three of our clinical isolates revealed IS6110 integrations upstream of genes and an investigation into the transcriptional effects could be of interest. The dynamic nature of the genome in this region in relation to IS6110associated integration and recombination is further evidence for the role of IS6110 in the generation of genome plasticity in M. tuberculosis and its influence on the organism's evolution [46].
The finding that 10 of the 69 isolates harboured the RvD7 genotype demonstrates a high frequency of homologous recombination between PPE38 and PPE71. Additional analysis of homologous recombination and gene conversion between these genes was greatly aided by the identification of the 21del mutation. This in-frame deletion has allowed us to distinguish between the 2 genes in the PGG2 LCC and Haarlem groups. 21del analysis demonstrated a high frequency of both homologous recombination and gene conversion, particularly between the various Haarlem groups, that result in various combinations of single/double/wildtype/mutant genotypes (Figures 4 and 5, Table 2). Springer and colleagues [47] have reported that in M. smegmatis homologous recombination can only originate in regions of high (> 99%) sequence homology but, once initiated, can extend across heterologous regions with limited constraint. Termination of the event was found to require another region of high sequence similarity. This is consistent with the situation seen between PPE38 and PPE71 either with or without the 21del mutation.
The case for a high frequency of gene conversion between PPE38 and PPE71 is supported by comparisons of the M. tuberculosis and M. marinum genomes. We found that within each genome there is extreme homology between the PPE38/71 and MMAR_3661/MMAR_3664 genes (over 95% in M. marinum and generally 100% in M. tuberculosis at both the DNA and protein level), while between genomes the homology between PPE38/71 and MMAR_3661 and MMAR_3664 is only 86% at the DNA level and 37% and 36% respectively at the protein level.
This extreme intra-genomic but lower inter-genomic homology strongly suggests that both pairs of genes have diverged from a recent common ancestral sequence but have been prevented from significant intra-genome divergence by regular gene conversion events. Additional evidence is provided by the sequence of these genes in M. canettii. Here, each gene contains a non-synonymous SNP (A → C) at nucleotide position 1054. This indicates that mutation in one gene followed by gene conversion has occurred in either the M. tuberculosis or M. canettii lineages since they last shared a common ancestor. Gene conversion between PPE38 and PPE71 could thus explain the apparent paradox between a high macro-mutational frequency, suggesting non-essentiality of the genes, and low micro-mutational frequency, which would normally be an indication of gene essentiality.
These results add to accumulating evidence supporting frequent PE/PPE-associated homologous recombination and gene conversion in M. tuberculosis. Using a microarray-based methodology Karboul and colleagues [48] mapped numerous deletion mutations spanning adjacent PE and PPE genes and found that they resulted in in-frame fusion genes. Homologous recombination, using the highly conserved N-terminal gene regions as substrates, was strongly implicated in these events. Our own analysis of the PPE39/40 fusion gene observed in the M. bovis BCG and 94_M4241A whole genome sequences provides support for this finding. Two additional reports have provided evidence for between-strain recombination in close proximity to PE and PPE genes and the authors have proposed the existence of recombination hot spots within or close to these gene family members [49,50]. Regarding gene conversion, the use of the 21del polymorphism to detect this event in 2 highly homologous proximal genes is similar to that recently reported for the PE_PGRS17 and PE_PGRS18 gene pair [51]. This study reported the presence of a 12 bp insertion associated with a set of 40 SNPs that is found in either PE_PGRS17 alone or in both genes. Analysis of this polymorphism in isolates representing a broad spectrum of M. tuberculosis lineages shows that numerous gene conversion events have occurred between these genes throughout the evolutionary history of the PGG2 and PGG3 groups. Apart from its utility in detecting PPE38/71 recombination and gene conversion events, the 21del mutation is of interest for additional reasons. Firstly, it confirms a close evolutionary relationship between the PGG2 groups, LCC and Haarlem, recently identified by our group (N. C. Gey van Pittius, unpublished results). Secondly, the mutation has become fixed in the majority of lineages and clusters from within these groups, indicating that it might provide the organism with a survival advantage that is able to override the homogenising effect of recombination/gene conversion events.
Indeed, this mutation may represent the initial stages of evolutionary divergence between these 2 genes.
Homologous recombination/gene conversion events are also presumably responsible for the indel mutations involving the exchange of PPE38/71 upstream sequence regions observed in several isolates. Typically, both genes share the same upstream sequence to position -35 before diverging. The finding that isolate SAWC 3656 contains an indel upstream of PPE38 that involves replacement of the normal sequence from position -36 to -83 with PPE71 upstream sequence indicates a gene conversion event where PPE71 has replaced PPE38 in an imperfect recombination that has included a portion of its 5'-untranslated region.
The other examples, and particularly 02_1987, indicate that homologous recombination can also produce more complex results. Isolate 02_1987 is a particularly good example of the benefits of whole genome sequence analysis with respect to the large-scale mutational events described in this study. Along with the plcA/ B and PPE39/40 mutations previously described (Tables 3  and 4), this genome was also found to possess numerous additional gene truncations, inversions and IS6110 insertions involving both PPE38-associated genes and others [see additional file 2, S23] and it provides an idea of the amount of genomic plasticity that can be tolerated by a M. tuberculosis isolate that has successfully infected a host and caused disease.
Because our sample cohort is well-defined in terms of evolutionary relationships we were able, in many instances, to determine mutation status at the lineage-, cluster-or isolate-specific level. Although most mutations were found to be lineage-specific, in 5 cases at least one isolate that represented a different cluster from the same lineage revealed an altered genotype. Thus, genotypic variability was often observed within RFLP-defined lineages. Variability was generally not observed within clusters although in most cases the numbers analysed were limited. However, our analysis demonstrated that within cluster alterations can occur with one lineage showing 4 distinct mutations, including 3 within the same cluster [see additional file 2, S7]. These results emphasise the hypervariability of the PPE38 region and demonstrate its rapid ongoing evolution at the within-lineage and even the within-cluster level.
Our results show that the PPE38 region's hypervariability extends to the adjacent plcABC and PPE39/40 regions. The plcABC region has previously been reported as a preferential region for IS6110 integration [29] and our results thus extend this region from plcC to PPE40. This results in a "hot-spot region" of around 11.3 kb when using the CDC1551 sequence as a reference. The importance of the plcABC genes, along with plcD, which is located in another genomic region, has been emphasised by knockout experiments showing that triple (plcABC) or quadruple knockouts are impaired during the late phase of infection in a mouse model [52]. However, several examples of clinical isolates that possess mutations in all 4 plc genes have been reported [28,30], revealing that their functions are not always essential for the bacteria's pathogenicity. The finding that the plcABC region is deleted in many non-M. tuberculosis MTBC members [22][23][24][25]32],  (Table 3). This mutation frequency is around double that found in the extensive study of Kong and colleagues [30]. This difference might reflect the greater accuracy of whole genome sequence analysis compared to a methodological approach based on PCR and Southern analysis, along with the fact that we included non-M. tuberculosis MTBC members with known RD5 or RD5-like deletions in our analysis. Several other micromutations (nsSNPs and microinsertions) were also detected that are predicted to abolish or alter protein function (Table 3). These results confirm the frequent loss of function of these genes in clinical isolates and suggest that previous studies may have underestimated this frequency.  (Figure 3). RD5-like deletions were found to be less common in M. tuberculosis isolates and were observed in just 1 of our clinical isolates and 2 of the M. tuberculosis whole genome sequences (Figure 3). The relatively low frequency of RD5-like deletions in M. tuberculosis is supported by the findings of Tsolaki and colleagues [53] who identified only 1 such deletion in a total of 100 phylogenetically diverse strains. This low deletion frequency in M. tuberculosis compared to non-M. tuberculosis MTBC members may signify that the absence of this region may provide the organism with a selective advantage in nonhuman hosts, a hypothesis that is strengthened by the finding that the RD5 mic deletion is found in vole, but not human, M. microti isolates [32].
Surprisingly, our analysis of 3 independent H37Rv samples confirmed the typical H37Ra-like structure [19], thus contradicting the published sequence [1] from which the RvD7 genotype is defined. We suggest that the hypervariability of this region may have influenced the results of the published H37Rv whole genome sequence and conclude that the results for this genomic region are not representative of its true sequence. We propose that either a culture-specific PPE38/71 recombination/deletion occurred to produce the non-representative RvD7 genotype or, alternatively, some subclones used for the H37Rv sequencing project may have become mixed with those from other isolates. The second possibility is supported by analysis of the H37Ra sequence which, surprisingly, was found to be far more similar to the CDC1551 sequence than to H37Rv [19,20]. Whatever the explanation, we suggest that the sequence accuracy of this region for other whole genome sequences that show the RvD7 genotype be treated with caution.
The biological consequences of the described mutations are unknown but our results suggest that functional loss of PPE38/71, MRA_2374 and MRA_2375 (and possibly also plcABC, PPE39, PPE40 or combinations of all of these) do not result in a significant loss of bacterial virulence. We base this conclusion on the high frequency of independent mutations found in this region and the fact that the large number of mutations identified (at least in relation to PPE38/71, MRA_2374 and MRA_2375) were mostly lineage specific, indicating that the original mutated organism had successfully caused disease, transmitted to new hosts and undergone further evolutionary expansion and divergence. The best example of this is that of the typical Beijing's (F29) where IS6110-associated recombination/deletion events have resulted in the complete loss of functional PPE38, PPE71, MRA_2374 and MRA_2375 [see additional file 2, S4]. Despite this, Beijing F29 represents the dominant M. tuberculosis lineage throughout much of Asia and its incidence continues to rise rapidly in many countries and regions throughout the world [54,55]. Beijing F29 is also known to have diverged into many distinct sub-lineages [56,57]. The apparent absence of a deleterious phenotypic effect from mutations in the PPE38 region is supported by the transposon site hybridisation studies of Sassetti and colleagues who found that none of the genes analysed in our study were essential for growth either in vitro [58] or in an in vivo mouse model of infection [59]. In addition to these studies, which relate to M. tuberculosis in growth phase, these genes also do not undergo significant differential regulation during dormancy phase [60,61]. The plc, PPE and esx genes are all members of multi-gene families with numerous members within the M. tuberculosis genome and it is possible that genetic redundancy is responsible for the observations of Sassetti et al. Whether the loss of expression of these genes can, in some cases, be beneficial to the organism remains unclear but many examples of potential "virulence suppressor" genes have been documented [62]. PlcA, PPE and esx genes have all been shown to produce antigenic proteins [7][8][9][63][64][65] and it is conceivable that the loss of such potentially potent antigens could aid in immune escape. A recent study [66] has characterised the cellular immune response to 167 peptides representing 8 ORF's (Rv2346c -Rv2353c) within the RD5 region (referred to in this study by the Behr et al [24] annotation, RD7) that is absent in M. bovis, M. caprae and M. bovis BCG compared to M. tuberculosis. A high secretion ratio of IFNγ to IL-10 was observed in response to this peptide pool suggesting that expression of genes within RD5 might produce a protective effect. Loss of genes within this region could therefore result in increased pathogenesis and disease virulence. Finally, our work provides a cautionary note regarding vaccine development studies (which often utilise PE and PPE proteins or peptides) by indicating that at least some PPE gene family members are able to undergo rapid evolutionary change.

Conclusion
This study presents a detailed analysis of mutations at the PPE38 genomic region in a variety of M. tuberculosis isolates representing all major evolutionary lineages, along with analysis of this region from other MTBC and nontubercule mycobacterial species, in order to ascertain its evolutionary history. We conclude that this region is hypervariable due to frequent IS6110 integrations, IS6110-associated recombination/deletion events, and gene conversion and recombination between PPE38 and PPE71. Gene conversion was implicated in the low levels of variation observed at the micro-mutational scale between PPE38 and PPE71. Furthermore, mutational analysis of numerous additional isolates at the lineage and cluster levels has provided insights into the molecular evolution of this region. We describe multiple instances of fixation of PPE38-associated mutations at the lineage level, along with examples of within-lineage and even within-cluster variation, indicating rapid and extensive evolution of the region. Because these mutations generally result in the functional loss of genes we conclude that they do not result in a significant loss of fitness and that, since they have been shown to be highly antigenic, they may in fact aid in the organism's survival.

DNA sample collection and determination of strain/ lineage/cluster
M. tuberculosis isolates from patients residing in an epidemiological field site near Cape Town, South Africa, were genotyped according to the internationally standardized IS6110 DNA fingerprinting method [67]. DNA fingerprints were analyzed with GelCompar software, using the unweighted-pair group method using average linkages and Dice coefficients [68]. Isolates with an IS6110 similarity index of ≥ 65% were grouped into strain lineages [69]. Spoligotyping was also done to further classify lineages into clades [70].