Correlation of drug resistance with single nucleotide variations through genome analysis and experimental validation in a multi-drug resistant clinical isolate of M. tuberculosis

Background Genome sequencing and genetic polymorphism analysis of clinical isolates of M. tuberculosis is carried out to gain further insight into molecular pathogenesis and host-pathogen interaction. Therefore the functional evaluation of the effect of single nucleotide variation (SNV) is essential. At the same time, the identification of invariant sequences unique to M. tuberculosis contributes to infection detection by sensitive methods. In the present study, genome analysis is accompanied by evaluation of the functional implication of the SNVs in a MDR clinical isolate VPCI591. Result By sequencing and comparative analysis of VPCI591 genome with 1553 global clinical isolates of M. tuberculosis (GMTV and tbVar databases), we identified 141 unique strain specific SNVs. A novel intergenic variation in VPCI591 in the putative promoter/regulatory region mapping between embC (Rv3793) and embA (Rv3794) genes was found to enhance the expression of embAB, which correlates with the high resistance of the VPCI591 to ethambutol. Similarly, the unique combination of three genic SNVs in RNA polymerase β gene (rpoB) in VPCI591 was evaluated for its effect on rifampicin resistance through molecular docking analysis. The comparative genomics also showed that along with variations, there are genes that remain invariant. 173 such genes were identified in our analysis. Conclusion The genetic variation in M. tuberculosis clinical isolate VPCI591 is found in almost all functional classes of genes. We have shown that SNV in rpoB gene mapping outside the drug binding site along with two SNVs in the binding site can contribute to quantitative change in MIC for rifampicin. Our results show the collective effect of SNVs on the structure of the protein, impacting the interaction between the target protein and the drug molecule in rpoB as an example. The study shows that intergenic variations bring about quantitative changes in transcription in embAB and in turn can lead to drug resistance.


Background
In spite of the worldwide efforts to combat mycobacterial diseases, it continues to be a great challenge to achieve this goal. In addition to the various strategies adopted by this pathogen to escape host immune system, M. tuberculosis has gainfully utilized genetic variability for its highly successful growth, pathogenesis, immunity and persistence [1]. The complete genome sequence of the strain of M. tuberculosis H37Rv and the recent surge in data on clinical isolates permits high-throughput whole-genome analysis, relationship and correlation with drug resistance [2][3][4][5][6][7][8][9]. This has revealed local, global and patient specific heterogeneity in M. tuberculosis strains [10]. The increased genetic diversity and variation in M. tuberculosis is implicated in strain specific immunogenicity and pathogenicity [11]. A large number and the most frequent occurrence of variation is seen in the genes for lipid metabolism and PE/PPE genes [12,13].
The whole genome sequence of more than 11,000 clinical isolates of M. tuberculosis are reported in NCBI-SRA, however, the correlation between genetic alterations and the phenotype is carried out mainly in the drug resistance genes [14]. Thus, drug resistance in tuberculosis is a phenomenon more complex than previously assumed where the association of intergenic regions and the identification of new genes have resulted from whole-genome sequence based approach [15].
We analysed the genetic polymorphism in an Indian, multi-drug resistant (MDR) clinical isolate VPCI591, from a 50 year old male patient from Vallabhbhai Patel Chest Institute (VPCI), Delhi. It is TBd-1 positive and EAI clade spoligotype and the isolate is resistant to all the first-line tuberculosis drugs [16,17]. The availability of this strain made it possible to validate selected variations and investigate their effect on the function of the gene. We carried out the comparative analysis of the variations detected in VPCI591 with the genome sequence of 1553 global clinical isolates, to identify the shared and unique Single Nucleotide Variations (SNV). This analysis also indicated the invariant regions/genes that could be of potential use as markers for M. tuberculosis complex.

Sequencing of VPCI 591 genome
The sequencing depth of coverage was calculated to be 133X. The FastQC data analysis showed high quality reads after adapter removal. The quality score distribution of all the reads and GC nucleotide distribution of the reads are shown in (Additional file 1: Fig. 1a & b). The distribution pattern shows typical pattern of GC rich mycobacterial genomes. The mean quality score distribution is also high (Additional file 1: Fig. 1c).
To increase the confidence in the identification of SNVs, four different pipelines were used in data analysis, namely Maq (0.6.6), Bowtie 2 (2.0.4) for variant detection in massively parallel sequencing data, in combination with Samtools, GATK (3.6-0-g89b7209) and Picard Tools (1.119), Varscan and Mpileup of Samtools and inGAP as given under methods. The consensus of the four pipelines was considered to call the Single Nucleotide Variations (SNV) in the genome of VPCI591. We compared VPCI591 with global datasets of variation totalling to 1553, obtained from GMTV (http://mtb. dobzhanskycenter.org) [18] and tbVar database (http:// genome.igib.res.in/tbVar/) [19].

Analysis of single nucleotide variations. Distribution of genic variations
A total of 1917 SNVs were chosen as consensus variations with high confidence from the four pipelines as described under methods and were considered for further analysis (Fig. 1a). 1685 genic and 232 non genic/intergenic variations were obtained. 1059 NS, 596 SY, 18 SG and 12 SL variations were there distributed among 1685 SNVs (Fig. 1b).
We identified 1685 genic SNVs in VPCI591 distributed among 1245 genes (Fig. 2a). We have found 926 genes having single variation among which 737 are NS (Fig. 2a,  b). The 1059 NS variations found in VPCI591 are distributed in 894 genes (Fig. 2b). It is known that the PE-PPE genes are highly variant among clinical isolates [20]. In VPCI591, among the PPE genes, Rv3347c has 8 SNVs out of which 5 are non-synonymous SNV and Rv0355c has 7 SNVs, 6 of which are NS (Fig. 2a, b). VPCI591 contains multiple non-synonymous variations which are also seen in 11 genes known for drug resistance and multiple variations are identified in gyrA, embC, embB and rpoB. Based on SIFT score (< 0.05), several variations were predicted to be deleterious; for example A384V in gyrA, I21T in inhA, N394D in embC, P913S in embA, D12A in pncA and all the 3 variations in rpoB (Fig. 2c).
The genetic variations were found in genes responsible for resistance to other drugs such as, fluoroquinolones, ethionamide, isoniazid, ethambutol, pyrazinamide and rifampicin (Fig. 2c). RNA polymerase β subunit (rpoB) harbours 3 variations known to cause resistance to rifampicin (Fig. 2c). We found similar variations in 1553 global clinical isolates of M. tuberculosis, that we analysed, however the co-occurrence of all the three is unique to VPCI591. In addition, based on SIFT score, we can predict deleterious effect of the SNVs in the following genes; mmpL4 (Rv0450c) and mmpL8 (Rv3823c) involved in membrane transport, proC (Rv0500, pyrroline-5-carboxylate reductase), trcS (sensor histidine kinase, Rv1032c), cyp121 (Rv2276, cytochrome P450 and polyketide synthases pks2 (Rv3825c) and pks7 (Rv1661). VPCI591 has SNVs in the two component system genes, mprA (Rv0981) and mprB (Rv0982). We have investigated its effect on the expression of immune response genes and also on phago-lysosome fusion [21]. A complete list of NS variations with gene names, position of SNV, genomic location and amino acid (AA) change is given in (Additional file 2).

Functional classification of genes with non-synonymous SNV
The classification of genes containing SNV according to Camus et al., [4] and PANTHER [22] was carried out (Fig. 3a). Classification based on Camus et al., [4] identified 219 genes of intermediary metabolism, respiratory pathway genes, 200 genes involved in cell wall and cell processes, 49 PE and PPE proteins, 65 genes for lipid metabolism of Mycobacterium, 45 regulatory proteins and 11 insertion elements, and phage proteins (Fig. 3a). A large number of genes (211) are annotated as conserved hypothetical or uncharacterized proteins (Fig. 3a). The genes with NS variations (894) were classified according to their molecular function using PANTHER [22]. Among the mapped genes, 290 genes are associated with catalytic activity, 56 genes with binding function and 48 genes having transporter activity (Fig. 3b). On classifying the catalytic activity genes further, it was observed that 109 of these genes have transferase activity, 88 genes have hydrolase activity and 80 genes have oxidoreductase activity (Fig. 3c). Among the 109 genes for transferase activity, 40 genes are involved in transferring acyl group, 31 genes have methyltransferase activity, 12 are glycosyl group transferases and 10 have kinase activity (Fig. 3d). The genes in the following classes were under represented among the SNV containing genes: structural constituents of ribosome (GO:0003735), DNA-binding transcription factor activity (GO:0003700), transcription coregulator activity (GO:0003712) and guanyl-nucleotide exchange factor activity (GO:0005085).  Table 1). In these cases we identified the domains lost due to which the loss of function is predicted, which was correlated with the loss of a long peptide, including the loss of the functional domain. For example, in Rv2187 (fadD15), 558 amino acids are lost which results in the loss of AMP synthetase/ligase domain. Since many of these genes are annotated as hypothetical  (Table 2). Rv0325 gains 155 amino acids to form class I SAM dependent methyl transferase. In Rv1870, 11 AA were added and it is converted in to an endonuclease.

Variation in intergenic region between divergently transcribed genes
There are several intergenic regions that are flanked by divergently transcribed genes. The presence of promoters or regulatory regions within such intergenic noncoding regions are known in M. tuberculosis [23]. Thus intergenic SNVs can affect the transcription/regulation of the downstream gene. We have mapped the intergenic variations occurring upstream of genes and also those mapping between divergently transcribed genes ( Fig. 4a; I and II) and computed the relative distance between the SNV position and the gene(s). We have identified a total of 232 intergenic variations, 32 of them are positioned between genes transcribing in opposite directions and these have the potential to affect both the genes. These were binned into categories depending upon the distance between the downstream genes ( Fig.  4b). There are 77 variations that map within 50 bp from transcription start site, 39 SNVs between 50 and 100 bp region and 35 SNVs between 100 and 150 bp regions indicating that majority of the intergenic variations are close to the transcription start site. Such variations in putative promoter or regulatory variations can affect both the flanking genes transcribed in opposite directions. Some of the genes whose transcription could be affected by the SNV in VPCI591 were identified (Fig.  4c). The complete list of the intergenic variations is given in (Additional file 3 and Additional file 4) respectively. We found a variation within the 85 bp intergenic region between embC and embAB, 4 bp upstream of   The comparative analysis with global clinical isolates, led to the identification of 141 unique variations in VPCI591, among which 125 are genic and 16 are intergenic. Most of the genic variations were nonsynonymous and stop-gain SNVs. Ontology classification revealed that majority of them fall under genes with catalytic function (75%) and transportation (17%). The complete list of the unique genes with amino acid alteration along with the SNV position in the genome is shown in (Additional file 6). A number of intergenic variations lie within the known promoter/regulatory regions. Intergenic variation in the cis-regulatory sequences of mce1 operon between Rv0166 (fadD5) and Rv0167 (yrbE1A) were identified [23]. SNV was identified within IS 1560 elements and mutation hotspot for anti-tuberculosis drug ethambutol, between embC and embAB gene. The complete list of unique intergenic variations identified in VPCI591 along with their genomic position and the adjacent genes is shown in (Additional file 7).

The effect of SNV in RpoB subunit and its possible implication
As mentioned earlier, 3 variations co-occurring in RpoB (Rv0667) at D441A, L458P and I1112M which is unique to VPCI591. We examined the effect of these on the structure of RpoB and its interaction with rifampicin. To understand the binding of rifampicin at the active site of RpoB, molecular docking analysis was performed using the crystal structure of RpoB (PDB Id: 5UHB). The locations of the 3 variations are mapped in the crystal structure at D441, L458 and I1112 (Additional file 8: Fig. A). The wild type protein showed His451, Phe439, Gln438, Arg454, Arg465, Ser456, Asn493 interacting with the drug rifampicin at the active site (Fig. 5a, c). Each variation was examined for its effect separately as well as in combination with each other; D441A showed loss of all the interactions with drug except Gly438, Phe439 and Arg465 (Additional file 8: Fig. B & C) and, while L458P also showed loss of all interactions with drug except Gln438, His441, Phe439 and Arg465 (Additional file 8: Fig. D and E). The variation I1112M showed loss of interactions with drug except Gln438, Phe439 and Arg465 (Additional file 8: Fig. F & G). The combination of all the three variations, D441A, L458P and I1112M as present in the clinical isolate VPCI591, showed loss of all interactions with the drug retaining the interaction with only with Arg454 ( Fig. 5b & d). The binding energy and the respective Ki values for the 3 variations were calculated for individual variations. The wild type shows (ΔG) − 9.27 kcal/mol while the variants D441A, L458P and I1112M have − 6.67 kcal/mol, − 7.11 kcal/mol and − 7.18 kcal/mol respectively. The combination of the three, D441A + L458P + I1112M has ΔG equal to − 5.82 kcal/ mol showing loss of binding of rifampicin at the active site. This could result in drug resistance. The Ki values calculated from the wild type is 160 nM whereas RpoB from VPCI591 shows 53.96 μM. The complete representation of the individual variations as well as the combined, the number of hydrogen bonding and the relative distances in Å is shown in ( Table 3).

Effect of SNV in intergenic region upstream of embA (Rv3794) gene
Several NS variations were identified with the associated genes for ethambutol resistance in embC (C809T, A1180G), embA (P913S), embB (E378A, E504D) (Fig. 6a). We found NS variation at embR (C110Y) which is not a part of the operon. In addition to the genic variations, an intergenic SNV at position 4,243,299 C > T at -4 bp position upstream to embA gene is identified in VPCI591. The SNV was confirmed by PCR followed by Sanger sequencing. The expression level of embAB was compared with that of the reference strain H37Rv by quantitative PCR, using expression of sigA for normalization. No SNV is identified in sigA in VPCI591. There is 3.6 fold increases in the expression of embA gene in VPCI591 as compared to H37Rv in log phase and 1.25 fold increase in expression in the stationary phase of growth, indicating a gain-of-function due to the intergenic variation (Fig. 6b).

Discussion
The clinical isolate VPCI591, is a multidrug resistant strain that has been used in various analysis [16,23]. The sensitivity of VPCI591 for first line anti-tuberculosis drugs is tested and it shows high MIC; INH-(MIC > 300 mg/L), RIF-(MIC > 125 mg/L), STR-(MIC 40 mg/L), EMB-(MIC 15 mg/L [17];. We have earlier carried out a directed polymorphism analysis of the mce 1 operon of VPCI591 and identified a gain-of-function SNV at 196800 (G > C) in the intergenic region between Rv0166-Rv0167 [23]. This strain has also been analysed for the effect of an SNV in MprA (Rv0981) in hostpathogen interaction [21]. In the light of these background studies, the availability of the genome sequence provides the complete genetic profile that can be analysed for its correlation with various phenotypes. This led us to predict the molecular correlates for the resistance phenotype of VPCI591 for two of the first line drugs, rifampicin and ethambutol. Second line drugsusceptibility testing had not been put up for the isolate. However, WGS results showed variations at gyrA (E21Q, S95T, A384V and G668D) and gyrB (M330I) for fluoroquinolones.
Among the high confidence SNVs identified, the antigenic PE-PPE proteins show high variation. They have multiple variations in a given gene in VPCI591 as in the 1553 global clinical isolates of M. tuberculosis, that we analysed from the repositories tbVar [19] and GMTV [18]. However there are also certain invariant PE, PPE family of genes. The number of SNVs per gene is highest in the PPE genes Rv0355c and Rv3347c; out of the 7 SNVs in Rv0355c, 6 are non-synonymous, while 5 out of 8 SNVs in Rv3347c are non-synonymous. The members such as PPE 35, PPE55, PPE8, PPE54, PPE34, PPE24 genes are known to be highly polymorphic among clinical isolates [20]. SNVs leading to stop-loss (SL) in PPE33, PPE67 and stop-gain (SG) in PPE42, PPE35, in addition to SG in fatty acid ligase gene, fadD15 are also identified. However the phenotypic consequence of these variations is not reflected in the ability of the pathogen to infect and survive in the human host or in in-vitro growth.
The classification of the SNV containing genes based on their biological function indicates that genes involved in transferase activity were the major class represented among the non-synonymous SNV containing genes in VPCI591 as also in others as reflected in tbVar [19] and GMTV [18] databases. It can be speculated that transferases including those involved in the post-translational modifications can tolerate variations and thus show plasticity, due their role in conferring functional diversity and having various classes of proteins as substrates.
SNVs mapping in the intergenic region between two genes that are transcribed in opposite directions, can impact the transcription of both the genes, if they map within the putative promoters or regulatory regions. There are several instances of such intergenic SNV in VPCI591, including SNV at genomic position 2,010,614, between Rv1776c (transcriptional regulator) and Rv1777 (cytochrome P450), Rv2333c (drug efflux protein) and Rv2334 (cysteine synthase).
In the present analysis, we have validated such an effect on embAB gene. We note that, VPCI591 harbours several SNVs which might be associated with ethambutol resistance; in embC/Rv3793 (T270I) and (N394D), embA/Rv3794 (P913S) and two variations in embB/ Rv3795 (E378A) and (E504D). There is also another non-synonymous change in embR/Rv1267c (C110Y). In addition, VPCI591 has a variation in pknH/Rv1266c (R607Q). PknH-mediated increase in the transcription of embAB genes significantly alters resistance to ethambutol [24]. Though among the various mechanisms of ethambutol resistance in M. tuberculosis, the over expression of embAB gene due to the presence of intergenic mutation, is one of the most well-understood mechanisms [25]. In case of VPCI591, the SNVs in several genes conferring resistance to ethambutol, and a threefold increase in the transcription of embAB collectively contribute to the high MIC of 15 μg/ml of VPCI591 [17] relative to MIC of 0.5-2 μg/ml of H37Rv [17,26].
Similarly, the high resistance of VPCI591 to rifampicin is attributed to the co-occurrence of three genic nonsynonymous SNVs which is unique to VPCI591 and weakens the interaction of RNA polymerase β subunit with rifampicin, as reflected in the increase in binding affinity. However, the effect of SNVs on the affinity of RpoB for rifampicin may not be the only reason for this level of resistance. It is known that in addition to RpoB mutation, the active efflux pumps like Rv1258c, Rv1410c, and Rv0783, the major facilitator superfamily (MFS) proteins, contribute to rifampicin resistance [27]. VPCI591 bears no variation in these genes except SNV in Rv2936/drrA (H309D), implicated in rifampicin efflux [27]. VPCI591 has NS variations in rpoC (A172V) and no variation in rpoA. The presence of these off-target variations are also known to contribute to the evolution and survival of drug-resistant M. tuberculosis [28,29]. The identification of genes that do not show any SNV in a large number of global clinical isolates (invariant regions), would be potential diagnostic markers for identification of M. tuberculosis complex [30][31][32]. We found several invariant genes in 1553 global clinical isolate including VPCI591. It has been shown that the invariant region in 190 bp region of Rv1458c and partial regions of Rv0440 can be used as diagnostic markers for the identification of M. tuberculosis complex (MTBC) [30]. The major group of invariant genes in our study includes, toxin antitoxin pathway genes like Rv0661c, Rv2760c, Rv0664, and Rv0581, and transposase of IS elements (Insertion elements) like IS6110. Several ribosomal subunit genes and the putative mycobacteriophage proteins like probable PhiRv1 phage protein are also devoid of variations in the clinical isolates. Several secretory pathway proteins such as Rv3875 (ESAT-6), TatA, SecG and sugar transport proteins like SugB, proteins of electron transport chain nuoE, nuoC, nuoK are also found to be invariant. Thus such invariant regions present in clinical isolates can be used as diagnostic markers.

Conclusion
The genetic variation in M. tuberculosis clinical isolates VPCI591 is common among almost all functional classes of genes. We demonstrate that the variations bring about quantitative changes in transcription and can lead to altered structure of the protein and interaction with the drug molecule. Likewise intergenic distance for nongenic SNVs plays a very important role if they lie within Fig. 6 Effect of intergenic SNV on the expression of embAB. a. Diagrammatic representation of embCAB operon of VPCI591 with the SNVs (vertical arrows). The direction of transcription is indicated by horizontal arrows. The intergenic region between embC and embAB (85 bp) is represented by dotted lines. b. Relative expression of embAB gene in VPCI591 with respect to H37Rv at log phase and stationary phase, represented as mean ± SEM (*p < 0.05). The results from two biological replicates are shown the putative promoter or regulatory region. However, the clinical isolates survive and are pathogenic implying that there is no drastic negative effect that is obvious. On the other hand, the invariant genes can be explored for their potential as drug targets.

Genome sequencing and pipeline for analysis of genome variation
Genomic DNA was isolated using standard protocol [33]. The genome sequencing was done using Illumina GAIIx analyser at CSIR-Institute of genomics and Integrative biology (IGIB). TruSeq2_SE adapters were used for sequencing. Sequence read file for the clinical isolate VPCI591 have been deposited in SRA format, NCBI: SRX5802345, Bioproject accession number PRJNA540936, BioSample accession numbers: SAMN11568242.
FastQC (0.11.5) was used for deciphering the overall quality statistics of the data [34]. Trimmomatic (0.36) [35] was used to remove adapter contamination shown by FastQC. M. tuberculosis H37Rv (NC_000962) sequence was used as the reference strain to identify the genetic variations.

Classification of the genome sequence data
The whole genome SNVs were classified into genic and non-genic/intergenic on the basis of their location on the genome. The genic variations were further classified into (a) Synonymous (SY), (b) Non-synonymous (NS), (c) Stop-gain (SG), and (d) Stop-loss (SL) categories depending on their effect on the protein sequence. In case of SG and SL variations we predicted the number of amino acids lost or gained due to premature termination or extension respectively. In SL variations, the extended protein resulting from the addition of amino acids was analysed for its predicted function by BLASTP [44] and InterProScan [45]. The ontology for genes harbouring NS variations were carried out using PANTHER [22] and further classified according to the major pathway [4]. The non-synonymous SNVs in drug resistance genes were retrieved by literature mining as well as the analysis of TBDReaMDB, a comprehensive drug resistance database [14]. The SNVs were screened for their predicted effect through SIFT [46]. For non-genic/intergenic variations, the relative distance between the SNV and the downstream gene was determined. From global comparative analysis with tbVar [19] and GMTV [18] database, invariant genes and variations unique to VPCI591 were identified.

Structural analysis of RpoB (Rv0667)
To assess the effect of the NS-SNV in RpoB detected in VPCI591, structural analysis was carried out. The atomic coordinates of RpoB complex with rifampicin (PDB Id: 5UHB; Resolution: 4.29-Å) [47] was obtained from Protein Data Bank (http://www.rcsb.org/pdb). The effect of variation on the structure and the binding of rifampicin was investigated by molecular docking to examine the protein-ligand (rifampicin) interaction using AutoDock 4.2 [48]. The structure of the protein with SNVs was generated using Swiss-PdbViewer [49]. All the hetero atoms were removed, and both the protein and ligand files were prepared and saved in PDBQT format and used as initial input for AutoDock following the standard protocols. The Lamarckian genetic algorithm (LGA) method was applied for docking and to deal with the protein-antagonist interactions [50]. The polar hydrogen atoms were added geometrically and Gasteiger charge to all the atoms of the protein was assigned using Auto-DocTools (ADT). The 3D affinity grid fields with grid map of 50 × 50 × 50 points spaced equally at 0.375 Å and centre of the grid box was 164.1 × 163.34 × 20.42 using auxiliary program AutoGrid to evaluate the binding energies between the ligand and receptor. The standard protocol of ADT utility was used to generate both the grid parameter file (gpf) and docking parameter file (dpf). The resultant docked models of wild-type and mutant complexes with rifampicin were analyzed using Schrödinger and PyMOL to visualize molecular interactions.

Expression analysis
The intergenic variation mapping between embC and embAB genes inVPCI591 was validated by Sanger sequencing using the primers, forward (CCTAGGAACG GTGACTCG) and reverse (AGACGACGGCTGCTAG GC). For expression analysis total RNA was extracted from the clinical isolate VPCI591 and H37Rv using the RNeasy mini kit (Qiagen) and was treated with DNase using TURBO™ DNase kit (Invitrogen) and cDNA was prepared using First strand cDNA synthesis kit (Fermentas K1612). Quantitative PCR was performed with sigA/ Rv2703 gene as control and fold change was measured by ΔΔCt method using FastStart universal master mix (Roche). The following primers were used: embA (F-GTAATGAGCGATCTCACCGG/ R-CGGTGATCTG GGTGATGTTG); sigA (F-AACGCACCGCCACCAA GTC/ R-TGGTGCTGGTCGTAGTGTCCTG).