Analysis of mutations in pncA reveals non-overlapping patterns among various lineages of Mycobacterium tuberculosis

Pyrazinamide (PZA) is an important first-line anti-tuberculosis drug, resistance to which occurs primarily due to mutations in pncA (Rv2043c) that encodes the pyrazinamidase enzyme responsible for conversion of pro-drug PZA into its active form. Previous studies have reported numerous resistance-conferring mutations distributed across the entire length of pncA without any hotspot regions. As different lineages of Mycobacterium tuberculosis display a strong geographic association, we sought to understand whether the genetic background influenced the distribution of mutations in pncA. We analyzed the whole genome sequence data of 1,480 clinical isolates representing four major M. tuberculosis lineages to identify the distribution of mutations in the complete operon (Rv2044c-pncA-Rv2042c) and its upstream promoter region. We observed a non-overlapping pattern of mutations among various lineages and identified a lineage 3-specific frame-shift deletion in gene Rv2044c upstream of pncA that disrupted the stop codon and led to its fusion with pncA. This resulted in the addition of a novel domain of unknown function (DUF2784) to the pyrazinamidase enzyme. The variant molecule was computationally modelled and physico-chemical parameters determined to ascertain stability. Although the functional impact of this mutation remains unknown, its lineage specific nature highlights the importance of genetic background and warrants further study.

Mycobacterium tuberculosis poses a huge burden on global health, the effective treatment of which is becoming increasingly complex as drug resistance spreads worldwide 1 . According to a WHO report published in 2015 only one in four multidrug resistant-TB cases are diagnosed 2 , highlighting the need for rapid and accurate diagnostic tools. To this end, whole genome sequencing (WGS) offers a better alternative than current methods 3 . In the last few years, many WGS projects have been undertaken around the world to catalogue genetic determinants associated with phenotypic resistance [4][5][6] . These have largely focused on two key areas. The first is to improve the accuracy in predicting phenotypic resistance based on genetic information, which could then be implemented as an alternative to conventional drug susceptibility testing (DST) and other diagnostic tests. The second is to detect novel mutations associated with drug resistance that are not captured in the current catalogue. Studies have shown the diagnostic potential of WGS to achieve higher diagnostic sensitivity and specificity for some drugs [6][7][8] , but further work is required to improve the accuracy of prediction for several first and second line drugs that form the core of anti-TB treatment. This includes pyrazinamide (PZA), resistance to which is mainly due to the acquisition of mutations in pncA (Rv2043c) that impairs the conversion of the pro-drug PZA into its active form pyrazinoic acid 9 . This first-line drug is likely to remain indispensable owing to its bactericidal activity against the persister bacterial sub-population 10 .
Low sensitivity in predicting PZA resistance is largely due to the large number of pncA gene mutations observed in phenotypically resistant isolates without clear hotspots for mutations, which poses significant hurdles for the development of genotypic assays 9,11 . Alternative assays such as high-resolution melt analysis (real-time PCR based) that detect variations in the entire pncA sequence and upstream promoter region are also limited by the inability to detect genetic events such as transversions or small deletions 12 . Furthermore, phenotypic screening of PZA resistance is challenging since its optimal bactericidal effect requires a low pH, which can lead to inconsistent results 13 .
Previous studies have reported that the propensity to acquire drug resistance varies among different lineages of M. tuberculosis 14 . For example, strains belonging to the East Asian lineage (lineage 2) were observed to acquire drug resistance more rapidly than other lineages 15 . Furthermore, strains of different lineages carrying the same resistance mutations have been observed to have different MIC levels for the corresponding drug. These observations clearly highlight the importance of bacterial genetic background 16 . The mechanisms for this lineage specific behavior is surmised to be multifactorial 17,18 , but is poorly understood and requires investigation. It also indicates a need to explore hitherto unknown associations between lineage specific variations and their impact on drug resistance.
Given the above and taking into account the multiplicity of mutations observed in pncA gene, we first determined the lineage-wise distribution of resistance conferring mutations identified in a phenotypically PZA resistant strain collection. As pncA is part of a larger operon comprising of two additional genes, we also identified the distribution of all mutations in the complete operon and its upstream promoter region using a collection of nearly one and a half thousand M. tuberculosis isolates chosen irrespective of their phenotype 6 , mainly to analyze if there exists any lineage specific differences that could impact on PZA susceptibility. This led to the identification of a lineage 3 specific frame-shift deletion in Rv2044c gene upstream of pncA which resulted in the fusion of these two genes. This added an additional domain of unknown function to the pyrazinamidase enzyme. The variant pncA molecule in lineage 3 was computationally modeled and physico-chemical parameters determined and compared to the native molecule. This study emphasizes the importance of analyzing lineage-specific mutations and their potential impact on drug resistance mechanisms in M. tuberculosis.

Results
Lineage-wise distribution of resistance determining mutations in pncA. The genome sequence data of 254 phenotypically PZA resistant isolates representing four major lineages -18 isolates of lineage 1 (7.1%), 149 isolates of lineage 2 (58.7%), 32 isolates of lineage 3 (12.6%) and 55 isolates of lineage 4 (21.7%) -were analyzed to identify the resistance determining mutations in gene pncA (including 40 bp upstream region). Out of 254 strains investigated, resistance determining mutations were observed in 181 isolates (71.2%) comprising 8 isolates of lineage 1, 118 isolates of lineage 2, 24 isolates of lineage 3 and 31 isolates of lineage 4. Complete information for these mutations is provided in Supplementary Table S1. The lineage-wise distribution of resistance determining mutations and their frequency at each distinct position along the gene are shown in Fig. 1. This clearly showed an absence of overlap among lineages at 85% of the positions where mutations were identified, for example as shown at positions 202 and 385. Mutations that confer resistance in M. tuberculosis are generally acquired via convergent evolution in different lineages, an example being mutations in rpoB that confers resistance to rifampicin and are concentrated in the rifampicin resistance determining region 19 as shown in Supplementary Figure S1 (complete details available in Supplementary Information). By contrast, here we observed that resistance determining mutations were distributed across pncA with minimal overlap observed between strains of different lineages at any particular position.

Lineage-wise distribution of mutations in the complete operon (Rv2044c-pncA-Rv2042c).
Genome sequence data for 1480 M. tuberculosis isolates belonging to the major lineages were selected irrespective of phenotype from publically accessible genomes that were generated by a previous study 6 . The process for selecting isolates is described in the methodology. Our collection represented all four major lineages -141 isolates of lineage 1 (9.5%), 91 isolates of lineage 2 (6.1%), 315 isolates of lineage 3 (21. 3%) and 933 isolates of lineage 4 (63%). It was previously reported that pncA (Rv2043c) is co-transcribed as a polycistron along with Rv2044c located 40 base pairs (bp) upstream to pncA, and Rv2042c is located immediately downstream with a 1 bp overlap with pncA 20 . To capture this entire region, 1801bp sequence of H37Rv reference strain (NC_000962.3) corresponding to the complete operon (Rv2044c-pncA-Rv2042c) together with 85 bp upstream (to include the promoter region for the operon) was used for variant calling with GATK HaplotypeCaller. After filtering for variants using metrics described in the methodology, mutations were detected at 68 distinct positions in the operon. The lineage-wise distribution of all the single nucleotide polymorphisms (SNP) and insertion/deletion (indels) sites is shown in Table 1. Complete details of genetic variants at each position and their frequency are provided in Supplementary Table S2.  Table 1. Comparative statistics of single nucleotide polymorphism (SNP) and insertion/deletion (indels) sites observed among different lineages.  Table S2). Our analysis also identified variants at two positions that were highly specific to strains of a particular lineage (present in 97% of isolates belonging to that lineage and absent in the remaining isolates/lineages) as shown in Fig. 2. The first was a SNP (C1515G) in Rv2042c in lineage 2 isolates, and the second was a single nucleotide deletion (GCCG/232/GCG) in Rv2044c in lineage 3 isolates. Apart from the above mentioned two variants, SNP (C553T) in pncA was observed in 81% of lineage 3 isolates. This has been reported previously as a lineage 3-associated silent mutation (Ser65Ser) 19 , which was absent in a small number of basal isolates of this lineage.  The crystal structure of native pyrazinamidase has been determined by Petrella et al. 27 , and was available in the RSCB Protein Databank (PDB). An in silico evaluation was performed to determine a representative conformation of the variant pyrazinamidase enzyme in equilibrium. 3D modelling was performed using I-TASSER software, and refinement of the protein 3D model was performed in multiple steps to address the lack of homology to the extended region in the variant enzyme (see methodology). The final model with the highest confidence score contained 31.4% Alpha helix, 9.5% Strand, 2.9% 3-10 helix, and the remainder included mainly Coil, as shown in Fig. 4. The Ramachandran plot assessment of residues of the 3D model using RAMPAGE revealed 87.8% residues in favorable region, 7.3% in allowed region, and 5% comprised outliers. This final model was further subjected to MD simulations for 35 ns using Gromacs and the stability and conformational changes were analyzed. The flattening of the RMSD plot of the protein backbone around 35 ns in Fig. 5A indicated that the molecule achieved a stationary phase during the later stages of simulation and showed fluctuations around 5.4 A° at the end of simulations. The plot of gyration radius was also stable around 2.4 A° with no major modifications in the secondary structure of the protein, representing the compactness of the protein during the simulation as shown in Fig. 5B. The total energy trajectory remained stable over the entire simulation period at around −1.018e + 3 KJ/mol, as shown in Fig. 5C. Ramachandran plot analysis of the model obtained after MD simulation depicted 86.8% residues in favorable region, 10.3% in allowed region and 3.0% in disallowed regions.

Discussion
M. tuberculosis has co-evolved with humans over time, which has led to the emergence of different geographically compartmentalized bacterial lineages 28 . Genetic diversity of these lineages may influence both virulence and transmission potential 29 . In addition, strains from specific lineages have been shown to have an increased capacity to acquire resistance or mitigate the associated fitness cost 30 . This supports the importance of a detailed understanding of the effect of genetic background on biological traits, including drug resistance. The genetic  variation among lineages is also pertinent in relation to pyrazinamide where resistance conferring mutations are diverse and distributed across pncA. Hence, we analyzed the lineage wise distribution of mutations in the operon comprised of Rv2044c, pncA and Rv2042c. We identified an important frameshift deletion in Rv2044c affecting pncA that was restricted to M. tuberculosis lineage 3. This frameshift deletion disrupted the stop codon of Rv2044c, and the hybrid molecule composed of the two genes effectively increased length of PncA by 119 amino acids. As pncA is reported to be expressed as part of a polycistronic mRNA product of the operon 20 , any effect on expression due to this deletion would be minimal. This observation is supported by the evidence of enzymatic activity in lineage 3 isolates harboring this deletion, as reported recently by Miotto et al. 31 . In their study on pyrazinamide resistance, sequencing of pncA was carried out for 1950 clinical isolates and upstream promoter region (>100 bp) was also included for a small number of isolates.
This strongly suggests that the lineage-specific variation identified in this study does not disrupt pyrazinamidase enzyme activity and that a variant enzyme with an additional domain DUF2784 of unknown function is transcribed in isolates of lineage 3. Furthermore, structural refinement of the 3D model obtained from I-TASSER using MD simulations showed that the variant molecule is stable as displayed by the trajectories of RMSD, total energy and radius of gyration for 35 ns, which validates the conformation of the predicted structure. Minor fluctuations observed in the trajectories could be attributed to the presence of coil regions in the protein model, which are difficult to stabilize during the simulations. However, functional studies comparing the native and variant PncA are necessary to elucidate the role of the additional domain and its full impact on enzyme function.
To evolve from an environmental bacterium into a human restricted professional pathogen, M. tuberculosis could have selected many of the attributes that are found commonly in non-tuberculous mycobacteria (NTM) and smooth tubercle bacilli (M. canettii). A previous study that compared genomes of MTBC with 11 other related NTM species comprising both free living environmental bacteria and opportunistic pathogens 26 revealed that orthologs of Rv2042c and pncA were conserved in the majority of NTM except for M. leprae, which has undergone extensive genome reduction 32 . The third member of the operon (Rv2044c ortholog) was absent in the majority of NTMs including slow growing species such as M. marinum, M. leprae and M. avium, the exception being M. smegmatis. The evolution of different slow growing mycobacterial species from their common ancestor involved the acquisition or deletion of various important genes that are species-specific or shared by small number of slow growing mycobacteria 33,34 . The presence of orthologs of Rv20444c in a limited number of slow growing mycobacterial species and all MTBC members indicates its importance in the latter. Although all members of the MTBC harbor an ortholog of Rv2044c, the evolutionary fixation of the frameshift deletion in lineage 3 isolates identified here strongly suggests that the variant pncA represents a beneficial adaptive change in the associated geographical setting.
A recent multi-country surveillance report which assessed levels of pyrazinamide and rifampicin resistance in five different high burden countries observed that levels of pyrazinamide resistance were significantly lower compared to the levels of rifampicin resistance only in Pakistan 35 . Although this study did not report lineage information, other reports have shown the predominance of lineage 3 in Pakistan and Northern India (Delhi) [36][37][38] . This raises the possibility of alternative mechanisms of pyrazinamidase enzyme action in these strains. Therefore, mechanisms by which M. tuberculosis lineage 3 strains acquire PZA resistance need careful investigation.
The genetic background constituted by lineage specific non-resistance conferring mutations might influence drug resistance and amelioration of their impact through epistatic interactions 17 . To our knowledge, our study is the first to report the effect of a lineage-specific non-resistance mutation on the drug resistance associated gene, pncA. Moreover, this study highlights the need for examination of lineage-specific variations, mainly indels whose downstream effect is commonly not estimated compared to SNPs. Minimal overlap in the position of variants between lineages could be affected by the limited number of isolates tested for some lineages. Future studies of large bacterial collections containing under-sampled lineages promise to yield better resolution and provide deeper insights into the dynamics of resistance acquisition.

Methods
Screening for mutations. Intitially, the read data of 254 strains reported as phenotypically resistant to PZA were considered from two previous studies 6,37 in order to identify the resistance determining mutations along the gene pncA. The sequence reads were aligned to the region of H37Rv corresponding to pncA gene including 40 bp upstream region. Further, to identify all the genetic variants along the operon, out of 3651 genome sequences made available by a previous study from Walker et al. 6 , the read data that fulfilled the following criteria were chosen randomly irrespective of phenotype -read length of 75 bp,100 bp and 101 bp with coverage greater than or equal to 100× (complete accession details are provided in Supplementary Table S3). These were downloaded from the NCBI sequence read archive using the SRA tool kit 39 . The sequence reads belonging to different lineages of M. tuberculosis were aligned to the region of H37Rv corresponding to (Rv2044c-pncA-Rv2042c) genes and 85 bp upstream to identify variants.
Variant calling and annotation. This variant calling step involved generation of an alignment file using bwa-mem 40 and further processed using Picard tools 41 to mark duplicate reads before applying the GATK HaplotypeCaller 42 . All variations called in the VCF (variant call format) file were filtered for metrics -QUAL (>50), DP (>10), MQ (>40), QD (>20). Further, filtering based on AD values was performed to include only those sites where reads supporting alternate allele were greater than or equal to 75 percent of total reads aligned at that position. The annotation of variants and identification of resistance determining ones was done using in-house written python scripts. The effect of lineage specific mutations were visualized using Expasy translate tool 43 and confirmed with gene prediction software GeneMarkS 44 .

Domain identification and structure analysis of variant protein. The prediction of domains was
done using NCBI CD-search 21 . Physico-chemical properties (molecular weight, theoretical PI, instability index 45 , aliphatic index 46 , GRAVY score 47 ) were determined for the native and variant proteins using Prot-Param server 48 . The three-dimensional (3D) model of the variant protein comprising 305 amino acids (aa) was initially predicted by ab-inito modeling in I-TASSER online server 49 . The top 10 threading templates detected by the I-TASSER included a template file corresponding to the original pyrazinamidase crystal structure (3PL1-186 aa) PDB as the top most alignment, but also none of the other templates had any alignment to the newly added region of the protein. In order to overcome this effect, the first 119 aa sequence which did not have any reported alignment was modeled ab-initio separately again using I-TASSER server. In the last step, the model obtained in previous run for first part of the protein was provided as a user template while submitting complete sequence to I-TASSER for final 3D modeling.
The best model obtained in final run was chosen based on C-score and visualized in PyMOL 50 before subjecting to structure refinement and validation analysis. Molecular Dynamics (MD) simulations were performed using GROMACS version 5.1.4 51 . This mainly included topology generation with GROMOS96 54a7 force field in an aqueous environment, solvation of defined cubic box, neutralization of charge using 4 sodium ions and geometry optimization. Initial unconstrained global dynamics was carried out in two steps -temperature (300 K) for 100 picoseconds followed by Pressure (1 bar) for 100 picoseconds. The final step of MD simulations for 35 nanoseconds was performed at pressure (1 bar) and temperature (300 K).The Ramachandran plots of variant pyrazinamidase before MD simulation and final one obtained after MD simulation were compared using RAMPAGE tool 52 .
Data Availbility Statement. All the read data analysed is availble on NCBI and accession details are provided.