Structural, functional and molecular dynamics analysis of cathepsin B gene SNPs associated with tropical calcific pancreatitis, a rare disease of tropics

Tropical Calcific Pancreatitis (TCP) is a neglected juvenile form of chronic non-alcoholic pancreatitis. Cathepsin B (CTSB), a lysosomal protease involved in the cellular degradation process, has recently been studied as a potential candidate gene in the pathogenesis of TCP. According to the Cathepsin B hypothesis, mutated CTSB can lead to premature intracellular activation of trypsinogen, a key regulatory mechanism in pancreatitis. So far, CTSB mutations have been studied in pancreatitis and neurodegenerative disorders, but little is known about the structural and functional effect of variants in CTSB. In this study, we investigated the effect of single nucleotide variants (SNVs) specifically associated with TCP, using molecular dynamics and simulation algorithms. There were two non-synonymous variants (L26V and S53G) of CTSB, located in the propeptide region. We tried to predict the effect of these variants on structure and function using multiple algorithms: SIFT, Polyphen2, PANTHER, SDM sever, i-Mutant2.0 suite, mCSM algorithm, and Vadar. Further, using databases like miRdbSNP, PolymiRTS, and miRNASNP, two SNPs in the 3′UTR region were predicted to affect the miRNA binding sites. Structural mutated models of nsSNP mutants (L26V and S53G) were prepared by MODELLER v9.15 and evaluated using TM-Align, Verify 3D, ProSA and Ramachandran plot. The 3D mutated structures were simulated using GROMACS 5.0 to predict the impact of these SNPs on protein stability. The results from in silico analysis and molecular dynamics simulations suggested that these variants in the propeptide region of Cathepsin B could lead to structural and functional changes in the protein and thus could be pathogenic. Hence, the structural and functional analysis results have given interim conclusions that these variants can have a deleterious effect in TCP pathogenesis, either uniquely or in combination with other mutations. Thus, it could be extrapolated that Cathepsin B gene can be screened in samples from all TCP patients in future, to decipher the distribution of variants in patients.


INTRODUCTION
. The minor allele frequency (MAF) of variants L26V and S53G in TCP patients were 0.46 and 0.09 respectively. Also in 2008, an article suggested that coexistence of variants in transcription factor 7-like 2 (TCF7L2), SPINK1 and CTSB (L26V), might lead to exocrine damage in TCP and determine the onset of FCPD (Mahurkar et al., 2008). The analysis in this paper (Mahurkar et al., 2008) was performed with TCP patients and control population from Dravidian and Indo-European ethnicities. There is another article which described a missense mutation (p. Q334P) in cathepsin B gene discovered in chronic pancreatitis patients but not in TCP patients (Xiao et al., 2017). An article in 2014 illustrated no association of L26V mutation with TCP (Singh, Choudhuri & Agarwal, 2014). In this paper, statistical significance tests have indicated the lack of association of L26V mutation with TCP, but this mutation was observed in seven out of 150 TCP patients. Hence, the association of this mutation with TCP cannot be completely disregarded (Singh, Choudhuri & Agarwal, 2014). Although researchers have identified SNVs in CTSB gene observed in TCP patients, lacunae lie in the information about the functional effect of these SNVs in the pathogenesis of TCP.
Human Cathepsin B (catB, E.C 3.4.22.1) is a lysosomal cysteine protease which is involved in several cellular processes like protein degradation, extracellular matrix degradation, regulatory mechanisms, cell death, autophagy and antigen representation (Olson & Joyce, 2015). It belongs to papain superfamily and acts both as an endopeptidase and as an exopeptidase. Cathepsin B is synthesized as an inactive proenzyme (cathepsin B) and is activated by other proteases and by autocatalytic processes (Pungerčar et al., 2009). Procathepsin B (length of protein = 339aa) has an N-terminus propeptide of 62 amino acid length from Arg-Lys (18-79 residues). Signal sequence (1-17 residues) and post-translational glycosylation modification (Katunuma, 2010) targets cathepsin B to endosomes/lysosomes (Ghosh, Dahms & Kornfeld, 2003) via mannose-6-phosphate receptor pathway. Propeptide exhibits an essential role in the processing and maturation of cathepsin B. It acts as (a) a scaffold for catalytic domain during protein folding, (b) involved in intracellular trafficking of cathepsin B to lysosome after N-terminal glycosylation and phosphorylation and (c) as a high-affinity reversible inhibitor for the premature activation of zymogen. The crystal structure of procathepsin B [PDB ID: 3PBH] has a propeptide region [ArP1 to LysP62], and main chain [Leu1 to Asp254] enzyme residues (Podobnik et al., 1997). The main chain has two domains (R and L domains) with active site residues Cys29 and His199, located at the interdomain cleft. The propeptide siting in the active site cleft is in reverse direction to that of the substrate, thus suggesting its role as an inhibitor. The structure has an ''occluding loop'' [Ile105 to Pro126] which has an alternate conformation in propeptide and in mature enzyme. The occluding loop is lifted above in procathepsin B, while it is tightly packed in the active enzyme, thus exposing the active sites in Cathepsin B. Procathepsin B is activated by other proteases like cathepsin D and, also by autoactivation. The potential intermolecular cleavage site identified in cathepsin B is CystP42-GlyP43. At low pH, acidic residues at propeptide surface destabilize propeptide secondary structure, resulting in distortion of hydrophilic and hydrophobic interaction with mature region of protein. Subsequently, intermolecular cleavage takes place, and propeptide gets completely dissociated from mature enzyme. Thus, autoactivation is a bimolecular process. Once CTSB gets activated, it activates trypsinogen (Halangk et al., 2000). Mutations affect different regions of Cathepsin B protein but how these variants affect the function of Cathepsin B is yet to be studied.
Since CTSB plays a cardinal role in premature trypsinogen activation, therefore in the present study, we decided to analyze computationally the functional and structural effect of the missense variants identified in the previous study (Mahurkar et al., 2006), in order to determine the clinical significance of these mutations in TCP pathogenesis. We predicted the effect of these coding variants in the propeptide region of cathepsin B, using various in silico algorithms. Also, we predicted that variants present in 3 UTR region (noncoding) in cathepsin B are associated with miRNA binding sites, and hence they could be significant. Evidential results from the structural and functional analysis of SNVs in Cathepsin B have implicated the potential role of these variants in the pathogenesis of tropical calcific pancreatitis. This study is the first attempt to structurally and functionally characterize the variants found in human Cathepsin B protein screened in TCP patients.

Data curation
The single nucleotide polymorphisms (SNPs) in CTSB gene associated with TCP was extracted from an article published in 2006 (Mahurkar et al., 2006). In this article, researchers have done direct exome sequencing of CTSB gene, taking samples from 25 controls and 51 TCP patients, and further replicating the sequencing in 130 controls and 89 TCP patients from the same cohort, in order to ensure their results. In the current study, we have mapped the SNPs extracted from the literature on current human genome assembly, GRCh38.7. The mRNA accession number, NM_147782.2, and protein accession number, NP_680092.1 of gene Cathepsin B (CTSB), was used in our computational analysis. The current data about these SNVs were retrieved from human SNP database, dbSNPbuild150. The workflow for the computational analysis performed to decipher the significance of SNPs is depicted in Fig. 1.

Sequence retrieval and alignment
The sequence of Cathepsin B (CTSB) was retrieved from UniProt database: P07858 (CATB_HUMAN). The non-synonymous variants (L26V and S53G) were manually inserted in the wild-type protein sequence for further analysis.

Non-synonymous SNP analysis
The functional effect of mutations was predicted using the following algorithms: SIFT (Sorting Intolerant from Tolerant) (Ng & Henikoff, 2003), PolyPhen-2 (Polymorphism Phenotyping v2) (Adzhubei et al., 2010) and PANTHER (Mi et al., 2013). SIFT predicts whether the non-synonymous coding mutation affects protein function or not, based on sequence homology and physical properties of amino acids. SIFT calculates median conservation value for each amino acid position and thus measures the diversity of sequence. Finally, it gives the score which is the normalized probability of an amino acid change. Score of less than 0.05 are deleterious substitutions. PolyPhen-2 predicts the impact of substitution on structure and function of a protein, after annotating the substitutions and finally building conservation profiles. The prediction algorithm of PolyPhen-2 calculates Naïve Bayes posterior probability about the damaging effect of mutation and gives prediction sensitivity scores also. PolyPhen-2 annotates the substitution as ''Possibly damaging'', ''Probably damaging'' or ''Benign'', based on their scores. PANTHER predicts the functional effect of coding mutation based on ''evolutionary preservation'' metric of a given substitution and calculates preservation time-position-specific evolutionary preservation (PESP). Longer the PESP time, the more likely that substitution will have a deleterious effect. All these softwares: SIFT, PolyPhen-2, and PANTHER, were based on evolutionary conservation-based algorithms. Another tool, ProtParam (Bendtsen et al., 2004) was used to calculate the hydropathicity or the GRAVY (grand average of hydropathicity) score (Kyte & Doolittle, 1982) of mutated procathepsin B sequences. Hydrogen bond length and the rotational angles of main chain hydrogen bonds is a significant descriptor to study the conformation and dynamics of a protein. Therefore, to calculate the altered hydrogen bonding patterns in the mutated three-dimensional procathepsin B structures, Vadar v1.8 program (Willard et al., 2003) was used. This program calculates the H-bond distances in the main chain, side chain and the bond angle between main chain residues.
Further, the stability of the mutated models was calculated by SDM (Site Directed Mutator) server (Worth, Preissner & Blundell, 2011), I-Mutant suite (Capriotti, Fariselli & Casadio, 2005) and mCSM (Mutation Cutoff Scanning Matrix Calculation) (Pires, Ascher & Blundell, 2014) webservers. These algorithms calculate the difference in change in Gibbs free energy ( G). SDM server is used to calculate the difference in thermal stability of wild-type protein structure and mutated protein structure, using constrained environment specific substitution tables (ESSTs). I-Mutant suite is a support vector machine-based algorithm which predicts the protein stability upon mutation, by taking datasets from ProTherm (Bava et al., 2004) database. This server also calculates the change in Gibbs free energy ( G) between wild-type and mutant protein structures. mCSM server is used to predict the effect of mutations in proteins using graph-based signatures. It predicts the effect of single point mutation on protein stability by extracting data sets about various thermodynamic parameters from the ProTherm database and calculates change in Gibbs free energy ( G) between wild-type and mutant protein structures. Altogether, these three algorithms calculate G for a protein on mutation. The change in Gibbs free energy ( G) is as follows: in Gibbs free energy of a system from unfavourable to favourable condition H = Change in enthalpy of the system S = Change in entropy of the system, T = Temperature of the system G = Value of free energy stability change of a protein upon mutation G W = Change in Gibbs free energy of the wild-type protein from unfavourable to favourable conditions G m = Change in Gibbs free energy of mutant from unfavourable to favourable conditions. G > 0 = Increase protein stability upon mutation G < 0 = Decrease protein stability upon mutation.

Homology modeling
The two non-synonymous SNPs (L26V and S53G) were modelled to analyze the structural effect of variants on protein. The position specific iterated blast program (PSI-BLAST) with protein databank database (PDB) and default advanced settings, was used to find the template for homology modelling. MODELLER 9.15 (Šali & Blundell, 1993) was used to build mutated models of procathepsin B. The best predicted models according to the lowest value of DOPE score (Discrete Optimized Protein Energy), was used for further evaluation and analysis. The predicted 3D mutated models, L26V and S53G, were evaluated for their quality by using Verify-3D (Eisenberg, Lüthy & Bowie, 1997) and ProSA (Protein Structure Analysis) servers (Wiederstein & Sippl, 2007). Verify-3D examines the correctness of 3D-structures by comparing the 3D structure to the 1D structure. If 3D-1D score of each amino acid is ≥ 1, then the model is correct. ProSA analysis the correctness of theoretical models by calculating the Z -score of the input structure. ProSA considers C-alpha atoms of protein structure and calculates Z -scores based on the similarity of crystal and NMR structures of the same size. If the Z -score for a model is negative, then it is a model with minimum or no errors. After these models pass the respective thresholds, the Ramachandran plot was evaluated (Lovell et al., 2002). TM-score (Zhang & Skolnick, 2005) and root mean square deviation (RMSD) values of mutant structures (L26V and S53G), were calculated with respect to wild-type by using TM-Align web server (Berendsen, Van der Spoel & van Drunen, 1995).

Molecular dynamics simulation
The molecular dynamic simulation was performed with the Gromacs-5.0 package (Kutzner et al., 2015) on the native (PDB ID: 3PBH) and mutant structures (S53G and L26V). This computational investigation was done with a viewpoint to examine if these single nucleotide variants might lead to changes in surface properties or distort the protein orientation. The protein molecule was solvated in a dodecahedron box with SPC216 water molecules at 1.5 Åmarginal radiuses. The system was made neutral by adding 7 Na+(Sodium ions) because the initial charge of the system is −7. Subsequently, the molecular system was subjected to steepest distance energy minimization until reaching the criterion of 1,000 kJ/mol (the minimization is converged when maximum force is less than 1,000 kJ/mol) with OPLS-all atom force field (Kutzner et al., 2015). Berendson temperature coupling method (Berendsen et al., 1984) was used to regulate the temperature inside the box at 300 k. Isotropic pressure coupling was performed using Parinello-Rahman method (Martoňák, Laio & Parrinello, 2003), and the pressure of the system was maintained at 1 bar. LINCS algorithm (Hess et al., 1997) was used to treat bond lengths including H-bonds. Van der Waals and Coulomb interactions were truncated at 1nm, and Particle Mesh Ewald method (Darden, York & Pedersen, 1993) was used to compute electrostatic interaction. Finally, the simulation was performed for 35ns. The structural deviations between native and mutated structures were subjected to comparative analysis by computing RMSD (Root mean square deviation) and RMSF (Root mean square fluctuation). The trajectories were analyzed, and finally, protein compactness was studied by calculating the radius of gyration (Rg). The secondary structure analysis of wild-type and simulated mutant structures was also done using the do-dssp program of Gromacs.

Data curation
The SNPs associated with TCP were extracted from literature and is tabulated in Table 1. The SNPs were further categorized according to their type. There were total of 23 SNPs  in CTSB gene found to be associated with TCP. The non-coding region included 20 SNPs (one deletion, six in 5 UTR, two in 3 UTR and 11 in introns). Coding region had 2 missense variants and 1 synonymous variant (Table 1).

Sequence retrieval and alignment
The protein sequence (NP_680092.1) was retrieved from the UniProt database, and the desired variants were manually inserted in the sequence Fig. 2. The UniProt ID for the procathepsin B sequence is P07858 (CATB_HUMAN). Mutant 1 (L26V) where residue L (Leucine) is substituted with residue V (Valine). Mutant 2 (S53G) where residue S (Serine) is substituted with residue G (Glycine).

Homology modeling
The 3D-structure of mutant (L26V and S53G) proteins was predicted after template searching by PSI-BLAST. The protein sequence (NP_680092.1) of procathepsin B was used as a query, and the resulting templates were then filtered. Finally, the X-ray crystal structure of human procathepsin B (PDB ID: 3PBH) with 2.5 Å resolution (Sequence identity: 100% and Query coverage: 93%) was used as a template for homology modeling. The DOPE scores, TM-scores, and RMSD of the predicted best models by Modeller 9.15 are shown in Table 2. The stereochemical properties of the mutated procathepsin B structures were evaluated using the Ramachandran plot from RAMPAGE. The plot defines the amino acids in favoured, allowed and outlier regions in the mutated structures as well as in the wild-type Cathepsin B structures (Table 3). Verify-3D did structure validation of predicted models, and it was observed that 99.05% amino acids had average 3D-1D protein score in a 21 residue sliding window ≤ 0.2 for L26V mutated model and 97.16% amino acids had average 3D-1D protein score in a 21 residue sliding window ≤ 0.2 in S53G mutated model. Additionally, the ProSA web server was also used to evaluate the quality of predicted 3D mutated models. The Z -score (by the ProSA webserver) of L26V model was −7.32 and of S53G model was −7.47, which were within the acceptable range of X-ray and NMR studies. The interaction energy analyzed by ProSA tool was negative for maximum residues in L26V, and S53G predicted models, in a sliding window of 10 and 40 respectively. Since the mutations were present in the propeptide region of procathepsin B, only the mutated propeptide structures are shown in Fig. 3 and Fig. 4.

Non-synonymous SNP analysis
The functional effect of the mutations predicted by using algorithms described in the ''Methods'' section are tabulated in Table 4. Analysis by SIFT and PANTHER suggests that S53G and L26V mutations can have damaging effects. The GRAVY scores of wild-type WT was −0.470, and the mutants (L26V and S53G) was −0.469 for both. Thus, it could be concluded that there is no significant effect of mutations on hydropathicity of the protein.
The comparative analysis of hydrogen bond lengths and the rotational angles between WT and mutants were calculated using Vadar v1.5 server, at 10 different regions of protein,  playing imperative role in the functioning of procathepsin B. Remarkable differences in the H-bond lengths and Bond angles between WT and mutants (S53G and L26V) were observed as represented in Fig. 5. Hence, it could be interpreted that these mutations alter the binding between residues in mutated structures., The change in Gibbs free energy ( G) of the mutated structures calculated by SDM server, I-Mutant 2.0, and mCSM webservers indicatedestabilization of mutated proteins (Table 5). Additionally, the results of MSA revealed that leucine at the 26th position and Serine at the 53rd position are conserved residues. Extrapolating these results indicate that any alteration in conserved residues will affect the structure and function of the protein (Fig. 6). Altogether, L26V and S53G mutations were predicted to have a deleterious effect on the structure and function of the protein, through non-synonymous SNP analysis algorithms.

Analysis of SNPs in UTR region
UTRs play an essential role in mRNA processing during post-translational mechanism. Hence, the SNPs in the UTR region can significantly affect the functionality of UTRs, provided they affect the miRNA binding sites. The 3 UTR region is essential for microRNA (miRNA) binding which can lead to degradation or transcriptional suppression of mRNA and thus, can further affect the downstream processing. The databases miRdbSNP, PolymiRTS, miRNASNP, were used to predict the significance of SNPs in the 3 UTR region ( Table 6). The two SNPs present in 3 UTR region, rs709821 and rs8898, were predicted to be present in miRNA binding sites and therefore are significant. The SNP, rs8898 was predicted to create a new miRNA site, while rs709821 disrupt a non-conserved miRNA site as predicted by PolymiRTS database. The two miRNAs targeting CTSB gene having both non-coding variants, hsa-miR-96 and hsa-miR-1271, are pancreas specific miRNAs, deciphered from the miRNet database (Fan et al., 2016).

Molecular dynamics simulations
The comparative analysis of trajectories by calculating RMSD, RMSF and Radius of gyration after MD simulation of 35 ns, for both native and mutants, was performed. Interestingly, it was observed from RMSD of backbone residues, that both mutated structures (S53G and L26V) were conspicuously deviated from the native structure (PDB ID: 3PBH). To infer effect of mutations on the dynamic behaviour of each residue, RMSF of C α-atoms was calculated. There was fluctuation observed in mutated protein structures as compared to the native structure. The protein compactness was determined by the radius of gyration (Rg). It was observed that Rg of mutated structures were distinctly fluctuated as compared to native structure, throughout the simulation (Fig. 7). The secondary structure analysis of wild-type and simulated mutant structures, by do-dssp program, implicated that mutations had caused deviation to the protein (Fig. 8).

DISCUSSION
Tropical Calcific Pancreatitis (TCP) has distinct morphological characteristics with undefined etiology. The propeptide region of procathepsin B (Arg1 -Lys 62) i.e., the N-terminal part inhibits the activity of Cathepsin B in the pancreas, thereby regulating its premature activation, also act as a scaffold for protein folding and as a chaperone for endosome/lysosomal trafficking. L26V and S53G are the two missense variants observed in the propeptide region of Cathepsin B protein in TCP patients. The in-silico SNP analysis of the mutated protein sequences, resulted in alteration of secondary structure, thereby predicting an adverse folding effect on the protein. The phenotypic effect of the mutations was calculated using sequence analysis algorithms, and it was observed that at least two algorithms indicated a deleterious effect of these mutations on protein functionality. The free energy ( G) calculations of mutated proteins structures, by various algorithms, indicated that mutations are destabilizing. Further, the comparative analysis of H-bond distances between mutated and native 3D-structure of procathepsin B provided a unique information about the structural characteristics of motifs around main chain H-bonds which are altered in mutant protein structures, thereby affecting the function of the protein (Penner et al., 2014). Additionally, MD simulation of mutated and native protein structures indicated that the mutations distinctly deviate the structural conformation of procathepsin B, thereby having a deleterious effect on downstream signalling mechanism. Thus, the structural and functional analysis of mutated procathepsin B predicts the significance of these mutations in the propeptide region of Cathepsin B. Hence, we could extrapolate from these results (in silico analysis of the mutated structures and sequences) that both mutations (L26V and S53G) have a deleterious effect on structure and function of protein.
These results will provide a lead towards designing the experimental research strategy on the mutations involved in the pathogenesis of TCP to understand the disease etiopathogenesis.

CONCLUSION
The dearth of information about the etiopathogenesis of tropical calcific pancreatitis was the driving force for this study. The literature only has the information about the SNVs in the cathepsin B gene associated with TCP and lacks the crucial theories about the relative effects of these SNVs in the pathogenesis of TCP. In this study, we predicted the structural and functional effect of cathepsin B SNVs which were identified in TCP patients in previous studies. The predicted deleterious effect of these SNVs is a lead towards developing