Potential protein activity modifications of amino acid variants in the human transcriptome

Background: The occurrence of widespread RNA and DNA sequence differences in the human transcriptome was reported in 2011. Similar findings were described in a second independent publication on personal omics profiling investigating the occurrence of dynamic molecular and related medical phenotypes. The suggestion that the RNA sequence variation was likely to affect disease susceptibility prompted us to investigate with a range of algorithms the amino acid variants reported to be present in the identified peptides to determine if they might be disease-causing. Results: The predictive qualities of the different algorithms were first evaluated by using nonsynonymous single-base nucleotide polymorphism (nsSNP) datasets, using independently established data on amino acid variants in several proteins as well as data obtained by mutational mapping and modelling of binding sites in the human serotonin transporter protein (hSERT). Validation of the used predictive algorithms was at a 75% level. Using the same algorithms, we found that widespread RNA and DNA sequence differences were predicted to impair the function of the peptides in over 57% of cases. Conclusions: Our findings suggest that a proportion of edited RNAs which serve as templates for protein synthesis is likely to modify protein function, possibly as an adaptive survival mechanism in response to environmental modifications.


INTRODUCTION
In a publication, which was extensively commented on, Li and coworkers (2011) reported in 2011 the occurrence of widespread RNA and DNA sequence differences (RDD) in the human transcriptome (Li et al., 2011).The authors emphasized the consistent pattern of the observations and concluded that the RDDs had a biological significance and were not just "noise" (Li et al., 2011).The nature of the amino acid variants in the novel RDD peptides was investigated by using mass spectrometry.The authors also suggested that the RNA sequence variation was likely to affect disease susceptibility by modifying the function of the protein.Subsequently, in 2012 Li and coworkers (2012) responded to comments which did not dispute their findings but were in disagreement on the number of RDDs.Li and coworkers (2012) also pointed out that several other research groups had reported similar phenomena.Whereas Li and coworkers (2011) had sequenced peptides from immortalized B cells in culture as well as in some primary skin cells and brain tissue, Chen and coworkers in 2012, identified other forms of RNA variants, henceforth RNA-edits, in circulating white blood cells.Despite the expected difference in gene expression between in vitro and in vivo cells of different origins, both studies reported RNA editing, which according to our results often leads to modified protein activity.Also widely discussed was the role of technical artifacts due to sequencing or sequence mapping, nevertheless they only partially explain the discovered RDDs (Pickrell et al., 2012).This would suggest that it is a general, widespread editing mechanism that can affect phenotype and should be further investigated.
Until recently nsSNPs have been the major group of amino acid polymorphisms that are associated with protein stability.Worth and coworkers have estimated that up to 80% of disease-associated nsSNPs are associated with protein stabilization effects (Worth et al., 2007).It is possible that the amino acid variants reported by Li and coworkers (2011) might have the potential to vary the function of the RDD peptides.By using algorithms such as Polyphen-2 (POLYmorphism PHENotyping ver.2) (Adzhubei et al., 2010) and SIFT (Sorting Intolerant From Tolerant) (Kumar et al., 2009) it should be possible to determine how amino acid variants might change the function of the RDD-peptides and also other similar phenomena.The practical application of the two algorithms is now so well established that they are integrated into Ensembl [www.ensembl.org/index.html].Other available algorithms are PANTHER (Protein Analysis Through Evolutionary Relationships) (Mi et al., 2007), PhD-SNP (Predictor of human Deleterious Single Nucleotide Polymorphisms) (Capriotti et al., 2006) and SNAP (Synonymous Non-synonymous Analysis Program) (Bromberg * e-mail: Joanna Zyla: joanna.zyla@polsl.pl;Robert A. Bulman: robert.bulman@phe.gov.ukAbbreviations: APF, affects protein function; BLAST, the basic local alignment search tool; BNG, benign; D, disease-causing; DNA, deoxyribonucleic acid; DSLS, differential static light scattering; hSERT, human SERotonin transporter protein; MSC, median of conservation value; N, neutral; NN, non-neutral; NSP, number of sequence at position; nsSNP, nonsynonymous single-base nucleotide polymorphism; PANTHER, protein analysis through evolutionary relationships; PhD-SNP, predictor of human deleterious single nucleotide polymorphisms; Polyphen-2, polymorphism phenotyping ver.2; PRD, probably disease-causing; PSD, possibly sisease-causing; RDD, RNA and DNA sequence differences; RI, reliability index; RNA, ribonucleic acid; Sens: Spec, sensitivity and specificity; SIFT, sorting intolerant from tolerant; SNAP, synonymous non-synonymous analysis program; subSPEC, substitution position-specific evolutionary conservation; TOL, tolerated et al., 2008).By using the five foregoing algorithms we have evaluated the data from Li et al. (2011) and Chen et al. (2012) as well as providing further insight into the predictive properties of the algorithms.All of the algorithms are widely known and recommended as in silico tools for assessing the functional effect of nsS-NPs (Patnala et al., 2013).We also validate the predictive properties of the algorithms by drawing upon amino acid variant data produced in different studies (Allali-Hassani et al., 2009;Andersen et al., 2010).In addition, the RNA editing of the GluR2 transcript GRIA2, a process which results in a 607Arg instead of Gln607 variant (Mercucci et al., 2011), has also been investigated.

METHODS
The disease-causing potential of the variants was first evaluated by using algorithms Polyphen2 (Adzhubei et al., 2010) and SIFT (Kumar et al., 2009).Subsequently, other available algorithms, PANTHER (Mi et al., 2007), PhD-SNP (Capriotti et al., 2006) and SNAP (Bromberg et al., 2008) were also used to evaluate the data from Li et al. (2011) and Chen et al. (2012) as well as providing further insight into the predictive properties of the algorithms.BLAST (The Basic Local Alignment Search Tool) (Altschul et al., 1992), a rapid sequence similarity search tool, evaluates a submitted protein sequence to detect high degrees of amino acid sequence similarity.The algorithms then check the location of an amino acid variant throughout all available sequences.A low occurrence of similarity of an amino acid has the potential to impair a protein's function.More detail is available in reviews on the function of algorithms (Capriotti et al., 2006;Mi et al., 2007;Bromberg et al., 2008;Kumar et al., 2009;Adzhubei et al., 2010).

Validation of algorithms.
A validation of the predictive properties of the algorithms was sought by using the amino acid variants reported by Allali-Hassani and coworkers (2009), who noted that approximately 75% of the mutations affect the biochemical function directly.Allali-Hassani and coworkers (2009), in their extensive account, assessed the thermostability of nsSNPs and wild type proteins by differential static light scattering (DSLS) which is essentially an aggregation-based method for assessing protein stability.The authors demonstrated that 46 nsSNP amino acid variants altered the stability and activity of 16 human enzymes (Allali-Hassani et al., 2009).In addition, amino acid substitution data, which were reported by Andersen et al in their account of mutational mapping and modelling of binding sites in the human serotonin transporter (Andersen et al., 2010), have also been used to ascertain the performance of the algorithms.

Validation of algorithms
The predictive qualities of algorithms have been evaluated (Table 1) by using nsSNP data which had already been probed by the biophysical and chemical techniques described by Allali-Hassani and coworkers (2009).Data reported by Allali-Hassani and coworkers is included in Table 1, as the first two columns.Further classification of a disease-causing status is indicated in the note attached to Table 1.Examination of Table 1 indicates that over 83% of the nsSNPs in PKM2 is predicted to be disease-causing when ∆T agg values range from -2.8 to -11.1.For PRMT3, 70% of the nsSNPs is classified as disease-causing when ∆T agg values are -3.2 to -5.8.
We have also evaluated, the predictive properties of the algorithms by using data published by Andersen and coworkers (2010) who examined the influence of amino acid variants on the transport of (S)-citalopram by hSERT (Table 2).The authors noted these characteristics: (i) extension of the acidic side chain by one methylene unit (Asp98Glu) caused a 15-fold loss of potency of (S)-citalopram; (ii) Asn-177 contributes to formation of the surface region in the substrate binding pocket of hSERT; (iii) Phe-341 is a major determinant of the contour of the binding site of hSERT.Again, the pre-  dictions of the algorithms support the independently acquired data reported by Andersen and coworkers (2010).
The variant amino acid 607-arginine in the GluR2 transcript GRIA2 (Mercucci et al., 2011) abolishes the 100% impermeability to calcium.Of the five algorithms, Polyphen2 and SIFT fail to predict this established variation in the phenotype.

DISCUSSION
Our findings suggest that a proportion of edited RNAs which serve as templates for protein synthesis is likely to modify protein function.The process of editing RNA, which is still not well known, is possibly an adaptive survival mechanism in response to environmental modifications.By being able to check an amino acid sequence for a single variance, it is possible to identify a potential effect that might induce an alteration of clinical significance in the organism.The predictions by the algorithms of disease-causing states for some amino acid variants are important for clinical practice.As shown in Table 1, the value of such predictions is strengthened by the results reported by Allali-Hassani and coworkers (2009).This outcome indicates that the evaluated algorithms have a satisfactory level of prediction and as such are in agreement with the experimentally obtained data reported by Allali-Hassani and coworkers (2009).However, when the algorithms are being clinically applied it is necessary to recognize that there are methodological differences, which might lead to a variation in the results (Hicks et al., 2011).We suggest that checking RNA edit-ing by only one of the methods is not reliable.Also, it is important to recognize that the algorithms indicate only a possible effect and as such might not be wholly adequate for medical diagnosis.Nevertheless, all of them are helpful tools to identify a potentially relevant and interesting polymorphism from potential candidates and, also, could help to assess the effects of RNA editing.
This study provides a means of independently evaluating the predictive qualities of the algorithms.We have used data pertaining to RDD variants, Table 3, and amino acid variants in RNA-edit, Table 4, to evaluate the likely impact on the phenotypes of RDD and RNA-edit proteins.A comparison of Tables 3 and 4 reveals striking differences in the characteristics of the RDD and RNAedit peptides.With the exception of the predictions by Polyphen-2, for AP2A2, DFNA5, ENO1, ENO3, FABP3 and FH, the predictions for phenotype variations in Table 3 are quite similar between algorithms.In contrast, in Table 4 the extent of the similarity in the predictions by the algorithms is much reduced.Predictions for a disease-causing state by PANTHER and SNAP are similar for a few amino acids.Only a few amino acid variants in Table 4 are predicted to be disease-causing by Polyphen-2 and SIFT.PhD-SNP does not report any of the amino acid variants to be disease-causing.In Table 5 we summarize the percentage of disease-causing amino acid variants in the RDD (Li et al., 2011).and RNA-edit proteins Chen and coworkers (2012).
The algorithms have been also used to examine a form of RNA editing where there is a clear variation of the phenotype.In the GluR2 transcript GRIA2 the 607-arginine amino acid variants abolishes the 100% impermeability to calcium.Of the five algorithms, Poly-phen2 and SIFT fail to predict this established variation in phenotype.
The data we have reported here provides an opportunity to consider the predictive properties of algorithms that are likely to be important aids for identifying variations in function of a wide variety of proteins, for example those reported by Li and coworkers (2011).The data for PKM2, Table 1, is in quite good agreement.The pre-Table 4. Evaluation of amino acid variants identified as part of an investigation of personalized medicine.subSPEC, substitution position-specific evolutionary conservation; PANTHER scoring is thus: subSPEC value of -3.5 and "greater" indicates disease-causing prediction.PRD, probably disease-causing, PSD possibly disease-causing, D, disease-causing, APF affects protein function; NN, non-neutral; N, neutral; BNG, benign; TOL, tolerated.The use of prd, bng, d and n indicates that the prediction is not fully conclusive.RI, reliability index; MSC, Median of Conservation value; NSP, Number of Sequence at Position; Sens : Spec, sensitivity and specificity.Results in bold are significantly deleterious, the italic results appear as significant but they are statistically not significant.
dictions by Polyphen-2 of the variant amino acids in Tables 3 and 4 are distinctly different.In Table 4 there is a much lower occurrence of disease-causing amino acid variants than there is in Table 3.In Table 3, with the exception of Polyphen-2, the algorithms predict that many amino acid variants are disease-causing.
In summary, we have demonstrated: (i) in Tables 1 and 2 that the five algorithms are largely in agreement with independently reported experimental data, published by Allali-Hassani and coworkers (2009); (ii) the algorithms predict that many of the amino acid variants in the RDD peptides will vary the phenotype of the RDD species reported by Li and coworkers (2011) (Table 3); (iii) in Table 4 the extent of potential disease-causing amino acid variants reported by Chen et al. (2012) is lower than that reported in the RDD proteins reported by Li and coworkers (2011).
Our findings suggest that a proportion of edited RNAs which serve as templates for protein synthesis is likely to modify protein function, possibly as an adaptive survival mechanism in response to environmental modifications.

Table 1 . Predictions by algorithms of the disease-causing potential of the nsSNPs previously selected by using differential static light scattering.
Notes 1 ∆T agg = T agg [Wild typeT] -T agg[nsSNP Variant].subSPEC, substitution position-specific evolutionary conservation; PANTHER scoring is thus: subSPEC value of -3.5 and "greater" indicates disease-causing prediction.PRD, probably disease-causing, PSD possibly disease-causing, D, disease-causing, APF affects protein function; NN, non-neutral; N, neutral; BNG, benign; TOL, tolerated.The use of prd, bng, d and n indicates that the prediction is not fully conclusive.RI, reliability index; MSC, Median of Conservation value; NSP, Number of Sequence at Position; Sens : Spec, sensitivity and specificity.INMT, indolethylamine N-methyltransferase. PKM2, pyruvate kinase muscle 2; SULT1A1, sulfotransferase.Results in bold are significantly deleterious.While the results in italics might appear significant they are not statistically significant.

Table 2 . Predictions by algorithms of the influence of amino acid variants on the transport of 5-hydroxytryptamine by serotonin. subSPEC, substitution
position-specific evolutionary conservation; PANTHER scoring is thus: subSPEC value of -3.5 and "greater" indicates disease-causing prediction.PRD, probably disease-causing, PSD possibly disease-causing, D, disease-causing, APF affects protein function; NN, non-neutral; N, neutral; BNG, benign; TOL, tolerated.The use of prd, bng, d and n indicates that the prediction is not fully conclusive.RI, reliability index; MSC, Median of Conservation value; NSP, Number of Sequence at Position; Sens : Spec, sensitivity and specificity.Results in bold are significantly deleterious, the italic results appear as significant but they are statistically not significant.

Table 3 . Evaluation of induction of phenotype variation by amino acid variants in RDD peptides. subSPEC
, substitution position-specific evolutionary conservation; PANTHER scoring is thus: subSPEC value of -3.5 and "greater" indicates disease-causing prediction.PRD, probably disease-causing, PSD possibly disease-causing, D, disease-causing, APF affects protein function; NN, non-neutral; N, neutral; BNG, benign; TOL, tolerated.The use of prd, bng, d and n indicates that the prediction is not fully conclusive.RI, reliability index; MSC, Median of Conservation value; NSP, Number of Sequence at Position; Sens : Spec, sensitivity and specificity.Results in bold are significantly deleterious, the italic results appear as significant but they are statistically not significant.