Identifying highly-penetrant disease causal mutations using next generation sequencing: Guide to whole process

Recent technological advances have created challenges for geneticists and a need to adapt to a wide range of new bioinformatics tools and an expanding wealth of publicly available data (e.g. mutation databases, software). This wide range of methods and a diversity of file formats used in sequence analysis is a significant issue, with a considerable amount of time spent before anyone can even attempt to analyse the genetic basis of human disorders. Another point to consider is although many possess ‘just enough’ knowledge to analyse their data, they do not make full use of the tools and databases that are available and also do not know how their data was created. The primary aim of this review is to document some of the key approaches and provide an analysis schema to make the analysis process more efficient and reliable in the context of discovering highly penetrant causal mutations/genes. This review will also compare the methods used to identify highly penetrant variants when data is obtained from consanguineous individuals as opposed to non-consanguineous; and when Mendelian disorders are analysed as opposed to common-complex disorders.

difficult to understand and parse. Comprehensive information about the VCF and its companion software VFCtools [1] are available online (vcftools.sourceforge.net).
Because of the substantial decrease in the price of DNA sequencing and genotyping [3], there has been a sharp increase in the number of genetic association studies being carried out, especially in the form of genome-wide association studies (GWAS, statistics available at www.genome.gov/gwastudies/). As whole genome sequencing (WGS) is prohibitively expensive for large genetic association studies [4][5][6], whole exome sequencing (WES) has emerged as the attractive alternative -where only the protein coding region of the genome (i.e. exome) is targeted and sequenced [7]. This decision to carry out WES over WGS is not solely influenced by the cost which currently stands at one-third in comparison [8], but also by the fact that most of the known Mendelian disorders (~85%) are caused by mutations in the exome [9] and reliably interpreting variation outside of the exome is still challenging as there is little consensus (even with ENCODE data [10] and non-coding variant effect prediction tools such as CADD [11] and GWAVA [12]).
For complex diseases, WES can provide more evidence for causality compared to GWAS, assuming that the causal variants are exonic. This is because the latter uses linkage disequilibrium (LD) patterns between common markers [13] whereas WES directly associates the variant itself with the phenotypes/disorder. Therefore using GWAS, especially in gene-dense regions, one cannot usually make conclusive judgements about which gene(s) is causal without further sequencing or functional analysis. WES has been successfully used in identifying and/or verifying over 300 causal variants for Mendelian disorders (statistics from omim.org/) [14,15]. WES currently stands at approx. $1000 for 50x read depth (variable prices, less for larger studies). However since there is a great deal of variation in the human genome [16], finding the causal variant(s), especially ones with low penetrance, is not going to be trivial. This problem can be exacerbated by the nature of the disorder(s) analysed. It is relatively easier to map variants causing rare monogenic diseases, as there is most likely to be a single variant present in the cases that is not in the controls; but in contrast, common complex (polygenic) disorders are much harder to dissect when searching for causal variants.
In this paper, our aims are to (i) provide a guide for genetic association studies dealing with sequencing data to identify highly penetrant variants (ii) compare the different approaches taken when data is obtained from unrelated or consanguineous individuals, and (iii) make suggestions about how to rank single nucleotide variation (SNV) and/or insertion/deletions (indels) following the standard filtering/ranking steps if there are several candidate variants. To aid the process of analysing sequencing data obtained from consanguineous individuals, we have also made available an autozygosity mapping algorithm (AutoZplotter) which takes VCF files as input and enables manual identification of regions that have longer stretches of homozygosity than would be expected by chance.

Mapping sequence reads
The raw reads produced should then be aligned to a reference genome (e.g. GRCh38 -see NCBI Genome Reference Consortium) and there are many open source and widely applied tools (Table 1).
However, solely depending on automated methods and software can leave many reads spanning indels misaligned, therefore post-reviewing the data for mismapping is always a good practice, especially in the candidate regions. Attempting to remap misaligned reads with a lower stringency using software such as Pindel would be an ideal way to go about solving such a problem [24]. GATK provides a base recalibration and indel realignment algorithm for this purpose.
Effective variant calling depends on accurate mapping to a dependable reference sequence. If available, using a population specific reference genome would be most ideal to filter out known neutral SNPs existing within the region of origin of the analysed subjects (e.g. East-Asian reference for subjects of Japanese origin). Inclusion of ambiguity codes (e.g. IUPAC codes) for known poly-allelic variants to create a composite reference genome can also be useful (although not essential).

Variant calling
There are many tools available for the identification of SNVs, indels, splice-site variants and CNVs present in the query sequence(s). Each variant calling tool has advantages and disadvantages and has made compromises relating to issues such as speed of analysis, annotation and reliability of the output file (Table   2). Separating true variation from sequencing artefacts still represents a considerable challenge. When dealing with very rare disorders, the candidate regions in the output VCF (or BAM) files should be reviewed either by reviewing the QC scores in the VCF or by visualising the alignments in IGV [25]. Performing this step could highlight sequencing errors such as over-coverage (due to greater abundance of capture probes for the region or double capturing due to poorly discriminated probes hybridising to the same region) or under-coverage (due to probes not hybridising because of high variability in the region). For rare Mendelian disorders, since there is going to be a single causal variant it is important to analyse variants which are reliable. Therefore setting strict parameters for read depth (e.g. ≥ 10x), base quality score (e.g. ≥ 100) and genotype quality scores (e.g. ≥ 100) initially can eliminate wrong base and genotype calls. This can then be adjusted subsequently if no variants with a strong candidacy are found after filtering (also see Best Practices section of GATK documentation for variant analysis).
There are many tools available for the identification of SNVs, indels, splice-site variants and CNVs present in the query sequence (see Table 2). GATK [2] is one of the most established SNP discovery and genome analysis toolkits, with extensive documentation and helpful forums. It is a structured programming framework which makes use of the programming philosophy of MapReduce to solve the data management challenge of NGS by separating data access patterns from analysis algorithms. GATK is constantly updated and cited, and also has a vibrant forum which is maintained continually.
SAMtools [26] is a variant caller which uses a Bayesian approach and has been used in many WGS and WES projects including the 1000 Genomes Project [16]. SAMtools also offers many additional features such as alignment viewing and conversion to a BAM file. A recent study has compared GATK, SAMtools and Atlas2 and found GATK to perform best in many settings [27]. However all three were highly consistent with an overlapping rate of ~90%. SOAPsnp is another highly used SNP and genotype caller and is part of the reliable SOAP family of bioinformatics tools (http://soap.genomics.org.cn/).

Additional checks of autozygosity
For data obtained from consanguineous families, confirming expected autozygosity (i.e. homozygous for alleles inherited from a common ancestor) would be an additional check worth carrying out. If the individual is the offspring of first cousins then the level of autozygosity would be near 6.25% (F=0.0625); and 12.5% (F=0.125) for offspring of double first cousins (or uncle-niece unions, see Supp. Figure S1 for a depiction of these). These values will be higher in endogamous populations (e.g. for offspring of first cousins: 6.25% + autozygosity brought about due to endogamy. See Supp. Fig. S3  Materials) that we developed takes VCF files as input enabling easy and reliable visualisation and analysis of LRoH for any type of data (WGS, WES or SNP chip).

STAG E 2 -F IL TERIN G /R AN KI NG O F VARIAN TS
Once the quality control process is complete and VCF files are deemed analysis ready, the approach taken will depend on the type of disorder analysed. For rare Mendelian disorders, many filtering and/or ranking steps can be taken to reduce the thousands of variants to a few strong candidates. Screening previously identified genes for causal variants is a good starting point. Carrying out this simple check will allow the identification of the causal variant even from a single proband thus saving time and money. If no previously identified variant is found in the proband analysed, there are several steps which can be taken to identify novel mutations.

Using prior information to rank/filter variants
Locus specific databases (see http://www.hgvs.org/dblist/dblist.html for a comprehensive list) and 'wholegenome' mutation databases such as HGMD [32], ClinVar [33], LOVD (www.lovd.nl/) and OMIM [34] are very informative resources for this task. Finding no previously identified variants indicates a novel variant in the proband analysed. For rare Mendelian disorders, the look for the variant can begin by removal of known neutral and/or common variants (≥0.1%) as this would provide a smaller subset of potentially causal variants. This is a pragmatic choice as Mendelian disease causal variants are likely to be very rare in the population or unique to the proband. If the latter is true, the variant will be absent from public databases. For this process to be thorough, an automated annotation tool such as Ensembl VEP can be used. VEP enables incorporation of MAF (or GMAF, global MAF) from the EVS and the 1000 Genomes Project (see Supp. Material and Methods for details).

Using effect prediction algorithms to rank/filter variants
Ranking this subset of variants based on consequence (e.g. stop gains would rank higher than missense) and scores derived from mutation prediction tools (e.g. 'probably damaging' variants would rank higher than 'possibly damaging' according to Polyphen-2 prediction) would enable assessment of the predicted impact of all rare mutations. It is important to understand what is assumed at each filtering/ranking stage; and comments are included about each assumption and their caveats in Figure 2.
For individuals of European ancestry, a VCF file will have between eighty and ninety thousand variants for WES (more for individuals with African ancestry [35]); and approx. a tenth will be variants with 'predicted high impact' (also known as Φ variants i.e. rare nonsense, missense, splice-site acceptor or donor variants, exonic indels [36]). There are many algorithms which predict the functional effect of these variants (Table 3). A large proportion of these algorithms utilize sequence conservation within a multiple sequence alignment (MSA) of homologous sequences to identify intolerant substitutions, e.g. a substitution falling within a conserved region of the alignment is less likely to be tolerated than a substitution falling within a diverse region of the alignment (see Ng for a review [37]). A handful of these algorithms also utilize structural properties, such as the protein secondary structure and solvent accessible surface area, in order to boost performance. Well known examples of a sequence-based and structure-based algorithm are SIFT [38] and PolyPhen [39] respectively. Newer software such as FATHMM [40] and MutPred [41], which use state-of-the-art hidden Markov models and machine learning paradigms, are worth using for their performance. There are also several tools such as CONDEL-2 [42] which combine the output of several prediction tools to produce a consensus deleteriousness score. Although SIFT

Further filtering/ranking
With current knowledge, there are fifty synonymous mutations with proven causality -complex traits and Mendelian disorders combined [46]. This is a very small proportion when compared to the thousands of published clinically relevant non-synonymous (i.e. missense and nonsense) mutations.
Therefore, when filtering variants for rare monogenic disorders, not taking non coding variants and synonymous variants into account in the initial stages is a pragmatic choice. If ranking is preferred, then tools such as SilVA [47] which ranks all synonymous variants and CADD [11] which ranks all variants (including synonymous variants) in the VCF files should be used.
Highly penetrant (Mendelian or common-complex) disease causal variants are expected to be very rare, therefore most of them should not appear in publicly available datasets. However filtering all variants present in dbSNP which is common practice, should not be carried out as amplification and/or sequencing errors as well as potentially causal variants are known to make their way into this database [48,49]. Thus use of a MAF threshold (e.g. ≤ 0.1% in 1000 Genomes and/or EVS) is a wiser choice in contrast to using absence in dbSNP as a filter. Upon completion of these steps, a smaller subset of variants with strong candidacy will remain for further follow up to determine causality.
As many online tools are expected to keep logs of the processes undergoing in their servers, to protect confidentiality of genetic information downloading a local version of the chosen tools (or the VEP cache from the Ensembl website) is recommended. VEP also enables incorporation of MAF from the EVS and the 1000 Genomes Project -and many other annotations (e.g. conservation scores, is variant position present in HGMD public version, PubMed), which will make the filtering steps more manageable. Figure 3 suggests the route to take to help differentiate causal variant(s) from non-causal ones for Mendelian disorders. At this stage one must gather all information that is available about the disorder and use them to determine which inheritance pattern fits the data and what complications there might be (e.g. the possibility of compound heterozygotes in disorders which show allelic heterogeneity). Supp. Figure S2 can be used to observe the contrast between the routes taken when analysing Mendelian ( Figure 3) and complex disorders.

Public data as a source of evidence
Having a candidate gene list based on previously published literature (e.g. by using OMIM or disease/pathway specific databases such as the Ciliome database [50]) and knowledge about the biology of the disorder (e.g. biological pathways) is useful. Software such as STRING and KEGG predicts proteinprotein interactions using a variety of sources [51,52]. SNPs3D is a user friendly interface which is designed to suggest candidates for different disorders [53]. UCSC Gene Sorter (accessible from https://genome.ucsc.edu/) is another useful tool for collating a candidate gene list as it groups gene according to several features such as protein homology, coexpression and gene ontology (GO) similarity.
Uniprot's (http://www.uniprot.org/) Blast and Align functions can provide essential information about the crucial role a certain residue plays within a protein if it is highly conserved throughout many species. This is especially important for SNVs (excluding nonsense mutations as they truncate the protein) where the SNV itself should be causal.
An example of the filtering process for an autosomal recessive disorder such as PCD is depicted in Figure 5. If several variants pass the filtering steps, information about the relevant genes should be gathered using databases such as GeneCards (www.genecards.org/) and NCBI Gene (www.ncbi.nlm.nih.gov/gene) for functional information, GEO Profiles (www.ncbi.nlm.nih.gov/geoprofiles) and Unigene (www.ncbi.nlm.nih.gov/unigene) for translational data about the gene's product; and if available, one can check if a homologue is present in different species using databases such as HomoloGene (www.ncbi.nlm.nih.gov/homologene) and whether a similar phenotype is observed in model organisms. For example, if the disorder affects the cerebral cortex but the gene product is only active in the tissues located in the foot, then one cannot make a good argument about the identified variant in the respective gene as being 'causal'. There are many complications that may arise depending on the disorder such as genetic (locus) heterogeneity [54], allelic heterogeneity [55] and incomplete penetrance [56]. Therefore gathering as many cases from the same family is helpful. However for very rare Mendelian disorders this may not be possible, thus it is important to seek other lines of evidence (e.g. animal models, molecular analyses).

Mapping causal loci within families
For rare Mendelian disorders, familial information can be crucial. The availability of an extended pedigree can be very informative in mapping which variant(s) fits the mode of inheritance in the case(s) and not in the unaffected members of the family (e.g. for autosomal recessive mutations, confirming heterozygosity in the parents is a must). This will provide linkage data where its importance is best displayed by Sobreira et al where WES data from a single proband was sufficient in discovering the causal variants in two different families [57]. Where available, previously published linkage data (i.e. associating a chromosomal region to a Mendelian disorder) should also made use of.
Traditionally a LOD score of 3 (Prob. = 1/1000) is required for a variant/region to be accepted as causal. Reaching this threshold requires many large families with many affected individuals. However this is not feasible for most disease causal variants (which are very rare by nature) and other lines of evidence such as animal knockouts, molecular studies and alignments are required to make a case for the causality of variants, especially mutations which are not stop gains (e.g. missense).
As mentioned previously, understanding the characteristics of a Mendelian disorder is important. If the disorder is categorised as 'familial' (i.e. occurs more in families than by chance alone), which are usually very rare by nature, then availability of familial data becomes crucial -as unaffected members of the family are going to be the main source of information when determining neutral alleles. Any homozygous (and rare) stop gains in previously identified genes would be prime candidates.
Approach taken in families is different from the approaches taken when analysing common Mendelian disorders using unrelated individuals. For common Mendelian disorders (e.g. Finnish Heritage disorders [58-60]), fitting the dataset into a recessive inheritance model requires most (if not all) affected individuals to have two copies of the disease allele, enabling the identification of founder mutations as they will be overrepresented in the cases. These variants will be homozygous through endogamy and not consanguinity.

Autozygosity mapping
For consanguineous subjects, the causal mutation usually lies within an autozygous region (characterised by long regions of homozygosity, LRoH, which are generally >5Mb, see [61]), thus checking whether any candidate genes overlaps with an LRoH can narrow region(s) of interest. There are several tools which can identify LRoHs such as Plink, AutoSNPa and AgilentVariantMapper. We have made available a python script (AutoZplotter) to plot heterozygosity/homozygosity status of variants in VCF files to allow for screening of short autozygous regions as well as LRoHs.

AutoZplotter
There are several software which can detect long runs of homozygosity reliably (>5Mb), however they struggle to identify regions that are shorter than these. Therefore we developed AutoZplotter which plots homozygosity/heterozygosity state and enables quick visualisation of suspected autozygous regions. The input format of AutoZplotter is VCF thus it suits any type of genetic data (e.g. SNP array, WES, WGS).
AutoZplotter was used for this purpose in a previous study by Alsaadi et al [18].

Exceptional cases
There can always be exceptional cases (in consanguineous families also) such as compound heterozygotes (i.e. individuals carrying different variants in the two copies of the same gene). This would require haplotype phasing and the confirmation of variant status (i.e. heterozygosity for one allele and absence of the other) in the parents and the proband(s) by sequencing of PCR amplicons containing variant or genotyping the variant directly. Beagle and HAPI-UR are two widely used haplotype phasing tools for their efficiency and speed [62,63].

Identifying highly penetrant variants for common-complex disorders
For common complex disorders, identifying causal variants in outbred populations has proven to be a difficult and costly process (Supp. Figure S2); and these disorders can have many unknowns such as the significance of environmental factors [64-66] and epistasis [67]. Many of the causal variants may be relatively rare (and almost always in heterozygous state) in the population introducing issues with statistical power. Traditional GWAS do not attempt to analyse them thus they are largely ignored -leaving a lot of heritability of common complex disorders unexplained. Analysing individuals with extreme phenotypes where the segregation of disease mimics autosomal recessive disorders (e.g. in consanguineous families) can be useful in identifying highly penetrant causal genes/mutations for complex disorders (e.g. obesity and leptin gene mutations [68]). The genetic influence in these individuals is predicted to be higher and are expected to have a single highly penetrant variant in homozygous state. These highly penetrant mutations can mimic Mendelian disorders causal variants. Therefore similar study designs can be used (e.g. Autozygosity/homozygosity mapping).

CONCLUSIONS
The NGS era has brought data management problems to traditional geneticists. Many data formats and bioinformatics tools have been developed to tackle this problem. One can easily be lost in the plethora of databases, data formats and tools. "Which tools are out there? How do I use it? What do I do next with the data I have?" are continually asked questions. This review aims to guide the reader in the rapidly changing and ever expanding world of bioinformatics. Figure 4 depicts a summary of the analysis process from DNA extraction to finding the causal variant, putting into perspective which file formats are expected at each step and which bioinformatics tools we prefer due to reasons mentioned before. Researchers can then appreciate the stage that they are at and how many other steps are required for completion as well as knowing what to do at each step.
Whole exome sequencing is the current gold standard in the discovery of highly penetrant disease causal mutations. As knowledge on the non-coding parts of the genome can still be considered to be in its early days, the human exome is still a pragmatic target for many. As approx. 1600 known Mendelian disorders (and ~3500 when suspected ones are included) and most common-complex disorders are still waiting for their molecular basis to be figured out (from omim.org/statistics/entry, true as of 15/07/14), future genetic studies have much to discover. However for these projects to be fruitful, careful planning is needed to make full use of available tools and databases (see Table 4).
Finally, with this paper we have also made AutoZplotter available (input format: VCF), which plots homozygosity/heterozygosity state and enables quick visualisation of suspected autozygous regions. This can be important for shorter autozygous regions where other autozygosity mappers struggle. These aligners use similar algorithms to determine contiguous sequences however MAQ and BWA are widely used and have been praised for their computational efficiency and multi-platform compatibility [74]. Table 1 Tools for aligning reads to a reference genome These are some of the many tools built for aligning reads produced from high throughput sequencing. Some have made speed their main purpose whereas others have paid more attention to annotating the files produced (such as mapping quality). Thus a manual review of candidate regions may prove to be crucial especially when dealing with very rare disorders.   -Predicts the impact of protein mutations. User friendly website and accepts many formats.
*PANTHER [94,95] 0.53 (unweighted) Predicts the effect of amino acid change based on protein evolutionary relationships. It provides a number ranging from 0 (neutral) to -10 (most likely deleterious) and allows the user to decide on the "deleteriousness" threshold. It is constantly updated making it a very reliable tool.

CONDEL-2 [42]
-Combines FATHMM and Mutation Assessor (as of version 2) in order to improve prediction. It theoretically outperforms the tools it is using in comparison to when the tools are used individually.

Notes
Advisably, feed the data into multiple prediction tools (Table 3) and apply weight according to consistency of predictions. Rank indels and nonsense in exons highest, then splice donor/acceptor mutations; and then predicted 'damaging' SNVs higher than 'tolerated' ones.

Assumptions
Causal variant is most likely coding

Notes
Either filter variants in non-coding regions or use CADD C-score to rank all variants.

Assumptions
The variant responsible for Mendelian disorders will not be present in publicly available control databases (or will be rare)

Notes
Rank SNVs according to frequency in 1000 Genomes Project, EVS and dbSNP; ranking very rare/unique variants higher than common ones -or filter all common ones.

Assumptions
Previous literature is reliable

Notes
If there are genes known to be associated with the disorder/pathway, rank them higher than the others (i.e. for PCD, the prime candidates would be genes affecting the relevant organ/organelles involved in the respiratory pathway such as the lung and cilia). GeneCards website provides comprehensive information about every gene

Figure 2: Post-VCF file procedures (example for sequencing data)
Every step can be automated through the use of pipelines and bioinformatics tools. Whilst performing the steps listed above, one must always bear in mind the assumptions behind the procedures. Ranking of rare SNVs would be advised over filtering as it allows the researcher to observe all variants as a continuum from most likely to least likely.

Figure 3: Finding 'the one' in Mendelian Disorders
Does the disease/disorder follow a dominant or recessive mode of inheritance?

Dominant Recessive
Identify most biologically plausible variants -Predicted functional effect (see Table 3) -Is the gene product active in the tissue/region? -Homologues in different species? -Functional analyses in model organisms (e.g. knockouts)

Yes
No Return to Figure 2 and

Publish
Include as many candidate SNVs in paper as possible for potential future analyses by other groups

Rank variants according to biological plausibility
-Predicted functional effect (see Table 3 Table 3. Familial (very rare) disorders are more likely to be following a recessive mode of inheritance, thus family data is crucial (to rule out de novo mutations). Also it is crucial to include as many family members as possible. For common Mendelian disorders, if the disorder is following a recessive inheritance model, the possibility of the existence of compound heterozygotes should be taken into account when fitting the data into a recessive model. Finally, functional post-analysis of candidate variant(s), especially in mouse knockouts, can be crucial.
*If a consanguineous family, identify regions where there are long runs of homozygosity (LRoH) for each individual; and amongst these regions, the ones which are shared by the affected and not by the unaffected.    After all the filtering steps in the above figure are applied, the total will be reduced to a single candidate. The numbers here are for illustration purposes only (adapted from [36]). Homozygosity step is added as PCD is an autosomal recessive disorder.

and/or
Publish Return to Figure 2 and re-check assumptions

Supp. Figure S2 Finding 'the lot' in Complex disorders: Searching for causal variants (WES example)
The standard procedure is to compare cases with controls and detect whether there are any significant differences in the allele frequencies of each variant. The statistical power of this approach is going to predominantly depend on sample size and penetrance of the causal variant. Covariates should be identified and population stratification should be controlled for in the regression models. The clinical significance of the variant must also be taken into account especially when searching for variants with very low effect sizes. One must consider whether it is worth sequencing more exomes in order to reach exome wide significance for the identification of a variant which does not have any considerable effect on patients' health.  a  r  r  I  M  ,  B  h  a  s  k  a  r  S  ,  O  '  S  u  l  l  i  v  a  n  J  ,  A  l  d  a  h  m  e  s  h  M  A  ,  S  h  a  m  s  e  l  d  i  n  H  E  ,  e  t  a  l  .  (  2  0  1  3  )  A  u  t  o  z  y  g  o  s  i  t  y  m  a  p  p  i  n  g  w  i  t  h  e  x  o  m  e   s  e  q  u  e  n  c  e  d  a  t  a  .  H  u  m  M  u  t  a  t  3  4  :  5  0  -5  6  .   3  1  .  C  a  r  r  I  M  ,  F  l  i  n  t  o  f  f  K  J  ,  T  a  y  l  o  r  G  R  ,  M  a  r  k  h  a  m  A  F  ,  B  o  n  t  h  r  o  n  D  T  (  2  0  0  6  )  I  n  t  e  r  a  c  t  i  v  e  v  i  s  u  a  l  a  n  a  l  y  s  i  s  o  f  S  N  P  d  a  t  a  f  o  r  r  a  p  i  d   a  u  t  o  z  y  g  o  s  i  t  y  m  a  p  p  i  n  g  i  n  c  o  n  s  a  n  g  u  i  n  e  o  u  s  f  a  m  i  l  i  e  s  .  H  u  m  M  u  t  a  t  2  7  :  1  0  4  1  -1  0  4 i  m  N  -L  ,  K  u  m  a  r  P  ,  H  u  J  ,  H  e  n  i  k  o  f  f  S  ,  S  c  h  n  e  i  d  e  r  G  ,  e  t  a  l  .  (  2  0  1  2  )  S  I  F  T  w  e  b  s  e  r  v  e  r  :  p  r  e  d  i  c  t  i  n  g  e  f  f  e  c  t  s  o  f  a  m  i  n  o  a  c  i  d   s  u  b  s  t  i  t  u  t  i  o  n  s  o  n  p  r  o  t  e  i  n  s  .  N  u  c  l  e  i  c  A  c  i  d  s  R  e  s  4  0  :  W  4  5  2  -W  4  5  7  .   8  5  .  K  u  m  a  r  P  ,  H  e  n  i  k  o  f  f  S  ,  N  g  P  C  (  2  0  0  9  )  P  r  e  d  i  c  t  i  n  g  t  h  e  e  f  f  e  c  t  s  o  f  c  o  d  i  n  g  n  o  n  -s  y  n  o  n  y  m  o  u  s  v  a  r  i  a  n  t  s  o  n  p  r  o  t  e  i  n  f  u  n  c  t  i  o  n   u  s  i  n  g  t  h  e  S  I  F  T  a  l  g  o  r  i  t  h  m  .  N  a  t  P  r  o  t  o  c  o  l  s  4  :  1  0  7  3  -1  0  8  1  .   8  6  .  D  a  v  y  d  o  v  E  V  ,  G  o  o  d  e  D  L  ,  S  i  r  o  t  a  M  ,  C  o  o  p  e  r  G  M  ,  S  i  d  o  w  A  ,  e  t  a  l  .  (  2  0  1  0  )  I  d  e  n  t  i  f  y  i  n  g  a  H  i  g  h  F  r  a  c  t  i  o  n  o  f  t  h  e  H  u  m  a  n   G  e  n  o  m  e  t  o  b  e  u  n  d  e  r  S  e  l  e  c  t  i  v  e  C  o  n  s  t  r  a  i  n