Research article
A computational method for prediction of rSNPs in human genome

https://doi.org/10.1016/j.compbiolchem.2016.04.001Get rights and content

Highlights

  • A computational method for detection of rSNPs is proposed.

  • A new ensemble method for handling unbalanced data is applied.

  • Differences in hydroxyl radical cleavage patterns caused by SNPs are analyzed.

Abstract

Regulatory single nucleotide polymorphisms (rSNPs) in human genomes are thought to be responsible for phenotypic differences, including susceptibility to diseases and treatment outcomes, even they do not change any gene product. However, a genome-wide search for rSNPs has not been properly addressed so far. In this work, a computational method for rSNP identification is proposed. As background SNPs far outnumber rSNPs, an ensemble method is applied to handle imbalanced data, which firstly converts an unbalanced dataset into several balanced ones and then models for every balanced dataset. Two major types of features are extracted, that are sequence based features and allele-specific based features. Then random forest is applied to build the recognition model for each balanced dataset. Finally, ensemble strategies are adopted to combine the result of each model together. We have tested our method on a set of experimentally verified rSNPs, and leave-one-out cross-validation results showed that our method can achieve accuracy with sensitivity of 73.8%, specificity of 71.8% and the area under ROC curve (AUC) is 0.756. In addition, our method is threshold free and doesn’t rely on data of regulatory elements, thus it will have better adaptability when facing different data scenarios. The original data and the source matlab codes involved are available at https://sourceforge.net/projects/rsnpdect/.

Introduction

Variations in human genomes are believed to be affiliated with phenotypic differences and have impacts on genetic diseases (Altshuler et al., 2008, Colombo et al., 2011). Among them, single nucleotide polymorphism (SNP) is one of the most common variations (Altshuler et al., 2010). The mechanism by which SNP influences the phenotype and the role it plays in complex disease have been widely studied recently (Bonadies et al., 2010, Swindell et al., 2014). SNPs residing in coding regions (cSNPs), especially non-synonymous mutations that change the amino acid sequences, are readily detectable in terms of their likely impacts on protein functions and/or structures (Stranger et al., 2007, Adzhubei et al., 2010). However, genome-wide association studies (GWAS) indicate that most significant variants are located in either intronic or intergenic regions that do not encode proteins and are difficult to interpret (Hollenhorst et al., 2009), implying that SNPs may affect gene expression in a regulatory way (Lappalainen et al., 2010) and this kind of SNPs are called regulatory SNPs (rSNPs).

It is known that rSNPs can lead to development of diseases, and their detection is in demand, especially for personalized medicine. However, GWAS and next generation sequencing (NGS) have provided us far too many SNPs, making it rather difficult and expensive to find the significant SNPs through experimental procedures only. Therefore, computational methods are very necessary to improve the efficiency and to reduce the cost. In this work, a SNP will be considered as regulatory if it can influence the gene expression level by altering the binding affinity of transcription factors (TFs) in vivo. The interpretation of rSNP here is the same as that of Molineris et al. (2013) and should be contrasted with the definition in some other literature such as rSNPBase (Guo et al., 2014), where the regulatory functions of SNPs are determined if they locate within regulatory regions. Some previous rSNP-finding approaches focus on prediction of regulatory elements, especially the binding sites of specific TFs, assuming that SNPs located within these regions would have possible regulatory functions (Ameur et al., 2009, Ponomarenko et al., 2003). This kind of methods would produce numerous false positives, as SNPs overlapping with TF binding sites would not certainly alter the TF-DNA binding affinity. Naturally an improved practice is using the score generated by a TF position weight matrix (PWM) to measure the binding affinity of the corresponding TF, and SNPs generating large score differences are regarded more likely to be rSNPs (Andersen et al., 2008). RSNP-MARRER is a tool developed for large-scale prediction of rSNPs and score differences of TF-PWMs are used directly to weigh if significant changes happen on putative transcription factor binding sites (TFBSs) (Riva, 2012). While Manke et al. (2010) use Fourier transform to calculate the distributions of PWM scores, then the ratio of P-values of affinity scores between two alleles is computed to determine if a SNP significantly disrupts TF binding. Another PWM-based method is is-rSNP (Macintyre et al., 2010). This algorithm calculates the distributions of PWM scores and ratios between allele scores via convolution methods. Then the distributions of P-value ratios are computed to assign statistical significance to rSNP effects. Recently, ChIP-seq ENCODE data has been introduced for locating rSNPs (Boyle et al., 2012). Bryzgalov et al. (2013) put forward the viewpoint that the ChIP-seq peaks which represent enrichment of TF binding loci in a genomic region, indicate their regulatory functions, and thereby SNPs located in these peaks are more likely to affect TF bindings. Another method using ENCODE data is GWAS3D (Li et al., 2013). Besides ENCODE data, GWAS3D also integrates many other kinds of annotation data, such as annotations from cell type-specific chromatin states and cross-species conservation, to analyze the probability of genetic variants affecting regulatory elements. However, plenty of experimental annotation data needs to be obtained, thus making methods using ENCODE data difficult to be transplanted to studies of other organisms.

Here, an effective computational method for rSNP identification is presented. We aim to provide an algorithm which doesn’t rely on prediction of regulatory elements, thus rSNPs within unknown regulatory regions can be predicted. To realize this, two major types of classification features, one type is sequence based features and the other type is allele-specific based features, are extracted to train and test the random forest classifiers. None of these features need participation of regulatory data. The proposed algorithm is tested on the collected data and cross-validation result shows that it can achieve a prediction result better than that given by rSNP-MAPPER and is-rSNP. Compared to the aforesaid methods, our method neither relies on the prediction of regulatory elements, which is difficult because experimental regulatory data is insufficient, nor needs extra information such as TF PWMs, ENCODE data and so on. The succinct properties will make our method with better transportability when facing other organisms and species, thus the model will be helpful for future studies of regulatory variations and in particular their roles in diseases.

Section snippets

Datasets

ORegAnno (Griffith et al., 2008; http://www.oreganno.org/oregano/) is an open database for the curation of known regulatory elements from scientific literature and there are 175 regulatory polymorphisms contained in it. Some other literature is also searched for documented experimentally verified regulatory SNPs that show allele-specific binding to TFs in human: Andersen et al. (2008) have made a collection in which 104 one-base substitution polymorphisms are provided; Bryzgalov et al. (2013)

Performance on collected rSNPs

Sensitivity (SN), specificity (SP) and the area under the receiver operating characteristic curve (AUC) are used to quantitatively evaluate the performance of our method. In order to get a relatively objective result of our method on the unbalanced small-sample data we have collected, leave-one-out cross-validation is employed, which means that one sequence picked randomly from the entire sample suite serves as the test sequence, and all the remaining sequences constitute the training set, then

Conclusion

With the noticeable increase of the number of SNPs discovered in human and other species, more and more efforts are in urgent demand in order to fully interpret their biological and medical functions. In this paper, a new computational method for identifying rSNPs that change TF bindings is proposed. As elaborated in this manuscript, the proposed method is simple to realize and can easily be applied to different data scenarios. We hope that our method would be helpful for experimental

Funding

This work was supported by grants from the Ph.D. Program Foundation of the Ministry of Education of China (No. 20110201110010).

Acknowledgements

We are grateful to our colleagues in Ministry of Education Key Lab for Intelligent Networks and Network Security, Xi’an Jiaotong University, for their help during the course of this work, in particular Dr. Shanxin Zhang and Dr. Hongqiang Lv for their critical suggestions.

References (44)

  • A. Ameur

    Identification of candidate regulatory SNPs by combination of transcription-factor-binding site prediction, SNP genotyping and haploChIP

    Nucleic Acids Res.

    (2009)
  • M.C. Andersen

    In silico detection of sequence variations modifying transcriptional regulation

    PLoS Comput. Biol.

    (2008)
  • E.P. Bishop

    A map of minor groove shape and electrostatic potential from hydroxyl radical cleavage patterns of DNA

    ACS Chem. Biol.

    (2011)
  • N. Bonadies

    PU.1 is regulated by NF-kappa B through a novel binding site in a 17 kb upstream enhancer element

    Oncogene

    (2010)
  • A.P. Boyle

    Annotation of functional variation in personal genomes using RegulomeDB

    Genome Res.

    (2012)
  • L. Breiman

    Random forests

    Mach. Learn.

    (2001)
  • L.O. Bryzgalov

    Detection of regulatory SNPs in human genome using chIP-seq ENCODE data

    PLoS One

    (2013)
  • N.V. Chawla

    SMOTE: Synthetic minority over-sampling technique

    J. Artif. Intell. Res.

    (2002)
  • F. Colombo

    A 5′-region polymorphism modulates promoter activity of the tumor suppressor gene MFSD2A

    Mol. Cancer

    (2011)
  • M. Friedel

    DiProDB: a database for dinucleotide properties

    Nucleic Acids Res.

    (2009)
  • L.M. Fu

    CD-hIT: accelerated for clustering the next-generation sequencing data

    Bioinformatics

    (2012)
  • S. Garcia et al.

    Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy

    Evol. Comput.

    (2009)
  • Cited by (3)

    • A novel method for in silico identification of regulatory SNPs in human genome

      2017, Journal of Theoretical Biology
      Citation Excerpt :

      In this paper, a computational method for in silico rSNP prediction is presented. We have done some previous work on rSNP modeling (Li et al., 2016). The aim of this manuscript is to further improve the prediction model by building prediction model on a much larger dataset and finding more discriminating features.

    • A computational method for identification of disease-associated non-coding SNPs in human genome

      2017, Proceedings - 16th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2017
    View full text