计算机科学 ›› 2022, Vol. 49 ›› Issue (6A): 309-313.doi: 10.11896/jsjkx.210700262

• 图像处理&多媒体技术 • 上一篇    下一篇

基于XGBoost算法的水稻基因组6mA位点识别研究

孙福权1,2, 梁莹1   

  1. 1 东北大学信息科学与工程学院 沈阳 110819
    2 东北大学秦皇岛分校数学与统计学院 河北 秦皇岛 066004
  • 出版日期:2022-06-10 发布日期:2022-06-08
  • 通讯作者: 梁莹(1971805@ stu.neu.edu.cn)
  • 作者简介:(17853462077@163.com)
  • 基金资助:
    国家重点研发计划项目(2018YFB1402800);河北省高教研究与实践项目(2018GJJG422);河北省高层次人才资助项目(A202101006)

Identification of 6mA Sites in Rice Genome Based on XGBoost Algorithm

SUN Fu-quan1,2, LIANG Ying1   

  1. 1 College of Information Science and Engineering,Northeastern University,Shenyang 110819,China
    2 School of Mathematics and Statistics,Northeastern University,Qinhuangdao,Hebei 066004,China
  • Online:2022-06-10 Published:2022-06-08
  • About author:SUN Fu-quan,born in 1964,Ph.D,professor.His main research interests include big data analysis and medical image processing.
    LIANG Ying,born in 1996,postgra-duate.Her main research interests include bioinformatics and genomic functional site recognition.
  • Supported by:
    National Key Research and Development Project(2018YFB1402800),Hebei Higher Education Research and Practice Project(2018GJJG422) and Hebei Provincial High-level Talents Funding Project(A202101006).

摘要: N6-甲基腺嘌呤(6mA)位点在调控真核生物的基因表达中起着至关重要的作用,准确识别6mA位点有助于理解基因组6mA位点的分布和生物功能。目前,各种实验测定方法应用于识别不同物种体内的6mA位点,但其太昂贵和耗时。基于此,文中提出了一种基于XGBoost的识别水稻基因组6mA位点模型P6mA-Rice。首先,通过引入序列核苷酸位置特异性及其他相关DNA性质,从7个方面提出了有效的特征提取准则,使其更全面地提取DNA信息;然后,基于XGBoost算法的特征重要性进行了进一步特征筛选,最终获得了特征集合P6mA;最后,在此基础之上,基于所选XGBoost分类算法,成功构建了P6mA-Rice甲基化位点识别模型。其相应的小刀实验结果表明,P6mA-Rice的敏感性为90.55%,特异性为88.48%,相关系数为79.00%,准确率为89.49%。大量实验验证了P6mA-Rice模型的有效性。

关键词: DNA, N6-甲基腺嘌呤, XGBoost, 序列, 序列位置特异性

Abstract: N6-methyladenine(6mA) sites plays an important role in regulating gene expression of eukaryotes organisms.Accurate identification of 6mA sites may helpful to understand genome 6mA distributions and biological functions.At present,various experimental methods have been used to identify 6mA sites in different species,but they are too expensive and time-consuming.In this paper,a novel XGBoost-based method,P6mA-Rice,is proposed for identifying 6mA sites in the rice genome.Firstly,DNA sequence coding method based on sequence,which introduces and emphasizes the position specificity information,is first employed to represent the given sequences.Effective feature extraction criteria is proposed from seven aspects to make the expression of DNA information more comprehensive.Then,the selected feature set PS6mA based on the XGBoost feature importance is put into the integrated tree boosting algorithm XGBoost to construct the proposed model P6mA-Rice.The jackknife test on a benchmark dataset demonstrates that P6mA-Rice could obtain 90.55% sensitivity,88.48% specificity,79.00% Mathews correlation coefficient,and a 89.49% accuracy.Extensive experiments validate the effectiveness of P6mA-Rice.

Key words: DNA, N6-methyladenine, Position specificity, Sequence, XGBoost

中图分类号: 

  • TP391
[1] LI Y,ZHANG X M,LUAN M W,et al.Distribution Patterns ofDNA N6-Methyladenosine Modification in Non-coding RNA Genes[J].Frontiers in Genetics,2020,11.
[2] O'BROWN Z K,GREER E L.N6-Methyladenine:A Conserved and Dynamic DNA Mark[J].Advances in Experimental Medicine & Biology,2016,945:213-246.
[3] TSAI K,COURTNEY D G,CULLEN B R,et al.Addition of m6A to SV40 late mRNAs enhances viral structural gene expression and replication[J].Plos Pathogens,2018,14(2):e1006919.
[4] FRELON S,DOUKI T,RAVANAT J L,et al.High-perfor-mance liquid chromatography--tandem mass spectrometry mea-surement of radiation-induced base damage to isolated and cellular DNA[J].Chemical Research in Toxicology,2000,13(10):1002-1010.
[5] FLUSBERG B A,WEBSTER D R,LEE J H,at al.Direct detection of DNA methylation during single-molecule,real-time sequencing[J].Nature Methods,2010,7(6):461-465.
[6] FENG P,YANG H,DING H,et al.iDNA6mA-PseKNC:Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC[J].Genomics,2019,111:96-102.
[7] CHEN W,LV H,NIE F,et al.i6mA-Pred:identifying DNA N6-methyladenine sites in the rice genome[J].Bioinformatics,2019,35(11):2796-2800.
[8] TAHIR M,TAYARA H,CHONG K T.iDNA6mA(5-steprule):Identification of DNA N6-methyladenine sites in the rice genome by intelligent computational model via Chou's 5-step rule[J].Chemometrics & Intelligent Laboratory Systems,2019,189:96-101.
[9] HAO L,DAO F Y,GUAN Z X,et al.iDNA6mA-Rice:A Computational Tool for Detecting N6-Methyladenine Sites in Rice[J].Frontiers in Genetics,2019,10:793.
[10] FU L,NIU B,ZHU Z,et al.CD-HIT:accelerated for clustering the next-generation sequencing data[J].Bioinformatics Oxford,2012,28:3150-3152.
[11] ZHANG X,LIU S.RBPPred:predicting RNA-binding proteins from sequence using SVM[J].Bioinformatics,2016,33(6):854-862.
[12] HOFACKER I L,STADLER P F.Automatic Detection of Conserved Base Pairing Patterns in RNA Virus Genomes[J].Computers & Chemistry,1999,23(3/4):401-414.
[13] MANAVALAN B,BASITH S,SHIN T H,et al.Meta4mC-pred:A Sequence-Based Meta-Predictor for Accurate DNA 4mC Site Prediction Using Effective Feature Representation[J].Molecular Therapy.Nucleic Acids,2019,16:733-744.
[14] MANAVALAN B,SHIN T H,LEE G.DHSpred:support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest[J].Oncotarget,2018,9(2):1944.
[15] XU R,ZHOU J,WANG H,et al.Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation[J].BMC Systems Biology,2015,9:S10.
[16] CHEN T,GUESTRIN C.XGBoost:A Scalable Tree Boosting System[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016:785-794.
[17] KONG L,ZHANG L.i6mA-DNCP:Computational Identifica-tion of DNA N6-Methyladenine Sites in the Rice Genome Using Optimized Dinucleotide-Based Features[J].Genes,2019,10(10).
[1] 周乐员, 张剑华, 袁甜甜, 陈胜勇.
多层注意力机制融合的序列到序列中国连续手语识别和翻译
Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion
计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[2] 王坤姝, 张泽辉, 高铁杠.
基于Hachimoji DNA和QR分解的遥感图像可逆隐藏算法
Reversible Hidden Algorithm for Remote Sensing Images Based on Hachimoji DNA and QR Decomposition
计算机科学, 2022, 49(8): 127-135. https://doi.org/10.11896/jsjkx.210700216
[3] 高振卓, 王志海, 刘海洋.
嵌入典型时间序列特征的随机Shapelet森林算法
Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features
计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226
[4] 陈慧嫔, 王琨, 杨恒, 郑智捷.
蓝舌病毒基因组序列多元概率特征可视化分析
Visual Analysis of Multiple Probability Features of Bluetongue Virus Genome Sequence
计算机科学, 2022, 49(6A): 27-31. https://doi.org/10.11896/jsjkx.210300129
[5] 刘宝宝, 杨菁菁, 陶露, 王贺应.
基于DE-LSTM模型的教育统计数据预测研究
Study on Prediction of Educational Statistical Data Based on DE-LSTM Model
计算机科学, 2022, 49(6A): 261-266. https://doi.org/10.11896/jsjkx.220300120
[6] 李京泰, 王晓丹.
基于代价敏感激活函数XGBoost的不平衡数据分类方法
XGBoost for Imbalanced Data Based on Cost-sensitive Activation Function
计算机科学, 2022, 49(5): 135-143. https://doi.org/10.11896/jsjkx.210400064
[7] 赵耿, 王超, 马英杰.
基于混沌序列相关性的峰均比抑制研究
Study on PAPR Reduction Based on Correlation of Chaotic Sequences
计算机科学, 2022, 49(5): 250-255. https://doi.org/10.11896/jsjkx.210400292
[8] 沈少朋, 马洪江, 张智恒, 周相兵, 朱春满, 温佐承.
多元时序上状态转移模式的三支漂移检测
Three-way Drift Detection for State Transition Pattern on Multivariate Time Series
计算机科学, 2022, 49(4): 144-151. https://doi.org/10.11896/jsjkx.210600045
[9] 赵耿, 李文健, 马英杰.
基于离散动力学反控制的混沌序列密码算法
Chaotic Sequence Cipher Algorithm Based on Discrete Anti-control
计算机科学, 2022, 49(4): 376-384. https://doi.org/10.11896/jsjkx.210300116
[10] 高堰泸, 徐圆, 朱群雄.
基于A-DLSTM夹层网络结构的电能消耗预测方法
Predicting Electric Energy Consumption Using Sandwich Structure of Attention in Double -LSTM
计算机科学, 2022, 49(3): 269-275. https://doi.org/10.11896/jsjkx.210100006
[11] 陈伟, 李杭, 李维华.
核小体定位预测的集成学习方法
Ensemble Learning Method for Nucleosome Localization Prediction
计算机科学, 2022, 49(2): 285-291. https://doi.org/10.11896/jsjkx.201100195
[12] 陈晋鹏, 胡哈蕾, 张帆, 曹源, 孙鹏飞.
融合时间特性和用户偏好的卷积序列化推荐
Convolutional Sequential Recommendation with Temporal Feature and User Preference
计算机科学, 2022, 49(1): 115-120. https://doi.org/10.11896/jsjkx.201200192
[13] 吴立波, 黄玉芳.
基于DNA链置换的逻辑推理问题研究
Logical Reasoning Based on DNA Strand Displacement
计算机科学, 2022, 49(1): 259-263. https://doi.org/10.11896/jsjkx.210200131
[14] 程思伟, 葛唯益, 王羽, 徐建.
BGCN:基于BERT和图卷积网络的触发词检测
BGCN:Trigger Detection Based on BERT and Graph Convolution Network
计算机科学, 2021, 48(7): 292-298. https://doi.org/10.11896/jsjkx.200500133
[15] 陈静杰, 王琨.
不平衡油耗数据的区间预测方法
Interval Prediction Method for Imbalanced Fuel Consumption Data
计算机科学, 2021, 48(7): 178-183. https://doi.org/10.11896/jsjkx.200500145
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!