Abstract
In the bioinformatics field, the classification of unknown biological sequences is a key task that is fundamental for simplifying the consistency, aggregation, and survey of organisms and their evolution. We can view biological sequences as data components of higher non-fixed dimensions, corresponding to the length of the sequences. Numerical encoding performs an important function in DNA sequence evaluation via computational procedures such as one-hot encoding (OHE). However, the OHE method has drawbacks: 1) it does not add any details that may produce the additional predictive variable, and 2) if the variable has many classes, then OHE increases the feature space significantly. To overcome these drawbacks, this paper presents a computationally effective framework for classifying DNA sequences of living organisms in the image domain. The proposed strategy relies upon multilayer perceptron trained by a pseudoinverse learning autoencoder (PILAE) algorithm. The PILAE training process does not have to set the learning control parameters or indicate the number of hidden layers. Therefore, the PILAE classifier can accomplish better performance contrasting with other deep neural network (DNNs) strategies such as VGG-16 and Xception models. Experimental results have demonstrated that this proposed strategy achieves high prediction accuracy as well as to a significant degree high computational efficiency over different datasets.
Similar content being viewed by others
References
Alexandari AM, Shrikumar A, Kundaje A (2017) Separable fully connected layers improve deep learning models for genomics. BioRxiv p 146431
Alipanahi B, Delong A, Weirauch MT, Frey BJ (2015) Predicting the sequence specificities of dna-and rna-binding proteins by deep learning. Nature Biotechnol 33(8):831
Asgari E, Mofrad MR (2015) Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS One 10(11):e0141287
Bertolazzi P, Felici G, Weitschek E (2009) Learning to classify species with barcodes. BMC Bioinf 10(14):S7
Bold systems v4. http://www.boldsystems.org/index.php/TaxBrowser_Home. Accessed: 2019-04-01
Cao J, Xiong L (2014) Protein sequence classification with improved extreme learning machine algorithms. BioMed Res Int
Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1800–1807
Choong ACH, Lee NK (2017) Evaluation of convolutionary neural networks modeling of dna sequences using ordinal versus one-hot encoding method. In: International Conference on computer and drone applications (IConDA), pp 60–65. IEEE
Conneau A, Schwenk H, Barrault L, Lecun Y (2017) Very deep convolutional networks for text classification. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, vol. 1, pp 1107–1116
Dna rainbow [internet]. http://www.dna-rainbow.org
Eickholt J, Cheng J (2013) Dndisorder: predicting protein disorder using boosting and deep networks. BMC Bioinf 14(1):88
Feng S, Li S, Guo P, Yin Q (2017) Image recognition with histogram of oriented gradient feature and pseudoinverse learning autoencoders. In: Neural information processing—24th international conference, ICONIP, pp 740–749
Gao X, Wei Z, Hakonarson H (2018) TRNA-DL: a deep learning approach to improve trnascan-se prediction results. Human Heredity 83(3):163–172
Ghandi M, Lee D, Mohammad-Noori M, Beer MA (2014) Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput Biol 10(7):e1003711
Guo P (2018) A vest of the pseudoinverse learning algorithm. arXiv preprint arXiv:1805.07828
Guo P, Lyu MR (2004) A pseudoinverse learning algorithm for feedforward neural networks with stacked generalization applications to software reliability growth data. Neurocomputing 56:101–121
Guo P, Lyu MR, Chen CLP (2003) Regularization parameter estimation for feedforward neural networks. IEEE Trans Syst Man Cybern Part B 33(1):35–44
Guo P, Zhao D, Han M, Feng S (2019) Pseudoinverse learners: new trend and applications to big data. In: INNS Big Data and Deep Learning conference, pp 158–168. Springer
Guo P, Zhou X, Wang K (2018) Pilae: A non-gradient descent learning scheme for deep feedforward neural networks. arXiv preprint arXiv:1811.01545
Håndstad T, Hestnes AJ, Sætrom P (2007) Motif kernel generated by genetic programming improves remote homology and fold detection. BMC Bioinf 8(1):23
Hebert PD, Gregory TR (2005) The promise of dna barcoding for taxonomy. Syst Biol 54(5):852–859
Higashihara M, Rebolledo-Mendez JD, Yamada Y, Satou K (2008) Application of a feature selection method to nucleosome data: accuracy improvement and comparison with other methods. WSEAS Trans Biol Biomed 5(5):95–104
Hochreiter S, Heusel M, Obermayer K (2007) Fast model-based protein homology detection without alignment. Bioinformatics 23(14):1728–1736
Kelley DR, Snoek J, Rinn JL (2016) Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res 26(7):990–999
Khawaldeh S, Pervaiz U, Elsharnoby M, Alchalabi AE, Al-Zubi N (2017) Taxonomic classification for living organisms using convolutional neural networks. Genes 8(11):326
Kingma DP, Ba JL (2015) Adam: A method for stochastic optimization. In: International Conference on Learning Representations
La Rosa M, Fiannaca A, Rizzo R, Urso A (2015) Probabilistic topic modeling for the analysis and classification of genomic sequences. BMC Bioinf 16(6):S2
Lanchantin J, Singh R, Wang B, Qi Y (2017) Deep motif dashboard: Visualizing and understanding genomic sequences using deep neural networks. In: Pacific Symposium On Biocomputing 2017, pp 254–265. World Scientific
Lee TK, Nguyen T (2011) Protein family classification with neural networks
Li J, Zhang J, Zuo L, Chang D (2018) Reveal the cognitive process of deep learning during identifying nucleosome occupancy and histone modification. In: 2018 Chinese Automation Congress (CAC)
Li S, Chen J, Liu B (2017) Protein remote homology detection based on bidirectional long short-term memory. BMC Bioinf 18(1):443
Li Y, Shi W, Wasserman WW (2018) Genome-wide prediction of cis-regulatory regions using supervised deep learning methods. BMC Bioinf 19(1):202
Liao B, Li R, Zhu W, Xiang X (2007) On the similarity of dna primary sequences based on 5-d representation. J Math Chem 42(1):47–57
Liu B, Liu F, Fang L, Wang X, Chou KC (2016) REPRNA: a web server for generating various feature vectors of rna sequences. Mol Genet Genomics 291(1):473–481
Liu B, Liu F, Wang X, Chen J, Fang L, Chou KC (2015) Pse-in-one: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res 43(W1):W65–W71
Liu B, Long R, Chou KC (2016) IDHS-EL: identifying Dnase i hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework. Bioinformatics 32(16):2411–2418
Liu B, Wang S, Long R, Chou KC (2016) IRSPOT-EL: identify recombination spots with an ensemble learning approach. Bioinformatics 33(1):35–41
Liu B, Wu H, Zhang D, Wang X, Chou KC (2017) Pse-analysis: a python package for dna/rna and protein/peptide sequence analysis based on pseudo components and kernel methods. Oncotarget 8(8):13338
Min X, Chen N, Chen T, Jiang R (2016) Deepenhancer: Predicting enhancers by convolutional neural networks. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp 637–644. IEEE
Morrow A, Shankar V, Petersohn D, Joseph A, Recht B, Yosef N (2017) Convolutional kitchen sinks for transcription factor binding site prediction. arXiv preprint arXiv:1706.00125
Neugebauer T, Bordeleau E, Burrus V, Brzezinski R (2015) Dna data visualization (DDV): software for generating web-based interfaces supporting navigation and analysis of DNA sequence data of entire genomes. PloS One 10(12):e0143615
Nguyen NG, Tran VA, Ngo DL, Phan D, Lumbanraja FR, Faisal MR, Abapihi B, Kubo M, Satou K (2016) Dna sequence classification by convolutional neural network. J Biomed Sci Eng 9(05):280
Padial JM, Miralles A, De la Riva I, Vences M (2010) The integrative future of taxonomy. Front Zool 7(1):16
Pennacchio LA, Bickmore W, Dean A, Nobrega MA, Bejerano G (2013) Enhancers: five essential questions. Nature Rev Genetics 14(4):288
Pham TH, Tu BH, Dang HT, Satou K (2007) Prediction of histone modifications in dna sequences. In: IEEE International conference on bioinformatics & bioengineering
Pokholok DK, Harbison CT, Levine S, Cole M, Hannett NM, Lee TI, Bell GW, Walker K, Rolfe PA, Herbolsheimer E et al (2005) Genome-wide map of nucleosome acetylation and methylation in yeast. Cell 122(4):517–527
Ratnasingham S, Hebert PD (2007) Bold: the barcode of life data system. Mol Ecol Notes 7(3):355–364
Rizzo R, Fiannaca A, La Rosa M, Urso A (2014) The general regression neural network to classify barcode and mini-barcode DNA. In: International meeting on computational intelligence methods for bioinformatics and biostatistics, pp 142–155. Springer
Rizzo R, Fiannaca A, La Rosa M, Urso A (2016) Classification experiments of dna sequences by using a deep neural network and chaos game representation. In: Proceedings of the 17th international conference on computer systems and technologies 2016, pp. 222–228
Roy A, Raychaudhury C, Nandy A (1998) Novel techniques of graphical representation and analysis of DNA sequences-a review. J Biosci 23(1):55–71
Schölkopf B, Smola AJ, Bach F et al (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, Cambridge
Seaman JD, Sanford JC (2009) Skittle: a 2-dimensional genome visualization tool. BMC Bioinf 10(1):452
Shrikumar A, Greenside P, Kundaje A (2017) Reverse-complement parameter sharing improves deep learning models for genomics. bioRxiv p. 103663
Sifre L, Mallat S (2013) Rotation, scaling and deformation invariant scattering for texture discrimination. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1233–1240
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations
Wang K, Guo P, Xin X, Ye Z (2017) Autoencoder, low rank approximation and pseudoinverse learning algorithm. In: Systems, Man, and Cybernetics (SMC), 2017 IEEE International Conference on, pp 948–953. IEEE
Wei L, Ding Y, Su R, Tang J, Zou Q (2018) Prediction of human protein subcellular localization using deep learning. J Parallel Distrib Comput 117:212–217
Weirauch MT, Cote A, Norel R, Annala M, Zhao Y, Riley TR, Saez-Rodriguez J, Cokelaer T, Vedenko A, Talukder S et al (2013) Evaluation of methods for modeling transcription factor sequence specificity. Nature Biotechnol 31(2):126
Weitschek E, Fiscon G, Felici G (2014) Supervised dna barcodes species classification: analysis, comparisons and results. BioData Mining 7(1):4
Wa̧ż P, Bielińska-Wa̧ż D (2014) Non-standard similarity/dissimilarity analysis of dna sequences. Genomics 104(6):464–471
Xu H, Park S, Lee SH, Hwang TH (2019) Using transfer learning on whole slide images to predict tumor mutational burden in bladder cancer patients. bioRxiv p. 554527
Yin B, Balvert M, Zambrano D, Schoenhuth A, Bohte S (2018) An image representation based convolutional network for DNA classification. In: International Conference on Learning Representations
Zeng H, Edwards MD, Liu G, Gifford DK (2016) Convolutional neural network architectures for predicting dna-protein binding. Bioinformatics 32(12):i121–i127
Zhang Q, Shen Z, Huang DS (2019) Modeling in-vivo protein-dna binding by combining multiple-instance learning with a hybrid deep neural network. Sci Rep 9(1):8484
Zou Q, Hu Q, Guo M, Wang G (2015) Halign: Fast multiple similar dna/rna sequence alignment based on the centre star strategy. Bioinformatics 31(15):2475–2481
Acknowledgements
This work was fully supported by the Grants from the National Natural Science Foundation of China (NSFC) (61375045) and the Joint Research Fund in Astronomy (U1531242) under cooperative agreement between the NSFC and Chinese Academy of Sciences (CAS).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest As authors of the manuscript
we, Mohammed A. B. Mahmoud and Ping Guo, declare that we have no conflict of interest to each other.
Ethical approval
This article does not contain any studies with human participants performed by any of the authors.
Informed consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Communicated by V. Loia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mahmoud, M.A.B., Guo, P. DNA sequence classification based on MLP with PILAE algorithm. Soft Comput 25, 4003–4014 (2021). https://doi.org/10.1007/s00500-020-05429-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-020-05429-y