Skip to main content

Advertisement

Log in

DNA sequence classification based on MLP with PILAE algorithm

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

In the bioinformatics field, the classification of unknown biological sequences is a key task that is fundamental for simplifying the consistency, aggregation, and survey of organisms and their evolution. We can view biological sequences as data components of higher non-fixed dimensions, corresponding to the length of the sequences. Numerical encoding performs an important function in DNA sequence evaluation via computational procedures such as one-hot encoding (OHE). However, the OHE method has drawbacks: 1) it does not add any details that may produce the additional predictive variable, and 2) if the variable has many classes, then OHE increases the feature space significantly. To overcome these drawbacks, this paper presents a computationally effective framework for classifying DNA sequences of living organisms in the image domain. The proposed strategy relies upon multilayer perceptron trained by a pseudoinverse learning autoencoder (PILAE) algorithm. The PILAE training process does not have to set the learning control parameters or indicate the number of hidden layers. Therefore, the PILAE classifier can accomplish better performance contrasting with other deep neural network (DNNs) strategies such as VGG-16 and Xception models. Experimental results have demonstrated that this proposed strategy achieves high prediction accuracy as well as to a significant degree high computational efficiency over different datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://github.com/photomedia/DDV

  2. http://www.photomedia.ca/DDV/

  3. http://www.boldsystems.org/index.php/TaxBrowser_Home

References

  • Alexandari AM, Shrikumar A, Kundaje A (2017) Separable fully connected layers improve deep learning models for genomics. BioRxiv p 146431

  • Alipanahi B, Delong A, Weirauch MT, Frey BJ (2015) Predicting the sequence specificities of dna-and rna-binding proteins by deep learning. Nature Biotechnol 33(8):831

    Article  Google Scholar 

  • Asgari E, Mofrad MR (2015) Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS One 10(11):e0141287

    Article  Google Scholar 

  • Bertolazzi P, Felici G, Weitschek E (2009) Learning to classify species with barcodes. BMC Bioinf 10(14):S7

    Article  Google Scholar 

  • Bold systems v4. http://www.boldsystems.org/index.php/TaxBrowser_Home. Accessed: 2019-04-01

  • Cao J, Xiong L (2014) Protein sequence classification with improved extreme learning machine algorithms. BioMed Res Int

  • Chollet F (2017) Xception: Deep learning with depthwise separable convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1800–1807

  • Choong ACH, Lee NK (2017) Evaluation of convolutionary neural networks modeling of dna sequences using ordinal versus one-hot encoding method. In: International Conference on computer and drone applications (IConDA), pp 60–65. IEEE

  • Conneau A, Schwenk H, Barrault L, Lecun Y (2017) Very deep convolutional networks for text classification. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, vol. 1, pp 1107–1116

  • Dna rainbow [internet]. http://www.dna-rainbow.org

  • Eickholt J, Cheng J (2013) Dndisorder: predicting protein disorder using boosting and deep networks. BMC Bioinf 14(1):88

    Article  Google Scholar 

  • Feng S, Li S, Guo P, Yin Q (2017) Image recognition with histogram of oriented gradient feature and pseudoinverse learning autoencoders. In: Neural information processing—24th international conference, ICONIP, pp 740–749

  • Gao X, Wei Z, Hakonarson H (2018) TRNA-DL: a deep learning approach to improve trnascan-se prediction results. Human Heredity 83(3):163–172

    Article  Google Scholar 

  • Ghandi M, Lee D, Mohammad-Noori M, Beer MA (2014) Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput Biol 10(7):e1003711

    Article  Google Scholar 

  • Guo P (2018) A vest of the pseudoinverse learning algorithm. arXiv preprint arXiv:1805.07828

  • Guo P, Lyu MR (2004) A pseudoinverse learning algorithm for feedforward neural networks with stacked generalization applications to software reliability growth data. Neurocomputing 56:101–121

    Article  Google Scholar 

  • Guo P, Lyu MR, Chen CLP (2003) Regularization parameter estimation for feedforward neural networks. IEEE Trans Syst Man Cybern Part B 33(1):35–44

    Article  Google Scholar 

  • Guo P, Zhao D, Han M, Feng S (2019) Pseudoinverse learners: new trend and applications to big data. In: INNS Big Data and Deep Learning conference, pp 158–168. Springer

  • Guo P, Zhou X, Wang K (2018) Pilae: A non-gradient descent learning scheme for deep feedforward neural networks. arXiv preprint arXiv:1811.01545

  • Håndstad T, Hestnes AJ, Sætrom P (2007) Motif kernel generated by genetic programming improves remote homology and fold detection. BMC Bioinf 8(1):23

    Article  Google Scholar 

  • Hebert PD, Gregory TR (2005) The promise of dna barcoding for taxonomy. Syst Biol 54(5):852–859

    Article  Google Scholar 

  • Higashihara M, Rebolledo-Mendez JD, Yamada Y, Satou K (2008) Application of a feature selection method to nucleosome data: accuracy improvement and comparison with other methods. WSEAS Trans Biol Biomed 5(5):95–104

    Google Scholar 

  • Hochreiter S, Heusel M, Obermayer K (2007) Fast model-based protein homology detection without alignment. Bioinformatics 23(14):1728–1736

    Article  Google Scholar 

  • Kelley DR, Snoek J, Rinn JL (2016) Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res 26(7):990–999

    Article  Google Scholar 

  • Khawaldeh S, Pervaiz U, Elsharnoby M, Alchalabi AE, Al-Zubi N (2017) Taxonomic classification for living organisms using convolutional neural networks. Genes 8(11):326

    Article  Google Scholar 

  • Kingma DP, Ba JL (2015) Adam: A method for stochastic optimization. In: International Conference on Learning Representations

  • La Rosa M, Fiannaca A, Rizzo R, Urso A (2015) Probabilistic topic modeling for the analysis and classification of genomic sequences. BMC Bioinf 16(6):S2

    Article  Google Scholar 

  • Lanchantin J, Singh R, Wang B, Qi Y (2017) Deep motif dashboard: Visualizing and understanding genomic sequences using deep neural networks. In: Pacific Symposium On Biocomputing 2017, pp 254–265. World Scientific

  • Lee TK, Nguyen T (2011) Protein family classification with neural networks

  • Li J, Zhang J, Zuo L, Chang D (2018) Reveal the cognitive process of deep learning during identifying nucleosome occupancy and histone modification. In: 2018 Chinese Automation Congress (CAC)

  • Li S, Chen J, Liu B (2017) Protein remote homology detection based on bidirectional long short-term memory. BMC Bioinf 18(1):443

    Article  Google Scholar 

  • Li Y, Shi W, Wasserman WW (2018) Genome-wide prediction of cis-regulatory regions using supervised deep learning methods. BMC Bioinf 19(1):202

    Article  Google Scholar 

  • Liao B, Li R, Zhu W, Xiang X (2007) On the similarity of dna primary sequences based on 5-d representation. J Math Chem 42(1):47–57

    Article  MathSciNet  Google Scholar 

  • Liu B, Liu F, Fang L, Wang X, Chou KC (2016) REPRNA: a web server for generating various feature vectors of rna sequences. Mol Genet Genomics 291(1):473–481

    Article  Google Scholar 

  • Liu B, Liu F, Wang X, Chen J, Fang L, Chou KC (2015) Pse-in-one: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res 43(W1):W65–W71

    Article  Google Scholar 

  • Liu B, Long R, Chou KC (2016) IDHS-EL: identifying Dnase i hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework. Bioinformatics 32(16):2411–2418

    Article  Google Scholar 

  • Liu B, Wang S, Long R, Chou KC (2016) IRSPOT-EL: identify recombination spots with an ensemble learning approach. Bioinformatics 33(1):35–41

    Article  Google Scholar 

  • Liu B, Wu H, Zhang D, Wang X, Chou KC (2017) Pse-analysis: a python package for dna/rna and protein/peptide sequence analysis based on pseudo components and kernel methods. Oncotarget 8(8):13338

    Article  Google Scholar 

  • Min X, Chen N, Chen T, Jiang R (2016) Deepenhancer: Predicting enhancers by convolutional neural networks. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp 637–644. IEEE

  • Morrow A, Shankar V, Petersohn D, Joseph A, Recht B, Yosef N (2017) Convolutional kitchen sinks for transcription factor binding site prediction. arXiv preprint arXiv:1706.00125

  • Neugebauer T, Bordeleau E, Burrus V, Brzezinski R (2015) Dna data visualization (DDV): software for generating web-based interfaces supporting navigation and analysis of DNA sequence data of entire genomes. PloS One 10(12):e0143615

    Article  Google Scholar 

  • Nguyen NG, Tran VA, Ngo DL, Phan D, Lumbanraja FR, Faisal MR, Abapihi B, Kubo M, Satou K (2016) Dna sequence classification by convolutional neural network. J Biomed Sci Eng 9(05):280

    Article  Google Scholar 

  • Padial JM, Miralles A, De la Riva I, Vences M (2010) The integrative future of taxonomy. Front Zool 7(1):16

    Article  Google Scholar 

  • Pennacchio LA, Bickmore W, Dean A, Nobrega MA, Bejerano G (2013) Enhancers: five essential questions. Nature Rev Genetics 14(4):288

    Article  Google Scholar 

  • Pham TH, Tu BH, Dang HT, Satou K (2007) Prediction of histone modifications in dna sequences. In: IEEE International conference on bioinformatics & bioengineering

  • Pokholok DK, Harbison CT, Levine S, Cole M, Hannett NM, Lee TI, Bell GW, Walker K, Rolfe PA, Herbolsheimer E et al (2005) Genome-wide map of nucleosome acetylation and methylation in yeast. Cell 122(4):517–527

    Article  Google Scholar 

  • Ratnasingham S, Hebert PD (2007) Bold: the barcode of life data system. Mol Ecol Notes 7(3):355–364

    Article  Google Scholar 

  • Rizzo R, Fiannaca A, La Rosa M, Urso A (2014) The general regression neural network to classify barcode and mini-barcode DNA. In: International meeting on computational intelligence methods for bioinformatics and biostatistics, pp 142–155. Springer

  • Rizzo R, Fiannaca A, La Rosa M, Urso A (2016) Classification experiments of dna sequences by using a deep neural network and chaos game representation. In: Proceedings of the 17th international conference on computer systems and technologies 2016, pp. 222–228

  • Roy A, Raychaudhury C, Nandy A (1998) Novel techniques of graphical representation and analysis of DNA sequences-a review. J Biosci 23(1):55–71

    Article  Google Scholar 

  • Schölkopf B, Smola AJ, Bach F et al (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, Cambridge

    Google Scholar 

  • Seaman JD, Sanford JC (2009) Skittle: a 2-dimensional genome visualization tool. BMC Bioinf 10(1):452

    Article  Google Scholar 

  • Shrikumar A, Greenside P, Kundaje A (2017) Reverse-complement parameter sharing improves deep learning models for genomics. bioRxiv p. 103663

  • Sifre L, Mallat S (2013) Rotation, scaling and deformation invariant scattering for texture discrimination. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1233–1240

  • Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations

  • Wang K, Guo P, Xin X, Ye Z (2017) Autoencoder, low rank approximation and pseudoinverse learning algorithm. In: Systems, Man, and Cybernetics (SMC), 2017 IEEE International Conference on, pp 948–953. IEEE

  • Wei L, Ding Y, Su R, Tang J, Zou Q (2018) Prediction of human protein subcellular localization using deep learning. J Parallel Distrib Comput 117:212–217

    Article  Google Scholar 

  • Weirauch MT, Cote A, Norel R, Annala M, Zhao Y, Riley TR, Saez-Rodriguez J, Cokelaer T, Vedenko A, Talukder S et al (2013) Evaluation of methods for modeling transcription factor sequence specificity. Nature Biotechnol 31(2):126

    Article  Google Scholar 

  • Weitschek E, Fiscon G, Felici G (2014) Supervised dna barcodes species classification: analysis, comparisons and results. BioData Mining 7(1):4

    Article  Google Scholar 

  • Wa̧ż P, Bielińska-Wa̧ż D (2014) Non-standard similarity/dissimilarity analysis of dna sequences. Genomics 104(6):464–471

    Article  Google Scholar 

  • Xu H, Park S, Lee SH, Hwang TH (2019) Using transfer learning on whole slide images to predict tumor mutational burden in bladder cancer patients. bioRxiv p. 554527

  • Yin B, Balvert M, Zambrano D, Schoenhuth A, Bohte S (2018) An image representation based convolutional network for DNA classification. In: International Conference on Learning Representations

  • Zeng H, Edwards MD, Liu G, Gifford DK (2016) Convolutional neural network architectures for predicting dna-protein binding. Bioinformatics 32(12):i121–i127

    Article  Google Scholar 

  • Zhang Q, Shen Z, Huang DS (2019) Modeling in-vivo protein-dna binding by combining multiple-instance learning with a hybrid deep neural network. Sci Rep 9(1):8484

    Article  Google Scholar 

  • Zou Q, Hu Q, Guo M, Wang G (2015) Halign: Fast multiple similar dna/rna sequence alignment based on the centre star strategy. Bioinformatics 31(15):2475–2481

    Article  Google Scholar 

Download references

Acknowledgements

This work was fully supported by the Grants from the National Natural Science Foundation of China (NSFC) (61375045) and the Joint Research Fund in Astronomy (U1531242) under cooperative agreement between the NSFC and Chinese Academy of Sciences (CAS).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammed A. B. Mahmoud.

Ethics declarations

Conflict of interest As authors of the manuscript

we, Mohammed A. B. Mahmoud and Ping Guo, declare that we have no conflict of interest to each other.

Ethical approval

This article does not contain any studies with human participants performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mahmoud, M.A.B., Guo, P. DNA sequence classification based on MLP with PILAE algorithm. Soft Comput 25, 4003–4014 (2021). https://doi.org/10.1007/s00500-020-05429-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-020-05429-y

Keywords

Navigation