Abstract
Microarray data have become an integral part of the clinical and drug discovery process. Due to its voluminous and heterogeneous nature, the question arises of the interpretability and stability of the traditional gene selection method. To enhance the stability of the gene selection method, so that the results are better explicable, an ameliorated Extended ReliefF gene selection algorithm is proposed. It encodes gene affinity information using a new mathematical formula based on Bayes’ theorem and Manhattan distance for calculating the nearest neighbor in a pooled sample. It works in four aspects: initializing sample gene weight, improving gene weight, maximizing sample gene weight and finally adopting mutation operation. The proposed method selects the most informative genes which are highly perceptive to the prognosis of the disease. Further, to accomplish the accuracy and stability of the algorithm, soft classification is performed on Relieved_F, STIR, VLS-RelifF, I-RelieF, conventional ReliefF and proposed extended ReliefF algorithms using three classifiers namely Support Vector Machine (SVM), Multilayer Perceptron (MLP) and Random Forest (RF) on ten microarray datasets. According to the findings, MLP training times are much longer than those of RF and SVM. From a network perspective, SVM is much faster at training, whereas MLP excels in terms of accuracy. With a rise in gene similarity among the genes selected from the multiple training sets, the approach becomes more stable. As a result, it can be seen that the recommended gene selection algorithm greatly outperforms the other feature selection methods in terms of accuracy and stability.
Similar content being viewed by others
Data availability
The datasets used in this research work is downloaded from genomics-pubs.princeton.edu/oncology/database.htm.
Abbreviations
- SVM:
-
Support Vector Machine
- MLP:
-
Multilayer Perceptron
- CV:
-
Cross-validation
- RF:
-
Random Forest
- LOOCV:
-
Leave One Out Cross-Validation
References
Alizadeh AA, Eisen MB et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403:503–511. https://doi.org/10.1038/35000501
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28. https://doi.org/10.1016/j.compeleceng.2013.11.024
Dang TH, Trung DP, Tran HL, Le Van Q (2016) Using dimension reduction with feature selection to enhance accuracy of tumor classification. 2016 IntConf Biomed Eng (BME-HUST). https://doi.org/10.1109/bme-hust.2016.7782082
Dashtban M, Balafar M (2017) Gene selection for microarray cancer classification using a new evolutionary method employing artificial intelligence concepts. Genomics 109(2):91–107. https://doi.org/10.1016/j.ygeno.2017.01.004
Dhanalakshmi R, Khaire UM (2019) Feature selection and classification of microarray data for cancer prediction using mapreduce implementation of random forest algorithm. Journal of Scientific and Industrial Research 78:158:161
Drotár P, Gazda J, Smékal Z (2015) An experimental comparison of feature selection methods on two-class biomedical datasets. Comput Biol Med 66:1–10. https://doi.org/10.1016/j.compbiomed.2015.08.010
Furlanello C, Serafini M, Merler S, Jurman G (2003) An accelerated procedure for recursive feature ranking on microarray data. Neural Netw 16(5–6):641–648. https://doi.org/10.1016/s0893-6080(03)00103-5
Ghosh A, Barman S (2016) Application of Euclidean distance measurement and principal component analysis for gene identification. Gene 583(2):112–120. https://doi.org/10.1016/j.gene.2016.02.015
Giurcaneanu C, Tabus I, Shmulevich I, Wei Zhang (2003) Stability-based cluster analysis applied to microarray data. Seventh International Symposium on Signal Processing and Its Applications, 2003. Proceedings. https://doi.org/10.1109/isspa.2003.1224814
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537. https://doi.org/10.1126/science.286.5439.531
Goncalves J, Marks W (2002) Roles and requirements for a research microarray database. IEEE Eng Med Biol Mag 21(6):154–157. https://doi.org/10.1109/memb.2002.1175154
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene Selection for Cancer Classification using Support Vector Machines. Mach Learn 46:389–422. https://doi.org/10.1023/A:1012487302797
Hinrichs A, Prochno J, Ullrich M (2019) The curse of dimensionality for numerical integration on general domains. J Complex 50:25–42. https://doi.org/10.1016/j.jco.2018.08.003
Imoto S, Miyano S (2012) A Top-R feature selection algorithm for Microarray gene expression data. IEEE/ACM Trans Comput Biol Bioinf 9(3):754–764. https://doi.org/10.1109/tcbb.2011.151
K C, S. K, Mundayoor S (2015) A BBO based feature selection method for DNA microarray. ARC J Int J Res Stud Biosci (IJRSB)3(1):201–204
Khan MW, Alam M (2012) A survey of application: Genomics and genetic programming, a new frontier. Genomics 100(2):65–71. https://doi.org/10.1016/j.ygeno.2012.05.014
Kumar M, Kumar Rath S (2015) Classification of microarray using MapReduce based proximal support vector machine classifier. Knowl-Based Syst 89:584–602. https://doi.org/10.1016/j.knosys.2015.09.005
Kumar M, Rath NK, Swain A, Rath SK (2015) Feature selection and classification of Microarray data using MapReduce based ANOVA and k-nearest neighbor. Procedia Comput Sci 54:301–310. https://doi.org/10.1016/j.procs.2015.06.035
Kumar V (2014) Feature selection: A literature review. Smart Comput Rev 4(3). https://doi.org/10.6029/smartcr.2014.03.007
Li X, Li M, Yin M (2017) Multiobjective ranking binary artificial bee colony for gene selection problems using microarray datasets. IEEE/CAA J Autom Sin 1–16. https://doi.org/10.1109/jas.2016.7510034
Nakai K, Kanehisa M (1992) A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics 14(4):897–911. https://doi.org/10.1016/s0888-7543(05)80111-9
Pang H, George SL, Hui K, Tong T (2012) Gene selection using iterative feature elimination random forests for survival outcomes. IEEE/ACM Trans Comput Biol Bioinf 9(5):1422–1431. https://doi.org/10.1109/tcbb.2012.63
Perthame É, Friguet C, Causeur D (2016) Stability of feature selection in classification issues for high-dimensional correlated data. Stat Comput 26(4):783–796. https://doi.org/10.1007/s11222-015-9569-2
Somol P, Novovičová J (2010) Evaluating Stability and Comparing Output of Feature Selectors that Optimize Feature Subset Cardinality. IEEE Trans Pattern Anal Mach Intell 32(11):1921–1939. https://doi.org/10.1109/TPAMI.2010.34
Pudil P, Novovičová J, Kittler J (1994) Floating search methods in feature selection. Pattern Recogn Lett 15(11):1119–1125. https://doi.org/10.1016/0167-8655(94)90127-9
Ray SS, Ganivada A, Pal SK (2016) A granular self-organizing map for clustering and gene selection in Microarray data. IEEE Trans Neural Netw Learn Syst 27(9):1890–1906. https://doi.org/10.1109/tnnls.2015.2460994
Ruan J, Jahid MJ, Gu F, Lei C, Huang Y, Hsu Y, Mutch DG, Chen C, Kirma NB, Huang TH (2019) A novel algorithm for network-based prediction of cancer recurrence. Genomics 111(1):17–23. https://doi.org/10.1016/j.ygeno.2016.07.005
Tu K, Yu H, Guo Z, Li X (2004) Learnability-based further prediction of gene functions in gene ontology. Genomics 84(6):922–928. https://doi.org/10.1016/j.ygeno.2004.08.005
Yates (1999) Modern information retrieval. Pearson Education India
Zahiri J, Yaghoubi O, Mohammad-Noori M, Ebrahimpour R, Masoudi-Nejad A (2013) PPIevo : Protein–protein interaction prediction from PSSM based evolutionary information. Genomics 102(4):237–242. https://doi.org/10.1016/j.ygeno.2013.05.006
Srivastava N, Gautam J (2017) Prognosis of disease that may occur with growing age using confabulation based algorithm. Def Life Sci J 2(4):399–405. https://doi.org/10.14429/dlsj.2.11029
Ahmad S, Mehfuz S, Mebarek-Oudina F, Beg J (2022) RSM analysis based cloud access security broker: a systematic literature review. Clust Comput 25(5):3733–3763
Nyo MT, Mebarek-Oudina F, Hlaing SS, Khan NA (2022) Otsu’s thresholding technique for MRI image brain tumor segmentation. Multimed Tools Appl 81(30):43837–43849
Sheela CJJ, Suganthi G (2022) Automatic brain tumor segmentation from MRI using greedy snake model and fuzzy C-means optimization. J King Saud Univ-Comput Inf Sci 34(3):557–566
Sucharita S, Sahu B, Swarnkar T, Meher SK (2023) Classification of cancer microarray data using a two-step feature selection framework with moth-flame optimization and extreme learning machine. Multimed Tools Appl 1–28
Ram PK, Kuila P (2023) Dynamic scaling factor based differential evolution with multi-layer perceptron for gene selection from pathway information of microarray data. Multimed Tools Appl 82(9):13453–13478
Chaki J, Dey N (2020) Pattern analysis of genetics and genomics: a survey of the state-of-art. Multimed Tools Appl 79:11163–11194
Funding
This work is funded under the Data Science Research of Interdisciplinary Cyber-Physical Systems (ICPS) Programme of the Department of Science and Technology (DST) [Sanction Number T-54], New Delhi, Government of India, India.
Author information
Authors and Affiliations
Contributions
Both authors made contributions to the planning and design of the study. Ms. Neha Srivastava prepared the materials, collected the data, and carried out the analysis. Ms. Neha Srivastava wrote the manuscript's initial draught, while Dr. Devendra K. Tayal provided feedback on earlier draughts. The final manuscript was read and approved by both writers.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Srivastava, N., Tayal, D.K. Assessing gene stability and gene affinity in microarray data classification using an extended relieff algorithm. Multimed Tools Appl 83, 45761–45776 (2024). https://doi.org/10.1007/s11042-023-17149-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-17149-0