Gene selection in autism – Comparative study
Introduction
Autism disorder belongs to the pervasive neurodevelopmental disorders, affecting a broad spectrum of human functions [1], [2]. The important problem is early recognition of this disorder, enabling the proper treatment of the autistic individuals. Nowadays, microarray gene expression data are studied to find the genes or sequences of genes which are the best associated with autism and might be treated as biomarkers. The difficulty in identifying these genes are many outliers, high variance of data and bad conditioning of the problem [1], [3], manifested by the small number of available observations (usually measured in hundreds) in comparison to very huge number of genes (dozens of thousands).
These complexities raise the challenge of how to identify the genes, that are the most informative for this disorder and that can be used to distinguish the class of autistic from the other individuals. Many methods developed in feature selection have been used in solving the task of gene selection in different problems. They include clustering methods [4], application of neural networks and Support Vector Machines [5], [6], [7], statistical tests [8], linear regression methods applying forward and backward selection [9], fuzzy expert system based algorithms [10], [11], rough set theory [12], use of global optimization methods, including genetic algorithms, chaotic binary particle swarm optimization and artificial bee colony (ABC) in connection with kNN classifier [13], [14], [15], application of ReliefF method combined with different classifiers [7], various statistical methods [16], [17], as well as fusion of many selection methods [6], [18]. Although most of these methods have been applied in cancer research, they might be also adopted to the autism. Many solutions have used the specialized methods, from which the best one was chosen as the most appropriate in the particular problem.
However, it should be mention that each selection method uses specialized procedure of assessing the class discriminative features. The results depend on the applied mechanism of selection, which might work well in some data mining problems and be not efficient in the other. The additional difficulty in autism recognition is very high variance of gene expression of the individuals belonging to the same group. For example in NCBI data base of autism [19] containing 146 observations and 54,614 genes the variance of gene expression values of different individuals change from 0.099 to 24.19 × 106 with the mean 2.38 × 104, median 38.84 and 13,245 genes of the expression variance higher than 1000. Fig. 1 presents the mean value and variance of gene expressions in the analyzed data. It confirms very high variability of the gene expressions and existence of many outliers.
This high variance of data means that the particular choice of sets of observations for selection procedure may lead to completely different results. This problem makes the application of single method inefficient for autistic data and needs elaboration of special procedure. It will be based on application of many selection methods in multiple runs. The important task in such approach is to fuse the selection results into the final solution.
The primal aim of the paper is to find the small population of the most informative genes strongly associated with autism. These genes might be useful as biomarkers of this neurodevelopmental disorder and at the same time serve as the input attributes to the automatic system in autism prediction.
The application of many different feature selection methods cooperating in an ensemble will be proposed as the best tool to solve this task. The most important requirement is to use the methods, which are based on different principles of operation, guarantying the independent performance. Their number is not strictly defined. In this solution we have used eight methods, which in our opinion are satisfactory from the diversity of operation.
Such approach to autistic data was never applied by other authors. There are some works showing an ensemble of methods for microarray data regarding cancer problems [6], [18], however the strategy was different. The authors of [18] have proposed the multicriterion fusion-based approach, in which the integration is done on the level of features. In our approach the fusion is performed on two levels: the level of individual methods and the level of many classifiers, forming an ensemble creating the final decision. Moreover, we propose different strategies of selecting the size of the optimal gene set. The applied methods rely on different principles and therefore, assess the discrimination ability of the gene in an independent way.
The important point is to fuse their results into one final group of genes that might be treated as the biomarkers of autism. In this paper we will present three different approaches to gene fusion: the purity of the clusterization space, the genetic algorithm and random forest. The limited set of genes may be also used as the input attributes in the classification system, responsible for early identification of autistic individuals. This system is composed of many classifiers arranged in an ensemble integrated by the random forest. The results of numerical experiments performed on the NCBI data base [18] will be presented and discussed.
Section snippets
Materials
The basic numerical experiments of gene selection have been performed on the NCBI dataset related to autism. The database is publicly available and was downloaded from GEO (NCBI) repository[19]. The number of observations in this dataset equals 146 and number of genes 54,613. The database consists of two classes: the first one is related to children with autism (number of such observations equal 82) and the second to the control group of healthy children (64).
All subjects in the base are male.
Methods
The gene expression array of autism considered in the work contains more than 50,000 genes. It is natural, that most of them have no class discrimination ability. Therefore, the first filtration of genes should be done in the introductory stage to reduce this number in a significant way. We have applied a strategy in which the genes with similar mean values of expression for autistic and reference (control) classes within all observations should be eliminated first as not discriminative in
Comparative analysis of selection results
Three different methods applied in the second step of gene selection have resulted in different contents of the most important genes. Among 24 sets corresponding to eight methods of the first stage and three approaches to the final stage of selection there were only 13 sets containing 10 commonly selected genes. They include: HIST1H2BG, TRPV6, CAPS2, ZSCAN18, SNHG7, CFC1B, RHPN1, Clone FP18821 unknown mRNA, EVPLL and PSENEN. These genes can be treated as the extended set of the most
Classification results
Final experiments have been directed to compare the class recognition ability of the selected sets of genes. This time the available data set was split into two independent parts: 40% of samples have been used only in selection of the best genes and the remaining 60% only in class recognition. This process of random splitting was repeated 10 times and the results averaged. The classification stage was performed using the genes selected on the basis of the first subset. Thanks to such
Conclusions
The paper has presented and compared the collective approach to the selection of the most important genes/transcripts, which are most informative for autism and can be used as biomarkers to distinguish two classes of data. It was shown that multistep collective approach by applying many different, properly integrated feature selection methods, is able to extract the small subset containing the most informative genes. The theoretical results were validated and supported by the experiments
Tomasz Latkowski was born in Poland, 1987. He received the M.Sc. and Ph.D. degrees from the Military University of Technology, Warsaw, Poland, in 2011 and 2016, respectively, all in electronic engineering. His research interest is in the area of artificial intelligence methods, data mining and their application in biomedical signal processing.
References (31)
- et al.
A review of gene linkage, association, expression studies in autism, in assessment of convergent evidence
Int. J. Dev. Neurosci.
(2007) - et al.
Microarray gene expression classification with few genes: Criteria to combine attribute selection and classification methods
Expert Syst. Appl.
(2012) - et al.
Gene selection, classification using Taguchi chaotic binary particle swarm optimization
Expert Syst. Appl.
(2011) - et al.
Reducing bioinformatics data dimension with ABC-kNN
Neurocomputing
(2013) - et al.
Robust gene signatures from microarray data using genetic algorithms enriched with biological pathway keywords
J. Biomed. Inform.
(2014) - et al.
Autism, increased patternal age related changes in global levels of gene expression regulation
Plos One
(2011) - et al.
Developing a predictive gene classifier for autism spectrum disorders based upon differential gene expression profiles of phenotypic subgroups
North Am. J. Med.Sci.
(2013) - et al.
Cluster analysis and display of genome wide expression patterns
Proc. Natl. Acad. Sci. U.S.A.
(1998) - et al.
Gene selection for cancer classification using SVM
Mach. Learn.
(2002) - et al.
Data mining methods for gene selection on the basis of gene expression arrays
Int. J. Appl. Math. Comput. Sci.
(2014)
A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inference of gene changes
Bioinformatics
Linear regression and two-class classification with gene expression data
Bioinformatics
A fuzzy logic approach to analyzing gene expression data
Physiol. Genom.
Design of fuzzy expert system for microarray data classification using a novel genetic swarm algorithm
Expert Syst. Appl.
A robust gene selection method for microarray-based cancer classification
Cancer Inform
Cited by (7)
TRF-WGHC—Top-Ranking filter and wrapper-based greedy hill-climbing gene selection for microarray-based cancer classification
2023, Biomedical Signal Processing and ControlWhale optimized mixed kernel function of support vector machine for colorectal cancer diagnosis
2019, Journal of Biomedical InformaticsCitation Excerpt :Pashaei et al. put forward a novel ensemble method based on RF and PSO algorithms to solve the problem of multiclass in microarray datasets, and obtained the best classification accuracy [26]. There are also some other researches on feature section based on the random forest method [27–29]. However, when the microarray data exits many noisy genes, this method tends to cause overfitting phenomenon.
Adaptive autism behavior prediction using improved binary whale optimization technique
2023, Concurrency and Computation: Practice and ExperienceHierarchical System of Gene Selection Based on Deep Learning and Ensemble Approach
2021, Proceedings of the International Joint Conference on Neural NetworksStable gene selection by self-representation method in fuzzy sample classification
2020, Medical and Biological Engineering and Computing
Tomasz Latkowski was born in Poland, 1987. He received the M.Sc. and Ph.D. degrees from the Military University of Technology, Warsaw, Poland, in 2011 and 2016, respectively, all in electronic engineering. His research interest is in the area of artificial intelligence methods, data mining and their application in biomedical signal processing.
Stanislaw Osowski was born in Poland in 1948. He received the M.Sc., Ph.D., and Dr. Sc. degrees from the Warsaw University of Technology, Warsaw, Poland, in 1972, 1975, and 1981, respectively, all in electrical engineering. Currently he is a professor of electrical engineering at the Institute of the Theory of Electrical Engineering, Measurement and Information Systems, Warsaw University of Technology and is also employed in Electronic Faculty of Military University of Technology, Warsaw, Poland. His research and teaching interest are in the areas of artificial intelligence, neural networks, data mining, biomedical signal and image processing. He is a Senior member of IEEE.