Gene selection in autism – Comparative study

doi:10.1016/j.neucom.2016.08.123

Neurocomputing

Volume 250, 9 August 2017, Pages 37-44

https://doi.org/10.1016/j.neucom.2016.08.123 Get rights and content

Abstract

The paper investigates application of several methods of feature selection to identification of the most important genes in autism disorder. The study is based on the expression microarray of genes. The applied methods analyze the importance of genes on the basis of different principles of selection. The most important step is to fuse the results of these selections into common set of genes, which are the best associated with autism. These genes may be treated as the biomarkers of this disorder and used in early prediction of autism. The paper proposes and compares three different methods of such fusion: purity of the clusterization space, application of genetic algorithm and random forest in the role of integrator. The numerical experiments are concerned with the identification of the most important biomarkers and their application in autism recognition. They show the applied fusion strategy of many independent selection methods leads to the significant improvement of the autism recognition rate.

Introduction

Autism disorder belongs to the pervasive neurodevelopmental disorders, affecting a broad spectrum of human functions [1], [2]. The important problem is early recognition of this disorder, enabling the proper treatment of the autistic individuals. Nowadays, microarray gene expression data are studied to find the genes or sequences of genes which are the best associated with autism and might be treated as biomarkers. The difficulty in identifying these genes are many outliers, high variance of data and bad conditioning of the problem [1], [3], manifested by the small number of available observations (usually measured in hundreds) in comparison to very huge number of genes (dozens of thousands).

These complexities raise the challenge of how to identify the genes, that are the most informative for this disorder and that can be used to distinguish the class of autistic from the other individuals. Many methods developed in feature selection have been used in solving the task of gene selection in different problems. They include clustering methods [4], application of neural networks and Support Vector Machines [5], [6], [7], statistical tests [8], linear regression methods applying forward and backward selection [9], fuzzy expert system based algorithms [10], [11], rough set theory [12], use of global optimization methods, including genetic algorithms, chaotic binary particle swarm optimization and artificial bee colony (ABC) in connection with kNN classifier [13], [14], [15], application of ReliefF method combined with different classifiers [7], various statistical methods [16], [17], as well as fusion of many selection methods [6], [18]. Although most of these methods have been applied in cancer research, they might be also adopted to the autism. Many solutions have used the specialized methods, from which the best one was chosen as the most appropriate in the particular problem.

However, it should be mention that each selection method uses specialized procedure of assessing the class discriminative features. The results depend on the applied mechanism of selection, which might work well in some data mining problems and be not efficient in the other. The additional difficulty in autism recognition is very high variance of gene expression of the individuals belonging to the same group. For example in NCBI data base of autism [19] containing 146 observations and 54,614 genes the variance of gene expression values of different individuals change from 0.099 to 24.19 × 10⁶ with the mean 2.38 × 10⁴, median 38.84 and 13,245 genes of the expression variance higher than 1000. Fig. 1 presents the mean value and variance of gene expressions in the analyzed data. It confirms very high variability of the gene expressions and existence of many outliers.

This high variance of data means that the particular choice of sets of observations for selection procedure may lead to completely different results. This problem makes the application of single method inefficient for autistic data and needs elaboration of special procedure. It will be based on application of many selection methods in multiple runs. The important task in such approach is to fuse the selection results into the final solution.

The primal aim of the paper is to find the small population of the most informative genes strongly associated with autism. These genes might be useful as biomarkers of this neurodevelopmental disorder and at the same time serve as the input attributes to the automatic system in autism prediction.

The application of many different feature selection methods cooperating in an ensemble will be proposed as the best tool to solve this task. The most important requirement is to use the methods, which are based on different principles of operation, guarantying the independent performance. Their number is not strictly defined. In this solution we have used eight methods, which in our opinion are satisfactory from the diversity of operation.

Such approach to autistic data was never applied by other authors. There are some works showing an ensemble of methods for microarray data regarding cancer problems [6], [18], however the strategy was different. The authors of [18] have proposed the multicriterion fusion-based approach, in which the integration is done on the level of features. In our approach the fusion is performed on two levels: the level of individual methods and the level of many classifiers, forming an ensemble creating the final decision. Moreover, we propose different strategies of selecting the size of the optimal gene set. The applied methods rely on different principles and therefore, assess the discrimination ability of the gene in an independent way.

The important point is to fuse their results into one final group of genes that might be treated as the biomarkers of autism. In this paper we will present three different approaches to gene fusion: the purity of the clusterization space, the genetic algorithm and random forest. The limited set of genes may be also used as the input attributes in the classification system, responsible for early identification of autistic individuals. This system is composed of many classifiers arranged in an ensemble integrated by the random forest. The results of numerical experiments performed on the NCBI data base [18] will be presented and discussed.

Section snippets

Materials

The basic numerical experiments of gene selection have been performed on the NCBI dataset related to autism. The database is publicly available and was downloaded from GEO (NCBI) repository[19]. The number of observations in this dataset equals 146 and number of genes 54,613. The database consists of two classes: the first one is related to children with autism (number of such observations equal 82) and the second to the control group of healthy children (64).

All subjects in the base are male.

Methods

The gene expression array of autism considered in the work contains more than 50,000 genes. It is natural, that most of them have no class discrimination ability. Therefore, the first filtration of genes should be done in the introductory stage to reduce this number in a significant way. We have applied a strategy in which the genes with similar mean values of expression for autistic and reference (control) classes within all observations should be eliminated first as not discriminative in

Comparative analysis of selection results

Three different methods applied in the second step of gene selection have resulted in different contents of the most important genes. Among 24 sets corresponding to eight methods of the first stage and three approaches to the final stage of selection there were only 13 sets containing 10 commonly selected genes. They include: HIST1H2BG, TRPV6, CAPS2, ZSCAN18, SNHG7, CFC1B, RHPN1, Clone FP18821 unknown mRNA, EVPLL and PSENEN. These genes can be treated as the extended set of the most

Classification results

Final experiments have been directed to compare the class recognition ability of the selected sets of genes. This time the available data set was split into two independent parts: 40% of samples have been used only in selection of the best genes and the remaining 60% only in class recognition. This process of random splitting was repeated 10 times and the results averaged. The classification stage was performed using the genes selected on the basis of the first subset. Thanks to such

Conclusions

The paper has presented and compared the collective approach to the selection of the most important genes/transcripts, which are most informative for autism and can be used as biomarkers to distinguish two classes of data. It was shown that multistep collective approach by applying many different, properly integrated feature selection methods, is able to extract the small subset containing the most informative genes. The theoretical results were validated and supported by the experiments

Tomasz Latkowski was born in Poland, 1987. He received the M.Sc. and Ph.D. degrees from the Military University of Technology, Warsaw, Poland, in 2011 and 2016, respectively, all in electronic engineering. His research interest is in the area of artificial intelligence methods, data mining and their application in biomedical signal processing.

References (31)

YangM.S. et al.
A review of gene linkage, association, expression studies in autism, in assessment of convergent evidence
Int. J. Dev. Neurosci.
(2007)
C.J. Alonso-González et al.
Microarray gene expression classification with few genes: Criteria to combine attribute selection and classification methods
Expert Syst. Appl.
(2012)
ChuangL et al.
Gene selection, classification using Taguchi chaotic binary particle swarm optimization
Expert Syst. Appl.
(2011)
T. Prasartvit et al.
Reducing bioinformatics data dimension with ABC-kNN
Neurocomputing
(2013)
R.M. Luque-Baena et al.
Robust gene signatures from microarray data using genetic algorithms enriched with biological pathway keywords
J. Biomed. Inform.
(2014)
M. Alter et al.
Autism, increased patternal age related changes in global levels of gene expression regulation
Plos One
(2011)
HuV et al.
Developing a predictive gene classifier for autism spectrum disorders based upon differential gene expression profiles of phenotypic subgroups
North Am. J. Med.Sci.
(2013)
M. Eisen et al.
Cluster analysis and display of genome wide expression patterns
Proc. Natl. Acad. Sci. U.S.A.
(1998)
I. Guyon et al.
Gene selection for cancer classification using SVM
Mach. Learn.
(2002)
M. Muszyński et al.
Data mining methods for gene selection on the basis of gene expression arrays
Int. J. Appl. Math. Comput. Sci.
(2014)

P. Baldi et al.

A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inference of gene changes

Bioinformatics

(2001)

HuangX. et al.

Linear regression and two-class classification with gene expression data

Bioinformatics

(2003)

P.J. Woolf et al.

A fuzzy logic approach to analyzing gene expression data

Physiol. Genom.

(2000)

P.G. Kumar et al.

Design of fuzzy expert system for microarray data classification using a novel genetic swarm algorithm

Expert Syst. Appl.

(2012)

WangX et al.

A robust gene selection method for microarray-based cancer classification

Cancer Inform

(2010)

Cited by (7)

TRF-WGHC—Top-Ranking filter and wrapper-based greedy hill-climbing gene selection for microarray-based cancer classification
2023, Biomedical Signal Processing and Control
Gene expression microarray technologies have enabled the biological classification of the expression levels of thousands to tens of thousands of genes. However, most genes in a DNA microarray experiment are not relevant from the classification viewpoint. With the goal of finding the target gene set faster and more accurately for microarray-based cancer classification, this study investigated the existing mainstream technologies of gene selection based on a hybrid filter-wrapper model. On this basis, we present a novel hybrid gene selection algorithm, named TRF-WGHC (Top-Ranking Filter and Wrapper-based Greedy Hill-Climbing). The main advantages of TRF-WGHC are its simplicity and effectiveness. TRF-WGHC selects genes over two steps. First, by using a specific ranking metric, it selects a small top-n percentage of genes and eliminates those genes with scores smaller than the threshold. Second, it searches for the optimal subset of the remaining genes using the augmented greedy hill-climbing algorithm. We performed comprehensive experiments to compare TRF-WGHC with other state-of-the-art algorithms on 18 publicly available microarray expression datasets. Theoretical analysis and experimental results prove that the TRF-WGHC is a simple but extremely effective gene selection algorithm for the classification of microarray datasets.
Whale optimized mixed kernel function of support vector machine for colorectal cancer diagnosis
2019, Journal of Biomedical Informatics
Citation Excerpt :
Pashaei et al. put forward a novel ensemble method based on RF and PSO algorithms to solve the problem of multiclass in microarray datasets, and obtained the best classification accuracy [26]. There are also some other researches on feature section based on the random forest method [27–29]. However, when the microarray data exits many noisy genes, this method tends to cause overfitting phenomenon.
Microarray technique is a prevalent method for the classification and prediction of colorectal cancer (CRC). Nevertheless, microarray data suffers from the curse of dimensionality when selecting feature genes of the disease based on imbalance samples, thus causing low prediction accuracy. Hence, it is of vital significance to build proper models that can avoid the above problems and predict the CRC more accurately. In this paper, we use an ensemble model to classify samples into healthy and CRC groups and improve prediction performance. The proposed model is composed of three functional modules. The first module mainly performs the function of removing redundant genes. The main feature genes are selected using minimum redundancy maximum relevance (mRMR) method to reduce the dimensionality of features thereby increasing the prediction results. The second module aims to solve the problem caused by imbalanced data using hybrid sampling algorithm RUSBoost. The third module focuses on the classification algorithm optimization. We use mixed kernel function (MKF) based support vector machine (SVM) model to classify an unknown sample into healthy individuals and CRC patients, and then, the Whale Optimization Algorithm (WOA) is applied to find most optimal parameters of the proposed MKF-SVM. The final results show that the proposed model achieves higher G-means than other comparable models. The conclusion comes to show that RUSBoost wrapping WOA + MKF-SVM model can be applied to improve the predictive performance of colorectal cancer based on the imbalanced data.
Adaptive autism behavior prediction using improved binary whale optimization technique
2023, Concurrency and Computation: Practice and Experience
Hierarchical System of Gene Selection Based on Deep Learning and Ensemble Approach
2021, Proceedings of the International Joint Conference on Neural Networks
A Selection of an Optimal Framework Identifying the Prominent Autism Risk Gene Biomarkers from Gene Expression Data Using Neural Network
2021, SN Computer Science
Stable gene selection by self-representation method in fuzzy sample classification
2020, Medical and Biological Engineering and Computing

View all citing articles on Scopus

Stanislaw Osowski was born in Poland in 1948. He received the M.Sc., Ph.D., and Dr. Sc. degrees from the Warsaw University of Technology, Warsaw, Poland, in 1972, 1975, and 1981, respectively, all in electrical engineering. Currently he is a professor of electrical engineering at the Institute of the Theory of Electrical Engineering, Measurement and Information Systems, Warsaw University of Technology and is also employed in Electronic Faculty of Military University of Technology, Warsaw, Poland. His research and teaching interest are in the areas of artificial intelligence, neural networks, data mining, biomedical signal and image processing. He is a Senior member of IEEE.

View full text

Gene selection in autism – Comparative study

Abstract

Introduction

Section snippets

Materials

Methods

Comparative analysis of selection results

Classification results

Conclusions

Int. J. Dev. Neurosci.

Expert Syst. Appl.

Expert Syst. Appl.

Neurocomputing

J. Biomed. Inform.

Autism, increased patternal age related changes in global levels of gene expression regulation

Plos One

Developing a predictive gene classifier for autism spectrum disorders based upon differential gene expression profiles of phenotypic subgroups

North Am. J. Med.Sci.

Cluster analysis and display of genome wide expression patterns

Proc. Natl. Acad. Sci. U.S.A.

Gene selection for cancer classification using SVM

Mach. Learn.

Data mining methods for gene selection on the basis of gene expression arrays

Int. J. Appl. Math. Comput. Sci.

A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inference of gene changes

Bioinformatics

Linear regression and two-class classification with gene expression data

Bioinformatics

A fuzzy logic approach to analyzing gene expression data

Physiol. Genom.

Design of fuzzy expert system for microarray data classification using a novel genetic swarm algorithm

Expert Syst. Appl.

A robust gene selection method for microarray-based cancer classification

Cancer Inform