Elsevier

Neurocomputing

Volume 250, 9 August 2017, Pages 37-44
Neurocomputing

Gene selection in autism – Comparative study

https://doi.org/10.1016/j.neucom.2016.08.123Get rights and content

Abstract

The paper investigates application of several methods of feature selection to identification of the most important genes in autism disorder. The study is based on the expression microarray of genes. The applied methods analyze the importance of genes on the basis of different principles of selection. The most important step is to fuse the results of these selections into common set of genes, which are the best associated with autism. These genes may be treated as the biomarkers of this disorder and used in early prediction of autism. The paper proposes and compares three different methods of such fusion: purity of the clusterization space, application of genetic algorithm and random forest in the role of integrator. The numerical experiments are concerned with the identification of the most important biomarkers and their application in autism recognition. They show the applied fusion strategy of many independent selection methods leads to the significant improvement of the autism recognition rate.

Introduction

Autism disorder belongs to the pervasive neurodevelopmental disorders, affecting a broad spectrum of human functions [1], [2]. The important problem is early recognition of this disorder, enabling the proper treatment of the autistic individuals. Nowadays, microarray gene expression data are studied to find the genes or sequences of genes which are the best associated with autism and might be treated as biomarkers. The difficulty in identifying these genes are many outliers, high variance of data and bad conditioning of the problem [1], [3], manifested by the small number of available observations (usually measured in hundreds) in comparison to very huge number of genes (dozens of thousands).

These complexities raise the challenge of how to identify the genes, that are the most informative for this disorder and that can be used to distinguish the class of autistic from the other individuals. Many methods developed in feature selection have been used in solving the task of gene selection in different problems. They include clustering methods [4], application of neural networks and Support Vector Machines [5], [6], [7], statistical tests [8], linear regression methods applying forward and backward selection [9], fuzzy expert system based algorithms [10], [11], rough set theory [12], use of global optimization methods, including genetic algorithms, chaotic binary particle swarm optimization and artificial bee colony (ABC) in connection with kNN classifier [13], [14], [15], application of ReliefF method combined with different classifiers [7], various statistical methods [16], [17], as well as fusion of many selection methods [6], [18]. Although most of these methods have been applied in cancer research, they might be also adopted to the autism. Many solutions have used the specialized methods, from which the best one was chosen as the most appropriate in the particular problem.

However, it should be mention that each selection method uses specialized procedure of assessing the class discriminative features. The results depend on the applied mechanism of selection, which might work well in some data mining problems and be not efficient in the other. The additional difficulty in autism recognition is very high variance of gene expression of the individuals belonging to the same group. For example in NCBI data base of autism [19] containing 146 observations and 54,614 genes the variance of gene expression values of different individuals change from 0.099 to 24.19 × 106 with the mean 2.38 × 104, median 38.84 and 13,245 genes of the expression variance higher than 1000. Fig. 1 presents the mean value and variance of gene expressions in the analyzed data. It confirms very high variability of the gene expressions and existence of many outliers.

This high variance of data means that the particular choice of sets of observations for selection procedure may lead to completely different results. This problem makes the application of single method inefficient for autistic data and needs elaboration of special procedure. It will be based on application of many selection methods in multiple runs. The important task in such approach is to fuse the selection results into the final solution.

The primal aim of the paper is to find the small population of the most informative genes strongly associated with autism. These genes might be useful as biomarkers of this neurodevelopmental disorder and at the same time serve as the input attributes to the automatic system in autism prediction.

The application of many different feature selection methods cooperating in an ensemble will be proposed as the best tool to solve this task. The most important requirement is to use the methods, which are based on different principles of operation, guarantying the independent performance. Their number is not strictly defined. In this solution we have used eight methods, which in our opinion are satisfactory from the diversity of operation.

Such approach to autistic data was never applied by other authors. There are some works showing an ensemble of methods for microarray data regarding cancer problems [6], [18], however the strategy was different. The authors of [18] have proposed the multicriterion fusion-based approach, in which the integration is done on the level of features. In our approach the fusion is performed on two levels: the level of individual methods and the level of many classifiers, forming an ensemble creating the final decision. Moreover, we propose different strategies of selecting the size of the optimal gene set. The applied methods rely on different principles and therefore, assess the discrimination ability of the gene in an independent way.

The important point is to fuse their results into one final group of genes that might be treated as the biomarkers of autism. In this paper we will present three different approaches to gene fusion: the purity of the clusterization space, the genetic algorithm and random forest. The limited set of genes may be also used as the input attributes in the classification system, responsible for early identification of autistic individuals. This system is composed of many classifiers arranged in an ensemble integrated by the random forest. The results of numerical experiments performed on the NCBI data base [18] will be presented and discussed.

Section snippets

Materials

The basic numerical experiments of gene selection have been performed on the NCBI dataset related to autism. The database is publicly available and was downloaded from GEO (NCBI) repository[19]. The number of observations in this dataset equals 146 and number of genes 54,613. The database consists of two classes: the first one is related to children with autism (number of such observations equal 82) and the second to the control group of healthy children (64).

All subjects in the base are male.

Methods

The gene expression array of autism considered in the work contains more than 50,000 genes. It is natural, that most of them have no class discrimination ability. Therefore, the first filtration of genes should be done in the introductory stage to reduce this number in a significant way. We have applied a strategy in which the genes with similar mean values of expression for autistic and reference (control) classes within all observations should be eliminated first as not discriminative in

Comparative analysis of selection results

Three different methods applied in the second step of gene selection have resulted in different contents of the most important genes. Among 24 sets corresponding to eight methods of the first stage and three approaches to the final stage of selection there were only 13 sets containing 10 commonly selected genes. They include: HIST1H2BG, TRPV6, CAPS2, ZSCAN18, SNHG7, CFC1B, RHPN1, Clone FP18821 unknown mRNA, EVPLL and PSENEN. These genes can be treated as the extended set of the most

Classification results

Final experiments have been directed to compare the class recognition ability of the selected sets of genes. This time the available data set was split into two independent parts: 40% of samples have been used only in selection of the best genes and the remaining 60% only in class recognition. This process of random splitting was repeated 10 times and the results averaged. The classification stage was performed using the genes selected on the basis of the first subset. Thanks to such

Conclusions

The paper has presented and compared the collective approach to the selection of the most important genes/transcripts, which are most informative for autism and can be used as biomarkers to distinguish two classes of data. It was shown that multistep collective approach by applying many different, properly integrated feature selection methods, is able to extract the small subset containing the most informative genes. The theoretical results were validated and supported by the experiments

Tomasz Latkowski was born in Poland, 1987. He received the M.Sc. and Ph.D. degrees from the Military University of Technology, Warsaw, Poland, in 2011 and 2016, respectively, all in electronic engineering. His research interest is in the area of artificial intelligence methods, data mining and their application in biomedical signal processing.

References (31)

  • P. Baldi et al.

    A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inference of gene changes

    Bioinformatics

    (2001)
  • HuangX. et al.

    Linear regression and two-class classification with gene expression data

    Bioinformatics

    (2003)
  • P.J. Woolf et al.

    A fuzzy logic approach to analyzing gene expression data

    Physiol. Genom.

    (2000)
  • P.G. Kumar et al.

    Design of fuzzy expert system for microarray data classification using a novel genetic swarm algorithm

    Expert Syst. Appl.

    (2012)
  • WangX et al.

    A robust gene selection method for microarray-based cancer classification

    Cancer Inform

    (2010)
  • Cited by (7)

    • Whale optimized mixed kernel function of support vector machine for colorectal cancer diagnosis

      2019, Journal of Biomedical Informatics
      Citation Excerpt :

      Pashaei et al. put forward a novel ensemble method based on RF and PSO algorithms to solve the problem of multiclass in microarray datasets, and obtained the best classification accuracy [26]. There are also some other researches on feature section based on the random forest method [27–29]. However, when the microarray data exits many noisy genes, this method tends to cause overfitting phenomenon.

    • Adaptive autism behavior prediction using improved binary whale optimization technique

      2023, Concurrency and Computation: Practice and Experience
    • Hierarchical System of Gene Selection Based on Deep Learning and Ensemble Approach

      2021, Proceedings of the International Joint Conference on Neural Networks
    View all citing articles on Scopus

    Tomasz Latkowski was born in Poland, 1987. He received the M.Sc. and Ph.D. degrees from the Military University of Technology, Warsaw, Poland, in 2011 and 2016, respectively, all in electronic engineering. His research interest is in the area of artificial intelligence methods, data mining and their application in biomedical signal processing.

    Stanislaw Osowski was born in Poland in 1948. He received the M.Sc., Ph.D., and Dr. Sc. degrees from the Warsaw University of Technology, Warsaw, Poland, in 1972, 1975, and 1981, respectively, all in electrical engineering. Currently he is a professor of electrical engineering at the Institute of the Theory of Electrical Engineering, Measurement and Information Systems, Warsaw University of Technology and is also employed in Electronic Faculty of Military University of Technology, Warsaw, Poland. His research and teaching interest are in the areas of artificial intelligence, neural networks, data mining, biomedical signal and image processing. He is a Senior member of IEEE.

    View full text