Brief Communication
Finding rule groups to classify high dimensional gene expression datasets

https://doi.org/10.1016/j.compbiolchem.2008.07.031Get rights and content

Abstract

Microarray data provides quantitative information about the transcription profile of cells. To analyze microarray datasets, methodology of machine learning has increasingly attracted bioinformatics researchers. Some approaches of machine learning are widely used to classify and mine biological datasets. However, many gene expression datasets are extremely high dimensionality, traditional machine learning methods cannot be applied effectively and efficiently. This paper proposes a robust algorithm to find out rule groups to classify gene expression datasets. Unlike the most classification algorithms, which select dimensions (genes) heuristically to form rules groups to identify classes such as cancerous and normal tissues, our algorithm guarantees finding out best-k dimensions (genes) to form rule groups for the classification of expression datasets. Our experiments show that the rule groups obtained by our algorithm have higher accuracy than that of other classification approaches.

Introduction

Mining gene expression datasets has generated interest among many bioinformatics researchers (Alon et al., 1999, Chen et al., 2007, Knapp and Chen, 2007, Mann et al., 2007, Mramor et al., 2007). One of the important trends in bioinformatics is identification of genes or groups of gene to differentiate diseased tissues from normal tissues. Classification of tissues into cancerous and normal tissues using the identified genes is one of the key problems being faced in bioinformatics. Golub (Golub et al., 1999) firstly showed the better diagnostic performance of gene expression signatures in acute leukemia classification compared to other currently used diagnostic method. Many other studies (Bhattacharjee et al., 2001, Khan et al., 2001, Shipp et al., 2002) have been undertaken in almost all cancer types. Recently, a wide range of statistical and machine learning methods for microarray data analysis developed (Allison et al., 2006, Nahar et al., 2007, Asyali et al., 2006, Pham et al., 2006). Since the particularity of microarray data, i.e. high number of genes and small number of samples, it becomes a hard task to find accurate patterns to classify microarray data. This paper attempts to find rule groups for several different cancerous datasets. The results are encouraging in terms of accuracy and effectiveness.

Gene expression data is usually represented as a matrix; each element in the matrix represents an appearance level of a particular gene under a particular condition. We assume that a gene expression matrix has n rows and m columns. The rows represent samples that are divided into different classes such as cancerous tissue and normal tissue. The columns represent genes whose number is usually more than several thousands. The number of rows is much lower than that of columns as the sample used ranges from ten to several hundreds. To cope with this kind of extremely high dimensional data, traditional machine learning techniques such as decision tree and support virtual machine, cannot classify effectively as they use heuristics to select significant dimensions (genes); many discriminative dimensions can be left out. In this paper, we propose a classification method that generates rule groups to categorize samples. A rule is a conjunction of several dimensions (genes); each gene is constrained into one interval. For example, (gene1 > 120.5)  (gene2  20.3) is one such rule. If a sample satisfies the conjunction of a rule, it will be covered by the rule. The above rule covers samples whose expression values of gene1 are larger than 120.5 and expression values of gene2 are smaller than or equal to 20.3. In contrast to traditional machine learning algorithms that use heuristics our method guarantees finding out best-k genes which are most discriminative to classify samples in different classes, to form rule groups. The value of parameter k is set to around 5. It is based on the fact that each rule should not be too long from the principle of Occam's razor (Mitchell, 1997); otherwise, the problem of overfitting will arise (Quinlan, 1986).

Section snippets

Approach

A rule group is associated with a target class as different classes have different rule groups that reflect the common characters for the classes. The samples that belong to the target class are treated as positive samples, and the samples that belong to other classes are treated as negative samples throughout this paper. For the sake of consistency, we treat dimensions as columns (or genes) in gene expression matrix.

Rule groups reveal biological relationship between cellular function and group

Methods

Our algorithm enumerates all possible combinations of items to find rule group to describe a specific class. Like most rule generation algorithms, the gene expression data is discretized to symbols. The dimensionality of gene expression data is usually very high; low discriminative genes are removed in the preprocessor of our algorithm.

Experiment

We test our algorithm with widely used five gene datasets: ALL-AML leukemia (ALL), breast cancer (BC), colon tumor (CT), lung cancer (LC) and prostate cancer (PC). The rows of the datasets represent clinical samples; the columns represent the gene expression values, which is real data illustrating gene expression level of a specific gene for a sample. There are two classes of samples in these datasets. The datasets can be found at http://sdmc.i2r.a-star.edu.sg/rp.

Table 1 shows the information

Conclusion

In this paper, we propose a robust algorithm to find out rule groups that describe a specific class in high dimensional gene expression datasets. Our algorithm enumerates all possible combinations of dimensions. By introducing pruning power and constraint of the number of items, the procedures can be executed efficiently in any personal computer. The algorithm guarantees finding out best-k rules for a specific class data; the predictive accuracy is found to be better than that of the state of

References (22)

  • J. An et al.

    DDR: an index method for large time series datasets

    Inform. Syst.

    (2005)
  • D.B. Allison

    Microarray data analysis: from disarray to consolidation and consensus

    Nat. Rev. Genet.

    (2006)
  • U. Alon et al.

    Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays

    Proc. Natl Acad. Sci. U.S.A.

    (1999)
  • J. An et al.

    Another inductive algorithm

    Lect. Notes Artif. Intellig.

    (2005)
  • M.H. Asyali

    Gene expression profile classification: a review

    Curr. Bioinformatics

    (2006)
  • Bayardo, R.J., 1998. Efficiently mining long patterns from databases. 17th ACM SIGMOD International Conference on...
  • A. Bhattacharjee

    Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses

    Proc. Natl Acad. Sci. U.S.A.

    (2001)
  • Q. Chen et al.

    Detecting inconsistency in biological molecular databases using ontology

    Data Min. Knowl. Discov.

    (2007)
  • Clark, P., Boswell, R., 1991. Rule induction with CN2: some recent improvements. Mach. Learn. EWSL-91,...
  • U.M. Fayyad et al.

    Multi-interval discretization of continuous-valued attributes for classification learning

    IJCAI

    (1993)
  • T.R. Golub

    Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

    Science

    (1999)
  • Cited by (12)

    • Gene selection using rough set based on neighborhood for the analysis of plant stress response

      2014, Applied Soft Computing
      Citation Excerpt :

      A increasing number of bioinformatics researchers had paid more attention to mine gene expression data. One of the important trends in bioinformatics was the identification of genes or groups of genes to differentiate diseased tissues from normal tissues, which could be viewed as gene selection and sample classification [1]. This study was of great significance on plant stress response [2–5], tumor and cancer classification [6–9], disease diagnosis [10–12], etc.

    • A fuzzy intelligent approach to the classification problem in gene expression data analysis

      2012, Knowledge-Based Systems
      Citation Excerpt :

      Mining gene expression datasets has generated interest among many bioinformatics researchers. One of the important trends in bioinformatics is identification of genes or groups of gene to differentiate diseased tissues from normal tissues [1]. Classification is an important area of research that concerned with assigning an object to one of a set of classes, based upon its attributes.

    • Exploring the ncRNA-ncRNA patterns based on bridging rules

      2010, Journal of Biomedical Informatics
    • Identifying co-regulating microrna groups

      2010, Journal of Bioinformatics and Computational Biology
    View all citing articles on Scopus

    This work is partially supported by Grant DP0344488 from the Australian Research Council.

    View full text