Brief CommunicationFinding rule groups to classify high dimensional gene expression datasets☆
Introduction
Mining gene expression datasets has generated interest among many bioinformatics researchers (Alon et al., 1999, Chen et al., 2007, Knapp and Chen, 2007, Mann et al., 2007, Mramor et al., 2007). One of the important trends in bioinformatics is identification of genes or groups of gene to differentiate diseased tissues from normal tissues. Classification of tissues into cancerous and normal tissues using the identified genes is one of the key problems being faced in bioinformatics. Golub (Golub et al., 1999) firstly showed the better diagnostic performance of gene expression signatures in acute leukemia classification compared to other currently used diagnostic method. Many other studies (Bhattacharjee et al., 2001, Khan et al., 2001, Shipp et al., 2002) have been undertaken in almost all cancer types. Recently, a wide range of statistical and machine learning methods for microarray data analysis developed (Allison et al., 2006, Nahar et al., 2007, Asyali et al., 2006, Pham et al., 2006). Since the particularity of microarray data, i.e. high number of genes and small number of samples, it becomes a hard task to find accurate patterns to classify microarray data. This paper attempts to find rule groups for several different cancerous datasets. The results are encouraging in terms of accuracy and effectiveness.
Gene expression data is usually represented as a matrix; each element in the matrix represents an appearance level of a particular gene under a particular condition. We assume that a gene expression matrix has n rows and m columns. The rows represent samples that are divided into different classes such as cancerous tissue and normal tissue. The columns represent genes whose number is usually more than several thousands. The number of rows is much lower than that of columns as the sample used ranges from ten to several hundreds. To cope with this kind of extremely high dimensional data, traditional machine learning techniques such as decision tree and support virtual machine, cannot classify effectively as they use heuristics to select significant dimensions (genes); many discriminative dimensions can be left out. In this paper, we propose a classification method that generates rule groups to categorize samples. A rule is a conjunction of several dimensions (genes); each gene is constrained into one interval. For example, (gene1 > 120.5) ∧ (gene2 ≤ 20.3) is one such rule. If a sample satisfies the conjunction of a rule, it will be covered by the rule. The above rule covers samples whose expression values of gene1 are larger than 120.5 and expression values of gene2 are smaller than or equal to 20.3. In contrast to traditional machine learning algorithms that use heuristics our method guarantees finding out best-k genes which are most discriminative to classify samples in different classes, to form rule groups. The value of parameter k is set to around 5. It is based on the fact that each rule should not be too long from the principle of Occam's razor (Mitchell, 1997); otherwise, the problem of overfitting will arise (Quinlan, 1986).
Section snippets
Approach
A rule group is associated with a target class as different classes have different rule groups that reflect the common characters for the classes. The samples that belong to the target class are treated as positive samples, and the samples that belong to other classes are treated as negative samples throughout this paper. For the sake of consistency, we treat dimensions as columns (or genes) in gene expression matrix.
Rule groups reveal biological relationship between cellular function and group
Methods
Our algorithm enumerates all possible combinations of items to find rule group to describe a specific class. Like most rule generation algorithms, the gene expression data is discretized to symbols. The dimensionality of gene expression data is usually very high; low discriminative genes are removed in the preprocessor of our algorithm.
Experiment
We test our algorithm with widely used five gene datasets: ALL-AML leukemia (ALL), breast cancer (BC), colon tumor (CT), lung cancer (LC) and prostate cancer (PC). The rows of the datasets represent clinical samples; the columns represent the gene expression values, which is real data illustrating gene expression level of a specific gene for a sample. There are two classes of samples in these datasets. The datasets can be found at http://sdmc.i2r.a-star.edu.sg/rp.
Table 1 shows the information
Conclusion
In this paper, we propose a robust algorithm to find out rule groups that describe a specific class in high dimensional gene expression datasets. Our algorithm enumerates all possible combinations of dimensions. By introducing pruning power and constraint of the number of items, the procedures can be executed efficiently in any personal computer. The algorithm guarantees finding out best-k rules for a specific class data; the predictive accuracy is found to be better than that of the state of
References (22)
- et al.
DDR: an index method for large time series datasets
Inform. Syst.
(2005) Microarray data analysis: from disarray to consolidation and consensus
Nat. Rev. Genet.
(2006)- et al.
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays
Proc. Natl Acad. Sci. U.S.A.
(1999) - et al.
Another inductive algorithm
Lect. Notes Artif. Intellig.
(2005) Gene expression profile classification: a review
Curr. Bioinformatics
(2006)- Bayardo, R.J., 1998. Efficiently mining long patterns from databases. 17th ACM SIGMOD International Conference on...
Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses
Proc. Natl Acad. Sci. U.S.A.
(2001)- et al.
Detecting inconsistency in biological molecular databases using ontology
Data Min. Knowl. Discov.
(2007) - Clark, P., Boswell, R., 1991. Rule induction with CN2: some recent improvements. Mach. Learn. EWSL-91,...
- et al.
Multi-interval discretization of continuous-valued attributes for classification learning
IJCAI
(1993)
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring
Science
Cited by (12)
Gene selection using rough set based on neighborhood for the analysis of plant stress response
2014, Applied Soft ComputingCitation Excerpt :A increasing number of bioinformatics researchers had paid more attention to mine gene expression data. One of the important trends in bioinformatics was the identification of genes or groups of genes to differentiate diseased tissues from normal tissues, which could be viewed as gene selection and sample classification [1]. This study was of great significance on plant stress response [2–5], tumor and cancer classification [6–9], disease diagnosis [10–12], etc.
A fuzzy intelligent approach to the classification problem in gene expression data analysis
2012, Knowledge-Based SystemsCitation Excerpt :Mining gene expression datasets has generated interest among many bioinformatics researchers. One of the important trends in bioinformatics is identification of genes or groups of gene to differentiate diseased tissues from normal tissues [1]. Classification is an important area of research that concerned with assigning an object to one of a set of classes, based upon its attributes.
Exploring the ncRNA-ncRNA patterns based on bridging rules
2010, Journal of Biomedical InformaticsA neural network-based biomarker association information extraction approach for cancer classification
2009, Journal of Biomedical InformaticsIdentifying co-regulating microrna groups
2010, Journal of Bioinformatics and Computational BiologyAdaptive Variable Extractions with LDA for Classification of Mixed Variables, and Applications to Medical Data
2021, Journal of Information and Communication Technology
- ☆
This work is partially supported by Grant DP0344488 from the Australian Research Council.