Finding rule groups to classify high dimensional gene expression datasets

doi:10.1016/j.compbiolchem.2008.07.031

Computational Biology and Chemistry

Volume 33, Issue 1, February 2009, Pages 108-113

https://doi.org/10.1016/j.compbiolchem.2008.07.031 Get rights and content

Abstract

Microarray data provides quantitative information about the transcription profile of cells. To analyze microarray datasets, methodology of machine learning has increasingly attracted bioinformatics researchers. Some approaches of machine learning are widely used to classify and mine biological datasets. However, many gene expression datasets are extremely high dimensionality, traditional machine learning methods cannot be applied effectively and efficiently. This paper proposes a robust algorithm to find out rule groups to classify gene expression datasets. Unlike the most classification algorithms, which select dimensions (genes) heuristically to form rules groups to identify classes such as cancerous and normal tissues, our algorithm guarantees finding out best-k dimensions (genes) to form rule groups for the classification of expression datasets. Our experiments show that the rule groups obtained by our algorithm have higher accuracy than that of other classification approaches.

Introduction

Mining gene expression datasets has generated interest among many bioinformatics researchers (Alon et al., 1999, Chen et al., 2007, Knapp and Chen, 2007, Mann et al., 2007, Mramor et al., 2007). One of the important trends in bioinformatics is identification of genes or groups of gene to differentiate diseased tissues from normal tissues. Classification of tissues into cancerous and normal tissues using the identified genes is one of the key problems being faced in bioinformatics. Golub (Golub et al., 1999) firstly showed the better diagnostic performance of gene expression signatures in acute leukemia classification compared to other currently used diagnostic method. Many other studies (Bhattacharjee et al., 2001, Khan et al., 2001, Shipp et al., 2002) have been undertaken in almost all cancer types. Recently, a wide range of statistical and machine learning methods for microarray data analysis developed (Allison et al., 2006, Nahar et al., 2007, Asyali et al., 2006, Pham et al., 2006). Since the particularity of microarray data, i.e. high number of genes and small number of samples, it becomes a hard task to find accurate patterns to classify microarray data. This paper attempts to find rule groups for several different cancerous datasets. The results are encouraging in terms of accuracy and effectiveness.

Gene expression data is usually represented as a matrix; each element in the matrix represents an appearance level of a particular gene under a particular condition. We assume that a gene expression matrix has n rows and m columns. The rows represent samples that are divided into different classes such as cancerous tissue and normal tissue. The columns represent genes whose number is usually more than several thousands. The number of rows is much lower than that of columns as the sample used ranges from ten to several hundreds. To cope with this kind of extremely high dimensional data, traditional machine learning techniques such as decision tree and support virtual machine, cannot classify effectively as they use heuristics to select significant dimensions (genes); many discriminative dimensions can be left out. In this paper, we propose a classification method that generates rule groups to categorize samples. A rule is a conjunction of several dimensions (genes); each gene is constrained into one interval. For example, (gene1 > 120.5) ∧ (gene2 ≤ 20.3) is one such rule. If a sample satisfies the conjunction of a rule, it will be covered by the rule. The above rule covers samples whose expression values of gene1 are larger than 120.5 and expression values of gene2 are smaller than or equal to 20.3. In contrast to traditional machine learning algorithms that use heuristics our method guarantees finding out best-k genes which are most discriminative to classify samples in different classes, to form rule groups. The value of parameter k is set to around 5. It is based on the fact that each rule should not be too long from the principle of Occam's razor (Mitchell, 1997); otherwise, the problem of overfitting will arise (Quinlan, 1986).

Section snippets

Approach

A rule group is associated with a target class as different classes have different rule groups that reflect the common characters for the classes. The samples that belong to the target class are treated as positive samples, and the samples that belong to other classes are treated as negative samples throughout this paper. For the sake of consistency, we treat dimensions as columns (or genes) in gene expression matrix.

Rule groups reveal biological relationship between cellular function and group

Methods

Our algorithm enumerates all possible combinations of items to find rule group to describe a specific class. Like most rule generation algorithms, the gene expression data is discretized to symbols. The dimensionality of gene expression data is usually very high; low discriminative genes are removed in the preprocessor of our algorithm.

Experiment

We test our algorithm with widely used five gene datasets: ALL-AML leukemia (ALL), breast cancer (BC), colon tumor (CT), lung cancer (LC) and prostate cancer (PC). The rows of the datasets represent clinical samples; the columns represent the gene expression values, which is real data illustrating gene expression level of a specific gene for a sample. There are two classes of samples in these datasets. The datasets can be found at http://sdmc.i2r.a-star.edu.sg/rp.

Table 1 shows the information

Conclusion

In this paper, we propose a robust algorithm to find out rule groups that describe a specific class in high dimensional gene expression datasets. Our algorithm enumerates all possible combinations of dimensions. By introducing pruning power and constraint of the number of items, the procedures can be executed efficiently in any personal computer. The algorithm guarantees finding out best-k rules for a specific class data; the predictive accuracy is found to be better than that of the state of

References (22)

J. An et al.
DDR: an index method for large time series datasets
Inform. Syst.
(2005)
D.B. Allison
Microarray data analysis: from disarray to consolidation and consensus
Nat. Rev. Genet.
(2006)
U. Alon et al.
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays
Proc. Natl Acad. Sci. U.S.A.
(1999)
J. An et al.
Another inductive algorithm
Lect. Notes Artif. Intellig.
(2005)
M.H. Asyali
Gene expression profile classification: a review
Curr. Bioinformatics
(2006)
Bayardo, R.J., 1998. Efficiently mining long patterns from databases. 17th ACM SIGMOD International Conference on...
A. Bhattacharjee
Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses
Proc. Natl Acad. Sci. U.S.A.
(2001)
Q. Chen et al.
Detecting inconsistency in biological molecular databases using ontology
Data Min. Knowl. Discov.
(2007)
Clark, P., Boswell, R., 1991. Rule induction with CN2: some recent improvements. Mach. Learn. EWSL-91,...
U.M. Fayyad et al.
Multi-interval discretization of continuous-valued attributes for classification learning
IJCAI
(1993)

T.R. Golub

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

Science

(1999)

Cited by (12)

Gene selection using rough set based on neighborhood for the analysis of plant stress response
2014, Applied Soft Computing
Citation Excerpt :
A increasing number of bioinformatics researchers had paid more attention to mine gene expression data. One of the important trends in bioinformatics was the identification of genes or groups of genes to differentiate diseased tissues from normal tissues, which could be viewed as gene selection and sample classification [1]. This study was of great significance on plant stress response [2–5], tumor and cancer classification [6–9], disease diagnosis [10–12], etc.
Gene selection and sample classification based on gene expression data are important research trends in bioinformatics. It is very difficult to select significant genes closely related to classification because of the high dimension and small sample size of gene expression data. Rough set based on neighborhood has been successfully applied to gene selection, as it selects attributes without redundancy and deals with numerical attributes directly. Construction of neighborhoods, approximation operators and attribute reduction algorithm are three key components in this gene selection approach. In this study, a novel neighborhood named intersection neighborhood for numerical data was defined. The performances of two kinds of approximation operators were compared on gene expression data. A significant gene selection algorithm, which was applied to the analysis of plant stress response, was proposed by using positive region and gene ranking, and then this algorithm with thresholds optimization for intersection neighborhood was extended. The performance of the proposed algorithm, along with a comparison with other related methods, classical algorithms and rough set methods, was analyzed. The results of experiments on four data sets showed that intersection neighborhood was more flexible to adapt to the data with various structure, and approximation operator based on elementary set was more suitable for this application than that based on element. That was to say that the proposed algorithms were effective, as they could select significant gene subsets without redundancy and achieve high classification accuracy.
A fuzzy intelligent approach to the classification problem in gene expression data analysis
2012, Knowledge-Based Systems
Citation Excerpt :
Mining gene expression datasets has generated interest among many bioinformatics researchers. One of the important trends in bioinformatics is identification of genes or groups of gene to differentiate diseased tissues from normal tissues [1]. Classification is an important area of research that concerned with assigning an object to one of a set of classes, based upon its attributes.
Classification is an important data mining task that widely used in several different real world applications. In microarray analysis, classification techniques are applied in order to discriminate diseases or to predict outcomes based on gene expression patterns, and perhaps even to identify the best treatment for given genetic signature. The most important challenge in gene expression data analysis lies in how to deal with its unique “high dimension small sample” characteristic, which makes many traditional classification techniques non-applicable or inefficient; and hence, more dedicated techniques are nowadays needed in order to approach this problem. Fuzzy logic is recently shown that is a powerful and suitable soft computing tool for handling the complex problems under incomplete data conditions. In this paper, a new hybrid model is proposed that combines artificial intelligence with fuzzy in order to benefit from unique advantages of both fuzzy logic and the classification power of the artificial neural networks (ANNs), to construct an efficient and accurate hybrid classifier in less available data situations. The proposed model, because of using the fuzzy parameters instead of the crisp parameters, will need less data set in comparing with traditional nonfuzzy neural networks in its training process or with same training sample can better learn and hence can yield more accurate results than traditional neural networks. In addition of theoretical evidence of using fuzzy logic, empirical results of gene expression classification indicate that the proposed model exhibits effectively improved classification accuracy in comparison with traditional artificial neural networks (ANNs) and also some other well-known statistical and intelligent classification models such as the linear discriminant analysis (LDA), the quadratic discriminant analysis (QDA), the K-nearest neighbor (KNN), and the support vector machines (SVMs). Therefore, the proposed model can be applied as an appropriate alternate approach for solving problems with scant data such as gene expression data classification, specifically when higher classification accuracy is needed.
Exploring the ncRNA-ncRNA patterns based on bridging rules
2010, Journal of Biomedical Informatics
ncRNAs play an important role in the regulation of gene expression. However, many of their functions have not yet been fully discovered. There are complicated relationships between ncRNAs in different categories. Finding these relationships can contribute to identify ncRNAs’ functions and properties. We extend the association rule to represent the relationship between two ncRNAs. Based on this rule, we can speculate the ncRNA’s function when it interacts with other ncRNAs. We propose two measures to explore the relationships between ncRNAs in different categories. Entropy theory is to calculate how close two ncRNAs are. Association rule is to represent the interactions between ncRNAs. We use three datasets from miRBase and RNAdb. Two from miRBase are designed for finding relationships between miRNAs; the other from RNAdb is designed for relationships among miRNA, snoRNA and piRNA. We evaluate our measures from both biological significance and performance perspectives. All the cross-species patterns regarding miRNA that we found are proven correct using miRNAMap 2.0. In addition, we find novel cross-genomes patterns such as (hsa-mir-190b → hsa-mir-153-2). According to the patterns we find, we can (1) explore one ncRNA’s function from another with known function and (2) speculate the functions of both of them based on the relationship even we do no understand either of them. Our methods’ merits also include: (1) they are suitable for any ncRNA datasets and (2) they are not sensitive to the parameters.
A neural network-based biomarker association information extraction approach for cancer classification
2009, Journal of Biomedical Informatics
A number of different approaches based on high-throughput data have been developed for cancer classification. However, these methods often ignore the underlying correlation between the expression levels of different biomarkers which are related to cancer. From a biological viewpoint, the modeling of these abnormal associations between biomarkers will play an important role in cancer classification. In this paper, we propose an approach based on the concept of Biomarker Association Networks (BAN) for cancer classification. The BAN is modeled as a neural network, which can capture the associations between the biomarkers by minimizing an energy function. Based on the BAN, a new cancer classification approach is developed. We validate the proposed approach on four publicly available biomarker expression datasets. The derived Biomarker Association Networks are observed to be significantly different for different cancer classes, which help reveal the underlying deviant biomarker association patterns responsible for different cancer types. Extensive comparisons show the superior performance of the BAN-based classification approach over several conventional classification methods.
Identifying co-regulating microrna groups
2010, Journal of Bioinformatics and Computational Biology
Adaptive Variable Extractions with LDA for Classification of Mixed Variables, and Applications to Medical Data
2021, Journal of Information and Communication Technology

View all citing articles on Scopus

^☆: This work is partially supported by Grant DP0344488 from the Australian Research Council.

View full text

Brief CommunicationFinding rule groups to classify high dimensional gene expression datasets☆

Abstract

Introduction

Section snippets

Approach

Methods

Experiment

Conclusion

Inform. Syst.

Microarray data analysis: from disarray to consolidation and consensus

Nat. Rev. Genet.

Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays

Proc. Natl Acad. Sci. U.S.A.

Another inductive algorithm

Lect. Notes Artif. Intellig.

Gene expression profile classification: a review

Curr. Bioinformatics

Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses

Proc. Natl Acad. Sci. U.S.A.

Detecting inconsistency in biological molecular databases using ontology

Data Min. Knowl. Discov.

Multi-interval discretization of continuous-valued attributes for classification learning

IJCAI

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

Science

Brief Communication
Finding rule groups to classify high dimensional gene expression datasets☆