Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Identifying Subspace Gene Clusters from Microarray Data Using Low-Rank Representation

  • Yan Cui,

    Affiliation School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing, Jiangsu, China

  • Chun-Hou Zheng,

    Affiliation College of Electrical Engineering and Automation, Anhui University, Hefei, Anhui, China

  • Jian Yang

    csjyang@mail.njust.edu.cn

    Affiliation School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing, Jiangsu, China

Abstract

Identifying subspace gene clusters from the gene expression data is useful for discovering novel functional gene interactions. In this paper, we propose to use low-rank representation (LRR) to identify the subspace gene clusters from microarray data. LRR seeks the lowest-rank representation among all the candidates that can represent the genes as linear combinations of the bases in the dataset. The clusters can be extracted based on the block diagonal representation matrix obtained using LRR, and they can well capture the intrinsic patterns of genes with similar functions. Meanwhile, the parameter of LRR can balance the effect of noise so that the method is capable of extracting useful information from the data with high level of background noise. Compared with traditional methods, our approach can identify genes with similar functions yet without similar expression profiles. Also, it could assign one gene into different clusters. Moreover, our method is robust to the noise and can identify more biologically relevant gene clusters. When applied to three public datasets, the results show that the LRR based method is superior to existing methods for identifying subspace gene clusters.

Introduction

With the advent of the DNA microarray technology, it is now possible to study the transcriptional response of a complete genome to different experimental conditions. However, reconstruction the regulatory networks from the high throughput DNA microarray data is one of the foremost challenges of current bioinformatics research. Fortunately, many studies have unveiled that regulatory networks are modular and hierarchically organized [1][7]. Since subspace gene clusters may represent co-regulated genes to some degree, so we can reconstruct the whole regulatory network start with identifying the clusters. In addition, since genes can be clustered with similar cellular functions, therefore, identifying the clusters from DNA microarray data might provide much deeper insight into biological function and relevance.

Traditional clustering methods, such as hierarchical clustering [8], K-means clustering [9], self-organizing maps [10], and model-based methods [11][14] can organize gene expression data into clusters of genes possessing similar expression profiles using all the conditions, and the clusters are exclusive and exhaustive (Figure 1 (A)). These methods identify gene clusters by assuming that genes with similar expression profiles share similar functions or the same pathway. Although these clustering methods classify genes successfully when applied to relatively small data sets, when used for analyzing large-scale expression data, it is limited by three well-recognized drawbacks [3]. Firstly, commonly used algorithms assign each gene to a single cluster, whereas in fact many genes may participate in several functions and should thus be included in several clusters [3], [10]. Secondly, these algorithms group genes on the basis of their expression under all experimental conditions, whereas cellular processes are generally affected only by a small subset of these conditions, so that a gene can participate in multiple clusters or in none at all [15]. In the analysis of a particular cellular process, therefore, most conditions do not contribute information but instead increase the amount of background noise [3]. Thirdly, due to the complex procedures of microarray experiments, gene expression data often contains a huge amount of noise. These algorithms force each gene into a cluster, which may cause the algorithm to be sensitive to noise [15][17].

thumbnail
Figure 1. Clustering and subspace clustering of a gene expression matrix: (A) a gene cluster must contain all columns, (B) subspace clusters correspond to arbitrary subsets of rows and columns, shown here as rectangles.

https://doi.org/10.1371/journal.pone.0059377.g001

Recently, subspace clustering methods have been proposed to find subgroups of genes that exhibit similar behavior across subsets of samples, experimental conditions, or time points [15], [18]. Subspace clustering was first proposed by Agrawal et al. in general data mining domain [19] to find subsets of objects such that the objects appear as a cluster in a subspace formed by a subset of features. Figure 1 (B) shows an example of the subspace clusters (a and b) embedded in a gene expression matrix. From Figure 1 (B), we can found that a subspace cluster is defined as a submatrix spanned by a set of genes and a set of samples. Genes or samples can be part of more than one subspace cluster or of no subspace cluster. In addition, the subsets of samples for various subspace clusters can be different. Two subspace clusters can share some common genes and samples, and some genes may not belong to any subspace cluster [15]. Therefore, the goal of subspace clustering technique is to find a set of significant subspace clusters in a matrix. Until now, subspace clustering approaches have found widespread applications in many fields [20], e.g., pattern recognition, data compression, image processing and bioinformatics. These methods include algebraic methods [20], such as Generalized Principal Component Analysis(GPCA) [21], iterative methods [20], statistical methods [20], and spectral clustering-based methods [20].

GPCA presents an algebraic way to model the data drawn from a union of multiple subspaces. The intersections between subspaces are automatically allowed; hence, GPCA can deal with both independent and dependent subspaces. However, this method is sensitive to noise and outliers due to the difficulty of estimating the polynomials from real data, which also causes the high computation cost of GPCA [20].

Algebraic algorithms are sensitive to the noise. A very simple way of improving the performance of algebraic algorithms in the case of noisy data is to use iterative refinement. The main advantage of algebraic method is its simplicity since it alternates between assigning points to subspaces and estimating the subspaces via PCA. While the algorithm’s convergence to the global optimum depends on a good initialization, and the method is sensitive to outliers.

Statistical methods include Mixture of Probabilistic PCA (MPPCA) [22], Agglomerative Lossy Compression (ALC) [23] and Random Sample Consensus (RANSAC) [24]. MPPCA is a simple and intuitive method, where each iteration can be computed in closed form by using PPCA. An important drawback of MMPCA is that the number and dimensions of the subspaces need to be known beforehand. In principle, ALC does not need to know the number of subspaces and their dimensions. In practice, however, the number of subspaces is directly related to the parameter of ALC. When the number of subspaces is known, one can run ALC for several values of its parameter. This typically increases the computational complexity of the method. Another disadvantage of ALC, perhaps the major one, is that there is no theoretical proof for the optimality of the agglomerative procedure. The main advantage of RANSAC is its ability to handle outliers explicitly. An important drawback of RANSAC is that its performance deteriorates quickly as the number of subspaces increase. Another critical drawback of RANSAC is that it requires the dimension of the subspaces to be known and equal.

Spectral clustering algorithms are very popular techniques for clustering high-dimensional data. One of the main challenges in applying spectral clustering to the subspace clustering problem is to define a good affinity matrix. This is because two points could be very close to each other but lie in different subspaces (e.g., near the intersection of two subspaces). Conversely, two points could be far from each other but lie in the same subspace. One of the most popular spectral clustering algorithms is sparse subspace clustering (SSC) [25], [26]. SSC is robust to noise and its computational complexity dose not grow exponentially with the number of subspaces and their dimensions. Nonetheless, it requires solving N optimization problems in O(N) variables, hence, it may be time consuming. Another possible disadvantage of SSC is that it is provably correct only in the case of independent or disjoint subspaces.

In this paper, we propose to use Low-Rank Representation(LRR) to identify the subspace gene clusters from microarray data. Compared with traditional techniques, such as K-means clustering, our method can cluster genes with similar functions but without similar expression profiles. Moreover, LRR is better fitted to analyze the microarray data than other subspace clustering algorithms, such as GPCA, since it is robust to noise and outliers via lowest-rank criterion, and it also can be capable of extracting useful information from a high level of background noise. The contribution of the paper is that a new subspace gene clustering method based on LRR is proposed, and its theoretical analysis is also given.

Materials and Methods

Low-Rank Representation

Before we present the Low-Rank Representation(LRR) based method for identifying gene clusters from microarray data, we first introduce the algorithm of Low-Rank Representation, which is a new framework for seeking the lowest rank representation matrix [27]. Supposing that there is a gene expression dataset with genes and samples, we can denote it as a matrix with size . When the data is noiseless, the LRR algorithm looks for a representation by solving the problem(1)

We call the optimal solutions of the above problem the “lowest-rank representation” of the data with respect to a dictionary . The above optimization problem is difficult to solve due to the discrete nature of the rank function. Fortunately, as suggested by matrix completion methods [28], [29], the following convex optimization provides a good surrogate for problem(1):(2)where denotes the nuclear norm of a matrix [27], [30], i.e., the sum of the singular values of the matrix. Note that the block diagonal structure of directly induces clustering genes (each block corresponds to a cluster). So the clustering task is equivalent to finding a block diagonal representation matrix .

However, due to the complex procedures of microarray experiments, gene expression data often contains a huge amount of noise. Therefore, the optimization model of LRR is formulated as:(3)where is the -norm of the matrix of errors . Minimizing the -norm of noise is to meet the assumption that some data vectors are corrupted and others are clean. Since in this case, the solution to Eq.(3) may not be block diagonal, it is recognized as an affinity matrix instead and spectral clustering methods are applied to to obtain a block diagonal matrix, where denotes the matrix or vector transpose and denotes a matrix whose entries are the absolute values of .

The LRR algorithm proceeds by solving the optimization problem in (3) using an Augmented Lagrange Multiplier (ALM) method [31][33].

Identifying Gene Clusters Using LRR

Denoted the gene expression data matrix as with size , each row of containing the expression levels of a gene in all the samples, and each column of containing the expression levels of all the genes in one sample. Our goal of using LRR algorithm to model the gene expression data is to discover subspace gene clusters, so we can cluster the genes according to their representation matrices. When we apply the LRR algorithm to cluster genes, we need to use instead of . The Eq.(3) can be written as:(4)

In order to cluster the genes into their respective subspaces, we need to compute an affinity matrix that encodes the pairwise affinities between data vectors. So we use the data itself as the dictionary, i.e., problem(4) becomes(5)

Then Eq.(5) can be written as(6)where (each column is a gene), is the coefficient matrix with each being the representation of . is -norm, and the parameter is used to balance the effects of the two parts, which could be chosen according to properties of the two norms, or tuned empirically. Since -norm encourages the columns of to be zero, the underlying assumption here is that some data vectors are corrupted and the others are clean.

The optimization problem (6) is convex and can be solved by various methods [31]. For efficiency, we adopt in this paper the Augmented Lagrange Multiplier(ALM) method [31][33]. We first convert Eq.(6) to the following equivalent problem:(7)

This problem can be solved by the ALM method, which minimizes the following augmented Lagrange function:(8)

The above problem is unconstrained. So it can be minimized with respect to , Z and E, respectively, by fixing the other variables, and then updating the Lagrange multipliers and , where is a penalty parameter. The inexact ALM method, also called the alternating direction method, is outlined in Algorithm 1. Its convergence properties could be proven similarly as those in [32].

After solving problem (6), as in [25], we utilize the lowest-rank representation (denoted by ) to define the affinity matrix of an undirected graph. We treat gene clustering problem as a graph partitioning problem. The gene vectors correspond to the vertices and the affinity between and is computed by . In clustering, we seek to partition the set of vertices into disjoint sets where by some measure the similarity among the vertices in a set is high and across different sets is low. So we could use the spectral clustering algorithms such as Normalized Cuts [34] to produce the final clustering results, since Normalized Cuts is an unbiased measure which can minimize the disassociation between the clusters of a graph and maximize the association within the clusters simultaneously. The final clusters are gene clusters that we discover from the gene expression data. Integrating LRR with spectral clustering has the following advantages. First, since LRR may fail to obtain a block-diagonal representation in complex applications, the spectral clustering could ensure the robustness of the clustering. Second, it is convenient to integrate the lowest-rank representation with other information by defining such an undirected graph For example, in some specific applications such as subspace segmentation, ones may want to enforce that only the neighbor samples can be connected by edges. Therefore, the proposed clustering method based LRR can be capable of extracting useful information from a huge amount of background noise and it also can well capture the intrinsic patterns of genes with similar functions. Algorithm 2 summarizes the whole clustering algorithm of LRR for gene expression data.

Algorithm 1 Solving Problem (6) by Inexact ALM

Input: data matrix X, parameter

Initialized: , , , , ,

, , .

while not converged do

1) fix the others and update J by

2) fix the others and update Z by

3) fix the others and update E by

4) update the multipliers

5) update the parameter by

6) check the convergence conditions

end while

Algorithm 2 Subspace Clustering by LRR

Input: data matrix , number of subspaces

1) Obtain the lowest-rank representation by solving problem(6)

2) Construct an undirected graph by using the lowest-rank

representation to define the affinity matrix of the graph

3) Use Normalized Cuts to cluster the vertices of the graph into clusters

Gene Expression Data

In this paper, three published gene expression datasets are used to analyze the performance of our methods: the yeast dataset [35], the yeast_Spellman dataset [36] and nomal human tissue dataset [37].

The yeast dataset contains 173 samples collected under several different conditions, which include temperature shocks, hyper-and hypoosmotic shocks, exposure to various agents such as peroxide, menadione, diamide, dithiothreitol, amino acid starvation, nitrogen source depletion and progression into stationary phase, etc. This dataset contains 6152 genes in each sample.

The yeast_Spellman data contains 6178 genes in 73 samples under several different conditions. There are different carbon sources, temperature, yeast strain background, etc.

The normal human tissue data consists of Affymetrix oligonucletide array measurements of 7070 genes in 59 samples of 19 kinds of tissues.

GeneCodis Analysis

GeneCodis is a web-based tool for finding sets of biological annotations that frequently appear together and are significant in a set of genes. It allows the integrated analysis of annotations from different sources and generates statistical rank scores for signal annotations and their combinations. GeneCodis is an important extension of existing tools for the functional analysis of genes lists [38][40], and it is publicly available at http://genecodis.cnb.csic.es.

The application of GeneCodis is simple. It takes a list of genes which are in a cluster as input and determines biological annotations or combinations of annotations that are over-represented with respect to a reference list. Meanwhile, selecting one or more categories that you want include in the analysis is necessary. In addition, the organism selected is Saccharomyces cerevisiae for yeast dataset and yeast_Spellman data, while the organism selected is Homo sapiens for normal human tissue dataset. When the genes are submitted, the modular enrichment analysis and singular enrichment analysis can be obtained. For a detailed description of this method, see the online help for the program.

In the GeneCodis method, P-values obtained through Hypergeometric analysis corrected by false discovery rate (FDR) method [41], [42]. Briefly, a gene list of the same size of the input list is generated by randomly selecting genes from the set of genes defined as the reference distribution. The process of extracting frequent sets of annotations is repeated and P-values for the annotations and combinations of annotations generated from this random list are calculated using the same statistical test. This process is repeated 100 times and the corrected P-values for each set of K-annotations are calculated as the fraction of permutations having any annotation of the same value of K with a P-value as good or better than the observed P-value.

Results and Discussion

In this section, the proposed method is applied to identify the subspace gene clusters in the form of functional links among genes. The GeneCodis is executed to investigate the enrichment of functional annotations of genes in each cluster. First, LRR based methods and K-means, GPCA are carried on the synthetic data. Then, these methods are used to find gene clusters from real gene expression data. The MATLAB code is supplied in File S1 and File S2.

Simulation on Synthetic Data

We first tested our method on synthetic data to assess its performance for clustering. We synthesize a network that consists of 2000 genes each with 200 samples and 20 transcription factors. We construct 20 independent subspaces whose bases are computed by , , where is a random rotation and is a random orthogonal matrix of dimension . So, each subspace has a dimension of 100. We sample 100 data vectors from each subspace by , with being a noise matrix, . Then the noise matrix is added to data matrix with different Signal-to-Noise Ratios (SNR). The simulation scheme in [27] is used to generate the synthetic data. The average receiver operator characteristic (ROC) curves are shown in Figure 2 with four different SNR (signal-to-noise ratio). The corresponding in LRR based method are set as 0.01, 0.1, 1 and 10, respectively.

thumbnail
Figure 2. ROC curves for synthetic data. (SNR denotes the signal-to-noise ratio).

https://doi.org/10.1371/journal.pone.0059377.g002

From Figure 2, we can see that for different SNRs, our LRR based method consistently outperforms the competitive methods. When the noise level increases (i.e., SNR decreases), three methods suffer from performance degradation by a corresponding decrease in the AUCs (Table 1). However, the AUC range of LRR is from 0.9908 to 0.8928, while the AUC ranges of K-means and GPCA are from 0.9233 and 0.9255 to 0.6643 and 0.6154, respectively. The decrease rates of K-means and GPCA are higher than that of LRR. In other word, K-means and GPCA are sensitive to the noise, while LRR is robust to the noise. Moreover, the proposed method can extract useful information from a high level of background noise.

From the experiments on synthetic data, a conclusion can be drawn that LRR method outperforms other methods for clustering.

Experimental Results on the Yeast Dataset

For yeast dataset, we first used KNNimpute [43] to fill in missing values. When we apply our method to cluster the genes, the parameters and k need to be considered. Since the original gene expression matrix obtained from a scanning process contains noise, missing values, and systematic variations arising from the experimental procedure. The gene expression data contains a huge amount of noise. In our method, the parameter is used to balance the effects of noise. For the yeast dataset, we take  = 0.1, because when taking this value, the enrichment analysis based on GO can achieve the most significant result. With regards to the parameter k, more clusters are discovered with the increase of k. In this experiment, according to the former work [44][46], we choose k = 30.

Table 2 lists the most enriched GO categories of modular enrichment analysis in each cluster uncovered from the yeast dataset. In this table, the first two columns list total number of genes in each cluster and the number of annotated genes in each cluster respectively. It should be noted that among the 6152 genes, only 6123 genes were annotated by GO and KEGG database. The last column on the right are P-values obtained through Hypergeometric analysis corrected by false discovery rate (FDR) method. In Table 3, we analyzed the function of genes clustered in two clusters. For example, there are 14 out of 30 genes clustered sharing similar function ‘response to stress’, and Cluster C3 is significantly associated with ‘response to stress’. Therefore, genes share with similar function can be clustered in the same cluster. Table 4 lists a few functional categories in each significantly enriched modules (corrected P-value <10−20). Figures 3 and 4 show the enriched combinations of significant annotations of Biological Process (BP) of C17 and that of Molecular Function of C17 respectively, using pie graphs and bar graphs.

thumbnail
Figure 3. Enriched combinations of significant annotations of Biological Process of Cluster C17: (A) pie graph, (B) bar graph.

https://doi.org/10.1371/journal.pone.0059377.g003

thumbnail
Figure 4. Enriched combinations of significant annotations of Molecular Function of Cluster C17: (A) pie graph, (B) bar graph.

https://doi.org/10.1371/journal.pone.0059377.g004

thumbnail
Table 2. The most enriched GO categories of modular enrichment in each gene clusters uncovered by LRR from yeast dataset.

https://doi.org/10.1371/journal.pone.0059377.t002

thumbnail
Table 3. Results of genes analysis in gene clusters uncovered by LRR from yeast dataset.

https://doi.org/10.1371/journal.pone.0059377.t003

thumbnail
Table 4. Singular enrichment of GO (or KEGG) categories in gene clusters uncovered by LRR from yeast dataset.

https://doi.org/10.1371/journal.pone.0059377.t004

We also adopted K-means clustering and GPCA as compared methods applying on the yeast dataset, and the experimental results are listed in Tables 5 and 6. These two tables show K-means clustering and GPCA are efficient in clustering the genes. However, our method can identify more significantly enriched clusters, e.g. 11 clusters were discovered with corrected P-value <10−20, while only 4 and 1 clusters by K-means and GPCA, respectively. The number of annotated genes in each cluster using our algorithm is more than that of other two methods.

thumbnail
Table 5. The most enriched categories of modular enrichment in each gene clusters uncovered by K-means clustering from yeast dataset.

https://doi.org/10.1371/journal.pone.0059377.t005

thumbnail
Table 6. The most enriched categories of modular enrichment in each gene clusters uncovered by GPCA from yeast dataset.

https://doi.org/10.1371/journal.pone.0059377.t006

Moreover, the proposed approach is effective in clustering together genes with similar expression profiles and similar function categories. Cluster C17, for example, is strongly associated with the ‘structural constituent of ribosome’ process (corrected P-value <3.58803×10−202). The heatmap indicated similar expression patterns of genes under different experimental conditions (Figure 5(A)). More significantly, our algorithm can cluster genes which show different expression profiles but similar functions. Cluster C14, for example, is significantly enriched by ‘cellular amino acid biosynthetic process’ (corrected P-value <2.33136×10−41). The heatmap revealed as least two distinctive expression patterns in this cluster (denoted as a and b, Figure 5(B)). Which show the advantage of our subspace clustering method.

thumbnail
Figure 5. Two heatmaps of expression values of genes analyzed by the proposed algorithm from the yeast dataset: (A) a heatmap of expression values of genes in Cluster C17, and the heatmap shows similar expression patterns of genes in different samples, (B) a heatmap of expression values of genes in Cluster C14, and the heatmap shows different expression patterns of genes in different samples (denoted as a and b).

https://doi.org/10.1371/journal.pone.0059377.g005

We also compared statistical significance of common enriched functional categories in gene clusters uncovered by our algorithm and K-means (Table 7). Among those common functional categories detected significantly by these methods, there are five out of eight functional categories that our method produced significantly lower corrected P-value than K-means method did. In addition, our method is robust to the noise, while other two methods are not. We also found the performance of GPCA is not better than that of K-means clustering for this dataset. To investigate the performance of these methods explicitly, Table 8 lists the average values of negative logarithm of corrected P-value on three datasets using K-means, GPCA and LRR, respectively. From Table 8 (a), we also can see that LRR based method outperforms other methods on yeast dataset.

thumbnail
Table 7. Comparison of statistical significance of enriched functional categories in gene clusters uncovered by LRR and K-means from yeast dataset.

https://doi.org/10.1371/journal.pone.0059377.t007

thumbnail
Table 8. The average values of negative logarithm of corrected P-value on three datasets.

https://doi.org/10.1371/journal.pone.0059377.t008

Experimental Results on Yeast_Spellman Dataset

We also first used KNNimpute to fill in missing values. In this experiment, we chose  = 0.01 due to the dataset contains a huge amount of noise. Table S1 (In File S1 and File S2) lists the most enriched GO categories of modular enrichment analysis in each gene cluster uncovered from the yeast_Spellman dataset. In this dataset, there are 6074 genes of 6178 genes annotated by GO and KEGG. Table S2 also lists a few functional categories in each significantly enriched modules (corrected P-value <10−10). Figures 6(A) and Figure 6(B) demonstrate that the proposed method can identify the clusters of genes with similar expression profiles or different expression profiles, respectively. Cluster C27 significantly enriched by ‘ribosome biogenesis’ (corrected P-value <2.47852×10−104), and Cluster C10 is significantly enriched by ‘structural constituent of ribosome’ (corrected P-value <2.63854×10−38). Compared with experimental results of K-means clustering and GPCA methods listed in Tables S3 and S4, our approach can discover 11 significantly enriched clusters with corrected P-value <10−10, while 8 and 3 clusters by K-means clustering and GPCA, respectively. We also compared statistical significance of common enriched functional categories in gene clusters uncovered by our algorithm and K-means (Table S5). Among those common functional categories detected significantly by these methods, there are six out of eight functional categories that our method produced significantly lower corrected P-value than K-means method did. In the experiment we also found that the result of GPCA is not robust to noise. In addition, for yeast_Spellman dataset, the average value of negative logarithm of corrected P-value using LRR based method listed in Table 8 (b) is higher than the values listed in Table 8 (b) using K-means and GPCA, respectively. Therefore, LRR based method can achieve better result than other methods.

thumbnail
Figure 6. Two heatmaps of expression values of genes analyzed by the proposed algorithm from the yeast_Spellman dataset: (A) a heatmap of expression values of genes in Cluster C27, and the heatmap shows similar expression patterns of genes in different samples, (B) a heatmap of expression values of genes in Cluster C10, and the heatmap shows different expression patterns of genes in different samples (denoted as a and b).

https://doi.org/10.1371/journal.pone.0059377.g006

Experimental Results on Normal Human Tissue Dataset

In this experiment, we chose  = 10 since the dataset contains some noise but no missing values. Table S6 lists the most enriched GO categories of modular enrichment analysis in each gene clusters discovered from the human tissue dataset, and only 5991 genes were annotated by GO and KEGG among the 7070 genes. Among 30 gene clusters, ten clusters were significantly enriched by GO and KEGG (corrected P-value <10−20). Each cluster appeared to be dominated by only a few functional categories (Table S7). Similar to above two experiments, Figure 7(A) shows the heatmap of Cluster C18, which is significantly enriched by ‘muscle filament sliding’ (corrected P-value <1.29274×10−39). It can be seen that the expression patterns of genes in this cluster are similar and similar function categories. On the contrary, the heatmap of Cluster C3, which is significantly enriched by ‘extracellular region’ (corrected P-value <4.79992×10−25), reveals different expression profiles but similar functions (Figure 7(B)).

thumbnail
Figure 7. Two heatmaps of expression values of genes analyzed by the proposed algorithm from the normal human tissue dataset: (A) a heatmap of expression values of genes in Cluster C18, and the heatmap shows similar expression patterns of genes in different samples, (B) a heatmap of expression values of genes in Cluster C3, and the heatmap shows different expression patterns of genes in different samples (denoted as a and b).

https://doi.org/10.1371/journal.pone.0059377.g007

The experimental results of the K-means clustering and GPCA methods are listed in Tables S8 and S9. Our method can discover 10 significantly enriched clusters with corrected P-value <10−20, while K-means and GPCA can identify 1and 3 clusters, respectively. Moreover, from Table 8 (c), we also can find that the performance of LRR based method is better than that of other methods. Similar to above two experiments, we compared statistical significance of common enriched functional categories in gene clusters uncovered by our algorithm and K-means (Table S10). Among those common functional categories detected significantly by these methods, there are five functional categories that our method produced significantly all lower corrected P-value than K-means method did. For this dataset, the performance of GPCA is better than K-means clustering.

Conclusions

In this study, we present low-rank representation based method for identifying subspace gene clusters from microarray data. The new approach can cluster the genes via low-rank criterion. Our goal is to find a block diagonal representation matrix from gene expression data using low-rank representation. In this block diagonal matrix each block corresponds to a cluster. Therefore, the genes in each cluster have similar functions. Compared with other clustering methods, the proposed method offers several advantages. Firstly, it can identify genes of similar functions yet without similar expression profiles. Secondly, the method can assign one gene into different modules. Thirdly, our method is capable of extracting useful information from a high level of background noise. In a word, our method leads to a significant improvement in identifying biologically relevant gene clusters. In the experiment we also found that many categories discovered by different methods are different. So in practice, different methods can be used to find more reliable result.

Otherwise, subspace gene clusters identified using the proposed method may represent co-regulated genes to some degree. However, due to the limited information present in any dataset, genes in the same cluster might be co-expressed but not necessarily co-regulated [47][49]. Therefore, to design an effective algorithm for finding co-regulated genes is our future work.

Supporting Information

Table S1.

The most enriched GO categories of modular enrichment in each gene clusters uncovered by LRR from yeast_Spellman dataset.

https://doi.org/10.1371/journal.pone.0059377.s001

(DOC)

Table S2.

Singular enrichment of GO (or KEGG) categories in gene clusters uncovered by LRR from yeast_Spellman dataset.

https://doi.org/10.1371/journal.pone.0059377.s002

(DOC)

Table S3.

The most enriched categories of modular enrichment in each gene clusters uncovered by K -means clustering from yeast_Spellman dataset.

https://doi.org/10.1371/journal.pone.0059377.s003

(DOC)

Table S4.

The most enriched categories of modular enrichment in each gene clusters uncovered by GPCA from yeast_Spellman dataset.

https://doi.org/10.1371/journal.pone.0059377.s004

(DOC)

Table S5.

Comparison of statistical significance of enriched functional categories in gene clusters uncovered by LRR and K -means from yeast_Spellman dataset.

https://doi.org/10.1371/journal.pone.0059377.s005

(DOC)

Table S6.

The most enriched GO categories of modular enrichment in each gene clusters uncovered by LRR from normal human tissue dataset.

https://doi.org/10.1371/journal.pone.0059377.s006

(DOC)

Table S7.

Singular enrichment of GO (or KEGG) categories in gene clusters uncovered by LRR from normal human tissue dataset.

https://doi.org/10.1371/journal.pone.0059377.s007

(DOC)

Table S8.

The most enriched categories of modular enrichment in each gene clusters uncovered by K-means clustering from normal human tissue dataset.

https://doi.org/10.1371/journal.pone.0059377.s008

(DOC)

Table S9.

The most enriched categories of modular enrichment in each gene clusters uncovered by GPCA from normal human tissue dataset.

https://doi.org/10.1371/journal.pone.0059377.s009

(DOC)

Table S10.

Comparison of statistical significance of enriched functional categories in gene clusters uncovered by LRR and K -means from normal human tissue dataset.

https://doi.org/10.1371/journal.pone.0059377.s010

(DOC)

File S1.

The MATLAB code of Low-Rank Representation is supplied in this Supplementary File.

https://doi.org/10.1371/journal.pone.0059377.s011

(DOC)

File S2.

The yeast dataset used in our experiment is supplied in this Supplementary File.

https://doi.org/10.1371/journal.pone.0059377.s012

(XLS)

Author Contributions

Conceived and designed the experiments: YC. Performed the experiments: YC CHZ. Analyzed the data: YC JY. Contributed reagents/materials/analysis tools: CHZ JY. Wrote the paper: YC CHZ JY.

References

  1. 1. Bar-Joseph Z, Gerber GK, Lee TI, Rinaldi NJ, Yoo JY, et al. (2003) Computational discovery of gene modules and regulatory networks. Nature Biotechnology 21: 1337–1342.
  2. 2. Alon U (2007) Network motifs: theory and experimental approaches. Nature reviews Genetics 8: 450–46.
  3. 3. Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, et al. (2002) Revealing modular organization in the yeast transcriptional network. Nature Genetics 31: 370–377.
  4. 4. Chen L, Xuan J, Riggins RB, Wang Y, Hoffman EP, et al. (2010) Multilevel support vector regression analysis to identify condition-specific regulatory networks. Bioinformatics 26: 1416–1422.
  5. 5. Kiddle SJ, Windram OPF, McHattie S, Mead A, Beynon J, et al. (2010) Temporal clustering by affinity propagation reveals transcriptional modules in Arabidopsis thaliana. Bioinformatics 26: 355–362.
  6. 6. Goncalves JP, Aires RS, Francisco AP, Madeira SC (2012) Regulatory snapshots: integrative mining of regulatory modules from expression time series and regulatory networks. PLoS ONE 7(5): e35977.
  7. 7. Song W-M, Di Matteo T, Aste T (2012) Hierarchical information clustering by means of topologically embedded graphs. PLoS ONE 7(3): e31929.
  8. 8. Johnson S (1967) Hierarchical clustering schemes. Psychometrika 32: 241–254.
  9. 9. Hartigan J (1972) Direct clustering of a data matrix. Journal of the American Statistical Association 67: 123–129.
  10. 10. Tamaryo P, Slonim D, Mesirov J (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proceedings of the National Academy of Sciences USA 96: 2907–2912.
  11. 11. Banfield J, Raftery A (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49: 803–821.
  12. 12. Ben-Dor A, Chor B, Karp R (2003) Discovering local structure in gene expression data: the order-preserving submatrix problem. Journal of Computational Biology 10: 803–821.
  13. 13. Fraley C, Ragtery A (2002) Model-based clustering, discriminant analysis and density estimation. Journal of the American Statistical Association 97: 611–631.
  14. 14. Mclachian G, Basford K (1998). Mixture Models: Inference and Applications to Clustering. Marcle Dekker, New York.
  15. 15. Jiang D, Tang C, Zhang A (2004) Cluster Analysis for Gene Expression Data: A Survey. IEEE transactions on knowledge and data engineering. 16: 1370–1386.
  16. 16. Sherlock G (2000) Analysis of Large-Scale Gene Expression Data. Current Opinion in Immunology 12: 201–205.
  17. 17. Smet FD, Mathys J, Marchal K, Thijs G, Moor M, et al. (2002) Adaptive Quality-Based Clustering of Gene Expression Profiles. Bioinformatics 18: 735–746.
  18. 18. Tchagang AB, Phan S, Famili F, Shearer H, Fobert P, et al. (2012) Mining biological information from 3D short time-series gene expression data: the OPTricluster algorithm. BMC Bioinformatics 13: 54.
  19. 19. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. SIGMOD 1998, Proc. ACM SIGMOD Int’l Conf. Management of Data, 94–105.
  20. 20. Vidal R (2011) Subspace clustering. IEEE Signal Processing Magazine, 52–68.
  21. 21. Vidal R, Ma Y, Sastry S (2005) Generalized principal component analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 27: 1–15.
  22. 22. Tipping M, Bishop C (1999) Mixtures of probabilistic principal component analyzers. Neural Computation 11: 443–482.
  23. 23. Ma Y, Derksen H, Hong W, Wright J (2007) Segmentation of multivariate mixed data via lossy coding and compression. IEEE Transactions on Pattern Analysis and Machine Intelligence 29: 1546–1562.
  24. 24. Fischler MA, Bolles RC (1981) RANSAC random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of ACM 24: 381–395.
  25. 25. Elhamifar E, Vidal R (2009) Sparse subspace clustering. in Proc. IEEE Conf. Computer Vision and Pattern Recognition 2790–2797.
  26. 26. Elhamifar E, Vidal R (2010) Clustering disjoint subspaces via sparse representation, in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing 1926–1929.
  27. 27. Liu G, Lin Z, Yu Y (2010) Robust subspace segmentation by low-rank representation. in Proc. Int. Conf. Machine Learning.
  28. 28. Candes EJ, Recht B (2009) Exact matrix completion via convex optimization. Foundations of Computational Mathematics.
  29. 29. Keshavan R, Montanari A, Oh S (2009) Matrix completion from noisy entries. In NIPS.
  30. 30. Fazel M (2002) Matrix rank minimization with applications. PhD thesis.
  31. 31. Liu G, Lin Z, Yan S, Sun J, Yu Y, et al. (2012) Robust Recovery of subspace structures by low-rank representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 99: 1–14.
  32. 32. Lin Z, Chen M, Wu L, Ma Y (2009) The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices. UIUC Technical Report UILU-ENG-09–2215.
  33. 33. Bertsekas D (1982) Constrained Optimization and Lagrange Multiplier Methods, Athena Scientific.
  34. 34. Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22: 888–905.
  35. 35. Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, et al. (2000) Genomic, expression programs in the response of yeast cells to environmental changes. Molecular Biology of the Cell 11: 4241–4257.
  36. 36. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, et al. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9: 3273–3297.
  37. 37. Hsiao L, Dangond F, Yoshida T, Hong R, Jensen RV, et al. (2001) A Compendium of gene expression in normal human tissues reveals tissue-specific genes and distinct expression patterns of housekeeping genes. Physiological Genomics 7: 97–104.
  38. 38. Tabas-Madrid D, Nogales-Cadenas R, Pascual-Montano A (2012) GeneCodis3: a non-redundant and modular enrichment analysis tool for functional genomics. Nucleic Acids Research 40: 478–483.
  39. 39. Nogales-Cadenas R, Carmona-Saez P, Vazquez M, Vicente C, Yang X, et al. (2009) GeneCodis: interpreting gene lists through enrichment analysis and integration of diverse biological information. Nucleic Acids Research 37: 317–322.
  40. 40. Carmona-Saez P, Chagoyen M, TiradoF, Carazo JM, Pascual-Montano A (2007) GENECODIS: A web-based tool for finding significant concurrent annotations in gene lists. Genome Biology 8: R3.
  41. 41. Boyle EI, Weng S, Gollub J, Jin H, Botstein D, et al. (2004) GO:TermFinder - open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics 20: 3710–3715.
  42. 42. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society 57: 289–300.
  43. 43. Troyanskaya O, Cantor M, Sherlock G, Brown PO, Hastie T, et al. (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17: 520–525.
  44. 44. Li H, Sun Y, Zhan M (2007) The discovery of transcriptional modules by a two stage matrix decomposition approach. Bioinformatics 23: 473–479.
  45. 45. Zhang J, Zheng C, Liu J, Wang H (2011) Discovering the transcriptional modules using microarray data by penalized matrix decomposition. Computers in Biology and Medicine 41: 1041–1050.
  46. 46. Zheng CH, Zhang L, Ng T, Shiu C, Huang DS (2011) Molecular pattern discovery based on penalized matrix decomposition. IEEE/ACM Transactions on Computational Biology and Bioinformatics 8: 1592–1603.
  47. 47. Allocco DJ, Kohane IS, Butte AJ (2004) Quantifying the relationship between co-expression, co-regulation and gene function. BMC Bioinformatics 5: 18.
  48. 48. Yeung KY, Medvedovic M, Bumgarner RE (2004) From co-expression to co-regulation: how many microarray experiments do we need? Genome Biology 5: R48.
  49. 49. Clements M, Someren E, Knijnenburg T, Reinders M (2007) Integration of known transcription factor binding site information and gene expression data to advance from co-expression to co-regulation. Genomics Proteomics Bioinformatics 5: 86–101.