Block-Constraint Laplacian-Regularized Low-Rank Representation and Its Application for Cancer Sample Clustering Based on Integrated TCGA Data

. Low-Rank Representation (LRR) is a powerful subspace clustering method because of its successful learning of low-dimensional subspace of data. With the breakthrough of “OMics” technology, many LRR-based methods have been proposed and used to cancer clustering based on gene expression data. Moreover, studies have shown that besides gene expression data, some other genomic data in TCGA also contain important information for cancer research. Therefore, these genomic data can be integrated as a comprehensive feature source for cancer clustering. How to establish an eﬀective clustering model for comprehensive analysis of integrated TCGA data has become a key issue. In this paper, we develop the traditional LRR method and propose a novel method named Block-constraint Laplacian-Regularized Low-Rank Representation (BLLRR) to model multigenome data for cancer sample clustering. The proposed method is dedicated to extracting more abundant subspace structure information from multiple genomic data to improve the accuracy of cancer sample clustering. Considering the heterogeneity of diﬀerent genome data, we introduce the block-constraint idea into our method. In BLLRR decomposition, we treat each genome data as a data block and impose diﬀerent constraints on diﬀerent data blocks. In addition, graph Laplacian is also introduced into our method to better learn the topological structure of data by preserving the local geometric information. The experiments demonstrate that the BLLRR method can eﬀectively analyze integrated TCGA data and extract more subspace structure information from multigenome data. It is a reliable and eﬃcient clustering algorithm for cancer sample clustering.


Introduction
Cancer has seriously threatened the health of people all over the world. For cancer patients, timely detection, accurate diagnosis, and effective treatment are vital for saving lives [1]. Cancer classification, as an important prerequisite for early diagnosis and treatment of cancer, has always been a challenging focus in cancer research. Modern medical research shows that the cause of cancer is the variation and mutation in genes, and these gene mutations and abnormalities cause pathological differences in cancer, forming different classifications in clinical diagnosis [2]. us, cancer research at the genetic level has received much attention from biologists.
With the advent of postgenome era in bioinformatics research, vast quantities of genomic data are being generated by DNA-microarray and deep-sequencing techniques [3][4][5][6]. Because these techniques can concomitantly profile thousands of genes, these genomic expression data produced by these technologies can fully reflect the transcription activity at a certain point, which affords researchers' avenues to understand and study life mechanism in genome-wide range.
e Cancer Genome Atlas (TCGA), as the largest component of International Cancer Genome Consortium (ICGC), is by far the largest open genome database for cancer. As of the end of the TCGA project, the TCGA database has collected more than 11,000 cancer cases involving 33 cancer types [7]. TCGA project aims to comprehensively and systematically study the biological and molecular basis of the formation, growth, and metastasis of cancer cells by mapping the genome of human cancers. e TCGA database can provide us with diverse genomics data. ese genome data provide an unprecedented opportunity for us to systematically and comprehensively consider different genetic aberrations of biological processes. erefore, cancer research based on TCGA data has become a hotspot in the field of bioinformatics.
Clustering of cancer samples is an important means of cancer classification. Its purpose is to find samples' sample groups with similar expression. Based on TCGA data, a large number of articles on cancer clustering have been produced. For example, Yu et al. developed a method named Graphbased Consensus Clustering (GCC) to research the classes of the samples based on microarray data [8]. Zheng et al. adopted Nonnegative Matrix Factorization (NMF) and sparse NMF methods to study tumor clustering [9]. Based on the maximum correntropy criterion, Wang et al. proposed a new Nonnegative matrix factorization method named NMF maximum correntropy criterion (NMF-MCC) for cancer clustering from gene expression data [10]. Kong et al. presented a P-norm Singular Value Decomposition (PSVD) method for clustering of tumor [11]. Feng et al. enforced graph-Laplacian regularization and P-norm on PCA and presented the PgLPCA method for selecting feature genes and sample clustering [12]. Virmani et al. used DNA methylation data to cluster lung cancer [13]. Ye et al. studied tumor clustering based on independent component analysis (ICA) and affinity propagation (AP) [14]. Based on genomic data, Liu et al. adopted Robust Principal Component Analysis (RPCA) approach to research tumor clustering [15]. Liu et al. presented a network-assisted coclustering method to identify the cancer subtype [16]. ese studies show that besides gene expression data, other genomic data in TCGA also contain the feature information needed for cancer clustering and can be used as feature source for cancer clustering research. erefore, it is reasonable to think that the integrated data composed of multiple genome data can contain more cancer clustering features than the single genome data, which is helpful to study cancer clustering better. However, different genomic data in the TCGA database come from different categories of genomics assays and therefore have different characteristics. In other words, these genomic data are heterogeneous, which makes the integration and analysis of different genome data become a major bottleneck in bioinformatics research [17]. Hence, most cancer clustering methods are based on single genomic data in the TCGA database, more frequently on gene expression data. is may ignore the interaction of different genetic factors, which is not conducive to the detection of cancer pathogenesis [18]. Obviously, these clustering methods cannot be directly used for comprehensive analysis of integrated TCGA data. In this case, how to establish an effective clustering algorithm for comprehensive analysis of TCGA integrated data to further improve the reliability of cancer clustering has become an urgent problem.
In recent years, Liu et al. developed a novel matrix transformation method known as Low-Rank Representation (LRR) method [19] for subspace segmentation. e LRR method is based on an important assumption that the highdimensional data are approximated as mappings of unknown low-dimensional space.
at is, the high-dimensional data can be recovered from the low-dimensional space. Under this assumption, LRR aims at finding the lowest-rank structural representation of each sample through low-rank constraint. And based on the recovered lowest-rank representation matrix, each sample is grouped into its own subspace. In LRR, because the global space information of input data is exploited to recover the subspace structures embedded in the high-dimensional data, LRR can effectively pick up the underlying subspace structures of data. As a result, the LRR method has achieved excellent performance in subspace segmentation and has been frequently applied in many fields [20][21][22][23][24][25][26]. It is well known that, in the real world, high-dimensional data often reside on unknown nonlinear manifolds. However, the classical LRR method loses sight of the local structure information in data, resulting in the loss of the inherent topological characteristics of the nonlinear manifold.
Meanwhile, with the deepening of manifold learning theory and graph theory research, more and more researchers introduce the graph regularization constraint into their research algorithms [27][28][29][30][31][32][33]. For example, Long et al. presented a graph-regularized discriminative nonnegative matrix factorization (GDNMF) method [29]; in the GDNMF model, the discriminative information and local geometrical information were taken into account by imposing the graph regularization constraint on the NMF model. Huang et al. presented Hypergraph-based Attribute Predictor (HAP) for attribute learning [31]. To further improve the classification performance of Extreme Learning Machine (ELM), Peng et al. proposed a graph-regularized ELM named as GELM [32]. Cheng et al. proposed a Graph-regularized Dual Lasso method to integrate the geometrical structure within traits and genetic markers [33]. Similarly, in order to learn the topological structure of data better, researchers introduced manifold learning into the LRR method [34][35][36][37][38]. For example, in order to improve the effectiveness of facial expression recognition, Wang et al. presented a regularized low-rank representation approach by combining linear subspace learning with data recovery [34]. Yin et al. combined LRR with graph regularizer and developed the Nonnegative Sparse Hyper-Laplacian-regularized LRR (NSHLRR) method [36]. Wang et al. put forward Laplacianregularized Low-Rank Representation (LLRR) to identify different expression genes [37]. Besides, these LRR-based methods combining graph regularization have also aroused great interest of biologists and been used in bioinformatics modeling for cancer clustering or cancer classification. Gan et al. applied latent low-rank representation to derive features for tumor clustering [39]. Wang [42]. Although these studies show that these LRR-based methods with manifold constraint have good performance in cancer clustering, the applicability of these methods in multitype integrated data analysis needs further study. Inspired by the success of the LRR method and graph regularization, in this work, we present a novel method referred to as Block-constraint Laplacian-regularized Low-Rank Representation (BLLRR) to research cancer sample clustering. BLLRR method is devoted to obtaining a lowestrank representation matrix which reflects the similarity between samples through comprehensive analysis of integrated TCGA data. Considering that different types of TCGA data have different characteristics and noise, in our method, we treat each type of data as a data block and impose different constraint strengths on different types of data. ese different parameters can well balance the noise from different genomic data. In additional, in order to maintain the nonlinear geometrical relationships of real data, graph Laplacian based on manifold is introduced into BLLRR. Graph Laplacian, also named graph regularization, can maximize the smoothness of the nonlinear manifold of data by maintaining local geometrical relationships within data, which greatly enhances the capability of the BLLRR method to learn the subspace structure. Our contributions of this paper are listed as follows. (1) A framework of cancer sample clustering based on multigenome data is come up with. is will bring cancer clustering research out of the confinement of analyzing single gene expression data. (2) We develop a novel method called BLLRR to model integrated TCGA data. In the BLLRR method, we introduce the block-constraint idea to decompose integrated TCGA data. Block-constraint solves the bottleneck problem of heterogeneous data integration and analysis by imposing different constraints on different genome data. Besides, in order to smooth the nonlinear manifold structure of data, graph regularization is introduced into BLLRR. Both graph regularization and block-constraint enable our method to pick up the subspace structures embedded in multigenome data well. (3) In BLLRR, adaptive balance parameters are proposed to balance the noise of different types of data. Namely, the constraint strength of each type of data is constantly adjusted with iteration, which greatly reduces the trouble of parameter selection and makes the model more adaptable. (4) BLLRR model is applied to the clustering of cancer samples, and many experiments of cancer clustering are provided. e experimental results substantiate the feasibility of cancer clustering based on integrated multigenome data and also show that the BLLRR method has remarkable reliability and accuracy in cancer sample clustering. e rest of this paper is organized as follows. In Methodology section, firstly, classical LRR and graph Laplacian are briefly reviewed in 2.1 and 2.2, and then the proposed BLLRR method is elaborate in 2.3. In Section 2.3.1, the objective function of BLLRR is given. In Section 2.3.2, the solving process of the BLLRR method is introduced, and the iteration formulas of the optimal solution are given. In Section 2.3.3, the model of decomposition of multigenome data by BLLRR is established. Also, in Section 2.3.4, the clustering process based on the optimal coefficient matrix obtained by BLLRR is introduced. In Section 3, datasets used for experiments are introduced, and the results and discussions of cancer sample clustering experiments are presented. In Section 4, we conclude the paper.

Methodology
2.1. LRR. LRR is a representation-based subspace clustering method. e basic assumption of LRR is to treat high-dimensional data as coming from multiple low-dimensional subspaces, and these subspaces are independent [19]. So, high-dimensional data can be regarded as the mapping of data in these low-dimensional subspaces. Based on this, the LRR method is devoted to calculating the mapping weights of high-dimensional data. e weight matrix is often known as the coefficient matrix or low-rank representation matrix. As the nuclear norm is commonly used to approximate rank operator, the resulting problem of LRR is to solve a convex optimization problem with nuclear norm regularization. Supposing the high-dimensional data matrix is represented by X, of which each column vector represents a data point, the problem of LRR is formulated as where A is referred to as a dictionary matrix by which the whole low-dimensional space can be linearly spanned, Z is known as the coefficient matrix corresponding to A, ‖·‖ * denotes the nuclear norm, ‖Z‖ * is the summation of the singular values of Z, E is a noise or perturbation term, ‖·‖ 1 denotes the l 1 -norm which is a regularization strategy to produce sparse in matrices, ‖E‖ 1 is the summation of absolute values of elements in E, and c is a scalar parameter. After LRR decomposition, the coefficient matrix Z is obtained from high-dimensional data. Ideally, A is noiseless, and the coefficient matrix Z is sparse and symmetric. In general, data matrix X is selected as the dictionary matrix. So, LRR can be reformulated as (2) In such a case, coefficient matrix Z reflects the mapping relationships between all samples. ese mapping relationships are actually the similarities between samples, which can reveal the low-dimensional subspace structure embedded in high-dimensional data. Given Z � [z 1 , z 2 , . . . , z n ], the column vector z i denotes the similarities between the i-th sample and all samples. e more similar the two samples are, the more likely they are to come from a subspace. So, subspace clustering can be implemented based on Z.

Graph Laplacian.
As is known to all, the high-dimensional data observed in the real world usually are located on nonlinear low-dimensional manifolds. Keeping the local geometric structure of data is very important for smoothing the nonlinear manifold structure. Graph Laplacian, as a popular approach to preserve the intrinsic structure embedding in high-dimensional data, is proposed on an essential idea named local invariance proposed by Hadsell et al. [43]. Supposing that X � [x 1 , x 2 , . . . , x n ] is the observed data, each column vector of X is a data sample. ese data samples and their neighbors form the local geometric structures of original observed data. In practice, the neighborhood relationship is assumed to be linear [44], i.e., each data sample from a local geometry can be treated as a linear union of its neighbors. So, the linear representation coefficients between data samples can efficiently characterize the local geometric structures. According to this, we construct a k-nearest-neighbor graph G. Here, each data sample is treated as a node, so graph G is with n nodes. At the same time, we define the weight of each edge connecting two nodes of graph G as follows: where W ij is the weight value of edge associating nodes i and j, x i and x j are data samples corresponding to node i and j, respectively, and N k (x i ) is the set of k-nearest-neighbors of node i. e weights of all edges in graph G constitute a weight matrix denoted as W. Obviously, the affinity between any two nodes of graph G can be measured by matrix W. According to the idea of local invariance, the nature assumption in manifold theory is that the affinity relations of data samples in input space should be kept in a new space.
at is to say, if data samples are nearby to each other in the intrinsic geometry of observed data, then their mappings on the output low-dimensional manifold are nearby too. e hypothesis can be achieved by neighborhood relationships. In mathematics, the relationship can be formulated as follows: where z i and z j are the representations of x i and x j under the low-dimensional manifold, respectively. Next, we define a diagonal matrix S with size n × n, and the i-th diagonal element of S is defined as S ii � j W ij . Apparently, S ii indicates the total affinities related with sample x i , so matrix S is often called the degree matrix. Accordingly, a Laplacian matrix [45] L is defined as L � S − W. It is not difficult to prove that the relationship defined by (4) can be rewritten as Because formulation (5) can describe the local adjacency relation of graph G by edge weight matrix W which keeps the affinity between pairs of nodes, it is called graph Laplacian.
is rule is essential to preserve the inherent geometric structure of the original data distribution.

BLLRR Method
e traditional LRR [19] method and its improved algorithms, such as NSHLRR [36], LLRR [37], and SSC-LRR [41], improve the algorithm robustness to noise by enforcing l 1 -norm or l 2,1 -norm constraint on the perturbation item. In these methods, all samples are subject to uniform constraint strength; therefore, these methods are only applicable to the study of a single type of data. For heterogeneous data, these methods cannot be used directly. However, in practice, we need to obtain more useful information through comprehensive analysis of various heterogeneous data. For the analysis of multiple heterogeneous data, there are two issues need to be considered. One is that heterogeneous data have different characteristics because they come from different experiments or environments. e other is multiple heterogeneous data will bring more complex noise. Based on these two aspects, when dealing with multiple heterogeneous data, we introduce the blockconstraint idea. Namely, we treat each category of data as a data block, and on different data blocks, we impose different constraint strengths. Block-constraint can not only balance the noise from different data but also preserve the feature information in the data by following the characteristics of heterogeneous data. In addition, similar to LLRR, to well discover the intrinsic geometrical structure embedding in the high-dimensional space, manifold constraint is also introduced into the algorithm. So, the optimization problem is formulated as follows: where X � (X 1 , . . . , X c ) ∈ R m×n is the input data matrix that is a collection of multiclass data, where c is the number of data categories and X l is the l-th category data. Accordingly, E � (E 1 , . . . , E c ) ∈ R m×n is the noise matrix, to be specific, E l is the noise signal with regard to X l . α and λ are penalty parameters. c l (l � 1, . . . , c) is the weighting parameter to balance the noise item of different categories. In (6), the LRR method is combined with graph Laplacian and block-constraint, so it is named as the Block-constraint Laplacian regularized Low-Rank Representation method. Obviously, when c � 1, the BLLRR model degenerates into the LLRR model whose objective function is as follows: 4 Complexity

e Optimization of BLLRR.
In order to recover the low-rank representation from data, many algorithms have been developed [46][47][48]. Specially, the ADM with Linearized Adaptive Penalty (LADMAP) [48] is a more efficient algorithm. In this paper, LADMAP is also applied to resolve problem (6). Firstly, an auxiliary variable J is introduced to make problem (6) separable. So, equation (6) can be converted to the following optimization problem: en, we remove the linear constraints in (8) by introducing the augmented Lagrangian formulation.
erefore, optimization problem (8) can be transformed into the following: where M 1 and M 2 are Lagrangian multipliers, μ is a penalty parameter that can be adaptively adjusted, ‖·‖ F is the matrix Frobenius norm, and the value of ‖Y‖ F is the sum of squares of all elements in matrix Y. Finally, in order to optimize the variables Z, J, and E by alternate updating, the original optimization problem is divided into three subproblems: (1) e Computation of Z. Fixed E and J, the iteration formula of Z can be obtained by solving subproblem (10).
Firstly, we define a quadratic term as follows: en, subproblem (10) is recast as the following objective function: where . Finally, the solution of Z is given by where Θ(·) is an operator of singular value threshold [49] and (2) e Computation of J. Fixed the current value of other variables, the iteration formula of J can be obtained by solving subproblem (11). e solution of J is given by where Ω(·) is an operator of soft shrinkage and Ω ε (x) is defined as Ω ε (x) � max(x − ε, 0) + min(x + ε, 0).
(3) e Computation of E. Similarly, fixed Z and J, the iteration formula of E can be obtained by solving subproblem (12). According to Lemma 1 [50], an operator solving subproblem (12) is denoted as Γ(·). So, the solution of E is as follows: for l � 1, . . . , c.
Here, E l is the l-th submatrix of E and denotes the noise signal corresponding to X l . D � X − XZ K+1 + M K 1 /μ K , and D l is the l-th submatrix of D. ς � c l /μ K denotes the threshold of the corresponding block. e iteration formulas of M 1 , M 2 , and μ are as follows: Complexity e main procedure of BLLRR is shown in Algorithm 1.

2.3.3.
e BLLRR Model of Integrated TCGA Data. ough people have been studying cancer clustering based on the gene expression for many years, it has been increasingly recognized that DNA copy number variation and DNA methylation also play important role in cancer understanding and clustering research [51][52][53][54]. Moreover, as mentioned earlier, TCGA dataset can provide a variety of genomic data for each sample, which make it possible to study cancer based on a variety of biological processes. erefore, we integrate these different genomics data as an integrated feature source to research cancer clustering. Figure 1 shows a schematic diagram of the multiassay genomic data. In Figure 1, mRNA expression, DNA copy number, and DNA methylation represent different genomics assay data from TCGA, in which each row represents a feature from a certain type of genome data, and each column represents a sample. erefore, in the integrated data, each sample contains all the features from three categories of genomic data. Now, we focus on integrated multigenome data. In our integrated data, there are three different types of genome data. And each category data is regarded as a data block. Because of the heterogeneity of different data blocks, in the BLLRR method, we impose different constraints on each data block, which are called as blockconstraint. After BLLRR decomposition, the coefficient matrix Z, which reflects the similarity between samples, is obtained. It is not difficult to understand that the samples with high similarity can be regarded as located in the same subspace. Consequently, based on Z, the samples can be clustered. e schematic depiction of BLLRR decomposition of integrated multigenome data is shown as Figure 2. In this figure, X is the multigenome data matrix, Z is the low-rank representation matrix, E is the noise matrix, and c l is the constraint intensity on the l-th category data.
As shown in Figure 2, the observation data are decomposed into two parts: one is the low-rank matrix and the other is the noise matrix. Of course, an appropriate restraint strength, i.e., scale parameter c, is critical to enhance the robustness of BLLRR and obtain accurate similarity patterns between samples. Due to the different constraints imposed on different types of data blocks, it is difficult to tune parameter c by following the traditional method of parameter tuning. Furthermore, because different types of data have different noises, it is reasonable to think that the noise of a certain type of data is only related to this kind of data. us, we propose a new idea called parameter self-regulation to set these parameters c l for different data blocks. Specifically, the parameters are adjusted with the iteration process. For the category l, the parameter c l is set as follows: where c i l is the constraint intensity of thei-th feature in the category l. As previously described, D � X− XZ K+1 + M K 1 /μ K is an intermediate matrix generated in the iteration process, and it has the same data dimension and corresponding data block relationship with E. So, D l is the matrix corresponding to the category l, and D i l denotes the i-th feature vector. As can be seen from formula (21), in the BLLRR method, we impose different constraints on each feature vector to balance the noise item of different categories of data. And the constraint intensity of each feature vector is calculated by the ratio of the F-norm of feature vector to the F-norm of the data block matrix in which the feature is located. In the iteration process of the BLLRR algorithm, D is constantly updated, so the constraint strength of each type of data is also constantly adjusted with iteration.

Clustering with BLLRR.
As discussed previously, the coefficient matrix Z obtained after BLLRR decomposition reflects the similarities between samples. According to Z, the samples with high similarity are clustered into one class. However, the observation data from real world are inevitably noisy, so Z is usually neither sparse nor symmetric. Before using Z to implement clustering, we need to do some processing on Z to improve the accuracy of clustering and increase the interpretability of clustering. Firstly, Z is normalized by rows and shrinked under the appropriate threshold ζ that is very small and close to zero. After the above treatment, Z becomes a sparse matrix Z. at is, each sample is similar to only a few other samples, which is critical for clustering problem. Next, we construct an affinity graph using all the samples. Based on Z, we define an affinity matrix Z to denote the affinities between samples in the affinity graphs. In Z, both element z ji and z ij denote the affinity of sample i and j, so z ji is equal to z ij , and Z is a symmetric matrix. Consequently, the affinity matrix Z is defined as Z � (|Z| + |(Z) T |)/2. So far, based on the affinity matrix, the sample clustering problem can be regarded as a graph segment problem. After the above two steps of processing, the affinity matrix becomes sparse and symmetrical. However, the affinity matrix does not have the block structure needed for clustering and cannot directly obtain the clustering results of samples. Finally, a classical spectral clustering method-K-means is adopted to obtain the final clustering label of the samples based on Z.
e main clustering procedure of BLLRR is shown in Algorithm 2.

Experimental Results and Discussion
Firstly, the original datasets from TCGA and their integrated datasets for experiments are introduced.
en, based on experimental datasets, we carry out cancer sample clustering experiments to test the effectiveness of our method. In 6 Complexity addition, in order to further demonstrate the performance of BLLRR, we choose K-means, GNMF [27], gLPCA [55], LRR [19], and LLRR [37] as comparison methods in our experiments. In the following section, we give experimental results and discuss the clustering performance of the BLLRR method in detail.

Input:
Observation matrix X, Laplacian matrix L Parameter α, λ Output: Z Initial:    281,192, and 418, respectively. In addition, each dataset includes three categories of genome data: DNA copy number variation, mRNA expression level, and DNA methylation. Also, in the three datasets, each sample from the same category of genome data contains the same number of genes. Specifically, in DNA copy number data, one sample contains 23,627 genes. In mRNA expression data, one sample contains 20,502 genes. And in DNA methylation data, one sample contains 21,031 genes.
As stated earlier, besides mRNA expression data, both DNA copy number data and DNA methylation data also play important role in cancer clustering research. According to Figure 1, we integrate the three types of genome data from each dataset into multigenome data for cancer sample clustering. e three integrated data are COInteg corresponding to the COAD dataset, ESInteg corresponding to the ESCA dataset, and HNInteg corresponding to the HNSC dataset.
us, COInteg contains 281 samples and each sample contains 65,160 genes, ESInteg contains 192 samples and each sample contains 65,160 genes, and HNInteg contains 418 samples and each sample contains 65,160 genes.

Evaluation Index of Clustering Performance.
In clustering research, evaluation is a necessary work. Many indexes have been designed to evaluate the performance of the clustering algorithm, such as accuracy (AC), true positive rate (TPR), false positive rate (FPR), receiver operating characteristic (ROC) curve, precision, and F1-measure. In this paper, we use AC, TPR, and FPR to evaluate the clustering performance of the BLLRR algorithm. Next, we will introduce them concisely.

AC.
For a given dataset, the ratio of the number of samples correctly clustered to the total number of samples is defined as AC [56]. In practice, AC is calculated by comparing the clustering labels and real labels of samples. e mathematical definition of AC is as follows: where N is the total number of samples contained in each experimental dataset, r i is the clustering label of sample i assigned by the clustering algorithm, s i is the real label of sample i, and δ(s i , map(r i )) is a function that compares the clustering label of a sample with its real label and gets the result of the comparison. If the clustering label is consistent with the real label, the function value is 1; otherwise, the value is 0. And map(r i ) is a mapping function that matches the clustering label of the sample to its real label to facilitate label comparison. By the Kuhn-Munkres method [57], the best matching can be achieved.

TPR and FPR
. TPR and FPR, as common metrics widely used to evaluate clustering quality, are all calculated based on the confusion matrix. So, let us start with a brief introduction to confusion matrices. Confusion matrix, also known as the error matrix, is a standard format for evaluating. Obfuscation matrix is a two-dimensional matrix. Each row represents an actual class, and each column represents a predicted class. e confusion matrix of a simple case with two classes is shown in Table 1. Generally, among these two classes, the one we are concerned with is designated as a positive class and the other as a negative class. In this table, true positive (TP) denotes the number of positive class samples that are correctly clustered into positive class. True negative (TN) indicates the number of negative class samples that are correctly clustered into negative class. False positive (FP) denotes the number of negative class samples that are incorrectly clustered into positive class. False negative (FN) means the number of positive class samples that are incorrectly clustered into negative class. TPR and FPR are defined as follows: From the calculation formulas of TPR and FPR, we can see that the TPR represents the ratio of the number of Input: Observation data X, clustering number K Output: Z (1) Get the coefficient matrix Z of problem (8) using BLLRR method.
(2) Normalize Z by rows as z i � z i /‖z i ‖ 2 .
(3) Shrink Z to get the sparse matrix Z by z ij � z ij if z ij ≥ ζ 0 otherwise .
8 Complexity samples correctly clustered into the positive class to the total number of samples in the positive class, and the FPR represents the ratio of the number of samples incorrectly clustered into the positive class to the total number of samples in the negative class.

Experimental Results.
In this section, based on experimental datasets, many sample clustering experiments are performed to fully demonstrate the performance of our method. Firstly, we apply LLRR to cluster cancer samples based on DNA copy number variation, mRNA expression level, DNA methylation, and their integrated data. As mentioned earlier, when the BLLRR method is applied to single genomic data, the BLLRR model is equivalent to the LLRR model. e accuracies of the cluster are shown in Table 2. In Table 2, DNA copy number variation is denoted by CN, mRNA expression level is denoted by GE, and DNA methylation is denoted by ME. And the best result on each dataset is shown in bold.
From Table 2, we can see that the clustering accuracy of each single genome data from our three experimental datasets is over 92%. is indicates that each of the three categories of genomic data contains useful information for cancer sample clustering. Next, we compare the clustering results on different genomic data of each dataset. Table 2 shows that, for the COAD dataset and the ESCA dataset, the clustering accuracy on GE data is the best, reaching 95.35% and 96.51%, respectively. And for the HNSC dataset, the clustering accuracy on ME data is the best, reaching 97.22%.
is comparison further indicates that, besides GE data, CN data and ME data can also be used as feature source data to study the clustering of cancer samples. At last, for each dataset, we compare the clustering accuracy on integrated multigenome data with that on the single genome data. It is not difficult to see that, on all three datasets, the clustering effect of integrated data is worse than the best clustering effect achieved on single genome data. e fundamental reason for this result is that the LLRR method ignores the heterogeneity of different genome data and imposes the same constraint intensity on integrated multigenome data. So, when LLRR is used to decompose multigenome data, the noise and characteristic information of different genome data cannot be well processed. Obviously, the LLRR model is only suitable for single genome data but not for multigenome data. Summing up the above analysis, we come to the following two conclusions: (1) DNA copy number variation, mRNA expression level data, and DNA methylation are of great significance to the clustering of cancer samples, so it is reasonable to integrate them into multigenome data for cancer sample clustering. (2) When processing integrated multigenome data, the heterogeneity of data must be fully considered.
Secondly, in order to test the clustering performance of the BLLRR method based on multigenome data, the cancer sample clustering experiments are conducted on the three integrated multigenome data. As comparison methods, Kmeans, GNMF, gLPCA, LRR, and LLRR are also used to cluster cancer samples. Moreover, for the sake of the comparability of the experimental results, we uniformly use K-means algorithm to get the final clustering results for GNMF, gLPCA, LRR, and LLRR, just like the BLLRR method. As we all know, because K-means will randomly select cluster centers for each clustering, when clustering with K-means, there is a small difference in each clustering result. In order to reduce the impact of this difference on the evaluation of experimental results, in all our experiments, we take the average of 30 clustering results as the final result. To be specific, for GNMF, gLPCA, LRR, LLRR, and BLLRR, firstly, we decompose the experimental data and get a matrix for clustering. en, we use K-means to repeat clustering 30 times based on the obtained matrix and take the mean of 30 times clustering accuracies as the final clustering result. Table 3 gives the clustering accuracy of each method on multigenome data in detail. Similarly, for each dataset, the best result is displayed in bold.
Of these methods used for comparison, LRR and LLRR are LRR-based clustering methods; K-means, GNMF, and gLPCA are traditional methods. Firstly, as can be seen from Table 3, the clustering accuracies of LRR and LLRR are higher than those of three traditional methods on the whole.
is benefits from the successful learning of the subspace structure embedded in data by LRR-based methods, which reflects the importance of the subspace structure for clustering research. Secondly, comparing LRR with LLRR, we can see that the clustering performance of LLRR is better than that of LRR. is is due to the introduction of the graph regularization term in the LLRR method. As introduced previously, graph regularization can preserve the geometrical relationships of data and furthermore smooth the nonlinear manifold. erefore, LLRR has better ability to learn the subspace structure than LRR. irdly, we compare BLLRR with LLRR. It is very clear that, on each integrated data, the clustering accuracy of BLLRR is higher than that of LLRR. For LLRR and BLLRR, their basic clustering ideas are consistent. Furthermore, in both algorithms, graph Laplacian is introduced to help better obtain the subspace structure of data. e main difference between the two methods is that when decomposing multigenome data, the idea of block-constraint is introduced into the BLLRR method. In the BLLRR method, each category of genome data contained in the integrated data is regarded as a data block, and different constraints are imposed on different data blocks. Because block-constraint considers the peculiarities of different genome data in multigenome data, it can improve the robustness of BLLRR to complex noise from multigenome data and protect the feature information of each genome data well. However, in the LLRR method, the integrated multigenome data are regarded as single genome data and imposed on a uniform constraint strength, which ignores the peculiarities of different types of data. So, BLLRR can deal with multiple heterogeneous data more effectively than LLRR. Finally, comparing the results of BLLRR shown in Table 3 with the results of LLRR based on single genomic data shown in Table 2, we can see that, on all the three datasets, the clustering results of BLLRR on multigenome data are better than the best results of LLRR on single genomic data. is indicates that multigenome data contain more subspace structure information than single genome data and can be used as comprehensive feature source for cancer research. Meanwhile, it again illustrates that BLLRR is capable of mining more useful subspace information from multiple genomic data for sample clustering. Based on the above analysis, we can conclude that the BLLRR method has powerful ability to learn the intrinsic subspace structure within multiple heterogeneous data and can effectively cluster cancer samples by decomposing multiple genomic data. Now, we would like to further explain the importance of parameter c l and the rationality of our setting of c l . Firstly, as can be seen from formula (21), we set the corresponding parameter c l according to the overall expression level of different genomic data, which helps to set up appropriate constraints for each genome data with different expression levels. Moreover, c l will be continuously updated in the iteration. So, parameter c l will help to process the complex noises in multigenomic data better. en, we compare the experimental results of the BLLRR method and the LLRR method to illustrate the rationality of parameter c l . As discussed earlier, when a uniform constraint strength is applied to multiple genome data, BLLRR degenerates into LLRR. From the comparative analysis of Tables 2 and 3, we can get the following two points. One is that, for the LLRR method, the clustering results on multigenomic data are worse than those on single genome data. is indicates that it is not feasible to impose uniform constraints on multigenomic data to deal with different noise levels. Second, for the BLLRR method, its clustering result on multigenomic data is better than that on single genome data. is proves that parameter c l can effectively balance the complex noises in different genomic data. Summarizing the above analysis, both the formula and the experimental results show that the parameter c l obtained in BLLRR is reasonable and effective.
However, the samples in our experimental datasets are extremely imbalanced, that is, there are more tumor samples and fewer normal samples. Sample imbalance is a common problem in the field of bioinformatics. In order to indicate the degree of sample imbalance, for each integrated data, we calculate the ratios of two types of samples, as shown in Table 4. In Table 4, n T and n N represent the number of tumor samples and the number of normal samples, respectively. So, n T /n N denotes the ratio of tumor samples to normal samples, and n N /n T denotes the ratio of normal samples to tumor samples. In this case, the normal samples are surrounded by a large number of tumor samples, which is disadvantageous to the clustering of normal samples.
Finally, in view of this situation, we use TPR and FPR as evaluation measures to research the clustering effect of each class of samples. In cancer clustering research, researchers tend to pay more attention to disease samples, that is, cancer samples or tumor samples. erefore, we regarded cancer samples as positive samples and normal samples as negative samples. e values of TPR and FPR on all multigenome data are recorded in Table 5. According to the definition of TPR, the larger the value of TPR, the better the clustering effect of cancer samples. And for FPR, the smaller the value of FPR is, the better the clustering effect of normal samples is. So, in Table 5, for each data, both the maximum values of TPR and the minimum values of FPR are remarked in bold. And for ease of comparison, we also use histograms to illustrate the results as shown in Figures 3 and 4.
In our data, because positive class samples are far more than negative class samples, in the following description, positive class samples are also called majority class samples and negative class samples are also called minority class samples. From Figure 3, we can find that the PTR values of various methods are generally high on all three data, especially on ESInteg, the mean value of PTR exceeds 99%. In addition, as can be seen from Figure 4, most FPR values exceed 60%. Especially, from Table 5, we also see that the FPR values of GNMF on COInteg and LRR on HNInteg are 100%.
ese results show that the extreme imbalance of sample distribution is beneficial to the clustering of majority class samples, but it is a great challenge to the clustering of minority class samples. In order to demonstrate the clustering performance of BLLRR for minority class samples, we compare LRR, LLRR, and BLLRR. Firstly, as can be seen from Table 5, for LRR, the values of TPR and FPR are the highest on each data. is shows that the LRR method is sensitive to the extremely imbalanced datasets when learning  10 Complexity subspace. at is, when the dataset is extremely unbalanced, the LRR method can only learn the subspace structure of majority class samples well but cannot learn the subspace structure of minority class samples well. So, LRR is not suitable for the study of subspace clustering in the case of extremely unbalanced samples. Secondly, as can be seen from Figure 4, compared with LRR, LLRR improves the clustering performance of minority class samples. is further shows that graph regularization helps to learn subspace information better by preserving local geometric structures in high-dimensional data, which is of great significance for the clustering of minority class samples. Finally, we compare BLLRR with LLRR. We can see from Figure 4 that, on each data, the FPR value of the BLLRR method is far less than that of the LLRR method and is the smallest of all the comparison methods. is shows that block-constraint is beneficial to extract more abundant structural information from multigenome data, thus avoiding the loss of the intrinsic subspace structure of minority class samples in manifold learning. In addition, this experimental result also proves the validity of the BLLRR method for clustering samples on extremely unbalanced data. To sum up, BLLRR can effectively learn the subspace structure embedded in multigenome data so that BLLRR can still cluster each class of samples effectively even though the samples are extremely unbalanced.

Conclusion
In this paper, we put forward a novel method termed BLLRR to analyze integrated TCGA data. In the BLLRR model, the graph Laplacian is introduced to make the BLLRR method respect the local geometric relationship of data better when learning the manifold structure. In addition, in order to deal with heterogeneous data better, the idea of block-constraint is introduced, which makes it convenient for BLLRR to impose different constraint intensities on different data     blocks. Because block-constraint can well balance the complex noise of multiclass data and better preserve the useful characteristic information of each class of data, our method is competent to learn the subspace structure of multiple heterogeneous data. en, we apply the BLLRR method to cancer sample clustering based on multigenome data. Firstly, the integrated multigenome data are decomposed by BLLRR, and a coefficient matrix is obtained. Secondly, we construct the affinity matrix to denote the affinities between samples based on the coefficient matrix. Finally, we regard sample clustering as a problem of graph segmentation and use K-means to achieve the cancer sample clustering. e experimental results show that our method has remarkable subspace learning ability. Especially for minority class samples in extremely unbalanced datasets, the clustering performance of the BLLRR method is obviously better than other methods. So, the BLLRR method is an efficient and reliable method for multigenome data analysis. In future, we will continue to work on the comprehensive analysis of TCGA data.

Data Availability
e Data sets supporting the findings of this work are available at https://cancergenome.nih.gov/.

Conflicts of Interest
e authors declare that they have no conflicts of interest.