Exploring the classification of cancer cell lines from multiple omic views

Background Cancer classification is of great importance to understanding its pathogenesis, making diagnosis and developing treatment. The accumulation of extensive omics data of abundant cancer cell line provide basis for large scale classification of cancer with low cost. However, the reliability of cell lines as in vitro models of cancer has been controversial. Methods In this study, we explore the classification on pan-cancer cell line with single and integrated multiple omics data from the Cancer Cell Line Encyclopedia (CCLE) database. The representative omics data of cancer, mRNA data, miRNA data, copy number variation data, DNA methylation data and reverse-phase protein array data were taken into the analysis. TumorMap web tool was used to illustrate the landscape of molecular classification.The molecular classification of patient samples was compared with cancer cell lines. Results Eighteen molecular clusters were identified using integrated multiple omics clustering. Three pan-cancer clusters were found in integrated multiple omics clustering. By comparing with single omics clustering, we found that integrated clustering could capture both shared and complementary information from each omics data. Omics contribution analysis for clustering indicated that, although all the five omics data were of value, mRNA and proteomics data were particular important. While the classifications were generally consistent, samples from cancer patients were more diverse than cancer cell lines. Conclusions The clustering analysis based on integrated omics data provides a novel multi-dimensional map of cancer cell lines that can reflect the extent to pan-cancer cell lines represent primary tumors, and an approach to evaluate the importance of omic features in cancer classification.


92 Materials & Methods
93 Cancer cell lines and data pre-processing 94 Our study involved 1,019 cell lines from 31 previously established cancer types. The mRNA, 95 miRNA, CNV, METHY and RPPA data were downloaded from the CCLE database for all cell 96 lines (https://portals.broadinstitute.org/ccle/data) (Ghandi et al., 2019). The number of cancer 97 cell lines and the cancer types involved were shown in Table 1. 98 For mRNA sequence data, we used RSEM values in gene level shared by CCLE database. We 99 used miRNA expression data from CCLE for miRNA analysis. For DNA methylation, the 100 promoter CpG data was used for clustering analysis. And reverse phase protein array data was 101 downloaded for protein analysis. In parallel, we downloaded segmented copy number profiles 102 from CCLE database for CNV analysis. This SNP6.0 arrays data was used as the input data for 103 Gistic2.0 software (Mermel et al., 2011). Before pre-processing the data, we mapped segmented 104 copy number to the chromosome arm level using Gistic2.0. This copy number variation by the 105 chromosome arm level was the input data of CNV clustering analysis. Next, the following steps 106 were performed to improve the dataset quality for single omics clustering.

107
(1) For each omics dataset, cell lines with more than 20% features missing, and features with 108 more than 20% cell lines missing were filtered out.

109
(2) For each omics dataset, the missing data points were filled in using average imputations.

110
(3) For mRNA and miRNA data, log2 (x+1) (x is the value of mRNA and miRNA) 111 transformation were performed before feature selection.

112
(4) For mRNA and METHY data, only features in the top 5,000 in terms of variance were 113 selected. For miRNA, RPPA and CNV data, all features were considered. Manuscript to be reviewed 145 Feature contribution of integrated multiple omics clustering 146 We used the Normalized mutual information (NMI), which was a measure of the 147 interdependence between two random variables, to measure the contribution of each omics type 148 feature. The function "rankFeaturesByNMI" in the R package "SNFtool" (version 2.3.0) were 149 used to compute NMI (Liu et al., 2018a;Wang et al., 2014). Codes are provided in Script S1.
150 Tumor maps of cancer cell lines 151 We used the TumorMap website to create pan-cancer cell lines maps from the above integrated 152 data. TumorMap is an interactive website for assisting in exploring high-dimensional and 153 complicated omics data (https://tumormap.ucsc.edu/) (Newton et al., 2017) 162 We initially clustered cell lines based on each type of omics data, which were mRNA, miRNA, 163 CNV, METHY and RPPA data. The optimal clustering numbers were set to 10 ( Fig. 1 and Fig. 164 S1).

165
In the hierarchical clustering result of 901 cell lines by mRNA (Fig. 1A, Table S1 and Fig. 166 S1A), we found that one cluster was mainly formed from a single type of cancer (C7 [SKCM] Fig. S4). And each 236 clusters also mixed with few amounts of other cancer types. Except SKCM, C15 also contained 237 one glioblastoma multiforme cell line (LN229) with low level of VHL and high expression of 238 has-miR-146a, has-miR-29b and has-miR-188-3p (Aurich et al., 2017). It is notable that although 239 SARC is the dominant cancer types in C6, the proportion within the cluster is relatively low.  Fig. S4). On the one hand, the proportion of two dominant cancer 243 types were almost equal in C3, C13 and C18. And C3 was characterized by high level of CDH1. 244 On the other hand, in C5, C9 and C11, one of the two dominant cancer type was over 50%. And 245 C9 had high levels of VAV1 and STAT5A and low level of CTNNB1 (Bertagnolo et al., 2011;246 Harir et al., 2007;Ysebaert et al., 2006). Manuscript to be reviewed 253 lines in C11 had high levels of protein binding involved in heterotypic cell-cell adhesion, 254 SNARE binding and G-protein beta-subunit binding in GO terms (Figs. 3C-3H and File S1). 255 Meanwhile, pan-squamous morphology carcinoma cell lines (C5 [HNSC-ESCA]) were 256 characterized by up-regulated of nicotine addiction and cell adhesion molecules pathway. And 257 these cell lines had high levels of CAV1, EGFR and ITGA2 (Ando et al., 2007;Song et al., 258 2015b). 259 We also observed two clusters with the same cancer type dispersed. For instance, cell lines 260 from ALL were divided into two clusters, C4 and C8, despite the common characteristics such as 261 Human T-cell leukemia virus 1 infection, Th17 cell differentiation, and TNF signaling pathway. 262 The ALL cell lines in C8 were enriched in KEGG terms including up-regulated in ECM-receptor 263 interaction, down-regulated in antigen processing and presentation pathway, while the ALL cell 264 lines in C4 had low level of cellular senescence (Figs. 3A, 3B and File S1). GO enrichment 265 analysis results showed that the ALL cell lines in C8 had low level of cell-cell junction and high 266 levels of calcium channel activity, while the ALL cell lines in C4 were down-regulated in growth 267 factor receptor binding and sulfur compound binding (Figs. 3C-3H and File S1). At other four 268 omics levels, the features of these two clusters were different as well. For example, the levels of 269 PTEN (a tumor suppressor gene), LCK and Syk (two immune-related genes) and has-miR-151-270 5p (related to tumor invasion and metastasis) was completely inconsistent.

271
Integrated multiple omics clustering provided a global view of cancer types because it could 272 capture both shared and complementary information from each omics data. Several cancer types 273 which mixed together in one single omics data were divided in other single omics data or 274 integrated omics data. For example, BRCA and SCLC were mixed together based on miRNA 275 data, but they were separated into two distinct molecular clusters based on integrated omics data. 276 Besides, in single omics clustering, three pan-organ system clusters were only found based on 277 mRNA data and the pan-squamous morphology carcinoma cluster was only found based on 278 METHY and RPPA data. But pan-gastrointestinal cluster, pan-gynecological cluster and pan-PeerJ reviewing PDF | (2019:12:43613:2:0:NEW 12 May 2020) Manuscript to be reviewed 279 squamous morphology carcinoma clusters were simultaneously identified by integrated multiple 280 omics clustering.

281
The relative contribution of each omics data to the integrated clustering was computed based 282 on the NMI value. On the basis of the top 20% statistical features from the five omics data, we 283 found that RPPA and mRNA contributed 32.24% and 29.62% respectively, followed by 284 METHY (16.24%) ( Fig. 2C and Table 3). This result demonstrated that mRNA and proteomics 285 data were particular important for cancer molecular classification. Meanwhile, more information 286 was showed based on mRNA and RPPA data than other omics data in single omics clustering.
287 For instance, pan-organ system clusters were identified based on mRNA and RPPA data, but not 288 in miRNA and CNV. This results indicated that mRNA and proteomics data could be preferred if 289 multiple omics data were not able to be measured simultaneously.
290 The comparison of classification between cancer samples and cell lines 291 We compared the classification results of 19 cancer types shared by cancer cell lines from CCLE 292 and patient samples from TCGA (Hoadley et al., 2018). Clusters of patient samples and cell lines 293 were divided into three types respectively, namely clusters dominated by single cancer type, pan-294 cancer clusters and clusters mixed with other cancer types (Table 4).

295
For hematopoietic lymphatic malignancies, the classification of cancer cell lines is more 296 abundant than patient samples. For example, some LAML cell lines were clustered together in a 297 group, while others were mixed with LCML cell lines into another group (LAML-LCML) in our 298 findings. For patient samples, there is only one LAML group (Hoadley et al., 2018). The 299 classification of DLBC cell lines was consistent with patient samples. Just like hematopoietic 300 malignancies, the SARC patient samples were clustered individually into a group. However, for 301 cell lines, except gathering in a single group, a few other SARC cell lines were mixed with 302 GBM.

303
For most solid tumors, the classification of patient samples is generally more abundant and 304 diverse than the corresponding cell lines. In general, patient samples with same cancer type can 305 be divided into multiple groups, while cell lines with same cancer type are clustered in one 306 group. For example, the samples of breast cancer were classified into three subgroups (chr8q 307 amp, HER2 amp and Luminal). In addition, there were a large number of BRCA samples 308 gathered with other cancer types in a mixed cluster. Except a few of BRCA cell lines were mixed 309 in a pan-gynecological cluster, whereas almost all BRCA cell lines were clustered in a single 338 The TumorMap landscape of pan-cancer cell lines 339 We used TumorMap web tool to visualize the landscape of pan-cancer cell lines. The same 340 layout and four different color schemes (SNF-CC cluster, TCGA disease, Pan-organ system and 341 histology) were used to reveal that most cancer cell lines gathered based on organ systems and 342 histopathological similarity (Fig. 4). More nuance within a cancer type were apparent. The  349 Campbell et al., 2018;Liu et al., 2018b). We found that cell lines within C11 (COAD/READ-350 STAD) and C5 (HNSC-ESCA) were tightly gathered, while cell lines within C18 (OV-UCEC) 351 were relatively dispersed. The TumorMap landscape showed that cancer cell lines with similar 352 histology characterization tended to get together, even though histological information were not 353 used during calculating similarities (Fig. 4D). The hematopoietic lymphatic malignancies were 354 remote from other cancer types on the map. This result underscored that the molecular 355 characteristics of hematopoietic lymphatic malignancies were different from other cancer types 356 (Fig. 4D). Moreover, C15 (SKCM) and C17 (KIRC) were also far away from other solid tumor 357 groups on the map. Manuscript to be reviewed 358 We downloaded the drug susceptibility data for 24 anticancer drugs across 504 cell lines in 359 CCLE database. We used TumorMap web tool to analyze the relationship between the drug 360 susceptibility and the pan-cancer clustering. We divided the analysis results into four types and 361 chose the representative drugs as examples (Fig. S5).

362
(1) These anticancer drugs have a strong effect on almost all cancer cell lines. For example,

368
(3) Only one cell line or few cell lines are sensitive to these anticancer drugs. For example, as 369 a BRAF inhibitor, PLX4720 has an obvious effect on some SKCM (C15) cell lines, but has no 370 effect on other cell lines (Fig. S5C).

387
Our study showed that clusters were strongly influenced by organ system and cell of origin.
388 Three pan-cancer cell line clusters: pan-gastrointestinal group, pan-gynecological group and pan-389 squamous morphology carcinoma group were identified by integrated multiple omics clustering 390 simultaneously (Berger et al., 2018;Campbell et al., 2018;Liu et al., 2018b). Common 391 functional mechanism and multiple omics characterization in the same pan-cancer clusters may 392 contribute to potential clinical application value. The clusters obtained by integrated clustering 393 provided reference about treating the same disease with different therapies. On one hand, one 394 cancer type with different molecular features gathered in different clusters. Although these cell 395 lines belong to same cancer type, the treatment therapies based on molecular characterizations 396 may be different. On the other hand, the treatment of a cluster containing multiple cancer types 397 may be the same. The comprehensive analysis about cancer classification could be used to 398 elucidate potential disease mechanism and provide additional guidance for molecular treatments.  We also presented that mRNA and proteomics data were more strongly grouped in terms of 409 classification by cancer type than other omics data. This is meaningful for biologists and 410 oncologists choosing what types of omics data they need for their particular analysis.  Cancer types with -lg(P-value) > 3 in each cluster were defined as dominant cancer types. All the blank cells mean the instances of P-value = 0.