Integrated multi-omics characterization of KRAS mutant colorectal cancer

KRAS mutation is the most frequent oncogenic aberration in colorectal cancer (CRC). The molecular mechanism and clinical implications of KRAS mutation in CRC remain unclear and show high heterogeneity within these tumors. Methods: We harnessed the multi-omics data (genomic, transcriptomic, proteomic, and phosphoproteomic etc.) of KRAS-mutant CRC tumors and performed unsupervised clustering to identify proteomics-based subgroups and molecular characterization. Results: In-depth analysis of the tumor microenvironment by single-cell transcriptomic revealed the cellular landscape of KRAS-mutant CRC tumors and identified the specific cell subsets with KRAS mutation. Integrated multi-omics analyses separated the KRAS-mutant tumors into two distinct molecular subtypes, termed KRAS-M1 (KM1) and KRAS-M2 (KM2). The two subtypes had a similar distribution of mutated residues in KRAS (G12D/V/C etc.) but were characterized by distinct features in terms of prognosis, genetic alterations, microenvironment dysregulation, biological phenotype, and potential therapeutic approaches. Proteogenomic analyses revealed that the EMT, TGF-β and angiogenesis pathways were enriched in the KM2 subtype and that the KM2 subtype was associated with the mesenchymal phenotype-related CMS4 subtype, which indicated stromal invasion and worse prognosis. The KM1 subtype was characterized predominantly by activation of the cell cycle, E2F and RNA transcription and was associated with the chromosomal instability (CIN)-related ProS-E proteomic subtype, which suggested cyclin-dependent features and better survival outcomes. Moreover, drug sensitivity analyses based on three compound databases revealed subgroup-specific agents for KM1 and KM2 tumors. Conclusions: This study clarifies the molecular heterogeneity of KRAS-mutant CRC and reveals new biological subtypes and therapeutic possibilities for these tumors.

2 by cosine similarity analysis against the Catalogue of Somatic Mutations in Cancer (COSMIC 22 V3). The 96 types single nucleotide variants on CRC genomic landscape were profiled by 23 Lego plot. Moreover, we set up a threshold of 0.4 (-ta and -td parameters of GISTIC2) in 24 filtering the amplified or deleted regions based on the distribution of germline copy number 25 variants. GISTIC2 generated arm level and focal level SCNAs for the cohort with G-Score 26 and FDR-Q value indicating the significance and strength of the identified SCNAs. 27

Single cell analyses of CRC cellular landscape 28
The single cell RNA-seq and metadata were curated form the Samsung Medical Center 29 (SMC cohort) [2] and Katholieke Universiteit Leuven (KUL cohort) [3]. Briefly, gene 30 expression matrix from the CellRanger pipelines was filtered and normalized using the Seurat 31 R package[4] within the following criteria: >1,000 unique molecular identifier (UMI) 32 counts; >200 genes and <6,000 genes; and <20% of mitochondrial gene expression in UMI 33 counts. We further utilized the reciprocal PCA (RPCA) implemented in Seurat (v4.0) to 34 integrate and align the tumor cells from the two scRNA datasets. After the integration, to 35 perform dimension reduction, we first scaled the data by shifting and scaling the expression of 36 each gene so that the mean expression across cells was 0 and the variance across cells was 1. 37 Afterwards we ran PCA analysis and clustered and visualized the aligned dataset using 38 UMAP projection. The major cell types in the datasets have been annotated by comparing the 39 canonical marker genes and the differentially expressed genes (DEGs) for each cluster. 40

Data Imputation 41
3 Missing values of preprocessed RNA-seq, protein and phosphoproteome data were 42 imputed by KNN method using R package "impute". For protein data, genes presented in at 43 least 70% samples were reserved with the imputation parameters: k = 5, rowmax = 0.3, 44 colmax = 0.4. For phosphoproteome data, phosphosites presented in at least 50% samples 45 were reserved according to previous report [5] with the imputation parameters: k = 5, rowmax 46 = 0.3, colmax = 0.4. For RNA-seq data, genes presented in at least 80% samples were used 47 for data analysis as previously described [6]. The imputed data described above was z-scored 48 for each sample 49

Consensus Molecular Clustering 50
We adopted the mRNA, imputed proteomic data and imputed phosphoprotein data to a 51 similarity matrix using R package "CancerSubtypes"[6, 7]by default parameters. The 52 similarity matrix was used as the input of unsupervised clustering performed by R package 53 "ConsensusClusterPlus"[8] with the parameters: maxK = 10, reps = 500, clusterAlg = 54 "spectralAlg." The number of clusters was demonstrated by the stable shape and maximum 55 area of the consensus cumulative distribution function (CDF) curve with the clearest 56 consensus matrix and the rapid decrease of average silhouette from k = 2 to 4. NMI values for 57 three data types were calculated by the function "rankFeaturesByNMI" in the R package 58 "SNFtool" using default parameters. Besides, each filtered data matrix was used for 59 consensus clustering respectively by R package "ConsensusClusterPlus" with the parameters: 60 maxK = 10, reps = 500, clusterAlg = "km", distance = "euclidean". 61

Variable selection analysis and subset prediction 62
4 Variable selection analysis was used for KRAS mutant subset signature selection 63 performed by the R package "VSURF" with random forest algorithm. The number of trees 64 was set to 10,000 as previously described. mRNA data of selected signature from 65 interpretation step was used to build a prediction model for KRAS-WT sub-groups performed 66 by the function "predict" in the R package "VSURF" with the parameters: type = "class", step 67 = "interp." The subsets of KRAS mutant (nonsynosmous mutation) subtype in TCGA and 68 CCLE datasets were acquired by unsupervised clustering based on the mRNAs signature 69 using the imputed RNA-seq data. values indicated increased sensitivity to drugs treatment. Before analysis, we firstly removed 98 the agents with more than 20% of missing data, and the remaining drug missing data were 99 also imputed by 'impute' R package. For the filtration of the potential drug, fold-change 100 differences of the protein expression levels of candidates' drug targets between tumor and 101 normal tissue were calculated in CCRC and CPTAC cohort. A higher fold change value 102 indicated a greater potential of candidate agent for CRC treatment. Thirdly, a comprehensive