Phenotypic plasticity underlies local invasion and distant metastasis in colon cancer

Phenotypic plasticity represents the most relevant hallmark of the carcinoma cell as it bestows it with the capacity of transiently altering its morphological and functional features while en route to the metastatic site. However, the study of phenotypic plasticity is hindered by the rarity of these events within primary lesions and by the lack of experimental models. Here, we identified a subpopulation of phenotypic plastic colon cancer cells: EpCAMlo cells are motile, invasive, chemo-resistant, and highly metastatic. EpCAMlo bulk and single-cell RNAseq analysis indicated (1) enhanced Wnt/β-catenin signaling, (2) a broad spectrum of degrees of epithelial to mesenchymal transition (EMT) activation including hybrid E/M states (partial EMT) with highly plastic features, and (3) high correlation with the CMS4 subtype, accounting for colon cancer cases with poor prognosis and a pronounced stromal component. Of note, a signature of genes specifically expressed in EpCAMlo cancer cells is highly predictive of overall survival in tumors other than CMS4, thus highlighting the relevance of quasi-mesenchymal tumor cells across the spectrum of colon cancers. Enhanced Wnt and the downstream EMT activation represent key events in eliciting phenotypic plasticity along the invasive front of primary colon carcinomas. Distinct sets of epithelial and mesenchymal genes define transcriptional trajectories through which state transitions arise. pEMT cells, often earmarked by the extracellular matrix glycoprotein SPARC together with nuclear ZEB1 and β-catenin along the invasive front of primary colon carcinomas, are predicted to represent the origin of these (de)differentiation routes through biologically distinct cellular states and to underlie the phenotypic plasticity of colon cancer cells.


Sample-size estimation
• You should state whether an appropriate sample size was computed when the study was being designed • You should state the statistical method of sample size computation and any required assumptions • If no explicit power analysis was used, you should describe how you decided what sample (replicate) size (number) to use Please outline where this information can be found within the submission (e.g., sections or figure legends), or explain why this information doesn't apply to your submission:

Replicates
• You should report how often each experiment was performed • You should include a definition of biological versus technical replication • The data obtained should be provided and sufficient information should be provided to indicate the number of independent biological and/or technical replicates • If you encountered any outliers, you should describe how these were handled • Criteria for exclusion/inclusion of data should be clearly stated • High-throughput sequence data should be uploaded before submission, with a private link for reviewers provided (these are available from both GEO and ArrayExpress) Please outline where this information can be found within the submission (e.g., sections or figure legends), or explain why this information doesn't apply to your submission: The number of replicates of mice experiments were decided from past experience and practical considerations. Details can be found in the figure legends. For bulk RNA sequencing, 4 replicates were processed for each subpopulation. For the single cell RNA sequencing, 1 biological replicate per subpopulation was processed.
In vitro studies on the cell lines (e.g. qPCRs, chemo assays, MTT) were performed at least 3 times starting from different frozen vials (biological replicates). For each biological replicate, 2/3 technical replicates were generated by dividing the same batch of cells in multiple samples depending on the assay. Details can be found in the figure legends.

Statistical reporting
• Statistical analysis methods should be described and justified • Raw data should be presented in figures whenever informative to do so (typically when N per group is less than 10) • For each experiment, you should identify the statistical tests used, exact values of N, definitions of center, methods of multiple test correction, and dispersion and precision measures (e.g., mean, median, SD, SEM, confidence intervals; and, for the major substantive results, a measure of effect size (e.g., Pearson's r, Cohen's d) • Report exact p-values wherever possible alongside the summary statistics and 95% confidence intervals. These should be reported for all key questions and not only when the p-value is less than 0.05.
Please outline where this information can be found within the submission (e.g., sections or figure legends), or explain why this information doesn't apply to your submission: (For large datasets, or papers with a very large number of statistical tests, you may upload a single table file with tests, Ns, etc., with reference to sections in the manuscript.)

Group allocation
• Indicate how samples were allocated into experimental groups (in the case of clinical studies, please specify allocation to treatment method); if randomization was used, please also state if restricted randomization was applied • Indicate if masking was used during group allocation, data collection and/or data analysis Please outline where this information can be found within the submission (e.g., sections or figure legends), or explain why this information doesn't apply to your submission: Additional data files ("source data") • We encourage you to upload relevant additional data files, such as numerical data that are represented as a graph in a figure, or as a summary table • Where provided, these should be in the most useful format, and they can be uploaded as "Source data" files linked to a main figure or table • Include model definition files including the full list of parameters used • Include code used for data analysis (e.g., R, MatLab) • Avoid stating that data files are "available upon request" Please indicate the figures or tables for which source data files have been provided: For each experiment, data are shown as mean ±SD. IBM SPSS Statistics, R and Python were used for data analysis. The Mann-Whitney U test was used to analyze the difference between two groups of quantitative variables; -value was set at 5%. Information on the analysis of the sequencing data can be found in the Materials and Methods.
For the mice experiments, mice of similar age (6-8 weeks) were randomly assigned to experimental groups keeping equal proportions of both sexes. Figure 1: source data 1, source data 2, source data 3; Figure 2: source data 1; Figure 3: source data 1; Figure 5: source data 1.