High risk glioblastoma cells revealed by machine learning and single cell signaling profiles

Recent developments in machine learning implemented dimensionality reduction and clustering tools to classify the cellular composition of patient-derived tissue in multi-dimensional, single cell studies. Current approaches, however, require prior knowledge of either categorical clinical outcomes or cell type identities. These algorithms are not well suited for application in tumor biology, where clinical outcomes can be continuous and censored and cell identities may be novel and plastic. Risk Assessment Population IDentification (RAPID) is an unsupervised, machine learning algorithm that identifies single cell phenotypes and assesses clinical risk stratification as a continuous variable. Single cell mass cytometry evaluated 34 different phospho-proteins, transcription factors, and cell identity proteins in tumor tissue resected from patients bearing IDH wild-type glioblastomas. RAPID identified and characterized multiple biologically distinct tumor cell subsets that independently and continuously stratified patient outcome. RAPID is broadly applicable for single cell studies where atypical cancer and immune cells may drive disease biology and treatment responses.


Introduction
Malignant cells in human tumors are remarkably diverse in their functional cell identities and this intra-tumor cellular heterogeneity is closely linked to patient outcomes 1, 2 . However, bench and computational tools that have driven our understanding of altered phospho-protein signaling networks in cancer have historically been under-used in solid tumor research due to a lack of technology and samples. In blood cancers, single cell profiles of signaling networks have revealed cancer cells present at diagnosis whose abundance is closely linked to subsequent clinical outcomes, including patient survival [3][4][5][6][7] . This single cell snapshot proteomics approach focuses on a select set of key proteins that govern cell functional identity and can robustly measure targets, such as phospho-proteins, that are inaccessible to sequencing modalities. Suspension mass cytometry is a valuable platform for solid tumor analysis, as it is relatively low cost, well-powered to detect rare and novel cell types, and able to sensitively measure phosphorylated transcription factors and other mechanistic determinants of cancer cell identity 8,9 .
Quantitative analysis of single cell cytometry data has recently moved from an era of human-driven identification of cell types using guide markers (expert gating) and embraced machine learning tools that can automatically reveal and characterize novel and abnormal cells [10][11][12][13] . In building an automated cytometry workflow, algorithm developers need to decide whether users will supervise the discovery of cell subsets using clinical knowledge. CITRUS is an automated cell subset discovery tool that uses prior knowledge of categorical labels, such as "disease" or "healthy", to identify cell clusters associated with those labels 14 . CellCNN is another supervised analysis tool that requires prospective assignment of samples to categories and uses convolutional neural networks to learn a filter that predicts whether new cells match one of the groups 15 . Other cell subset discovery approaches do not supervise the analysis with knowledge of clinical outcomes but do use prior biological knowledge to identify cell subpopulations and then test whether differential outcomes are associated these cell subsets 5,6,16 . In mass cytometry analysis, another common approach is to use tools for automated, unsupervised cell discovery and characterization, including SPADE 17 , t-SNE 18 , UMAP 19 , FlowSOM 20 , and Marker Enrichment Modeling (MEM 21 ). These tools help explore the structure of multidimensional data and reveal subpopulations that can be overlooked in expert manual analysis 10,12,13,22 . However, while it is possible to quickly review enriched features of the groups 21 , it would also be powerful to test whether groups with similar phenotypes share an association with differential risk of death 23 . RAPID, a fully unsupervised workflow presented here, implements t-SNE, FlowSOM, and MEM analysis of single cell mass cytometry data to reveal risk stratifying cell populations 23 .
Mass cytometry has recently been developed for human solid tumors 11,16,24,25 , including glioblastoma, the most common primary malignant brain tumor in adults 26 . The median survival of glioblastoma patients after diagnosis has remained approximately 12-15 months for over a decade 27,28 . These highly aggressive tumors are composed of both tumor and stromal cells, which harbor diverse genomic, transcriptomic, and proteomic expression profiles reflecting abnormal neural lineages 24,29,30 . Previous studies in glioblastomas have either measured signaling states in bulk primary tumors [31][32][33] or characterized genomic and transcriptomic profiles in a modicum of single cells (<1000) 29,30,34,35 . These approaches, however, have not yet improved clinical practice or outcome for patients with glioblastoma. Although both inter-and intra-tumoral dysregulation of signaling in glioblastoma, particularly the disruption of receptor tyrosine kinase (RTK) homeostasis, is hypothesized to drive disease aggressiveness, very little is known about the activation states of signaling effector proteins in glioblastoma and how these signaling changes may be associated with cell subpopulations and patient clinical outcomes 36,37 . Novel, molecularly-driven criteria may give valuable insights into the biology of tumor progression and identify patients more likely to benefit from targeted therapeutics in development for this devastating malignancy.
Here, two new technologies were created in parallel: 1) a tailored set of 34 antibodies for single cell mass cytometry of glioblastoma focused on phospho-protein signaling effectors, stem cell proteins, and transcription factors critical to neural development, and 2) an unsupervised cell discovery workflow termed RAPID (Risk Assessment Population Identification). These technologies were combined to reveal and characterize novel populations of risk stratifying glioblastoma cells.

Comprehensive patient-specific analysis reveals glioblastoma cells with potentiated signaling and aberrant lineage protein co-expression
Tissue samples were collected from 28 patients with IDH wild-type glioblastoma after primary surgical resection (Supplementary Table 1). As of February 2019, the median progression free survival (PFS) and overall survival (OS) after diagnosis were 6.3 and 13 months, respectively, similar to those observed in larger populations of patients undergoing standard therapy 27 . Resected tissues were immediately dissociated into single-cell suspensions as previously reported 38 . Cells were stained with a customized antibody panel for mass cytometry designed to capture the expression of known cell surface proteins, intracellular proteins, and phospho-signaling events (Figure 1, 2, Supplementary Table 2, Supplemental Information) that are critical for gliomagenesis and pathogenesis 32,39,40,30,34,35 . Collectively, the antigens included in this panel positively identified >99% of viable single cells within any given tumor sample.
The first round of data analysis was patient-specific and used to computationally isolate the glioblastoma cells from the stromal cells. A patient-specific t-SNE view of single-cell protein expression was generated for all tumor and stromal cells from each patient's tumor (Supplemental Information). These patient-specific t-SNE maps were generated using 26 of the 34 measured markers 18 (Supplemental Table 2). Patient specific t-SNE maps revealed nonglioblastoma populations of immune (CD45 + ) and endothelial (CD45 -CD31 + ) cells, consistent with prior mass cytometry studies of gliomas 11,21,24 . Non-immune, non-endothelial cells were computationally isolated from each individual patient prior to subsequent downstream analysis of tumor-intrinsic phenotypic parameters (Figure 1, 2). These remaining CD45 -CD31cells were labeled as glioblastoma cells. Ion counts for mass-tagged antibody reporters spanned from 0 to nearly 10,000, representing protein expression from a sensitivity limit of around 400 molecules per cell to 1 x 10^7 molecules per cell 8 . This sensitivity and the ability to capture at least 4,500 live glioblastoma cells from every patient provided excellent statistical power to observe rare cell types representing as little as 1% of the cancer cells (which might themselves be as little as 25% of the tumor cells).
Within a single patient's tumor, plots of cell density across the t-SNE embedding revealed 5 or more phenotypically distinct subpopulations of glioblastoma cells (Figure 2, Supplemental Information). These intra-tumoral subsets of glioblastoma cells were distinguished by differences in expression of core neural identity proteins and by aberrant co-expression of neural lineage and stem cell proteins. For example, in tumor LC26, abnormal glioblastoma cell subsets were apparent and distinguished by lineage aberrancy. Common abnormal co-expression phenotypes in glioblastoma cells included expression of astrocytic S100B and stem-like CD133 or co-expression of markers associated with different molecular subtypes of glioblastoma, such as mesenchymal (CD44) and classical (EGFR) (Figure 2) 32 .

RAPID identifies prognostic cell subsets in glioblastoma disease
The second round of data analysis used an equal number of each patient's glioblastoma cells to create a single, common t-SNE map of glioblastoma cell phenotypes across all patients (N = 131,880 cells; 4,710 cells x 28 patients). Prior to creating this common map, mass cytometry standardization beads were used to remove batch effects and to set the variance stabilizing arcsinh scale transformation for each channel following field-standard protocols 11,38,41 . This common t-SNE map was generated using 24 of 34 measured markers (Supplementary Table 2) and was used for automated analysis of risk stratifying cell subsets.
Once a common, low-dimensional view of all patients was established, the RAPID algorithm used statistical analysis of cell density, feature variance, and population abundance to automatically set all computational analysis parameters. Critically, RAPID was designed to set analysis parameters independent of clinical outcomes. To identify an appropriate number of stable clusters containing phenotypically homogenous cells, RAPID used iteratively executed unsupervised self-organizing maps from FlowSOM 42 . RAPID repeatedly tested a range of clusters  and identified the number of clusters that minimized intra-cluster variance for each feature while maintaining cluster stability (see Supplemental Methods). Within the glioblastoma patients examined here, RAPID identified 43 phenotypically distinct glioblastoma cell subsets (Figure 1). RAPID assigned patients to high or low abundance for a given cluster based on a cut-point, set as the interquartile range of the population abundance across the samples (see Supplemental Methods). For example, for cluster 24, the interquartile range was 0.67% to 3.36%, resulting in a cut point of 2.69% (Supplementary Table 4). Those patients with < 2.69% were designated 'low' for cluster 24 while those with > 2.69% were assigned to the 'high' group. Finally, RAPID applied a univariate Cox survival analysis to determine the correlation between the abundance of tumor cells in each cluster and patient survival outcome.
The output of RAPID, when using t-SNE and FlowSOM, is a PDF containing a colorcoded, 2D t-SNE plot depicting all FlowSOM clusters, a 2D t-SNE plot colored by clusters which were significantly associated with patient outcome, and Kaplan-Meier survival estimates of patients for each subset (additional files described in Methods) (Figure 1b).

Distinct glioblastoma cellular phenotypes associate with patient prognostic outcomes
RAPID identified 43 phenotypically distinct glioblastoma cell clusters. Of these, 7 clusters were considered "universal" because cells from every tumor were observed in these clusters (ranging from 0.02% to 28.05%, Supplemental Table 4). The abundance of the 43 clusters varied extensively across patients. Tumors contained a median of 14 clusters at >1% with a range from 5 cell clusters in LC06 to a maximum of 27 cell clusters LC25. Overall, the presence of a greater number of GBM cell clusters at >1% abundance within a tumor was not observed to be associated with differential survival (ρ=0.047, p=0.812).
In contrast, the abundance of 9 glioblastoma cell clusters was closely correlated with overall survival (Fig. 3,4). Clusters were identified here as prognostic by assessing the hazard ratio (HR) of death in patients who were either high or low for the cell cluster. Negative and positive prognostic clusters were colored red or blue in graphs if they were significantly associated (p<0.05) with an HR that was >1 or <1, respectively.
Four Glioblastoma Negative Prognostic (GNP) clusters (red; clusters 33, 34, 37, and 42) and five Glioblastoma Positive Prognostic (GPP) clusters (blue; clusters 2,3,4,5, and 41) were identified (Figure 3, 4). The remaining 34 clusters were not associated with differential prognosis. RAPID was also used to identify glioblastoma cell clusters with differential PFS, as opposed to OS. Assessing PFS can be especially useful for cancers with longer median survival where progression-free survival is the most useful clinical assessment. Of the 43 subsets identified by RAPID, 4 subsets were significantly associated with PFS (subsets 20, 33, and 43 with negative PFS and subset 3 with positive PFS, Supplemental Figure S1).
To determine if the effect of cell subset abundance was continuous and independent of other features known to stratify glioblastoma survival, a multivariate Cox proportional-hazards model analysis was performed incorporating known features and GNP or GPP cell abundance. Known predictors included were age 43,44 , MGMT promoter methylation status 45,46 , and treatment variables including the extent of surgical resection 47,48 , therapy with temozolomide 27 , and radiation 49,50  . Thus, the abundances of GNP and GPP cell subsets were associated with distinct and contrasting patient outcomes (Figure 3, 4, Supplemental Figure S1), and their predictive value was independent of each other and known prognostic factors of patient survival.
Immune cells were intentionally excluded from initial RAPID analyses and subsequent biaxial gating confirmed that the GNP and GPP subsets did not contain any unexpected residual CD45-or CD31-positive cells (99.50% and 98.64% non-immune, non-endothelial cells, respectively, Figure 3, 4). However, infiltrating immune cells can comprise a large proportion of non-cancer cells in glioblastomas and have highly variable overall abundance across patients [51][52][53] . Notably, GPP-high patients' tumors all contained > 9% CD45 + cells, whereas GNP-high patients' tumors were not observed to contain more than 9% CD45 + cells (p < 0.001, Supplementary Figure 4).

Comparable identification of prognostic glioblastoma cells with different subsampling and dimensionality reduction tools
The cells input to RAPID should be equal in number from each patient in order to remove the possibility that a single patient would disproportionately impact the identification of cell clusters and risk assessment. However, this limits the RAPID analysis to a number of cells equal to the smallest observed from any one patient and it creates the possibility that the cells randomly selected from tumors where many cells were measured might not be representative. For the tumors studied here, the number of glioblastoma cells measured ranged from 4,710 to 330,000 cells per patient.
To test whether the cells sampled for RAPID were representative of the tumor from which they were selected, 9 additional t-SNE analyses were created, each with a different sample of 4,710 cells selected at random, with replacement, from each patient. Each of these 9 t-SNE maps were then used in a new RAPID analysis, creating 10 total analyses (the original and 9 new tests). In these analyses, RAPID identified different numbers of optimal clusters ranging from 18 to 48. Of these, a total of 48 clusters from the 9 new runs were considered prognostic. Because the 10 RAPID analyses ran on different subsampling of cells, the f-measure could not be calculated on a cell-by-cell basis. However, the average f-measure based on patient categorization (GNP high, GNP and GPP low, and GPP high) was 0.79 between t-SNE runs.
Thus, to quantify the degree of similarity between the 48 newly identified clusters and the 9 original GNP and GPP clusters, the root-mean-square deviation (RMSD) in the MEM enrichment values was calculated as a way of determining if the phenotype of the newly identified clusters was stable, even when different cells were sampled from the tumors. GNP subsets from subsequent runs were highly similar to the GNP subsets identified by the initial analysis described above and the same was observed for GPP subsets (Supplemental Figure S5; GNP v GNP average RMSD = 92.5, GPP v GPP average RMSD = 88.2, and GNP v GPP average RMSD = 80.8).
To test whether RAPID could use different types of dimensionality reduction values as input parameters, the algorithm was implemented replacing t-SNE with UMAP (Uniform Manifold Approximation and Projection), a tool that emphasizes both local and global data structure 19 . RAPID identified 31 populations using UMAP input; 4 of these were prognostic and significantly associated with OS (1 GNP UMAP and 3 GPP UMAP ) ( Figure 5). GNP UMAP MEM scores reflected the characteristic S100B and SOX2 expression observed in the GNP populations along with an active pro-survival basal signaling status. GPP UMAP subsets were similarly defined by co-expression of EGFR and CD44 and a general lack of the measured phosphorylated signaling effectors ( Figure 5). When the cells identified using t-SNE were overlaid on the UMAP axes, they occupied similar phenotypic space as UMAP-identified clusters, and vice versa (fmeasure for cell assignment to GNP, GPP, or neither = 0.872, Figure 5). Thus, when UMAP was used in the RAPID algorithm, GNP and GPP populations were identified that had comparable phenotypes to those identified previously in t-SNE analyses, confirming that RAPID is not dependent upon a specific dimensionality reduction tool ( Figure 5).

Towards tracking clinically distinct glioblastoma cells in the clinic
After patterns are recognized by a machine learning approach, it can be valuable to determine whether the learned features can be identified using simpler models that can be applied by experts or machines to new datasets. One approach is to create a decision tree using one-or two-dimensional gating 23 , consistent with traditional strategies in immunology and hematopathology. Such gates make the identification of cells computationally less intensive and more pragmatic for wide-spread clinical use and have been previously used in glioblastoma mass cytometry 24 . Therefore, a traditional, lower-dimensional strategy was developed to use a small number of simple gates to capture the GNP and GPP cell populations ( Figure 6).
A population that was consistent with both the phenotype and risk stratification of GNP cells was identifiable using 3 gates and 6 proteins (S100B, EGFR, SOX2, p-STAT5, GFAP, and CD44, Figure 6). Similarly, GPP cells could be identified with 3 gates and 6 proteins (S100B, EGFR, cyclin B1, p-STAT5, GFAP, and CD44, Figure 6). This gating scheme accurately captured both GNP and GPP cells (f-measure of 0.826 for categorizing cells as GNP, GPP, or neither using RAPID cell populations as truth and biaxial gating as test). GNP and GPP cells identified by traditional gating were also mapped back onto the t-SNE axes and largely occupied the same regions of the biaxial t-SNE map as the cells identified by the RAPID algorithm ( Figure  6). The cells identified by traditional gating were quantitatively comparable in their phenotype, as seen by a comparable MEM label for the gating identified GNP and GPP cell subsets ( Figure  6). Thus, a simple gating model of GNP or GPP cell identity successfully recovered GNP and GPP cells by assessing only 7 total key features observed to be enriched following RAPID. This indicated that, once revealed, GNP and GPP cell subsets were phenotypically cohesive in a traditional cell biological sense and could be reliably quantified by traditional approaches compatible with standard clinical flow cytometric profiling.

Discussion
RAPID is a novel automated workflow that identifies cell subpopulations associated with patient outcomes. The RAPID workflow automatically assigned single tumor cells from IDH wild-type glioblastomas into computational clusters based on phenotypic similarity, generated a quantitative phenotypic descriptor of each population, and determined the correlation between the abundance of populations and clinical outcomes. Two significant glioblastoma cell types were identified: Glioblastoma Negative Prognostic (GNP) cells, characterized by high expression of S100B, SOX2, p-STAT3, and p-STAT5, were associated with a decreased overall survival, while Glioblastoma Positive Prognostic (GPP) cells, characterized by high expression of EGFR and CD44, were associated with longer overall survival. Critically, therapeutically targetable signaling events were identified as a signature of prognostic cell populations identified by RAPID, suggesting potentially novel therapeutic strategies for patients with these characteristics.
High-dimensional cytometry was critical to revealing novel prognostic glioblastoma cells in two ways. First, assessment of a large number of cells per tumor enabled the use of an unsupervised approach in the identification of rare and novel cell subsets across patients. Second, per-cell quantification of phosphorylated signaling effector proteins revealed potential mechanisms of tumor cell regulation that are not inherently apparent in bulk tumor data. Supervised analysis of single cell data has previously uncovered signaling events tied to patient survival in hematologic malignancies 3,4,7 . To our knowledge, our findings are the first to reveal a similar connection in a solid malignancy using an automated, unsupervised approach, reinforcing the importance of cell signaling in multiple human malignancies Although other workflows and algorithms have been developed to identify cell populations of interest in cancer samples (CITRUS 14 , DDPR 6 , Phenograph 4 , Cytofast 54 ), these largely require a level of prior knowledge which may not always be available, especially for solid tumors. For example, Levine et al. used Phenograph 4 and an understanding of the coupling of surface markers and signaling status in healthy bone marrow to classify negative prognostic leukemia cells. Similarly, a map of the healthy developmental lineage was instrumental in using DDPR to identify features of negative prognostic leukemia cells 6 . Supervised methods, including CITRUS and Cytofast, require that samples to be grouped at the beginning of the analysis before generating an overview of cell cluster phenotypes in cytometry data 54 . These methods, however, require that the data have already been clustered and that each sample be prospectively assigned to a group, whereas RAPID enables analysis with continuous, ungrouped data. In studies of diseased human tissue, it is difficult to anticipate the number of expected unique phenotypic subsets and once identified, these subsets can be challenging to manually annotate as they may reasonably be assigned to one or more cell-type categories (this study and 29 ). It is particularly valuable to be independent of prior knowledge of expected cell clusters in studies of diseases like primary glial tumors, where healthy samples are infrequently obtained and the developmental lineage is largely quiescent. RAPID is designed to be free from supervision in the identification of the number of clusters and also in the assessment of cluster abundance in tumors. RAPID also employs MEM to automatically provide a quantitative description of the features which are most selectively enriched on each cell cluster. Furthermore, RAPID is modular, such that different dimensionality reduction tools (t-SNE, UMAP) can be used with different clustering algorithms (dbSCAN, FlowSOM) within the workflow. A benefit of RAPID is the streamlined, unsupervised application of these tools such that a user can input raw data files (for example, FCS files from cytometry platforms) or equivalent data types in conjunction with patient survival data, and RAPID will output quantitatively described cell clusters and their significance with respect to patient outcome. For the data set used in this study (131,880 cells), RAPID ran in 15 minutes from start to finish.
In this study, RAPID analysis of glioblastoma patient samples demonstrated a link between altered signaling and possible abnormal lineage programs in glioblastoma 55 . The GNP signature was defined by abnormal neural development features and simultaneous high basal phosphorylation of multiple signaling effectors downstream of RTKs (Fig. 3). STAT5 signaling, a common feature of all GNP cell subsets, is required in development of many tissues to block apoptosis and drive cell cycle entry 2 . For example, p-STAT5 is an essential feature of negative prognostic acute myeloid leukemia signaling profiles 3,4 . Here, RAPID identified the connection between p-STAT5 and glioblastoma outcome, previously unidentified in primary patient samples. STAT3 and S6 phosphorylation, identified here in GNP cells, agreed with prior studies indicating the importance of p-STAT3 in T cell suppression 56 and mTOR-dependent signaling in tumor formation 57,58 . These phosphorylation signaling events in GNP cells should be explored as a potential therapeutic target and a biomarker of therapy response.
In contrast, the time-to-progression-prolonging GPP signature was defined by EGFR and CD44 co-enrichment, diminished evidence of proliferation, and specific lack of STAT5 phosphorylation. Previous molecular subtyping predicts EGFR expression in the classical subset of tumors and CD44 expression in mesenchymal tumors 32 . As these studies were based on bulk tumor data, cells co-expressing EGFR and CD44 (classified as GPP cells in this study) may have been missed; single glioma cells have been shown to co-express pro-tumor receptors 29,36 . Genetically, glioblastomas commonly have amplified EGFR 31,32 ; however, we noted examples of tumors with robust EGFR amplification that contained both high and low percentages of GPP cells (data not shown), highlighting the importance of measuring protein expression in addition to genomic content. Although EGFR signaling through mTOR and EGFRvIII has been linked with increased p-S6 and p-STAT3/5 respectively, we did not observe these associations in the GNP or GPP subsets 59,60 . Instead, these cells showed enrichment of p-NFĸB (Fig. 3), a transcription factor that activates pro-apoptotic programs 61,62 .
Recent studies have revealed significant variation in immune cell abundance and relative proportions of immune cell subsets across glioblastoma 63,64 . Here, unfavorable GNP cells were associated with diminished tumor-infiltrating immune cells and GPP cells were associated with higher proportions of immune cells in the tumor microenvironment. These results invite the question of whether an altered immune microenvironment precedes development of an aggressive glioblastoma or whether more aggressive tumors suppress anti-tumor immunity. These findings argue that immunotherapy is likely to be more efficacious in tumors containing GPP cells, but that additional research is needed to understand whether GNP cells directly suppress microglia or immigrant leukocytes.
The GNP and GPP subsets correlated with survival independent of the effects of other widely accepted prognostic factors (age 43,44 , MGMT promoter methylation status 45,46 , and treatment including extent of surgical resection 47,48 , therapy with temozolomide 27 , and radiation 49,50 ). These cells were identified in pre-therapy, untreated patient samples, suggesting that these phenotypes are linked to biological mechanisms of therapy response or tumor detection by the immune system. Future studies of recurrent glioma samples would illuminate the persistence of these populations. If GNP subsets have the capacity to evade therapy and retain their active proliferation properties, recurrent tumors would be expected to contain higher proportions of GNP cells and have a more uniform phenotype. Although much has been made of loss or gain of genetic aberrations post-temozolomide and radiation therapies, little is known about signaling in recurrent tumor cells and thus it is unclear if clonal evolution and/or a shift in activated phosphoproteins is necessary for tumor cell survival and repopulation. Other factors have recently been shown to correlate with patient outcomes including the location of the tumor with respect to the largest neural stem cell niche in the adult brain, the ventricular-subventricular zone (V-SVZ) 65 .
In the future, one fascinating question will be to determine whether V-SVZ contacting tumors, which correlate with worse outcomes, contain more cells with a GNP-like phenotype and fewer GPP-like cells.
Critically, these discoveries using RAPID led to a development of a lower-dimensional pipeline which can be immediately adopted for clinical stratification. Moreover, the combination of single-cell snapshot proteomics and the automated RAPID algorithm can be immediately applied to the discovery of critical onco-signaling events in other types of intractable human malignancies, providing a needed complement to genomic classification.

Patient samples
Surgical resection specimens of 28 IDH-wildtype glioblastomas collected at Vanderbilt University Medical Center between 2014 and 2016 were processed into single cell suspensions following an established protocol 38 . Only samples that were confirmed to be IDH-wildtype glioblastomas by standard pathological diagnosis were used. All samples were collected with patient informed consent in compliance with the Vanderbilt Institutional Review Board (IRBs #030372, #131870, #181970), and in accordance with the declaration of Helsinki.

Patient characteristics and collection of clinical data
All patients were adults ( 18 years of age) at the time of their maximal safe surgical resection of their cerebral (supratentorial) glioblastomas. Extent of surgical resection was independently classified as either gross total or subtotal resection by a neurosurgeon and a neuroradiologist. Gross total resection was defined as agreement by both viewers of no significant residual tumor enhancement on patients' gadolinium-enhanced magnetic resonance imaging (MRI) of the brain obtained within 24 hours after surgery. All patients were considered for treatment with postoperative chemotherapy (temozolomide) and radiation according to the standard of care 27 , after determination of MGMT promoter methylation status by pyrosequencing (Cancer Genetics, Inc., Los Angeles, CA, USA). Multiplex polymerase chain reaction (PCR) was used to determine IDH1/2 mutational status. Patients' postoperative course was followed until February 2019, noting time to first, definitive radiographic progression or recurrence of glioblastoma as agreed upon by the treating neuro-oncologist and neuroradiologist, and the time to patients' death. All deaths were deemed to be due to the natural course of patients' glioblastoma. Median overall survival of the analyzed 28 patients with IDH wild-type glioblastoma was 388.5 days (13 months) and median PFS was 187.5 days (6.3 months), which is typical for the disease 26,27 .

Mass cytometry analysis
Cells derived from patient samples were prepared as previously described 38 . A multi-step staining protocol was used, which included 1) live surface stain, 2) 0.02% saponin permeabilization intracellular stain, and 3) intracellular stain after permeabilization with ice-cold methanol (Supplementary Table 2). After staining, cells were resuspended in deionized water containing standard normalization beads (Fluidigm) 53 , and collected on a CyTOF 1.0 instrument located in the Mass Cytometry Core Facility at Vanderbilt University. Rhodium viability stain and cleaved caspase-3 antibody were included in staining to exclude non-viable and apoptotic cells, respectively. Detection of total histone H3 was used to identify intact, nucleated cells 24 . A 32-dimensional mass cytometry antibody panel was used to analyze over 2 million viable cells from 28 tumors (ranging from 4,860 to 336,284 cells per tumor). Data were normalized with MATLAB-based normalization software 53 , and were arcsinh transformed (cofactor 5), prior to analysis using the Cytobank platform 66 .

Survival and statistical analysis
Time from surgical resection to death (overall survival, OS) and time from surgical resection to the initial radiographic recurrence or death before radiographic assessment (progression free survival, PFS) were plotted and analyzed in R. Survival time points were censored if, at last follow up, the patient was known to be alive or had not had radiographic progression. Differences in the survival curves were compared using the Cox univariate regression model, reporting a hazard ratio (HR) between the survival curves.
A Cox proportional-hazards regression model was created to assess the influence of GNP and GPP cells on OS and PFS as continuous variables while accounting for other factors known to affect survival, including age at diagnosis, MGMT promoter methylation status, extent of surgical resection (EOR), treatment with temozolomide (TMZ), and radiation (XRT). The hazard model can be written as: where ℎ( ) ℎ 0 ( ) represents the ratio of hazard comparing the risk of death at time t to the baseline hazard (obtained when all variables are equal to zero) and represents the hazard ratio of variable . The data were fit using R software, version 3.5 (R foundation for Statistical Computing, Vienna, Austria). The proportional-hazards assumption was tested in all multivariate models and supported by a non-significant relationship between Schoenfeld residuals and time for each covariate included in the model (p > 0.38; degree of freedom = 1) and the overall model (p = 0.96; degrees of freedom = 6 and 7). Statistical significance α was set at 0.05 for all statistical analyses, one-or two-tailed noted in figure legends.
An f-measure was used to quantify the level of agreement between classifications of patients or cells between alternative analysis strategies as wells as multiple RAPID iterations. ). An f-measure of 1 indicates perfect agreement between two different strategies or iterations as opposed to an fmeasure of 0 which would mean no agreement between classifications of patients or cells from two strategies or iterations. Patients could be classified as GNP high, GNP and GPP low, or GPP high, while cells were classified as GNP, GPP, or neither. To calculate the f-measure of patient categorization, the classification of the 28 patients into the three prognostic groups from the t-SNE implementation of RAPID was used as the reference point from which to compare patient classification resulting from the UMAP implementation of RAPID or the biaxial gating strategy. Similarly, the stability of the RAPID workflow in assigning cells to GNP, GPP, or nonsignificant clusters was tested by using the t-SNE implementation of RAPID (FlowSOM seed 38) as the reference from which to compare 100 iterations of RAPID (random FlowSOM seed per iteration). Calculation of the f-measure was implemented using R software, version 3.5.

Data availability
Files will be made available in FlowRepository upon publication following peer review.