An integrated single-cell RNA-seq map of human neuroblastoma tumors and preclinical models uncovers divergent mesenchymal-like gene expression programs

Background Neuroblastoma is a common pediatric cancer, where preclinical studies suggest that a mesenchymal-like gene expression program contributes to chemotherapy resistance. However, clinical outcomes remain poor, implying we need a better understanding of the relationship between patient tumor heterogeneity and preclinical models. Results Here, we generate single-cell RNA-seq maps of neuroblastoma cell lines, patient-derived xenograft models (PDX), and a genetically engineered mouse model (GEMM). We develop an unsupervised machine learning approach (“automatic consensus nonnegative matrix factorization” (acNMF)) to compare the gene expression programs found in preclinical models to a large cohort of patient tumors. We confirm a weakly expressed, mesenchymal-like program in otherwise adrenergic cancer cells in some pre-treated high-risk patient tumors, but this appears distinct from the presumptive drug-resistance mesenchymal programs evident in cell lines. Surprisingly, however, this weak-mesenchymal-like program is maintained in PDX and could be chemotherapy-induced in our GEMM after only 24 h, suggesting an uncharacterized therapy-escape mechanism. Conclusions Collectively, our findings improve the understanding of how neuroblastoma patient tumor heterogeneity is reflected in preclinical models, provides a comprehensive integrated resource, and a generalizable set of computational methodologies for the joint analysis of clinical and pre-clinical single-cell RNA-seq datasets. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-024-03309-4.

(a-d) Inflection point determination in acNMF curves as implemented by the elbow R package (v0.0.0.9000).cNMF was run across a range of ranks in each independent split of data.Multiple Jaccard lengths were tested as a metric for comparison of the splits, and the resulting network-derived communities (synonymous with reproducible GEPs) were plotted across all ranks.For each Jaccard length, the inflection point of the resulting concave curve is determined by first connecting the minimum and maximum data points (red line) and calculating residuals to the curve (green arrows) (a).After fixing these differences to the x axis, the maximum value of the new profile is chosen (b).This value represents the x value of the inflection point (c), and consequently determines the rank utilized in downstream analyses.The Jaccard length that most reliably stabilizes the solution after the inflection point is represented by fitting a linear regression model of the values on the right side of the inflection point, and choosing the value that minimizes the slope (d).
(e-r) UMAP plots showing all recovered GEPs by acNMF in the simulated single-cell RNA-seq dataset.
Dots are colored by the activity score of the recovered gene expression program.All cell identity programs (panels e-q) and the activity program (panel r) were successfully recovered.
(s) Network plot of community detection of simulated data at k=14, which results in only 13 reproducible programs (of 14 ground truth programs).
(t) Representative acNMF line plots for each neuroblastoma single-cell RNA-seq dataset highlight the applicability of the method to real data, where in each case the number of replicated gene expression programs reasonably stabilizes beyond a dataset-specific inflection point.
(u) Results of benchmarking acNMF to 3 different published models.A simulated dataset of 15,000 cells of 13 cell types and 1 activity program was used.The adjusted rand index (ARI) was calculated by the ARI function in the R library aricode.
(v) Jaccard similarity between scVI latent representations and acNMF GEPs applied to the GOSH dataset.
acNMF identified a greater number of gene expression programs with enhanced detection of cell types of the tumor microenvironment (e.g.endothelial cells, which exist in all tumors).(g) Like (f) but for the Adrenergic I (pre-neuronal-like) program marker genes highlighted in Fig. 3h (NEFL, NNAT, RPRM, BASP1, NEFM).
(h) Like (f), but for the NB-MYC program marker genes highlighted in Fig. S3d (EEF1A2, EEF1B2, EEF2, EEF1A1, E1F3L).(a) A graph of GEPs identified by acNMF across 8 neuroblastoma single-cell RNA-seq datasets, from humans, PDXs, GEMM, and cell lines.Nodes represent individual GEPs identified through the acNMF method, and edges connect nodes from different datasets if the nodes are statistically matched using a Jaccard similarly test.Node size is represented by the degree of each node.The node connections are identical to Fig. 4a, but here, the nodes have been colored by dataset, rather than by community.Importantly, acNMF identified reproducible GEPs that are not represented in the network because they were not identified in another dataset.This can be partially attributed The picture can't be display ed.
to patient specific signatures that highlight the ability of acNMF to uncover the distinct heterogeneity of neuroblastoma.Of note, an interactive version of this graph-based representation of these datasets can be accessed in our web-based platform.Each node is clickable and opens a detailed report and analysis of each gene expression program.Furthermore, the unique GEPs not represented in the network can be found and explored using this browser.(a) RNA velocity was estimated in each mouse using scVelo.Cells are colored by one of two gene sets -those genes found to be significantly repressed after cisplatin treatment (green), and those genes induced by the chemotherapeutic (magenta).Whereas the velocities projected on the 2D embedding (UMAP) in the drug treated mice are all directed towards the cells most highly enriched for the induced gene set, the origin of these velocities arises from those cells most enriched in those genes that are repressed due to the treatment.Importantly, this was observed in every drug-treated mouse tumor.As the duration of this treatment was short (only 24 hours), the observation of these multiple cell states co-existing in these tumors likely represents an initiation of transcriptional changes with groups of cells existing in various stages of the response.Thus, the RNA velocity trajectory embeddings recover an accurate representation of the plausible cell state transitions.and is colored by drug treatment.
Figure S1.acNMF performance on simulated and real data.

Figure S2 .
Figure S2.Inferred CNV profiles for gene expression programs enriched for the Van Groningen (a-f) BioCircos plots representing copy number variation (CNV) profiles inferred from GOSH single-cell RNA-seq data.Each segment on the middle track represents chromosomes 1-22 (ordered clockwise).The outermost track represents Hidden Markov Model (HMM) CNV calls from InferCNV for the highconfidence cancer cells in each of the 4 neuroblastoma samples in GOSH.The innermost track shows the inferred CNV profile in cells highly expressing the program (purple line; calculated from the top 50 cells, or cells with an activity score h > 50).The orange line shows the inferred CNV profile for 50 randomly chosen reference normal immune cells.Deviation of the purple line from the orange line is indicative of copy number changes.Panels (a-c) show the 3 programs with statistical enrichment of the Van Groningen adrenergic (ADRN) signature.Copy number changes are predicted to be abundant in these cells and their expression features are consistent with neuroblastoma cancer cells.Panels (d-f) show the 3 programs with strongest statistical enrichment of the Van Groningen mesenchymal (MES) signature.Inferred copy number profiles in these cells are largely consistent with the reference normal, and these cells have expression features of fibroblast subpopulations.NOTE: Our web-based platform allows interactive versions of these plots to be accessed for every gene expression program, across all datasets in this manuscript.

Figure S3 .
Figure S3.Cells expressing multiple adrenergic and mesenchymal-like gene expression programs

Figure S4 .
Figure S4.Novel neuroblastoma gene expression programs are reproducible across human

Figure S7 .
Figure S7.Analysis of RNA-ISH multispectral data with HALO

Figure S8 .
Figure S8.Rare mesenchymal-like signatures in neuroblastoma cells in human and PDX models.
(a) BioCircos plot (similar to panels in Figure S5 (a-f)), but for Program 12 in the Jansky dataset, which is exclusive to the NB14 sample and weakly expresses mesenchymal-like features (while remaining predominantly adrenergic).(b) UMAP plot showing the activity of Program 12 in the Jansky dataset.
UMAP plot showing the activity of Program 19 in the PDX dataset, (specific to SJNBL063820; which also weakly expresses mesenchymal-like features).(d) Table showing the gene weights (loadings) for the top-ranked Van Groningen mesenchymal signature genes on Program 19 from the PDX dataset (i.e.these correspond to the highly differentially expressed genes in cells highly expressing Program 19).(e) Boxplots showing the mean expression (y axis) of the genes from the classical Van Groningen mesenchymal (MES; orange box) or adrenergic (ADRM; purple box) signature, in cells that are highly expressing (h > 50) the acNMF-identified TWISTT1 expressing Program 33 in the Dong dataset, which is expressed in cancer cells from the T230 sample.The yellow box shows the mean expression of all other genes in these cells.
(f) Like (t) but for Program 11, which is expressed in cancer cells of the T200 sample in Dong.NOTE: Our web-based platform allows interactive versions of these plots, along with a very detailed report, to be accessed for every gene expression program, across all datasets in this manuscript.(g)Full sized immunohistochemical staining images for the mesenchymal-marker gene COL1A1 in the three PDX samples from Figure5j.

Figure S9 .
Figure S9.Mesenchymal-like gene expression programs in drug treated tumors.

Figure S10 .
Figure S10.Examining transcriptional dynamics in cell lines with RNA velocity.

Figure S11 .
Figure S11.Examining transcriptional dynamics of drug response with RNA velocity and (b) Monocle3 was used to derive a trajectory graph in single cell RNA-seq data from our GEMM.The root nodes were manually defined in untreated mouse samples (NB856 and NB864, specifically), and cells were plotted in pseudotime.(c)Morans' I spatial correlation test was implemented on the principal graph of the trajectory inference in (b).The top three panels illustrate genes identified in our DESeq2 differential expression analysis that also co-varied in pseudotime (Crabp1, MYCN, and Thumpd3, respectively).The bottom panels show the reverse trend and are significantly increased in those cells at later pseudotime (Ptn, Snca, and Twist1, respectively).Each dot represents a cell from (b)