Inferring spatial transcriptomics markers from whole slide images to characterize metastasis-related spatial heterogeneity of colorectal tumors: A pilot study

Over 150 000 Americans are diagnosed with colorectal cancer (CRC) every year, and annually over 50 000 individuals will die from CRC, necessitating improvements in screening, prognostication, disease management, and therapeutic options. Tumor metastasis is the primary factor related to the risk of recurrence and mortality. Yet, screening for nodal and distant metastasis is costly, and invasive and incomplete resection may hamper adequate assessment. Signatures of the tumor-immune microenvironment (TIME) at the primary site can provide valuable insights into the aggressiveness of the tumor and the effectiveness of various treatment options. Spatially resolved transcriptomics technologies offer an unprecedented characterization of TIME through high multiplexing, yet their scope is constrained by cost. Meanwhile, it has long been suspected that histological, cytological, and macroarchitectural tissue characteristics correlate well with molecular information (e.g., gene expression). Thus, a method for predicting transcriptomics data through inference of RNA patterns from whole slide images (WSI) is a key step in studying metastasis at scale. In this work, we collected tissue from 4 stage-III (pT3) matched colorectal cancer patients for spatial transcriptomics profiling. The Visium spatial transcriptomics (ST) assay was used to measure transcript abundance for 17 943 genes at up to 5000 55-micron (i.e., 1–10 cells) spots per patient sampled in a honeycomb pattern, co-registered with hematoxylin and eosin (H&E) stained WSI. The Visium ST assay can measure expression at these spots through tissue permeabilization of mRNAs, which are captured through spatially (i.e., x–y positional coordinates) barcoded, gene specific oligo probes. WSI subimages were extracted around each co-registered Visium spot and were used to predict the expression at these spots using machine learning models. We prototyped and compared several convolutional, transformer, and graph convolutional neural networks to predict spatial RNA patterns at the Visium spots under the hypothesis that the transformer- and graph-based approaches better capture relevant spatial tissue architecture. We further analyzed the model’s ability to recapitulate spatial autocorrelation statistics using SPARK and SpatialDE. Overall, the results indicate that the transformer- and graph-based approaches were unable to outperform the convolutional neural network architecture, though they exhibited optimal performance for relevant disease-associated genes. Initial findings suggest that different neural networks that operate on different scales are relevant for capturing distinct disease pathways (e.g., epithelial to mesenchymal transition). We add further evidence that deep learning models can accurately predict gene expression in whole slide images and comment on understudied factors which may increase its external applicability (e.g., tissue context). Our preliminary work will motivate further investigation of inference for molecular patterns from whole slide images as metastasis predictors and in other applications.


Introduction
Colorectal cancer (CRC) is the third leading cause of cancer-related death in the United States, and there are disparities in screening and outcomes between age, race, and gender. 1 CRC incidence is rising among younger adults who are not typically incorporated into screening programs, illustrating the importance of developing timely and lower-cost prognostication methods to better assess the tumor's malignant potential. Currently, the pathological TNM-staging system (pTNM), which determines staging based on the impact of local invasiveness-histological stage (T-stage), and metastatic potential-nodal (N-stage) and distant (M-stage) metastasis, is the most predictive factor for risk of recurrence and prognosis. Metastasis, in many cases, is challenging to assess at the time of primary tumor resection. 2 For instance, specimen inadequacy often hinders the complete inference of nodal involvement. 3 It is thus crucial to develop orthogonal, less invasive, but equally informative technologies that can shed light on disease pathology and prognosis. One promising direction is to study the Tumor Immune Microenvironment (TIME)-the amalgamation of immune cells, chemokines, cytokines, and other immune modulators, etc. that accrete at the invasive front and inside the tumor at the primary site. [4][5][6] Recent studies have demonstrated that monocyte/lymphocyte immune infiltrates and their spatial distribution, density, and relationships play an important role in providing a coordinated anti-tumoral response. Yet, the full importance of TIME has not been elucidated, as most clinical studies consider either a few canonical markers at a time (e.g., immunoscore, which assesses cytotoxicity at the primary site) or only study cell mixtures which lack a single-cell or spatial dimension. 7 Spatially resolved transcriptomics (spatial omics), as enabled through technologies such as 10x Genomics Spatial Transcriptomics (ST, Pleasanton, CA) or Nanostring GeoMX Digital Spatial Profiling (DSP, Seattle, WA), is an actively growing area of research that provides rich information about how different areas of tissue interact by analyzing highly multiplexed gene expression at staggering spatial resolution. These technologies can be configured to study the distribution, density, and spread of tumorinfiltrating lymphocytes (TILs) as they may relate to concomitant somatic alterations. 8,9 Assay costs are currently exceedingly high, as profiling just 4 capture areas can cost tens of thousands of dollars, though costs are being driven downwards with new advances in chemistry and lower sequencing costs. Thus, sufficiently powering spatial transcriptomic association studies or extending their generalizability of the inferences to specific patient subgroups that lie outside of these small cohorts is challenging due to cost. In comparison, tissue slides stained with hematoxylin and eosin (H&E) to assess tissue histomorphology are routinely ordered at a very low cost, and there is ample evidence to suggest that many concurrent molecular alterations coincide with morphological features. Thus, the prediction of RNA expression using image data across a slide presents an opportunity to reveal critical prognostic information for patients at a lower cost, which can motivate relevant downstream analyses.
Deep learning approaches, which rely on using multi-layer artificial neural networks (ANNs), have proven instrumental for image analyses in the context of digital pathology. 10 Of relevance for this study is the assessment of whole slide images (WSI), digitized tissue slides, from which machine learning applications can predict the primary site of a metastatic lesion, tumor stage, and the outcome of immunohistochemical stains. Convolutional neural networks (CNNs), a type of predictive machine learning model, are powerful tools for extracting dense information from highdimensional image data. Prior works have employed these algorithms to extract morphological features from H&E-stained tissue to complement whole transcriptome analyses. As WSI can extend to hundreds of thousands of pixels along each spatial dimension, they are usually broken into subimages to enable efficient computation. Of relevance to our research topic, He et al.
(2020) used a DenseNet-101 model to regress on co-localized gene expression levels, 11 and Levy-Jurgensen et al. (2020) employed an InceptionV3 model to detect dichotomized gene expression for given patches of tissue. 12 However, these techniques do not analyze the potential for integrating spatial context outside the patch-level (i.e., spatially correlated patches are assumed to be independent and identically distributed), and, therefore may miss larger macroarchitectural contextual cues of aberrant expression. Zeng et al. (2022) and Pang et al. (2021) investigated approaches to integrate broader spatial context into their gene expression prediction models by using vision transformers and demonstrated that it is possible to outperform convolutional architectures that make predictions on individual patches (e.g., ST-Net). 13,14 However, there remain several unknowns, such as: (1) the scale of tissue features relevant for inferring RNA, (2) how well these models preserve global spatial expression characteristics (e.g., patterns of clustering), (3) whether relevant domain knowledge can make inferences more informative, and (4) whether these effects depend on the specific genes under study (i.e., what resolution/architectural context is optimal for specific genes and what does it say about the tumor biology). Furthermore, many spatial omics studies attempt to study one slide, where potential endogeneity is introduced by a matter of spatial distance between collection sites.
Here, we compared a convolutional neural network and a graph attention network for inferring spatially co-registered gene expression from WSI. We comment on the role the tissue macroarchitecture may play in RNA inference as elucidated by changing subimage size and the use of contextual models. We also assessed performance on a subset of immunerelated genes and how spatial characterization varies across these models/factors and slides. These explorations will motivate future downstream work to characterize factors pertaining to tumor nodal/distant metastasis.

Overview
The primary goal of this work is to predict the gene expression detected by a Visium spot at any given location on the slide. Our method is as follows ( Fig. 1 neural network and graph attention neural networks, the latter leveraging larger spatial context. These models will predict both binary (dichotomized expression) and count-based (continuous; zero-inflated negative binomial) objectives. We also configured additional approaches (e.g., Transformer) for comparison. 4. Capture surrounding tissue context at different scales: Ablation study over patch size to determine whether relevant biological information is encoded outside the Visium collection spot area. 5. Leave one-patient-out cross-validation: Evaluation on held-out slides/ patients as a measure of external applicability. 6. Recover spatial biology inferences: Spatial autocorrelation tests for the capacity of models to draw similar spatial inferences.

Data collection
The primary dataset utilized in this study was acquired from 4 pathologic T Stage-III (pT3) matched (pTNM system) colorectal cancer patients at Dartmouth Hitchcock Medical Center, determined through a retrospective review of pathology reports from 2016 to 2019 following IRB review and approval. These 3 patients were drawn from a set of 36 patients included in a prior study 15 -half of the patients had concurrent tumor metastasis (slides A1 and B1 had tumor metastasis; slides C1 and D1 did not have tumor metastasis) and were otherwise matched on age, sex, tumor grade, tissue size, mismatch repair (MMR) status, and tumor site using iterative patient resampling with t-tests for continuous variables and Fisher's exact tests for categorical variables. The 4 patients were subselected to restrict the tumor site (3 in the right colon, 1 in the transverse colon), grade (3 grade 1, 1 grade 2), node status (2 metastasis cases with N-1a), and account for differences in sex (50% female within both the non-metastasis and metastasis groups). We restricted the cohort to patients without MMR deficiencies as determined through immunohistochemistry (IHC) to control for microsatellite instability status. Tissue blocks were sectioned into 10-micron thick layers, and specific capture areas that contained various distinct macroarchitectural regions (all containing epithelium, tumor-invasive front, intratumoral, lymphatics, etc.) were annotated by the practicing pathologist. A histotechnician carefully extracted/manually cut these capture areas from the tissue, and slides were sent to the Single Cell Genomics Core in the Center for Quantitative Biology for simultaneous H&E staining, imaging, and Visium profiling.
Visium spatial transcriptomics profiling: Traditional transcriptome-wide association studies use bulk/dissociated specimens, which disregards spatial information (e.g., cell type, location, co-expression). Prior spatial assays (e.g., IHC) only allow for targeted analyses (1-10 molecular markers). 16,17 Recently, spatially resolved transcriptomic technologies have been developed which allows for untargeted high multiplexing (i.e., ∼20 000 molecular markers) at high spatial resolution. The 10X Genomics Visium Spatial Transcriptomics (ST) assay elucidates spatial variation for expression of tens of thousands of protein coding genes at up to 5000 (6.5 mm) or 14 000 (11 mm) 55-micron capture areas (1-10 cells) per tissue section. 5,18 This technology has been featured in hundreds of high impact studies, [19][20][21][22][23][24][25][26][27] many of which have explored spatial tumor heterogeneity, making this technology the cornerstone spatial biology assay. After a deparaffinization step: (1) tissue is stained with H&E and imaged, then tissue is (2) permeabilized to release mRNA, which are (3) hybridized to target probes on slide, captured through poly(A) tail binding; (4) probes are then ligated, and (5) undergo extension and amplification; finally, (5) probes and spatial barcodes are sequenced by an Illumina sequencer. This allows for unbiased/gridded profiling of up to 5000 spots (1-10 cells/spot) per 6.5 mm by 6.5 mm capture area. A graphical description of the application of the Visium ST assay on our study cohort can be found in Fig. 2.

Preprocessing
Spatial gene expression profiles contain information for 17 943 genes at almost 5000 locations per slide (after filtering out non-tissue-total number of Visium dots: 4950, 4922, 4887, and 4169 per slide), sampled in a honeycomb formation. Each Visium spot covers a circular capture area with a diameter of 130 pixels at 20x magnification. After sequencing, we used the SpaceRanger package to preprocess the Visium reads into gene count matrices.
As whole slide images (WSIs) derived from the Visium capture areas (size of capture area-6.5 × 6.5 mm) span thousands of pixels along each dimension, we subdivided the image into square patches centered on the Visium spot. We associated the gene expression of each patch based on the Visium spot at the center of the patch and ignored expression at other spots contained within the patch.

Inference targets
We used the SpatialDE library to select the top 1000 genes based on their mean fraction of spatial variance (FSV) across all slides (i.e., selected genes which exhibited the greatest spatial variation across the 4 slides). We tested the capacity of our models to recover expression for all 1000 genes based on dichotomized expression (binary classification) and the original counts (regression). 28 For binary classification, we classify tiles as having a "high" expression for a gene if its individual expression is greater than the median for that slide, as shown by Levy-Jurgensen et al. For binary tasks, we used a weighted binary cross-entropy loss. The loss was independently taken for positive and negative Visium spots and summed together to account for unbalanced labels.
We model the gene expression distribution for regression tasks as a negative binomial distribution with zero inflation. The model predicts the parameters of the distribution (mean μ, dispersion factor σ, and inflation of zero count π) and is optimized with negative log-likelihood loss. The inferred value is equal to the expected value of the negative binomial distribution (accounting for zero inflation):  1. Overview of tested model architectures: Whole slide images are divided into patches co-localized with the Visium spots and an Inception model is used to predict counts and dichotomized expression for 1000 genes. Features derived from Inception model are additionally fine-tuned using graph neural network for inference; inferred expression profiles are compared to the ground truth through a cluster analysis, spatial autocorrelation tests, and pathway analysis.

Modeling approaches
First, we compare the performance of the following models using the dichotomized labels and continuous regression objectives. For these models, we sought to establish whether increasing the spatial receptive field by varying the patch size (ablation study) and training graph attention networks (GATs) positively impacted our capacity to predict spatial gene expression. All models featured an output layer, which simultaneously predicted the expression of all selected genes ( Fig. 1). Details of these approaches can be found below.

Local patch prediction model
We initialized our model using InceptionV3 weights (Szegedy et al., 2015). 29 InceptionV3 was chosen because it has demonstrated high performance in gene imputation studies by Levy-Jurgensen et al. These models are trained for 25 epochs with a learning rate of 0.0001 and a batch size of 32. The patch-size labeled Inception models they were configured to make predictions on (i.e., amount of incorporated surrounding information): Inception-256 (256 pixels), Inception-512 (512 pixels), and Inception-768 (768 pixels) (model suffix indicates patch size).

Contextualized patch prediction models
The Visium spots lie on a hexagonal array, each of which can be treated as a node in a graph, each connected to other Visium spots within 150 pixels. After training the InceptionV3 models from our local patch classification experiments, we extracted patch image embeddings (e.g., ndimensional descriptive vector) for each Visium spot from the penultimate layer of the trained Inception-256 model. We test a graph attention network (GAT) to see how iterative message-passing can improve model performance. We also compared these results to that obtained using a Vision Transformer (ViT). 30(p16) GAT models were labeled by the number of graph attention layers used to make predictions (i.e., the amount of incorporated surrounding information; the number of layers dictates the size of the neighborhood; 8 attention heads per layer): GAT-1 (1 layer), GAT-2 (2 layers), and GAT-4 (4 layers) (model suffix indicates the number of layers). ViT models were labeled by the size of the patch used to form contextual embeddings: ViT-224 (224-pixel patch sizes) and ViT-384 (384pixel patch sizes).

Data augmentation during training and hyperparameters
To improve the robustness of the model results to new tissue contexts, we transformed patch images by shifting and scaling the pixel intensities by the mean and variance of ImageNet. Then, we applied color jitter and random rotations; between −15 and 15 degrees. Random horizontal and vertical flips also augment patches if they are not used in the GAT. Through a coarse hyperparameter search, learning rates were set to 1e-4, and models were trained for 25 epochs. Batch sizes for the Inception model were set to 32, save for the regression models for patch sizes 512 and 768, where batch sizes were set to 16 and 8, respectively.

Comparison of model performance
To compare model and patch-size performances, we performed leaveone-slide-out cross-validation (CV), where 3 of the 4 slides were used for training/validation, and the final slide was used for testing. This scheme was repeated 4 times to report on test performance across all 4 slides in an unbiased manner using macro-averaged (across slides) median (across genes) area under the receiver operating curve (AUROC) and average precision (AP) statistics for the binary outcome and correlation coefficients (e.g., Spearman) to compare true versus predicted counts. Non-parametric bootstrapping was used to assess statistical significance through the calculation of a 95% confidence interval. During cross-validation, we use the same set of InceptionV3 embeddings to train the contextual models Graphical description of Visium spatial transcriptomics assay: (A) Capture areas of 6.5 2 mm 2 area are selected from H&E stained slides for profiling; after tissue dissection of these defined regions, (B) sections are placed onto Visium slide; (C) each section is profiled at individual 55-micron Visium spots (i.e., 1-10 genes), selected based on where underlying tissue is detected (red) from the co-registered H&E stained WSI; (D) tissue is permeabilized at individual Visium spots, which releases mRNA (purple), which are captured through gene-specific hybridization probes which are spatially barcoded; probes which capture mRNA are ligated, extended, and sequenced using an Illumina sequencer to (E) quantify gene expression at each Visium spot, clustered into 4 patterns (purple, orange, green, and blue); as spots are co-registered to H&E, expression of individual genes can be predicted.
(i.e., GAT) that corresponded to the same cross-validation fold. We acknowledge that these statistics could also vary by models, patch size, and slides, so we plotted scatters of test AUCs of each gene to compare these factors in a pairwise manner (e.g., for slide 1, comparing Inception to GAT, or patch size 256 to 768) to draw additional inferences on suitable modeling approaches for a subset of genes. Identifying a subset of genes that obtained optimal performance for one approach versus another was as important as comparing overall performance. These performance differences could suggest relevant tissue features at different scales. For instance, the GAT could extract features from the larger macroarchitecture, indicating its relevance for a gene that does not predict well from InceptionV3.

Model interpretation through pathway analysis and gene embeddings
Due to fundamental limitations in tissue biology, it is unrealistic to expect that every gene can be predicted from tissue histology. We sought to establish which types of genes could be inferred from histology through pathway analysis. Using the Elsevier Pathways database available through the Enrichr package, we performed a pathway analysis of the top 250 genes ranked by AUROC averaged across the CV folds. 31 Enrichr reports overrepresented pathways using a modified fisher's exact test. Detected pathways were filtered based on tissue specificity (i.e., could reasonably be involved with the colon). To determine whether different pathways could be inferred from different tissue contexts, pathway analysis results were compared across models.
We also sought to assess how well the predicted gene expression profiles recapitulated relationships/clustering between the Visium spots. This was accomplished through the comparison of Uniform Manifold Approximation and Projection (UMAP) embeddings for the ground truth and predicted expression profiles (on held-out slides) using InceptionV3 and GAT. 32 Ground truth and predicted gene expression profiles were projected to a lower dimensional space using AlignedUMAP to preserve the orientation/alignment between the ground truth and predicted expression profiles (i.e., positioning of Visium spots across projections is relatively preserved) to enable comparison between the approaches. Visium spots corresponding to the ground-truth UMAP projections were clustered using hierarchical density-based clustering (HDBSCAN). 33 HDBSCAN also identified outlier Visium spots removed from the cluster analysis and scatterplots for all 3 datasets (ground truth, InceptionV3, GAT). Mapper, a Topological Data Analysis method, was used to provide topological summaries of the embeddings by reducing the number of points based on overlap and connectivity. [34][35][36] Mapper embedding plots for InceptionV3 and GAT were colored using the cluster information from the ground truth expression to qualitatively assess topological consistency (i.e., were relationships between Visium spots preserved). This procedure was repeated across the held-out test slides.

Recapitulating spatial biology through spatial autocorrelation tests
In addition to tissue clustering, spatial variation is often used as a proxy for explaining the diversity of cellular lineages interacting across the tissue slide. While this factor alone is not an exhaustive assessment of spatial biology, this served as a target for our preliminary assessment (an exhaustive exploration of spatial analyses is out-of-scope of this work, although it is a future direction). After performing cross-validation, we sought to investigate the ability of our algorithms to recapitulate known spatially variable genes on each of the held-out slides from the cross-validation folds. We used two libraries to determine spatially variable (SV) genes: SpatialDE and SPARK-X. 37 Gene expression counts were summed to create a total count matrix. Genes with low overall expression across the slide (i.e., below a threshold of 30% slide coverage) were removed. The data was then transformed to a normal distribution (by Anscombe transform) to account for the negative binomial distribution of the gene expression. Since SpatialDE's computation time increases cubically with each additional expression patch, we reduced the resolution of the Visium data through 2 × 2 median pooling (i.e., taking median expression for specific genes from 2 × 2 neighborhoods). The reduced memory requirements allowed us to perform further histomolecular assessments on the images, including clustering based on spatial variability.
Using SpatialDE, we extracted the Fraction of Spatial Variance (FSV) and P-values for each gene from the ground-truth set. In addition, we used SpatialDE's Automatic Expression Histology (AEH) to identify 5 groups of genes that were co-expressed spatially from the ground-truth data. Separately, we ran SPARK/SPARK-X on the 1000 genes from each held-out validation slide, reporting projection and adjusted P-value (Bonferroni-corrected) statistics. We separately ran this procedure for the ground-truth expression and predicted expression profiles. As an indication of performance, we expected P-values derived for the projection covariance function for both the inferred and original expression data to correlate well with each other for each slide. This was accomplished using the Fisher's exact test after dichotomizing spatial autocorrelation statistics into "high autocorrelation" (low P-value) and "low autocorrelation" (high P-value) and similar dichotomization for statistics extracted from the inferred expression. Thresholds for dichotomization were chosen to maximize the magnitudes of the Fisher's exact test statistics. Dichotomized spatial autocorrelation was cross-tabulated across the genes to report odds ratios with P-values. An odds ratio and corresponding confidence interval of more than one (i.e., statistically significant) would suggest the ability of the model to recapitulate spatial autocorrelation from the slide. Separately, we sought to assess whether genes that were predicted with high accuracy (AUC) were spatially autocorrelated using a similar methodology (i.e., comparing dichotomized spatial autocorrelation on the ground-truth expression with dichotomized AUC for each modeling approach). We also compared model accuracies between AEH groups of co-expressed genes identified from SpatialDE using Kruskal-Wallis ANOVA statistical tests. Similar to the AUC comparison, we performed comparisons across models, patch sizes, and slides. It should be noted that these analyses were done on the top 1000 spatially variable genes; dichotomized autocorrelation is for this reference group. For the original and inferred expression, genes which exhibited spatial autocorrelation which differed between patients with and without metastasis were selected for a pathway analysis using the MSigDB Hallmarks gene sets via the Enrichr software.

Model comparison
First, we assessed the ability of the Inception model to predict gene expression without considering the surrounding tissue macroarchitecture. Across the whole transcriptome, we noted moderately strong concordance between the predicted and actual expression across the held-out slide folds.
Impact of patch size to leverage surrounding spatial context Importantly, the model achieved optimal performance by increasing access to the surrounding macroarchitecture by increasing the patch size.

Impact of model architecture to leverage surrounding spatial context
After optimizing model architectures, the Inceptionv3 model appeared to outperform the GAT and Transformer approaches for the whole transcriptome assessment. This was true for both classification and regression modeling objectives. The GAT-1 model outperformed the GAT-4 model, which incorporated more of the surrounding tissue context, for the classification task, though GAT-4 outperformed GAT-1 for regression and demonstrated a similar capacity to predict count outcomes as Inception-512 ( Fig. 3; Table 1). A breakdown of these model performances for individual held-out validation slides can be found in the appendix (Supplementary Table 1; Supplementary Figs 1-2).

GAT outperforms inception on a subset of genes
While overall, Inception outperformed GAT, it is important to recognize that there are many genes for which GAT achieves superior performance

Pathway analysis
The top-performing genes by AUROC for both types of models were also highly related to tumor aggression and migration. For instance, CDH1, heavily implicated with tumor suppression, achieved AUROC of 0.880. 38 RAB25 (Fig. 4A), which can serve as a tumor suppressor or oncogene depending on the context, obtained an AUROC of 0.879. TNS4 has been heavily implicated for multiple prognostic outcomes following surgery and was predicted with an AUC of 0.871. 39 AXIN2 (Fig. 4C-E), which inhibits the Wnt signaling pathway and serves to regulate immune cell infiltration, 40 was detected with an AUROC of 0.873. The pathway analysis  corroborated these findings and was highly relevant for metastasis formation. For instance, genes associated with cancer metastasis, cell motility and proliferation, glycolysis, and the epithelium to mesenchymal transition were identified by both models. Interestingly, the GAT model was able to identify genes related to anti-EGFR therapy (Cetuximab) resistance in colorectal cancer (Supplementary Table 2).

Inferred spatial expression is topologically consistent with ground truth
Visual inspection of mapper diagrams of clustered UMAP embeddings illustrates topological consistency (i.e., preserve relationships) between the predicted expression and the ground truth (Fig. 5). Qualitatively, the gene expression embeddings produced from Inception appear to be more closely aligned with the ground truth embedding plots across the tissue types (e.g., Fig. 5A, C, where clusters are placed in the approximately same area for ground truth and Inception in the Mapper diagrams).

Spatial autocorrelation
We also compared the capabilities of each modeling approach for their ability to recapitulate slide-level spatial autocorrelation parameters. Results indicate a large significant positive association between predicted and actual spatial autocorrelation for both Inception and GAT approaches. For half of the held-out slides, GAT demonstrated a larger effect estimate than Inception (Table 3).
Separately, results indicate that highly spatially autocorrelated genes, as determined using the actual gene expression, were predicted with higher accuracy using Inception and GAT versus genes which lacked spatial variation (Table 3). These models varied in their ability to associate spatial  variation with model accuracy. For Inception, there was a large statistically significant effect (Fig. 6), while spatial variation was not as associated with accuracy for the GAT-i.e., there was a statistically significant association for the first 2 slides and no statistically significant effect for the final 2. For the Inception model, groups of genes that tended to be co-expressed, as determined through the AEH analysis on the raw expression data, were found to have widely different accuracies (Table 3; Supplementary Figs. [3][4][5]. Similar to the spatial variation analysis, GAT model accuracy did not vary substantially between AEH groups determined from the raw expression data. Using the ground truth, Inception, and GAT results, we identified the set of genes, which exhibited different spatial variation for primary sites with (A1, B1) and without metastasis (C1, D1) based on the dichotomized thresholds. Through a pathway analysis (gene set testing of Cancer Hallmark genes), genes that were differently autocorrelated were related to the epithelial to mesenchymal transition (Supplementary Table 3).

Discussion
Assessment of colon cancer tumor metastasis status and recurrence risk depends on the ability to assess lymph node involvement. For cases where lymph nodes are unable to be completely assessed, leveraging information found in the tumor immune microenvironment (TIME) at the primary resection site presents a viable alternative. Studying TIME can help us understand how immune cells infiltrates and interactions with other cell lineages to form a coordinated anti-tumoral immune response. These interactions can inform us about immune suppression, activation, exhaustion, tumor cell energy metabolism, pluripotency, cytokine secretion/recruitment, immune escape pathways, and collagen/extracellular matrix remodeling. These pathways play a significant role in creating an environment that is permissive for metastasis. Several assays have been developed to study the TIME-perhaps the most notable among them being the Immunoscore assay which assesses T-cell cytotoxicity (CD3/CD8 density) both within and around the tumor as both a metastasis predictor and independent recurrence risk factor. 7,41 Yet, most technologies to assess the primary site lack a spatial component (e.g., bulk expression), which does not enable a comprehensive characterization of the tissue. Spatial transcriptomic technologies enable high multiplexing at incredible spatial resolution. Due to both fiscal and sampling (i.e., placement of capture area) constraints, findings are not likely to be clinically actionable or reproducible through a lowcost test. Inferring a spatial digital biomarker from a routine histological slide (WSI) has the potential to enable high-throughput tissue characterization for the creation of nascent decision-making aids which can ; embedding plots are summarized using Mapper, which flexibly clusters the expression data with overlapping clusters containing multiple Visium spots; Mapper nodes are sized by the number of associated spots and colored by the dominant cluster in the set of node-associated spots with cluster assignments determined using HDBSCAN; outliers were filtered from these embedding plots and the cluster assignments plotted over the WSI on the right. Table 3 Statistical testing results from spatial autocorrelation analysis, comparing: (A) Spatial autocorrelation from raw expression (high/low) with inferred spatial autocorrelation from predicted expression values (high/low); (B) spatial autocorrelation from raw expression (low/high) with model AUC (high/low); (C) whether model accuracy changed depending on the AEH group; (A) and B) were determined through Fisher's exact tests while (C) utilized Kruskal-Wallis ANOVA testing. complement existing specimen findings. In this study, we explored the potential for deep learning models to predict spatial gene expression from formalin-fixed specimens, the ability to recapitulate well-known spatial findings, and comment on the degree to which spatial information at higher-order contexts (i.e., macroarchitecture) plays a role in modeling spatial expression. The present study dissected tissue sections centered around the TIME to identify spatially variable metastasis-related biomarkers which could be predicted from tissue histomorphology localized to specific architectures in the TIME.
We add further evidence that deep learning models can accurately predict gene expression in whole slide images, reporting an AUC of 0.8. Furthermore, we demonstrate that increasing the receptive field can improve the performance of certain subsets of genes (e.g., Inception-256 versus Inception-768 with AUCs of 0.769 and 0.79 across 1000 genes respectively). Interestingly, although InceptionV3 (AUC=0.79) outperformed other modeling approaches overall (GAT AUC=0.75; Transformer AUC= 0.76), this did not apply to all genes. We noted that certain genes were predicted more effectively at local spatial resolutions (Inception; e.g., CDH1, KRT18, PABPC1, RAB25, SOX9, EPCAM) while others benefited from considering a broader architectural context (GAT; e.g., PECAM1, COL6A1, GAS1, IL1R1, LAPTM5). For instance, we noticed that COL6A1 and COL6A2 were consistently predicted better by the GAT model as compared to Inceptionv3. [42][43][44] COL6A1 is a crucial component of collagen secretion, extracellular matrix maintenance, and mesenchymal phenotype promotion. Of the other markers listed in this paragraph, an interesting observation is that genes which were better predicted using the Inception approach have been previously implicated with cellular adhesion (e.g., CDH1, EPCAM), 45-48 cellular proliferation (e.g., RAB25), 49,50 stemness (e.g., SOX9), 51,52 and apoptosis (e.g., PABPC1), 53 whereas genes predicted using GAT were related to angiogenesis (e.g., PECAM1), 54 induction of a mesenchymal phenotype (e.g., COL6A1), glycolysis/energy metabolism (e.g., GAS1), 55 and regulation and secretion of pro-inflammatory cytokines (e.g., IL1R1, LAPTM5), 56 amongst others. These results suggest that graph neural networks are more adept at capturing tumor migration and invasive processes (e.g., collagen matrix remodeling and energy metabolism required to enable tumor/immune cell motility) and global structural responses (e.g., inflammation) aimed at curbing further growth. A notable finding was that the GAT model predicted genes related to response to anti-EGFR therapy, as indicated in the pathway analysis. 57 Anti-EGFR (epidermal growth factor receptor) therapies, such as Cetuximab, are chimeric monoclonal antibodies widely used to curb colorectal cancer metastasis by blocking the activation of EGFR tyrosine kinases, thereby preventing further cell proliferation. This finding was unexpected because we assumed that the GAT model would predict genes involved in pathways that enable large-scale tissue structural changes. However, these results are consistent with a recent clinical trial that showed that anti-EGFR therapy nonresponders may benefit from treatments that target fibrotic pathways. 58 Moreover, it was reported that a subset of tumors with anti-EGFR resistance also exhibited a pro-fibroblast phenotype and altered T cell infiltration in the tumor microenvironment. Since anti-EGFR resistance may be Fig. 6. Boxen plots illustrating the predictive accuracy of Inceptionv3 (AUC, y-axis) across genes, separated by whether highly significant spatial variation was reported (blue versus orange, x-axis); gathered from validation slides held-out of the training/validation set. The positive association demonstrates higher accuracy for genes with significant spatial variability (i.e., not distributed randomly); these genes were more accurately predicted by our RNA inference model. (A-D) correspond to slides A1-D1, respectively associated with extracellular matrix remodeling, using the GAT model to predict gene expression in these pathways may help to identify early response or resistance to anti-EGFR therapy, which warrants further investigation. It should be mentioned that none of the patients in this study were treated with Cetuximab at the time of surgical resection. In comparison to genes and pathways that could be predicted from GAT inferred transcriptomics data, the Inception approach appears to be proficient at characterizing signatures related to proliferation and escape pathways to further promote tumor aggressiveness. Nonetheless, both approaches overall demonstrated exemplary performance on predicting these unique molecular signatures and further research is needed to characterize the extent to which these models can be used in conjunction to capture these metastasis-related pathways. While both models demonstrated the capacity to recapitulate well-known cancer markers, it was clear that certain genes can be better predicted by considering receptive field size and choosing a model that best incorporates this spatial information.
Spatial autocorrelation was recapitulated for these slides from the deep learning model predictions. We noticed that increases in modeling accuracy were associated with spatial gene expression variation. Genes that were differentially autocorrelated between metastatic and non-metastatic tumors corroborated with well-established oncogenic pathways. For instance, both the GAT and Inception models recaptured genes related to the Epithelial to Mesenchymal (E2M) transition with greater statistical power than the ground-truth expression data. For the Inception model, differentially spatially autocorrelated genes included: CALD1, 59 73 It is interesting to note that while the identified genes are tied to an E2M phenotype, only 2 of these genes (PCOLCE and THBS1) were found in common between both methods. This suggests that each method may capture different molecular aspects of the E2M phenotype.
Nonetheless, these findings supports our overarching modeling approach to characterize spatial heterogeneity. Yet, identified pathways from the spatial autocorrelation analysis once again differed between Inception and GAT. The graph neural networks prioritized the pathways of angiogenesis, myogenesis, glycolysis, and apical junction, which are related to cell migration and energy metabolism as we discussed before. These pathways correspond to the genes that the GAT model could accurately predict. The Inception model selected the pathways of TNF-alpha signaling via NF-kB, TGF-beta signaling, and coagulation, which are mainly involved in inflammation, immune escape mechanisms, and injury responses. These pathways may overlap with or encompass the ones derived from GAT, which had inferior model performance. As some genes were better predicted using the broader spatial architecture, it would make sense to decide which modeling approach to utilize on a gene-by-gene basis and whether the local versus spatial context is preferred. This allows for the selection of optimal modeling approaches in a gene-specific manner to extend the broad applicability of our framework across all disease-relevant genes.
Importantly, the techniques featured in this work will prove useful for inferring gene expression on slides from external cohorts. This is necessary due to the prohibitive costs of spatial transcriptomics. If validated properly on a slightly larger set of slides while carefully controlling for potential confounders (e.g., site, MMR status, etc.), these methods may overcome sources of patient-specific batch effects to allow the report of less-biased effect estimates for large-scale cohort studies. Validation efforts on these held-out slides can leverage assessment methods such as Spark-X and SpatialDE to ensure spatial heterogeneity from the inferred expression data is similar to the initial internal validation cohort. To this end, we achieved remarkable performance for the ability to predict spatial heterogeneity, which will potentially help power future studies elucidating spatial transcriptomic predictors of colon metastasis.
Our results did not support the hypotheses that, overall, spatial gene expression estimation would benefit from message-passing between patch locations via transformer models and graph convolutional networks.
However, these neural networks outperform local patch prediction across a large set of genes, suggesting the relevance of these methods for genes which leverage the spatial context. For GAT, underperforming genes can potentially be explained by the fact that graph convolutional networks tend to smoothen features, 74 which may weaken the model's predictions if the gene expression data is not as smooth. While these findings could alternatively suggest that neighboring patches may not be as correlated as suspected, this may only hold true for a subset of genes. Increasing graph convolution layers may result in optimal performance for genes, which rely on higher-order dependencies between tissue regions; however, such genes may be outnumbered by genes better suited for the Inception approach.
There were a few study limitations of note. Our validation scheme featured the use of held-out slides. Generally, as 2 cases had tumor metastasis, whereas the other 2 did not have concurrent metastasis, we believed heldout slides would have similar heterogeneity in expression and morphology as the training/validation slides. Indeed, it is possible that tissue expression/morphology existed outside of this range, differentially impacting the GAT/transformer models designed to capture long-range spatial dependencies. As we had manually selected capture areas, we did not expect slides to have exactly matching/analogous architectural features; thus, the GAT/transformer model could have been unable to generalize as it aims to integrate this higher-level information. There were 4 whole slide images for both training and cross-validation, and while these provided nearly 15 000 patches in the training set when taken individually, they provided limited opportunities for models to learn diverse global contexts that transformer models would have benefitted from. For instance, different molecular alterations (e.g., mismatch repair deficiency) and tissue location (e.g., right versus left colon) are factors related to patient prognosis and may confound/modify disease associations. Models aiming to infer molecular alterations across a large cohort of CRC patients should be privy to these factors. These challenges could have been ameliorated by pretraining the GAT/Transformer on a variety of colorectal cancer tissue contexts, which does not explicitly require spatial omics data. Furthermore, the assessment of tumor metastasis using the ground truth and inferred spatial expression information in this study uses differences in spatial autocorrelation between patients with and without metastasis to determine the presence of metastasis. There are many other parameters related to tumor metastasis which warrant further consideration-for instance, how different cell-types may co-localize within distinct architectures (e.g., proximity to surrounding lymphatics/vasculature, ligand-receptor binding), differential expression of cell-type specific markers at the invasive margin (e.g., suppression of key lineages), and associating tumors which have metastasized with risk of recurrence/mortality. Integration of single cell RNAseq information can help facilitate these nuanced assessments by mapping single cell expression profiles to the Visium spots. Assessing different parameters of metastatic pathogenesis (e.g., formation, aggressiveness, local/distant progression, outcomes) will be the subject of future work which will be enhanced through inferring expression across large study cohorts.
We also acknowledge the impact of evolving workflows for spatial resolution of transcriptomics information. Many workflows rely on the prediction of tissue features from fresh or fresh frozen tissue. We utilized formalin-fixed paraffin-embedded (FFPE) tissue slides. Only recently have deparaffinization and assay workflows been developed for FFPE. As many of these specimen processing workflows are still under development, manual staining and imaging could have both introduced batch effects and impacted tissue quality. For instance, while a plethora of data preprocessing workflows offer capabilities to combat batch effects through normalization, a robust comparison of preprocessing workflows was outside of the study scope. Our group is keen to adopt future iterations of the FFPE workflows. We expect that protocol updates for the FFPE workflow and data collection will provide major improvements in specimen processing, data quality, and resolution, improving prediction models. Enlarging the capture area and evaluating complementary molecular assays (e.g., spatial proteomics) will improve the resolution and scope of our findings. 75 Although this work does not explicitly assess the tumor immune microenvironment, the methodology explored here can help facilitate spatial analyses and will be used to motivate future clinical findings. Future work will clearly demarcate these regions (e.g., intratumoral, invasive margin, away from the tumor) for a more granular analysis that is well-adjusted for potential confounders.

Conclusions
Tumor metastasis is heavily tied to poor prognostic outcomes and risk of recurrence. Assessment of key niches at the primary site (e.g., tumor immune microenvironment) may reveal histomorphological or biological factors relating to nodal involvement or distant metastasis. In this work, we investigated the potential to infer spatially resolved transcriptomics from the tissue histology of colorectal cancer patients through several sophisticated neural network approaches. Our findings indicate that neural networks can be effectively employed in this capacity and that selection of a neural network model could be informed by its relevance to a molecular pathway resolved from histological features at different scales. Findings reaffirm the role of the epithelial-to-mesenchymal transition as an important metastasis-related process. In the future, with additional algorithmic finetuning, data curation, and standardization, there are opportunities to generalize these findings to perform large-scale spatial molecular assessments.