An information theoretic approach to detecting spatially varying genes

Summary A key step in spatial transcriptomics is identifying genes with spatially varying expression patterns. We adopt an information theoretic perspective to this problem by equating the degree of spatial coherence with the Jensen-Shannon divergence between pairs of nearby cells and pairs of distant cells. To avoid the notoriously difficult problem of estimating information theoretic divergences, we use modern approximation techniques to implement a computationally efficient algorithm designed to scale with in situ spatial transcriptomics technologies. In addition to being highly scalable, we show that our method, which we call maximization of spatial information (Maxspin), improves accuracy across several spatial transcriptomics platforms and a variety of simulations when compared with a variety of state-of-the-art methods. To further demonstrate the method, we generated in situ spatial transcriptomics data in a renal cell carcinoma sample using the CosMx Spatial Molecular Imager and used Maxspin to reveal novel spatial patterns of tumor cell gene expression.


In brief
Quantifying the degree of spatial organization in a biological signal is a key building block of spatial transcriptomics analysis. Jones et al. conceptualize this in information theoretic terms, providing an interpretable and accurate tool to reveal spatial organization.

INTRODUCTION
The last several years have seen a slew of advancements in spatially resolved transcriptomics, 1,2 proteomics, 3 and genomics. 4 These methods create an opportunity to develop a deep understanding of the spatial organization of tissues at the cellular level. When the expression of a large number of genes is measured, a key consideration is identifying which genes show signs of structured, non-random spatial organization. Analogous to identifying differentially expressed genes, these methods aim to identify spatially varying genes (SVGs).
Similar questions have been examined in geostatistics, ecology, and demography for decades, so naturally, existing spatial statistics methods have been put to use. Traditional measures of spatial auto-correlation, the most common of which are Moran's I and Geary's C, 5 have numerous general purpose implementations but have also been included in spatial expression analysis toolkits like Squidpy 6 and MERINGUE. 7 A number of methods have built more elaborate, special-purpose models based on Gaussian processes with covariance functions applied over spatial coordinates (''kriging'' in geostatistics parlance), which have long been a standard tool in probabilistic modeling of spatial data. SpatialDE 8 tests for SVGs by separately fitting a Gaussian process model and a regression model excluding spatial covariance and comparing the two using a c 2 test. SPARK 9 elaborates on this primarily by using Poisson likelihood, which is a more natural model of count data, and by testing models with ten different covariance functions to account for various potential patterns of spatial coherent expression. GPcounts 10 models counts using a negative binomial distribution and tests for spatial coherence using a likelihood ratio test MOTIVATION Identifying genes with spatially coherent expression patterns is a key task in spatial transcriptomics analysis. Previously proposed methods have various shortcomings, often failing to adequately conceptualize the problem, making overly strong distributional assumptions, struggling to scale to the rapidly increasing scale of the data, or failing to demonstrate any improvement over classical spatial statistics methods.
on the marginal likelihoods of models with and without the prior Gaussian process.
Gaussian process models have many appealing properties, but they also have one major shortcoming: exact inference is Oðn 3 Þ, where here n is the number of cells or spots. This quickly becomes problematic once n is larger than a few thousand. Given the recent rapid progress in spatial assays, cubic time inference marks a method for rapid obsoletion. A recent method, nnSVG, 11 overcomes this problem by leveraging a modern approximate inference scheme for Gaussian process models. SPARK-X 12 instead bypasses the issue by using an entirely different model than SPARK, a null-hypothesis test on the Frobenious inner product of the spatial coordinate covariance matrix and one computed over a gene's expression. When these are highly non-independent, the inner product test statistic will be large, yielding smaller p values. To consider various patterns of spatial coherence, it runs this test on different transformations of the coordinates.
Apart from the Gaussian process methods, trendsceek 13 uses a mark-segregation test, which tests the statistical independence of the distance between two cells and a gene's expression in both. If conditioning on the distance does not alter the distribution over expression pairs, there is no detectable spatial coherence. Independence is tested using summary statistics, and p values are computed by shuffling expression values across spatial locations to compute an empirical null distribution. Sepal 14 constructs a model simulating continuous time diffusion of expression across the spatial domain and measures spatial coherence by the time required for the process to reach a homogeneous state. SpaGFT 15 uses the graph Fourier transform to decompose an expression signal on a neighborhood graph into a frequency domain, letting it test for significant lower frequency variations. Finally, scGCO 16 bins a gene's expression using a Gaussian mixture model, clusters cells using a hidden Markov random field on the spatial neighborhood graph, and then tests for overrepresentation of expression bins within cell clusters. Other methods based on first clustering cells, such as SpaGCN 17 and STAMarker, 18 are able to detect SVGs but only with respect to specific domains and so are not directly comparable.
Giotto, 19 in addition to interfacing with SpatialDE, trendsceek, and SPARK, implements its own SVG test called binary spatial extraction (BinSpect), which binarizes expression values using either K-means clustering (BinSpect-kmeans) or threshold ranking (BinSpect-rank) and runs a Fisher's exact test using a contingency table of values from neighboring cells. SpaGene 20 also heuristically binarizes expression data but instead considers the degree distribution for a subgraph of the k nearest neighbor graph consisting of just high expression cells, operating under the principle that if high expression cells collocate, we would expect an increase in higher degree nodes in this graph when compared with a shuffled graph.
Existing methods often make either strong distribution assumptions, discard information through binarization, or are highly tuned in an ad hoc manner to specific types of patterns. In the work presented here, we develop an entirely new approach to measuring spatial coherence, adopting an information theoretic perspective and building on recent advancements in approximating mutual information. Like trendsceek, our goal is to quan-tify the degree of statistical dependence between a pair of expression values and their spatial vicinity, but rather than running null hypothesis tests on summary statistics, we take a more direct route by estimating the Jensen-Shannon (JS) divergence between the two, computing a score analogous to mutual information. Spatial coherence can then be neatly defined as how much information gene expression values in the same neighborhood share. In a spatially incoherent setting, this divergence is low; nearby expression values share no more information than randomly selected values. When expression is organized spatially, expression pairs become increasingly predictable based on whether they are nearby or not, and this divergence will be high.
Information theoretic divergences are appealing on theoretical grounds, but their application is often fraught. Computing mutual information directly necessitates a model of the joint probability and, in the case of continuous domains, computing a multidimensional integral. Recently, mutual information has gained renewed interest in the context of deep learning. 21 Belghazi et al. 22 showed, in a method they call mutual information neural estimation (MINE), how mutual information between to variables X; Y can be estimated by training a neural network classifier to distinguish pairs of observations (de facto draws from the joint distribution PðX;YÞ) from pairs drawn from the marginal distributions PðXÞ;PðYÞ, which can be formed by shuffling observations. Intuitively, the better the neural network is able to distinguish observed pairs from shuffled pairs, the higher the mutual information. This intuition is formalized by showing that the objective function used to train the neural network forms a lower bound on the mutual information.
In our method, Maxspin (a portmanteau of ''maximization of spatial information''), we adopt an approach similar to MINE to compute lower bounds on the JS divergence between pairs of nearby and pairs of distant gene expression values. On a genewise basis, we optimize a simple classifier to distinguish pairs of expression value pairs chosen uniformly at random across spatial locations from pairs chosen from nearby locations according to a short random walk on the spatial neighborhood graph ( Figure 1A). Spatial coherence is scored as the degree to which a classifier can recognize uniformly sampled pairs from spatially proximate pairs, with an objective function that is a lower bound on the JS divergence between the two distributions, representing a measure of spatial auto-information, which we will refer to as simply ''spatial information'' here. This principle can be trivially generalized to consider the spatial information between pairs of genes, quantifying the degree of spatial coexpression. Maxspin is available under an open source licence at https://github.com/dcjones/maxspin. MERFISH. The simulation has the advantage of known ground truth.
For the real data, as a proxy for ground truth, we annotated spatially varying clusters of cells or spots and called differential expression between these using pairwise Wilcoxon signedrank tests and a Benjamini-Hochberg q value cutoff of 0.01. This list no doubt omits some SVGs with spatial expression patterns not conforming to the annotated regions, but we expect it to contain the large majority of the most obvious examples, and omissions should equally disadvantage each method and not a priori bias the benchmark toward any particular method.
Because of the wide variety of settings, we adopt Moran's I as a baseline method and consider the degree to which a method improves on (or falls below) its performance, as measured by the area under the precision-recall curve (PR-AUC). Moran's I is simple, easy to interpret and implement, and has been a mainstay of spatial statistics for 70 years. Thus, it is a meaningful hurdle that any specialty method for detecting SVGs should aim to surpass.
GPcounts was excluded from these benchmarks. Though it has a test of spatial coherence, it was designed for very small numbers of cells and fails to scale to any of the datasets evaluated here. We were also unable to run trendsceek as the software is unmaintained and has been rendered unusable by library incompatibilities.

Simulations
To explore a variety of scenarios with a known ground truth, we simulated spatial expression data. The simulation assumes A B C Figure 1. Overview of the Maxspin algorithm for detecting spatially varying genes (A) The maximization of spatial information (Maxspin) algorithm proposed here works by training a simple classifier to distinguish between expression pairs sampled from nearby locations and those chosen uniformly at random. (B) If a gene displays spatial organization, there is a distributional shift between the two sets of sampled cell pairs. The classifier accuracy acts as an approximation of the Kullback-Leibler divergence between these distributions, which can be interpreted as spatial information.
(C) The spatial information score computed effectively quantifies the degree of spatial coherence for each gene evaluated, here shown in a Visium mouse brain dataset.
some number of discrete cell types and genes. Random expression values were drawn from negative binomial distributions. For 7,000 of the 8,000 simulated genes, this distribution is fixed across cell type, and for the remaining 1,000 it is perturbed, representing a ground-truth subset of SVGs. To simulate the spatial arrangement of cells, we developed a cellular Potts model 23 in which cells migrate and deform according to an actin-inspired mechanism. 24 Cells adhere to one another with greater or lesser strength according to a random cell type affinity matrix.
To consider conditions of more or less spatial coherence, we initialized cell positions into highly ordered arrangements by assigning cell type according to sine waves varying across the spatial dimensions, producing homogeneous clumps of cells. Simulation parameters were then chosen so that cell arrangements slowly grew more disordered with more iterations while still displaying some patterning due to varying cell type adhesions. Thus, running the simulation for a larger number of iterations produces an arrangement of cells with spatial patterns that are subtler but still present. We ran a variety of simulations with 2,000 and 10,000 cells. Because SPARK and SpatialDE are both Gaussian process methods, the former with Oðn 3 Þ inference and the latter limited by GPU memory, they were unable to scale to the n = 10; 000 simulations.
Maxspin was found to outperform other methods by a large margin in these simulations (Figures 2A and 2B). We varied parameters of the simulation, including setting the negative binomial dispersion parameter r between 3 and 6, the number of cell types between 2, 4, and 8, and the number of simulation iterations between 10, 100, and 1,000. These results hold up across these different parameterizations ( Figure 2C). SPARK and nnSVG compete for second, but SPARK fails to scale to 10,000 cells, and SPARK-X shows wildly variable performance. Geary's C slightly improves on Moran's I, perhaps due to more subtle local variation, while all other methods consistently underperform Moran's I. We additionally adapted simulations from Zhu et al. 12 consisting of cells distributed uniformly at random across a rectangle, with a subset of SVGs that have modulated expression either in a single vertical band region (''streak'') or a circular region (''hotspot''), examples of which are shown in Figures S1C and S1D. We generated a number of simulations by varying parameters controlling mean expression, overdispersion, and effect size. The cosine and Gaussian coordinate transformations used by SPARK-X are well tuned to detect these particular simple spatial patterns, yet Maxspin achieves largely equivalent performance without any specialty spatial kernels (Figures S1A and S1B). As with the cellular Potts simulation, SPARK and nnSVG compete for second place, Geary's C slightly surpasses Moran's I, and all other methods underperform Moran's I. We might expect SpatialDE to perform similarly to the other Gaussian process methods, but it once again assigns 0 p values to many examples, erasing any meaningful ordering.
Because most biological instances of spatially coherent expression, especially at the cellular level, are so much more complex, we believe that these streak and hotspot simulations are less realistic than our cellular Potts model, yet they do offer an important alternative perspective. SPARK-X does well here, but considerably worse on the more sophisticated cellular Potts A C B Figure 2. Evaluation of the performance of methods to detect spatially varying genes using simulated data (A and B) Spatial transcriptomics data were simulated using a cellular Potts model for (A) 2,000 and (B) 10,000 cells, with expression drawn from a negative binomial distribution that is perturbed across cell types for the ground-truth spatially varying genes. Methods were benchmarked by computing the area under their precision-recall curves and subtracting the value achieved by Moran's I. (C) Performance is further broken down by various parameterizations of the simulation. Across the board, performance increases with lower dispersion and fewer simulation iterations (resulting in simpler spatial distributions), with the number of cell types playing a more ambiguous role. model, with accuracy collapsing in some cases of subtle spatial organization, while Maxspin very consistently performs well across all simulations.

Visium
We used human prefrontal cortex data from Maynard et al., 25 generated on the 103 Visium platform. The expression of 21,151 genes was profiled in 12 samples. Visium slides consist of a grid of 4,992 spots laid out in a hexagonal pattern. In this dataset, between 3,460 and 4,789 spots were covered by the tissue samples. Compared with single-cell methods, Visium spots are typically much more deeply sequenced but are lower resolution, containing several cells in most cases. To determine a ground-truth set of SVGs, we used the authors' annotation of cortical layers, calling differential expression between all pairs of adjacent layers, resulting in 7,775 SVGs on average.
Every method considered was able to scale to the Visium data by virtue of having a strict maximum number of spots. In future generations of the technology with more spots, this will become more problematic for SPARK and SpatialDE, for which this dataset pushes the limit on tractability. Overall, we demonstrate consistently improved accuracy using Maxspin ( Figure 3A). Following Maxspin, the two Gaussian process methods SpatialDE and nnSVG perform well, while SPARK-X was inconsistent, showing high accuracy for some of the 12 samples but greatly diminished for others. SPARK and BinSpect offer only very small improvements over Moran's I, while Geary's C and SpaGene tend to underperform it. SpaGFT, Sepal, and scGCO all performed very poorly with respective median D PR-AUC values of À0.52, À0.49, and À0.42. They were excluded from the plot for clarity. Sepal requires data on a regular grid, so we were only able to run it on this benchmark. In Figure 1C, a span of information scores computed by Maxspin is shown for one sample in this dataset.

MERFISH
A human lung cancer MERFISH dataset was taken from the Vizgen MERFISH FFPE Human Immuno-oncology dataset, 26 consisting of 500 genes measured across 270,968 cells. With a deliberately selected panel of genes and a large number of cells, many, if not all, genes will display some degree of spatial organization. To construct a benchmark with fewer and harder to detect SVGs, we downsampled counts to 100 per cell and then split the data into regions along a 30 3 30 grid, retaining any region with at least 300 cells, resulting in 508 regions. We jointly clustered cell types across the full data, manually selecting clusters with obvious spatial organization; we then called differential expression between pairs of spatially organized clusters separately on each region. The result is a region-specific lists of SVGs. Each method was then run on individual regions and compared with these lists.
Maxspin has the highest median improvement over Moran's I. Consistent with other benchmarks, nnSVG performs well, SPARK-X and SpatialDE are again seen to be somewhat inconsistent, and other methods underperform the baseline method, Moran's I. SpaGFT and scGCO performed very poorly, with respective median D PR-AUC scores of À0.37 and À0.30. These appear to fail in different ways: SpaGFT assigns very low p values to genes with only a small number of counts, while scGCO appears to assign a p value of roughly 0.5 to most genes.
We benchmarked using one additional MERFISH dataset, profiling a mouse primary motor cortex dataset. This also showed improved performance by Maxspin, though to a lesser degree due to the simpler patterns of spatial organization ( Figure S3).

CosMx SMI
CosMx SMI is a recently introduced platform from NanoString for spatial profiling of RNA and protein expression. We used CosMx Article ll data from He et al. 27 to evaluate detection of SVGs. Expression of 960 genes was measured across roughly 800,000 cells, across 8 human non-small cell lung cancer samples. We relied on the authors' annotation of tissue microenvironments. We tested individual fields of view for each sample, in part to create a variety of scenarios but also to allow SPARK and SpatialDE to be run on these data, as they ordinarily would struggle to scale to this many cells. Our ground-truth set of SVGs consisted of any gene that varied between any pair of microenvironments, which was on average 419 per field of view. Maxspin had consistently higher performance that competing methods ( Figure 3C). SPARK-X sometimes improved on Moran's I significantly, but as with the Visium data, it is inconsistent, producing poor results on other samples. The remaining methods show performances close to or sometimes below that of Moran's I. As with the MERFISH data, and in stark contrast to the Visium data, SpatialDE showed poor performance, with a median D PR-AUC of À0.197. Again, scGCO and SpaGFT also showed very poor performance, with median D PR-AUC values of À0.298 and À0.134, respectively. Figure 3D shows several examples of large disagreements between methods. Often methods will disagree on the relative significance of very small neighborhood of cells with elevated expression (e.g., CCL20), with Maxspin preferring patterns involving more cells. The Gaussian process methods seem to occasionally fail on examples with large unoccupied areas (e.g. OLFM4). SPARK-X produces some unpredictable results, failing to identify fairly obvious examples when they consist of sporadic small regions spread throughout the sample (e.g., ITGA3, TPSB2).
The probe set includes 20 negative control probes, which we expect to be unexpressed. Thus, any non-zero counts for probes represents pure technical noise that should not be detected as spatially varying. Figure 3E shows that while these are typically ranked lowly by all methods, the median rank for Maxspin is the lowest, with nnSVG very nearly tied.
In Figure S5, we show additional examples of spatial information scores computed by Maxspin on small sections of one sample. The highest scoring genes are within tumor regions, which is unsurprising given their specific spatial locality.

Spatially varying expression in renal cell carcinoma
We used Maxspin to investigate patterns of SVG expression in a CosMx dataset of a human renal cell carcinoma (RCC) sample, consisting of 178,410 cells and 993 genes, with an average of 222 counts per cell. The data involved both a large number of cells, which many existing methods fail to scale to, and relatively sparse counts, introducing uncertainty that many methods fail to account for. Maxspin is well suited to this data, as it scales to very large numbers of cells and is able to account for uncertainty under a variety of distributional assumptions.
We first identified cell types using unsupervised clustering of expression data alone. Count data were normalized using Sanity 28 and then clustered using Leiden. 29 This produced 24 clusters, to which we assigned names by comparing average expression across clusters, making use of an existing catalog of human kidney cell types 30 ( Figure 4C).
Computing spatial information across all cells will often recapitulate markers for known cell types. For example, the highest scoring gene in this dataset is the immunoglobulin gene IGHG1, revealing populations of B cells that we could have annotated by other means. To reveal more subtle spatial patterns, we instead separately ran Maxspin on each cluster to discover patterns of spatial variation that are less visible to standard clustering methods. Many examples of spatially coherent intra-cluster variations were found in the tumor cell clusters ( Figure 4B).
Many of these genes have been studied in the context of RCC. For example, GPX3, VIM, and IGFBP3 are known to be important markers in clear cell RCC, 31,32,33 but much less is known about their patterns of spatial expression within the tumor. DUSP5, which in varying contexts can act either as a tumor suppressor or promoter, 34 is seen to be highly expressed specifically in the upper left tumor region. Curiously, the adjacent tumor region shows distinctly high expression of the MT-RNR2 like-1 pseudogene. MMP7 is upregulated along the tumor-stroma boundary, consistent with its role in degrading extracellular matrix to facilitate tumor invasion. 35 In non-tumor cell clusters, a number of interesting examples of SVGs also emerge ( Figure 4A). Tumor-associated macrophages disproportionately show upregulation of SPP1 and downregulation of CD163, consistent with prior findings. 36,37 Along the tumor-stroma boundary, extracellular matrix genes fibronectin (FN1) and collagen type 4 alpha 1 (COL4A1) are highly expressed in fibroblasts and endothelial cells, respectively. The transcription factor X-box binding protein 1 (XBP1), which is involved in plasma cell differentiation, 38 shows spatially varying clusters of high expression within one of the B cell clusters. Within an epithelial cell cluster, SERPINA1 shows elevated expression surrounding one of the tumor regions but not the others.
A gene's disaggregated information score can be plotted as a saliency map, which provides clear visual explanations for why a gene scores highly. Examples of SVGs with a span of spatial information scores are shown in Figure S4. Not surprisingly, high expression is typically a necessary condition for a high information score, but it is not a sufficient condition, as some high expression genes have relatively low spatial information.
The spatial auto-information score can be trivially generalized to compute the spatial information between a pair of genes. Doing this for all Oðn 2 Þ pairs is typically intractable unless n is small, but if we first filter to consider only genes with high auto-information, this becomes a viable way to explore the spatial relationships between genes, which avoids coercing cells into distinct types or clusters where they might not always unambiguously fit.
We chose genes with a spatial information score above 20, resulting in 48 genes, and then computed pairwise information between these genes. Hierarchical clustering of this pairwise information matrix reveals specific relationships of spatial organization among these genes ( Figure 4D). Twenty of the genes were broadly tumor associated yet, as indicated by the structure of the dendrogram, show varying patterns of expression within the tumor. Outside of the tumor-associated group, we see some predictable groupings, which we have broadly identified as B cell/immunoglobulin, extracellular matrix, tubule, major histocompatibility complex (MHC), housekeeping, and long non-coding RNA (lncRNA). There is significant variation within these groups, for example the MHC class I gene B2M is in the same tree as four MHC class II genes, but at a much greater distance. Article ll OPEN ACCESS XBP1 shows a highly distinct pattern of expression, with low spatial information with any other of the genes. Taken together, this gives a rough map of the spatial organization of a subset of genes in this dataset.

Performance
To evaluate computational costs of the methods, we ran them on simulated data with varying numbers of cells. Since these methods test each gene individually, they all scale linearly with the number of genes, and thus we kept the number of genes fixed at 8,000.
The results of this benchmark are shown in Figure 5. SPARK was by far the most computationally expensive method in terms of time and memory. We ran benchmarks for SPARK only out to 5,000 cells, as it became exceedingly time consuming. Most methods scale roughly linearly with the number of cells. The exceptions are the Gaussian process methods SPARK and SpatialDE. The one other Gaussian process method, nnSVG, manages to achieve linear time inference but was still the second slowest method. SpatialDE overtakes nnSVG at 10,000 cells but is faster for fewer cells in part because it runs on a GPU.
To avoid giving Maxspin any opportunity for unfair advantage, we disabled convergence checks when optimizing and let it run for the maximum number of iterations. As a result, the results reported for Maxspin are strictly an upper bound on inference time. Regardless, it significantly outperforms Gaussian process methods, though it cannot match the performance of methods like Moran's I, with simple closed form computations.
SpatialDE and Maxspin differ from the other methods in that they run on GPUs. Though it is possible to run Maxspin without a GPU, it is not designed or optimized to do so. While not advised, running Maxspin without a GPU is possible and is still considerably faster than SPARK. GPU methods also have to work within the constraints of GPU memory, which is typically much more limited than system memory (e.g., 12 GB GPU memory vs. 64 GB system memory in the system we used for these benchmarks). Maxspin is designed to split work up and run in batches, avoiding an issue with memory constraints until spatial transcriptomics technology scales to orders of magnitude larger. SpatialDE, on the other hand, does rapidly run up against this limitation. Indeed, memory usage is inherently higher in Gaussian process inference, even when approximate inference methods are used.
All benchmarks were performed using Nvidia GeForce 3080 Ti GPU and AMD Ryzen 5950X CPU.

DISCUSSION
The spatial information score computed by Maxspin appears to better capture spatially coherent expression patterns than existing methods across various datasets and simulations. When presenting the results as D PR-AUC values, quantifying the improvement over Moran's I, a simple and well-established definition of spatial correlation, we see that specialized modern methods are not always an improvement. Methods aimed at detecting SVGs are sometimes guilty of disregarding decades of work in spatial statistics and fail to demonstrate a consistent advantage.
Taking the benchmarking results together, we see some patterns emerge. Methods that rely on binning or binarization of expression data-SpaGene, scGCO, and BinSpect-clearly discard some useful information and consistently underperform Moran's I. In the case of scGCO, this underperformance is dramatic, typically failing to detect any SVG that is not highly expressed, suggesting that binning is not the only issue. Uniquely among the tested methods, scGCO tries to first cluster cells into spatial domains and then use these domains to inform the test of spatial organization. Other methods such as SpaGCN 17 and STAMarker 18 follow a similar approach, except they detect SVGs specific to each of the identified spatial domains, which is difficult to compare directly with the methods discussed here. Conceivably, if the spatial domains are accurately identified and account for the bulk of spatial variation, this could be a sensitive test. Yet, if spatial variation is not confined to disjoint homogeneous domains and instead varies in a more complex fashion, there is potentially much these methods would miss.
Gaussian process methods-SpatialDE, nnSVG, and SPARK-often perform well. In the case of SpatialDE, it performed very well on the Visium benchmark, and very poorly elsewhere, producing a preponderance of 0 p values, which is a numerical issue that could conceivably be fixed. SpaGFT takes a unique approach by using the graph Fourier transform but generally fares poorly, particularly, it seems, in low-expression examples. SPARK, which uses exact inference, rapidly becomes intractable for large numbers of cells. SpatialDE also fails to scale, though it uses approximate inference. This leaves nnSVG as the clear state of the art in Gaussian process-based tests. However, each Gaussian process model lags behind Maxspin. This is in part due to Maxspin fitting a discriminative model rather than the generative models that the Gaussian process methods involve. Weaker distributional assumptions are involved, making it less prone to influence by small numbers of outlier cells.
SPARK-X is an appealing alternative that is highly efficient and adept at detecting certain spatial patterns. The set of coordinate transformation kernels it uses yields a test that is highly sensitive to, for example, circular regions and bands that span the sample, as demonstrated most distinctly in their simulation, but also in the Visium and an MERFISH brain samples, which have similar bands of distinct cell types. When spatial patterns get more subtle and local, most clearly shown in the cellular Potts simulation, it can entirely fail to detect them. Maxspin excels in this setting and also very nearly matches SPARK-X in sensitivity to the very simple, broad patterns that SPARK-X is specially tuned to.
Maxspin also offers improved explainability compared with more opaque statistical models. Because the information score is a sum across cells, the individual cell-level values can be visualized, showing precisely which cells and regions contribute to a high information score ( Figure 6). No other method yet proposed can be disaggregated in this way.
In our analysis of RCC using CosMx SMI, we demonstrated how Maxspin can be used to uncover patterns of expression in a high-dimensional yet relatively sparse dataset by scaling linearly with the number of cells and accounting for uncertainty in expression estimates. Though we largely focused on auto-information, pairwise information is a simple extension that allows genes to be clustered into an atlas of similar and dissimilar spatial patterns. As demonstrated by first clustering into cell types and then computing spatial information, spatial analysis often uncovers patterns of expression that traditional clustering approaches are insensitive to. The clearest indication of this is the several examples of spatially organized expression within the tumor regions. This variation in expression resists explanation by clustering into cell types and is revealed most clearly by gene-level spatial analysis.
The principle presented here, using a discriminative model to maximize a bound on JS divergence, has other potential uses that we have yet to fully explore. An obvious extension is to directly control for possible covariates by estimating a score similar to conditional mutual information. This is straightforwardly achievable by fitting the model twice, once including expression values and the covariate and again with just the covariate, and Article ll OPEN ACCESS then subtracting the two. A less trivial extension is to use Maxspin as an objective function in gating, clustering, or dimensionality reduction tasks. This adds a level of complexity because we would be simultaneously optimizing a function to assign cells to clusters (or to low-dimensional representations) and the discriminative bound on spatial information, but this is not so different from the setup in generative adversarial networks, 39 which have seen tremendous success despite there being a level of trickiness training such models.
By framing spatial coherence as spatial information and developing the tools for practical inference, we have presented a promising new perspective on spatial expression analysis and have taken the first step in exploring this idea by demonstrating a simple discriminative model that is more efficient and more accurate than state-of-the-art Gaussian process models at identifying SVGs High saliency corresponds to regions that contribute most to the overall spatial information score, typically by residing in a region of consistently low or high expression. Saliency here is defined as the accuracy with which a cell can be classified in repeated shuffled versus non-shuffled binary classification tasks. across many settings. Though not a replacement in all cases for generative probabilistic models, it is, in many cases, a more accurate and efficient alternative that is designed to scale with spatial transcriptomics technology as it inevitably expands to very large datasets in the coming years.
Limitations of the study Ultimately, ''spatial coherence'' or ''spatially varying'' expression can be subject to definitional disagreement. There may be different defensible answers to what constitutes a meaningful pattern of spatial organization. Other fields have grappled with these issues long before the recent relevance of spatial statistics in genomics. For example, the question of how best to measure segregation has been debated for decades in spatial demography. 40 The benchmarks presented here all rely on calling differential expression between identified transcriptionally distinct regions. While unlikely to capture all meaningful spatial variation, it is an intuitive baseline that we expect captures the most significant variations that should be uncontroversial as examples of SVGs. Further, these results hold up in our simulations in which the ground truth is known. Yet, for any given method, it is possible to design a benchmark that violates its assumptions and results in poor performance.
Maxpsin is designed with the intention of being run on a GPU, which adds some potential complication to using the software when compared with, for example, computing Moran's I. In a setting in which a GPU is unavailable and a dataset is very large, using Maxspin may be impractical. We hope the extensive benchmarks here will provide some guidance on a suitable alternative. If efficiency is critical, our results show that one can do much worse than Moran's I.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:   Article ll OPEN ACCESS Similar to, 42 we found that using Jensen-Shannon divergence in place of Kullback-Leibler divergence lead to somewhat better results. The resulting objective function can be thought of as a non-standard version of mutual information, which we refer to here simply as spatial information. It can also be rewritten as standard mutual information, though no longer between neighboring expression values, but instead the mutual information between observed pairs of expression values and 0-1 binary variable z indicating whether they have been shuffled or not. More formally, if we let ðx;x 0 Þ be drawn from a mixture distribution zp N ðx;x 0 Þ + ð1 À zÞpðxÞpðx 0 Þ, then the objective can be written as Both of these definitions hinge on how ''spatial neighborhood'' is defined. In the simplest form, we can consider two cells/spots neighbors if they are within some distance e. If s; s 0˛Rd are spatial coordinates corresponding to x;x 0 , respectively, we can marginalize out the spatial coordinates within the same neighborhood to arrive at a joint neighborhood expression distribution on fixed radius neighborhoods where p N ðs; s 0 Þ is chosen to assign higher probability to nearby coordinates. To recover Equation 4, we can make it a uniform distribution over pairs within distance ε of each other. A related notion of spatial mutual information occurs in Altieri, et al. 43 There, spatial mutual information is defined simply as the mutual information between a pair of observed values ðx; x 0 Þ and their spatial distance dðs;s 0 Þ, both of which are discretized in their setting. This is seemingly more straightforward as it avoids having to define a neighborhood distribution, but in a large dataset training on all n 2 pairs would be inefficient, and randomly sampled subsets of pairs will tend to be mostly distant. So in practice, this is potentially harder to estimate. Furthermore, the precise distance between a pair of cell is unlikely to be meaningful when those cells are far away from each other.
Another possible definition is to consider the mutual information between a single expression value and its entire spatial neighborhood. This may enable the detection of more elaborate spatial patterns, but has practical issues. Neighborhoods could easily be memorized by a model, overestimating mutual information, so the joint distribution would have to be very carefully modeled to avoid overfitting. We would also have to aggregate information across every node's neighborhood, using for example a graph neural network, 44 at every step of training and evaluation. The computational cost would thus limit how large of a neighborhood we could consider.
Computing either the JS or KL divergence has two major hurdles in practical applications. First, it necessitates finding a reasonable model for the joint and marginal probabilities. An overparameterized model for p N ðx; x 0 Þ risks memorizing the data or otherwise discovering spatial patterns that are too subtle to be meaningful, whereas an insufficiently expressive model will fail to find true patterns. Second, even with a suitable model in hand, we still have to compute the integral in Equation 2. In this work, we sidestep both of these issues by instead estimating a lower bound on the JS divergence using a simple discriminative model.

Approximating Jensen-Shannon divergence
A number of recent studies have focused on avoiding the problem of computing mutual information directly by deriving more tractable lower bounds. 21 These formalize the intuition that the easier it is to distinguish random draws form the joint distribution from random draws from the marginals, the higher the mutual information must be. Or, in the context of spatial transcriptomics: a gene's expression is spatially coherent if we can, with some reliability, predict whether a pair of expression values was drawn at uniform, or drawn from nearby cells.
A number of bounds on different divergence measures have been derived (for a review and comparison, see Tsai, et al. 45 ). For this work, we found the bound on Jensen-Shannon divergence proposed by Nowozin, et al. 46 to work well, which is equivalent to the binary cross-entropy objective used by Brakel, et al. 47 A Jensen-Shannon based variant of the standard mutual information can be defined as the Jensen-Shannon divergence between p N ðx; x 0 Þ and pðxÞpðx 0 Þ, and bounded below by where sðxÞ : = logð1 + exp ðxÞÞ is the softplus function, and f q is a classifier function with parameters q. Both expectations are taken over ðx; x 0 Þ with respect to either the joint neighborhood distribution or the product of the marginals. Intuitively, we want to train f q to distinguish samples drawn from the same neighborhood, using p N , and pairs that are independent draws from across the sample. The better it is able to distinguish these, the higher the lower bound on the Jensen-Shannon mutual information. Thus, expression patterns are spatially coherent to the degree that we can learn a pattern in nearby pairs.
Assuming we can efficiently sample from p and p N , and f q is differentiable with respect to q, this objective can be easily optimized using stochastic gradient descent, by drawing samples from the joint and marginals at each iteration, computing the objective in Equation 6, and backpropagating gradients to update q.
Concretely, each step of the optimization algorithm proceeds as follows, where x = ðx 1 ; .; x n Þ is a single gene's expression measured across n cells or spots, and s = ðs 1 ; .; s n Þ is their spatial positions.
I)ðrandomwalk k ðiÞ fori = 1; .; nÞ 8 k-step random walk from each cell x 0 )ðx I1 ; .; x In Þ 8 Expression at neighboring cells x)shuffleðxÞ 8 Randomly shuffle values across cells x 0 )ðx I1 ; .;x In Þ. 8 Neighboring expression in shuffled data d)ðks i À s Ii k 2 fori = 1; .; nÞ 8 Euclidean distance of each random walk q ) q + gV q À X n i = 1 s Here g is the step size. In practice we use the Adam 48 optimizer, rather than a fixed step size.
Testing for statistical significance Because Maxspin is based on attempting a binary classification between shuffled and non-shuffled, it is trivial to construct a nullhypothesis statistical test in addition to quantifying spatial information. The natural null hypothesis is that the classifier can do no better than random guessing, and thus the number of correctly classified examples k is drawn from a Binomialðn; 0:5Þ distribution. The p value is then computed from the CDF of the Binomial distribution. Most genes will either have low expression or display some level of spatial organization, if the spatial context examined large enough. Because of this, examining spatial information will often be more informative than binarizing into SVG or non-SVG using a null hypothesis test. This is doubly true for reasons of numerical precision: p values computed for highly spatially organized genes will tend to saturate floating-point arithmetic and produce many zeros. As noted, SpatialDE, among other methods, suffered from this phenomenon, reducing its performance in our benchmarks.

Choosing neighborhood joint distribution
The definition of ''spatial neighborhood'' is determined by the neighborhood joint distribution p N . Optimizing the Jensen-Shannon lower bound using stochastic gradient descent necessitates a distribution that can be efficiently sampled from. In Equation 5 we showed how joint neighborhood expression can be defined in terms of a joint distribution over pairs of positions pðs; s 0 Þ defining a soft neighborhood, where s; s 0 are spatial positions of cells.
In Maxspin, we use the distribution induced by considering random walks of a fixed length between s and s 0 . A neighborhood graph can be formed by Delaunay triangulation, fixed distance neighborhoods, or according to a grid in the case of Visium data. From this graph, a transition matrix is formed where each neighbor is visited with equal probability. We can then define the joint distribution over positions as pðs; s 0 Þ : = p k ðs 0 jsÞpðsÞ where pðsÞ is a uniform distribution over cell positions, and p k ðs 0 jsÞ is the probability of starting at the cell at position s and arriving at the cell at position s 0 after k random steps.
Larger values of k will consider a larger neighborhood, and thus be more tuned to finding larger scale spatial patterns, and a small k will be more tuned to finding small scale patterns. Throughout this paper we used k = 10, finding the results to be not particularly sensitive to this parameter, so long as it was not very small (less than 3). The limiting stationary distribution is a uniform distribution over pairs of nodes, so at a certain point performance will degrade if k is too large, but the computational burden of sampling such long walks becomes an issue before that point.
This random walk distribution is largely equivalent to the p-step random walk kernel discussed by Smola, et al., 49 which is proposed as a more practical alternative to their graph diffusion kernel.
It has yet to be established definitively whether definitions of spatial locality that are graph based, as used here, or distance based, as used for most Gaussian process covariance functions, are a better fit for spatial transcriptomics in general. Graph based approaches are simpler though, since there is no need to try to calibrate the scale of the kernel function to the spatial coordinates which come in a variety of units (pixels, grid indices, micrometers, etc). As with common distance based kernel functions, random walk based kernels also emphasize nearness, as a walk between two distant cells is less probable.