Optimizing the design of spatial genomic studies

Spatial genomic technologies characterize the relationship between the structural organization of cells and their cellular state. Despite the availability of various spatial transcriptomic and proteomic profiling platforms, these experiments remain costly and labor-intensive. Traditionally, tissue slicing for spatial sequencing involves parallel axis-aligned sections, often yielding redundant or correlated information. We propose structured batch experimental design, a method that improves the cost efficiency of spatial genomics experiments by profiling tissue slices that are maximally informative, while recognizing the destructive nature of the process. Applied to two spatial genomics studies—one to construct a spatially-resolved genomic atlas of a tissue and another to localize a region of interest in a tissue, such as a tumor—our approach collects more informative samples using fewer slices compared to traditional slicing strategies. This methodology offers a foundation for developing robust and cost-efficient design strategies, allowing spatial genomics studies to be deployed by smaller, resource-constrained labs.

Note that allowing us to split up the second term: This yields the alternative form of IG due to the symmetry of mutual information (MI): This is easily extended to the setting where we're choosing the design x for the tth experimental iteration, and we have already observed data Y 1:t−1 .In this case, our "prior" is the posterior for θ given Y where x ∈ R p .Suppose our design space is X ⊆ R p , and on each iteration we choose a single design x.
Recall the differential entropy of a D-dimensional multivariate normal distribution with mean vector m and covariance matrix C: We can plug the entropy into the equation for the EIG (Equation A.2) to compute the EIG under the GP regression model.On the first experimental iteration (before observing any data), we can compute the EIG using Equation A.2 as follows: Factoring out τ 2 I in the log of the first term leads us to cancel the second term as follows: On experimental iteration t > 1, we can compute the IG using Equation A.3: where Σ(X) is the posterior covariance of the GP, given by Following a similar series of calculations, we obtain

A.1.3 Variational inference approach
In the border-finding application, we approximate the posterior p(θ|y) using variational inference [1], where the variational family is specified as the set of multivariate Gaussians with diagonal covariance matrices.Specifically, q(θ) = N K (θ|η, diag(ψ)), where K is the number of model parameters, η is the variational mean, and ψ is the vector of variational marginal variances.We then minimize the KL divergence between the approximate posterior distribution and the posterior distribution with respect to the variational parameters, ϕ = {η, ψ}.This is equivalent to maximizing the evidence lower bound (ELBO).To see, this note that the KL divergence from q(θ) to p(θ|y) can be split into the log evidence and the ELBO: D KL (q(θ)∥p(θ|y)) = −E q log p(θ|y) q(θ) = −E q log p(y, θ) q(θ)p(y) = log p(y) − E q log p(y, θ) q(θ) ≥ 0.
The KL divergence is nonnegative, so we obtain a lower bound on the log evidence: log p(y) ≥ E q log p(y, θ) q(θ) =: L(ϕ). (A.5) We maximize the ELBO L(ϕ) with respect to ϕ using stochastic variational inference [1].

A.2.1 Allen Brain Atlas experiment
Data preprocessing.To filter out voxels outside of the tissue, we removed spatial locations with intensity less than two.The spatial coordinates were centered to have a mean of zero, and the intensities for each gene were standardized by subtracting the mean and dividing by the standard deviation.

A.2.2 Visium prostate cancer experiment
The provided graph-based cluster labels were used to label the spots containing carcinoma cells (spots belonging to Cluster 1 were labeled as carcinoma).