Comprehensive visualization of cell–cell interactions in single-cell and spatial transcriptomics with NICHES

Abstract Motivation Recent years have seen the release of several toolsets that reveal cell–cell interactions from single-cell data. However, all existing approaches leverage mean celltype gene expression values, and do not preserve the single-cell fidelity of the original data. Here, we present NICHES (Niche Interactions and Communication Heterogeneity in Extracellular Signaling), a tool to explore extracellular signaling at the truly single-cell level. Results NICHES allows embedding of ligand–receptor signal proxies to visualize heterogeneous signaling archetypes within cell clusters, between cell clusters and across experimental conditions. When applied to spatial transcriptomic data, NICHES can be used to reflect local cellular microenvironment. NICHES can operate with any list of ligand–receptor signaling mechanisms, is compatible with existing single-cell packages, and allows rapid, flexible analysis of cell–cell signaling at single-cell resolution. Availability and implementation NICHES is an open-source software implemented in R under academic free license v3.0 and it is available at http://github.com/msraredon/NICHES. Use-case vignettes are available at https://msraredon.github.io/NICHES/. Supplementary information Supplementary data are available at Bioinformatics online.


Workflow
The main function of our method is RunNICHES ( Figure S1). It takes a single-cell or spatial gene expression matrix and the associated cell metadata as input and the external ligand-receptor mechanism databases as reference, and builds different cell interaction matrices (e.g. the Cell-Cell Matrix and NICHES Matrix) as specified by the user. It further aggregates and assembles the userspecified metadata accordingly. The exact matrix computation is described in detail in the Supplementary section Cell-Cell Matrix Construction and Niche Matrix Construction.
For the output, users can easily convert the cell interaction matrices and other information into formats for different single cell data analysis frameworks as they prefer, as demonstrated in our online vignette (https://msraredon.github.io/NICHES/articles/08%20Interoperability%20-%20NICHES.html), and then perform downstream analysis.
NICHES generates all user-requested cell-interaction matrices along with associated compiled metadata required for downstream data visualization, clustering, and differential analysis. The outputs of NICHES include: -Assay: Stores the cell interaction matrix and associated data transformations.
-Metadata: Stores the aggregated metadata information, as mentioned in the Supplemental Methods, for every input metadata category (e.g., cell types, sample, batch). The aggregated metadata will include the sending metadata, receiving metadata, and sending-receiving metadata as appropriate for a given output. We argue that by carrying over the metadata (i.e., sample labels, coarse-and fine-grain cluster labeling, experimental conditions) from source data, NICHES allows rapid downstream differential analysis between already tagged groupings of cells.
absence of downstream signal inhibition, and subsequent phenotypic response, which is cellstate dependent. We think of NICHES as quantifying the mechanistic 'wavelengths' that a sending and receiving cell are jointly 'tuned' to.
2) In its present form, NICHES multiplies ligand expression on sending cells with receptor expression on receiving cells. This operator was chosen because it is zero-preserving. We would like to note that simple multiplication of ligand and receptor expression values does not necessarily reflect the modulation of phenotypic responses by involved cells. Such effects are often non-linear and do not necessarily correlate with ligand and receptor expression. However, multiplying sending and receiving values does provide a reasonable big-picture proxy of cellular connectivity within a dataset that allows organization of the data, similarly to practices in graph-network-based approaches that are widely used in genomics. Multiplication of ligand and receptor information, either via product or geometric mean, is used as part of the workflows in CellChat (Jin, et al., 2021), iCellNet (Noël, et al., 2021), NATMI (Hou, et al., 2020), SingleCellSignalR (Cabello-Aguilar, et al., 2020, SoptSC (Wang, et al., 2019), CellCall (Zhang, et al., 2021), and CommPath (Lu, et al., 2022), among others (Almet, et al., 2021).
3) When mechanisms are queried which containing more than one ligand subunit or more than one receptor subunit, NICHES multiplies the expression of all ligand subunits to yield a single ligand-mechanism expression value and multiplies the expression of all receptor subunits to yield a single receptor-mechanism expression value (see Figure S1 for a diagram of this computation). We have chosen this approach so that zero expression of any subunit within a given mechanism yields zero connectivity for that mechanism. It should be noted, however, that the sparse nature of single-cell transcriptomic data means that NICHES may output zero connectivity for a given mechanism between two cells if all subunits were not captured in a cell used for a given cell-cell pairing. Imputation prior to running NICHES can lessen this artifact, at the cost of using pseudo-values to calculate connectivity. For a demonstration of this difference, please see https://msraredon.github.io/NICHES/articles/02%20NICHES%20Single.html 4) The SystemToCell and CellToSystem outputs estimate connectivity between an entire single-cell system and each individual cell within that system. These tools are designed to quantify changes in system signaling character due either to altered cellular gene expression or to altered cellular distribution or representation. These tools use the mean operator to combine ligand information from multiple sending cells and/or to combine receptor information from multiple receiving cells. 5) When spatial information is provided, NICHES allows users to limit analysis to connectivity within either a defined radius or to local nearest-neighbor communities. However, when spatial information is not provided, NICHES treats all barcodes within an input system as available for sampling and analysis. If cells have been captured which are not to be treated as within a biological system of interest, we recommend removing these cells from the input data prior to running the SystemToCell or CellToSystem functionality, as otherwise ligand or receptor information from these cells will be included downstream. 6) We recommend standardizing total input cell number to maximize the accuracy of cross-system system-level comparisons. We make this recommendation because artifacts may otherwise theoretically be created in instances in which variable cell-number denominators act on constant system-expression numerators for a given mechanism. 7) Batch effects which cause biologically artifactual variance in ligand and receptor expression will affect the outputs from NICHES. This should be noted by users.
8) Differential testing may be performed on NICHES output data. If comparing experimental conditions containing multiple batches, users may run NICHES either for each batch independently or for an entire condition (containing multiple batches) at once. Running NICHES on individual batches will preserve batch effects for downstream analysis. Conversely, running NICHES on an experimental condition containing multiple batches will have the effect of smoothing cell-signaling batch effects, since cells will be paired with other cells within their condition but not necessarily within their specific batch. 9) If a single dataset is input into NICHES which contains information across multiple experimental conditions, cells may be paired with cells outside of their respective condition. We currently do not recommend doing this unless such patterns are deliberately of interest to a user.

Supplemental Findings
Simulation 1: NICHES reveals hidden intra-cluster signaling heterogeneity To demonstrate NICHES's ability to discover intra-cluster heterogeneity, we generated a synthetic single-cell RNA-seq dataset with 400 cells equally divided into 2 cell types (C1 and C2), which can be separated by their marker genes ( Figure S3, A, top).
Additionally, there are 2 subtypes S1 and S2 within C1 and 2 subtypes S3 and S4 within C2 ( Figure S3, A, bottom). By design, S1 and S2 interact differentially with S3 and S4 through 2 ligand-receptor mechanisms. Specifically, S2 has a higher expression of A2m ligand gene compared to S1 (corresponding receptor gene is Lrp1 that is highly expressed in S3 and S4), and S3 has a higher expression of receptor gene Agtr2 compared to S4 (corresponding ligand gene is Ace which is highly expressed in S1 and S2), as shown in Figure S3, B. Due to the subtle gene expression difference, one cannot distinguish the subtypes based on the overall gene expression profiles. But because the nuance involves ligand and receptor genes, we may apply NICHES to capture this signal.
Figure S3, C, shows the cell-cell matrix output of NICHES, embedded in 2D mechanism space, clustered via k-means (top) and labeled by ground-truth subtype interaction (bottom). Sending cells are all from C1 and receiving cells are all from C2, and the space is 2-dimensional because there are only 2 ligandreceptor mechanisms in this simulation: A2m-Lrp1 and Agtr2-Ace. Note that the clustering results ( Figure S3, C, top) are almost completely identical to the ground truth subtype interaction pair labels ( Figure S3 C, bottom). For reference, we have added a single black point representing the single mean connectivity value between C1 and C2 which results from using a mean-wise computational method to assess ligand-receptor connectivity in this simulation.
Simulation 2: NICHES preserves information regarding differential signaling distributions in disparate experimental conditions Next, we sought to demonstrate that NICHES can register how cellular archetype shift influences cellcell connectivity. We simulated 2 scRNAseq cases, both of which contain 2 cell types (C1 and C2), as shown in Figure S4, A. In both cases, we simulated 1 active ligand-receptor channel between C1 and C2 (C1 expresses the ligand gene A2m and C2 expresses the receptor gene Lrp1). C1 expression level of ligand A2m is sparse but high in Case 1, and broad but low in Case 2, with similar mean values ( Figure  S4, B). By design, the expression level of receptor by population C2 is identical in both cases. Because the mean expression level of A2m is similar, the mean connectivity of A2m-Lrp1 is similar between the 2 cases ( Figure S4, C). NICHES, however, is able to detect this significant difference in connectivity ( Figure S4, D).
Simulation 3: NICHES allows high-dimensional visualization of altered systems-cell topology due to addition or removal of cells Next, we sought to evaluate the ability of NICHES to map changes in system-to-cell signaling topology. We simulated two cell systems to be compared. The first contains two cell types (C1 and C2, Figure S5, A) while the second contains 3 cell types (C1, C2, and C3, Figure S5 B). C1 and C2 are identical in character in each case, and are connected via two simulated mechanisms, A2m-Lrp1 and Ace-Agtr2, where C1 cells express the ligands A2m and Ace and C2 cells express the receptors Lrp1 and Agtr2. C3, which is only present in the three-cell system, also expresses ligands A2m and Ace. From the perspective of a receiving cell (from population C2, in this simulation), the mean expression of ligand is higher in the three-cell system as compared to the two-cell system. NICHES quantifies this difference and allows analysis and visualization between conditions ( Figure S5, C).

Figure S1: NICHES data infrastructure and workflow
NICHES takes a single-cell or spatial gene expression matrix (columns are cells, rows are genes) and the associated cell metadata (e.g., cell types, sample IDs, batch information) as input. By referencing the ligand-receptor interaction mechanisms via external databases (e.g., Fantom 5, OmniPath, or other user-customized databases), NICHES builds cell interaction matrices (e.g., Cell -Cell Matrix where columns are cell pairs and rows are interaction mechanisms, and Niche Matrix where columns are niche environments and rows are interaction mechanisms) via RunNICHES function. These new cellsignaling objects may be output as either Seurat objects from the Seurat package or as tabular matrices that are cross-compatible with multiple single-cell platforms. Here, we schematize the outputs from NICHES and additional data forms which may be computed from these outputs. Cell interaction matrices are output (Assay) along with other useful information (e.g., aggregated Metadata) and may be processed to yield data embeddings (Reductions) and nearest neighbor graphs (Graphs) for downstream analysis (e.g., data visualization, clustering, and differential analysis).

Figure S2: NICHES connectivity for mechanisms with multiple subunits
NICHES is capable of calculating intercellular connectivity for ligand-receptor mechanisms with any number of subunits. For a given mechanism, ligand subunit expressivities on the sending cell are multiplied together and receptor subunit expressivities on the receiving cell are multiplied together.
These two values are then multiplied to yield mechanism connectivity between the sending and receiving cell. This operation is zero-preserving by design, so that the lack of expression of even a single subunit on either the sending or receiving cell within a given cell-cell pairing will cause a connectivity value of zero.

Figure S3: NICHES captures heterogeneity in cell-cell connectivity
In this simulation using synthetic data (see Methods) there are two celltypes labeled C1 and C2 containing signaling subtypes S1-S4 which do not resolve in gene space (A). These subtypes communicate in distinct ways via two distinct signaling mechanisms: A2m-Lrp1 and Ace-Agtr2 (B). Subpopulation S2 expresses ligand A2m higher than S1 while subpopulation S3 expresses receptor Agtr2 higher than S4. This expression pattern creates four distinct cell-cell signaling relationships even though only two celltypes have been considered. NICHES allows rapid observation of these distinct relationships using two-dimensional embeddings and k-means clustering (C, top) which closely matches the ground truth subtype pairings in this simulation (C, bottom). Mean connectivity between C1 and C2 is represented in black in the lower panel of (C). A biological counterpart to this simulation is show in Figure 1F.

Figure S4: NICHES captures cross-condition archetype shift in cell-cell connectivity
In this simulation using synthetic data (see Methods), we compare two cell-systems (A, Case 1 and Case 2) representing different experimental conditions or tissues containing the same number of cells, with the same celltypes present, and similar mean connectivity for a given signaling mechanism. Case 1 cells express ligand sparsely but highly, while Case 2 sending cells express ligand broadly but lowly (B).
Receptor expression is identical in each case. While mean connectivity is nearly identical (C), NICHES captures the significantly different ground-truth connectivity between the two cases (D).

Figure S5: Differential System-to-Cell Signaling
(A-C) represent a simulation using synthetic data (see Methods), to demonstrate the capability of NICHES to measure altered system-cell signaling topology due to the addition or removal of cells. In Case 1 we have a two-cell system containing communicating cell types C1 and C2 (A). In Case 2, a third cell type (C3) has entered the system which expresses ligands cognate to receptors on C1 and C2 (B). When we use NICHES to calculate system-to-cell signaling within each case, we see a clear shift in the character of the sensed environment of celltype C2 due to the altered mean ligand expression within the system. This functionality of NICHES empowers the study of complex biological questions, such as how added, aberrant or infiltrating cells might affect the microenvironment of a receiving celltype across experimental conditions or disease states.
(D-H) represent an applied biological example of this same use-case. In this demonstration, using data from (Adams, et al., 2019), two separate tissue systems are considered: a Control system containing 6 cell types, and an IPF condition which contains an additional aberrant cell population (annotated as "Aberrant_Basaloid") which is not identified in the control (D). We use NICHES to ask how this change in community structure influences the connectivity associated with receiving fibroblasts. (E-F) are visualizations of the System-To-Cell output from NICHES, colored by Condition and Receiving Cell Type, respectively. The measurements associated with receiving fibroblasts may be isolated (G) and differential expression testing performed to reveal mechanisms that are more-or less-prominently engaged in the fibroblast milieu across experimental condition.
Scripts to replicate the above analyses are available in the software releases associated with this publication as well as online here: https://msraredon.github.io/NICHES/articles/09%20System%20Effects%20of%20Aberrant%20Cells.ht ml

Figure S6: NICHES Reveals Intra-Celltype Microenvironment Heterogeneity
A) Spatial transcriptomic data labeled by dominant celltype (see Methods). B) UMAP embedding of cellular niche for each transcriptomic location. C) Sub-clustering of the L5 IT niche represented spatially and D) within UMAP space. Exploration of marker mechanisms reveals niche interactions specific to the microenvironments within each subcluster (E-I).

Figure S7: NICHES runtime and scalability
NICHES is designed to allow rapid analysis of cell-cell signaling patterns. Runtime scales linearly with output edge number. Edge number is a function of input cell number and cell type number and is dataset specific.

Dependencies
The internal workings of NICHES are dependent on a large number of other R software packages, in particular dplyr (Wickham, et al., 2019) and Seurat (Butler, et al., 2018). A maintained list of dependencies can be viewed in the DESCRIPTION file.

Mathematical and Computational Formalism
First, we define basic notations: Table S1. Notations Notation Description and terminology = { ! , " , … , # ! }. An ordered set of cells in the system where $ is the total number of cells in the system. In the spatial transcriptomic datasets, we also use to represent each measurement (e.g. spots). is the gene expression level of receptorfrom Cell . = 6 ! , " , … , # * 8. An ordered set of ligand-receptor mechanisms where / is the total number of mechanisms. Each mechanism + has a corresponding ligand + and receptor + = [ ! , " , … , # * ]. A vector of reference ligands. Each + has a corresponding mechanism + from in which it participates. + can consist of multiple subunits. = [ ! , " , … , # * ]. A vector of reference receptors. Each + has a corresponding mechanism + from in which it participates. + can consist of multiple subunits. ∈ {0,1} , unweighted and directed adjacency matrix that indicates which cells are connected and can interact in the system. For instance, %. = 1 indicates that Cell and Cell are connected and the ligand signals of Cell can be received by the receptors of Cell Given the gene expression data of a cell system , along with a list of known ligand-receptor mechanism , we aim to define a vector $ " $ ( ∈ ℝ # * for every connected cell pair cell and cell in a pre-defined cell adjacency matrix , such that $ " $ ( can characterize the / -dimensional ligandreceptor interaction profiles between cell sending signal via ligand and cell receiving signal via its receptors.

Cell-Cell Matrix Construction
To construct the Cell-Cell Matrix, we define where the mechanism + consists of ligand + and receptor + . We choose the multiplication operation so when the ligand or the receptor are not expressed the product is zero, representing no cell-cell signaling.
We concatenate the $ " $ ( Cell-Cell Interaction vectors to construct the Cell-Cell Matrix: where / is the total number of mechanisms and 2 = ∑ is the total number of (directed) connected cells.
can be used as input to many computational analysis pipelines, including dimensionality reduction, clustering, differential expression, and trajectory inference, etc., to investigate cell-cell interactions at the individual cell level.

Computing the Adjacency Matrix
One step before computing a Cell-Cell Matrix is to compute the adjacency matrix . For single-cell RNA-seq datasets we assume a fully connected cellular system. However, the computational complexity of becomes I K, which greatly hinders the application of Cell-Cell Matrix onto cellular systems of large number of cells (e.g. $ > 1 × 10 6 ).
To reduce the complexity, we adopt a random sampling scheme to down-sample edges and to compute a new M as follows: Let's denote the set of cell type labels in the system by = 6 ! , " , … , # 7 8 where 7 is the total number of cell types. The set of cells associated with each type is denoted by = 6 ! , " , … , # 7 8. For each pair of cell types within {( + , 0 )| = 1,2, … , 7 ; = 1,2, … , 7 }, we draw 2 sets of cells 89:,; ' and 89:,; , from cells of cell type + and 0 uniformly, i.e., For spatial transcriptomic datasets we constrain the interactions among cells to be only within a certain distance threshold or a certain local neighborhood. To be more specific, let us denote as the Euclidean distance matrix among cells computed from the spatial locations, where %. is the distance between Cell and Cell . Given a distance threshold , %. is computed as %. = ^1 , %. ≤ 0, for each entry of for spatial transcriptomic datasets. Alternatively, the user can specify the parameter k which computes a k-nearest neighbor (knn) graph from and the adjacency matrix will be computed as a mutual nearest neighbor graph from this knn graph, i.e., as %. = ^1 , , ℎ 0,

Niche Matrix Construction
Besides the base cell-cell interaction formulation, we extend our original definition of the Cell-Cell Matrix to investigate cellular niche and cellular influence interactions.
Specifically, we define the Niche Matrix as: ∈ ℝ # * ×# ! , where / is the total number of mechanisms and $ is the total number of cells in the system. A column vector of is defined as , each column of is a / -dimensional vector that characterizes the interaction profiles between cells sending ligand signals to Cell which possesses the relevant receptors to receive these signals.
The connectivity value on one mechanism (e.g. + ) between sending cells and Cell is defined as where the mechanism + consists of ligand + and receptor + , * ' denotes the row vector of + 's expression levels across the cells that are connected to Cell , and () is a vector operator which, in our implementation, can be either (default) or .
Similarly, we define the Influence Matrix as: ∈ ℝ # * ×# ! where each column of is a / -dimensional vector that characterizes the interaction profiles between Cell that sends the ligand signals and the cells receiving from it. Each connectivity value between Cell and the system is defined as where the mechanism + consists of ligand + and receptor + , , ' denotes the row vector of + 's expression levels across the cells that connect to Cell in the system, and () is again either (default) or .
For single-cell RNA-seq datasets without spatial coordinates, we assume a fully connected involving all cells measured within a system. For spatial transcriptomic datasets, we construct in the same fashion as for the spatial Cell-Cell Matrix, limiting edges to neighbors within radius or within a userdefined set of nearest neighbors.

Metadata Mapping
NICHES allows the researcher to carry over any and all metadata (i.e. sample labels, coarse-and finegrain cluster labeling, experimental conditions) from source data, allowing rapid downstream differential analysis between already tagged groupings of cells. Because each Cell-Cell Matrix contains many individual measurements of cell pairings (or environmentcell pairings in the Niche Matrix), differential analysis can be used to reveal ligand-receptor mechanisms preferential to a given celltype-celltype pairing within a system, or to identify top differential signaling mechanisms across subject, disease state, experimental condition, or tissue. Such calculations may be performed for specific celltype-celltype pairings or for other custom groupings as the user desires, based on mapped metadata. We currently recommend using ROC analysis to measure how well a mechanism differentiates two groups compared to standard two-sample tests when the columns in Cell-Cell Matrix or Niche Matrix are no longer independent.

Methods for Application to Synthetic Data
We generate 3 simulation datasets (Simulation 1, Simulation 2, Simulation 3) for 3 separate simulation analyses ( Figure S3, S4, S5) respectively. For each dataset, we simulate 3 categories of genes: signaling genes (ligands and receptors), 50 non-signaling marker genes to differentiate each cell type, and 5000 noise genes. We assume the genes in the datasets follow negative binomial (NB) distributions parametrized by parameter which characterizes the mean expression level, and the dispersion parameter . We describe the exact parameter settings for each dataset as follows:    NB( =10, =20) NB( =20, =20) NB( =30, =20) Noise genes (5000 genes) NB( =15, =20) Methods for Application to Native Lung scRNAseq Data Data was downloaded from (Raredon, et al., 2019), subset to 4 main populations of interest, and run through standard principle component analysis (PCA), clustering, and UMAP embedding pipelines in Seurat (McInnes, et al., 2018;Stuart, et al., 2019). Data was imputed using ALRA (Linderman, et al., 2022) and then run through the NICHES function RunCellToCell. The resulting signaling matrix was then used to create a new Seurat object which was scaled and run through PCA again and embedding using UMAP. FindAllMarkers was used in Seurat to identify cell-cell interaction markers of interest.

Methods for Application to Brain Spatial Transcriptomic Data
Anterior mouse brain data was downloaded from 10x Genomics (2020) and preprocessed following the steps in Seurat (Stuart, et al., 2019), with subsetting to the frontal cortex region only. We integrated the data with a reference single-cell RNA-seq dataset (Tasic, et al., 2016) and used its cell type annotations to predict the labels of the spatial pixels by a probabilistic classifier (Seurat TransferData function). We then annotated each spatial pixel by its most probable label.
For NICHES matrix construction, we imputed the data with ALRA (Linderman, et al., 2022) based on the normalized data matrix, and then applied the NICHES Neighborhood-to-Cell function to compute niche signaling between direct histologic neighbors. The resulting niche matrix was embedded using UMAP (McInnes, et al., 2018) in Seurat 4.0 (Hao, et al., 2021). FindAllMarkers in Seurat was used to compute top markers for each celltype niche.