ADEPT: Autoencoder with differentially expressed genes and imputation for robust spatial transcriptomics clustering

Summary Advancements in spatial transcriptomics (ST) have enabled an in-depth understanding of complex tissues by quantifying gene expression at spatially localized spots. Several notable clustering methods have been introduced to utilize both spatial and transcriptional information in the analysis of ST datasets. However, data quality across different ST sequencing techniques and types of datasets influence the performance of different methods and benchmarks. To harness spatial context and transcriptional profile in ST data, we developed a graph-based, multi-stage framework for robust clustering, called ADEPT. To control and stabilize data quality, ADEPT relies on a graph autoencoder backbone and performs an iterative clustering on imputed, differentially expressed genes-based matrices to minimize the variance of clustering results. ADEPT outperformed other popular methods on ST data generated by different platforms across analyses such as spatial domain identification, visualization, spatial trajectory inference, and data denoising.


INTRODUCTION
The complex tissues in our body consist of diverse cell types, each specialized to carry out a particular function. Cell behavior is influenced by the surrounding environment, including signaling with adjacent and distant cells. 1 Deciphering the spatial domains of different cell types in tissue is consequently critical for understanding the behavior of cells and the progression of disease pathology. 2 Single-cell RNA sequencing techniques (scRNA-seq) have made it possible to characterize cells by their types and physiological properties at an unprecedented per-cell resolution; 3 however, the lack of information regarding the spatial location of cells prohibits us from investigating the complicated transcriptional architecture of heterogeneous tissues. Conducting scRNA-seq while also identifying the spatial context of cells within the tissue could allow a deeper understanding of location-specific gene expression and cell behavior. Recent technological advances in Spatial Transcriptomics (ST) have made this possible. 3 The technique of ST has greatly accelerated the study of complex transcriptional architecture within heterogeneous tissue, while only slightly sacrificing cell resolution (1-10 cells in each sequencing spot). 4 There are two main categories of methods of ST sequencing. The first category is performed via fluorescence-based in situ transcriptomics, which includes methods like single-molecule fluorescent in situ hybridization (smFISH), 5 spatially resolved transcript amplicon readout mapping (STARmap), 6 and multiplexed errorrobust fluorescence in situ hybridization (MERFISH). 7 The second category consists of a combination of spatial barcoding and next-generation sequencing. Methods like Slide-seq 8 and 10x Visium are in this category 9 .
An important first step in ST research is to cluster the spots and define spatially coherent regions in terms of expression data and location adjacency. This is essential for downstream analyses, such as cell type or tissue annotation, new cell type identification, spatially variable gene identification, and gene ontology (GO) analysis. 10 Some naive approaches which have been previously applied to ST data include traditional clustering algorithms like Louvain, 11 spectral clustering 12 and k-means. 13 These methods, though capable of leveraging the spatial and histology data [14][15][16][17][18][19] to identify segmented or layered spatial domains for different ST data, produce results which are somewhat unstable and of great variance. In recent years, several popular methods including BayesSpace, 17 SpaGCN, 14 SEDR, 18 CCST, 16 and STAGATE 19 have been proposed and demonstrated their superiority compared to previous baseline models, however, Data quality across different ST sequencing techniques and types of datasets appear as a crucial factor that influences the performance of different methods and influences benchmarks. For example, previous studies with scRNA-seq data have attempted to solve the dropout effect, i.e. the large percentage of missing events or excessive zero counts, 20 to improve the quality of the sequencing data. Gene dropout describes the phenomenon of a gene being observed at a certain expression level in one cell, but not being detected in another cell of the same type. Previous analyses of scRNA-seq data have shown that effective imputations for dropout effects could improve the clustering results and downstream analyses. 21,22 A recent benchmarking study showed that examining and understanding the statistical properties of the excessive zero values in ST data is important to facilitate the development of best practices for various data analytic tasks in the field. 23 In this study, we were motivated to develop a robust clustering algorithm, called ADEPT, an Autoencoder with Differentially Expressed genes and imputation, by tackling the data quality effect across different types of ST data. ADEPT employs a graph autoencoder to learn the low-dimensional latent embedding of each spot via both gene expression and spatial context. To control and stabilize data quality, ADEPT relies on the selection of differentially expressed genes (DEGs) and imputation of the multiple DEG-based matrices for the initial and final clustering of the graph autoencoder backbone, to minimize the variance of clustering results. The DEG selection and imputation are performed multiple times and averaged for estimating robustness. We have benchmarked ADEPT against five other popular methods on ST data generated by different ST platforms to demonstrate its robustness and superiority in different downstream analyses such as spatial domain identification, visualization, spatial trajectory inference, and data denoising.

RESULTS
The overall pipeline of ADEPT ADEPT is a multi-stage framework that performs spot clustering iteratively, and gradually increases clustering quality in an unsupervised manner, without the need for label information. Its overall pipeline is described in the following steps ( Figure 1): There are two types of input data for ADEPT, sequencingbased, and image-based ST data. The image-based input is optional. In the first data preprocessing step, ADEPT utilizes the gene expression profile. Each spot is treated as one node and the associated gene expression as a feature vector. ADEPT then constructs the graph structure based on the node (spot) adjacency via the k-Nearest Neighbor (kNN) algorithm. In the second step, ADEPT feeds the constructed graph to a Graph Autoencoder (GAE), to learn a low-dimensional latent representation of each node. 24,25 The autoencoder contains a naturally coupled reconstruction loss function which is minimized to acquire the low-dimensional spatial distribution of the node embeddings, in an unsupervised fashion. In the third step, an initial clustering is performed on the node embeddings and several differentially expressed gene (DEG) lists are selected from the initial clusters relying on non-zero rates of the expression matrix. A Gaussian mixture model is used to cluster the node embeddings after the reconstruction loss threshold is reached and convergence of the model has occurred. In the fourth step, ADEPT extracts and imputes multiple DEG-based gene expression matrices, and combines them into a final imputed matrix as the GAE model input for final clustering. Finally, in the fifth step, further downstream analyses can be performed based on the final clustering result, such as spatial domain identification, spatial trajectory inference, and ST matrix imputation.
To evaluate the performance of our spatial clustering method, we used ADEPT on several annotated benchmark datasets, with five other tools. The accuracy and robustness of ADEPT were evaluated by using the adjusted rand index (ARI), Fowlkes-Mallows score (FMS), and purity, as well as comparing visualization of spatial domain identification between each tool and the ground truth. We also performed further downstream analyses such as spatial trajectory inference and ST matrix imputation for some specific datasets.

Benchmark datasets
We used datasets from the dorsal lateral prefrontal cortex (DLPFC), a human breast cancer dataset, and the STARmap dataset. ll OPEN ACCESS Figure 1. Overview of the ADEPT framework ADEPT begins by processing ST raw data as input. It first constructs a k-Nearest Neighbor graph based on spatial locations. To construct node features from gene expression information, a data quality control step is involved to remove low-quality genes. The model input is then generated and fed into the ADEPT backbone. ADEPT learns a low-dimensional latent representation with both spatial context and expression information via a graph autoencoder. After acquiring the initial clustering results from the embeddings, DEG selection, and matrix imputation modules are executed to refine the final clustering result. The output of ADEPT is then used for downstream analyses such as spatial domain identification, spatial trajectory inference, and ST matrix imputation. The DLPFC dataset includes 12 human DLPFC sections, taken from three individual samples. 26 The total number of spots ranges from 3498 to 4789, depending on the section. For this dataset, the authors have meticulously manually annotated all 12 sections for cortical layers 1 to 6 and white matter (WM). 26 The breast cancer dataset contains 2 sections; however, only the first section of the dataset (BC1 in the article) has annotation. 18 In the annotation provided by SEDR, the tissue is segmented into 20 areas. The breast cancer dataset contains 3,798 spots in total and around 33K genes. The STARmap dataset, which was generated from the mouse visual cortex that spans from the hippocampus to the corpus callosum, and encompasses six neocortical layers, contains just one slice. The STARmap dataset 6 has fewer cells and sequenced genes (1020 genes on 1207 cells) when compared to the used 10x Visium datasets, but has single-cell resolution.

Experimental setup and default parameters for ADEPT
For ADEPT, we used the Adam optimizer 27 to minimize the reconstruction loss with an initial learning rate of 1e À 3 and a weight decay of 1e À 5 . The default number of iterations was set to 1000. We used a 2-layer structure for both the encoder and the decoder in ADEPT's backbone. Attention mechanisms were turned on for the first layer of the encoder and the last layer of the decoder correspondingly. The input feature dimension of the encoder's first layer was equal to the total number of genes after filtering and data quality control. The dimension of the encoded hidden features, used for clustering spots, was set to 32. The input dimension of the hidden layer was set to 512. Because this tool uses an autoencoder backbone, the input feature size of the encoder layers was equal to the output feature size of the decoder layers, and vice versa. We run all experiments with the same default parameter setting.

Control data quality by non-zero rate
There are two additional self-adaptive hyperparameters that need to be estimated in our framework. The first one is the minimum counts across cells per gene used in the data preprocessing step of ADEPT. In Figure 2, there are nine subplots in total for the eight DLPFC datasets and the breast cancer dataset. The figure depicts a trend of a non-zero rate of increase as a function of an increasing number of low-quality genes being filtered. In these log-like curves, there is a rapid increase near the origin, which illustrates that there are always a few genes that have significantly poor sequencing quality in most spots. The expression for these genes could contain noise or excessive zero values that would negatively affect the clustering performance, so we need to exclude these genes from our framework. Furthermore, as suggested by the starting non-zero rate in each subplot, data quality in different batches or different types of datasets could vary from each other drastically. In ADEPT, we have chosen an empirical threshold of 0.14 to ensure that after the screening of low-quality gene features, expression matrices would have acceptable and fairly uniform data quality. This approach works with high-quality sequencing data as well. In Figure 2I, the sequencing quality is high, so ADEPT would only need to remove genes with sum of counts in all spots less than 5. For image-based ST datasets generated by STARmap technology, ADEPT chooses not to execute this step by default because the number of genes is already quite lower than 10x Visium data. The minimum-counts hyperparameter is thus estimated based on each dataset and its sequencing platform.
Another crucial hyperparameter is DEG lists kept after initial clustering. As we describe in the STAR Methods section, ADEPT selects DEG lists by the non-zero rate for each dataset. DEG lists and their corresponding non-zero rates from DEG-based matrices are illustrated in Table S1. As shown in the table, each dataset has several different DEG lists that satisfy the non-zero rate threshold. All of these candidate lists will later be used in the imputation step of our framework.

ADEPT demonstrates robust clustering performance across different ST datasets
To quantitatively demonstrate the spatial clustering capability of ADEPT, we first tested it on two 10x Visium benchmark datasets which contain manual annotations as ground truth, the DLPFC dataset and the breast cancer dataset. Using the ground truth, we compared the clustering performance of ADEPT with five other recently developed spatial clustering tools (BayesSpace, 17 SpaGCN, 14 SEDR, 18 CCST 16 and STAGATE 19 ) based on the adjusted rand index (ARI), which is a commonly used similarity measure between two given clusters by considering all pairs of samples. We visualized results using SCANPY. 28 Boxplots of ARI values from 20 experiments of each tool for all 10 datasets are shown in Figure 3. The average ARIs can be found in Table 1. For both boxplot results and average ARIs, ADEPT achieved the best performance in eight out of ten datasets, and came in second place for the other two datasets. In iScience Article addition, for all four DLPFC sections in Figure 3A, from 151673 to 151676, ADEPT outperformed STAGATE, which is also an autoencoder-based method. Finally, the overall variance of ADEPT was much lower than the other methods due to the integration of the imputation step in ADEPT. In DLPFC 151507 and breast cancer, ADEPT failed to achieve the best performance; however, its result was only slightly lower than the average ARI of SpaGCN on the breast cancer dataset ( Figure 3D). We also adopted two additional metrics, purity and Fowlkes-Mallows Score (FMS), to evaluate the clustering performance of all tools on these benchmark datasets (Table 2). ADEPT still accomplished the best performance in eight out of ten datasets and came in second place for the other two datasets.
As expected, ADEPT can effectively distinguish spatial domains, no matter if they are in layered structures ( Figures 4A, 4B, and 4D for DLPFC and STARmap datasets) or in more complex structures ( Figure 4C for the breast cancer dataset). For example, in Figure 5A, in the visualization comparison of DLPFC 151675, ADEPT showed a clear pattern of separation of the seven-layered regions and achieved the best clustering accuracy (ARI = 0.64). For comparison, the clustering result of SEDR, SpaGCN, and BayesSpace could not thoroughly reveal the expected layer pattern in this section, and the border of the clusters is chaotic with many outliers inside each cluster, which impairs the overall clustering accuracy and harms the overall visual effects. CCST and STAGATE were also effective in creating clear borders between clusters; however, CCST failed to identify all seven layers consistently even with numerous runs of the algorithm. Because some tools can achieve the best performance for certain datasets, it is hard to claim which tool works the best for every dataset. We further tested the robustness of ADEPT against five other methods by calculating the sum of rankings and the average ranking for all datasets. Each method was assigned a ranking score from 1 to 6 based on the metrics ranking. In Table S2, we present the sum of rankings and the average ranking in each dataset for all the methods. The table shows that CCST, SEDR and SpaGCN have a similar sum of rankings and average ranking. In both ranking metrics, ADEPT achieves the best average ranking (1. In terms of execution time and memory usage, ADEPT required an average runtime of 17 min for standard 10x datasets with less than 5k spots and 1.5 min for the STARmap dataset. Among all six tools (Table S3), STAGATE, SpaGCN, and BayesSpace achieved the minimum computation cost (an average runtime of 3-5 min for 10x Visium datasets, and 1.5-2 min for the STARmap dataset). All tools consumed approximately 0.5-2 GB GPU memory for 10x datasets.
ADEPT imputes and denoises gene expressions of biomarker genes for better spatial expression patterns ADEPT can also impute and denoise gene expressions for clear spatial expression patterns, because imputation is an innate feature of the pipeline. We compared the gene expressions of six layer-marker genes 26 (LAMP5, NEFH, RASGRF2, B3GALT2, NTNG2, and ATP2B4) between raw data and after imputation by ADEPT in the DLPFC section 151675. In Figure 5B, the heatmaps of the raw and imputed data of the six genes are shown. Before our imputation, the heatmaps of these six genes were comparably chaotic. Using LAMP5 as an example, all spots with higher gene expression are in a scattered distribution without any significant patterns. The refined expression of LAMP5 successfully reveals the regional pattern that has a higher expression level in layer 2 and adjacent layers. We also used violin plots to confirm the same conclusion. We plotted the expression level of the same set of genes as in Figure 5B, by ground truth layers. The plot clearly shows the distinct expression levels of all six genes in different layers, which is consistent with the conclusion we have drawn previously. As reported in the previous studies, 26,29 after imputation, these results collectively demonstrate the ability of ADEPT to reduce both noises and dropouts and recover spatial patterns more ideally, whereas its raw spatial expression, though affected by sequencing qualities and other factors, is randomly distributed. iScience Article ADEPT improves spatial trajectory inference In Figure 5D, we show that ADEPT was capable of revealing the distance between spatial domains by projecting embedded features down to a two-dimensional space by a UMAP plot, and further inferring the spatial trajectory using a trajectory inference tool called PAGA. 30 For instance, in the DLPFC section 151675, those clusters of each layer were distributed reasonably and showed consistent spatial trajectories from layer 1 to layer 6 and white matter (WM) in the UMAP plots generated by the embeddings of ADEPT and STAGATE. However, in the UMAP plot of another autoencoder-based method, SEDR embeddings, layers were not separated and connected clearly, which resulted in comparably worse results. The PAGA graphs of both ADEPT and STAGATE embeddings showed a linearly-connected development tendency from layer 1 to layer 6 and WM, whereas the PAGA results of SEDR embeddings were mixed for some middle layers. CCST, SpaGCN, and BayesSpace could not be used to perform these analyses.

DISCUSSION AND CONCLUSION
In this study, we have proposed a multi-stage graph-based deep clustering method, combined with DEG discovery and expression profile imputation. Our method, ADEPT, was successfully tested on 10x DLPFC, Breast cancer datasets, and the STARmap dataset. It was compared with other methods in terms of visualization and metrics and biological downstream analysis, illustrating its advantages. We show that ADEPT could robustly recover the layered patterns of DLPFC and STARmap datasets, while it could also capture the clustered distribution of tumor tissues in the breast cancer dataset. By taking the advantage of DEGs and imputation, ADEPT exhibited far less prediction variance compared to other notable methods. ADEPT achieved clustering performance with the highest ARI values among five other The chosen value for the hyperparameter minimum counts across cells per gene and selected DEG lists for each dataset are also provided. The highest average ARI in each column is highlighted. iScience Article state-of-the-art ST clustering tools for most of the datasets. In conclusion, ADEPT is a powerful and efficient method that can facilitate clustering-based downstream analyses, promote the discovery of gene markers, and refine ST expression profiles.

Limitations of the study
As more ST clustering algorithms are implemented there is no consensus on which clustering method is the best, and a comprehensive benchmarking framework becomes increasingly necessary. Although we benchmarked several ST datasets and ADEPT outperformed existing methods most of the time, it is hard to conclude that ADEPT will achieve the best performance on other new datasets. In addition, the scalability of ADEPT is not satisfactory at present. Runtime will greatly increase for future datasets with substantially more gene features and clusters. These are areas of future improvement for our method.

DECLARATION OF INTERESTS
The authors declare no competing interests.  iScience Article k-nearest-neighbor (kNN) algorithm, which is typically exploited to construct a graph when there are no explicit edge relationships provided. In ADEPT, k is set to 6 as default. This hyperparameter is not very sensitive across different datasets. In the end, the inputs for the model are the list of edges and the node features, which are major components for any graph-based neural network. 31

Graph attention autoencoder
We design the graph attention autoencoder consisting of three core parts: the encoder, the decoder, and the graph attention layer. 32 In order to strengthen the connection between nodes that are represented by similar expression profiles and to smooth out the clustering result, ADEPT applies the attention layer mechanism, which is widely used to compute the weight between node pairs. These weights indicate the different contributions of each neighbor used in the aggregation process. Based on node embedding, the weights of edges are automatically calculated. Therefore, this model relies less on the initial edge weight and enables learning of the relationships between nodes by implicitly taking consideration of spot similarity, rather than explicitly calculating the weight with heuristic methods (e.g., in SpaGCN 14 ).
The encoder's input consists of the normalized gene expression profile and the edge lists. The encoder then generates node embeddings by aggregating information from all neighbors. We denote h ð0Þ u as the normalized expression feature of spot u. By treating expression profiles as the initial spot embedding, the l th encoder layer generates the embedding of spot u in layer l as follows: where a ðlÞ uv is the attention coefficient, ε is is an exponential linear unit (ELU), 33 N u denotes all the neighbors of node u including u itself, h ðl À 1Þ v denotes the embedding of node v in the l À 1 layer, a ðlÞ is another learnable parameter, and W ðlÞ is the matrix of trainable parameters in l th layer. The attention coefficient a ðlÞ uv of layer l for every node pair ðu; vÞ in the encoder is used to measure the importance of the neighboring node v towards learning a higher quality representation of node u on graphG. This is calculated by Equations 2 and 3 where 4 is the concatenation operation and s is a sigmoid activation function. The output of the L-layer encoder h ðLÞ u is the final embedding of spot u and represents the hidden embedding with the lowest dimension. The decoder attempts to reconstruct the normalized expression profile for each spot u (i.e., h ð0Þ u ) given the latent embeddings of the encoder. More specifically, the output of the encoder is given as the input to a L-layer decoder (i.e., b h ð0Þ u = h ðLÞ u ), with the l th layer of the decoder (from the perspective of spot u) defined as: where b a ðlÞ uv is the decoder attention coefficient which is calculated similarly as a ðkÞ uv in the encoder, and c W ðlÞ is the matrix of trainable parameters in l th layer. Ultimately, the output of the decoder, b h ðLÞ u , is trained via updating the parameters of the encoder and decoder layers in an attempt to reconstruct the normalized expression profiles (i.e., h ð0Þ u ).

Differentially expressed gene lists selection by non-zero rate
As mentioned in the section describing the overall pipeline of ADEPT, several differentially expressed gene (DEG) lists are selected from the initial clusters relying on a range of non-zero rates from 0.3 to 0.4. We chose this empirical threshold range to maintain an optimal number of DEGs. Within this range, ADEPT generates a specific number of DEGs and selects multiple DEG lists that meet the non-zero rate range threshold. Across this range of non-zero rates, ADEPT selects a small proportion of differentially expressed genes and determines if the data quality for each selection is acceptable. Across different types of datasets, we maintain the same range of non-zero rates, resulting in different DEG list selections for the next The ranks for each group are summed and these sums are used to determine the likelihood that gene expression among groups is different. This is performed using Equations 5 and 6, where n x represents the number of spots in the cluster to be analyzed, n y represents the number of spots in all other clusters, R x represents the sum of ranks for the cluster to be analyzed and R y represents the sum of ranks for all other clusters.
The minimum value between U x and U y is used to find the p-value of each gene via Mann-Whitney tables. 35 Genes with the lowest p-values are the most statistically significant between groups. When using this method, there is a trade-off for the number of DEGs selected. Since all the DEGs are selected sequentially from the top ranking to the bottom, more features will be kept when we choose to select more DEGs. However, keeping more DEGs introduces increasingly more noise and a higher dropout rate. On the other hand, if we choose to keep the smallest amount of DEGs possible, many useful features will be sacrificed.
Here, we thus introduce a DEG candidates selection step, which aims to optimize the number of DEGs which are kept for each cluster by optimizing the range of the non-zero rates after performing DEG selection. Once several different DEG lists are selected based on the initial clustering results, they will be utilized for imputation and final clustering.

Expression matrix imputation
At this stage, ADEPT has selected several DEG lists based on non-zero rates and extracted multiple DEGbased matrices. We then impute these matrices and merge them for robust estimation, to minimize the variance of the final clustering result. The strategy of our imputation method is to use the average expression value of the gene within the same cluster and the pseudo-labels that are obtained from our initial clustering result to complete the dropout entries.
Specifically, we denote X˛R g3c as a gene expression matrix, where g is the number of genes and c is the number of spots. The ði; jÞ th component of X is represented as x ij . The estimated value of a dropout event is given by averaging the gene expression values across all spots for each non-zero gene within the same cell cluster: E À x ij Á = P j 0˛c lustern x ij 0 kcluster n k (Equation 7) x ij denotes i th gene dropout of spot j. j 0˛c luster n refers to all other spots within the same cluster that have values in their i th gene, whilekcluster n k indicates the size of cluster n excluding those spots with dropout in i th gene.
Additionally, we denote C 1 , C 2 , /, C K as different clustering results based on different selected DEG lists. The Eðx ij C K Þ was computed for each clustering result C 1 , C 2 , /, C K . Finally, ADEPT estimates the final imputation for dropout events x ij , and Eðx ij Þ, by computing the average of the estimated results from all different DEG lists: The imputed expression matrix contains all genes that occur in either of the K DEG lists at least once.

ll
OPEN ACCESS