Inferring cell diversity in single cell data using consortium-scale epigenetic data as a biological anchor for cell identity

Abstract Methods for cell clustering and gene expression from single-cell RNA sequencing (scRNA-seq) data are essential for biological interpretation of cell processes. Here, we present TRIAGE-Cluster which uses genome-wide epigenetic data from diverse bio-samples to identify genes demarcating cell diversity in scRNA-seq data. By integrating patterns of repressive chromatin deposited across diverse cell types with weighted density estimation, TRIAGE-Cluster determines cell type clusters in a 2D UMAP space. We then present TRIAGE-ParseR, a machine learning method which evaluates gene expression rank lists to define gene groups governing the identity and function of cell types. We demonstrate the utility of this two-step approach using atlases of in vivo and in vitro cell diversification and organogenesis. We also provide a web accessible dashboard for analysis and download of data and software. Collectively, genome-wide epigenetic repression provides a versatile strategy to define cell diversity and study gene regulation of scRNA-seq data.


INTRODUCTION
Single-cell RNA sequencing offers unprecedented opportunities to investigate the diversity of cell processes underpinning de v elopment and disease. Standar d analysis wor kflows typically involve stepwise analytic tools that decompose high-dimensional data to re v eal the complexity of and relationships between cell types. Howe v er, the computational challenges associated with single cell analysis are considerable, partially due to common user-defined parameter settings that make interpretation of cell di v ersity a subjecti v e e x er cise. Furthermor e, downstr eam analyses often r ely on prior knowledge, using known mar ker genes, subjecti v e data subtraction and r efer ence annotation databases ( 1 ). Currently, few methods utilize unsupervised orthogonal reference points for defining cellular di v ersity and interpreting genetic regulation of cell states.
Cell type clustering methods are essential for downstream analyses to generate hypotheses about cellular processes. Most current clustering algorithms group cells based on gene expression similarity and rely on ma thema tically defined constraints for data analysis ( 2 ). Agglomerati v e clustering methods such as SINCERA ( 3 ), pcaReduce ( 4 ), and CIDR ( 5 ) use distance measurements such as Euclidean, Manhattan, Jaccard, Minkowski and Canberra ( 6 ) to merge clusters iterati v ely based on similarities ( 7 ). K-means clustering groups data points by taking a distance measure and iterati v ely optimising the centroid of each group to minimize within-cluster distance ( 8 ). Model-based approaches like Gaussian Mixture Models (GMM) exemplified by scG-MAI ( 9 ) use (multiple) distributions to captur e r elationships between cells ( 8 ). Density-based clustering such as DBSCAN ( 10 ) and GiniClust ( 11 ), are non-parametric appr oaches that gr oup cells based on data point density, where low-density ar eas ar e consider ed as outliers. Lastl y, gra phbased clustering is an extension of density-based clustering ( 12 ) wher e r elationships among cells ar e r epr esented by a similarity graph ( 13 ). Popular clustering approaches, such as Louvain and Leiden, are implemented in Seurat ( 14 ) and SCANPY ( 15 ).
Owing to the complex nature of cell clustering, there is no widely accepted standard for parameter setting or method selection ( 2 ). Moreover, most clustering methods assume all cells in a dataset are equally important and ther efor e interpret them equally, despite the well-known influence of sequencing depth, ambient RNA and technical variation on the variability and accuracy of scRNA-seq data (16)(17)(18). Esta blishing justifia ble and generalisa ble criteria for excluding data beyond the standard trimming of doublets and highquality cells remains challenging.
Similarl y, gene expression anal ysis often depends on arbitrary selection of computational parameters and r efer ence points. For instance, in differential expression (DE) gene anal ysis, using highl y variable genes ( 19 ) between cell clusters can results in data interpretation that is only relevant to individual datasets ( 20 ). Grouping genes with co-expression patterns has been an effecti v e strategy to study functional gene-to-gene relationship. For instance, gene modules identified by fitting a Gaussian mixture model to gene expression data has been used to stra tify pa tient (21)(22)(23) and various cancer cell subtypes (24)(25)(26)(27) by inferring functional relationship between genomic elements. Also, Weighted correlation network analysis (WGCNA) connects co-expressed gene modules to external r esour ces such as clinical data, SNPs (single-nucleotide polymorphisms) and Gene Ontology (GO) ( 28 ). Howe v er, current models mostly rely on expression data alone, and the biolo gicall y meaningful incorporation of orthogonal genomic data into computational systems biology is not commonly used. Alternati v e methods also use cell-le v el abundance features to identify cell clusters ( 29 ), or model distributional differences in the expression of individual genes to extract biological meaning from the regulatory network relevant to cell clusters (30)(31)(32)(33)(34). These methods may lack more complex gene regulatory information needed to interpret gene expression networks and are often heavily influenced by the data structure ( 33 , 35 ).
Recent studies, such as CADD ( 36 ) and UnTANGLeD ( 37 ), use statistical methods to quantify probabilities and relationships in the genome to help interpret complex genomic data. These methods are powerful because they provide an independent, unsupervised, simple and quantifiable analysis of the genome that seamlessly interfaces with orthogonal genomic data to interpret genetic information. We have recently developed TRIAGE (Transcriptional Regulatory Inference Analysis of Gene Expression), a computational method that provides an independent and unsupervised model to interpret genomic data from any cell or tissue type, without the need for external r efer ence da ta, sta tistical cutoffs, or prior knowledge. We used consortium data from the Human Cell Atlas, FANTOM and a draft map of the human proteome to demonstrate that histone modifica tion deposition pa tterns can be used to discern genes underpinning cell identity ( 20 ). We found that genes frequently harboring broad H3K27me3-domains across diverse cell types significantly enrich for cell-type specific regulatory genes ( 20 ). A genome-wide quantitati v e feature devised from TRIAGE, called 'r epr essi v e tendency score' (RTS), serves as an independent r efer ence point that can infer cell-type regulatory potential for each protein-coding gene in different cell types, which could help advance our understanding of gene regulation and its impact on cell identity.
Her e, we pr esent a two-step pipeline that complements the baseline methods for identifying transcriptionally distinct cell populations. The pipeline draws on the fundamental principles of TRIAGE, enabling the analysis of cell clustering and interpretation of cell populations. This approach is scalable to any cell type and data type, including applications across species, making it possible to analyse biological di v ersity in scRNA-seq data.

Identifying priority cell type-specific regulatory genes
To calculate the RTS, we followed the two-step approach described in TRIAGE ( 20 ), using EpiMap data from 834 cell types [39]: (i) we calculated the total length (breadth) of H3K27me3 domains in base pairs for each gene across 834 EpiMap cell types, and (ii) we multiplied each value by the proportion of cell types in which the gene's H3K27me3 breadth was among the top 5% of broad domains. This analysis assigned a single score to each gene, indicating its association with broad H3K27me3 domains. The genes were ranked by their RTS, and a priority set of 993 genes were identified as those above the inflection point.

Correlation analysis of RTS priority gene expression during cell differentiation pseudotime
To identify genes that significantly influence differentiation potential, we used a single-cell RNA sequencing dataset characterising definiti v e endoderm dif ferentia tion using 125 iPSC lines ( 38 ). For each cell line, we calculated the average expression of each RTS priority gene at day 0, day 1 and day 2 of endoderm dif ferentia tion. Next, we performed a linear r egr ession between the average gene expression at pluripotency and the average pseudotime of each cell line at day 3 of dif ferentia tion, as calcula ted in the original da ta ( 38 ). We applied a false discovery rate correction to account for multiple testing. We repeated this analysis for gene expression at days 1 and 2 of differentiation.

Evaluation of relationship between RTS priority gene rank and differentiation time.
To investigate the correlation between RTS values and lineage de v elopment, as well as cell type identification, we analysed a time-course in vitro cardiac cell differentiation ( 39 ). We examined the relationship between RTS values and cell de v elopment linea ges by tracking the distrib ution of RTS values across different stages of differentiation. We also calculated the average expression level of each RTS priority gene at each stage of dif ferentia tion compared to its corresponding RTS value and visualised results using pheatmap R package ( 40 ) with Z -score normalisation for each gene.

Pre-processing of mouse gastrulation atlas data ( 41 )
To pr epar e the mouse gastrula tion a tlas da ta, we converted mouse gene names to human gene names and removed PAGE 3 OF 18 Nucleic Acids Research, 2023, Vol. 51, No. 11 e62 genes that had no expression in all cells. Additionally, we excluded low-quality cells using the perCellQCMetrics and quickPerCellQC functions from the R package scater ( 42 ). The filtered data was then normalised by library size using the computeSumF actor s function from the scran R package ( 43 ). Normalised counts were used for TRIAGE ( 20 ) conversion, which generates a discordance score (DS) count matrix. To reduce the dimensionality of the data, we performed a downstream principal component analysis (PCA) using the RunPCA function in Seurat and computed 50 principal components (PCs). We then used the 50 PCs f or UMAP (unif orm manif old appr oximation and pr ojection) and t-SNE (t-distributed stochastic neighbour embedding) through the RunUMAP and Runtsne function ( 14 ). All other data used in this paper were processed in the same way as the mouse gastrulation atlas data. Gi v en that UMAP has demonstrated its usefulness in identifying cell types in highly heterogeneous datasets, as well as its scalability and faster runtime in scRNA-seq data (44)(45)(46)(47)(48)(49), we selected UMAP as the dimensional reduction space to a ppl y our TRIAGE-Cluster pipeline. Howe v er, we also compared the performance of UMAP and t-SNE with respect to our pipeline. The FindClusters function with Louvain algorithm as default setting is utilized in Seurat to obtain clusters at 20 different resolutions ranging from 0 to 2 with an increment of 0.1. These clusters are then utilized in downstream benchmarking analysis.

Density estimation and peak selection
For each cell in the data set, the RTS priority gene with highest expression for each cell was identified and its corresponding RTS was used for downstream analysis as follows: UMAP coordinates were used with the assigned RTS for each cell as weight for estimating kernel density. The function stats.gaussian kde from scipy package ( 50 ) estimates the probability density function of RTS-weighted cell states. For any gi v en cell x, the weighted Gaussian kernel density estimator ˆ f can be calculated as where n is the total number of cells, w i is RTS and x i is the UMAP embeddings of the i -th cell, h is the bandwidth of a Gaussian kernel K defined as where μ is the mean of the Gaussian distribution and σ is its standard deviation. Bandwidth h was determined using Scott's Rule ( 51 ) and adjusted to suit scRNAseq data as follows where n denotes number of cell and d denotes number of dimensions. A density map with contours (contour le v els = 10) were generated. At each contour le v el, we used unsupervised DBSCAN from the sklearn package ( 52 ) to group cells into spatially separated clusters. From each contour le v el, we selected regions with local RTS maxima as peaks (i.e. clusters without smaller clusters within them from a higher contour le v el). Peaks (i.e. clusters) from all contour le v els r epr esent the cell population di v ersity in the dataset. It should be noted that UMAP coordinates are not linear transformations of input data and only preserve close similarities ( 53 ). We hypothesised this is sufficient in order for density estimates to work and tested it against another topological dimension reduction method, t-SNE, by comparing the biological di v ersity capture from peaks obtained from UMAP and t-SNE.
J accar d similarity dendr ogr am. We computed the Euclidean distance matrix using the Jaccard similarities of the top 100 DS genes for each pair of peaks, and then used the R package hclust to perform hierarchical clustering and extract the dendrogram r epr esentation of relationship between all peaks.
Adjusted rand inde x (ARI) anal ysis. We used the ARI ( 54 ) to assess the performance of different clustering methods by comparing it with the original annotation of the mouse gastrula tion a tlas da ta ( 41 ). Additionally, to evalua te the impact of removing RTS priority gene on accuracy of data structure, we employed adjustedRandIndex function from mclust R package ( 55 ), to compare four removal scenarios with the gold standard that we used, which was the high expression RTS genes for all cells generated from the original RTS priority gene rank. We tested four gene removal scenarios: (i) random removal of genes from the RTS priority gene list, (ii) removal of RTS priority genes from the bottom rank to the top, (iii) removal of RTS priority genes from the top rank to the bottom and (iv) random removal of genes from the whole genome. Each removal step is 50 genes.
Spearman r ank corr elation. To assess the ability of TRIAGE-Cluster and Seurat in identifying biolo gicall y distinct cell types, we (i) evaluated peak-peak correlation across multiple resolutions using the average expression of either the entire transcriptome (16 946 genes) or a set of highly variable genes (1000 genes) with both the original expression matrix and the TRIAGE-converted matrix (DS matrix), and (ii) assessed peak-peak correlation with selected resolutions giving the same number of clusters for both methods, using the entire transcriptome (16 946 genes) with the DS matrix as it provides comprehensi v e understanding of the biological context. We conducted Spearman rank correlation analysis using the cor function from the stats package. Highly variable genes are defined using the modelGeneVar and getTopHVGs ( n = 1000) functions from the scran package.
Gene ontology enrichment. To verify that TRIAGE-Cluster peaks capture a greater biology di v ersity, we compared the significant GO terms ( P -value < 0.05) between TRIAGE-Cluster peaks and Seurat clusters, using the top 100 genes ranked by DS from each peak or cluster. We conducted a GO enrichment analysis using the topGOtable function from the pcaExplorer package with org .Hs.eg .db, the genome wide annotation for human. SC3 (single-cell consensus clustering) analysis. Since the mouse gastrulation atlas data includes over 5000 cells, we employed a hybrid approach with an SVM model to obtain clusters as outlined in the SC3 pipeline ( 56 ). We generated the same number of clusters as TRIAGE-Cluster for downstream benchmarking anal ysis. Additionall y, to assess cell di v ersity captured by SC3 clusters, we performed Spearman rank correlation and GO enrichment using the same cluster numbers used for benchmarking with Seurat as examples.
MAGIC imputation. To evaluate the impact of sparsity of scRNA-seq data on RTS priority genes and peaks, we employed the magic function with default settings from the MAGIC R package ( 57 ) to r estor e the structur e and minimize the 'dropout' issue in the mouse gastrula tion a tlas data. Subsequently, we compared the peaks generated from the MAGIC-imputed data with those from the original mouse gastrulation atlas data to assess biological di v ersity by GO enrichment analysis.

Development of TRIAGE-ParseR
Processing of H3K27me3 data. We used consolidated H3K27me3 BED files for 111 Roadmap tissue and cell types ( 58 ). We also downloaded bigwig files r epr esenting 834 samples from EpiMap ( 59 ) and converted them into bedgraph format using bigWigToBedGraph ( 60 , 61 ). Subsequently, we used MACS2 ( 62 ) to call H3K27me3 enriched loci withbroad option to capture broad deposition of H3K27me3.
Extracting H3K27me3 patterns. We performed PCA to extract orthogonal patterns of H3K27me3 depositions from consortium-le v el epigenomic data ( 58 , 59  Analysis of H3K27me3 patterns among top 100 genes ranked by DS. We compared underlying H3K27me3 patterns associated with top 100 genes ranked by DS which are highly enriched with variably expressed transcription factors between different tissue groups (defined by expression coefficient of variation > 1). To this end, we first analysed enrichment of H3K27me3 patterns across genes. For each gene, empirical probability of observing a gi v en value higher than the 95th percentile of the 26 827 RefSeq genes in each eigen vector was calculated. We define that genes with the empirical P -value of < 0.05 (right-tail) demonstrate strong association with a gi v en H3K27me3 pattern. To understand whether each of the top 100 genes ranked by DS demonstra te significant associa tion with a gi v en pattern, we randomly sampled another 100 genes (null distribution) and counted incidences ( r ) where the observed value is higher than samples in the null distribution. We permuted this process 1000 times and calculated a probability of gene i ( p i ) to have a value higher than 95% of the random samples.
Identifying an appropriate cluster number for genes by H3K27me3 patterns. To cluster genes ranked by DS based on associated H3K27me3 patterns, we used Bayesian information criterion ( 63 ) to find a suitable number of clusters. The BIC finds a suitable number of clusters from distributions of observed H3K27me3 depositions. While we used top 100 genes ranked by DS based on their enrichment of regulatory genes, the user can select any meaningful number of genes for this analysis. For a gi v en top 100 TRIAGE-prioritised gene set (i.e. n = 100), values of top m PCs (i.e. X i = x i, 1 , x i, 2 , . . . , x i,m , for i -th ranked gene where i ∈ { 1 , 2 , . . . , 100 } ) were used as features for the parameter selection. The method then ranks PCs by le v el of variance observ ed among the input gene set and selects top m PCs (default = 10 PCs). Subsequentl y, the maxim um likelihood to observe PC values from all top 100 genes ranked by DS was iterati v ely calculated with a varying number of clusters θ where θ ∈ { 1 , 2 , . . . , 100 } . The cluster was defined by a Gaussian model ϕ j ( X| μ j, 1 , μ j, 2 , . . . , μ j,m , σ j, 1 , σ j, 2 , . . . , σ j,m ) for j th cluster centred at means μ j with variances σ j across m PCs, where m is 67 or 70 for Roadmap or EpiMap data respecti v ely. The clusters represent a region of high probability mass estimated by expecta tion-maximisa tion (EM) algorithm. Intuiti v ely, if θ is 1, e v ery gene belongs to the same cluster while if θ is 100, e v ery gene belongs to its own cluster. The aim of BIC is to find an appropriate value for θ that gi v es the lowest BIC value (i.e. a cluster number which is the most balanced point between model complexity and likelihood of observing data gi v en the cluster number and model parameters).
where ˆ L is the maximum value of the likelihood function (L) gi v en the number of clusters ( θ ) and model parameters. The likelihood functions are written as follows: wher e π k r epr esents a probability (or mixture proportion) that gene X i belonging to k -th cluster: We aim to find θ such that the BIC is the smallest. Hence, this problem is equivalent to finding the maximum loglikelihood value gi v en θ (i.e. ln ˆ L ).
Due to sensitivity of EM method to initial parameter values, this analysis was repeated multiple times (default = 10 times) and the most frequently occurring values were chosen as an appropriate number of clusters. To ensur e r eproducibility of the r esults, r andom state par ameter for GaussianMixture function from scikit-learn Python module was set to 42.
Assigning genes to H3K27me3 clusters using gaussian mixtur e models . After identifying an appropria te number of clusters ( h ) for a gi v en prioritised gene set, the next step is to probabilistically assign each gene into one of the clusters. We assumed multivariate Gaussian distributions for each cluster. The Gaussian Mixture Model (GMM) estimates parameters of the model through the EM algorithm in dimensions defined by selected PCs. Briefly, the EM randomly initiates parameters (i.e. mean and variance of each cluster) then iterati v ely adjusts these parameters until convergence. By default, iterations stop when log-likelihood gain of the model is less than 1e −3.
For each observation, the expectation step ( E ) calculates likelihood of the observation belonging to clusters gi v en current model parameters while the maximisation step ( M ) adjusts these parameters to maximise the likelihood of data. Once the model is trained, posterior probability of the gene to fall into each cluster is calculated. Finally, the gene is assigned to a cluster with the highest posterior probability.
where Z i is a latent variable for i -th gene being in a k -th cluster and Q i is a cluster where i th gene is finally assigned.

Functional annotation of gene clusters.
To test functional relevance of gene clustering, pr otein-pr otein interaction data as well as GO enrichment analysis using STRING data were performed ( 64 ). Gene clusters with significant PPI (FDR < 0.001) were identified and enriched GO terms were extracted.
STRING PPI anal ysis. To anal yse functional connectivity between genes, STRING database was utilised. PPI enrichment was assessed with default settings which include both functional and physical protein associations. Network plots r epr esent both a series of e xperimental e vidence (i.e. protein fusion, neighbourhood, co-occurr ence, co-expr ession and other e xperimental e vidence) and computational predictions (i.e. text mining and database). It is recommended that PPI is used to validate the gene cluster as weak connectivity may indicate less biolo gicall y meaningful grouping.
Comparing expression and H3K27me3 breadth values to predict genes with different biological processes. Logistic regression models were trained independently with principal components of either expression or H3K27me3 breadth values pertaining to genes with different GO terms. A separate model was trained for each GO term, aiming to predict whether genes are associated with a particular term. Maximum values of breadth and expression were set to 1e5. The SMOTE algorithm ( 65 ) was used to oversample the underr epr esented class (i.e. genes with a gi v en GO term).
Data visualisation of integrated data in a circular plot. We used the circ lize dendr ogr am from the circ lize R package to cr eate a cir cular dendrogram r epr esenting Jaccard similarities. The resulting dendrogram was then used to order the circular plot generated by the circos-heatmap function for the TRIAGE-Cluster peak number, and the proportion of cells in TRIAGE-Cluster peaks, based on original annotation and timepoints. The circular plots were compiled in Adobe Illustrator for visualization. Figure 1 provides a stepwise schematic for the workflow de v eloped in this study, in which TRIAGE is used to determine cell populations and regulatory gene programs in scRNA-seq data. A glossary of terms is also provided in Table 1 . First, TRIAGE determines the frequency of broad H3K27me3-domains at each protein coding gene locus across 834 EpiMap biosamples, resulting in a gene's r epr essi v e tendency score (RTS) (Supplementary Figure S1). Second, any orthogonal input gene expression data set (or any data mapped to protein coding genes) is multiplied with the RTS to deri v e a DS. The DS prioritizes genes with high regulatory potential (ranked by DS value). Third, TRIAGE-Cluster uses the top-ranked gene's RTS value for each cell as a weight to identify cell clusters in a single-cell UMAP space. Fourth, we use epigenetic data in a machine learning model to parse pseudo-bulk DS gene lists into biolo gicall y functional groups.

Using EpiMap data to calculate genome-wide r epr essive tendency scores
Our first study ( 20 ) de v eloped RTS using 111 NIH Epigenome Roadmap H3K27me3 samples and evaluated the consistency of RTS across various sample types. We demonstra te tha t RTS r emain highly consistent r egardless of the input samples cellular origins. In this study, we rank the expanded RTS generated across 834 cell samples and 27 tissue groups in EpiMap ( 20 , 59 ) (Table S2) and identify 993 priority genes above the inflection point of the interpolated R TS curve (R TS > 0.013) (Figure 2 A). Highl y variabl y expressed transcription factors (TFs) represent a positi v e gene set of cell type specific regulatory genes ( 20 ) and are significantly enriched among RTS priority genes (Figure 2 B). Furthermore, genes ranked highly by RTS tend to be more cell type specific and lowly expressed ( Figure 2 C-E), a profile typical of regulatory factors, like TFs. We evaluated whether the expression level of RTS priority genes influences cell dif ferentia tion more than would be expected by random chance. To test this, we analysed expression of RTS prioritised genes to evaluate their expression abundance correlated with differentiation trajectory using pseudotime analysis of definiti v e endoderm ov er three days of dif ferentia tion measured from 125 iPSC lines ( 38 ). These data show that expression of RTS priority genes between day 0-2 significantly affects the dif ferentia tion potential of cells (Odds Ratio (OR) on day 0 = 2.10, P = 0.04, day 1 = 3.75, P = 7 × 10 −14 , and day 2 = 2.94, P = 2.7 × 10 −20 ) (

High Low
Step 2: Calculate Discordance Score (DS) for any input scRNA-seq dataset Use the gene-based RTS to calculate a Discordance Score metric from any quantitative analysis of gene expression at single-cell resolution

Step 3: Cell Clustering and Cell Peak Identification using TRIAGE-Cluster
For every cell in the scRNA-seq dataset, use RTS priority genes as a biological reference point to identify Cell Peaks for inferring cell heterogeneity   ( 59 ). RTS genes above the inflection point of the RTS curve are defined as RTS priority genes to assist in peak identification using TRIAGE-Cluster. ( B ) Input single cell expression matrix is transformed to discordance matrix to convert the original expression value to discordance score (DS). The DS results in high ranking of cell type regulatory genes. ( C ) We use RTS priority genes in a density-based clustering method, TRIAGE-Cluster, to identify cell populations in UMAP space. ( D ) For each peak, genes are ranked by pseudo-bulk discordance score (DS). TRIAGE-ParseR analysis uses PCA and Gaussian mixture model (GMM) to group genes into functional groups to assist with cell type identification.
We next tested whether the RTS values could be used as an orthogonal metric to identify cell types in scRNA-seq da ta. We evalua ted da ta sets for method de v elopment and selected a mouse gastrula tion a tlas da ta which profiles 116 312 cells collected at nine sequential time-points ranging from e6.5 to e8.5 days post-fertilisation ( 41 ) (Supplementary Figure S2 and S3). These data comprise all primary germ layers, consisting of 37 classified cell types including ectoder m, mesoder m, and endoder m ( Supplementary Figure S2). We first find that high-RTS ranked genes have cellspecific expression patterns across the UMAP space (Figure 2 H). Second, analysis of pseudo-bulk gene expression of two annotated cell types, cardiomyocytes and endothelial cells shows that original gene expression enriches for housekeeping and functional genes, while the TRIAGE transformation to DS enriches for regulatory genes underpinning   Figure 3 A demonstrates a model for how RTS assigned to genes by TRIAGE quantifies a distinction between nonspecific vs cell type-specific genes. For example, EOMES r epr esents a broad mesendoderm cell type (RTS 0.19), while more specific markers of la teral pla te mesoderm ( HAND1 , RTS: 0.28) and cardiomyocytes ( NKX2-5 , RTS: 0.35) demonstrate that RTS values can be used as a surroga te indica tor for genes demarca ting biolo gicall y unique cell types. Similar examples are shown for paraxial mesoderm and endoderm (Figure 3 A). To test this model, we analysed time-course data of cell differentiation ( 39 ) and evaluated RTS scores in cell types from early mesendoderm into dif ferentia ted cell types (Supplementary Figure S4a). These da ta demonstra te tha t RTS scor es do not simply incr ease with dif ferentia tion into definiti v e cell types. Instead, high RTS genes can be used to identify biolo gicall y distinct cell types at e v ery stage of differentiation irrespecti v e of de v elopment stage or lineage (Supplementary Figure S4b). Next, we aimed to test this hypothesis by de v eloping a single cell clustering method that utilizes RTS values for cell type identification.

TRIAGE RTS demarcates cell specific clusters in scRNA-seq data
We first tested whether RTS assigned to genes could provide a method to identify gene clusters in scRNA-seq data. We designed a proof-of-concept analysis comparing Seurat clustering versus RTS priority genes as a decision-making feature influencing clustering tree analysis over 10 resolutions (Figure 3 B). Standard Seurat clustering progressi v ely divides clusters with incr easing r esolution (Figur e 3 C). For TRIAGE-Cluster, we assumed enrichment of RTS priority genes in a cell popula tion demarca tes unique cell types and evaluated this at each step wise r esolution of clustering. We found that RTS priority genes characterising new clusters quickly plateaus demonstrating that increased clustering resolutions do not necessarily imply biologically distinct cell types. Furthermore, RTS priority genes trimmed the number of clusters, while still being sensiti v e enough to identify new clusters even at very high r esolutions, r epr esenting rare cell types that cannot be detected in broad categoriza tions a t low r esolutions (Figur e 3 C).

Weighted density analysis of RTS priority genes determines cell clustering
Based on these observations, we de v eloped an approach drawing on the concept of a topo gra phic ma p with contour lines indicating peaks of cell specificity in a low-dimensional gene expression profile (Figure 3 D). We hypothesised that RTS prioritised genes expressed in cells demarcate cellspecific identity and a gene's RTS can define contour lines tha t separa te peaks (specific cell types) from valleys (transitional or non-specific cell types) (Figure 3 D). Ther efor e, we treated RTS as the weighting parameter to adjust and estimate cell density in a low-dimensional whole transcriptome space. We selected 2D UMAP embeddings instead of t-SNE, as UMAP better preserves the global structure of scRNA-seq data ( 44 ) which is supported by a comparati v e analysis of t-SNE and UMAP methods (Supplementary Figure S5a, b).
We applied this method to the mouse gastrulation atlas data ( 41 ) Table S3). A further confirmation of this using a second data set of human adult heart with myocardial infarction ( 66 ) shows consistent results that all annotations from the original data ar e r etained and r epr esented with one or more peaks, as determined by TRIAGE-Cluster (Supplementary Table S4). For selected populations we show how high-RTS genes, which we call anchor genes, re v eal identitydefining genes across different cell types in the data. For example, cardiomyocytes ar e anchor ed primarily by the mesendoderm-associated gene GATA4 and cardiac regulatory factor NKX2-5 . Parietal endoderm is anchored by GATA4 and the endoderm-associated transcription factor SOX17 .
We next evaluated peak-to-peak associations obtained through the pairwise Jaccard similarity analysis of the op 100 DS genes in each peak of the mouse gastrulation atlas data (Figure 3 F). The output provides an integrated view of TRIAGE-Cluster deri v ed peak relationships in which we show (i) the output dendrogram organising peak relationships using DS, (ii) each peak's time point as it appears during embryonic de v elopment, (iii) the original annotation for each peak from the mouse gastrulation atlas data and (iv) the highest expressing RTS priority gene which we call the peak's 'anchor gene' (Figure 3 H and Figure S6).

PAGE 11 OF 18
Nucleic Acids Research, 2023, Vol. 51, No. 11 e62 in vivo and in vitro scRNA-seq data sets ( 39 , 45 , 63 , 67-75 ) (Supplementary Table S5) which shows that high-RTS genes are the primary dri v ers of clustering across all data sets (Figure 4 B). We assessed this further to address whether RTS-prioritised genes are under-represented in sparse scRNA-seq data. Despite the typically low expression le v els of high-RTS genes in scRNA-seq dataset, our comparison of the original and MAGIC-imputed ( 57 ) versions of mouse gastrula tion a tlas da ta re v eals that the peaks identified in the original data capture most of the high-RTS genes and exhibit greater biological diversity compared to MAGIC-imputed data (Supplementary Figure S5c, d). We next tested the performance of TRIAGE-Cluster using the assumption that distinct cell types have gene expr ession differ ences r eflected by lower cluster -cluster pairwise correlation. Comparing TRIAGE-Cluster against Seurat, we tested gene expression similarity between all pairwise peaks at different clustering resolutions using Spearman rank correla tion. Seura t clustering shows highly stable similarity between cell types at all clustering resolutions regardless of input data type (original expression or DS, i.e. when expression is multiplied by RTS). Furthermore, Seurat shows stable correlation between clusters using either the whole transcriptome or highly variable genes for gene expression similarity comparison. This suggests that separation of distinct cell types over different cluster resolutions in Seurat is not dri v en by significant global gene expression differ ences (Figur e 4 C). To further test the performance of TRIAGE-Cluster, we compared the peaks identified by TRIAGE-Cluster with the clusters identified by Seurat and SC3. By using ARI across different cluster numbers and at 77 peaks / clusters, we found that TRIAGE-Cluster outperformed the other methods (Supplementary Figure S5e, f).
In contrast, TRIAGE-Cluster peaks consistently result in lower peak-peak Spearman rank correlation compared to Seurat and the corr elation incr eases at higher clustering r esolutions as differ ences between cell types become less distinct (Figure 4 C). We achie v ed the lowest peak-peak correlation using TRIAGE-Cluster with highly variable DS values assigned to genes. We further demonstrate this comparing peak-peak correlations of TRIAGE-Cluster versus Seurat clusters in which the clustering resolution of both methods results in the same cluster number (Figure 4 D). Peakpeak correlations of cell types originally defined in the single cell atlas of mouse gastrulation atlas data ( 41 ) were similar to those defined by Seurat (Figure 4 D-E). To further valida te tha t TRIAGE-Cluster peaks capture functionally relevant gene sets in the data, we performed Gene Ontology analysis using the genes with the top 100 DS for each peak or cluster. These data show that TRIAGE-Cluster peaks have more GO BP terms than Seurat clusters and therefore demonstrates that TRIAGE-Cluster extracts genes involved in more diverse biological processes (Figure 4 F, Tables S6-S9). To evaluate alternate clustering methods we compared TRIAGE-Cluster and SC3 ( 56 ). While Seurat uses a graph-based approach to connect cells based on their mutual nearest neighbour relationships, SC3 constructs a consensus matrix to r epr esent the similarity between cells based on clustering results from multiple algorithms. As with comparison to Seurat, we found that peaks generated by TRIAGE-Cluster exhibit grea ter dif ferences in cell di v er-sity than those from SC3 based on enrichment in gene ontologies r epr esented in the clusters (Supplementary Figur e S5g-j). Taken together, these data show that a combination of DS score with TRIAGE-Cluster efficiently identifies biolo gicall y distinct cell types in scRNA-seq data.
In our companion study, we demonstrate the utility of TRIAGE-Cluster by a ppl ying it on a multiplexed single cell atlas of iPSC differentiation data that integrates temporal data across eight time points and signalling data with nine de v elopmental signalling perturbation ( 71 ). Fortyeight peaks were identified from the in vitr o dif ferentia tion a tlas da ta, and clustering performance was evaluated using a trajectory approach VIA, which evaluates cell states' relationship ( 76 ). Comparing analysis using all cells vs TRIA GE-Cluster peaks, TRIA GE-Cluster peaks gi v e better peak-to-node assignment in the trajectory and significant correlation between pseudotime and actual time. These da ta demonstra te tha t TRIAGE-Cluster can define the biological di v ersity and improv e cell-cell trajectory analysis in scRNA-seq data. Overall, the assessment of the impact of RTS gene on clustering output using thirteen independent scRNA-seq datasets (Figure 4 A, Table S5) presented in this manuscript, along with our companion study a ppl ying the TRIAGE-Cluster pipeline to iPSC dif ferentia tion da ta covering di v erse lineages from in vitro and in vivo de v elopment, provide compelling evidence for the generalizability of our method across different assay conditions such as drug perturbations.

TRIAGE-ParseR enhances identification of cell gene regulatory networks
To complement TRIAGE-Cluster, we de v eloped a method called TRIAGE-ParseR to extract regula tory informa tion from a distinct population of cells. We aimed to de v elop a method that distinguishes biolo gicall y meaningful gene groups without reliance on subjecti v e e xternal r efer ence points (such as differ ential expr ession) or prior knowledge (such as GO, STRING, marker gene selection, or gene class biases) ( 19 ).
Based on the rationale that expression of key developmental genes is modulated by both gain or loss of H3K27me3 ( 77 , 78 ), we hypothesized that principal component analysis (PCA) of (breadth of) H3K27me3 for genes across di v erse cell states, specifically in di v erse human tissue and cell types, could help separate functionally relevant gene-gene relationships. By training logistic regression models with principal components obtained from gene expression or H3K27me3 breadth values, we demonstrate superiority of H3K27me3 as the input feature to predict genes with various Gene Ontology terms (Supplementary Figure S7) After performing PCA, we evaluated the input genes list for shared H3K27me3 deposition patterns encoded by the top PCs and modelled by a GMM. To generate the input gene list, the original expression matrix of a cell type is converted to DS and evaluated using TRIAGE-ParseR (   GNAS  TNNT2  SMARCD3  PRRX2  IGF2BP3  SPEG  DUSP9  PDIA6  BAMBI  PPP1R14C  CAMTA1  CADM1  GPC3  MAP1B  MEIS1  CRABP1  RBBP7  ACTC1  DLGAP4  CITED1  TNNI1  PTH1R  IRX3  TSC22D3  TPM1  ACTA1  PDGFRA  CRIP2  WNT2  PRTG  METAP1D  ANP32B  RPL38  ANKRD1  CAPN6  PPP1R1A  TSPAN18  CDKN1C  TENM4  SPON1  HS6ST2  GMDS  TMEM163  STK3  WNT5A  C4orf48  IGFBP5  FXYD6  NKX2−5  TBX5  COL2A1  MSX1  HAND2  IRX4  CDX2  TBX3  NR2F1  OTX2  NR2F2  PAX6  NKX2−6  BUB3  GATA5  IGF2BP1  GATA6  IGF2  HAND1  CA4  RNF220  TBX20  ISL1  PRDM6  MEIS2  MSX2  SFRP1  FGF8  BMP7  DLK1  SPR  CNTFR  IRX1  HCN4  PPIE  TNNI3  LEF1  BARX1  BMP4  MEIS3  TCF21  NAV2  IRX5  CELF4  PLXNA4  IRX2  KCNH2  ZNF503  SIX1  pa ttern demonstra te tissue-specific biological processes (Supplementary Figure S8b), and (iii) distinct tissue types can be characterised by these H3K27me3 patterns (Supplementary Figur e S8c). Furthermor e, these data show that the first PC recovers genes with strong tendency to deposit broad H3K27me3 across many tissue types (i.e. high-RTS genes) while subsequent PCs capture more tissue typespecific patterns (Supplementary Figure S8d). This justifies the importance of incorporating di v erse PCs to re v eal biolo gicall y informati v e signals that distinguish gene programs underpinning different cell types. We used TRIAGE-ParseR to evaluate gene expression from heart tissue and blood (Supplementary Figure S8e-g). In addition to the first PC which shows high concordance with RTS, analysis of di v erse PCs re v eals unique gene-gene TRIAGE-P arseR signatur es that can be used to inform gene-gene relationships.
We applied TRIAGE-ParseR on single cell clusters identified in the mouse gastrulation atlas data and focussed the initial demonstration of the method on cardiomyocytes (Figure 5 B). In the workflow pipeline ( Figure 5 C), pseudobulk gene expression (top ranked are primarily housekeeping and structural genes) is transformed with TRIAGE to deri v e the DS (top ranked are primarily cardiac regulatory genes). The top 100 genes ranked by DS are used as input to TRIAGE-ParseR as this gene number gave strong enrichment of variably expressed TFs as the surrogate of de v elopmentally regulated genes ( 79 ) (Figure 5 C, Figure S8h). The result showed three gene clusters with significant enrichment by STRING analysis (pr otein-pr otein interaction FDR < 0.001). Figure 5 D shows the raw H3K27me3 deposition patterns that clearly demarcate common epigenetic patterns governing each gene group.
We compared GO enrichment at each step of the workflow (Figure 5 C). While original gene expression shows enrichment in terms related to structural features of cardiomyocytes, TRIAGE DS pr ovides r obust enrichment in di v erse pa thways associa ted with cardiomyocyte morphogenesis and dif ferentia tion, TRIAGE-ParseR separa tes these genes into groups. Genes in cluster 3 (the most significantly r epr essed genes across di v erse cell types, Fig (Figure 5 E). Indeed, STRING analysis re v ealed that genes in these clusters formed significant pr otein-pr otein interactions and ther efor e r e v ealed both known and unknown gene-gene relationships (PPIs, enrichment P < 0.001) (Figure 5 F). These data demonstrate that H3K27me3 patterns re v eal gene regulatory networ ks rele vant to biological functions of genes and TRIAGE-ParseR can meaningfully segregate genes to facilitate cell type identification and biological interpretation.

Combing TRIAGE-cluster and TRIAGE-ParseR to identify cell types in scRNA-seq data
We utilised both TRIAGE-Cluster and TRIAGE-ParseR as a collecti v e suite of methods to evaluate cell types in the mouse gastrula tion a tlas da ta. Figure 6 A shows a holistic view of all TRIAGE-Cluster peaks, peak relationship at bandwidth 0.3, as well as corresponding original annotation and timepoint. While the mouse gastrulation atlas data set provides a powerful r esour ce for studying the earliest stages of mammalian organogenesis, most cell classifications captur e r edundant or broad cell groups that lack the nuance of cell di v ersification during early de v elopment. We ther efor e implement TRIAGE-Cluster as a complementary strategy for identifying more distinct cell types based on its ability to determine local maxima of RTS-ranked genes in the data set as a surrogate indicator of biolo gicall y distinct cell populations.

DISCUSSION
Understanding biological relationships between cell types and the gene networks that govern these differences remains central to maximising the value of technologies capturing genomic data at single cell resolution. This study demonstra tes tha t tools like TRIAGE tha t use an orthogonal reference point to determine identity-defining features provide a powerful mechanism for defining and interpreting the genetic basis of cell populations. The values assigned to all genes based on their RTS can act as an independent biological r efer ence point, which can be implemented downstr eam of dimensionality reduction algorithms to cluster cell types.
Regulation of the genome is in part mediated through histone modifications that control genome accessibility, acting as an on / off switch to regulate gene programs ( 80 ). We reasoned that regions with the broadest histone modification domains (r epr esenting extr eme le v els of on / off control) could be used to quantify the probability that a locus has a r ole contr olling cell decisions and functions. Based on this rationale, TRIAGE ( 20 ) uses patterns of histone methylation data to identify regions of the genome that are enriched in cell-specific identity defining genes. Genes with a high RTS have consistent r epr essive chromatin in most cell types and, ther efor e, in the rare instances when these genes get turned on, they likely have key r oles contr olling cell decisions or functions.
The collecti v e suite of methods built around TRIAGE provide se v eral major advantages: (i) TRIAGE is a customisable metric with a single value assigned to genes which makes it simple to implement in any genomic data set, (ii) it is versatile, providing a quantitati v e genomic model to weight gene prioritisation using routine genetic analysis tools, (iii) the simplicity of TRIAGE enables broad implementa tion in da ta analysis pipelines for any data sample from any cell, tissue, disease, or individual and providing a method to enrich for factors most likely to cause phenotypic changes in cell identity and (iv) can help prioritise genetic targets for drug de v elopment and identifying the primary determinants of disease and de v elopment.
TRIAGE-Cluster incorporates gene-le v el information inferred from H3K27me3, which offers a restricted view of biolo gicall y relevant cell populations in scRN A-seq data and provides higher sensitivity for detecting rare cell types. Our evaluation of TRIAGE-Cluster on in vitro and in vivo scRNA-seq datasets shows that it can accurately identify at least one peak in e v ery annotated cell type of the data (Supplementary Tables S3 and S4). This feature makes TRIAGE-Cluster highly suitable for generalizable use across a wide range of de v elopmental and adult tissues, as well as various assay conditions. Since our analysis demonstrates peaks generated from TRIAGE-Cluster can captur e gr ea ter dif ferences in cell di v ersity compared to current methods (Figure 4 and Figure S5), we introduce it as a complementary method to baseline clustering methods to help identify transcriptionally distinct cell types and gene expression pa tterns tha t may be missed by other methods. TRIAGE-ParseR provides complementary perspecti v es in studying gene-gene relationships. Unlike ChromHMM ( 81 ) or Paige et al. 's work ( 82 ), which focus on exploring the combina torial rela tionship between various histone modi-fications and other genomic features in a specific cellular conte xt, TRIAGE-ParseR le v erages a di v erse range of biological contexts. This approach allows the tool to extract epigenetic patterns defined by H3K27me3 that are uni v ersall y a pplicable. Namel y, (i) it provides an unsupervised a pproach to study the gene regulatory basis of cells without drawing on external r efer ence points (like differ ential expression analysis) or prior knowledge (including gene expression data), (ii) it parses genes into groups to re v eal biological mechanisms underpinning cell states and (iii) uses an independent r efer ence da ta set for evalua ting epigenetic pa tterns which provides a mechanism for applying this method to any input gene list in various biological contexts.
In future studies, we aim to address three limitations in the current workflow. First, we demonstrate that the current clustering method significantly restricts cell selection to capture clusters with the most biologically distinct identity. Our method has demonstrated the ability to capture theoretically any cell type from di v erse data types (Supplementary Tables S3 and S4), and furthermore, it has the potential to offer greater clarity and insight into the di v ersity of cell subtypes within a gi v en popula tion. W hile this is a major advantage, it comes at the cost of data trimming which should be weighed relati v e to the loss of statistical power provided by the scale of cell sampling. To address this limitation, we aim to use TRIAGE-Cluster in combination with da ta imputa tion or trajectory inference methods to interpret cell-cell relationships using all cells in the data set. Second, the current method is reliant on assumptions built into dimensionality reduction representing cell relationships in UMAP space. We indeed demonstra te tha t peaks generated from UMAP exhibit greater biological di v ersity compared with another topological dimensional reduction method, t-SNE (Supplementary Figure S5a, b), but the major limitations exist for simplifying the data into these graphical presentations. To address this, we propose additional strategies such as comparison of cell types using TRIAGEdendrograms (Figure 3 H, Figure S6) which provide independent approaches to evaluate cell-cell relationships. We aim to de v elop approaches that provide the analysis value of TRIAGE-Cluster but avoid the limita tions associa ted with topo gra phical ma p r epr esentations. Third, TRIAGE-ParseR provides a unique approach to decipher gene-gene relationships but draws on only a single epigenetic reference point. Future de v elopments of this method need to parameter test additional histone modifications r epr esenting acti v e promoters (H3K4me3) or enhancers (H3K27ac and H3K4me1) as complementary to r epr essi v e chromatin marked by H3K27me3 as a basis for more nuanced analysis of gene-gene relationships.
This study introduces a unique strategy for identifying cell types by using epigenetic information as a biological r efer ence point for cell clustering and deconstructing gene modules in single-cell expression data. The workflow is illustrated by analysing in vivo organogenesis and in vitro dif ferentia tion a tlases. Collecti v el y, TRIAGE highl y ranks regions of the genome enriched in genetic features with profound effects on cell identity because these regions encode potent determinants of cell de v elopmental processes. These ranking features and genome-wide predictions make TRIAGE unique among genomic analysis methods and e62 Nucleic Acids Research, 2023, Vol. 51, No. 11 PAGE 16 OF 18 position it to address areas of need in cell biology and genetics. In addition to our prior studies ( 20 ), TRIAGE-Cluster and TRIAGE-ParseR provide a growing portfolio of methods that complement baseline methods such as Seurat and SC3. These methods facilitate analysis of large-scale data sets by integrating genome-wide epigenetic data with scRNA-seq data to parse cell-cell and gene-gene relationship governing mechanisms of cell identity.

SUPPLEMENT ARY DA T A
Supplementary Data are available at NAR Online.