Clustering-independent analysis of genomic data using spectral simplicial theory

The prevailing paradigm for the analysis of biological data involves comparing groups of replicates from different conditions (e.g. control and treatment) to statistically infer features that discriminate them (e.g. differentially expressed genes). However, many situations in modern genomics such as single-cell omics experiments do not fit well into this paradigm because they lack true replicates. In such instances, spectral techniques could be used to rank features according to their degree of consistency with an underlying metric structure without the need to cluster samples. Here, we extend spectral methods for feature selection to abstract simplicial complexes and present a general framework for clustering-independent analysis. Combinatorial Laplacian scores take into account the topology spanned by the data and reduce to the ordinary Laplacian score when restricted to graphs. We demonstrate the utility of this framework with several applications to the analysis of gene expression and multi-modal genomic data. Specifically, we perform differential expression analysis in situations where samples cannot be grouped into distinct classes, and we disaggregate differentially expressed genes according to the topology of the expression space (e.g. alternative paths of differentiation). We also apply this formalism to identify genes with spatial patterns of expression using fluorescence in-situ hybridization data and to establish associations between genetic alterations and global expression patterns in large cross-sectional studies. Our results provide a unifying perspective on topological data analysis and manifold learning approaches to the analysis of large-scale biological datasets.

* These authors contributed equally to this work. # Correspondence to: pcamara@pennmedicine.upenn.edu The main objects of study in this paper are a finite data set (often also termed as point cloud) with a notion of distance or dissimilarity ‖ − ‖, where , ∈ and ‖•‖ is a distance in , and a set of features { } defined as maps from into a formally real field Ϝ. In 2006, He, Cai, and Niyogi proposed an algorithm for unsupervised feature selection called Laplacian score (1).
They construct a weighted nearest neighbor graph with nodes and adjacency matrix and and ( ) = 1, ∀ ∈ , is the unit feature vector. The Laplacian score ranks features according to their consistency with the structure of . Specifically, features with small values for take high values in highly connected nodes of . This approach to unsupervised feature selection has become widespread, as it offers a substantial statistical power compared to ranking features according to their variance (1). In what follows, we generalize these notions to simplicial complex representations of the data.
Preliminary Definitions. We first recall some standard definitions from algebraic topology that will be used below (2). We define an ordered abstract simplicial complex on a finite set = { 0 , … , } as a collection of ordered subsets of which is closed under inclusion, i.e. ⊂ ⇒ ∈ , ∀ ∈ . The ( + 1)-dimensional elements of are called -simplices. We denote the set of -simplices of by ( ).
There are multiple ways to construct an ordered abstract simplicial complex from a data set and an order relation 0 < 1 < ⋯ < among the elements of (3). The Čech complex ∁( , ) Given a ( + 1)-simplex = { 0 , … , +1 } ∈ , we define its boundary as the linear combination of -simplices where 1 is the unit element of Ϝ. More generally, the boundary operator can act on linear combinations of -simplices. We denote by sgn( , ) the sign of a -simplex contained in the boundary of .
We now turn our attention to maps from into Ϝ. We define a -point feature of as a map With these definitions we can extend the inner product 〈 , 〉 to a weighted inner product between -forms on simplicial complexes where ( ) ∈ Ϝ is the weight of . In particular, notice the inner product between 0-forms is equivalent to 〈 , 〉 where is the 1-skeleton of .
Finally, we introduce Eckmann's generalization of the graph Laplacian to discrete -forms on simplicial complexes (6) Relation to Feature Extraction. Feature extraction is a closely related problem to feature selection, where a finite set of synthetic features that optimally capture the structure of the point cloud is engineered. These synthetic features can be then used for dimensionality reduction and de-noising. The Laplacian score of He, Cai, and Niyogi follows from the Laplacian Eigenmaps for dimensionality reduction (7). In what follows, we show how this relation can be naturally extended to simplicial complexes and combinatorial Laplacian Eigenmaps.
To that end, we consider the diagonalization problem of the combinatorial Laplacian, Bivariate Combinatorial Laplacian Score. The combinatorial Laplacian score for discrete differential forms can be thought in close analogy to the concept of variance for random variables. In this regard, it is natural to extend the combinatorial Laplacian score to pairs of discrete differential forms, similarly to the covariance of pairs of random variables, ) reduces to the weighted covariance of (0) and (0) , and where is the 1-skeleton graph of . Thus, , (0) is small for pairs of features that take mutuallyexclusive high values in adjacent nodes of (e.g. (0) ( 1 ) = 1, (0) ( 2 ) = 0, (0) ( 1 ) = 0, (0) ( 2 ) = 1, 12 = 1). Note in particular that when the two forms are identical, , ( ) reduces to the combinatorial Laplacian score introduced above.
Directions for Future Research. There are at least two directions that we believe deserve further investigation. Graph Laplacians admit elegant interpretations in terms of random walks, Markov chains, and diffusion processes (8). It is still unclear to us how this formalism can be generalized to higher-dimensional combinatorial Laplacians and random walks on simplicial complexes, although some progress has already been done in that direction (9,10).
Additionally, given a filtration of Čech complexes with varying scale , it has been shown that its cohomology (spanned by the zero eigenvalue -forms of the combinatorial Laplacian) can be formulated in terms of the theory of persistence (11). It is an open question whether this formulation can be extended to non-zero eigenvalues of the combinatorial Laplacian.
Addressing these questions will further contribute to unifying conceptually the tools of manifold learning and topological data analysis.