Geometry of the Gene Expression Space of Individual Cells

There is a revolution in the ability to analyze gene expression of single cells in a tissue. To understand this data we must comprehend how cells are distributed in a high-dimensional gene expression space. One open question is whether cell types form discrete clusters or whether gene expression forms a continuum of states. If such a continuum exists, what is its geometry? Recent theory on evolutionary trade-offs suggests that cells that need to perform multiple tasks are arranged in a polygon or polyhedron (line, triangle, tetrahedron and so on, generally called polytopes) in gene expression space, whose vertices are the expression profiles optimal for each task. Here, we analyze single-cell data from human and mouse tissues profiled using a variety of single-cell technologies. We fit the data to shapes with different numbers of vertices, compute their statistical significance, and infer their tasks. We find cases in which single cells fill out a continuum of expression states within a polyhedron. This occurs in intestinal progenitor cells, which fill out a tetrahedron in gene expression space. The four vertices of this tetrahedron are each enriched with genes for a specific task related to stemness and early differentiation. A polyhedral continuum of states is also found in spleen dendritic cells, known to perform multiple immune tasks: cells fill out a tetrahedron whose vertices correspond to key tasks related to maturation, pathogen sensing and communication with lymphocytes. A mixture of continuum-like distributions and discrete clusters is found in other cell types, including bone marrow and differentiated intestinal crypt cells. This approach can be used to understand the geometry and biological tasks of a wide range of single-cell datasets. The present results suggest that the concept of cell type may be expanded. In addition to discreet clusters in gene-expression space, we suggest a new possibility: a continuum of states within a polyhedron, in which the vertices represent specialists at key tasks.


A. Estimation of technical noise and bimodality in the intestinal cell dataset
In the data of [1], 5 of the genes had duplicate primers, allowing estimation of technical noise and bimodality effects. For all 5 genes, measures were highly correlated at high expression, but for low expression, often one of the primers displayed a read of 0, leading to a bimodal effect as described in [2,3]. Since bimodality occurs only at low expression levels (S21a Figure), we assumed that they are due to failure of one primer to sample the gene. Thus, for the 5 genes that had duplicate primers, we took the mean expression of the primers, unless one of them was zeroin which case we took the non-zero value. See S21b Figure for analysis of false-negative error (i.e. the chance of a gene to be detected only by one of the primers) for different expression levels. Since the error is highest at low expression, we removed 10% genes and cells with low expression as described in methods. We compared this to the method of McDavid et al [2] for removing outlier cells, and find excellent agreement (89% overlap in the cells removed by the two methods, using parameters which remove 10% of the cells).

B. Robustness of archetypes to the sampling of the data, Intestinal dataset
In order to test the robustness of the method to the specific sampling of the data, we created 1,000 randomized datasets by bootstrapping the intestinal dataset (resampling the data with replacements to reach the same size as the original dataset). We computed the positions of the archetypes for each of the datasets. Table S1 shows the errors in each of the archetypes. The relative errors were calculated by computing the variance in each of the 3 principal axes of the noise (variation between bootstrapped datasets) in each archetype, taking their mean value, and dividing by the distance of the archetype from the origin (the dataset mean). Figure 2c-f and 4b show the archetype positions mean and standard deviation, represented by ellipsoids whose principal axes are aligned with the principal axes of the noise in each archetype, and their length is the standard deviation of the archetype position in these directions.
We also tested the robustness of the list of enriched genes in every archetype under bootstrapping, using the same 1D enrichment method as for the original dataset. All of the enriched genes in the list repeated in at least 50% of the bootstrapped data sets, except for 6 genes that were consequently removed. The inferred tasks remained the same under bootstrapping.

C. Comparison of archetypal analysis to clustering methods
When studying the composition of a tissue, several clustering techniques can be used to classify cells into subtypes based on their gene expression [4]. These methods give a discrete description of the data, by assigning each cell to one group: Where ⃗ is the approximated description of data point (cell) ⃗, ⃗ are the clusters centroids, and N is the number of clusters.
The description suggested by archetypal analysis is different because it is continuous: Each cell is represented by its position inside the polytope, given by its distance from the archetypes: Where ⃗ are the archetypes vectors, and N is the number of archetypes.
The Pareto-inspired interpretation of the polytopal geometry [5] of the data further suggests a biological insight in terms of the tasks of the system and their tradeoffs, which is lacking in clustering approaches.
In this section we compare the sensitivity of the description of the data by archetypal analysis to the sensitivity of its description by 3 commonly used clustering methods: k-means clustering [6], UPGMA hierarchical clustering [7], and self-organizing maps [8,9].
In order to check the sensitivity of the different methods to the sampling of the data, we created 1,000 random datasets by bootstrapping (resampling of the data points with replacement). For each bootstrapped dataset we found archetypes using the PCHA algorithm [10] and clustered the data using the three methods.
Each cell in each dataset was described by 76 coordinates. In archetypal analysisthe coordinates of the best description of the cell by linear combination of the archetypes: and in clusteringthe centroid of the cluster that the cell is assigned to: We gathered these descriptions for each cell, for each of the methods. Then we computed the normalized standard deviation (SD) for each cell by dividing the standard deviation along the 1 st principal component of variation in these descriptions by the standard deviation of the 1 st principal component of the complete real data: ⃗ Visualization of the results can be seen in Figure S3b, where cells are colored by their normalized SD in each of the methods and in Figure S4 which shows the distribution of the normalized SD among cells. Note that cells next to the vertices of the polygon have low SD in all of the methods (in UPGMA even lower than in archetypal analysis), since their assignment to a cluster is robust. However, cells in the middle of the dataset, which do not belong naturally to any particular cluster, have high SD in all of the three clustering methods. These cells are assigned in different bootstrapped datasets to different clusters, since the boundaries of the clusters are sensitive to the specific sampling of the data (S2 Figure  Mathematically, ⃗ and ⃗ are both fairly robust to the sampling of the data (see Table S1), but because of the different constrains, ⃗ in clustering is not robust to sampling while in archetypal analysis it is, in data with a geometry such as the present dataset. Note that if the data was arranged in distinct, well-separated clouds, both clustering and AA would be robust to sampling.
The robustness of archetype analysis to data sampling in comparison to the other methods tested here is summarized in S3a Figure, that shows the percentage of cells with high normalized SD (variation between bootstrapping datasets larger than 0.3 of the variation captured by the first PC of the real dataset), which is less than 1% in archetypal analysis and more than 14% in all of the other methods.

D. Definition of cell types in the Intestinal dataset
We examined two methods to divide the data points into cell types.
Second, we assigned each data point to a cell type according to the closest archetype (for example, if a point was closer to the enterocyte archetype than to other archetypes, it was marked as an enterocyte). Sorting by the two methods gave similar results (85% overlap).
This second method classifies the Nodal cells as well, which were not previously recognized.
Note that the published data in Dalerba et al does not include 34 genes, included in the present analysis, kindly provided by Dalerba et al for this study. A subset of these genes allows identification of the Nodal class.
Finally, we also used t-SNE, an algorithm for non-linear projection of the data that conserves the pairwise distance between points [11]. We projected the data into 2D using t-SNE, and then used k-means clustering with distance measure of one minus the cosine of the included angle between points, and 20 replicates starting from arbitrary initial cluster centroids, to divide the data into 4 clusters. Again, classification agreed with the other methods (79% overlap).

E. Effect of sample size on the statistical significance of polytopes
Practical use of the present approach raises the question of how much data is needed to discover polytopes reliably. To test this, in this section we study the effect of sample size needed to attain statistical significance for tetrahedral geometry, given that the data lies perfectly in a tetrahedron. We randomly sampled points from a perfect tetrahedron and checked the p-value for a tetrahedron by applying our algorithm to this data. The test was done for 20 to 70 points, and the p-value was computed as described in Methods: Statistical significance of best fit polytopes. Results are shown in S8 Figure. 20 points yield a non-significant p-value of 0.2; 30 points are enough to get a p-value of 0.04; while for 40 points the p-value is already small -0.003. In general, the p-value appears to drop exponentially with the number of points. However, considering technical and biological noise as well as other effects beside trade-offs that may be in play, one should expect higher p-values when considering real datasets.
In light of these results, it is not surprising that when testing each cell type separately goblet cells (18 cells) do not show a significant polytope. However, enterocytes (89 cells) and nodal cells (114 cells) could in theory yield a significant p-value. The fact that they do not show such geometry may suggest that the current approach may not be useful for these cells, since other effects dominate the structure of the data. Alternatively, tradeoff may exist in these cells, but is not manifested in the set of genes that were measured in this study. These cells might also have too many competing tasks (too many archetypes) to be detectable.

F. Analysis of a single-cell qPCR dataset of a human colon cancer xenograft from a single cancer cell
An important question that can now be studied using single-cell technologies is the heterogeneity of cancer tumors, and specifically, whether tumors heterogeneity recapitulates developmental processes in normal tissues [1,[12][13][14][15]. We studied a dataset of human colon cancer xenograft, created from a single human cancer cell that was injected to mice, as described in [1], which contained the expression of 90 genes in 589 cells.
We pre-processed the data as described in Methods: Processing and normalization of the data.
We united 5 genes with multiple primers as described in S1 Text Section A, and removed low expressed genes and low expressing cells, to obtain a dataset of 521 cells and 76 genes.
We found that the cancer cells data is, to a good approximation, low dimensionalonly 2 PCs explained 37% of the variance in the data, while the third PC explained only 4% more (S11b Figure). In order to compare between the cancer data and the normal tissue data, we projected the cancer cells into the space spanned by the 2 first PCs of the healthy bottom crypt cells (the two datasets included the same genes). The normal cells are the cells that appear in Fig. 5b (though in Fig. 5b the analysis was carried using only a subset of genes to compare with mouse data (Fig. 5a)).
In agreement with Dalerba et al [1], we notice that the cancer cells re-inhabit the Pareto front spanned by the normal cells, with a similar density distribution (compare Fig. 5(b), Fig. S11(a)). This hints that the human cancer cells undergo differentiation similar to the normal mouse and human tissues.

G. Analysis of a single-cell mass cytometry dataset from human bone marrow
We analyzed data of protein levels in human healthy bone marrow cells, acquired by single-cell mass cytometry [11,16]. The data contained measurements of 31 protein levels in 10,000 cells, and was preprocessed and normalized as described in Methods.
First, we asked whether the data geometry is polyhedral. We found that the data is approximately 4 dimensional and that it is well described by a 4D polytope with 5 vertices (pvalue 0.005, Fig. 6, S12 Figure (a) and (b)). A 4D simplex explains 52% of the variance in the data, giving a description almost as accurate as the full 4D space spanned by the first 4 PCs, which explains 54%. Projections of the simplex on all the PC pairs are shown in S12(c) Figure. .
Next, we identified the archetypes and inferred the tasks they perform. In order to do so we examined the archetypes gene expression profiles (S14 Figure) and carried a leave-1-out enrichment analysis (S13 Figure, Table S5, Methods: 1D Gene enrichment at archetypes). We found that 4 of the archetypes showed marker genes identified with a known cell type. Archetype 1 is enriched with surface markers CD3 and CD4, and thus represents CD4 T cells. Archetype 2 is enriched with CD8 and express high CD33 and thus recognized with CD8 Tcells; Archetype 3 is enriched with CD19 and CD20, markers of B-cells; And archetype 4 is enriched with CD11b and CD33, markers of macrophages/ monocytes. All of these archetypes are enriched with CD45, a marker of leukocytes. The last archetype does not express this marker, and is not enriched for any other gene except from division marker ki67. We hypothesize this archetype represents non-leukocytes, which express proteins that were not included in this dataset.
One unexpected feature of this dataset is the existence of cells in the middle of the simplex. This may correspond to cells in the original viSNE analysis [11] that were not amenable to classification (S15 Figure). Based on the present results, one may hypothesize that some bone marrow cells express a combination of markers which does not label them as classic cell types.

H. Analysis of a single-cell RNA-Seq dataset of stimulated mouse spleen dendritic cell
Dendritic cells are known to carry out a wide range of immune functions, such as pathogen recognition, antigen presentation and regulation of lymphocytes [17][18][19]. It has been suggested that these functions cannot all be carried out by the same cell, and evidence for the existence of DC subtypes was found [20][21][22][23][24]. However, these cells cannot be easily clustered based on their gene expression profiles [25]. We therefore applied the Pareto approach to analyze dendritic cells, hypothesizing that they may reveal tradeoffs between immune tasks.
We used the single-cell RNA-Seq data from [25], of mice spleen CD11c+ cells. Each cell was characterized by 20,091 gene expression counts, based on sampling a fraction of the cell's mRNA pool. This data was classified by Jaitin et al into seven groups using a probabilistic mixture model. One group of cells, however, seemed to defy clear clustering (class VII in [25]) -These are the dendritic cells in the spleen. We therefore analyzed LPS stimulated dendritic cells. We performed down-sampling as in [25] and focused on the 500 genes with the highest standard deviation across samples, resulting in 312 cells with 400 mRNA counts per cell. Results were similar for choosing genes by highest mean or median. The data was preprocessed and normalized as described in Methods.
We find that the data is well described by a tetrahedron (p<10^-3, Fig. 6 bottom row). A tetrahedron explains about 10% of the variance in the data (S23a Figure). This fraction of explained variance is lower than in the other datasets analyzed in this study, presumably because of the large number of genes in the RNA-Seq dataset (500 versus a few tens in the qPCR and cytof methods) which may contribute larger experimental variability. When inspecting the eigenvalue loadings of the data compared to shuffled data (see Methods: determining the number of archetypes, S23b Figure), it seems that the data is embedded in a 5 dimensional space. However, we chose to describe the data by a 3D polytope since the relatively small number of cells and of reads per cell does not allow finding robust archetypes in 5D space (5D archetypes are not stable under bootstrapping). It is thus possible that some of the archetypes that we find may represent unification of higher order archetypes.
To study the archetypes tasks, we carried an enrichment analysis on biological functions (MSigDB gene sets [26]), using parts of ParTI software package [27]. In this analysis we considered all 10,208 genes that were expressed in the LPS-stimulated dendritic cells. The value for each gene group in each cell was set to be the average value of the expression of genes that belong to this group. We binned the cells according to their Euclidean distance from each archetype (in the 500D gene expression space) and asked which biological functions are enriched maximally in the cells closest to each archetype, as described in Methods: 1D enrichment analysis.
To address concerns about multiple hypothesis testing [28,29], we used the Benjamini-Hochberg (FDR<0.01) test to ensure that the functional categories found significant are not due to the fact that we test many categories. As a second test, we shuffled the data and repeated the enrichment analysis, to find that only 34 ± 14 functional categories are enriched on average in shuffled data compare to 1164 in the real data (total over all archetypes). We conclude that type-II errors do not explain the present enrichment.
To remove concerns about sensitivity to sampling, raised because of the sparseness of the data, we further checked the sensitivity of the enrichment to bootstrapping of the data. We resampled the data with replacement 100 times and inspected only gene groups that were statistically significantly enriched in the same archetype in more than 80% of the bootstrapped data sets (see Table S6 for the full list, S24 Figure shows the archetypes gene expression profiles).
The results suggest that dendritic cells trade-off between 4 key immune tasks: (i) response to virus (cytoplasmic DNA response and interferon pathways) (ii) Dendritic cell maturation, and formation of cytoskeletal features. (iii) Antigen presentation and activation of lymphocytes (iv) putative apoptosis pathways.
The first archetype is enriched with anti-viral functions and interferon pathways. It shows high expression of genes such as Stat2, Irf7, Ifit3 and Ifit1 that were previously described as part of an anti-viral module which is expressed only in a subset of dendritic cells [18,22]. Cells closest to this archetype correspond to this previously described group.
The second archetype is enriched with cytoskeleton organization and biogenesis functions. It is enriched with genes Tmem176a and Tmem176b, that were previously reported as marking an immature dendritic cell state [30]. In addition it expresses Ccr7, which is present in migrating cells [24,31]]. We hypothesize that this archetype represents cells that are found in a maturation process. A main task of these cells is phagocytosis, which also requires high cytoskeletal activity [32][33][34]. Enrichment of GO terms such as lytic vacuole, vacuole and lysosome strengthen this hypothesis.
The third archetype is enriched with cytokine production and secretion, and cell-cell signaling.
Highly expressed genes in this archetypes are chemokines including Ccl5, Ccl22, Ccl4, and antigen presentation related genes such as subunit IL27, H2-Eb1, H2-Aa, and Cd74. Cells near this archetype are likely to be active mature dendritic cells [22,35,36]. Their task is antigen presentation and stimulation of lymphocytes including T cells.
The fourth archetype is harder to interpret. It is enriched with, apoptosis pathways, activity in mitochondria and membrane parts, and sphingolipids metabolism. Therefore we hypothesize it may represent cells that undergo LPS-induced apoptosis [37][38][39].
Further analysis of dendritic cells using the Pareto archetype analysis approach may be promising due to the continuous nature of their gene expression, and their ability to perform multiple tasks. Future advances in single-cell RNA-Seq technology may allow better characterization of gene expression profiles of single cells from this fascinating population and achieve better understanding of their function. In addition, it would be highly interesting to explore the spatial position of cells near the different archetypes in the spleen, which are known to contain DCs in different stages of their activation [24].

I. Uniformity of the distribution of cell states varies between tissues
Different tissues have different distributions of cell states, ranging from clumped to uniformly distributed geometries in gene expression space. Clumped geometry relates to the classic idea of separate cell types, where uniform filling challenges this view. Both geometries can be described by the Pareto perspective, with different distributions on the Pareto fronti.e. the polytope. Different factors that can influence this distribution are the shape of the fitness function, the shape of the performance functions where steep performance functions (as determined by a condition on their second derivative) lead to population of the front only near the vertices [5], and the flexibility that the system needs when facing a temporally fluctuating environment .
In this section, we quantify the deviation from a uniform distribution for each tissue. To do this, we compute the mean local density in a ball of volume around each data point, where is equal to the volume of the convex hull of the data divided by the number of data points N. We then compute the same mean local density for a uniform distribution, , by generating N uniformly distributed random points in the same convex hull. is equal to the global density with a correction for edge effects for points near the convex hull. The deviation from uniformity of the data is then given by: .
When the data is more clustered then uniform distribution, the average local density around data points is higher than for a uniform distribution, thus . The larger is, the larger the deviation from uniform distribution. reflects a deviation from a uniform distribution toward a more ordered geometry, e.g. if the points are arranged on a lattice. A value of is consistent with a uniform distribution. This measure is inspired by Ripley's K function [40].
We projected the data points and the archetypes into the 3 first PCs space of the data, and the density analysis was carried in this space. Uniformly distributed random points were drawn 10 times for each datasets to estimate mean and error bars.
We found that bone marrow cells are the most clustered among the datasets we checked ( ) (S16 Figure), aligning with the classic view of the hierarchical hematopoietic lineage [41,42]. Intestinal cells are less clustered ( ), dendritic cells even more ( ), and intestinal progenitors are the most continuous ( ), in agreement with recent findings about their plasticity [43] and with studies on embryonic stem cells [44].

J. Comparison of archetypal analysis to Principal Component Analysis
In this section we compare archetypal analysis (AA) and Principal Component Analysis (PCA) as applied to the single cell intestinal dataset. A fair comparison is between 3PCs and a tetrahedron, because they both describe 3D subspaces of the 76D gene space. Another way to see this is that each point in the 3PC space is described by three coordinates, and each point in the tetrahedron is also described by three coordinates, due to the constraint ∑ , where N is the number of archetypes, and are the weights of the archetypes ⃗ for each cell, The AA result defines the data much more strictly than PCA: 3PCs allows any point in the 3D subspace spanned by the 3PCs, whereas AA restricts the data to be inside a tetrahedron, which is only a small part of the 3D space (S17 Figure).
Thus, the fact that a tetrahedron whose vertices are on the convex hull of the data explains almost all of the variance that is explained by the first 3 PCs (45%, 47%, respectivelyi.e. 96% of the 3PC explained variance) is remarkable.
It is useful to compare the gene expression profiles that correspond to the first PCs to the gene expression profiles of the archetypes (S18 Figure). Figure S19 show the projections of the archetypes on the first 6 PCs to more clearly present these differences. It can be seen that the first PC is composed largely of archetype 3 (stem cells) minus archetype 1 (enterocytes); the second PC is composed of archetype 1 (enterocytes) minus archetype 2 (Nodal cells), while the third PC has the biggest contribution from the 4 th archetype (goblet cells). PCs 4-6 do not correlate specifically with any of the archetypes.
In general, the PCs mix together different archetypes and thus it is harder to infer the extreme gene expressions of each cell type from them. This is because the tetrahedron is not aligned with the PC axes.
In addition to this analysis, we compared the sensitivity of the description by PCA to the sensitivity of the AA description, in a similar way to S1 Text Section C: "Comparison of archetypal analysis to clustering methods". We created 1,000 bootstrapped sets by sampling with replacement data points (the same bootstrapped sets that were used in S1 Text Section C).
For each bootstrapped dataset we computed the positions of the 4 archetypes by PCHA algorithm and the coordinates of the 3 first PCs.
Each cell in each dataset was described by 76 coordinates. In archetypal analysisthe coordinates of the best description of the cell by linear combination of the archetypes: Where ⃗ is the approximated description of data point (cell) ⃗, ⃗ are the archetypes vectors, and N=4 is the number of archetypes, And in PCAthe projection of the cell on the first 3 PCs: Where N=3 is the number of archetypes, and ⃗⃗⃗⃗⃗ are the PC vectors. Note that while the PCs have to be orthogonal, the archetype vectors do not obey this constraint.
We gathered these descriptions for each cell, for each of the methods. Then we computed the normalized standard deviation (SD) for each cell by dividing the standard deviation along the 1 st principal component of variation in these descriptions by the standard deviation of the 1 st principal component of the complete real data:

⃗
The results, presented in Figure S3, show that PCA is less sensitive to the sampling of the data than clustering techniques, but more sensitive than archetypal analysis. Most of the variation was due to lack of robustness of the third PC which separates the goblet cells. It appears that the relatively small number of goblet cells in the dataset caused large fluctuations in the third PC coordinates when bootstrapping, where archetypal analysis was able to sufficiently buffer these fluctuations presumably because it uses information distributed more uniformly across the entire dataset to infer archetype positions.

K. Increasing the number of archetypes may reveal subtle trends
In the manuscript, we chose to fit the intestinal cells data to a 3d-polytope, namely a tetrahedron. The reasons are described in Methods: Determining the number of archetypes.
However, further analysis using higher order polytopes may reveal more subtle trends. In Figure  S20 we show a dendrogram-like tree of archetypes, in which we relate the archetypes found in different polytope fits. The tree was generated by fitting the data to k archetypes and computing their Euclidean distance, in the 76-dimensional gene expression space, from the k-1 archetypes whose positions were computed before. Characterization of these archetypes was then done by carrying a leave-1-out enrichment analysis (Methods: 1D Gene enrichment at archetypes), and inspecting the enriched genes. The enriched genes in each archetype for n = 2 to 6 is shown in Table S7.
It may be seen that with two archetypes (line), the separation between stem cells and enterocytes is captures. At three archetypes (triangle) the nodal class splits from the enterocytes. At four archetypes a goblet archetype appears, and remains distinct thereafter. At five archetypes, stem cells split into two archetypes. At six archetypes, nodal and enterocyte archetypes begin to mix.