Constructing local cell-specific networks from single-cell data

Significance Understanding gene regulatory networks is a topic of great interest because it can provide insights into cellular development, and identify factors that differ between normal and abnormal cells and phenotypes. Single-cell RNA sequencing provides a unique opportunity to gain understanding at the cellular level, but the technical features of the data create severe challenges when constructing gene networks. We develop a method that successfully skirts these challenges to estimate a cell-specific network for each single cell and cell type. Application of our algorithm to two brain cell samples furthers our understanding of autism spectrum disorder by examining the evolution of gene networks in fetal brain cells and comparing the networks of cells sampled from case and control subjects.


Performance under a correlated bivariate normal
locCSN and oCSN return networks (CSNs) in which each edge is the value of a test statistic that can be used to reject the null 17 of independence between two genes. These networks can be used for two-sample testing, and their average gives the fraction of 18 cells for which the test is rejected, which may be used as an aggregate indicator of nonlinear co-expression. 19 Fig . S2a shows the results of oCSN (1) and locCSN on data simulated from a correlated bivariate normal distribution, where ρ = 0.4. For this example, the relationship between X and Y is the same for all cells, and thus the ideal test would 20 reject the null of independent X and Y for all cells. 21 We find that oCSN rejects the independence hypothesis for only 16% percent of the data points, while locCSN shows greater 22 power, rejecting 55%. The lower power of oCSN in this example can be attributed to the choice of a fixed quantile range for 23 the window used to estimate the marginal and joint densities. In particular, in areas of high density, where oCSN's power is 24 lowest, the window becomes extremely small (Fig. S2b). As a result, oCSN has good power to detect co-expression only for 25 extreme points, while locCSN has good power over a greater range of expression values. Similar patterns can be found in Fig.   26 S2c, which shows that locCSN chooses a constant window size for correlated Gaussian data, and in Fig. S2d, which shows 27 p-values computed by oCSN and locCSN for correlated normal data with ρ ranging between 0.1 and 0.9.  Figure 1A). Scatterplots of gene expression levels for gene pairs (x, y) and (w, z), showing regions of high and low density (red and blue) compared to the product of the marginal densities, with corresponding CSN edges or non-edges for these gene pairs highlighted for two cells i and j. The connection between gene pairs are different across cells. (b) Standard deviation derived window size for one cell. The first scatter plot shows the quantile derived window for gene y, wy = 2dy. With this window size, we calculate the standard deviation of gene x expression, using cells within this window size. We take the obtained standard deviation for gene x as the window size for gene x, wx = 2s.d.(x). Second scatter plot shows the same thing as the first one with swapped x and y. Finally, we use the window size derived from standard deviation for further calculation, that is wx = 2s.d.(x) and wy = 2s.d.(y).  and 100 genes. The code to reproduce this simulation is here: code. The gene-gene correlation matrix exhibits block structure 34 of the form (Fig. S3a), or equivalently has off-diagonal entries for the 6 blocks given by the matrix 0.9 0.7 0.5 0.3 0.1 0 0.7 0.9 0.7 0.5 0.3 0 0.5 0.7 0.9 0.7 0.5 0 0.3 0.5 0.7 0.9 0.7 0 0.1 0.3 0.5 0.7 0.9 0 where the first 5 blocks have 15 genes each, and the 6'th block has 25 genes that are independent from all others.

37
CSN are performed on log-transformed CPM with ESCO simulated read counts data. Fig. S3c  For the same choice of window sizes, Fig. S4a-c show ROC and AOC for the task of discriminating gene pairs that are 43 uncorrelated from those whose correlation is ≥ 0.5. 2100 genes pairs were randomly selected for this task, balanced between 44 the two categories. The curves show that for this task, the standard deviation-based window sizes used by locCSN perform 45 better than the fixed quantile range windows proposed by (1). Fig. S4g and h shows that the average CSN heatmaps using the 46 standard deviation-based window size resemble the original connection matrix (Fig. S3a) more strongly that do those using the quantile-based window sizes. at α = 0.05 for the weaker connection scenario.

64
The two scenarios both support that standard deviation windows work better than quantile windows in terms of false 65 discovery and accuracy. The simulation also suggests that the choice of threshold also depends on how true connections are 66 defined. If we only consider strong connections (correlation > 0.9) as connected, we can use larger threshold. On the other 67 hand, if we want to include medium strength connections (correlation ≥ 0.5) as connected, we can use the smaller threshold.  73 Surprisingly, in this example the averaged CSN not only identifies the simulated block structure, as would be expected 74 of a useful measure of co-expression, but also estimates the true correlation more accurately than the empirical Pearson's 75 correlation in L1 norm, as can be seen in Table S1. While the average CSN is not designed to be an estimator of Pearson's 76 estimating this quantity.

78
To give an intuitive example where such a phenomenon might possibly occur, consider a simple model where X and Y are 79 random variables with identical marginal distributions (after centering and rescaling), where Y = X with probability p and is 80 generated independently of X otherwise. In this case, it can be seen that the Pearson correlation of X and Y is equal to p, the 81 fraction of data points for which X and Y are not independent -i.e., the fraction that should reject a local independence test 82 analogous to locCSN or oCSN. In such a case, it might be possible that estimating the fraction p of non-independent points 83 could be more accurate than generically estimating the Pearson correlation, particularly if the data has high noise.
84 Table S1. Distances between true connection matrix and estimated matrices, measured by L 1 norm.  Compare CSN with BigSCale correlation. Next we simulate data from ESCO, with 10000 genes and 2500 cells. We focus on 85 125 housekeeping genes that are not correlated with each other. After down-sampling the read counts to weaken the signal,  For the human brain cortex atlas data, we focus on 10 cell types from 4 samples for analysis, specifically the 6 neuron cell  Table S2 shows the number of cells and metacells for 7 major cell-types: P, IP, ExN, ExM, ExM-U 97 and ExDp.

98
Prior to CSN construction along the curve, we generate metacell bins based on pseudotime of the curve within each cell-type.

99
Each bin contains around 800 metacells, which are relatively homogeneous; however, for each cell-type, some metacells are 100 deemed outliers based on their pseudotime scores. We retain metacells whose pseudotime are within 2 standard deviations of 101 the mean within the bin. These are the cells that will be utilized for CSN construction. The number of bins used for each 102 cell-type, and number of metacells (before and after outlier screening) are listed in Table S3. Since there are only 58 metacells 103 for ExM-U in the leftmost curve, we ignore this cell-type for the D-Curve analysis. This cell-type properly belongs in the 104 U-curve analysis. Table S3 shows the overlapping of metacells for the two curves. P metacells are shared across both curves.

105
IP and ExN metacells are largely shared between the two curves. The split in trajectories occurs during development of the 106 ExM cell-type, which show considerably less overlap of metacells between the two curves.
107         (Fig. S14a and S15). For instance, the AST* cell-types are merged into one cell-type, while the L* cell clusters are 134 distinct and analyzed individually (Fig. S14b).

135
To circumvent challenges due to sparse counts, which are especially prevalent in single-nuclei RNA-seq data, we cluster 136 similar cells and form metacells (7). To avoid batch effects, metacells are created within a sample and cell-type (Table S7). We 137 merged AST-*, IN-*, L* and Neu* together as broad cell-type for the summary of the number of metacells (Table S6). The 138 number of metacells in each cell-type are shown in Table S7. Within a cell-type, some metacells exhibited heterogeneity that 139 was poorly delineated into clusters. For each metacell within a cell-type, we constructed CSNs using the nearby 100 metacells 140 from UMAP plot of the combined ASD and control cells (Fig. S16).

141
For broad cell-type AST, In, L and Neu, more than one original cell-type is included within the broad cell-types. We then 142 determine, based on heterogeneity of cells, whether to analyze the cells within a broad cell-type or within a more refined 143 cell-type. Numbers of metacells in each original cell-type are presented in Table S7. From the UMAP and tSNE plot, we  Table S5. Metacells for cell-types are shown in Table S7.   Control  278  102  238  253  414  238  107  177  105  244  353  504  261  ASD  425  75  265  206  358  211  109  164  94  235  469  310  300  Total  703  177  503  459  772  449  216  341  199  479  822 814 561  Table S10. p-values from sLED-CSN after removing DN genes. The removal is for the 10 cell-types with significant signal in the original analysis (Table S9). * indicates significant difference (p-value < 0.005) after adjustment for multiple testing.     ASD samples are significantly more variable than control samples.

151
N 2 denote the same from class 2. The test statistic Q is a scaled q-norm divergence measurement, with q ∈ (0, 2) recommended(8), and is given by with p-value calculated by permutation test. is computed from the spectrum of D. Additionally, the test also identifies a small cluster of leverage genes corresponding to 164 the non-zero entries of the sparse leading eigenvector. The differential network genes are the ones that explain 90% of the 165 variability among the leverage genes. These are candidate genes that have altered co-expression structure between the two 166 groups. As with DISTp, the p-value of the test statistic is determined by permuting samples among cell classes.  Z (1−α) reverse CDF (cumulative distribution function) of standard normal at 1 − α. V (1) , V   Fig. S21. Heatmap of gene expression of 3 branches from Liver data. Rows correspond to genes: marker genes for Cc (black), Hb (red), Hc (green), HVG (navy), HEG (cyan) and Regulation genes (magenta); and columns for branches, Cs (black), Hb (red) and Hc (green).  (Table S16). For a large set of genes, for example Setting 3 with 1000 genes, parallel computing is recommended to 186 speed up the process of generating CSNs. We also speed up our algorithm by an approximate CSN calculation, which partitions 187 the outcome space for each pair of genes into a grid. Cells that fall into the same grid yield the same test statistic (called 188 fuzzy). With these approximations CSN can be readily applied to large datasets with good accuracy Fig. S22. Here we include the runtime for the real analysis. In our package locCSN: https://github.com/xuranw/locCSN, we also include