hdWGCNA identifies co-expression networks in high-dimensional transcriptomics data

Summary Biological systems are immensely complex, organized into a multi-scale hierarchy of functional units based on tightly regulated interactions between distinct molecules, cells, organs, and organisms. While experimental methods enable transcriptome-wide measurements across millions of cells, popular bioinformatic tools do not support systems-level analysis. Here we present hdWGCNA, a comprehensive framework for analyzing co-expression networks in high-dimensional transcriptomics data such as single-cell and spatial RNA sequencing (RNA-seq). hdWGCNA provides functions for network inference, gene module identification, gene enrichment analysis, statistical tests, and data visualization. Beyond conventional single-cell RNA-seq, hdWGCNA is capable of performing isoform-level network analysis using long-read single-cell data. We showcase hdWGCNA using data from autism spectrum disorder and Alzheimer’s disease brain samples, identifying disease-relevant co-expression network modules. hdWGCNA is directly compatible with Seurat, a widely used R package for single-cell and spatial transcriptomics analysis, and we demonstrate the scalability of hdWGCNA by analyzing a dataset containing nearly 1 million cells.


INTRODUCTION
The development and widespread adoption of single-cell and spatial genomics approaches has led to routine generation of high-dimensional datasets in a variety of biological systems. These technologies are frequently used to study developmental stages, evolutionary trajectories, disease states, drug perturbations, and other experimental conditions. Despite the inherent complexity and interconnectedness of biological systems, studies leveraging single-cell and spatial genomics typically analyze individual features (genes, isoforms, proteins, etc.) one by one, greatly oversimplifying the underlying biology. These datasets provide an opportunity for investigating and quantifying the relationships between these features to further contextualize their roles across biological conditions of interest.
Here we developed hdWGCNA, a framework for co-expression network analysis 1 in single-cell and spatial transcriptomics data. Co-expression networks are based on transformed pairwise correlations of input features, resulting in a quantitative measure of relatedness between genes. 1,2 Hierarchical clustering on the network structure allows us to uncover functional modules of genes whose expression profiles are tightly intertwined, 3,4 which typically correspond to specific biological processes and disease states. Considering that unique cell types and cell states have distinct gene expression programs, we designed hdWGCNA to facilitate multi-scale analysis of cellular and MOTIVATION Single-cell and spatial transcriptomics assays are commonly used to profile the molecular signatures of biological systems, yielding high-dimensional datasets that can be used to model gene regulation across cell types, cell states, and spatial niches. Many statistical tools for high-dimensional transcriptomics data analysis focus on individual features rather than the underlying network structure, ignoring potential interactions between transcripts or genes. Here, we introduce hdWGCNA, a comprehensive methodological framework for the inference, analysis, and interpretation of gene co-expression networks in high-dimensional transcriptomics data. hdWGCNA is implemented as an open-source R package that extends the Seurat ecosystem of data analysis tools. spatial hierarchies. hdWGCNA provides a rich suite of functions for data analysis and visualization, providing biological context for co-expression networks by leveraging a variety of biological knowledge databases. To maximize usability among the genomics community, the hdWGCNA R package extends the data structures and functionality of the widely used Seurat package, [5][6][7] and we developed an extensive documentation website for hdWGCNA demonstrating its use on new datasets. Further, we used hdWGCNA to analyze a single-cell RNA sequencing (scRNA-seq) dataset consisting of 1 million cells, showcasing the scalability of hdWGCNA in large datasets.
In this study, we applied hdWGCNA in a variety of high-dimensional transcriptomics datasets from different technologies and biological conditions. As a common use case, we first performed iterative network analysis of the major cell types in the human prefrontal cortex (PFC), identifying shared and specific network modules in each cell type. We constructed co-expression networks in anterior and posterior mouse brain sections profiled with 103 Genomics Visium spatial transcriptomics (ST), and found distinct spatial patterns of these gene expression programs. Using long-read scRNA-seq data from the mouse hippocampus, 8 we uncovered splicing isoform co-expression networks in the radial glia lineage involved in cell fate specification. Network analysis of inhibitory neurons from published single-nucleus RNA sequencing (snRNA-seq) in autism spectrum disorder (ASD) donors 9 revealed modules disrupted in ASD containing key genetic risk genes such as SCN2A, TSC1, and SHANK2. We performed consensus co-expression network analysis of microglia from three Alzheimer's disease (AD) snRNA-seq studies, [10][11][12] yielding multiple gene modules corresponding to disease-associated microglia and polygenic risk of AD. Finally, we used hdWGCNA to project gene modules from two bulk RNA-seq studies of AD patients into an snRNAseq dataset of the AD brain, showing that our approach allows for interrogation of gene modules and networks that have been previously identified.

RESULTS
Constructing co-expression networks from highdimensional transcriptomics data Here we describe hdWGCNA, a comprehensive framework for constructing and analyzing co-expression networks in highdimensional transcriptomic data ( Figure 1A). Given a gene expression dataset as input, co-expression network analysis typically consists of the following analysis steps: computing pairwise correlations of input features, weighting correlations with a soft-power threshold ðbÞ, computing the topological overlap between features, and unsupervised clustering via the Dynamic Tree Cut algorithm 3 ( Figure S1 and STAR Methods). The sparsity and noise inherent in single-cell data can lead to spurious genegene correlations, thereby complicating co-expression network analysis. Additionally, the correlation structure of single-cell or spatial transcriptomic data varies greatly for different subsets (cell types, cell states, anatomical regions). A typical hdWGCNA workflow in scRNA-seq data accounts for these considerations by collapsing highly similar cells into ''metacells'' to reduce sparsity while retaining cellular heterogeneity and by allowing for a modular design to perform separate network analyses in specified cell populations.
Metacells are defined as small groups of transcriptomically similar cells representing distinctive cell states. There are several approaches to identify metacells from single-cell genomics data. [13][14][15][16] We leverage a bootstrapped aggregation (bagging) algorithm for constructing metacell transcriptomic profiles from single-cell datasets by applying K-nearest neighbors (KNN) to a dimensionality-reduced representation of the input dataset (STAR Methods, Algorithm 1). This approach can be performed for each biological replicate to ensure that critical information about each sample (age, sex, disease status, etc.) is retained for downstream analysis. We computed gene-gene correlations in the normalized gene expression matrix from the single-cell dataset and metacell expression matrices while varying the number of cells to collapse into a single metacell (the KNN K parameter). The distribution of these gene-gene correlations displays a spike at zero for the singlecell expression matrix, with flattened distributions corresponding to more non-zero correlations in the metacell matrices, indicating that metacell expression profiles are less prone to noisy gene-gene correlations compared with the single-cell matrix ( Figure 1B) (STAR Methods). We note that sparsity (defined in Equation 1) is greatly reduced in the metacell matrices for each cell type compared with the single-cell matrices, with over a 10-fold reduction in some cases (Figure 1C). We applied hdWGCNA to a dataset of CD34 + hematopoietic stem and progenitor stem cells 16 using two additional metacell approaches 14,16 and found that all approaches were suitable for downstream network analysis ( Figure S2; Data S1). Metacell algorithms strive to retain biologically meaningful signals spanning a spectrum of cell states in a tissue of interest; therefore, it is necessary to carefully apply these approaches to avoid obscuring these cell states. For example, the hdWGCNA metacell algorithm requires a dimensional reduction of the input expression matrix, but these reductions often contain technical artifacts. The choice of dimensionality reduction method and handling of technical artifacts would then influence the effectiveness of metacell construction. Further, the optimal number of cells to merge together to form a single metacell may differ across cell types and tissues, attempting to balance between increasing information content of the aggregated group while avoiding merging of dissimilar cells. Aside from metacell approaches, pseudo-bulk aggregation of all cells in a given population have yielded favorable results in benchmarks of differential gene expression tests, 17 suggesting that, given a sufficient sample size, pseudo-bulk expression profiles are likely suitable for co-expression network analysis.
While co-expression modules consist of many genes, it is convenient to summarize the expression of the entire module into a single metric. Module eigengenes (MEs), defined as the first principal component of the module's gene expression matrix (STAR Methods, Algorithm 2), describe the expression patterns of entire co-expression modules. hdWGCNA computes MEs using specific accommodations for high-dimensional data, allowing for batch correction and regression of continuous covariates (STAR Methods, Algorithm 2). Optionally, hdWGCNA can use alternative gene scoring methods such as or UCell 18 or Seurat's AddModuleScore function, and we show that these scores are correlated with MEs ( Figure S3).
We demonstrate hdWGCNA in single-cell transcriptomic data through an iterative network analysis of six major cell types in the Zhou et al. human PFC snRNA-seq dataset of 11 cognitively normal donors ( Figure 1A). 11 We constructed metacells and performed co-expression network analysis for each major cell type in the human PFC dataset 11 using the standard hdWGCNA workflow, yielding distinct network structures and sets of gene modules (Data S2 and S3). Networks were constructed using metacell expression matrices for each cell type separately, but we computed MEs for each module using the entire snRNA-seq dataset, allowing us to interrogate the cell-type specificity of these modules' expression programs across all cell types. This iterative network analysis revealed 96 co-expression modules across the six major cell types. Through differential module eigengene (DME) analysis, we found shared and distinct module expression patterns across different cell types (Data S2; STAR Methods), and we highlight specific modules from each cell type ( Figures  1D and 1E). Further, we performed a pairwise gene set overlap analysis of the 96 co-expression modules, and, while we did find that some modules had significant overlaps across the different cell types, the gene sets comprising these modules were overall quite distinct, with a maximum Jaccard index between two modules of 0.297 and a median of 0.005 (STAR Methods and Figure S4). The expression of module hub genes, which are highly connected members of the co-expression network ranked by eigengene-based connectivity (kME), tend to display cell-type-specific patterns, such as the myelination genes CNP and PLP1 in oligodendrocyte (ODC) module ODC-M3 ( Figure 1D). However, some co-expression modules may correspond to cellular processes common to multiple cell types, in which case the hub genes may be widely expressed. We inspected the MEs of selected cell-type-specific modules and found that the overall expression patterns were similar to that of their constituent hub genes ( Figures 1D and 1E).
We showcase some of the downstream functionalities of hdWGCNA using the ODC co-expression network ( Figures 1F-1H). For network visualization, we used Uniform Manifold Approximation and Projection (UMAP) 19 to embed the co-expression (F) UMAP plot of the ODC co-expression network. Each node represents a single gene, and edges represent co-expression links between genes and module hub genes. Point size is scaled by kME. Nodes are colored by co-expression module assignment. The top two hub genes per module are labeled. Network edges were downsampled for visual clarity.
(G) snRNA-seq UMAP as in (A) colored by MEs for the 10 ODC co-expression modules as in (F).
network topological overlap matrix (TOM) into a two-dimensional manifold, using the topological overlap of each gene with the top hub genes from each module as input features (STAR Methods; Figure 1F). We found that eight of the 10 ODC modules were specifically expressed in ODC cells based on their MEs (Figure 1G; Wilcoxon rank-sum test Bonferroni-adjusted p <0.05). Finally, we performed module preservation analysis 20 to test the reproducibility of these modules in an independent dataset 12 and found that all of the ODC-specific modules were significantly preserved (Z summary preservation R 5). In sum, these network analyses in the human PFC dataset shows the core capabilities of the hdWGCNA workflow ( Figure S1). Finally, we performed a similar iterative network analysis on a peripheral blood mononuclear cell (PBMC) scRNA-seq dataset of nearly 1M cells, highlighting the scalability of hdWGCNA in large datasets ( Figure S5; Data S1).

Runtime
Test root-mean-square error (RMSE) Predicting a gene's expression using the top 10 module hub genes E Figure 2. Runtime, memory usage, and performance of hdWGCNA singular value decomposition (SVD) of sparse matrices, and the runtime and memory usage of this function both linearly increase with the number of cells in the dataset. Optionally, ModuleEigengenes can employ the harmony 22 algorithm following SVD, which increases runtime but not memory usage. Further, the efficiency of this function varies with the number of co-expression modules detected and with the number of features in each modules. Finally, ModuleConnectivity computes eigengene-based connectivity as product-moment correlation coefficients between the sparse gene expression matrix and the MEs matrix, which resulted in fast calculations with low memory usage. We next sought to evaluate the co-expression networks identified by hdWGCNA using a functional coherence analysis. We used the EGAD neighbor-voting algorithm 23 to predict known biological pathway associations of genes based on the coexpression network structure using the cell-type-specific coexpression networks from the Zhou et al. 11 human PFC dataset. In principle, we expect that co-expressed genes are involved in similar biological processes, and therefore co-expression network structures should be predictive of biological pathway membership. Briefly, EGAD performs a 3-fold cross-validation classification by occluding the pathway labels of a subset of genes and then attempting to predict the pathway membership of those occluded genes based on their labeled neighbors in the network. We used EGAD to test the functional coherence of our hdWGCNA co-expression networks for a set of Gene Ontology (GO) terms, reporting area under the receiver operating characteristic curve (AUC) value for each term ( Figure 2C). We report a similar level of functional coherence in these co-expression networks to a previous study that evaluated co-expression networks derived from scRNA-seq data with different measures of gene-gene association. 24 The inhibitory neuron network performed the best for functional coherence with a median neighbor-voting AUC of 0.592, while the lowest-performing network was from the oligodendrocytes with a median AUC of 0.549. We tested whether there was a bias toward genes that were multifunctional based on the frequency that they appeared in the annotated set of GO terms, and we found that multifunctional genes did not bias the co-expression functional coherence results ( Figure 2D).
In principle, genes within the same co-expression modules derived from specific cell types should be functionally related or co-regulated. The expression of module hub genes, which exhibit the highest intramodular connectivity, may be predictive of the expression of other module member genes if the network is well defined and contains meaningful structures. For each of the 96 cell-type PFC co-expression modules, we sought to predict the expression of each gene using the top 10 module hub genes as the input features to a XGBoost 25 regularized regression model. In this analysis, we performed 5-fold cross-validation, and we report the performance as root-mean-square error (RMSE) of the test set averaged over each fold ( Figure 2E). Overall, we found that module hub gene expression was generally predictive of module member gene expression across all modules in the six cell-type co-expression networks, where the module with the best performance had an average test set RMSE of 0.0159 and the module with the worst perfor-mance had an average test set RMSE of 0.209 ( Figure 2E). This analysis and our functional coherence analysis provide support that hdWGCNA co-expression networks and gene modules capture biologically relevant information in specific cell types.
Spatial co-expression networks represent regional expression patterns in the mouse brain ST enables the investigation of biological patterns that might otherwise be hidden in other -omics technologies, such as scRNA-seq or bulk RNA-seq. 26,27 We used hdWGCNA to identify spatial co-expression network modules in the murine brain using a publicly available Visium transcriptomics dataset from 103 Genomics ( Figure 3A). This ST dataset consists of one posterior and one anterior slice originating from a sagittal brain section from a single male mouse at 8 weeks of age. Sequencing-based ST approaches such as Visium yield transcriptome-wide gene expression profiles localized to individual ''spots'' where a single spot likely contains multiple cells, and this dataset is composed of 2,696 spots in the anterior slice and 3,353 spots in the posterior slice. Data sparsity is also inherent to the current generation of these technologies, therefore we propose a metaspot aggregation approach prior to network analysis ( Figure S6). Evenly spaced spots throughout the input ST slide are used as principal spots, with at least one other spot in between two principal spots. The transcriptomes of the principal spots and their direct neighbors are aggregated into metaspot expression profiles, containing at most seven ST spots ( Figure S6A). Similar to metacells in scRNA-seq, the sparsity of the metaspot expression matrix was reduced compared with the original ST matrix ( Figure S6B), and the distribution of gene-gene correlations in the metaspot expression matrix was less concentrated at zero ( Figure S6C). hdWGCNA is capable of processing any number of ST samples in the same co-expression network analysis by constructing metaspots separately for each sample.
We applied hdWGCNA in the mouse brain Visium dataset, identifying 12 spatial modules (SM1-12; Figure S7; Data S1), and we embedded the co-expression network in two dimensions using UMAP ( Figure 3B). DME analysis showed that spatial coexpression modules displayed distinct regional expression profiles based on their MEs ( Figure 3C; Data S3), encompassing a wide array of cellular processes such as the myelination module SM1 in the white matter tracts, and synaptic transmission modules SM7, SM9, SM11, and SM12 ( Figure S7C; Data S2). For example, DME analysis showed that expression of SM4 was localized to the ventricles and cortical layer 1 near the bloodbrain barrier ( Figure 3C). Further, the hub genes of SM4 include hemoglobin subunits (Hba-a1, Hba-a2, Hbb-bt), and we show that SM4 was enriched for biological processes associated with brain vasculature (Figures 3B and S7C). We compared these gene modules with cluster marker genes from a wholemouse-brain snRNA-seq dataset 28 and found significant correspondences, such as the striatum module SM7 and medium spiny neurons (Fisher's exact test false discovery rate [FDR] <0.05; Figure S7D). Additionally, we performed network analysis on a subset of this dataset containing cortical layers 2-6 (Figure S8), identifying additional fine-grained spatial co-expression modules localized to specific cortical layers (Data S1 and S2). Isoform-level co-expression networks reveal cell fate decisions in the radial glia developmental lineage Different isoforms of the same gene are often involved in distinct biological processes. 29 Conventional single-cell transcriptomics assays capture information at the gene level, thereby missing much of the biological diversity and regulatory mechanisms that occurs at the isoform level. 30 Emerging long-read sequencing approaches enable us to profile cellular transcriptomes at isoform A B C Figure 3. Spatial co-expression networks represent regional expression patterns in the mouse brain (A) Visium spatial transcriptomics (ST) in anterior (left, 2,696 spots) and posterior (right, 3,353 spots) mouse brain sections, colored by Louvain clusters annotated by anatomical regions. (B) UMAP plot of the mouse brain ST co-expression network. Each node represents a single gene, and edges represent co-expression links between genes and module hub genes. Point size is scaled by kME. Nodes are colored by co-expression module assignment. The top five hub genes per module are labeled. Network edges were downsampled for visual clarity.
(C) ST samples colored by MEs for the 12 spatial co-expression modules. Gray color indicates an ME value less than zero.  Figure 4. Isoform co-expression network analysis reveals fate-specific expression programs in the hippocampal radial glia lineage (A) UMAP plot of cells from the mouse hippocampus ScISOrSeq dataset. 8 Major cell types are labeled and the cells used for co-expression network analysis are colored. This dataset contains expression information for 96,093 isoforms and 31,053 genes in 6,832 cells from one mouse brain sample. ASC, astrocytes; CPX, resolution, 8,31-33 thus providing new opportunities to model the relationships between isoforms using co-expression network analysis.
We used hdWGCNA to perform isoform co-expression network analysis in radial glia lineage cells from the mouse hippocampus at postnatal day 7 (P7) profiled with single-cell isoform RNA sequencing (ScISOrSeq) 8 ( Figure 4A; STAR Methods). This dataset contains isoform-level and gene-level expression data from 6,832 nuclei derived from a single mouse hippocampus sample. Radial glia, which share transcriptomic similarities with mature astrocytes, are progenitor cells that give rise to numerous distinct cell fates, including neuronal cells, astrocytes, oligodendrocytes, and ependymal cells. 34,35 To model this developmental process, we applied Monocle3 36 pseudotime to 2,190 radial glia lineage cells ( Figure 4B). We identified three trajectories corresponding to distinct cell fates, termed the ependymal (EPD) trajectory, astrocyte (ASC) trajectory, and the neural intermediate progenitor cell (NPC) trajectory.
Isoform co-expression network analysis revealed 11 modules in the radial glia lineage ( Figure 4C; Data S1). Of the genes retained for network analysis, 61.5% had a single isoform, 18.2% had multiple isoforms that were all assigned to the same module, and 20.4% had multiple isoforms spread across several modules (Figure 4D). Thus, these network modules capture information about the roles of different isoforms of the same gene in distinct biological processes. We inspected module eigenisoform (MEiso) patterns throughout the developmental lineage, thereby uncovering isoform modules critical for cell fate decisions ( Figure 4E; Data S2 and S3). Increased expression of modules RGL-M1 and RGL-M2, which were enriched cilium assembly genes ( Figure 4F), was associated with the transition from a radial glia to an ependymal cell state. A steady expression level of module RGL-M5 (glial development, astrocyte differentiation) was found in the transition from radial glia to astrocytes, while a decreased expression of RGL-M5 led to alternative fates. Four modules (RGL-M3, RGL-M8, RGL-M9, and RGL-M11) displayed an increase in expression in the neuronal trajectory, containing genes associated with cellular processes such as non-canonical Wnt signaling, neuronal synaptic plasticity, and RNA splicing ( Figure 4F).
We inspected the isoforms of three selected genes that had hub isoforms in different co-expression modules: Gfap, H3f3b, and Cd9 ( Figures 4G-4I). Gfap encodes a key intermediate filament protein in astrocytes that is involved in astrocytic reactivity during central nervous system (CNS) injuries or neurodegeneration, 39 and we found that modules RGL-M4 and RGL-M6 contained hub isoforms of Gfap featuring alternative splicing, alternative transcription start site (TSS) usage, and alternative transcription end site (TES) usage ( Figure 4G). Different isoforms of the histone H3.3 subunit gene H3f3b were hubs for modules RGL-M1 and RGL-M11, which were associated with ependymal and neuronal cell fates respectively, suggesting that alternative TES usage in H3f3b plays a role in regulating epigenetic factors in murine hippocampal development ( Figure 4F). Cd9 encodes a transmembrane protein and is a known glioblastoma biomarker, 40 and we found subtle differences in the TSS between hub isoforms in modules RGL-M6 and RGL-M9 that we show as a splicing summary graph 38 ( Figure 4I), supporting functional changes mediated by small isoform differences.

Co-expression network analysis of inhibitory neurons in ASD
Co-expression networks can be interrogated to further understand the molecular phenotypes of complex polygenic diseases in primary human tissue samples. We applied hdWGCNA to 20,249 inhibitory neurons (INHs) from an snRNA-seq dataset of the human PFC in 22 ASD patients, 24 age-matched controls, and eight epilepsy patients 9 (Figures 5A and S10; Data S1). The INH network contained 14 modules, and we show hub genes that have a known association with ASD in the SFARI database on the co-expression UMAP ( Figure 5B). The MEs showed that some modules were primarily confined to a single INH cluster (INH-M3, INH-M1) while others were spread across multiple neuronal groups ( Figure 5C). Furthermore, DME analysis revealed significant differences between MEs in ASD and control samples for all modules except INH-M4 in at least one INH subpopulation ( Figure 5D; Data S3; Wilcoxon rank-sum test Bonferroni-adjusted p <0.05). However, by focusing on the DME results with an absolute average log 2 (fold change) R 0:5, we note that many of the largest differences were found in the SST + inhibitory neuron clusters. Furthermore, three co-expression modules (C) UMAP plot of the radial glia lineage isoform co-expression network. Each node represents a single isoform, and edges represent co-expression links between isoforms and module hub isoforms. Point size is scaled by kMEiso. Nodes are colored by co-expression module assignment. Network edges were downsampled for visual clarity. (D) Donut chart showing the percentage of genes with one isoform, with multiple isoforms that are all assigned to the same module, and with multiple isoforms that are spread across more than one module.  Article ll OPEN ACCESS latest genome-wide association study (GWAS) of ASD 41 (Figure 5E), but we note that all of these modules contained several ASD-associated SFARI genes. INH-M11 was enriched for genes associated with synaptic transmission, ion transport, glutamate receptor signaling, and nervous system development ( Figure 5F; Data S2), and this module was downregulated in ASD for five of the six INH subtypes ( Figure 5D). Similarly, INH-M13 was associated with RNA processing ( Figure 5F) and was downregulated in ASD in all INH subtypes except PVALB+ neurons ( Figure 5D). One of the INH-M13 hub genes is CHD2, whose de novo variants have been identified in individuals with ASD. 42,43 CHD2 is part of the CHD family of chromatin-modifying proteins and can alter gene expression by modification of chromatin structure. Similarly, rare loss-of-function mutations have been reported in the SCN2A gene, a hub gene of the INH-M11 module. 44 We also find enrichment of several ASD-associated genes such as TSC1 highlighting that these modules are functional and provide new insights into the role of inhibitory neurons in ASD. Finally, we tested for the preservation of these modules in 19,425 inhibitory neurons from an snRNA-seq dataset of the PFC from donors with major depressive disorder (MDD) and controls 45 (34 samples), and we found substantial evidence of preservation across all modules except INH-M1 (Figures S10C-S10E).

Consensus network analysis of microglia in AD
Microglia, the resident immune cells of the brain, are implicated in the pathology and genetic risk of several CNS diseases, including AD. 46-49 Transcriptomic and epigenomic studies in human tissue and AD mouse models have identified multiple cell states of microglia, representing a spectrum between homeostatic and disease-associated microglia (DAMs). 12,50,51 Our previous study defined a set of transcription factors, genes, and cis-regulatory elements involved in the shift between homeostatic and DAM cell states in human AD, identifying shared and distinct signatures compared with the DAM signature from 5xFAD mice. 12 Here we sought to expand on previous work by providing a systems-level analysis of gene expression throughout the spectrum of microglia cell states.
We modeled the cell-state continuum between homeostatic and DAM-like microglia by employing a pseudotime analysis of microglia from three human AD snRNA-seq datasets [10][11][12] ( Figures 6A, 6B, and S11). Next, we performed consensus coexpression network analysis using microglia integrated from three human AD snRNA-seq datasets, 10-12 identifying four consensus modules ( Figure 6C; Data S1). Consensus network analysis is an approach that performs network analysis sepa-rately for each dataset, followed by a procedure to retain structures common across the individual networks, and thus it is well suited for analyzing microglia co-expression from these different sources (STAR Methods).
Classical markers of homeostatic microglia, such as CSF1R, CX3CR1, and P2RY12, were members of MG-M2, while known DAM genes, including APOE, TYROBP, and B2M, were members of MG-M1. GO term enrichment analysis associated MG-M2 with homeostatic microglia functions such as cell migration, synapse organization, and response to colony-stimulating factor, contrasting disease-related processes enriched in MG-M1 including amyloid fibril formation, microglial activation, maintenance of blood-brain barrier, and cytokine production ( Figure 6D; Data S2). Together, this suggests that MG-M1 comprises the gene network underlying DAM activation in AD, while MG-M2 represents the network of homeostatic microglia genes. The MEs for MG-M1 and MG-M2 display opposing patterns throughout the microglia pseudotime trajectory, contextualizing this trajectory as the transcriptional shift from homeostatic microglia (start) to a DAM-like cell state (end) (Figures 6E and 6F). Furthermore, DME analysis revealed significant changes in these modules between AD and control brains in evenly spaced windows throughout the microglia trajectory ( Figure 6G; Wilcoxon rank-sum test Bonferroni-adjusted p <0.05; Data S3). Coexpression networks behave as functional biological units; therefore, we reason that the hub genes and other members of MG-M1 represent candidates for an expanded set of human DAM genes including ACTB, TPT1, and EEF1A1.
Aside from modules MG-M1 and MG-M2, which contained well-known microglia gene signatures, we also identified modules MG-M3 and MG-M4 containing genes associated with key microglial processes such as axon guidance, phagocytosis, and myeloid cell differentiation (Figures 6C and 6D). CD163, a hub gene of MG-M4, is known to be involved in the breakdown of the blood-brain barrier. 53,54 The trajectory of MG-M4, containing CD163 as a hub gene, was consistent with that of DAM-like module MG-M1, and was enriched for processes including phagocytosis, myeloid cell differentiation, and neutrophil activation ( Figure 6D); therefore, it is possible that MG-M4 represents an alternative microglial activation module. 55 We performed single-cell polygenic risk enrichment for AD risk in the microglia trajectory, 46,52 and identified a significant increase throughout the trajectory, revealing an enrichment of AD genetic risk single-nucleotide polymorphisms (SNPs) in DAMs ( Figure 6H; STAR Methods; Data S3). We show that expression of these modules was significantly correlated with AD genetic risk (Pearson correlation p <0.05), with the strongest correlation in alternative activation module MG-M4 ( Figure 6I).
(B) Gene co-expression network derived from inhibitory neurons, represented as a two-dimensional UMAP embedding of the TOM. Nodes represent genes, colored by module assignment. Module hub genes with prior evidence of ASD association from SFARI are labeled. Edges represent co-expression relationships between genes and module hub genes. Network edges were downsampled for visual clarity. To ensure that these microglial modules were reproducible across other datasets and in mouse models of AD, we performed module preservation analysis 20 (STAR Methods; Figure 6J). We projected the microglial consensus modules into a dataset of the PFC in aged human samples, 56 the superior frontal gyrus (SFG) and entorhinal cortex (EC) in AD samples, 57 the occipital cortex (OC) and the occipitotemporal cortex (OTC) in human AD samples, 58 the PFC from 5xFAD mice, 11 and whole-brain samples from 5xFAD mice 28 ( Figure 6H). Additionally, we projected these modules into an snATAC-seq dataset of the PFC in human AD, 12 using gene activity 59 as a proxy for gene expression from chromatin accessibility data. These module preservation tests showed the microglia consensus modules were broadly preserved and reproducible across brain regions and in mouse models of AD, providing further support that this network is relevant in AD biology and microglial activation.
Projecting network modules from bulk RNA-seq cohorts into relevant single-cell datasets hdWGCNA allows for interrogating co-expression modules inferred from a given reference dataset in a query dataset. Modules can be projected across datasets by computing MEs in the query dataset, and preservation of the network structure can be assessed via statistical testing. 20 For example, modules can be projected between different species to link transcriptomic changes between mouse models and human disease patients, or modules can be projected across data modalities from single-cell to spatial transcriptomics to provide regional context to cellular niches.
To date, it remains cost-prohibitive for most researchers to perform high-dimensional -omics studies of large patient cohorts, but there are numerous large-scale disease-relevant bulk RNA-seq datasets containing thousands of samples from consortia such as the Encyclopedia of DNA Elements (ENCODE), 60 the Genotype-Tissue Expression (GTEx) project, 61 and The Cancer Genome Atlas (TCGA). 62 By projecting coexpression modules derived from bulk RNA-seq patient cohorts into single-cell datasets, we can layer disease-related information onto the single-cell dataset and attribute cell-state-specific expression patterns to the bulk RNA-seq data. We demonstrate projecting modules in this manner using co-expression modules from two bulk RNA-seq studies of AD 56,63 as the references and a human AD snRNA-seq dataset 12 (57,950 nuclei from AD 11 samples and seven control samples of the PFC) as the query. These studies both used AD samples and controls from the same patient cohorts (Religious Orders Study and Memory and Aging Project, Mayo Clinic, Mount Sinai School of Medicine), 64-66 but they took unique approaches for co-expression network analysis. The AMP-AD study from Wan et al. 63 performed network analysis separately from each brain region, while, in our previous study, 56 we performed consensus network analysis across the different brain regions. We projected these modules into a snRNA-seq dataset of AD and control samples from the PFC (Figure 7A), and we found distinct cell-type-specific expression patterns based on their MEs (Figures 7B and  7C). This analysis demonstrates hdWGCNA's ability to transfer co-expression information across datasets to uncover otherwise unseen biological insights.

DISCUSSION
Classical bioinformatic approaches for transcriptomics analysis such as differential gene expression are useful for finding individual genes that are altered in a particular disease or condition of interest, but they do not provide information about the broader context of these genes in specific pathways or regulatory regimes. For example, biological processes such as development or regeneration require coordination of distinct sets of genes in certain cell types with spatial specificity. Therefore, to understand these complex processes, we must look beyond individual genes. We developed hdWGCNA to provide a succinct methodology for investigating systems-level changes in the transcriptome in single-cell or ST datasets. We designed hdWGCNA to be highly modular, allowing for multi-scale analyses of different cellular or spatial hierarchies in a technology-agnostic manner.
In this study, we demonstrated that hdWGCNA is compatible with single-cell and ST datasets and can be easily adapted for novel transcriptomics approaches such as ScISOrSeq. Co-expression networks have been successful for analyzing bulk proteomics datasets in human disease samples, 67,68 and we expect that hdWGCNA could be swiftly adapted for single-cell and spatial proteomics datasets as the technology matures and becomes more widely available. 69 hdWGCNA includes built-in functions to leverage external biological knowledge sources to provide insight for co-expression networks, for example by comparing gene modules with functional gene sets such as disease-associated genes from GWAS expression quantitative trait loci (eQTLs), or transcription factor target genes. Unlike other network analysis pipelines such as single-cell regulatory network inference and clustering (SCENIC) 70 or CellChat, 71 hdWGCNA is a purely unsupervised approach and does not require prior knowledge or databases in the inference procedure. The co-expression information computed by hdWGCNA can be easily retrieved from the Seurat object to facilitate custom downstream analyses beyond the hdWGCNA package. hdWGCNA allows for comparisons between experimental groups via DME testing and module preservation analysis, which allowed us to identity inhibitory neuron modules that were dysregulated in ASD and enriched for ASD genetic risk genes, and microglial modules that were dysregulated in AD and enriched for DAM genes. Our network analyses of the ASD and AD datasets shows that hdWGCNA is capable of uncovering expanded disease-relevant gene sets via the interaction partners of known disease-associated genes such as the ASD SFARI genes or the AD DAM genes. We showed that the co-expression networks inferred by hdWGCNA were highly reproducible in unseen datasets, indicating that this is a robust methodology that reflects the under-lying biology of the system of interest rather than picking up on technical artifacts. Further, hdWGCNA sheds new light on previously identified co-expression networks and gene modules by allowing modules to be projected from a reference dataset to a query dataset. The hdWGCNA R package directly extends the familiar Seurat pipeline and the SeuratObject data structure, enabling researchers to rapidly incorporate network analysis into their own workflows, going beyond cell clustering and differential gene expression analysis toward systems-level insights.
Limitations of the study Transcriptomic measurements of single cells are generally noisy, imposing challenges and limitations in the analysis of these datasets. Technical noise may arise from dropout events or from various steps in the experimental protocols, potentially making downstream data analysis and interpretation more difficult. hdWGCNA explicitly tries to handle the issues of technical dropouts and data sparsity by constructing networks in metacell or metaspot transcriptomic profiles rather than directly using the single-cell data. Furthermore, we show that module preservation statistical testing can assess the reproducibility of a co-expression network in external validation datasets, giving additional confidence in the results from hdWGCNA.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

DECLARATION OF INTERESTS
The authors declare no competing interests. Data and code availability d All of the sequencing data used in this paper were obtained from publicly available sources, and are listed in the key resources table. d The hdWGCNA R package has been deposited at Zenodo (see key resources table). The R package code and full tutorials are available at https://swaruplab.bio.uci.edu/hdWGCNA. The data processing and analysis code has been deposited at Zenodo (see key resources table) and is available on GitHub at this repository: https://github.com/smorabit/hdWGCNA_paper. d Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

METHOD DETAILS
Bootstrapped aggregation of single cell transcriptomes to form metacells Single-cell gene expression datasets typically contain many more zero valued entries than non-zero valued entries, meaning that these datasets are sparse. We formally define the sparsity of a gene expression matrix in Equation 1. Given an un-normalized counts matrix X with genes and N c cells, sparsity is the sum of all zero valued elements.
Complementing sparsity, the density of a single gene expression matrix is the sum of all non-zero valued elements divided by the total number of matrix elements, such that density = 1 À sparsity. A matrix is considered sparse if sparsity > 0:5. Conventional single-cell gene expression assays yield sparse gene expression matrices. In general, correlations of sparse vectors may lead to downstream conclusions that are not robust or reproducible. Thus, as part of the hdWGCNA workflow, we propose a bootstrapped aggregation (bagging) algorithm to construct a gene expression matrix M with considerably reduced sparsity prior to performing coexpression network analysis. Zero valued entries in a gene expression matrix have both biological and technical origins, 75 and it is important to prioritize preserving relevant biological signals while reducing technical noise. For example, a biological zero may be attributed to a gene that is only expressed in a given cell population, whereas a technical zero may arise from low sequencing depth.
We define the set of unique cell barcodes C and the set of unique genes G such that kCk = N c and kGk = N g . Transcriptomically similar cells are identified in a dimensionally-reduced representation D of the gene expression matrix X using the k-nearest neighbors (KNN) algorithm, 76 yielding N c sets of k cells. Inherently, there is overlap between these N c sets of k neighboring cells, and we include a parameter m to control for the maximum allowable overlap. Cells are uniformly randomly sampled from C, and gene expression signatures from X are aggregated (sum or average) with their k nearest neighbors. A cell is skipped if its neighbors have too much overlap with the set of neighbors from previously selected cells, in order to reduce redundancy in the downstream metacell Aggregation of neighboring spatial transcriptomic spots to form metaspots Sequencing-based ST approaches such as the 103 Genomics Visium platform also yield sparse transcriptomic profiles, thus introducing the same potential pitfalls as single-cell data for co-expression network analysis. To alleviate these issues, we sought to develop a data aggregation approach similar to our metacell algorithm. This approach leverages spatial coordinates rather than the dimensionality-reduced representation ( Figure S6). For each ST spot, we obtain a list of physically neighboring spots. We then devise a grid of ''principal spots'', which are evenly spaced spots throughout the input tissue which serve as anchor points for aggregating neighboring spots. Each principal spot and its neighbors are aggregated into one metaspot, with at most seven spots merging into one metaspot and at most two overlapping spots between metaspots. We implemented this procedure as part of the hdWGCNA R package in the MetaspotsByGroups function. Similar to the MetacellsByGroups function, the user may specify groups within the Seurat object to perform the aggregation, such that metacells would only be grouped within the same tissue slice, anatomical region, or other annotation. For all downstream analysis with hdWGCNA, the metaspot expression dataset can be used in place of the metacell expression matrix.
Computing co-expression networks Following metacell or metaspot construction, hdWGCNA constructs co-expression networks and identifies gene modules, building off of the WGCNA workflow. 1,4,77,78 The gene-gene adjacency matrix A is computed by taking the pairwise correlation of genes in G in the metacell expression matrix M, or in a subset of M for a specified cell population. Consider the gene expression vectors x i = M iÃ and x j = M jÃ for an arbitrary pair of genes ði; jÞ˛G, we compute the signed correlation as: Note that a i;j is a linear transformation that retains the sign of the correlation while satisfying 0 % a i;j % 1. We define A as a symmetric adjacency matrix of size N g 3N g containing the signed correlations a i;j for all pairs ði; jÞ˛G as in Equation 2. In order to emphasize strong correlations, we raise the elements of A to a power b, and we refer to this as soft power thresholding.
Now we have the gene-gene correlation raised to a power b, and an alternative metricã i;j which also retains the sign of the correlation between these genes. The final co-expression network is then computed as a signed topological overlap matrix (TOM). The TOM describes shared neighbors between the a pair of genes ði; jÞ. We define the signed TOM as where k i and k j represent the connectivity between genes i and j In the signed TOM, negative correlations serve to negatively reinforce the network connection, which is not the case in the unsigned TOM.

TOM unsigned
Genes are then grouped into modules based on the TOM network representation using the Dynamic Tree Cut algorithm, 3 such that co-expression modules consist of genes with high topological overlap. Dynamic Tree Cut hierarchically clusters genes based on their dissimilarity in the TOM, denoted as DissTOM = 1 À TOM, thereby yielding a mapping between module assignments and gene names. The overall process transforming a metacell expression matrix M to a signed TOM co-expression network is implemented as part of the hdWGCNA R package in the ConstructNetwork function. Here we described the recommended workflow, using a signed adjacency matrix and a signed TOM, but ConstructNetwork can optionally construct unsigned or signed hybrid networks as well.

Computing module eigengenes
Module eigengenes (MEs) are a convenient metric to summarize the gene expression of a given co-expression module. While the coexpression network was computed using the metacell expression matrix M, we compute MEs in the single-cell expression matrix X, thus yielding information about the activity of each module in each cell. The expression matrix for the I-th module consisting of genes G ðIÞ 3G is X ðIÞ = X G ðIÞ ;Ã . The ME for module I is then computed by performing singular value decomposition (SVD), such that X ðIÞ = UDV T . Prior to running SVD, X ðIÞ must be scaled and centered, and we accomplish this using the Seurat function ScaleData. Importantly, ScaleData enables us to optionally perform regression to diminish the effects of selected technical covariates prior to computing MEs. The first column of V, containing the right-singular vectors V ðIÞ = ðv While SVD or other dimensionality reductions on a single-cell gene expression matrix contains critical biological information, technical artifacts are also present in these representations. There are many computational methods aiming to reduce technical effects in a reduced dimensional space, and these methods are often referred to as ''batch-correction'' or ''integration'' approaches. 79 In particular, Harmony 22 is an algorithm well suited for correcting batch effects that may be present in a dimensionality-reduced single-cell expression dataset, 79 and here we propose applying Harmony to MEs to maximize the biological information content of each ME. We implemented the ME computation algorithm, as defined in Algorithm 2, as part of the hdWGCNA R package in the function ModuleEigengenes.
Projecting co-expression modules in unseen data In a typical hdWGCNA workflow, we perform metacell bagging, co-expression network analysis, module identification, and ME computation using the same single-cell gene expression dataset, starting from the expression matrix X. Given the module-gene assignment table derived from a reference dataset X, we can run the ModuleEigengenes algorithm on a query dataset Y where the genes in Y must be contained in the set of genes in X such that G Y 4G X . We implemented this process in the hdWGCNA R package as the ProjectModules function. Importantly, we designed ProjectModules to be agnostic towards the data modality or species used in the reference and query datasets, thereby allowing for a host of comparative analyses. ProjectModules can facilitate crossspecies analysis leveraging a table that maps gene symbols between two genomes. Modules can be projected into epigenomic data modalities such as single-cell assay for transposase accessible chromatin with sequencing (scATAC-seq) provided a measure of gene expression estimated from chromatin accessibility, such as Signac 59 gene activity or ArchR 80 gene scores. This approach can also be used to project modules from bulk expression datasets into single-cell or spatial transcriptomics datasets.
Implementation of the hdWGCNA R package hdWGCNA greatly extends upon scWGCNA, 12 our previous method for co-expression network analysis in single-cell transcriptomics data. scWGCNA was originally used to identify co-expression networks using bulk and single-cell RNA-seq together, 12 and in another study we showed that scWGCNA was suitable for network analysis using scRNA-seq alone. 28 Contrasting the hdWGCNA

Algorithm 2. ModuleEigengenes
Require: X such that dimðXÞ = N g ; N c ⊳ normalized gene expression matrix of N g genes and N c cells Require: modules ⊳ the table containing mappings between genes and modules. Require: mods ⊳ list of modules. Require: covariates ⊳ covariates to regress. Require: batches ⊳ batch identity to correct with Harmony, or null to ignore. ME )½ B for I in mods do modules ðIÞ ) subset(modules, module = = I) G ðIÞ ) modules ðIÞ [,gene] X ðIÞ )X G ðIÞ ;Ã X ðIÞ ) ScaleData(X ðIÞ , covariates) V ðIÞ ) SVD ðX ðIÞ Þ ME ðIÞ )V ðIÞ 1 if batchessNULL theñ V ðIÞ ) Harmony(V ðIÞ , batches) ME ðIÞ )Ṽ package, the implementation of scWGCNA was an R package containing a single function for metacell construction, and a single tutorial to cover the basics of network analysis using the WGCNA package 1 with the metacell matrix. We implemented hdWGCNA as an open-source object-oriented R package that leverages the widely used SeuratObject data structure. The hdWGCNA R package includes all necessary functions for network inference, data visualization, statistical testing, and downstream analysis such as pathway enrichment. Further, hdWGCNA includes functions to extract the network data from the SeuratObject to easily facilitate custom analysis with external Bioconductor or R packages. In order for hdWGCNA to be widely useful across the genomics community, we developed a detailed documentation website containing tutorials for network analysis in single-cell and spatial transcriptomics data, as well as tutorials for advanced analysis like consensus network analysis and network preservation testing. Unlike scWGCNA, the metacell construction algorithm (Algorithm 1) in hdWGCNA includes new parameters to avoid redundant metacells, the module eigengene algorithm in hdWGCNA (Algorithm 2) accounts for batch effects and additional covariates in the input dataset, and hdWGCNA contains functions to handle spatial transcriptomics datasets. Several key steps in co-expression network analysis, like calculating module eigengenes and eigengene-based connectivity, have been re-implemented to operate on sparse matrices, greatly decreasing runtime and memory usage. hdWGNCA is completely technology agnostic, and can be adapted to handle high dimensional counts matrices from any single-cell or spatial transcriptomics platform. Additionally, hdWGCNA includes a novel approach for visualizing genes and the underlying network in a two-dimensional manifold of co-expression space using UMAP. 19 As shown throughout this manuscript, hdWGCNA includes functions for projecting co-expression networks into a variety of external datasets. The widespread adoption of single-cell genomics has led to many biologists running their own computational analysis, and we designed the hdWGCNA R package with these individuals in mind through our various step-by-step tutorials and detailed documentation.

Reprocessing published datasets
The key resources table details the different datasets used throughout this manuscript. We used several published datasets generated by our own group, 12,28,56 and sequencing data was not re-downloaded for these studies. For all human snRNA-seq datasets, we applied a uniform processing pipeline to process each dataset starting from the raw sequencing data and resulting in an anndata object 72 containing UMI counts, normalized gene expression, cluster identities, and cell type annotations. Parameters used throughout this processing pipeline vary slightly between different datasets, and all parameters are noted in the data processing scripts in our github repository. For each biological replicate, we used the kb count function from kallisto j bustools 81 to psuedoalign raw sequencing reads to the reference transcriptome and quantify gene expression attributed to each cell barcode. The human reference transcriptome (GRch38) was obtained from the 103 Genomics website (version 2020-A, July 2020), and was re-formatted for use with kallisto j bustools using the kb ref function. For each of the UMI counts matrices, we used the remove-background function from cellbender 74 to simultaneously identify which barcodes corresponded to cells and to remove counts attributed to ambient RNA. We then used scrublet 73 to compute ''doublet scores'', the likelihood of each barcode mapping to more than one cell. Counts matrices from each biological replicate in a given dataset are then merged into a single anndata object, and any relevant sample level meta-data (age, sex, disease status) was stored in the adata.obs table. We performed a percentile filtering of cells that were outliers from each dataset based on the number of UMI per cell the percentage of UMI attributed to mitochondrial genes per cell, and the doublet score. Filtering based on these criteria was performed in each sample, as well as dataset-wide. After filtering, downstream data processing steps were carried out with SCANPY. 72 The UMI counts matrix was normalized with lnðCPMÞ using the functions sc.pp.normalize_total and sc.pp.log1p. Highly variable genes were identified using the function sc.pp.highly_variable_genes, and these genes are used as the features for downstream analysis steps such as principal component analysis (PCA). The normalized expression matrix was then scaled to unit variance and centered at zero using the function sc.pp.scale. PCA was performed on the scaled expression matrix using the function sc.tl.pca. Harmony 22 was used to correct the PCA matrix for batch effects using the function sc.external.pp.harmony_integrate. The harmonized PCA matrix was then used to construct a cell neighborhood graph using the function sc.pp.neighbors. The cell neighborhood graph was then used to compute a two-dimensional representation of the data with uniform manifold approximation and projection 19 using the function sc.tl.umap, and to group cells into clusters with Leiden clustering 82 using the function sc.tl.leiden. We inspected the gene expression signatures in each Leiden cluster for a panel of canonical cell-type marker genes in order to assign a cell-type label to each cluster, and to identify additional doublet clusters that may have escaped the previous filtering steps. The distribution of quality control metrics was inspected in each cluster. We filtered out cells belonging to clusters that displayed conflicting expression of cell-type marker genes, or were outliers in their quality control metrics. After filtering these low-quality clusters, we ran UMAP and Leiden clustering again, resulting in the final processed dataset. We used a custom script to convert the datasets from anndata to SeuratObject by saving the individual components (counts matrix, cell metadata, gene meta-data, dimensionality reductions, etc.) in Python and then loading them back into R to create a SeuratObject.

Iterative network analysis of major cell types in the human cortex
We performed an iterative co-expression network analysis of the major cell types (ASC, EX, INH, MG, ODC, OPC) in the human PFC snRNA-seq dataset from Zhou et al., 11 only including samples from control brains (36,671 cells and 36,601 genes). We retained genes that were expressed in at least 5% of cells for downstream analysis. Metacells were computed separately for each major Article ll OPEN ACCESS cell type and each sample using the hdWGCNA function MetacellsByGroups, aggregating 25 cells per metacell. Further, we ran MetacellsByGroups while varying the K parameter in order to asses the resulting metacell expression matrix sparsity. For each cell type, we applied the following hdWGCNA commands with default arguments to perform network analysis: TestSoftPowers, ConstructNetwork, ModuleEigengenes, ModuleConnectivity, and RunModuleUMAP. We performed module preservation analysis 20 of the ODC co-expression modules in an external snRNA-seq dataset of the human PFC. 12 Modules were projected from the reference to query dataset using the hdWGCNA function ProjectModules, and the module preservation test was performed using ModulePreservation with 100 permutations.
Comparison of hdWGCNA with alternative metacell approaches For the purpose of co-expression network analysis, we compared our metacell aggregation approach (Algorithm 1) with two alternative approaches, namely Metacell2 14 and SEACells. 16 We ran the three metacell approaches using the recommended settings on the same dataset, and then ran hdWGCNA on each of the resulting metacell expression matrices. We used a scRNA-seq of 6,800 CD34 + hematopoietic stem and progenitor stem cells included with the SEACells package, and we used the cluster annotations from the original study. Notably, SEACells and Metacell2 do not account for cell labels in their aggregation procedures, which may result in a number of metacells containing transcriptomes from differently labeled cells. For the hdWGCNA metacell algorithm, we aggregated 50 cells per metacell. For the three metacell expression matrices derived from the different algorithms, we performed co-expression network analysis with the standard hdWGCNA pipeline by sequentially running the following functions with default parameters: TestSoftPowers, ConstructNetwork, ModuleEigengenes, ModuleConnectivity, and RunModuleUMAP. With the same cluster settings, Dynamic Tree Cut recovered a different number of co-expression modules for the three methods (hdWGCNA: 16 modules; MC2: 13 modules; SEACells: 20 modules). We performed pairwise comparisons between the gene modules detected with each metacell approach using Fisher's exact test to test module overlaps. Additionally, we performed rank-rank hypergeometric overlap 83 (RRHO) tests using the RRHO function from the R package RRHO (version 1.13.0) to compare the kME ranking between modules across methods. To compare MEs and Seurat module scores, we ran the AddModuleScore function, and computed Pearson correlations between each ME and each module score.
Isoform co-expression network analysis in the mouse hippocampus We performed isoform co-expression network analysis in radial glia lineage cells (radial glia, astrocytes, ependymal cells, and neural intermediate progenitor cells) from mouse hippocampus ScISOrSeq dataset from Joglekar et al. 8 using the hdWGCNA R package. The gene-level counts matrix for this dataset was obtained from the Gene Expression Omnibus database (GEO: GSE15845), and the isoform-level counts matrix was obtained directly from the authors of the original study. We formatted this dataset as a Seurat object with an isoform-level expression assay and a gene-level expression assay. The standard Seurat processing pipeline was used on the gene-level expression assay, where we sequentially ran the functions NormalizeData, FindVariableFeatures, ScaleData, and RunPCA with default parameters. The dataset was projected into two dimensions by running UMAP on the PCA matrix with 30 components using the RunUMAP function. For all downstream purposes, the cell-type annotations from the original study were used.
Radial glia cells were selected for network analysis, and isoforms expressed in fewer than 1% of these cells were excluded, yielding a set of 2,190 cells and 10,375 isoforms from 4,770 genes. We constructed metacells separately for each cell type on the isoformlevel expression assay using the hdWGCNA function MetacellsByGroups with k = 30. We performed a parameter sweep for the soft-power threshold b using the function TestSoftPowers. The isoform co-expression network was constructed using the ConstructNetwork function with the following parameters: networkType = ''signed'', TOMType = ''signed'', soft_power = 5, deep-Split = 4, detectCutHeight = 0.995, minModuleSize = 50, mergeCutHeight = 0.5. This approach identified 11 isoform co-expression modules. Isoform-level module eigenisoforms were computed using the ModuleEigengenes function, and eigenisoform-based connectivity was computed using the ModuleConnectivity function with default parameters. We computed a semi-supervised UMAP projection of the co-expression network using the hdWGCNA function RunModuleUMAP, with the module labels and the top six Article ll OPEN ACCESS hub isoforms (by kMEiso) per module as the input features. We used the enrichR to identify enriched pathways in each module ranked by using the following databases: GO_Biological_Process_2021, GO_Cellular_Component_2021, GO_Molecular_Function_2021, WikiPathway_2021_Mouse, and KEGG_2021_Mouse.
To assess isoform co-expression network dynamics throughout the cellular trajectories within the radial glia lineage, we performed pseudotime analysis using Monocle 3 36 (version 1.0.0). We computed a UMAP of just radial glia lineage cells using the Monocle 3 function run_umap. A trajectory graph was built on this UMAP representation using the function learn_graph, and pseudotime was calculated with the function order_cells using the radial glia cells as the starting point. We split the pseudotime trajectory into three lineages based on the distinct cell fates (astrocyte, neuronal, and ependymal). We grouped cells into 50 evenly-sized bins throughout each trajectory, and we applied loess regression to the average module eigenisoform of each module in these bins to inspect the dynamics of each module throughout development. We wrote a custom script to generate a GTF of isoform models output from the ScISOrSeq pipeline. To visualize expressed isoforms, we plotted isoforms from this GTF on the UCSC genome browser as well as in Swan. 38 Co-expression analysis network of inhibitory neurons in autism spectrum disorder We selected inhibitory neurons from the Velmeshev et al. 9 human autism spectrum disorder (ASD) snRNA-seq dataset for co-expression network analysis. Of the 121,451 cells in this dataset, 20,249 were labeled as inhibitory neurons based on marker gene expression profiles. We retained 11,194 genes which were expressed in at least 10% of cells from any cluster, and had non-zero variance in the inhibitory neuron population. Metacell transcriptomic profiles were constructed separately for each of the 54 samples and each cell type using the hdWGCNA function MetacellsByGroups, aggregating 50 cells into one metacell. We selected a soft-power threshold b = 9 based on the parameter sweep performed with the TestSoftPowers function. The co-expression network was computed with the ConstructNetwork function with the following parameters: networkType = ''signed'', TOMType = ''signed'', soft_ power = 9, deepSplit = 4, detectCutHeight = 0.995, minModuleSize = 50, mergeCutHeight = 0.2. Module eigengenes were computed using the ModuleEigengenes function, and we applied Harmony 22 to correct MEs based on sequencing batch. Eigengene-based connectivity for each gene was computed using ModuleConnectivity. The co-expression network was embedded in two dimensions using UMAP with the RunModuleUMAP function with the top five genes (ranked by kMEs) per module as the input features. Distributions of MEs were compared between ASD and control samples for each inhibitory neuron subpopulation using a two-sided Wilcoxon rank-sum test with the R function wilcox.test. We used the enrichR 85 to perform enrichment analysis on the top 100 genes in each module ranked by kME using the following databases: GO_Biological_Process_2021, GO_Cellular_Component_2021, GO_ Molecular_Function_2021, WikiPathway_2021_Human, and KEGG_2021_Human. Furthermore, we computed the overlap between co-expression modules and ASD-associated genes from the SFARI Gene database using the R package GeneOverlap, which calculates the overlap between sets of genes using Fisher's exact test.
Consensus co-expression network analysis of microglia in Alzheimer's disease We performed consensus co-expression network analysis of microglia in Alzheimer's disease (AD) using three published snRNA-seq datasets. [10][11][12] The individually processed datasets were merged into a single Seurat object comprising 189,127 nuclei, and the datasets were integrated into a common dimensionally-reduced space using PCA and Harmony. 22 We retained all nuclei labeled microglia for network analysis based on expression of canonical marker genes such as CSF1R (9,904 nuclei), and genes expressed in at least 5% of microglia from any of the three studies were retained (7,900 genes). Metacells were constructed in groups of cells based on AD diagnosis status and study of origin, aggregating 25 cells per metacell. Within hdWGCNA, we used the SetMultiExpr function to create a list of expression matrices containing the selected genes and metacells for the three studies. We performed a separate parameter sweep for the three expression matrices using the hdWGCNA function TestSoftPowerConsensus, ensuring that we . The consensus coexpression network was contructed using the hdWGCNA function ConstructNetwork using the consensus = TRUE option. Individual TOMs were computed for each dataset, and they were scaled based on the 80th percentile in order to alleviate different statistical properties specific to each dataset rather than the underlying biology. A consensus TOM was computed by taking the element-wise minimum of the individual TOMs from each dataset. Therefore, large topological overlap values between two genes, which indicate a strong co-expression relationship, are supported across all three datasets in the consensus TOM. We performed hierarchical clustering on the consensus TOM, and we used the Dynamic Tree Cut algorithm 3 was used to identify consensus co-expression modules based on the hierarchy. Module eigengenes were computed using the ModuleEigengenes function, and we applied Harmony 22 to correct MEs based on the dataset of origin. Eigengene-based connectivity for each gene was computed using ModuleConnectivity. We visualized the network using UMAP with the top ten hub genes (ranked by kMEs) per module as the input features, annotating the hub genes and known disease-associated microglia genes. 50 We used the enrichR 85 to perform enrichment analysis on the top 100 genes in each module ranked by kME using the following databases: GO_Biological_Process_2021, GO_ Cellular_Component_2021, GO_Molecular_Function_2021, WikiPathway_2021_Human, and KEGG_2021_Human. We sought to model the transcriptional dynamics governing the shift between homeostatic and activated microglia in AD, therefore we performed pseudotime analysis using Monocle 3 36 to build a continuous trajectory of microglia cell states. A trajectory graph was built on the microglia UMAP using the function learn_graph, and pseudotime was calculated with the function order_cells. We oriented the start of pseudotime based on the expression of homeostatic microglia marker genes, such as P2RY12, CX3CR1, and Cell Reports Methods 3, 100498, June 26, 2023 e8 Article ll OPEN ACCESS CSF1R. We grouped cells into 50 evenly-sized bins throughout each trajectory, and we applied loess regression to the average module eigengene of each module in these bins to inspect the dynamics of each module throughout the microglia trajectory.
To link the integrated microglia snRNA-seq dataset with polygenic risk of disease for individual cells, we used the scDRS python package (version 1.0.0). 52 This pipeline takes 1) a set of putative disease genes derived from GWAS summary statistics and 2) a scRNA-seq dataset as inputs, and outputs disease enrichment statistics for a given disease (raw and normalized disease scores, cell-level scDRS p value, and Z-scores converted from the p values). GWAS summary statistics of 74 diseases and complex traits supplied by scDRS were utilized as gene sets, among which a gene set by Jansen et al. 46 provided the set of genes associated with AD. We then visualized the AD scDRS Z-scores in the integrated AD microglia trajectory, and we correlated the scDRS score with the trajectory using a Pearson correlation.
We performed module preservation 20 analysis in a variety of external datasets from human and mouse 11,12,28,[56][57][58] to test for the reproducibility of the consensus AD microglia modules in the microglia population from each dataset. We used the hdWGCNA function ProjectModules to compute module eigengenes for the consensus AD microglia modules for each query dataset. The module preservation test was performed using the hdWGCNA function ModulePreservation with 100 permutations, and we reported the preservation Z-summary statistics in a heatmap. For the Morabito & Miyoshi et al. snATAC-seq dataset, we used the gene activity 59 representation as a gene-level summary of chromatin accessibility in order to assess the module preservation at the epigenomic level.
Analysis of bulk RNA-seq co-expression modules in single-cell data We projected gene co-expression modules from two bulk RNA-seq studies of AD 56,63 into a published snRNA-seq study of AD to assess their expression patterns within various cell populations. While both of these studies used the samples from the same bulk RNA-seq cohort, the set of modules from Morabito et al. 2020 56 was based on a consensus network analysis across six brain regions while the other set of modules from the AMP-AD study 63 were constructed separately for seven different brain regions. Module eigengenes were computed for each of these bulk RNA-seq modules in the snRNA-seq dataset using the hdWGCNA function ProjectModules, using Harmony to correct MEs based on sequencing batch. We visualized the MEs of the projected modules in the snRNA-seq dataset using the Seurat function DotPlot.