Deciphering cellular transcriptional alterations in Alzheimer’s disease brains

Large-scale brain bulk-RNAseq studies identified molecular pathways implicated in Alzheimer’s disease (AD), however these findings can be confounded by cellular composition changes in bulk-tissue. To identify cell intrinsic gene expression alterations of individual cell types, we designed a bioinformatics pipeline and analyzed three AD and control bulk-RNAseq datasets of temporal and dorsolateral prefrontal cortex from 685 brain samples. We detected cell-proportion changes in AD brains that are robustly replicable across the three independently assessed cohorts. We applied three different algorithms including our in-house algorithm to identify cell intrinsic differentially expressed genes in individual cell types (CI-DEGs). We assessed the performance of all algorithms by comparison to single nucleus RNAseq data. We identified consensus CI-DEGs that are common to multiple brain regions. Despite significant overlap between consensus CI-DEGs and bulk-DEGs, many CI-DEGs were absent from bulk-DEGs. Consensus CI-DEGs and their enriched GO terms include genes and pathways previously implicated in AD or neurodegeneration, as well as novel ones. We demonstrated that the detection of CI-DEGs through computational deconvolution methods is promising and highlight remaining challenges. These findings provide novel insights into cell-intrinsic transcriptional changes of individual cell types in AD and may refine discovery and modeling of molecular targets that drive this complex disease.


Background
Alzheimer's disease (AD) is a neurodegenerative disease that affects~5.7 million patients with annual cost of more than $230 billion in the US [1]. Effective diseasemodifying drugs are still elusive despite the urgent need and increasing global burden [2,3]. Pathologically, AD is marked by amyloid-beta plaques and neurofibrillary tangles, along with neuronal loss and gliosis in the affected brain regions. Transcriptome-wide expression levels have been analyzed from bulk brain tissue of hundreds of AD patients and neuropathologically normal controls [4][5][6][7][8] to discover genes and biological pathways that are perturbed in and/or lead to AD. Systems biology and bioinformatics analysis of these data have implicated altered pathways in AD including immune response [6] and myelin metabolism [4,5]. However, a fundamental knowledge gap remains concerning whether diseaseassociated changes in brain gene expression are due to changes in cellular composition of the AD samples secondary to disease neuropathology, or due to changes in the intrinsic regulation/activity of genes in the central nervous system (CNS) cells. From a clinical perspective, it is difficult to target changes in cellular composition secondary to neuropathology, whereas intrinsic changes in gene expression that may drive AD progression are potentially "druggable".
We expect that identifying cell-intrinsic differentially expressed genes (CI-DEGs) of individual CNS cell types will reveal novel insights into the genes and pathways that could potentially identify drug targets and lead to AD therapeutics. This approach circumvents the influence of cell-composition changes that can impact disease associated DEGs obtained from bulk tissue transcriptome analysis. RNA sequencing (RNAseq) studies from single cell, single nucleus or purified adult human CNS cells [9][10][11] can be used to identify CI-DEGs. Even though such single cell-type RNAseq data can effectively serve as a reference to annotate bulk-tissue transcriptome data [4], such approaches remain costly, cumbersome and limited in sample sizes. On the other hand, there exist large-scale bulk brain RNAseq datasets [5,8,12], which can be leveraged to identify CI-DEGs through analytic deconvolution of bulk tissue expression into signals of individual cell types by using cell proportions or their proxies [13,14].
In this study, we describe the design and application of a bioinformatics pipeline that uses cell type marker genes to estimate cell proportion [15,16] to deconvolute bulk tissue transcriptome data using three computational approaches [13,14,17] and to subsequently identify CI-DEGs. We applied our pipeline to the analysis of three post-mortem brain datasets, one from dorsolateral prefrontal cortex (DLPFC) [8] and two from temporal cortex (TCX) [4,12,18] regions, comprised of 685 unique samples. Consensus CI-DEGs common to both TCX and DLPFC regions were analyzed for enrichment of gene ontology (GO) terms. We compared the results of consensus CI-DEGs to consensus bulk-DEGs. In addition, for the DLPFC [8] dataset that had both bulk and single nucleus RNAseq [19] (snRNAseq) data, we compared the CI-DEGs from the computational deconvolution to CI-DEGs obtained from snRNAseq [19].
To our knowledge, this is the first study to detect consensus CI-DEGs and their enriched gene ontology (GO) terms from multiple brain regions using multiple computational deconvolution algorithms for AD and control RNAseq samples. Our study illustrates the cell proportion landscape of AD and control brain regions assessed in three independent RNAseq studies [4,7,8,12]. We identify consensus CI-DEGs many of which are not observed in bulk-DEG analysis and characterize their cell-type specificity. GO terms that are enriched for CI-DEGs implicate cell intrinsic transcriptional alterations that may influence AD, rather than be a result of cell-proportion changes in this disease. These CI-DEGs and their biological pathways may serve as refined molecular targets for therapeutic discoveries and disease modeling in AD. Our study also demonstrates that detection of CI-DEGs through computational deconvolution methods is promising while some challenges remain.

Results
Cellular composition in three brain cohorts from two brain regions We analyzed three cohorts each consisting of postmortem brains from AD and control subjects (Table S1), namely the Rush Religious Orders Study and Memory and Aging Project dorsolateral prefrontal cortex (DLPFC) [7,8], Mayo Clinic temporal cortex (TCX-Mayo) [4,12], and Mount Sinai VA Medical Center Brain Bank temporal cortex (TCX-MSBB) [18]. We generated the TCX-Mayo RNAseq dataset, and downloaded DLPFC and TCX-MSBB RNAseq datasets from the AMP-AD knowledge portal on Synapse (www.synapse.org).
Cell proportions (Table S2) were estimated for DLPFC, TCX-Mayo and TCX-MSBB datasets independently using the digital sorting algorithm (DSA) method [16] and the top 100 marker genes (Table S3) obtained from R package BRETIGEA [15] for each of the following cell typesneuron, oligodendrocyte, microglia, oligodendrocyte progenitor cell (OPC), astrocyte and endothelial cell.
An inspection about the pairwise correlation between marker genes (Fig. 1a) revealed that markers of OPC have poor median pairwise Pearson correlation values of 0.12 in DLPFC, 0.11 in TCX-Mayo and 0.06 in TCX-MSBB respectively, whereas among the other five cell types neuronal markers have the highest median correlation (0.68 in DLPFC, 0.78 in TCX-Mayo and 0.67 in TCX-MSBB), and microglia markers have the lowest correlation (0.37 in DLPFC, 0.42 in TCX-Mayo and 0.44 in TCX-MSBB). In addition, a computer simulation study (Fig. S1) demonstrated that the estimated proportions of OPC were not robust upon using different selection of marker genes. Therefore, we did not include OPC in downstream analyses in this study.
In all three datasets, neuronal cell proportion estimates were significantly lower in AD compared to controls (Fig. 1b). The magnitude of this decrease was the greatest for TCX-Mayo (AD mean proportion = 28.0%, Control = 35.7%; ratio of AD:control cell proportions = 0.78), followed by TCX-MSSM (AD = 42.3%, control = 49.3%; ratio = 0.87) and DLPFC (AD = 42.4%, control = 47.4%; ratio = 0.89). The estimated proportions of microglia were significantly higher in AD vs. controls for all datasets, with higher magnitude in TCX-Mayo (AD:control ratio = 1.19) and TCX-MSBB (AD:control ratio = 1.19) than for DLPFC (AD:control ratio = 1.06). The estimated proportions of astrocytes and endothelial cells were significantly higher in AD vs. controls for DLPFC and TCX-Mayo datasets, although the magnitude was greater in TCX-Mayo (1.40 and 1.30 respectively) than in DLPFC (1.07 and 1.14 respectively) for both cell types. Oligodendrocyte proportion is significantly higher in AD in DLPFC with AD:control ratio 1.14 and TCX-MSBB with AD:control ratio 1.27, although remains unchanged in TCX-Mayo with the ratio 0.94. Collectively, these findings demonstrate that the proportions of CNS cell types are different in post-mortem AD vs. control brains for most cell types. Although these proportional changes with AD are mostly consistent across the different studies, their extent varies across brain regions, with TCX tending towards higher magnitude of neuronal loss and microglia proliferation than DLPFC. It needs to be emphasized that the cell proportion changes estimated here are relative values, rather than absolute cell proportion changes between ADs and controls.

Differential expression analyses
In this study, three computational approaches were applied to identify cell intrinsic differential expression in individual cell types (CI-DEGs, Table S4-S6), namely Cell-CODE [14], PSEA [13] and our method WLC. Differentially expressed genes from bulk brain tissue (bulk-DEGs) were identified through linear regression without adjusting for cellular composition (Table S7). For the DLPFC, TCX-Mayo and TCX-MSBB datasets, we obtained bulk-DEGs and CI-DEGs from the three computer algorithms for neuronal, oligodendrocytic, microglial, astrocytic and endothelial cell types respectively.
We compared bulk-DEGs across the three datasets (Fig. 2a, top panel). Similarly, CI-DEGs from CellCODE, PSEA and WLC are compared across datasets (Fig. 2a, lower panels), such that CI-DEGs shared between datasets are required to be consistent in the designated cell type. All DEGs are identified at nominal p-value cutoff 0.05 and shared CI-DEGs have the same direction of change in the compared datasets. The ratio of overlap between any two datasets over all DEGs, i.e. the number in overlapping areas of the Venn diagram over the total number (Fig. 2a, top panel), is 30.0% or 1711/5697 in up-regulated bulk-DEGs, and 34.8% or 2214/6371 in down-regulated bulk-DEGs. This ratio of overlap in bulk-DEGs is much higher than that in CI-DEGs (2.7, 4.7 and 10.0% in up-regulated genes from CellCODE, PSEA and WLC respectively; 3.1, 6.8 and 9.3% in down-regulated genes from CellCODE, PSEA and WLC respectively).

Consensus CI-DEGs between DLPFC and TCX
To obtain the consensus CI-DEGs that are shared between DLPFC and TCX brain regions, we selected those CI-DEGs that are detected in "DLPFC and TCX-Mayo" or in "DLPFC and TCX-MSBB" under any of the three algorithms (Fig. S2). We combined all such genes, which collectively comprised the consensus CI-DEGs for each cell type (Fig. 2b). Similarly, consensus bulk-DEGs were the combined set of bulk-DEGs shared between "DLPFC and TCX-Mayo" or "DLPFC and TCX-MSBB".
Most consensus CI-DEGs are from neuronal cells (N = 559), followed by oligodendrocytes (N = 260), whereas microglia contributed the least number (N = 101). The majority (65.5% or 366/559) of neuronal CI-DEGs is down-regulated in AD, and the majority (66.0% or 140/ 212) of endothelial CI-DEGs is up-regulated in AD, with other cell types lying in between. Some of these CI-DEGs are also among the 1000 marker genes of the corresponding cell type from BRETIGEA [15]; 14.7% or 82/559 of neuronal CI-DEGs are also neuronal markers, 25.4% or 66/260 of oligodendrocyte CI-DEGs are also oligodendrocyte markers, and other cell types lie in between.
With regards to consensus bulk-DEGs (Fig. S3), 28.2% of them (885/3135) are cell type markers; 10.4% neuronal markers, 5.6% oligodendrocyte, 3.4% microglia, 4.8% astrocyte and 4.0% endothelial markers. The above observations indicate that computational deconvolution algorithms could identify CI-DEGs for both marker genes and non-marker genes. Importantly, the proportion of non-marker CI-DEGs is greater than that in bulk-DEGs. This suggests that compared to bulk-DEGs, CI-DEGs may be capturing a greater proportion of expression changes that are not due to mere cell population changes. We also compared the consensus bulk-DEGs with consensus CI-DEGs (Fig. S4). We determined that only a small fraction (15.0% or 29/193) of the up-regulated neuronal CI-DEGs was also present in up bulk-DEGs although the overlap is still significant (Fig. 2c). In comparison, most of the up-regulated CI-DEGs of the other four cell types were included in up bulk-DEGs. On the other hand, most (84.2% or 308/366) of the downregulated neuronal CI-DEGs were also present in down bulk-DEGs, whereas most of the down-regulated CI-DEGs of the other four cell types were absent from this group. Since bulk-DEGs did not adjust for neuronal loss and gliosis in AD (Fig. 1b), its ability to identify upregulated neuronal genes and down-regulated glial genes is likely to be compromised. For the same reason, bulk-DEGs may have a false inflation of detecting downregulated neuronal and up-regulated glial genes.

Enriched GO terms of consensus CI-DEGs between DLPFC and TCX
To identify pathways implicated by CI-DEGs that are robust across brain regions, we performed Gene Ontology (GO) enrichment analysis [20,21] for the consensus CI-DEGs, assessing separately those that are up vs. down in AD subjects (Table S8-S17). Figure 3 illustrates the top two enriched GO terms by enrichment p-values, after filtering out terms that encompass less than four CI-DEGs or are cellular compartments.
Consensus CI-DEGs revealed biological pathways that are perturbed in AD in specific brain cell types. Some of these pathways have previously been implicated in AD and others are novel. Down-regulated neuronal CI-DEGs were enriched in neuropeptide hormone activity (GO:0005184) and hormone activity (GO:0005179) pathways, which include VGF (a.k.a. neuroendocrine regulatory peptide 1) [22] and corticotropin releasing hormone (CRH) [23] (Table S13). Consensus up-regulated neuronal CI-DEGs were significantly enriched in potassium channel activity (GO:0005267) and regulation of ion transport (GO:0043269) pathways (Table S8). The latter GO term encompasses most of the genes from the former, and also includes other genes involved in neuronal functions such as the glutamate ionotropic receptor NMDA type subunit 1, GRIN1 [24] and SYT1, which encodes the synaptic vesicle protein, synaptotagmin [25].
Many of the most significant GO terms are related to key functions of the respective cell types for the glial CI-DEGs, as well. The top enriched pathway of downregulated CI-DEGs in oligodendrocytes is myelination (GO:0042552), including myelin basic protein (MBP) [4], plasmolipin (PLLP) [4,5], myelin and lymphocyte protein (MAL), and myelin-associated glycoprotein (MAG) [26] (Table S14). Up-regulated CI-DEGs of oligodendrocytes are enriched in ceramide biosynthetic process (GO: 0046513) including ceramide synthase 4 (CERS4) and UDP glycosyltransferase 8 (UGT8) [5] (Table S9). Ceramide is a constituent of sphingomyelin, a sphingolipid which is particularly found in the myelin sheath; and also a multi-functional signaling molecule [27,28]. Hence, both the down-regulated and the up-regulated oligodendroglial consensus CI-DEGs highlight different components of the myelin biology that are perturbed in AD.
Astrocytes, a cell type that plays a critical role in maintaining brain energy dynamics [33] and metabolism [34], show enrichment of oxidoreductase activity (GO: 0016491) and drug metabolic process (GO:0017144) in down-regulated CI-DEGs which includes genes glutathione S-transferase mu 2 (GSTM2) [35] and thioredoxin2 (TXN2) [36] (Table S16). Astrocytic up-regulated consensus CI-DEGs are enriched for cell-cell junction assembly (GO:0007043) process (Table S11), including the Fig. 3 Top two enriched GO terms in up (red) or down-regulated (blue) consensus CI-DEGs between DLPFC and TCX regions, per cell type astrocytic gap junction protein connexin43 (GJA1) [37], which was identified as a key regulator associated with AD related outcomes. The other top GO process for astrocytic up-regulated consensus CI-DEGs is adenylate cyclase-inhibiting G protein-coupled receptor signaling pathway (GO:0007193), which harbors adenylate cyclase 8 (ADCY8), involved in memory functions [38].

Comparison of CI-DEGs from computational deconvolution vs. snRNAseq
We determined the extent to which each of the three computational deconvolution algorithms could detect CI-DEGs from bulk tissue by comparison of their results with those obtained in a published snRNAseq study [19]. The ROSMAP dataset utilized in our study has both bulk RNAseq from DLPFC (bulk-DLPFC) as well as snRNAseq (snDLPFC) in a subset of its participants [19]. We compared the bulk-DLPFC data deconvoluted with three different algorithms with the published snDLPFC [19] data. Endothelial CI-DEGs were not available from the snRNAseq study, therefore overlap of results could be assessed only for four cell types.
We tested the overlap between the top CI-DEGs for each cell type obtained from deconvoluted bulk-DLPFC and those from snDLPFC ranked by their p values (Fig. 4a). We evaluated the overlap for a range of top CI-DEGs up to top 1000 genes. Overlap for CI-DEGs that are either up (Fig. 4a, upper panel) or down (Fig. 4 a, lower panel) in AD were assessed separately. Hence, overlapping genes had both similar ranks and direction of effect in both deconvoluted bulk-DLPFC and snDLPFC analyses. We established the significance of overlap using simulations for a range of top ranked genes (N = 200, 600 and 1000) (Table S18).
Neuronal CI-DEGs retained their significance of overlap across all comparisons and for all algorithms, except for the top 1000 up-regulated neuronal CI-DEGs deconvoluted with PSEA. Microglial CI-DEGs had the least numbers of significant overlap for their top ranked genes.
Astrocytic and oligodendrocytic top ranked CI-DEGs had significance of overlap between the neuronal and microglial results (Table S18). These findings are reflective of the abundance of these cell types, with the most abundant neurons having the most overlap for the top ranked CI-DEGs between deconvoluted bulk-DLPFC and snDLPFC.
Amongst these comparisons, we determined that the significance for overlap was best for all algorithms for the top ranked 600 genes. Using WLC deconvoluted results, the overlap for the top 600 CI-DEGs from bulk-DLPFC and snDNPFC are statistically significant for all eight comparisons (Fig. 4a). For the top 600 genes, overlap with CellCODE results is significant for all except downregulated oligodendrocyte and up-regulated astrocyte CI-DEGs. For PSEA, none of the microglia CI-DEGs had significant overlap. PSEA results for the top 600 genes were otherwise significant for all but up-regulated oligodendrocyte and down-regulated astrocyte genes.
We also performed a comparison of CI-DEGs identified at nominal significance (p-value < 0.05) with each algorithm from bulk-DLPFC to nominally significant snDLPFC results (Fig. 4b, Table S19). As with the above comparison, genes that are either up or down in both deconvoluted bulk-DLPFC and snDLPFC data were analyzed separately for each cell type.

Discussion
There is an increasing number of large scale RNAseqbased transcriptome datasets generated in bulk tissue for many diseases, including brain tissue from patients with AD, other neurodegenerative diseases and controls [5][6][7][8]12]. Comparison of such transcriptome data from patient and control individuals has been instrumental in the identification of genes and co-expression networks that are altered in and may therefore be risk factors for these diseases [4][5][6][7]44]. The discovery that many of these transcriptional networks harbor genes with disease risk variants provides support for the utility of this bulk-transcriptome approach in deciphering molecules and pathways that are risk factors for these conditions. Nevertheless, there is clear evidence for abundant transcriptome alterations in bulk tissue from affected organs of patients with disease, which appears to be due to changes in cellular composition of that tissue as a consequence of the disease processes [4,45]. Given this, there is a strong need to detect cell-intrinsic transcriptional alterations to be able to distinguish gene expression changes that are upstream of and may therefore be causative factors for disease from those that are merely a result of cell proportion changes that occur due to disease pathology. The discovery of molecules and pathways that are upstream of and risk factors for disease pathology is of paramount importance for development of targeted therapies. This information can also aid in the identification of refined disease biomarkers reflective of disease-causative expression alterations in these conditions. Detection of cell-specific transcriptional changes can also help develop more accurate disease models harboring these cellular alterations. Further, discovery of cell-specific transcriptional alterations in disease may uncover expression changes, particularly in less abundant cell types, which may be missed by the analysis of bulk transcriptome. Thus, there is a growing effort to identify cell-specific expression alterations in human diseases [9-11, 14, 15, 17, 46-48].
There are two general approaches to decipher cellspecific transcriptional changes in AD. One approach is to conduct single nucleus (snRNAseq), single cell (scRNAseq) or purified cell RNAseq experimentally, followed by data analyses. The alternative approach is to design relatively complex bioinformatics pipelines to decipher cell-intrinsic information of individual cell types from bulk tissue microarray or RNAseq data. The first approach is limited due to significantly higher cost and experimental challenges. Additionally, these approaches may have the drawback that the procedures of dissociating cells break cell-to-cell communication and thus may not reflect the true expression signal in vivo. The alternative bioinformatics approach to decipher cell-specific transcriptome alterations from bulk tissue has the potential to avoid the above weaknesses of the experimental approach. Furthermore, a bioinformatics approach can make use of the large amount of existing RNAseq data [4][5][6][7][8]12] from fresh or frozen bulk brain tissues with minimum added cost, and may better reflect the true situation in brain where different cells are in communication.
In this study, applying three different algorithms [13,14] including our own novel approach, we estimated cell-intrinsic gene expression for deconvoluted cell types from three large bulk RNAseq datasets [4,8,12,18] from two brain regions. We identified consensus cellintrinsic transcriptional alterations (CI-DEGs) in AD, which are conserved across cohorts and brain regions. We also performed an in-depth comparison of these CI-DEGs with bulk brain RNAseq data obtained from the same datasets, collectively comprised of 685 unique brain samples. To our knowledge, this is the first study to detect CI-DEGs and their enriched gene ontology (GO) terms in computationally deconvoluted large-scale RNAseq data from AD and control brain samples. Additionally, we conducted a detailed comparison of CI-DEGs deconvoluted from bulk-DLPFC data with the three algorithms and those obtained from snRNAseq [19] (snDLPFC) of a subset of the samples from the same cohort [7,8].
The main findings from our study can be summarized as follows: 1) The direction of change in cellular proportions in AD is consistent across two brain regions and three datasets for most cell types, although the magnitude of change seems to vary. Our findings revealed greater neuronal loss and microgliosis in TCX compared to DLPFC. 2) We identified CI-DEGs and bulk tissue DEGs (bulk-DEGs) independently in two TCX and one DLPFC datasets. The overlap in bulk-DEGs across datasets is greater than that for CI-DEGs. 3) We performed an in-depth comparison of the consensus CI-DEGs, common to both TCX and DLPFC against the consensus bulk-DEGs detected in these same datasets. We identified significant overlap between consensus CI-DEGs and consensus bulk-DEGs. The extent of overlap between consensus bulk-DEGs and consensus CI-DEGs was greatest for down-regulated neuronal genes (p = 9.7E-206). This was followed by up-regulated non-neuronal genes (p ranging from 1.6E-77 for endothelia to 1.1E-17 for microglia). 4) Despite the statistically significant overlap between consensus bulk vs. CI-DEGs, the majority of the consensus CI-DEGs for up-regulated neuronal and down-regulated non-neuronal genes were not detected in bulk tissue. This finding highlights the potential ability of computational deconvolution approach to identify CI-DEGs that may be missed in bulk-DEGs especially for genes that are not moving in the direction of cell proportion changes. 5) We identified GO-terms enriched for consensus CI-DEGs, and detected processes that have previously been implicated in AD as well as novel ones. 6) Using an snRNAseq [19] dataset as comparison, we assessed the performance of our CI-DEG detection algorithm (WLC), and the published CellCODE [14] and PSEA [13] approaches. We determined that WLC had comparable or superior performance in the detection CI-DEGs that had significant overlap with snDLPFC results.
Our findings highlight the consistency and reproducibility of our findings across two different brain regions from three different studies conducted separately. We identified similar directions of change in AD:Control cell proportions in TCX and DLPFC. As expected from known AD neuropathology, neuronal populations are significantly lower, and microglial populations are significantly higher in AD vs. control brains in all datasets. Consistent with this pattern of reproducibility, we also found significant overlap of consensus CI-DEGs detected in TCX and DLPFC for all cell types and for both directions of change, i.e. up or down in AD, with consensus bulk-DEGs.
Using consensus CI-DEGs, we identified GO terms, which include processes and genes that have previously been implicated in AD, thereby providing further validation of our approach. Detailed discussion of all of the pathways identified with the consensus CI-DEGs is beyond the scope of this study. Instead, we herein highlight some of the CI-DEG enriched pathways.
Down-regulated neuronal CI-DEGs were enriched in neuropeptide hormone activity (GO:0005184) pathway. These terms include VGF (a.k.a. neuroendocrine regulatory peptide 1), which is selectively expressed in some neurons and shown to be reduced in AD and Parkinson's disease [22]. Corticotropin releasing hormone (CRH), which is a neuronally expressed peptide that mediates stress in the hypothalamic-pituitary-adrenal axis, is also a member of the same GO term. CRH has been implicated in both adverse outcomes related to AD pathology in model systems and epidemiology studies, as well as having an important role in learning and memory [23]. Neuronal reduction of CRH may either be a potentially neuroprotective response in AD brains or lead to further negative impact on memory. Although the biological implications of our finding remain to be uncovered, our results are aligned with the potential importance of the neuroendocrine system in AD.
Interestingly, CI-DEGs also implicate biological processes that are enriched for neuronal-DEGs that are up in AD, despite reductions in neuronal cell populations in this condition for both TCX and DLPFC. This suggests that our cell type specific transcriptome deconvolution successfully captures transcript changes that occur in the direction opposite to that of cell-population changes, and that may therefore be missed in bulk-DEG approaches.
Indeed, the significant GO terms potassium channel activity (GO:0005267) and regulation of ion transport (GO:0043269) harbor many potassium channels, which are up in AD for neuronal CI-DEGs in both TCX and DLPFC, but do not have consistent results in bulk-DEGs from the same cohorts. These findings highlight the potential utility of cell-specific transcript deconvolution approaches in reducing noise from cell population changes, thereby resulting in consistent detection of true signal. Notably, potassium channels have been reported to be up in AD brains and mouse models of AD [49][50][51], leading to their suggestion as a potential therapeutic avenue in this condition.
Notably, the computational approach we describe can be extended to cellular sub-types. To exemplify this, we utilized published snRNAseq [19] data to identify marker genes of excitatory and inhibitory neurons. Using those and the previously applied markers for the glial cells, we ran DSA algorithm [3] to estimate cell proportion (Fig. S6). We applied PSEA, WLC and CellCODE to identify consensus CI-DEGs in the same fashion as described (Table S20, Fig. S7). As expected, both excitatory and inhibitory neuron proportions are significantly lower in AD samples. Interestingly, the up-regulated excitatory neuronal consensus CI-DEGs are enriched in cation channel activity, which is not observed in inhibitory neurons (Tables S21-S23). These results highlight the ability of this approach to refine transcriptional alterations from bulk brain data to cellular sub-types.
Some consensus CI-DEGs point to AD-related perturbations of key cellular functions for the specific cell type. An example of this is consensus oligodendrocyte CI-DEGs. The down-regulated oligodendrocyte CI-DEGs are enriched for the myelination GO term (GO:0042552) and those that are up in this cell type are enriched in ceramide biosynthetic process (GO:0046513).
Ceramide dysregulation has been implicated in AD [52,53]. Increased ceramide species were observed in AD and other neuropathological disorders compared to controls [53], and the activation of the neutral sphingomyelinase-ceramide pathway induces oligodendrocyte death [54]. Our present observation from oligodendrocyte CI-DEGs are consistent with these findings.
Another potential utility of the computational deconvolution approach is its complementarity to snRNAseq data. While the latter is able to provide transcriptional data at a cellular level, it can miss information from low abundance cell populations. Indeed, in a recent snRNAseq [19] study of AD and control brain samples, only 121 endothelial cells were identified out of a total of 70, 634 cells; and consequently no endothelial DEGs were reported. In contrast, our computational approach identified 140 genes that are up-regulated in endothelial cells in AD and 72 genes that are down-regulated. We further validated the endothelial expression of these genes by confirming their expression in sorted endothelial cells [9]. The significantly enriched GO terms include cytoskeleton organization and translational initiation, for these genes up-and down-regulated, respectively, in endothelia (Fig. 3, Tables S12, S17). These findings highlight the ability of the computational deconvolution approach to provide biological insights that could be missed by snRNAseq approaches.
Despite the biological insights gained from computationally deconvoluted CI-DEGs, they also have some shortcomings. Compared to bulk-DEGs, CI-DEGs between different datasets show less degree of overlap, regardless of the deconvolution algorithm utilized. This highlights the challenge in the field for the ultimate goal of minimizing detection of transcriptional perturbations due to cell proportion changes while maximizing discovery of those that lead to disease. Put differently, CI-DEGs may enhance discovery of true cell-specific transcriptional changes but this may come at the expense of increased false negative findings. In contrast, bulk-DEGs may capture a larger number of perturbed transcripts, but some may be merely due to cell population differences between diseased and healthy tissue. Ultimately, findings from both approaches may provide the greatest utility in detecting true positives.
Another analytic caveat is that cell proportion estimation approaches, including those used in this study yield relative values rather than the absolute levels of cell proportions. Detection of absolute levels in a typical bulk RNAseq experiment will be challenging if not impossible due to different mRNA amounts per cell type and the library preparation protocols (e.g. TruSeq® RNA Sample Preparation v2 Guide) which require similar amounts of cDNA from each sample to be sent to the sequencer. This leads to a normalization which would preclude the detection of the absolute cell proportions, as illustrated in an example in Fig. S8. Despite these challenges, our findings reveal that analytic deconvolution of bulk RNAseq can detect cell-specific transcriptional changes.
Comparison of CI-DEGs deconvoluted from bulk-DLPFC and those detected using an orthogonal approach of snRNAseq [19] from the same cohort (ROS-MAP) demonstrates the ability of these computational approaches to identify true cell-specific expression changes in AD. Using our in-house WLC algorithm, there was significant overlap with the results from snRNAseq and CI-DEGs of most cell types (Fig. 4, Tables S18-S19). CellCODE and PSEA also identified significant overlaps, but to a lesser extent, especially for rarer cell types such as microglia. Hence, our WLC algorithm demonstrates at least comparable performance to these two algorithms [13,14]. This is also supported by the higher degree of overlap among different cohorts for CI-DEGs detected by WLC than the other two algorithms (Fig. 2a). Due to the challenges in deconvoluting noisy data from human series, different computational approaches may be utilized and combined, and that calls for a more devoted effort in developing such algorithms.
In summary, using three distinct computational approaches, we deconvoluted brain bulk-RNAseq data from three large and independent cohorts [8,12,18]. We detected cell population changes that are observed consistently across cohorts, and congruent with the known disease pathology. Although there is significant overlap between consensus CI-DEGs and consensus bulk-DEGs, there are more unique CI-DEGs that change in the direction opposite to that of cell population changes. This suggests that CI-DEGs may have utility in detecting disease-related transcriptional changes above and beyond those due to cell proportion changes. Consensus CI-DEGs identify GO terms, including those for hormone activity, myelin biology and channel activity. The enriched CI-DEGs include genes previously implicated in AD or neurodegeneration, such as VGF, CRH, MOBP and MBP, and other novel genes.

Analysis datasets
We generated the TCX-Mayo data, which consists of temporal cortex RNAseq measurement of 80 AD patients and 28 controls diagnosed according to neuropathologic criteria [4,12]. RNAseq data were processed and quality control (QC) was conducted as described [4,12]. ROSMAP DLPFC [7,8] and TCX-MSBB [18] datasets were downloaded from the AMP-AD Knowledge Portal on Synapse (syn8691134 and syn8691099). We further filtered out non-Caucasian samples and those that had incongruent sex based on provided covariate vs. transcriptome data. All samples were classified as AD or control based on neuropathological data. All TCX-Mayo AD samples had Braak neurofibrillary tangle (NFT) score ≥ 4. TCX-Mayo controls had Braak score ≤ 3 and were without any neurodegenerative disease diagnoses. TCX-MSBB AD samples had Braak ≥4 and CERAD ≥2; and controls had Braak ≤3 and CERAD ≤1. DLPFC AD samples form ROSMAP had Braak score ≥ 4 and CERAD neuritic plaque score ≤ 2. ROSMAP control samples had Braak ≤3 and CERAD neuritic plaque score ≥ 3.
It should be noted that the CERAD [55] neuritic plaque score as applied by the ROSMAP study is defined such that high CERAD indicates lower neuritic plaque burden and decreased probability of AD. In the MSBB study, higher CERAD indicates higher plaque burden.
Raw RNA read counts were normalized using conditional quantile normalization (CQN) method [56] implemented in R cqn package, as previously described [4]. This normalization takes into consideration library size, gene length and GC content. It also performs a log2 transformation so that the resulting distribution for each gene is Gaussian-like. We also determined covariates that contributed significantly to the variation of gene expression in these RNAseq cohorts (Fig. S5) for adjustment in the analyses.

Cell proportion estimation
Digital sorting algorithm (DSA) [16] was applied to estimate cell proportions through R DSA package, function DSA. For each cell type, DSA first computes the average marker gene expression in the analysis dataset, the purpose of which is to construct a variable that better reflects cell proportion variation among subjects. To reduce the effect of outlier expression that is occasionally seen in RNAseq data, we modified the original DSA so that the median instead of mean expression was computed.

CI-DEG analysis for individual cell types
In this study we identified CI-DEGs from deconvoluted bulk RNAseq data using three different algorithms, namely PSEA [13], CellCODE [14] and our in-house algorithm WLC. All analyses adjusted for the following variables: Sex, RIN, age at death and batch for DLPFC and TCX-MSBB datasets, and sex, RIN and age at death for TCX-Mayo dataset (Fig. S5).
PSEA [13] applies model selection procedures to select cell type(s) that should be included in baseline (control) or AD condition, and then estimate differential expression in specific cell types (CI-DEGs). We used the R package PSEA to obtain CI-DEG results of PSEA approach, through functions em_quantvg (to generate candidate models) and lmfitst (to fit all candidate models and pick the best one). Expression values used in PSEA are in linear scale (non log-transformed).
CellCODE [14] assesses overall gene expression difference between AD and control groups with adjustment of relative cell proportion, followed by assigning the cell type that correlates best with the difference (CI-DEGs). R package CellCODE was used to obtain DEG results of Cell-CODE approach, through functions getAllSPVs (to construct surrogate variable using marker genes through singular value decomposition) and lm.coef (to estimate difference between AD and control groups). Strictly speaking, CellCODE measures the overall differential expression rather than CI-DEGs but identifies the cell type that is most correlated with the observed difference between AD and control using cellPopT function. However, for simplicity, we refer to the DEGs from CellCODE as CI-DEGs in this study. Expression values used in CellCODE are in log scale.
Our in-house method WLC, described in the method section, applies weighted linear regression with constraints and model selection procedures, which also estimates differential expression in specific cell types (CI-DEGs). It guarantees the fitted relative gene expression is nonnegative. By weighing the expression, it smooths out the extreme data points. The procedures of this algorithm is illustrated by the following high level pseudo code.
Fit Eq.
(1) to identify a set of cell type T in which the gene is significantly expressed.
If the set T is not empty: Fit Eq.(2) to identify a set of cell type Φ ⊆ T in which the gene is differentiallly expressed Let Ω = {each cell type in T, Φ, T } For each element Θ Ω. Fit Eq.(3) with adjustment for other covariates C k (1 ≤ k ≤ m) Keep Akaike information criterion (AIC) of this model fitting Identify Θ best that gives the best AIC. Use the estimated values from the best model and obtain p-value from F-test y i ∼a 0 þ X t∈ 1;2;3;4;5 f g a t x i;t þ ∈; s:t:a t ≥ 0 1≤ t ≤ 5 ð Þ ð1Þ In Eqs.1, 2 and 3, y i is the observed expression of a gene in subject i; x i, t is the median marker gene expression of cell type t in subject i; C i, k is covarite k in subject i. In Eq.1, α t is the overall relative expression in cell type t. In Eq.2, β t is relative expression at the baseline condition in cell type t; d i = 0 if subject i is in control group, and d i = 1 if subject i is in AD group; therefore, β Δ t is the difference of relative gene expression between baseline condition and AD condition in cell type t. Of note, due to the biological meaning these coefficents, they must satisfy constraints such that α t , β t , β t þβ Δ t be non-negative. In addition, y i is in linear scale rather than in log scale [57], and these non-logtransformed expression values tend to have extreme data points that need to be weighted down. Based on the above considerations, Eq. 1 was fitted by weighted least square linear regression with contraints, which is implemented in lsei function in R package limSolve. The weight of each observation (w i ) is determined by formulae Eqs. 3 and 4 below. Intuitively, if the expression of a gene in a sample is extremely distant from the median expression of all samples in the same diagnosis group, the weight of that sample is smaller than 1 for that gene.

GO enrichment analysis
Using genes included in the CI-DEG analysis as background genes, p-value for GO term enrichment with consensus CI-DEGs was calculated by "enrichmentAnalysis" function from WGCNA R package [35].

Comparison of CI-DEGs from computational deconvolution vs. snRNAseq
To determine if computational bulk-tissue RNAseq could reveal true CI-DEGs, we downloaded and utilized a published snRNAseq study [19] from frozen DLPFC tissues (snDLPFC) which compared gene expression from 24 individuals with AD-pathology to that from 24 individuals without AD-pathology in six cell types -excitatory neurons, inhibitory neurons, oligodendrocyte, microglia, oligodendrocyte precursor cells and astrocytes. These snDLPFC samples are from the ROSMAP project, of which we analyzed 474 bulk-DLPFC RNAseq data in our current study. Twenty-four (9 AD cases, 15 controls) of the 48 individuals in snDLPFC study are also included in the bulk-DLPFC dataset. Hence both the snDLPFC and bulk-DLPFC are from the same cohort with significant overlap. Three deconvolution methods were included in this comparison -CellCODE [14], PSEA [13] and WLC, our in-house method. Two types of comparisons were made between the deconvoluted bulk-DLPFC and snDLPFC results. In the first comparison, we ranked the genes by their p-values of differential expression between AD and control subjects, per cell type. We compared the top N up-or down-genes from the snDLPFC study to those identified by each deconvolution algorithm, per cell type. Genes common to both the deconvoluted bulk-DLPFC and snDLPFC were counted and plotted.
In the second set of comparisons, CI-DEGs identified from deconvolution methods at nominal significance (pvalue < 0.05) were compared to those identified in the snRNAseq data (p-value < 0.05).
To assess if the observed overlap from each set of analyses is significant with regards to overlap between random selections, we obtained empirical p values from computer simulations described below.

Empirical p-value for the number of overlapping genes
The empirical p-values for the number of overlapping genes between snDLPFC and bulk-DLPFC was obtained using a computer simulation as follows. (A) Let S sn stand for all genes in snDLFC and S bulk for all genes in bulk-DLFC; (B) randomly assign up-regulation on each S sn gene at probability 0.5, and on each S bulk gene at probability 0.5; (C) randomly pick N genes from upgenes of S sn , randomly pick N genes from up-genes of S bulk , and count the number of overlapping genes; (D) Steps B-C were repeated 10,000 times, and the numbers of overlaps (M 1 , M 2 , …, M 10000 ) were obtained; (E) Let M be the number of observed overlapping genes, and the empirical p-value is (1 + number of occurrences that M i >= M)/10,001.

Identification of excitatory and inhibitory neuronal markers
Using a published human brain snRNAseq dataset [19], we identified excitatory and inhibitory neuronal markers using Seurat R package FindMarker function. Excitatory neuronal markers are those that a) are detected in > = 70% of excitatory neurons, b) have average normalized expression in excitatory neurons that are > = 4.5X of that in each of the other cell types (i.e. inhibitory neurons, oligodendrocytes, microglia, astrocytes and OPC), and c) have rank sum test p-value < 0.05 in the comparison of expression levels in excitatory neurons and each of the other cell types. Inhibitory neuronal markers were similarly identified except that we required a less stringent detection limit of > = 50% of inhibitory neurons as the > = 70% threshold yielded too few genes. Excitatory neuronal markers were further refined by requiring their presence in the neuronal markers from BRETIGEA [15]. A total of 53 excitatory and 62 inhibitory neuronal markers were identified (Table S20). Excitatory and inhibitory CI-DEGs were identified as described above.

Conclusions
This study demonstrates the utility of our analytic approach in deciphering cell-specific transcriptional alterations using bulk tissue in a complex disease, provides a comprehensive comparison of our pipeline to existing ones, identifies patterns of cell proportions in AD and control samples across brain regions, discovers novel CI-DEGs with replication across independent cohorts and highlights biological processes with cell-specific expression changes in AD. These findings are expected to refine discovery of molecular therapeutic targets, biomarkers that reflect cellular transcriptional alterations in AD and accelerate generation of more accurate disease models.
Additional file 1 Figure S1. Pearson correlation of estimated cell proportions using different sets of marker genes randomly selected. The steps of this simulation study are as follows.  Figure S2. Illustration of obtaining consensus CI-DEGs between DLPFC and TCX regions, using up-regulated CI-DEGs in neuronal cells as an example. Figure S3. Left panel: the number of upregulated and down-regulated consensus bulk-DEGs between DLPFC and TCX-Mayo, or between DLPFC and TCX-MSSM. Right panel: percent of consensus CI-DEGs that are also cell type marker genes. Cell type markers are from BRETIGEA, containing 1000 markers for each of the five cell types. Figure S4. Venn diagram for all consensus bulk-DEGs and consensus CI-DEGs. Figure S5. Source of variance analysis of in the DLPFC RNAseq dataset (left panel), TCX-Mayo (middle panel) and TCX-MSBB (right panel). For each gene, a full model was fitted in which cqn normalized gene expression with gene expression as dependent variable and RIN, age at death, sex, batch, and diagnosis group as independent variables (for DLPFC); RIN, age at death, sex, flowcell, and diagnosis group as independent variables (for TCX-Mayo); RIN, age at death, sex, batch, and diagnosis group as independent variables (for TCX-MSBB). Partial models were fitted using the same dependent variable and all but one independent variable. F statistics were obtained by comparing the full model and partial model for each independent variable. Y-axis is the mean of values of F statistics over all genes. In DLPFC and TCX-MSBB, diagnosis, age at death, sex, RIN and batch contributed more than random errors to the variation of gene expression, whereas in TCX-Mayo diagnosis, age at death, sex, and RIN contributed more than random errors to the variation of gene expression. Figure S6. Estimated cell proportion of excitatory neuron, inhibitory neuron, oligodendrocyte, microglia, astrocyte and endothelial cells. Red asterisk indicates location shifts between cell proportions in AD and control groups at nominal p value 0.05 from Wilcoxon rank sum test. Figure S7. Venn diagram of consensus CI-DEGs in excitatory neuron and inhibitory neuron. Figure S8. Illustration of absolute change of cell proportion (blue panel) and relative change of cell proportion (pink panel) in a bulk RNAseq experiment. This is a hypothetical experiment to illustrate the effect of library preparation step on cell type proportions. For the ease of description, assume there are only two cell typesneurons and glia; assume the total mRNA amount of a glial cell is λ, and the total mRNA amount of a neuronal cell is 2λ on average. In the above scenario there is one control and two AD samples (Case 1 and Case 2). We assume that the cell counts in the Control are baseline (i.e. no changes due to disease). In Case 1, half of the neuronal cells were lost while the number of glia cells remains the same. In other words, the absolute proportion of neuronal cell loss is 50%. In a typical bulk RNAseq approach, the library preparation involves a normalization step (TruSeq® RNA Sample Preparation v2 Guide). The goal of this step is to send similar amounts of cDNAs to the sequencer. After this step, the (relative) neuronal loss detected in Case 1 is 10% that of the Control. In Case 2, one quarter of neuronal cells are lost and and glial cell increase by 50% compared to the Control (absolute changes). However, after the cDNA normalization step, the (relative) neuronal loss that can be detected is about 10% that of the Control. Therefore, even though a bioinformatics approach could accurately estimate the number of cells of different cell types that were sequenced, it would be difficult to trace back to the absolute proportion of cell changes.
Additional file 2. WGCNA: Weighted gene co-expression network analysis AG046139, R01 AG018023, U01 AG006576, U01 AG006786, R01 AG025711, R01 AG017216, R01 AG003949, NINDS grant R01 NS080820, CurePSP Foundation, and support from Mayo Foundation. This study was in part funded by NIH RF1 AG051504 and R01 AG061796 (NET). We thank the Mayo Clinic Advanced Genomic Technology Center staff and bioinformatics core facility for all gene expression measurements. XW thanks Dr. E. Aubrey Thompson for text editing and Jeremy Burgess for provide insightful inputs. XW was funded in part by a pilot grant from the Mayo Clinic ADRC (P50 AG016574). The results published here are in whole or in part based on data obtained from the AMP-AD Knowledge Portal (https://adknowledgeportal.synapse.org). Study data were provided by the Rush Alzheimer's Disease Center, Rush University Medical Center, Chicago. Data collection was supported through funding by NIA grants P30AG10161 (ROS), R01AG15819 (ROSMAP; genomics and RNAseq), R01AG17917 (MAP), R01AG30146, R01AG36042 (5hC methylation, ATACseq), RC2AG036547 (H3K9Ac), R01AG36836 (RNAseq), R01AG48015 (monocyte RNAseq) RF1AG57473 (single nucleus RNAseq), U01AG32984 (genomic and whole exome sequencing), U01AG46152 (ROSMAP AMP-AD, targeted proteomics), U01AG46161(TMT proteomics), U01AG61356 (whole genome sequencing, targeted proteomics, ROSMAP AMP-AD), the Illinois Department of Public Health (ROSMAP), and the Translational Genomics Research Institute (genomic). The results published here are in whole or in part based on data obtained from the AMP-AD Knowledge Portal (https://adknowledgeportal.synapse.org/ ). These data were generated from postmortem brain tissue collected through the Mount Sinai VA Medical Center Brain Bank and were provided by Dr. Eric Schadt from Mount Sinai School of Medicine.
Authors' contributions XW, SL, MA and NE-T conceived the idea. XW, MA and NE-T developed and designed the study, and wrote the manuscript. XW and SL developed the mathematical theory and XW performed the computations. DD provided tissue from the Mayo Clinic Brain Bank and neuropathologically characterized TCX-Mayo brain samples. TN, KM, SL, MA and MC performed sample selection and library preparation of the TCX-Mayo dataset. MA, ZQ, TC, TP and J.R. performed quality control of DLPFC and TCX-MSBB datasets. JC and YA supervised the analytical aspects of the project. NE-T provided funding, supervision and direction for the whole study. All authors read and approved the final manuscript.
Author's information Not applicable.

Availability of data and materials
The data and scripts in this manuscript are available via the AD Knowledge Portal (https://adknowledgeportal.synapse.org). The AD Knowledge Portal is a platform for accessing data, analyses and tools generated by the Accelerating Medicines Partnership (AMP-AD) Target Discovery Program and other National Institute on Aging (NIA)-supported programs to enable openscience practices and accelerate translational learning. The data, analyses and tools are shared early in the research cycle without a publication embargo on secondary use. Data is available for general research use according to the following requirements for data access and data attribution (https://adknowledgeportal.synapse.org/DataAccess/Instructions). For access to content described in this manuscript see: https://doi.org/10.7303/syn22228843.