Meta-analysis of single-cell and single-nucleus transcriptomics reveals kidney cell type consensus signatures

While the amount of studies involving single-cell or single-nucleus RNA-sequencing technologies grows exponentially within the biomedical research area, the kidney field requires reference transcriptomic signatures to allocate each cluster its matching cell type. The present meta-analysis of 39 previously published datasets, from 7 independent studies, involving healthy human adult kidney samples, offers a set of 24 distinct consensus kidney cell type signatures. The use of these signatures may help to assure the reliability of cell type identification in future studies involving single-cell and single-nucleus transcriptomics while improving the reproducibility in cell type allocation.


Introduction
The kidney is a vital and highly complex organ responsible for blood filtration, elimination of metabolites and waste products, and body homeostasis (oxygen sensing, pH balance, electrolyte levels, systemic blood pressure regulation). These critical functions are enabled through its functional subunit, the nephron, an intricate interplay between the nephron sub-structures and some mesenchymal cells, involving the coordinated action of more than 20 cell types (immune cells, stromal cells, nephron epithelial cells, and cells from rare populations such as the juxtaglomerular apparatus) from the entry of the blood in glomerular capillaries to the urine compartment 1 . Nephron sub-structures are organized following the urine flow: blood arrives to the glomerulus and filtered through fenestrated capillaries and the slit-diaphragm (n.b. this active fenestration process is controlled by podocytes), then primary urine passes first through the Bowman capsule lined by parietal epithelial cells (PEC) and then through the tubule per se, including the proximal tubule (PT; reabsorption of water, sodium, calcium, potassium, chloride, phosphate, proteins and glucose), the loop of Henle (LoH; reabsorption of water, sodium and chloride; urine concentration), the distal tubule consisting in the distal convoluted tubule (DCT) and the connecting tubule (CNT; fine tuning of sodium and chloride levels, regulation of H + and HCO3-assuring pH balance), and finally flows into the collecting ducts in the cortex (CCD) and medulla (MCD; water reabsorption; urine concentration) until its storage in the bladder 2,3 .
Although bulk transcriptomics have critically promoted the understanding of kidney development, physiology and diseases [4][5][6] , such approach is not suitable for investigating renal cell type-specific features at a single cell scale. More recently, advances in high-throughput single-cell (scRNA-seq) and single-nucleus (snRNA-seq) transcriptomics allowed to evaluate cell populations and biological processes of different tissues at the single-cell/nucleus level [7][8][9] . Except for liquid samples (e.g. blood, urines…), scRNA-seq involves tissue dissociation, single-cell emulsion and encapsulation, passage in the microfluidics one cell at a time, creation of a library through high-throughput sequencing, and finally data analysis. As it may be performed on frozen tissue, snRNA-seq may overcome some issues observed with scRNA-seq, such as dissociation-induced stress response leading to the expression of specific set of genes, poor viability and loss of rare and fragile cell types 10,11 . Although the heterogeneity between individuals and cell states have been demonstrated 12,13 , studies are often performed on a reduced number of samples due to the limited availability of human tissue and the cost of 1 Fig. 1 Meta-analysis workflow. All available scRNA-seq and snRNA-seq data were retrieved and downloaded from the Gene Expression Omnibus (GEO) or zenedo repository. Single cell RNA-seq and snRNA-seq samples were analysed separately, quality control metrics were measured and poor quality cells were filtered out in accordance. Then scRNA-seq samples and snRNA-seq samples were integrated independently. High resolution unsupervised clustering followed by visualisation of the expression of specific transcriptomic markers allowed to attribute each cluster a clear cell type (certain cell types were attributed to several clusters), or a cell type followed by « na » (i.e. not attributed) for the cells that did not show strong enough differenciation markers expression. Consensus signatures were computed using the FindAllMarkers() function in Seurat. Single cell RNA-seq and snRNA-seq were then integrated together to evaluate the matching between both datasets annotations. Finally, consensus signatures were used for cell type enrichments on previously published and annotated d a t a s e t s . 12 Renal cortex from a nephrectomy Wu H,et al. 21 15 . However, no consensus list of human kidney cell type transcriptomic markers has been published so far. Thus, the establishment of consensus transcriptomic kidney cell type signatures might be of utmost importance considering significant batch effects within scRNA-seq and snRNA-seq datasets [16][17][18] . Here we present a meta-analysis of publicly available scRNA-seq and snRNA-seq datasets from 39 healthy adult kidneys, consisting in 68,028 single cells and 33,412 single nuclei. As data were taken from different sources, data were normalized following the SCTransform analysis pipeline in Seurat v4 and batch effects were mitigated by integration of scRNA-seq and snRNA-seq samples, respectively. Cell types were attributed to clusters using broad cell type markers, consensus cell type signatures were computed, and labelled scRNA-seq and snRNA-seq samples were integrated together to map cell types depending on the method used (sc/snRNA-seq). Finally, single-cell and single-nucleus consensus signatures were benchmarked by enrichment in previously published and annotated datasets.   www.nature.com/scientificdata www.nature.com/scientificdata/

Analysis workflow.
To determine consensus gene signature associated to each kidney cell type, we first aimed to collect kidney scRNA-seq and snRNA-seq data available on public databases according to the workflow presented in Fig. 1. Data collection ended up with 3 scRNA-seq and 4 snRNA-seq datasets publicly available [19][20][21][22][23][24][25][26][27][28][29][30][31][32] , encompassing a total of 101,431 cells and 35,764 nuclei, from 32 and 7 healthy adult kidneys, respectively (Table 1). Samples from the different datasets were pre-processed with Seurat v4, and cells expressing between 200 and 3500 genes were kept for analysis (discarding cell debris and cell doublets). As some kidney cell populations highly express mitochondrial genes, the percentage of mitochondrial gene expression threshold to use in kidney tissue is debating (varying between 20% and 50% across studies) 33,34 . We chose to keep cells with less than 30% mitochondrial genes expressed 24 (Table 2). Despite nuclei should not express mitochondrial genes, nuclei with less than 5% mitochondrial genes expressed were kept to limit the waste due to possible little contamination. Since confounding variables may affect the different samples from the different studies and further analysis (Tables 1, 3), scRNA-seq and snRNA-seq samples were integrated separately using Seurat IntegrateData() function to mitigate the batch effects, following the newly implemented SCTransform framework for normalization and count data variance stabilization 35 . Finally, the integrated datasets consisted in 68,028 single cells and 33,412 single nuclei. www.nature.com/scientificdata www.nature.com/scientificdata/ Generation of a healthy human kidney consensus scRNa-seq dataset. To generate a healthy human kidney consensus scRNA-seq dataset, we first assessed the quality of the integration by comparing the distribution of cells on Principal Component Analysis (PCA) plot, before and after integration using both Harmony and Seurat v4 correction (Fig. 2a). The correction of PC1 and PC2 by Harmony did not look as good as the one obtained with Seurat for which PC1 and PC2 did not depend anymore from the origins of the samples after integration. Thus the Seurat v4 correction was used for further computations. Uniform Manifold Approximation and Projection (UMAP) of the integrated dataset showed a very good scattering of cells from the different samples and from the different batches (Fig. 2b,c). In addition, it has been suggested that kidney cells express subsets of genes that are regulated in a sex-dependent manner in mice 36 . However, as the sex was not known for 24 samples among 32, we could not evaluate whether a sex bias may occur in cell type attribution in humans (Fig. 2d).
Generation of a healthy human kidney consensus snRNa-seq dataset. The evaluation of the mitigation of the batch effects for snRNA-seq integrated dataset was not as good as the one obtained for scRNA-seq dataset, but PC1 and PC2 distribution was more satisfying after integration using Seurat v4 compared to Harmony (Fig. 6a). Hence Seurat correction was adopted to pursue the analysis. When nuclei are displayed according to the origin of the sample or the origin of the batch the sample comes from (i.e. the publication), it is clear that sample GSM3135714 from batch GSE114156 is not well integrated to the dataset (Fig. 6b,c). As there are only 7 samples, and some of the nuclei from this sample do not mix with the rest of the nuclei from the other samples, we chose to keep the nuclei from this sample in the analysis and exclude only the non-mixed ones after clustering. By contrast to scRNA-seq dataset, the gender was known for the 7 snRNA-seq samples and allowed to appreciate differences www.nature.com/scientificdata www.nature.com/scientificdata/ in sex representation within each identified population, in particular for the principal cells of the collecting duct (PC.CD) and the cells from both the ascending thin limb (LoH.ATL) and the thick ascending limb of the loop of Henle (LoH.TAL; Fig. 6d). Unfortunately, with only 5 men and 2 women, we could not assess whether these differences were due to a real gender bias rather than inter-individual differences or some remaining batch effects. www.nature.com/scientificdata www.nature.com/scientificdata/ Besides this potential sex bias was different from what was shown in mice, where the authors observed discrepancies in the PTC populations while comparing 2 males to 2 females 36 .
The computation of HVG for every cell type has been performed, and these gene lists correspond to the consensus transcriptomic cell type signatures of kidney nuclei from healthy adult individuals ( Fig. 8c and Supp. Table 2).

Joint analysis of scRNa-seq and snRNa-seq labeled datasets.
To study the similarities and discrepancies between the results obtained with the two procedures, scRNA-seq and snRNA-seq samples were integrated together. The mitigation of the batch effects for the integration of 39 samples was overall acceptable, as attested by the correction of PC1 and PC2 (Fig. 9). Again, Harmony correction was not as satisfying as Seurat v4 one. Samples looked well merged, but nuclei and cells did not colocalize everywhere (Fig. 10a,b). The allocated cell types were highly consistent between cells and nuclei (Fig. 10c). Of note, we cannot rule out whether selecting www.nature.com/scientificdata www.nature.com/scientificdata/ viable cells on the basis of mitochondrial genes expression may influence this observation, since we cannot filter nuclei on the same basis. However the fact that overall, cell types were allocated at the same coordinates in cells and nuclei may give further confidence in the identified cell populations in both scRNA-seq and snRNA-seq datasets. In light of these results, we would not recommand to integrate scRNA-seq and snRNA-seq datasets before cell types have been allocated to cells and nuclei. Overall, these results demonstrated that snRNA-seq and scRNA-seq consensus signatures should be used to enrich for cell types within snRNA-seq and scRNA-seq datasets, respectively.

Validation of the enrichment of consensus signatures for automatic cell type annotation.
To test wether enrichment of consensus signatures are suitable for the identification of cell types within scRNA-seq and snRNA-seq datasets, we downloaded publicly available annotated datasets from Kuppe et al. 55,56 (scRNA-seq) and Lake et al. 9,57 (snRNA-seq). Sample expression matrices were processed and integrated as previously. Then CelliD v1.0 58 was used to perform enrichment analysis for scRNA-seq or snRNA-seq consensus signatures in every single cell or nucleus respectively, and the results were compared to the original labelling of the authors. To better evaluate automatic cell type annotation on test datasets, original labels were adapted to match consensus signatures nomenclature (Table 5).
After filtering out poor quality cells and cell doublets (less than 200 or more than 3500 expressed genes with more than 30% of mitochondrial genes), scRNA-seq dataset from Kuppe et al. 55,56 consisted in 81,239 cells from 19 samples, representing a total of 13 chronic kidney disease patients (hypertensive nephrosclerosis) 55 (Fig. 11 and Fig. 12a). Enrichment of consensus scRNA-seq signatures was performed following Multiple Correspondence Analysis (MCA), and UMAP was computed on the residues of the MCA using the RunMCUMAP() function implemented in CelliD 58 . However, to avoid annotation of cells with the « na » label that is not informative, signatures for « na » annotated cell types were not tested. Enrichment retrieved cell labels closely related to the original labels (Fig. 12b,c). Some differences were observed, in particular the non-attributed endothelial cells were recognized as B cells, a population of macrophages was recognized as dendritic cells, and the cells of the thick ascending limb of the Loop of Henle labeled as distal tubule cells. As only cell types belonging to the list of consensus signatures may be attributed, we did not find any schwann cell, urothelial cell, monocyte or mast cell (dendritic cells instead), myofibroblast or pericyte (vascular smooth muscle cells instead), macula densa cell (thin ascending limb of the loop of Henle instead) (Fig. 12c). Overall, automatic cell type annotation using scRNA-seq consensus signatures pretty matched the original labels from Kuppe et al. 55 , demonstrating its suitability and reliability to help in cell type allocation (Fig. 12b,c). www.nature.com/scientificdata www.nature.com/scientificdata/ Concerning the validation of our identified snRNA-seq signature, Lake et al. 9,57 snRNA-seq dataset was processed as previously described to remove poor quality nuclei, and finally consisted in 17,375 nuclei from 43 samples that belonged to 16 individuals, including 14 tumor-free regions of nephrectomies and 2 deceased donor kidneys 9 (Fig. 13). The nomenclature of the original labels was modified as previously described for scRNA-seq, www.nature.com/scientificdata www.nature.com/scientificdata/ Continued www.nature.com/scientificdata www.nature.com/scientificdata/ to match the nomenclature of the consensus cell type signatures (Fig. 14a, Table 5). Again, enrichment of snRNA-seq consensus signatures was done after computation of MCA and UMAP and signatures for « na » annotated cell types were not included for enrichment. The annotations were overall conserved between original labelling and consensus signature-based labelling (Fig. 14b,c). However, a subpopulation of proximal tubule cells was enriched for the descending thin limb of the loop of Henle (LoH.DTL) and some parietal epithelial cells (PEC) in the automatic annotation (Fig. 14b). In addition, cells originally labelled as LoH.DTL and some cells labelled as ascending thin limb of the loop of Henle (LoH.ATL) from samples NK37, NK38, NK45 and NK46, were still unassigned after consensus signatures enrichment (which means, there is no cell type enriched with a FDR < 0.01). This important unassigned population, which belonged to 4 samples among 43, may be considered « non-conventional » cells (although it may be due to remaining batch effects, as the samples were collected and conserved differently). In an original study, such nuclei would benefit from an in-depth analysis, since they could belong to non-tested cell types or non-steady cell states.
In conclusion, enrichment of consensus signatures allows the automatic and reliable annotation of kidney cell types in scRNA-seq and snRNA-seq datasets, and may be used to help in the decision of cell type allocation.

Discussion
Single cell RNA-seq and snRNA-seq are exponentially used within the kidney field. While major kidney cell types are generally retrieved, cell type identification is unconsistant between studies and therefore lacks reproducibility. It seems legit that every batch of samples, or every single sample, would not allow to identify the exact same transcriptomic signatures. It is mainly due to the small sample size of most human single-cell studies, subject to the cost of the technology and the limited availability of healthy human samples, that is in general not sufficient to generalize the conclusions to the overall population. In addition, scRNA-seq and snRNA-seq samples still suffer from a very low sequencing depth that may give rise to false positives or false negatives, within   www.nature.com/scientificdata www.nature.com/scientificdata/ the differentially expressed genes computed between two populations (not to mention that these differentially expressed genes are calculated on the basis of the raw data, not the matrix normalized by the integration). Mapping reads to different versions of the reference genome, as well as the pre-processing of the data are also critical factors participating in batch effects between studies. This results in limited reproducibility and reliability between the different studies involving human kidney scRNA-seq and snRNA-seq. In this meta-analysis, we performed integration of 32 scRNA-seq and 7 snRNA-seq samples, from 3 and 4 different studies respectively [26][27][28][29][30][31][32] . After assessing the mitigation of batch effects, we ran high resolution unsupervised clustering and allocated cell types to clusters, based on the expression of known markers, before computing consensus cell type signatures. Despite scRNA-seq and snRNA-seq samples did not equally distribute everywhere on the UMAP, cell type allocation was highly consistent across the two datasets. Finally, we showed that enrichment of consensus signatures achieved cell type allocation consistent with previously annotated datasets 56,57 . These consensus signatures may thus help increasing reproducibility and reliability between future studies involving scRNA-seq or snRNA-seq in the kidney field.
Our present study tried to standardize cell type nomenclature by the way of meta-analysis. Even though proximal nephron is functionally and anatomically divided in three segments (S1 to S3), our study could not discriminate these segments and all proximal tubule data is lumped into one proximal category. Same is true for the three descending thin limbs segments and distal convoluted tubule that is not resolved into DCT1 and DCT2. One plausible explanation is that these subpopulations are part of the unattributed population, i.e. PTC. na and LoH.na. There is also a possibility that we did not find these subpopulations because of the sparsity of the data, especially with such heterogeneity in the data sources. The datasets used in this study are the first published in the field, they were pre-processed with different tools and suffer from strong batch effects that prevent to detect many markers. However, the signatures that we proposed still allow to identify their cell types in the validation step we did. www.nature.com/scientificdata www.nature.com/scientificdata/ Single cell and single nucleus transcriptomics allow to study the expression of every detected gene (i.e. read count) in every single cell, or every single nucleus, within a suspension of cells. The advantages of sn-over scRNA-seq have been studied in healthy and fibrotic mouse kidney tissue, and include the following: less dissociation bias, less stressed or dead cells, and the possibility to use frozen tissue pieces which may allow to process more and better quality biological samples 11 . Furthermore, scRNA-seq and snRNA-seq samples may present too different transcriptomes in certain cell populations to allow a good detection of every cell type in both kind of experiment. This is not striking since the set of RNA expressed in the nucleus differs from the set of RNA expressed in an entire cell. In other words, scRNA-seq samples contain nuclear, mitochondrial and cytosolic RNA, while snRNA-seq samples only contain nuclear and cytosolic RNA. Therefore, cell type assignment depending on the expression of canonical marker genes, which is the standard in single cell/nucleus transcriptomics analyses, may differ between scRNA-seq and snRNA-seq. Our analysis confirmed these assumptions and as a consequence provides two sets of cell type signatures, obtained by scRNA-seq or snRNA-seq strategies. Besides, we were not able to find immune cells within snRNA-seq datasets except few T cells, which is consistent with previous reports attesting that snRNA-seq in the kidneys failed to detect immune cells in general 9,11,59 . Interestingly, we detected a population of parietal epithelial cells in both scRNA-seq and snRNA-seq datasets that express pluripotent cell, tubular epithelium and podocyte markers (e.g. SOD2, KRT8, KRT18, WT1, CD24, PAX2, SOX4, VIM, RACK1, NUPR1…; Supp. Table 2) and may actually correspond to previously described parietal epithelial stem cells 60 . These cells look very different from the other clusters, express self-renewal markers (e.g. CD24, PAX2) and match with the PAX2 + CD24 + population previoulsy observed in the glomerulus 60 . By contrast, we failed to detect any CD133 + mesenchymal stem cell-like population.
To introduce biological heterogeneity and mitigate technical variability, we encourage authors who would use previously published healthy kidney datasets as control datasets for their purpose, to integrate several samples from different studies instead of using the samples from a single study. For those who would add new healthy human kidney samples to their single-cell or single-nucleus studies, we would advice to compare the cell type signatures from control cells with the consensus signatures we provide, and to assign cell types in their dataset using enrichment of consensus signatures (e.g. CelliD 58 ).
However this approach is biaised in the sense that cell type enrichment depends on the tested cell types, and if a cell type is not tested it could not be attributed to cells/nuclei, even if it should. Thus, one of the main limitation of this method is that every cell or nucleus will be attributed a cell type from the tested list: the enriched cell type with the lowest p-value will be attributed, which can be misleading (if there is no enriched cell type, then cell/nucleus is labelled « unassigned »). This further means the consensus signatures we provide only define the cell types identified in the current meta-analysis. Therefore, depending on the settings, it could make sense to www.nature.com/scientificdata www.nature.com/scientificdata/ use only certain consensus signatures, for instance if the studied cells or nuclei populations have been purified by FACS prior to the transcriptomics. For the same reason, such cell type enrichment may be used as a decision helper instead of a decision maker in cell type attribution to cells/nuclei. However, a more unbiased approach is possible for original studies, based on unsupervised clustering followed by extraction of the cell-specific signatures using CelliD, and finally enrichment of functional terms or pathways of these signatures. A more general limitation of such single cell studies is the statistical power for the computation of HVG (Wilcoxon Rank Sum test), that depends on the number of cells allocated for every single cell type. Indeed, the statistical power is higher for the computation of PTC signature (computed on 29,246 PTC cells vs 38,782 cells within the rest of the www.nature.com/scientificdata www.nature.com/scientificdata/ dataset) than for the one of DCT cells (computed on 248 DCT cells vs 67,780 cells) in the scRNA-seq dataset, for instance. In the future, these signatures may benefit from being updated by integrating newly published healthy human kidney single cell datasets that may increase the biological variability and the number of cells for every population while mitigating the batch effects even better. In addition, the very low sequencing depth of these experiments implies that the results should be interpreted with caution. To solve this issue in cell type identification while specifically working with kidney tissue and validate the identified cell type-specific signatures, a bulk transcriptomic analysis of micro-dissected healthy human nephron segments would be really helpful, as it has been performed in rodents 6 . Nevertheless, cell type allocation by enrichment of consensus signatures may depend on the size of the signatures -i.e. the size of the gene lists, spanning between 27 (scRNA-seq signatures, LoH.TAL) and 311 (snRNA-seq, EC.art) genes in the present meta-analysis. Thus, we recommand to perform such enrichment with both the complete signatures, and truncated signatures that are close in size.
Studies involving scRNA-seq and snRNA-seq technologies in the kidney are barely comparable, because of a lack of standardized workflow (technically and analytically) and a diversity in the references used for cell type recognition. In this meta-analysis, 32 scRNA-seq samples from 3 studies, and 7 snRNA-seq samples from 4 studies, were integrated and analysed. This resulted in the computation of 30 consensus cell type signatures for kidney cell types. Future studies in the field may benefit from the use of these signatures to automatically allocate cell types to cells/nuclei.

Methods
Data acquisition. Single-cell RNA-seq and snRNA-seq datasets generated from healthy adult kidney samples were downloaded from the Gene Expression Omnibus database (GEO; https://www.ncbi.nlm.nih.gov/geo/) as count matrices [26][27][28][29][30][31][32] . The collection consists of 7 snRNA-seq samples from 4 independent studies (GEO Accession ID: GSE114156, GSE118184, GSE131882, GSE151302) and 32 scRNA-seq samples from 3 independent studies (GEO Accession ID: GSE131685, GSE140989, GSE159115) [26][27][28][29][30][31][32] . The clinical and technical informations regarding the samples gathered from these studies are provided in Tables 1 and 3. Expression matrices of scRNA-seq samples GSM4819730 and GSM4819731 from batch GSE159115 were merged together prior to the analysis since they belong to the same individual, as well as snRNA-seq samples GSM3320197 and GSM3320198 from batch GSE118184 21,23 . Data downloaded from GEO were already pre-processed for each dataset, in different ways across the different studies involved (Table 3). This heterogeneous pre-processing of the samples may biase the analysis. However since our goal is to provide widely usable and consensus cell type signatures, this technical variation is important to retain. Quality control and filtering out of poor quality cells and nuclei. We used R software v4.1.0 (https://www.r-project.org/) and Seurat v4.0.5 package 61 (https://satijalab.org/seurat/) to perform the analysis. As observed in previous studies, human kidney scRNA-seq datasets generally present with high mitochondrial gene counts, which may be attributed to the processing time of human kidney samples as well as the processing itself. Moreover kidney tissue notoriously contains a lot of mitochondria, consistent with the high levels of energy needed for a proper filtration process. Therefore the standard filtering out of cells with >5% mitochondrial gene expressed was not suitable for the processing of these scRNA-seq data. Cells with <200 or >3500 (cell debris and doublets) expressed genes, and >30% mitochondrial gene expressed, were filtered out, whereas nuclei with <200 or >3500 expressed genes, and >5% mitochondrial gene expressed, were filtered out. In total, 68,028 high quality cells and 33,412 high quality nuclei were obtained after applying these thresholds. Table 2 presents quality control www.nature.com/scientificdata www.nature.com/scientificdata/

C36
Proximal Tubule PTC C37 S1 PTC C38 S1/2 1 PTC C39 S1/2 2 PTC C40 S1/2 3 PTC Continued www.nature.com/scientificdata www.nature.com/scientificdata/ metrics of every sample (i.e. number of cells/nuclei, mean number of reads per cell/nucleus, mean number of features expressed per cell/nucleus, % mitochondrial genes, % ribosomal genes), prior to and after filtering. Data were normalized and scaled (regressing out % mitochondrial genes), and highly variable genes computed using the SCTransform() function 35 (Seurat v4) for every scRNA-seq and snRNA-seq sample. Identified HVG were then used to compute PCA for every sample. SCTransform is a newly implemented statistical method in Seurat v4, pooled from the sctransform R package (https://github.com/satijalab/sctransform), that aims to better resolve the technical variability and sequencing depth differencies between cells/nuclei across datasets 35 . It is particularly interesting when working with datasets obtained from different sources, which induce important variability.
Integration and dimensional reduction. Single-cell and single-nucleus samples always depend on confounding variables and may thus present differences that are called batch effects. To allow any comparison between samples, batch effects need to be mitigated as much as possible, which is done by the integration process (i.e. normalization step). Because further computations depend on this process, the quality of the integration deserves to be evaluated. Thus, two integration approaches were considered: the Seurat v4 method that outputs a corrected expression matrix for a list of genes to consider, and the Harmony v0.1.0 method that directly corrects the residues of the PCA for each sample.
Integration of 32 scRNA-seq samples on one hand, and 7 snRNA-seq samples on the other, was achieved by running consecutively PrepSCTIntegration(), FindIntegrationAnchors() and IntegrateData() functions from Seurat, with 2,500 integration features. Then PCA was computed and the first 30 PCs were inputed for uniform manifold approximation and projection (UMAP) of integrated scRNA-seq and snRNA-seq datasets. Harmony ran as well and UMAP was computed on the 30 first corrected PCs of both dataset. The distribution of the cells or nuclei from the different samples was compared between the two methods. Of note, Seurat and Harmony are among the best batch effect correction methods to date 17,18 . Clustering and cell type annotation. High resolution clustering is important in such meta-analysis: since there are still notable batch effects, small batch-dependent clusters may be identified. In addition, more clusters may identify more cell types when closely related, thus more consensus cell type signatures if so. Unsupervised clustering was performed using FindClusters() function with Louvain algorithm in both dataset (resolution = 3.4 and 3.0 in scRNA-seq and snRNA-seq datasets, respectively). Distribution of samples across clusters was studied thanks to the pheatmap v1.0.12 R package. The cells were then labelled according to the expression of specific markers (Table 4). To match the nomenclature adopted for consensus cell type signatures, original labels from Kuppe C, et al. and Lake BB, et al. were changed (Table 5).

Data availability
The single-cell and single-nucleus datasets generated in the study have been deposited on Figshare 62,63 . These files contain 4 assay slots (raw counts matrix, sample-dependent SCT-transformed values, post-integration SCTcorrected values, and the secondary integration SCT-corrected values) and some meta-data slots, including the dataset of origin (GEO sample accession number), the batch of origin (GEO series accession number), the method used (scRNA-seq vs snRNA-seq), the clusters, and the cell type labelling. The Figshare repository also contains supplementary Tables 1 and 2 64,65 .  Table 5. Nomenclature for test datasets.