Single-cell methylation analysis of brain tissue prioritizes mutations that alter transcription

Summary Relating genetic variants to behavior remains a fundamental challenge. To assess the utility of DNA methylation marks in discovering causative variants, we examined their relationship to genetic variation by generating single-nucleus methylomes from the hippocampus of eight inbred mouse strains. At CpG sequence densities under 40 CpG/Kb, cells compensate for loss of methylated sites by methylating additional sites to maintain methylation levels. At higher CpG sequence densities, the exact location of a methylated site becomes more important, suggesting that variants affecting methylation will have a greater effect when occurring in higher CpG densities than in lower. We found this to be true for a variant’s effect on transcript abundance, indicating that candidate variants can be prioritized based on CpG sequence density. Our findings imply that DNA methylation influences the likelihood that mutations occur at specific sites in the genome, supporting the view that the distribution of mutations is not random.

One impediment to progress is that due to the relatively large intervals into which quantitative trait loci are mapped, where there are usually thousands, and often tens of thousands, of candidate causal variants.How are these variants to be prioritized for functional study, and is there another level in addition to genetic variation that determines their function?The majority of causative variants lie in regulatory regions of the genome and likely act by altering a molecular phenotype, [18][19][20] such as DNA methylation, which is associated with neuronal function and behavior. 21NA methylation occurs at regulatory elements in the genome, affecting transcription factor binding affinity and controlling gene transcription, 22,23 roles that suggest it may play a role in mediating the effect of sequence variants that alter behavior.While both CpG and CH methylation (mCG and mCH, where H = A, C, or T) show cell-type specificity, [22][23][24] only methylation at CpG dinucleotides propagates through cell division, providing stable marks that differentiate cell types. 16Methylation's cell-type specificity in the brain means that mutations, as much as they act through methylation, will have different consequences depending on the cell type that they affect.Hence single-cell data from multiple individuals are needed to understand the relationship between mutation, methylation state, and behavioral outcome.However, to use CpG methylation marks to discover causative variants requires an understanding of the relationship between sequence and methylation variation. 25hat happens when a mutation removes a methylated CpG site?Broadly speaking, there are three possible consequences.The simplest, at least for interpreting genetic association studies, is that the mutation disrupts a sequence motif, with consequences for whatever function that motif performs.8][29] Knowing the location of the mutation at a methylated site within a known motif or regulatory region will suggest candidate proteins.Second, the function of methylation may depend on the density of methylated cytosines in a region.For instance, CpG density is associated with active histone marks and high expression, [30][31][32] and CpGs contribute to transcriptional activity regardless of whether they are part of a sequence motif. 33In this case, the consequences of a mutation will depend on local context (e.g., how many other cytosines are methylated), rather than on the specific sequence.Third, there is the possibility that methylation function and DNA sequence are independent.][36] In this paper, we examine the relationship between methylation and sequence variation using data from eight inbred mouse strains.Existing methylomes are mostly derived from one strain (C57BL/6J, B6) 16,17,[37][38][39] or are from array-based assays that do not interrogate all methylation sites at a single-cell level in the brain. 40We generated genome-wide, base-resolution maps of multiple cell types from the hippocampus in each of the eight strains.We chose CAST/EiJ, a fully sequenced representative of M. m. castaneus, as an outgroup and compared methylation and sequence variation to those of five classical laboratory strains (A/J, C57BL/6J, BALB/cJ, FVB/J, and DBA/2J) (all M. m. domesticus) and two wild-derived inbred strains: WSB/EiJ (M. m. domesticus) and PWK/PhJ (M. m. musculus).We show that interpretation of the functional consequences of sequence variation, as mediated by methylation, depends on local CpG density.

Cell clusters identified from methylation profiles in eight mouse strains
We generated 13,683 single-nucleus methylation sequence (snmC-seq) profiles from microdissected ventral hippocampus tissue of eight mouse strains (Figure 1).We chose the ventral hippocampus because the cytoarchitecture of the hippocampus is less complex than other parts of the brain, and because many behaviors that are involved in hippocampal function have been mapped. 1111,694 of these profiles passed our quality control (QC) metrics.Strains were sampled in duplicate, and fluorescence-activated nuclei sorting was used to isolate approximately 83% NeuN-positive and 17% NeuN-negative cells to enrich for neurons.
We performed iterative clustering analysis on the mC dataset based on similarity of global CG and CH methylation in 100-kb bins of the mouse genome.Using this approach, we identified a total of 20 distinct cell types within the ventral hippocampus.
We identified cells belonging to every major subregion of the ventral hippocampus and to every major cell-type class (astrocytes, microglia, excitatory neurons, inhibitory neurons) based on the hypo-methylation states of multiple canonical markers of each cell type.Other neuronal types that were not confidently ascribed to a hippocampal subregion were labeled first by whether they express inhibitory or excitatory markers and then by differences in CH hypomethylation of genes.These ambiguous subregion clusters may relate to other known neuronal subtypes within the hippocampus that were previously defined by RNA sequencing.
Sequence coverage varied widely among the 20 cell types we identified: from less than 10X coverage for Exc-CA2, Exc-Fstl4-Grm3, EC, Exc-CA3-Kcnh5, VLMC, Exc-Gfra1, and Exc-GM45686 to more than 50 for Exc-DG (Table S1).Genome coverage for each cell type is given in the supplemental information.We decided to use the Exc-DG hippocampal cell type for subsequent analyses, because of the high coverage and because we were unable to identify any subregion clusters, indicating homogeneity.

Conservation of methylated sites between strains depends on CpG sequence density
1][32][33] To do this, we estimated the density of CpG dinucleotides in windows of 2 Kb (this size was chosen to capture regions of clustered methylation; results for 1 kb and 500 bp were the same) and explored its relationship with the number of methylated sites.
Figure 2A shows results for the Exc-DG hippocampal cell type.The distribution of methylated sites, with respect to CpG sequence density, consists of three fractions.First, there is a highly methylated fraction at densities of less than 25 CpG/Kb.At densities between 25 and 40 CpG/Kb DNA, segments are partly methylated, while DNA in which less than 20% of CpGs are methylated occurs in a fraction of high CpG density (greater than 40 CpG/Kb).The same relationship is observed for other cell types for which we have sequence coverage greater than 10-fold.At lower sequence coverages, the estimates of percentage methylation for most of the CpG density segments are too variable to reveal the relationship (illustrated in Figure S2).
We next explored the relationship between CpG sequence density and mutations in the Exc-DG cell type.Using CAST/EiJ as an outgroup, we calculated the percentage of sequence variation (defined as either the loss or the gain of a cytosine in the comparison between the outgroup and the strain; other nucleotides were excluded) in each 2-kb segment for each strain, doing this separately for methylated and non-methylated CpGs.We summed sequence variants in each 2-kb segment and expressed them as a percentage of the total number of CpG sites in the segment.
Figure 2B shows that the percentage of sequence variants in methylated DNA (blue lines) is higher than in unmethylated DNA (gray lines) and that this relationship depends on the density of CpGs.2][43] Our results show that CpG sequence density is correlated with both the fraction of methylated CpGs and with the distribution of mutations at methylated CpGs, and that this relationship is non-linear.
To what extent is the density of methylated CpG determined by sequence?We turn next to consider what maintains the correlation between two strains in the number of methylated CpG sites in each segment of DNA.There are two possibilities: the amount of CpG methylation can be the same in two different mouse strains because the same sites are methylated or because the total number of sites is equal in the two strains, regardless of which sites are methylated.To test between these alternatives, we calculated two values for each 2-kb segment of the genome.First, we counted the number of methylated CpGs in each strain.We set the larger number as the denominator to derive a ratio of the two numbers, bounded between 0 and 1.If the ratio is 1, then the segment contains the same number of methylated sites.If the ratio is less than one, it means one strain has fewer methylated sites than the other.Second, we derived a measure of the correlation between sites.We estimated the probability that both sites are in the same state (methylated or not) in each segment.If the probability is 1, this means we can fully predict which sites are methylated in one strain from knowing the state in the other.Probabilities less than 1 mean that knowing the methylation state of sites in one strain is less predictive of their state in the second strain.
Figure 3 shows that for low CpG sequence densities, the ratio is close to 1 for each strain comparison, while the probability of sites being in the same state is about 0.8.In other words, the methylation density is maintained, even though not all homologous sites are methylated in both strains.The pattern changes as densities increase above 40 CpG/Kb, until at densities above 80 CpG/Kb it reverses, with the probability becoming higher than the ratio, though the small sample sizes of hypomethylated CpGs introduce a large variance.Thus, constancy is maintained in two ways: at low density (less than 40 CpG/Kb) different sites may be methylated, while at higher density, it is more likely that identical sites in each strain are methylated.

The impact of mutations at methylated CpG sites depends on local CpG sequence density
Our findings suggest that a mutation removing a methylated CpG site may have different phenotypic consequences when it occurs in a region of low compared to high CpG sequence density.Methylation in general is a repressive mark, 25 so we expect that most CpG mutations that disrupt methylation will increase gene expression (by relieving suppression).Consequently, sequence variants disrupting CpG methylated sites in low CpG sequence density regions would be less predictive of transcript abundance than mutations in high CpG sequence density regions.We tested the hypothesis by examining the association between mutations and transcript abundance.Using singlecell RNA data, we compared two strains, B6 and DBA/2J, for the Exc-DG cell type.
We generated single-nuclei transcriptomes for a total of 30,880 nuclei (median UMI counts per cell: 5,816) from two biological replicates that passed our doublet and low-quality filtering steps (STAR Methods).We carried out multi-modal clustering and visualization using uniform manifold approximation and projection (UMAP) to embed transcriptomic and methylation data (Figure 4A), noting that there were no differences by strain (Figure 4B), and that the same major cell-type groups from the methylation dataset were represented, based on the same marker gene set used to annotate the methylation data (Figure 4C).Major cell-type clusters within snRNA and snMethylation were highly consistent between modalities (Figure 4C), except for the small non-neuronal clusters MGC/OPC/VLMC (STAR Methods), which were excluded from further analysis.
We ran DESeq2 44 to identify transcripts that were differentially expressed and, using a liberal threshold of p <0.05 (unadjusted), identified 2,049 transcripts for downstream analysis (we also examined the effect of including transcripts at more conservative thresholds, and we give results in the supplemental information).For mutations, we took all single-nucleotide sites in the mouse genome that altered a methylated CpG present in the strain D2 to either ''A'' or ''T'' (to ensure the mutation results in a site that cannot be methylated, regardless of the strand on which it occurs).That yielded 23,959 mutations, of which 1,862 lie within regions with more than 40 CpGs/Kb, categorized as ''high'' density.We matched the position of each methylated CpG site to a gene interval, defined as running from 2 kb upstream of the transcriptional start site to 2 kb downstream of the 3 0 end of the gene.
When mutations occur in high CpG sequence density regions, the mean ratio of D2 to B6 transcripts is significantly higher than it is in low CpG sequence density regions (mean = 0.15 vs. mean = 0.02; t = 4.3, degrees of freedom [df] = 1,924.6,p value = 1.9eÀ05), supporting the hypothesis that mutations in high CpG sequence density regions are associated with a relatively larger effect.It is possible that some of the effects we are seeing are attributable to linked mutations that happen to be enriched in high-density CpG regions of the genome, but most of the other classes of mutations (such as insertions and deletions) will decrease the amount of transcript (rather than increase it, as most methylation mutations are expected to do).We can exclude some of these mutations indirectly, because their mean effect on gene expression is known to be relatively large. 13e divided the sample into those transcripts with a fold change greater than two and those less than two.The dataset with smaller effects contains 1,555 genes including 19,397 mutations, of which 1,450 are situated in high CpG sequence density regions.
Figure 5A shows that the D2:B6 ratio of transcript abundance is about 20-fold larger when mutations occur in regions of high CpG sequence density than in low density (mean ratio in high density: 0.19, mean ratio in low density: 0.01, t = 5.55, df = 1,662.6,p value = 3.3eÀ08).To examine whether CpG sequence density itself contributes to the increase in the D2:B6 ratio, Figure 5B plots D2:B6 ratios in high and low CpG sequence density for CpG sites without mutations.
There are higher D2:B6 ratios in the CpG high sequence density regions than in the low, and because there are so many more sites without mutations, the effect is highly significant (p = 4.11eÀ18), but the increase is much smaller than that found in regions with mutations (mean ratio in high: 0.034, mean ratio in low: À0.001).D2:B6 ratios in high CpG sequence density sites containing mutations are significantly higher than ratios in high CpG sequence density sites without mutations (t = 4.93, df = 1,493.1,p value = 9.18eÀ07), again consistent with the hypothesis that the effect of mutations in high CpG sequence density will be larger than in low CpG sequence density sites.
What is the effect of mutation on the transcript abundance ratio in the high-and low-density regions?We addressed this question using a linear model.In high-density regions, mutations have a highly significant positive effect (beta = 0.16, t = 5.2, p = 2.39eÀ07), while density makes no significant contribution (beta = À0.0002,t = À1.4,p = 0.16).The situation is reversed The horizontal axis represents the probability that both sites in two strains are in the same state (methylated or not) in each 2-kb segment (black lines) and also the ratio of the total number of methylated sites in two strains, again for each 2-kb segment (blue lines).Each line is a separate strain comparison, which is the same as that shown in Figure 1B.The horizontal axis shows the density of CpG sites (regardless of methylation) per kilobase of genomic DNA. in low CpG density regions: mutations make no detectable contribution (beta = 0.007, t = 0.911, p = 0.362), while the effect of CpG density, though small, is highly significant (beta = 0.006, t = 61.3,p < 2.2eÀ16).These results indicate that the effect of mutations altering methylation at CpG sites is undetectable when they occur in regions of low CpG density.
Might CpG sequence density be a surrogate for other confounds that explain the predictive power of mutation in the highdensity regions?For example, it could be that high-density CpG sequences are enriched with functional elements that promote expression, and the mutations are linked to such elements.To exclude the effect of such confounds, we used a set of ''universal chromatin state'' annotations of the mouse genome based on over 900 datasets from various cell and tissue types. 45We found a significant increase in the fit of a model that included an interaction between mutation and density (i.e., allowing the effect of mutation to vary according to CpG density), compared to a model that only included mutations, chromatin state, and density (high vs. low) to predict D2:B6 ratios (F = 28.7,p = 8.41eÀ08, and Table S4).Finally, we addressed the issue of whether p values were well calibrated for this model (and others) by randomly sampling the ratio and repeating the analyses 10,000 times (we did this by permuting expression levels among genes, STAR Methods).
What of the large effect changes?To test our expectation that the large fold changes in expression are likely not associated with changes in repressive methylation, we examined genomic regions associated with transcripts in which DEseq2 reports a change of more than 2-fold difference.We identified 216,168 po-sitions associated with genes, including 4,693 mutations.Their effect is large but decisively negative: À0.23, p = 2.98eÀ23.This result justifies our exclusion of the larger effect loci from the analysis of the impact of mutations on methylated CpGs.
Our analyses make assumptions about the threshold for distinguishing high and low CpG sequence density regions and in the division of effects by fold changes in transcript abundance.We carried out analyses to test these assumptions and found that our main conclusions hold, namely the impact of mutations at methylated CpG sites depends on local CpG sequence density (STAR Methods, Figures S2, S3, S4, and S5 and Tables S3 and S4).We were also able to see the same effect in six other cell types for which we had sufficient data; for cell types with low coverage there were too few mutations lying within regions whose methylation state we could confidently call (Tables S5 and S6 and Figure S6).

CpG mutations are enriched in enhancers
How might the CpG mutations in high-density regions be acting to alter RNA expression?Are there differences in the way mutations in low-and high-density regions operate?To address this, we examined the relationship between mutations and annotated functional elements.Since functional elements differ between cell types, we first generated a set of single-nucleus assay for transposase-accessible chromatin (ATAC) sequence data from the two strains (B6 and D2).
We generated a single-nucleus ATAC-seq dataset for a total of 27,206 nuclei (median read-pairs per cell: 29,806) from two biological replicates after doublet removal and QC (STAR Methods).We then carried out integration and linking of our snATAC-seq dataset with our snRNA-seq dataset to obtain cell-type annotations Finally, we confirmed through joint embedding that major cell-type clusters within snATAC and snMethylation were highly consistent between modalities and strain (Figures 4A, 4B, and 4C and STAR Methods).
We annotated the open-chromatin sites based on their overlap with chromatin states. 45We want to know if the location of CpG mutations that contribute to changes in transcript abundance differs in any of 15 chromatin state groups (excluding one state labeled ''artifacts''), comparing sites lying in high CpG sequence density regions to those in low sequence density regions.For each site, we used a linear model, predicting the change in D2:B6 ratios from the presence of mutations, doing this separately in high and low CpG sequence density regions for each state.Table 1 shows the result, giving the number of sites for each chromatin state and the p value of the linear model for the predicted effect of CpG mutations in both high and low sequence CpG density regions of the genome.After correcting for multiple testing, only the effect of mutations in active enhancers in high CpG density regions is significant (p = 8.30EÀ04, exceeding a Bonferroni corrected threshold of 0.001 for testing 2 x 15 states).

DISCUSSION
Our study of methylation states in eight inbred strains, combined with the nearly complete sets of sequence variants and func-tional annotations, allows us to address the relationship between sequence and methylation variation.We find that the relationship depends in a non-linear fashion on the density of CpG sequences per kilobase of DNA.At densities less than 40 CpG/ Kb, there are more mutations in the methylated than unmethylated DNA.The relationship is consistent across brain cell types.The significance of the inflection at 40 CpG/Kb is unclear, but presumably it reflects how cells determine CpG sequence density.It is known that increasing CpG sequence density alters promoter activity, 33 possibly by recruitment of ZF-CxxC domaincontaining proteins that bind to unmethylated CpGs. 46t lower CpG sequence densities, cells compensate for the loss of methylated sites by recruiting additional sites in the same DNA segment, sites that arise due to mutational gain of cytosines.At higher sequence densities, the exact location of a methylated site becomes more important; specific sites are maintained even at the cost of reducing the number of methylated sites.In other words, the impact of a mutation on methylation depends on the CpG sequence density, with different consequences for embedded functional elements.These are exposed to higher mutation rates in lower density regions but with relaxed constraint on where the mutation will occur; in higher density regions, they are exposed to lower mutation rates, but mutations are likely to be constrained to preserve the same sites of methylation.
We tested this prediction by examining the impact of CpG density on mutations, using as output the relative change in transcript abundance in a comparison of two inbred strains.Since CpG methylation in general suppresses transcription, we expected mutations that abrogate a methylated CpG site to increase the relative amount of transcript and found this to be so.Consistent with our predictions from how CpG density relates to methylation, we found that the impact of mutations depended on the CpG density.In regions of low CpG density, we were unable to detect a significant effect of mutation on transcription, despite the very large number of sites analyzed.By contrast, in regions of high CpG density, the far fewer mutations we found made a highly significant contribution to variation in transcript abundance (though the overall effect is small, less than 1%).We also found that in high-density regions, CpG mutations are enriched in chromatin states that mark enhancers.
It is commonly assumed that genetic effects on behavior (as with other complex traits) are likely mediated by changes in transcription, an idea supported by the success of transcriptionwide association to facilitate the discovery of genes for common complex traits. 47Assuming that the CpG mutations' effects on transcription can be extrapolated to their effects on behavior, our findings suggest ways to prioritize the detection of genetic variants that alter behavior.One implication is that variants within regions of low CpG sequence density are unlikely to have any detectable effect on a phenotype, which would simplify searches for causative mutations.Conversely, prioritizing the search for causative variants to regions of high CpG sequence density and to enhancers should accelerate the discovery of variants that are causal for behavior.Importantly, there are far fewer variants in high compared to low CpG sequence density regions of the genome (approximately 1,500 compared to 15,000 in a com-parison of two strains).While still large, this number is easily within the range of massively parallel reporter assays. 48Furthermore, when interest is focused on a single locus, identified for example from fine mapping of behavioral phenotypes, there will be far fewer candidate variants to consider: approximately three candidate variants will lie within a 5-Mb locus, mapped in a comparison between B6 and D2 strains (this does not take into account the non-random distribution of variants across the mouse genome).
4][45][46][47][48][49][50][51][52][53][54] We add to this literature by indicating that sequence variants disrupting methylation act primarily through a specific class of candidate functional elements (we identified enhancers) in the context of high CpG sequence density.While we are unable to distinguish standing genetic variation (subject to selective pressure) from de novo variation, which makes us unable to determine whether the pattern observed in mice is independent of selection, the pattern suggests that the increased mutational load associated with methylation may be targeted to a subset of functional elements, consistent with findings in other species. 55In other words, it is possible that methylation may be being used to target increased rates of mutation to specific elements in the genome.

Limitations of the study
There are several limitations of our study.First, we used two mouse replicates to generate the methylation RNA data, limiting our ability to detect differences between strains.Second, for most cell types, sequence coverage was too low to detect ), together with the number of sites lying within genes (including 2 kb upstream and downstream) whose expression could be detected in the Exc-DG cell type in the hippocampus.The sites are divided into those lying in high and low CpG sequence density regions of the genome and then by whether they contain a mutation that disrupts a CpG site (''Mutation'' and ''No mutation'' columns).The proportion (expressed as a percentage) of sites with mutations is shown for each chromatin state.The p value from a linear model testing for the predicted effect of mutations on RNA fold change is shown.
most methylated sites.We cannot be certain that results found for the high-coverage cell types will apply to others, though as explained in the supplemental information, we think this is unlikely.Third, we cannot be certain we have identified all cell types.Cell types that we inadvertently believe to be homogeneous may confound our analysis, as the effects of mutations in cell-type-specific methylation sites will be obscured.We may also be missing cell types in which mutations have different effects from those documented here.Finally, our analysis is limited to CpG sites and to one region of the brain, the hippocampus.It is possible that mutations affecting non-CpG methylated sites behave differently.It is also possible, though we think unlikely, that different findings will emerge from analysis of different brain regions.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

QUANTIFICATION AND STATISTICAL ANALYSIS
Mapping and primary quality control All reads were mapped to the mouse mm10, Genome Reference Consortium Mouse Build 38 (GCA_000001635.2)).The gene and transcript annotation used was a GENCODE GTF file. 65snmC-seq2 reads were mapped using Bismark (V0.22.3) to SNP-swapped mm10 genomes.This was done to allow us to directly compare chromosomal coordinates between the 8 strains.SNPs for each strain were downloaded from the Sanger Institute Mouse Genomes website. 12Custom code was used to generate 8 separate SNP-swapped genomes for each strain by replacing each corresponding SNP nucleotide position in the B6J mm10.fafile with the nucleotide of that strain.snRNA and snATAC-seq reads were mapped using 10X Cell Ranger (V6.0.2) and 10X Cell Ranger ATAC (V2.0.0) respectively against mm10 for B6 and an SNP-swapped mm10 for DBA.We retained introns for RNA analysis while default settings were used for ATAC analysis.
Methylation features were calculated as fractions of methylcytosine over total cytosine across gene bodies ± 2kb flanking regions and 100kb bins spanning the entire genome.Methylation features were further split into CG and CH methylation types.Features overlapping our methylation mm10 blacklist were removed.100kb bin features were then filtered on minimum mean coverage >500 and maximum mean coverage <3000.Gene body features were filtered on minimum coverage >5 and all remaining features were normalized per cell using the beta binomial normalization technique in allcools. 16ngle-nucleus RNA sequencing data quality control and preprocessing All quality control and preprocessing were done under the Seurat package framework. 57Per biological sample, we filtered out cells that (1) fall below the 5th percentile of the total UMI counts (nCount_RNA) or the 5th percentile total number of unique genes expressed (nFeature_RNA) or 700 unique genes expressed, whichever was more stringent; (2) are over the 95th percentile quantile in either the total UMI counts or the total number of unique genes expressed; (3) have larger than 5% mitochondria fraction (percent.mt).
Global coverage normalization: counts per million (CPM) was applied to each cell followed by log transformation (''LogNormalize'').We then projected cells from each biological sample to low dimensional space using principal components analysis (PCA) on highly variable features selected by Seurat.Potential doublets were identified and subsequently removed from the downstream analysis by DoubletFinder, 60 ran in the top 15 principal components space with the expected doublet rate set to the recommended amount from 10X genomics based on loading volume (10k nuclei per well).
Single-nucleus ATAC sequencing data quality control and preprocessing All quality control and preprocessing were done under the Seurat and Signac package framework. 58Per sample, we first used Model-based Analysis for ChIP-Seq (MACS) to call sample specifc de novo peaks from its fragments file. 59We then merged sample-specific sets of peaks to a unified peaks set while removing peaks with length larger than 10000bp or smaller than 20bp.A unified peaks by cells count matrix was constructed from the fragments file while removing cells with lower than 200 peaks detected and peaks only present in less than 10 cells.Cells were filtered based on the following criteria: (1) appropriate number of non-duplicate, usable read-pairs (passed_filters from Cell Ranger's output singlecell.csv).Specifically, we set it to larger than 3000, 4000, 2500, and 5000 for the 2 BL6 samples and 2 DBA samples respectively.(2) number of fragments overlapping peaks (peak_region_fragments from Cell Ranger's output singlecell.csv)falls within the 5th percentile and the 95th percentile.(3) ratio of fragments overlapping peaks over the total number of non-duplicate, usable read-pairs falls within the 5th percentile and the 95th percentile.(4) nucleosome_signal: the ratio of fragments between 147 bp and 294 bp (mononucleosome) to fragments <147 bp (nucleosome-free) is smaller than 4. ( 5) TSS enrichment score is larger than 2. We did not include a filter for ratio of peaks in black list regions over the total number of non-duplicate, usable read-pairs as this was removed during the construction of the DBA SNP-swapped reference genome.
We normalized the count data with Text Frequency Inverse Document Frequency (RunTFIDF) and performed Singular Value decomposition (RunSVD) on top 90% informative features selected by Signac (FindTopFeatures).The first low dimensional embedding was excluded from downstream doublet detection and clustering analysis due to high correlation with sequencing depth.
Potential doublets were identified and subsequently removed from downstream analysis by DoubletFinder, 60 ran on the 2 nd À11 th low dimensional embedding with the expected doublet rate set to recommended amount from 10X genomics based on loading volume (10k nuclei per well).
Finally, we built a gene-by-cell transcriptional activity matrix that counts per cell, at the gene body and 2000bp upstream to capture the promoter region, the total number of ATAC-seq counts.

Cell type cluster generation and modality integration
Single-cell methylation clustering Principal component analysis was run using Scanpy 61 default parameters followed by k-nearest neighbors (knn) using only the top 16 principal components by amount of variance explained and k = 15.Divergences between strains were evident in initial unsupervised clustering so Harmony 62 was used to correct the batch between strains and unbias clusters arising from strain differences.After Harmony was applied, iterative clustering was performed with a combination of leiden unsupervised clustering and UMAP dimensionality reduction, identifying clusters as cell types by marker gene hypomethylation. 16ingle-nucleus RNA sequencing data integration, clustering and annotation Gene counts were normalized using SCTransform, 63 and regressed out percentage of reads from mitochondrial genes.We then integrated cell from all samples using reciprocal principal components analysis (rPCA) implemented in Seurat 4.0.5 57 on the top 5000 integration genes and top 30 reciprocal principal components.For clustering, we standardized the integrated data, performed PCA on all integrated genes and ran de novo Louvain clustering algorithm in the top 15 principal components space with resolution set to 0.1.Cluster markers that are conserved between the strains were called using non-parametric Wilcoxon rank-sum test and subsequently used for annotation.We annotated clusters by manually checking conserved markers against the ALLEN BRAIN MAP's Mouse Whole Cortex and Hippocampus dataset.Single-nucleus ATAC sequencing data integration, clustering and annotation We first jointly projected all cells' ATAC peak profile to uncorrected Latent Semantic Indexing (LSI) embeddings with TFIDF transformation followed by calculating SVD on the top 90% most informative peaks.Peak profile embeddings were then integrated in shared low dimensional space via integration anchors identified in the 2 nd to 30 th reciprocal LSI space as implemented in Signac 1.5.0.We then integrated cell transcriptional activity profiles by performing SCTransform after regressing out percentage of activity from mitochondrial genes and carried out rPCA integration on integration genes identified from the single-nucleus RNA experiment and the top 10 reciprocal principal components.We transferred the single-nucleus RNA annotation onto the single-nucleus ATAC cells by linking the RNAs expression profiles with ATAC transcriptional activity profiles through canonical correlation analysis (CCA) described in Seurat.Pairs of cells from each modality that are mutual nearest neighbors in the top 15 canonical component space were identified as ''transfer anchors''.''Anchors'' were further filtered and weighted by distances in the integrated peak embeddings prior to impute ATAC cells' annotation.Co-embedding single-nucleus methylation with single-nucleus RNA and single-nucleus ATAC sequencing data While we obtained separate independent annotations for single-nucleus methylation cells and RNA and ATAC cells, a joint embedding demonstrates that cluster annotation is highly concordant across modalities.
We used the negative of the average mCH fraction of the gene body ± 2kb as the proxy of methylation cell's transcriptional activity as described previously. 16Single-nucleus RNA expression profiles were linked to single-nucleus methylation gene body mCH profiles via CCA on RNA integration genes.We then identified ''transfer anchors'' in the top 15 canonical component space and used them to impute methylation cells' expression profiles.Single-nucleus ATAC cell expression profiles were imputed similarly with previously computed ''anchors''.All three modalities were merged on their integrated or imputed expression profiles and projected to low dimensional space via PCA, and visualized by UMAP performed on the top 15 principal components.
In general, cell-type annotation demonstrated high concordance across the three modalities, except on single nucleus methylation of non-neuronal cells, where the VLMC, OPC and MGC cell clusters did not colocalize with the corresponding expressionbased annotation counterpart.This could be partially explained by the fact that mCH methylation is largely not present in the non-neuronal population. 66In addition, both the "transfer anchor" and subsequently the imputed transcriptomic profile for single nucleus methylation cells were identified using the ''gene body only'' mCH fraction, a significant reduction in information compared to what was used for the de novo single nucleus methylation annotation, which included both the mCG and mCH fraction at 100kb genome wide.

Analysis of cell type specific effects and relationships with sequence variation
Cell-type specific differential test for single-nucleus ATAC and single-nucleus RNA We used DESeq2 44 for cell-type specific pseudobulk level differential expression analysis and differential accessibility analysis.Per cell type, raw counts were aggregated to replicate level and DESeq2 was run under default parameters to detect statistical evidence of strain differences.Pipeline to identify relationship between methylation and sequence variation in multiple strains Data from each strain for each cell type was combined into a single R data frame, where the following pieces of information were included for each site.Genome location was obtained from mm10, Genome Reference Consortium Mouse Build 38 (GCA_000001635.2))and gene information from the associated databases. 65Sequence variants were downloaded from the Sanger Institute Mouse Genomes project. 12e confirmed that the sequence variants coincided with the expected position by searching for sites with rs numbers and checking that the coordinates agreed between those in the file and those in the mouse mm10 assembly.We confirmed that a file has the correct sequence variants by downloading genome sequence from the UCSC browser and confirming that the sequence is ''C' at the ''+'' strand indices from the comparison file.We added to this file the location of all CpGs, from the genome and from each strain, with the script: FindCpGs.pl-c chromosome (assumes the presence of a sequence file to process of the form chr1.fa in a given directory) -f <annotation file> This script identifies a five base pair context around any cytosine in the genomic sequence, testing for the presence of any variant that changes the reference B6 genome to a C by using the strain variant information. 12This generates a file with the 5 bp sequence context in the reference genome, the sequence for each strain (marked as CG for CpG), the number of methylated sites, the sequence coverage and a column for annotations (containing gene information and other potentially relevant features, downloaded from UCSC genome browser databases. 65e extracted regions of 2 kilobases in length (different segment sizes were also analyzed), examined for each the methylation and mutational spectrum.To do so we consider variation in coverage in each segment and correct for this by down sampling the strain with the most sequence reads at each region (we randomly reduced the amount of methylation proportionate to the ratio in the total sample).We require a minimum coverage of 5 reads and to call a site methylated we require at least 10% of the reads to be methylated.If the coverage is good enough to call a site, but lacks evidence for methylation, then the site is regarded as not methylated.
We compared two definitions of similarity

Figure 2 .
Figure 2. Relationship between methylation state, mutations, and CpG density (A) Percentage of CpG sites that are methylated is shown on the vertical axis.The horizontal axis is the sequence density of CpG sites (regardless of methylation) per kilobase of genomic DNA.Each blue line represents a different inbred strain where aj = A/J, b6 = C57BL/6J, balb = BALB/cJ, d2 = DBA/2J, fvb = FVB/J, pwk = PWK/PhJ, and wsb = WSB/EiJ.(B) Percentage of CpG sites for the seven inbred strains that are mutated, relative to the outgroup strain CAST/EiJ.The percentages are shown for methylated CpGs (in blue) and non-methylated (in gray).The horizontal axis again shows CpG sequence density per kilobase of genomic DNA.

Figure 3 .
Figure 3.Comparison of the probability of methylation state with the ratio of the number of methylated sites

Figure 4 .
Figure 4. Co-embedding of single-nucleus methylation with single-nucleus RNA and single-nucleus ATAC sequencing data for two strains, C57BL/6J and DBA/2J (A) UMAP embedding of single-nucleus methylation, single-nucleus RNA-seq, and single-nucleus ATAC cells after integration, colored by sequencing platform.(B) is similar to (A) but colored by strain label.(C) Concordance between independent single-nucleus methylation annotation and single-nucleus RNA transcriptomic-based annotation on major ventral hippocampus cell types.

Figure 5 .
Figure 5.Effect of mutations at methylated CpGs on D2:B6 ratios of transcript abundance in regions of low and high CpG sequence density Each dot represents the change in D2:B6 ratios of transcript abundance for mutations lying in regions of high CpG sequence density (>40 CpG/Kb) and in regions of low CpG sequence density (<40 CpG/Kb).The horizontal bars indicate the mean change in RNA transcript, with upper and lower 95% confidence intervals shown as dotted lines.(A) Sites that have mutations disrupting methylated CpGs.(B) Sites without mutations.

Table 1 .
The association of CpG mutations on RNA transcript abundance varies by local chromatin state 45e table shows 15 chromatin state groups (from Vu and Ernst45

TABLE
d RESOURCE AVAILABILITY B Lead contact B Materials availability B Data and code availability d EXPERIMENTAL MODEL AND STUDY PARTICIPANT DE-TAILS B Mouse strains d METHOD DETAILS B Ventral hippocampus microdissections B Generating snmC-seq2 libraries B Generating snRNA-seq libraries B Generating snATAC-seq libraries d QUANTIFICATION AND STATISTICAL ANALYSIS B Mapping and primary quality control B Single-nucleus methylation data quality