Using long-read sequencing to detect imprinted DNA methylation

Abstract Systematic variation in the methylation of cytosines at CpG sites plays a critical role in early development of humans and other mammals. Of particular interest are regions of differential methylation between parental alleles, as these often dictate monoallelic gene expression, resulting in parent of origin specific control of the embryonic transcriptome and subsequent development, in a phenomenon known as genomic imprinting. Using long-read nanopore sequencing we show that, with an average genomic coverage of ∼10, it is possible to determine both the level of methylation of CpG sites and the haplotype from which each read arises. The long-read property is exploited to characterize, using novel methods, both methylation and haplotype for reads that have reduced basecalling precision compared to Sanger sequencing. We validate the analysis both through comparison of nanopore-derived methylation patterns with those from Reduced Representation Bisulfite Sequencing data and through comparison with previously reported data. Our analysis successfully identifies known imprinting control regions (ICRs) as well as some novel differentially methylated regions which, due to their proximity to hitherto unknown monoallelically expressed genes, may represent new ICRs.


INTRODUCTION
Methylation of the fifth carbon of cytosines (5mC or simply mC) is an epigenetic modification essential for normal mammalian development. Methylation differences between alleles contribute to establish allele-specific expression patterns. As obtaining genome-wide haplotyped methylomes with short reads remains challenging, we evaluated the ability of long-read, nanopore-based sequencing to improve allele-specific methylation analyses.
We apply the technique to the study of genomic imprinting, where differential expression of the maternal and paternal alleles in the offspring is at least partially set by the differential methylation (1)(2)(3)(4)(5). Imprinting is proposed to arise from the diverging interests of the maternal and paternal genes (6). In accordance with its primordial role in allocation of resources from the mother to the offspring, the placenta, along with the brain, is the organ where parental conflict results in the most pronounced imprinted expression (7)(8)(9). We thus conduct a survey of differential methylation and expression in murine embryonic placenta.
Recent studies have increased the number of genes identified as subject to imprinting in mouse to ∼200 (10)(11)(12)(13)(14)(15). The cause of the differential expression between paternal and maternal alleles is only known for a subset of these genes; maternal histone marks can play a role (14), and in other cases it involves the differential methylation of adjacent regions (5). The differential methylation patterns may be established in the gametes and persist through the epigenetic reprogramming occurring after fertilization (16).
These differentially methylated regions (DMRs) are called primary DMRs, or imprinting control regions (ICRs). Alternatively, differential methylation may arise during development, perhaps as a downstream effect of differential expression, in which case the regions are called somatic or secondary DMRs (17).
Apart from the parent of origin of the allele, genetic differences can also be associated with differential methylation. In this case, F1 hybrids of distinct mouse strains will display DMRs between the alleles according to the strain of origin (18), and not the parent. Genetically determined DMRs can have profound effects on phenotype, for instance in humans by altering the expression of mismatch repair genes important in cancer (19). Therefore, we also investigate the link between DNA methylation and expression for strain-biased genes.
Reconstructing haplotyped methylomes necessitates the simultaneous measurement of DNA methylation and single-nucleotide polymorphisms (SNPs) differentiating the alleles. This can be achieved by deep sequencing of bisulfite-converted DNA on the Illumina platforms, although the short reads combined with the reduced complexity of the bisulfite-treated DNA make the process inefficient, meaning many regions with low SNP density remain unresolved. Long reads provided by third generation sequencing technologies can overcome the requirement of a high SNP density, while several methods allow the assessment of base modifications on native DNA (thus also avoiding the reduction in complexity associated with bisulfite conversion). These methods include: analysis of polymerase kinetics for PacBio SMRT sequencing (20), and detection of deviations in the electric signal for Oxford Nanopore sequencing, via nanopolish (21), signalAlign (22), mCaller (bioRxiv doi:10.1101/127100), Tombo (bioRxiv doi:10.1101/094672) or DeepSignal (bioRxiv doi:10.1101/385849). We note that, for the dominant eukaryotic genome base modification at 5mC, the PacBio technology requires very high coverage making it impractical for use in the analysis of mammalian genomes (23). PacBio SMRT sequencing can be combined with bisulfite treatment (SMRT-BS) to facilitate 5mC detection, but this approach is currently only available for targeted sequencing (24) and the bisulfite treatment introduces the same drawbacks noted above in addition to fragmenting the DNA. Additionally, while PacBio technology is limited to maximum read lengths of between 50 and 100 kb (25), Oxford Nanopore sequencing has no theoretical upper limit on read length and exhibits no bias in sequencing quality with read length (26), which is especially beneficial in genomic regions devoid of SNPs, or highly repetitive regions.
Here, we use the MinION and PromethION long-read nanopore sequencers to generate whole-genome haplotyped methylomes from murine embryonic placenta. With a mean coverage of 10× we successfully identify known ICRs as well as novel parent-of-origin DMRs near imprinted genes, as well as strain-specific DMRs close to both strainbiased genes and structural variants. We show the improved efficiency of this strategy over existing workflows to resolve allele-specific methylation, and highlight its utility in investigating the mechanisms of genomic imprinting.

Animal strains and husbandry
All mice were maintained and treated in accordance with Walter and Eliza Hall Institute Animal Ethics Committee approved protocols under approval number WEHI AEC 2014.026. Mus musculus castaneus mice were obtained from Jackson Labs. Note that due to prior inter-crossing for transgene transmission, the female M. musculus domesticus C57BL/6 mice that served as dams for our study comprise 12.5% FVB/NJ genome, however for simplicity we will refer to this mouse as B6. Wild-type B6 were reciprocally mated to wild-type Mus musculus castaneus (Cast).

DNA and RNA extraction
Pregnant females were sacrificed at E14.5 by CO 2 asphyxiation and the embryonic portion of each placenta was dissected from the maternal portion in phosphate buffered saline, as we have done previously (27). Samples were snap frozen in buffer RLT plus (Qiagen) and DNA and RNA were later extracted from the same sample using the All-Prep DNA/RNA Mini Kit (Qiagen), according to manufacturer's instructions. Samples were sexed by polymerase chain reaction (PCR) using primers for Otc (X-linked gene) and Zfy (Y-linked gene) as previously described (28), and male samples were selected for further analysis.

Illumina sequencing
Reduced Representation Bisulfite Sequencing (RRBS) libraries were made from 100 ng of DNA purified from the embryonic layer of a male B6 × Cast E14.5 placenta using the Ovation RRBS Methyl-Seq System (NuGEN), according to the manufacturer's recommendations, which include use of the Qiagen Epitect kit for bisulfite conversion. The resultant library was sequenced on a HiSeq 2500 (Illumina) using 100 bp paired-end reads and analysed as previously described (28,29).

Nanopore sequencing
The B6 × Cast F1 sample was sequenced on three Min-ION flow cells with the 1D Sequencing Genomic Ligation (LSK108) protocol from ONT with minor adjustments: Nucleic Acids Research, 2019, Vol. 47, No. 8 e46 4 g of starting material were used for each library preparation, and for two libraries DNA was sheared to 10 kb with a Covaris G-Tube, whereas shearing was omitted for the third library (resulting in longer read lengths). Reads were basecalled with Albacore 1.2.2. The Cast × B6 F1 was sequenced on one PromethION flow cell with the 1D Sequencing Genomic Ligation (LSK109) protocol without shearing, and basecalled with Albacore 2.2.7. Nanopore reads were aligned to the same SNP-masked genome as before, using BWA-MEM (arXiv:1303.3997).

Haplotyping
Haplotyping is achieved through the identification of SNPs that are unique to one or the other allele. Examining only the SNPs identified as passing all filters in Keane et al. (38), we combine two distinct methods to confidently haplotype each read.
Basecall haplotyping. Where a read is aligned to a SNP position i on the reference genome, we assign a score S i if the aligned base agrees with the reference haplotype, or 1 − S i if the aligned base agrees with the alternate haplotype, where the score depends on the basecalling quality score q i of the base in question as where the co-efficients of the above relationship were determined empirically on successfully haplotyped reads. Bases which match neither haplotype, or which exhibit a deletion at the SNP location are excluded from the analysis. Finally, the read is assigned an aggregate haplotype value h ∈ [0, 1] across the n informative SNP calls as follows:  (39). Briefly, the raw signal corresponding to the section of the read aligned to the reference at the SNP position is realigned using a HMM, and the likelihood of the sequence of 6-mers in this vicinity is maximized by choosing the more likely of the two possible alleles. Each read is then assigned scores according to the same rule as in Basecall Haplotyping, where nanopolish quality scores are offset by −35 in order to exhibit a similar relationship to basecall quality scores.
where H = 1 represents a read assigned to the reference haplotype, H = −1 represents a read assigned to the alternate haplotype and H = 0 represents an unassigned read. This process is shown graphically as a flowchart in Supplementary Figure S3.
Resolution of maternal recombination. Owing to the cross of an FVB-strain into the maternal line in the grandparental generation, it is necessary to resolve which section of the maternal genome was contributed by recombination from the FVB chromosome. We run the above haplotyping procedure with three possible outcomes, rather than two: mm10, FVB and CAST, with variants called by Keane et al. (38). The proportion of maternal (non-CAST) reads within any 100 Kb region was fitted to a recursive partition tree, which splits continuous data into a stepwise function, here representing the proportion of a contiguous section of chromosome haplotyped to FVB ( Supplementary Figure S5). Fitting was performed using the R package rpart with parameters minsplit=5 and cp=0.1 (40). SNPs in sections of the chromosome with mean proportion of FVB >50% were replaced with the FVB allele for further analysis.

Methylation calling
We determined the methylation status of each CpG site on each read using nanopolish call-methylation (21). Briefly, nanopolish uses a 5-base alphabet, with 5-methylcytosine represented as M, to build a Gaussian mixture model representing every possible 6-mer with both methylated and unmethylated cytosine in a CpG context, excluding those 6-mers which contain both the methylated and unmethylated base. We ran nanopolish separately on reads haplotyped to the maternal and paternal chromosome, using a SNP-masked version of each chromosome to decrease bias in reads with expected deviations from the mm10 reference. Nanopolish then assigns each section or 'event' of nanopore current to a base on the reference genome and calculates the likelihood of each 6-mer containing the CpG site being either methylated (L M (d ij )) or unmethylated (L C (d ij )) given the data d ij for a call group i covered by a read j. Groups of consecutive CpG sites in which the distance between any two adjacent sites is <11 bases (therefore having overlap between 6-mers containing the cytosines in question) are chained into CpG call groups. All sites within the one CpG call group are assumed to have the same methylation status, such that each 6-mer is only considered once. We convert these likelihoods to probabilities as follows: and since M and C are mutually exclusive and jointly exhaustive, Then, defining the prior probability of methylation as P(M) = p 0 , and rearranging for P(M|d ij ), .
Noting results from Decato et al. (41) showing methylation levels ranging from 0.433 to 0.538 for mouse placental tissue we set p 0 = 0.5, so finally we define the single-read, singlesite probability of methylation as .
Comparison with RRBS methylation calls. Individual methylation calls on a single CpG call group are aggregated over the set of reads covering each group in order to compare with aggregate values provided by bisulfite sequencing. That is, for each CpG call group i covered by n reads, we define the call group average In order to compare methylation calls between nanopore and RRBS, we must split CpG call groups defined by nanopolish as CpG sites separated by <11 base pairs into individual sites, including GpC sites on the reverse strand, with each site retaining the same ␤ value as the original call group. Only those CpG sites for which both RRBS and nanopore data exist are considered.

Identification of differentially methylated regions
Following methylation detection and haplotype assignment of each read, it is possible to assign each call of methylation on the genome to one of the two haplotypes. The aggregated ␤ methylation values for each CpG group are tested for DMRs using the DSS software (42). Briefly, DSS tests for differential methylation at single CpG-sites, using a Wald test on the co-efficients of a beta-binomial regression of count data with an 'arcsine' link function. Then, using a default P-value threshold of 10 −5 , DSS aggregates differentially methylated sites into DMRs based on a maximum separation between sites and a minimum density and number of sites in each DMR.
To detect parent-of-origin DMRs, we perform DSS with the comparison B6♀ and Cast♀ versus Cast♂ and B6♂; to detect strain-specific DMRs, we perform DSS a second time with the comparison B6♀ and B6♂ versus Cast♂ and Cast♀.

Visualization of haplotyped methylation
Owing to the noisy nature of nanopore methylation calls, we use a loess smoothing curve to visually represent the methylation of a single nanopore read (43). Here, the smoothing parameter ␣ is determined by where L is read length. This relationship was determined empirically to have minimal impact on visualization while minimizing computation time. Genomic tracks of methylation and expression were plotted with the following Bioconductor packages: ggbio (44), rtracklayer (45) and Genomi-cRanges (46).

Nanopore methylation calls are concordant with other technologies
We sequenced the embryonic portion of placenta derived from a male embryonic day 14.5 (E14.5) conceptus from a C57BL/6 (Black6, or B6) × Castaneus (Cast) F1 on the MinION platform to a depth of ∼8× ( Supplementary Figure S1) and called methylation using nanopolish (21). The genome-wide methylation data successfully recapitulated known patterns: CpG islands (CGIs), as defined by Irizarry et al. (47), separated into two groups of high and low methylation ( Figure 1A); the methylation level dipped at transcriptional start sites (TSSs) ( Figure 1B), and the average genome-wide methylation level was ∼50%, as previously reported for placental tissue (41).
To further validate the accuracy of the nanopore methylation calls, we compared them to RRBS data on the same sample at sites covered by both methods. Nanopore methylation calls showed an overall similar distribution to RRBS methylation calls, albeit with a bias toward intermediate values of methylation ( Figure 1C). Despite being less correlated than measurements from methylation-specific technologies (48) with a median absolute deviation of 0.18, persite methylation was also relatively concordant between the two methods, with 85% of CpG sites being called correctly when converting average methylation values to binary calls ( Figure 1D).
Sequencing of E14.5 embryonic placenta from the reciprocal cross (Cast × B6) on the PromethION platform at 12× coverage (Supplementary Figure S1) generated comparable results.

Increased haplotyping efficiency with nanopore reads
We next used high-confidence SNPs between the Cast and B6 strains to haplotype RRBS and nanopore reads. In order to mitigate the high sequencing error rates of nanopore sequencing, we used two methods of haplotype assignment, denoted the 'basecall method', based on FASTQ data, and the 'signal method', based on the phase-reads module from nanopolish, an HMM, which uses the raw nanopore signal to predict genotype (39). Additionally, we only assigned a haplotype to those nanopore reads with at least five highconfidence SNPs (Supplementary Figure S3). All RRBS reads overlapping at least one SNP were assigned a haplotype (31). Only 24% of the mapped RRBS reads could be assigned to one haplotype, whereas 74% of mapped nanopore reads were haplotyped in the expected proportions (Figure 2A and B): roughly half of the haplotyped reads were assigned to the maternal haplotype, and half to the paternal haplotype, albeit showing a slight bias toward the paternal haplotype (due to an increased number of split reads in regions where sections of the Cast genome has a deletion with respect to the B6 genome). The pattern of haplotype assignment was consistent across the autosomal chromosomes, while, as expected for a male sample, almost all (91%) of the reads aligned to the X chromosome were assigned to the maternal haplotype ( Figure 2B). Haplotyping of the Cast × B6 cross gave similar results (Supplementary Figure S2A). The lack of maternal bias in read haplotype indicates minimal maternal contamination, which is also reflected in consistent RNA-seq library sizes (Supplementary Figure S4). We further evaluated the accuracy of the haplotyping of the nanopore reads by sequencing the same tissue from the parents (B6 only, and Cast only). Following the same haplotyping procedure, 85.7% of the reads were correctly as- signed to the relevant genotype, and 1.5% were misassigned (Supplementary Figure S2B and C). The large majority of nanopore reads showed strong agreement between the two haplotyping methods. Discrepancy between the basecall and signal methods are typically due to a low number of SNPs being scored by one or both methods, resulting in these alignments being filtered out by the haplotyping procedure (Supplementary Figure  S3). However, when examining the overall predictive performance of the two methods with Area Under Receiver Operating Characteristic Curve (AUROC) on the singlestrain experiments, the signal method marginally outperformed the basecall method (see Table 1). We also found that the combination of the two methods achieved a slight improvement again over the signal method, either with a logistic regression model (49) or an ad hoc combination of the two approaches (see 'Materials and Methods' section). The ad hoc method allowed classification of 30 000 additional reads over the logistic regression and signal method, both of which excluded reads for which nanopolish failed to produce output. In both nanopore and RRBS data, the main cause of haplotyping failure is the lack of SNPs in the region covered by the read (Supplementary Figure S3). While the proportion of successfully haplotyped nanopore reads could increase with optimization for longer reads and anticipated improved sequencing accuracy, RRBS haplotyping efficiency is limited by the short read length.

Parent-of-origin and strain-biased gene expression
To investigate the correlation between differential methylation and differential gene expression, we performed RNAseq on the same F1 placental tissue from reciprocal crosses of B6 and Cast, in quadruplicates. Maternal tissue contamination was unlikely as for each embryo maternal and paternal counts were similar (Supplementary Figure S4). We found 135 genes with a parental bias in expression (imprinted genes, 10% FDR, Figure 3A): 88 with higher expression from the maternal allele and 47 with higher expression of the paternal allele. Among the 135 genes, 53 corresponded to well-characterized imprinted genes in classic databases (50)(51)(52)(53). A further 17 of these genes, including Fkbp6, Smoc1/2, Gzmc/d/e/f/g, Zdhhc14 and Arid1b have been identified as imprinted in one or several recent studies (12)(13)(14)(15). The remaining 65 genes constitute novel candidate genes with parent-biased expression in mouse placenta. The complete annotated list is reported in Additional File 1.

Known imprinting control regions are observable by nanopore sequencing
We combined the methylation and haplotyping data from the nanopore reads to compare methylation between the parental alleles. To highlight the linkage information of the methylation data available for nanopore reads, as well as the per-site per-read data, we plotted the loess fit of the cytosine methylation levels for each read in the region of interest ( Figure 4A and B).
DMRs at known ICRs (52) were readily visible and concordant with matched allele-specific RNA-seq and RRBS data ( Figure 4A and B). Nanopore data recapitulated methylation differences at most known ICRs ( Figure 4C), often showing extended differential methylation past the annotated ICR borders.

Nanopore sequencing reveals novel differentially methylated regions
Next, we sought to define DMRs between parental alleles as well as between strains de novo, using the differential methylation tool DSS (42). We ranked putative DMRs based on the area statistic. Using the DSS default threshold of 10 −5 , we obtained a total of 933 DMRs, of which 309 were explained by parent-of-origin differences, and the remainder by strain-specific effects (Additional File 2). We then examined these DMRs, in conjunction with haplotyped RRBS and RNA sequencing data for corroborating evidence of differential methylation and differential expression, respectively, in order to find putative DMRs of interest at imprinted genes.   Of the 20 highest ranking DMRs, 15 corresponded to known ICRs. Although many of the lower-ranking DMRs are potential false-positives, they also included regions of known imprinted expression (for instance two small detected DMRs immediately adjacent to known DMRs at the IMPACT and NESP ICRs, shown in Figure 4A and B, respectively.) Thus in the absence of statistically robust DMRfinding methods for nanopore data, we kept this permissive threshold.
Five ICRs annotated in the WAMIDEX database (52) were not detected de novo (Table 2). INPP5F V2 and GRB10 simply lacked coverage in the B6 × Cast sample but showed clear differential allelic methylation in the Cast × B6 sample; GNAS-EXON1A also lacked coverage in B6 × Cast but the reciprocal sample and the RRBS did not suggest differential methylation, while NDN lacked coverage in both samples. The last undetected region was GPR1-ZBDF2; however Duffié et al. (17) have shown that this region lacks important features of a bona fide ICR, and that a neighbouring maternally hypermethylated region is the true ICR. The region in question was readily detected as differ- entially methylated from the nanopore data (DMR #229, Table 2).
In addition to ICRs, we detected numerous DMRs at imprinted genes that do not appear to be present in gametes (54)(55)(56), and which therefore likely constitute secondary DMRs (Table 2). When RRBS coverage was present, the bisulfite data corroborated the de novo DMR identification. Five of the secondary DMRs have been described previously, although they are not currently compiled in a database: maternal hypermethylation at the Igf2 promoter (57), maternal hypermethylation at the placental-specific promoter of Gab1 (58), paternal hypermethylation of the Meg3 TSS (59), maternal hypermethylation at the Slc38a4 TSS (60) and paternal hypermethylation at the Igf2r promoter (2).
Other novel secondary DMRs overlapped introns rather than TSSs. Park2, a recently identified maternally-biased gene (13), had seven intronic DMRs, all displaying hypermethylation of the maternal allele. Rian displayed a DMR that had not been previously reported in mice, although its human orthologue also presents an intronic imprinted DMR (61).
In some cases, inspection of the parent-specific DMRs revealed unannotated imprinted transcription nearby, for example in the Kcnq1 and Igf2r clusters ( Table 2). These RNA-seq reads may be part of imprinted long non-coding RNAs, frequent at imprinted clusters.
While we have collated all the imprinted DMRs that we found to directly overlap with imprinted expressionin addition to the WAMIDEX ICRs-in Table 2, we note that other imprinted DMRs may be associated with the imprinted expression of more distant genes, or with genes that are only expressed or imprinted in specific tissues. For example, we found imprinted DMRs in the promoters of Smoc2 (DMR #224) and Arid1b (DMR #863), two genes recently identified as being imprinted (and also imprinted in our data). The strong DMR #110 overlapped the TSS of Gtsf2, which was poorly expressed in placenta but may be imprinted in the tissues where it is expressed (in gonocytes and spermatids (62)). In the absence of chromatin conformation data or functional validation, we did not attempt to formally assign these DMRs to specific genes.
We however used the top, most reliable 400 DMRs to calculate the distance of genes to their nearest DMR depending on their expression status. We found that parentally biased genes were more likely to be proximal to parentof-origin DMRs than unbiased genes (median distance 0.9 Mbp compared to 7.4 Mbp), whereas strain-biased genes and DMRs did not show this relationship (median distance 2.9 Mbp compared to 3.1 Mbp) The distributions of distances to the nearest DMR are shown in Figure 5, which shows a striking relationship between parentally biased genes and parent-of-origin DMRs. This result is consistent with parental bias in expression being driven necessarily by epigenetic differences, whereas differential expression between strains is mainly driven by genetic differences.

Long reads provide advantages in differential methylation analysis
Inspection of the DMRs revealed multiple advantages of our nanopore-based method of methylome haplotyping over traditional bisulfite sequencing ( Figure 6).
We were able to resolve DMRs in regions of low SNP density, where there were no haplotyped RRBS reads despite the presence of a CGI ( Figure 6A). The particular DMR in Figure 6A encompassed the TSS of the imprinted gene Peg10, and was much wider than the previously annotated ICR. The increased DMR width was a regular occurrence at ICRs (Table 2).
Our method also uncovered novel secondary DMRs at known imprinted genes such as Jade1 (Figure 6B), as well as at previously uncharacterized imprinted transcripts such as AC158554.1, annotated as a long intergenic noncoding RNA (ENSMUSG00000116295, Figure 6C).
Another advantage provided by the long reads was apparent at the ZIM2-PEG3 ICR ( Figure 6D). RRBS data from the B6Cast F1 showed that certain CG dinucleotides were highly methylated on the maternal allele (100% methylation at these positions) while others were variably methylated, resulting in averages of 25-50% methylation. Two scenarios could give rise to these intermediate values: either the variable positions are randomly unmethylated in all maternal alleles, or there exists two populations of maternal alleles, one where CG dinucleotides are methylated throughout the region and one where the variable positions are consistently unmethylated. The long nanopore reads revealed that the second scenario contributes to the observed intermediate methylation patterns: there was a mixture of cells, in some of which the maternal allele showed a contiguous loss of methylation. Although this result is well known to those who have practiced Sanger bisulfite sequencing, the haplotyped methylome derived from nanopore sequencing allowed investigation of this variability more accurately (no PCR bias) and across the whole genome.
Eight strain-specific DMRs were also found within 5 kb of a gene with strain-biased expression ( Figure 6E). Most of these exhibit structural variation proximal to the DMR, although a small number exhibited changes in expression seemingly not associated with any structural variant.
One example of a structural variant between B6 and Cast associated with differential methylation can be found on Figure 6F. Upstream of the Tap2 gene, an IAPEZ retrotransposon in the B6 genome is absent from the Cast genome (Cast reads are truncated upstream of the repeat and absent downstream), and the gene-proximal border is more highly methylated on B6 alleles than on Cast alleles. This differential methylation would be consistent with the insertion of the transposon attracting methylation that spreads to adjacent regions. We could also see at this repeat region a lot of spuriously mapping reads from both strains, suggesting that the repeat is present in multiple copies that the current assembly fails to account for.

DISCUSSION
Determining allele-specific methylation patterns in diploid or polyploid cells with short-read sequencing is hampered by the dependence on a high SNP density and the reduction  A B Figure 5. Proximity of differentially expressed genes to DMRs. (A) Distribution of distance from genes to imprinted DMRs, shown on a log-scale. Inset shows distances from 0 to 10 000 bp on a linear scale. Imprinted genes are much more frequently located within 100-100 000 bp of an imprinted DMR. (B) Distribution of distance from genes to strain-specific DMRs. Strain-biased genes are for the most part located no closer to a strain-specific DMR than non-differentially expressed genes, indicating the strain-specific differential expression is likely caused by other factors, such as genomic differences. In both cases, we use only DMRs ranked in the top 400.
in sequence complexity inherent to bisulfite treatment. In this study, we demonstrate the use of long-read nanopore sequencing to derive haplotyped methylomes of the embryonic portion of mouse placentae. Methylation estimates from nanopore reads are consistent with previous knowledge ( Figure 1). The longer read lengths allowed most reads to overlap multiple SNPs, resulting in accurate haplotyping of 75% of the reads, a much higher proportion than comparable short-read data ( Figure 2). Sequencing of native DNA not only maintains the sequence complexity that is lost in bisulfite treatment, but also has the potential to detect a variety of base modifications outside 5mC, bypassing the need for specialized chemistries such as bisulfite (for 5mC) or oxidative-bisulfite (for 5hmC) treatments. Furthermore, we are able to characterize allele-specific methylation at a relatively shallow level of genomic coverage (∼10×), which is substantially lower than the coverage required by Pacific Biosciences single-molecule sequencing to ascertain any native base modification (25×) or 5mC in particular (250×) (23). Nonetheless, a detailed comparison of the per-  formance of nanopore and PacBio in the detection of base modifications on matched samples would be of interest. Recent increases in throughput of nanopore sequencing instruments make this approach a cost-effective way of obtaining genome-wide allele-specific methylation for mammalian-sized genomes, compared to the alternatives of short-read whole-genome bisulfite sequencing, or PacBio SMRT sequencing. Thus, the approach we present is unique in its ability to characterize allele-specific single-molecule cytosine methylation state in eukaryotes, in which the 5mC modification is both common and highly relevant to transcriptional regulation.
The haplotyped methylomes for reciprocal B6 × Cast F1 samples confirm the parent-of-origin specific methylation of ICRs and provide an improved definition of their boundaries (Figure 4). By integrating the haplotyped methylomes with allele-specific expression data, we identified novel DMRs linked to imprinted genes. These are likely to constitute secondary DMRs, whose role and origin are unclear. We note that the low sequencing coverage in this study (∼10×) limits our ability to detect modest methylation differences between alleles, and is thus most suited to the detection of large differences such as those occurring at ICRs. We confirm a large number (70) of previously identified imprinted genes and propose another 65 as new candidates (Figure 3 and Additional File 1). This suggests that although the monoallelically expressed genes are now well characterized, sensitive analyses can still uncover parentally biased genes. Interestingly, though we find more maternal-dominant genes than paternal-dominant ones (88 and 47, respectively), the imbalance is much less pronounced than in Finn et al. (2014) (12) (96% maternal dominance). Applying long-read sequencing to the transcriptome also promises improvements in the percentage of usable data, the detection of allele-specific as well as isoform-specific differential expression and even the detection of RNA base modifications.
Our allele-specific methylation and expression data can also be used to reveal strain-biased expression of genes linked to strain-specific DMRs. The genetic divergence between the two strains accounts for most of the differences in expression, however the presence of DMRs could suggest an epigenetic component to the regulation of a subset of genes.
We foresee a number of improvements that will make the determination of haplotyped methylomes by nanopore sequencing more efficient and comprehensive in the future. First, we expect to see an expansion in the types and nucleotide contexts of base modifications characterized. Our analysis is based on Simpson et al. (21) (21), and is limited to 5mC at CpG sites. However that is not a limitation of the technology, as has been demonstrated by others ((22), Tombo (bioRxiv doi:10.1101/094672), mCaller (bioRxiv doi:10.1101/127100)). Second, improvements can be made in reaching true nucleotide-resolution methylation calls. Where multiple CpG sites occur within less than twice the k-mer length (here 6), all these sites are considered to have the same methylation state. Again, this is not a limitation of the technology, as more complete training data will allow resolution of mixed methylation states. Finally, we see opportunities for improvement in the analysis of nanopore methylation data. Instead of binary calls from bisulfite sequencing, the output of nanopore sequencing is a likelihood ratio that the site is methylated versus unmethylated. Currently, there is no method for the detection of differential methylation that accepts these continuous values as input. Additionally, the DMR detection algorithm that we used was designed originally for bisulfite data, and we expect that algorithms designed specifically to incorporate long reads and probabilistic methylation assignment would achieve greater levels of accuracy.

CONCLUSIONS
We demonstrate that long-read sequencing using nanopore technology can efficiently generate haplotyped mammalian methylomes. With no additional sample preparation than that routinely used for basic sequencing and with only a mean coverage of ∼10×, we identify differential allelic methylation throughout the genome. Combined with expression data, this improves the resolution of imprinting analyses. Our approach is widely applicable to other systems, for instance with more complex genetics, or to phase cancer mutations with methylation state and to determine the effects of structural variation on methylation.

DATA AVAILABILITY
All sequencing data are available at ENA under study accession ERP109201. Processed data can be explored via a Genome browser (63,64), interactive plots (37) and a summary page available from bioinf.wehi.edu. au/haplotyped methylome. All analysis scripts are available on GitHub at github.com/scottgigante/haplotypedmethylome.