Low-input and single-cell methods for Infinium DNA methylation BeadChips

Abstract The Infinium BeadChip is the most widely used DNA methylome assay technology for population-scale epigenome profiling. However, the standard workflow requires over 200 ng of input DNA, hindering its application to small cell-number samples, such as primordial germ cells. We developed experimental and analysis workflows to extend this technology to suboptimal input DNA conditions, including ultra-low input down to single cells. DNA preamplification significantly enhanced detection rates to over 50% in five-cell samples and ∼25% in single cells. Enzymatic conversion also substantially improved data quality. Computationally, we developed a method to model the background signal's influence on the DNA methylation level readings. The modified detection P-value calculation achieved higher sensitivities for low-input datasets and was validated in over 100 000 public diverse methylome profiles. We employed the optimized workflow to query the demethylation dynamics in mouse primordial germ cells available at low cell numbers. Our data revealed nuanced chromatin states, sex disparities, and the role of DNA methylation in transposable element regulation during germ cell development. Collectively, we present comprehensive experimental and computational solutions to extend this widely used methylation assay technology to applications with limited DNA.


Introduction
DNA modifications, including 5-methylcytosines (5mCs) and 5-hydroxymethylcytosines (5hmCs), are canonical forms of epigenetic modification in human and other mammalian genomes.DNA methylation is found mainly in CpG dinucleotide contexts, where it is extensively implicated in gene transcriptional regulation, cell identity maintenance, organismal development, aging, and diseases ( 1 ).Infinium DNA methylation BeadChips are among the most popular genomewide methylation assays in humans and other species due to the ease of experiment and data analysis ( 2 ).These arrays have been the primary data generation workhorse for large data consortia such as The Cancer Genome Atlas (TCGA), with public methylome profiles of over 80 000 HM450 samples ( 3 ) and a similar number of EPIC array methylome profiles deposited to Gene Expression Omnibus (GEO).While the adoption of sequencing-based methods is catching up in case and mechanistic studies, the Infinium technology remains the most used assay platform for population-level studies such as meQTL studies ( 4 ,5 ), epigenetic risk scores ( 6 ,7 ), and other epigenome-wide association studies ( 8 ,9 ).This is, in part, due to the necessity in population studies to cover a large number of samples with nuanced variation in methylation levels and to dissect multiple cohort covariates such as sex, age, genetic background, and tissue type.In addition to being a powerful and popular tool for biological discovery, the technology has recently enabled rapid clinical application development ( 10 ).This platform has found wide success in cancer diagnosis ( 11 ), cell-free liquid biopsy ( 12 ), and forensics ( 13 ).Recently, the Infinium BeadChips have also been used to generate the largest DNA methylome atlas across different mammalian species (14)(15)(16).
Despite these successes, a significant drawback of this technology is that it requires over 200 ng of input DNA from the standard processing protocol ( 2 ).This requirement constrains scientific and clinical applications with limited DNA availability.For example, as few as 25 primordial germ cells (PGCs) can be found in the early mouse embryo ( 17 ).Serum-derived, tumor-originated cell-free DNA (cfDNA) in cancer patients ( 18 ) holds value in non-invasive early cancer diagnosis ( 19 ).But as little as five ng / ml cfDNA in healthy subjects and 30 ng / ml in cancer patients ( 20 ) may be available.Also, DNA obtained from crime scene traces are often in the picogram to nanogram range ( 21 ).The DNA concentrations obtained from these tissues are much lower than the Infinium array standard protocol requires.
In addition to circumstances where the input DNA quantity is limited, it may be of interest to study a complex tissue to dissect cell-to-cell heterogeneity purposefully, even when there is no shortage of total DNA in the tissue.DNA methylation encodes distinctive cell identity fingerprints, which can be used to infer cellular phenotypes ( 1 ), trace the state of the DNA-releasing cells ( 22 ), and infer cell proportions ( 23 ).By performing DNA methylation analysis on laser-capture microdissected specimens, one can compare methylomes at different locations in tumors ( 24 ,25 ) or select specific cell types from the brain for analysis ( 26 ).In most cases, laser-capture microdissected tissues are limited in quantity and involve pre-assay whole genome amplification, as is done with SNP arrays ( 27 ).
The extreme of increasing the cell resolution at the cost of working with small DNA amounts is epitomized by the rapid development of single-cell DNA methylome assay technologies ( 28 ).Each human or mouse cell carries 5-7 pg.In the past decade, most technologies have been based on preamplifying deaminated DNA (29)(30)(31) using random priming or single-strand adapter ligation before feeding amplified DNA to high-throughput sequencing.In addition, multiple enzymatic cytosine conversion methods have been developed to replace sodium bisulfite conversion to better preserve genomic DNA during library preparation ( 32 ).Inspired by these singlecell methods, we posit that similar preamplification methods and enzymatic conversion can also be used with Infinium arrays.
Besides the experimental challenges, current computational methods are not fully optimized for low-input DNA data.Current signal preprocessing practices may lead to low probe signal detection rates and probe over-masking.Signal detection has traditionally been determined by comparing probes' signal intensities with negative control probes ( 33 ) or readings in the Infinium-I out-of-band channel ( 34 ).A conservative threshold of the detection P -value was then determined to mask lowintensity probes.When the foreground and background signal intensities are separable, using a conservative threshold on high-input data would not harm detection sensitivity.This approach leads to a significant loss of true biological signals for limited DNA input.A method that maximally preserves biological signals while removing pure artifacts remains an unmet need.
Here, we systematically developed and evaluated experimental and computational methods to improve array sensitivity at low-input ranges and single cells.Our evaluation encompasses previously attempted adaptations, including using different bisulfite conversion elution ( 35 ), using Formalin-Fixed Paraffin-Embedded (FFPE) restoration ( 36 ), combining bisulfite conversion with DNA extraction ( 37 ) as well as methods that were never previously used with Infinium arrays, such as the enzymatic conversion and different preamplification strategies.We developed a new signal detection framework to address the computational challenge of processing data from limited DNA.We showed that this new method significantly improved array detection rates while effectively masking probes whose readings are dominated by background signals.We showed that the Infinium BeadChip is compatible with samples of low input down to single cells.And we presented end-to-end solutions to enable this technology for lowinput and single-cell samples.

DNA extraction and bisulfite conversion
NIH3T3 and B16-F0 cells were harvested by centrifugation at 100g for 5 min at room temperature and washed twice using PBS (Gibco, 10010023).The DNeasy Blood and PAGE 3 OF 16 Tissue Kit (Qiagen, 69504) was used to extract genomic DNA from NIH3T3 and B16, according to the manufacturer's protocol.DNA samples were quantified using Qubit 4.0 Fluorometer (Invitrogen) using the dsDNA HS Assay Kit (Invitrogen, Q33231).Bisulfite conversion was performed using three kits.DNA bisulfite conversion using EZ DNA Methylation Kit (Zymo Research, D5001) was performed according to the manufacturer's instructions with the specified modifications for Illumina Infinium Methylation Assay.DNA bisulfite conversion using EZ DNA Methylation-Gold Kit (Zymo Research, D5005) and EZ DNA Methylation-Direct Kit (Zymo Research, D5020) was performed according to the manufacturer's protocol.Cell lysis and bisulfite conversion from sorted cells and PGCs were performed with EZ DNA Methylation-Direct Kit according to the manufacturer's instructions.

DNA restoration
After bisulfite conversion, the bisulfite-converted DNA was eluted, resulting in an 8 μl volume.The Infinium HD FFPE DNA restoration kit (Illumina, WG-321-1002) was then used according to the manufacturer's instructions.Following a 1min incubation, the elution of the DNA was carried out using autoclaved ultrapure water for a 10 μl elution volume.The eluted DNA was stored at -20 • C before undergoing Infinium array processing.Intermediate DNA purifications were performed using the DNA Clean and Concentrator-25 Kit (Zymo Research, D4064).

Cytosine conversion elution size optimization
The Illumina Infinium Mouse Methylation BeadChip assays were conducted according to the manufacturer's specifications with slight modifications ( Supplementary Tables S1 A and S1 B).The original protocol specifies using 4 μl obtained from 12 to 22 μl eluted BCD, mixed with 4 μl 0.1 N sodium hydroxide for amplification and BeadChip reaction (Infinium HD Assay Methylation Protocol Guide 15019519 v01).However, commercial bisulfite conversion kits typically produce over 10 μl elution in purifying the converted DNA, leading to only part of the eluted DNA (4 μl) used for the BeadChip assay.Previous studies have adjusted the elution size or additional concentration steps to minimize DNA loss.For example, one option is to mix 7 μl eluted DNA with 1 μl 0.4 N sodium hydroxide (NaOH) (40)(41)(42).DNA input can be maximized by increasing NaOH concentration in denature step of the Infinium array.To preserve more input DNA, we compared four alternative combinations of elution buffer, input DNA volume, and NaOH concentration and volume.

Enzymatic methyl (EM)-array sample preparation
Libraries were prepared using the NEBNext Enzymatic Methyl-seq (NEB, E7120S) kit, following the manufacturer's instructions.50, 5, 2, or 0.5 ng of 5mC adaptorligated NIH3T3 DNA was used as input.HiFi HotStart Uracil + Ready Mix (KAPA Biosystems, KK2801) was used to amplify the libraries following conversion before purification over SPRI beads (0.8 × left-sided) and elution in nuclease-free water to yield final libraries.Libraries were then quantified by Qubit HS (Invitrogen, Q32851) and quality checked on an Agilent Bioanalyzer 2100 before sequencing on an Illumina MiSeq instrument to confirm conversion efficiencies.Details of each input's elution size and library amplification cycles are listed in Supplementary Table S1 C.

ELBAR detection P -value calculation
In the standard Infinium BeadChip usage, > 200 ng DNA is profiled, and probe detection calling is employed to filter out probes whose signals are subject to substantial background influence.However, this practice will cause significant biological signal loss from low-input datasets where users often seek to retain the most biological signal but can tolerate some background influence.To meet this need, we developed ELBAR (Eliminating BAckground-dominated Reading) to exclude / mask only probes lacking biological signals and entirely dominated by background signals.The ELBAR method is based on the observation that the beta value ranges depend on the probes' total signal intensities and includes the following steps.First, we define total signal intensities as the sum of signals methylated (M) and unmethylated (U) alleles.Pooling in-band and out-of-band signals, ELBAR bins probes by log 2 -transformed M + U signal intensities.Next, ELBAR calculates the upper and lower bounds (calculated using the 5% and 95% quantiles to accommodate outliers, defined as the beta value envelope ) of each bin as M + U varies.Third, we define the background signal by looking for the first bin that deviates in the beta value envelope of the bin from the smallest M + U. Lastly, these probes' maximum M and U signals were treated as the true background signal to compute detection P -values.

Infinium BeadChip data preprocessing and analysis
The Illumina Infinium BeadChips technology is based on sodium bisulfite conversion of DNA, with single base resolution genotyping of targeted CpG sites via probes on a microarray.Probes are designed to match specific 50 base regions of bisulfite-converted genomic DNA, with a CpG site at the probe's 3 end ( 47 ).Upon hybridizing with bisulfite-converted DNA, the probe undergoes a single-base extension that incorporates a fluorescently labeled ddNTP at the 3 CpG site, enabling distinguishing of the C / T conversion resulting from bisulfite conversion.The fluorescent signals provide insight into the methylation status (methylated or unmethylated) of specific cytosine residues in the DNA sample.Signal intensity refers to the strength of the fluorescent signal emitted by the hybridized probes on the BeadChip.The signal intensity is directly related to the quantity of target molecules bound to specific probes.Probe success rates represent the proportion of successfully captured CpG probes for targeted CpGs.
The IDAT files generated, along with all public datasets utilized, were processed using the SeSAMe R package.This encompassed preprocessing, quality control, and analysis, adhering to the established preprocessing workflow ( 43 ).The probe detection p-value was computed using the pOOBAH algorithm, which leverages the fluorescence of out-of-band (OOB) probes.Subsequently, normalization was performed using noob, which applies a normal exponential deconvolution of fluorescent intensities based on the OOB probes.
Additionally, a dye bias correction was applied using the dye-BiasNL function.Infinium Methylation BeadChip manifest and annotations data, which include gene, chromatin state, sequence context, the 'PGCMeth' probes list (designed to target CpGs highly methylated in E13.5 PGCs) ( 48 ), and other functional annotations, were obtained from http://zwdzwd.github.io/InfiniumAnnotation .Metagene plot was generated using the KYCG_plotMeta function in the SeSAMe package.To calculate the F1 scores of each sample against the 250-ng control (205243950081_R01C01), a beta value greater than 0.5 was rounded to one and set to zero otherwise.Then, the F1 score is calculated by treating one as true and 0 as false and comparing the target sample with the 250-ng control, i.e.F1 = 2TP / (2TP + FN + FP) where TP is the true positive counts, FP is the false positive counts, and FN is the false negative counts.

Results
Characterizing the suboptimal DNA signatures in public Infinium datasets Suboptimal DNA quality and quantity impact Infinium methylation data through the manifestation of lower signal intensity and probe success rate.We first studied the probe success rate of over 100 000 public Infinium datasets deposited to GEO to identify the determinants of Infinium array performance (Figure 1 A).Comparing DNA sources, we observed that bone, buccal, plasma cfDNA, esophagus, and saliva often yield data with suboptimal probe detection success.Further dataset stratification by sample preservation reveals that FFPE samples are significantly lower in probe success rate than non-FFPE samples.The lower signal intensity leads to skewing of beta values, which represent the methylation level at a specific site, towards an intermediate reading at 0.5 due to stronger relative influences of the signal background (Figure 1 B-D) ( 34 ).This asymptotic convergence to 0.5, defined by the upper and lower bounds of beta values, forms a beta value envelope .This increase in the background signal influence is a continuous spectrum instead of a dichotomy of detection success vs. failure.When the input DNA is of high quality and quantity, most probes have stable and clustered signal intensities, leading to a bimodal beta value distribution.But in suboptimal datasets such as from FFPE and cfDNA samples, more probes carry lower signal intensities, leading to beta values approaching 0.5 (Figure 1 C).This drop in sample quality in FFPE samples is likely intertwined with low DNA input amount, as supported by a similar transition of intermediate methylation readings in the low input data (Figure 1 D and Supplementary Figure S1 A).We also found that FFPE and cfDNA samples lose detection at different genomic regions.cfDNA, saliva, and buccal cells preferentially lose detection at GC-rich promoter sites such as TssA and TssBiv, while FFPE samples are less biased across genomic territories (Figure 1 E).The full names of the chromatin states are available in Supplementary Figure S1 B.

Infinium BeadChip workflows for low DNA input and single cells
To improve signal detection in low-input experiments, we developed 13 non-standard workflows using: (i) preamplification of the genomic DNA (see Supplementary Methods ); and (ii) enzymatic cytosine conversion methods, besides other protocol adjustments that preserve DNA load (see Figure 2 A, Supplementary Tables S1 A-S1 C).The probe success rate and F1 scores were presented in Figure 2 B-D for three workflows, each representing distinct characteristics: the unmodified workflow (Workflow A), the workflow with maximized elution size (Workflow C), and the most optimized workflow with preamplification (Workflow J).Additionally, Figure 2 C and D included a workflow achieved through enzymatic base conversion, labeled as Workflow M. In Supplementary Figures 2 B-2 D, we illustrated probe success rates, F1 scores and Spearman correlations for all 13 workflows, encompassing the four previously mentioned workflows.We found that the standard Illumina workflow (Workflow A) can detect signals on 70% array probes with two ng DNA without modification, consistent with our prior characterization of the EPICv2 BeadChip ( 49 ).In the sub-2-ng range, the detection success rate drops rapidly for Workflows A and B (Figure 2 B, C, and Supplementary Figure S2 B).Enzymatic base conversion (Workflow M) maximizes signal detection (84%) in 0.5 ng-input experiments, followed by a whole-genome preamplification-based method (Workflow J) and one with elution size modification alone (Workflow C).Due to the allelic nature of DNA, we use the F1 score (Materials and methods), which binarizes beta values for comparison to the reference sample, rather than correlation.In the five-cell experiments, Workflow J (Figure 2 B and D, and Supplementary Figure S2 C) consistently reaches over 70% in signal detection (pOOBAH < 0.05) and close to 90% in the F1 score.Notably, Workflow J detects around 25% probes in single cells with an F1 score > 70%, suggesting that the detected data is biologically informative despite a higher rate of detection failure.In contrast, the standard workflow is consistently under 50% in the F1 score, suggesting more biologically misleading readings.
For 0.5 and 2 ng input ranges, Workflow J also improved the data quality as indicated by the higher correlation coefficient with a 250-ng dataset (Figure 2 E and F, and Supplementary Figure S2 D).The improvement is most evident in the recovery of intermediate DNA methylation, which is only observed in workflow J at 0.5 ng (Figure 2 F), suggesting that preamplification helps all samples from fewer than ten cells to those with over 1 ng input (Figure 2 E).
Previous studies have shown that FFPE restoration kits may improve signal detection on DNA of suboptimal quality ( 36 ).The FFPE restoration kit did not significantly improve the array performance with 50 ng and 0.5 ng non-FFPE DNA input (Workflows D and E, Supplementary Figure S2 E).We also did not observe substantial performance differences among the three different bisulfite conversion kits ( Supplementary Figure S2 F).EZ-direct kit produced data of a slightly better probe detection rate likely due to the minimization of DNA loss from a single-step bisulfite conversion and purification.
We also compared preamplification based on random hexamers (N6) and random hexamers with a T7 primer at the 5end (N6-T7) for whole genome amplification ( Supplementary Figures S2 F and S2 G) ( 50 ).Workflow J showed the highest Spearman`s rho compared to other workflows with preamplification (Workflows F-I).The T7 sequence in N6-T7 serves as a second primer, allowing further PCR amplification.However, additional PCR cycles did not improve array performance (Workflows K and L, Supplementary   PAGE 7 OF 16 we followed the N6 amplification strategy for the subsequent analysis.

Optimized workflow resolves intercellular heterogeneity while maintaining cell line identity
DNA methylomes profiled from a small number of cells often reveal cell-to-cell heterogeneity.We next tested whether our low-input method can uncover this heterogeneity and whether the cell population averages reduce to measurements from high DNA input.We merged the single and five-cell methylomes, respectively, and compared the combined data with the 250-ng methylome (Figure 3 A).We found that the merged methylome reinstated the intermediate methylation measurements which are otherwise missing from the single low-input experiment representatives.As DNA methylation readouts are allelic (taking only 0%, 50%, and 100% in diploid cells), we expect a reduction of non-allelic fractions as the cell population becomes less heterogeneous.Focusing on CpGs showing intermediate methylation (0.3-0.7) in the 250-ng data, we observed a gradual dichotomization of methylation levels approaching 0% and 100% in the 10-cell ( n = 4) datasets and more in the five-cell ( n = 5) datasets (Figure 3 B).This polarization is likely due to a higher genomic DNA homogeneity from reduced cell numbers.The lingering non-allelic methylation fractions are likely due to amplification bias and residual cell-to-cell heterogeneity.Given the same input cell number, preamplification (Workflow J) retained more intermediate readings (Figure 3 C).The standard workflow (Workflow A) became nearly fully dichotomized at ten cells and struggled globally in the 5-cell experiment (Figure 3 C).Despite the resolution of intermediate methylation levels, small-cell-number data largely retain global and focal methylation differences intrinsic to the cell type identity ( Supplementary Figure S3 A).All datasets from five to ten cells cluster with the 250 ng datasets of the same cell line on a tdistributed stochastic neighbor embedding (tSNE) projection (Figure 3 D).Most single-cell samples are also grouped accordingly despite the erroneous placement of two single-cell B16 samples, likely due to lack of biological signal, as suggested by their ultra-low probe detection rates (0.13) compared to other samples of similar input DNA amounts.The five-cell datasets are clearly separated in a metagene plot that suggested a higher global methylation for the B16 cells at all input ranges ( Supplementary Figure S3 B).Finally, the differentially methylated CpGs between the two cell lines are largely preserved in five-cell methylomes (Figure 3 E).Random discrepant methylation did occur more frequently at CpGs intermediately methylated in the 250 ng datasets for the two cell lines, respectively (Figure 3 E), enriching for bivalent chromatin ( Supplementary Figures S3 C and S3 D).Collectively, these data suggest the Infinium arrays can robustly profile fivecell samples.While the Infinium arrays can profile single cells, their performances are unstable (Figure 3 D).

GC-rich and high copy number regions retain detection in low input datasets
We next explored which genomic regions are most susceptible to detection loss in high and low input experiments.We first observed that signal intensities intrinsically depend on the probe sequences.We studied the within-sample intensity Z -score of HM450 autosomal probes across 749 normal samples from the TCGA cohort (Methods) (Figure 4 A, Supplementary Figure S4 A).The violin plot displays the complete set of Z -scores for probes in the cohort.Data points corresponding to individual samples were smoothed and provided a comprehensive view of the overall distribution pattern.The Z -score distribution for different probes shows clear probe dependence.Probes at the high and low signal intensity extremes have little overlap, suggesting a strong sequence dependence.Probes targeting GC-rich regions (as indicated by the number of 'C's in the probe sequences since 'G's are replaced by ' A' s to pair with 'T's from bisulfite conversion) are associated with higher signal intensities (Figure 4 B).This is supported by an enrichment of the detected probes in CpG islands (Figure 4 C, D, and Supplementary Figure S4 B), gene promoters, transcription factor binding sites, e.g.TFAP2C, and promoter-associated histone modifications, e.g.H3K64ac ( Supplementary Figures S4 C and D).
Consistently, the probes that fail detection in 250-ng and 5-cell samples are significantly enriched in low-CpG density regions (Figure 4 C).PMD solo-WCGWs-CpGs flanked by A / Ts and with no other CpGs within a 70-bp neighborhood at partially methylated domains ( 51 )-are observed to lose most signal detection, consistent with their CpG-sparse nature.Probes targeting non-CG cytosine methylation also tend to lose detection in low-cell-number samples.Interestingly, mitochondrial CpG probes, transposable element (TE) CpGs, and other multi-mapping probes have the least probe detection loss in low-input datasets.The mitochondrial genome showed nearly 100% probe detection success in single, and two-cell experiments.This is likely due to the high copy number of mitochondrial genomes per cell ( 52 ).Similarly, other high copy number repetitive elements, such as the Satellite, B1 elements, and other SINE1 / 7SL elements, also show high probe success rates (Figure 4 E and F).These results suggest that the multi-mapping probes may be used as a TE profiling tool for low-input samples.More prevalent heterogeneity in DNA methylation within quiescent chromatin was observed.CpGs displaying lower methylation levels in bulk tissues but higher levels in individual five cells (Figure 4 G, right panel), or vice versa ( Supplementary Figure S4 E, right panel), are both enriched in quiescent chromatin regions.In contrast, promoters (Tss) and gene bodies (Tx) showed consistently low (Figure 4 G, left panel) and high methylation levels ( Supplementary Figure S4 E, left panel), respectively, in both the bulk tissue and the 5-cell samples.

ELBAR preserves more signal detection for low-input datasets
The conventional detection P -value calculation aims at preventing false discovery in high-input datasets, where probes with suboptimal signal intensity are rare and a relatively clear decision boundary can be found.In low-input samples, more probes carry lower signal intensity and can overlap with measurements purely dominated by background signals ( Supplementary Figure S5 A).Applying the same detection P -value threshold may lead to a significant loss of biologically useful readings.To better balance sensitivity for lowinput datasets, we developed the ELBAR algorithm, based on the observation that probes dominated by signal-backgroundonly are always associated with intermediate methylation readings (Figure 5 A).In brief, ELBAR looks for low-signal probes with intermediate methylation to model the background signal (Methods).Doing so can effectively remove  background-induced artificial readings while minimally removing probes with biological signals (Figure 5 B).In the cell line experiments, probes that survive ELBAR masking maintain a bimodal distribution of beta values as biologically expected.In contrast, pOOBAH, a prior method designed for high-input datasets, masked probes more aggressively.The probes surviving pOOBAH masking are slightly asymmetric in the beta-value envelope and show a small amount of background-dominated beta values (Figure 5 B).ELBAR effectively masked these beta values associated with low signal intensity and artifactually fixed around 0.5.
Testing ELBAR on public EPIC, HM450 (Figure 5 C, Supplementary Table S2 A), and MM285 (Figure 5 D) datasets, we found that it could rescue a significant number of probes compared to pOOBAH.Interestingly, experiments with array-wide failure remain low in detection rates, suggesting ELBAR can discriminate probe failure against arraywide failures.The probes rescued by ELBAR from pOOBAH show biological relevance, evidenced by higher correlation with the 250 ng datasets ( Supplementary Figure S5 B).Of note, ELBAR combines negative control probes for background calibration and only considers intermediate methylation reading from low-intensity probes.Hence, its masking would not be influenced by true biological methylations.For example, we validated ELBAR's performance in samples with globally high, intermediate, and low methylation levels ( Supplementary Figures S5 C-E), including testicular seminoma tissues ( Supplementary Figure S5 F).ELBAR improves detection in wide input ranges (Figure 5 E, P -value = 2.7E-8, t -test of the method sensitivity difference in multiple linear regression) without harming accuracy (Figure 5 F, no statistical significance detected from method accuracy differences).
To further validate ELBAR's performance in FFPE samples, we compared ELBAR, pOOBAH and minfi's detectionP function in a previous study of 53 melanoma FFPE tissue samples ( 53 ).We used five samples with the best detection p -values to derive a putative ground truth methylation profile and evaluated the measurement deviation in probe sets stratified by the masking status under the three methods (Figure 5 G).Probes that survive all three masking methods have the lowest methylation level deviation as expected, followed by the probes masked only by pOOBAH.These probes are the greatest in number compared to other probe masking groups, suggesting that pOOBAH may have caused a significant loss of biological signal in this dataset.Overall, ELBAR-masked probes are associated with higher measurement deviation from the ground truth (dark red in Figure 5 G), despite that they may survive the masking by the other two methods.Collectively, these results point to an advantage of using ELBAR for detection calling over pOOBAH and detectionP in FFPE samples.

Low-input BeadChip data captures the demethylation dynamics in primordial germ cells
PGCs are typically present in low numbers, hindering their analysis by the standard Infinium array processing workflow (54)(55)(56).In mammals, PGCs undergo genome-wide epigenetic reprogramming, including global DNA methylation loss, as they migrate from the epiblast to the bipotential gonads ( 57 ).This corresponds to embryonic day(E)7.5 to E14.5 of development in the mouse.We applied our optimized method to study the methylation of mouse gonadal PGCs collected at E11.5 to E14.5 (Figure 6 A).For each time point, PGCs from a pair of embryonic gonads were FACS sorted (Methods), and the aliquoted volume varied from 0.25 μl to 9 μl.We employed workflow J, with pre-amplified DNA amounts ranging from ∼1 ng to 13 ng ( Supplementary Table S2 B).As a contrast, we included methylome profiles of mouse liver, lung, ovary, and testes tissues in our analysis (Methods, Supplementary Table S2

C).
Consistent with prior knowledge, PGCs exhibit the lowest methylation level relative to somatic tissues and adult gonads across major chromatin states (Figure 6 B and C).Consistent with the probe design rationale, 'PGCMeth' probes showed resistance to methylation in E11.5 and E13.5 PGCs.The genome-wide distribution of PGC methylation loss is largely uncoupled from their methylation states observed in non-PGC tissues.For example, the genomic regions with active gene transcription (ChromHMM states 'Tx' and 'TxWk') were associated with the highest methylation in non-PGC tissues (Figure 6 B).But in PGCs, gene bodies are less methylated than heterochromatin and transcriptionally quiescent regions (Figure 6 B and C).We observed a partial methylation loss in E11.5 PGCs.The demethylation process culminated in E13.5 PGCs.Male PGCs initiated re-methylation as early as E14.5, while female PGCs remained similarly unmethylated at E14.5 as E13.5.This is consistent with de novo methylation occurring in female germline postnatally as oocytes are recruited for growth during each reproductive cycle ( 58 ).This sex disparity in methylation rebound is evident in imprinted geneassociated differentially methylated regions (DMRs) and gene bodies (Figure 6 D and Supplementary Figure S6 A).
The arrays allow for detailed analysis of the timing of the DNA methylation change across genomic elements.For example, the retained DNA methylation at heterochromatin is enriched at TRIM28 binding sites (Figure 6 E and Supplementary Figure S6 B).TRIM28 regulates the transcription of TE, particularly endogenous retroviruses (ERV) ( 59 ).Specific ERV elements are known to retain DNA methylation in human PGC development ( 60 ) and mice ( 61 ).These observations suggest a critical role for DNA methylation in TE suppression for maintaining germ cell genome integrity and intergenerational epigenetic inheritance.
Interestingly, PGC residual methylation is also enriched for the binding of the DNA methylation reader proteins MECP2 and MBD1, supporting their reported roles in DNA methylation-mediated TE regulation ( 62 ).Among TE DNA families, LTR elements were more resistant to PGC demethylation than SINEs and LINEs, consistent with their functions as transcriptional promoters ( 63 ).The role of methylation retention in TE regulation is further supported by the higher methylation retention in evolutionarily younger TE subfamilies than older TE families ( 64 ) (Figure 6 F).Intracisternal A Particle (IAP) elements were previously highlighted to exhibit extensive methylation retention in PGCs ( 65 ).We identified a heterogeneous pattern of their demethylation dynamics in PGCs.IAPLTR2, IAPEY and IAPA are among the most resistant families, while IAPEY5, IAPLTR4 and IAP5 are the least methylated.
In addition to TEs, imprinted gene DMRs, germ cells, and placenta-specific hypomethylated sites show later DNA methylation loss compared to the rest of the genome.This is due to active DNA demethylation pathways, mediated by TET proteins, that are required for methylation erasure at imprinting control regions ( 39 ,66 ).CGs flanked by A / Ts are more susceptible to aging-associated DNA demethylation ( 51 ).We did not observe this sequence context preference for PGC development ( Supplementary Figure S6 C), suggesting a distinct, TET-mediated demethylation mechanism.

Discussion
Despite the successful employment of Infinium BeadChips in population-scale DNA methylome studies, their potential for 'difficult' DNA, i.e. when the input is limited in quality, quantity, or both, has not been fully explored or optimized.This restricts the scope of the Infinium array usage for cell-free DNA, microdissected tissues, and other samples of limited availability.Here, we presented experimental and computational resources to enable array usage in these suboptimal settings, especially with low-input DNA.Experimentally, we explored the array's compatibility with random priming-based whole genome preamplification and enzymatic base conversion by TET / APOBEC3A.Our data suggests that both preamplification and enzymatic base conversion by TET / APOBEC3A using the NEBNext Enzymatic Methyl-seq (Workflow J).Computationally, we developed ELBAR for preserving biological signals from suboptimal input datasets.ELBAR excludes only probes dominated by background noises.Besides, we comprehensively characterized the biological and technical determinants of array performance from public datasets.From surveying 100 000+ datasets and using probe detection rate as the main performance metric, we found that cell-free DNA, saliva, bone and FFPE samples, tend to have worse detection rates compared to cultured cells, primary and fresh frozen tissues.FFPE samples and those of sheer lower input show a different genomic distribution in signal detection.

PAGE 13 OF 16
Plasma cfDNA tends to cause detection loss at CpG-sparse, GC-low genomic regions and preserved detection at CpG-rich regions such as bivalent chromatin.In contrast, FFPE-induced detection loss is less biased across genomic regions.This lowinput sample bias can be attributed to the weaker intrinsic signal from GC-low probes (Figure 4 C and D).As cfDNA localization is known to be linked to nucleosome footprints and can inform cell of origin ( 67 ), the array signal intensity bias may serve as an unconventional source of cell of origin signal to complement the methylation signal that the array data already carries.
Although Illumina recommends a minimum of 200 ng of DNA for current Infinium BeadChips, there has been research exploring lower input amounts.For example, reproducible results were achieved with over 125 ng of DNA from peripheral blood ( 68 ), and an input of 16 ng exhibited a high correlation with 500 ng input, albeit with lower reproducibility.Another study identified 75 ng as the minimum requirement ( 69 ), and a few other studies suggested that 10 ng is acceptable ( 35 , 70 , 71 ).The fact that the standard protocol works at these sub-optimal input ranges is likely due to the isothermal amplification.However, the true lower input limit of the technology remains underexplored, and the extent to which precision and sensitivity are compromised with decreased DNA input remains incompletely resolved.Our work shows that Infinium arrays are reasonably compatible with picogram-range input DNA or single-digit cell number.Preamplification using Klenow fragments and enzymatic conversion further magnifies probe signal intensities, leading to reproducible profiling of five-cell methylomes.Interestingly, additional adapterbased PCR amplification did not lead to a further increase in probe detection or accuracy ( Supplementary Table S1 D), likely due to loss of library complexity from amplification bias ( 72 ).
The Infinium technology is currently not cost-competitive for projects that only massively profile methylomes in single cells due to the incompatibility with the sample multiplexing ( 73 ).The focus of this work is to explore the capacity of Infinium technology in profiling samples of variable quantities and whether the data is comparable.This is most relevant to applications where DNA can be of high or limited quantities, such as microdissected tissues ( 24 ) and cell-free DNA.In fact, techniques shared with sequencing-based methods, such as preamplification using Klenow or other polymerases ( 27 ), would benefit both high and low input samples (e.g. the 50-ng samples in Figure 2 C and D).We showed that lowinput Infinium BeadChip data is comparable to high-input data and is biologically relevant.It reflects the allelic nature of the DNA methylation signal on low DNA input.Intermediate methylation levels were resolved to high and low methylation readings.These binary readings reflect cell-to-cell heterogeneity and when merged, their population averages recapitulated bulk input measurements.Our single-cell array data reached 20% detection, similar to previous deep single-cell WGBS datasets ( 30 ).
Previous data analysis workflows relied on a single threshold for determining signal detection success ( 74 ).While this is a viable assumption for the high input data, it does not always hold for low-input datasets.In low-input datasets, biological signals overlap more with background signals in signal intensity, particularly for probes with intrinsically low foreground signals.Our analysis showed that this intrinsic propensity is linked to the number of Cs in the probe sequences, likely rem-iniscent of a GC content bias as the probes are designed to be G-less to pair with converted genomic DNA.The stronger overlaps of biological with background signal not only obfuscate the detection discrimination but also bias the beta value calculation towards 0.5 due to the incomplete subtraction of signal background from the observed compound signal.Users should consider this major tradeoff of measurement precision for sensitivity.
Compared to pOOBAH and other detection calling methods designed to minimize false discovery in high-input datasets, ELBAR seeks to mask only probes fully dominated by signal background, leaving probes with true biological signals visible in downstream analysis.However, we caution that probe readings surviving ELBAR detection may be influenced by background signals to various degrees, leading to unstable quantitative accuracy.The detection p -values can serve as measures of background influences besides their traditional use for probe masking.For high-input samples, pOOBAH and ELBAR perform similarly (Figure 5 E and F), suggesting that one may use ELBAR for data from all input settings.
Despite the reduced probe detection in low-input datasets, probes that target multi-copy DNA, such as mitochondria and repetitive elements (e.g. the B1 elements and satellite sequences), retain high signal intensities.In the low-input datasets, we observed that these probes measure aggregated signals from multiple genomic loci, making an unconventional use of the methylation BeadChips-as a tool to study the global epigenetic regulation of multi-copy TEs.In our work, we applied our low-input protocol to profile mouse PGCs.We validated the dynamics of global methylation erasure in PGCs, a sex disparity in remethylation, as well as the demethylation resistance at TRIM28 binding sites which are known to escape germ cell epigenetic remodeling ( 60 ).These multi-mapping probes also revealed that evolutionarily younger LTR repeat families retained more methylation than other repeat families.These methylation retentions can protect germ-line genome integrity from TE transcriptional mobilization.

Conclusion
We presented experimental and computational solutions for applying Infinium BeadChips to low-input and single-cell samples.Based on whole-genome preamplification and enzymatic base conversion, our new methods revealed a previously underappreciated low-input potential of this popular methylation profiling assay.We demonstrated the power of these methods by applying them to uncover detailed demethylation dynamics of murine primordial germ cell development.

PAGEFigure 1 .
Figure 1.L o w er input arra y s e xhibit lo w er signal intensity and lo w er probe success rates.( A ) T he probe success rate in 105475 public Infinium meth ylation B eadChip arra y data sets.( B ) Schematic sho wing the dependence of beta v alue on the probe's tot al signal intensit y.The x-axis represents the total signal intensity, and the y-axis represents beta values.( C ) The intensit y-bet a plot compares different DNA sources and ( D ) input amounts in ng.The green and red curves denote the beta value envelopes defined by the green and the red channel background signal, respectively.( E ) The probe detection rate of FFPE and cfDNA samples at different genomic regions from HM450 (left) array data and (right) EPIC array data.

Figure 2 .
Figure 2. Infinium BeadChip performance in ultra-low input ranges.( A ) A summary table of workflows used in this study.Workflow A is the Illumina standard w orkflo w. ( B ) B o x plots w ere used to visualiz e the probe success rates (top) and the F1 score (bottom).T he number of samples f or each e xperiment w as displa y ed ne xt to each bo x. (C, D) Comparison of f our main preparation methods based on ( C ) probe success rate and ( D ) F1 score.T he number on the top right corner of each tile indicates the number of samples analyzed in each experiment.See also Supplementary Figure S2 .(E, F) Smooth scatterplots for the comparison of workflows A, C, J, and M with ( E ) 2 ng and ( F ) 0.5 ng of DNA input ( R : spearman's rho, P : P -value).The dashed squares indicate intermediate beta values (0.25-0.75) on both axes.

Figure 3 .PAGE 9 OF 16 Figure 4 .
Figure 3. L o w input meth ylation arra y data unco v ered sample-to-sample heterogeneity.( A ) Smooth scatter plots f or meth ylomes from a representativ e single-cell dataset, the merged single-cell dataset ( N = 10), a representative 5-cell dataset, and the merged five-cell dataset ( N = 5), respectively, against the 250-ng input.( B ) Distribution of intermediate methylation (0.3-0.7, dashed black square) of CpGs in the 250-ng methylome in 10-and 5-cell datasets.Each blue dot represents a CpG, and the X-axes were arranged in the order from chromosome 1 to chromosome Y, with each chromosome displa y ed in alternating gray and black colors.( C ) DNA methylation level distribution of 5 cells to 50 ng with different workflows.( D ) tSNE cluster of NIH3T3 and B16 using 20 0 0 probes most v ariable in the meth ylation le v el.Fractions labeled in the plot are probe success rates f or eac h dataset.All B1 6-F0, NIH3T3, and PGCs e x cept 50, 100 and 250 ng w ere whole genomes amplified b y Workflo w I or J. ( E ) Heatmap sho wing arra y perf ormance comparison of 5-cell methylomes of B16-F0 and NIH3T3 with 250-ng methylomes.20 0 0 highly variable CpGs with the highest standard deviations of DNA methylation value among samples were used.

PAGE 10 OF 16 NucleicFigure 5 .
Figure 5. ELBAR detection preserves more probes with biological signals.( A ) A schematic illustration of the ELBAR algorithm.The x-axis represents the total signal intensity, and the y-axis represents beta values.( B ) Performance of ELBAR in eliminating background-dominated reading probes compared to pOOBAH and unfiltered in PGCs, 0.1-, 0.5-and 250-ng-input datasets.Dashed bo x es illustrate the artificial, background-dominated readings left by pOOBAH masking.The green and red curves denote the beta value envelopes defined by the green and the red channel background signal, respectively.(C, D) ELBAR performance regarding the probe success rates for public human ( C ) and mouse ( D ) array datasets.(E, F) Comparison of ELBAR performance with pOOBAH regarding probe success rate ( E , P -value = 2.7E-8, t -test of method difference in a multiple linear regression) and Spearman's rho ( F , P -value = 0.71, t -test of method difference in a multiple linear regression) in low-input datasets with DNA input ranging from single cell to 250 ng. ( G ) Comparisons of three probe masking methods: (top) the total number of readings surviving detection masking from 53 FFPE samples, and (bottom) the methylation reading deviation from putative ground truths in these 53 samples.

Figure 6 .
Figure 6.L o w-input Infinium arra y data re v eals epigenetic dynamics of primordial germ cell de v elopment.( A ) Expected DNA meth ylation dynamics during primordial germ cell de v elopments.E11.5 and E13.5 refer to embryonic days after fertilization.(B, C) DNA demethylation patterns in PGCs and tissues stratified by ChromHMM states ( B ) and mouse array probe design groups ( C ). ( D ) DNA methylation distribution in gene body regions in PGCs and tissues.M: male; F: female.( E ) Enriched genomic features by CpGs that retain methylation level (over 0.5) in the three female E11.5 PGCs.The size of each circle represents the log2 odds ratio (OR) of individual curated CpG sets, and the y-axis indicates the -log10 of the false discovery rate (FDR) for each curated CpG set.Arranged from left to right along the x-axis are four distinct sets of databases: chromatin states, probe design group, histone modification consensus, and transcription factor binding site consensus motif.( F ) DNA demethylation patterns in TEs (left) and IAP elements (right) across PGCs and tissues.