CpG Islands Recruit a Histone H3 Lysine 36 Demethylase

Summary In higher eukaryotes, up to 70% of genes have high levels of nonmethylated cytosine/guanine base pairs (CpGs) surrounding promoters and gene regulatory units. These features, called CpG islands, were identified over 20 years ago, but there remains little mechanistic evidence to suggest how these enigmatic elements contribute to promoter function, except that they are refractory to epigenetic silencing by DNA methylation. Here we show that CpG islands directly recruit the H3K36-specific lysine demethylase enzyme KDM2A. Nucleation of KDM2A at these elements results in removal of H3K36 methylation, creating CpG island chromatin that is uniquely depleted of this modification. KDM2A utilizes a zinc finger CxxC (ZF-CxxC) domain that preferentially recognizes nonmethylated CpG DNA, and binding is blocked when the CpG DNA is methylated, thus constraining KDM2A to nonmethylated CpG islands. These data expose a straightforward mechanism through which KDM2A delineates a unique architecture that differentiates CpG island chromatin from bulk chromatin.

(A) Non linear fitting of the SPR signal at equilibrium with the Langmuir binding isotherm illustrating the interaction of KDM2A with a 75bp oligonucleotide probe containing 1 CpG. The point of intercept between the curve and the vertical line represents the K D .
(B) The same analysis in (A) was applied to data from a probe containing 6 CpG's.

Figure S2
(A) KDM2A immunoprecipitation from V6.5 mouse ES cell nuclear extract followed by western blot analysis. Full length KDM2A protein migrates by SDS-page at an expected size of 132 kDa. Smaller immunoreactive protein species (marked with *) represent KDM2A degradation products. The antibody heavy chain is marked with **. (D) Indirect immunofluorescence with anti-KDM2A antibodies (red) and DAPI staining of DNA (blue). Endogenous KDM2A staining is found throughout the nucleoplasm in Dnmt1 +/+ MEFs (upper panels). DAPI bright staining punctate foci mark regions of densely methylated pericentromeric heterochromatin (lower panels).
(E) In Dnmt1-/-MEFs that are nearly devoid of CpG methylation, KDM2A localizes to the hypomethylated pericentromeric DAPI bright staining repeat regions (compare upper and lower panels). In Dnmt-/-MEFs a proportion of cells lack clear DAPI foci (indicated with asterisk) and in these instances focal KDM2A staining is not observed. The subpopulation of cells that lack clear DAPI bright staining foci is specific to the Dntmt1-/-MEFs and may result from structural alterations caused by prolonged culture in the absence of DNA methylation. In both (D) and (E) the co-localization of KDM2A foci with DAPI bright staining foci was counted in an additional 180 (D) and 172 (E) nuclei and is indicated as a percentage of total cells counted.     (E) Average KDM2A tag density at expressed and silent CpG island promoters compared to expressed and silent non-CpG island promoters. The X-axis shows position relative to transcription start site (TSS) of each category of gene. Importantly, KDM2A tag density at both silent and expressed CpG island genes is enriched in comparison to silent or expressed non-CpG island genes, suggesting that KDM2A binding is determined by non-methylated CpG content at these genes rather than the underlying transcriptional status of the gene. The enrichment of KMD2A at some expressed non-CpG island genes is likely due to inclusion of some non-methylated island genes that fall below the threshold for inclusion in the algorithm defined CpG island set.
(F) Average KDM2A tag density at GC-rich compared to GC-poor promoters. As an alternative to using the CpG island algorithm to define promoter type, gene promoters were categorized as either GC-rich or GC-poor as increased GC content is a feature of experimentally identified nonmethylated islands. GC-rich promoters show clear enrichment of KDM2A tag density compared to GC-poor promoters.
(G) As observed in (E) KDM2A binding occurs at GC rich promoters whether silent or expressed, supporting the observation that KDM2A nucleation occurs independently of the transcriptional state of the associated gene. The clear separation of KDM2A bound promoters based on GC content also supports the contention that KDM2A binding seen at some expressed non-CpG islands genes in (E) is likely due to inclusion of non-methylated island regions due to algorithm based limitations.
(H) KDM2A tag density at algorithm defined CpG island and non-CpG island clusters. The magnitude of KDM2A tag density is greater at algorithm defined CpG island clusters than those that fall outside of the cutoffs for inclusion in the bioinformatically defined CpG island set.

Bisulfite Sequencing
Bisulfite conversion of DNA was performed using the EZ DNA Methylation-Gold Kit (Zymo Research). PCR-amplified DNA was cloned into pGEM-T Easy (Promega) and sequenced.

Immunofluoresence
Cells were fixed for 20 min in 4% paraformaldehyde (

Protein Expression and Purification
Expression and purification of KDM2A ZF-CxxC domain and KDM2A 1-747 constructs were performed as previously described (Klose and Bird, 2004) except for the KDM2A 1-747 constructs where the elution from the Ni-NTA was directly loaded onto a pre-washed StrepTactin Superflow column (IBA). The sample was allowed to flow over the column several times and the resin was then washed with wash buffer (100 mM Tris-Cl pH 8, 150 mM NaCl) and eluted in elution buffer (100 mM Tris-Cl pH 8, 150 mM NaCl, 2.5 mM desthiobiotin). Purified protein was dialyzed overnight in BC100 buffer (50 mM Hepes pH 7.9, 100 mM KCl, 10% glycerol, 0.5 mM DTT) and stored at -80°C.

EMSA and SPR probes
A short randomly generated DNA probe (GTAGGCGGTGCTACACGGTTCCTGAAGTG) containing two CpGs was assembled by annealing complementary oligo nucleotides and then end labeled with ATP [γ-32 P] (Perkin Elmer) and T4 polynucleotide kinase (Fermentas). The probes were then purified on a nucleotide removal kit (Qiagen) and used for experiments in Figure 1B. A longer CpG DNA containing probe was generated by PCR amplification of a 147 base pair fragment from the pGEM-3Z 601 plasmid (a gift from T. Owen-Hughes and J. Widom) for experiments in Figure1D (sequence available on request). Amplified DNA was then purified on a Resource Q anion exchange column (GE Healthcare). A small aliquot of purified DNA was methylated overnight with SssI DNA methyltransferase (NEB) and purified using the QIAquick PCR purification kit (QIAGEN). A small fraction of each non-methylated and methylated DNA samples was radioactively end-labeled using T4 polynucleotide kinase (Fermentas) and ATP [γ-32 P] (Perkin Elmer). The labeled DNA was then purified using the QIAquick PCR purification kit (QIAGEN) and stored at -20°C. For SPR experiments three 75 base pair oligonucleotides probes containing 0, 1 or 6 evenly spaced CpG sites were generated with a 5' biotin moiety. The same random DNA sequence was used as a basis for each probe and the indicated CpG density was substituted into this sequence (sequence available on request).

Electrophoretic mobility shift assay (EMSA)
EMSA reactions were assembled in binding buffer (4% Ficoll 400, 20 mM Hepes 7.9, 150 mM KCl, 1 mM EDTA, 0.5 mM DTT, 25 ng/µl poly-dAdT competitor DNA). The protein sample was incubated in EMSA buffer for 10 min at room temperature prior to addition of the radiolabelled probe. The mixture was allowed to incubate for 20 min at room temperature and loaded onto a 0.8% agarose gel in 0.5X TBE. The gel was run at 100V, 4°C, subsequently dried onto a DE81 anion exchanger paper (Whatman) via a vacuum driven gel-dryer (Amersham) and exposed to a phosphorimager screen overnight. The screen was then scanned using a fluorescent image analyser (FLA-7000, Fujifilm).

Surface plasmon resonance
Surface plasmon resonance measurements were determined using a Biacore T100 machine (GE Healthcare). Three different oligonucleotide probes containing respectively 0, 1 or 6 CpG sites were used. 2.5 pmol of each probe diluted in coupling buffer (10mM Hepes pH 7.9, 150 mM NaCl, 3 mM EDTA, 0.005% Tween 20) was immobilized on the surface of a streptavidin (SA) coated sensor chip (GE Healthcare) at a flow rate of 5µl/min. All flow cells were then washed with 200µM biotin in 1X binding buffer (10 mM Hepes pH 7.9, 150 mM NaCl, 1 mM DTT) to block all unbound streptavidin sites. Serial dilutions (3 µM, 1.5 µM, 750 nM, 375 nM, 187.5 nM, 0 nM) of KDM2A 1-747 protein were prepared in 1X binding buffer containing 100 ng/µl of poly (dAdT) and 20 µl of each dilution was injected at a flow rate of 5 µl/min. Bound proteins were allowed to dissociate in binding buffer for 10 min followed by an elution step with 20 µl of regeneration buffer (2.5 M NaCl, pH 5.3). The K D value was determined for the 1 and 6 CpG probes using a Langmuir binding curve.

Sequencing data analysis
Sequenced tags of 51 bp in length were mapped to the mouse genome (mm9) using Bowtie aligner (Langmead et al., 2009). Only uniquely mapped tags with no more than two mismatches were retained. Positions in genome with the numbers of mapped tags above the significance threshold defined by a Z-score of 7 were identified as anomalous, potentially resulting from amplification bias, and the tags mapped to those positions were discarded. The final sets used for further analysis comprised 10,042,076 KDM2A ChIP tags and 9,760,426 input tags. The characteristic sizes of DNA fragments in the ChIP and input samples were estimated to be 100 bp and 90 bp respectively using the strand cross-correlation analysis (Kharchenko et al., 2008).
Since the positions of sequenced tags correspond to 5'-ends of the DNA fragments, these positions were shifted by the half of the characteristic fragment size towards the fragment 3'ends to represent centers of the DNA fragments. The positions from positive and negative DNA strands were combined. To determine regions of significant tag enrichment of the ChIP sample over the input, we calculated the tag fold-enrichment in overlapping 500-bp windows. The continuous regions of enrichment were determined based on Z-score threshold of 3. Enriched regions shorter than 250 bp were discarded, and the clusters separated by less than 500 bp were merged. One percent of the genes with the highest and lowest tag counts were not taken for averaging. The profiles were smoothed in a 100 bp running window. The annotation from the UCSC Genome Browser was used to determine the boundaries of CpG islands (Karolchik et al., 2003). CpG island promoters were determined as promoters of the genes that have their TSS encompassed by CpG islands. All other genes were assigned to the non-CpG island group. To determine the sets of GC-rich and GC-poor sequences at TSS, the thresholds of 0.55 and 0.40 were used to stratify 4-kb fragments centered at TSS by the GC-content. To avoid possible distortions, genes shorter than 1 kb and overlapping genes were filtered out for the calculations of the average tag density profiles around TSS. The groups of the 'strong' and 'weak' CpG islands were defined based on the GC-content and CpG observed-to-expected ratio in the island. The selected threshold values correspond to the means for all the CpG islands in the genome and were equal to 0.65 for the GC-content and 0.9 for the CpG o/e ratio. The islands that had both GC-content and CpG o/e ratio above (below) the thresholds were identified as strong (weak). There were 3,506 strong islands, 3,547 weak islands, and 8,972 islands were not included in any group.

Microarray and Data analysis
Total RNA was extracted from the KDM2A knockdown and control cell lines in biological replicates. The RNA labeling and hybridization to Agilent 4 × 44 K human gene expression microarrays (Agilent Technologies, Inc., Santa Clara, CA) was carried out by Oxford Gene Technology (Oxford, UK). Hybridizations were performed for two biological replicates for each sample. Microarray data were background-corrected and quantile-normalized between the arrays using Bioconductor packages Agi4x44PreProcess and limma (Smyth, 2004). The individual probe data were summarized for all the unique transcripts present on the array (20,240 transcripts in total). Fold-change and statistical significance were estimated for the mean expression values for the replicate sets.