Chemical Methods for Decoding Cytosine Modifications in DNA

1.1. Introduction to Mammalian DNA Base Modifications 
Genetic information is encoded by the four bases adenine (A), guanine (G), cytosine (C), and thymine (T). Base pairing through hydrogen bonding between the cognate pairs A-T and C-G together within the base stack of the DNA double helix provides the molecular basis for the genetic code.1 It is evident that there are other molecular mechanisms for encoding function within DNA. The major groove and the minor groove each exhibit a hydrogen bonding pattern that enables the primary sequence of the DNA double helix to be read, without being unwound, which is important for sequence-dependent events such as the binding of transcription factors. Furthermore, there are enzyme-dependent chemical modifications to the canonical bases that have the potential to dynamically alter the structure, recognition and function of DNA. Examples of naturally occurring DNA base modifications are shown in Figure ​Figure1.1. There are organisms whose genomes exhibit a substantial level of chemically modified bases, for example in bacteriophages, all or a major proportion of one of the four bases are commonly replaced by a modified base.2 
 
 
 
Figure 1 
 
Structures of modified bases in phage DNAs: (A) α-putrescinylthymine, (B) 5-dihydroxypentyluracil, (C) α-glutamylthymine, and (D) 2-aminoadenine.

Genetic information is encoded by the four bases adenine (A), guanine (G), cytosine (C), and thymine (T). Base pairing through hydrogen bonding between the cognate pairs A-T and C-G together within the base stack of the DNA double helix provides the molecular basis for the genetic code. 1 It is evident that there are other molecular mechanisms for encoding function within DNA. The major groove and the minor groove each exhibit a hydrogen bonding pattern that enables the primary sequence of the DNA double helix to be read, without being unwound, which is important for sequence-dependent events such as the binding of transcription factors. Furthermore, there are enzyme-dependent chemical modifications to the canonical bases that have the potential to dynamically alter the structure, recognition and function of DNA. Examples of naturally occurring DNA base modifications are shown in Figure 1. There are organisms whose genomes exhibit a substantial level of chemically modified bases, for example in bacteriophages, all or a major proportion of one of the four bases are commonly replaced by a modified base. 2 The biosynthetic pathway to modified bases in genomes can occur at the level of modified mononucleotides subsequently incorporated via polymerase-mediated DNA synthesis or post-DNA synthesis from the canonical bases within DNA. 3 Functions of modified bases in the DNA of phages include protection from host and phage nucleases, signaling for transcription and replication of the DNA, and facilitating the packaging of the DNA. 4 Eukaryotes would appear to comprise a smaller repertoire of DNA base modifications. 5-Methylcytosine (5mC) is the beststudied DNA base modification. It contains a methyl group at the 5-position of the cytosine base, which protrudes into the major groove of the DNA presenting a potential recognition site (or obstacle) for protein binding without changing the Watson−Crick base pairing. This chemical derivative of C has functional consequences for the cell, most notably in the control of gene expression. The study of heritable changes in gene expression mediated by dynamic changes in 5mC, without changes in the primary DNA sequence, is a major aspect of the field of epigenetics. Given that epigenetic changes are of vital importance to developmental biology and numerous areas of human disease, that include cancer and metabolic diseases, it is of great importance to elucidate the underlying mechanisms that cause and stem from the chemical modification of DNA bases.
In mammals, 5mC was first discovered in the late 1940s and has been found to play essential roles in maintaining cellular function and genomic stability, including processes such as the inactivation of one of the two X chromosomes in female mammals; genomic imprinting such that genes are expressed in a manner dependent on the parent-of-origin; and the silencing of moveable genetic elements called transposons. 5 A family of enzymes called the DNA methyltransferases (DNMTs) are known to be responsible for the generation and maintenance of 5mC in genomes. 6 The standard mechanism of 5mC formation involves initial nucleophilic attack of a cysteine residue in DNMT at the C6 position and nucleophilic attack by C5 on the methyl donor from S-adenosyl methionine (SAM), followed by elimination to restore aromaticity in the base ( Figure 2).
The function of methylation in mammals depends on the context of the modification within the genome. There is a strong positive correlation between gene silencing and methylation of regions rich in C-G diads called CpG islands (CGIs) near transcription start sites (TSS) and also the first exon within long-term silenced genes. 5a Within gene bodies, there is a positive correlation between active transcription and gene body methylation on active X chromosomes. Studies also suggest that DNA methylation in gene bodies could play a role in regulating alternative splicing. 7 In the 1970s two papers suggested mammals contained very high levels of another cytosine modification, 5-hydroxymethylcytosine (5hmC); up to 25% of all C bases. 8 However, others could not corroborate these results 9 and 5hmC had been widely viewed as a potential DNA damage product. 10 In 2009, two studies were published in Science demonstrating the presence of 5hmC, in mouse brain and embryonic stem (ES) cells. 11 Furthermore, Tahiliani et al. showed that the ten-11 translocation 1 (TET1), a 2-oxoglutarate (2-OG) and Fe(II)dependent dioxygenase, could catalyze the conversion of 5mC to 5hmC (Figure 3). 11b Genome-wide experiments have since mapped the location of 5hmC to promoter regions, transcription start sites, and gene bodies. In ES cells, 5hmC is also enriched at developmental genes that are poised for changes in transcriptional activity. 12 In 2011, 5-formylcytosine (5fC) was detected in mouse ES cells and brain cortex and 5-carboxycytosine (5caC) in mouse ES cells by thin layer chromatography and tandem liquid chromatography−mass spectrometry. 13 Quantification by mass spectrometry of DNA digested into nucleosides showed that the genomic DNA of ES cells contained 5fC at levels of around 0.2% relative to G and 5caC at 10-fold lower levels than 5fC. 13 In mammalian brain tissues, levels of 5fC were found to be 2−3 and 5caC 3−4 orders of magnitude lower than 5hmC. 14 Several studies have mapped the locations of 5fC and 5caC in the genomes of mouse ES cells. 15 Furthermore, single base resolution 5fC sequencing methods have enabled single base resolution genomic maps of 5mC, 5hmC, and 5fC in embryonic stem cells. 15c,16 The discovery of 5hmC, 5fC, and 5caC, in mammalian DNA has raised the need to elucidate their function. A popular hypothesis is that such oxidized cytosine modifications constitute part of the pathways that lead to active DNA demethylation.
There are several proposed pathways for demethylation; one mechanism suggests the iterative oxidation of 5mC by the TET family enzymes, followed by base excision repair or deformylation/decarboxylation. A potential mechanism for active demethylation is through the thymine DNA glycosylase (TDG) enzyme, which can excise both 5fC and 5caC from DNA but does not remove 5mC or 5hmC. 13c,17 Following this base excision the abasic site would be repaired by the base excision repair (BER) pathway. 18 It is also possible that a decarboxylase enzyme could directly remove the carboxylic acid group from 5caC ( Figure 4). 19 It is clear from work during the past five years that the enzyme-mediated chemical modification of cytosine in DNA has emerged as an important area of scientific investigation. The focus of this review will be to discuss the chemical methodologies that have been created and explored to detect, measure, and elucidate cytosine derivatives in the genomic DNA from living systems.

GENOME-WIDE PROFILING METHODS
The decoding of DNA falls within the general scope of chemical structure elucidation and has been naturally enabled by the creation and application of chemical approaches. Decoding the sequence of the four canonical DNA bases was first made widely accessible in the late 1970s with two independent chemical approaches from Maxam and Gilbert 20 and from Sanger. 21 The Sanger sequencing approach was optimized, automated and employed to decode the 3 billion base human genome reference sequence via the Human Genome Project. The Solexa/Illumina sequencing approach originated from the Balasubramanian and Klenerman laboratories in the late 1990s 22 and has been developed 23 and  optimized to a level where the past five years have progressively shown that very high capacity (human genome scale) sequencing experiments are routinely possible in relatively small laboratories. While the advent of widely accessible large scale sequencing has had an impact on genetics and genomics, these advances also hold the potential to decode and help elucidate noncanonical DNA bases in the genomes of organisms.

Restriction Endonuclease Detection of 5mC.
Restriction enzymes recognize short DNA sequences present in double stranded DNA and cleave the phosphodiester backbone of both strands by direct hydrolysis or through a covalent enzyme intermediate ( Figure 5). 24 This reaction forms two fragments of double stranded DNA. This DNA cleavage reaction can be blocked in the presence of modifications to the DNA bases in the recognition site. 25 Absolute quantitation of the levels of modified bases at a specific restriction site can be obtained by using two restriction enzymes that cleave at the same site, but where only one can cleave in the presence of a specific DNA base modification. 26 This difference occurs due to the capacity of enzymes to recognize the DNA sequences when a methyl group is present in the major groove. The modification can be quantified by measuring the difference in how many times a specific site has been cut with each    restriction enzyme. This method is regularly used with quantitative polymerase chain reaction (qPCR) to measure specific restriction sites.
The two restriction enzymes regularly used to detect 5mC are HpaII and MspI, which both cut at the same DNA sequence; CCGG. This sequence is ideal as it contains a CpG dinucleotide, which is where the majority of 5mC resides in mammals. 27 HpaII is methylation-sensitive and will only cut a CCGG sequence that does not contain 5mC, whereas MspI is methylation-insensitive and will cut a CCGG sequence with or without 5mC ( Figure 6).
2.1.2. Genome Wide Restriction Endonuclease Detection of 5mC. The two most frequently used restriction enzyme based techniques that are used to detect 5mC, on a genome wide scale, are the HpaII tiny fragment enrichment by ligation-mediated PCR (HELP) assay 28 and Methyl-Seq 29 The HELP assay was developed in 2006 and Methyl-Seq in 2009 and both rely on a comparison of genomic DNA after HpaII and MspI digestion. In both methods, genomic DNA is separately digested with HpaII and MspI, and then adapters are ligated to the ends of the digested DNA fragments. The library of MspI digested fragments represents the total population of sites, as MspI digests both 5mC and C, whereas the HpaII library represents a subset of these sites as HpaII only digests C.
The HELP assay uses fluorescently labeled primers to amplify each adapted library using PCR. Different fluorophores are applied to the HpaII and MspI libraries. A DNA microarray is then used that contains the sequences of specific genomic regions of interest. The method was developed using a DNA microarray that detects 1339 sites in the mouse genome, which represents a total of 6.2 Mbp. 28 The presence of a CCGG site results in a fluorescent signal being detected in the MspI library. When a site is fully methylated no fluorescent signal is detected in the HpaII library, however if there is partial or no methylation a fluorescent signal will be detected. A HpaII/ MspI ratio is then calculated for each genomic region to give relative quantification at a large number of genomic loci simultaneously.
In Methyl-Seq, following ligation of adapters to the HpaII and MspI libraries, each is sequenced using next generation sequence technologies. This creates millions of short genomic reads all starting at CCGG sites. Sites that are only sequenced in the MspI library are fully methylated and are called as "methylated". When sites are present in the HpaII library there is either partial or no methylation, and they are called as "unmethylated". The initial publication demonstrated this method could be used to assay 90 000 regions in the human genome. 29 Partially methylated regions are undetectable in Methyl-Seq, as they are labeled as "unmethylated", whereas there is relative quantification from the HELP assay. However, new DNA arrays are needed for each new region of interest in the HELP assay, whereas in Methyl-Seq a much larger quantity of sites are analyzed without the need for DNA microarray development.
2.1.3. Restriction Endonuclease Detection of 5hmC. With the recent discovery of 5hmC in the mammalian genome, 11 there has been growing interest in developing techniques to detect this base to enable the elucidation of its function. Restriction endonuclease methods developed to quantitatively detect 5hmC in the genome rely heavily on the βGT enzyme found in T4 bacteriophage. 26a βGT adds a glucose moiety to the primary alcohol on the hydroxymethyl group of 5hmC while present in double stranded DNA. 30 As the primary alcohol group of 5hmC is present in the major groove of the double stranded DNA, this enzymes functionalizes the DNA major groove with a glucose moiety that consequently alters the recognition potential at that site. The most commonly used method involves designing primers for quantitative PCR analysis of a specific region of interest in the genome that contains a single restriction site for the enzyme used. 26a MspI will cleave DNA with C, 5mC, or 5hmC in its restriction site, but when glucosylated 5hmC is present in the restriction site, MspI will no longer cut the DNA. 26a The levels of 5hmC can be determined by performing quantitative PCR on undigested, digested, and glucosylated then digested DNA. Quantifying the difference between each digestion then gives the percentage of 5hmC at that restriction site. HpaII does not cleave 5mC or 5hmC DNA and can be used in parallel to the above method to determine the levels of both 5mC and 5hmC at the same site. Thus, when comparing this HpaII data with that obtained for 5hmC alone, levels of C, 5mC, and 5hmC can be obtained through the differences ( Figure 6).
Along with the creation of methods to detect 5mC and 5hmC using HpaII and MspI and βGT, there has also been the development of novel families of enzymes, PvuRts1I 31 and MspJI, 32 that only digest 5hmC or glucosylated 5hmC, and not C or 5mC. These enzymes offer the ability to directly detect 5hmC modifications on a genome wide scale. 33 2.1.4. Restriction Endonuclease Detection of 5fC and 5caC. Following the discovery of 5fC and 5caC little has been done to detect them using restriction endonucleases or find out how previous enzymes interact with them. Little research has been carried out to use restriction enzymes to map 5fC and 5caC in the genome. One study has indicated that MspI could not digest synthetic DNA that contained 5fC or 5caC; 13a however, this has not been taken further to look at genomewide levels. There is a need for robust data on the specificity/ discrimination of these restriction enzymes on all cytosine modifications, before such methods can be widely used with confidence.
2.1.5. Advantages and Disadvantages of Restriction Endonucleases. Restriction endonucleases provide a simple and relatively low cost way of accurately quantifying modified bases at single restriction sites. These methods do not detect modified bases at single base resolution, as a modification present at any position at its cut site can block digestion. Using PCR to achieve absolute quantification at many genomic sites in parallel can become very time-consuming, as separate PCR reactions must be run for each site. However, restriction endonuclease techniques are now available to detect 5hmC at a genome wide scale, albeit with only relative quantification. It will be of great interest to combine the HELP assay and Methyl-Seq approaches to also detect 5hmC, using the βGT enzyme that inhibits MspI digestion of 5hmC.

Chemical Based Profiling
DNA immunoprecipitation sequencing (DIP-Seq) is a technique that uses a probe (protein or small molecule) that noncovalently or covalently recognizes a DNA feature of interest that can be isolated by affinity enrichment of fragmented genomic DNA and then characterized by high throughput DNA sequencing. For example, methylated DIPseq (MeDIP-Seq) uses an antibody that binds and enriches for methylated DNA fragments from genomic DNA. The enriched fragments are decoded by sequencing and the sequences are then computationally aligned and "stacked" against the reference genome to provide a genome-wide profile of methylation sites. The resolution of this approach is a function of the fragment size of the prepared DNA library 34 (Figure 7).
A similar antibody-based approach has been developed for mapping 5hmC, called hMeDIP-Seq. 35 Trypanosomes contain a protein, called JBP1, which binds to glucosylated 5hydroxymethyluracil. It was shown that JBP1 also binds glucosylated 5hmC and then used as an antibody to map 5hmC. 36 Although protein-based enrichment methods are widely used to map DNA modifications, this technique highly depends on the quality of the antibody used. Low specificity of the antibody for targeted modifications or cross-reactivity with off-target sites results in high background noise. In order to overcome these issues alternative chemical profiling methods were developed ( Figure 8). The first chemical profiling method reported for 5hmC, hmC-seal, was developed by Song et al. 37 The method exploits the use of a β-glucosyltransferase (βGT) that can transfer a 6-N 3 -glucose onto the hydroxyl moiety of 5hmC. Subsequent copper-free click chemistry attaches dibenzocyclooctyne-modified biotin to the base. The biotin−streptavidin interaction can either be used to quantify 5hmC with avidinhorseradish peroxidase (HRP) or efficiently enrich for 5hmC containing fragments with streptavidin-coated beads.
Another chemical enrichment method for 5hmC termed GLIB (glucosylation, periodate oxidation, and biotinylation) used glucosylated 5hmC that was subsequently treated with sodium periodate, which oxidatively cleaved the vicinal diols in glucose to yield a dialdehyde. 12a The aldehydes were then reacted with a hydroxylamine-biotin probe. In the case of 5fC, the chemical reactivity of the aldehyde moiety on the modified bases itself was exploited by chemoselective reaction with a hydroxylamine-biotin probe to perform the first genome wide mapping of this modification. 15a Fragmented genomic DNA containing 5fC from mouse ES cells was reacted to the probe and pulled-down with streptavidin-coated beads to enrich for 5fC-containing DNA fragments that are subsequently decoded by sequencing. Song et al. extended their hmC-seal method in order to enrich for 5fC containing DNA (fC-seal method). 15c Therefore, they first blocked 5hmC with unmodified UDP-Glc using βGT. Subsequently they reduced 5fC to 5hmC using sodium borohydride and then glucosylated the newly generated  5hmC with an azide-modified glucose. The azide was clicked to a biotin containing probe using copper-free click chemistry, which in turn allowed the pulldown of 5fC containing fragments. A joint chemical-antibody approach has also been deployed to detect 5hmC in vivo. Bisulfite treatment of DNA was used to convert all the 5hmC to a stable cytosine-5methylsulfonate adduct (CMS), then an antibody was used to detect these chemically modified 5hmC bases. 12a Furthermore, it has been demonstrated that DNMTs can be used to tag small molecules onto 5hmC, 38 and this reaction could be used to map 5hmC.
With regards to 5caC, it can also be captured by using 1ethyl-3-[3-(dimethylamino)propyl]-carbodiimide hydrochloride (EDC)-catalyzed amide bond formation between the carboxyl group of 5caC and a biotin modified amine. 39 It remains to be seen if the labeling sensitivity and selectivity of this method is sufficient to apply it to genomic DNA, given the very low abundance of 5caC in genomic DNA.  20 The Gs are methylated by dimethylsulfate, the purines (A and G) are depurinated using formic acid, and the pyrimidines (C and T) are hydrolyzed using hydrazine. The addition of high salt and hydrazine hydrolyses C only. The DNA backbone is subsequently cleaved at the sites where the bases have been reacted, using hot piperidine. Electrophoresis of these fragments generates a sequencing ladder corresponding to reading 5′ to 3′ on the DNA (Figure 9).
Aspects of this sequencing method allow it to also be used to sequence methylcytosine in DNA. 40 The reaction of hydrazine (which leads to cleavage at C and T) with 5mC is inefficient and therefore does not introduce a strand cleavage. This results in a gap in the sequencing pattern. This gap together with the identification of G on the complementary strand determines the location of 5mC in the DNA sequence. 40   A modified Maxam and Gilbert sequencing method uses different chemicals for the selective detection of 5-methylcytosine in DNA sequences. 41 N-Sodio-N-chloro-p-nitrobenzenesulfonamide and N-sodio-N-bromo-m-nitrobenzenesulfonamide display differential reactivity toward C and 5mC in that only N-sodio-N-chloro-p-nitrobenzenesulfonamide showed high selectivity toward C producing a cleavage at C sites upon hot piperidine treatment. Treatment with N-sodio-Nbromo-m-nitrobenzenesulfonamide generated two products with cleavage at the C and 5mC sites ( Figure 10).
By combining the results obtained using both of these compounds, the authors claim the accurate identification of 5mC residues in DNA sequences. When this method was combined with the use of β-glucosyltransferase, the introduction of a glucose moiety to the hydroxyl group of 5hmC could be used to distinguish 5mC from 5hmC and C.
Two more methods are available that can supplement the Maxam and Gilbert sequencing method for the interrogation of cytosine modification in DNA. The first one exploits the selective detection of 5mC by using uracil DNA glycosylase (UDG). 42 Bisulfite treated DNA is subsequently treated with UDG to initiate uracil elimination followed by DNA cleavage in alkaline conditions. As 5mC is resistant to conversion to U by bisulfite, cleavage can only be observed at C sites. The second method uses hot alkali treatment, thereby selectively cleaving the DNA at sites of 5fC and 5caC. 43 While these sequencing methods work well on short synthetic DNA strands and could potentially be applied for the development of probes to study genomic samples, Maxam and Gilbert type sequencing is rather time-consuming and cumbersome compared to modern sequencing approaches and so this approach may not be suitable for the routine genome-wide study of epigenetic modifications.
Munzel et al. described a chemical method to discriminate between C and 5mC. 44 The chemical reagent O-allylhydroxylamine, in contrast to bisulfite, does not exploit reactivity differences but gives different reaction products with cytosine and 5mC (Figure 11).
The reagent forms a stable mutagenic adduct with cytosine, which can exist in two oxime-type configurations, E or Z, which in turn are in equilibrium via the amino isomeric form. The amino tautomer effectively base pairs as C, whereas the E-imino isomer will pair as T and the Z-imino isomer interferes with the base pairing causing a polymerase stalling. Which of the isomer is formed depends on the steric hindrance between the O-allyl chain and the functional group on 5-position of the cytosine. In case of C the allylhydroxylamine adduct switches into the Eisomeric form, which generates C to T transition mutations that can easily be detected by sequencing. In contrast, the 5mCadduct adopts exclusively the Z-isomeric form, which causes the polymerase to stop. A limitation of this method is that it does not distinguish distinguish between 5mC and 5hmC. The detection principle is based on sterics and 5hmC imposes an even larger steric strain and therefore it is not possible to distinguish 5mC and 5hmC after incubation with O− allylhydroxylamine.
2.3.2. Bisulfite Sequencing of 5mC. Bisulfite sequencing (BS-Seq) has been regarded as the gold standard for 5mC detection and therefore widely used to detect 5mC at single base resolution in a large variety of cell types and disease models. 45 In BS-Seq the bisulfite mediates and overall hydrolytic deamination of C to U, but does not alter 5mC. 46 Following DNA sequencing, all of the Cs in the DNA that were deaminated will read as Ts, so any remaining Cs are assumed to have come from 5mC, which does not deaminate during the bisulfite treatment.
The deamination reaction of C to U with bisulfite was first observed in 1970. 47 The bisulfite anion adds across the C5−6 double bond of C at acidic pH, to generate an adduct, which  has lost aromaticity of the base and undergoes hydroysis with loss of ammonia. The resulting uracil bisulfite adduct rearomatises to form uracil upon an increase in pH ( Figure  12). This reaction requires the single stranded form of DNA (ssDNA), owing to the inaccessibility of the C5−6 double bond to the bisulfite anion in the double helix. 48 This bisulfite deamination of C to U is highly pH sensitive; the optimal pH for the overall reaction, adduct formation and deamination is between 5.0 and 5.3. 49 When increasing the pH above 5.3 there is a sharp decrease in cytosine bisulfite adduct formation due to the dissociation of the bisulfite anion to the sulfite conjugate base. 49 The deamination rate also decreases above pH 5.3 as the N3 unprotonated bisulfite adduct deaminates at 1% of the rate of the N3 protonated adduct. However, deamination of the bisulfite adduct is base-catalyzed, so the rate also decreases at pH values below 5.0 due to the protonation of the most effective catalytic species, sulphite. 49 There is a positive linear relationship between concentrations of bisulfite and reaction rate. 49 Further studies in 1980 demonstrated that bisulfite reacted slower with 5mC than with C, due to inhibition of the adduct formation from the electronic effect of the methyl group. 50 This difference in reactivity between C and 5mC is the basis of BS-Seq 46 Bisulfite treatment of DNA causes a degree of DNA degradation, and for a long time the mechanism was thought to be through depurination of A and G due to protonation at low pH. 51 However, it was later discovered that the true degradation mechanism involves depyrimidination of the bisulfite adduct with C, while no degradation was observed with A, G, or T as no bisulfite adduct forms. 51 Once depyrimidination has occurred to form an abasic site, DNA strand scission (degradation) will occur in basic conditions, 20 such as those in the bisulfite work up (Figure 13). Hydroquinone has been used as an additive 52 to protect DNA from degradation and commercial BS-Seq products contain "DNA protect buffers"; however, there has been no definitive examination of their effectiveness.
When analyzing genomic DNA samples it is usually the case that there are multiple copies of the same genetic sequence due to the extraction of DNA from more than one cell, unless working on the single cell level. Each copy of the same genetic sequence can contain different cytosine modifications, as epigenetic states are dynamic. This means that when analyzing a population of DNA samples, what can be quantified is the percentage of sites that contain each modification at the same genomic location, at a given time point (e.g., if 50% of the sequencing reads show a C at a given site in BS-Seq, this would suggest 50% of the cell population exhibits 5mC at that site). BS-Seq has been used to gain this quantitative map of 5mC across whole genomes of many plants and mammals, at single base resolution. 45 An adaptation to this method has been developed, reduced representative bisulfite sequencing (RRBS-Seq), that can fraction the genome into only biologically relevant CGIs, genomic regions that contain a high percentage of CpG dinucleotides. 7e RRBS-Seq works by enzymatically digesting the genome with MspI followed by removal of the undigested DNA resulting in enrichment of CpG sites. Due to  this enrichment, RRBS-Seq allows the same depth of coverage (number of times each site is sequenced) of whole genomic sequencing but with less sequencing, as regions that have a low percentage of CpG sites will not be sequenced.
BS-Seq has also been used to jointly map 5mC with histone modifications, ChIP-bisulfite-sequencing (ChIP-BS-Seq) 53 and bisulfite-treated chromatin immunoprecipitated DNA (Bis-ChiP-Seq). 54 Both of these methods involve initially enriching the genome for DNA located around histones of interest by chromatin immunoprecipitation of genomic DNA with antibodies targeted at histone modifications of interest. BS-Seq is then carried out on this enriched DNA to obtain a map of 5mC in these regions. This allows the direct analysis of methylation status at regions of the genome that coincide with specific histone modifications. The drawback of these techniques is that by enriching the DNA using antibodies there is no longer absolute quantification of the DNA methylation status. Furthermore, they are both reliant on the availability and specificity of histone modification antibodies.
A further adaptation of BS-Seq has been made to simultaneously map methylation status and nucleosome positions by a method termed nucleosome occupancy and methylome sequencing (NOMe-Seq). 55 This method uses a GpC methyltransferase (M.CviPI) that will methylate all of the cytosines present in GpC context outside nucleosomes. BS-Seq of this GpC methylated DNA can then be used to detect regions of unmethylated cytosines in GpC context, as they will convert to Us, and relate to the position of nucleosomes. All of the cytosines in GpC context outside of nucleosomes are methylated and will still read as a C. Furthermore, it is possible to detect the natural cytosine CpG methylation status of DNA around each nucleosome. It would be of great interest to combine ChIP-BS-Seq/BisChiP-Seq with NOMe-Seq to generate a joint map of 5mC with specific histone modifications along with the exact position of each nucleosome.
2.3.3. Detection of 5hmC with Bisulfite. The realization that 5hmC exists in mammalian DNA has revealed an important shortcoming of BS-Seq treatment of 5hmC with bisulfite results in a stable cytosine-5-methylsulfonate adduct (CMS) that, like 5mC, does not undergo deamination and is therefore read as C during sequencing data. 5c,56 Thus, 5mC and 5hmC are indistinguishable by sequencing that follows bisulfite conversion, and therefore all reported examples of BS-Seq methylation analysis have actually been measuring the contributions from the sum of 5mC plus 5hmC, rather than the true 5mC level, which may confound the interpretation of the data in some cases. Another potential issue is that the CMS adduct, when present at high density can stall common DNA polymerases. 56 Resolving 5mC and 5hmC in sequencing data is important given that each modification may have a distinct role in biology. 57 Two distinct methods, oxidative bisulfite sequencing (oxBS-Seq) 16a and TET-assisted bisulfite sequencing (TAB-Seq), 16b have been invented to quantitatively sequence 5hmC at single base resolution in genomic DNA. Both methods unequivocally resolve 5mC from 5hmC during bisulfite treatment.
The oxBS-Seq approach exploits the observation that reaction of bisulfite with 5fC leads to deformylation and deamination. Thus, oxBS-Seq comprises a selective and quantitative chemical oxidation of 5hmC to 5fC in genomic DNA using potassium perruthenate. 16a,58 The resulting 5fC is subsequently, efficiently transformed to U with bisulfite treatment. In oxBS-Seq, only 5mC will read as a C, giving a direct read out for the level and position of 5mC in a DNA sequence. 5hmC can be identified as the difference between oxBS-Seq and BS-Seq, where 5mC and 5hmC read as a C ( Figure 14). OxBS-Seq has been used, in combination with targeting (RRBS-Seq), to generate a single base resolution map of 5mC and 5hmC status of CpG islands in mouse embryonic stem cells (mESCs). 16a TAB-Seq uses the β-glucosyltransferase enzyme to modify 5hmC present in the genome, and a recombinant mouse TET1 enzyme to oxidize the 5mC to 5caC. 16b,59 Prior glucosylation of the 5hmC protects it from oxidation by the TET1 enzyme. Reaction of DNA with bisulfite cause decarboxylation and deamination of 5caC to form uracil, leaving the glucosylated 5hmC unconverted. TAB-Seq therefore gives a direct read out of 5hmC. 5mC can be identified as the difference between TAB-Seq and BS-Seq ( Figure 15).
TAB-Seq has been used to generate a high-resolution map of 5hmC across the whole genome of mESCs. 16b The researchers detected sites that contained high levels of 5hmC throughout the entire genome. β-Glucosyltransferase was shown to exhibit inefficiencies when glucosylating 5hmCpGs when another 5hmC is within 4 bp, 16b which may pose difficulties for sites with multiple 5hmCs in close proximity.
The deformylation of 5fC by reaction with bisulfite had not previously been described, prior to oxBS-Seq, however the decarboxylation of 5-carboxyuridine (analogous to 5caC) was observed previously in 1969. 60 The mechanism of decarboxylation of 5caC is thought to go through a single addition of bisulfite to the C5−6 double bond, which breaks the aromaticity of base, and then decarboxylative elimination leading to the desulfonation ( Figure 16A). The mechanism of deformylation of 5fC has been proposed to go through a double addition of bisulfite to 5fC, across the C5−6 double bond and the aldehyde, which are well documented in the literature. 61 This bis-adduct could then deformylate and desulfonate to cytosine ( Figure 16B).
2.3.4. Detection of 5fC with Bisulfite. As discussed, bisulfite causes 5fC to deformylate and deaminates to form U, thus naturally occurring 5fC is indistinguishable from C in BSsequencing and does not interfere with the detection of 5mC, Figure 14. Reaction scheme and sequencing output for oxBS-Seq (A) 5hmC is oxidized to 5fC by potassium perruthenate, which is then deaminated by sodium bisulfite. (B) 5mC is the only base to read as a C in oxBS-Seq 5hmC can be distinguished as the difference between the read out of C bases from BS-Seq and oxBS-Seq. Two chemical methods, 5fC-assisted bisulfite sequencing (fCAB-Seq) 15c and reduced bisulfite sequencing (redBS-Seq), 16c have been invented to quantitatively sequence 5fC at single base resolution in genomic DNA. Both methods function by exploiting chemistry to block the conversion of 5fC to U during bisulfite treatment.
In fCAB-Seq a substituted hydroxylamine is reacted with the formyl group of 5fC to form an oxime. 15c This oxime is not susceptible to hydrolytic deamination to U during bisulfite treatment. Therefore, by subtracting the data obtained from BS-Seq, where C and 5fC read as a U, from data obtained by fCAB-Seq where only C reads as a U and 5fC reads as a C, 5fC can be identified as the difference (Figure 17).
fCAB-Seq has been used to detect 5fC status at single base resolution of several targeted regions in mouse genomic DNA. Sequencing was carried out on wild type mESC genomic DNA, along with mESC DNA from cells where the TDG enzyme, thought to be responsible for the removal of 5fC, has been knocked down.
In redBS-Seq sodium borohydride is used to reduce 5fC to 5hmC in genomic DNA. 16c Given 5hmC is read as a C during sequencing that following bisulfite reaction the reduced 5fC will no longer deaminate to U during bisulfite treatment. By subtracting the data obtained from BS-Seq, where C and 5fC are read as a U, from data obtained by redBS-Seq where only C reads as a U and 5fC reads as a C, 5fC can be identified as the difference (Figure 18).
By combining RRBS-Seq with redBS-Seq, a single base resolution map of 5fC at CpG sites across the mESC genome was generated. Furthermore, this method was employed in parallel with oxBS-Seq to generate a high resolution map of 5mC, 5hmC and 5fCs.
2.3.5. Detection of 5caC with Bisulfite. Along with the discovery of 5fC was the discovery of 5caC at levels ten times lower than 5fC in genomic DNA. 13a 5caC deaminates during Figure 15. Reaction scheme and sequencing output for TAB-Seq (A) 5hmC is blocked from further reaction by glucosylation by βGT. 5mC is then oxidized to 5caC with TET1 oxidase, which is then deaminated by sodium bisulfite. (B) 5hmC is the only base to read as a C in TAB-Seq 5mC can be distinguished as the difference between the read out of C bases from BS-Seq and TAB-Seq. One method has been published to detect 5caC in DNA, termed chemical modification-assisted bisulfite sequencing (CAB-Seq). 39 In CAB-Seq, 5caC is converted to an amide by reaction with EDC and a primary amine. It was demonstrated that amide-derivatives of 5caC inhibit the conversion to U during bisulfite treatment, and are therefore read as a C. This method could potentially be used to detect 5caC by subtracting the BS-Seq data, where C, 5fC and 5caC read as a U, from this CAB-Seq method, where 5caC reads as a C (Figure 19).
The CAB-Seq method has been demonstrated on synthetic DNA with qualitative sequencing technologies. It would be of great interest to explore how quantitatively CAB-seq can measure the level of 5caC in DNA and at specific locations in the genome. While global genomic levels of 5caC are extremely low, it will be important to address if there are sites where 5caC is abundant as is the case with 5fC. 15c,16c

SINGLE MOLECULE SEQUENCING
Sequence analysis of modified cytosines by bisulfite-based methods has been hugely enabled by the advent of low cost, high-throughput (sometimes called "Next Generation") sequencing on platforms such as the Solexa/Illumina system. 22,23 Generally, such approaches have been applied on genomic DNA derived from populations of cells, thereby providing an average representation from the cell population. Single cell analysis via bisulfite sequencing can be achieved by careful and efficient manipulation of the genomic DNA isolated from a single cell. 62 There are also single molecule sequencing approaches at various stages of development, that have the potential to directly detect modifications to DNA bases and decode genomes from single cells without prior amplification steps.

SMRT Sequencing
One approach for the single molecule sequencing of modified bases is to exploit the pausing of a polymerase due to the presence of chemical tags; this has been demonstrated for the detection of 5hmC in single-molecule real-time sequencing (SMRT). 63 SMRT DNA sequencing is a single molecule sequencing technology, whereby the continuous incorporation of phospholinked nucleotides by a DNA polymerase is detected as fluorescent pulses. The kinetics of nucleotide incorporation is dependent on the nature of the bases and typically the polymerase incorporation rate at the modified base position is slower. 5mC and 5hmC have a similar low kinetic signature, which makes it is difficult to distinguish between them and nonmodified C. 63b However, 5fC and 5caC have a greater signal than 5mC and 5hmC and, through oxidiation of 5mC with the TET enzymes, have been used to detect 5mC. 64 In order to sequence 5hmC in a genomic DNA sample with high confidence, Song et al. combined the selective chemical labeling of 5hmC and SMRT sequencing technology. 65 Therefore, 5hmC was glucosylated using β-glucosyltransferase. Then a cleavable biotin-containing disulfide linker was clicked onto the azide group ( Figure 20).
After enrichment of 5hmC containing DNA strands, the fragments were released from the streptavidin beads by DTT treatment and tested for kinetic signatures during SMRT sequencing. This method represents the first example of a single molecule sequencing method being employed to detect 5hmC at single base resolution. In principle, this approach could enable sequencing of modified bases in long reads (>10 kbp).

Nanopore Sequencing
Protein or solid state nanopores, which contain pores that allow single stranded DNA to pass through, have the potential to sequence DNA. 66 The nanopore sequencing concept involves the measurement of the current passing through a pore as DNA translocates the pore. Each different base gives a distinct current signature when moving through a nanopore, which provides the basis for decoding the base sequence. 66b Early attempts suggest it might be feasible to use such nanopores to discriminate 5mC and 5hmC (in addition to G, C, T, and A) in DNA in the future. 67 By chemically altering the primary alcohol of 5hmC it is possible to create more distinct current signature to sequence 5hmC in synthetic DNA. 68

ELUCIDATING DNA MODIFICATIONS IN THE FUTURE
There have been considerable advances in the creation of chemical and enzymatic methods that enable the detection of modifications of cytosine bases in genomic DNA. It is now possible to decode 5mC, 5hmC, 5fC, and 5caC in addition to G, C, A, and T in DNA at single base resolution. When coupled with the recent (and ongoing) transformations in DNA sequencing technologies, it is practical to carry out such analysis on whole human (and other species') genomes. Collectively these methods will pave the way to understand the role of modified cytosines in nature and ultimately the exploitation of this knowledge in medicine, agriculture, and biotechnology.

AUTHOR INFORMATION
Corresponding Author *E-mail: sb10031@cam.ac.uk.  Balasubramamian's research is on the chemical biology of nucleic acids and the genome and has included an approach for genome sequencing, the study of G-quadruplex nucleic acids, and modified bases in DNA.