Dear Editor,

TALE (transcription activator-like effector) DNA-binding repeats, which represent a modular assembly for specific target DNA of almost any sequences, provide a powerful tool for genetic editing. All the codes are for the four fundamental bases. There was no report on the recognition of modified DNA by TALE up to date. In this study, we report two crystal structures of engineered TALE repeats in complex with methylated DNA elements at 1.85 Ã… and 1.95 Ã… resolutions, respectively. Biochemical analysis shows that the TALE code NG, but not HD, binds to 5-methylcytosine (mC). Our findings will extend the application of TALE in epigenetic modification and cancer research, but also reveal the previously unconsidered limits for the applications of TALEs.

DNA methylation is a major epigenetic mark, and plays a pivotal role in diverse biological processes in a wide range of organisms. In mammals, DNA methylation usually occurs to the C5 position of cytosine in the CG context. Hypermethylation of the CpG islands may lead to gene silencing1,2.

The TALEs are a family of DNA-binding proteins3,4,5. A TALE contains a number of sequence repeats, each recognizing one DNA base. Each TALE repeat consists of 33-35 highly conserved amino acids except for those at positions 12 and 13, which are named RVD (repeat variable diresidue) and determine DNA-binding specificity. The recognition codes between the RVDs and the DNA bases have been established through experimental and computational approaches6,7. For example, the bases A, G, C and T can be recognized by the RVDs NI (Asn and Ile), NN (Asn and Asn), HD (His and Asp) and NG (Asn and Gly), respectively4. The modular nature of TALE repeats provides an important tool for genetic manipulation8,9,10.

We recently determined the high-resolution structure of DNA-bound TALE dHax3, which provided the molecular basis for base-specific DNA recognition11. The 34-residue TALE repeat comprises two α-helices connected by a short loop where RVD resides. The RVD loop tracks along the major groove of DNA (Figure 1A). Only the second residue of RVD, namely the one at position 13, is in direct contact with the base in the sense DNA strand, whereas the first residue helps maintain the RVD loop conformation through hydrogen bond.

Figure 1
figure 1

The TALE repeats containing code NG, but not HD, recognize 5-methylcytosine. (A) The crystal structure of DNA-bound dHax3 TALE repeats (PDB accession code 3V6T). RVDs in each repeat are shown as red spheres. The upper and lower DNA strands are colored gold and silver. (B) Structural analysis of the code NG→T suggested that NG may recognize 5-methycytosine (mC). (C) The DNA fragments with three bases T substituted with mC retained binding to TALE repeats. The protocol of EMSA is described in detail in Supplementary information, Data S1. (D) Crystal structure of dHax3 in complex with methylated DNA dHax3-5mC at 1.85 Å. The overall structure is identical to Figure 1A; see also Supplementary information, Figure S2A. Only one mC and the Gly13 in the corresponding repeat is shown to highlight the van der Waals interaction, which is indicated by the black dashed line, between the 5-methyl group of mC and the Cα atom of Gly. (E) Crystal structure of a dHax3 variant in complex with methylated DNA containing (mC)G(mC)G at 1.95 Å resolution. The hydrogen bond between Asn13 and base G is shown as red dashed line. The Cα atoms of Gly13 residues are shown as red spheres. (F) NG, but not HD, recognizes 5-methylcytosine. Lanes 1-20: The DNA of dHax3 box with six bases of T replaced by mC retained binding to dHax3 repeats, whereas replacement by C led to complete loss of binding. Lanes 21-40: HD specifically recognizes C. Substitution of the five cytosines in dHax3 box with any other nucleotides, including mC, leads to almost complete loss of binding with dHax3 repeats. All the structure figures were prepared with PyMOL12.

Notably, the DNA base T is recognized by Gly13 in most cases. The lack of side chain in Gly not only provides sufficient space to accommodate the 5-methyl group of thymine but also allows optimal van der Waals interactions between the Cα atom of Gly13 and the 5-methyl group11 (Figure 1B). This observation immediately suggests the possibility that mC might be recognized by Gly13 in RVD, because the only difference between the bases T and mC is at position 4, which is not involved in binding to TALE repeats. To examine this possibility, we replaced three T bases in the sense DNA strand by three mC bases and performed DNA-binding studies using the electrophoretic mobility shift assay (EMSA).

Confirming our prediction, the dHax3 protein binds to the triply modified DNA, with the forward strand 5′-TCCCT(mC)TA(mC)CTC(mC)-3′ (Figure 1C). This binding is very similar to that for the unmodified dsDNA, with the forward strand 5′-TCCCTTTATCTCT-3′ (Figure 1C). This result is rather striking, considering the fact that three T-A base pairs have been replaced by three mC-G base pairs in the dsDNA.

Next, we crystallized the binary complex between dHax3 and the triply modified DNA-binding sequence, and determined its structure at 1.85 Å (Figure 1D and Supplementary information, Figures S1A, S2A and Table S1). As anticipated, the 5-methyl group of mC points to the Cα of Gly13 with a distance of 3.4-4.0 Å for the three mC bases (Figure 1D and Supplementary information, Figure S2B). As DNA methylation mostly occurs to cytosine in the CG context, we constructed a dHax3 variant that is expected to recognize the DNA elements 5′-TCCCTT(mC)G(mC)GTCT-3′, where the RVDs NG and NN are designed for bases mC and G, respectively (Supplementary information, Figure S1A). The crystal structure of this dHax3 variant, which we name dHax3-mCG, in complex with its target dsDNA was also obtained and refined at 1.95 Å resolution (Figure 1E and Supplementary information, Figure S2C and Table S1). The coordination of mC bases by Gly13 residues is identical to that in the first structure (Figure 1D). In fact, the two structures of dHax3 variants in complex with triply and doubly methylated DNA elements are nearly identical to that of dHax3 bound to the unmodified DNA11, with root-mean-squared deviation values of less than 0.3 Å over more than 900 Cα atoms (Supplementary information, Figure S2D).

Encouraged by the structural findings, we replaced all six T bases by mC in the sense DNA strand. Subsequent EMSA study revealed that dHax3 retained similar binding to this DNA element as to the unmodified DNA (Figure 1F, lanes 1-10; Supplementary information, Figure S1B). In contrast, there was no detectable binding between dHax3 and the DNA element in which the six T bases were replaced by the base C (Figure 1F, lanes 11-20). This is rather striking, because this result suggests a qualitative and reliable method for differentiating the bases mC and C. We next examined whether the RVD code HD, which favors the base C, may also recognize mC. Substitution of the five C bases with mC or T in the sense DNA strand led to complete abrogation of DNA binding by dHax3 (Figure 1F, lanes 21-30). Substitution of the five C bases with A or G resulted in significant impairment of DNA binding (Figure 1F, lanes 31-40). These results illustrate the specific nature of mC recognition by TALE repeats involving the RVD NG, but not HD.

Our experimental characterization provides a molecular basis for distinguishing methylated and unmethylated cytosine. Binding of mC by TALE repeat through the RVD NG extends the DNA recognition code and has potential application in epigenetics and cancer research. For example, specific TALE repeats may be designed to recognize the hypermethylated DNA region; detection can be facilitated by fusing TALEs with fluorescence proteins.

Our study also strongly argues that the in vivo methylation status of the target DNA sequence must be considered for the design of specific DNA-binding TALEs. Methylation of the base C in vivo might render the DNA sequence unfit for binding by the designed TALEs. Because the methylation status of DNA sequences is frequently under dynamic control, one would have to design at least two TALEs for one DNA sequence (i.e., one for methylated and one for unmethylated). In fact, assessment of methylation status of specific DNA sequences in vivo can be greatly facilitated through quantification of fluorescence signal of designed GFP-TALEs. Alternatively, the CpG sequences may be avoided for the application of TALEs, although this practice will somehow limit the potential application. Despite these complexities, the discovery of mC binding by TALEs with RVD NG opens a number of exciting opportunities.