Revealing structural peculiarities of homopurine GA repetition stuck by i-motif clip

Abstract Non-canonical forms of nucleic acids represent challenging objects for both structure-determination and investigation of their potential role in living systems. In this work, we uncover a structure adopted by GA repetition locked in a parallel homoduplex by an i-motif. A series of DNA oligonucleotides comprising GAGA segment and C3 clip is analyzed by NMR and CD spectroscopies to understand the sequence–structure–stability relationships. We demonstrate how the relative position of the homopurine GAGA segment and the C3 clip as well as single-base mutations (guanine deamination and cytosine methylation) affect base pairing arrangement of purines, i-motif topology and overall stability. We focus on oligonucleotides C3GAGA and methylated GAGAC3 exhibiting the highest stability and structural uniformity which allowed determination of high-resolution structures further analyzed by unbiased molecular dynamics simulation. We describe sequence-specific supramolecular interactions on the junction between homoduplex and i-motif blocks that contribute to the overall stability of the structures. The results show that the distinct structural motifs can not only coexist in the tight neighborhood within the same molecule but even mutually support their formation. Our findings are expected to have general validity and could serve as guides in future structure and stability investigations of nucleic acids.


INTRODUCTION
DNA can, depending on the primary sequence, adopt various secondary structures distinctly different from the clas-sical WC double helix. These non-canonical structures frequently appear in important regions of the genome and have specific functions in regulating biological processes (1). Actually, searching for new sequence-dependent conformational arrangements, understanding principles of their formation and functional consequences are prerequisite for the possibility of controlling gene expression and for the rational use of scientific findings in medical applications (2).
Alternating d(GA) n ·d(TC) n sequences are abundant in genomes (3)(4)(5)(6). They are especially frequent in genomes of rodents and primates, where a significant fraction of them is found in long (n ≥ 30 base-pairs) blocks (3). This microsatellite sequence is extremely polymorphic (7,8). Except for classical duplex and triplexes, the particular single strands can adopt various non-canonical arrangements: the d(CT) strand can form an intercalated iM with thymine bulges in slightly acidic conditions, and numerous models were proposed for the arrangements of the d(GA) sequence strand: alpha helix-like ordered single strands, parallel and also antiparallel duplexes, and tetraplexes (8)(9)(10)(11). The polymorphism of the sequence may play a role in a plenty of their different biological functions reported: Microsatellite d(GA)·d(TC) is a recombination hot spot. A significant enhancement of homologous DNA recombination in minichromosomes of SV40 polyomavirus was found in humans and monkeys (12). It has been shown that the PGB protein found in human fibroblast cells selectively binds single stranded d(GA) repeats, induces strand separation of the WC paired heteroduplex and stabilizes formation of triple helices or other unusual DNA structures (13). The sequence d(AG) 4 was found to have a higher activity as a primer for DNA polymerase I in Escherichia coli than any other dinucleotide repeat (14). Other experimental studies have shown that d(GA) n repetitions play various roles in different genes, e.g. at hsp26 promoter of Drosophila melanogaster the d(AG) sequences surround the region responsible for the nucleosome binding (15) and have a critical influence on the formation of DNase I hypersensitive sites (16).
Self-association of d(GA) n has been studied since 1980s when its tetrahelical arrangement was proposed at neutral pH (17,18). Another work employing pH titrations that induced disproportion of poly[d(AG)·d(CT)] sequence monitored by CD distinguished six rearrangements (duplex → triplex + free d(AG) → 'acid-induced self-complex' of purine strand even in the presence of a complementary pyrimidine strand) (19). Similarly, the secondary structure of d(GA) 10 at a lower salt (<10 mM Na + ) concentration and acidic pH was described as a single-stranded left-handed helix built form unstacked nucleobases with sequential intramolecular ionic bonds between protonated A and the phosphate group of G (8,20). It was proved by CD experiments that such ordered d(GA) n single strands tend to dimerize into homoduplex at higher salt concentrations without substantial conformational rearrangement (21). Several spectroscopic and chemical-probing experiments provided data analogous to those of parallel G-quadruplexes suggesting that the d(GA) 10 oligonucleotide at high salt concentrations (≥100 mM Na + ) of uniunivalent salts (or more potent MgCl 2 ) adopts a parallel duplex with direct stacking of G·G base-pairs (bps) and the intervening A residues pushed outside the duplex (22). Rippe et al. (9) proposed a parallel double helix structure of the d(GA) n formed by symmetric syn G·G bps in N1H-O6 geometry (associated via Watson-Crick edge, WCE) and anti A·A bps in N6H-N7 geometry (Hoogsteen edge, HE), vide infra.
Interestingly, the single 5 -GA-3 segment in parallel duplexes adopts base pairing geometry different to the ones proposed above. In the work by Robinson et al. (23), it was suggested that at acidic pH, CGATCG adopts a parallel duplex structure with the G 2 ·G 2 bp in N2H-N3 geometry (sugar edge, SE) and A 3 ·A 3 bp in HE geometry. A similar pairing pattern was found in sequences (CGA) 2 C, C(GA) 3 , C(GA) 3 C where GA segment is attached to CH + ·C pair at its 5 -end (24,25). The unusual compact pairing of guanine was recognized, among others, based on substantial shielding of N1H imino proton and inter-strand NOE connectivities observed in 1 H NMR spectra. This specific arrangement (termed as GA step in this paper) was found to be a powerful motif in promoting parallel duplex formation. An extraordinary stability of this structure was ascribed to the effective inter-strand stacking overlap as shown in detail in the structure of d(TCGA) duplex determined using NMR spectroscopy (see Supplementary Figure S1) (26).
To investigate the structure of d(GA) n dinucleotide repeats arranged in parallel duplex (27) and consequently clarify, in some cases, contradicting claims about stoichiometry, protonation state of adenine and type of base pairing found in the literature (18)(19)(20)(21), a robust and well-defined model sequence is essential. We started our investigation with d(GA) n (n = 5, 10) sequences. However, the broad signals in their NMR spectra indicated dynamic ensemble of structures preventing the precise structure determination (see Supplementary Figure S2). Therefore, the flexibility of system was decreased by shortening the d(GA) n block to two repeats (n = 2). To impose parallel orien-tation of the strands in our models, we introduced a C 3 segment forming an i-motif (iM). A similar strategy was employed to stabilize a parallel duplex composed of A-T bps in reverse WC geometry (28). The resulting tremendous improvement in spectral properties encouraged us to design a series of constructs with different relative positions of the homopurine part and the iM clip, which allowed the high-resolution structures to be determined, vide infra.
The iM is constituted from hemi-protonated cytosine bp (CH + ·C) formed under acidic conditions (29,30). Recent papers, however, show that, with increasing length of Crich sequence (31,32), formation of the iM can shift to the neutral pH range. The interest of scientific community in iMs has been spurred by recent demonstrations of their existence in vivo. Specifically, the iM structures were detected in regulatory regions of the human genome, namely in promoters and telomeric regions by iM-specific fluorescently marked antibody (33). In the same time, the existence of an iM formed by selected promoter sequences was demonstrated by in-cell NMR experiments (34). Recently, the compatibility of iM with B-DNA at neutral pH has been demonstrated (35). In that work, the unimolecular iM is stabilized by a hairpin at one side and a minor groove tetrad at the other side. In general, the iM structure can adopt two topologies differing in the intercalation pattern: in the 3 E arrangement, the outermost CH + ·C bp is located at 3end of the sequence, whereas 5 E arrangement is characterized by a 5 -external CH + ·C bp (36). Previously, the overall higher stability of extended 3 E topology was attributed to a larger number of favourable weak CH···O4 hydrogen bonds between sugar moieties as indicated by a comparative MD simulation (37). This preference can be inverted if the resulting structure is a kinetic product of the folding process (38). Incorporation of 5-methylcytosine (mC) was reported to induce thermal stabilization of iMs and to modulate the preferred topology (39,40). Here, we used mC-for-C substitution as a label facilitating the assignment of NMR signals of the iM.
In this work, we performed a systematic CD and NMR investigation of the stability and structure of various d(C 3 R 4 ) and d(R 4 C 3 ) oligonucleotides where R 4 stands for GA or AG repeat. Their global fold was examined using CD, absorption and 1D 1 H NMR spectra. For sequences adopting well-defined structures, NMR data were used to build high-resolution models that were further refined and analysed based on MD simulations in the explicit solvent. The goals of this study are addressed in the following sequence: • To screen a global structure and stability of designed sequences by using CD and absorption spectroscopy • To determine iM topology and base pairing of the purine block in various sequential contexts based on 1 H NMR experiments • To characterize the modulation of iM topology by 5methylation of cytosine • To describe the structural differences in R-C and C-R steps using high-resolution models and to correlate these changes with global stability  Figure S3). The measurements were performed in 10 mM potassium phosphate and 50 mM KCl (65 mM K + ), pH 5, at 0.15 mM strand concentration.

Preparation of oligonucleotide samples
All oligonucleotides (see Table 1) were synthetized and HPLC-purified by supplying company Merck. Dried pellets were dissolved in 500 l of solution containing 10 mM potassium phosphate and 50 mM KCl (total of 65 mM K + ) at pH 5. All measurements were performed in this solution, unless stated otherwise. For NMR measurements, the oligonucleotide samples were subjected to 4 cycles of centrifugal filtration using 3 kDa milipore filters (AMICON) and concentrated to 50 l. Afterwards, 50 l of D 2 O and 400 l of above-described solution were added to yield 500 l of final solution which was annealed overnight followed by a check of pH. Selected samples were lyophilized and transferred to 99.95% D 2 O.

UV absorption and CD spectroscopy
DNA strand concentrations were determined based on UV absorption measured at 260 nm on a UNICAM 5625UV/Vis spectrophotometer (Cambridge, U.K.) using molar absorption coefficients calculated by the nearest neighbour method (41). Sample concentration for CD experiments was (unless stated otherwise) 0.15 mM per strand.
CD measurements were carried out using a Jasco 815 (Tokyo, Japan) dichrograph in 0.05 cm (singularly in 0.02 cm) path-length quartz Hellma cells placed in a Peltier cell holder. Spectra were measured in the range of 220-330 nm with scan speed 100 nm/min and a set of four scans was averaged for each spectrum. CD signal was expressed as a difference in the molar absorption ε of the left-and righthanded circularly polarized light, according to the formula ε [M -1 cm -1 ] = θ / (32.98 c l), where θ is the measured ellipticity value [mdeg], c is the molar strand concentration [M], and l is the optical path of a used cell [cm] (41).
CD melting experiments were monitored by changes in ε at wavelengths corresponding to dominating CD band of particular sequences (specified in Table 1 and figures). The temperature was increased in 2 • C steps and the samples were equilibrated for 2 min at each temperature before collecting the spectrum (the total time of keeping at each temperature point was ∼6 min resulting in the rate of temperature changes ∼0.33 • C/min). The T m values were determined from dual baseline-corrected 1 -0 normalized curves (1-native and 0-denatured forms) as temperatures, at which half of the molecules were folded (42). The error associated with determination of melting temperature was estimated to be ± 1 • C based on repeated measurements.
Hysteresis between melting and renaturation processes were checked using UV absorption measured on a Varian Cary 4000 (Mulgrave, Australia) spectrometer and determined as a difference between melting (T m ) and renaturation (T ren ) temperature. The absorption melting and denaturation experiments were carried out at 0.15 mM DNA concentration using 0.05 cm cells at three rates of temperature changes: in 1 • C increments with 3 and 6 min waiting prior to taking each spectrum (total time 4.5 and 7.5 min per point), resulting in ∼0.22 and 0.13 • C/min, respectively and also in 2 • C increments and 4.5 min waiting at each temperature (total time 6 min per point) resulting in 0.33 • C/min as in the CD measurement. The T m and T ren values were determined as stated above for CD melting experiments. The course of melting and renaturation experiments taken in 10 mM potassium phosphate and 50 mM KCl, pH 5 (adjusted by 0.1 M HCl) was compared for selected sequences with the melting experiments carried out in solution of potassium Robinson-Britton buffer (K-RB), pH 5 with added KOH up to the final concentration 65 mM K + ). K-RB buffer was prepared from a mixture of 0.04 M acids (boric, phosphoric and acetic) and 0.2 M KOH.

NMR spectroscopy
The 1 H 1D and 2D NMR spectra of studied sequences were measured on Bruker Avance III HD 600 and 700 MHz spectrometers equipped with quadruple-resonance cryoprobe and triple-resonance room temperature probes, respectively. The NMR samples were prepared at 0.  (44,45) experiments were recorded on selected samples in 99.95% D 2 O at 70, 100 and 150 ms mixing time to determine interproton distances. Resolved 31 P shifts were assigned based on 1 H-31 P COSY experiment (46). Acquired data were processed using TopSpin3.2 software. Sparky 3.114 (47) was used for the resonance assignment and integration of NOE cross-peaks.
A starting structure of the homoduplex segment was built manually from nucleotides which were arranged to qualitatively respect right-handed helicity and experimental data regarding -torsion angle. Starting structure of iM clip was acquired from structure of C 4 A 2 (PDB: 1YBN) (36). Isolated structures of homoduplex segment and iM clip were subjected to simulated annealing with above-described restraints. Afterwards, the homoduplex segment was manually attached on both ends of the iM clip. The energy of the resulting structure was minimized, and the minimized structure was subjected to simulated annealing with complete set of restraints. As the restrain violations between different simulated annealing runs were comparable, the most symmetrical structure was selected for further simulations. The selected structure was solvated by 12Å layer of water with Na + and Clions to yield isotonic concentration using Solvate software (www.mpibpc.mpg.de/grubmueller/ solvate). Truncated octahedron was created afterward by tleap (48). The solvated molecule was subjected to a 10-stage temperature equilibration (58) in order to relax both explicit Figure 1. Scheme of possible C 3 R 4 and R 4 C 3 arrangements: Both classes of sequences (C 3 R 4 and R 4 C 3 ) studied in this work can adopt two possible iM topologies: 3 E arrangement with the outermost CH + ·C bp at 3 -end, 5 E arrangement with the outermost CH + ·C bp at 5 -end. The 5 →3 polarity of individual strands is depicted by the orientation of triangles. The cytosines are denoted by blue and purines by grey triangles. The alternative descriptors reflecting the compactness of the iM clip are indicated in the header. Note the different stacking modes at C-R and R-C steps in extended and compact topology: In the extended topology the R·R bp is directly stacked on to CH + ·C bp of the same duplex whereas in the compact topology the direct stacking is interrupted by intercalated CH + ·C bp. The requirement to accommodate CH + ·C bp in compact iM topologies may affect the geometry of R·R bp, vide infra. solvent and solute. Restraints were included in the last two stages of equilibration.

Unbiased molecular dynamics
Constant pressure unbiased molecular dynamics simulation (MD) was performed on previously equilibrated structures at 300 K with SHAKE algorithm (59) constraining length of bonds with hydrogen. The length of integration step was set to 2 fs. Target pressure was set to 1 bar and pressure relaxation time to 6 ps. Temperature was regulated by Berendsen thermostat (60) with heat bath coupling constant of 5 ps. Non-bonded cutoff was set to 8Å. No restraints were applied during MD. A snapshot was selected every 100 ps of MD yielding 10 000 snapshots from 1 s of simulation. Selected structural parameters were extracted from the trajectory using cpptraj tool (61). Software 3DNA (62) was used to evaluate overlaps between stacked base-pairs.

RESULTS AND DISCUSSION
We have studied purine segments GAGA and AGAG attached to the iM clip either at 5 -or 3 -end ( Figure 1). In the text below we use the following nomenclature to specify residues and interatomic contacts. Because i-motif (iM) can be viewed as a pair of intercalated parallel duplexes, we use subscripts I and II to distinguish nucleotides of the two duplexes and subscripts a and b to differentiate nucleotides CD spectra comparing both isolated structural components (C 3 and GA repetition) with two of the main constructs investigated in this work, C 3 GAGA and GAGAC 3 (A) at pH 7 and (B) at pH 5. All measurements were done in 10 mM potassium phosphate and 50 mM KCl (total of 65 mM K + ) apart from (GA) 5 for which the 100 mM K + was added (total of 115 mM K + ) to stabilize its neutral parallel duplex. Due to the inability of (GA) 2 to adopt any ordered secondary structure, we show the fingerprint of CD spectra of parallel duplex on the longer (GA) 5 sequence. (C) pH-induced changes in CD spectra of the studied sequences monitored by ε at wavelengths indicated. Open triangles correspond to non-equilibrium states. CD spectra were measured at 1 • C, in 0.1 cm cells at 0.1 mM DNA strand concentration. The dependences started at alkaline pH and proceeded toward acidic pH. within a duplex. All base-pairs are formed between symmetrically equivalent residues, e.g. G 4 Ia base pairs with G 4 Ib in duplex I. A list of oligonucleotides studied in this work is given in Table 1.

Global fold and stability
The first estimation of the degree of folding for all the oligonucleotides was based on CD spectra. (GA) n repeats with n ≥ 5 at neutral pH provide CD spectra characteristic for a parallel duplex (see (GA) 5 in Figure 2A) with positive band at 260 nm and a negative band at 240 nm regardless of salt type. The same type of spectrum but containing a shallower negative band is displayed by an ordered single strand of the GA repeat formed at acidic pH ( Figure  2B) or under dehydrating conditions (21). The short GAGA sequence studied in this work remains unstructured under both mentioned conditions. In contrast, the C 3 segment associates into the tetramolecular iM at acidic pH close to the cytosine pK a value ( Figure 2B). Its CD spectrum is characteristic by a dominating positive band at 285 nm and a negative band at 265 nm. Interestingly, connecting the two short sequences, C 3 and GAGA, gives rise to ordered structures at acidic pH with distinct positive CD amplitudes in the positions of the characteristic CD bands of the two structural components (Figure 2). The band at 260 nm dominates in the CD spectrum of C 3 GAGA, and the one at the long wavelength side is lower. In contrast, GAGAC 3 provides a very high CD band at 285 nm and only a shoulder around 260 nm. Thus, in both cases, the shape of the CD spectrum is strongly influenced by the structural component positioned at the 3 -end of the molecule. Both, C 3 as well as C 3 GAGA and GAGAC 3 provide conservative, plain spectra at pH 7 indicating that no ordered structure is formed. It is the pHdriven formation of iM that enables constitution of the new structures consisting of a tetramolecular iM clip and presumed GAGA homoduplex on both of its sides (see Figure  1). Therefore, we compare CD spectra and pursue all further experiments at pH ∼ 5. However, it is to be noted that while C 3 adopts stable structures only at pH 5 and (GA) 2 remains unfolded, both C 3 GAGA and GAGAC 3 transform toward their stable structures around pH 6 already ( Figure  2C). This is very exceptional observation of a synergy between two structural blocks previously reported only for imotif and G-quadruplex in a single 38 nucleotide molecule (63).
In the following experiment we compare the effect of connecting C 3 block with GAGA and AGAG (Figure 3). Both sequences C 3 AGAG and AGAGC 3 form ordered and cooperatively melting structures. CD spectra indicate that the structure of C 3 AGAG is similar to that of C 3 GAGA, but it is markedly less thermostable (Table 1). In contrast, the CD spectrum of AGAGC 3 differs from that of GAGAC 3 . Its positive long-wavelength band is distinctly reduced, and a positive band at 260 nm may indicate a unique structural feature in AGAGC 3 . The structures with AGAG repetition are much less thermostable than those containing GAGA repeat.
Melting of all four sequences is fully reversible. However, whereas the courses of melting and refolding of C 3 GAGA follow the same curve, and only slightly differ in the case of C 3 AGAG, the sequence AGAGC 3 and, especially, GAGAC 3 display a hysteresis between the two processes (Supplementary Figure S3). The hysteresis diminished only slightly upon increasing the time for temperature equilibration. The presence of hysteresis in the case of sequences with purine block at the 5 -end indicates a slow and distinct kinetics of melting and renaturation of their structures. As the CD experiments were performed at much lower DNA concentration compared to the NMR measurements, we checked the CD spectra of the four parental sequences at distinctly increased DNA concentration approaching those used in NMR. The CD spectra of the sequences at 0.7 mM DNA concentration (Supplementary Figure S4) are principally the same as those at the concentration used for CD measurements with only slightly higher amplitudes. As expected for intermolecular assembly, the melting temperatures increase at higher oligonucleotide concentration (Supplementary Figure S4) (by 6 • C on average, data not shown).
Based on the analysis of the melting curves derived from CD measurements, the stability trends of the sequences summarized in Table 1 can be verbally expressed as follows: • Sequences containing 5 -GA-3 repetition (i.e. C 3 GAGA and GAGAC 3 ) show a similar stability regardless of their position relative to the iM. Switching the order of the purine residues to 5 -AG-3 (i.e. C 3 AGAG and AGAGC 3 ) causes significant drop in T m by 19 • C for purines at the 3 -end (C 3 AGAG versus C 3 GAGA) and by 12 • C for the 5 -end (AGAGC 3 versus GAGAC 3 ).
• Neglecting the sequence polarity and other structural effects, presence of the C-G step is slightly less stabilizing than A-C (C 3 GAGA versus GAGAC 3 , both containing (GA) 2 segment) and G-C is more stabilizing than C-A (AGAGC 3 versus C 3 AGAG, both containing (AG) 2 segment). We hypothesize that the presence of the A-C step and (GA) 2 block is linked to a significantly greater stability of GAGAC 3 compared to C 3 AGAG with C-A step and a single GA step, cf. Figure 3.

C 3 R 4 adopts extended form of i-motif.
In the 1 H NMR spectrum of sequence C 3 GAGA one set of cytosine N3H protons was observed ( Figure 4A, top, Supplementary Figure S5). Other resonances of cytosine base and sugar protons were assigned based on intra-residual NOE contacts.
To distinguish the intercalation topology and associated ambiguous assignment of C 1 and C 3 residues, we first focused on inter-duplex contacts (H2 /H2 ) I -(NH 2 ) II present exclusively in the structure of iM (Supplementary Figure  S6) (64). Such contacts suggested the extended 3 E topology of iM. More detailed discussion of signal assignment using mC modification is given in Supplementary Data (see also Supplementary Figures S5, S7 and S8). Incorporation of mC in the sequence C 3 GAGA (i.e., mCC 2 GAGA and C 2 mCGAGA) has no effect on the topology of iM ( Figure  4A, bottom), which is also manifested by the same type of CD spectra showing only a slight increase of the iM band (Supplementary Figure S9). This is also in agreement with an increase of T m (∼7 • C) caused by mC 1 and only a minor one (∼2 • C) in the case of mC 3 . Similar to C 3 GAGA, no hysteresis between melting and renaturation curves is displayed by the two methylated analogues. The direct stacking between C 3 and G 4 bps was confirmed by NOE connectivity (Supplementary Figure S10). Despite poorly resolved NMR signals of C 3 AGAG indicating a labile structure (Supplementary Figure S11), we were able to identify the iM to be in the extended 3 E topology.

R 4 C 3 features equilibrium between extended and compact topology of iM.
The broadened and overlapped cytosine N3H signals suggest a presence of two interconverting iM topologies ( Figure 4B, top, Supplementary Figure S12) in GAGAC 3 . The equilibrium was shifted by mC modification of the 3 -terminal cytosine in the sequence GAGAC 2 mC which adopts the extended 5 E topology ( Figure 4B, bottom, see also amplified bands in CD spectrum, Supplementary Figure S13). Methylation of both C 5 and C 7 resulted in a significant increase in T m by 11 and 9 • C, respectively. However, the slope of the CD melting curve of GAGAmCC 2 is less steep reflecting lower cooperativity of the transition (Supplementary Figure S13), which along with broadened 1 H NMR signals (Supplementary Figure S12) suggest increased structural variability compared to GAGAC 2 mC. Melting and renaturation curves of GAGAC 3 display an extensive hysteresis (Supplementary Figure S3), which is distinctly affected by methylation: the hysteresis of GAGAmCC 2 is significantly reduced whereas that of GAGAC 2 mC is even larger as compared to the parental GAGAC 3 . Based on the retrospective comparison of imino resonances between GAGAC 3 and GAGAC 2 mC, we assume that the dominant set of signals in GAGAC 3 corresponds to a structure with iM in extended 5 E topology. Because of the reasons indicated above, we focused on the modified GAGAC 2 mC instead of GAGAC 3 . In 1 H NMR spectrum of AGAGC 3 , one dominant set of cytosine N3H signals is observed confirming the presence of iM ( Figure 4C, top, Supplementary Figure S15). The NOE contacts (Supplementary Figure S6) revealed the compact 3 E topology of iM. Upon mC-for-C 7 substitution in AGAGC 2 mC, the iM completely switched to the extended 5 E topology. The perturbed 1 H NMR shifts ( Figure 4C, bottom, Supplementary Figures S16 and S17) and altered pattern of (H2 /H2 ) I -(NH 2 ) II NOE contacts in AGAGC 2 mC compared to AGAGC 3 further support compact 3 E topology in the non-modified AGAGC 3 . In AGAGmCC 2 , the ratio between the compact and extended topologies shifted to ∼1:1, as estimated from the integrals of 1 H NMR signals. The significant impact of cytosine methylation on the fold of AGAGC 3 is also demonstrated in CD spectra, where the mC-for-C 7 substitution (AGAGC 2 mC) caused amplification of the 260 nm band (Supplementary Figure S18). The difference in CD spectral pattern might be linked with the structural rearrangement in purine segment, vide infra. Both mC 5 and mC 7 substitutions led to an increase in T m (∼5-7 • C).

Purines in C 3 GAGA adopt four different base-pair geometries whereas those in C 3 AGAG remain mostly unstructured.
NOE sequential walk C 2 I -C 1 I I -C 3 I -G 4 I established the starting point for an assignment of the remaining part of the purine segment. The standard (H1 /H2 ) i -H8 i+1 NOE connectivity was used to assign the signals of aromatic protons. The 1 H resonance at 10.0 ppm was assigned to G 4 N1H based on its NOE contacts to C 3 H5 and A 5 H1 /H8. A significant shielding of G 4 N1H indicates that it is not participating in any stable H-bond (sugar edge arrangement in Figure 5A, top). Further, we detected the signal of G 4 NH 2 (at 5.9 ppm) which is crucial for a stabilization of the only possible C 2 -symmetrical pairing occurring at the sugar edge (SE). The other signal in the imino 1 H region at 11.9 ppm belongs to G 6 N1H involved in H-bond. In contrast to G 4 , the G 6 forms a bp via Watson-Crick edge where the NH 2 is unbound ( Figure 5A, bottom).
To confirm the proposed bp geometries of guanine residues, we carried out substitution experiments exploiting inosine(I)-and 2-aminopurine(aR)-based nucleotides (for 1 H NMR and CD spectra, see Supplementary Figures S20 and S21, respectively). As expected, the replacement of G 4 by inosine in C 3 IAGA leads to a dramatic destabilization (decrease in T m by 16 • C and significant reduction of CD intensity at 240 nm), whereas inosine in C 3 GAIA destabilizes the structure much less (decrease in T m by 4 • C). Note that C 3 IAGA is the only sequence with iM at the 5 -end, which displays a distinct hysteresis (Supplementary Figure  S3). Both 1 H NMR and CD experiments confirm substantial lability of C 3 IAGA structure. Replacement of G 4 base by 2-aminopurine (aR) allowed us to examine the situation where only the amino group is present at the pairing interface (see Figure 5A, top). Such sequence adopts welldefined secondary structure only at low temperature (5 • C). In the light of all these observations, both N1H and NH 2 of G 4 seem to be essential for the formation and stability of purine homoduplex (role of N1H in the SE arrangement is described in section MD perspective). Sequence C 3 AGAG is not capable of forming well-defined structure of purine homoduplex at room temperature (Supplementary Figure  S11). Finally, we were also able to identify the structure of sandwiched A 5 bp ( Figure 5B). Well-resolved A 5 N6H amino proton signals in combination with relatively strong NOE contacts to A 5 H8 suggest a tight HE bp. Despite dynamic opening of the terminal A 7 bp we were able to deter-mine its propensity to adopt WCE arrangement as determined by dipolar NOE contacts G 6 H8-A 7 H8 and G 6 NH 2 -A 7 H2. The structure of CGA segment in C 3 GAGA shows clear similarity to CGA in TCGA duplex (Supplementary Figure S1) (26).
Extending the purine duplex by additional G in C 3 (GA) 2 G and GAG in C 3 (GA) 3 G did not prevent the formation of the structure of (GA) 2 block observed in C 3 GAGA. In the case of C 3 (GA) 2 G, a higher structure uniformity was achieved compared to C 3 GAGA because of an additional stabilization of A 7 bp. In contrast, the additional residues in C 3 (GA) 3 G caused a severe broadening of G 6 N1H suggesting an exchange between alternative pairing modes in longer constructs (Supplementary Figure  S22). The CD spectroscopy revealed an increase in amplitude of 260 nm band without significant impact on T m (Supplementary Figure S23). The conformation of the sugarphosphate backbone was mapped in the well-defined structure of C 3 (GA) 2 G using 1 H-31 P correlation showing two outlying 31 P signals assigned to G 4 and A 5 residues (Supplementary Figure S24) that represent a transition from iM to GAGA segment (C 3 pG 4 ) and a stretched conformation of the backbone connecting nucleotides in the GA step (G 4 pA 5 ). Unusual conformation of phosphate linkage was also observed in G·A mismatches both in DNA and RNA antiparallel duplexes (65,66).
GA step is formed in R 4 C 3 regardless of sequential context. The broadened NMR lines and low-intensity bands in CD spectra of GAGAC 3 indicate equilibrium of two iM topologies. The mC-for-C 7 substitution shifts the equilibrium toward extended 5 E iM topology ( Figures 4B and Figure 6). The terminal G 1 nucleotide does not form a stable bp which, subsequently, also affects a formation of the neighbouring A 2 bp. Assignment of NOESY spectra revealed that G 3 -A 4 adopts the structure of the GA step, vide infra. The A 4 bp in the tight HE geometry ( Figure 5B) stacked on 5 -face of C 5 bp forms a rigid motif and is probably responsible for high thermal stability of GAGAC 2 mC ( Supplementary Figure S25).
In AGAGC 3 , the structure of the stable GA step is formed by G 2 -A 3 segment. Additionally, the G 4 adopts synconformation of the glycosidic bond enabling the formation of the loose WCE G 4 bp ( Figure 5A) which probably facilitates the intercalation of C 7 I I bp between G 4 I and C 5 I bps and allows formation of the more stable 3 E compact topology (as described in the previous section, AGAGC 3 shows higher propensity to form compact topology of iM). The arrangement of residues at the junction in AGAGC 3 differs from the one described for A 2 C 4 (36) in which the iM adopts extended 5 E topology and tight A 2 bp in HE geometry stacks directly on C 3 bp of iM. The integrity of AGAGC 3 is almost completely lost upon substitution of inosine for guanosine (Supplementary Figure S26) which supports an essential role of AGAG segment for the overall stability. The base pairing in G 2 -A 3 -G 4 segment is preserved despite the conversion of iM into extended 5 E topology observed in AGAGC 2 mC ( Figure 4C). The relative position of guanine N1H in 1 H NMR spectra remained unchanged, only the difference in chemical shifts is more pronounced (Supplementary Figure S27). Compared to AGAGC 3 , the extended topology of iM in AGAGC 2 mC is associated with residue G 4 adopting anti-conformation of the glycosidic bond and direct stacking of G 4 I bp on 5 -face of C 5 I bp ( Figure 6). A detailed view on the CG step of both models is shown in Supplementary Figure S28.
Conserved structure of GA step. The detailed analysis of NOESY spectra allowed context-independent identification of NOE contacts observed in the conserved GA step and their classification as inter-or intra-strand (Supplementary Figure S29). Additionally, the characteristic values of 1 H NMR shifts of guanine N1H are preserved in all well-defined structures regardless of the relative position or topology of the iM. If we disregard the presence of the terminal (unpaired) A nucleotide, 5 -GAG-3 segment remains, in which 5 -G exhibits lower chemical shift of N1H (∼10-11 ppm, SE bp) compared to G-3 (∼11-12 ppm, WCE bp). The presence of G·G bp at 3 terminus is not essential for the formation of 5 -GA-3 step. In sequences containing GAGA segment such as C 3 GAGA(G), GAGAC 3 , only the GA attached to the iM clip adopts the structure of GA step and prevents formation of another inter-strand stacked GA step in its vicinity. We hypothesize that the impossibility to adopt two consecutive GA steps originates in the poor stacking between A·A bp in HE geometry and G·G bp in SE geometry. Following this hypothesis, we speculate that low thermal stability of sequence C 3 AGAG originates in iM enforcing tight HE geometry of adjacent A 4 bp which would result in a poor overlap with G 5 bp thus destabilizing the core of the structure. It is documented in the literature that a preceding nucleotide can influence the pairing of purines in a palindromic antiparallel duplex (65). Interestingly, the bp geometry of G·A mismatch is dictated by the position of purine in the sequence: in d(Y-GA-R)·d(Y-GA-R) the Hbonds are formed between the sugar edge of G and the Hoogsteen edge of A which results in inter-strand stacking similar to the one described above. In contrast, the Watson-Crick edges of G and A are preferred in G·A mismatches in the d(R-GA-Y)·d(R-GA-Y) context.
Two major findings related to our CD and NMR experiments. First, we described the iM topologies and the effect of mC-for-C substitution on the topology of tetrameric iM in R 4 C 3 and C 3 R 4 type of sequences. The iMs of nonmodified sequences adopt generally more stable 3 E topology except for sequence GAGAC 3 . There are two factors affecting the folding topology: i) the preference of iM to adopt 3 E rather than 5 E topology and ii) an obstruction of the direct stacking in C-R and R-C steps by an intercalation of CH + ·C base-pair occurring in the compact topology of iM. In C 3 GAGA adopting extended 3 E topology with direct C 3 -G 4 stacking, the methylation of C 1 does not change the iM topology as the two effects act in synergy which is responsible for structural uniformity. In the hypothetical compact 5 E topology, the mC 1 bp is intercalated between C 3 and G 4 preventing their efficient base overlap that is present in extended 3 E topology ( Figure 6A). In contrast, in R 4 C 2 mC the 3 E topology is compact which prevents direct R 4 -C 5 stacking and results in the two effects acting in competition. We explain the increased preference for extended 5 E topology in R 4 C 2 mC compared to R 4 C 3 by hindrance of sterically demanding mC 7 bp intercalated between R 4 and C 5 ( Figure 6B, C). The mC, sequentially incorporated next to purine in C 2 mCR 4 and R 4 mCC 2 , seems to prefer the extended iM topology due to a more efficient intra-strand stacking with the R·R base-pair compared to the compact topology. We pointed out the ability of iM to switch the topology to allow a direct intra-strand R-C stacking.
Second, we described the base pairing geometries of purines arranged in the parallel homoduplex and structural changes at the duplex-iM junction induced by the altered iM topology. It has become evident that the base pairing geometry of purines adjacent to the iM clip represents the resultant of structural forces arising from both iM topology and neighbouring purine bps. We described the GA step as a conserved structural motif formed in all well-defined structures. However, our results indicate that the GA step does not simply repeat within a repetitive (GA) n segment. In the following section, we rationalize the differences in the thermal stability and the preference of iM in GAGAC 3 to adopt extended 5 E topology by analysis of underlying supramolecular interactions identified from unbiased MD simulation.

MD perspective: supramolecular interactions forcing the purine pairing
The unbiased MD simulations were performed to assess the robustness of the complete tetramolecular models and to rationalize structural peculiarities of the adopted arrangements ( Figure 7). The structural rigidity of the most stable sequences C 3 GAGA and GAGAC 2 mC during MD is shown in Supplementary Figures S31 and S32. Despite the rigid iM clips, the purine segment of GAGAC 2 mC exhibits increased flexibility compared to that of C 3 GAGA because of frequent opening of bp at the 5 -end. The tendency to a partial disruption of the structure is even higher in the case of AGAGC 3 where the intercalated C 7 bp loses the original structural integrity. Supramolecular interactions (i.e., stacking and hydrogen bonding) detected in MD snapshots from stable time periods were further evaluated with particular attention devoted to the purine homoduplex.
Base stacking. A comparison of base-pair overlaps in sequences C 3 GAGA, GAGAC 2 mC, and AGAGC 3 is shown in Figure 8. The averaged orientations of stacked nucleobases are in a good agreement with preferences of dinucleotide models simulated in (67). Only the GA step differs significantly from the reported arrangement because of inter-strand character of the base stacking in the parallel duplex described here. For the sequence C 3 GAGA we found that all three consecutive steps C 3 -G 4 , G 4 -A 5 and A 5 - Figure 7. Averaged structures of tetramolecular C 3 GAGA, GAGAC 2 mC, and AGAGC 3 as obtained from unbiased MD trajectory starting from geometries produced by simulated annealing with NMR restraints (colour coding of nucleotides: C and mC, blue, G, orange, A, green, for technical details, see section Materials and Methods). G 6 are characterized by substantial bp overlaps of approximately 12Å 2 . In contrast, only the G 3 -A 4 and A 4 -C 5 steps reach comparable overlaps in the sequence GAGAC 2 mC. The step A 2 -G 3 shows broader distribution of overlap area because of frequent disruptions of A 2 bp. The stability of AGAGC 3 structure is disfavoured on an account of a very small overlap of G 4 I with orthogonal C 7 I I bp. However, the preceding steps G 2 -A 3 and A 3 -G 4 are stacked to a similar extend as in sequence C 3 GAGA. From this analysis it is evident that the overlap in the GA step is preserved in presented sequences. Furthermore, the analysis revealed significant difference in base overlap between purine bp and the outermost CH + ·C bp in extended and compact topology of the iM (Figure 8, compare A 4 -C 5 and G 4 I -C 7 I I ). However, the trend in base overlaps is not capable to fully explain the differences in thermal stability (Table 1).

Base-backbone hydrogen bonding.
To understand the differences in stabilities we performed an analysis of H-bonds which were extracted from MD trajectory. First, we focused on the base-backbone H-bonds stabilizing the conserved GA step. The inter-strand H-bond between A a N6H and A b OP was identified already in the structure of TCGA (26). We discovered an intuitive trend in H-bond length distributions governed by relative orientations of the GA step and the rigid iM (Supplementary Figure S34). However, the differences in length distributions are insufficient to explain the differences in the thermal stabilities. We then identified a formation of transient base-backbone H-bonds between phosphate group in the conserved GpA step and the amino group of 3 adjacent residue with respect to A (G 6 , C 5 and G 4 in C 3 GAGA, GAGAC 2 mC, and AGAGC 3 , respectively; Figure 9). The most populated H-bond is formed in GAGAC 2 mC between C 5 NH 2 and A 4 OP group within one strand stabilizing the homoduplex-iM junction. An anal- ogous inter-strand H-bond in C 3 GAGA between G 6 NH 2 and A 5 OP is disabled because of a larger separation of the strands in the purine homoduplex. Metastable G 4 bp in AGAGC 3 structure is shaped mainly by intra-residual NH 2 -OP interaction which probably depopulates the remote contact with A 3 OP. The importance of G 4 NH 2 -G 4 OP interaction for the stability was demonstrated on the sequence AGAIC 3 , in which the absence of NH 2 group prevented formation of a stable secondary structure. Thus, we suspect that relatively high melting temperature of GAGAC 2 mC originates in the GAC region. Despite the structural flexibility of G 1 and A 2 residues, the large A 4 -C 5 base overlap enforced by C 5 NH 2 bridges with a phosphate group seems to be the key factor imparting a greater thermal stability to the core region of this sequence. Base-backbone H-bonds are also known to be important for stability of G·A mismatches in sheared geometry formed in tandem (68). Supramolecular interactions at iM-homoduplex junction described using MD. Complementing our experimental data by theoretical approach allowed us to clarify the preference of iM for the 5 E topology in GAGAC 3 by presence of additional base-backbone H-bonds and efficient base stacking in A-C step located at homoduplex-iM junction. We were able to pinpoint interatomic contacts explaining the trend in thermal stabilities. We showed that presence of sequencedependent weak supramolecular interactions can shift the equilibrium from generally preferred 3 E to 5 E topology. We described sequence C 3 GAGA in which the efficiency of base-stacking and favourability of 3 E iM topology act in synergy resulting in structural uniformity and thermal stability of both non-modified and mC-modified sequence. We highlighted the structural differences at iM-homoduplex junction in sequences AGAGC 3 and AGAGC 2 mC. We determined that the G-C step is the most versatile as the G was observed to adopt syn-conformation to accommodate orthogonal CH + ·C base-pair in compact 3 E topology whereas in extended 5 E topology the anti-conformation is preferred.

Summary
To summarize our work, we determined the structure of d(GA) repeats arranged in parallel duplex enforced by iM clip. We described a conserved structural motif adopted by 5 -GA-3 block (termed GA step), formed independently on the position of iM, and showed that GA step does not simply repeat in the parallel d(GA) n segment. Additionally, we showed that 5-methylation of cytosine affects the equilibrium between 3 E and 5 E topology of tetrameric iM in a sequence-dependent manner and we described the changes at the junction between duplex and iM induced by the altered topology of the iM clip. Finally, we rationalized the roles of different base pairing geometries and their steric requirements, efficiency of base-pair stacking, and formation of transient base-backbone H-bonds on the stability. We described two non-canonical structural motifs, which not just coexist in a single supramolecular arrangement but even significantly stabilize individual building blocks. The only analogous study reported a simultaneous existence of i-motif and G-quadruplex in a single molecule (63). Unprecedented structural details revealed in this study provide valuable insights into the structure-stability relationship which are applicable in other non-canonical arrangements of nucleic acids.

DATA AVAILABILITY
Atomic coordinates have been deposited in the Protein Data bank under accession numbers: 7BI0 (C 3 GAGA, 10 snapshots + averaged structure from 1 s MD trajectory), 7BL0 (GAC 2 mC stable part, 10 snapshots + averaged structure from 1 s MD trajectory), 7BLM (AGAGC 3 : 3 snapshots + averaged structure from initial 0.3 s part of MD trajectory), 7BMA (AGAGC 2 mC 10 snapshots + averaged structure from 1 s MD trajectory). Summary of experimental restrains and statistical parameters calculated for initial models and MD snapshots is reported in Supplementary