High-throughput characterization of the role of non-B DNA motifs on promoter function

Summary Alternative DNA conformations, termed non-B DNA structures, can affect transcription, but the underlying mechanisms and their functional impact have not been systematically characterized. Here, we used computational genomic analyses coupled with massively parallel reporter assays (MPRAs) to show that certain non-B DNA structures have a substantial effect on gene expression. Genomic analyses found that non-B DNA structures at promoters harbor an excess of germline variants. Analysis of multiple MPRAs, including a promoter library specifically designed to perturb non-B DNA structures, functionally validated that Z-DNA can significantly affect promoter activity. We also observed that biophysical properties of non-B DNA motifs, such as the length of Z-DNA motifs and the orientation of G-quadruplex structures relative to transcriptional direction, have a significant effect on promoter activity. Combined, their higher mutation rate and functional effect on transcription implicate a subset of non-B DNA motifs as major drivers of human gene-expression-associated phenotypes.


In brief
Georgakopoulos-Soares et al. performed computational analyses of germline mutations and identified increased mutability at non-B DNA motifs. The contribution of non-B DNA motifs on gene expression was investigated using massively parallel reporter assays, identifying Z-DNA as a positive regulator of gene expression and finding that the orientation of G-quadruplexes influences promoter activity.

INTRODUCTION
Under physiological conditions, the favored conformation of DNA is a right-handed double helix, also known as B-DNA ( Figure 1A). However, alternative DNA conformations, collectively termed non-B DNA structures, have been recognized and shown to affect transcription, replication, recombination, and DNA repair, either transiently or for longer periods. 1 The propensity to form non-canonical structures and their biophysical properties are determined by non-B DNA motifs that can be identified from the primary sequence. [2][3][4][5] For example, Z-DNA is a left-handed double-helical structure that is formed by alternating purine-pyrimidine tracts (Figure 1B). G-quadruplexes (G4s) consist of four or more Gruns that are interspersed with loop elements ( Figure 1C).
Direct and tandem repeats, including mononucleotide repeat tracts, can form slipped structures ( Figure 1D); mirror repeats with high A/G content can form triple-stranded DNA structures ( Figure 1E); and inverted repeats can form hairpins and cruciforms ( Figures 1F and 1G).
Previous studies have shown that non-B DNA structures are mutational hotspots because they are more likely to be exposed as single-stranded DNA, making them vulnerable to damage. 6,7 Their increased mutability results in an excess of population variants overlapping non-B DNA motifs 8,9 and an excess of somatic mutagenesis at those sites in cancer. [10][11][12][13][14][15] Although variants overlapping non-B DNA motifs are frequently neutral in their effect, it is clear that non-B DNA motifs are a major source of genetic variation in the human genome. They are enriched in regulatory regions [16][17][18][19] and likely cause numerous disorders such as cancer, fragile X syndrome, and Friedreich ataxia. [20][21][22] As a result, they are likely hotspots for disease and genetic variation. 23 Thus, it is important to take non-B DNA motifs into consideration when modeling mutation rates and pathogenicity. 7,15,24 In the human genome, non-B DNA motifs are unevenly distributed. They are enriched in certain regulatory regions, including open chromatin, promoters, and 5 0 and 3 0 UTRs. [16][17][18][19] At the base-pair level, specific non-B DNA motifs are over-represented and positioned relative to critical gene features, such as the transcription start and end sites, splice junctions, and translation initiation regions, while their formation is often associated with transcriptionally active loci. [25][26][27][28][29][30][31] A number of studies have shown, primarily in cancer when targeting selected loci, that non-B DNA motifs can have an impact on the expression levels of various genes. For example, G4s were shown to modulate the expression of key cancer genes, such as MYC, c-Kit, BCL2, and KRAS, with their disruption resulting in pronounced expression changes. 25,32 Furthermore, recurrent mutations across cancer types and patients, including highly recurrent promoter mutations in the TERT and PLEKHS1 genes, overlap non-B DNA motifs [33][34][35] and likely predispose these regions to increased mutagenesis. However, the functional consequences of non-B DNA motif disruptions, either due to germline or somatic mutations at promoter regions, have not been studied in a systematic manner and remain poorly understood. Additionally, although the impact of promoter non-B DNA structures at individual genes on the regulation of gene expression has been demonstrated at individual loci, 34,36,37 the results are conflicting regarding the role of non-B DNA motifs acting as either activators or repressors. 38 Here, we set out to systematically identify the role of non-B DNA motifs on promoter transcriptional regulation. We find that non-B DNA motifs harbor an excess of polymorphisms, many of which affect gene expression levels. To gain further insights regarding the impact of non-B DNA motifs on gene expression, we analyzed various lentivirusbased massively parallel reporter assays (lentiMPRAs 39 ) to systematically test the effect of non-B DNA motifs on promoter activity. We observed a causal link between specific non-B DNA sequences and gene expression levels. We also show that biophysical properties, which influence the likelihood of secondary-structure formation and stability, are linked to these regulatory effects. Our results demonstrate that non-B DNA motifs are important determinants of promoter activity, and their increased mutability implicates them as major drivers of gene-expression-associated phenotypes.

Non-B DNA motifs harbor an excess of standing genetic variation
As previous studies demonstrated that non-B DNA motifs are enriched for somatic mutations, 11,14,15 we set out to analyze whether this enrichment also exists for germline variation. We took advantage of available whole-genome sequencing (WGS) datasets for thousands of individuals and analyzed them to determine whether non-B DNA sequences are enriched for variants. We measured the genome-wide distribution of 204,063,503 single-nucleotide polymorphisms (SNPs), including both rare and common variants as well as 25,925,202 small insertions and deletions (indels; <50 bp) derived from 15,496 genomes from the gnomAD project 40 relative to seven non-B DNA motifs: inverted repeats (IRs), direct repeats (DRs), mirror repeats (MRs), short tandem repeats (STRs), G4s, Z-DNA, and H-DNA motifs ( Figures 1A-1G). To form a null distribution, we generated simulated SNPs, controlling for trinucleotide context and proximity to the original SNP (STAR Methods). We observed an excess of SNPs directly overlapping non-B DNA motifs (Figure S1A; Mann-Whitney U, p < 0.0001), but the magnitude of the effect was small, and the highly significant p value was due to the large sample size. Of note, H-DNA motifs and IRs showed the highest (1.56) and lowest (1.05) fold enrichments, respectively ( Figures 1H, S1B, and S1C). Similarly, the proportion of indels overlapping non-B DNA motifs was substantially elevated relative to the simulated controls (2.26-fold, Mann-Whitney U, p < 0.0001; Figure S1D). The enrichment of genetic variants at individual non-B DNA motifs was higher for small indels than for SNPs, ranging from 2.44-fold for IRs to 13.68-fold for STRs (Figures 1H, S1E, and S1F). We further separated indels into insertions and deletions, finding differences depending on the non-B DNA motif category ( Figure S1G). For example, STRs had a higher frequency of deletions, whereas G4s had a higher frequency of insertions.
Extending our analysis to 505,529 structural-variant breakpoints derived from the gnomAD project, 40 we found a strong association with non-B DNA motifs, with 14.61% of structuralvariant breakpoints directly overlapping a non-B DNA motif versus 8.83% for simulated controls (Mann-Whitney U, p < 0.0001; Figure S1H), representing a 1.66-fold enrichment. For individual non-B DNA motifs, the enrichments ranged from 1.23fold for G4s to 3.50-fold for H-DNA motifs (Figures 1H and S1I-S1K; Mann-Whitney p < 0.0001 for all non-B DNA motifs), consistent with previous reports finding an excess of non-B DNA motifs at structural-variant breakpoints. 41 We separated structural variants into six categories: insertions, deletions, In schematics (D)-(G), spacer denotes the region of the non-B DNA motif that remains single stranded and exposed, whereas arms hybridize into double-stranded DNA.
(H) Distribution of non-B DNA motifs relative to 204,063,503 SNPs on the left. Distribution of non-B DNA motifs relative to 25,925,202 small indel variants in the center. Distribution of non-B DNA motifs relative to 505,529 structural variants on the right. Enrichment is corrected for trinucleotide context. DR, G4, IR, MR, and STR refer to direct repeats, G-quadruplexes, inverted repeats, mirror repeats, and short tandem repeats, respectively. (I) Association between structural-variant-breakpoint category and enrichment at non-B DNA motifs. INV, CPX, CTX, DEL, DUP, and INS refer to inversions, complex rearrangements, translocations, deletions, duplications, and insertions, respectively. Adjusted p values displayed as *p < 0.05, **p < 0.01, and ***p < 0.001. (J) Enrichment patterns of eQTLs at non-B DNA motifs relative to proximal regions. (K) eQTL density at G4 peaks from G4 antibody treatment.
Cell Genomics 2, 100111, April 13, 2022 3 Article ll OPEN ACCESS duplications, inversions, translocations, and complex. 40 We found that deletions, insertions, and duplications were the most enriched across non-B DNA motifs ( Figure 1I). Taken together, these results suggest that non-B DNA motifs are hotspots of genetic variation in the human population across different categories of population variants.
Non-B DNA motifs are enriched for gene-regulatoryassociated variants To gain further insights regarding the regulatory potential of these variants, we investigated the relative frequency of variants overlapping non-B DNA motifs across six regulatory-elementassociated sequences/functions defined by the Ensembl Regulatory Build: 42 promoters, CTCF-binding sites, open chromatin regions, transcription factor binding sites, promoter flanking regions, and enhancers. The analysis was performed across twelve different cell lines (STAR Methods), finding that most non-B DNA motifs were enriched for SNPs, indels, and structural variants across the regulatory elements, but more so for indels than for SNPs and structural variants ( Figures S2A-S2C). We also investigated the increase in mutagenicity for non-B DNA motifs across the seven annotated genic sub-compartments: genic, intronic, coding, and 5 0 and 3 0 UTRs as well as 1 kilobase (kb) upstream of the transcription start site (TSS) and 1 kb downstream of the transcription end site (TES). Most regions had elevated mutation rates, although the magnitude varied by mutation type and genic sub-compartment ( Figures S2D-S2F). As expected, coding regions showed the lowest mutagenicity relative to other regions, most likely due to selection constraints and increased DNA repair. 43 To analyze whether variants in non-B DNA motifs could have a substantial impact on gene expression, we analyzed expression quantitative trait loci (eQTL). We examined the frequency of eQTLs, characterized by the GTEx consortium, 44 at each of the seven non-B DNA motifs genome wide. We found an enrichment of eQTLs across all non-B DNA categories relative to their flanking regions, with the most pronounced effect for G4s ( Figure 1J). Although the excess of eQTLs in the vicinity of non-B DNA motifs can be explained by the higher background frequency of substitution and indel SNPs ( Figure 1H), our results indicate that a subset of mutations overlapping non-B DNA motifs impact gene expression.
As G4s had the most pronounced effect on gene expression, we next analyzed G4 sequencing (G4-seq) and G4 chromatin immunoprecipitation (ChIP)-seq datasets for their overlap with population variants and eQTLs. We investigated the association between population variants and G4s using previously published G4-seq datasets from the HEK-293T cell line with Pyridostatin (PDS) and K + treatments that provide in vitro evidence of G4 formation potential 45 and G4 ChIP-seq-derived peaks from the Ha-Cat cell line that provide in vivo evidence of sites that form G4 structures. 16 In accordance with the G4 motif analysis, we found that SNPs, indels, and structural variants were enriched at G4seq and G4 ChIP-seq peaks (Figures S3A-S3F; Mann-Whitney U, p < 0.001). We considered the G4 ChIP-seq sites that overlapped both G4-seq K + and G4-seq PDS peaks as the highest confidence, experimentally derived G4s ( Figure S3G) and found consistent enrichments of 1.14-fold, 1.41-fold, and 1.36-fold for substitutions, small indels, and structural variants ( Figures   S3H and S3I). Next, we found that eQTLs are found more frequently than expected by chance in the experimentally derived G4 sites. In total, 20,310 eQTLs overlapped with the 8,955 ChIP-seq peaks, with 34% of the peaks having one or more eQTL (Figures 1K and S3J). Interestingly, the enrichment for the experimentally derived G4s was more pronounced than our results derived from the G4 motif analysis. This is likely the result of G4 formation occurring more frequently in open chromatin and transcribed regions. 16 We also investigated if G4 ChIP-seq peaks overlapping genes display a preference for the template (non-coding) or non-template (coding) strands, using the G4 motif orientation within the peaks as proxy. After correcting for the background bias in the orientation of G4 motifs ( Figure S4A), we found that G4 motifs on the non-template strand overlap G4 ChIP-seq peaks 1.71fold more frequently than motifs on the template strand (binomial test, p < 1 3 10 À12 ) ( Figures S4B and S4C), suggesting significant bias in the formation of G4s, dependent on their orientation.

Non-B DNA motifs are enriched in promoter regions
We next investigated the distribution of non-B DNA motifs across the six regulatory elements defined by the Ensembl Regulatory Build (promoters, CTCF-binding sites, open chromatin regions, transcription factor binding sites, promoter flanking regions, and enhancers). For most non-B DNA motifs, we found an enrichment at promoters and CTCF-binding sites relative to other regulatory elements (Figures 2A and S5A), in accordance with previous findings. 46 Next, we separated the gene body into six compartments: a 1 kb window upstream from the TSS, the 5 0 and 3 0 UTRs, coding exons and introns, and a 1 kb window downstream of the TES. Consistently, promoter regions displayed a higher density of non-B DNA motifs than the gene body for most non-B DNA motifs, with the enrichment ranging from 0.97-fold for IRs to 3.02-fold for G4s (Figures 2B and S5B). We also found a significant enrichment of G4-seq-derived peaks for both PDS and K + treatments and for G4 ChIP-seqderived peaks at promoters relative to other regulatory elements ( Figure 2C). Across the gene body, we found the highest enrichments at promoters, coding regions, and 5 0 UTRs ( Figure 2D).
At promoters, for most non-B DNA motifs, the enrichment was higher upstream of the TSS than in the broader promoter region ( Figure 2E). A close investigation of the distribution of non-B DNA motifs relative to the TSS showed an enrichment of peaks $50 bp upstream of the TSS ranging between 1.28-and 1.89fold for DRs and G4 motifs, respectively ( Figure 2F). Importantly, we observed a 5-fold enrichment approximately 100 bp upstream of the TSS for G4 ChIP-seq peaks, consistent with the literature. 16 Interestingly, the ChIP-seq-derived enrichment was substantially larger than that of the G4 motif and the G4-seq datasets (Figure 2G), reflecting a preference in structure formation at promoters in vivo. We also performed a Gene Ontology (GO) term analysis in promoter upstream regions. For G4s, Z-DNA motifs, and MRs, we found multiple terms associated with developmental processes, such as pattern specification process (GO: 0007389), embryonic organ development (GO: 0048568), and positive regulation of neuron differentiation (GO: 0045666) ( Figure S6A). As these analyses suggest that some non-B DNA motifs could control tissuespecific gene expression, we used TissueEnrich to calculate the enrichment of tissue-specific genes and found sets of tissue-specific genes where a set of neuronal-specific genes were enriched for genes containing G4, MR, DR, and STR at their upstream promoter regions ( Figure S6B). Altogether, these results demonstrate that promoters are enriched for non-B DNA motifs relative to other regulatory elements and to other genic compartments and that some non-B DNA motifs are more likely to occur at developmental and neuronal genes. Therefore, the excess of genetic variants at (F) Base-pair resolution of distribution of nucleotide motifs relative to the TSS. IRs, MRs, DRs, STRs, and G4s are abbreviations for inverted repeats, mirror repeats, direct repeats, short tandem repeats, and G-quadruplexes, respectively. (G) G4 enrichment patterns relative to the TSS for G4 motif, G4-seq peaks in K + and PDS treatments, and from G4 ChIP-seq peaks.
Cell Genomics 2, 100111, April 13, 2022 5 Article ll OPEN ACCESS prompted us to investigate their functional impact on gene transcription utilizing MPRAs. We first analyzed two lentiMPRA datasets generated by our group as part of the ENCODE consortium, 47 where a total of 14,625 and 7,346 candidate promoter sequences were examined in both orientations in K562 and HepG2 cell lines. We identified non-B DNA motifs across the lentiMPRA tested sequences (STAR Methods) and examined their association with gene expression. We found that sequences with G4 and Z-DNA motifs showed significantly increased expression levels in both cell lines (Figures 3A and 3B; t tests, Bonferroni correction, p < 0.001), while for IRs, DRs, STRs, and MRs we did not observe consistent results (Figure S7A). As there is a known positive correlation between expression and guanine-cytosine (GC) content, 48 which was also observed in our lentiMPRA datasets (Pearson r = 0.398 and 0.261 in K562 and HepG2, respectively), we constructed a linear model to account for the contribution of GC content toward expression ( Figure S7B). Sequences with Z-DNA motifs had substantially elevated expression levels relative to se-quences without them, even after controlling for GC content in both cell lines (t tests, Bonferroni correction p < 0.001; Figures  3C and S7C). However, after GC-content correction, G4s were not associated with increased expression, and in HepG2, they were instead significantly associated with reduced expression levels ( Figures 3C and S7C). Similar results were obtained after removing outliers from the linear model (absolute Z score >2.5). Also, G4s on the template strand were associated with reduced expression relative to non-template strands in both cell lines, but the difference reached statistical significance only in the HepG2 cell line ( Figure S7D). For the other non-B DNA motifs, we could not find consistent effects in both cell lines, suggesting that nucleotide composition contributed to the observed effects before GC-content correction. Finally, we identified transcription factor binding sites (TFBSs) across the MPRA sequences using the JASPAR vertebrate nonredundant list of transcription factor motifs. 49 We compared the contribution of non-B DNA motifs relative to TFBSs toward expression levels, both before and after GC-content correction. We found that G4 and Z-DNA motifs had similar contributions to known TFBSs, such as EGR1, YY1, and SP9, resulting in increased expression levels relative to sequences without them ( Figure S8). However, only Z-DNA motifs had comparable effects when we accounted for GC content ( Figures 3D and 3E), and the results were consistent between HepG2 and K562 lentiMPRAs.
To further validate our findings, we analyzed lentiMPRA results from a library that characterized the effect of 3,623 de novo promoter mutations that were identified in the Simons Simplex Collection. 50 This library tested both alleles, centered around the variant, totaling 7,246 sequences along with 150 positive and 150 negative controls for their effect on promoter activity in neural progenitor cells (NPCs) (Figures S9A-S9C). This library had 1,234 sequences harboring one or more non-B DNA motifs ( Figure S9D). We observed that sequences harboring G4, DR, and Z-DNA motifs displayed a significantly higher expression than sequences without them (t tests, Bonferroni corrected p values, G4s, DRs, and Z-DNA p < 0.001), whereas sequences with IRs, MRs, and STRs did not show a significant association (p > 0.05) ( Figure 4A).
Similar to the analysis of the ENCODE MPRA libraries, we observed a significant contribution of the GC content toward the effects on expression of certain non-B DNA motifs. After constructing a linear model to adjust for GC content, we observed that G4 motifs are associated with decreased expression, while only Z-DNA sequences remained associated with higher expression ( Figure 4B), consistent with previous results. In this case, removing outliers maintained a positive association with G4s and gene expression. We also observed a substantial difference in the expression dependent on the orientation of G4 motifs, with G4s on the template strand having lower expression than those on the non-template strand before and after GC-content adjustment ( Figures 4C and S9E; Mann-Whitney U, p < 0.001). The primary sequence comprising consecutive G-runs that are interspersed by loop elements can form G4 structures ( Figure 1). The association between G-runs and gene expression was further investigated, finding that consecutive G-runs result in decreased expression when accounting for their GC-content contribution ( Figure S9F). Furthermore, we found that the length of the Z-DNA motif was positively associated with the expression levels (Kruskal-Wallis H test, p < 0.001; Figure 4D).
Similar to the previous MPRAs, we identified TFBSs across the MPRA sequences and compared the contribution of non-B DNA motifs relative to TFBSs toward expression levels before and after GC-content correction. We found that G4 and Z-DNA motifs had comparable contributions to TFBSs toward increasing expression levels with increases of 1.27-and 1.51-fold over sequences without them ( Figure S10A). However, when we accounted for GC content, the effect of non-B DNA motifs was not comparable to the best TFBS motifs ( Figure S10B). Therefore, we find substantial differences in the results in NPCs relative to HepG2 and K562 cell lines, with a lower contribution of Z-DNA motifs in NPCs, which might be due to the selection of loci that were not necessarily proximal to the TSS or due to the lower number of Z-DNA-containing sequences, with only 311 sequences having them.
To validate if the G4s we observed in this NPC lentiMPRA form these structures, we selected ten candidate promoter-proximal sequences with the lowest and highest expression among sequences with G4s (Table S1) and performed multiple spectroscopic assays to characterize their structures ( Figures 4E and  4F), as G4 structures possess distinct spectroscopic features. 51,52 We first used circular dichroism spectroscopy measurements of the G4-containing DNA oligonucleotides, in the presence of lithium ions (non-G4 stabilizing) or potassium ions (G4 stabilizing), to examine the formation potential of DNA G4s, which indicated that our candidate sequences can fold into G4 structures ( Figures 4E, 4F, S11A, and S11B). In addition, we conducted UV melting and found a hypochromic shift at 295 nm for the potassium-ion condition, which supported the formation of the G4 structure, with a melting temperature above physiological temperature (Figures 4E, 4F, S11A, and S11B).
To confirm the results from the circular dichroism and UVmelting experiments, we used fluorescent-based arrays, including N-methyl mesoporphyrin IX (NMM)-ligand-enhanced fluorescence and intrinsic fluorescence experiments ( Figures  4G, 4H,S12, and S12B). In the absence of NMM ligand, no fluorescence was observed at $610 nm. Upon NMM addition, weak fluorescence was observed under Li + , which was substantially enhanced when substituted with K + , supporting the formation of G4 that allows recognition of NMM and enhances its fluorescence ( Figure 4G). Similarly, the intrinsic fluorescence of G4s was increased when replacing Li + with K + , highlighting the formation of DNA G4s ( Figure 4H). Corroborating our results, we observed increased fluorescence intensity under conditions that promote G4 formation for all candidates. We also carried out two positive G4 controls and a negative B-DNA control to verify our findings above ( Figure S13). Combined, these results validate that these sequences form G4 structures in vitro.

Non-B DNA motifs have a significant effect on promoter activity
To directly test the effect of non-B DNA structures on promoter activity, we generated an MPRA library that introduces various non-B DNA perturbations to ten disease-associated genes. This set of genes included cancer oncogenes (CMYC, CKIT, BCL2, KRAS) and genes associated with different cancer types (ADAM12, ALOX5, SRSF6, VEGF12) as well as FMR1, associated with fragile X syndrome (OMIM: 300624), and SNX12, which is associated with neurodegenerative diseases (Table S2). As our MPRA-tested sequences are 200 bp in length, we first validated whether our selected 200 bp sequences could drive promoter activity using luciferase assays in K562, MCF7, IMR90, and HEK293T cells, finding the majority to be active in most cell lines ( Figures S14A and S14B).
Following validation of these 200 bp sequences, we next generated an MPRA library that included the following manipulations: (1) disruption of existing non-B DNA motifs and (2) introduction of different non-B DNA motifs with varied biophysical properties, including spacer-and arm-length changes in IRs, DRs, and MRs, orientation and loop length in G4s, and length in Z-DNA motifs. lentiMPRAs and subsequent computational analyses were carried out as previously described. 53 Briefly, oligonucleotides were synthesized and cloned into a lentiviral MPRA promoter vector ( Figure 5A;  (E) Circular dichroism (CD) spectra of the four candidate targets for G4 formation potential in presence of two cations.
(F) UV-melting profiles of the four G4 candidates in presence of K + . The reverse melting profile (K + rev ) is also shown and matched well with the forward melting profile (K + ). Hypochromic shift at 295 nm is a hallmark for G4 formation, which can be transformed into a negative peak in derivative plot (dAbs/dT) for G4 stability analysis. The melting temperature (Tm) of a G4 can be identified at the maximum negative value. (G) Fluorescence emission associated with NMM ligand binding to G4 candidates in the presence of Li + or K + ions. (H) Intrinsic fluorescence of four candidate DNA oligonucleotides under Li + or K + conditions. In (A)-(C), adjusted p values displayed as *p < 0.05, **p < 0.01, and ***p < 0.001. HEK293T cells for 3 days, to allow non-integrating lentivirus to degenerate, and DNA and RNA barcodes were sequenced. Since previous work in our lab showed that lower basal activity can have a significant effect on MPRA results, 54 these two cell lines were chosen as almost all the selected promoters showed R2-fold activity compared with empty vectors (except for CKIT in HEK293T). All experiments were done in triplicate, and computational analyses were carried out using MPRAflow 53 and MPRAnalyze. 55 We observed a strong correlation between all three replicates (Pearson r R 0.9 in all cases; Figure S15A) and between the two cell lines (Pearson r = 0.87; Figure S15B).
The promoters in our MPRAs showed variable expression, with the highest levels observed for SRSF6 and the lowest for ADAM12 ( Figure 5B). We investigated the contribution of each non-B DNA motif toward expression in both cell lines across the promoters, adjusting across genes using Z score normalization. Specifically, for each gene we calculated the Z score of each sequence, which was calculated by subtracting the expression levels of that sequence from the mean across all sequences of that gene and dividing by the standard deviation. In concordance with our previous MPRA analyses, we observed that sequences with Z-DNA and G4 motifs had significantly higher expression (Figures 5C and 5D). Interestingly, while we did not observe consistent results in our previous MPRA analyses for MRs, DRs, and IRs, here, we observed significantly higher expression levels when MRs and DRs were present, whereas for IRs we found significantly lower expression ( Figure 5C). For STRs, we did not find consistent patterns in the two cell lines. The above results across non-B DNA motifs did not change when we accounted for GC content; however, this was most likely due to our experimental design having only a small number of loci targeted, which, as a result, had a narrow and uninformative GC-content range. and on the right (duplication/substitution) for G4s. The collection of all promoters is ordered as an oligonucleotide library of 230-mer. The oligonucleotide library is PCR amplified and barcoded at the 5 0 UTR using a degenerate reverse primer. Cloning of PCR products into a lentiviral promoter assay vector was performed next. Cloning of PCR products into a promoter-less lentiviral vector was then performed. This plasmid library was sequenced to assign every barcode to one of the promoters in the library (left) and used to produce the lentiviral library (bottom), which was then used to infect the cell lines of interest (K562 and HEK293T). RNA and DNA were collected after 3 days post-infection, and the barcodes were sequenced. Promoter activity was calculated as the log(RNA/DNA). LTR, long terminal repeats; ARE, antirepressor element. (B) Expression levels of nine genes and their sequence variants for K562 and HEK-293T cell lines. (C) Boxplot displaying the Z score, for sequences with and without each non-B DNA motif, calculated separately for each gene, in K562 and HEK293T cell lines. (D) Sequences with Z-DNA motifs display higher expression than sequences with Z-DNA disruptions for SNX12 and SRSF6 genes. (E) Sequences with G4 motifs on the non-template strand have a higher expression than sequences with G4 motifs at the template strand. Article ll OPEN ACCESS For G4s, we introduced a single, two, or three mutations in one, two, three, or every G-run at the original G4 genomic sites. We compared the mutated sequences with the original sequence and found that sequences with disruptions in the Gruns did not display significant expression differences from the original sequences ( Figure S15C). We designed MPRA sequences with scrambled Z-DNA motifs or with disruptions of purines to pyrimidines in the alternating purine-pyrimidine tract, which served as Z-DNA controls. We found that there was a statistically significant reduction in expression following the disruption of Z-DNA motifs ( Figure 5D), supporting the notion that they are activating sequences. We also observed that non-template G4s had higher expression than those at the template strand in both cell lines and both before and after GC-content correction (Mann-Whitney U, Bonferroni corrected; Figure 5E), consistent with our earlier results. For Z-DNA, longer motifs resulted in higher expression ( Figure 5F). These results suggest that the non-B DNA motifs and their biophysical properties contribute to expression across promoter templates.

DISCUSSION
By analyzing thousands of WGS datasets, we found that non-B DNA motifs are hotspots for genetic variation, fitting with their known increased mutability properties. Their increased mutability is consistently observed across mutation types, including substitutions but also larger and more disruptive indels and structural variants. The increased likelihood of mutagenesis at non-B DNA motifs is also consistent with previous analyses of somatic mutations in cancer genomes. 15 Different mechanisms underlying the higher mutation rate at individual non-B DNA motifs have been previously identified, such as DNA polymerase slippage errors at microsatellites causing deletions, 20 which was also observed in this study. We also observed an excess of eQTLs in the vicinity of non-B DNA motifs. In particular, at experimentally identified G4s, the eQTL enrichment was even larger than that observed across G4 motifs ( Figures 1J and  1K), which is likely due to the formation of G4 motifs being more frequent in open chromatin regions and nucleosomedepleted regions. 16 We further show that non-B DNA motifs are enriched in promoters where they can directly influence downstream gene expression levels. Specifically, we observed that Z-DNA motifs increase expression, whereas the effect of G4s is dependent on the gene studied. Combined, these results suggest that gene-regulatory variants are more likely to occur at non-B DNA structures and that they have a substantial impact on gene expression.
The promoter effects of G4s have previously been shown to be inhibitory or activating depending on the target gene. [56][57][58] Similarly, previous work has suggested that Z-DNA sequences can act as both activating and repressing elements in promoters. 29,59,60 Here, we found that in the absence of chemical perturbations, Z-DNA sequences are more likely to be activating, while G4s are more likely to be inhibitory and promoter dependent. One of the mechanisms by which Z-DNA motifs might increase gene expression might be the reduction of nucleosome occupancy that they elicit. 60 The reduction of expression at promoters with G4 motifs could be due to interference with tran-scription factor or RNA polymerase II binding. In addition, template G4s have a more inhibitory effect than non-template ones. The stronger inhibitory effect at the template strand is also aligned with potentially interfering with RNA polymerase II binding. These results are suggestive of inhibitory effects of G4s in promoters, which can be mischaracterized if the effect of GC content is not taken into consideration, as well as orientation-dependent regulatory effects.
Non-B DNA structure formation depends on a plethora of factors, including DNA superhelicity as well as the activity of multiple enzymes such as topoisomerases and helicases. 61,62 Small molecules that stabilize G4s can substantially alter the thermodynamic equilibrium of structure formation, resulting in dramatic changes in gene expression. 63,64 Thus, targeting these sequences in key regulatory sites could be a potential novel therapeutic path. 65 Although the selectivity of such compounds is usually limited, molecules that discriminate among G4s have also been characterized. 66 These can modulate the activity of clinically important genes, as recently shown for the telomerase gene (TERT), where promoter mutations have been associated with a variety of cancers. 67 By targeting a G4 in the TERT promoter with a small molecule, the expression of telomerase was down-regulated in cancer cells. 34 However, small molecules targeting G4s could cause concomitant DNA damage and telomere dysfunction, influence telomere length, and interfere with other biological processes. 63 Targeting these non-B DNA structures via cis-regulation therapy could be an alternate approach to alter target gene expression. 68 It is increasingly recognized that non-B DNA motifs are involved in a plethora of cellular processes, such as transcription and translation initiation, splicing, and transcription termination. [26][27][28][29][69][70][71][72][73][74][75][76][77][78][79][80][81] Therefore, future work is required to explore the regulatory effects of mutations at non-B DNA motifs genome wide and to estimate their overall pathogenicity by integrating the topology of non-B DNA motifs and the downstream biological effects of their disruption. In addition, measuring the likelihood of mutagenesis for individual non-B DNA motifs per cell division in somatic and cancer cells could have important implications relevant to modeling cancer evolution and aging. Further systematic and high-throughput functional assays could extend our understanding of the functional diversity and clinical evaluation of particular non-B DNA motifs and the variants within them.
Limitations of the study Our study has multiple limitations. First, the examination of the regulatory roles of non-B DNA motifs through MPRA experiments did not investigate how molecules that stabilize their formation affect the conclusions reached. Secondly, the MPRA results are based on specific cell lines, and it would be of interest to examine which of these findings can be generalized across cell types and which effects are cell-type specific. We also cannot exclude the influence of the experimental design in our findings. Furthermore, additional experiments and mechanistic work are required to further our understanding, including biophysical and molecular experiments. Lastly, future work would be needed to resolve the relevance of mutations at non-B DNA motifs in the development and progression of

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

DECLARATION OF INTERESTS
The authors declare no competing interests. Ensembl regulatory build Regulatory features were derived from the Ensembl regulatory build for twelve commonly used cell lines across human tissues, namely A549, HMEC, HUVEC, IMR-90, K562, HepG2, HSMM, MCF-7, NHEK, H1-ESC, GM12878 and HCT116. 42 The enrichment in Figures 2A and 2C were calculated from the median enrichment across the cell lines.

Non-B DNA motif identification
The genome-wide analysis of non-B DNA motifs was performed using the positions derived from 82  For DR motif identification, the STR repeat threshold within the arm was set to 80%, in order to separate them from STR motifs. Enrichment of mutations at non-B DNA motifs was estimated as described in 15 .
G4-seq and G4 ChIP-seq maps G4-seq BedGraph data were derived from GEO accession code GSE63874 for the human genome, in two conditions, PDS and K + treatments. 45 G4 ChIP-seq data were derived from GEO accession code GSE107690 for K562 cell line. 83 G4 motifs were oriented as template and non-template based on their orientation relative to gene direction, across genic regions. Strand orientation of G4 motifs at G4-seq and G4 ChIP-seq peaks was performed by subsetting the strand of G4 motifs overlapping the peaks.
Transcription factor binding site maps Position frequency matrices (PFMs) of transcription factors were derived from JASPAR (release 2020) 49 for the non-redundant CORE collection (http://jaspar.genereg.net/download/CORE/JASPAR2020_CORE_vertebrates_non-redundant_pfms_meme.zip) and motif scanning was performed with FIMO. 87 Population variant analysis Nucleotide variants, indels as well as structural variants were derived from the GnomAD project for whole genome sequenced datasets. 40 Only variants with the filter flag PASS were analyzed. eQTL analysis eQTLs were derived from the GTEx consortium 44 and analyzed with the commands ''intersect'' and ''closest'' from BEDTools to investigate their intersection and distribution patterns with motifs from each non-B DNA category as well as with G4-seq and G4 ChIP-seq peaks.

Gene set enrichment analysis
For each type of non-B DNA motif, we extracted a group of genes that contain a non-B DNA motif within a 200 bp upstream window from their TSS and these were used to perform gene set enrichment analyses. GO analyses were performed using clusterProfiler, 88 where GO terms with at least 20 genes and gene ratio greater than 0.01 for at least one of the non-B DNA sets were considered. For visualization purposes, we only displayed a maximum of 10 GO terms with the highest gene ratio per non-B DNA set. Finally, we calculated the enrichment of each non-B DNA group across sets of tissue-specific genes using TissueEnrich 89 using default arguments.

Luciferase assay
Candidate promoters of 200 bp were PCR amplified using AccuPrime TM GC-Rich DNA Polymerase (ThermoFisher Scientific 12337016) and cloned into the pLSmP-Luciferase vector after digestion with SbfI and AgeI restriction enzymes (remove minimal promoter). Primers with 20bp homology to the vector cloning site were designed and PCR products were assembled to the lentiviral vector using NEBuilderâ HiFi DNA Assembly Master Mix (E2621S). Lentiviruses were produced using Lenti-Pac HIV Expression Packaging Kit (Genecopoeia, LT001) according to manufacturer's instructions. Small scale viral productions on HEK293T cells (2x800,000 cells seeded on a p6 well 24h prior to transfection; virus-containing culture media was collected 48h post-transfection and was used to infect desired cells) were performed of all different constructs to test luciferase activity in four different cell lines (MCF-7, IMR-90, K562, HEK293T). 50,000 cells were seeded on 96-well plates in a volume of 50 mL and another 50 mL of virus-containing medium was added in order to transduce them. Luminescence was measured 24h or 48h post-infection using Dual-Luciferaseâ Reporter Assay System (Promega, E1910).
Cell Genomics 2, 100111, April 13, 2022 e3 Article ll OPEN ACCESS lentiMPRA of promoters Each of the sequences was synthesized on a 7,500-feature microarray (Agilent OLS; 15 bp primer + 200 bp promoter + 15 bp primer = 230 mers). For the G4s that we studied across these genes, all selected loci overlapped G4-seq or G4 ChIP-seq peaks. lentiMPRA was performed as described previously with modifications. 53 In brief, PCR amplification of OLS library was performed using NEBNextâ High-Fidelity 2X PCR Master Mix (New England Biolabs, M0541S)(4x50 ml reactions using 20 ng of template OLS library and primers L1.Amp.F and L1.Amp.R; PCR program: 95 C, 2 min; (95 C, 15 sec; 65 C, 20 sec; 72 C, 1 min) x12 cycles; 72 C, 5min). Barcodes were added by PCR in the library amplification step in the 5 0 UTR of the GFP gene. This PCR 5 0 -tagging strategy allowed us to eliminate the confounding effect that lentiviral genome recombination might have on 3 0 -tagged libraries. Additionally, tagging barcodes in the PCR amplification step via primers harboring degenerate nucleotides enabled us to assay larger promoter sequences (200 bp instead of 171 bp of previous MPRA designs) and the cost-effective use of an oligonucleotide library 100 times smaller in size to obtain 100 barcodes per promoter (we ordered 7,500 different sequences instead of 750,000). 20 mg of lentiviral vector (pLSmP-GFP) were digested with SbfI and AgeI restriction enzymes. Linearized vector and PCR products were run on a 1% agarose gel and purified using QIAquick gel extraction kit (QIAGEN, 28704). 5x20 ml ligations containing 1:10 molar ratio between vector and inserts were performed using NEBuilderâ HiFi DNA Assembly Master Mix (E2621S). Ligations were pooled and purified using MinElute Reaction Cleanup Kit (QIAGEN 28204) and electroporated into ElectroMAX TM Stbl4 TM Competent Cells (ThermoFisher Scientific 11635018). 50 ml of electrocompetent bacteria and 60 ng of DNA were used per reaction in a 0.1 cm cuvette (Program: 1.2kV; 200 ohms; 25 mF; 1 pulse). 1:1,000 and 1:10,000 dilutions were seeded on LB plates with ampicillin in order to estimate the number of clones. Approximately 800,000 different clones were obtained and, thus, the complexity of the plasmid library with an estimated of 100 barcodes per insert. Insert-barcode fragment was amplified from the plasmid library and sequenced using NextSeq PE150 for the insert-barcode association.
Lentiviral particles were produced from the plasmid library as in the luciferase assay but scaling the process to 6x150 mm plates. In summary, 6x10 6 HEK293T cells were seeded per plate 48 hour before transfection, 5 mg of plasmid library and 5 mg of HIV packaging mix were co-transfected using 30 ml of EndoFectin. Media were collected 48h post-transfection and lentiviral particles were concentrated using Lenti-X TM Concentrator (Takara 631231). Lentiviral library was tested in a small scale experiment with HEK293T and K562 cell lines in order to titrate the number of desired integrations. Three million HEK293T and 4 million K562 cells were infected with the library with the multiplicity of infection (MOI) of 400 and 40, respectively, as calculated in small scale titration experiments. In order to improve infection, polybrene was added together with the lentiviral library at a final concentration of 8 mg/ml. After three days of culture, barcoded DNA and RNA were extracted from the cells using Allprep DNA/RNA mini kit (QIAGEN 80204). mRNA was purified using Oligotex mRNA mini kit (QIAGEN 70022), and reverse-transcribed using SuperScriptII (Life Technologies, 18064-071), according to manufacturer's instructions. Barcodes were amplified and sequenced using NextSeq PE15, as described previously. 53 We performed three independent replicates of infection for each cell line.

MPRA analysis pipeline
The design of MPRA sequences was performed with algorithms adjusted from 84 . For barcode insert mapping and filtering, we called a consensus sequence from the paired-end reads associating with barcode sequence from the index read. We aligned all consensus sequences back to all designed sequences (inserts) using BWA MEM (version 0.7.17-r1188). 86 As many of the designed sequences are either only 1bp mutation from each other, or the inverted orientation, we use CIGAR string with perfect sequence match and 0 mismatches as a strict filter. For RNA/DNA barcode counting and ratio normalization, RNA and DNA barcodes for each of three replicates were sequenced on an Illumina NextSeq instrument, UMI is used to remove PCR duplicates and the inserts with associated barcode counts lower than 3 are removed. Evaluating the effect of GC-content in the contribution to expression across the MPRA was performed by fitting a linear model and subtracting from each sequence the expected score due to GC-content.

NMM ligand enhanced fluorescence
Experiments were carried out as previously reported with slight modification. 92 Sample solutions of 100 mL total volume were prepared containing 1 mM DNA, 10 mM lithium cacodylate (LiCac) buffer (pH 7.0), 150 mM LiCl or KCl solution and 1 mM NMM ligand. HORIBA FluoroMax-4 Fluorometer was used to measure the fluorescence spectra. Before sample measurement, samples were first prepared without ligand and heated for denaturation at 95 C for 3 minutes followed by cooling down for 15 minutes by placing the sample solution at room temperature so as to undergo renaturation. The samples were then transferred into a quartz cuvette which had a path length of 1-cm and excited at 394 nm. The range from 550 to 750 nm of emission spectra were needed. All data were measured at 25 C in every 2 nm and the exit and entrance slit widths were 5 nm. The enhanced fluorescence spectra of samples in the absence of ligand were used for normalization. All of the above calculations were analyzed in Microsoft Excel.

Circular dichroism (CD) spectroscopy
Experiments were carried out as previously reported with slight modification. 93 Jasco J-1500 CD spectrophotometer was used to carry out the CD spectroscopy. A total of 2 mL sample solution was contained with a quartz cuvette which had a path length of 1-cm. Sample reactions consisting of 5 mM DNA, 150 mM KCl or LiCl and 10 mM LiCac (pH 7.0) were prepared. Then mixed thoroughly and denatured the sample solution for 5 minutes at 95 C and then incubated for 15 minutes by placing the sample solution at room temperature to undergo renaturation. All samples were measured at 25 C in a range from 220 to 310 nm. The spectra were e4 Cell Genomics 2, 100111, April 13, 2022 Article ll OPEN ACCESS needed every 1 nm. The time for responding was 0.5 s/nm and all of the spectra stated were 2 scans in average. By normalizing the data collected, the molar residue ellipticity was obtained and then smoothed over 5 nm. Spectra Manager Suite (Jasco Software) was used to analyze the collected data.
Thermal melting monitored by UV spectroscopy Experiments were carried out as previously reported with slight modification. 93 Sample reactions of 2 mL consisting of 5 mM DNA (except for concentration dependent melting that ranged from 1 -10 mM), 150 mM KCl and 10 mM LiCac (pH 7.0) were prepared. Samples were then mixed completely and heated for 3 minutes at 95 C for DNA denaturation and followed by renaturation for 15 minutes by placing the sample solution at room temperature. Samples were then transferred into a quartz cuvette which had a path length of 1-cm then sealed with 2 layers of Teflon tape in order to lower the chance of evaporation of the sample when the measurement reached high temperature. Measurements were conducted using Agilent Cary 100 UV-Vis Spectrophotometer with sample block initially set at 20 C for 5 minutes.
The samples were measured from 20 to 95 C (forward scan) with a 0.5 C/min temperature increment rate. There was a reverse scan measurement (95 to 20 C) that also had a 0.5 C/min increment rate after holding for 5 minutes at 95 C. At 295 nm (or 260nm for the B-DNA oligonucleotide), both of the forward and reverse scans were recorded for the folding and unfolding transitions.
The collected data were deducted by the blanked solutions which had the identical concentrations of the KCl and LiCac buffer (pH 7.0) only. The data's first derivatives were obtained by smoothing the data over 11 nm where all the processes and results were marked in Microsoft Excel. By taking average of the melting temperatures in both of the reversed and forward measurements, the final melting temperature was determined.
Intrinsic fluorescence spectroscopy Experiments were carried out as previously reported with slight modification. 93 Samples were prepared as done for UV-melting and CD-spectroscopy. HORIBA FluoroMax-4 Fluorometer was used to measure the intrinsic fluorescence spectra. After denaturation and renaturation of samples, the samples were transferred into a quartz cuvette which had a path length of 1-cm and excited at 260 nm. The range from 300 to 500 nm of the emission spectra were needed. All data were measured at 25 C of every 2 nm and the exit and entrance slit widths were 5 nm. The collected data were smoothed over 5 nm using Microsoft Excel.

Population variant analysis
For SNP variants, simulated controls were generated within 10kb from the original variant, controlling for trinucleotide content. To achieve this, the base-pair at the randomly selected simulated position, within 10kb from the original mutation, and both the 5' and the 3' adjacent base-pairs had to match those at the mutated sites, and the mutation and simulation sites had to be different from one another. In addition, in the simulations regions of the human genome for which mutation calling by GnomAD was not performed were excluded. For indels, we generated simulated indels within 10kb of the original indel site, correcting for indel length and with local GC content at a 100bp window each side of the indel site within 2.5% difference from the original. For structural variants, we simulated an equal number of breakpoints at random locations within 10kb of the original breakpoints, correcting for local GC content, with 2.5% maximum difference from the original GC content. Statistical significance was estimated with non-parametric Mann-Whitney U tests in Python using the SciPy library. 90 Across regulatory elements, z-scores were calculated from the density of mutations at non-B DNA motifs at that element, relative to the mean mutational density at that element, divided by the standard deviation.