Systematic characterization of short intronic splicing-regulatory elements in SMN2 pre-mRNA

Abstract Intronic splicing enhancers and silencers (ISEs and ISSs) are two groups of splicing-regulatory elements (SREs) that play critical roles in determining splice-site selection, particularly for alternatively spliced introns or exons. SREs are often short motifs; their mutation or dysregulation of their cognate proteins frequently causes aberrant splicing and results in disease. To date, however, knowledge about SRE sequences and how they regulate splicing remains limited. Here, using an SMN2 minigene, we generated a complete pentamer-sequence library that comprises all possible combinations of 5 nucleotides in intron 7, at a fixed site downstream of the 5′ splice site. We systematically analyzed the effects of all 1023 mutant pentamers on exon 7 splicing, in comparison to the wild-type minigene, in HEK293 cells. Our data show that the majority of pentamers significantly affect exon 7 splicing: 584 of them are stimulatory and 230 are inhibitory. To identify actual SREs, we utilized a motif set enrichment analysis (MSEA), from which we identified groups of stimulatory and inhibitory SRE motifs. We experimentally validated several strong SREs in SMN1/2 and other minigene settings. Our results provide a valuable resource for understanding how short RNA sequences regulate splicing. Many novel SREs can be explored further to elucidate their mechanism of action.


INTRODUCTION
Pre-mRNA splicing is an essential step for expression of most eukaryotic genes, during which introns are removed and exons joined to generate a mature mRNA. Exons and introns are either constitutively or alternatively spliced. Alternative selection of 5 or 3 splice sites, i.e. alternative splicing, to generate two or more mRNA isoforms is a common phenomenon for pre-mRNAs transcribed from human genes. Natural alternative splicing not only contributes to expansion of transcript diversity, but also serves as a posttranscriptional mechanism to regulate gene function, often in a cell-type-or developmental-stage-specific manner (1). The splicing pattern of a gene in a specific cell setting generally reflects an intricate interplay among multiple cis-acting elements and trans-acting factors, and is influenced by other cellular pathways, such as transcription through chromatin, and signaling pathways.
The strength of four core splicing signals--the 5 splice site, the 3 splice site, the poly-pyrimidine tract and the branch point site--which comprise the first layer of the 'splicing code', is the major determinant of intron and exon definition. However, the prevalence of pseudo-exons with core signals that also match the consensus elements indicates that additional cis-acting splicing signals are involved in distinguishing true exons from pseudo-exons (2,3). These auxiliary signals, termed splicing-regulatory elements (SREs) represent another essential aspect of the splicing code. Based on their role and position, SREs are classified as exonic and intronic splicing enhancers (ESEs and ISEs) or silencers (ESSs and ISSs). cis-Acting SREs are typically located in the vicinity of splice sites to exert their effects; however, distal elements located >500 nt away from exons, i.e. deep intronic elements, may still affect splice-site selection (4).
SREs can be RNA secondary structures that inhibit splicing by concealing key splicing sequences in double-stranded regions, or in some cases promote splicing by bringing the 5 and 3 splice sites in close proximity, or distal enhancer sequences close to their regulated exons (4,5). In most described cases, SREs are single-stranded RNA sequence motifs that are specifically bound by their cognate RNA-binding proteins (RBPs), which function as either splicing activators or repressors. The differences in the splicing pattern of a pre-mRNA in different cell types or developmental stages are mostly due to expression, localization, or phosphorylation alterations of various regulatory splicing factors, which constitute another layer of the regulatory splicing code.
It has long been established that purine-rich and ACrich ESEs are bound by serine/arginine-rich (SR) proteins (6,7), which facilitate the recognition of adjacent splice sites by the basal splicing machinery. Such activities of SR proteins are often antagonized by hnRNP proteins that recognize splicing silencer sequences (8,9). Recently, an increasing number of RBPs have been found to be involved in splicing regulation. Humans have an estimated 1542 RBPs, or ∼7.5% of all protein-coding genes; among them, 692 are mRNA-binding proteins (10). RBPs possess at least one RNA-binding domain, such as an RNA-recognition motif (RRM), a K-homology (KH) domain, an arginine/glycinerich domain, a DEAD-box motif, or a zinc-finger domain (10,11), and they typically bind to sequence motifs of 3-7 nt (12,13). One characteristic feature of many RBPs is the degeneracy of the sequence motifs they recognize. For example, hnRNP A1, a strong splicing repressor, binds tightly to the consensus pentamer motif UAGGG (13,14). However, the core sequence sufficient for recognition and binding is the dinucleotide AG, with improved affinity for sequences containing UAG or its weaker version, CAG (15). The degenerate nature of RBP-binding motifs makes it challenging to identify authentic binding sites for certain RBPs.
For a known RBP that regulates splicing, in vitro SE-LEX (systematic evolution of ligands by exponential enrichment) (16) can be employed in conjunction with a splicing assay to identify its functional consensus motif; with this method, functional sequence motifs specific for a subset of SR proteins were uncovered (17). Another frequently used method is CLIP (crosslinking and immunoprecipitation), in which a transcriptome-wide analysis is performed to detect direct protein-RNA interactions in cells; this powerful method has defined a cohort of SRE-RBP pairs that regulate numerous alternative splicing events essential for maintaining cellular homeostasis and function (12,18). With advances in next-generation sequencing technology, transcriptome-wide RNA sequencing is now preferentially used to detect splicing changes and associated SREs regulated by an RBP, whose expression is manipulated via knockdown or overexpression; this approach requires appropriate bioinformatics tools to quantitate isoform reads associated with splicing events (19). On the other hand, to search novel SREs--regardless of the RBPs that recognize them--several strategies have been employed, including conventional deletion and mutational analysis, as well as computational prediction methods, such as RESCUE-ESE (6) and mCross via analysis of published CLIP data (20). Several groups have built minigene reporter systems to screen random-sequence libraries inserted into an alternatively spliced exon or a flanking intron, generating a large amount of information about sequence motifs that influence splice-site selection (21)(22)(23)(24)(25)(26)(27). However, despite all the efforts undertaken during the past three decades, our knowledge about SREs--an essential part of the splicing code--remains incomplete.
To better understand short SREs, we generated and analyzed a complete library of pentameric sequences, and directly compared all 1024 sequences inserted at an intronic region in a minigene system. Based on a prior systematic analysis of 207 RBPs from different species, including 85 human proteins, with a method called 'RNAcompete', we know that a large number of RBPs recognize 5-nt or shorter motifs (13). Therefore, pentameric or shorter motifs represent a major fraction of SREs, and they are also at the practical limit of the kind of exhaustive one-by-one analysis we employed here. We chose the spinal muscular atrophy (SMA)-associated SMN2 as the model gene, because alternative splicing of its exon 7 has been extensively characterized. We chose the mutation region at positions 11 to 15 (CAGCA) in intron 7, a region that is critical for modulation of exon 7 splicing (15,28). The percentage of SMN2 exon 7 inclusion in multiple cell lines, including HEK293 cells is about 40%, which is ideal to observe changes in exon 7 inclusion in either direction. We analyzed the splicing pattern of each of the 1023 mutants in HEK293 cells, compared to the wild-type (WT) SMN2 minigene. Our data provide a thorough picture of how different pentameric intronic sequences can regulate an alternative splicing event.

Plasmids
SMN1/2 minigene constructs were pCI-SMN1 and pCI-SMN2 (15). These two minigenes comprise the 111-nt exon 6, a 200-nt shortened intron 6, the 54-nt exon 7, the 444nt intron 7, and the first 75 nt of exon 8, followed by a consensus 5 splice site. To obtain all 1023 SMN2 mutants for making the full pentamer library, we first set up 16 sequence groups in which the last two nucleotides of pentamers were preset (AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG and TT) and the remaining three nucleotides were generated by site-directed mutagenesis using a pair of partially-overlapping primers comprising a 3-nt random-sequence pool. 16 pairs of primers were accordingly designed to obtain all mutant plasmids. We sequenced ∼100 clones for each group and obtained almost all 64 plasmids except for a few, which we then generated by site-directed mutagenesis with specific primer sets.

Cell culture and transfection
HEK293 cells were cultured in Dulbecco's modified Eagle's medium (DMEM, Invitrogen) supplemented with 10% (v/v) fetal bovine serum (FBS) and antibiotics (100 U/ml penicillin and 100 g/ml streptomycin). For splicing analysis of the pentamer library, cells were seeded at 1.8 × 10 5 per well in 12-well plates; the next day, 1 g of each minigene plasmid was delivered to cells using branched polyethylenimine (PEI, Sigma-Aldrich) in FBS-free DMEM. For other experiments, 500 ng of each minigene plasmid was cotransfected with or without protein-expression plasmid(s) using PEI or Lipofectamine 2000 (Thermo Fisher Scientific). After incubation at 37 • C for ∼6 h, the transfection solution was removed and replaced with complete medium. Cells were collected for extraction of total RNA 24-36 hrs post-transfection.

RNA-affinity chromatography
Biotinylated WT RNA with two copies of GCACC and two mutants (Mut1 and Mut2) were purchased from Sigma-Aldrich. Each RNA (100 pmol) was incubated with 5 l prewashed streptavidin beads at 4 • C for 2 h under rotation. The RNA/streptavidin beads were then incubated with HeLa cell nuclear extract in binding buffer (20 mM HEPES, pH 7.8, 100 mM KCl, 0.25 g/l yeast tRNA, 0.1 mM EDTA, 1 mM DTT, 1 mM PMSF, 0.05% Triton X, and Protease Inhibitor Cocktail (Roche)) at 30 • C for 1 h, followed by washing for four times with the same buffer at different salt concentrations: 50, 75, 100 and 150 mM, respectively. Bound proteins were eluted with Laemmli buffer and heated at 95 • C for 5 min.

Western blotting
Protein samples separated by 12% SDS-PAGE were electroblotted onto PVDF membranes (Millipore, Bedford, MA, USA). The blots were then probed with primary monoclonal antibodies (mAbs) or polyclonal antibodies (pAbs), followed by secondary IRDye 680RD-conjugated goat antimouse or goat anti-rabbit antibody (LI-COR Biosciences, Lincoln, NE, USA). Anti-T7 mAb was generated at Cold Spring Harbor Laboratory; anti-␤-Tubulin mAb was purchased from Santa Cruz Biotechnology (Dallas, TX, USA), and anti-YB1 pAb from Abcam (Boston, MA, USA). Protein signals were detected with an Odyssey Infrared Imaging System (LI-COR Biosciences).

Motif set enrichment analysis (MSEA)
For MSEA, a motif set is a group of pentamers with the same sequence feature. Six layers of motif sets were used to describe the sequence features of a pentamer. The first layer uses non-positional motifs, which group pentamers based on 3-mer or 4-mer motifs regardless of the position of the motif (e.g. CCAUG and CAUGC pentamers both belong to the 'AUG' motif set). The second, third and fourth layers use position-dependent motifs, based on the first (Position 1), the second (Position 2), or the third (Position 3) 3-mers in a given pentamer (e.g. AUGCC and AUGGC pentamers both belong to Position-1 'AUGNN' motif set; N = A, C, G or U). The fifth layer describes the pentamers based on repeat dimers or single nucleotide repeats (e.g. U:U). It counts how many repeating single or dinucleotides in a given pentamer (e.g. UU is repeated twice in the AUUUU pentamer and is written as UU:UU). The sixth layer is based on the interaction of two nucleotides in the pentamer. For example, we use A-U-to describe pentamers with an A as the first nucleotide and a U as the third nucleotide. Finally, we used the percent spliced-in (PSI) values of the 1024 pentamers in two independent replicate experiments to generate the initial data matrix (4 × 1024) in .gct format. We wrote in-house Perl scripts to generate the motif sets in .gmt format, and generated a ranked list of 1024 pentamers based on average PSI (mutant -WT) values in .rnk file format. We then used the .gmt and .rnk files for GSEA software (v4.0.3) to compute normalized enrichment score (NES), FDR q-value and leading edge values (29). Motif sets with FDR q-value <0.05 and list value <50% in the leading-edge analysis were considered significant.

Statistical analysis
Experimental data other than the mutant library data are presented as mean ± standard deviation. Statistical significance was analyzed by Student's t-test and one-way ANOVA using software SPSS 16.0; a value of P < 0.05 was considered statistically significant.

Construction of a pentamer library to identify short SREs
To systematically explore pentameric and shorter sequences that are critical for splicing regulation, we took advantage of a previously reported SMN2 minigene (30) to generate a complete pentamer library by site-directed mutagenesis, and analyzed the effects of all possible mutants on exon 7 splicing. We mutated the 5-nt sequence CAGCA at positions 11-15 in SMN2 intron 7 into all possible combinations of sequences. We chose this intronic position on the basis of our previous mutational analysis, which showed that suboptimal SREs in this region can noticeably affect exon 7 splicing (15). We obtained a total of 1023 pentamer mutants ( Figure 1A). We analyzed the splicing patterns of all 1024 WT and mutant minigenes after transient transfection into HEK293 cells, by fluorescence-labeled semiquantitative RT-PCR. For each experiment, we used the WT SMN1 and SMN2 minigenes as controls. We then calculated the difference in the percentage of exon 7 inclusion, i.e. the PSI value, between each mutant and the WT SMN2 control in two independent experiments.

Characterization of the pentamer library with MSEA
PSI values (0-100%) from the pentamer library were highly reproducible in two biological replicates (Pearson's r = 0.9687, linear regression R 2 = 0.9310, Supplementary Figure S1). The PSI difference between two replicates of WT minigenes ranged from 0 to 21% (2.52% on average), and the PSI difference for mutant minigenes ranged from 0 to 31.84% (1.23% on average). 686 pentamers had a positive PSI value (up to +64%), and 330 pentamers had a negative PSI value (up to -42%). Seven mutant pentamers had zero PSI (e.g., CUACC). The range of PSI values reflects the fact that the SMN2 minigene has a base average PSI value of ∼41%. 584 pentamers induced >10% exoninclusion increases, whereas 230 pentamers induced >10% exon-skipping increases (Supplementary Table S1). In other words, there were more stimulatory pentamers in the complete pentamer library; one reason may be that the WT pentamer CAGCA is a weak hnRNP A1-binding sequence (15).
The top 50 most stimulatory and inhibitory pentamers, along with 3-nt of flanking sequence on each side, are listed in Figure 2. We randomly selected 15 of the top 50 from each category, and re-evaluated their splicing patterns in HEK293 cells, and the data from three independent experiments were consistent with the initial screening ( Figure  1B). Although the effects of many stimulatory or inhibitory mutants can be attributed to known mechanisms, many of the mutants with strong splicing-regulatory effects represent cis-elements that have not been previously reported. In addition, we observed in many cases that a single nucleotide difference between two mutants led to robust changes in exon 7 inclusion, from being one of the most stimulatory to one of most inhibitory, or vice versa ( Figure 1C). For example, CCCGC induces 44.53% more exon 7 inclusion, whereas ACCGC induced 40.01% more exon 7 skipping.
Identification of short SREs within or overlapping the stimulatory and inhibitory pentamers should help understand their mechanism of action. To this end, we employed a MSEA approach (see Methods). Initially, 1216 motif sets were generated based on the six layers of sequence features: (i) non-positional, (2) Position-1, (iii) Position-2, (iv) Position-3, (v) repeat dimers and (vi) nucleotide interaction ( Figure 3A). Each motif set comprises four to 232 pentamers. To remove extreme cases, we excluded motif sets with fewer than seven pentamers or >200 pentamers. Thus, we used 692 motif sets in MSEA to obtain the normalized enrichment score (NES) based on the PSI ranking of the corresponding pentamers among the 1024 pentamers in the library ( Figure 3A). We used the NES score to describe the stimulatory or inhibitory tendency for a given pentamer. A positive NES score means a stimulatory tendency, whereas a negative NES score means an inhibitory tendency. For example, the AUUUU pentamer has all positive NES scores in its six layers of sequence features ( Figure 3A). This implies a strong stimulatory tendency of the pentamer.
To project the distance between pentamers in the 6-layer space, we performed principal component analysis (PCA). We used the first three principal components to generate a three-dimensional PCA plot ( Figure 3B). The first two principal components (PC1 and PC2) cover 79.44% of the variance. With the additional third principal component (PC3), the PCA plot covers 88.58% of the variance. We observed two major clusters in the PCA plot: (i) a stimulatory cluster (red) and (ii) an inhibitory cluster (blue) ( Figure 3B). This shows that the six-layer sequence features can nicely separate the stimulatory from the inhibitory pentamers.
Among the non-positional motif sets, 30 stimulatory and 46 inhibitory 3-and 4-mers were significant. Eight stimulatory and nine inhibitory Position-1 3-mers, six stimulatory and 12 inhibitory Position-2 3-mers, as well as seven stimulatory and 14 inhibitory Position-3 3-mers were all significant. Repeat dimers had six stimulatory (including single nucleotide U:U repeats) and one inhibitory (AG:AG) SRE motifs. Nucleotide-interaction motif sets had 38 stimulatory and 23 inhibitory SRE motifs. Overall, stimulatory SRE motifs are more diverse (e.g. UUU, AUG and CCC), whereas inhibitory SRE motifs have strong preferences for A and G nucleotides. Also, inhibitory SRE motifs in general have higher NES scores ( Figure 3C and D, Supplementary Table S2).
We looked at the sequence features of the 30 enriched stimulatory SRE motifs, and found that they can be separated into two major classes, which we termed U-rich and CG-containing (CG-core), plus one minor class that is C- rich ( Figure 5A). 19 motifs belong to the U-rich class, with each motif containing at least two Us. Interestingly, only one of the U-rich motifs has a G (UUG). As the immediate tetranucleotide downstream of the library site is UUAU, this suggests that poly-U runs or U-rich sequences (>4 Us) mixed with scattered A or C tend to be strong ISEs. We interrogated all 32 pentamer mutations that are comprised of solely A and/or U, and found 20 with PSI >30 and all with a positive PSI, reflecting that UA-rich sequences are generally ISEs. However, pentamers with 3-5 Us have much higher PSI values than those with 3-5 As, pointing to the Us as the key component of these motifs (Supplementary  Table S3); this explains why AAU is the only enriched motif with more As than Us. On the other hand, U-rich motifs mixed with scattered C, though mostly stimulatory, can be very inhibitory in some cases ( Figure 2, Supplementary Table S4). It is reasonable to assume that most U-rich motifs function by binding to TIA1 and TIAL, and indeed overexpression of TIA1 promoted more exon 7 inclusion in pentamers with multiple Us than in those without Us (Sup- plementary Figure S2). However, we cannot rule out other mechanisms that may mediate the effects of particular motifs, considering the existence of multiple RNA-binding proteins with high affinity for U-rich sequences (13).
The second major class of stimulatory motifs includes eight motifs (CGU, CGC, GUCG, GCG, UCG, CUCG, CGUC and CGCU), all of which share the dinucleotide CG, highlighting this CG feature as the core of this type of splicing enhancers ( Figure 5A). Alignment of all the 8 CG-core motifs suggests that UCGY (Y = pyrimidine) is a stronger extended version of the CG dinucleotide.
We designated the remaining two motifs, CUC and CCCU, both with more Cs (at least two Cs) than Us, as Crich motifs ( Figure 5A). As the first upstream flanking nucleotide is C, this hints that C-rich sequences, such as three or more consecutive Cs, or mixed with scattered Us, constitute an ISE. We discuss this further below (Position 1dependent MSEA).
The 46 inhibitory motifs--11 3-mers and 35 4mers--enriched by non-positional MSEA can be separated into four classes ( Figure 5B). Among the 11 inhibitory 3-mer motifs, eight contains at least one A and one G, and all display a pattern of NAG or AGN, including all possible combinations, but not BGA (B = C, G or U) or GAH (H = A, C, or U), confirming that the dinucleotide AG, rather than GA, is the core of purine-rich silencers. Most of the NAG/AGN motifs rank at the top of the 46-motif list (Supplementary Table S2). We designated those motifs containing at least one AG as AG-core motifs, which represent the most predominant class of inhibitory motifs and includes 27 4-mers, in addition to the eight 3-mers ( Figure 5B). Among the 27 4-mer motifs, 16 are NUAG, UAGN, NAGG and AGGN, consistent with UAGG being the SELEX winner sequence with high affinity for hnRNP A1 (14). Depletion of hnRNP A1 and its paralog protein hnRNP A2 indeed increased exon 7 inclusion in UAGG-containing mutants to a greater extent than in mutants lacking the motif (Supplementary Figure  S3). The predominant enrichment of AG-containing motifs by non-positional MSEA supports the notion that hnRNP A/B proteins, particularly the abundant hnRNP A1 and hnRNP A2, are likely among the strongest splicing repressors.
We classified the seven enriched 3-and 4-mers that contain a copy of either UGG or GGU (UGG, UGGC, UGGU, GGUC, GGU, CUGG and UGGA) as UG-type motifs ( Figure 5B). The silencing activities of these motifs may involve multiple mechanisms. It has been reported that two 8-nt sequences in intron 7, one from position 3 to 10 and the other from 281 to 289, form a double-stranded RNA structure, which impairs the annealing of U1 snRNA to the 5 splice site of intron 7, contributing to the predominant skipping of SMN2 exon 7 (32). We found that mutations in either strand that presumably strengthen the long-distance RNA-RNA interaction also repressed exon 7 splicing in the SMN1 minigene (Supplementary Figure S4). UGNNN and UGGNN extend the predicted double-stranded RNA structure by at least two and three base pairs, respectively. Therefore, this may be an important part of the mechanism that makes them strongly inhibitory. On the other hand, UGG and GGU themselves have inhibitory activity, particularly when they are flanked with pyrimidines, as in UGGU, UGGC, CUGG and GGUC, and the consensus motif appears to be UGGY. Indeed, 14/16 NUGGY and UGGYN Six layers were used to describe the sequence features of pentamers. The non-positional feature (red layer) covers 3-to 4-mer motifs in any given position of the pentamer. For example, a 3-mer AUU motif set in this layer will include pentamers with AUU at any given position. Position-dependent features are described by three independent layers (blue, green and pink layers), and they cover pentamers whose 3-mer motif starts at the first nucleotide (Position 1), the second nucleotide (Position 2) or the third nucleotide (Position 3). The repeat-dimer feature is described by whether a pentamer has specific repeating nucleotide(s) (cyan layer). The nucleotide-interaction feature is described by the co-existence of any two nucleotides in a pentamer. Motif sets were created according to these sequence features. One motif set represents one particular sequence feature in a group of pentamers (e.g. AUUUU and AUUCU are both in the non-positional AUU motif set). A pentamer can be included in multiple motif sets. For example, the AUUUU pentamer is included in both AUU (non-positional) and AUUNN (Position 1) motif sets. Each motif set has a normalized enrichment score (NES) based on the PSI ranking of its pentamers among the 1024 pentamers. If most of its pentamers are ranked toward the positive PSI spectrum, the motif set has a positive NES. A positive NES means that the sequence feature (or motif) is stimulatory to SMN2 exon 7 inclusion, whereas a negative NES means an inhibitory motif. (B) A three-dimension PCA plot with 1,024 pentamer balls is shown. Percent variance explained is labeled next to the principal components (PC1, PC2, and PC3). The pentamer balls are colored based on their PSI values (red = exon inclusion; blue = exon skipping). (C, D) Stimulatory and inhibitory motifs are shown in C and D, respectively. The sizes of SREs are scaled based on NES scores. SRE motifs are colored based on the layer of sequence feature: non-positional (red), Position-1 (blue), Position-2 (green), Position-3 (purple), repeat-dimer (cyan) and nucleotide-interaction (yellow). pentamers in the library strongly inhibited exon 7 splicing compared to the WT SMN2 minigene (Supplementary Table S5); the two exceptions were AUGGU and AUGGC, both of which are part of an RBFOX-binding motif (UG-CAUG) (see below). 16 NUGGN pentamer mutants provide an opportunity to analyze the UGG motif itself, with little interference from the RNA secondary structure. When UGG is followed by G or A instead of C or U, its inhibitory activity is markedly reduced or even reversed (Sup-plementary Table S6). This observation suggests that GGG and GGA are at best weakly inhibitory, whereas GGG may be even slightly stimulatory in some cases, and counteracts UGG. It has been established that intronic G-rich motifs or poly-G runs act as splicing enhancers (33,34). Therefore, we interrogated all 40 pentamer mutants with a G triplet (NNGGG, NGGGN and GGGNN). Interestingly, except for seven AGG-and three UGG-containing pentamers, all others have a PSI > 0 and 18 have a PSI > 17 (Supple-  Table S7). For all pentamers containing four consecutive Gs, 5/7 robustly promoted exon 7 inclusion, the exceptions being AGGGG and UGGGG (Supplementary Table S7), confirming that poly-G stretches are indeed ISEs.
We classified three 4-mers (AGAC, GGAC and UGAC) as GAC-core motifs ( Figure 5B). AGAC also falls into the AG-core class. Interestingly, when the nucleotide preceding GAC is C, its inhibitory activity is neutralized. We looked at all mutants containing CGAC, and found that 18/24 markedly promoted exon 7 splicing, compared to the WT SMN2 minigene with PSI > 11 (Supplementary Table S8). On the other hand, 22/24 mutants that comprise DGAC (D = A, U or G) had a PSI < 0, and 18 markedly inhibited exon 7 splicing ( PSI < 12) (Supplementary Table S9). These data support DGAC motifs as a class of ISSs.
ACC and CUCC, which rank No. 15   named these elements GCWCC-type (W = A or U) ( Figure  5B).

Motifs enriched by position-dependent MSEA
Non-positional analysis generally identifies fully functional motifs in the library, but may miss motifs that span one of the two junctions; such motifs require position-dependent analysis to pinpoint. For example, some motifs may include the upstream trinucleotide UGC or the downstream dinucleotide UU, or a portion thereof, to form a fully functional motif.
For Position 1-dependent MSEA, we analyzed all 3-mers at Position 1 (nucleotides 1-3 in each pentamer), which identified 17 3-mers that strongly affected exon 7 splicing, with 8 stimulatory and 9 inhibitory ones ( Figure 6A and  Figure 6A). The immediate upstream nucleotide C and CCC make a longer C-rich motif, and the C and GUC form the CG-core motif CGUC, which is on the enriched non-positional MSEA list. CCC is ranked as the top stimulatory motif on the list, though it failed to show enrichment by non-positional MSEA, indicating that poly-C runs (four or longer) are one class of strong ISEs.
It was not surprising to find AUG among the top two on the list, as it is part of a well-known splicing enhancer, UGCAUG, with the upstream flanking trinucleotide UGC ( Figure 6A). UGCAUG is the binding motif of the RBFOX family of proteins. It has been well documented that these proteins promote splicing when binding downstream of the 5 splice site, including SMN2 intron 7 (35,36). Overexpression of RBFOX1 markedly promoted exon 7 splicing, but only for mutants that harbor UGCAUG at the upstream junction (Supplementary Figure S2), consistent with earlier observations. We classified these motifs as RBFOX-related. AUC appears not to rely on upstream sequences to form a stimulatory motif, and its enrichment at Position 1 is likely due to other 3-mers at this position forming inhibitory motifs with the upstream GC (GCWCC) or C (CAGN).
Regarding the 9 inhibitory 3-mers enriched by Position 1-dependent MSEA (Supplementary Table S2), UGG and ACC replaced UAG and AGG as the top 2 inhibitory motifs, compared to the non-positional analysis, suggesting that upstream flanking nucleotides contribute to their silencing activities. Two 3-mers (UCC and UGA) are new. UGG and UGA can be explained by the above-mentioned long-distance RNA secondary structure. On the other hand, with the upstream dinucleotide GC, ACC and UCC form GCACC and GCUCC, respectively, which further confirms GCWCC as a class of potent splicing silencers ( Figure 6B).
For Position 2-dependent MSEA, we analyzed all 3-mers at Position 2 (nucleotides 2-4 in each pentamer), which led to the identification of six stimulatory and 12 inhibitory motifs ( Figure 6A and B, Supplementary Table S2). 5/6 stimulatory motifs, U-rich or CG-core, were enriched by nonpositional MSEA, except ACU, which similar to AUC identified by Position 1-dependent MSEA, appears not to rely on flanking sequences to form a functional ISE, but rather avoids the formation of silencers (such as AG, DGAC, UGG and CUCC) when positioned at 2-4. Among the 12 enriched inhibitory motifs, five (GAC, GGC, CCU, CCA and GGA) are new, but they all belong to the above-defined motif classes ( Figure 6B). Six are AG-core and all were enriched by non-positional MSEA. That GAC is enriched here is most likely because its inhibitory activity is abolished or compromised when placed at Position 1, due to the upstream C. GGC, GGU and GGA apparently rely on the first nucleotide in the pentamers to be A or U, to form AG-core or UG-type motifs, whereas CCU and CCA rely on the first nucleotide to be A or U to make GCWCC silencers.
For Position 3-dependent MSEA, we analyzed all 3-mers at Position 3 (nucleotides 3-5), which identified seven stimulatory and 14 inhibitory 3-mers ( Figure 6A and B, Supplementary Table S2). Except for UUA, all others are on the stimulatory-element list identified by non-positional MSEA. Together with the downstream dinucleotide UU, UUA forms UUAUU, which is among the top 50 stimulatory pentamers in the library (Figure 2). Another contributing factor is that it avoids the formation of the strong inhibitory motifs UAG and UAGG at nucleotide positions 3-5. Among the 14 inhibitory 3-mers, only GCC was not enriched above. We examined the 16 NNGCC mutants, but only half were inhibitory, and they include five comprising the AG-core, two comprising UGG motifs, as well as GCGCC, which is likely a weak version of GCWCC (Supplementary Table S1). Therefore, GCC is likely a moderate or weak ISS motif; its enrichment is partly because G at po-sition 3 allows the formation of UAG, CAG, CAGG, AGG and UGG. This also explains why GAC replaced UAG as the most inhibitory 3-mer at positions 3-5.

Dinucleotide and nucleotide interaction analyses
Analysis of 3-and 4-mer motifs enriched by MSEA demonstrates that sequences as short as two nucleotides play critical roles in defining SREs, which prompted us to explore whether any two nucleotides can indeed be enriched with significant splicing-regulatory activity by MSEA. Repeat-dimer analysis identified five dinucleotides (UU, AU, UC, CG and CU) as stimulatory, and only one dinucleotide (AG) as inhibitory; single-nucleotide U repeats also displayed stimulatory activity (Supplementary Table  S2). Based on the above 3-/4-mer motif analysis, none of the four U-containing dinucleotides (UU, AU, UC and CU) is sufficient to be a specific core of SREs. It is most likely that they form U-rich ISEs with the downstream UUAU. Indeed, our poly-U analysis shows that at least four Us are required for a potent increase in exon 7 inclusion (see below). Therefore, the only stimulatory dinucleotide core is CG, whereas the only inhibitory dinucleotide core is AG. As CG and AG constitute opposite splicing signals, a single nucleotide C > A mutation in CG-core motifs is expected to result in severe splicing defects. One such case is associated with breast cancer, in which the deleterious missense mutation c.5242C > A in BRCA1 causes exon 18 skipping, resulting in the loss of 26 aa that are essential for the protein's function (37).
We next performed nucleotide-interaction analysis (see MSEA in Materials and Methods) and identified 39 stimulatory nucleotide interactions, including eight U/U, 11 U/C, 11 U/A, two C/C, three G/U, one G/C, two CG and one AC (Supplementary Table S2). 30/39 are U/U, U/C and U/A interactions, confirming that poly-U or U-rich sequences with scattered C or A are generally strong ISEs. All four G/U and G/C interactions have the G at nucleotide position 1, and thus the G forms a CG dinucleotide with the upstream C, consistent with CG being the core of a class of ISEs. Indeed, the CG dinucleotide itself was enriched twice on the list of stimulatory nucleotide interactions. We also identified 23 inhibitory nucleotide interactions, which include nine A/G (among them four AG dinucleotides and two GA dinucleotides), four U/G (all Us being at position 1), four G/G, three C/C, two G/C and one AC at positions 1-2 (Supplementary Table S2). The nucleotide-interaction analysis is consistent overall with our classification of stimulatory and inhibitory motifs obtained by positional and non-positional MSEA.

Poly-C and poly-U tracts are comparable in promoting SMN2 exon 7 splicing
We next validated several interesting findings from the pentamer-library analysis. Multiple U-rich pentamers, such as UUUUU, UUCUU and UUUUA rank among the strongest stimulatory mutants (Figure 2), and form longer U-rich stretches with the downstream nucleotides UUAU. On the other hand, based on the Position 1-dependent MSEA, sequences with four or more consecutive Cs also Nucleic Acids Research, 2022, Vol. 50, No. 2 741 strongly promote exon 7 inclusion. To gain further insight into splicing regulation by poly-U and poly-C sequences and compare their effects, we inserted different lengths of Us or Cs at two positions: one after position 10 (P10) and one after position 15 (P15) of intron 7 in the SMN2 minigene. These mutants were analyzed in HEK293 cells. As shown in Figure 7, four or more Us inserted at P10 robustly promoted exon 7 splicing, with the percentage of exon 7 inclusion increasing from 48% (WT SMN2) to 76% (4 Us) or higher (>4 Us). Insertion of three-Us at P15, which forms a 5-U tract with the 2 Us downstream, increased the percentage to 54%. At both positions, the longer the inserted poly-U tract, the greater the exon 7 inclusion. Similarly, when we inserted three Cs at P10, which form a 5-C tract with the two flanking Cs, the percentage of exon 7 inclusion increased from 50% in SMN2 to 67%, and there was a strong correlation between the number of Cs and exon 7 inclusion (Figure 7B). However, when we placed poly Cs at P15, strong suppression of exon 7 splicing ensued, which, we believe, is due to the immediate upstream trinucleotide GCA forming a strong GCACC ISS with the poly-C sequences. Indeed, disruption of the GCACC motif by insertion of an extra A or U before the poly-C stretches restored the stimulatory activity of the poly-C tracts ( Figure 7C). These data confirm that poly-C sequences have similar effects as poly-U in promoting exon 7 inclusion.

GCACC and GCUCC are strong ISSs
Among the top 13 inhibitory pentamers, eight are attributable to GCACC or GCUCC (Figure 2), whose silencer activities were confirmed by Position 1-dependent MSEA ( Figure 5B) and the above poly-C insertion study (Figure 7). To further test GCACC as a novel strong ISS motif, we placed it in different sequence settings by inserting it at four positions: P15, P24, P33 and P43 in intron 7 of both the WT SMN2 and SMN1 minigenes; insertions at these sites cause no disruptions of known SREs. To avoid the formation of AG and AGG at P24 and P33, we also inserted UGCACC at these two sites. We transfected all insertion mutants into HEK293 cells for splicing analysis. As shown in Figure 8A, insertions of GCACC or UGCACC at P15 and P24 robustly repressed exon 7 splicing in SMN2, and the effect was less but still pronounced in SMN1. The effect of GCACC inserted at P33 and P42 was drastically compromised or abolished, suggesting that close proximity to the 5 splice site is crucial for this element to exert its silencing activity.
To explore whether GCACC is a general SRE that regulates other alternative splicing events, we tested it in three additional minigenes in HEK293 cells. A MAPT minigene comprising part of exon 9, a truncated intron 9, exon 10, intron 10 and exon 11, was constructed. Resembling the endogenous gene, exon 10 of the minigene is alternatively spliced ( Figure 8B). An RNA secondary structure is present at the 5 splice site of MAPT intron 10 and regulates exon 10 splicing (38,39). To avoid disruption of the RNA structure and other natural SREs, we inserted one or two copies of GCACC, following position 32 (P32) of intron 10, with two neutral pentamers (GAACC and GCACA) as controls (Supplementary Table S1). As predicted, insertion of one copy of GCACC, but not GAACC or GCACA, at P32 markedly reduced exon 10 inclusion from 62% (WT) to 22%; and two copies of GCACC reduced exon 10 inclusion further, to 6% ( Figure 8B).
The CASP3 gene expresses a minor exon 6-skipped isoform (40). We generated a CASP3 minigene comprising a genomic fragment from exon 5 to the first 100 nt of exon 7. One or two copies of GCACC or two controls, GCACA and GAACC, were inserted after position 10 of intron 6. We observed a modest reduction in exon 6 inclusion when one copy of GCACC was inserted, and a drastic reduction (from 81% in WT to 5%) when two copies were inserted ( Figure  8C). In contrast, GAACC, either one or two copies, had no effect. In the case of GCACA, we observed inhibitory effects, but they were modest for both one and two copies.
GOLM2 exon 9 is alternatively spliced and regulated by SRSF1 (19). We generated a GOLM2 minigene that consists of exon 8, a shortened intron 8, exon 9, a shortened intron 9, and the first 141 nt of exon 10. We initially planned to insert GCACC around P10 in intron 9, because analysis in the SMN2 minigene revealed that the closer the motif is to the 5 splice site, the stronger the splicing repression it exerts. However, we noted that the first 21 nt of intron 9 are highly GC-rich (16 out of 21 are Gs or Cs). The GC content at splice sites can affect splicing (41). Therefore, we inserted the GCACC motif after position 25 to avoid interference by the GC-rich stretch. A robust reduction was observed when one copy was inserted, and exon 9 inclusion further decreased to 1% when two copies were inserted (Figure 8D). In contrast, we did not observe a significant change in exon 9 splicing when GCACA (a control) was inserted. Intriguingly, insertion of GAACC also markedly inhibited exon 9 splicing, although no differences were observed between one and two copies, which will require further study to understand the underlying mechanism.
The potent repression of splicing by one or two copies of GCACC in all tested minigene contexts demonstrates that the motif is a general, strong ISS. We next performed RNAaffinity chromatography to identify potential RBPs. A biotinylated 14-nt RNA with two copies of GCACC bound to streptavidin beads was incubated with HeLa cell nuclear extract under splicing conditions. Two RNAs, one with a C in one copy mutated to A (Mut1), and the other with a C in both copies mutated to A (Mut2), were used as controls. The beads were washed four times with buffer containing different salt concentrations, up to 150 mM, and proteins that remained bound to RNA were analyzed by SDS-PAGE and Coomassie-blue staining. Two strong bands at ∼150 and ∼49 kDa were observed with the WT RNA sample ( Figure 9A). The ∼49 kDa band was weak in the Mut1 sample, but absent in the Mut2 sample, highlighting a potential RBP that specifically binds to GCACC. The two bands were excised and analyzed by mass spectrometry. The smaller band contained mainly YB1 and the other band contained SF3B1. Western blotting confirmed that YB1 was enriched in the WT RNA sample, compared to the two controls. Unexpectedly, knockdown of either YB1 or SF3B1 with specific siRNAs had no effect on exon 7 splicing of the SMN2 minigene GCACC mutant (data not shown). It is possible that there is redundancy with other RBPs, or that YB1 affects exon 7 splicing of the mutant through dif- Figure 7. Comparison of the effects of poly-C and poly-U stretches on SMN2 exon 7 splicing in HEK293 cells. (A, B) Insertion of poly-U or poly-C runs at two sites: one after position 10 (P10) and one after position 15 (P15) in SMN2 intron 7. We named the mutants based on the insertion site and the number of Us or Cs. (C) We inserted a single A, U or G at P15 preceding the poly-C runs in the P15 poly-C mutants, to abolish the formation of GCACC, a novel strong ISS. We named the mutants with the inserted extra nucleotide and the number of Cs; for example, 'A + 3Cs' means insertion of ACCC. Quantitative data are shown in histograms (n = 3). (*) P < 0.05, (**) P < 0.01, (***) P < 0.001 compared to WT SMN2.
ferent mechanisms, such that the net outcome is neutral. For example, it was reported that YB1 binds to A/C-rich ESEs to stimulate exon inclusion (42). Therefore, we used an MS2-tethering system to assess the effect of YB1 bound to the specific location. The five nucleotides at positions 11-15 were replaced with the MS2 sequence, and the new minigenes were termed MS2-SMN1/2. YB1 was fused to bacteriophage MS2 coat protein (CP), and GFP was used as a control. T7-tagged CP-YB1 fusion protein was properly expressed in HEK293 cells ( Figure 9B). Overexpression of CP-YB1 potently repressed exon 7 splicing of both MS2-SMN1/2 minigenes, compared to CP and CP-GFP controls ( Figure 9C−E), indicating that YB1 is a candidate protein that binds to the inhibitory GCACC motif. Indeed, in a prior SELEX study using recombinant GST-YB1, multiple RNAs containing GCACC were enriched (43), consistent with our conclusion. GCACC and GCUCC differ by one nucleotide. Whether their mechanisms of action are similar is unknown. We note that the first G is essential for the strong in- , respectively, in both the SMN1 and SMN2 minigenes. Insertion of GCACC at P24 and P33 creates an AGG and AG, respectively. Therefore, we also tested insertion of UGCACC at P24 and P33 to evaluate the effects of the newly formed AGG and AG. The data show that the closer the GCACC motif is to the 5 splice site of intron 7, the stronger its inhibitory effect is on exon 7 splicing. (B) We tested one or two copies of GCACC in a MAPT minigene by insertion after position 32 (P32) in intron 10. GAACC and GCACA served as controls. Robust inhibition of exon 10 inclusion is seen by GCACC, but not the two controls. The effect of two copies was even stronger. (C, D) One or two copies of GCACC were also inserted at P10 in intron 6 of a CASP3 minigene, and at P25 in intron 9 of a GOLM2 minigene. We observed inhibitory effects of the motif for both minigenes when one copy was inserted, and robust exon skipping when too copies were inserted. One of the controls also exhibited inhibitory effect in the respective minigenes, which will require further investigation to understand the underlying mechanisms. Quantitative data are shown in histograms (n = 3). (**) P < 0.01, (***) P < 0.001 compared to WT SMN2, SMN1, MAPT, CASP3 or GOLM2. Figure 9. Identification of YB1 as a potential splicing repressor bound to GCACC. (A) RNA-affinity purification was performed to explore RBPs that bind to the GCACC motif. Biotinylated WT, Mut1 or Mut2 RNA bound to streptavidin beads was incubated with HeLa cell nuclear extract (NE) under splicing conditions. The beads were washed four times at different salt concentrations up to 150 mM. Bound proteins were eluted with SDS and analyzed by SDS-PAGE with Coomassie-Blue staining. Protein bands at ∼150 and ∼50 kDa were prominent in the WT RNA sample. Based on mass spectrometry analysis, these bands consisted mainly of SF3B1 and YB1, respectively. Enrichment of YB1 in the WT RNA sample was verified by Western blotting using an anti-YB1 pAb. (B) In an MS2-tethering assay, T7-tagged proteins with or without CP fusion were properly expressed in HEK293 cells, as detected with anti-T7 mAb. (C) The effects of various expressed proteins on exon 7 splicing of the MS2-SMN1/2 minigenes were analyzed. Each minigene plasmid (500 ng) was co-transfected with one (300 ng) or two (150 ng each) expression plasmids into HEK293 cells. Buffer: no expression plasmid. (D, E) Quantitative data are shown in histograms (n = 3). (***) P < 0.001 compared to T7-CP. hibitory activity of GCACC. In contrast, the first G appears not to be strictly required in GCUCC, as both mutants ACUCC and CCUCC in the pentamer library strongly inhibited exon 7 splicing, though with slightly weaker effects than GCUCC. The exception is mutant UCUCC, which had weak inhibitory activity, with a PSI of −3.50.

CG repeats potently promote SMN2 exon 7 splicing
We showed above that CG-containing motifs are a new class of ISEs, of which UCGY is a particularly strong version. Intriguingly, pentamers with two copies of CG tend to be more stimulatory than those with only one copy of CG (Supplementary Tables S1 and S2). Indeed, GCGCG, the only mutant that has three CG repeats, counting the upstream C, is one of the top stimulatory pentamers in the library (Figure 2), suggesting that the stimulatory activity of each CG copy is additive.
To verify that CG repeats are indeed potent ISE motifs, we inserted CGCGCG at P15, P24, P33 and P42, respec-tively, in intron 7 of the SMN2 minigene, and examined exon 7 splicing of these mutants in HEK293 cells. Insertions of CGCGCG at the three proximal sites, but not at P42, robustly promoted exon 7 inclusion ( Figure 10A). We also tested CGCGCG in the above-mentioned three minigenes. When inserted at P32 in intron 10 of MAPT, or at P25 in intron 9 of GOLM2, the CG repeats inhibited exon inclusion ( Figure 10B, D). We noticed that the sites of insertion in both minigenes are GC-rich, which might interfere with the activity of CGCGCG. On the other hand, when inserted at P10 in intron 6 of CASP3, the CG repeats promoted exon inclusion. (Figure 10C). These results suggest that the effect of CG repeats is context-dependent.

DISCUSSION
Most RBPs recognize short sequence motifs to bind RNA and exert their functions (13). Although a cohort of SREs and their cognate proteins have been documented, it remains challenging to predict the patterns of sequence fea- tures that regulate alternative splicing. In the present study, we built a library of all possible pentameric sequences at a fixed site in intron 7 of an SMN2 minigene by site-directed mutagenesis, and analyzed in duplicate the effects of all 1023 mutants, plus the wild type, on exon 7 splicing in HEK293 cells. Using MSEA, we found that 96 2-4-mer motifs are enriched in pentamers that robustly promote or repress exon 7 splicing. Based on their sequence features, these motifs were grouped into four stimulatory classes: Urich, C-rich, CG-core and RBFOX-related, as well as four inhibitory classes: AG-core, UG-type, GCWCC-type and GAC-core. The splicing effects of nearly all the top 50 moststimulatory and most-inhibitory mutants are attributable to these motif classes ( Figure 2). Therefore, our study revealed major classes of short intronic SRE motifs that potently affect alternative splicing. Three motif classes--CGcore, GCWCC, and GAC-core--are novel, and will require further investigation to understand their underlying mechanisms and characteristics in splicing regulation.
Research on exon definition has focused on the −3 to + 6 nucleotides of the 5 splice site (44). The intronic nucleotides at the 5 splice site beyond + 6 are generally believed to have limited impact on exon definition. Our experiments showed that the + 11∼+15 intronic nucleotides of SMN2 exon 7 have a strong impact on its exon definition. For ex-ample, the 15 pentamers with single-nucleotide variations (SNVs) in our minigene experiments showed +39% to -23.5% PSI changes. In contrast, a well-trained SpliceAI model (45) is not sensitive to these SNVs. All 15 SNVs are predicted to be lower than the lowest Score cutoff (0.2) for the 5 splice site of SMN2 exon 7 ( Supplementary Figure S5). In other words, these SNVs are invisible to the exon-definition predictions of SpliceAI. In summary, the intronic nucleotides beyond the +6 position can substantially influence splicing, and exon-definition models can be improved by integrating the sequence features identified in this study.
Systematic high-throughput screens of SREs have been reported in several studies using various strategies (21)(22)(23)(24)(25)(26). A shared feature is that they all used random sequences coupled with RNA sequencing or selection by enrichment tools, such as GFP, as opposed to a complete sequence library being analyzed one by one--the approach we used here. Our study presents a full picture of splicing patterns for all pentameric sequences with respect to their effects on SMN2 exon 7 splicing. Although different alternative splicing events may have distinct features, the general principles governing RNA splicing are widely applicable. Therefore, our results markedly improve our understanding of splicing regulation mediated by 5-nt or shorter sequences.
The length of random sequences used to generate a library in previous large-scale studies was 10 to 25 nt, except for one study that used hexamers (46); analysis of short sequences, such as pentamers or hexamers, enriched within the longer sequence was key to identify functional motifs. We compared our data to those of prior studies. Interestingly, the strongest ISEs (CUUCUU, UAUU UU, UUGUUC, UCUUAU and UCUUAC) identified in a large-scale screen in HEK293 cells with 25-nt random sequences inserted downstream of alternative 5 splice sites in an artificial citrine-based reporter by Rosenberg et al. are all U-rich motifs (22), which is highly consistent with our finding that U-rich sequences are the predominant class of stimulatory motifs. In fact, the top two hexamers CU-UCUU and UAUUUU, which are equivalent to UUCUU and UAUUU, respectively, with the flanking C or U being included, are two of the strongest stimulatory pentamers in our library (Figure 2).
Wang et al. identified G-rich sequences as a group of the strongest ISEs in a GFP-based random-decamer screen in HEK293T cells (24), which is in broad agreement with our observation that G-rich sequences (four and more consecutive Gs) markedly promote exon 7 splicing, though G-rich motifs were not among the strongest ISEs in the present study.
One of our salient findings is that CG-core motifs promote exon 7 splicing. Although this class of ISEs was not previously reported, we found that 8/10 top 6-mer ESEs identified by Rosenberg et al. harbor at least 1 CG repeat, and half of these have two CG repeats (22), suggesting that CG-core motifs may be common splicing enhancers present in both exons and introns, though they may be inhibitory in some contexts, like MAPT. Of note, the CG dinucleotide is under-represented in vertebrate genomes (47).
Two studies specifically explored intronic SREs with a 10-or 15-nt random sequence library in a GFP-fused SMN1 minigene and a GFP-based reporter, respectively, in HEK293 cells, and a majority of their enriched 4-6-nt motifs comprise the dinucleotide AG (21,24), consistent with our data. Intronic SRE motifs identified by Culler et al. also include UG-rich and GACC (21), which highly resemble the UG-type and GAC-core motifs in our study.
On the other hand, we noted considerable discrepancies between our data and those in previous studies that relied on random-sequence pools. For example, in line with the established notion that the RBFOX-binding sequence UG-CAUG is a strong ISE when placed downstream of the 5 splice site of an alternative cassette exon (36), we found that UGCAUG, as seen with multiple (UGC)AUGNN mutants, robustly promotes exon 7 inclusion (Figure 2). However, UGCAUG was not identified in the previous highthroughput studies. Another example is the C-rich motif class, which we characterized as a main class of ISEs in the present study, but was likewise not found in the previous high-throughput studies. In addition, the two potent inhibitory pentamers GCACC and GCUCC we observed, were not detected as winner ISS motifs in prior randomsequence screens, despite the enrichment of weaker elements, CCUCC or CUCC, in two studies (21,25).
These discrepancies could be due to the different gene contexts, different cell types or conditions used for analysis, effects of potential RNA secondary structures that affect splice-site recognition, or different flanking sequences of the library locations that may form functional overlaps at the two junctions (46). Indeed, we confirmed that the inhibitory effect conferred by UGGNN pentamers involve the formation of a double-stranded RNA structure with a downstream sequence (Supplementary Figure S4), and that the stimulatory effect of AUGNN pentamers is attributable to the UGCAUG motif at the upstream junction (Supplementary Figure S3). It is possible that other pentamers likewise form functional overlapping motifs or secondary structures with the flanking sequences to affect exon 7 splicing, which will require further study. Another key issue that was previously overlooked is the length of the random sequences. Based on previous studies, the binding sites of most RBPs are ∼5 nt. In light of our observations, motifs as short as 2 nt, such as CG and AG, can determine whether a pentamer is stimulatory or inhibitory, and many 3-6 nt motifs, such as UCG, AGG, UAG, CAG, GAC, UGG, GGU, YCGY, UAGG, CAGG, UGGU, CUCC, GCACC, U-rich, C-rich, UA-rich, poly-G and UGCAUG, are potent SREs. When the length of the library sequences is 10 nt or longer, it is inevitable that each sequence harbors multiple motifs and widespread overlaps form within the library sequences, giving rise to complex consequences. Some authentic strong enhancer motifs may be missed, due to silencer motifs being next to or overlapping them, and vice versa. The worstcase scenario is that an enhancer may be mistaken for a silencer, due to the presence of a nearby strong silencer, which, we believe, is relatively common. For example, UG-CAUG itself is a known strong enhancer, and UGAC represents a strong silencer in our study; when both motifs overlap, as seen in mutant (UGC)AUGAC in our library, the silencer motif is dominant, and the net effect is moderately inhibitory (Supplementary Table S1). Finally, the completeness of our strategy should contribute to identification of multiple novel motifs in the present study. Previous studies using randomly-generated pools of mutants, coupled with high-throughput sequencing, may have missed sequences present in low abundance in the initial pool. However, we are also aware of weaknesses of the present study. Numerous RBPs bind to motifs that are longer than 5 nt. Moreover, our strategy may fail to detect a motif that requires another copy of the same or a different motif to synergistically or additively form a strong SRE. One good example is ISS-N1 in SMN2 intron 7, in which the tandem hnRNP A1 RRMs act in concert by binding simultaneously to each of the two juxtaposed AG motifs, resulting in potent suppression of exon 7 splicing (15,48).
Among the eight classes of stimulatory and inhibitory motifs enriched by MSEA, the U-rich, RBFOX-related and hnRNP A1-binding AG-core motifs were already well established, with known mechanisms (9,14,35,49,50). Ji et al. revealed PCBP1 and PCBP2 as global splicing activators; co-depletion of the two proteins inhibits inclusion of cassette exons flanked by intronic C-rich motifs that are immediately adjacent to the 5 and/or 3 splice site (51). Though we have not investigated which RBP(s) may mediate the effect of the C-rich motifs in our mutant minigenes, it is reasonable to assume that their cognate proteins are PCBP1 and PCBP2. Zheng et al. uncovered UGGU as the core Nucleic Acids Research, 2022, Vol. 50, No. 2 747 motif in an ESS that inhibits a 3 splice site in a bovine papillomavirus (BPV) type 1 late transcript, but no cognate RBP was identified (52). Although UG-type motifs are highly enriched in strong inhibitory pentamers in our library, their roles in splicing regulation may be over-estimated, owing to interference with the intramolecular RNA structure that impairs U1 annealing (Supplementary Figure S4). GACcore motifs represent a novel class of short ISSs. Though they have not been previously characterized, SELEX performed by Cavaloc et al. revealed that SRSF7 (formerly 9G8) binds with high affinity to GAC repeats (53). However, whether the silencing activity of the GAC-core motifs we analyzed is mediated by SRSF7--which has been characterized as a splicing activator, rather than a repressor--needs to be investigated.
Two notable findings in the present study are that the dinucleotide CG represents the core of the second most abundant class of stimulatory motifs, and that GCWCCtype motifs are potent ISSs. Whereas CG-core motifs or CG repeats will require further study to derive the consensus sequence, the winner sequences for GCWCC-type motifs are clearly GCACC and GCUCC. An early study by Zheng et al. delineated a so-called C-rich ESS sequence, GGCUCCCC, in BPV-1 pre-mRNA (52), which encompasses a GCUCC pentamer. It is intriguing that GCACC, despite being the second most inhibitory pentamer in our study, was not previously shown to regulate natural alternative splicing events. We believe that both CG-core and GCWCC motifs should play a widespread and important role in regulating alternative splicing, considering their potency in affecting splicing and expected relative frequency of occurrence of such short motifs in the human genome.
Based on our RNA-affinity data, YB1 specifically binds to the GCACC motif under splicing conditions. We indeed observed that YB1 potently repressed exon 7 splicing when tethered to intron 7 of the SMN1/2 minigenes. Our data are consistent with a previous SELEX study that revealed CACC as one of the consensus motifs with high affinity for YB1 (44). However, we cannot rule out that other proteins may also bind to the intronic motif individually or jointly to exert inhibitory effects. Future studies should further clarify how the GCACC motif and its cognate RBPs repress splicing, and identify their regulated alternative splicing targets, so as to gain insights into their significance in geneexpression regulation.
Although degeneracy is a typical feature for many RBPs, we observed that in many cases a single nucleotide change resulted in marked changes in exon 7 inclusion. The reason is that many core motifs are just 2-4 nt long, and a single nucleotide change can easily convert an enhancer into a silencer, for example, CG to AG, GGGG to AGGG and CCCC to CUCC, or vice versa. One surprising finding in the present study is the complexity of motifs comprising U and C. In the library, seven pentamers UUCUU, UCUUU, CUUUU, CCCUU, CUCUU, CCCCU and CCCUC are among the top 50 stimulatory ones, whereas four pentamers UCCUC, CUUCC, UUUCU and CCUCC are among the top 50 inhibitory ones. The sequence differences between them are quite subtle. A detailed study will be required to pinpoint the distinguishing sequence features and underlying mechanisms.
The pentamer library also provides a resource to better understand known motifs and their respective RBPs. For example, PTB is a strong splicing repressor whose known binding motifs are UCUU, UCUUC and CU-CUCU (54,55). We indeed found that overexpression of PTB in HEK293 cells strongly inhibits exon 7 splicing in the SMN1 and SMN2 minigenes (data not shown). Paradoxically, however, the pentamers NUCUU and UCUUN, as well as UCUCU and CUCUC (which form CUCUCU with flanking nucleotides) all potently stimulate exon 7 splicing. Both UCUU and UCUC are enriched ISE motifs by MSEA. Another example is the RBFOX protein-binding motif UGCAUG, a strong ISE, selected by the Position 1dependent MSEA. Based on our data (Supplementary Table S1), exon 7 splicing is much improved when UGCAUG is followed by C or U compared to G or A. In contrast, when UGCAUG is followed by AC, its stimulatory effect disappears. These examples highlight the complexity of interactions between SREs and their cognate RBPs in regulating pre-mRNA splicing.

DATA AVAILABILITY
All data generated or analyzed during this study are included in this published article and its supplementary information.