Quantitative prediction of variant effects on alternative splicing in MAPT using endogenous pre-messenger RNA structure probing

Splicing is highly regulated and is modulated by numerous factors. Quantitative predictions for how a mutation will affect precursor mRNA (pre-mRNA) structure and downstream function are particularly challenging. Here, we use a novel chemical probing strategy to visualize endogenous precursor and mature MAPT mRNA structures in cells. We used these data to estimate Boltzmann suboptimal structural ensembles, which were then analyzed to predict consequences of mutations on pre-mRNA structure. Further analysis of recent cryo-EM structures of the spliceosome at different stages of the splicing cycle revealed that the footprint of the Bact complex with pre-mRNA best predicted alternative splicing outcomes for exon 10 inclusion of the alternatively spliced MAPT gene, achieving 74% accuracy. We further developed a β-regression weighting framework that incorporates splice site strength, RNA structure, and exonic/intronic splicing regulatory elements capable of predicting, with 90% accuracy, the effects of 47 known and 6 newly discovered mutations on inclusion of exon 10 of MAPT. This combined experimental and computational framework represents a path forward for accurate prediction of splicing-related disease-causing variants.

predictions for how a mutation will affect precursor messenger RNA (mRNA) structure 23 and downstream function is particularly challenging. Here we use a novel chemical 24 probing strategy to visualize endogenous precursor and mature MAPT mRNA structures 25 in cells. We used these data to estimate Boltzmann suboptimal structural ensembles, 26 which were then analyzed to predict consequences of mutations on precursor mRNA 27 structure. Further analysis of recent cryo-EM structures of the spliceosome at different 28 stages of the splicing cycle revealed that the footprint of the B act complex with precursor 29 mRNA best predicted alternative splicing outcomes for exon 10 inclusion of the 30 alternatively spliced MAPT gene, achieving 74% accuracy. We further developed a β-

Introduction 39
Precursor messenger RNA (pre-mRNA) splicing is a highly regulated process in 40 eukaryotic cells (Z. Wang and Burge 2008). Numerous factors control splicing including 41 trans-acting RNA-binding proteins (RBPs), components of the spliceosome, and the 42 pre-mRNA itself. Pre-mRNA structure is a key attribute that directs splicing, particularly 43 alternative splicing, but we have a limited understanding of pre-mRNA structure-44 mediated splicing mechanisms (Taylor and Sobczak 2020). It has proven challenging to 45 develop quantitative models capable of predicting splicing outcome, specifically the 46 percent spliced in (PSI) for alternatively spliced exons. It is especially difficult to predict 47 outcome alterations due to genetic variation at exon-intron junctions because mutations 48 affect both the binding by RBPs and also pre-mRNA structure ( In this study, we exploited several technical developments that address these issues to 62 develop an integrated, RNA structure-based framework that accurately predicts splicing 63 outcomes. We measured endogenous pre-mRNA structure in cells taking advantage of 64 recent developments in mutational profiling (MaP) approaches for read-out of chemical 65 probing data (Homan et al. 2014) with targeted amplification of specific exon-intron 66 junctions. This novel approach enabled us to obtain single-nucleotide RNA structure 67 probing data for endogenous pre-and mature mRNAs in the same cell. Our RNA 68 structure modeling considers the equilibrium between multiple alternative structures 69 Tau. Exons 9, 11, and 12 are constitutively spliced, but exon 10 is alternatively spliced 83 resulting in MAPT isoforms with either four microtubule binding repeats (4R) or three 84 repeats (3R) when exon 10 is included or skipped, respectively. The normal ratio of 3R 85 to 4R isoforms is approximately 1:1 (Hefti et al. 2018). Twenty-nine clinically validated 86 disease-causing mutations have been identified in the region of the exon 10 -intron 10 87 junction (Stenson et al. 2003). These mutations result in impaired Tau function and are 88 implicated in neurodegenerative disease (Spillantini et al. 1998;Hutton et al. 1998;89 Clark et al. 1998; Rizzu et al. 1999; Goedert et al. 1999). Although some mutations alter 90 the Tau protein sequence (Mirra et al. 1999;Iseki et al. 2001), 20 of the disease-91 associated mutations deregulate MAPT pre-mRNA splicing altering the ratio of 3R to 4R 92 well-defined secondary structure (Petrov et al. 2014). As expected, the DMS reactivities 154 of unpaired nucleotides were significantly higher than for paired nucleotides both for 155 RNA probed in cells and for RNA isolated from cells prior to probing (Figure 1-figure  156 supplement 7A). This experiment confirmed that our DMS probing recapitulates native 157 RNA secondary structure regardless of the presence of proteins, consistent with 158 previous studies (Woods et al. 2017;Lackey et al. 2018). We used the SSU in-cell 159 reactivity data to calibrate the estimation of equilibrium ensembles (Materials and 160 methods), and we confirmed that structure modeling guided by experimental DMS 161 reactivities yielded a more accurate estimation of the SSU structure than the model not 162 informed by chemical probing data (Figure 1-figure supplement 7B).  The reactivities for exon 10 in the pre-mRNA and mature 4R isoform were highly 226 correlated ( Figure 2-figure supplement 1B). This high correlation was unexpected given 227 that the pre-mRNA undergoes splicing during the 5-minute treatment of the cells with 228 DMS. As we observed for the mature 4R isoform, exon 10 in the pre-mRNA mostly 229 formed base pairs with other exon 10 nucleotides (Figure 2-figure supplement 2). 230 However, when we compared DMS reactivities for pre-mRNA and the mature 4R 231 isoform, we found that DMS reactivity in exon 10 was significantly lower for the pre-232 mRNA (median in-cell DMS reactivity: 0.08) than for the 4R isoform (median in-cell 233 We used Boltzmann sampling of RNA structures supported by DMS reactivity data 245 (Spasic et al. 2018) (Materials and methods) to sample 1000 structures for the wild-246 type. We also generated ensembles for two RNAs that bear mutations in intron 10 that 247 are known to alter MAPT splicing: (i) an A to C mutation at position +15 (+15A>C) that 248 favors 3R isoform, and (ii) a C to G mutation at position +19 (+19C>G) that favors the 249 4R isoform (Tan et al. 2019). These mutant ensembles were generated using the same 250 DMS reactivities as the wild-type RNA, with the exception of the mutation site (see 251 Materials and methods), and thus provide a well-controlled prediction of the impact that 252 each mutation will have on the ensemble.  Contour plots were derived from the distribution of points on the t-SNE plot in B. 300 The darker the blue, the higher the density of structures. Contour lines 301 additionally represent density of points. Color scales for the three plots are 302 identical. Inserts are gel images representative of splicing assays using a 303 reporter plasmid expressing either the wild-type sequence (WT), the +19C>G 304 (3R) mutation or +15A>C (4R) mutation in HEK293 cells, where the RNA was 305 extracted and reverse transcribed to measure the isoform ratio using specific 306 PCR amplification (Materials and methods). In WT, both 3R (exon 9 -exon 11) 307 and 4R (exon 9 -exon 10 -exon 11) isoforms are expressed (two bands). In the 308 presence of the 3R mutation, only the 3R isoform is expressed (one band) 309 whereas for the 4R mutation only the 4R isoform is expressed (one band). Gel 310 insets for the 3R and 4R mutation are in their respective density plots.  best prediction accuracy (R 2 = 0.89; Figure 3B). Crucially, we found that using features 352 of the distribution of unfolding ΔG ‡ in the structural ensemble produced greater 353 predictive power than simply using the unfolding ΔG ‡ of a single minimum free energy 354 structure, supporting the importance of RNA ensemble behavior to splicing outcome 355  Compensatory mutations are double mutations that were designed to rescue changes in 366 exon 10 splicing caused by a single mutation (Grover et al. 1999). Although the model 367 supplement 1B). When exon 10 PSI was modeled with the changes to the motif strength 429 of all splicing regulatory elements, prediction accuracy increased (R 2 =0.51; Figure 4C) 430 compared with that obtained when only splice site strength was considered (R 2 =0.29); 431 for non-synonymous mutations accuracy was even higher (R 2 =0.75).  We next interrogated the model by performing a systematic in silico mutagenesis 533 analysis of the 100 nucleotides spanning the exon 10 -intron 10 junction ( Figure 6A). 6D). We observed that only a few mutations were predicted to have a PSI of zero (3R) 555 ( Figure 6D red bar). We therefore used splicing assays to experimentally determine the 556 splicing preference of six instructive variants (Materials and methods): 3 VUSs predicted 557 to be 3R, 1 VUS predicted to be 4R, and 2 VUSs predicted to maintain the WT splicing 558 ratio ( Figure 6D). We found that all six predictions were correct ( Figure 6E If the 5' splice site was always paired, only the 3R isoform would be produced. 636 However, the presence of 3R and 4R isoforms, usually in a 1:1 ratio implies that the 637 junction is accessible in a subset of the structures. We found disease-causing mutations 638 produced distinct shifts in the ensemble of the MAPT exon 10 -intron 10 junction 639 region; these shifts showed strong correlation with changes in the 3R to 4R isoform ratio 640 and confirmed that ensembles are essential at this junction. Our ability to accurately 641 predict the effects of mutations on ensembles significantly improved our quantitative 642 model (Figure 3 -figure supplement 1C). Exonic non-synonymous mutations promote splicing changes primarily by altering SRE 669 motifs, whereas exonic synonymous and intronic mutations altered RNA structure. A 670 combined model that accounted for both structure and SREs was the most accurate 671 predictor of exon 10 PSI ( Figure 5D). It was previously proposed that exon 10 is 672 alternatively spliced due to a weak 5' splice site (Ian D'Souza and Schellenberg 2005), 673 and, indeed, we found that mutations that strengthened the splice site increase 674 inclusion of exon 10 ( Figure 4A). SRE strength alterations overall skewed more toward 675 increased exon 10 inclusion, which suggest that SREs and the RBPs that bind them 676 buffer the effects of RNA structure to maintain the 1:1 isoform ratio. MAPT exon 10, the intron 9 -exon 10 junction), that are expected to impact exon 10 695 splicing regulation. Although our model provides an exact PSI prediction for each 696 mutation, we emphasize that its principal utility is in predicting the direction in which the 697 3R to 4R isoform ratio shifted from the wild-type ratio.

Base-pairing probabilities for SSU 878
The partition function for the SSU was generated using Rsample, using either the 879 sequence or using the sequence and the DMS reactivities. All possible i-j base pairing 880 probabilities were summed for each nucleotide i to generate a base pairing probability 881 per nucleotide i. 882

ROC curves for predicting SSU base pairs 884
Using the known secondary structure of the SSU, we assigned a nucleotide as either 0 885 or 1 if it was paired or unpaired. Only As and Cs were considered. DMS reactivities 886 were used to predict whether a nucleotide was paired; the higher the DMS reactivity, the 887 more likely a nucleotide is unpaired. ROC curves and AUC values were generated

Identification of representative structures for clusters 939
The 3000×2 matrix obtained after t-SNE dimensionality reduction, was clustered using windows that differed between the WT and mutants were around the location of the 1004 mutation, and only windows with strength above the threshold were considered.      pairs out of total number of base pairs. Initially, exon 10 was folded; thus, the 1777 percentage of intra-exon base pairs was 100%. Exon 11 nucleotides were then added 1778 one at a time and the RNA sequence was refolded after each addition. When all exon 1779 11 nucleotides were included in the folding, 88.9% of base pairs were intra-exon. base pairs out of total number of base pairs. Initially, exon 10 was folded; thus, the 1800 percentage of intra-exon base pairs was 100%. Intron 10 nucleotides were then added 1801 one at a time and the RNA sequence was refolded after each addition. When all intron 1802 10 nucleotides were included in the folding, 65.7% of base pairs were intra-exon/intron. 1803