Nano-DMS-MaP allows isoform-specific RNA structure determination

Genome-wide measurements of RNA structure can be obtained using reagents that react with unpaired bases, leading to adducts that can be identified by mutational profiling on next-generation sequencing machines. One drawback of these experiments is that short sequencing reads can rarely be mapped to specific transcript isoforms. Consequently, information is acquired as a population average in regions that are shared between transcripts, thus blurring the underlying structural landscape. Here, we present nanopore dimethylsulfate mutational profiling (Nano-DMS-MaP)—a method that exploits long-read sequencing to provide isoform-resolved structural information of highly similar RNA molecules. We demonstrate the value of Nano-DMS-MaP by resolving the complex structural landscape of human immunodeficiency virus-1 transcripts in infected cells. We show that unspliced and spliced transcripts have distinct structures at the packaging site within the common 5′ untranslated region, likely explaining why spliced viral RNAs are excluded from viral particles. Thus, Nano-DMS-MaP is a straightforward method to resolve biologically important transcript-specific RNA structures that were previously hidden in short-read ensemble analyses.

Genome-wide measurements of RNA structure can be obtained using reagents that react with unpaired bases, leading to adducts that can be identified by mutational profiling on next-generation sequencing machines. One drawback of these experiments is that short sequencing reads can rarely be mapped to specific transcript isoforms. Consequently, information is acquired as a population average in regions that are shared between transcripts, thus blurring the underlying structural landscape. Here, we present nanopore dimethylsulfate mutational profiling (Nano-DMS-MaP)-a method that exploits long-read sequencing to provide isoform-resolved structural information of highly similar RNA molecules. We demonstrate the value of Nano-DMS-MaP by resolving the complex structural landscape of human immunodeficiency virus-1 transcripts in infected cells. We show that unspliced and spliced transcripts have distinct structures at the packaging site within the common 5′ untranslated region, likely explaining why spliced viral RNAs are excluded from viral particles. Thus, Nano-DMS-MaP is a straightforward method to resolve biologically important transcript-specific RNA structures that were previously hidden in short-read ensemble analyses.
RNA structure is a main determinant of RNA function 1,2 , and is controlled largely through the folding of RNA into regions of single-stranded and double-stranded RNA 3 . Among the methods for interrogating RNA folding, chemical probing stands out for its ease of use and ability to determine RNA structure in situ 4,5 . During chemical probing, RNA is treated with reagents that react preferentially with single-stranded regions of RNA. One such reagent, dimethylsulfate (DMS), methylates the N3 position of cytosines and the N1 position of adenines at the Watson-Crick face of unpaired residues, giving rise to information that can be used to perform high-accuracy RNA structure predictions [6][7][8][9][10] . This small cell permeable chemical is used widely for the in situ or in vitro structural analysis of RNA or RNA-protein complexes [11][12][13][14] . In classical experiments, the modified nucleotides, 1-methyladenosine (m1A) and 3-methylcytosine (m3C), stall reverse transcription causing reverse transcriptase (RT) drop off to form truncated complementary DNAs (cDNAs) that can be analyzed by gel or capillary electrophoresis 5 . In DMS sequencing (DMS-seq), truncated cDNAs are subjected to next-generation sequencing to perform genome-wide measurements of RNA structure 15,16 . Alternatively, DMS mutational profiling sequencing (DMS-MaP) uses modified buffer conditions to perform error-prone reverse transcription of DMS-modified nucleotides 12,17,18 . DMS-MaP therefore allows for straightforward measurements of RNA structure by counting mutations.
DMS-MaP can perform high-throughput measurements of RNA structure 12,17 but also has its drawbacks. Most importantly, typical RT conditions produce short cDNA molecules ideal for sequencing on Illumina sequencing machines 19,20 . The resulting reads, however, rarely span whole transcript isoforms like those generated by alternative transcription start and termination sites, or by alternative splicing. Therefore, DMS-MaP is not well suited to identify structural differences Article https://doi.org/10.1038/s41592-023-01862-7 is therefore limited by the inherent error rate of the sequencing platform and by RT and polymerase chain reaction (PCR) errors. DMS-MaP experiments are commonly designed to induce mutation frequencies of 1-2% at A and C residues. Such mutational signatures can be detected by short-read Illumina sequencing where most nucleotides have a Phred quality score (Q-score) above 30 (Q30), which is equivalent to 99.9% accuracy. While nanopore sequencing devices can perform long-read sequencing, they have a magnitude higher error rate than Illumina sequencing machines, with a substantial proportion of reads exhibiting low accuracies 25 . Recent improvements to Nanopore sequencing chemistry and basecalling algorithms, however, have raised modal accuracy to Q20 (99% accuracy) 26 .
We first assessed whether we could obtain long cDNA molecules from RNA molecules treated with DMS. To do this, we reverse transcribed and amplified a 532 nucleotide (nt) portion of the unspliced (US) HIV-1 RNA from infected cells. This region comprises the highly structured 5′ UTR and the beginning of the viral gag gene, and folds into a series of stem-loop structures that regulate the HIV-1 life cycle 27 . For reverse transcription, we used MarathonRT because it was reported to generate long cDNAs in the presence of RNA modifications 28,29 . Still, we found an inverse relationship between DMS concentration and the amount and length of cDNAs recovered ( Supplementary  Fig. 1a). We tried to improve cDNA recovery by adjusting parameters such as reverse transcription time, temperature and Mn 2+ concentrations ( Supplementary Fig. 1b,c), but the only parameter that had a substantial effect was DMS concentration. This indicates a trade-off between DMS concentration and maximal recoverable transcript length. Consequently, the amplification of long transcript isoforms is only possible with DMS concentrations below those typically used in DMS-MaP experiments.
We next tested whether nanopore sequencing has a sufficient accuracy to enable high quality structure determination, especially at lower DMS concentrations that are expected to have reduced signal-to-noise ratio. We performed a nanopore sequencing run using Kit 12 chemistry in transcript isoforms (Fig. 1a) 21,22 . In humans, transcript isoforms are very common. Over 50% of genes show variability in transcription start site, 70% of genes exhibit alternative polyadenylation and around 95% of multi-exonic genes are alternatively spliced 23 . Consequently, much of the structural information of cellular RNAs obtained by current MaP techniques reflects a population average of distinct underlying structures and isoforms, likely concealing important gene regulatory mechanisms.
Here, we have overcome problems related to the ambiguous mapping of short sequencing reads to transcript isoforms in DMS-MaP experiments by developing nanopore DMS-MaP (Nano-DMS-MaP) (Fig. 1b). We used an ultraprocessive RT enzyme to generate long cDNA molecules with mutational signatures at sites of DMS modification. We also developed an analytical workflow that enables the structural determination of individual transcript isoforms from error-prone nanopore sequencing data. We apply Nano-DMS-MaP to resolve the complex structural landscape of human immunodeficiency virus (HIV)-1 transcripts in infected cells. We show that the genomic and spliced transcripts have distinct structures at their common 5′ untranslated region (UTR) structures, which includes the packaging motif. These structural differences likely explain the exclusion of spliced transcripts from the virion. Thus, we suggest that, in addition to increasing protein diversity, alternative splicing results in the generation of RNA transcripts with distinct functions mediated by altered RNA structures. Our data provide a powerful demonstration that critical regulatory mechanisms can be hidden in short-read ensemble analyses, and that these can be uncovered by long-read RNA structural analysis.

Optimization of nanopore long-read sequencing
Sequencing accuracy is critical for DMS-MaP experiments because this method relies on the reverse transcription of DMS adducts into mutations that must be distinguished from site-specific errors by normalization with an unmodified control sample 17,24 . Signal-to-noise ratio  and evaluated platform-and DMS-induced mutation rates on the US HIV-1 RNA (Methods). The sequencing run generated data with a mean Q-score of 16 (97% accuracy), indicating that global error rates were substantially above the signal in most DMS-MaP experiments. After implementing a median per read Q-score filter of ten to remove the lowest quality sequences, mean mutation rates were 2.5% without DMS, which increased progressively from 2.7% at 8 mM DMS to 3.6% at 85 mM DMS (Supplementary Table 1 and Extended Data Fig. 1a). Accordingly, signal-to-noise ratios ranged from 1.13 at the lowest DMS concentration to 1.75 at the highest concentration (Supplementary Table 2).
Next, we evaluated the quality of structural information by calculating normalized DMS reactivities for each DMS concentration (Methods). We then quantitatively compared the results against a reference structure previously obtained by chemical probing of HIV RNA extracted from virions (Extended Data Fig. 1b) 14 . For this comparison, we used the receiver operator characteristic area under the curve (ROC-AUC) score-a summary statistic to evaluate the correlation of DMS reactivity with strandedness (for example, whether the nucleotide was in single-or double-stranded RNA). A score of 0.5 signifies a random association of the two variables, whereas 1 indicates a perfect match. At the 8 mM DMS concentration, ROC-AUC scores reached 0.6, indicating the presence of low-quality structural information. Increasing the concentration of DMS improved the ROC-AUC scores, reaching 0.9 at 85 mM DMS concentration, indicating excellent agreement with the reference structure. By subsampling reads, we observed that ROC-AUC scores saturated at approximately 4,000 reads, indicating that the low signal could not be overcome by increasing read depths (Extended Data Fig. 1c).
To further improve the recovery of structural information, we systematically optimized filtering parameters. We evaluated an option to ignore insertions and deletions (indels) when counting mutations, as well as median per read Q-score filters and per position Q-score filters (Supplementary Table 2 and Extended Data Fig. 2). Ignoring indels decreased error rates by fivefold in the untreated sample, from 2.5% to 0.5%. The mutation rate in the DMS-treated samples also decreased (3.6% to 1.7% at 85 mM), but the much lower mutation rate in the control led to a substantial increase in signal-to-noise ratio (from 1.12 to 1.61 at 8 mM DMS and from 1.75 to 5.7 at 85 mM DMS) (Supplementary Table 2). Accordingly, ignoring indels improved ROC-AUC scores under almost all conditions, and especially at lower DMS concentrations, which are required to reach the longest read lengths (Extended Data Fig. 3). This observation is explained by a mutation type analysis, which revealed that a high proportion of nanopore sequencing errors are indels (Extended Data Fig. 4a), while DMS-induced mutations were nearly exclusively single nucleotide mismatches (Extended Data Fig. 4b,c). Whereas read median filters greater than Q-score 10 decreased coverage without improving signal-to-noise ratio, the inclusion of a per position filter to remove nucleotide positions with a Q-score of less than 22 led to another notable increase in signal-to-noise ratio (from 1.6 to 3.25 at 8 mM and from 5.7 to 19 at 85 mM) (Supplementary Table 2). Altogether, we identified optimal parameters for Nano-DMS-MaP, namely the discarding of indels, a median per read Q-score filter of 10 (to remove low-quality reads), and a per position Q-score filter of 22. These straightforward data treatment steps gave a three-to tenfold boost in signal-to-noise ratio over raw nanopore data, which translates to higher quality structural information at lower DMS concentrations and coverages (Extended Data Fig. 3 and Supplementary Table 2).

Nano-DMS-MaP recovers known structures
We next performed structural analysis of the US HIV-1 5′ UTR. Using optimal Nano-DMS-MaP parameters, the global mutation rate was 0.09% in the untreated control, 0.2% at 8 mM DMS and 1.05% at 85 mM DMS (Extended Data Fig. 4d). As expected from the chemical selectivity of DMS, mutations were located principally at A and C residues (Fig. 2a). Calculated DMS reactivities were consistent across all DMS concentrations for the US HIV-1 RNA (Extended Data Fig. 5a) and, when plotted onto the reference structure, there was a clear correspondence with strandedness (Fig. 2b,c). A reactivity threshold of approximately 0.5 gave the best separation between true and false classifications for the RNA in question (Fig. 2d). ROC-AUC scores of 0.92 indicated near-perfect agreement between our data and the reference structure (Fig. 2e). By subsampling reads, we found that ROC-AUC scores converged towards 0.9 for all DMS concentrations at read depths of 30,000, but similar scores could also be achieved at 4,000 reads for higher DMS concentrations (Fig. 2f). We also identified a highly consistent relationship between the Pearson's correlation of the DMS reactivities of two replicates and their agreement with secondary structure by ROC-AUC, which provides a generally applicable quality control measure for the accuracy of Nano-DMS-MaP data ( Fig. 2g and Extended Data Fig. 5b). When comparing the optimized Nano-DMS-MaP analysis against Illumina sequencing of the same cDNA, we observed equivalency in mutation rates and near-perfect agreement of the measured DMS reactivities at equal coverages ( Fig. 2h and Extended Data Fig. 5c-f). DMS-guided folding recovered the reference structure at all DMS concentrations, demonstrating that Nano-DMS-MaP can be used for RNA structure determination (Supplementary Data Files 2 and Extended Data Fig. 5f). Mutation type analysis surprisingly revealed slightly higher single nucleotide substitution rates in the Illumina dataset compared with our nanopore data, reinforcing the notion that nanopore errors are mainly indels (Extended Data Fig. 4b,c). This analysis also confirms that MarathonRT nearly exclusively generates single nucleotide substitutions at positions of DMS modification, which enables our simple data filtering steps to boost signal-to-noise ratio without introducing bias (Extended Data Fig. 4).
Next, we tested the general applicability of our workflow on a panel of compact, functionally diverse and highly structured RNAs in vitro (Extended Data Fig. 6). We selected these RNAs because they have complex, yet well-characterized, secondary structures. Furthermore, the three-dimensional structure of several of these RNAs was recently solved using an integrated approach combining information from chemical probing and cryo-electron microscopy experiments 30 . In all cases, Nano-DMS-MaP recovered structural information with ROC-AUC scores of between 0.81 and 0.96 (Fig. 2i). We also performed Nano-DMS-MaP on a well-characterized RNA in situ, selecting the 18S human ribosomal RNA because of its relatively long length (1.9 kilobases (kb)). Again, we obtained useful structural information with a ROC-AUC score of 0.76 at A and C residues, which is a value consistent with other chemical probing studies of ribosomal RNAs ( Fig. 2i and Extended Data Fig. 7) 31 .

The structure of the HIV-1 genome in cells and virions
We next assessed the capabilities of Nano-DMS-MaP for long-read structural analysis by in situ probing of the 8.5 kb HIV-1 genome in both infected cells and virions. Although nanopore sequencing itself does not have a theoretical limit on read length, Nano-DMS-MaP includes RT and PCR enzymatic reactions as potential length-limiting steps. To avoid complications during PCR due to the repeat regions used to form the HIV-1 long terminal repeats (LTRs), we performed amplification of the genome in two PCR reactions, each spanning 4 kb (Fig. 3a). Notably, only a single cDNA reaction spanning the whole 8.5 kb of the US RNA was required for both PCR reactions to be successful. Both amplicons could be generated at DMS concentrations of up to 20 mM from less that 0.5 ml of viral supernatant, which demonstrates the sensitivity of Nano-DMS-MaP for long-read RNA structural analysis. DMS reactivities were highly correlated between independent experimental replicates (Pearson's r 0.86-0.98) and DMS concentrations (Pearson's r 0.92 in cell, 0.98 in virion) (Extended Data Fig. 8a,b). The slightly lower correlation the between replicates obtained in the cell despite their similar coverages may indicate structural flexibility and/ or the presence of alternative structures in the cell that are not present in virions (Extended Data Fig. 8).
Article https://doi.org/10.1038/s41592-023-01862-7 A comparison of the DMS reactivities across the entire HIV-1 genome showed a slight but consistent trend for increased DMS accessibility in the virions (Fig. 3a). The molecular basis for this is unknown, but relaxation of RNA structure in the virion may facilitate reverse transcription during the next cycle of replication. A notable exception to this trend was the 5′ UTR, which showed decreased reactivity to DMS in the virion compared with the cells (Fig. 3a). This can be explained largely by DMS reactivity changes due to annealing of the tRNA primer,   Fig. 2). Measured DMS reactivities correlated with known highly structured regions of the HIV-1 genome, such as the 5′ UTR (Fig. 3b), the two-helix model of the frameshift site (Fig. 3c), and the Rev response element (RRE) (Fig. 3b,d).
Altogether, these data demonstrate the suitability of Nano-DMS-MaP for long-read structural probing.

Long-read sequencing detects diverse transcript isoforms
The HIV-1 genome is transcribed by the host cell into three major transcript classes: US, partially spliced (PS) and fully spliced (FS) (Fig. 4a).
During the late stages of infection, HIV-1 specifically packages the US genome into viral particles. However, the PS and FS viral RNAs are efficiently excluded from the packaging process by a poorly understood mechanism. This mechanism of exclusion applies to the over 50 different spliced transcripts produced in HIV-1 infected cells through the use of a variety of weak donor and acceptor sites [32][33][34][35] . All viral RNAs share the first 289 nt of the 5′ UTR, including a major packaging signal, known as stem-loop 1 (SL1). SL1 contains a palindromic dimerization initiation sequence (DIS) within its apical loop, and is the primary binding site for the viral structural protein Pr55 Gag (Fig. 4b) 36,37 . SL1 is included in all spliced viral RNAs because it lies upstream of the major splice donor site within stem-loop 2 (SL2) (Fig. 4b). Nevertheless, it has been reported that SL1 directs US, but not PS or FS, transcripts into nascent virions 27,36,38 . We therefore hypothesized that structural differences within the packaging signal shared between US and spliced transcripts may explain the selective recognition of the US transcripts by the viral packaging machinery. To address this hypothesis, we designed   an experiment to sequence and detect the individual US, FS and PS RNAs. We then performed isoform-resolved RNA structure analysis. We first established RT-PCR conditions specific for amplifying complex mixtures of US, PS and FS RNAs (Fig. 4c-e and Supplementary  Fig. 3a-c). For the reverse transcription of the US RNA, we used an RT primer that hybridized within the gag open reading frame (ORF). For the PS RNAs we used an RT primer complementary to a region in the D4-A7 intron (Fig. 4a), and for the FS RNAs we used an RT primer spanning the D4-A7 splice site (Fig. 4a). To PCR amplify the resulting DNA, we used PrimeSTAR GXL because we found it was able to amplify   Fig. 3d). As expected, we detected a single 0.5 kb product for the US-cDNA (Fig. 4c). From the PS cDNA, we detected a variety of spliced transcripts (300-1.4 kb), as well as a transcript of around 5.5 kb corresponding to the US RNA (Fig. 4d). For the FS cDNA, we detected a different subset of spliced transcripts (300-900 bp) (Fig. 4e). PCR amplicons from the US, PS and FS samples were then barcoded and sequenced on the Oxford Nanopore Technologies MinIon device using kit v.12 (Q20+) chemistry. From four runs on four flow cells, we obtained 5 million demultiplexed reads (2.6 Gb) with a mean Q-score of 15.9 (97.4% accuracy). When sequenced reads were plotted as a virtual gel, the relative proportion and lengths of these reads correlated with species previously detected on agarose gels (Fig. 4c-e). The sole exception was the 5.5 kb transcript, which was readily visible on the agarose gel, but was present at much lower intensity on the virtual gel based on the nanopore sequencing reads ( Fig. 4d and Supplementary Fig. 3e). These data indicate that nanopore cDNA sequencing can capture diverse transcripts, although there may be a bias against longer transcripts in complex mixtures arising either during library preparation and/or sequencing. We next mapped individual reads from the untreated sample to the HIV-1 transcriptome using IsoQuant 39 (Methods). Across all samples, over 80% of reads were unambiguously assigned to a single known isoform ( Supplementary Fig. 4a), showing efficient read-to-isoform assignment even in this complex splicing landscape. Approximately 13% of reads could be assigned equally well to several isoforms and were ignored in subsequent analysis ( Supplementary Fig. 4a). In addition, 4% of reads were discarded because they could not be assigned to any known spliced isoform (Supplementary Fig. 4a). Sequencing reads from the US sample mapped almost exclusively to the genomic RNA (98%) (Fig. 4f,g). In contrast, reads from the PS and FS reactions could be assigned uniquely to many different spliced transcripts (Fig. 4f,g). In the PS sample, we identified 16 transcript isoforms with at least 1,000-fold coverage in both replicates (6 with 4,000-fold coverage at 57 mM DMS concentration), including transcripts expressing Env/Vpu, Tat, Vif and Vpr ( Supplementary Fig. 4b). In the FS sample, we detected 15 transcripts with at least 1,000-fold coverage in both replicates (10 with 4,000-fold at 57 mM DMS concentration) which mapped to Nef, Rev, Tat and Vpr-expressing isoforms (Supplementary Fig. 4b). The most common splice acceptor site in PS transcripts was A1 (one-third of transcripts), found for example in Env13, Env5, Tat6 and Vif2 isoforms, followed by A2, found in Env9 and Vpr3 and A5, solely from Env1 expression. The general occurrence of acceptor sites was similar in FS transcripts (that is, D4A7-spliced). Here A1, found in Nef5, Nef3 and Tat2 isoforms, was the most common acceptor, followed by A2, found in Nef4, Rev7, Rev8 and Tat3, and A5, expressing the Nef2 isoform ( Supplementary Fig. 4c). The relative quantities of recovered transcripts obtained by Nano-DMS-MaP were also highly reproducible (r 2 = 0.988) across two independent experiments (Fig. 4h). All transcripts we detected were seen in previous studies of HeLa cells expressing HIV-1 (refs. 33,34,40). In the presence of DMS, we observed a progressive decrease in the proportion of reads mapping to longer transcripts with increasing DMS concentration. In particular, the 1.4 kb Vif2 and 897 bp Vpr3 transcripts in the PS sample were decreased compared with the shorter Env-expressing transcripts (Fig. 4g). Nevertheless, the effects were modest at most of the DMS concentrations. Altogether, these data confirm that nanopore cDNA sequencing accurately captures a comprehensive and biologically relevant view of the HIV-1 splicing landscape.

HIV-1 transcript isoforms have distinct structures
We next investigated the 5′ UTR structures of 16 different PS and FS HIV-1 transcripts in cells where read depths were more than 4,000 reads in both replicates, thus ensuring an informative and reproducible structural signal (mean Pearson's r = 0.95 at 57 mM) (Extended Data Fig. 9a). We also analyzed the native structure of the US RNA in virions.
A correlation analysis and hierarchical clustering of the DMS reactivity of the shared 5′ UTR of all isoforms revealed that the US RNA from cells and virions clustered together (Fig. 5a). As shown before (Fig. 3a), the tRNA primer binding site (PBS) and the dimerization initiation site (DIS) became fully protected from DMS in the virion, due to known intermolecular RNA interactions at these sites (Extended Data Fig. 9b). On the other hand, spliced viral RNAs had distinct, yet similar structural profiles, as they grouped together into their own cluster (Fig. 5a). Subclustering of the spliced RNAs was associated mainly with the first splice acceptor site usage, suggesting an effect of the sequence of the first adjacent exon on the 5′ UTR structure (Fig. 5a). To characterize this effect in more detail, we averaged the DMS reactivities according to first acceptor site (Fig. 5b). By subtracting spliced DMS reactivities from those of the US RNA we identified several regions within spliced transcripts showing strong and consistent changes in DMS reactivities indicative of structural rearrangements compared to the US RNA ( Fig. 5b and Extended Data Fig. 9b,c). Increases in DMS reactivities at positions C80, C84 and C85 in the poly(A) loop, and positions C109, C110 and C111 in the U5 region, are likely explained by the loss of sequences that are present only in the US RNA ( Fig. 4b and Fig. 5b). Increases in DMS reactivity at position C58 of the poly(A) stem cannot be explained by loss of downstream sequences, but may instead relate to transcription start site variation shown to regulate the translation and packaging of the US RNA via 5′ UTR remodeling 41,42 (Fig. 4b and Fig. 5b). The PBS was structured similarly in US and spliced RNAs, although there were distinct changes in the PAS stem, such as an increase in reactivity at position A220 and decreases in reactivity at A225 and A227 (Fig. 5b). Most strikingly, we observed clear increases in reactivity throughout the 3′ portion of the SL1 stem, indicating its structural reorganization (Fig. 5b).
With our isoform-specific probing data we next performed de novo folding of the individual isoforms. This analysis confirmed structural rearrangements associated with the first splice acceptor usage, but more generally an unfolding of several regions within the spliced RNA (Supplementary Data Files 2 and Extended Data Fig. 9d). Specifically, whereas the transactivation repeat (TAR) stem-loop was found in all isoforms, the poly(A) and tRNA-like structures were predicted to form only in D1A1/A2-spliced isoforms. The SL1 stem-loop, which is bound by the viral Pr55 Gag protein during packaging 36,37 , was never predicted as its canonical stem-loop (Fig. 5c,d and Extended Data Fig. 9d). Instead, for the D1A1/2 spliced isoforms, we identified an anti-PAS-SL1 interaction reported to promote the structural rearrangement of the HIV-1 5′ UTR of NL4-3 RNA into a monomeric and packaging incompetent conformation 9 (Fig. 5d). Taken together, these data show that the 5′ UTR is remodeled upon removal of intron 1, leading to structural reorganization of a packaging motif, which likely explains the exclusion of spliced RNA from the virion.

Discussion
Nano-DMS-MaP is a rapid, reproducible and straightforward method for long-read and isoform-resolved RNA structural probing. Using an ultraprocessive reverse transcriptase, we were able to generate long cDNA molecules with mutational signatures at sites of DMS modification and showed that nanopore cDNA sequencing can be used for RNA structure determination by mutational profiling. Nano-DMS-MaP therefore enables the identification of new regulatory mechanisms that are hidden in short-read ensemble analyses.
Despite the high intrinsic error rates of the nanopore sequencing platform, we were able to recover structural information. Critically, we found that ignoring indels during mutation counting decreased the substitution error rates by an order of magnitude. Together with additional quality score filters, we achieved an effective accuracy of 99.9% in the untreated control for single point mutations. This is equivalent to a Phred quality score of 30, which is widely considered a benchmark accuracy in next-generation sequencing 43,44 . This was Article https://doi.org/10.1038/s41592-023-01862-7 made possible because Nanopore datasets have unique error profiles with higher likelihood of indels compared to single point mutations ( Supplementary Fig. 5a). Ignoring indels did not result in lower quality structural data because MarathonRT almost always introduces single nucleotide mutations from DMS modifications. This allowed us to separate the DMS signal from the background noise introduced from nanopore basecalling errors (Extended Data  Fig. 4b,c). Other commercially available highly processive RTases, such as TGIRT-III, have similar characteristics on DMS-modified RNAs and may also be appropriate for Nano-DMS-MaP 12,45 . In agreement with previous reports, we also found that DMS can provide valuable structural information at U residues (Extended Data Fig. 10a-f) 46,47 . Thus, where read depths and DMS concentrations are high, information at U residues may be cautiously included in RNA structure analysis (Extended Data Fig. 10g). Mutations at G residues, however, did not correlate with secondary structure. This is because methylation at G   Article https://doi.org/10.1038/s41592-023-01862-7 residues occurs preferentially at the N7 position on the Hoogsteen-face, leading to a characteristic G to A substitution with MarathonRT (Extended Data Fig. 4b,c) 5 . Instead, this signal could be used to study noncanonical RNA structures involving Hoogsteen interactions, such as G-quadruplexes 48,49 . Our method is related to recent advances in long-read RNA structural probing by nanopore direct RNA sequencing (dRNA-seq) of chemically modified transcripts [50][51][52] . However, dRNA-seq structural probing requires specialized modification detection algorithms which may need continual updates as nanopore sequencing chemistry changes. Nano-DMS-MaP on the other hand immediately takes advantage of improvements in DNA basecaller accuracy, leading to higher signal-to-noise ratio without changes in the experimental or analytical pipeline. Moreover, in comparison with direct RNA sequencing, nanopore DNA sequencing has a higher throughput on the same flow cells, resulting in a lower cost per base than equivalent methods using dRNA sequencing.
Using Nano-DMS-MaP, we recovered high quality structural information on RNA molecules up to 4 kb in length, which would allow isoform-resolved analysis of most human mRNAs 53 . An important caveat is that we used lower DMS concentrations than typically used in short-read DMS-MaP experiments. Although the accuracy of Nano-DMS-MaP structural information was similar at low and high DMS concentrations, higher read depths are generally required for longer molecules (Fig. 2g). For example, the 1.5 kb long RNA only required 4,000-fold coverage (6 megabases (Mb)), but the 4 kb long RNA required 20,000-fold coverage (80 Mb). Thus, sequencing throughput can be limiting, especially when using lower DMS concentrations to analyze longer transcripts. Additionally, when sequencing complex mixtures, it may not always be possible to structurally characterize transcript isoforms of low abundance due to the small number of reads captured. Future increases in nanopore sequencing throughput, together with selective sequencing, may alleviate these limitations. Alternatively, further improvements in reverse transcriptase processivity would allow higher modification densities without the same tradeoffs in read length, which would reduce sequencing requirements to eventually deliver transcriptome-wide structural probing of RNA isoforms. Nano-DMS-MaP may also in time allow more accurate structural determination through the detection of long-range interactions by correlated chemical probing 7,9,31,54 and computational deconvolution of structural ensembles 8,[55][56][57] . Favorable characteristics of Nano-DMS-MaP for these analyses are a higher number of mutations per read and a low background nucleotide substitution rate compared with equivalent Illumina based methods (Extended Data Fig. 4 and Supplementary Fig. 5).
By applying our method to the complex splicing landscape of HIV-1 we identified strong and consistent increases in DMS reactivity in the SL1 stem of spliced viral RNAs, which is a structure involved in Pr55 Gag binding 27,36,37 . DMS-guided structural predictions revealed restructuring of SL1, although the underlying mechanism is unclear. One possibility is that the loss of the SL2 hairpin containing the splice donor site indirectly destabilizes SL1 structure 9 . Alternatively, sequences downstream of the splice site may be required to fold the HIV-1 5′ UTR into a packaging competent structure 36 . In support of the second possibility, DMS reactivity changes in the spliced RNAs clearly show the loss of the U5-AUG interaction 58 and a pseudoknot interaction between the poly(A) stem-loop and sequences in gag 59 . The U5-AUG 27,60-62 and polyA 9,63 structure have both been implicated in 5′ UTR structural switching and the selective packaging of the US RNA 9,63 . We also observed increased DMS reactivity changes at position C58 that is linked to transcription start site variation that alters 5′ UTR structure to regulate genome packaging and translation of the US RNA 42,64,65 . We also cannot exclude that the unfolding of SL1 is due to the preferential translation of the spliced viral RNAs themselves. Testing whether SL1 unfolding drives RNA packaging selectivity in cells, and the role of transcription start site variation on 5′ UTR folding and translation are key topics for future studies.

Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41592-023-01862-7.

Preparation and probing of in vitro transcribed RNA
DNA templates of the bacterial RNase P type A 67 , hc16 ligase, tetrahymena riboswitch and Vibrio cholerae glycine riboswitch 30 were generated by assembly from DNA oligos (IDT) according to the primerize scheme 68 . The HCV IRES was subcloned from a reporter plasmid kindly provided by N. Caliskan. Fully assembled products were cloned into pJet-1.2 vector (ThermoFisher Scientific) for propagation and confirmed by Sanger sequencing. DNA was then amplified from plasmids with primer T7_fw (AAAGAATTCTAATACGACTCACTATAGG) and the M13_pA_re (TTTTTTTTTTGATTATCATACTCTGATAATCCAGGAAACA GCTATGACCATG) with the exception of HCV IRES, for which primer T7_HCV_fw (AAAGAAGACTTGGGGTAATACGACTCACTATAGGCCAGC CCCCGATTG) was used. DNA amplicons were then purified with 1.2× SPRI bead purification with Mag-Bind TotalPure NGS beads (Omega Biotek). Briefly, the DNA-bead mixture was incubated for 5 min under light agitation and beads were pelleted on a magnetic rack (Invitrogen DYNAL), followed by removal of supernatant and two washes with 100 µl freshly prepared 70% ethanol. Finally, beads were air-dried for 3-5 min (until appearance changed from glossy to rough) and DNA was eluted by addition of 15 µl H 2 O, followed by 5 min incubation at room temperature. For in vitro transcription 500 fmol of DNA were prepared in 40 mM Tris pH 7.5, 18 mM MgCl 2 , 10 mM DDT, 1 mM spermidine, 5 mM NTPs, 40 U RNasin (Molox) and homemade T7 RNA polymerase for 3 h at 37 °C, followed by DNase I treatment for 30 min at 37 C and 1.6× SPRI bead purification.
For probing, 300 ng of the RNA was prepared in an 8 µl reaction mix in 0.5 mM EDTA, 30 mM HEPES pH 7.5, 300 mM NaCl and heated to 95 °C for 1 min, followed by placing on ice. To facilitate folding of the RNA 1 µl of 50 mM MgCl 2 (5 mM final concentration) was added before incubation at 37 °C for 30 min. To probe the RNA, 1 µl DMS diluted in ethanol was added at the indicated final concentrations before incubation at 37 °C for 7 min. The reaction was quenched with four volumes of 30% β-mercaptoethanol. RNAs probed at the same concentration were pooled, 0.1 volume of 3 M NaOAc and 3 volumes EtOH were added, and RNA was precipitated at -20 °C overnight. RNA was then pelleted by centrifugation at 16,000g for 30 min, washed twice with 70 % EtOH before resuspension in H 2 O and normalization to 100 ng µl -1 .
Read-to isoform mapping was performed using IsoQuant v.2.0 (ref. 39), using a general feature file (GFF) generated from previously published data (including transcript naming, as listed in Supplementary Table 2) 33 , but adjusted for PCR primer start and end sites. Only reads mapping uniquely to one isoform were subsorted for subsequent analyses. Sorted reads were first aligned to their specific reference sequences using LAST v.1419, by first indexing the transcript reference with 'lastdb -uNEAR -R01', then training mismatch matrices per sample with 'last-train -Q0', followed by alignment with 'lastal -Qkeep -m20 -p {mismatch_matrix_file} | last-split -m1.' The output maf file was then converted to a Sam file with the 'maf-convert sam' command. The SAM file was then processed using Samtools v.1.12. Briefly, using Samtools a header was added, the file was converted to BAM format, followed by sorting the BAM file and lastly indexing it. The final BAM files were then used as input for the mutational profiling analysis. Mismatch types were analyzed from BAM alignment files with perbase v.0.8.3 and custom python scripts. Figures were generated using the python library plotly v.4.14.3.
Calculating the correlation of reactivity between the two replicates was performed per sample and transcript with the RNAFramework tool rf-correlate for A and C residues only if not stated otherwise. Reactivity of biological replicates were combined for plotting and folding prediction via the rf-combine tool. ROC-AUC scores of reactivity profiles from the unspliced RNA in cells with increasing DMS concentration were calculated using a reference HIV NL4-3 5′ UTR structure 14 as ground truth data. Calculations were performed using the python package scikit-learn v.0.21.3.
Subsampling was performed on aligned BAM files of both replicates of the unspliced isoform in cells. First, the average coverage was determined with Samtools depth, which was then used to calculate the fraction of reads of the BAM file to be subsampled at each subsampling depth. Subsampled BAM files were then processed as described above, including the use of rf-combine to average the reactivities of both subsampled replicates before evaluating their ROC-AUC score.

De novo RNA structure folding
De novo folding of isoform structures with coverage of at least 4,000 reads was performed by converting the reactivities on A and C residues to bp2seq files for input in EternaFold v.1. 3.1 (ref. 69). The command to perform DMS-guided secondary structure prediction was 'eternafold predict {bp2seq_file} -evidence -numdatasources 1 -params Eterna-Fold/parameters/EternaFoldParams_PLUS_POTENTIALS.v1'. Structures were then plotted using VARNA v.3-93 (ref. 70).

Illumina sequencing
The amplicons from the US RNA sequenced previously by nanopore were prepared for Illumina sequencing as follows: The addition of the transposon sequence HF in 5′ and HR in 3′ of the amplicons was performed by five cycles of amplification using the primers PCR-HIV-HF_Fw (TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGggtctctctggttagacc) and PCR-US-HR_Rv (GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG gatggttgtagctgtcccag) using the Q5 DNA Polymerase (NEB). DNA amplicons were purified via SPRI bead as described above using one volume of beads. Purified products (25 ng) were used in the final sequencing library preparation with Nextera DNA Flex Library Prep (Illumina) and Nextera DNA CD Indexes (96 indexes, 96 samples, Illumina), according to the manufacturer's instructions. Paired-end PE150 sequencing was carried out on an Illumina Novaseq instrument (Novogene). Fastq files were preprocessed with cutadapt v.4.1 with the following parameters '-a CTGTCTCTTATA -A CTGTCTCTTATA-nextseq-trim 25-minimum-length 25-max-n 0.' The trimmed files were then aligned using bowtie v.2 with parameters '-D 20 -R 3 -N 1 -L 15 -i S,1,0.50,' followed by the same analytical pipeline used for the nanopore data.

Statistics and reproducibility
DMS reactivity data shown are the mean of two independent biological replicates of Nano-DMS-MaP experiments on A and C residues unless stated otherwise. RT-PCRs optimization experiments were performed at least twice.

Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability
Sequencing data is available at Sequence Read Archive (SRP424422, Bioproject ID PRJNA938445). DMS reactivities as csv files are provided as csv files in Supplementary Dataset 1. De novo predicted structures with Eternafold are provided as db and varna files in Supplementary Dataset 2. Source data are provided with this paper.
Article https://doi.org/10.1038/s41592-023-01862-7 Extended Data Fig. 9 | Correlations, heat map analysis and structural predictions of spliced RNA. (a) Correlation between DMS reactivities obtained from 2 independent experimental replicates of the unspliced HIV-1 RNA in cells. Minimal required read depth per isoform was set and correlation for each isoform passing the filter was evaluated. Boxes represent quartile 1 (Q1) to quartile 3 (Q3). The second quartile (Q2) is marked by a line inside the box. Whiskers correspond to the box's edges +/−1.5 times the interquartile range (IQR: Q3-Q1). (b) A heatmap of DMS reactivities after probing at 57 mM DMS concentration for the unspliced HIV-1 5'UTR in cells and virions, as well as individual reactivities for spliced transcripts in cells. Reactivities of A and C residues are shown using a blue-white-red color scheme. (c) Line plot showing ΔDMS reactivity (mean spliced RNA per first acceptor site -unspliced RNA) for 57 mM at each position in the RNA. Error bands indicate standard deviation within the spliced RNA with same first acceptor site. (d) Arc-plots representing de novo RNA structural predictions for individual spliced isoforms. Predictions were performed using EternaFold guided by DMS reactivities. Underscore labels the actual isoform, depending on whether RNA had been reverse transcribed with RT-PS or RT-FS. For Vpr3 only the structure of the first 564 nt is shown. Varna files are provided in Supplementary Data Set 2.
Article https://doi.org/10.1038/s41592-023-01862-7 Extended Data Fig. 10 | RNA structural information at U residues. Box plots showing DMS reactivity distributions for single stranded and double stranded RNA, false positive and true positive against threshold, receiver operator characteristic curves and precision against recall curves for one of the two independent replicates of unspliced HIV RNA from cells treated with 85 mM DMS for (a) A, (b), C (c), G, and (d) U residues. (e) Heat maps of DMS reactivities at A, C, and U residues for the unspliced HIV-1 RNA in cells at increasing DMS concentration for one of the biologically independent replicates. (f) DMS reactivities of unspliced HIV RNA from cells treated with 85 mM DMS at A, C, U residues plotted onto the consensus structure of the HIV-1 5'UTR in the unspliced RNA. A, C and U residues are colored according to DMS reactivities using a blue-white-red color scheme. Boxes represent quartile 1 (Q1) to quartile 3 (Q3). The second quartile (Q2) is marked by a line inside the box. Whiskers correspond to the box's edges +/−1.5 times the interquartile range (IQR: Q3-Q1). (g) Positive predictive value and base-pairing sensitivity of reactivity-guided Eternafold structure prediction of the first 380 nt of the unspliced HIV 5' UTR using mean in cell Nano-DMS-MaP reactivity at A, C and U residues of both biologically independent replicates for increasing DMS concentrations.