ScanFold: an approach for genome-wide discovery of local RNA structural elements—applications to Zika virus and HIV

In addition to encoding RNA primary structures, genomes also encode RNA secondary and tertiary structures that play roles in gene regulation and, in the case of RNA viruses, genome replication. Methods for the identification of functional RNA structures in genomes typically rely on scanning analysis windows, where multiple partially-overlapping windows are used to predict RNA structures and folding metrics to deduce regions likely to form functional structure. Separate structural models are produced for each window, where the step size can greatly affect the returned model. This makes deducing unique local structures challenging, as the same nucleotides in each window can be alternatively base paired. We are presenting here a new approach where all base pairs from analysis windows are considered and weighted by favorable folding. This results in unique base pairing throughout the genome and the generation of local regions/structures that can be ranked by their propensity to form unusually thermodynamically stable folds. We applied this approach to the Zika virus (ZIKV) and HIV-1 genomes. ZIKV is linked to a variety of neurological ailments including microcephaly and Guillain–Barré syndrome and its (+)-sense RNA genome encodes two, previously described, functionally essential structured RNA regions. HIV, the cause of AIDS, contains multiple functional RNA motifs in its genome, which have been extensively studied. Our approach is able to successfully identify and model the structures of known functional motifs in both viruses, while also finding additional regions likely to form functional structures. All data have been archived at the RNAStructuromeDB (www.structurome.bb.iastate.edu), a repository of RNA folding data for humans and their pathogens.


INTRODUCTION
In coordination with (or in the absence of) experimental techniques to determine genome-scale RNA secondary structure, computational methods are indispensable for identifying functional RNA structures. Such techniques are driven by RNA Negative z-scores then, indicate that the MFE native is more negative (more stable) than RNAs with the same length/nucleotide content would typically generate: for example, a z-score of -1 indicates the MFE native is one standard deviation more stable, a z-score of -2 indicates the MFE native is two standard deviations more stable, etc.
The z-score is at the heart of several functional RNA prediction approaches, including the popular program RNAz (Gruber et al., 2007(Gruber et al., , 2010Washietl, Hofacker & Stadler, 2005b), which has been used to identify functional RNAs embedded within the human (Washietl et al., 2005a) and mouse (Thiel et al., 2018) genomes, as well as the Epstein-Barr virus (Moss & Steitz, 2013) and influenza (Moss, Priore & Turner, 2011) genomes. To span large sequences (e.g., genomes) a scanning window approach is used, where user-defined step and window sizes determine which nucleotides are analyzed: for example, defaults for RNAz window and step size are 120 and 40 nt, respectively. The justifications for small window sizes, optimally between 100 and 150 nt (Lange et al., 2012), are practical (prediction accuracy is higher for smaller RNAs), theoretical (due to the kinetics of folding, local motifs are favored) and algorithmic (RNAz, e.g., is trained on small ncRNA datasets). Another feature of many functional RNA prediction methods is the simultaneous consideration of homology in prediction: for example, align-and-fold approaches (Washietl, Hofacker & Stadler, 2005b) or fold-and-align approaches (Fu et al., 2015). Incorporating homology data can improve prediction accuracy and reduce false-positives; however, these methods are sensitive to alignment quality and sequence composition, evolutionary distance, and variation.
The ScanFold approach presented here is similar to others, in its reliance on the z-score, but focuses on single RNA sequences (vs. alignments) and divides the prediction process into a scanning step, a model building step, and an analysis step-where homology data or experimental results can be considered. For example, this process was previously used to map out the RNA structural landscape of the XIST long ncRNA (Fang et al., 2015). Here, as in other scanning window approaches, the challenge was to determine regions of interest for the structure modeling and analysis steps. For the study of XIST, a window z-score cutoff was used to define regions by overlapping low z-score windows. The cutoff was selected to best capture known elements, however, in many cases this may not be possible. This highlights a key drawback of scanning window approaches: individual windows are arbitrarily bounded sequence fragments while functional RNA structures are not , which makes defining the extent of motifs a challenge.
An early approach for overcoming the problems arising from artificially bounded windows was implemented in RNAplfold (Bernhart, Hofacker & Stadler, 2006). Here, the local base pairing probabilities from multiple overlapping windows are used to generate an average base pairing probability for each base pair predicted throughout the scan. In this way, regions with high base pairing probability (locally stable structures) can be quickly deduced. We take a different approach in this study: here, to define the extent of local motifs, we generate z-score weighted consensus structures to deduce those pairs most likely to be functional.
This approach was implemented in the program ScanFold-Fold and was used to analyze the Zika virus (ZIKV) (Atieh et al., 2016) and HIV-1 (Watts et al., 2009) (+)RNA genomes. These genomes were selected for their small sizes and the known importance of RNA structure in their functions. At either end of the ZIKV genome are untranslated regions (UTRs), a short (107 nt) 5′ UTR and a longer (465 nt) 3′ UTR, which form conserved RNA structures with important functions (Goertz et al., 2017). The 5′ UTR, plus a stretch of the downstream coding region, contains several functional structured domains (Fig. 1A). The 5′ UTR has a Y-shaped stem-loop A motif, which acts as the promoter of viral genomic (v)RNA replication (Filomatori et al., 2006;Lodeiro, Filomatori & Gamarnik, 2009;Thurner et al., 2004). Directly downstream is stem-loop B, which facilitates RNA interactions with the 3′ end (Alvarez et al., 2005). The cHP domain, which overlaps the capsid protein coding region, is required for efficient vRNA synthesis; additionally, cHP enhances start codon selection (Ye et al., 2016). The DCS-PK domain enhances vRNA replication by promoting vRNA circularization (Liu et al., 2013). The 3′ UTR contains six recognized structured motifs (Goertz et al., 2017) (Fig. 1B). From the 3′ end, it contains a large stem-loop structure (3′ SL), which is required for viral replication and is highly conserved throughout flavivirus genomes (Davis et al., 2013;Elghonemy, Davis & Brinton, 2005;Villordo et al., 2016;Yu & Markoff, 2005;Zeng, Falgout & Markoff, 1998). Directly upstream is a short, well conserved hairpin (sHP), thought to be involved in genome circularization (Villordo, Alvarez & Gamarnik, 2010). Upstream of this are two structures (DB-1 and É-DB), which have been shown to be conserved and duplicated, though their specific functions remains unknown (Villordo et al., 2016). The two remaining structures, SLI and SLII, which are resistant to host XRN1 exonucleases (Donald et al., 2016;Goertz et al., 2016;Pijlman et al., 2008), lead to an abundance of a highly structured ncRNA: the subgenomic flavivirus (sf)RNA, which is proposed to play roles in inhibition of the RIG-I host antiviral response (Akiyama et al., 2016;Chapman et al., 2014;Kieft, Rabe & Chapman, 2015). The HIV-1 RNA genome, whose secondary structure has been extensively characterized (Watts et al., 2009;Wilkinson et al., 2008), contains four structured RNA elements with known functions. On either end of the genome, in the UTRs, are trans-activation response (TAR) elements (stem-loops named the 5′TAR and 3′TAR, respectively; Figs. 1C and 1F) which are involved in viral replication (Das, Klaver & Berkhout, 1998); the 5′TAR has been shown to bind the viral Tat protein during transcriptional activation (Wimmer et al., 1999) and is processed into two micro RNAs (Ouellet et al., 2008). Within the coding region of the genome are structural elements as well: the gag-pol frameshift element (Fig. 1D), a stem-loop structure which alters the ribosomal reading frame to allow for proper translation of the gag and pol viral proteins (which are present on overlapping reading frames), and the Rev response element (RRE), a long stem-loop structure with five terminal stem-loops (Fig. 1E), which binds viral Rev protein and allows viral mRNA to be exported from the nucleus.
Our results are compared to previously described structure models from both ZIKV and HIV-1, and tested vs. available biochemical structure probing datasets. We performed multiple benchmarking analyses of ScanFold's ability to detect structures in the particularly well-characterized HIV-1 genome, and determined how parameters such as window size and shuffling technique affect results.

Data sets
The analyzed ZIKV genome was sequenced from the outbreak-lineage-derived reverse genetics system (Atieh et al., 2016) (NCBI accession KJ776791.2), and was selected to facilitate additional experimentation to better understand RNA structures' roles in ZIKV. The sequence for HIV-1 was from the genome chemically probed by Watts et al. (2009). SHAPE reactivity profiles for ZIKV were taken from extended data 6 in Huber et al. (2018) and for HIV-1 from supplementary dataset 2 in Watts et al. (2009).

ScanFold-Scan
The preliminary scanning window analysis for ZIKV and HIV-1 was performed by the ScanFold-Scan program (https://github.com/moss-lab/ScanFold). In this process, each window sequence is folded via RNAfold (Lorenz et al., 2011) to calculate its native MFE and associated base pairing structure at 37 C. Each sequence is then shuffled to produce, in this case, 50 random sequences. Two different shuffling techniques were used to generate random sequences: (1) mononucleotide shuffling, which generates a random sequence with the same mononucleotide content as the native sequence and (2) Clote's implementation of the (Altschul & Erickson, 1985) shuffling algorithm (http://clavius.bc.edu/clotelab/RNAdinucleotideShuffle/ShuffleCodeParts/ altschulEriksonDinuclShuffle.txt), which generates a shuffled sequence that maintains the mononucleotide and dinucleotide content of the native sequence. Each of the 50 randomized sequences is then folded to calculate an average MFE random value for use in the calculation of the thermodynamic z-score (see Introduction; Eq. (1)). Other metrics are calculated as well: for example, those derived from RNAfold's use of the partition function (McCaskill, 1990) (specifically, the ensemble diversity (ED), centroid structure, and frequency of MFE, which are metrics derived from the partition function to describe the nature of an MFE's structural ensemble), as well as a z-score stability ratio (calculated as the number of shuffled random MFEs which were more stable than native; referred to as the p-value), that can be useful as a quality control in downstream analyses. All of the aforementioned metrics are compared and described in detail in Freyhult, Gardner & Moulton (2005).

ScanFold-Fold
The ScanFold-Fold program analyzes the output of a scanning window analysis, focusing on MFE structures and their z-scores. The algorithm first reads the sequence and MFE structure from every window, generating a comprehensive list of all primary sequence nucleotides (i), the number of windows each i appears in (W i ), a list of all nucleotides each i base pairs with (j), and the number of windows each base pair arrangement (i-j) appears in (W i-j ). For each i-j, the calculated metrics from all occurrences of the i-j are recorded and summed (e.g., for the z-score metric, this sum is referred to as Z sum ). Next, the average MFE, ED, and z-score for each i-j arrangement are calculated as the sum of each metric's value divided by W i-j ; an example of this calculation is shown for the average thermodynamic z-score (Z avg ) in Eq. (2).
As well as average metrics observed for each i-j arrangement, a coverage-normalized z-score (Z norm ; Eq. (3)) is calculated as Z sum divided by the total number of windows covering i (W i ).
This coverage-normalized z-score (as opposed to Z avg ) gives more weight to i-j arrangements which consistently appear in low z-score windows and provides a normalized metric for comparison of regions with lower window coverage (near the ends, where i's are covered by only a few windows). This initial processing is output into a log file (an example portion of which is shown for i-1099 of ZIKV in Table 1).
For each i in the sequence, a single i-j arrangement is selected to represent the most favorable arrangement; here the "most favorable" arrangement is considered to be the one with the lowest Z norm . Selection based on Z norm results in a list of the most favorable i-j arrangements for every i of the input sequence. Importantly, the ScanFold-Fold algorithm must consider upstream and downstream base pairing competition when selecting the "best" i-j arrangement; it is possible that different i's will compete for the same j, which would result in the generation of unrealistic models depicting single nucleotides paired with multiple partners.
In cases of competition, such as shown in Table 2, where all three i's compete for the same j-1104, only one i can be selected to partner with j-1104. Here, the lowest Z norm among the set of competing i-j arrangements was observed for the unpaired arrangement (i-1104 and j-1104); therefore, j-1104 is awarded to i-1104 and an assumption is made that i-1099 and i-1078 do not pair with j-1104; for the sake of the consensus model, they will remain unpaired (see i-1099 in Table 3 where the final partner for i is shown to be i-1104 and j-1104; the Ã indicates that the winning j for i-1099 had a more favorable arrangement which is reported in its place).
These results are printed to a file as well; an example portion of which is shown in Table 3, for ZIKV nt's i-1099 to i-1108). Ultimately, this selection process allows for the generation of connectivity table (CT) files which span the entire genome.

Filtering
Since the primary interest for this analysis is to reveal potentially functional structures, a filtering process is employed to hone in on base pairs common to analysis windows with highly negative z-scores. For this filtering process, Z avg for each i-j arrangement is considered; only i-j arrangements with a Z avg below a filter value are written to a CT file. An example of this filtering process is shown in Fig. 2 for the first 2,000 nt of ZIKV (full genome in Fig. S1). By default, the ScanFold-Fold program will write multiple CT files, each using a different filter value: -2, -1, 10 (no filter) are used by default, and a user-defined value is also allowed. The user-defined filter value is provided because the definition of a "significantly" negative z-score can vary  (3)) per i nucleotide of the sequence; competition is allowed, that is, multiple partners are permitted to pair with the same nucleotides. (C) The third track shows the base pairs which remain after prohibiting multiple pairing partners per nucleotide; that is, competition is disallowed whereby only a single pairing partner is allowed per i or j nucleotide. This track is equivalent to the "no filter" results from ScanFold-Fold. The base pairs from this track are then subjected to filtering based on their Z avg (see Methods Eq. (2)). The final tracks depict which base pairs from the results above possessed Z avg scores less than (D) -1 (E) -1.6 (one standard deviation below the mean z-score) and (F) -2. (see "Comparison of shuffling techniques" in Results) and users may want to select different values.

Alignment
A total of 37 ZIKV genomes (curated in the ZikaVR database; Gupta et al., 2016) were aligned to the scanned ZIKV genome (accession KJ776791.2) using the MAFFT web server (Katoh, Rozewicki & Yamada, 2017;Kuraku et al., 2013) with default settings. Aligned sequences were compared to ScanFold-Fold predicted base pairs (with Z avg < -1) to tabulate the types of base pairs that are found throughout the alignment; nucleotides with mutations which maintained a ScanFold-Fold predicted base pair are noted in figures as "structure-preserving."

ScanFold-Fold predicted motifs in the ZIKV genome
The ZIKV genome was analyzed with ScanFold-Scan using a 120 nt window with a one nt step: resulting in 10,688 analyzed windows (Table S1). For each window, several metrics of RNA folding were predicted (described in the Results and Discussion of Andrews, Baber & Moss (2017)); two metrics of particular interest are plotted vs. the ZIKV genome in Fig. 3: the MFE (Fig. 3A) and the thermodynamic z-score ( Fig. 3B; Eq. (1)) using a mononucleotide shuffling technique. Across the ZIKV genome are windows where low MFEs overlap low z-score windows, but also places where they do not (Fig. 3: highlighted in yellow). Even for a relatively small genome such as ZIKV, 3,349 windows had z-scores less than -1 (signifying the window MFE prediction was one standard deviation more stable than random) and 994 windows with z-scores less than -2. With so many windows of interest, and so many competing models, it is a challenge to identify the most functionally significant base pairs. This was the impetus behind the development of ScanFold-Fold: to identify the base pairs which were responsible for generating low z-score regions and that persisted across multiple overlapping analysis windows. In total, 22,180 unique base pairs were predicted throughout all scanning windows ( Fig. S1A; Table S2); some nucleotides were predicted to form base pairs with as many as 16 different partners, highlighting the challenge of finding a single model (e.g., i-1099; Table 1). ScanFold-Fold records the metrics from each window where the base pair appears, generating a set of cumulative metrics. For each i, only one base pairing partner is selected. Selecting base pairs with the lowest Z norm (Eq. (3)) and allowing competition (see Materials and Methods) yielded a smaller group of 6,831 base pairs (Fig. S1B). Disallowing competition (see Materials and Methods), however, yields a much smaller group of 2,259 base pairs ( Fig. S1C; Table S3). To focus on the most significant hits, cumulative z-score filters were applied to identify the base pairs which were consistently found in low z-score windows: Z avg filters of -1 and -2 were used.
With a Z avg filter of -1 and 1,114 base pairs were identified ( Fig. S1D; Table S4). Consistent with the presence of structured functional motifs, many base pairs were found within the known ZIKV structured regions; a total of 194 base pairs were found within previously identified 5′ and 3′ end structured domains. ScanFold-Fold was able to identify 86 of the 114 known base pairs (Goertz et al., 2017) in the 3′ UTR (Fig. S2) and 75 of the 81 known base pairs (Ye et al., 2016) within the 5′ structured domain (Fig. S3).
Further filtering the results to Z avg < -2 reduced the number of base pairs to 233 ( Fig. S1F; Table S5). Of these, 121 were found in known structural regions. Interestingly, regions immediately adjacent (within 240 nt, or two window lengths) to these known functional motifs also had 86 Z avg pairs < -2 (Figs. 4A, 4B and 4G), suggesting that the regions of functional structure at either end may be larger than previously thought (full structural models of the extended 5′ and 3′ ends are shown in Fig. S4). The remaining 26 base pairs contribute to four structures found within the core coding region (Figs. 4C-4F).
To determine the structural conservation of these ScanFold-Fold identified base pairs, an alignment was performed of 37 ZIKV genomes curated in the ZikaVR database (Gupta et al., 2016); aligned sequences are reported in Table S6. The ScanFold base pairs with Z avg < -1 were mapped to the alignment to determine the conservation of base pairing across ZIKV genomes. When mutations occurred in predicted paired regions they generally preserved base pairing: for example, ScanFold-Fold predicted base pairs were over 96% conserved (Table S7). Multiple structure-preserving mutations occur throughout novel predicted motifs (Fig. 4) as well as in previously described ZIKV structures (Fig. S4).

ScanFold-Fold identified motifs in the HIV-1 genome
In order to benchmark the ScanFold pipeline, ScanFold-Fold Z avg < -2 base pairs were compared to well-characterized, experimentally-supported models for HIV-1 RNA structural motifs. Using the same pipeline used for the ZIKV genome, 13 structured regions were identified that contained base pairs with Z avg < -2. These regions are shown in Fig. 5. Again, all previously described RNA structural elements were identified with the ScanFold pipeline; the 5′TAR element (Fig. 5A), the gag-pol frameshift element (Fig. 5E), the five terminal stem loops of the RRE (Fig. 5K), and the 3′TAR element  463-4,520, 5,587-5,702, 7,096-7,205, and 7,651-7,755, respectively. Structure (G) is directly upstream of the 3′ structured region, located at nt 10,237-10,368, and has been annotated to show the location of the stop codon near the terminal loop (circled in green). Base pairs are colored by their z-score cutoff; green lines show base pairs which were predicted in the z-score < -1 results ( Fig. S1D; Table S4) and blue lines show base pairs which were predicted in z-score < -2 results ( Fig. S1E; Table S5). Sites with structure-preserving mutations are highlighted with green circles. All nucleotides are shown with their SHAPE reactivity scores as shown in Huber et al. (2018).  (2015)) and are also in agreement with SHAPE reactivity data. The remaining structures, while not previously described, are in good agreement with SHAPE reactivity data (with some slight discrepancies for the first hairpin of the structures in Figs. 5H and 5K).

Comparison of shuffling techniques
The process of shuffling RNA can affect the z-score (Forsdyke, 2007). Dinucleotide shuffling preserves nearest-neighbor nucleotides (that can stack in helixes), while mono-nucleotide shuffling abolishes this pattern-potentially overestimating the magnitude of the z-score (Gesell & Washietl, 2008). To determine the impact of mono-vs. dinucleotide shuffling on ScanFold results (which rely primarily on the thermodynamic z-score), both dinucleotide and mononucleotide shuffling were performed on ZIKV and HIV-1. These two shuffling techniques were implemented during each analysis to identify the differences in the resulting Z avg base pairs. For the ZIKV genome, the same overall z-score pattern (resulting in identification of similar motifs) was observed between shuffling techniques (Fig. 6A), however the mean z-score across the genome differed slightly: -0.55 and -0.18 for mononucleotide and dinucleotide shuffling, respectively (Fig. 6B). Dinucleotide shuffling results using a Z avg cutoff of -2 yielded fewer base pairs (147 bps) than mononucleotide shuffling (233 bps); this is likely due to the generally more positive z-scores that arise from using dinucleotide shuffling. Overall, the results for the known structures in the UTR regions are the same between shuffling techniques (Fig. S5); however, in the 3′ UTR, mononucleotide shuffling detected DB-1 and É-DB with a Z avg < -2 while dinucleotide shuffling did not. The other differences between results can be seen in the coding region. Between the two techniques, eight structured regions were identified in the coding region of ZIKV with Z avg < -2 base pairs (Figs. S5A-S5H), half of which were identified by both techniques with slight differences in the quantity and location of base pairs (Figs. S5A, S5C, S5D and S5H). Three structured regions were identified exclusively by mononucleotide shuffling (Figs. S5B, S5E and S5F) and one structured region was exclusive to dinucleotide shuffling (Fig. S5D).  Figure 6 Comparison of shuffling techniques. The ZIKV genome was analyzed with ScanFold-Scan using a 120 nt window size, one nt step size, and 50 iterations of either mononucleotide or dinucleotide shuffling during the calculation of the thermodynamic z-score. (A) For each window, the calculated z-score is plotted as a point along the genome (positioned according to the starting coordinate of the respective window). To observe the general z-score trend present for each shuffling technique, a moving average for every 120 points is plotted as a line against genomic position as well; where green and blue coloring refers to mononucleotide and dinucleotide shuffling, respectively. (B) The same data from above is plotted here as whisker plot: center lines indicate the median; crosses indicate the mean; the box limits indicate the 75th and 25th percentiles; whiskers extend to 1.5 times the interquartile range; outliers are represented as dots. Full-size  DOI: 10.7717/peerj.6136/ fig-6 Similar to Zika, the average z-scores for the HIV-1 genome were more positive using dinucleotide shuffling than mononucleotide shuffling (-0.15 and -0.49, respectively). Minimal differences were observed between each shuffling technique's ability to detect known structured regions over a range of window sizes (Fig. 7). Most of the base pairs identified in these regions were identical, where the only differences are due to mononucleotide shuffling detecting more base pairs: for example, there were 18 more identified pairs in regions found using a 120 nt window (Fig. 7D). This is consistent with the more positive z-scores obtained using dinucleotide shuffling overall, where a less stringent Z avg filter would likely identify more base pairs. Though there were differences between the number of base pairs identified in the regions intervening the known structural elements, identified base pairs were consistent with SHAPE reactivity data (Fig. S6).

Comparison of results from different window sizes
Since the MFE structures predicted throughout scanning window analyses are sensitive to window size, the HIV-1 genome was analyzed using five different window sizes (100, 120 150, 180, and 210 nt) and a Z avg filter of -2. The results obtained using each window size were compared to published SHAPE reactivity profiles from Watts et al. (2009). All window sizes using a mononucleotide shuffle correctly identified base pairs from the previously-described structural elements of HIV-1 (Fig. 7); where larger window sizes generally identified more base pairs around known functional structures (with a single exception: using a window size of 180 nt, none of the base pairs from the poly(A) stem located directly upstream of 5′TAR element (Lavender, Gorelick & Weeks, 2015) had Z avg below -2 (Fig. 7B).
In regions with no previously-described functional structures, base pairs with Z avg < -2 are consistent with the SHAPE reactivity profiles as well (Fig. S6), however, the location and quantity of bases pairs differs between window sizes. In general, the number of base pairs identified increased as window size increased, while prediction accuracy in known structured regions remained the same or was diminished. This suggests that a window size between 100 and 150 nt may be optimal; this is consistent with findings from a previous study that aimed to identify the optimum window size for detection of structured regulatory elements embedded in long mRNA molecules (Lange et al., 2012).

DISCUSSION
The ScanFold-Fold analysis of ZIKV reiterates the prominence of RNA structure in the 5′ and 3′ regions of its genome, indicates that these (previously described) structured regions could be larger, and reveals several potentially functional structures within the core coding region. This was achieved by an approach which condenses thousands of scanning window models into a list of base pairs with the highest likelihood of generating functional RNA structures; greatly reducing the dependence on subjective manual curation of results. The ScanFold-Fold algorithm was able to successfully identify known base pairs within the 5′ and 3′ regions with high positive predictive value (PPV) and sensitivity (Table S8) (Bellaousov et al., 2013;Mathews, 2004;Mathews et al., 1999). In the 5′ UTR region, ScanFold-Fold positively identified all known structures (with slight variations; Fig. S3), while the 3′ UTR model was accurate in identifying the structures of all elements, except for regions of the exonuclease resistant SLI and SLII structures (Fig. S2); likely due to the presence of complex RNA pseudoknot structures here (Akiyama et al., 2016), which can be difficult to predict computationally due to non-nested base pairing that complicates the use of recursive algorithms (Schlick & Pyle, 2017).
ScanFold-Fold identified structures directly upstream and downstream of the 5′ and 3′ UTR regions (nt 270-528 and nt 10,237-10,370, respectively) have metrics that are equally as favorable as those in the known structural regions. These structures (Figs. 4A, 4B and 4G) have unusually stable thermodynamic stability (compared to random), are well conserved throughout ZIKV genomes (Table S7), and the structures in Figs. 4A and 4G are supported by SHAPE reactivity data (reactivity data for the structure in Fig. 4B was not reported). Their close proximity to functional UTR regions (Fig. S4) suggests that they may play important roles in conjunction with these other elements: for example, in genome replication, processing, etc. The structures predicted in the core coding region (Figs. 4C-4F), also have metrics that indicate they may have evolved to form functional conformations, are conserved, and SHAPE reactivity data (where available) agrees with these models (Figs. 4D and 4E). There are numerous potential functions for these core region RNA structural motifs: for example, serving as sites of post-transcriptional modifications (Coutard et al., 2017;Dong et al., 2014;Gokhale et al., 2016;Lichinchi et al., 2016), facilitating packaging of the genome (Stockley et al., 2013), acting as localization signals (Pratt & Mowry, 2013), modulating the rate of translation to affect protein folding (Khrustalev et al., 2017), or to affect transcript (genome) stability (Wu & Brewer, 2012).
Known and novel ZIKV motifs predicted by ScanFold-Fold are corroborated by a recent study of ZIKV combining biochemical structure probing with comparative analysis (Huber et al., 2018; Extended data 6). This highlights the robustness of ScanFold-Fold predicted models and the program's ability to rapidly deduce likely functional motifs. Similarly, our benchmarking of ScanFold-Fold using HIV-1 data also showed good correlations with predictions and experimental data, and demonstrate its ability to independently identify functional (e.g., named) RNA motifs. Interestingly, in addition to named HIV-1 motifs, several additional structures were identified (Figs. 5F, 5J and 5L) which contain the same structurally conserved base pairs identified in a comparative analysis of HIV-1 with two primate SIV strains (Lavender, Gorelick & Weeks, 2015). No functions are currently proposed for these structures, however, their folding metrics are highly suggestive of their importance.
The map of the RNA structural landscape of the ZIKV genome and the reanalysis of HIV-1 presented in this report serves as a guide for future analyses. The functional importance of novel ScanFold-Fold identified motifs can be tested in virio by designing mutations to disrupt/compensate structure (e.g., using a tool such as RNA2DMut; Moss, 2018)-while maintaining (or minimally disrupting) amino acid coding (and codon use)-then introducing them into the genome: for example, via ZIKV or HIV-1 genetics systems (Atieh et al., 2016;Smyth et al., 2014) to assess effects on viability, infectivity, and replication. A similar strategy was previously used to test RNA structural motifs predicted to occur in influenza A virus . Furthermore, the presence of conserved base pairing within coding regions would be expected to impact their evolution (to maintain both functional protein and RNA structure); thus, these data can also potentially be helpful in understanding constraints placed on the evolution of these viruses. This is particularly significant to understanding the evolution and outbreak of pathogenic ZIKV strains.

Considerations
The underlying folding algorithm used in ScanFold-Scan, RNAfold (version 2.3.3), has been extensively benchmarked vs. experimental data (Lorenz et al., 2011), and is one of the top performing single-sequence, energy-based folding algorithms available (Puton et al., 2014). Despite the similarity in prediction accuracies between other top performing folding algorithms (such as RNAstructure; Reuter & Mathews, 2010 and UNAfold;Markham & Zuker, 2008), differences between their MFE structure predictions still arise due to the different ways the Turner energy parameters are implemented. It should be noted that output from any RNA folding algorithm (Puton et al., 2014), when properly formatted, can be considered in the ScanFold-Fold process. This could address specific inaccuracies or limitations of any particular approach. Indeed, one could combine results from multiple predictions approaches in ScanFold-Fold to get consensus motifs. Likewise, another way to address algorithm limitations is to incorporate data from biochemical structural analyses (Deigan et al., 2009;Washietl et al., 2012;Zarringhalam et al., 2012). Although, unconstrained ScanFold-Fold results were consistent with SHAPE data from ZIKV and HIV-1, these data could have also been included as constraints in the ScanFold-Scan window analyses: for example, by constraining reactive bases in each window.
A limitation of the ScanFold procedure is that base pairing beyond the window size used cannot be predicted. For example, functional long-range RNA-RNA interactions (LRIs) have been identified within genomes of positive-strand RNA viruses such as ZIKV (Huber et al., 2018). These interactions span thousands of nucleotides (much greater in distance than the typical scanning analysis window) and play functional roles in viral transcription, translation, and replication (Nicholson & White, 2014). Because they span such large distances, scanning window approaches are unable to explicitly predict LRIs; however, by deducing local regions with high propensity of folding, these no longer need to be considered when trying to deduce LRIs.
The ScanFold-Scan approach presented here was developed as a single sequence alternative to approaches for functional RNA motif discovery. It differs from alignment-based methods such as RNAz and locARNA by dividing the analysis steps to allow phylogenetic comparisons to be run after folding. In this way, ScanFold-Scan and ScanFold-Fold can be used to detect both conserved and nonconserved elements, which may be significant for recently-evolved viral strains, for example. It should be noted, however, that output from alignment-based approaches are also compatible with ScanFold-Fold and can readily be used to detect conserved elements from alignments.

CONCLUSIONS
In conclusion, this report presents a bioinformatics scan of the ZIKV and HIV-1 genomes and a novel analysis pipeline/method for functional RNA motif discovery that (1) recapitulates known functional motifs in both viruses, (2) suggests that regions of RNA structure in ZIKV may be larger than previously reported, (3) finds novel motifs that may be functionally important, and (4) provides a road-map for testing the functions of RNA structure in the biology of both ZIKV and HIV: for example, by disrupting structure via mutations to viral genomes.