A specific sequence in the genome of respiratory syncytial virus regulates the generation of copy-back defective viral genomes

Defective viral genomes of the copy-back type (cbDVGs) are the primary initiators of the antiviral immune response during infection with respiratory syncytial virus (RSV) both in vitro and in vivo. However, the mechanism governing cbDVG generation remains unknown, thereby limiting our ability to manipulate cbDVG content in order to modulate the host response to infection. Here we report a specific genomic signal that mediates the generation of a subset of RSV cbDVG species. Using a customized bioinformatics tool, we identified regions in the RSV genome frequently used to generate cbDVGs during infection. We then created a minigenome system to validate the function of one of these sequences and to determine if specific nucleotides were essential for cbDVG generation at that position. Further, we created a recombinant virus unable to produce a subset of cbDVGs due to mutations introduced in this sequence. The identified sequence was also found as a site for cbDVG generation during natural RSV infections, and common cbDVGs originated at this sequence were found among samples from various infected patients. These data demonstrate that sequences encoded in the viral genome determine the location of cbDVG formation and, therefore, the generation of cbDVGs is not a stochastic process. These findings open the possibility of genetically manipulating cbDVG formation to modulate infection outcome.

Introduction Defective viral genomes (DVGs), which are generated during the replication of most RNA viruses, potentiate the host innate immune response [1][2][3][4][5] and attenuate the infection in vitro and in vivo [4,[6][7][8][9]. Importantly, in naturally infected humans, the presence of DVGs correlates with enhanced antiviral immune responses during RSV infection [6] and reduced disease severity in influenza virus infection [8]. Significant effort is currently invested in harnessing DVGs as antivirals due to their strong immunostimulatory activity and ability to interfere with the replication of the standard virus. However, despite over 50 years of appreciating their critical functions in multiple aspects of viral infections, the molecular mechanisms that drive DVG generation remain largely unknown. This lack of understanding hampers our ability to effectively harness DVGs for therapeutic purposes and limits our capacity to generate tools to elucidate their mechanism of action and impact during specific viral infections.
There are two major types of DVGs: deletion and copy-back (cb) [10]. Both types are unable to complete a full replication cycle without the help of a co-infecting full-length virus [11,12] and can be packaged to become part of the viral population [13]. Deletion DVGs, common in influenza virus and positive strand RNA viruses, retain the 3' and 5' ends of the viral genomes but carry an internal deletion [14][15][16]. These types of DVGs are believed to arise from recombination events [17,18] and can strongly interfere with the standard virus [19]. cbDVGs are common products of non-segmented negative sense (nns) RNA virus replication, including Sendai virus (SeV), measles virus, and respiratory syncytial virus (RSV), and are the primary stimulators of the innate immune response during nnsRNA virus infection [6,7,20,21]. cbDVGs arise when the viral polymerase detaches from its template at a "break point" and resumes elongation at a downstream "rejoin point" by copying the 5' end of the nascent daughter strand [12,22]. This process results in the formation of a new junction sequence and a truncated genome flanked by reverse complementary ends [23]. cbDVGs have long been thought to result from errors made by the RNAdependent RNA polymerase (RdRp) during replication due to a combination of lack of proofreading activity and the presence of a polymerase with lower replication fidelity [12]. No pattern or specific sequences for the break and rejoin points of cbDVGs have been reported so far.
Based on our consistent observations of discrete populations of cbDVG generated during RSV infections in vitro and in vivo [6], we set out to test the hypothesis that the generation of cbDVGs is not completely stochastic but instead is regulated by a carefully orchestrated process. Upon identification of cbDVG populations generated during infection, we show that specific viral sequences within the viral genome are preferred sites for cbDVG generation and that these sequences are conserved across viral strains. Utilizing this knowledge, we generated a recombinant virus that produced a restricted set of cbDVGs, serving as strong evidence that specific sequences dictate where cbDVGs are generated. In addition, we demonstrate for the first time that common cbDVGs are generated independently in natural infections in humans, further supporting an orchestrated origin for cbDVGs.

High sensitivity detection of cbDVG populations during infection
To acquire a comprehensive view of the population of cbDVGs generated during infection, we developed an algorithm to identify cbDVG junction regions within RNA-seq datasets with high sensitivity. The principle of this Viral Opensource DVG Key Algorithm (VODKA) is illustrated in Fig 1A. In brief, the break point (T') and the rejoin point (W) are far apart in the parental viral genome, but in cbDVGs they become continuous and form the new cbDVG junction sequence when the viral polymerase (Vp) is released from the template and rejoins in the nascent strand. Since cbDVG junction sequences are absent in the full-length viral genome, VODKA identifies the sequence reads that capture junction sequences and then filters out false-positives reads that fully align to the reference genome.
We corroborated VODKA's performance by testing for the presence of the highly dominant DVG-546 in samples from infections with SeV Cantell (Fig 1B). Briefly, 98,543 out of the 98,626 (99.9%) cbDVG junction reads identified by VODKA from an RNA-seq data set obtained from SeV Cantell-infected cells mapped exactly to the known junction region of DVG-546 ( Fig 1C, bottom panel). By aligning cbDVG junction reads to the SeV full-length antisense genome, we determined the location of the major break and rejoin regions (blue peaks in the upper panel of Fig 1C). Each read that aligned to a break or rejoin region contained two portions, one of which fully aligned to the reference genome. The aligned reads in the break region (pink box, Fig 1C) and in the rejoin region (gray box, Fig 1C) corresponded to the DVG-546 sequence before the break point and after the rejoin point, respectively. The breakpoint for DVG-546 predicted by VODKA (14932±1_15292±1) exactly matched the one identified by Sanger sequencing (" in Fig 1C), thereby establishing the efficiency and accuracy of VODKA in identifying cbDVG-specific sequences.
We then used VODKA to identify the population of cbDVGs generated during RSV infection. RSV is a virus known to generate immunostimulatory cbDVGs in infected patients [6], and thus a subject of interest in our laboratory. We analyzed pooled RNA-seq datasets from six RSV-infected cell cultures. These cultures were infected with RSV generated from the same parental stock (strain A2) that was first depleted of DVGs and then passaged independently in different cell lines to generate six different stocks enriched for DVGs. The presence of cbDVGs in these infections was confirmed using a specific RT-PCR followed by Sanger sequencing. By aligning VODKA-identified cbDVG junctions to the RSV A2 reference antigenome, we observed 4 major break hotspots spanning over 1300 nucleotides (red down-facing arrows in Fig 1D). In contrast, only 2 major rejoin hotspots were observed within a narrower region of 223 nucleotides in length at the 3'end of the viral genome (black down-facing arrows in Fig  1D). Remarkably, the rejoin area with the highest peak included counts present in all six virus stocks. We then compared these break and rejoin hotspots to those generated in infections with a different stock of RSV enriched in cbDVGs (stock7) from which the major cbDVGs were identified upon Sanger sequencing of PCR amplicons. We observed that the cbDVG rejoin points from stock7 were located within the strongest rejoin hotspot, whereas its four break points were distributed more broadly across the genome (Fig 1D). These results reveal strong hotspots for the polymerase to rejoin during cbDVG formation and suggest a large degree of conservation of the RSV cbDVGs rejoin positions. The red dashed square marks the unique junction region that distinguishes cbDVGs from full-length viral genomes. This junction region was used by VODKA to identify cbDVGs. (B) A549 cells were mock infected or infected with SeV Cantell at a MOI of 1.5 TCID 50 /cell and harvested at 2, 6, 12, and 24 h post infection followed by detection of cbDVGs by RT-PCR using primers SeV DI1 and gSeV DI1 (S1 Table). (C) Alignment of SeV cbDVG junction reads obtained from VODKA to the last 3kb nucleotides of the SeV antisense genome (top) or the DVG-546 junction sequence (bottom). Blue histogram shows a synopsis of total coverage at any given position of the last 3kb of the SeV reference genome (top) or the DVG-546 junction sequence (bottom). The numbers on the left-side axis of the graphs represent the total number of reads at the position with the highest coverage. Colored nucleotides beneath the blue histogram represent the consensus sequences of cbDVG junction identified by VODKA. The size of the letter represents the degree of conservation at the site and if there were mutations at a certain position the most representative nucleotide was listed on top. Black nucleotides beneath colored nucleotides indicate the sequence of the DVG-546 junction region obtained by Sanger sequencing. The grey boxed areas in the graphs mark the region after the rejoin point and the pink boxed area the DVG region before the break point. (D) A549 cells were infected with RSV stocks1-7. For stocks1-6, RNA was extracted from the supernatants of the infected cells at 48 h post infection, followed by RNA-seq and VODKA screening. DVG junction reads were then mapped to the

Identification of specific RSV genomic regions responsible for cbDVG generation
To determine if candidate hotspots were involved in cbDVG generation, we selected the region containing the most break points (Break1, dark grey in Fig 1D), or the most rejoin points (Rejoin1+Trailer, light grey in Fig 1D) for further testing using a minigenome system [24]. We constructed an RSV minigenome backbone (BKB) that included the reporter gene mKate2 for flow cytometry quantification of transcripts produced by the viral polymerase. In addition, the minigenome BKB included restriction enzyme sites to insert the selected Break1 and/or Rejoin1 regions (Fig 2A). The goal was to use this system to establish whether sequences in the candidate break and rejoin regions altered the polymerase elongation capacity, eventually leading to the generation of cbDVGs. The strategy used for detection of cbDVGs is illustrated in S1 Fig. As illustrated in Fig 2B, in this system mKate2 expression should only occur if the viral polymerase replicates the entire minigenome sequence from the trailer to the leader. Co-transfection of the minigenome construct with the four helper plasmids expressing the polymerase proteins (L, P, NP, and M2-1), resulted in mKate2 expression in 8-17% of the cells, whereas no mKate2 expression was detected in control transfections that lacked the viral polymerase ( Fig 2C; -Vp). Constructs containing only the Rejoin1 sequence led to similar mKate2 expression as the BKB construct, whereas constructs containing Break1 caused a~30% reduction in mKate2 expression (Fig 2D and 2E). We verified that the difference in mKate2 expression among transfections with different constructs was not due to variable transfection efficiency These results are consistent with the concept that during cbDVG generation, the viral polymerase falls off the template at the break region leading to a reduced amount of newly synthetized template available for mKate2 transcription.
To formally assess whether candidate break and rejoin sequences lead to cbDVG formation, we cloned the designated Pair1 composed of Break1/Rejoin1 into the minigenome system. Upon transfection, we observed that the construct containing Pair1 led to a similar degree of mKate2 expression than the construct bearing Break1 alone (Fig 3A and 3B). We also observed two major amplicons (white arrowheads in Fig 3C), both of which were absent in cells transfected with the construct bearing Break1 alone. These two amplicons contained cbDVGs that were confirmed by conventional Sanger sequencing (S3A Fig). The individual break and rejoin points of these minigenome-generated cbDVGs are indicated in Fig 3D. Interestingly, the rejoin points clustered in close proximity to the rejoin points that we identified from in vitro infected cells. Taken together, these data demonstrate that RSV cbDVG rejoin points fall into a discrete region of the viral genome, which is critical for cbDVG generation.

Specific nucleotide composition determines the position for cbDVG rejoin
Since the late Rejoin1 + early Trailer region of the RSV genome was highly enriched with DVG rejoin points relative to other regions in the RSV genome, we then examined which specific features within this region impacted cbDVG generation. This region is within one of antigenome of the RSV A2 reference strain. Blue histogram shows synopsis of total coverage at any given position of the last 3kb of the RSV reference genome. The number on the right side of the graph represents the total reads at the position with highest coverage. Based on the VODKA output, individual peaks were identified as break or rejoin regions indicated by red or black facing-down arrows, respectively. For stock7, RNA was extracted from infected cells at 24 h post infections, followed by DVG specific RT-PCR using primer DI1 and DI-R (see S1 Table) that captures cbDVGs larger than 453 bp and with the break after DI1 and the rejoin after DI-R. Based on conventional cloning and Sanger sequencing of PCR products, the break and rejoin points of cbDVGs detected in this stock were marked by red and black facing upwards arrows, respectively. Dashed lines indicate join sequences from individual cbDVGs in stock7. Light and dark grey shades represented the rejoin and break regions that were cloned into RSV minigenome backbone for further analysis in Fig  most A and U enriched areas of the RSV genome (Fig 4A), suggesting that nucleotide composition might play a role in directing cbDVG formation. To avoid affecting the L "gene end" signal and the genome trailer region, we chose to mutate six nucleotides at the beginning of this rejoin region (nucleotide positions 191-186 from the 3' end of antigenome) to either all Us (named GC>Us) or all GCs (named AU>GCs). We then used RT-PCR (DI-1/DI-R primer set) to detect cbDVG-like fragments formed in the cells co-transfected with all U's or all GCs mutant constructs and polymerase-expressing plasmids, as described earlier. Mutant GC>Us generated a dominant amplicon (lane2, Fig 4B) that was absent in cells transfected with mutant AU>GCs (lane4, Fig 4B). From sequencing PCR products within the areas marked by asterisks in Fig 4B, we identified five distinct rejoin points from mutant AU>GCs and three from Going from 5' to 3': T7 promoter, hammerhead ribozyme, complementary sense RSV leader sequence (LeC), NS1 gene start sequence, NS1 non-coding sequence, gene encoding mKate2 [45], L gene start sequence (GS), L gene end sequence (GE), restriction enzyme sites, complementary sense trailer sequence (TrC), hepatitis delta virus ribozyme, and T7 terminator. All constructs were generated in a pSl1180 vector and included three restriction enzyme sites that were used for ligating Break1 and Rejoin1 as indicated. The length of all minigenome constructs after T7 promoter and before T7 terminator obeyed the rule of six. Red arrows indicate the primers that were used for detecting cbDVGs by PCR. One primer was within the viral trailer and the other primer was designed against the 3' end of the break region. (B) Workflow of the RSV minigenome system. BSR-T7 cells were transfected with expression vectors encoding the polymerase genes M2-1, L, NP, P, and the minigenome construct to be tested. The minigenome was then transcribed by the T7 polymerase generating a template for amplification by the viral polymerase. If the paired junction region was not functional, the viral polymerase continued to elongate from 3' TrC to the 5' LeC, exposing the promoter in Le for transcription leading to mKate2 expression. If break and rejoin sequences were functional, the viral polymerase recognized those signals resulting in dropping off from the template, generation of cbDVG-like fragments, and reduction of mKate2 expression. Since DVG junctions are split into two non-continuous regions (break and rejoin) in the minigenome plasmid, detection of the end-to-end DVG junction regions via DI-RT-PCR is a second indication of a productive junction region. (C) BSR-T7 cells were transfected with BKB either with or without the polymerase expression plasmids. One representative plot of mKate2 expression determined by flow cytometry is shown. (D-E) BSR-T7 cells were co-transfected with all 4 helper plasmids as well as either a BKB construct or a construct including the candidate signal sequences. mKate2 expression was measured by flow cytometry. The quantification of all repeats is shown as a percentage in D (F test P = 0.0028). Representative flow plots are shown in E.  Fig 4C). Compared to WT Pair1, mutant GC>Us did not generate rejoin points proximal to the mutated region (grey area in Fig 4C), whereas mutant AU>GCs still produced cbDVG-like fragments at the mutated area.
To rule out bias due to primer location, we designed two additional forward primers to detect cbDVG-like fragments from the same samples. Rejoins detected with DI-F2 primers are identified with red arrows in Fig 4C, while rejoins detected with DI-F3 are indicated with blue arrows. Transfections with mutant GC>Us resulted in one strong amplicon while no predominant amplicon was observed in transfections with mutant AU>GCs (Fig 4D), agreeing with results obtained using the DI-F1 primer set. Sequencing confirmed that the strong amplicons produced by all three different primer sets in transfections with mutant GC>Us were analogous cbDVG-like fragments and shared their break and rejoin points (sequence in S3 Fig,  DVG 303bp). To examine if the observed lack of a predominant product resulting from mutant AU>GCs was due to a general reduction of replication ability of the viral polymerase induced by mutations, we introduced the same mutations in the construct with Rejoin1 alone and examined mKate2 expression by flow cytometry. We found no significant differences between Rejoin1 and the two mutants, Rejoin1-GC>Us and Rejoin1-AU>GCs. Neither of these constructs reduced mKate2 expression compared to BKB (S4 Fig), suggesting that the function of the RSV minigenome system remained intact despite of the mutations. Altogether, these data suggest that a minimal content of C nucleotides in the rejoin region determines if cbDVGs are produced at that particular genomic location.
To determine if any of the two Cs within the mutated sequence was critical for cbDVG rejoining at this location, we performed a similar analysis using three new constructs: first C at position 188 or second C at position 186 from the 3' end of antigenome mutated to U (named  Table). A representative gel picture is shown in C. Bands labeled with a white arrow were confirmed to correspond to cbDVG by conventional sequencing. Their sequence is shown in S3A C188U or C186U, respectively), or both Cs mutated to Us (named AU). Transfection of the C186U, but not the C188U construct, resulted in one major DVG amplicon (indicated with an asterisk in Fig 4E; sequences in S3 Fig). The C186U construct rejoin points skipped the mutation area and concentrated in the early trailer region, similar to GC>Us. This was confirmed by the two other primer sets. A strong band shown in lane C188U at a high molecular weight (indicated with an arrowhead in Fig 3E) was determined to not correspond to a cbDVG by Sanger sequencing. The construct bearing the double mutation (AU) behaved similar to C186U in terms of the rejoin positions ( Fig 4C). Thus, we found the second C at position 15037 (position 186 from 3' trailer end of antigenome) to be critical for cbDVG generation.  Table). Confirmed cbDVG-like fragments are indicated with asterisks next to the gel.

A conserved rejoin region determines cbDVG formation during viral infection
Next, to establish whether Rejoin1 impacts on cbDVG generation during viral infection, we created a mutant virus harboring mutations identical to the GC>Us minigenome construct. This virus is herein identified as gRSV-FR-GC>Us. The backbone of the recombinant RSV (Line 19) included the mKate2 gene and we used mKate2 expression to estimate its replication. As shown in Fig 5A, cells infected with gRSV-FR-GC>Us expressed the same level of mKate2 protein as cells infected with the WT reporter virus (gRSV-FR-WT) at 72 h post infection. Both viruses began to generate cbDVGs at passage 3 (P3) and the pattern of cbDVGs was maintained, and became stronger, by P5 (Fig 5B). We verified that P5 gRSV-FR-GC>Us still carried the mutations we introduced (Fig 5C). Interestingly, gRSV-FR-WT produced 4 major DVGs, whereas gRSV-FR-GC>Us only generated one dominant cbDVG (asterisks in Fig 5B, confirmed sequence in S3G and S3H Fig), which is consistent with results from the minigenome system. The dominant cbDVG generated in cells infected with gRSV-FR-GC>Us rejoined at the early trailer region and skipped the mutation site, similar to what was observed in the minigenome system. Cells infected with gRSV-FR-WT produced one cbDVG that rejoined within the mutation site and three other cbDVGs that rejoined at the same region of the mutant virus (Fig 4D). A population of cbDVGs lacking generation at the mutation site can be repeatedly observed upon independent passages of the mutant virus, albeit the specific species of DVGs varied in different lineages (S5 Fig). These data further support a critical role of Rejoin1 in cbDVG generation.
The majority of rejoin points found in infection with gRSV-FR-WT, which derived from RSV Line 19, located within the early trailer sequence, rather than around the mutation site as found during infection with RSVstocks1-7 derived from RSV line A2 (Fig 5D). Alignment of both sequences revealed one natural mutation in RSV Line 19 that introduced three GCs right at the beginning of the trailer sequence, which are not present in RSV A2 (sequence indicated with a red horizontal line in Fig 5D). The increased GC content in this position in Line 19 likely explains why gRSV-FR-WT generates more cbDVGs at this location than RSV A2 stocks1-7. Regardless of this natural preference for rejoining in the early trailer, gRSV-FR-GC>Us diminished the rejoin signal at the mutation site as no cbDVGs rejoin points were found at this location resulting in less diversity of cbDVG generation compared to the WT virus. Overall, these data confirm that the common rejoin region sequence tested in the minigenome system determines cbDVG rejoining during RSV infection and that the content of C nucleotides, and possibly G nucleotides, in this region critically determines the site of cbDVG rejoin.

Conserved cbDVGs form during natural RSV infections in humans
To examine whether the Rejoin1 region was utilized during natural infections, we applied VODKA to RNA-seq datasets obtained from RSV-positive pediatric samples. A total of 10 clinical specimens were sequenced; 4 were classified as DVG-low and 6 as DVG-high based on semi-quantification following cbDVG PCR. VODKA outputs were aligned to the reference genome of an RSV strain A isolate (Reference genome NCBIKJ672447, 2012) and showed that, consistent with previous cbDVG-RT-PCR results, samples from DVG-low patients (upper panel in Fig 6A) contained~8 fold less cbDVG junction reads than DVG-high patients (lower panel in Fig 6A). In addition, coverage mapping showed the presence of multiple break and rejoin regions. Some of them were a mix of both break points and rejoin points (Fig 6A, read and black arrows). The rejoin points were particularly noteworthy because the majority of them clustered within one narrow AU-rich "Rejoin1+ Trailer" region (red ticks in Fig 6B) similar to that identified in in vitro infections (blue ticks in Fig 6B). According to the frequency of different cbDVG junction positions, we illustrated the top 6 major cbDVGs (one of them is a snap-back) in Fig 6C ( Table 1, Break Rejoin position shown as T'_W). All of them were found in multiple patients (Fig 6C and Table 1). The most abundant cbDVG again had the rejoin point within the "Rejoin1+Trailer" region, despite of a higher diversity of rejoin points compared to in vitro infection. Taken together, these results demonstrate that a conserved rejoin region drives the generation of most cbDVGs during RSV infection in vitro and in vivo and that identical RSV cbDVGs are generated in different naturally infected individuals.

Discussion
DVGs are critical regulators of viral replication and pathogenesis in multiple RNA virus infections, but the mechanisms modulating their generation are unknown. Historically, DVGs were thought to result from random errors introduced by the viral polymerase during replication. However, mounting evidence indicates that the generation of cbDVGs is not totally stochastic. Specifically, we show that during RSV infection discrete hotspots in the viral genome mark sites for the viral polymerase to release and rejoin during cbDVG formation, both in vitro and during natural RSV infections in humans. Moreover, we show that the content of C nucleotides, and possibly G nucleotides, within the major rejoin hotspot critically impacts the generation of cbDVGs at that position. We also identified specific nucleotides that, when mutated, altered the ability of recombinant viruses to generate diverse species of DVGs. The identification of a specific sequence involved in cbDVG formation opens the unprecedented possibility of genetically manipulating the content of cbDVGs during infection. This possibility may significantly impact our ability to generate tools to further understand the Sequences in the RSV genome determine cbDVG generation role of these viral products in virus pathogenesis, as well as potentially manipulate the cbDVG content with antiviral and/or therapeutic purposes.
In this study, we utilized a custom-designed algorithm, VODKA, to identify cbDVG in infections in vitro or from children naturally infected with RSV. VODKA outputs were consistent with previous results obtained using classic DVG-RT-PCR and demonstrated a higher sensitivity in the detection of cbDVGs both in vitro and in clinical samples. False-positive DVG junction reads were ruled out by screening all reads aligned to the host (reads from human transcriptome) using VODKA. This test resulted in a minimal number of hits compared to viral samples, adding to the evidence reported throughout this manuscript to support the specificity of cbDVG detection by VODKA. VODKA can successfully identify cbDVGs in a number of viruses, including SeV (Fig 1), offering a powerful tool for cbDVG detection in clinical samples. Furthermore, since cbDVGs, compared to other types DVGs, have the most potent immunostimulatory function, VODKA can be used to identify candidates for development of novel cbDVG-based adjuvants.
Based on our data, we conclude that the rejoin position significantly influences cbDVG generation. One C nucleotide substitution alone can influence the location of the DVG rejoin point implying that a strong rejoin signal likely needs an optimal number of C, and possibly G nucleotides, in specific locations. However, the total amount of cbDVGs produced and their immunostimulatory activity are not necessarily altered by the single C substitution in one rejoin hotspot, suggesting the redundancy of rejoin hotspots in cbDVG generation. More research needs to be done to investigate whether mutations in other rejoin hotspots or in combination will alter the overall amount of cbDVGs and their function. Interestingly, the same differential distribution of cbDVG rejoin points was observed when we compared cbDVG generation from infections with RSV A2 and Line19, which differ in their GC content at the beginning of the trailer region. This observation also implies that the preference of usage among different hotspots as cbDVG rejoin points may vary among different RSV subtypes. In addition, our data suggest that rejoin sequences influence the function of break signals when inserted as pairs in the construct. Our data is in agreement with data from in vitro infections with measles virus lacking the C protein, where break points of cbDVGs were widely distributed along the genome, whereas the rejoin points were clustered in a narrow region close to 5' end of the genome [25].
Further investigation into the molecular details of how the viral polymerase recognizes these signals may lead to important insights about the mechanism involved in RSV virus Sequences in the RSV genome determine cbDVG generation replication and the generation of cbDVGs. A lower density of nucleocapsid proteins (NPs) at certain genomic locations has been shown to result in increased cbDVG formation in SeV infection [26]. However, the mutations described to be responsible for low NP density were absent in our SeV stocks, suggesting that alternative mechanisms are likely involved. The usage of C nucleotides as a signal closely resembles the recognition of "gene end" or "gene start" by the viral polymerase when working on transcription [27,28] and it would be intriguing to evaluate if the mechanisms of cbDVG generation and viral RNA transcription are related. Another factor influencing DVG accumulation is their length, which is tightly related to the spatial structure of the viral RNPs. In paramyxoviruses, although it is thought that "only genomes with hexametric or heptametric lengths are efficiently replicated" [29,30], some cbDVGs generated in vitro do not obey this rule [25,31,32]. For RSV, we observed that a number of cbDVGs do not follow the rule of six or seven. Nonetheless, cbDVGs with certain length may have increased replication efficiency and thus an enhanced fitness advantage. Interestingly, in our minigenome system, although cbDVGs from Pair1 contained the expected rejoin point positions, break points frequently fell into a region further ahead of Break1, suggesting that the distance between the Break and Rejoin points may also play a role in determining where the break position is. In addition to genomic sequences, other factors, such as viral proteins, likely play an important role in DVG generation. For instance, influenza viruses harboring a high fidelity polymerase generate fewer deletion DVGs [33]. Mutations in non-structural protein 2 of influenza have also been shown to increase the de novo generation of DVGs by altering the fidelity of viral polymerase [34]. Host factors may be essential contributors to DVG generation as well [10]. For example, vesicular stomatitis virus produces a large amount of snap-back DVGs in most cell lines, except human-mouse somatic cell hybrids, and this cellular attribute was mapped to human chromosome 16 [35]. Similarly, infection with measles virus did not show de novo generation of defective interfering particles (DIPs) in human WI-38 cells and SeV did not produce cbDVGs in chicken embryo lung cells [36,37]. Despite the potential importance of these additional factors on DVG generation, the current work represents a major paradigm shift with the identification of sequences that regulate cbDVG formation.
Remarkably, we found various common cbDVGs present in more than one patient and at least one of those cbDVGs was also present in infections in vitro. These observations support a conserved origin for cbDVGs during infection and challenge the idea that DVGs occur as random product of virus replication. To date, all studies on DVG biology have been correlative in nature. This work opens up new areas of investigation and can ultimately allow us to manipulate the ability of viruses to produce DVGs as a powerful tool to study the role of DVGs in viral pathogenesis.

Ethics statement
Studies of human samples were approved by University of Pennsylvania Institutional Review Board. The embryonated chicken eggs used in these studies were 10 days old and were obtained from Charles River.

Cells and viruses
A549 cells (human type II alveolar cells, ATCC, #CRM-CCL185) and HEp2 cells (HeLaderived human epithelial cells, ATCC, CCL23) were cultured at 7% CO 2 and 37˚C with Dulbecco's modified Eagle's medium supplemented with 10% fetal bovine serum (FBS), 1 mM sodium pyruvate, 2 mM L-Glutamine, and 50 mg/ml gentamicin. BSR-T7 cells (Hamster kidney cells, BHK cells constitutively expressing the T7 polymerase, provided by Dr. Christopher Basler's lab at Icahn School of Medicine) and were maintained in 10% FBS DMEM with 1 mg/ml Geneticin (Invitrogen). All cell lines were treated with mycoplasma removal agent (MP Biomedicals) and routinely tested for mycoplasma before use. Sendai virus Cantell stock (referred to as SeV HD, containing a high DVG particle content) was prepared in embryonated chicken eggs as described previously [7,38]. The SeV HD stock used in these experiments had a high infectious to total particle ratio of 500:15,000. RSV-HD stocks 1-7 (stock of RSV derived from strain A2, ATCC, #VR-1540 with a high content of cbDVGs) were prepared and characterized as described previously [6,39] in MAVS KO (3 lineages, stock1-3), STAT1 KO (3 lineages, stock4-6), and WT A549 cells (1 lineage, stock7), respectively. Briefly, RSV was fixed-volume passaged until stocks accumulated a high content of cbDVGs. The cell lines were kindly provided by Dr. Susan Weiss (University of Pennsylvania).

Plasmids
Mammalian expression vectors for RSV N (NR-36462), P (NR-36463), M2-1 (NR-36464), and L (NR-36461) proteins, and the RSV reverse genetic backbone pSynkRSV-line19F (rRSV-FR, NR-36460) were obtained from BEI Resources. Detailed information of the constructs can be found in reference [40]. The backbone plasmid of the RSV minigenome used for testing various DVG junction regions was constructed by cloning two regions of sequences amplified from pSynkRSV-line19F into the pSl1180 vector. The first region included a T7 promoter, a hammerhead ribozyme, RSV leader sequence, and genes encoding monomeric Katushka 2 (mKate2), while the second region included the RSV trailer sequence, a Hepatitis delta virus ribozyme and a T7 terminator. These regions were sequentially cloned into psl1180 vector using restriction enzyme pairs SpeI/SandI and SandI/EcoRI, respectively. The potential cbDVG break and/or rejoin regions (positions in S1 Table) were then inserted between those two regions using restriction enzyme pairs NotI/SandI and SandI/SpaI, respectively. A detailed scheme of the construct can be seen in Fig 2A. Pair1 and Rejoin1 mutations were introduced using the site-directed mutagenesis commercial kit QuickChange II XL (Agilent, CA) according to the manufacture's protocol. All primers used for cloning are listed in S1 Table. Mutations in reverse genetic backbone pSynkRSV-line19F were generated by fusion PCR using primers in S1 Table as previously described [41].

Nasopharyngeal aspirates
Nasopharyngeal aspirates from pediatric patients were obtained from the Clinical Virology Laboratory of the Children's Hospital of Philadelphia. All samples used were banked samples obtained as part of standard testing of patients. Samples were de-identified and sent to our lab for RNA extraction and cbDVG detection as indicated below.

RNA extraction and DVG-RT-PCR
Total RNA was extracted using TRIzol or TRIzol LS (Invitrogen) according to the manufacturer's specifications. For detection of RSV DVGs in RSV infection, 1-2 μg of isolated total RNA was reverse transcribed with the DI1 primer using the SuperScript III reverse transcriptase (Invitrogen) without RNase H activity to avoid self-priming. Recombinant RNase H (Invitrogen) was later added to the reverse transcribed samples and incubated for 20 min at 37˚C. DVGs were partially amplified using both DI1 primer and DI-R primer. The temperature cycle parameters used for the cbDVG-PCR in a BioRad C1000 Thermal Cycler were: 95˚C for 10 min and 33-35 cycles of 95˚C for 30 sec, 55˚C for 30 sec and 72˚C for 90 sec followed by a hold at 72˚C for 5 min. Ramp rate of all steps was 3 degree/sec. Detailed method can be found in [6]. For detection of cbDVGs in the RSV minigenome system, extracted RNAs were treated with 2 μl TurboDNaseI (Invitrogen) for 15 min at 37˚C, followed by reverse transcription. Same procedures as above were utilized, except replacing DI1 primer with DI-F1, DI-F2, and DI-F3 primers. These were then all paired with DI-R reverse primer to amplify the different sizes of PCR products. Sequences of all primers are listed in S1 Table. RT-qPCR Total RNA (1 μg) was reversed transcribed using the high capacity RNA to cDNA kit from Applied Biosystems. cDNA was diluted to a concentration of 10 μg/μl and amplified with specific primers in the presence of SYBR green (Applied Biosystems). qPCR reactions were performed in triplicate using specific primers and the Power SYBR Green PCR Master Mixture (Applied Biosystems) in a Viia7 Applied Biosystems Light-cycler. Gene expression levels of RSV G were normalized to the GAPDH copy number. Sequences of primers used in these studies can be found in S1 Table. RNA-Seq RNA-Seq for SeV Cantell and RSV HD stocks 1-6 were performed as previously described [42]. RNA was extracted using TRIzol reagent and was re-purified using the PicoPure RNA isolation kit (Thermo Fisher Scientific). RNA quality was assessed using the RNA Pico 6000 module on an Agilent Tapestation 2100 (Agilent Technologies) prior to cDNA library preparation. For SeV RNA-Seq dataset, total cDNA libraries were prepared starting from 75 ng (SeV Cantell) and 450 ng (RSV HD stocks) of extracted raw RNA using the Illumina TruSeq Stranded Total RNA LT kit with Ribo-Zero Gold, according to the manufacturer's instructions. Samples were run on Illumina NextSeq 500 to generate 75 bp, single-end reads, resulting in 21-53 million reads per sample, with an average Q30 score � 96.8%. For sequencing of samples from RSV-positive patients, including 4 DVG low patients and 6 DVG high patients, 100-450 ng of extracted raw RNA was used for preparation of cDNA library using the same kit as above. Samples were run on Illumina NextSeq 500 to generate 150bp, paired-end reads, resulting in 60-170 million reads/sample with average Q30 score � 84.6%. To analyze genomic AUcontent relative to DVG break and rejoin points, we calculated the percentage of A or U nucleotides over sliding windows of 40 bases using the Python programming language (Python Software Foundation, https://www.python.org/). We plotted AU-content and cbDVG rejoin points in R using the ggplot2 package [43].

Viral Opensource DVG Key Algorithm (VODKA)
Based on our in vitro RSV experiments, we made the assumption that most cbDVGs are generated from the viral sequence near the 5' end region of the genome (close to the Trailer sequence). Therefore, starting with the last 3kb of a reference viral genome, we built an index of potential DVG sequences by taking all possible combinations of two non-overlapping segments of L bases, where L is the read length. The segments are linked by reverse complementing the second segment (C-D) and adding the first segment (A-B) to it (S6 Fig). Sequenced reads are aligned to the potential DVGs using bowtie2 [44], and subsequently undergo two filtering steps. First, reads are removed unless they map across a breakpoint (A_C) with at least 15bp of mapped segment on each side. Second, the reads that map cleanly to the reference genome are filtered out. This pipeline gives the output read counts for each breakpoint (A_C). To be consistent with the structure of copy-back DVGs in Fig 1A, A is equivalent to break point T' and C is equivalent to rejoin point W. VODKA output reads were further aligned to reference viral genomes (RSV A2: NCBI accession number KT992094.1; RSV 2012 clinical isolate: NCBI accession number KJ672447) or known SeV DVG-546 to identify the potential DVG junction regions using the Geneious 7.0 software.

RSV minigenome and reverse genetics system
BHK cells constitutively expressing the T7 polymerase (BSR-T7 cells) were transfected with different minigenome constructs, gRSV-FR-WT, gRSV-FR-GC>Us, or gRSV-FR-AU>GCs as well as the sequence-optimized helper plasmids encoding N, P, M2-1, and L, all under T7 control as described previously [40]. Cells were incubated with transfection complex (total plasmid: lipofectamine = 1:3.3) for 2 h at room temperature and then at 37˚C for overnight using Opti-MEM as medium. The following morning, the medium was replaced with antibiotic free tissue culture medium containing 2% FBS. For minigenome experiments, cells were harvested at 48 h post-transfection for either RNA extraction or flow cytometry. For mutant virus production, cells were maintained and split every 2-3 days until cytopathic effects (CPEs) were observed. Then viruses were collected and blindly passaged in HEp2 cells three times to obtain P3. P3 was titrated and passaged two more times at MOI of 10 to generate P4 and P5.

Flow cytometry
Transfected BSR-T7 cells were trypsinized 48 h post transfection and were either directly diluted in FACS buffer (PBS containing 2% FBS and 20 mM EDTA) or stained with aqua LIVE/DEAD. Cells were washed twice in FACS buffer before flow cytometry analysis on an LSRFortessa (Becton Dickinson). Data analysis was performed using Flowjo version Legacy.

Statistical analysis
All statistical analyses were performed with GraphPad Prism version 5.0 (GraphPad Software, San Diego, CA) and R v3.4.1. A statistically significant difference was defined as a p-value <0.05 by one-way analysis of variance (ANOVA) with a post hoc test to correct for multiple comparisons (based on specific data sets as indicated in each figure legend).

Code availability
The VODKA algorithm is open-source and available at: https://github.com/itmat/VODKA.

Data availability
All data are available upon request to the corresponding author. Raw RNA-Sequencing data of FISH-FACS sorted SeV infected cells and RSV infected samples have been deposited on the Gene Expression Omnibus (GEO) database for public access (SeV: GSE96774; RSV: GSE114948).