Common position of indels that cause deviations from canonical genome organization in different measles virus strains

The canonical genome organization of measles virus (MV) is characterized by total size of 15 894 nucleotides (nts) and defined length of every genomic region, both coding and non-coding. Only rarely have reports of strains possessing non-canonical genomic properties (possessing indels, with or without the change of total genome length) been published. The observed mutations are mutually compensatory in a sense that the total genome length remains polyhexameric. Although programmed and highly precise pseudo-templated nucleotide additions during transcription are inherent to polymerases of all viruses belonging to family Paramyxoviridae, a similar mechanism that would serve to non-randomly correct genome length, if an indel has occurred during replication, has so far not been described in the context of a complete virus genome. We compiled all complete MV genomic sequences (64 in total) available in open access sequence databases. Multiple sequence comparisons and phylogenetic analyses were performed with the aim of exploring whether non-recombinant and non-evolutionary linked measles strains that show deviations from canonical genome organization possess a common genetic characteristic. In 11 MV sequences we detected deviations from canonical genome organization due to short indels located within homopolymeric stretches or next to them. In nine out of 11 identified non-canonical MV sequences, a common feature was observed: one mutation, either an insertion or a deletion, was located in a 28 nts long region in F gene 5′ untranslated region (positions 5051–5078 in genomic cDNA of canonical strains). This segment is composed of five tandemly linked homopolymeric stretches, its consensus sequence is G6-7C7-8A6-7G1-3C5-6. Although none of the mononucleotide repeats within this segment has fixed length, the total number of nts in canonical strains is always 28. These nine non-canonical strains, as well as the tenth (not mutated in 5051–5078 segment), can be grouped in three clusters, based on their passage histories/epidemiological data/genetic similarities. There are no indications that the 3 clusters are evolutionary linked, other than the fact that they all belong to clade D. A common narrow genomic region was found to be mutated in different, non-related, wild type strains suggesting that this region might have a function in non-random genome length corrections occurring during MV replication.


Background
Measles virus (MV) is an RNA virus with a singlestranded, negative sense, nonsegmented genome. It belongs to the genus Morbillivirus, family Paramyxoviridae. The MV genome contains six tandemly linked genes (N, P, M, F, H and L), separated by nontranscribed intergenic triplets. Genes are composed of open reading frames (ORFs) with 5′ and 3′ untranslated regions (5′ UTR and 3′ UTR, respectively). Six MV genes are flanked by a short leader transcriptional control region (TCR) at the 3′ end of the genome and a trailer TCR at its 5′ end. Although nearly 11 % of MV genome is composed of non-coding regions, the genome is arranged so that distances between ORFs are not longer than 160 nucleotides (nts). The only exception is the non-coding region between M and F genes' ORFs (M-F UTR). Its length is 1012 nts, which is 6.4 % of the total MV genome length. M-F UTR is composed of the two by far longest untranslated regions, M gene 3′ UTR and F gene 5′ UTR, 426 and 583 nts long, respectively, and intergenic triplet (Additional file 1: Table  S1). Although much investigated, the precise function of this region in MV [1][2][3], as well as in other Morbilliviruses (i.e. canine distemper virus [4,5] and peste des petits ruminants virus [6]), is not well understood. M gene 3′ UTR and F gene 5′ UTR are not essential for MV per se, but they modulate the production of M and F proteins and influence virus replication and cytopathogenicity [3]. The suggested mechanisms include mRNA stabilization and regulation of translation [3]. Furthermore, M-F UTR is among the most variable regions in the MV genome [7][8][9].
As with other members of the family Paramyxoviridae (which now comprises solely genera formerly belonging to subfamily Paramyxovirinae), MV replicates efficiently only when the nucleotide length of its genome is an even multiple of 6, a requirement called the "rule of 6" [10,11]. Each nucleoprotein (N) in the viral ribonucleoprotein complex interacts with exactly 6 nts. During copying, viral polymerase "sees" the nts in the context of N. Interaction points N1-N6 are not equivalent, as particular nts that are part of signals for polymerase can be recognized only if they are positioned in a proper N subunit point, a phenomenon called "N phase context" or "hexamer phasing" [11,12]. With the exception of the position of the F gene start, the phase of the transcription start sites of each gene is strictly conserved between the morbilliviruses [11,13].
The canonical MV genome organization (Additional file 1: Table S1) is characterized by its total size of 15 894 nts and precisely defined length of every genomic region [14]. We have previously described wild type measles virus strains with deviations from canonical genome organization (strains possessing insertions and deletions of one or few nts, leading to a change in the N phase context within some genomic regions, but not differing in total genome size) [8,15]. Since 2009, measles strains with genomes extended by 6 nts in total have been detected in the USA [16] and Europe ( [7], strain presented in this paper).
Like other RNA/DNA polymerases, paramyxoviral RNA-dependent RNA polymerases (RdRp) have the propensity to mistakenly insert or delete nts within homopolymeric tracts [17]. Should this happen during virus replication, it would lead to a change of total RNA length and deviation from the rule of 6. This divergence can be corrected by compensatory insertions or deletions that restore the polyhexameric length. The occurrence of such counter-mutations has been shown in a few studies. Sequence analyses of recombinant human parainfluenza virus 2 (HPIV2) [17] and HPIV3 [18] rescued from cDNAs that did not conform to the rule of 6 showed that obtained viruses contained nucleotide insertions that corrected the length of the viral genome in such a manner that it became polyhexameric. Recombinant polyploid MV containing foreign gene construct that disabled virus replication, accumulated nucleotide insertions that inactivated the foreign gene expression and possessed compensatory deletions that restored polyhexameric genome length [19].
Although programmed and highly precise pseudotemplated nucleotide additions during transcription are inherent to polymerases of all viruses belonging to the family Paramyxoviridae, a similar mechanism that would serve to non-randomly correct genome length has so far not been described in the context of copying a complete virus genome. During transcription, pseudo-templated nucleotide additions occur in: (a) reiterative copying of short runs (4-7 nts long) of template uridylates in polyadenylation of viral mRNAs; and (b) mRNA editing, a cotranscriptional insertion of a single non-templated G, which happens with defined frequency during P gene transcription [20][21][22]. During mRNA editing, polymerase stutters at the sequence 3′-UUUUUCCC-5′ on the template strand (positions 2491-2498 on genomic cDNA) and inserts an extra G, leading to a frameshift and the production of the V protein mRNA. In Sendai virus, minigenomes whose lengths did not conform to the rule of six and which contained the P gene editing site underwent in vitro nucleotide insertions or deletions within the editing site that generated polyhexameric genome lengths [20]. In a complete infectious virus, the P gene editing site is unlikely to be used for this function, as this would alter the expression of P and V proteins [22].
In order to explore whether non-recombinant measles strains showing deviations from canonical genome organization possess a common genetic characteristic, which would suggest that genome length correction is not a random process, we compiled and analysed all complete MV genomic sequences available in open-access sequence databases till 05/05/2016. During multiple sequence analyses, we identified the strains with putative indels and analysed their positions. In 9 out of 11 identified non-canonical MV sequences, a common feature was observed: one mutation, either an insertion or a deletion, was located in a 28 nts long region in F gene 5′ UTR.

Compilation of genomic MV sequences
Sixty-four complete genomic MV sequences were retrieved from the GenBank database (Table 1). In addition, 52 partial (nearly complete) MV sequences spanning genomic region 5051-5078 were also compiled (Additional file 2).  [23].

RNA extraction and reverse transcription
RNA was extracted using the guanidinium isothiocyanatephenol-chloroform method [24]. Prior to reverse transcription, RNA was denatured at 70°C for 10 min and immediately cooled at 4°C. Reverse transcription was performed at 42°C for 60 min using M7 primer (5′-GGAGGAGCAGATGCAAGATA-3′) and SuperScript III reverse transcriptase (Thermo Fisher Scientific). Reaction mixture contained 3. After the initial denaturation step at 94°C for 5 min, 45 cycles at 94°C for 30 s, 50°C for 30 s and 72°C for 1 min were performed, followed by a terminal elongation step at 72°C for 10 min.
Purified PCR products were sequenced on ABI PRISM 3130 Genetic Analyzer (Thermo Fisher Scientific), according to manufacturer's instructions. Nucleotide sequences were deposited in GenBank under acc. nos. KF515521 and KF515522.
The R index was calculated by dividing the number of mononucleotide repeats identified in an individual genomic segment with the number of nts in that segment.
For visualization of variability in 64 different complete measles genome sequences a Web-based program Fingerprint was used [25] (http://evol.mcmaster.ca/fingerprint/). In this program, the variability of a genomic position is quantified by considering the number of different residues (1-4) occurring at that position.

MV phylogenetic analyses and genotyping
Maximum likelihood phylogenetic trees were generated using MEGA software, under the most appropriate model of nucleotide substitution determined with jModeltest v2.1.4. Bootstrap probabilities for 1 000 iterations were calculated to evaluate confidence estimates.
MV genotyping, based on the last 450 coding nucleotides of the N gene (N450), was performed according to WHO recommendations [26].

Results
Sixty-four complete genomic MV sequences, belonging to ten different MV genotypes (out of 24), were retrieved from the GenBank database (Table 1). Some sequences were obtained after the sequencing of different samples of the same viral strain (e.g. of samples differing in passage histories). In six instances identical sequences were deposited under different names and therefore our data set contained 54 different entries.

Measles virus strains with non-canonical genomic properties
In 11 different sequences (Table 2), deviations from canonical genome organization were identified: some regions are longer (for 1, 2 or 7 nts) or shorter (for 1 or 2 nts) due to indels.
Epidemiologically/ancestrally/based on common genetic characteristics, 10 of these 11 strains group into three clusters:  an interesting feature is that the genomes of all three of them are prolonged. M gene 3′ UTR is extended for 7 cytidines in region 4763-4744, so that a homopolymeric tract of 12 cytidine residues is created. F gene 5′ UTR is shortened for 1 nt, leading to a total genome length of 15 900 nts. We observed the same insertion and deletion in our wild type isolate MVi/Zagreb.CRO/19.08[D4].
The 11 th strain in which mutations were observed is a strain belonging to the Edmonston lineage, submitted to GenBank under the name Edmonston, acc. no. K01711 [31]. In addition to this strain, 12 other sequences included in our analysis belong to the Edmonston lineage. They represent various vaccine strains (or different seeds of a same vaccine) that have all originated from a single wild type isolate [32]. In none of these 12 remaining Edmonston sequences were deviations from canonical genome organization observed.

Genomic positions of indels
The positions of identified indels are presented in Table 2. Mutations occurred either in polyadenosine, polyguanosine or polycytidine stretches or in positions next to them (e.g. the position of insertion in strain WA.USA/17.98 is located immediately after 7 nts long polyadenosine stretch). In all strains compensatory mutations were identified and the rule of 6 was conformed to. In SSPE strain MVs/Zagreb.CRO/47.02/[D6] SSPE two sites of insertions of a nucleotide were identified. Deletion of two nts was detected in a single downstream region.
With the exception of Edmonston, in all strains insertions are in M-F UTR and deletions are either in F gene 5′ UTR or in F gene ORF. Deletions in F gene ORF caused frameshifts and led to truncations of the cytoplasmic tail of F protein's F1 subunit, a feature often found in SSPE strains. Besides the two SSPE strains MVs/Zagreb.CRO/47.02/[D6] SSPE and SSPE-Kobe-1, a deletion in F gene ORF was also detected in MVi/ Tokyo.JPN/37.99(Y) and MVi/Tokyo.JPN/37.99(Y)C7, viruses that descended from a wild type strain that had caused a lethal encephalitis.
Excluding MVs/Zagreb.CRO/08.03/SSPE and Edmonston, in all strains one of the indels is placed within the 28 nts long segment in F gene 5′ UTR, located at positions 5051-5078 in the genomic cDNA of canonical strains (shown in bold in Table 2). The only non-canonical strain in which one deviation is placed before and the other after 5051-5078 segment (i.e. RdRp did not insert compensatory mutation in this region during genome/ antigenome copying) is the SSPE strain MVs/Zagreb.-CRO/08.03/SSPE.
The specificities found in the Edmonston sequence were not detected in any other of analysed strains. It is  the only sequence where the insertion site is located in the leader region and the deletion site is placed in a region used for P mRNA polyadenylation. The insertion of an A in the leader sequence disrupts the highly conserved replication promoter element positioned within the N gene [22]. For morbilliviruses, this element has the sequence 3′-(C 1 n 2 n 3 n 4 n 5 n 6 ) 3 -5′ (numbers in superscript indicate N phase context; the element's position corresponds to region 79-96 in genomic cDNA, nts 79, 85 and 91 being Gs) [22]. The nucleotide at position 85 in the Edmonston genomic cDNA sequence is A. Furthermore, the insertion located in the leader region leads to a change of the N phase contexts of transcription start signals of the N and P genes and of the transcription stop signal of the N gene. The phasing of the mRNA editing site is also changed. None of these sites are found in a random N phase context within morbilliviruses [11].

Indels in 5051-5078 segment
The consensus sequence of the 5051-5078 segment in canonical strains is G 6-7 C 7-8 A 6-7 G 1-3 C 5-6 , the total number always being 28. The sequence of this region in 54 different MV strains is presented in Fig. 1. Non-canonical

Distribution of homopolymeric sequences in measles genomes
All mutations identified during our study occurred either in homopolymeric stretches or in positions next to them. In order to investigate the locations and distribution pattern of mononucleotide repeats in MV strains, we identified all positions where minimally 5 nts long mononucleotide repeats are present. Analysis included all 54 different complete genomic sequences and only repeats found in at least two non-temporally and non-geographically related strains were counted.
The total number of homopolymeric runs was 37, 28, 26 and 10 for polycytosines, polyguanosines, polyadenosines and polythymidines, respectively. The distribution of repeats is shown in Fig. 2. With the exception of M-F UTR, the only mononucleotide repeats found in noncoding regions are the ones used for pseudo-templated polyadenylation of mRNAs (Fig. 2a). Homopolymeric runs were identified throughout the entire genome length except in the first 1 000 nts (numbering corresponding to genomic cDNA) (Fig. 2b). Considering that individual genomic segments (i.e. the coding and two non-coding regions of each gene) have different lengths, we calculated the R index, which indicates the number of repeats relative to segment length. While the coding regions have an R index in the range of 0.004-0.007, the R index of M gene 3′ UTR and F gene 5′ UTR is 0.030 and 0.029, respectively. These segments are especially rich in polycytosine repeats (viewed in genomic cDNA; Fig. 2a).
Although quite a large number of homopolymeric runs were identified in M gene 3′ UTR and F gene 5′ UTR, 13 and 17 respectively, indels were found in no more than 9 of them. This indicates that not all parts of this long non-coding region can tolerate such mutations, despite the fact that it is among most variable parts of the genome (Additional file 4: Figure S2, [7][8][9] (strains with prolonged genome), created by the insertion of an additional 7 cytosines into a 5-cytosine stretch, is the longest mononucleotide repeat identified in any of the analysed strains.

Discussion
The complete genomic organization of MV was deduced in the late 1980s [33]. Unlike some other virus species belonging to the Paramyxoviridae family, which are known to possess few different genomic lengths (e.g. Newcastle disease virus, as well as other avian paramyxoviruses within the genus Avulavirus [34]), MV genomic length and organization was for a long time considered to be uniform [14].
Until 2012 (when sequences of MV strains with prolonged genomes were released) and the publication of Bankamp et al., which describes these viruses [16], only rarely were reports of strains possessing non-canonical genomic properties published [8,15], and even in those reports observed indels were mentioned only marginally.
Eleven complete genomic sequences with non-canonical properties analysed in this paper were submitted to open public databases by six different research groups (including ours), making it less likely that their specificities resulted from errors in RT-PCR or in sequencing. Ten of the 11 strains are grouped in three clusters. There are no indications that these clusters are somehow evolutionary linked, other than the fact that they all belong to clade D.
The 11 th non-canonical sequence was obtained from a sample containing the Edmonston strain. A suggestion that mistakes might have occurred during the sequencing of this sample, which was done in the 1980s and early 1990s, was made by Bankamp et al. [16] although sequence submitters claim otherwise (personal communication with M. Billeter). As this virus was extensively passaged in vitro, it is possible that this has led to the origin of the infectious Edmonston-lineage virus possessing such genomic sequence.
In nine non-canonical strains (all except Edmonston and MVs/Zagreb.CRO/08.03/SSPE), one of the genome editing sites is located within a 28 nts long segment in F gene 5′ UTR, which is composed of five tandemly linked homopolymeric stretches. None of these five stretches has a definite length in canonical strains. The mutations detected in this region include both insertions and deletions. Compensatory mutations (leading to the re-establishment of polyhexameric length) were located in adjacent regions, M gene 3′ UTR and F gene ORF, so that the N phase contexts of start and stop signals of downstream genes were not changed. During the preparation of this manuscript, we sequenced M-F UTR of a D8 wild type strain that circulated in Croatia in 2014-2015 (GenBank acc. nos. KX555602 and KX555601 for N450 and M-F UTR, respectively) and found that it also possess an insertion of a nucleotide in 5051-5078 segment. Accompanying deletion is at nucleotide position 4714 or 4715, in M gene's 3′ UTR (data not shown).
As discussed by Skiadopoulos et al. [17], the genome length correcting mechanism could operate by involving either (a) random length corrections, followed by a stringent selection for virus in which the correction was close to the point of deviation, or (b) non-random length corrections, involving a replication complex that "senses" the deviation from the rule of 6 and acts to insert a correcting mutation at a second, downstream site in the nascent molecule. Our analysis favours the second hypothesis, as the same narrow genomic region was found to be mutated in different, non-related measles strains.
Indels detected in the sequence of Edmonston and MVs/Zagreb.CRO/08.03/SSPE show that also other mechanisms can be involved in genome length corrections. A similar result was obtained with recombinant MVs: Rager et al. [19] found that recombinant MV, with a foreign gene fused to H gene's C terminus, disabled the expression of the foreign gene due to the insertion of an A in an A 6 G 4 region. Compensating deletion occurred downstream, in the L gene coding region where an A was deleted from an A 5 G 4 sequence. Other clones carried an A deletion in a G 2 A 5 region of the foreign gene and the polyhexameric length was restored by the insertion of an A in different polyadenylation sites. None of the sites reported by Rager et al. [19] to be involved in genome length corrections were located within the 5051-5078 region.
Generally, studies that investigated genome length corrections of viruses belonging to the family Paramyxoviridae [17][18][19][20] reported that inserted or deleted residues were adenosines, uridines or guanosines. We found that cytidines can also be inserted, but this may be a consequence of insertion occurring during the synthesis of antigenomic RNA. Skiadopoulos et al. [17] proposed a hypothesis that the fact that they found only adenosines and uridines to be inserted or deleted might simply reflect a lower content of homopolymeric guanosines and cytidines in the regions most amenable to accepting a length correction, namely the non-translated regions and intergenic regions. With the exception of M gene 3′ UTR and F gene 5′ UTR, MV non-coding regions are relatively short and do not contain homopolymers other than polyuridylates used for polyadenylation of viral mRNAs. In contrast, M gene 3′ UTR and F gene 5′ UTR are the regions with the largest numbers of homopolymers relative to their length. Even when the absolute numbers of homopolymers are compared, the only region with more mononucleotide repeats is the 6.5 kb long L gene ORF. Therefore, it is not surprising that nearly all of identified indels were in M-F UTR.
Mononucleotide repeats are generally considered to be exceptionally unstable genetic elements, prone to indels [35]. In most bacterial genes they are underrepresented in coding regions [36,37], as they lead to high error rates of transcription [38] and translation [39]. The finding that 9 out of 10 wild-type non-canonical strains possess an indel within the same 28 nts long region was rather unexpected, as 26 other homopolymeric runs (of length ≥5 nts) were identified in M-F UTR, outside the 5051-5078 segment.
Presumably, MV has maintained a significant non-coding nucleotide sequence content for its functionally important regulatory elements. Known MV regulatory sequences (summarized in Parks et al. [40]) located within non-coding regions are promotor sequences, TCRs at genomic ends, gene end and gene start sequences, as well as intergenic regions that guide transcription termination and reinitiation. A specific regulatory function of F gene's 5′ UTR is its involvement in the determination of AUG that is used as the F protein start codon [2].
Since the compact genomic organization and highcoding capacity of genes offer a selective advantage for rapidly replicating RNA viruses [41], long, highly variable M-F UTR is likely to be present and evolutionary preserved because of its functionally important (and yet unknown) regions.