Genetic diversity of currently circulating rubella viruses: a need to define more precise viral groups

Recent studies have shown that the currently circulating rubella viruses are mostly members of two genotypes, 1E and 2B. Also, genetically distinct viruses of genotype 1G have been found in East and West Africa. This study used a Mantel test to objectively include both genetic diversity and geographic location in the definition of lineages, and identified statistically justified lineages (n=13) and sub-lineages (n=9) of viruses within genotypes 1G, 1E and 2B. Genotype 2B viruses were widely distributed, while viruses of genotype 1E as well as 1G and 1J were much more geographically restricted. This analysis showed that more precise groupings for rubella viruses are possible, which should improve the ability to track rubella viruses worldwide. A year-by-year analysis revealed gaps in surveillance that need to be resolved in order to support the surveillance needed for enhanced control and elimination goals for rubella.


INTRODUCTION
Rubella virus (RuV) is a positive-sense ssRNA virus in the genus Rubivirus within the family Togaviridae. The RuV genome is approximately 9762 nt in length and encodes five proteins, of which three are structural, the C or capsid and two envelope proteins, E1 and E2, and two are non-structural, P150 and P200.
Virological surveillance data on RuV are used to track progress towards goals for enhanced control and elimination, and possible eradication of rubella, to help with case classification, and to document transmission pathways. Although known RuVs comprise a single serotype, sufficient genetic variability exists to allow for virologic surveillance [1]. A systematic nomenclature for wild-type RuVs was established in 2006 [1,2]. Updates to the nomenclature were reported in 2007 and 2013 [3,4]. Currently, RuVs are divided into two clades (1 and 2), which differ by 8-10 % at the nucleotide level. The clades are sub-divided into genotypes, which are designated by letters, uppercase for accepted genotypes and lowercase for provisional genotypes. Clade 1 contains 10 genotypes (1a-1J), of which only 1a is provisional, and clade 2 contains 3 accepted genotypes (2A-2C). Initial phylogenetic studies on RuV were conducted using the sequences from all or most of the coding regions for the E1 protein, but a large majority of RuV sequences currently available are from a window of 739 nt (nucleotides 8731-9469) in the E1 coding region that was recommended by the World Health Organization (WHO) for routine molecular characterization [2]. In recent years, wholegenome sequencing has become more feasible [5][6][7], but the number of fully sequenced RuVs available in GenBank is still below 50. As new whole-genome sequencing techniques (socalled next-generation techniques) become more widely utilized, the availability of more whole genomes for RuVs will allow objective evaluation of other possible sequence windows.
Although 13 RuV genotypes are recognized, only 4 of the genotypes (1E, 1G, 1J and 2B) are now commonly detected [4,8]. Genotype 2A was last detected in 1980, except for several 2A vaccine strain isolates, [9] and genotypes 1D and 1I were last detected in the 1990s. Genotypes 1B, 1C, 1F, 1H and 2C were still detected in the first decade of this century, but no viruses of these genotypes have been detected since 2010. Genotype 1a, which is composed mostly of RuVs from the 1960s, including most of the vaccine strains, has been detected sporadically; however, it is possible that some of these detections are due to laboratory contamination, since most of the commonly used laboratory strains are also genotype 1a viruses. Of the four currently active genotypes, 1E and 2B have a wide geographic distribution and are frequently detected, with genotype 2B being the most widely distributed. The remaining two genotypes, 1G and 1J, appear to be more restricted geographically and are less frequently detected. Together, the number of sequences from these four genotypes represents over 70 % of all the sequences available for analysis, and intra-genotype diversity as high as 2.6 % has been observed for viral sequences of genotype 1G . It is clear that if classification of currently circulating viruses into more than four genotypes was possible, it would enhance the utility of the molecular epidemiology of RuV. To this end, a dataset of 1109, 739 nt-long RuV sequences (E1 gene) has been created and analysed to determine if the data support further sub-division of RuV genotypes based on sequence diversity and/or geographical distribution. Results reported here provide support for division of viruses of genotypes 1E, 1G and 2B into several defined lineages.

Genotype distribution in 2010-2014
Of the 1109 sequences in the full dataset, 635 represent viruses collected between 2010 and 2014. These 635 sequences represent viruses collected from 21 countries, with mainland China, Japan and Taiwan being the countries from which the most viruses were reported (Fig. S1, available in the online Supplementary Material). Genotype 2B is the most frequently represented (414 sequences) followed by 1E (198 sequences). Genotypes 1G and 1J are represented by only 17 and 6 sequences, respectively. Genotype 2B was reported in 20 countries on five continents. Genotype 1E was reported by nine countries, mainly from eastern Asia. Genotype 1J was rarely detected, with four countries reporting six sequences in total. Genotype 1G was reported by four countries, mainly from Africa. Epidemiologic data were available for 38 cases (Table S1).
Based on the available epidemiological data, RuVs of any genotype collected between 2010 and 2014 originated from at least 30 countries in Asia, Africa and Europe. A year-byyear analysis shows that at least 21 countries had clear molecular epidemiological evidence of circulation of rubella (Table 1), as RuV has been detected in at least 3 years of the last 5 or has been exported at least once. Among these countries, only China and Japan are consistently reporting virological data, with a mean of 40 and 30 virus sequences reported per year, respectively. In contrast, India, with a similar population size to China, reported only one viral sequence per year. Finally, there were no rubella genotypes from Africa reported in 2014, although rubella is endemic in many African countries based on incidence as determined by laboratory-confirmed cases [10] (Fig. S1). Concerning the genotype distribution, genotype 2B and 1E viruses were reported every year from 2010 to 2014 (Fig. S1). It is likely that detection of genotypes 1J and 1G is under-reported as they were predominantly found in countries with little or no rubella virologic surveillance.
Phylogenetic analysis of the four circulating genotypes: 1J, 1G, 1E and 2B Phylogenetic analysis of genotype 1J sequences showed that the six sequences of currently circulating viruses (identified with black circles in Fig. S2) cluster in one group that is part of a main lineage supported by a 69 % bootstrap value. The number of sequences available for genotype 1J is insufficient to further divide the genotype. Four of the six sequences were either detected in or known to be exported from the Philippines.
The number of sequences of genotype 1G is limited (77 sequences, 17 since 2010), but the genetic diversity between these sequences has a maximum of 5.4 % nucleotide divergence and an average divergence of 2.9 %. Fig. 1 illustrates the phylogenetic analysis, and Fig. S3 describes the analysis process for 77 genotype 1G sequences. Genetic distance matrices and geographic distance matrices for these 77 sequences were compared using a Mantel test (see Methods). The Mantel test showed a correlation between genetic and geographic distances (P=1Â10 À4 ). The topology of the tree identified three groups, 1G-L0, 1G-L1 and 1G-L2 (Table 2, Fig. 1). 1G-L0 is represented by sequences of viruses mainly collected in Europe between 1991 and 2008 and not currently known to be circulating. 1G-L1 is composed of sequences from viruses collected in or exported from West Africa, mostly Republic of Côte d'Ivoire but also Ghana, Nigeria and Cape Verde. This group was further divided into two sub-groups (1G-L1a and 1G-L1b) because of a strong bootstrap value (1G-L1a is supported by 98 % bootstrap) and co-circulation (RVs/Abobo-Est.CIV/18.12 of  Phylogenetic tree of RuVs of 1G genotype. A neighbour joining tree was generated with MEGA 6 using the maximum composite likelihood nucleotide substitution model [20][21][22]. The phylogenetic inference was tested with the bootstrap method with 1000 replications. Bootstrap values greater than 70 % are shown. Sequences of viruses collected after 2010 are identified with a black circle. The tree was rooted with RVi/Pennsylvania.USA/64VACC_JF727653.2. 1G-L1a and RVs/Abobo-Est.CIV/22.12 of 1G-L1b). 1G-L2 viruses, found in East Africa, are characterized with significant nucleotide diversity within the 739 nt sequencing window, with a mean intra-lineage distance greater than 2 %. Also, the Mantel test showed that the 1G-L2 sequences are geographically clustered (P=1Â10 À4 ). Based on the topology of the phylogenetic tree, we defined four groups: 1G-L2a found in Kenya, 1G-L2b found in Uganda, 1G-L2c found in Ethiopia and 1G-L2d found in Sudan. 1G-L2a and 1G-L2c are supported by bootstrap values of 99 and 93 %, respectively. A 1G-L2a virus was detected in Kenya in 2005. However, another 1G-L2a virus collected in the USA in 2010 (RVi/Yavapai.AZ.USA/4.10-JX477654) was also imported from Kenya, suggesting that rubella is endemic in Kenya. Lineage 1G-L2b is represented by viruses from Uganda. This sub-lineage showed some genetic diversity (with mean intra-group distances close to 1.5 %) and was further divided into three groups in a recent publication, where they were called 'lineages 1, 2 and 3' [11]. To be consistent with the proposed designations, these lineages were identified as 1G-L2b1, 1G-L2b2 and 1G-L2b3. show that rubella is endemic in Sudan. We expect that the designation of RuV lineages will be ongoing. A full description of the lineages circulating in Sudan will require many more viruses. ||'Within-group' mean genetic distance >1.5 %. There are not enough data to break the groups at the time of the analysis. ¶Smallest 'between-group' distance <1.5 %. Although the genetic distance is small between these two groups, these groups are kept to differentiate between currently circulating viruses and those collected at an earlier time.  ). Five groups were established based on the topology of the phylogenetic tree (Fig. S5). They were called 2B-L0, 2B-L1, 2B-L2, 2B-L3 and 2B-L4. The distance between groups was greater than 2.5 %, except for the distance between 2B-L0 and 2B-L1 (1.45 %). Although the distance is less than the 1.5 % cutoff, the distinction between lineages 2B-L0 and 2B-L1 is useful in that it allows focusing on the sequences of viruses collected since 2010. Lineage 2B-L0 is paraphyletic and likely extinct. Only six sequences of 2B-L1 were from viruses collected before 2010, the oldest sequence being from 2008. Lineage 2B-L1 is not geographically clustered (P=0.0164). These sequences are very homogeneous with a within-group distance of 0.7 %. In contrast, lineage 2B-L2 is genetically diverse with a within-group distance of 1.8 %. A Mantel test demonstrates some geographic clustering (P=1Â10 À4 ) within the 94 2B-L2 sequences. Three groups were established based on tree topology, called 2B-L2a, 2B-L2b and 2B-L2c. 2B-L2b and 2B-L2c are supported with bootstrap values greater than 80 %. All three 2B-L2 sub-lineages are likely to share a common ancestor originating from India, but epidemiologic and sequence data are insufficient to formerly assess the genetic origin of these lineages. Lineages 2B-L3 and L4 are mainly from eastern Asia.

DISCUSSION
To gain further insight into the genetic diversity of the currently circulating RuVs, we created a robust sequence dataset and analysed per cent identity, tree topology and geographical distribution. We found that for three of the four currently circulating genotypes, 1E, 1G and 2B, sufficient data exist to propose sub-dividing these genotypes into lineages, some of which also have evidence for geographic localization. The large number of sequences of genotypes 1E (460) and 2B (545) [14]. One or more of these lineages may be the same clusters identified by Zhu et al. [15], but without standardized naming, this is difficult to determine. Another example is the three lineages of viruses of genotype 2B suggested by Namuwulya et al. [11]. The three lineages were found within one branch of a 1Gonly phylogenetic tree and would, thus, be classified as sublineages using the naming strategy proposed here.
To help with future data analysis and definition of meaningful lineages, we described here a method for defining lineages. Although more than 2000 sequences of RuVs are available in GenBank, curation of the sequences was necessary to construct a robust dataset, mainly by elimination of duplicate sequences and multiple sequences derived from single outbreaks. Therefore, we chose to analyse two types of sequences. All unique sequences (unless they were found to contain errors such as deletions in coding regions or stop codons) and identical sequences from viruses were included only if they were collected at a different location or at least 2 weeks apart. Only sequences of currently circulating genotypes were analysed here, since these sequences are useful for molecular epidemiology. In this context, we did not analyse sequences of viruses belonging to genotypes that have not been reported for at least 10 years. Phylogenetic analysis was performed, and the resulting tree topology was examined to identify potential lineages. Since virologic surveillance for RuV is very incomplete, some expert opinion had to be used to organize sequences for analysis. For example, in a lineage composed of 10 sequences of viruses collected in West Africa and one sequence collected in the USA, where rubella has been eliminated, if there were no epidemiologic data directly linking the virus to West Africa, this sequence was deleted from the geographic analysis.
In order to better classify sequences belonging to the major genotypes, we suggest using three criteria: genetic diversity, geographic distribution and time. In general, the threshold for the level of genetic diversity within and between lineages is not standardized. We propose that a genetic distance of 1.5 % should be the cutoff for RuV sequences. Sequences within a lineage would differ by 1.5 % or less, while the distance between lineages would be greater than 1.5 %. The geographic distribution should be assessed in conjunction with genetic diversity. For example, a group of sequences could be genetically related by <1.5 %, but they could be distinguished by the location (i.e. where the group of viruses is circulating). The Mantel test compares genetic distance and geographic distance matrices. If these two matrices are correlated (i.e. if distant sequences are from viruses collected at distant locations and if close sequences are from viruses collected at close locations), then the P value from the Mantel test is very low (1Â10 À4 ). The Mantel test is often used in ecology to test the correlation between two types of distances [16]. We chose the Mantel test as it provides a statistic to measure objectively the likelihood of geographic clustering within a particular lineage. Finally, it is useful to identify genetically diverse viruses that are co-circulating.
Sequence data are much more useful if linked to meaningful metadata, and it is important that the metadata are as precise as possible. For example, knowing only the name of the country of detection is often not informative enough. Thus, it would be better to collect data at the smallest acceptable administrative level to estimate accurate latitude and longitude coordinates, which are required input data for the Mantel test computation. In addition, some lineages exist at a regional level (such as 2B-L1, found mainly in East Asia).
Considering latitude/longitude coordinates would help to analyse such lineages, as national borders are not very meaningful in the context of regional distribution. The metadata most useful for sequence analysis are often not the same as the metadata most useful for public health purposes. For example, sequences of viruses imported into a country may be useful to that country, but these sequences are best linked to the geographic origin of the virus when defining viral lineages.
The present analysis was somewhat limited by major gaps in surveillance for RuV. Rubella was declared to be eliminated in the Americas in 2015 (www.paho.org); thus, surveillance in the Americas should be maintained, as these countries serve as sentinels for the entire world. For example, an importation into the USA from Kenya provided evidence for endemic circulation of 1G-L2a in Kenya. Many countries where rubella is known to be endemic, based on incidence rates [10], are not reporting molecular data for RuV [17]. As more countries adopt elimination goals and approach elimination, surveillance for rubella, including molecular characterization, will hopefully improve. Regional and national commissions responsible for verification of rubella elimination require virologic surveillance data showing lack of an endemic genotype, which is consistent with elimination. When many more RuV sequences from most geographic regions are available (1000s of sequences for each genotype), the work described here is expected to be a part of a systematic phylogeographic analysis of RuVs, as has been done for other viruses [18]. If any currently rare RuV genotype experiences a resurgence, then sufficient viruses for that genotype may be available to be a part of this systematic phylogeographic analysis.  [19]. The alignment window was set to the 739 nt window within the E1 ORF used to genotype RuVs (2005). Duplicated sequences were removed only if they were from viruses collected at the same location and at the same time (epi week ±2). The exporting country was determined based on epidemiologic data, including, but not limited to, travel history and country of birth. Sequences of viruses for which importation was suspected, but with no epidemiologic information, were removed from the analysis (e.g. a new genotype appearing in a country with established RuV surveillance that had not detected that genotype previously but without further information). The final dataset contained 1109 sequences. The annotated dataset (with phylogenetic groupings) is available upon request.

Maps
Maps were generated in QGIS version 2.14.0. Shape files were downloaded from www.gadm.org/. Latitude and longitude were determined with www.findlatitudeandlongitude. com/. In cases where different administration levels had identical names (for example, Chinese province and city), the lowest administrative level (i.e. city) was used to determine the latitude and longitude.

Phylogeny and nomenclature
Phylogenetic trees were generated with MEGA 6 using the neighbour joining method and the maximum composite likelihood nucleotide substitution model [20][21][22]. Trees with a similar topology to those presented were also obtained with the maximum likelihood method. The phylogenetic inference was tested with the bootstrap method with 1000 replications. Bootstrap values greater than 70 % are shown. Trees were rooted with RVi/Pennsylvania.USA/64VACC_JF727653 (the RA27/3 vaccine strain). Pairwise distances as well as mean distances within and between groups were computed in MEGA 6. The Mantel test was performed with the package ade4 in R (http://cran.r-project.org/web/packages/ade4/ade4.pdf) with 10 000 permutations. This test measures the correlation between genetic distance and geographic distance matrices. The lower the P value, the better the correlation between matrices and the higher the significance of geographic clustering. A significant P value is expected to be in the order of 1Â10 À4 . Lineages were identified based on tree topology depending on genetic distances, geographic distances and cocirculation. Criteria for the genetic distance within a lineage were less than 1.5 %, and the distance between lineages was more than 1.5 %. These criteria are in agreement with previous nomenclature studies such as for influenza H5N1 [23]. The correlation between genetic distances and geographic distances was assessed with the Mantel test; a matrix of pairwise genetic distances was compared with a matrix of pairwise latitude/longitude coordinates based on the WHO name of the virus. Co-circulation was considered to have occurred when viruses differing by approximately 1.5 % at the nucleotide level were collected at the same time and at the same location (based on the WHO name of the virus). Mean pairwise genetic distances within and between groups were computed to validate the ultimate groupings.
By applying these criteria to three of the now commonly detected genotypes (1E, 1G and 2B), three levels of virus sequence groupings within some genotypes could be established (see Results). A systematic nomenclature was adopted to describe the grouping established here and to accommodate future groups that are likely to result from analysis of a more complete set of RuV sequences. A lineage, the first grouping below genotype, is designated by the genotype followed by a dash, L (for lineage) and a number (for example, 1G-L1). Lineages designated as 0 represent lineages that are not currently circulating and are likely extinct. If necessary, sub-lineages are identified by the lineage of which they are part followed by a letter and, if necessary, a number to designate a sub-sub-lineage. For example, 1G-L2b1 would represent the sub-sub-lineage 1, part of sub-lineage b, part of lineage 2 of genotype 1G.

Funding information
This work was supported by general funding through the Centers for Disease Control and Prevention.