Refining PRRSV-2 genetic classification based on global ORF5 sequences and investigation of their geographic distributions and temporal changes

ABSTRACT Porcine reproductive and respiratory syndrome virus (PRRSV) is an important swine pathogen affecting the global swine industry. The first open reading frame 5 (ORF5)-based genetic lineage classification system describing global PRRSV-2 genetic diversity was introduced over a decade ago. Although refinements have been proposed for the predominant lineage in the USA (lineage 1), PRRSV-2 phylogenetic classification system at international levels has not been thoroughly evaluated and updated since 2010. In this study, based on analysis of 82,237 global ORF5 sequences reported during 1989–2021, we classified PRRSV-2 into 11 genetic lineages (L1‒L11) and 21 sublineages (L1A‒L1F, L1H‒L1J, L5A‒L5B, L8A‒L8E, and L9A‒L9E). The proposed classification system is flexible for growth if additional lineages, sublineages, or more granular classifications are needed. For example, for more fine-scale epidemiological investigation, L1C was further divided into five groups (L1C.1‒L1C.5), with L1C.5 corresponding to the recently emerged L1C variant. Comparison between restriction fragment length polymorphism (RFLP) typing and phylogenetic classification revealed the inaccuracy of using RFLP to determine PRRSV-2 genetic relatedness in most scenarios. Genetic homology of six commercial PRRSV-2 vaccines to each lineage/sublineage and detection frequency of vaccine-like viruses were determined. Global geographic distribution of each lineage/sublineage was presented. Temporal dynamic changes of PRRSV-2 in the USA during 1989–2021 were investigated by analyzing 73,092 ORF5 sequences. In summary, this study refined PRRSV-2 ORF5-based phylogenetic classification and investigated the geographic distribution and temporal changes of PRRSV-2 at the lineage/sublineage levels. The refined classification system and molecular epidemiology data in this study will be invaluable for future characterization of PRRSV-2. IMPORTANCE In this study, comprehensive analysis of 82,237 global porcine reproductive and respiratory syndrome virus type 2 (PRRSV-2) open reading frame 5 sequences spanning from 1989 to 2021 refined PRRSV-2 genetic classification system, which defines 11 lineages and 21 sublineages and provides flexibility for growth if additional lineages, sublineages, or more granular classifications are needed in the future. Geographic distribution and temporal changes of PRRSV-2 were investigated in detail. This is a thorough study describing the molecular epidemiology of global PRRSV-2. In addition, the reference sequences based on the refined genetic classification system are made available to the public for future epidemiological and diagnostic applications worldwide. The data from this study will facilitate global standardization and application of PRRSV-2 genetic classification.

KEYWORDS PRRSV-2, ORF5, genetic classification, lineage, sublineage, restriction fragment length polymorphism analysis, molecular epidemiology P orcine reproductive and respiratory syndrome (PRRS) is one of the most devastat ing swine diseases causing tremendous economic losses to the swine industry worldwide.Economic losses caused by PRRS were estimated to be approximately $664 million annually or $1.8 million a day in the USA (1).Clinical signs including reproductive failure and late-term abortion in sows and respiratory diseases in all age of pigs (2,3) were first reported in the USA in 1987 and in Western Europe in the 1990s (4).Porcine reproductive and respiratory syndrome virus (PRRSV), the causative agent of PRRS, is an enveloped, single-stranded, and positive-sense RNA virus in the family Arterividae (5).According to the new virus taxonomy, PRRSV includes two species: Betaarterivirus suid 1 (with virus name PRRSV-1, previously known as the European genotype) and Betaarterivirus suid 2 (with virus name PRRSV-2, previously known as the North American genotype).The PRRSV genome is ~15 kb in length and is composed of 11 open reading frames (ORF), including ORF1a, ORF1b, ORF2a, ORF2b, ORF3, ORF4, ORF5a, ORF5, ORF6, ORF7, and a short transframe (TF) ORF in the nsp2 region.These ORFs encode 16 non-structural proteins (nsp1α, nsp1β, nsp2-nsp6, nsp7α, nsp7β, nsp8-nsp12, nsp2TF, and nsp2N) and eight structural proteins (GP2, E, GP3-GP5, GP5a, M, and N) (6,7).
Like other RNA viruses, the lack of 3´ proof reading abilities of an RNA-dependent RNA polymerase causes error-prone replication and genetic variation with an estima ted nucleotide substitution rate between 4.7 × 10 −2 and 1.55 × 10 −3 substitutions/site/ year among PRRSVs (8)(9)(10).Due to its high genetic variation and harboring neutraliza tion epitopes, ORF5 encoding major envelope glycoprotein 5 (GP5) has been widely used to study genetic diversity of PRRSV-1 and PRRSV-2 (11)(12)(13)(14).Restriction fragment length polymorphism (RFLP) typing based on three restriction enzymes MluI, HincII, and SacII on PRRSV-2 ORF5 was first introduced in 1998 to distinguish Ingelvac PRRS MLV vaccine (RFLP 2-5-2) and wild-type strains (15).Since then, PRRSV-2 genetic diversity and emerging PRRSV strains in North America have been widely reported according to RFLP patterns (16,17).Recently, 228 distinct RFLP patterns from over 40,000 sequences were reported in the USA during 2007-2019; temporal and geographic distributions of different RFLP patterns reflected extensive PRRSV-2 genetic diversity (18).However, RFLP typing cannot capture PRRSV-2 genetic changes outside the cutting sites of the three restriction enzymes, and single nucleotide change at the cutting sites can change the RFLP pattern; hence, RFLP typing has limited utility as a tool to define genetic relatedness of PRRSV-2.Also, RFLP typing has mainly been used in North America and has not been widely adopted in other parts of the world.
This study aimed to fulfill multiple objectives.(i) To refine the PRRSV-2 genetic classification system based on analysis of a large data set (n = 82,237) of global ORF5 sequences and establish reference sequences for epidemiological and diagnos tic applications.(ii) To contrast RFLP typing and phylogenetic classification to better understand the limitations of each technique.(iii) To analyze the geographic distribu tions of global PRRSV-2 sequences and temporal changes of PRRSV-2 circulating in the USA, including both wild-type and vaccine-like virus strains.

Refine PRRSV-2 genetic classification system based on global ORF5 sequen ces
A data set of 82,237 PRRSV-2 ORF5 sequences from various countries was used in this study (Table 1).Following the procedures described in Materials and Methods, global PRRSV-2 ORF5 sequences were classified into 11 genetic lineages (L1 to L11) on the basis of the nine genetic lineage classification system previously proposed by Shi et.al.(13) with the addition of two new lineages (L10 and L11).As exemplified in Fig. 1, there were over 85% bootstrap supports in lineages L1-L8 and L10-L11 while there was 50.6% bootstrap support in lineage L9.
The newly proposed PRRSV-2 ORF5-based phylogenetic classification in this study was cross-tabulated with two PRRSV-2 ORF5-based phylogenetic classification systems proposed in previous studies (10,13,14) as summarized in Table 3. Lineages L10 and L11 in our proposed classification system were previously undescribed and they have been only detected in Thailand and South Korea, respectively.Our proposed PRRSV-2 subline age names are different from the 37 sublineages distributed in lineages 1, 5, 8, and 9 proposed by Shi et al. in 2010 (13), in part because of newly evolved genetic diversity in the past decade and in part because reference sequences needed to apply the subline ages proposed by Shi et   sublineages within lineage L1 (L1A-L1C, L1Dα, L1Dβ, L1E-L1H), but did not propose sublineages for other lineages (10,14).Our proposed refinement of L1 sublineages attempts to be consistent with those described by Paploski et al.However, some modifications are proposed in the current study and a summary is provided in    Sublineages L8.1-L8.9 and L9.1-L9.17 were proposed by Shi et al. (13); however, we have refined lineage 8 to include sublineages L8A-L8E and lineage 9 to include sublineages L9A-L9E.
The classification system proposed in this study is flexible for growth if additional lineages, sublineages, or more granular classifications are needed in the future.Here, we use the newly emergent PRRSV-2 L1C variant as an example.Beginning October 2020, high mortality and morbidity of pigs associated with PRRSV was observed in Iowa and Minnesota swine farms and the virus was subsequently detected in other states.ORF5 sequence analysis suggested that the PRRS viruses from these cases formed a distinct cluster within the sublineage L1C and most of them had a 1-4-4 RFLP pattern, for which reason these PRRS viruses have been referred to as "PRRSV L1C 1-4-4 variant" or sometimes "PRRSV L1C variant" (35,36).However, sublineage L1C still includes numerous other clusters and it remained undetermined how to denote these clusters with a distinct name that distinguishes them for epidemiological applications.In this study, we performed phylogenetic analysis using 10,961 sequences classified in sublineage L1C and proposed five groups L1C.1-L1C.5 with high bootstrap support to describe sequences that consistently formed unique monophyletic clusters."L1C variant" sequences (n = 822) were annotated as L1C.5; several other L1C clusters were annotated as L1C.1 (n = 1,644), L1C.2 (n = 270), L1C.3 (n = 1,358), and L1C.4 (n = 2,157).A large number of L1C sequences that did not consistently form distinct genetic clusters were annotated as L1C-Others (n = 4,710) (Fig. 3a and Table 3).The pairwise distances within each of L1C.1 to L1C.5 were 1.29%-3.99%and the pairwise distances between L1C.1-L1C.5 were 5.51%-9.99%(Fig. 3b).

Relationship of PRRSV-2 RFLP typing and ORF5-based lineage/sublineage classification
Among 82,237 PRRSV-2 ORF5 sequences, 341 RFLP patterns were determined.The analyses revealed that the number of RFLP patterns detected in lineages or sublineages did not correspond to genetic differences within lineages or sublineages.For example, sublineage L1A (n = 16,625 sequences) had the lowest average pairwise genetic distance (3.64%) among all sublineages in lineage L1 but 86 distinct RFLP patterns were detected in L1A (Fig. 4a).In contrast, sublineage L1E (n = 2,134 sequences) had the highest average pairwise genetic distance (10.06%) among all sublineages in lineage L1 but only 50 distinct RFLP patterns were identified in L1E (Fig. 4b).These observations suggest that the number of RFLP patterns in these two sublineages is not a good indicator of virus genetic diversity.

Genetic lineages/sublineages, RFLPs, and detection frequency of vaccine-like viruses
Six commercial PRRSV-2 modified live virus (MLV) vaccines including Ingelvac PRRS MLV (Boehringer Ingelheim), Ingelvac PRRS ATP (Boehringer Ingelheim), Fostera PRRS (Zoetis), Prime Pac PRRS RR (Merck), Prevacent PRRS (Elanco), and PRRSGard (Pharmgate) vaccines were launched or relaunched in 1994, 1997, 2012, 2018, 2018, and 2020, respectively, to control PRRSV in swine herds.In the USA, all six commercial vaccines are officially approved for use.The lineage/sublineage and RFLP information of the six commercial PRRSV-2 MLV vaccines as well as their ORF5 nt identity ranges to PRRSV-2 viruses in each lineage and sublineage is summarized in Table 4.Each vaccine has variable nt identities to PRRSV-2 viruses in different lineages and sublineages.However, nt identities are commonly found in the 80%-90% range, which potentially makes it challenging to estimate the protective efficacy of vaccines against viruses in different lineages and sublineages using exclusively ORF5 nt identity.
As no standard cutoff of nt identity was defined to distinguish between vaccinelike and wild-type viruses, in the current study, we arbitrarily defined any sequence with ≥98% ORF5 nt identity to a vaccine virus to be vaccine-like, 95% to <98% to be vaccine-like suspect, and <95% ORF5 nt identity to be wild-type virus.Sequen ces considered as vaccine-like suspects could possibly be vaccine-like or wild-type descendants and more data are required to define the true status of such sequences, which was beyond the focus of this study.
Ingelvac PRRS ATP vaccine virus belongs to the sublineage L8A with RFLP 1-4-2, Fostera PRRS vaccine virus is in the sublineage L8C with RFLP 1-3-2, Prime Pac RR vaccine virus belongs to the lineage L7 with RFLP 1-4-4, and Prevacent PRRS vaccine virus is in the sublineage L1D with RFLP 1-8-4.The detection frequency of these MLV vaccine-like viruses among the sequences in the lineage or sublineage to which they respectively belong is summarized in Table 5a, b, and c in a similar way as for Ingelvac PRRS MLV.As PRRSGard vaccine is a chimeric vaccine between two different PRRSV-2 strains, ORF5 sequence alone cannot distinguish PRRSGard-like and wild-type strains.Therefore, PRRSGard vaccine-like virus data were not included in Table 5.

Global geographic distribution of PRRSV-2 ORF5-based lineages and sublineages
Ninety-seven ORF5 sequences without country information for sample collection were excluded from the investigation of PRRSV-2 geographic distribution.Based on the available information, historical circulations of PRRSV-2 based on ORF5 genetic lineages and sublineages in global regions are summarized in Table 3 and Fig. 5.
Among 571 PRRSV-2 ORF5 sequences from Southeast Asia, most sequences (n = 551) were from Thailand and Vietnam and only 20 sequences were from other countries (Table 1).All 17 sequences reported in Cambodia (n = 6), Laos (n = 4), Myanmar (n = 6), and Philippines (n = 1) belonged to L8E; for two sequences from Malaysia, one belonged to L5A and the other one could not be assigned to a lineage; and one sequence from Singapore belonged to L5A.In Thailand, L10, L8E, L1I, and L5A sequences were detected with the number and percentage of sequences shown in Fig. 5a.In addition, a few L5B and undefined sequences were detected in Thailand.In Vietnam, L8E and L5A sequences were detected (Fig. 5b).In South Asia, 55 L8E sequences were reported from India in this study (Table 1 and Table 3).In these areas, L8E sequences mainly included the so-called HP-PRRSV-like strains that emerged in China in 2006 and subsequently spread to other Asian countries (29,31,33).
During 1991-2020, 5,157 PRRSV-2 ORF5 sequences were reported from East Asia in GenBank and most of them were from China (n = 4,188) during 1996-2020 based on the available data in this study.In mainland China, the sequences from high to low frequency in this study mainly were classified as L8E (HP-PRRSV, HP-PRRSV-like, and CH-1a-like  5d).
In this study, limited data were available from Central and South America.Only one PRRSV-2 sequence was reported from Guatemala in Central America which belonged to L8, and 32 PRRSV-2 sequences were reported in South American countries including Peru (20 L1A sequences), Chile (11 L1B sequences), and Venezuela (one L2 sequence).
A large number of PRRSV-2 ORF5 sequences (n = 76,202) were included from North America in this study.Among them, most sequences (n = 74,119) were collected in the USA, specifically from the Iowa State University Veterinary Diagnostic Laboratory database (n = 55,524) and the United States Swine Pathogen Database (US-SPD) database (n = 18,595) during 1989-2021, while fewer sequences were from Mexico (n = 1,918) and Canada (n = 165).Among ORF5 sequences reported from Canada, there were three main genetic lineages including lineage L1 (mainly L1H and L1C sequences and a few sequences in L1E, L1F and L1I) during 1991-2018, L5 (L5A) during 1995-2018, and L8 (L8A and L8C) during 2005-2018 (Fig. 5e).Most sequences classified in L5A and L8A were Ingelvac PRRS MLV vaccine-like viruses and Ingelvac PRRS ATP vaccine-like viruses, respectively.In Mexico, the top three sequence lineages included lineage L1 (mainly L1B and L1A sequences and a few sequences in L1C, L1E, and L1H) during 2009-2021, L5 (L5A) during 2004-2021, and L8 (mainly L8D sequences and a few sequences in L8A and L8C) during 2008-2021 (Fig. 5f).In addition, 7 L2, 20 L6, 4 L9A, and 51 L9D sequences were reported in Mexico.Among 432 L5A sequences from Mexico, most sequences were Ingelvac PRRS MLV vaccine-like viruses (n = 394) reported during 2004-2021 while 19 and 19 sequences reported during 2006-2019 were vaccine-like suspects and wild-type viruses, respectively.Sequences in sublineage L8D had a distinct geographic distribution, found only in Mexico, and most ORF5 sequences (n = 495 of 572) in L8D had a distinct nt length (609 bp).Lineage and sublineage information of PRRSV-2 in the USA is described in the section below.
Since 1,027 PRRSV-2 ORF5 sequences from the USA had no collection years, they were excluded from the temporal analysis and thus 73,092 PRRSV-2 ORF5 sequences collected in the USA during 1989-2021 were used for analyzing temporal dynamic changes.
Lineage L1, representing ~7.5% of sequences during 1989-2001 and increasing to 29.6% in 2002, became the dominant lineage from 2004-2021 in the USA with a dramatic increase from 33.3% to 74.4% (Fig. 6).The distribution of sublineages within L1 is shown in Fig. after 2018 (Fig. 3c).L1C.2 sequences were overall at low detection levels for several years but it is noteworthy that L1C.2 sequences has had an increasing trend since 2018 (Fig. 3c).L1C.3 dramatically increased from 6.5% in 2015 to 41.0% in 2016 and became a dominant L1C group during 2016-2020.Detection frequency of L1C.4 sequences increased from 30.4% detection in 2009 to over 50% of L1C sequences in 2010-2011 then decreased to 10.4% in 2014 and less than 1% after 2017.A few L1C.5 (L1C variant) sequences were first detected in 2018-2019 with little attention until 2020 when the detection frequency increased to ~5.8% in 2020 followed by a sharp increase to 59.4% in 2021 (Fig. 3c).
Lineage L5 detection ranged roughly from 8.4% to 16.6% of all sequences during 2002-2009 and rose to ~20% during 2010-2021 (Fig. 6).For the temporal changes of sublineages in L5 shown in Fig. 8a, L5A was the most detected sublineage from 2002 to 2021 while a small number of sequences belonging to L5B were observed only from 2002 to 2008.Among L5A sequences, wild-type viruses with <95% nt identity to Ingelvac PRRS MLV vaccine accounted for ~4.5%-17.2% of the sequences from 2002 to 2005 and then dropped to less than 3% after 2012 (Fig. 8b).Sequences with 95% to <98% nt identity to Ingelvac  in the graph and the total number of sequences in the lineage reported in a particular year is indicated in the table below the graph.Percentages of sequences in L1D with <95%, 95% to <98%, and ≥98% ORF5 nt identity to Prevacent PRRS vaccine were calculated against the total number of sequences in L1 in a particular year and the data are indicated in the table below the graph.
Lineages L2, L6, and L7 accounted for a small number of sequences reported in the USA across years as shown in Fig. 6.L2 represented less than 1% of sequences from 2004 to 2011 and has not been reported since 2012, while L6 represented less than 1% of sequences from 2009 to 2021.L7 was not reported during 2007-2013 and was observed in less than 1% of sequences during 2014-2021.
Eight states with over 1,000 ORF5 sequences in each state reported from 2015 to 2021 (Iowa, Illinois, Indiana, Minnesota, Missouri, North Carolina, Nebraska, and Oklahoma) were selected to investigate whether the detection frequency of PRRSV-2 lineages/sub lineages varied at the state level.During 2015-2021, L1 was the dominant lineage in all eight states, representing 60.2%-79.1% of sequences, while lineages L5 and L8 were reported for 15.7%-33.5% and 1.9%-8.8%,respectively, across the eight states (Fig. 10a), indicating a lack of major differences in PRRSV lineage distribution between states.When the detection frequency of PRRSV-2 L1 sublineages was compared across states (Fig. 10b), L1A was the dominant sublineage in seven of eight states (except Oklahoma) between 2015 and 2021, and the proportion of L1A sequences varied from 40.9% to 90.9% of sequences among states.Sublineage L1C occurrence ranged from 19.3% to 30.2% in Iowa, Illinois, Indiana, Minnesota, and Nebraska, while L1C accounted for a smaller proportion (~10% of sequences) in Missouri, North Carolina, and Oklahoma.Sublineage L1H was the dominant sublineage in Oklahoma (representing over 80% of sequences), while it accounted for less than 20% of sequences in other states and apparently absent in North Carolina.A small number of sequences from L1B, L1D, L1E, and L1F were reported in some states, with L1E being somewhat more common in Oklahoma.is shown in the graph and total number of sequences in the lineage reported in a particular year is indicated in the table below the graph.Among all L8A sequences in the USA from 1993 to 2021, percentages of L8A sequences with <95%, 95% to <98%, and ≥98% ORF5 nt identity to Ingelvac PRRS ATP vaccine in a particular year or period are shown in (b).Among all L8C sequences in the USA from 2002 to 2021, percentages of L8C sequences with <95%, 95% to <98%, and ≥98% ORF5 nt identity to Fostera PRRS vaccine in a particular year or period are shown in (c).

DISCUSSION
This study refined PRRSV-2 ORF5-based genetic classification system based on analysis of 82,237 global PRRSV-2 ORF5 sequences accumulated during 1989-2021.The refined PRRSV-2 phylogenetic classification system includes 11 genetic lineages (L1-L11) and 21 sublineages (L1A-L1F, L1H-L1J, L5A-L5B, L8A-L8E, and L9A-L9E).In this refined classification system, the principle procedures for defining lineages and sublineages are similar to those performed by Shi et al. in 2010 (13).The inter-lineage pairwise nucleotide distance is generally >11% and the intra-lineage pairwise nucleotide distance is generally <11%.However, there are exceptions.For example, sequences in L5 and L7 are mainly vaccine-like sequences; the inter-lineage distances between L5 or L7 and other lineages are sometimes <11% (Table 2a).If L5 and L7 sequences were excluded, the mean pairwise nt distances at the inter-lineage levels would range from 10.88% (close to 11% cutoff) to 17.18%.The intra-lineage distances for L5, L7, and L10 in this study were low, with 0.46%, 2.69%, and 2.22%, respectively (Table 2a), as these lineages are largely comprised of vaccine-like sequences (L5 and L7) or lineages that are likely severely under-sampled (L10, detected only in Thailand).For other lineages (L1-L4, L6, L8, L9, and L11), the intra-lineage distances ranged from 6.87% to 11.61% (Table 2a).Notably, sequences in L9 do not always form monophyletic clusters and they technically could be divided into multiple lineages.However, the detection frequency of L9 sequences have dramatically declined and almost disappeared in recent years; due to lack of current epidemiological significance, we did not divide L9 sequences into multiple lineages.In this study, lineages L10 and L11 were proposed to describe PRRSV-2 sequences detected mostly in Thailand and South Korea, respectively.Among the 216 sequences classified as L10 (Table 3), one sample from which the sequence was derived was collected in 2004 but its sequence was deposited into GenBank in 2019, one sample was collected in 2009 but its sequence was deposited into GenBank in 2016, 56 samples were collected in 2010 and their sequences were deposited into GenBank in 2011 (n = 53) and in 2014 (n = 3), 68 samples were collected in 2011 and their sequences were deposited into GenBank in 2011 (n = 67) and in 2013 (n = 1), and 90 samples had no collection dates.For the sequences from those 90 samples, one was submitted in 2012 and 89 were submitted in 2014 to GenBank.Therefore, none of the 216 sequences in L10 were available in 2010 when the L1-L9 classification system (13) was proposed.In contrast, for L11 sequences, some of them were already available in 2010 when L1-L9 classification system (13) was proposed and they were classified as "L2" sequences at that time.However, with accumulation of more sequences over years, the so-called "L2" sequences did not form monophyletic clusters anymore and instead they formed two phylogenetic clusters.Therefore, in the current study, we proposed to classify them into two separate lineages, L2 and L11.
The proposed sublineages within L1 are overall consistent with what Paploski et al. proposed in 2019 and 2021 (10,14) with some exceptions.For example, merging the former L1B and L1G into a single sublineage L1B with discontinued use of term L1G, combining the former L1Dα and L1E into new L1E, and so on.Compared to the nine sublineages (8.1-8.9) in L8 and 17 sublineages (9.1-9.17) in L9 originally proposed by Shi et al. (13), we have proposed five sublineages for both L8 (L8A-L8E) and L9 (L9A-L9E) to simplify the classification and facilitate their use.The reference sequences representing the refined lineages and sublineages are provided together with this paper and users can apply the new classification system for their specific use.
The PRRSV-2 classification system proposed in this study provides flexibility to refine the classification system in the future when needed.A PRRSV-2 variant that emerged in October 2020 in the USA formed a unique cluster within L1C (35,36).In order to facilitate the molecular epidemiological investigation and distinguish this variant strain from other L1C sequences, we divided L1C sequences into five groups, L1C.1 to L1C.5, with L1C.5 corresponding to the novel L1C 1-4-4 variant.Within each group of L1C.1 to L1C.5, the pairwise nt distances were 1.29%-3.99%(Fig. 3B).In the future, if there are demands for epidemiological studies, sequences in other genetically diverse sublineages can be similarly sub-divided into more smaller groups that may be more relevant for routine PRRSV monitoring and investigation by swine health professionals.
Since its introduction in 1998 (15), RFLP typing has been widely adopted to describe PRRSV-2 strains in North America.However, it has become evident that RFLP typing itself has drawbacks and cannot accurately reflect genetic differences and relatedness of various PRRSV-2 strains (16,38,39).Here, we demonstrate that each RFLP 1-4-4, 1-8-4, 1-4-2, and 1-3-2 often represent multiple, genetically diverse viruses belonging to different lineages/sublineages (Fig. 4C-F).In contrast, two viruses with RFLPs 1-7-4 and 1-7-3 could be genetically closely related viruses (Fig. 4A).However, the lineage/sub lineage system also has its limitations.Certain sublineages include genetically diverse viruses, forming groups that are too large to accurately investigate specific epidemio logical questions.Despite this, one large advantage of the lineage/sublineage system proposed before (10,13,14) and further refined in this paper is that it increases the likelihood that genetically related sequences will be clustered together.Genetic relationship is a hallmark of epidemiological connection, and by eliminating some of the confusions that RFLP classification produces on PRRSV-2 (such as giving the same label for viruses that are not related to one another, or by ensuring that closely related viruses do not receive different labels), this problem is diminished.
Commercial MLV vaccines are commonly used to control PRRSV infection.ORF5 sequencing via the Sanger method is still the most commonly used method to distinguish PRRSV-2 vaccine strains from wild-type strains although vaccine-like PCRs and next-generation sequencing technology have started to be used in recent years (40)(41)(42)(43).However, there is no standard ORF5 nt identity cutoff value to define vaccine-like viruses and wild-type viruses.A recent study showed that the distribution of ORF5 nt distances from the vaccine sequence was bimodal for lineages containing a modified live virus vaccine, with most sequences either clustering near 0% or >5% from the vaccine sequence (32).This suggests that a cutoff somewhere between 2% and 5% would probably be appropriate, and may be dependent on the purpose of the analysis.In fact, both ORF5 nt identity and phylogenetic analysis should be conducted to help determine if a detected PRRSV sequence is vaccine-like, vaccine-like suspect, or wild-type virus.Even so, there still could be a challenge to obtain a clear answer.For example, phylo genetic analysis of sublineage L8C sequences at the Iowa State University Veterinary Diagnostic Laboratory (ISU VDL) revealed that a distinct cluster separate from the Fostera vaccine-like virus cluster was observed in 2020-2021 with 95% to <98% nt identity to Fostera vaccine virus (data not shown).After considering both phylogenetic data and nucleotide identity data, we still cannot clearly determine if the ORF5 sequences in that distinct cluster are Fostera vaccine-like, wild-type, or vaccine-origin viruses that are transmitting similarly to a wild-type virus.The conclusion could not be drawn without clinical records and complete genome analyses of those viruses.But an increasing detection of L8C viruses having 95% to <98% ORF5 nt identity to the Fostera vaccine virus in 2020-2021, coinciding with the decreasing percentage of L8C viruses having 98%-100% ORF5 nt identity to the Fostera vaccine virus (Fig. 9C), is consistent with the hypothesis that vaccine escapees diverge enough to be able to have some epide miological fitness.Further studies involving epidemiological investigations and whole genome sequencing via next-generation sequencing technology to better comprehend the process and determinants for the emergence and increase of occurrence of this vaccine-like-suspect group of sequences within L8C in the years 2020 and 2021 are warranted.
The global distribution data in this study demonstrate geographic differences of PRRSV-2 in terms of the extent of genetic diversity and circulation period at lineage and sublineage levels.With the availability of a large number of PRRSV-2 ORF5 sequences and sample collection date, we were able to analyze the temporal dynamic changes of PRRSV-2 in each lineage and/or sublineage in the USA.PRRSV-2 L1 viruses have been the dominant viruses circulating in the USA since 2005.However, the frequency of subline age sequences within L1 have changed in the USA over time (Fig. 7), including a sharp increase of L1A 1-7-4 viruses since 2014 and emergence of L1C.5 (L1C 1-4-4 variant) viruses in 2020.These findings for L1, both in terms of timings and relative frequencies, are remarkably consistent with the patterns of sublineage turnover documented in Paploski et al. (10), suggesting that these patterns are consistent across different ORF5 databases in the USA even though they capture different but overlapping segments of the host population.It is necessary to constantly monitor the emergence of PRRSV variant strains in the swine population though.For example, based on the data in Fig. 7, in addition to L1A and L1C, we should also closely monitor L1H and L1D in the US swine.
Undoubtedly, the PRRSV-2 ORF5-based classification system in our current study has limitations.Although the ORF5 sequences included in this study were continuously collected in the USA from 1989 to 2021 and likely reflect PRRSV-2 occurrence in the USA, PRRSV-2 ORF5 sequences from other countries were retrieved from sequences available in GenBank and may not adequately reflect the true PRRSV-2 status in other countries.It is very likely there are some PRRSV-2 lineages and sublineages that have not been determined in the current study because they were not present in substantial numbers in the ISU VDL (which accounted for 75% of US sequences analyzed here) or GenBank databases.Nevertheless, our classification system and the associated reference sequences continue to advance efforts for a meaningful PRRSV-2 classification system.In addition, we acknowledge that ORF5 only occupies ~4% of PRRSV genome, and phylogenetic analyses based on ORF5 sequences may not necessarily reflect PRRVS-2 evolution in other genomic regions, which also shapes PRRSV-2 genetic diversity.Interestingly, in a recent publication (44), analyses show that ORF5-based lineages and sublineages are relatively stable when extrapolated to the whole genome; sequences belonging to the same ORF5-based lineage remain largely clustered together when phylogenies are built with other portions of the genome (44).However, it should be noted that this observation may not necessarily always stand, especially when recombination of viruses occurs.Nonetheless, ORF5-based lineages are a useful tool to label viruses that are phylogenetically related, though, of course, analysis of wholegenome sequences will broaden the ability to characterize PRRSV-2 genetic diversity and relatedness, including the ability to identify and characterize recombinant PRRSV-2 strains.Recombination between an MLV vaccine virus and a field virus or between two MLV vaccine viruses or between wild-type viruses in different geographic regions has been well documented (26,34,(44)(45)(46)(47)(48)(49)(50)(51)(52)(53)(54)(55).Although it makes sense at this stage to use ORF5 sequences for genetic classification because of the availability of large databases and the shorter turnaround time to determine ORF5 sequences, genetic classification based on whole-genome sequences should be explored in the future when more whole-genome sequences become available.

Conclusions
In this study, we refined the PRRSV-2 ORF5-based phylogenetic classification system based on analysis of a large set of global ORF5 sequences.In the new system, 11 lineages (L1-L11), 21 sublineages (L1A-L1F, L1H-L1J, L5A-L5B, L8A-L8E, and L9A-L9E), and five groups within L1C (L1C.1-L1C.5)were defined to describe the global PRRSV-2 genetic diversity.Reference sequences representing the new classification were established for future epidemiological and diagnostic applications.This study also revealed genetic homology of PRRSV-2 MLV vaccines with viruses at each lineage/sublineage.Geographic distribution of global PRRSV-2 at lineage/sublineage levels and temporal dynamic changes of PRRSV-2 at lineage/sublineage levels in the USA were characterized.These data will be invaluable for future characterization of PRRSV-2.

Data sets
A data set of 82,237 global PRRSV-2 ORF5 sequences from two sources was used in this study: 57,260 sequences from the ISU VDL and 24,977 PRRSV-2 ORF5 sequences from the US-SPD (https://swinepathogendb.org) created by the Agricultural Research Service National Animal Disease Center, United States Department of Agriculture (56).The US-SPD database collects all sequences deposited in GenBank, representing sequences from global countries (56).For the sequences from the US-SPD database, virus strain or isolate name, GenBank accession number, sample collection date, sample collection location, and RFLP pattern information were compiled.For the ISU VDL PRRSV-2 ORF5 sequences, the information including case submission ID, sample collection date, site location, and RFLP pattern was compiled.The information of 82,237 PRRSV-2 ORF5 sequences including collection region/country and collection year is summarized in Table 1.The 57,260 sequences from ISU VDL accumulated from 2003 to 2021 included samples collected in the USA (n = 55,524), Canada (n = 2), Mexico (n = 1,733), and Venezuela (n = 1).The 24,977 sequences from the US-SPD database were collected during 1989-2021 from seven global regions, including East Asia, Southeast Asia, South Asia, Europe, North America (n = 18,943; Canada, the United States, and Mexico), Central America, South America, and unidentified countries.
Redundant ORF5 sequences with 100% nt similarity, ORF5 sequences with more than five ambiguous nucleotides, and incomplete ORF5 sequences were removed from ORF5 sequences retrieved from ISU VDL and US-SPD databases using mothur v. 1.44.3 (57).Previous studies analyzing 355 and 949 PRRSV-2 whole-genome sequences (26,34) indicate that the recombination hot spots were mainly located in nsp9 and the ORF2-ORF4 genomic regions.Recombination within ORF5 rarely occurred.Even if recombina tion within ORF5 occurs under rare circumstances, it may create a phylogenetic clade that is located between the two lineages to which the two parental PRRSV-2 sequences belong.That will still be interesting for investigation.Therefore, in the current study, ORF5 sequences with the potential within-ORF5 recombination were not excluded for analysis.Eventually, 40,601 PRRSV-2 ORF5 sequences from ISU VDL, 16,851 PRRSV-2 ORF5 sequences from US-SPD database, and 841 ORF5 reference sequences from Shi et al. (13) were included to analyze and refine the PRRSV-2 ORF5-based genetic classification system as described in the section "Phylogenetic analyses of PRRSV-2 ORF5 sequences" below.
After the new PRRSV-2 lineage/sublineage classification system was defined, 1,100 reference sequences were randomly selected from different clusters of phylogenetic trees to represent all lineages and sublineages.Subsequently, the established refer ence sequences were used to analyze all of the 82,237 sequences to determine their lineage/sublineage information.

Phylogenetic analyses of PRRSV-2 ORF5 sequences
Multiple sequence alignment was performed by the progressive method Fast Fourier Transform algorithm with a Normalized Similarity matrix (FFT-NS-1) in MAFFT v7.407 (58).The phylogenetic tree from the multiple sequence alignment result was constructed using maximum likelihood with stochastic algorithm, general time-reversible nucleotide substitution with 10 categories of FreeRate heterogeneity model (GTR + F + R10), and 1,000 bootstrap replicates in IQ-TREE v1.6.12 (59).Simple manipulation of tree (smot) v0.14.2 (https://github.com/flu-crew/smot)was performed to reduce the data set based on genetic relatedness from the large phylogenetic tree.Multiple sequence alignment of the remaining ORF5 sequences after smot analysis was performed by the progres sive method (FFT-NS-2) in MAFFT v7.407 (58) and four independent phylogenetic trees were built using maximum likelihood with stochastic algorithm, general time-reversible nucleotide substitution with 10 categories of FreeRate heterogeneity model (GTR + F + R10), and 1,000 bootstrap replicates in IQ-TREE v1.6.12 (59).Tree topology tests were performed to statistically compare four independent phylogenetic trees using RELL approximation with 10,000 replicates (60), weighted Kishino-Hasegawa test (61), weighted Shimodaira-Hasegawa test (62), and approximately unbiased (63).Then, the best tree topology was identified from among the four trees.
Nucleotide identities were calculated by the MegAlign 15 program in the DNASTAR Lasergene 15 software.Average pairwise raw genetic distance within and between lineages and sublineages were calculated by MEGA X (66) to investigate genetic diversity.For calculating the pairwise distances at the lineage level (L1-L11), 1,100 reference sequences and 9,419 additional sequences randomly selected from the phylogenetic clusters representing the 11 lineages were used.For calculating the pairwise distances among the sublineages L1A-L1F and L1H-L1J, 466 reference sequences representing the nine sublineages within L1 and 10,170 additional sequences randomly selected from the phylogenetic clusters representing nine L1 sublineages were used.For calculating the pairwise distances in L5, all of the 16,845 sequences within L5 including 67 reference sequences representing L5A and L5B were used.For calculating the pairwise distances in L8, all of the 10,611 sequences within L8 including 219 reference sequences represent ing L8A-L8E were used.For calculating the pairwise distances in L9, all of the 6,052 sequences within L9 including 133 reference sequences representing L9A-L9E were used.

Genetic diversity analyses and visualization regarding RFLPs and genetic lineages and sublineages
Discriminant analysis of principal component (DAPC) is a combination method between discriminant analysis and principal component analysis.The DAPC method (67) in the adegenent package (68) in R was implemented to analyze and visualize genetic diversity of PRRSV-2 ORF5 sequences.To analyze genetic difference of an RFLP among different genetic lineages, ORF5 sequences were aligned using the progressive method (FFT-NS-1) in MAFFT v7.407 (58) and imported to R and then DAPC version 2.1.5(67) was performed to visualize genetic difference.

PRRSV-2 ORF5 reference sequences for the refined genetic classification system and GenBank accession numbers
The 1,100 PRRSV-2 ORF5 reference sequences for the refined genetic classification system included 59 L1A, 34 L1B, 29 L1C.

FIG 3
FIG 3 Phylogenetic tree and temporal dynamics of ORF5 sequences classified in sublineage L1C.(a) Phylogenetic tree of ORF5 sequences classified in groups L1C.1 to L1C.5.(b) Average pairwise genetic distance (% difference) between and within the five groups L1C.1 to L1C.5 in sublineage L1C.(c) Temporal dynamics of US PRRSV-2 ORF5 sequences classified in L1C.1-L1C.5 during 2009-2021.Total number of L1C sequences is shown in the table below the graph and percent of sequences classified in L1C.1-L1C.5 are indicated in the graph.Sequences not classified in L1C.1-L1C.5 were defined as L1C-Others.

FIG 5
FIG 5 Detection frequency and distribution of PRRSV-2 ORF5-based lineages and sublineages in representative countries based on the data set in this study.The number and percentage of sequences belonging to the major PRRSV-2 lineages and sublineages in Thailand, Vietnam, mainland China, South Korea, Canada, and Mexico are shown in (a) to (f ).The number and percentage of sequences at the lineage level in the USA are shown in (g) and at the sublineage level in the USA are shown in (h) to (k).Two L5 sequences with undefined sublineage are not shown in (i).One L8E sequence is not shown in (j).

FIG 6
FIG6 Temporal dynamics at the lineage level of 73,092 PRRSV-2 ORF5 sequences with samples collected in the USA during 1989-2021.Percent distribution of each lineage is indicated in the graph and the total number of sequences reported in a particular year is indicated in the table below the graph.

FIG 7
FIG 7 Temporal dynamics at the sublineage level of US PRRSV-2 ORF5 sequences classified in lineage L1 during 1998-2021.Percent of each sublineage is shown

FIG 8
FIG 8 Temporal dynamics at the sublineage level of US PRRSV-2 ORF5 sequences classified in lineage L5 or L9 during 1989-2021.Number of sequences in each sublineage is shown in the graph and the total number of sequences in the lineage reported in a particular year is indicated in the table below the graph.(a) Temporal dynamics of sublineages in lineage L5.(b) Among all L5A sequences in the USA, percentages of L5A sequences with <95%, 95% to <98%, and ≥98% ORF5 nt identity to Ingelvac PRRS MLV vaccine in a particular year are shown.(c) Temporal dynamics of sublineages in lineage L9.

FIG 9
FIG 9 Temporal dynamics at the sublineage level of US PRRSV-2 ORF5 sequences classified in lineage L8 during 1993-2021.(a) Percent of each sublineage

FIG 10
FIG 10 Temporal dynamics of US PRRSV-2 ORF5 sequences classified in lineages and sublineages in L1 reported in eight US states during 2015-2021.Eight states are represented by two-letter state codes and the number of sequences reported in each state during 2015-2021 is indicated below the graph.(a) Temporal dynamics of lineages.The percentages were calculated against the total number of sequences during 2015-2021 for each state, respectively.(b) Temporal dynamics of sublineages in L1.The percentages were calculated against the total number of L1 sequences during 2015-2021 for each state, respectively.

TABLE 1
Countries and collection years of PRRSV-2 ORF5 sequences from 1989 to 2021 used in this study

Table 3
. For example, the sublineages L1B and L1G proposed by Paploski et al. generally cluster together and it is difficult to consistently distinguish them; therefore, L1B and L1G are combined into L1B in our new system and the use of L1G is discontinued.Similarly, L1Dα and L1E proposed by Paploski et al. cluster together and it is difficult to accurately differentiate them in the tree; therefore, L1Dα and L1E are combined into L1E in our new

TABLE 2
Average pairwise genetic distance (% difference) for PRRSV-2 ORF5 sequences classified into 11 lineages and 21 sublineages based on the newly proposed classification (a)

TABLE 3
The comparison between the newly proposed PRRSV-2 classification and two previously proposed PRRSV-2 classifications based on ORF5 sequences as well as RFLP typing and geographic distribution