Genomic evolution of the Coronaviridae family

The current outbreak of coronavirus disease-2019 (COVID-19) caused by SARS-CoV-2 poses unparalleled challenges to global public health. SARS-CoV-2 is a Betacoronavirus, one of four genera belonging to the Coronaviridae subfamily Orthocoronavirinae. Coronaviridae, in turn, are members of the order Nidovirales, a group of enveloped, positive-stranded RNA viruses. Here we present a systematic phylogenetic and evolutionary study based on protein domain architecture, encompassing the entire proteomes of all Orthocoronavirinae, as well as other Nidovirales. This analysis has revealed that the genomic evolution of Nidovirales is associated with extensive gains and losses of protein domains. In Orthocoronavirinae, the sections of the genomes that show the largest divergence in protein domains are found in the proteins encoded in the amino-terminal end of the polyprotein (PP1ab), the spike protein (S), and many of the accessory proteins. The diversity among the accessory proteins is particularly striking, as each subgenus possesses a set of accessory proteins that is almost entirely specific to that subgenus. The only notable exception to this is ORF3b, which is present and orthologous over all Alphacoronaviruses. In contrast, the membrane protein (M), envelope small membrane protein (E), nucleoprotein (N), as well as proteins encoded in the central and carboxy-terminal end of PP1ab (such as the 3C-like protease, RNA-dependent RNA polymerase, and Helicase) show stable domain architectures across all Orthocoronavirinae. This comprehensive analysis of the Coronaviridae domain architecture has important implication for efforts to develop broadly cross-protective coronavirus vaccines.


Introduction
Coronaviridae is a family of enveloped, positive-strand RNA viruses that infect a wide variety of animals. The Coronaviridae family belongs to the suborder Cornidovirineae, which, together with Tornidovirineae belong to the order Nidovirales (enveloped, positive-strand RNA viruses) (Fig. 1). Recent phylogenetic studies based on RNA-directed RNA polymerases indicate that Nidovirales, together with Picornavirales, Caliciviridae, Astroviridae, and their relatives form a distinct supergroup of RNA viruses (Picornavirus supergroup) Wolf et al., 2018). Nidovirales can infect a wide range of animal hosts, including insects, mollusks, crustaceans, and vertebrates, suggesting horizontal virus transfer across metazoan species . Coronaviridae are divided into two subfamilies Letovirinae and Orthocoronavirinae, the latter of which are the main focus of this work.
All Orthocoronavirinae viruses possess four shared structural proteins, the spike (S), envelope (E), membrane (M) and nucleocapsid (N) proteins. The genome is packed inside a helical capsid formed by the nucleoprotein N. This in turn is surrounded by an envelope containing the E and M proteins, which are involved in virus assembly, and the spike glycoprotein protein S, which mediates virus entry into host cells (McBride and Fielding, 2012). Orthocoronavirinae have relatively large viral genomes in comparison to other RNA viruses, with sizes ranging from 26 to 32 kilobases. The first two open reading frames, ORF1a and ORF1b, code for two overlapping large replicase-containing polyproteins, pp1a and pp1ab, with the larger pp1ab translated as a result of a -1 ribosomal frameshifting ( Fig. 2A). These large polyproteins are subsequently (self) cleaved into 15 or 16 mature proteins referred to as non-structural proteins (nsps). And while the PP1ab, S, E, M, and N proteins are found in all Coronaviridae family genomes, the individual protein domains show surprising diversity. In addition, depending on the specific strain, many coronaviruses contain additional ORFs coding for accessory proteins, many of which remain poorly characterized (Fig. 2B).
In this work, we performed a protein domain-centric evolutionary comparative genomics analysis of Coronaviridae genomes, revealing the complex domain architectures that have resulted from recombination and a complicated evolutionary history.
Homologs are genes that are related by shared ancestry. Orthologs were defined by Fitch in 1970 as homologous genes in different species that diverged by speciation. Genes that diverged by gene duplication, either in the same or different species, have been termed paralogs (Fitch, 1970(Fitch, , 2000. While the terms ortholog and paralog have no functional implications (Jensen, 2001), orthologs are often thought of as more functionally similar than paralogs at the same level of sequence divergence (Altenhoff et al., 2012;Eisen, 1998).
Protein domains are distinct functional and/or structural units of a protein. Domains tend to form stable compact three-dimensional structures that can often be independently folded. Many proteins are composed of multiple domains, with each domain having its own evolutionary history and biochemical function. Thus, the architecture of a protein is a product of the ordered arrangement of its constituent domains and their overall tertiary structure. During evolution, multiple domains can combine, creating a vast number of distinct domain combinations, even within the same species (Moore et al., 2008). Assembling multiple domains into a single protein creates an entity whose function can be more than the sum of its constituent parts. The generation of proteins with novel combinations of duplicated and then diverged domains is a major mechanism for rapid evolution of new functionality in genomes (Itoh et al., 2007;Peisajovich et al., 2010). This modular structure of proteins enables rapid emergence of a multitude of novel protein functions from an initially limited array of functional domains. Proteins can gain or lose domains via genome rearrangements; the domains themselves can be modified by small-scale mutations (Christian M. . Here we use the Domain-architecture Aware Inference of Orthologs (DAIO) approach described in (Zmasek et al., 2019) to compare the arrangement of protein domains (and by extension, proteins) in polyproteins and ORFs from different Orthocoronavirinae sub-genera, updating and expanding our knowledge of Nidovirales genome evolution at the domain level, which, for example, has been review previously in (Gorbalenya et al., 2006). This approach places proteins into groups in which all members are not only orthologous to each other but also have the exact same domain architecture. This analysis resulted in the classification of Coronaviridae proteins into "Strict Ortholog Groups" (SOGs), in which all proteins are orthologous to each other (related by speciation events) and exhibit the same domain architecture. The SOG classification also enabled the development of an informative naming convention for each SOG that includes information about the protein's function (if known) and a suffix indicating the taxonomic group (such as Betacoronavirus) where a particular SOG is present. The SOG classification results are publicly available through the Virus Pathogen Resource (ViPR) (Pickett et al., 2012) at https://www.viprbrc.org. Fig. 1. Nidovirales taxonomy. This figure is based on the taxonomy established by the International Committee on Taxonomy of Viruses (ICTV) and currently used by the U.S. National Center for Biotechnology Information (NCBI) and the Universal Protein Resource (UniProt) databases. Viruses which infect humans are listed in blue (Alphacoronaviruses) and red (Betacoronaviruses). Their taxonomic level is indicated in square brackets. For some viruses, no taxonomic level has been established as of this writing. An example of this is Human coronavirus OC43.

Nidovirales genome evolution: protein domain composition of extant and ancestral genomes
We analyzed complete sets of proteins for all publicly available Nidovirales genomes (for a total of roughly one million sequences, including ~900,000 for SARS-CoV-2) for the presence of protein domains as defined by Pfam 34.0 (March 2021, 19179 entries) (El-Gebali et al., 2018). Within Orthocoronavirinae, the number of distinct protein domains varied from 9 in poorly studied viruses such as the White-eye coronavirus HKU16 (Deltacoronavirus) to 44 in SARS-CoV-2. Most Nidovirales not belonging to Orthocoronavirinae have even poorer coverage in Pfam. To understand the evolutionary history of the observed domain distribution in extant species, we reconstructed the domain content of ancestral genomes, specifically those lying at internal nodes corresponding to major branching points in the evolution of Nidovirales. Since independent evolution of the same domain more than once is highly unlikely, we used Dollo parsimony (https://doi.org/10. 1093/sysbio/26.1.77), which, when applied to domain content, assumes that each domain can be gained only once and seeks to minimize domain losses, to reconstruct the Pfam domain repertoire of Nidovirales (Zmasek and Godzik, 2011) (see Fig. 3 and Supplementary Tables 1 and 2). It should be noted that these findings do not imply that every extant genome retained all domains gained on its path from its respective ancestor. In fact, similar to the situation in eukaryote evolution, domain losses are common (Zmasek and Godzik, 2011).
The two most ancestral proteins are strongly associated with the realm of Ribovira (RNA viruses and viroids): RNA-dependent RNA polymerase (RdRP_1) and viral RNA helicase (Viral_helicase1). RNAdependent RNA polymerase (RdRp) is an essential protein encoded in the genomes RNA viruses which catalyzes synthesis of the RNA strand complementary to a given RNA template (De et al., 1996). Viral RNA helicase is a member of the P-loop containing nucleoside triphosphate hydrolase superfamily and has multiple roles at different stages of viral RNA replication (Koonin et al., 1993).
11 domains are associated with the evolution of Nidovirales from Pisoniviricetes (positive-strand RNA viruses which infect eukaryotes). Four of these are domains are also found in eukaryotes and bacteria. These are the AAA domains AAA_11, AAA_12, and AAA_13 as well as the Macro domain. AAA_11, AAA_12, AAA_30 are members of the P-loop containing nucleoside triphosphate hydrolase superfamily (same as the viral RNA helicase mentioned above), which often perform chaperone-like functions that assist in the assembly, operation, or disassembly of protein complexes (Neuwald et al., 1999). In Orthocoronavirinae, these domains are part of the RNA helicase (Gomez de Cedro è et al., n.d.). In Orthocoronavirinae, the Macro domain is part of the papain-like peptidase protein (PL-pro), together with domains CoV_peptidase and NSP3_C (as well as other, genus-specific domains). The Hema_esterase domain appears both in some Nidovirales (Embecovirus and Torovirus) as well as in Herpesvirales (dsDNA viruses) and Influenza C and D viruses. Together with the Hema_HEF domain, this domain is part of the Haemagglutinin-esterase fusion glycoprotein found in Embecoviruses. It has been speculated that Haemagglutinin-esterases have been acquired from viral host lectins, although it is unclear whether this acquisition happened in a putative ancestral virus followed by speciation and gene loss or by multiple independent acquisitions (Chen and Li, 2013). The Corona_NS2A domain is found in various Riboviria, although from the genomes studied in this work it is present in only select Orthocoronavirinae genomes. The Corona_NS2A domain can be found in Rotaviruses (double-stranded RNA viruses in the family Reoviridae), where together with the Rotavirus_VP3 domain, they form a multifunctional enzyme, the VP3 protein, involved in mRNA capping (Zhang et al., 2013). In Berne Virus (Tornidovirineae, Torovirinae) Corona_NS2A is found encoded in the polyprotein (pp1ab). Interestingly, in Embecovirus and Luchacovirus it is encoded as an individual ORF. The remaining five domains are unique to Nidovirales and are found in (some) Orthocoronavirinae as well as in (some) Tobaniviridae. These are the proofreading exoribonuclease (CoV_ExoN), the 2 ′ -O-methyltransferase (CoV_Me-thyltr_2), the RNA synthesis protein NSP10 (CoV_NSP10), the uridylate-specific endoribonuclease (CoV_NSP15_C), and the S2 subunit of the Spike protein (CoV_S2).
The main finding from this analysis is that, during Coronaviridae evolution, the largest number of domain gains (26) occurred on the  Supplementary Tables 1 and 2. branch leading from Nidovirales to Orthocoronavirinae. These domain gains include the small envelope protein E (with CoV_E domain), the matrix/glycoprotein M (CoV_M), nucleocapsid N (CoV_nucleocap), and three domains of the spike glycoprotein (bCoV_S1_N, CoV_S1_C, and CoV_S2_C). These gains also include numerous domains encoded within the polyproteins Pp1a and Pp1ab, namely the CoV_peptidase and CoV_NSP3_C domains, which are part of the papain-like peptidase (PLpro), domain Peptidase_C30, which is the single domain of the 3C-like proteinase (3CL-pro), the N-terminal domain of the RNA-dependent RNA polymerase RdRP (CoV_RPol_N), as well as NSP2 (CoV_NSP2_C and CoV_NSP2_N domains), NSP4, NSP6, NSP7, NSP9, and two domains of NSP15 (CoV_NSP15_N and CoV_NSP15_M). A more detailed analysis of protein domain changes in the Pp1ab polyprotein and the spike glycoproteins in the Orthocoronavirinae family is provided below.
Besides the domains and proteins discussed above, the distribution of which can be best explained with (ancestral) gains and subsequent loss, there are also several domains present in Orthocoronavirinae most likely resulting from horizontal gene transfer or recombination (due to them being present in only a few Orthocoronavirinae genomes as well as being present in very distantly related species). Orthoreo_P10 (Orthoreovirus membrane fusion protein p10) is thought to be a multifunctional protein that plays a key role in virus-host interaction (Bodeló et al., 2002) and is currently only found in Rousettus bat coronavirus GCCDC1 (Nobecovirus), as well as in some Spinareovirinae genomes and in various Eukaryotes. PRK (Phosphoribulokinase/Uridine kinase family) is found in Cegacovirus species as well as in numerous bacteria and Eukaryotes. The Astro_capsid_p (Turkey astrovirus capsid protein) domain which has been described as part of capsid proteins from various astrovirus strains (Tang et al., 2005) is found in Cegacovirus species as well as Human, astrovirus-1 and select Eukaryotes.

Evolution of spike glycoproteins
Coronaviridae spike proteins are multifunctional proteins that mediate viral entry into host cells. Composed of two subunits, S1 and S2, they first bind to a receptor on the host cell surface through their S1 subunit and then fuse viral and host membranes through their S2 subunit. In SARS-CoV-2 (but not in SARS-CoV1 or MERS-CoV) the two subunits S1 and S2 have been shown to be proteolytically cleaved by a Furin protease (Örd et al., 2020). The spike proteins of Coronaviridae are known to bind a broad range of cellular targets, including sialic acids, sugars, and proteins (Fig. 4). For example, Human coronavirus 229E (Alphacoronavirus, subgenus Duvinacovirus) binds aminopeptidase N, whereas Human coronavirus NL63 (Alphacoronavirus) and SARS-CoV (Betacoronavirus) bind to angiotensin converting enzyme 2 (ACE2) (Graham et al., 2013).
Sequence analysis shows that spike glycoproteins are composed of distinct combinations of six Pfam protein domains ( Fig. 4 and Table 1). While the carboxy-terminal S2 subunit shows the same two-domain CoV_S2-CoV_S2_C arrangement for all Orthocoronavirinae genomes analyzed here, the amino-terminal S1 subunit differs significantly between Alpha-, Beta-. Gamma-, and Deltacoronaviruses (we use "-" to indicate connected domains in a protein). The S1 subunit of all Betacoronaviruses analyzed have a bCoV_S1_N-bCoV_S1_RBD-CoV_S1_C architecture, whereas Gamma-, and Deltacoronaviruses have a CoV_S1-CoV_S1_C arrangement (Gammacoronaviruses have a longer N-terminal extension). Surprisingly, the S1 subunits of Alphacoronaviruses differ between sub-genera. In Luchacovirus and Rhinacovirus, S1 has a bCoV_S1_N-CoV_S1_C arrangement and is thus similar to the arrangement found in Betacoronaviruses (but lack bCoV_S1_RBD), whereas the other sub-genera have the same architectures as in Gamma-, and Deltacoronaviruses, with differing lengths of the N-terminal extension. Interestingly, this split of Alphacoronaviruses is also found when analyzing the phylogenetic history of spike glycoproteins, both when performing phylogenetic inference on entire proteins (in which the phylogenetic signal is likely to be somewhat obscured by differences in domain architectures; data not show) as well as on CoV_S2 domains alone, as shown in Fig. 4. Interestingly, phylogenetic analysis of all other proteins does not show this split within Alphacoronaviruses. Therefore, this split is likely not the result of taxonomic misclassification, but rather some recombination event between some Alpha-and Betacoronavirus spike proteins and perhaps convergent evolution selecting for this architecture. This interesting difference in spike proteins within Alphacoronaviruses has been previously noted, for example for the Rhinolophus bat coronavirus HKU2 (Chinese horseshoe bat virus; Bat-CoV HKU2) from the Rhinacovirus subgenus (Lau et al., 2007).

Divergence of the polyprotein N-terminal domain/protein architecture
We used the DAIO approach to compare the arrangement of protein domains (and by extension, mature proteins) in polyproteins from different Orthocoronavirinae genera and sub-genera ( Fig. 5 and Table 2). For comparison, we also included two example polyproteins from Tobaniviridae, which are currently not as well studied as Orthocoronavirinae and thus appear devoid of many proteins/domains. The polyproteins of all Alphacoronaviruses studied here exhibit an identical arrangement of domains/proteins, and the Gamma-and Fig. 4. Phylogeny and domain organization of Coronaviridae spike glycoproteins. The phylogeny on the left side was calculated using a maximum likelihood approach applied to a MAFFT alignment of the CoV_S2 domains. Spike protein domain architecture for each genus is shown in the middle; for a description of the Pfam domains see Table 1. Host cell receptors and likely additional receptors are shown on the right side (Graham et al., 2013). The following abbreviations are used: ACE2, angiotensin converting enzyme 2; APN, aminopeptidase N; CEACAM1a, carcinoembryonic cell adhesion molecule 1a; DC-SIGN, dendritic cell-specific ICAM-grabbing non-integrin; DC-SIGNR, DC-SIGN-related protein; DPP4, dipeptidyl peptidase 4; LSECtin, liver and lymph node sinusoidal C-type lectin.   Table 2.
Deltacoronaviruses show a nearly identical arrangement. This is in sharp contrast to the Betacoronaviruses which show substantial variability between sub-genera, with Sarbecovirus and Embecovirus being the most divergent. The two main findings from this analysis are as follows. First, the central part of the polyprotein (NSP4-3CL-pro-NSP6-… -NSP10-RdRP) is identical for all Orthocoronavirinae analyzed here. The same is true for the carboxy-terminal part (Viral_helicase1-Methyltr_1-NSP15-Methyltr_2) with the sole exception that for human (but not for all other animal hosts) Alpha-and Betacoronviruses an AAA_30-AAA_12 arrangement replaces Viral_helicase. However, this is very likely a sequence analysis-related artefact, since AAA_30, AAA_12, and Vir-al_helicase1 are related members of the P-loop containing nucleoside triphosphate hydrolase-superfamily (P-loop_NTPase Pfam clan) and match the same target sequences with similar E-values. In contrast, the amino-terminal part of the polyprotein varies dramatically between genera and sub-genera of Orthocoronavirinae. Significantly, the papain-like peptidase protein in Gamma-and Deltacoronaviruses is smaller and simpler (three domains) than the form found in Alpha-and Betacoronaviruses (four to seven domains). It is particularly noteworthy that the papain-like peptidase in Embecoviruses has a different domain architecture, even when compared to other Betacoronaviruses, in that it has an additional domain belonging to the Peptidase C16 family (Peptidase_C16) and lacks the bCoV_SUD_M (single-stranded poly-A binding domain) domain. Also, Embecoviruses are unique in that they have domain of unknown function DUF3477 at the N-terminus. Sarbecoviruses are the only subgenus in which a bCoV_-SUD_C domain is present in the papain-like peptidase. In addition, Alpha-and Betacoronaviruses have more mature peptides on the aminoterminal side of the papain-like peptidase protein.

Major differences between Orthocoronavirinae accessory proteins
We also used DAIO to compare the accessory proteins across Orthocoronavirinae subgenera and found that most accessory proteins are subgenus-specific, despite oftentimes having been given identical names such as "NS7" (non-structural protein 7) or "ORF7". These names are based on the protein's placement in the genome and do not necessarily indicate homology or similarity in biological/molecular function. Therefore, these names cannot be used to compare or relate proteins between different subgenera, since for example, Hibecovirus ORF7 is not related to Cegacovirus ORF7 (also see Fig. 2B for additional examples).
Furthermore, most accessory proteins outside of the Betacoronavirus subgenera Embecovirus and Sarbecovirus do not have a corresponding Pfam entry [no profile Hidden Markov domain model (HMM)]. Accessory protein domains with an existing Pfam entry are listed in regular fonts in Table 3 (as are ORF1ab polyprotein, Spike glycoprotein, Membrane protein, Envelope small membrane protein, and Nucleoprotein). HMMs would be the ideal means to systematically classify accessory proteins since they represent a protein's molecular "signature" and are indifferent to the placement in a genome (Eddy, 2004). For this reason, HMMs were created for all accessory proteins that currently lack one. Domains defined by these new HMMs are in italic fonts in Table 3. Since these novel HMMs are representing sub-genus-specific domains, they do not appear in Fig. 3 and Supplementary Tables 1 and 2 as gains and losses.

The Nobecovirus sub-genus accessory proteins appear to be particularly prone to domain gain/loss
The Nobecovirus subgenus provides a more recent example of both domain gain and recurrent domain loss. In addition to the Orthoreo_P10 protein proposed to have been gained through recombination in the Rousettus bat coronavirus GCCDC1 species (Huang et al., 2016;Obameso et al., 2017;Paskey et al., 2020), various Nobecoviruses appear to have at least four other accessory proteins downstream of the nucleocapsid genome which we have designated as Nobecovirus_ORF7a, Table 2 Pfam domains in polyproteins. Nobecovirus_ORF7b, Nobecovirus_ORF7c, and Nobecovirus_ORF7d. A phylogenetic analysis of the RdRP domain of 27 Nobecovirus genomes revealed that domain loss was not necessarily associated with specific branches in the tree, indicating that domain loss is likely to have occurred repeatedly at discrete timepoints, and in multiple species. This has resulted genomes with varying combinations of accessory proteins even within the same species. Additionally, domain loss did not appear to be correlated with sampling timepoints, host species or the geographic location of isolation of viruses.

Classification of Orthocoronavirinae proteins into Strict Ortholog Groups
Using these newly developed HMMs, as well as the existing Pfam HMMs, we grouped Orthocoronavirinae proteins into Strict Ortholog Groups (SOGs), groups of orthologous proteins with the same domain architecture (Zmasek et al., 2019). Supplementary Table 3 provides on overview of the results from this analysis. The first column indicates the taxonomic distribution for each SOG (all uppercase descriptions are used for SOGs which are found in every member of a given taxonomic unit, whereas lowercase descriptions are used for SOGs which are present in some, but not all, members of a taxonomic unit). The domain architectures are also listed, using Pfam domain names for domains that have an entry in Pfam, and the names of our newly developed HMMs for domains that lack an entry in Pfam, as indicated by an asterisk in the fourth column ("-" is used to indicate domain connections in multidomain proteins). This table also lists the proposed names for each SOG [numbers in brackets are temporarily used to distinguish SOGs with same names but different domain architectures, currently a mixture of manually curated names and automatically inferred ones].

Conclusions
In this work we show that Orthocoronavirinae genomes evolved in what could be called three distinct 'modes'. (i) Certain sections of the genomes are stable and only differ by point mutations and small insertions and deletions. These sections include the central portion and the C-terminus of the ORF1ab polyprotein, encoding the 3C-like protease, NSP4, NSP6, NSP7, NSP8, NSP9, RNA-dependent RNA polymerase, Helicase (Hel), and NSP15 and the membrane protein M and nucleoprotein N, which are encoded by their own ORFs. These proteins are present and orthologous over all Coronaviridae genomes analyzed and thus help define this virus family.
(ii) The spike proteins and the papain-like peptidases, in contrast, differ in their domain architectures between genera. Similarly, the N-terminus of the polyproteins differ in the proteins encoded between genera, and for Betacoronaviruses, even between subgenera. The envelope small membrane protein E is orthologous across Alpha-and Betacoronaviruses, absent in Gammacoronaviruses, and encoded by a different, non-homologous gene in Deltacoronaviruses. (iii) The greatest variability is found in the accessory proteins. For these proteins, each sub-genus has its own unique set, with very little overlap between sub-genera. The only notable exception to this is NS3b which is present and orthologous over all Alphacoronaviruses.
In addition, we note the following: The establishment of the Orthocoronavirinae family is associated with a large gain of domains. While this superficially appears as if these domains appeared at the same time, en bloc, the more likely explanation is that these domains were gained one domain at a time, but most viral species emerging from the branch leading from Nidovirales to Coronaviridae either went extinct and/or have not been discovered yet.
From a domain presence/absence perspective Alpha-and and square brackets asterisk are used to indicate weak similarity.
C.M. Zmasek et al. Betacoronaviruses are similar to each other, as are Gamma-and Deltacoronaviruses.
The only Coronaviridae which possess the Haemagglutinin-esterase fusion glycoprotein (composed of domains Hema_esterase and Hema_-HEFG) are the Embecoviruses. Given that other viral species containing such proteins are phylogenetically distant (Torovirus, Herpesvirales, Influenza C and D viruses) it appears likely that this distribution pattern is the result of multiple, independent gene acquisitions from host species, instead of a acquisition by a putative ancestral virus followed by speciation and gene loss.
It is interesting that most of the differences in the polyproteins are towards the amino-terminal end, even though, when taking the need to keep coding sequences in-frame into account, mechanically a diverging carboxy-terminal end should be the favored "solution".
In the same context, is it noteworthy that proteins encoded at the amino-terminal end of the polyprotein appear to have functions related to modulating virus-host interaction and appear not as strictly essential as other proteins (such as the peptidases and RNA-dependent RNA polymerases). Examples are SARS-CoV-2 NSP1 which is believed to inhibits host translation (Thoms et al., 2020) and SARS-CoV-2 NSP2 which has been implicated in the modulation of host cell survival (Cornillez-Ty et al., 2009;Lei et al., 2018).
Finally, in the course of this work, we developed a consistent naming scheme for all Coronaviridae proteins as well as numerous novel hidden Markov models (HMMs) representing sub-genus specific accessory proteins. The resulting annotations of this efforts will be disseminated via the ViPR database (Pickett et al., 2012).

Materials and methods
We used a semi-automated software pipeline to analyze amino acid sequences for their protein domain-based architectures and to infer multiple sequence alignments and phylogenetic trees for the molecular sequences corresponding to these architectures, followed by gene duplication inference. This pipeline contains the following five major steps: (1) sequence retrieval; (2) domain architecture analysis, including the inference of the taxonomic distributions of domain architectureseach of which corresponds to one preliminary SOG -and manual naming of domain architectures/preliminary SOGs; (3) extraction of molecular sequences corresponding to domain architectures/preliminary SOGs; (4) multiple sequence alignment and phylogenetic inference; (5) gene duplication inference, to determine which preliminary SOGs contain sequences related by gene duplications and thus need to be divided into multiple, final SOGs. Links to all custom software programs developed for this work are available at https://sites.google.com/site/cmzmase k/home/software/forester/daio. The tools and methods used are described in more detail below.

Sequence retrieval
Individual protein sequences were downloaded from the ViPR database (Pickett et al., 2012), while entire proteomes were downloaded from UniProtKB (Bateman et al., 2017).

Multiple sequence alignments
Multiple sequence alignments were calculated using MAFFT version 7.313 (with "localpair" and "maxiterate 1000" options) (Katoh and Standley, 2013). Prior to phylogenetic inference, multiple sequence alignment positions with more than 50% gaps were deleted.

HMM construction
For ORFs lacking a defined Pfam domain, HMMs were constructed by first creating multiple sequence alignments of homologous sequences using MAFFT version 7.313 (with "localpair" and "maxiterate 1000" options) (Katoh and Standley, 2013). These multiple sequence alignments were then used as input for hmmbuild. Resulting HMMs where then tested against expected matching sequences, as well as against expected non-matching sequences.

Phylogenetic analyses
Phylogenetic trees were calculated for individual domain architectures (not full-length sequences). Distance-based minimal evolution trees were inferred by FastME 2.0 (Desper and Gascuel, 2006) (with balanced tree swapping and "GME" initial tree options) based on pairwise distances calculated by TREE-PUZZLE 5.2 (Schmidt and von Haeseler, 2007) using the WAG substitution model (Whelan and Goldman, 2001), a uniform model of rate heterogeneity, estimation of amino acid frequencies from the dataset, and approximate parameter estimation using a Neighbor-Joining tree. For maximum likelihood approaches, we employed RAxML version 8.2.9 (Stamatakis, 2006) (using 100 bootstrapped data sets and the WAG substitution model). Tree and domain composition diagrams were drawn using Archaeopteryx [htt ps://sites.google.com/site/cmzmasek/home/software/forester]. Rooting was performed by the midpoint rooting method. Unless otherwise noted, Pfam domains are displayed with an E = 10 − 6 cutoff. Gene duplication inferences were performed using the SDI and RIO methods Eddy, 2001, 2002). Automated genome wide domain composition analysis was performed using a specialized software tool, Surfacing version 2.002 (C M Zmasek and Godzik, 2012), a tool for the functional analysis of domainome/genome evolution [available at htt ps://sites.google.com/site/cmzmasek/home/software/forester/surfac ing]. All conclusions presented in this work are robust relative to the alignment methods, the alignment processing, the phylogeny reconstruction methods, and the parameters used. All sequence, alignment, and phylogeny files are available upon request.

Declaration of interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.