Classification of human Herpesviridae proteins using Domain-architecture Aware Inference of Orthologs (DAIO)

We developed a computational approach called Domain-architecture Aware Inference of Orthologs (DAIO) for the analysis of protein orthology by combining phylogenetic and protein domain-architecture information. Using DAIO, we performed a systematic study of the proteomes of all human Herpesviridae species to define Strict Ortholog Groups (SOGs). In addition to assessing the taxonomic distribution for each protein based on sequence similarity, we performed a protein domain-architecture analysis for every protein family and computationally inferred gene duplication events. While many herpesvirus proteins have evolved without any detectable gene duplications or domain rearrangements, numerous herpesvirus protein families do exhibit complex evolutionary histories. Some proteins acquired additional domains (e.g., DNA polymerase), whereas others show a combination of domain acquisition and gene duplication (e.g., betaherpesvirus US22 family), with possible functional implications. This novel classification system of SOGs for human Herpesviridae proteins is available through the Virus Pathogen Resource (ViPR, www.viprbrc.org).


Human herpesviruses
Herpesviruses comprise a large and diverse order (Herpesvirales) of double stranded DNA viruses that infect humans and a wide range of other hosts (Pellet and Roizman, 2007;Virus Taxonomy: The Classification and Nomenclature of Viruses The Online 10th Report of the ICTV, 2017). Human diseases caused by herpesviruses range from vesicular rashes to cancer. The order Herpesvirales is subdivided into three families, including the Herpesviridae, which is further subdivided into three subfamilies, the Alpha-, Beta-, and Gammaherpesvirinae. Within subfamilies, groups of related herpesvirus species are classified into genera. The nine species of human herpesviruses are distributed across the three subfamilies and several genera (Table 1); these viruses are the main focus of this work. Prior studies found that the Beta-and Gammaherpesvirinae are more closely related to each other than to Alphaherpesvirinae (Montague and Hutchison, 2000). In contrast to some other human viruses, the human herpesviruses have a long evolutionary history, with evidence suggesting that the primordial herpesvirus diverged into the Alpha-, Beta, and Gammaherpesvirinae approximately 180 million to 220 million years ago (McGeoch et al., 1995). Coupled with their genome complexity and the availability of numerous complete genome sequences, this deep evolutionary history makes herpesviruses a tractable and informative model to study virus genome evolution at the levels of gene duplication and protein domain rearrangement. Nehrt et al., 2011;Rogozin et al., 2014), due to its importance for computational sequence functional analysis (Eisen, 1998;Zmasek and Eddy, 2002) and the significance of gene duplications for biological evolution (Zhang, 2003).
Orthologs (or groups/clusters of orthologs) have often been inferred by indirect methods based on (reciprocal) pairwise highest similarities [e.g. (Remm et al., 2001;Tatusov et al., 1997)]. In this work, we used explicit phylogenetic inference combined with comparison to a trusted species tree for orthology inference, as this approach is likely to yield more accurate results Eddy, 2002, 2001).

Protein domains and domain architectures
Many eukaryotic proteins, and by extension, proteins of eukaryotic viruses, are composed of multiple domains, components that can each have their own evolutionary history and functional implications. The architecture of a protein is a product of the ordered arrangement of its several domains and their overall tertiary structure. Evolutionarily, individual domains can combine with other partner domains, enabling formation of a vast number of domain combinations, even within the same species (Moore et al., 2008). Assembling multiple domains into a single protein creates a distinct entity that can be more than the sum of its constituent parts. The emergence of proteins with novel combinations of duplicated and then diverged domains is considered to be a major mechanism for rapid evolution of new functionality in eukaryotic genomes (Itoh et al., 2007;Peisajovich et al., 2010). It is especially important in the evolution of pathways, where novel linkages between existing domains may result in the rearrangement of pathways and their behaviors in the cell (Peisajovich et al., 2010). The modular structure of eukaryotic proteins provides a mechanism that enables evolutionarilyrapid differentiation and emergence of a multitude of novel protein functions from an initially limited array of functional domains. Proteins can gain (or lose) new domains via genome rearrangements, creating (or removing) domain combinations, in addition to modification of domains themselves by small-scale mutations (Patthy, 2003;Ye and Godzik, 2004).
Here we present a systematic classification of proteins catalogued in the NCBI RefSeq entries for each of the nine human herpesviruses plus selected comparisons with homologs from non-human herpesviruses based on phylogenetic inferencing and domain architecture analysis using Domain-Architecture Aware Inference of Orthologs (DAIO). This analysis resulted in the classification of proteins into "Strict Ortholog Groups" (SOGs), in which all proteins are orthologous to each other (related by speciation events) and exhibit the same domain architecture. The SOG classification also enabled the development of an informative name convention for each SOG that includes information about the protein's function (if known) and a suffix indicating the taxonomic distribution of the protein. For example, an "aBG" suffix would indicate that proteins of this group are found in some (but not all) human Alphaherpesvirinae species (lowercase "a"), and all human Beta-and Gammaherpesvirinae species (uppercase "B" and "G"). Such suffixes allow for the quick understanding of presumed conserved protein function and minimal common genome across the Herpesviridae family. The SOG classification results have been made publicly available through the Virus Pathogen Resource (ViPR) (Pickett et al., 2012) at https://www.viprbrc.org.

Results and discussion
For this analysis, we developed a rational, phylogeny-and domain architecture-aware classification approach for human herpesvirus proteins, the Domain-architecture Aware Inference of Orthologs (DAIO) method, which produces Strict Ortholog Groups (SOGs) of proteins. Before we present genome-wide findings, we show results for a few instructive SOG examples, including protein groups that have evolved in a "simple" manner, recapitulating the Herpesviridae evolutionary tree without gene duplications or domain rearrangements, and protein groups in which domain rearrangements (domain gains) and/or gene duplications have occurred. Table 2 lists the 23 SOGs common to all nine human herpesviruses. For every SOG, a suggested name is provided, composed of a protein names and a suffix indicating the taxonomic distribution (A, B, G: present in all human members of the Alpha-, Beta-, Gammaherpesvirinae, respectively; a, b, g: present in some but not all human members of the Alpha-, Beta-, Gammaherpesvirinae, respectively). Gene names/symbols (a forward slash is either part of the accepted gene name or is used to separate multiple gene names) and Pfam domain architecture names are also included. The table is organized into three sections. The first section lists protein families that have apparently evolved without gene duplication or domain rearrangements [e.g., uracil DNA glycosylase and the capsid scaffolding protein protease (CSPP)]; the second section lists proteins that have evolved with domain rearrangements and or duplications [e.g., glycoprotein B (gB), DNA polymerase, and multifunctional regulator of expression proteins (mRE)], and the third section lists proteins that share some function (and even genome region) but have been formed from distantly or unrelated domains (e.g., gL, gN, and DNA polymerase processivity factor).

Uracil DNA glycosylase and capsid scaffolding protein protease: Evolution of a stable domain architecture without gene duplications
Uracil DNA glycosylases catalyze the first step -removal of the RNA  base uracil from DNA -in base excision repair, the mechanism by which damaged bases in DNA are removed and replaced (Krusong et al., 2006). Uracil DNA glycosylases are found in eukaryotes, bacteria, and archaea, as well as in herpesviruses and poxviruses (Chen et al., 2002). Our phylogenomic analysis shows that for all nine human herpesviruses, uracil DNA glycosylase is well conserved and contains one Pfam domain, UDG (uracil DNA glycosylase superfamily). In addition, the gene tree for human herpesvirus uracil DNA glycosylases ( Fig. 1B) precisely recapitulates the herpesvirus species tree (Fig. 1A); therefore, this protein family can be inferred to have evolved from a single common ancestor and without any gene duplications or domain rearrangements (see Table 2 for virus-specific gene names).
Capsid scaffolding protein proteases are essential for herpesvirus capsid assembly and maturation, and have an essential serine protease activity (Liu and Roizman, 1993). These proteins contain one Pfam domain, Peptidase_S21. In contrast to uracil DNA glycosylases, currently available data indicate that protease-scaffolding proteins with a Peptidase_S21 domain are unique to Herpesvirales. Like uracil DNA glycosylases, CSPP evolved without domain architecture rearrangements or gene duplications (Fig. 1C, Table 2).
Other examples of Herpesviridae genes that have evolved without any domain architecture rearrangements or gene duplications are listed in the first section of Table 2.

Molecular evolution of gB: A highly conserved protein required for viral fusion with a recent domain acquisition in one virus lineage
Herpesvirus virions have an envelope that consists of an outer lipid bilayer studded with 12 or more surface glycoproteins (originally defined in HSV). After virion glycoprotein engagement with cell surface receptors, the envelope fuses with the plasma membrane -a process which, for herpes simplex virus 1 (HSV-1), requires four of its 12 envelope glycoproteins, namely glycoproteins gB, gD, gH, and gL (Cai et al., 1988;Forrester et al., 1992;Ligas and Johnson, 1988;Roop et al., 1993;Spear and Longnecker, 2003). In contrast, for other herpesviruses, only glycoproteins gB, gH, and gL have been reported to be required for membrane fusion (AlHajri et al., 2017).
gB and gH are highly conserved across all nine human herpesviruses (Table 2). A protein annotated as gL is also present in all nine human herpesviruses, yet its occurrences in members of the Alpha-, Beta-and Gammaherpesvirinae are homologous within, but not between subfamilies. gLs from different subfamilies contain unrelated protein domains (Pfam: Herpes_UL1, Cytomega_gL, and Phage_glycop_gL). gL is discussed in more detail below.
Detailed phylogenetic analysis of the human herpesvirus gB family ( Fig. 2A), including proteins from selected non-human members of the Herpesviridae, shows a picture of a protein that has evolved without gene duplications (or, at the very least, duplicated genes have not been retained) and with nearly completely conserved domain architectures.
The one exception to this is that human cytomegalovirus (HCMV) glycoprotein B (gB) has a short region of about 40 amino acids near its Nterminus that comes in two forms that differ by approximately 50% at the amino acid level. This sequence variant was identified in HCMV strains isolated from Chinese patients (Shiu et al., 1994) and is identified in Pfam as "HCMVantigenic_N domain". In our global hmmscan analysis (applying the same threshold of E = 10 −6 for every Pfam domain) Evalue support for presence of this domain in some strains is strong (E < 10 -22 ) and matching over the entire Pfam model while other HCMV strains do not exhibit significant sequence similarity with this domain. It has been suggested that this domain polymorphism may be implicated in HCMV-induced immunopathogenesis, as well as in strainspecific behaviors, such as tissue-tropism and the ability to establish persistent or latent infections (Pignatelli et al., 2004). In our new systemic naming approach (see below) we term the SOG of the protein with HCMVantigenic_N domain "Glycoprotein B_ ABG.b", whereas all other proteins fall into the "Glycoprotein B_ ABG.AbG" SOG.

Molecular evolution of DNA polymerase: A highly conserved protein with domain acquisition
All members of the Herpesviridae encode six conserved proteins that play essential roles at the replication fork during viral DNA replication: a single-strand DNA binding protein (major DNA binding protein), a DNA polymerase composed of two independently coded subunits (the catalytic DNA polymerase subunit and a DNA polymerase processivity factor encoded by three distantly related genes in members of the Alpha-, Beta-, and Gammaherpesvirinae, see below), and a three subunit helicase/primase complex (DNA replication helicase, DNA helicase primase complex associated protein, and DNA primase) (Pellet and Roizman, 2007).
Our analysis shows that the catalytic DNA polymerase subunits of all members of the Herpesviridae contain two domains: an N-terminal DNA polymerase family B exonuclease domain, and a C-terminal polymerase domain from DNA polymerase family B (Fig. 2B). Cellular family B DNA polymerases are the main polymerases involved with nuclear DNA replication and repair in eukaryotes and prokaryotes, and include DNA polymerases II and B, and polymerases α, δ, and ε (Garcia- Diaz and Bebenek, 2007). Family B DNA polymerases are also found in other dsDNA viruses, such as the insect Ascoviridae, and members of the Iridoviridae (e.g., fish lymphocystis disease virus) and Phycoviridae (e.g., chlorella virus) (Villarreal and DeFilippis, 2000). In addition to these two large and ubiquitous domains, Simplexvirus (which include human simplex virus 1 and 2) and Mardivirus also possess a small C-terminal domain, called the DNA polymerase catalytic subunit Pol (DNAPoly-mera_Pol) domain in Pfam (Zuccola et al., 2000), and are longer by about 45 aa on average than DNA polymerase proteins from other Herpesviridae. According to currently available genomic data, DNAPo-lymera_Pol is found in members of the Simplexvirus genus of the Alphaherpesvirinae. While varicella-zoster virus (Human herpesvirus 3) and other members of the Varicellovirus genus of the Alphaherpesvirinae possesses DNA polymerases that also tend to be longer, similarity of these protein regions to the DNAPolymera_Pol domain is low, using the current Pfam model for DNAPolymera_Pol (Pfam version 31.0). The function of this third domain is to mediate interaction between DNA polymerase and its cognate processivity factor (Bridges et al., 2000;Loregian et al., 2000) based on the observation that a peptide corresponding to the 27 C-terminal amino acids of HSV-1 DNA polymerase has been shown to inhibit viral replication by disrupting the interaction between DNA polymerase and UL42 (Digard et al., 1995;Loregian et al., 1999). In this context, it is interesting to note that the DNA polymerase processivity factors are only distantly-related across the Alpha-, Beta-, and Gammaherpesvirinae (see below). It is therefore conceivable that the interactions of Beta-, and Gammaherpesvirinae DNA polymerase processivity factors with their corresponding DNA polymerases (which lack a DNAPolymera_Pol domain) is different in nature than for Alphaherpesvirinae. As for Varicellovirus it is unclear whether they possess a functional DNAPolymera_Pol domain, and a definitive answer will require similar biochemical assays as have been performed for HSV-1.
Phylogenetic analysis of human herpesvirus DNA polymerase proteins, plus related proteins from selected mammalian herpesviruses, shows that, similar to the glycoprotein B family, DNA polymerases of the Herpesviride evolved without gene duplication. Nonetheless, in contrast to gB, DNA polymerases acquired a new domain early in Alphaherpesvirinae evolution. This domain might have been lost again, or underwent significant mutations, during Varicellovirus evolution. The presence of the longer domain in Varicelloviruses suggests that the longer domain emerged prior to the Varicellovirus/Simplexvirus split.

Evolution of viral multifunctional regulator of expression (mRE) proteins (homologs of HSV1 ICP27)
Multifunctional regulator of expression (mRE; also known as immediate-early protein IE63, infected cell protein 27, ICP27, and α27) is a protein with homologs in all human herpesviruses (for gene names see Table 2). Multifunctional regulator of expression is a regulatory protein that plays a role in the prevention of apoptosis during HSV1 infection (Aubert and Blaho, 1999). Multifunctional regulator of expression interacts directly with a number of proteins in performing its many roles. In particular, multifunctional regulator of expression protein contributes to host shut-off by inhibiting pre-mRNA splicing by interacting with essential splicing factors, termed SR proteins, and affecting their phosphorylation (Sciabica et al., 2003). Furthermore, the mRE protein C.M. Zmasek et al. Virology 529 (2019) [29][30][31][32][33][34][35][36][37][38][39][40][41][42] has been shown to associate with cellular RNA polymerase II holoenzyme in a DNA-and RNA-independent manner and to recruit RNA polymerase II to viral transcription/replication sites (Dai-Ju et al., 2006;Zhou and Knipe, 2002). mRE also competes with some transport receptors, resulting in the inhibition of host pathways while supporting mRNA export factor-mediated transport of HSV-1 mRNAs (Malik et al., 2012). All of the multifunctional regulator of expression proteins analyzed here have a single copy of a Pfam "Herpesvirus transcriptional regulator family" (Herpes_UL69) domain that is specific to members of the Herpesviridae. In addition to the Herpes_UL69 domain, human Simplexvirus mRE have an additional N-terminal domain, the "Herpes viral adaptor-to-host cellular mRNA binding domain" (HHV-1_VABD) (Tunnicliffe et al., 2011). Besides human Simplexvirus, architectures with C-terminal HHV-1_VABD and N-terminal Herpes_UL69 domains are also found in Chimpanzee herpesviruses (e.g. NCBI Reference Sequence: YP_009011042 (Severini et al., 2013)), while other non-human Simplexviruses lack the HHV-1_VABD domain. Using currently available genomic data, we were unable to detect HHV-1_VABD domains outside of the Simplexvirus genus.
Phylogenetic analysis of human herpesvirus mRE proteins, including proteins from selected herpesviruses of other mammals, shows that multifunctional regulator of expression proteins evolved without observable gene duplications (since this gene tree recapitulates the herpesvirus species tree).

Different domains performing the same, or similar, functions
Nine groups of human herpesviruses are annotated as performing the same, or very similar function, in the absence of discernable protein sequence similarity (Table 2, Fig. 3).
As mentioned above, DNA polymerase processivity factor is one of the six proteins that play essential roles at the replication fork during viral DNA replication. Processivity factors, also called clamp proteins, help to overcome the tendency of DNA polymerase to dissociate from the template DNA, and thus greatly enhance DNA polymerase processivity (Weisshart et al., 1999;Zhuang and Ai, 2010). In contrast to the protein families discussed so far, DNA polymerase processivity factors are only distantly-related across the Alpha-, Beta-, and Gammaherpesvirinae. In the Alphaherpesvirinae, the protein is composed of two tandem Herpes_UL42 domains; Betaherpesvirinae have a single Herpe-s_PAP domain; Gammaherpesvirinae have a single Herpes_DNAp_acc domain (Fig. 3A, B, C). These three domains are very distant homologs and are members of the DNA clamp superfamily (Pfam clan CL0060). gL (Fig. 3D, E, F) is another example of a protein function performed by different, probably non-homologous domains present in different Herpesviridae subfamilies (Pfam domains Herpes_UL1, GlyL_C, Cyto-mega_gL, and Phage_glycop_gL). Interestingly, the open reading frames for these seemingly unrelated proteins are located in analogous conserved genomic contexts, including open reading frame sizes and orientations relative to the surrounding conserved coding regions.

Gene duplication during viral 7-transmembrane receptor domain protein evolution
In contrast to the protein families discussed so far, the evolutionary history of human Herpesviridae proteins with 7-transmembrane receptor domains is more complex (Fig. 4) (Spiess et al., 2015). By comparing this gene tree with a species tree for human Herpesviridae (Fig. 1A), we can infer three gene duplication events (marked as red squares in Fig. 4), resulting in four groups of orthologous genes: UL33/U12, US27, U51/ORF74, and US28. In our new nomenclature (see below), we call the first group "Gprotein coupled receptor homolog UL33/U12_B" because it is found in all four human Betaherpesvirinae species (uppercase B suffix). The second group is called "G-protein coupled receptor homolog US27_b" as it is found in some human Betaherpesvirinae (lowercase b suffix). The third group is called "G-protein coupled receptor homolog U51/ORF74_bg" because it found in some human Betaherpesvirinae and in some human Gammaherpesvirinae (lowercase "bg" suffix). The fourth group is called "Envelope protein US28_b". No orthologous genes were found in the human Alphaherpesvirinae. Whenever available, we base our names preferably on (Mocarski, 2007) or the "Recommended name" (under "Protein names") from the UniProtKB database (Bateman et al., 2017). For reasons of consistency and objectivity, we used an automated approach to root all trees by mid-point rooting. It is possible, that the true root for the 7-transmembrane domain proteins tree is at the base of the U51-ORF74 subtree. In this case there would be only two duplications in the tree, but still the same four ortholog groups: U51/ORF74, US28, US27, UL33/U12. Functionally, all these proteins appear to be hijacked human proteins that are being used by the virus to modulate the host immune system. In particular, many of them appear to act as chemokine (orphan) receptors (Casarosa et al., 2003(Casarosa et al., , 2001Isegawa et al., 1998;Murphy, 2001;Zhen et al., 2005) (Fig. 5).

The complex evolution of US22 domain proteins
Proteins with US22 domains have the most complex evolutionary history of all Herpesviridae proteins, even though among the human herpesviruses, the US22 domain has been found only in C.M. Zmasek et al. Virology 529 (2019) 29-42 betaherpesviruses (Hanson et al., 1999). US22 domain proteins are also present in Gallid herpesvirus 2 (a member of the Alphaherpesvirinae), in members of the Alloherpesviridae family, in other dsDNA viruses (e.g., Poxviridae and Iridoviridae), and in some animal species. Most proteins with US22 domains carry two copies of the domain. US22 is a member of a large group of distantly homologous proteins (the SUKH superfamily, Pfam clan CL0526), which, for example include bacterial Syd proteins. It has been suggested that a function of the US22 family is to act against various anti-viral responses by interacting with specific host proteins (Zhang et al., 2011).
Here we summarize the results of our phylogenetic analysis of US22 domain proteins of the human bataherpesviruses. Unfortunately, the phylogenetic signal across this group of protens is weak, thus some support values are low. Two groups of US22 orthologs span all four human betaherpesviruses: CMV tegument protein UL23 is likely to have orthologs in HHV-6A, HHV-6B, HHV-7 (Roseolovirus) Protein U3 ("Tegument protein UL23/Protein U3_B"). Similarly, CMV Tegument protein UL43 is likely to be orthologous to HHV-6A, HHV-6B, HHV-7 (Roseolovirus) Protein U25 ("Tegument protein UL43/Protein U25_B"). U3 and U25 are paralogous towards each other, as they are connected by a gene duplication, as are HCMV UL23 and 43. Four groups of orthologs specific to Roseolovirus are Tegument protein DR1, Tegument protein DR6, Protein U7, and Protein U17/U16. In U17/U16 proteins, it is unclear whether they possess a second US22 domain, as the similarity to this domain is weak to the point of insignificance. In contrast, U7 proteins possess at least three US22 domains and an additional C-terminal Herpes_U5 domain. Proteins U7 are most closely related to CMV UL29, but differ in their domain architecture (lack of Herpes_U5 domain). Thus CMV UL29 forms its own species-specific group of orthologs. Numerous proteins with US22 domains are specific to CMV (and thus all paralogous to each other) given current data: apoptosis inhibitor UL38, early nuclear protein HWLF1, tegument protein UL26, US24, protein UL24, UL29, UL36, US23, US26, protein IRS1, and protein TRS1. C.M. Zmasek et al. Virology 529 (2019) 29-42 2.8. The inferred minimal proteomes of the human herpesviruses As described above, we classified viral proteins into "strict ortholog groups," requiring that all proteins exhibit the same domain architecture and are orthologous to each other. We attempted to give an informative name for each of these groups including a suffix that indicates the taxonomic distribution of a protein. For example, an "aG" suffix would indicate that proteins of this group are found in some (but not all) members of human alphaherpesvirus species (lowercase "a"), and members of both human gammaherpesvirus species (uppercase "G").
Families which have a (some) domain(s) in common but differ in their domain architectures, are more difficult to rationally name (we found 17 of these cases). An example of such a family is DNA polymerase. In such cases, the suffix is split by a period into two parts. The first part indicates overall presence of common domain(s) for all members of this SOG, the second part (after the period) relates to specific domain architectures. Thus, "DNA polymerase_ ABG.aBG" refers to the simpler DNA_pol_B_exo1--DNA_pol_B domain architecture present in nearly all Alphaherpesvirinae species. "DNA polymerase_ ABG.a" refers to the DNA_pol_B_exo1--DNA_pol_B-DNAPolymera_Pol DA that is present in a smaller subset of Alphaherpesvirinae species.
The rationale behind this approach for labeling members of protein families that have different domain architectures is that it gives users a choice between "traditional" ortholog groups, which do not consider domain architectures (by ignoring the part after the period), and SOGs (taking the full name into account).
In total, we were able to establish 169 SOGs (Supplementary Table 1). Of these, 40 (23 +8 +9) functionally similar groups (Table 2) are present in all 9 human Herpesviridae species and represent the core proteins of human herpesviruses.
Besides proteins with clearly defined Pfam domains, we found 29 protein families for which Pfam domains have not been defined. Classification of these proteins was based on manual BLAST searches. An example of such a family is the virion host shutoff protein UL41.
Another unusual case is the HSV1 UL13 serine threonine protein kinase. All nine human herpesviruses have homologs of this protein, but its associated Pfam domain UL97 only matches sequences in betaherpesviruses. Extension of the family to alpha-and gammaherpesviruses is thus based on manual BLAST searches.
Proteins which are species or strain specific are listed in Supplementary Table 2.

Dissemination of SOG data through the ViPR database
In order to make the results of DAIO classification available to all Herpesvirus researchers for experimental hypothesis testing, we incorporated SOG data into the Virus Pathogen Resource (ViPR) at https://www.viprbrc.org (Pickett et al., 2012). Through ViPR, scientists can search, sort, and download SOG names (including taxonomic distribution), Pfam domain architecture data, and individual protein sequences belonging to selected SOGs. Fig. 6A shows an example of a search result table, which includes data for some of the protein families discussed above, namely glycoprotein B family members (associated with two distinct SOGs: "Glycoprotein B_ ABG.b" and "Glycoprotein B_ ABG.AbG"), DNA polymerase ("DNA polymerase_ ABG.a" and "DNA polymerase_ ABG.aBG"), and multifunctional regulator of expression ("Multifunctional regulator of expression_ABG.a" and "Multifunctional regulator of expression_ABG.aBG"). By clicking on the "Total # of Proteins" table entries, users can view and download the individual protein sequences belonging to a given SOG. Fig. 6B shows how SOG data, including domain architecture information, is part of protein annotations in ViPR (Simplexvirus "DNA polymerase_ABG.a" example). As new genome sequence data become available, the SOG data in ViPR is continuously updated in order to keep current with the ever expanding universe of Herpesvirus protein sequences. In addition, SOG annotations in ViPR will be expanded to include non-human Herpesviruses in the future. SOG data is also available for Pox-and Coronaviruses in ViPR, and will be applied to other virus families in the future.

Conclusions
In this work, we used Domain-architecture Aware Inference of Orthologs (DAIO) to provide a classification for proteins of human herpesviruses, based on domain architecture and phylogenetic history. While the work presented here is limited to human herpesviruses, and thus does not take full advantage of all the sequence data that is currently available, we plan to extend our DAIO approach to all herpesviruses with a known phylogenetic history.
A major contribution of our classification system to herpesvirus biology is that it provides a series of testable hypotheses for further experimental investigations. For example, it informs experimental reconstruction of minimal genome viruses. Such synthesized minimal genomes could prove useful for identification of genes responsible for pathogenic and other biological differences between viruses.
Of particular interest in the field of molecular biology is the relationship between domain architecture and protein function. The detailed analysis of domain architectures presented here suggests studies that investigate the functional effects of removing or swapping domains in viral multidomain protein architectures The fact that Simplexvirus DNA polymerases contain the extra DNAPolymera_Pol domain and that this domain architecture is conserved among Simplexvirus isolates suggests that it may provide some unique function necessary for efficient replication of Simplexviruses. This hypothesis could be explored experimentally. Similarly, what would be the consequence of adding a Cterminal GlyL_C domain to the gL protein of VZV (which contains one Herpes_UL1 domain), and so making it similar to the gL protein found in HSV-1 and HSV-2 (which has a Herpes_UL1--GlyL_C architecture)?
Interestingly, while it has been noted that domain loss is an important mechanism in eukaryote evolution (probably equally-and possibly even more-important than domain gain) (Zmasek and Godzik, 2011); and references therein), in herpesvirus evolution domain loss seems to play a lesser role, as most of the events we were able to detect are domain gains (according to the parsimony principle).
Another implication of this work relates to the observation that in some cases proteins that share the same name are composed of either unrelated (e.g. gL) or very distantly related domains (e.g. DNA polymerase processivity factor) in different herpesvirus species. This raises the question -are such share named truly justified for proteins composed of unrelated domains? And to what extent has their putative shared function been experimentally validated.
Our approach is also expected to facilitate the detection and subsequent experimental study of species-(and strain-) specific proteins (listed in Supplementary Table 2). Whereas HSV1 and HSV2 do not have any species specific proteins given current data, VZV has six, and CMV has by far the most with 130 proteins which are not found in any other species. Interestingly, many of these 130 proteins are specific to one strain (or isolate) of CMV. Unsurprisingly, many of these species-and strain-specific protein do not yet have a Pfam domain (and thus were analyzed by manual BLAST searches in this work). An example of such a protein is the ORF45 protein of KSHV (Zhu and Yuan, 2003). Our automated approach provides a starting point for the systematic computational and experimental study of these species-and strain-specific proteins-studies, which eventually will provide answers to such questions as: Are these species-and strain-specitic proteins essential under certain conditions? Do they result in altered pathology or clinical symptoms? Do they function in host interaction? Do they possess as of yet undiscovered, but shared protein domains?
In summary, we developed a computational approach called Domain-architecture Aware Inference of Orthologs (DAIO) for the classification of viral proteins into groups of orthologous proteins with identical domain architecures (SOGs). In addition, we established a nomenclature for SOGs that provides the user with information about the biological function and taxonomic distribution for the member proteins of a SOG. We applied this classification and nomenclature to the proteomes of all human Herpesviridae species and made the results publicly accessible via the ViPR database. The acquisition and retention of novel domain architectures suggests that some Herpesviridae proteins may have acquired novel functional characteristics, which can now be explored experimentally.

Materials and methods
We developed a semi automated software pipeline to analyze amino acid sequences for their protein domain based architectures and to infer multiple sequence alignments and phylogenetic trees for the molecular sequences corresponding to these architectures, followed by gene duplication inference. This pipeline contains the following five major steps: (1) sequence retrival; (2) domain architecture anlysis, including the inference of the taxonomic distributions of domain architectureseach of which corresponding to one preliminary SOG, and manual naming of domain architecures/preliminary SOGs (to be automated in future versions of this pipeline); (3) extraction of molecular sequences corresponding to domain architectures/preliminary SOGs; (4) multiple sequence alignment and phylogenetic inference; (5) gene duplication inference, to determine which preliminary SOGs contain sequences related by gene duplications and thus need to divided in multiple, final SOGs. Links to all custom software programs developed for this work are available here: https://sites.google.com/site/cmzmasek/home/ software/forester/daio. In the following the tools and methods used are described in more detail.

Sequence retrieval
Individual protein sequences were downloaded from the ViPR database (Pickett et al., 2012), while entire proteomes were downloaded from UniProtKB (Bateman et al., 2017).

Multiple sequence alignments
Multiple sequence alignments were calculated using MAFFT version 7.313 (with "localpair" and "maxiterate 1000" options) (Katoh and Standley, 2013;Kuraku et al., 2013). Prior to phylogenetic inference, multiple sequence alignment columns with more than 50% gaps were deleted. For comparison we also performed the analyses based on alignments for which we only deleted columns with more than 90% gaps.

Phylogenetic analyses
Phylogenetic trees were calculated for individual domain architectures (not full-length sequences) except for US22 domain proteins, because US22 domain alignments lack phylogeneticly sufficient signal. Distance-based minimal evolution trees were inferred by FastME 2.0 (Desper and Gascuel, 2002) (with balanced tree swapping and "GME" initial tree options) based on pairwise distances calculated by TREE-PUZZLE 5.2 (Schmidt et al., 2002) using the WAG substitution model (Whelan and Goldman, 2001), a uniform model of rate heterogeneity, estimation of amino acid frequencies from the dataset, and approximate parameter estimation using a Neighbor-Joining tree. For maximum likelihood approaches, we employed RAxML version 8.2.9 (Stamatakis et al., 2005) (using 100 bootstrapped data sets and the WAG substitution model). Tree and domain composition diagrams were drawn using Archaeopteryx [https://sites.google.com/site/cmzmasek/home/ software/forester]. Rooting was performed by the midpoint rooting method. Unless otherwise noted, Pfam domains are displayed ith a E = 10 −6 cutoff. Gene duplication inferences were performed using the SDI and RIO methods Eddy, 2002, 2001). Automated genome wide domain composition analysis was performed using a specialized software tool, Surfacing version 2.002 [Zmasek CM (2012), a tool for the functional analysis of domainome/genome evolution [available at https://sites.google.com/site/cmzmasek/home/software/ forester/surfacing]. All conclusions presented in this work are robust relative to the alignment methods, the alignment processing, the phylogeny reconstruction methods, and the parameters used. All sequence, alignment, and phylogeny files are available upon request.

Phylogenomic analyses and development of novel naming schema using strict ortholog groups
The processes for defining and naming strict ortholog groups were formalized into a set of "rules" and then implemented into a semi-automatic domain-centric phyloinformatics pipeline. Any unique arrangement of single or multiple Pfam domains is considered a domain architecture (DA) Godzik, 2012, 2011). Most proteins of members of the Herpesviridae have DAs consisting of only a single domain. For example, the UDG domain of uracil DNA glycosylase is a single domain DA, whereas the combination of N-terminal DNA_-pol_B_exo1 and C-terminal DNA_pol_B (denoted as DNA_pol_B_ex-o1--DNA_pol_B) of DNA polymerases is a DA with two domains.
In this analysis, we consider a given DA "present" in a given Herpesviridae species S if the DA is present under a set of thresholds in at least one strain of the species S. The rationale for this is that it is possible to miss a DA in a genome, due to incomplete or erroneous sequences, erroneous assembly and gene-predication (false negatives), and even recent, actual gene loss. The opposite (false positive), on the other hand, is far less likely. For this work, we used two thresholds: a minimal domain length of 40% of the length set forth in the Pfam database (domain fragments are unlikely to be functionally equivalent to full length domains) and a hmmscan E-value cutoff of E = 10 −6 .
For every domain architecture, a set of bootstrap resampled phylogenetic trees (gene trees) was calculated by RAxML (Stamatakis et al., 2005) using protein sequences from one representative for each of the nine human Herpesviridae species. For comparison and validation, we also calculated phylogenetic trees that included non-human hosted Herpesviridae. For illustrations, gene duplications were inferred by comparing the consensus gene trees to the species tree ( Fig. 1) for Herpesviridae using the SDI (Speciation Duplication Inference) algorithm (Zmasek and Eddy, 2001). To obtain confidence values on orthology assignments (bootstrap support values), we employed the RIO approach (Resampled Inference of Orthologs) to compare sets of bootstrap resampled phylogenetic trees with the species tree for Herpesviridae (Zmasek and Eddy, 2002).
In this work, we define a strict ortholog group (SOG) as sequences related by speciation events and exhibiting the same domain architecture (based on Pfam domains from Pfam 31.0, a length threshold of 40%, and E-value cutoff of E = 10 −6 ). Based on this approach for defining SOGs, we developed the following naming syntax.
For protein families such as uracil DNA glycosylase, which exhibit the same DA in all nine human Herpesviridae, and which are related by speciation events only, we base our names on (Mocarski and Edward, 2007) as the base name and add a case-sensitive suffix that indicates the taxonomic distribution -"ABG" in this case, since uracil DNA glycosylase appears in each human Alpha-, Beta-, and Gammaherpesvirinae species. Therefore, the full name is "uracil DNA glycosylase_ABG". To indicate presence in some, but not all members of a subfamily, we use lower-case suffixes. "Replication origin-binding protein_ Ab" implies that members of this SOG are present in all human Alphaherpesvirinae species ("A"), and in some (but not all) Betaherpesvirinae ("b").
While most of the human Herpesviridae protein families fall into these basic cases, families which have a (some) domain(s) in common but differ in their DA, are more difficult to rationally name. An example of such a family is glycoprotein B described above. Because members of this family have different DAs, namely "Glycoprotein_B" and "HCMVantigenic_N-Glycoprotein_B", it is composed of two SOGs (named "Glycoprotein B_ ABG.AbG" and "Glycoprotein B_ ABG.b"). In such cases, we split the suffix into two parts, separated by a period. The first part ("ABG") indicates overall presence of common domain(s) for all members of this SOG, Glycoprotein_B in this case. The second part (after the period) relates to entire DAs. ". AbG" of "Glycoprotein B_ ABG.AbG" means that the Glycoprotein_B DA is present in all human Alpha-and Gamma-, and some Betaherpesvirinae. ".b" of "Glycoprotein B_ ABG.b" implies that the "HCMVantigenic_N-Glycoprotein_B" DA is present in some Betaherpesvirinae.