Atypical Divergence of SARS-CoV-2 Orf8 from Orf7a within the Coronavirus Lineage Suggests Potential Stealthy Viral Strategies in Immune Evasion

Orf8 is one of the most puzzling genes in the SARS lineage of coronaviruses, including SARS-CoV-2. Using sophisticated sequence comparisons, we confirm its origins from Orf7a, another gene in the lineage that appears as more conserved, compared to Orf8.

identity to the human/civet strain (35). Further confirming the origin of Orf8 from bat species, it has also been argued that this region may play a role in tracing the origin of SARS strains in epidemic outbreaks, since the reported deletions do not affect the survival of the SARS-CoV-1 virus (20). Importantly, SARS-CoV-2 has an intact Orf8 (non-split, Orf8ab) gene, which is known to be absent from the more lethal MERS strain (20).
Motivation. The Orf7a protein has a known Ig-like structure and exhibits large genome-level variation and yet relative conservation as a single protein family, whereas Orf8, which is predicted to have a similar Ig-like structure, exhibits fewer genome-level variations apart from the split a/b pair or a full deletion and is quite variable at the protein sequence level. It should be noted that neither specific disordered regions have been detected for either of the two genes (36) nor any peculiar codon usage patterns (37). However, this peculiar trajectory of Orf8/Orf7a in SARS-CoV-2 indicates a stealth viral strategy that might be key to control virulence and pathogenicity, with roles in host cell-virus interactions. Here, we investigate the origins and evolution of Orf8 across the coronavirus lineage by sequence and recombination analysis, ultimately merging the two Ig-like families by sensitive searches and identifying the shared conserved residues that define the common Ig fold.

RESULTS
Patterns of the multiple sequence alignment. The multiple alignment, as presented, reveals a turbulent evolutionary history across multiple coronavirus strains for this pair of SARS-CoV-2 genes and their homologs ( Fig. 1). At the top of the alignment, there are nine members of the SARS Orf8a and Orf8b lineage, not always corresponding to a cognate pair, and the three truncated structures (block i, Fig. 1A). The next set (block ii) is composed of 93 Orf7a homologs, excluding those 3 of known structure (2 for SARS-CoV-1 and 1 for SARS-CoV-2, intertwined within the first block, due to their truncated N termini, block i thus containing 12 members). The Orf7a group is followed by a quite heterogeneous set of 24 X4-like/Orf9 sequences from bats, Orf8 from pangolin and Orf9 from SARS-CoV-1 (block iii): despite different names and significant variation, they represent similar sequences of common origin. Finally, the bottom group (block iv) comprises 67 Orf8 sequences from SARS-CoV-2 intermixed with those from bat or pangolin hosts. The mutations V71L (V62L) and L96S (L84S) reported elsewhere (38) are clearly seen within the Orf8 family, among others (Fig. 1A, see DS04 at https:// doi.org/10.6084/m9.figshare.12678491.v1). Of 265 processed SARS-CoV-2 genomes, only two samples from Wuhan, China (WH19002/2019 and WH19004/2020), along with the USAWA-UW413/2020 case, exhibit no mutations compared to the original reference strain (Wuhan-Hu-1) (see DS.snp).
Sequence conservation across virus strains. The first 15 residues are considered to function as signal peptides for Orf7a and Orf8 proteins and were long thought to be the only element shared between them (13). We now extend the signal peptide similarity throughout the superfamily and suggest a common origin for those genes, augmenting independently obtained results by parallel efforts (23). The conservation pattern extends across the entire length of the superfamily, based on evidence from the database searches and the multiple alignment (Fig. 1A). It is notable that the N-terminal region is aligned separately from the downstream protein sequence, suggesting a potential cleavage site. Interestingly, most residues at position 2 are lysines, with two exceptions shown for SARS-CoV-2: one substitution by glutamate (QJF76096) and another by arginine (QIU91254), similar to bat X4-like/Orf9 entries (in the middle of the alignment, block iii). Then, from position 15 onward, the proteins of known structure suggest the presence of the Ig-like beta-sandwich, and an insertion of ;20 residues in the bottom part of the group (block iv, X4-like and Orf8), mostly covered by the regular expression {W.[I,L,V][K,R][I,V,Y]}. The Orf8a/b pair of SARS-CoV-1 underlines the structural flexibility of this segment and the possibility of having two distinct protein products that associate to perform the role of Orf8 in this viral strain. A shorter insert of about 10 residues is also found in Orf8b and some of its Orf9 bat homologs, mostly covered by the regular expression {L[I,V].RC}. Note that these two segments are  (52). (b) Sequence-structure conservation for SARS-CoV-1 Orf7a (1xak), 10 conserved residues (apart from three residues found in the signal sequence but not available in the structure) are indicated by red; disulfide bridges are indicated by green, rendered by UCSF Chimera (59). (c) Cladogram based on the final alignment, with SARS-CoV-2 sequences marked by red color, generated by FastTree (55) (DS.tree), visualized by IcyTree (57). The block iv group containing Orf8 homologs exhibits the widest variation. Due to significant variation, a large number of gaps or a fragmented gene structure, a cladogram is selected instead of a more typical phylogenetic tree to depict the structure of the superfamily. matched onto members of the Orf8ab pair, underlining the likely origin of the intact SARS-CoV-2 Orf8 from those, as mentioned above.
Block annotation and consistency checking. An annotated version of the alignment was generated, manually trimming low-occupancy positions, and marking the block structure (108 residues long, see DS05 at https://doi.org/10.6084/m9.figshare .12678491.v1 see Fig. S2A). From the annotated alignment (DS05), an automatically trimmed alignment (39) was also generated (54 residues long, see DS06 at https://doi .org/10.6084/m9.figshare.12678491.v1; see Fig. S2B), with two of the conserved positions not included (Y47, P79), in addition to removing most of the variation from rapidly changing residues across family members. This alignment reveals a consistent picture further suggesting that the common origin of these genes is indeed determined by the positions that define the Orf7a Ig-like fold, with Orf8 being subjected to rapid evolutionary change.
Phylogenetic tree and sequence space. An orthogonal approach to tree inference is a multidimensional alignment embedding in sequence space (40), which confirms the unique evolutionary history for the superfamily, not readily seen in a tree graph (Fig. 2). The distances in the three-dimensional (3D) embedding represent variation seen across multiple groups: in particular, within-family distances from Orf7a groups are clearly less variable compared to the Orf8 groups, supporting the notion of Orf8 as a highly variable protein in coronaviruses. However, the distances between SARS-CoV-1 and SARS-CoV-2 strain groups are comparable (Fig. 2). Predictably, when nonconserved residues are ignored, the within-family distances appear similar. SARS-CoV-1 Orf7a and Orf8 (also the concatenated ab) come closer than their SARS-CoV-2 counterparts, indicating that there is a faster evolution in the latter, also affected by more extensive sampling of the sequence space. This pattern is also reflected in a tree comparison tanglegram, with the major groups remaining stable, as the fast-evolving nodes alter the tree topology (see Fig. S3).
Phylogenetic profiling across the coronavirus pangenome. For both the Orf7a and Orf8 SARS-related families, the phylogenetic distribution of their members is distinctly restricted, supported by an ongoing coronavirus pangenome study (8), which lists these genes separately. We have confirmed a single key exception (X4-like, block iii) in the alpha group (QCX35187.1, https://www.ncbi.nlm.nih.gov/nuccore/MK720946 .1/) and its X4-like homologs (e.g., QCX35176.1), as previously shown (23). When mapped onto the tree of 89 representative coronavirus strains derived from DNAbased whole-genome alignment (41), the restricted distribution pattern is clearly visible (see Fig. S4A). In addition, no clear evidence for recombination for this region of the SARS-CoV-2 genome is detectable across the 89 strains, markedly challenging previous claims (35,42) and further supporting the scenario of gene duplication and extreme divergence. No homologs of this superfamily have ever been detected outside the coronavirus group. Furthermore, we were not able to confirm any other similarities at the structural level (23); these remain particularly valuable predictions for future research.
From structure to function, as well as constraints from protein interactions. The discovered interactions of these proteins for SARS-CoV-2 with the mammalian host proteins reveal a contrasting picture (43): while Orf7a is relatively more conserved, only two interactions have been detected for this protein (https://amp.pharm.mssm.edu/ covid19/genesets/20). On the other hand, 47 interactions have been detected for Orf8, which has a much faster evolutionary rate (https://amp.pharm.mssm.edu/covid19/ genesets/22) (43). This could be an artifact related to the coverage of protein interactions between the protein complement of the virus and the host cell, and yet it may also point to a more complex interaction landscape for Orf8, underlining its importance in viral strategies to invade cells and propagate.

DISCUSSION
Here, we show for the first time that remote, nontrivial sequence similarities between the SARS-CoV-2 proteins Orf7a and Orf8 are detectable using supervised sequence space walks in database searches, aimed at precision and reproducibility (44). The detection of similarity between Orf8 and Orf7a, a protein of known structure with an Ig-like fold, implies the importance of this gene duplication for virus biology and confirms previous results, independently derived by different approaches, i.e., profile-profile searches and protein modeling (23). We assessed the extent at which sequence comparisons alone can establish unambiguously the homology between Orf8 and Orf7a family members within the coronavirus lineage and entire pangenome. The embedding of sequence conservation patterns in a multidimensional space reveals atypical divergence patterns, with "equidistant" groupings for Orf7a and Orf8 across SARS-CoV-1 and SARS-CoV-2 but significant divergence of Orf8 compared to within-family distances for Orf7a (Fig. 2). Moreover, the conservation of Orf7a compared to Orf8 is clearly demonstrated not only by its similarity to bat viruses for SARS-CoV-1 and pangolin-Manis javanica viruses for SARS-CoV-2, respectively, but with the stark difference between Orf8 in these strains (;35% identity for SARS-CoV-1 Chinese bat-Rhinolophus sinicus virus homologs compared to ;88% identity for SARS-CoV-2 pangolin virus homologs), a puzzling pattern (Fig. 3).
During the final stages of this study, the structure of Orf8 from SARS-CoV-2 has been announced in a preprint (PDB 7JTL), confirming the presence of an Ig-like fold (45) and further supporting the notion of a common origin between the Orf7a and Orf8 families. Of the 13 conserved alignment positions, 9 are available (4 are missing at the N terminus): Q28, C30 (strand 1), P41, C42 (loop between strands 2 and 3), Y47 (strand 3), P79 (loop before last three strands in both structures, at 16.141 Å distance of C-alpha atoms, due to the presence of an Orf8-specific region [45], thus excluded from superposition measurements), and C95, C107, and V135 (the latter also excluded due to uncertainty) are shifted and interpreted differently between the automatically derived profile-driven alignment here and the structure-based alignment (45). Remarkably, when the two cysteines are shifted, the root mean square deviation between 7 C-alpha atom pairs drops from 9.883 Å (from their sequence-based match) to 3.398 Å (to their structure-based match), thus confirming the rather different outcomes of the sequence-and structure-based alignments.
We provide strong evidence for the peculiar divergence of Orf8 from Orf7a, within an otherwise dense viral genome and delimit the phylogenetic distribution of these two genes across coronaviruses. Neither of these genes is found in the gamma or delta coronavirus groups, suggestive of a likely loss in those coronavirus sublineages (see Fig. S4BC). It is quite perplexing that no member of this family is present in the MERS clade (23). Given that Orf7a with an Ig-like structure is potentially an immune antagonist with a pivotal role in the viral infection strategy and the recent observation that Orf8 downregulates MHC-I (46), the Orf7a/Orf8 superfamily might be a key system for immune evasion, known for other analogous cases, including herpesviruses, poxviruses, and adenoviruses (47). The detected protein interactions of SARS-CoV-2 exhibit such a sharp contrast between Orf7a and Orf8 (43) that, barring a technical artifact with respect to coverage, the hypothesis arises for Orf7a being used as a conserved template, to generate variants, such as Orf8, wreaking havoc through immune evasion in the host cell.
In summary, the pair Orf7a/Orf8 may be an under appreciated element in SARS-CoV-2 biology, given its peculiar and quite unusual patterns of divergence and functional properties that might be related to virulence and the pathogenicity of this strain which has caused the COVID-19 pandemic.

MATERIALS AND METHODS
Database searches and documentation. Using the Orf8 protein sequence of SARS-CoV-2 as a primary query, we extensively searched NCBI's 'nr' protein collection (48), following an approach described elsewhere (44), including masking by CAST (49) and permissive E value thresholds in supervised mode. This approach, named "sequence space walk," allows the use of permissive thresholds to detect weak sequence similarities, by inspecting all hits at each single iteration. We specifically searched the Virus subset of 'nr' using PSI-BLAST (50) and the following parameters: max target sequences, 500; Expect 10; word-size 2; BLOSUM62; gap costs, 11/1; compositional adjustment, none; Filter, none; and E value, 0.1. Against the full 'nr', the search exits early at ;200 hits with the same parameters (not shown). Fragments and sequences with unspecified residues were excluded (see below). Search statistics are provided (Table 1), along with the hit tables and PSSMs (the PSSM is unavailable for step 1 only) (DS.runs).
Other software tools. Sequence alignments were performed with MAFFT and default parameters (51) and visualized with JalView 2 (52). Family analysis was performed with AlignmentViewer (40), and dimensionality reduction for sequence space exploration (2D and 3D) was aided by UMAP (53), as implemented in AlignmentViewer. The color coding for sequence glyphs was selected according to mView (54) using a residue width of 2 pixels and a residue height of 1 pixel. Validation of sequence database searches, identifier tracking, and alignment quality checks were performed using various scripts and alignments were automatically trimmed with TrimAl (39), unless stated otherwise. Trees were inferred using FastTree (55) on the NGPhylogeny.fr servers with the LG/gamma options (56) and visualized with IcyTree (57). Tree annotation for FIG 3 Closest homologs of the SARS-CoV-1 and SARS-CoV-2 for Orf7a and Orf8. A restricted search against the virus subset of 'nr' that excludes SARS-1/-2 (taxid 694009 and 2697049, respectively) as targets reveals the similarity of superfamily members to coronavirus strains from other hosts. Orf7a of SARS-1 is most similar to Chinese bat (Rhinolophus sinicus) strains (;95% identity). The concatenated Orf8ab (from Orf8a/Orf8b) shows the closest-yet lower-similarity to Chinese bats (;35% identity), since the host species Rhinolophus ferrumequinum is excluded. In contrast, Orf7a of SARS-2 is most similar to pangolin (Manis javanica) strains (;97% identity), with Orf8 exhibiting high levels of similarity again to pangolin (;88% identity). Identifiers next to the host name are provided. The scale (sequence identity) is shown on the right. "NA" signifies that there are no similarities detected in this restricted sequence search. The diagram was generated by using ClustVis (66). SARS-CoV-1 is referred to here as "SARS-1," and SARS-CoV-2 is referred to here as "SARS-2." Orf8 for "SARS-1" corresponds to a concatenated Orf8ab.
genes and clades was supported by MicroReact (58). Protein structure analysis and annotation was performed with UCSF Chimera (59). Protein interaction data were obtained from the Covid-19 drug/gene set library (60). Variations were calculated using Virulign (61) over a representative sample of SARS-CoV-2 genomes. Recombination breakpoints were identified using BALi-Phy (62) to compute the pairwise homoplasy index across the genome alignment (63). Regions without breakpoints were identified as nonrecombining regions (15). Further details are given in Text S1 in the supplemental material.
Supervised sequence space walk. The initial query sequence Orf8 from SARS-CoV-2 detects its closest homologs within the Orf8 family (PF12093) (29), followed at step 3 by distant homologs of the Orf7a family (PF08779) (17,64), variably called X4 or Orf10, from a range of viral hosts that include bats, pangolins and civets. The search converges at step 9 with high precision, i.e., no known false positives (Fig. 4a). The detected region of homology spans the length of these proteins beyond the signal sequence, as observed independently (23), and unifies the two Pfam entries into a single superfamily (65), with few invariant residues across its members (see below). All PSSMs are stored for future searches (see Text S1, DS.runs) and re-use by the community. Since the results are time sensitive at present (July 2020), the PSSM collection is provided to ensure reproducibility: very many and highly similar sequences will be deposited in the meanwhile, without substantially modifying the main conclusions or challenging the validity of the reported sequence search results.
Identifier processing. There are 24 sequence entries that are lost (as false negatives) in the iterative search, 20 of which reappear in subsequent iterations (Fig. 4b). Of those, 1 entry is lost at step 2, 11 are lost at step 3, another 11 are lost at step 4, and just 1 is lost at step 6. All lost entries were eventually recovered, except for 4 (all of representing sequences shorter than 25 residues). A link with the 24 entries is provided for further inspection (identifiers are available in DS.runs).
Quality control and alignment. The iterative search returns 465 unique entries, many of which are partial sequences (of short length) or have multiple undefined residues (see DS01 at https://doi.org/10 .6084/m9.figshare.12678491.v1); just 1 entry (QJI07349.1) returns two hits in separate regions, due to 70 undefined middle positions (one duplicate ID). When partial/undefined sequences were removed, 288 entries remained, which were then aligned by MAFFT, revealing the conservation patterns of the superfamily (Fig. 4c, (31).
Cross-family sequence similarities. The sequence space walk with Orf8 revealed its closest homologs at the first steps and at step 3 connected them with Orf7a and its homologs (Fig. 4a), with minimum sequence identity dropping from 20 to 8% at convergence (Table 1). This outcome is also confirmed by the resulting alignments, with an average sequence identity dropping from 40 to 10%. The alignment depiction as glyphs for both the initial (Fig. 4c) and the final (Fig. 4d) alignments is also reflected in the matrix of pairwise sequence similarities for the former (see Fig. S1a) and the latter (see Fig. S1b), respectively; these cross-similarity patterns were generated by AlignmentViewer (40). (Note that in this study, for data annotation purposes only, SARS-CoV-1 and SARS-CoV-2 are marked as "SARS-1" and "SARS-2," respectively.)

SUPPLEMENTAL MATERIAL
Supplemental material is available online only.