Conflicting and ambiguous names of overlapping ORFs in the SARS-CoV-2 genome: A homology-based resolution

At least six small alternative-frame open reading frames (ORFs) overlapping well-characterized SARS-CoV-2 genes have been hypothesized to encode accessory proteins. Researchers have used different names for the same ORF or the same name for different ORFs, resulting in erroneous homological and functional inferences. We propose standard names for these ORFs and their shorter isoforms, developed in consultation with the Coronaviridae Study Group of the International Committee on Taxonomy of Viruses. We recommend calling the 39 codon Spike-overlapping ORF ORF2b; the 41, 57, and 22 codon ORF3a-overlapping ORFs ORF3c, ORF3d, and ORF3b; the 33 codon ORF3d isoform ORF3d-2; and the 97 and 73 codon Nucleocapsid-overlapping ORFs ORF9b and ORF9c. Finally, we document conflicting usage of the name ORF3b in 32 studies, and consequent erroneous inferences, stressing the importance of reserving identical names for homologs. We recommend that authors referring to these ORFs provide lengths and coordinates to minimize ambiguity caused by prior usage of alternative names.


Introduction
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the recently identified strain (F. Wu et al., 2020a;Zhou et al., 2020; of the species Severe acute respiratory syndrome-related coronavirus in the family Coronaviridae (subgenus Sarbecovirus, genus Betacoronavirus, subfamily Orthocoronavirinae) (Gorbalenya et al., 2020) that is the causative agent of coronavirus disease 2019 . Characterization of the SARS-CoV-2 proteome is vital for understanding its molecular biology and for development of countermeasures against the COVID-19 pandemic. Of particular interest are proteins that are unique to SARS-CoV-2, differ substantially from their SARS-CoV homologs, or have not been well characterized in other viruses of this species.
Coronaviruses have positive-sense single-stranded RNA genomes that encode proteins expressed from genomic and subgenomic RNAs using complex regulation at the transcriptional, translational, and posttranslational levels (Fung et al., 2016;Fung and Liu, 2018;Sola et al., 2015). Some of the protein-coding open reading frames (ORFs) are conserved across coronaviruses, with homologs in all strains, and were named according to a uniform coronavirus-wide nomenclature (de Groot et al., 2012). At the 5 ′ end are two large ORFs, ORF1a and ORF1b. ORF1a encodes polyprotein pp1a, and the combination of ORF1a and ORF1b encodes polyprotein pp1ab via a programmed frameshift. Polyproteins pp1a and pp1ab are proteolytically processed to yield 11 and 15 non-structural proteins ("nsp's"), respectively (16 unique, nsp1-nsp16). These include the 3C-like cysteine proteinase (nsp5), RNA-dependent RNA polymerase (nsp12), helicase (nsp13), and exonuclease (nsp14) (Snijder et al., 2003). The name ORF1ab is sometimes used to refer to the two ORFs combined via the frameshift. However, we refer to ORF1a and ORF1b as separate ORFs following common practice in the nidovirus field motivated by their large sizes and small overlap, despite the fact that ORF1b begins at a frameshift site rather than a start codon, unlike the other ORFs we discuss here. The other ORFs conserved across coronaviruses encode, from 5 ′ to 3 ′ , S (Spike protein), E (Envelope), M (Membrane), and N (Nucleocapsid). Other "accessory" ORFs, located in the region downstream of ORF1b, may be species-specific or present only in some strains of a species.
SARS-CoV-2 has a full complement of ORFs previously identified in other viruses of the species Severe acute respiratory syndrome-related coronavirus, which includes the prototype SARS-CoV, the causative agent of the 2002-2003 SARS outbreak. In addition to the ORFs common to all coronaviruses these include, from 5 ′ to 3 ′ , the accessory genes ORF3a, ORF6, ORF7a, ORF7b, and ORF8 (split into ORF8a and ORF8b in some SARS-CoV isolates) (Cui et al., 2019;Liu et al., 2014;Wu et al., 2020a). Because of the unprecedented interest in SARS-CoV-2, its proteome has been extensively investigated by various experimental and computational techniques. One additional independent ORF, ORF10, and several additional ORFs overlapping S, ORF3a, and N in alternative positive-sense reading frames have been hypothesized to encode functional proteins. Some of these alternative-frame ORFs are unique to SARS-CoV-2, some are completely or partially homologous to ones already described for SARS-CoV, and one is present in SARS-CoV but was not identified until the SARS-CoV-2 genome was investigated. Alternative-frame ORFs could be translated from the same subgenomic RNA as the main ORF via leaky scanning or internal ribosome entry, from a subgenomic RNA specific to the alternative-frame ORF, or via a translational frameshift (Di et al., 2017;Firth and Brierley, 2012;Irigoyen et al., 2016;Kim et al., 2020b;Liu and Inglis, 1992;O'Connor and Brian, 2000;Thiel and Siddell, 1994). Due in part to the biological and evolutionary complexity of these ORFs and the incremental nature of scientific discovery, different research groups have assigned different names to the alternative-frame ORFs, which has complicated clear scientific communication.
Here we propose a standard set of names for these overlapping SARS-CoV-2 ORFs for use by the scientific community in order to facilitate unambiguous communication and minimize confusion while the coding potential and biological function of these ORFs continues to be investigated.

SARS-CoV-2 overlapping ORFs and their ambiguous names
The term "open reading frame" or "ORF" has been used with slightly different meanings by different authors. Here we use the term to mean any contiguous stretch of RNA codons beginning with a start codon, ending with a stop codon, and with no intermediate in-frame stop codons. Given appropriate evidence, the 5 ′ end of the ORF might be moved to a site with a known stop codon readthrough or frameshift signal, as in the case of ORF1b, in order to accommodate the complexity of genome expression in viruses. (Note that, although we require an ORF to end with a stop codon, we do not include the stop codon when we report the lengths and coordinates of the ORF.) We do not require that an ORF exceeds some minimum length or that undisputed evidence is available for its translation into a protein. In what follows, we will only be discussing ORFs with AUG start codons, but our definition would include ORFs with other start codons (typically near-cognate to AUG, such as CUG). By this definition, the conceptual translation of the nucleotide sequence using a codon table determines whether a genome region is an ORF, whereas experimental or computational evidence is needed to determine if an ORF is indeed translated and encodes a functional protein during virus infection. This evidence may come from, but is not limited to, ribosome profiling, protein or peptide detection, and observation of evolutionary signals. Although a large number of ORFs satisfy our definition, we will only be discussing ORFs for which some evidence has suggested translation. Their consideration would benefit from having agreed nomenclature, even if for some of them this evidence may not pass the test of time.
At least six ORFs overlapping S, ORF3a, and N in alternative reading frames have been hypothesized to encode functional proteins. These ORFs are detailed in Fig. 1 and Table 1, and issues relating to their naming are discussed in the following paragraphs.
UniProt (The UniProt Consortium, 2019) annotates two ORFs overlapping N in a different reading frame, namely a 97 codon ORF with coordinates 28284-28574, which they call ORF9b, and a 73 codon ORF with coordinates 28734-28952, which they call ORF14. (As a result of our recommendation, the 73 codon ORF is called ORF9c beginning with UniProt release 2021_01.) The name ORF14, which is out of sequence from the other SARS-CoV-2 ORF names, dates back to the 2003 paper that introduced the SARS-CoV genome (Marra et al., 2003), which numbered all ORFs sequentially, including overlapping ORFs. Later papers renumbered so that overlapping ORFs were distinguished using different letters following a shared number, but the name ORF14 continued to be used by some authors whereas others used the name ORF9c. Various authors have referred to the 97 and 73 codon SARS-CoV-2 ORFs overlapping N, respectively, as ORF9a and ORF9b (Cagliani et al., 2020;Davidson et al., 2020;Wu et al., 2020a), ORF9b and ORF9c (Gordon et al., 2020;Michel et al., 2020;Nelson et al., 2020b), or ORF13 and ORF14 (Lu et al., 2020), resulting in ambiguity about whether ORF9b refers to the 97 or 73 codon ORF.
Biological and evolutionary complexity have engendered even greater confusion about the names of ORFs overlapping ORF3a. SARS-CoV contains an alternative-frame 154 codon ORF, ORF3b, that partially overlaps both ORF3a and E (Chan et al., 2005), but the homologous 155 codon region in SARS-CoV-2 contains several in-frame stop codons. The longest alternative-frame ORF overlapping SARS-CoV-2 ORF3a is the 57 codon ORF with coordinates 25524-25694 that overlaps a 5 ′ -proximal portion of ORF3a that has no homology to SARS-CoV ORF3b (Fig. 1), though this ORF is truncated to 13 codons in a substantial fraction of isolates (Lu et al., 2020). Because there is no SARS-CoV-2 ORF of comparable length in the region homologous to SARS-CoV ORF3b, Chan et al. (2020) referred to this 57 codon ORF as ORF3b (the paper does not explicitly state the length or coordinates, and ORF3b is not included in the corresponding NCBI record, accession MN975262, but the ORF can be inferred from the amino acid sequence specified in their Fig. 4). However, Konno et al. used the name ORF3b to refer to the 22 codon ORF with coordinates 25814-25879 at the 5 ′ end of the region homologous to SARS-CoV ORF3b, which they reported to be an interferon antagonist when expressed from a plasmid (Konno et al., 2020b), a property that had previously been reported for the much longer SARS-CoV ORF3b (Kopecky-Bromberg et al., 2007). Adding to the potential for confusion, the non-homologous 57 codon ORF overlapping ORF3a has also been reported to function as an interferon antagonist in a paper that also referred to it as ORF3b (Lu et al., 2020). The 57 codon ORF was predicted to be translated and functional based on a statistical test (Schlub et al., 2018) for unexpectedly long overlapping ORFs, a ribosome profiling analysis, and a d N /d S analysis comparing SARS-CoV-2 to pangolin-CoV GX/P5L, and was named ORF3c in an early preprint (Nelson et al., 2020a), but its name was changed to ORF3d in the final published version (Nelson et al., 2020b) to reflect the consensus reported here. It was also predicted to be a bona fide gene using an independent sequence composition analysis method, but left unnamed (Pavesi, 2020). Complicating matters further, a ribosome profiling study reported evidence for translation of a 33 codon isoform of the 57 codon ORF that starts at a downstream in-frame AUG (coordinates 25596-25694), calling it ORF3a.iORF2, but did not obtain evidence that the full 57 codon isoform is translated (Finkel et al., 2020). As this example illustrates, some proteins have more than one potential start site and it can be difficult to determine which are functionally important. Other studies that discuss the 22 or 57 codon ORFs are listed in Table 2 and Supplementary Table 1. A third ORF overlapping ORF3a, the 41 codon ORF with coordinates 25457-25579, was proposed to be translated based on synonymous constraint across 6 closely-related strains of the species and called ORF3h (Cagliani et al., 2020). This ORF was independently identified using ribosome profiling (Finkel et al., 2020), by synonymous constraint in a larger group of strains (Firth, 2020), and using evolutionary signatures of protein-coding regions (Jungreis et al., 2021), and referred to as ORF3a.iORF1, ORF3c, and ORF3c in these three respective studies, with additional evidence of purifying selection reported by comparing different SARS-CoV-2 isolates (Nelson et al., 2020b). Adding to the confusion, this ORF has also been referred to as 3b protein (Pavesi, 2020). Interestingly, the 41 codon and 57 codon ORFs have a 59-nucleotide overlap (including the stop codon of the former), so if both encode functional proteins then this region of ORF3a would be translated in the main reading frame and both alternative reading frames, three frames in total (ORFs 3a, 3c and 3d, Fig. 1B). Translation of three genes overlapping the same sites in different reading frames is rare but known to occur in at least one other virus, namely Env, Tat, and Rev in HIV-2 (Bakouche et al., 2013).
Lastly, the 39 codon ORF with coordinates 21744-21860 overlapping the Spike protein was found to show evidence of translation in a ribosome profiling experiment (Finkel et al., 2020). The sequence of this ORF displays evidence of purifying selection between human hosts, using a π N /π S method intended for use with overlapping genes (Nelson et al., 2020b).
Ambiguity in the usage of the name ORF3b has been particularly confusing. Two of the earliest papers about the SARS-CoV-2 genome used the term ORF3b to mean different genomic regions: Wu et al. show ORF3b as the region homologous to SARS-CoV ORF3b in their Fig. 1 (22 codons). The region homologous to SARS-CoV ORF3b, which overlaps the 3 ′ half of ORF3a and the 5 ′ end of the envelope protein ORF (E) is also shown (light blue background). Note that ORFs 3a, 3c, and 3d are in different reading frames (+0, +1, and +2, respectively), so the 59 nucleotide region common to all three could be a rare example of RNA translated in three different reading frames. C. Nucleocapsid ORF (N) containing ORFs 9b (97 codons) and 9c (73 codons). (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)

Table 1
Recommended standard names. Recommended standard names for each of six ORFs overlapping S, ORF3a, or N, in 5 ′ -3 ′ order, and the shorter isoform of ORF3d, with number of codons, coordinates, and a list of other names that have been used in previous publications or preprints. Codon counts and coordinates do not include the stop codon. Coordinates are with respect to the Wuhan-Hu-1 reference genome (NCBI: NC_045512.2). Frame +1 or +2 indicates that codons are shifted one or two nucleotides, respectively, in the 3 ′ direction from codons in the main (larger) ORF, which occupies frame +0.

Recommended ORF name
Length ( (Hachim et al., 2020) of the 57 amino acid protein as evidence of expression (Konno et al., 2020a), though this was corrected in the published version (Konno et al., 2020b). On the other hand, a report about interferon evasion (Xia et al., 2020) refers to one of the investigated ORFs as ORF3b, without providing coordinates, but we were able to infer from the PCR primers in their Supplementary Table S1 that the amplified region was the 155 codon SARS-CoV-2 region homoloous to SARS-CoV ORF3b, so presumably in this case it was the 22 codon ORF that was expressed, since that is at the 5 ′ end of this region. Furthermore, at least five review papers discuss the 57 codon and 22 codon ORFs overlapping ORF3a as if they were the same entity (Supplementary Table 1). Particularly confusing is one review that mentions ORF3b without providing coordinates and depicts it graphically as the ORF at the 3 ′ end of the SARS-CoV-2 region homologous to SARS-CoV ORF3b (overlapping the 3 ′ end of ORF3a and the 5 ′ end of E), but we can infer from the cited reference that they are referring to the 22 codon ORF at the 5 ′ end of the homologous region (Sa Ribero et al., 2020). We know of at least one instance of similar confusion for ORF9b: Davidson et al. (2020) report that they "could not detect peptides from ORF9b as described by Bojkova et al., 2020", but "peptides corresponding to the ORF9a protein were identified"; however, Davidson's ORF9a and Bojkova's ORF9b are different names for the same 97 amino acid protein, and Davidson's ORF9b is the 73 amino acid protein, for which Bojkova et al. also did not find peptides (Bojkova et al., 2020), so the two studies detected the same N-overlapping protein after all. Examples of confusion caused by inconsistent naming continue to accumulate rapidly.

Consensus nomenclature for SARS-CoV-2 overlapping ORFs and isoforms
After discussions with many of the researchers that have published evidence related to overlapping ORFs in SARS-CoV-2, and in consultation with members of the Coronaviridae Study Group of the International Committee on Taxonomy of Viruses, we propose consensus nomenclature for the six aforementioned overlapping ORFs, and the shorter isoform of ORF3d (Table 1, Fig. 1).
Our naming decisions were motivated by the following considerations. First, we strongly recommend using the same name as the SARS-CoV homolog where one exists. This rule is in agreement with the prevailing practice for ORF and protein naming by the Coronaviridae Study Group. This facilitates the transfer of knowledge in the coronavirus field and cross-communication between research on SARS-CoV-2, SARS-CoV, and other viruses of the species Severe acute respiratory syndrome-related coronavirus. The importance of this rule can be appreciated by imagining a scenario in which researchers use the name "hemoglobin" to refer to hemoglobin (homologous) in one eukaryotic species, but to insulin (not homologous) in anotheraccurate transfer of knowledge would be error-prone indeed. Practically, homology recognition is based on analysis of sequence and structure similarity which is not always straightforward for small ORFs unless assisted by other considerations like genome collinearity (synteny). However, we recommend that researchers naming ORFs make every possible effort to determine whether there is homology to ORFs in related strains or species, and avoid names that could lead to mistaken assumptions of homology or lack thereof. At least eight studies using the name ORF3b to refer to the 57 codon SARS-CoV-2 ORF overlapping ORF3a (following Chan et al. (2020)) have mistakenly assumed homology and a possible functional relationship to SARS-CoV ORF3b due to this departure from the homology rule (Supplementary Table 1). The application of the same-name-for-homologs rule to ORF9b is unambiguous, because SARS-CoV and SARS-CoV-2 encode full-length homologs. On the other hand, there are several small ORFs within the region of the SARS-CoV-2 genome homologous to SARS-CoV ORF3b, but to our knowledge only the ORF at the 5 ′ end of this region, beginning at the AUG codon homologous to the start codon of SARS-CoV ORF3b, has been proposed to be protein-coding (Konno et al., 2020b). Thus, we assign the name ORF3b to the 22 codon ORF, in line with prior studies of ORFs of various lengths in bat viruses of this species homologous to the 5 ′ end of SARS-CoV ORF3b (Zhou et al., 2012).
Second, we maintain the convention of naming overlapping ORFs as ORF{Number1}{letter}, where "Number1" is the number of the main ORF (note that the numeric names of S and N are ORF2 and ORF9a, respectively (Inberg and Linial, 2004)) and "letter" is a lower case letter. We reserve "a" for the main ORF and default to sequential (alphabetical) letters to name additional overlapping ORFs in 5 ′ -3 ′ order, but retain the flexibility to accommodate the same-name-for-homologs rule or historical usage. In the case of ORFs overlapping ORF3a, "a" is taken by the main ORF and "b" by the SARS-CoV homolog. Thus, we have named the remaining two overlapping ORFs 3c and 3d in 5 ′ -3 ′ order even though they occur 5 ′ of ORF3b. In the case of ORF9c, we choose to use "c" because it is 3 ′ of ORF9b, and because the homologous ORF in SARS-CoV Genome report (Wu et al., 2020a) Empirical (Konno et al., 2020b;Lokugamage et al., 2020;Xia et al., 2020;Zhang et al., 2020) Review Sa Ribero et al. (2020 57 codon ORF ORF3d Genome Report Chan et al. (2020) Empirical (Banerjee et al., 2020;Gordon et al., 2020;Hachim et al., 2020;Hayn et al., 2020;Lam et al., 2020;Laurent et al., 2020;Samavarchi-Tehrani et al., 2020.;St-Germain et al., 2020) Laboratory resource -sequence clone collection (Kim et al., 2020a) Computational (Michel et al., 2020;Pasquier and Robichon, 2020;Sadegh et al., 2020) Review (Celik et al., 2020;Garofalo et al., 2020;Helmy et al., 2020;Taefehshokr et al., 2020;Wu et al., 2020b;Yang et al., 2020;Yi et al., 2020;Yoshimoto, 2020;Zinzula, 2020) Unclear Unclear Empirical (Lei et al., 2020;Nabeel-Shah et al., 2020;Yuen et al., 2020) Computational Sun (2020 has sometimes been called ORF9c (though it has also been called ORF14). Finally, we extend our convention by naming smaller isoforms of overlapping ORFs using alternative start codons according to the template ORF{Number1}{letter}{-}{Number2}. Specifically, we introduce the name ORF3d-2 for the 33 codon isoform of ORF3d. Whether either, both, or neither of these two isoforms encode a functional protein has yet to be determined, so we have chosen to name the shorter isoform in case it is the only functional isoform, but use a name that relates it to ORF3d in case both are functional. There are several other sub-ORFs that have been proposed to be translated (Finkel et al., 2020) but we have not assigned names to them because, as far as we know, ORF3d-2 is the only one for which anyone has proposed that the shorter form is translated but the longer one is not. If, in the future, names are needed for other smaller isoforms of overlapping ORFs using alternative start codons, we suggest using a naming strategy that is analogous to ORF3d-2.
While researchers have presented experimental or computational evidence of translation or function for each of the six overlapping ORFs discussed here, a final consensus in the community has not yet been achieved. We would like to emphasize that in choosing names we are not intending to imply anything about the strength of evidence for translation or function of any of these ORFs, or parts thereof. With the humbling recognition that our knowledge of coronavirus biology in general and the SARS-CoV-2 genome in particular is far from complete, we have tried to suggest naming rules with sufficient flexibility to handle future discoveries.

Conclusions
We have proposed standard names for six SARS-CoV-2 ORFs and one shorter isoform that have been hypothesized to encode accessory proteins. The ORF names we have recommended here have been endorsed by several members of the Coronaviridae Study Group of the International Committee on Taxonomy of Viruses, namely Stanley Perlman, Bart L Haagmans, and Benjamin W Neuman, and two coauthors John Ziebuhr and Alexander E. Gorbalenya; by the other coauthors of this paper, who represent many of the groups that initially proposed or reported additional evidence for the protein-coding status of some of these ORFs; and by the virus curator of SwissProt/UniProt, Philippe Lemercier. We hope that future publications will adopt the recommended names, including the published versions of any current preprints that refer to these ORFs, in order to facilitate unambiguous communication and minimize confusion. We also recommend that authors referring to any of these ORFs explicitly provide the length or genome coordinates with respect to the reference SARS-CoV-2 genome Wuhan-Hu-1 (NCBI: NC_045512.2), and report the name used in any cited paper if it is different. These practices should help to resolve ambiguities caused by names that have already appeared in the literature.