Conflicting and Ambiguous Names of Overlapping ORFs in SARS-CoV-2: A Homology-Based Resolution

At least six small alternate-frame open reading frames (ORFs) overlapping well-characterized SARS-CoV-2 genes have been hypothesized to encode accessory proteins. Researchers have used different names for the same ORF or the same name for different ORFs, resulting in erroneous homological and functional inferences. We propose standard names for these ORFs and their shorter isoforms, developed in consultation with the Coronaviridae Study Group of the ICTV. We recommend calling the 39 codon Spike-overlapping ORF ORF2b; the 41, 57, and 22 codon ORF3a-overlapping ORFs ORF3c, ORF3d, and ORF3b; the 33 codon ORF3d isoform ORF3d-2; and the 97 and 73 codon Nucleocapsid-overlapping ORFs ORF9b and ORF9c. Finally, we document conflicting usage of the name ORF3b in 32 studies, and consequent erroneous inferences, stressing the importance of reserving identical names for homologs. We recommend that authors referring to these ORFs provide lengths and coordinates to minimize ambiguity due to prior usage of alternative names.

causative agent of the 2002-2003 SARS outbreak. In addition to the ORFs common to all coronaviruses these include, from 5′ to 3′, the accessory genes ORF3a, ORF6, ORF7a, ORF7b, and ORF8 (split into ORF8a and ORF8b in some SARS-CoV isolates) (Cui et al., 2019;Liu et al., 2014;F. Wu et al., 2020). Because of the unprecedented interest in SARS-CoV-2, its proteome has been extensively investigated by various experimental and computational techniques. One additional independent ORF, ORF10, and several additional ORFs overlapping S, ORF3a, and N in alternative positive sense reading frames have been hypothesized to encode functional proteins. Some of these alternative frame ORFs are unique to SARS-CoV-2, some are homologous in whole or part to ones already described in SARS-CoV, and one was present in SARS-CoV but was not identified until the SARS-CoV-2 genome was investigated. Alternative frame ORFs could be translated from the same subgenomic RNA as the main ORF via leaky scanning or internal ribosome entry, from a subgenomic RNA specific to the alternative frame ORF, or via a translational frameshift (Di et al., 2017;Firth and Brierley, 2012;Irigoyen et al., 2016;D. Kim et al., 2020;Liu and Inglis, 1992;O'Connor and Brian, 2000;Thiel and Siddell, 1994). Due in part to the biological and evolutionary complexity of these ORFs and the incremental nature of scientific discovery, different research groups have assigned different names to the alternative frame ORFs, complicating communication.
Here we propose a standard set of names for these overlapping SARS-CoV-2 ORFs for use by the scientific community in order to facilitate communication and minimize confusion while the coding potential and biological function of these ORFs continues to be investigated.

SARS-CoV-2 overlapping ORFs and their ambiguous names
The term "open reading frame" or "ORF" has been used with slightly different meanings by different authors. Here we use the term to mean any contiguous stretch of RNA codons beginning with a start codon, ending with a stop codon, and with no intermediate in-frame stop codons. Given appropriate evidence, the 5' end of the ORF might be moved to a site with a known stop codon readthrough or frameshift signal, as in the case of ORF1b, in order to accommodate the complexity of genome expression in viruses. (Note that, although we require an ORF to end with a stop codon, we do not include the stop codon when we report the lengths and coordinates of the ORF.) We do not require that an ORF exceeds some minimum length or that undisputed evidence is available for its translation into a protein. In what follows, we will only be discussing ORFs with AUG start codons, but our definition would include ORFs with other start codons (typically near-cognate to AUG, such as CUG). By this definition, the conceptual translation of the nucleotide sequence using a codon table determines whether a genome region is an ORF, whereas experimental or computational evidence is needed to determine if an ORF is indeed translated and encodes a functional protein during virus infection. This evidence may come from, but is not limited to, ribosomal profiling, protein or peptide detection and observation of evolutionary signals. Although a large number of ORFs satisfy our definition, we will only be discussing ORFs for which some evidence has suggested translation. Their consideration would benefit from having agreed nomenclature, even if for some of them this evidence may not pass the test of time.
At least six ORFs overlapping S, ORF3a, and N in alternative reading frames have been hypothesized to encode functional proteins. These ORFs are detailed in Figure 1 and Table 1, and issues relating to their naming are discussed in the following paragraphs.   codons in a substantial fraction of isolates . Because there is no SARS-CoV-2 ORF of comparable length in the region homologous to SARS-CoV ORF3b, Chan et al  referred to this 57 codon ORF as ORF3b (the paper does not explicitly state the length or coordinates, and ORF3b is not included in the corresponding NCBI record, accession MN975262, but the ORF can be inferred from the amino acid sequence specified in their Fig. 4).
However, Konno et al used the name ORF3b to refer to the 22 codon ORF with coordinates 25814-25879 at the 5′ end of the region homologous to SARS-CoV ORF3b, which they reported to be an interferon antagonist when expressed from a plasmid (Yoriyuki Konno et al., 2020), a property that had previously been reported for the much longer SARS-CoV ORF3b (Kopecky-Bromberg et al., 2007). Adding to the potential for confusion, the non-homologous 57 codon ORF has also been reported to function as an interferon antagonist and referred to as ORF3b . The 57 codon ORF was predicted to be translated and functional based on a statistical test for unexpectedly long overlapping ORFs, a ribosome profiling analysis, and a dN/dS analysis comparing SARS-CoV-2 to pangolin-CoV GX/P5L, and named ORF3c in an early preprint by Nelson et al (C. W. Nelson et al., 2020), but its name was changed to ORF3d in the final published version (Chase W.  to reflect the consensus reported here. It was also predicted to be a bona fide gene using an independent sequence composition analysis method, but left unnamed (Pavesi, 2020). Complicating matters further, a ribosome profiling study reported evidence for translation of a 33 codon isoform of the 57 codon ORF that starts at a downstream in-frame AUG (coordinates 25596-25694), calling it ORF3a.iORF2, but did not find evidence that the full 57 codon ORF is translated (Finkel et al., 2020). As this example illustrates, some proteins have more than one potential start site and it can be difficult to determine which are functionally important. Other studies that discuss the 22 or 57 codon ORFs are listed in Table 2 and Supplementary Table 1. Computational (Michel et al., 2020;Pasquier and Robichon, 2020;Sadegh et al., 2020) Review (Celik et al., 2020;Garofalo et al., 2020;Helmy et al., 2020;Taefehshokr et al., 2020;Yang et al., 2020;Yi et al., 2020;Yoshimoto, 2020;Zinzula, 2020) Unclear Unclear Empirical (Lei et al., 2020;Nabeel-Shah et al., 2020;Yuen et al., 2020) Computational (Sun, 2020) A third ORF overlapping ORF3a, the 41 codon ORF with coordinates 25457-25579, was proposed to be translated based on synonymous constraint across 6 closely-related strains of the species and called ORF3h (Cagliani et al., 2020). This ORF was independently identified using ribosome profiling (Finkel et al., 2020), by synonymous constraint on a larger group of strains (Firth, 2020), and using evolutionary signatures of protein-coding regions (Jungreis et al., 2020), and referred to as ORF3a-iORF1, ORF3c, and ORF3c, respectively, with additional evidence of purifying selection provided by comparing different SARS-CoV-2 isolates (Chase W. . Adding to the confusion, this ORF has also been referred to as ORF3b (Pavesi, 2020).
Interestingly, the 57-codon and 41-codon ORFs have a 59-nucleotide overlap (including the stop codon of the former), so if both encode functional proteins then this region of ORF3a would be translated in the main reading frame and both alternative reading frames, three in total (see ORFs 3c and 3d in Figure 1B). Translation of three genes overlapping the same sites in different reading frames is rare but known to occur in at least one other virus, namely Env, Tat, and Rev in HIV (Bakouche et al., 2013).
Lastly, the 39 codon ORF with coordinates 21744-21860 overlapping the Spike protein was found to show evidence of translation in a ribosome profiling experiment (Finkel et al., 2020).
This ORF also has evidence of being under purifying selection between human hosts, using a πN/πS method specifically intended for overlapping genes (Chase W. .
Ambiguity in the usage of the name ORF3b has been particularly confusing. Two of the earliest papers about the SARS-CoV-2 genome use the term ORF3b to mean different genomic regions: Wu et al Fig. 1 and Supplementary  F. Wu et al., 2020), which have different definitions of ORF3b. We were able to infer from the PCR primers in a supplemental table (Hachim et al., 2020) (Davidson et al., 2020) report that they "could not detect peptides from ORF9b as described by Bojkova et al." (Bojkova et al., 2020), but "peptides corresponding to the ORF9a protein were identified"; however Davidson's ORF9a and Bojkova's ORF9b are different names for the same 97 amino acid protein, and Davidson's ORF 9b is the 73 amino acid protein, for which Bojkova et al did not find peptides either, so it turns out that the two studies detected the same N-overlapping protein after all. Examples of confusion caused by inconsistent naming continue to accumulate rapidly.

Consensus nomenclature for SARS-CoV-2 overlapping ORFs and isoforms
After discussions with many of the authors that have published evidence related to overlapping ORFs in SARS-CoV-2, and in consultation with members of the Coronaviridae Study Group of the International Committee on Taxonomy of Viruses, we propose consensus nomenclature for the six aforementioned overlapping ORFs, and the shorter isoform of ORF3d, in Table 1 and Severe acute respiratory syndrome-related coronavirus. The importance of this rule can be appreciated by imagining a scenario in which researchers use the name "hemoglobin" to refer to hemoglobin (homologous) in one eukaryotic species, but to insulin (not homologous) in another -accurate transfer of knowledge would be error-prone indeed. Practically, homology recognition is based on analysis of sequence and structure similarity which is not always straightforward for short ORFs unless assisted by other considerations like genome collinearity (synteny). However, we recommend that researchers naming ORFs make every possible effort to determine whether there is homology to ORFs in related strains or species, and avoid names that could lead to mistaken assumptions of homology or lack thereof. At least eight studies using the name ORF3b to refer to the 57-codon SARS-CoV-2 ORF overlapping ORF3a (following Chan et al ) have mistakenly assumed homology and a possible functional relationship to SARS-CoV ORF3b due to this departure from the homology rule  (Zhou et al., 2012).
Second, we maintain the convention of naming overlapping ORFs as ORF{Number1}{letter}, where "Number1" is the number of the main ORF (note that the numeric names of S and N are ORF2 and ORF9a, respectively (Inberg and Linial, 2004)) and "letter" is a lower case letter. We reserve "a" for the main ORF and use increasing letters in 5′ to 3′ order for overlapping ORFs, but with flexibility to accommodate the same-name-for-homologs rule or historical usage. In the case of ORFs overlapping ORF3a, "a" is taken by the main ORF and "b" by the SARS-CoV homolog. Thus, we have named the remaining two overlapping ORFs 3c and 3d in 5′ to 3′ order even though they occur 5′ of ORF3b. In the case of ORF9c, we choose to use "c" because it is 3′ of ORF9b, and because the homologous ORF in SARS-CoV has sometimes been called ORF9c (though it has also been called ORF14).
Finally, we extend our convention by naming smaller isoforms of overlapping ORFs using alternative start codons according to the template ORF{Number1}{letter}{-}{Number2}.
Specifically, we introduce the name ORF3d-2 for the 33 codon isoform of ORF3d. Whether one, the other, both, or neither of these two isoforms encode a functional protein has yet to be determined, so we have chosen to name the shorter isoform in case it is the only functional isoform, but use a name that relates it to ORF3d in case both are functional. There are several other sub-ORFs that have been proposed to be translated (Finkel et al., 2020) but we have not assigned names to them because, as far as we know, ORF3d-2 is the only one for which anyone has proposed that the shorter form is translated but the longer one is not. If, in the future, names are needed for other smaller isoforms of overlapping ORFs using alternative start codons, we suggest using this template, analogous to ORF3d-2.
While researchers have presented experimental or computational evidence of translation or function for each of the six overlapping ORFs discussed here, a final consensus in the community has not yet been achieved. We would like to emphasize that in choosing names we are not intending to imply anything about the strength of evidence for translation or function of any of these ORFs, or parts thereof. With the humbling recognition that our knowledge of coronavirus biology in general and the SARS-CoV-2 genome in particular is far from complete, we have tried to suggest naming rules with flexibility to handle future discoveries.
The ORF names we have recommended here have been endorsed by several members of the and Alexander E. Gorbalenya; by the other coauthors of this paper, who represent many of the groups that initially proposed or reported additional evidence for the protein-coding status of some of these ORFs; and by the virus curator of SwissProt/UniProt, Philippe Lemercier. We hope that future publications will adopt the recommended names, including the published versions of any current preprints that refer to these ORFs, in order to facilitate communication and minimize confusion in the future. We also recommend that authors referring to any of these ORFs explicitly provide the length or genome coordinates with respect to the reference SARS-CoV-2 genome Wuhan-Hu-1 (NCBI: NC_045512.2), and report the name used in any cited paper if it is different. These practices should help to resolve ambiguities caused by names that have already appeared in the literature.

Supplementary Information
Supplementary