A previously uncharacterized gene in SARS-CoV-2 and the origins of the COVID-19 pandemic

1Biodiversity Research Center, Academia Sinica, Taipei, Taiwan 2Institute for Comparative Genomics, American Museum of Natural History, New York, NY, USA 3Chair for Microbial Ecology, Technical University of Munich, Freising, Germany 4Department of Pathobiological Sciences, University of Wisconsin-Madison, Madison, WI, USA 5Global Health Institute, University of Wisconsin-Madison, Madison, WI, USA 6Bavarian Center for Biomolecular Mass Spectrometry (BayBioMS), Technical University of Munich, Freising, Germany 7Department of Epidemiology and Biostatistics, School of Public Health, SUNY Downstate Health Sciences University, Brooklyn, NY, USA 8Institute for Genomic Health, SUNY Downstate Health Sciences University, Brooklyn, NY, USA 9Departments of Integrative Biology and Statistics, University of California, Berkeley, CA, USA

The COVID-19 pandemic raises urgent questions about the properties that allow animal 2 viruses to cross species boundaries and spread within humans. Addressing these questions 3 requires an accurate and comprehensive understanding of viral genomes. One frequently 4 overlooked source of novelty is the evolution of new overlapping genes (OLGs), wherein a 5 single stretch of nucleotides encodes two distinct proteins in different reading frames. Such 6 "genes within genes" improve genomic information compression and may offer a major source 7 of genetic novelty via overprinting (Keese and Gibbs 1992), particularly as frameshifted 8 sequences preserve certain physicochemical properties of proteins (Bartonek et al. 2020). 9 However, OLGs also entail the cost that a single mutation may alter two proteins, constraining 10 evolution and complicating sequence analyses. Moreover, genome annotation methods 11 typically miss OLGs, favoring one open reading frame per genomic region (Warren et al. 12 2010). 13 In SARS-related betacoronaviruses (subgenus Sarbecovirus; sarbecoviruses), OLGs 14 are known but remain inconsistently reported. Novel overlapping gene candidates 26 To identify OLGs within the SARS-CoV-2 genome, we first generated a complete list of 27 candidate ORFs in the Wuhan-Hu-1 reference genome (NCBI: NC_045512.2). Specifically, 28 we used the Schlub et al. codon permutation method (Schlub et al. 2018) to detect 1 unexpectedly long ORFs while controlling for codon usage. One unannotated gene candidate, 2 here named ORF3c, scored highly (P=0.0104), exceeding the significance of two known OLGs 3 annotated in Uniprot (ORF9b and ORF9c [ORF14] within N; 4 https://viralzone.expasy.org/8996) (Figure 1; Supplementary Table 1). 5 ORF3c comprises 58 codons (including STOP) near the beginning of ORF3a (Table  6 1; Figure 2), making it longer than the known genes ORF7b (44 codons) and ORF10 (39 7 codons) (Supplementary Tables 2 and 3). ORF3c was discovered independently by Chan et. 8 al (2020) as 'ORF3b' and Pavesi (2020) as, simply, 'hypothetical protein'. Due to its naming 9 ambiguity and location within ORF3a, ORF3c has subsequently been conflated with ORF3b 10 in multiple studies (Fung et Table 4). It is also distinct from other OLGs 17 hypothesized within ORF3a (Table 1), and an independent sequence composition analysis 18 predicts ORF3c over the alternative candidate ORF3a*/ORF3h (Pavesi 2020). Thus, ORF3c 19 putatively encodes a novel protein not present in other sarbecoviruses, and the absence of 20 full-length ORF3b in SARS-CoV-2 distinguishes it from SARS-CoV-1 ( Figure 1). In contrast, 21 ORF3b plays a central role in SARS-CoV-1 immune interactions and its absence or truncation 22 in SARS-CoV-2 may be immunologically important (Konno et al. Yuen et al. 2020). 23 24 ORF3c molecular biology and expression 25 To assess expression of ORF3c, we re-analyzed the ribosome profiling (Ribo-seq) data of 26 Finkel et al. (2020), focusing on four high-coverage samples with ribosomes stalled by 27 28 1 genes are not precisely aligned. Note that ORF8 is not novel in SARS-CoV-2 as has been claimed 7 , and ORF9b and 9c are found throughout sarbecoviruses, though rarely annotated 8 in genomes at NCBI. ORF3b is full-length in only 3 sequences (SARS-CoV TW11, SARS-CoV Tor2, 9 and bat-CoV Rs7327), while the remainder fall into two distinct classes having an early or late premature 10 STOP codon (Supplementary Table 12   To further investigate the immunological properties of ORF3c, we predicted linear T-4 cell epitope candidates for each 9-mer of the SARS-CoV-2 proteome using NetMHCPan (Jurtz 5 et al. 2017) to estimate MHC class I binding affinity for representative HLA alleles (Sidney et 6 al. 2008). The lowest predicted epitope density occurs in ORF3c, the only gene significantly 7 depleted compared to both short unannotated ORFs (P=0.019; two-sided percentile) and 8 randomized peptides (P=0.044; permutation tests), followed by ORF8 and N ( Figure 4       Between-species divergence 17 To examine natural selection on ORF3c, we measured diversity at three evolutionary levels: 18 between-species (Sarbecovirus), between-host (human SARS-CoV-2), and within-host 19 (human SARS-CoV-2). At each level, we inferred selection by estimating mean pairwise 20 nonsynonymous (amino acid changing) and synonymous (not amino acid changing) 21 nucleotide divergence (d; between sarbecoviruses) or diversity (π; within SARS-CoV-2) 1 among all sequenced genomes at each level. Importantly, we combined standard (non-OLG) 2 Nucleotide differences were analyzed at three levels: between-species divergence (d), between-host 3 diversity (π; consensus-level), and within-host diversity (π; deep sequencing). Each gene/level is     Table 1 footnotes), showing evidence for purifying selection specific to the ORF3c region and the 7 comparison with pangolin-CoV GX/P5L. On the right-hand side, this analysis is repeated for the same-8 strand reading frame encoding ORF3b (ss13; see Table 1 footnotes), this time with respect to SARS-9 CoV-1, where ORF3b is functional. Here it is seen that there is constraint in this frame across much of 10 the gene and across sarbecoviruses.

12 13
Between-host evolution and pandemic spread 14 We obtained n=3,978 human SARS-CoV-2 consensus sequences from GISAID, limiting to 15 whole-genome high-coverage sequences lacking indels in coding regions (accessed April 10, 16 2020; Supplementary Table 17; GISAID Acknowledgments Table,  With respect to ORF3a, ORF3c-LOF (G25563U) has been identified as a strong 4 candidate for positive selection for its effect as ORF3a-Gln57His (Kosakovsky-Pond 2020). 5 However, temporal allele frequency trajectories (Supplementary Figure 4D) and similar signals 6 from phylogenetic branch tests are susceptible to ascertainment bias (e.g., preferential 7 sequencing of imported infections and uneven geographic sampling) and stochastic error 8 (e.g., small sample sizes). Thus, we performed an independent assessment to partially 9 account for these confounding factors. We first constructed the mutational path leading from 10 the SARS-CoV-2 haplotype collected in December 2019 to the haplotype carrying ORF3c-11 LOF (G25563U). This path involves five mutations (C241U, C3037U, C14408U, A23403G, 12 G25563U), constituting five observed haplotypes (EP-3 → EP-2 → EP → EP+1 → 13 EP+1+LOF, shown in Table 2). Here, EP is suggested to have driven the European Pandemic 14 (detected in German patient #4, footnote 3 of Table 2 corresponding to countries with or without early (January) samples ("early founder" and "late 26 founder", respectively) ( Figure 8). In the early founder group, EP-3 is the first haplotype 27 28 Table 2. The mutational path to European pandemic founder haplotypes a 1  shows the combined frequencies of EP+1 and EP+1+LOF (yellow and blue, respectively). Note that 12 EP-1 is never observed in our dataset.

14
detected in all countries but Germany, consistent with most early COVID-19 cases being 1 related to travel from Wuhan. As this implies that genotypes EP-3 and EP had longer to spread 2 in the early founder group, it is surprising that their spread is dwarfed by the increase of EP+1 3 and EP+1+LOF starting in late February. This turnover is most obvious in the late founder 4 group, where multiple haplotypes are detected in a narrow time window, and the number of 5 cumulative samples is always dominated by EP+1 and EP+1+LOF. Thus, the quick spread of 6 ORF3c-LOF seems to be caused by its linkage with another driver, either C14408U (+1  Table 13). We also examined 6 high-depth samples of pangolin-CoVs from Guangxi, but no 20 conclusions could be drawn (e.g., due to low quality; Methods; Supplementary infection rate x is small), derived mutations occurring in more than one sample must be 24 identical by state but not descent (i.e., recurrent). Limiting to 220 samples where the major 25 allele was also ancestral (Wuhan-Hu-1 genotype) at site 25563, ORF3c-LOF is observed as 26 a minor allele in two samples (SRR11410536 and SRR11479046 at frequencies of 1.9% and 27 4.6%, respectively). This proportion of samples with a recurrent mutation (2 of 220) is high but 28 not unusual, as 1.76% of possible genomic changes have an equal or higher proportion of 1 recurrence. Additionally, no mutations in ORF3c recur in >2.5% of samples ( Figure 9). 2 However, a small number of genomic loci exhibit high rates of recurrent mutation, with five 3 mutations observed in >10% of samples (Methods; Figure 9). Surprisingly, another STOP 4 mutation (A404U; nsp1-L47*) is observed in the majority of samples (Figure 9  Our analyses provide strong evidence that SARS-CoV-2 contains a third overlapping gene 2 (OLG), ORF3c, that has not been consistently identified or fully analyzed before this study. 3 The annotation of a newly emerged virus is difficult, and OLGs tend to be less carefully 4 documented than non-OLGs, for example, ORF9b and ORF9c are still not annotated in the 5 most used reference genome, Wuhan-Hu-1 (NCBI: NC_045512.2). This difficulty is 6 exacerbated for SARS-CoV-2 by the highly dynamic process of frequent gains and losses of 7 accessory genes across the Sarbecovirus subgenus. Therefore, de novo and homology-8 based annotation are both essential, followed by careful expression analyses using multi-omic 9 data and evolutionary analyses within and between species. In particular, we emphasize the 10 importance of using whole-gene or genome alignments when inferring homology for both 11 OLGs and non-OLGs, taking into account genome positions and all reading frames. 12 Unfortunately, in the case of SARS-CoV-2, the lack of such inspections has led to mis-13 annotation and a domino effect. For example, homology between ORF3b (Sarbecovirus) and 14 ORF3c (SARS-CoV-2) has been implied, leading to unwarranted inferences of shared 15 functionality and subsequent claims of homology between ORF3b and other putative OLGs 16 within ORF3a of SARS-CoV-2, e.g., ORF3h/3a* (Table 1). Given the speed of growth of 17 SARS-CoV-2 literature, it is likely this mistake will be further propagated. We therefore provide 18 a detailed annotation of Wuhan-Hu-1 protein-coding genes and codons in Supplementary 19 Tables 2 and 3, respectively, as a resource for future studies. 20 Our study highlights the highly dynamic process of frequent gains and losses of 21 accessory genes across the Sarbecovirus subgenus, with the greatest functional constraint 22 observed for the most highly expressed genes (Supplementary Figure 2). Indeed, while many 23 or all accessory genes may be dispensable for viruses in cell culture, they often play an Although mutational bias marginally favors the recurrence of ORF3c-LOF (within-host 11 analysis), the quick expansion of this mutation and its haplotype during this pandemic is 12 puzzelsome (between-host analysis). One potential explanation for the slower spread of EP-13 3 is a sampling policy bias in case isolation; in most countries, testing and quarantine 14 enforcement were preferentially applied to travellers who recently visited Wuhan, which may 15 have led to selective detection, isolation, quarantining, and tracing of EP-3 and EP 16 haplotypes. Because mutations occurring within Europe (e.g., candidates: C14408U, 17 G25563U) are unlikely from intercontinental travelers, one might expect them to contribute 18 more to community-acquired infections, particularly as testing biases might have provided an 19 opportunity for them to spread in Europe. However, the EP+1 and EP+1+LOF haplotypes also 20 grow faster in the late founder group, where it is unclear which haplotype was more travel 21 related. Because G25563U simultaneously creates a nonsynonymous change (ORF3a-Q57H) 22 and a loss of function (ORF3c-LOF), it could influence the spread of SARS-CoV-2 through 23 selection on either change. Despite that, the quick spread of G25563U seems to be caused 24 by its early occurrence in linkage with the +1 variant (C14408U causes RdRp-P323L), 25 suggesting that this mutation could be the real driver and the increase in ORF3c-LOF is due 26 to hitchhiking. The spread of EP+1 and EP+1+LOF but not EP is unexpected, as EP was the 27 earliest haplotype and carried Spike-D614G (A23403G), a variant with predicted functional 28 relevance (Bhattacharyya et al. 2020). Thus one may speculate that the +1 mutation 1 (C14408U) acts synergistically with D614G or other mutations (5'UTR-C241U, nsp3-C3037U) 2 unique to the EP background, causing differences among the haplotypes in infection rate, 3 disease rate, hospitalization rate, latent period, transmission rate, or other symptoms. The 4 faster spread of EP+1+LOF than EP+1 in the early founder countries but not late founder 5 countries (p=0.0312) also requires explanation. These observations highlight the necessity of 6 empirically evaluating the effects of 3c-LOF (G25563U), 3a-Q57H (G25563U), RdRp-P323L 7 (C14408U), and their interactions with Spike-D614G (A23403G). Lastly, because only five 8 major polymorphisms are considered in this analysis (Supplementary Table 18), it is possible 9 that the spread ORF3c-LOF or other haplotypes is further assisted or hindered by subsequent 10 mutations. 11 Our study has several limitations. Short peptides can be difficult to detect using mass 12 spectrometry, and the second half of 3c does not contain any potential targets. Thus, we were 13 unable to discriminate 3c, 9c, or 10 from noise even in two high-quality datasets, a limitation 14 likely to be true of any proteomics dataset for SARS-CoV-2. With respect to between-host 15 diversity, we focused on relatively abundant consensus-level sequence data; however, this 16 approach can miss important variation (Holmes 2009), stressing the importance of deeply 17 sequenced within-host samples, sequenced with technology appropriate for calling within-host 18 variants. As we use Wuhan-Hu-1 for read mapping and remove duplicate reads, reference 19 bias could potentially affect our within-host results (Degner et al. 2009). We detected natural 20 selection using counting methods that examine all pairwise comparisons between or within 21 specific groups of sequences, which may have less power than methods that trace changes 22 over a phylogeny. However, this approach is robust to errors in phylogenetic and ancestral 23 sequence reconstruction, and to artifacts due to linkage or recombination (Hughes et al. 2006;24 Nelson and Hughes 2015). Additionally, although our method for measuring selection in OLGs 25 does not explicitly account for mutation bias, benchmarking with other viruses suggests 26 detection of purifying selection is conservative (Nelson et al. 2020). Finally, given multiple 27 recombination breakpoints in ORF3a (Boni et al. 2020) and the relative paucity of sequence 28 data for viruses closely related to SARS-CoV-2, our analysis could not differentiate between 1 convergence, recombination, or recurrent loss in the origin of ORF3c. 2 In conclusion, OLGs are an important part of viral biology and deserve more attention. 3 We document several lines of evidence for the expression and functionality of a novel OLG in 4 SARS-CoV-2, here named ORF3c, and compare it to other hypothesized OLG candidates in 5 ORF3a. Finally, we provide a detailed annotation of the SARS-CoV-2 genome and highlight 6 mutations of potential relevance to the within-and between-host evolution of SARS-CoV-2 as 7 a resource for future studies.     Table 23).  Table 9), leaving 21 sequences for analysis (Supplementary Table   3 8). To produce whole-genome alignments, we first aligned all sequences using MAFFT. Then, coding 4 regions were identified using exact or partial sequence identity to SARS-CoV-2 or SARS-CoV-1 5 annotations, translated, and aligned at the amino acid level using ProbCons v1.12 (Do et al. 2005). The

11
The alignment is available in the supplementary data as sarbecovirus_aln.fasta, where pangolin-CoV 12 GD/1 has been removed because it is only available from GISAID.

13
Phylogenetic analysis and ancestral sequence reconstruction  18 Boussau and Gouy 2006) were contrasted based on their logLik value, while accounting for among-site 19 rate heterogeneity using discrete rate categories modeled by the Γ distribution (Yang 1995) and the 20 FreeRate model (Soubrier et al. 2012).

Proteomics analysis 22
To determine whether ORF3c is expressed, we re-analyzed four publicly available SARS-CoV-2 mass

17
NetMHCPan T-cell epitope analysis  antigen presentation compared to a set of random natural peptides. We employed the suggested 24 threshold of 2% to determine potential presented peptides, and 0.5% to identify strong MHC binder.

25
Both strong and weak binders were considered predicted epitopes.

26
Statistical tests on synonymous and nonsynonymous rate 27 Statistical and data analyses and visualization were carried out in R v3. 5

19
To estimate π, numbers of nonsynonymous and synonymous differences and sites were first  Within-host recurrent mutations analyses 9 We assume that each host was infected by a single genotype. Under this assumption, the minor allele