Comparative Genomic Analysis of Rapidly Evolving SARS-CoV-2 Viruses Reveal Mosaic Pattern of Phylogeographical Distribution

The Coronavirus Disease-2019 (COVID-19) that started in Wuhan, China in December 2019 has spread worldwide emerging as a global pandemic. The severe respiratory pneumonia caused by the novel SARS-CoV-2 has so far claimed more than 60,000 lives and has impacted human lives worldwide. However, as the novel SARS-CoV-2 displays high transmission rates, their underlying genomic severity is required to be fully understood. We studied the complete genomes of 95 SARS-CoV-2 strains from different geographical regions worldwide to uncover the pattern of the spread of the virus. We show that there is no direct transmission pattern of the virus among neighboring countries suggesting that the outbreak is a result of travel of infected humans to different countries. We revealed unique single nucleotide polymorphisms (SNPs) in nsp13-16 (ORF1b polyprotein) and S-Protein within 10 viral isolates from the USA. These viral proteins are involved in RNA replication, indicating highly evolved viral strains circulating in the population of USA than other countries. Furthermore, we found an amino acid addition in nsp16 (mRNA cap-1 methyltransferase) of the USA isolate (MT188341) leading to shift in amino acid frame from position 2540 onwards. Through the construction of SARS-CoV-2-human interactome, we further revealed that multiple host proteins (PHB, PPP1CA, TGF-β, SOCS3, STAT3, JAK1/2, SMAD3, BCL2, CAV1 & SPECC1) are manipulated by the viral proteins (nsp2, PL-PRO, N-protein, ORF7a, M-S-ORF3a complex, nsp7-nsp8-nsp9-RdRp complex) for mediating host immune evasion. Thus, the replicative machinery of SARS-CoV-2 is fast evolving to evade host challenges which need to be considered for developing effective treatment strategies.

downloaded on March 19, 2020 from NCBI database and based on quality assessment two 126 genomes with multiple Ns were removed from the study. Further the genomes were annotated 127 using Prokka [22]. A manually annotated reference database was generated using the Genbank 128 file of Severe acute respiratory syndrome coronavirus 2 isolate-SARS-CoV-129 2/SH01/human/2020/CHN (Accession number: MT121215) and open reading frames (ORFs) 130 were predicted against the formatted database using prokka (-gcode 1) [22]. Further the GC 131 content information was generated using QUAST standalone tool [23]. 132

Analysis of natural selection 133
To determine the evolutionary pressure on viral proteins, dN/dS values were calculated for 9 134 ORFs of all strains. The orthologous gene clusters were aligned using MUSCLE v3.  Team, 2015). 139

Phylogenetic analysis 140
To infer the phylogeny, the core gene alignment was generated using MAFFT [27] present 141 within the Roary Package [28]. Further, the phylogeny was inferred using the Maximum 142 Likelihood method based and Tamura-Nei model [29] in MEGAX [30] and visualized in 143 interactive Tree of Life (iTOL) [31] and GrapeTree [32]. 144 To determine the single nucleotide polymorphism (SNP), whole-genome alignments were 145 made using libMUSCLE aligner. For this, we used annotated genbank of SARS-CoV-146 2/SH01/human/2020/CHN (Accession no. MT121215) as the reference in the parsnp tool of 147 Harvest suite [33]. As only genomes within a specified MUMI distance threshold are recruited, 148 we used option -c to force include all the strains. For output, it produced a core-genome 149 alignment, variant calls and a phylogeny based on Single nucleotide polymorphisms. The SNPs 150 were further visualized in Gingr, a dynamic visual platform [33]. Further, the tree was 151 visualized in interactive Tree of Life (iTOL) [31]. 152 SARS-CoV-2 protein annotation and host-pathogenic interactions 153 updated in any protein database, we first annotated the genes using BLASTp tool [34]. The 156 similarity searches were performed against SARS-CoV isolate Tor2 having accession no. 157 AY274119 selected from NCBI at default parameters. The annotated SARS-CoV-2 proteins 158 were mapped against viruSITE [35] and interaction databases such as Virus.STRING v10.5 159 [36] and IntAct [37] for predicting their interaction against host proteins. These proteins were 160 either the direct targets of HCoV proteins or were involved in critical pathways of HCoV 161 infection identified by multiple experimental sources. To build a comprehensive list of human 162 PPIs, we assembled data from a total of 18 bioinformatics and systems biology databases with 163 five types of experimental evidence: (i) binary PPIs tested by high-throughput yeast two-hybrid 164 (Y2H) systems; (ii) binary, physical PPIs from protein 3D structures; (iii) kinase-substrate 165 interactions by literature-derived low-throughput or high-throughput experiments; (iv) 166 signaling network by literature-derived low-throughput experiments; and (v) literature-curated 167 PPIs identified by affinity purification followed by mass spectrometry (AP-MS), Y2H, or by 168 literature-derived low [36,38]. 169 Filtered proteins (confidence value: 0.7) were mapped to their Entrez ID [39] based on the 170 NCBI database used for interactome analysis. HPI were stimulated using Cytoscape v.3.7.2 171 [40]. 172

Functional enrichment analysis 173
Next, functional studies were performed using the Kyoto Encyclopedia of Genes and Genomes 174 (KEGG) [41,42] and Gene Ontology (GO) enrichment analyses using UniProt database [43] 175 to evaluate the biological relevance and functional pathways of the HCoV-associated proteins. 176 All functional analyses were performed using STRING enrichment and STRINGify, plugin of 177 Cytoscape v.3.7.2 [40]. Network analysis was performed by tool NetworkAnalyzer, plugin of 178 Cytoscape with the orthogonal layout. 179

Results and Discussion 180
General genomic attributes of SARS-CoV-2 181 In this study, we analyzed a total of 95 SARS-CoV-2 strains (available on March 19, 2020) 182 isolated between December 2019-March 2020 from 11 different countries namely USA (n=52), 183 China (n=30), Japan (n=3), India (n=2), Taiwan (n=2) and one each from Australia, Brazil, 184 oronasopharynges or lungs, while two of them were isolated from faeces suggesting both 186 respiratory and gastrointestinal connection of SARS-CoV-2 (Table 1). No information of the 187 source of isolation of the remaining isolates is available. The average genome size and GC 188 content were found to be 29879 ± 26.6 bp and 37.99 ± 0.018%, respectively. All these isolates 189 were found to harbor 9 open reading frames coding for ORF1a (13218 bp) and ORF1b (7788 190 bp) polyproteins, surface glycoprotein or S-protein (3822 bp), ORF3a protein (828 bp Our analysis revealed that strains of human infecting SARS-CoV-2 are novel and highly 201 identical (>99.9%). A recent study established the closest neighbor of SARS-CoV-2 as SARSr-202 CoV-RaTG13, a bat coronavirus [45]. As COVID19 transits from epidemic to pandemic due 203 to extremely contagious nature of the SARS-CoV-2, it was interesting to draw the relation 204 between strains and their geographical locations. In this study, we employed two methods to 205 delineate phylogenomic relatedness of the isolates: core genome ( Figure 1A & C) and single 206 nucleotide polymorphisms (SNPs) ( Figure 1B). Phylogenies obtained were annotated with 207 country of isolation of each strain ( Figure 1A & B). The phylogenetic clustering was found 208 majorly concordant by both core-genome ( Figure 1A) and SNP based methods ( Figure 1B). 209 The strains formed a monophyletic clade, in which MT093571.1 (South Korea) and 210 MT039890.1 (Sweden) were most diverged. Focusing on the edge-connection between the 211 neighboring countries from where the transmission is more likely to occur, we noted a strain 212 from Taiwan (MT066176) closely clustered with another from China (MT121215.1). With the 213 exception of these two strains, we did not find any connection between strains of neighboring 214 countries. Thus, most strains belonging to the same country clustered distantly from each other 215 and showed relatedness with strains isolated from distant geographical locations ( Figure 1A & 216 B). For instance, a SARS-CoV-2 strain isolated from Nepal (MT072688) clustered with a strain 217 from USA (MT039888). Also, strains from Wuhan (LR757998 and LR757995), where the 218 virus was originated, showed highest identity with USA as well as China strains; strains from 219 India, MT012098 and MT050493 clustered closely with China and USA strains, respectively 220 ( Figure 1A & B). Similarly, Australian strain (MT007544) showed close clustering with USA 221 strain ( Figure 1A & B) and one strain from Taiwan (MT066175) clustered nearly with Chinese 222 isolates ( Figure 1B). Isolates from Italy (MT012098) and Brazil (MT126808) clustered with 223 different USA strains ( Figure 1A & B). Notably, isolates from same country or geographical 224 location formed a mosaic pattern of phylogenetic placements of countries' isolates. For viral 225 transmission, contact between the individuals is also an important factor, supposedly due to 226 which the spread of identical strains across the border of neighboring countries is more likely. 227 But we obtained a pattern where Indian strains showed highest similarity with USA and China 228 strains, Australian strains with USA strains, Italy and Brazilian strains with strains isolated 229 from USA among others. This depicts the viral spread across different communities. However, 230 as genomes of SARS-CoV-2 were available mostly from USA and China, sampling biases is 231 evident in analyzed dataset as available on NCBI. Thus, it is plausible for strains from other 232 countries to show most similarity with strains from these two countries. In the near future as 233 more and more genome sequences will become available from different geographical locations; 234 more accurate patterns of their relatedness across the globe will become available 235

SNPs in the SARS-CoV-2 genomes 236
SNPs in all predicted ORFs in each genome were analyzed using SARS-CoV-237 2/SH01/human/2020/CHN as a reference. SNPs were determined using maximum unique 238 matches between the genomes of coronavirus, we observed that the strains isolated from USA 239

SARS-CoV-2-Host interactome unveils immunopathogenesis of COVID-19 280
Although the primary mode of infection is human to human transmission through close contact, 281 which occurs via spraying of nasal droplets from the infected person, yet the primary site of 282 infection and pathogenesis of SARS-CoV-2 is still not clear and under investigation. To 283 explore the role of SARS-CoV-2 proteins in host immune evasion, the SARSCoV-2 proteins 284 were mapped over host proteome database ( Figure 3B & Table 3). We identified a total of 28 285 proteins from host proteome forming close association with 25 viral proteins present in 9 ORFs 286 of SARS-CoV-2 ( Figure 3C). The network was trimmed in Cytoscape v3.7.2 where only 287 interacting proteins were selected. Only 12 viral proteins were found to interact with host 288 proteins ( Figure 3A). Detailed analysis of interactome highlighted 9 host proteins in direct 289 association with 6 viral proteins. Further, the network was analyzed for identification of 290 regulatory hubs based on degree analysis. We identified mitogen activated protein kinase 1 291 Similarly, the viral protein Papain-like proteinase (PL-PRO) which has deubiquitinase and 309 deISGylating activity is responsible for cleaving viral polyprotein into 3 mature proteins which 310 are essential for viral replication [57]. Our study showed that PL-PRO directly interacts with 311 PPP1CA which is a protein phosphatase that associates with over 200 regulatory host proteins 312 to form highly specific holoenzymes. PL-PRO is also found to interact with TGFβ which is a 313 beta transforming growth factor and promotes T-helper 17 cells (Th17)  agreement with our prediction where we found IL6 as an interacting partner. Our study also 322 showed JAK1/2 as an interacting partner which is known for IFNγ signaling. It is well known 323 that TGFβ along with IL6 and STAT3 promotes Th17 differentiation by inhibiting SOCS3 324 [62]. Th17 is a source of IL17, which is commonly found in serum samples of COVID-19 325 patients [61,63]. Hence, our interactome is supported from these findings where we found 326 SOCS3, STAT3, JAK1/2 as an interacting partner [64]. The results suggested that 327 proinflammatory cytokine storm is one of the reasons for SARS-CoV-2 mediated 328 immunopathogenesis. 329 In the next cycle of physical events the viral protein NC (nucleoprotein), which is a major 330 structural part of SARV family associates with the genomic RNA to form a flexible, helical 331 nucleocapsid. Interaction of this protein with SMAD3 leads to inhibition of apoptosis of SARS-332 CoV infected lung cells [65], which is a successful strategy of immune evasion by the virus. membranes. CAV1 has been reported to be involved in viral replication, persistence, and the 346 potential role in pathogenesis in HIV infection also [71]. Thus, ORF3a interactions will 347 upregulate viral replication thus playing a very crucial role in pathogenesis. Multiple 348 structural proteins were observed to target the SPECC1 proteins and linked with cytokinesis 350 and spindle formations during division. Thus, major viral assembly also targets the proteins 351 linked with immunity and cell division. Taken together, we estimated that SARS-CoV-2 352 manipulate multiple host proteins for its survival while, their interaction is also a reason for 353 immunopathogenesis. 354

Conclusions 355
As COVID-19 continues to impact virtually all human lives worldwide due to its extremely 356 contagious nature, it has spiked the interest of scientific community all over the world to 357 understand better the pathogenesis of the novel SARS-CoV-2. In this study, the analysis was 358 performed on the genomes of the novel SARS-CoV-2 isolates recently reported from different 359 countries to understand viral pathogenesis. With the limited data available so far, we observed 360 no direct transmission pattern of the novel SARS-CoV-2 in the neighboring countries through 361 our analyses of the phylogenomic relatedness of geographical isolates. The isolates from same 362 locations were phylogenetically distant, for instance, isolates from the USA and China. Thus, 363 there appears to be a mosaic pattern of transmission indicative of the result of infected human 364 travel across different countries. As COVID-19 transited from epidemic to pandemic within a 365 short time, it does not look surprising from the genome structures of the viral isolates. The 366 genomes of six isolates, specifically from the USA, were found to harbor unique amino acid 367 SNPs and showed amino acid substitutions in ORF1b protein and S-protein, while one of them 368 also harbored an amino-acid addition. This is suggestive of the severity of the mutating viral 369 genomes within the population of the USA. These proteins are directly involved in the 370 formation of viral replication-transcription complexes (RTC). Therefore, we argue that the 371 novel SARS-CoV-2 has fast evolving replicative machinery and that it is urgent to consider 372 these mutants to develop strategies for COVID-19 treatment. The ORF1ab polyprotein protein 373 and S-protein were also found to have dN/dS values approaching 1 and thus might confer a 374 selective advantage to evade host responsive mechanisms. The construction of SARS-CoV-2-375 human interactome revealed that its pathogenicity is mediated by a surge in pro-inflammatory 376 cytokine. It is predicted that major immune-pathogenicity mechanism by SARS-CoV-2 377 includes the host cell environment alteration by disintegration by signal transduction pathways 378 and immunity evasion by several protection mechanisms. The mode of entry of this virus by 379 S-proteins inside the host cell is still unclear but it might be similar to SARS CoV-1 like 380 viruses. Lastly, we believe as more data accumulate for COVID-19 the evolutionary pattern 381 will become much clear. 382  isolates. Highly similar genomes of coronaviruses were taken as input by Parsnp. Whole-591 genome alignments were made using libMUSCLE aligner using the annotated genome of 592 MT121215 strain as reference. Parsnp identifies the maximal unique matches (MUMs) among 593 the query genomes provided in a single directory. As only genomes within a specified MUMI 594 distance threshold are recruited, option -c to force include all the strains was used. The output 595 phylogeny based on Single nucleotide polymorphisms was obtained following variant calling 596 on core-genome alignment. C) The minimum spanning tree generated using Maximum 597

Authors Contribution
Likelihood method and Tamura-Nei model showing the genetic relationships of SARS-CoV-2 598 isolates with their geographical distribution. 599  Table 1: General genomic attributes of SARS-CoV-2 strains. 618