Integrated genomic view of SARS-CoV-2 in India

Background: India first detected SARS-CoV-2, causal agent of COVID-19 in late January 2020, imported from Wuhan, China. From March 2020 onwards, the importation of cases from countries in the rest of the world followed by seeding of local transmission triggered further outbreaks in India. Methods: We used ARTIC protocol-based tiling amplicon sequencing of SARS-CoV-2 (n=104) from different states of India using a combination of MinION and MinIT sequencing from Oxford Nanopore Technology to understand how introduction and local transmission occurred. Results: The analyses revealed multiple introductions of SARS-CoV-2 genomes, including the A2a cluster from Europe and the USA, A3 cluster from Middle East and A4 cluster (haplotype redefined) from Southeast Asia (Indonesia, Thailand and Malaysia) and Central Asia (Kyrgyzstan). The local transmission and persistence of genomes A4, A2a and A3 was also observed in the studied locations. The most prevalent genomes with patterns of variance (confined in a cluster) remain unclassified, and are here proposed as A4-clade based on its divergence within the A cluster. Conclusions: The viral haplotypes may link their persistence to geo-climatic conditions and host response. Multipronged strategies including molecular surveillance based on real-time viral genomic data is of paramount importance for a timely management of the pandemic.


Introduction
The ongoing pandemic of COVID-19 caused by SARS-CoV-2 following its first appearance in China has pressed the global community to take measures to flatten its transmission (Chan et al., 2020;Zhu et al., 2020). The severe symptoms of infection can include pneumonia, severe acute respiratory syndrome, kidney failure and even death with a coalescence of factors (Young et al., 2020;Zhu et al., 2020). Many COVID-19 cases have been reported to be asymptomatic and may serve as carrier of SARS-CoV-2 (He et al., 2020;Xu et al., 2020). Genome sequences of SARS-CoV-2 suggest its origin and transmission patterns after it enters a new population is proving to be an important step towards formulating strategies for management of this pandemic (Andersen et al., 2020;Chen & Li, 2020).
The first three cases in India were reported in late January and early February, in individuals with a travel history of Wuhan, China. India took drastic steps to contain the further spread of the virus including imposition of travel restrictions to and from the affected countries. There were no new cases of COVID-19 for almost a month. However, while the global focus was on China and other eastern countries like South Korea and Japan; European countries, the Middle East and the USA reported a surge in cases of COVID-19. March 2020 onwards, India also witnessed a surge of imported cases from countries other than China which has been further assisted with local transmission. In March, imposition of nationwide lockdown checked the epidemic curve. Despite these measurements, India is at the verge of a large outbreak as the transmission is rapidly increasing with more than 450,000 reported cases of COVID-19 by the fourth week of June 2020.
We carried out whole genome sequencing of SARS-CoV-2 (n=104) from Pan-India through Surveillance Program of the National Center for Disease Control (NCDC), Delhi. Here, we combine genetic and epidemiological data to understand the genetic diversity, evolution, and epidemiology of SARS-CoV-2 across India. The spectrum of variations would be an important tool towards contact tracing, effective diagnostics and backbone for drug and vaccine development.

Subject recruitment
The study was conducted jointly by the NCDC and CSIR-Institute of Genomics and Integrative Biology (CSIR-IGIB). Institutional ethical clearance was obtained at both institutes prior to initiation of research; the need for consent from the patients was waived by the committee. A total of 127 laboratory-confirmed cases of COVID-19 from targeted testing and available samples at NCDC which represent different geographic locations or states and travel history from different countries during the early phase of the outbreak (Table 1 and Extended data, Supplementary figure S1 [Kumar et al., 2020b]). were included in the study for genomic analyses. Targeted testing involved suspected cases; having symptoms (fever, cough and breathlessness) with recent travel history to high-risk countries (China, South Asia, Middle East, European countries such as Italy, Spain, UK, France and USA) or positive contacts of COVID-19 cases.

Sample collection and molecular investigations
The nasopharyngeal and oropharyngeal swabs (in viral transport medium) were received at NCDC, Delhi through the Integrated Disease Surveillance Programme were subjected to viral inactivation followed by RNA extraction using QIAamp Viral RNA Mini Kit (Cat. No. 52906, Qiagen). Total RNA content in the elute was quantified using NanoDrop (Thermo Fisher Scientific). The 260/280 ratio ranged between 1.6-2.2 for the majority of the samples. To ensure that sub-optimal RNA samples are also included in the study, we made use of SuperScript IV (Cat. No. 18091050, Thermo Fisher Scientific, Waltham, MA, USA), for superior first strand cDNA synthesis and included them for sequencing.

Molecular diagnosis of COVID-19
A quantitative reverse transcription (RT)-PCR assay was used on purified RNA for detection of SARS-CoV-2 in the samples. Quantitative RT-PCR was carried out using TaqMan assay chemistry on ABI7500 platform. The primer/probe concentrations and reaction conditions for diagnostics were as per the WHO protocols (Corman et al., 2020). Two target genes were used for diagnosis of SARS-CoV-2, envelope (E) gene for screening and RNA dependent RNA polymerase (RdRp gene) for confirmation. The positive samples were analyzed based on the country of origin (traveller), contact with positive case, geographical location (community), gender and age. Samples from each group were selected and further processed for WGS of the SARS-CoV-2.
Whole genome sequencing of SARS-CoV-2 cDNA synthesis: Total RNA from SARS-CoV-2 positive samples were quantified using Nanodrop and 50 ng of the RNA was taken for double-stranded cDNA synthesis. Briefly, first strand cDNA was made using 1.0 μl of random hexamer (50 ng/μl), 1.0 μl of dNTPs (10 nM) and 13.0 μl of total RNA with volume adjusted with nuclease-free water (NFW), followed by incubation at 65°C for 5 mins and cooling on ice. To this, 4.0 μl of 5X SSRT IV Buffer, 1.0 μl of 100 mM DTT, 1.0 μl of ribonuclease inhibitor and 1.0 μl of SSRT IV enzyme (200 U/μl) was added (Cat. No. 18091050, Thermo Fisher Scientific, Waltham, MA, USA) with incubation at 23°C for 10 minutes, 50°C for 10 minutes and 80°C for 10 minutes. 1.0 μl of RNase H was added to this and incubated at 37°C for 20 mins. Next, 20.0 μl of first strand cDNA was heated at 95°C for 3 minutes after addition of 10 pmol of random primers, 10 μM dNTPs and 1X Klenow Buffer, followed by immediate cooling on ice. Soon after, 1.0 μl of Klenow Fragment (Cat. No. M0210S, New England Biolabs) was added with incubation at 37°C for 60 mins, 75°C for 10 mins and 4°C for 10 mins. This was followed by Ampure beads purification (Cat. No. A63881, Beckman Coulter) and quantification using Qubit dsDNA HS assay kit (Cat. No. Q32854, Invitrogen).
Nanopore library preparation and sequencing: A total of 100 ng of double stranded cDNA was taken for next generation sequencing (NGS) using a highly multiplexed PCR amplicon approach for sequencing on the Oxford Nanopore Technologies (ONT) (Oxford, United Kingdom) MinION using V3 primer   England Biolabs). After adaptor ligation, it was purified using a combination of short fragment buffer and Ampure beads resulting in a sequencing ready library. Library quantification was conducted using the Qubit dsDNA HS assay kit (Cat. No. Q32854, Invitrogen) and 70 ng of the library was used for sequencing. Barcoding, adaptor ligation, and sequencing were performed on samples with Ct values between 16-31. A 'no template control' was created at the cDNA synthesis step and amplicon generation step to detect cross-contamination between samples. Controls were barcoded and sequenced with both the high-and low-titer sample groups. The sequencing flowcell was primed and used for sequencing using MinION Mk1B.

Illumina library preparation and sequencing
A common pool of cDNA was used for making both Illumina and Nanopore sequencing libraries and subsequent sequencing. cDNA (100 ng) was used to construct the Illumina library using theNextera XT protocol, as per manufacturer's instructions (15031942 v05, Illumina Inc). Briefly, tagmentation of cDNA was done which tagged and fragmented the cDNA by addition of amplicon tagment mix (ATM) and tagment DNA buffer, as per manufacturer's protocol, Illumina Inc with incubation at 55°C for 5 mins with heated lid option. Tagmentation was stopped by addition of neutralization tagment buffer. This was followed by the addition of unique index adapters (i7 and i5 adapters) to the samples. Index adapters are then used for PCR amplification at 72°C for 3 mins, 95°C for 30 secs and 12-cycles of 95°C for 10 secs, 55°C for 30 secs, 72°C for 30 secs; and 72°C for 5 mins. The PCR product was purified using AgencourtAMPure XP beads. The quantity of the sequencing ready library was measured using Qubit dsDNA HS assay kit (Cat. No. Q32854, Invitrogen) and quality by Agilent DNA HS kit (Cat. No. 5067-4626, Agilent). Illumina's MiSeq platform was used for sequencing.
Analysis pipeline for nanopore sequencing data The raw fast5 files were base-called and demultiplexed using the Guppy basecaller (version 3.5.2). The fastq files were normalized by read length, thereby eliminating possible chimeric reads. Pre-alignment quality control was carried out to assess the read quality using Nanopack tools (version 1.29.0) (De Coster et al., 2018). Minimap2 (version 2.17) has been used to align the raw reads with the reference (MN908947.3) (Li, 2018). Nanopolish were used for accurate variant calling from the aligned output (Loman, 2015) with options, minimum flanking sequence -10, ploidy -1 and minimum candidate frequency -0.15. The possible heterozygous variants are filtered out as a separate group after the variants have been called. Post-alignment QC was then performed with Nanopack tools as well as the seaborn (version 0.10.1) package in python to create the distribution of amplicon quality and CT-value vs coverage and depth. Finally, a consensus fasta was created, wherein genomic regions with low coverage and low quality were masked using BCFtools (version 1.9).

Miseq data analysis
The raw reads from the miseq were quality-checked by FASTQC (version 0.11.8). Trimgalore was used to trim the reads containing bad quality and the minimum length of 40 base pairs was kept as a threshold for the reads. HISAT2 is used to map the reads to the human genome (GRCh37) to remove potential the human rRNA reads for the contamination with default parameters [Kim et al., 2019]. The unmapped reads from the human are converted from bam to fastq using bam2fastq (version 2.29) to align to the SARS-CoV-2. HISAT2 was used to align unmapped reads with the SARS-CoV-2 reference genome (MN908947.3 build). Using both samtools (version 1.9) and BCFtools from the SARS-CoV-2 aligned bam files the consensus fasta was generated. The variants in the samples were called using BCFtools and VarScan.

Phylogeny and network analysis
The fasta sequences were aligned using MAFFT (version 7.455) considering the MN90894.3 version as the reference sequence. Phylogenetic trees were constructed using the Neighbour joining algorithm as statistical method and maximum composite likelihood as model in MEGA X software. FIGTREE (version 1.4.4) was used for the graphical visualisation of phylogenetic analysis. Pheatmap (version 1.0.12) and ComplexHeatmap (version 1.10.2) packages from R 3.6.2 were used to plot the heatmaps. Haplotype network analysis was conducted using PopART (version 1.7) [Leigh & Bryant, 2015].

Protein based annotation
In order to categorize the specific amino acid change and the proteins containing the variants, they were annotated with SnpEff (version 4.5) [Cingolani et al., 2012]. The annotation was performed according to the known reference genome of SARS-CoV-2 (i.e. NC_045512) in the NCBI database [Wang et al., 2020]. SARS-CoV-2 polypeptide ORF1ab encodes 16 non-structural proteins (nsp) as a result of proteolytic processing. Hence, for better mapping of the variants present in ORF1ab, we annotated the variants according to the respective nsp residue number.
Further, conservation analysis of the full-length sequences of proteins harbouring these mutations was done on the basis of the six other coronaviruses. The multiple sequence alignment of seven protein sequences was performed with Clustal Omega [Madeira et al., 2019]. The conservation score of ORF3a and ORF8 were calculated with low confidence due to introduced gaps at these positions during alignment. The amino acid type was defined as hydrophobic (G, A, V, L, I, M, P, F, W), polar (S, T, N, Q) or charged (H, K, R, D, E). With this definition, the type of change of residues was calculated.

Three-dimensional protein models
To map the high frequency mutations on proteins, we took protein structure models of SARS-CoV-2 from the Swiss Model repository (https://swissmodel.expasy.org/repository/ species/2697049) [Waterhouse et al., 2018] and models were generated through comparative modeling and by using Robetta prediction server [Kim et al., 2004]. In total three structural models were obtained from swiss model repository and the nsp12 structure was obtained from RCSB [PDB ID 6M71] [Gao et al., 2020]. The details of missing residues or structural domains of each protein is described below. The Spike protein exists as a homotrimer consisting of 1273 residues in each chain with a total of 3819 amino acids. While electron microscopy structures are available for different conformations of Spike protein, in particular S1 region, many residues within the S1 and S2 stalks are missing [Walls et al., 2020;Wrapp et al., 2020]. Therefore, to map the mutations onto the structure we obtained the S1 stalk model of the spike protein (residues 15-1137). Similarly, nsp3 also known as PL-PRO (papain-like proteinase) is a large multi-domain transmembrane protein. For mapping the nsp3 mutations, we considered the model for the nucleic acid binding domain (residue 1089-1203), which is conserved in betacoronaviruses [Angelini et al., 2013]. The Nucleocapsid protein comprise of N-terminal and C-terminal domains connected by linker region. However, structural information for the linker region is unknown.

Demographic details and travel history of the subjects
The majority of the SAS-CoV-2-positive samples were obtained from New Delhi, covering the national capital region of Delhi, India and a few clusters identified by the surveillance team (covering the states of Delhi, Tamil Nadu, Maharashtra, Uttar Pradesh, Andhra Pradesh, West Bengal, Bihar, Orissa, Rajasthan, Haryana, Punjab, Assam and Union territory of Ladakh). The mean (standard deviation) age of the total 127 subjects was 41.4±17.5 years with age range 0.5-76 years and median of 39 years. The male-to-female gender ratio of in the age group <39 years was 35:28, while the remaining 46 subjects >40 years had the ratio of 58:6. Exposure to COVID-19 was suggestive of travel history of subjects to Europe, West Asia and East Asia. A minority of subjects were from foreign countries: Indonesia (n=14), Thailand (n=2) and Kyrgyzstan (n=2). The identified localities of the subjects will further help in molecular surveillance of SARS-CoV-2 in respective geographical regions.

Profile of SARS-CoV-2 genome sequences
The average amplicon coverage for the V3 ARTIC primers used in the study was more than 1000X coverage across the majority of the samples (Extended data, Supplementary figure S2 [Kumar et al., 2020b]). We also looked into whether lower Ct values are a good indicator of genome coverage using a minimal set of virus mapping reads. We plotted genome coverage and average sequencing depth across Ct value of both the genes (E and RdRp). It was observed that higher Ct values (27 onwards) have increased possibility of lower genome coverage (Extended data, Supplementary figure S2 [Kumar et al., 2020b]), although some lower Ct value samples also had incomplete genome coverage.
We sequenced a subset of samples on orthogonal platforms and sequencing methods (shotgun and amplicon) using ONT and Illumina platform. Significantly, we observed that the genetic variants were common between both the platforms.

NGS analysis and construction of phylogeny network for SARS-CoV-2 sequences
A total of 104 samples passed the quality threshold for mapping full genome coverage threshold for SARS-CoV-2 genome <0.05 N content with median coverage ~1500× (see Underlying data for each accession number [Kumar et al., 2020a]. A total of 23 samples that did not qualify the threshold criteria were excluded from strain identification. The phylogenetic analysis of 104 high quality sequences reveal all the strains to be grouped into two major clades, a sub-clade and other clades ( Figure 1 and  being called N), C23929T (S-protein), n=50 (other being low depth/N bases). The variant C6310A (NSP3); n=22 being observed as another frequent alteration ( Figure 1B). We also observed few novel variants, G12685T (NSP8); n=5 and TC1706T (NSP2); n=4, T7621C (ORF1a/b); n=3, A21792T (S protein); n=3, G13920A (NSP12/RdRp); n=2, A16355G (NSP13/Helicase); n=2 and G18803T (NSP14/Exonuclease); n=2 in this cluster. The majority of the key cluster variants 11083,13730,28311,6312,23929 are also shared in sequences submitted from Singapore and Brunei; additionally, similar clade sequences were observed in India submitted by National Institute of Mental Health and Neuro-Sciences (NIMHANS) and Gujarat Biotechnology Research Centre (GBRC) cohort (see Underlying data for details of all accession numbers [Kumar et al., 2020a]. Based on the geographical location of the subjects of this cluster, a considerable number of Indonesians (n=7) and two each from Thailand and Kyrgyzstan were part of this cluster from our study site. This probably suggests introduction of this particularly from East Asian countries into India. Redefining cluster 2 with neighbourhood re-joining With over represented variants in cluster 2 for variants 11083/137 30/28311/6312/23929, we defined this cluster with A4 clade. This has similarity with sequences submitted from Singapore, Brunei and other Indian sequences submitted. The haplotype network analysis suggests that these sequences are having a common origin from East Asia/South-East Asia ( Figure 1C and Figure 2). This A4 clade has multiple variants in important region of viral genome, RdRp (A97V), N-capsid (P12L), NSP3 (T2016K), NSP6 (L37F) and NSP3 (S1197R) variants. In our cohort of samples, the majority of subjects were from Tamil Nadu, Delhi and Indonesia and others were from various other states (Figure 2).

Protein-wise analysis of SARS-CoV-2 variants
To provide quantitative insights into the mutant proteins, we characterized amino acid substitutions across the 104 viral genomes. Of the 53 point mutations identified, 29 were missense that resulted in amino acid substitutions. Extended data, Supplementary figure S4 [Kumar et al., 2020b] plots the occurrence of these mutations as a function of each viral protein. The frequency of amino acid variations was highest in nsp6 (L37F), present in 68 genomes, followed by nsp12 (A97V) in 65, nsp3 (T1198K) in 62 and nucleocapsid (P13L) in 53 genomes. Interestingly, the D614G mutation in spike protein, which is considered as a prevalent global mutation [Bhattacharyya et al., 2020;Korber et al., 2020], was present in only 26 of the 104 sequenced genomes.
The analysis of occurrence of each mutation with the type of amino acid change have shown that ~45% of these are synonymous changes (Extended data, Supplementary figure S5 [Kumar et al., 2020b]). Within frequently occurring mutations, P13L, L37F, A97V also showed no major residue alterations. However, T1198K in nsp3 involve acquisition of a charged group along with the key S protein mutation (D614G) also involves loss of the charged group. These mutations that lead to positively charged groups may cause more severe structural and functional effects.
We also compared SARS-CoV-2 mutation sites with other six coronavirus sequences (Extended data, Supplementary figure S5b [Kumar et al., 2020b]). Most of the mutations were present in variable locations. Out of 29 mutations, 10 are present on highly conserved residue locations. Interestingly, a higher frequency of mutations are at positions that evolve faster/are variable across the coronaviruses, except for A97L and L37F, which are present on conserved locations.
The structural analysis of different viral proteins, nucleocapsid, nsp3, nsp12, and spike protein was conducted and analysis of nucleocapsid protein [Kang et al., 2020] showed its variants were present in the linker region ( Figure 3A). The observed mutations in nsp12 (a highly conserved protein) are overlaying onto the interface (P323L) and NiRAN (A97L) region. The latter is critical as it contains a Zn+ binding site; however, little is known about the exact functional output. In contrast, the P323L mutation is present on protein interaction junctions where a hydrophobic cleft is known to bind to inhibitors ( Figure 3B).
The amino acid change from proline to leucine may result in significant backbone changes, due to the absence of unique proline-induced distortions in the protein backbone. Next, we mapped mutations within Nsp3 protein ( Figure 3C). In particular, the mutations were present on the NAB domain of nsp3, which is a nucleic acid binding domain and also interacts with nsp12 [Jian et al., 2018]. This mutation may impact RNA synthesis machinery; however, little is known about its exact mechanism of action. Lastly, the D614G mutation in spike protein is an interesting substitution and has been reported with increased tally ( Figure 3D) [Bhattacharyya et al., 2020;Korber et al., 2020]. Structurally, this mutation is located in the S1 subunit that also contains the RBD domain. Although present outside the functional region, the proximity of D614G around S1 cleavage site implicates an important change in the local environment.

Discussion
This is the first comprehensive genomic picture of the SARS-CoV-2 prevalent in the Indian population during the early phase of outbreaks. The understanding is important keeping in view the vast geographical expanse and population density of India. There were three major waves of viral entry in India associated with multiple outbreaks (Extended data, Supplementary figure S6 [Kumar et al., 2020b]). The first wave includes importation of SARS-CoV-2 (A2a cluster) through travelers from Europe (Italy, UK, France, etc.) and the USA. Second wave of SARS-CoV-2 (A3 cluster) was linked with the Middle East (Iran and Iraq). The third wave comprises combined viral (haplotype redefined as A4) entries from Southeast Asia (Indonesia, Thailand and Malaysia) and Central Asia (Kyrgyzstan). The study, taken together with those of other reported genomes (Potdar et al., 2020), revealed that the A4 cluster (previously unclassified) is the most prevalent in the available genome sequences from India. The observed distinct A4 genome lineage of SARS-CoV2 in the Indian Subcontinent, which is present in East Asian Countries like Singapore and Indonesia, may allow further research and investigation to understand the evolution of SARS-CoV-2 genomes in Southeast Asian countries. Many novel mutations identified may be specific to Indian conditions, but more genomic data is needed to strengthen the assumption to rule out sampling bias and other factors (Lu et al., 2020). However, a more detailed analysis of these genomes might provide information whether these variations need to be considered during design of diagnostic primers as the need for testing shoots up. It may allow for creation of cost-effective panels to trace the movement of lineage specific strains across geographical regions more rapidly and effectively. Lots of efforts are ongoing to identify suitable vaccine candidates through docking studies. These observations are important to consider the variants that map to the Indian genomes during such prioritization studies, since these strains would now form a major fraction of the genomes that are likely to become more prevalent in India after lockdown. Mapping of these variant genomes in conjunction with the clinical history in terms of recovery, hospitalization and co-morbidity might allow identification of variants that should be actionable and would also have relevance for prognosis. It is imperative that robust genomic data based on large sample size, including rural populations with even distribution can bring out the real scenario once correlated with epidemiological data eventually helping in drafting of further management policies.   [Kumar et al., 2020a].

Data availability
This project contains all GISAID accession numbers generated and analysed in this study.
This file contains the following extended data: • Supplementary Figure S1: A schematic diagram showing numbers of samples with their geographical affiliations with respect to states of India.
• Supplementary Figure S2: Sequencing data quality parameters and orthogonal platform validation.