SARS-CoV-2 Next Generation Sequencing (NGS) data from clinical isolates from the East Texas Region of the United States

The SARS-CoV-2 virus has evolved throughout the pandemic and is likely to continue evolving into new variants. Some of these variants may affect functional properties, including infectivity, interactions with host immunity, and disease severity. And compromised vaccine efficacy is an emerging concern with every new viral variant. Next-generation sequencing (NGS) has emerged as the tool of choice for discovering new variants and understanding the transmission dynamics of SARS-CoV-2. Deciphering the SARS-CoV-2 genome has enabled epidemiological survivance and forecast of altered etiologically. Clinical presentations of the infection are influenced by comorbidities such as age, immune status, diabetes, and the infecting variant. Thus, clinical management and vaccine efficacy may differ for new variants. For example, some monoclonal antibody treatments are variant-specific, and some vaccines are less efficacious against the omicron and delta variants of SARS-CoV-2. Consequently, determining the local outbreaks and monitoring SARS-CoV-2 Variants of Concern (VOC) is one of the primary strategies for the pandemic's containment. Although next-generation sequencing (NGS) is a gold standard for genomic surveillance and variant discovery, the assays are not approved for variant diagnosis for clinical decision-making. Advanta Genetics, Texas, USA, optimized Illumina COVID-seq protocol to reduce cost without compromising accuracy and validated the Illumina COVID-Seq assay as a Laboratory Developed Test (LDT) according to the guidelines prescribed by the College of American Pathologists (CAP) and Clinical Laboratory Improvement Amendments (CLIA). The whole genome of the virus was sequenced in (n = 161) samples from the East Texas region using the Illumina MiniSeq® instrument and analyzed by using Illumina baseSpace (https://basespace.illumina.com) bioinformatics pipeline. Briefly, the library was prepared by using Illumina COVIDSeq research use only (RUO) kit, and the individual libraries were normalized using the DNA concentration measured by Qubit Flex Fluorometer, and the pooled libraries were sequenced on Illumina MiniSeq® Instrument. Illumina baseSpace application was used for sequencing QC, FASTQ generation, genome assembly, and identification of SARS-CoV-2 variants. This whole genome shotgun project (n = 161) has been deposited at GISAID.

Dataset link: SARS-CoV-2 Next Generation Sequencing (NGS) data from clinical isolates from the East Texas Region of the United States (Original data) Keywords: SARS-CoV-2 qRT-PCR Next generation sequencing Variants of concern Epidemiology Transmission dynamics a b s t r a c t The SARS-CoV-2 virus has evolved throughout the pandemic and is likely to continue evolving into new variants. Some of these variants may affect functional properties, including infectivity, interactions with host immunity, and disease severity. And compromised vaccine efficacy is an emerging concern with every new viral variant. Next-generation sequencing (NGS) has emerged as the tool of choice for discovering new variants and understanding the transmission dynamics of SARS-CoV-2. Deciphering the SARS-CoV-2 genome has enabled epidemiological survivance and forecast of altered etiologically. Clinical presentations of the infection are influenced by comorbidities such as age, immune status, diabetes, and the infecting variant. Thus, clinical management and vaccine efficacy may differ for new variants. For example, some monoclonal antibody treatments are variantspecific, and some vaccines are less efficacious against the omicron and delta variants of SARS-CoV-2. Consequently, determining the local outbreaks and monitoring SARS-CoV-2 Variants of Concern (VOC) is one of the primary strategies for the pandemic's containment. Although next-generation sequencing (NGS) is a gold standard for genomic surveillance and variant discovery, the assays are not approved for variant diagnosis for clinical decision-making. Advanta Genetics, Texas, USA, optimized Illumina COVID-seq protocol to reduce cost without compromising accuracy and validated the Illumina COVID-Seq assay as a Laboratory Developed Test (LDT) according to the guidelines prescribed by the College of American Pathologists (CAP) and Clinical Laboratory Improvement Amendments (CLIA). The whole genome of the virus was sequenced in ( n = 161) samples from the East Texas region using the Illumina MiniSeq® instrument and analyzed by using Illumina baseSpace ( https://basespace.illumina.com ) bioinformatics pipeline. Briefly, the library was prepared by using Illumina COVIDSeq research use only (RUO) kit, and the individual libraries were normalized using the DNA concentration measured by Qubit Flex Fluorometer, and the pooled libraries were sequenced on Illumina MiniSeq® Instrument. Illumina baseSpace application was used for sequencing QC, FASTQ generation, genome assembly, and identification of SARS-CoV-2 variants. This whole genome shotgun project ( n = 161) has been deposited at GISAID.  Table   Subject Health and Medical sciences Specific subject area Genomic: Virology Type of data Raw FASTQ files; available at GISAID; Figure; Table (supplementary) How the data were acquired This data was acquired by sequencing of SARS-CoV-2 samples from the PCR-positive patient samples from the East Texas region. Sequencing data were analyzed by using DRAGEN COVID Lineage Basespace Labs (3.5.4). Analyzed data has been published [1 , 2] . Data format Raw data (fastq files accession ID), Analyzed data for surveillance over the time; Filtered by coverage Description of data collection Nasopharyngeal swab samples were collected from suspected SARS-CoV-2 cases and tested for SARS-CoV-2 by qRT-PCR. Only positive ( ∼30 Ct values) samples were selected for SARS-CoV-2 whole genome sequencing using the COVIDSeq protocol and the MiniSeq® (Illumina) instrument. The term "Ct" in the context of a real-time reverse transcription PCR (qRT-PCR) assay refers to the cycle threshold, which is the number of times a machine attempts to replicate the genetic material of a given virus before it is successfully detected in the given sample.

Value of the Data
• This data may be useful to researchers mapping the evolution of SARS-CoV-2 variant mutations from the East Texas region, as well as the efficacy of diagnostic techniques for variant calling. • The data can be useful for researchers working on circulating variants of SARS-CoV-2. This study may help to develop strategies and control programs for this pandemic. • The data will be used to retrieve information about the circulating and dominating strains of SARS-CoV-2, which may help better understand the transmission dynamics of SARS-CoV-2 and develop strategies for preventing the pandemic from returning. Moreover, the data can be used to develop genomics-derived transmission prediction models to predict infectious disease spread in the future. Several studies have suggested the differential clinical presentation and vaccine efficacy for different SARS-CoV-2 strains, which may guide therapeutic decisions, such as monoclonal antibody therapies and future vaccination strategies [3 , 4] .

Objective
Primary objective for this data acquisition was to monitor the evolution of SARS-CoV-2 variants in the East Texas regions. Second objective was to establish the SARS-CoV-2 variant detection as Laboratory Developed Test (LDT) for clinical reporting.

Data Description
Here we present whole genome sequencing data obtained from 161 qRT-PCR positive SARS-CoV-2 nasopharyngeal samples from the East Texas region collected between August 2020 and September 2022. Data repository: GISAID: EPI_SET_20220715vh [1] . Data was analyzed after determining the limit of detection (LOD) of genomic coverage by computing the depth of coverage (X times) and percent genome coverage for all tested samples. The lowest genomic coverage of > 200X (depth) and 90% genome coverage was required for successful detection variant detection ( Fig. 1 ). Importantly, all 161/161 (100%) observations with a minimum of 90% genome coverage at a minimum of 200X resulted in the correct variant call after the analysis.
The data identified greater genomic diversity early in the pandemic, before classification of VOC. Initial samples from July 2020 identified the SARS- CoV-2 574; and B.1.602. The data revealed a progressive evolution from non-VOC infectivity with samples tested from July-Aug 2021 resulting in 100% calling for the WHO classified Delta VOC. Continued virus mutation confirmed co-circulation variants in samples tested from December 2021 with data revealing infectivity of Omicron (58%) and Delta (42%) variants. Whereas the data from April-September 2022 samples indicated Omicron responsible for 100% of infectivity with a dominant variant with an evolved dominance progression from Omicron BA.2 to Omicron BA.5 ( Fig. 2 ).  The GISAID database for global SARS-CoV-2 sequence analysis, available on the Nexstrain server, was used to retrieve representative variant sequences [5] . All individual consensus genome sequence files were aligned using the Clustal-W Multiple Sequence Alignment Tool [6] . Phylogenetic analysis was performed using the Clustal Omega Server, and the phylogenetic tree was constructed using the Mega X tool with default maximum likelihood parameters [7 , 8] .

Experimental Design, Materials and Methods
A total of 161 nasopharyngeal swab samples were collected from patients with positive SARS-CoV-2 qRT-PCR assay at Advanta Genetics in Tyler, Texas, USA ( https://aalabs.com/ ). Total nucleic acid (NA) was extracted using the Roche MagNA Pure 96 system and Viral RNA Small Volume Kits (Port Scientific Inc, Canada). Isolated NA was archived at -80 °C until library preparation. Whole genome synthetic RNAs from three reference strains (Omicron, Delta and Wuhan) were obtained from BEI Resources, and sequenced with each sequencing batch for quality control.
Libraries were prepared using the Illumina COVIDSeq protocol (Illumina Inc, USA). Briefly, first-strand cDNA was synthesized using reverse transcriptase and random hexamers primer. The SARS-CoV-2 genome was amplified using two sets of primers (COVIDSeq Primer Pool-1 and 2) in two multiplex PCR protocols. Libraries were constructed by tagmentation, and adapter ligation using IDT (Integrated DNA Technologies) for Illumina Nextera UD Index Set A. Individual libraries were quantified using Qubit 2.0 fluorometer (Invitrogen, Inc.) and pooled in equimolar concentration. Normalized library pools were sequenced. Final library pools were diluted to a 2 pM loading concentration, and dual-indexed paired-end sequencing of 75 bp reads was performed using an HO flow cell (150 cycles) on an Illumina MiniSeq® instrument.
Illumina baseSpace ( https://basespace.illumina.com ) bioinformatics was used for data QC, FASTQ generation, genome assembly and SARS-CoV-2 variant detection. Briefly, raw FASTQ files were trimmed and quality checked (Q > 30) using Basespace's FASTQ-QC application. QCpassed FASTQ files were aligned to the SARS-CoV-2 reference genome (NCBI reference sequence NC_045512.2) using the Bio-IT processor (version: 0 × 04261818). Basespace's DRAGEN COVID Lineage (version: 3.5.4) was used to determine the SARS-CoV-2 variant and to generate a single consensus FASTA file. Finally, individual consensus FASTA files were also analyzed for lineage assignment using the online version of Phylogenetic Assignment of Named Global Epidemic Lineages (PANGOLIN) ( https://pangolin.cog-uk.io ). Only consensus variants identified by both applications were used for further analysis.

Ethics Statements
This research used de-identified residual samples, and the study was exempted by Institutional Review Board (IRB).

Declaration of Competing Interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability
SARS-CoV-2 Next Generation Sequencing (NGS) data from clinical isolates from the East Texas Region of the United States (Original data) (GISAID).