Sierra SARS-CoV-2 sequence and antiviral resistance analysis program

Introduction Although most laboratories are capable of employing established protocols to perform full-genome SARS-CoV-2 sequencing, many are unable to assess sequence quality, select appropriate mutation-detection thresholds, or report on the potential clinical significance of mutations in the targets of antiviral therapy Methods We describe the technical aspects and benchmark the performance of Sierra SARS-CoV-2, a program designed to perform these functions on user-submitted FASTQ and FASTA sequence files and lists of Spike mutations. Sierra SARS-CoV-2 indicates which sequences contain an unexpectedly large number of unusual mutations and which mutations are associated with reduced susceptibility to clinical stage mAbs, the RdRP inhibitor remdesivir, or the Mpro inhibitor nirmatrelvir Results To assess the performance of Sierra SARS-CoV-2 on FASTQ files, we applied it to 600 representative FASTQ sequences and compared the results to the COVID-19 EDGE program. To assess its performance on FASTA files, we applied it to nearly one million representative FASTA sequences and compared the results to the GISAID mutation annotation. To assess its performance on mutations lists, we applied it to 13,578 distinct Spike RBD mutation patterns and showed that exactly or partially matching annotations were available for 88% of patterns Conclusion Sierra SARS-CoV-2 leverages previously published data to improve the quality control of submitted viral genomic data and to provide functional annotation on the impact of mutations in the targets of antiviral SARS-CoV-2 therapy. The program can be found at https://covdb.stanford.edu/sierra/sars2/ and its source code at https://github.com/hivdb/sierra-sars2.


Introduction
SARS-CoV-2 sequencing is performed for surveillance as well as research and clinical purposes. The extent of sequencing for clinical purposes may increase as more SARS-CoV-2 inhibitors become available, particularly if resistance to these inhibitors arises. Although most laboratories are capable of performing full-genome SARS-CoV-2 sequencing employing established laboratory and sequence analysis protocols [1][2][3][4][5][6][7][8], many are unable to assess sequence quality, select appropriate mutation-detection thresholds, or report the potential clinical significance of SARS-CoV-2 mutations in the targets of antiviral therapy.
We previously briefly described a sequence analysis program called Sierra SARS-CoV-2 in a paper on the Stanford Coronavirus Antiviral Resistance Database (CoV-RDB) [9]. The program utilizes the same codebase as Sierra HIV [10,11], the Stanford HIV Drug Resistance Database sequence analysis program [10,11]. The program accepts three types of input: FASTQ files containing short reads from a deep sequencing instrument, FASTA sequences, and lists of Spike amino acid mutations.
To assess the performance of Sierra SARS-CoV-2 on FASTQ files, we applied it to two sets of sequences from the NCBI Sequence Read Archive (SRA) [12] and to sequences from a clinical laboratory. To assess its performance on consensus FASTA sequences, we applied it to 963,237 SARS-CoV-2 genome sequences from GISAID [13]. To assess its performance interpreting Spike mutations, we applied it to 13,578 distinct Spike receptor binding domain (RBD) amino acid mutation patterns from approximately 4.7 million SARS-CoV-2 GISAID sequences.

Methods
Sierra SARS2-CoV-2 provides native support for FASTA sequences and lists of mutations, defined as amino acid differences from the Wuhan-Hu-1 reference sequence (GenBank accession NC_045512.2). Support for FASTQ files is provided through an auxiliary pipeline that converts FASTQ files to comma-delimited files containing the frequency of each codon at each genomic position, i.e., codon frequency (CodFreq) files [9,11]. Table 1 summarizes SARS-CoV-2 output depending on whether CodFreq files, FASTA sequences, or mutation lists are submitted. Fig. 1 illustrates the workflow of Sierra SARS2-CoV-2 for all three input types.

Generation of CodFreq files and consensus sequences
CodFreq files contain seven columns: gene, amino acid position, number of reads at a position, codon, number of reads for a codon, amino acid, and proportion of reads for a codon. For our application, the CodFreq format has several advantages over the commonly used variant call format (VCF) because CodFreq files can be interpreted without a reference sequence and used independently from the accompanying SAM/BAM file. CodFreq files can be used to generate a consensus FASTA sequence containing mixtures of codons above a user-specified threshold.
The CodFreq pipeline can be run on batched sequences using the Sierra SARS-CoV-2 frontend or locally using a pre-built Docker image. A shell script is provided on GitHub for running the CodFreq pipeline from a local host (https://github.com/hivdb/codfreq). The frontend identifies paired-end files and prompts users to confirm the pairing. An advanced option is provided for users submitting primer information. The pipeline reports progress for each backend task.
The CodFreq pipeline includes the following steps (Supplementary Figure): (1) The Fastp program trims adapters, removes regions with low phred scores, and stitches paired reads; (2) MiniMap2 aligns FASTQ sequence reads to the reference sequence [14]; (3) Samtools converts the resulting SAM text file into a binary BAM file and a BAI index file [15]; (4) PySam reads the BAM file to determine the frequency of each codon at each position; and (5) PostAlign, a program we created, adjusts the placement of indels through a codon-aware process (https://github. com/hivdb/post-align). Depending on user input, the programs Cutadapt or iVar are used to trim SARS-CoV-2 primers [16,17]. The CodFreq and BAM files are provided for users to download.

Identification of amino acid mutations and lineage assignment
Minimap2 and PostAlign are used to analyze FASTA sequences. Minimap2 aligns a query sequence to the reference sequence and saves the alignment in Pairwise mApping Format (PAF) files, which are then converted into pairwise nucleotide alignments. PostAlign adjusts indels using a codon aware process and position-specific gap scores to increase consistency of indel placement in accordance with alignments of established SARS-CoV-2 variants. PostAlign also separates alignments into discrete genes, identifies mutations, and numbers them by gene. If complete genomes are submitted, the Pangolin program is used to assign the PANGO lineage [18].

Report generation and mutation annotation
The Sierra SARS-CoV-2 report contains sections summarizing sequence and mutation data (Supplementary File). The sequence summary reports the genes present in a sequence, areas in which sequence data are missing, the consensus sequence, and the assigned PANGO lineage. It contains a figure plotting read coverage along the sequence and read depths for Mpro, RdRp, and Spike genes. Dropdown menus enable users to interactively adjust the minimum number and proportion of reads for reporting non-consensus mutations.
Each mutation in a sequence is annotated with the following information: (1) The proportion and number of reads containing the mutation; (2) Whether the mutation is unusual, defined as having a global prevalence below 0.01% based on the open-source sequence analysis pipeline created by the Kosakovsky Pond laboratory [19]; (3) Whether the mutation is an mAb resistance mutation defined as a Spike mutation associated with reduced susceptibility to one or more clinical-stage mAbs; (4) Whether the mutation is a potential RdRp or Mpro resistance mutation; and (5) Comments for the most well studied Spike mutations associated with reduced mAb susceptibility and for most mutations associated with Mpro and RdRp inhibitor reduced susceptibility. The lists of mAb and potential RdRp and Mpro resistance mutations are updated monthly.
Each list of Spike mutations is also used to interrogate CoV-RDB for published mAb, convalescent plasma, and vaccinee plasma susceptibility data. There is an option to display just data from variants with exactly matching sets of mutations versus comprehensive results with data for variants with a subset or superset of the submitted mutations.

Generating a list of mAb resistance mutations
mAb resistance mutations were defined as Spike mutations with a median ≥5 fold reduction in susceptibility compared with wildtype according to CoV-RDB and/or having an escape fraction ≥0.1 in the deep mutational scanning (DMS) platform developed by the Bloom Laboratory at the Fred Hutchinson Cancer Research Center [20,21]. As of September 2022, there were 488 spike mutations meeting these criteria. Fig. 2 illustrates the 160 RBD-associated mAb-resistance mutations having a prevalence ≥0.0001% with data on their neutralizing antibody susceptibilities, DMS escape fractions, and whether they were selected in vitro and/or in vivo.

Generating a list of mutations associated with potential small molecule inhibitor resistance
Mpro and RdRp mutations were classified as potential drugresistance mutations if they met one of the following three criteria: (1) they were associated with 2.5-fold or higher reductions in susceptibility in either a biochemical assay or in cell culture; (2) they were selected during an in vitro passage experiment; or (3) they were selected in a person receiving an Mpro or RdRp inhibitor. Fig. 3 illustrates that as of September 2022, 42 mutations at 28 positions were reported to be possibly associated with reduced susceptibility to Mpro inhibitors Table 1 Overview of the Sierra SARS2-CoV-2 Analysis Report.  Sierra SARS-CoV-2 work flow for handling FASTQ files, FASTA files, and lists of SARS-CoV-2 mutations. Sierra provides native support for FASTA sequences and mutation lists. Support for FASTQ files is provided through an auxiliary pipeline that converts FASTQ files into CSV files containing the frequency of each codon at each position in a genome. The workflow for the auxiliary pipeline is shown in Supplementary Figure. The Supplementary File shows an example of the HTML output. Fig. 2. SARS-CoV-2 Spike RBD mAb-resistance mutations. The mAb-resistance mutations shown met one or more of the following criteria: (1) having a ≥5-fold reduction in susceptibility to a clinical stage mAb; (2) having a DMS escape fraction ≥0.1 and having a global prevalence >0.001%; (3) having been selected in vitro by an mAb; or (4) having been selected in vivo in a patient receiving an mAb or experiencing prolonged infection. A dark blue cell indicates a ≥25-fold reduction in susceptibility; a light blue cell indicates a 5-25-fold reduction in susceptibility; a white cell indicates a <5-fold reduction in susceptibility; and a gray cell indicates the absence of susceptibility data. Cells with a plus (+) symbol indicates that the mutation had a DMS escape fraction ≥0.1. Bold mutations with a yellow background represent the consensus for one or more variants of concern or of interest. The numbers in the "in vivo" column indicate the numbers of times the mutation was selected in vivo during prolonged infection or in a patient receiving an mAb. The numbers in the "in vitro" column indicate the number of times the mutation was reported to be selected during passage in the presence of an mAb.  27.8% as Omicron variants, 11.6% as Alpha variants, and 7.8% as other variants. For the SRA sequences, we used the parameter -skip-technical to exclude adapters, primers, and bar-codes from the downloaded FASTQ file. The SUH sequences were generated using a recently published pipeline [44].

FASTA files
On March 25, 2022, a random set of 963,237 FASTA files was selected from 9,632,370 GISAID sequences [45].

Mutation data
The global prevalence of each Spike, Mpro and RdRp mutation was obtained from a publicly available quality controlled analysis pipeline created by the Kosakovsky Pond laboratory that contained 4,740,761 Spike, 5,328,735 Mpro, and 5,076,452 RdRp sequences containing 201,167 Spike, 5,404, Mpro and 32,788 RdRp distinct mutation patterns [46,47].

FASTQ files
To evaluate the CodFreq pipeline using FASTQ files, we tested the 200 NCBI SRA Illumina files, the 200 NCBI SRA ONT files and the 200 SUH Illumina files. For the 400 Illumina and 200 ONT sequences, we compared the consensus codon of each CodFreq file to the codon in the consensus FASTA file generated by the EDGE COVID-19 program (version 20220314). For both pipelines, a codon-level read depth ≥5 and a mutation-detection threshold of 50% were used.

Illumina sequences
For regions successfully aligned by both Sierra and EDGE, each program detected a mean 11.0, 1.7, and 0.19 amino acid mutations per sequence in Spike, RdRp, and Mpro. Of the 4,413 Spike mutations detected by either program, Sierra and EDGE detected the same mutation in 98.9% of cases; 0.7% were detected only by Sierra and 0.3% only by EDGE. Of 760 RdRp and Mpro mutations, Sierra and EDGE detected the same mutation in 98.4% of cases; 1.5% were detected only by Sierra and 0.1% only by EDGE. The 59 discordances in the three genes resulted from small differences in the threshold at which mutations were detected (n=49) and in placement of indels (n=10).

ONT sequences
For regions successfully aligned by both Sierra and EDGE, Sierra detected a mean 19.0, 1.7, and 0.48 mutations and EDGE detected a  [34,35], which are associated with reduced susceptibility to nirmatrelvir (NTV) or ensitrelvir (ENS) either biochemically or in cell culture, which have been selected in vitro, the effect of mutations on Mpro fitness determined either biochemically or in cell culture, and the global mutation prevalence as of June 2022. For RdRp, the figure shows which mutations reduced susceptibility to remdesivir (RDV), which have been selected by RDV in vitro and in vivo, and the global mutation prevalence as of June 2022. A dark blue cell indicates ≥10-fold reduction in susceptibility; a light blue cell indicates 5-10-fold reduction; a very light blue cell indicates a 2.5-5-fold reduction; and a white cell indicates a <2.5-fold reduction. A gray cell indicates the absence of susceptibility data. *G15S is the consensus amino acid for the Lambda variant. †E166V has been reported in three persons receiving nirmatrelvir in the EPIC-HR study [22]. §Variable reductions in susceptibility were reported for this mutation in different studies. For RDV, S759A was evaluated only in combination with V792I; F480L and F557L were evaluated only in combination with each other. mean 18.7, 1.7, and 0.49 mutations per sequence in Spike, RdRp, and Mpro. Of the 3,855 Spike mutations detected by either program, Sierra and EDGE detected the same mutation in 94.3% of cases; 3.8% were detected only by Sierra and 2.0% only by EDGE. Of 448 detected RdRp and Mpro mutations, Sierra and EDGE detected the same mutation in 96.9% of cases; 2.0% were detected only by Sierra and 1.1% only by EDGE. The 235 discordances in the three genes resulted from small differences in the threshold at which mutations were detected (n=174) and in the placement of indels (n=61).

FASTA files
We compared the Spike, Mpro, and RdRp mutation lists generated by Minimap2 and PostAlign with the GISAID "AA substitutions" metadata, generated by the CoVSurver program [45] for 963,237 FASTA sequences. The list of mutations for Spike, Mpro, and RdRp genes identified by Sierra and GISAID were identical for 99.4%, 99.9% and 99.5% of sequences, respectively. However, there were differences in the placement of indels for Spike. Nearly all Spike differences were caused by indels at several positions, such as the Omicron BA.1 N-terminal domain deletion that has alternatively been placed at position 211 [48,49] or 212 [50][51][52]. The non-indel differences resulted from how mutations in regions surrounding missing sequence data were handled.  Fig. 5A-C show the number of unusual mutations per sequence in Spike, RdRP, and Mpro. In Spike, 92.9% sequences had no unusual mutations, 6.7% had one, 0.4% had two, and <0.1% had three or more unusual mutations. In RdRp, 96.1% sequences had no unusual mutation, 3.8% had one and 0.1% had two or more unusual mutations. In Mpro, 99.2% sequences had no unusual mutation, 0.7% had one and 0.1% had two or more unusual mutations. Fig. 6 shows the numbers of usual and unusual Spike mutations at different mutation thresholds in the 200 NCBI Illumina and 200 NCBI ONT sequences. At mutation detection thresholds <50%, there was a markedly higher proportion of unusual mutations in ONT compared with Illumina sequences.

Neutralizing susceptibility data in CoV-RDB for submitted RBD mutation patterns
The Spike mutation pattern dataset contained 13,578 distinct patterns of Spike RBD mutations. Each RBD mutation pattern was submitted to Sierra to determine the frequency for which complete or partial neutralizing susceptibility data was available in CoV-RDB [9] (Fig. 7). For 76.7% of sequences (1.3% of patterns), CoV-RDB contained data exactly matching the submitted mutation pattern. For 10.2% of sequences (86.6% of patterns), CoV-RDB contained data partially matching the submitted mutation pattern (i.e., CoV-RDB contained data for mutation patterns representing a subset, superset, or intersecting set of the mutations in the submitted mutation pattern). For 13.0% of sequences (12.0% of patterns), CoV-RDB contained no data matching the pattern of submitted mutations.

Discussion
Sierra SARS-CoV-2 is an open-source web-based program that accepts FASTQ and FASTA files and lists of Spike mutations. Depending on the nature of the input data, it generates a consensus nucleotide sequence, assigns a sequence lineage, identifies amino acid mutations, and uses the mutations to interrogate a quality-controlled sequence analysis pipeline for global mutation prevalence data and CoV-RDB for data on SARS-CoV-2 susceptibility to antiviral agents and to plasma from previously infected and/or vaccinated persons.
We assessed the performance of Sierra SARS-CoV-2 using 600 FASTQ datasets, nearly one million FASTA sequences, and approximately 13,500 distinct Spike RBD mutation patterns. In the analysis of FASTQ sequences, Sierra SARS-CoV-2 and EDGE COVID-19 were highly concordant and in the analysis of FASTA sequences, Sierra SARS-CoV-2 and the GISAID mutation list were highly concordant. For both analyses, most discordances resulted from equally acceptable placements of several commonly occurring indels. An analysis of approximately 13,500 distinct Spike RBD mutation patterns, showed that exactly or partially matching annotation data were available for 88% of reported mutation patterns.
Sierra SARS-CoV-2 uses mutation prevalence data to identify sequences with an unexpectedly large number of unusual mutations. Indeed, only 0.1% of quality-controlled Spike sequences had three or more unusual mutations and only 0.1% of quality-controlled Mpro and RdRp sequences had two or more unusual mutations. Therefore, the presence of many unusual mutations in a sequence suggests the possibility of sequence artifact or possibly, although less likely, a novel variant.
Sierra SARS-CoV-2 uses published data to identify mutations potentially associated with reduced antiviral susceptibility. Although few major SARS-CoV-2 lineages circulate at any time, an increasing number of Omicron sub-variants containing different spike mutation patterns are now reported in many regions [53]. Therefore, a sequence analysis program that provides susceptibility data for mutation patterns, as well as for variants of concern has become increasingly relevant. Additionally, an increasing number of Mpro mutations associated with reduced nirmatrelvir susceptibility have been identified in vitro, although few have been reported in persons receiving nirmatrelvir.
In conclusion, Sierra SARS-CoV-2 is one of a few open-source analytic pipelines actively maintained and available through a web interface [3,6,7]. It uniquely leverages published data to improve the quality control of submitted viral genomic data and to provide functional annotation on the impact of mutations in the targets of antiviral therapy.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Fig. 7. Availability of neutralizing susceptibility data in CoV-RDB for submitted sets of Spike receptor binding domain (RBD) mutations. The 13,578 unique patterns of RBD mutations, present in 4,740,761 sequences, were submitted to Sierra SARS-CoV-2. Exactly matching susceptibility data were available for 183 mutation patterns (1.3% of mutation patterns derived from 76.7% of sequences). Partially matching susceptibility data were available for 11,760 patterns (86.6% of patterns from 10.2% of sequences) including cases for which CoV-RDB contained data for a subset, superset, or intersecting set of mutation patterns. No matching susceptibility data were available for 1,635 mutation patterns (12.0% of patterns from 13.0% of sequences). Each of the five tables contain examples of the five scenarios: exact match, subset, superset, intersection, and no match with one column showing the submitted mutation pattern, another showing the closest CoV-RDB pattern, and the third showing the number of sequences (except for the tables showing the patterns that contained an exact match or no match in CoV-RDB).