SpikeSeq: A rapid, cost efficient and simple method to identify SARS-CoV-2 variants of concern by Sanger sequencing part of the spike protein gene

In 2020, the novel coronavirus, SARS-CoV-2, caused a pandemic, which is still raging at the time of writing this. Here, we present results from SpikeSeq, the first published Sanger sequencing-based method for the detection of Variants of Concern (VOC) and key mutations, using a 1 kb amplicon from the recognized ARTIC Network primers. The proposed setup relies entirely on materials and methods already in use in diagnostic RT-qPCR labs and on existing commercial infrastructure offering sequencing services. For data analysis, we provide an automated, open source, and browser-based mutation calling software (https://github.com/kblin/covid-spike-classification, https://ssi.biolib.com/covid-spike-classification). We validated the setup on 195 SARS-CoV-2 positive samples, and we were able to profile 85% of RT-qPCR positive samples, where the last 15% largely stemmed from samples with low viral count. We compared the SpikeSeq results to WGS results. SpikeSeq has been used as the primary variant identification tool on > 10.000 SARS-CoV-2 positive clinical samples during 2021. At approximately 4€ per sample in material cost, minimal hands-on time, little data handling, and a short turnaround time, the setup is simple enough to be implemented in any SARS-CoV-2 RT-qPCR diagnostic lab. Our protocol provides results that can be used to choose antibodies in a clinical setting and for the tracking and surveillance of all positive samples for new variants and known ones such as Alpha (B.1.1.7), Beta (B.1.351), Gamma (P.1) Delta (B.1.617.2), Omicron BA.1(B.1.1.529), BA.2, BA.4/5, BA.2.75.x, and many more, as of October 2022.


Introduction
The pandemic caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a crisis faced by nearly every country in the world at the time of writing this article. Throughout 2020 and 2021, the spread of SARS-CoV-2 caused millions of deaths, and more than 600 million people have been diagnosed with COVID-19, primarily with Real-Time quantitative Polymerase Chain Reaction (RT-qPCR) diagnostic tests (Wang et al., 2021). Variants of the virus are identified by whole genome sequencing (WGS) based mutation calls and compared to the originally published genome (Wu et al., 2020) or by variant specific qPCR tests (Vogels, 2021;Vogels et al., 2020;Spiess et al., 2021). WGS of SARS-CoV-2 has led to the identification of several VOCs of SARS-CoV-2, such as the Pango (Rambaut et al., 2020) Fig. 1 shows the mutations in the SpikeSeq amplicon and that all the shown variants can be distinguished from each other using only the SpikeSeq amplicon. Monitoring VOCs makes it possible to tailor the societal response for maximal containment of SARS-CoV-2 at the lowest cost. While WGS is crucial for surveillance work, it is not optimal for near real-time contact tracing of VOCs, as it is it is expensive, slow and requires a centralized setup and a high degree of technical and bioinformatics expertise. In a clinical setting, clear and rapid differentiation between variants can be used for selecting therapeutic antibodies, e.g. Casirivimab/Imdevimab for Delta infected patients and retaining Sotrovimab for Omicron infected patients (Li and Gandhi, 2022). The clinical laboratories around the world can therefore benefit greatly by using our low cost SpikeSeq for routine screening of COVID-19 positive patients. SpikeSeq is a cost efficient, rapid, and simple method for SARS-CoV-2 variant mutation typing using no additional equipment or reagents to that already in use for diagnostic RT-qPCR SARS-CoV-2 testing (Fig. 2). At a cost of approximately 4€ per reaction in materials, and approximately one and a half hours of hands-on time per 96 samples (Fig. 2), this system can readily be implemented for large scale typing of all positive samples, without the use of specialized equipment or personnel. The material cost is much cheaper than the cost of WGS and does not increase in price per sample if only a few samples are run at a time (Quick, 2020). SpikeSeq consists of an RT-PCR with a single ARTIC Network primer set, using the same enzyme mix used in diagnostic COVID-19 RT-qPCR tests, and using Sanger sequencing to sequence a 1001 bp amplicon located in the spike gene. Sanger sequencing as a technology is more than 40 years old (Sanger et al., 1977), and a large existing commercial infrastructure offers Sanger sequencing services. We provide a bioinformatics tool that enables automated basecalling, mapping and calling of mutations from Sanger electropherogram files (.ab1), fasta or fastq files. The results are reported in a user-friendly table as well as in fastq format, available as a command line interface from https://github.com/kblin/covid-spike-cl assification or in a browser-based app running locally from https://ssi. biolib.com/app/covid-spike-classification/run without data leaving the computer. SpikeSeq has been the primary variant profiling tool for all positive samples from the test facility at the Technical University of Denmark (DTU) and used by several other Danish hospitals, totaling > 10.000 samples through 2021.

Input RNA
The RNA from the diagnostic SARS-CoV-2 RT-qPCR facility at the Centre for Diagnostics at DTU was obtained from oropharyngeal swabs with specimens transferred immediately into a tube containing 3 M Guanidine Thiocyanate for instant lysis of viral particles. This ensures stability of RNA during transportation and during the process of transferring patient material from sampling tube to plate format. RNAdvance Viral Reagent Kit -Built on SPRI (Solid Phase Reversible Immobilization) bead-based technology (#C63510 Bechman Coulter, IN, USA), was used for the purification of input RNA. RNA was frozen for 1-2 days at − 20 C • prior to being used for SpikeSeq.

RT-PCR and sequencing reaction
The SpikeSeq Reverse Transcription Polymerase Chain Reaction (RT-PCR) was set up in 20 μl RT-PCR reactions using 10 μl One Step Pri-meScript III (Takara, Shimogyō-ku, Kyoto), 0.4 μM forward primer (Table 1), 0.4 μM reverse primer, and 5 μl purified template RNA per reaction (Table 1). RT-PCR was performed with the following program: reverse transcription for 5 min at 52 • C, then hot start polymerase activation at 95 • C for 10 s, followed by 45 cycles of 95 • C for 5 s, 58 • C for 30 s, and 72 • C for 1 min, followed by 5 min at 72 • C. For each Sanger reaction, 1.5 μl unpurified RT-PCR product was diluted to 18.5 μl containing 1.2 µM of the nCoV-2019_76_LEFT_alt3 primer. The Sanger sequencing was performed by Eurofins Genomics (Eurofins, Luxembourg, Luxembourg).

WGS
To compare the SpikeSeq mutation calls to WGS mutation calls, 29 samples with SpikeSeq data were subjected to WGS. The illumina based WGS method used at Rigshospitalet for the samples of this study has been described previously (Andersen et al., 2021). Sequence reads were trimmed using BBDuk (Bushnell et al., 2017) v.36.49 and aligned to Wuhan-Hu-1 reference genome (GenBank MN908947.3) using Fig. 2. Workflow of SpikeSeq typing of SARS-CoV-2 variant mutations. Time estimates are based on a single 96 well PCR plate and no automation. The total hands-on time is one hour and 30 min, and only requires the set up of an RT-PCR from already extracted RNA from a diagnostics lab followed by transfer of 1.5 μl of the product to a new 96 well PCR plate. The plate is subsequently sent for Sanger sequencing at a commercial provider. The total cost per sample including plastware, enzymes, transport, and sequencing is approx. 4 €.
After the raw data is received, data analysis takes less than five minutes for 96 samples. If Sanger sequencing is available in-house, the time to result could be less than six hours. Overall, Sanger sequencing data from 85% of samples were of sufficient quality for mapping. The non-template control is not depicted, as it did not produce a Sanger read that mapped to the reference genome. * : Ct values stem from a 2-step PCR where the first 7 cycles are not fluorescence registered, possibly meaning that the Sanger sequencing assay is even more sensitive than suggested by this figure.
Minimap2 (Li, 2018) v.2.17. Aligned reads were sorted using Samtools (Li et al., 2009)  From the assembled WGS consensus sequence, the amplicon sequence used for SpikeSeq was extracted and subjected to the same analysis as the Sanger data, using the -fasta switch in https://github.com/kbl in/covid-spike-classification.

Results & discussion
In RT-qPCR SARS-CoV-2 tests, the relative number of viral copies in a sample is estimated by the threshold cycle (Ct) value, where a higher value indicates fewer viral copies. A batch of 195 SARS-CoV-2 positive samples was used to investigate the relationship between viral load and the ability of SpikeSeq to generate mutation calls. In Fig. 3, the SpikeSeq results of the 195 samples have been binned according to the diagnostic RT-qPCR Ct value for both of the targets (N1 and N2). Overall, the success rate of SpikeSeq was 85% (166/195 Sanger reads mapped to the correct position of the SARS-CoV-2 spike gene). The difference in sensitivity observed between N1 and N2 reflects that the Ct for the two RT-qPCR targets of SARS-CoV-2 can be different, and sometimes one target is not recorded (n = 8 for N1 and n = 4 for N2). Of note is that unpurified RT-PCR product was sequenced, significantly reducing the workload involved in mutation profiling. The fraction of samples with Ct < 30 with a mappable result from Sanger sequencing is approximately 90% (Fig. 3). For samples where the Sanger read successfully mapped, 99.1% of positions had corresponding sequence information available (4770 out of 4814 positions, CSC version 0.6.4, 166 samples mapped, 29 positions, counting the positions with multiple mutations once only, see Table S1). The positions without sequence information were mainly from the ends of the sequence, as is common in Sanger sequencing data. As expected, the "No Template Control" did not yield a result. Note that the initial seven cycles in the diagnostic RT-qPCR are not fluorescence recorded and therefore do not count towards the final Ct value, which means that the sensitivity could be even greater than reported. Five of the 166 sequences had the combination N501Y, A570D, P681H, and T716I mutations, which strongly suggests infection with the B.1.1.7 lineage (Fig. 1, Table S1) based on the known circulating variants at the time. This key result shows that at least 85% of SARS-CoV-2 positive sample scan be profiled using SpikeSeq, which is comparable to WGS success rates (Baker et al., 2021).
To verify SpikeSeq mutation calls, we focused on 29 samples where both SpikeSeq and whole genome consensus sequences were available. The sample set included 7 B. SpikeSeq has been used as the primary variant identification tool at DTU Diagnostics, where it detected the first case of Gamma in Denmark in early March 21. In early April 2021, before Delta was officially named, SpikeSeq data was used to identify a Delta infection and alert authorities of the unusual variant, allowing contact tracing to stop further transmission. Both of these variants were later confirmed by WGS (data not shown), and highlights the ability of SpikeSeq to find known and emerging variants.
For samples where the SpikeSeq mutation profile is not a known VOC or nonVOC, we suggest that the provided fastq file is aligned to the reference and manually viewed to identify additional mutations. Then, to gain more information from the mutation profile, we suggest using the fantastic resources available at nextstrain.org/ncov (Hadfield et al., 2018) to filter the GISAID deposited genome sequences to the samples with a similar mutation profile, which will give information on the geographical spread and novelty of the mutation profile.
An Achilles heel of any PCR based assay is mutations in the primer binding regions. This is true for all major SARS-CoV-2 WGS approaches, all variant specific RT-qPCR setups, and SpikeSeq, however the three approaches are differently affected. For WGS, where a combination of 30-100 primer sets is routinely used, dropout of a single amplicon will rarely influence the analysis. For RT-qPCRs on the other hand, mutations both in the primer binding regions and in the probe region can lead to misinterpretation of the results, sometimes even in an undecipherable way. This may potentially lead to wrong mutation calls and hence variant profiling. This was observed during the transition from Alpha to Delta in 2021 with the spike position 452 not only being Leucine (L) or Arginine (R) as expected, but also Glutamine (Q) and Methionine (M) providing false positives/negatives for a common RT-qPCR variant assay in Denmark. For the SpikeSeq method, mutations in the primer binding site could lead to lowered success rate of the assay, but not to wrong mutation calling, as there is either a good quality sequence or not. The primer sites chosen for SpikeSeq turned out to encompass and allow for the detection and distinction of all current and previous VOCs, many of which came to dominate after the assay design. Position K417, which is located in the middle of the forward primer and consequently outside of the SpikeSeq analysis window, is mutated in a number of lineages, such as Beta, Gamma, and all Omicron variants. It does not strongly reduce the primer binding affinity however, since it is not close to the 3 ′ end of the primer (Stadhouders et al., 2010). The mutation of position K417 in some VOCs is the only mutation in the primer sequences in the current and previous VOCs, highlighting the robustness of SpikeSeq. Nevertheless, if an important SARS-CoV-2 variant should emerge with significant mutations in the primer binding sites, the relentless work from the ARTIC network generates, tests, and publishes primer sets, which can be used instead of the original SpikeSeq primer set. In this way, SpikeSeq can be adapted to be used on any conceivable variant, with validated primer sets covering the entire genome of SARS-CoV-2 provided by the ARTIC network. This would be relevant for differentiation of e.g. Omicron BA.4 and Omicron BA.5, two variants where the SpikeSeq amplicon is identical.
SpikeSeq is easy to automate and we estimate that the total price of these simple steps is 4€ per sample in material cost including plastics, enzymes, buffers, shipping and sequencing. This assumes that excess RNA from a diagnostic COVID-19 qPCR facility is available as input for SpikeSeq. This makes the workflow cost efficient, even when compared to multi-target RT-qPCR typing or low cost optimized WGS methods such as CoronaHiT and ARTIC LoCost (€7.5-12 and £6.22-9.75, respectively) (Baker et al., 2021). Furthermore, the SpikeSeq price scales linearly with sample number with no minimum number of samples, as opposed to WGS approaches, where large batches are required to keep the cost down. If Sanger sequencing can be performed in-house the time to result can be less than six hours from positive COVID-19 sample to variant call, which is faster than any high throughput WGS based analysis, with less computational workload and data storage. The advantages of WGS approaches compared to SpikeSeq is the much greater resolution, where it is always possible to assign lineage to a sample. Having the complete genome sequence mean that all mutations can be identified, also any consequential mutations outside of the SpikeSeq analysis window. The cheapest WGS methods offer a relatively low cost of only double the price of SpikeSeq, if the sample volume is sufficiently great and the laboratory infrastructure and analysis infrastructure is readily available.
Globally, a number of scientists have utilized the ubiquitous Sanger Sequencing technology, either to generate complete genomes (Shaibu et al., 2021;Paden et al., 2020;Moniruzzaman et al., 2020), or, in similar efforts to this, to type SARS-CoV-2 variants. These protocols, published at a later date, are targeting the S gene, using a similar approach to SpikeSeq (Daniels et al., 2021;Bezerra et al., 2021). In one (Bezerra et al., 2021), the assay is -similiarly to SpikeSeq-based on ARTIC primers, and the amplicon is 320 nt shorter than the SpikeSeq assay amplicon and sequenced from both the forward and reverse primer. This will hypothetically mean a slightly higher assay success rate, but also mean that a handful of important positions are not included in the analysis window, including H655Y, N659K, P681R/H, A701V, T716I. Surprisingly, the K417 mutations are reported in this study, however the forward primer is placed over the position, which means that K417 cannot be reliably analyzed by the assay. The data analysis in (Bezerra et al., 2021) is based on commercial software, but could easily be adapted to be used with an open source software similarly to SpikeSeq, by simply loading the Sanger electropherogram files.
In (Daniels et al., 2021), it is suggested to use several amplicons to target the spike gene instead of a single amplicon, and while this would slightly increase the resolution of Sanger sequencing based variant identification, it would come with the price of increased complexity in both primer maintenance, laboratory work, data analysis, and cost of the analysis. The analysis software in SpikeSeq is automated, open source, secure, and easy to use for non-bioinformaticians, however neither study similar to SpikeSeq (Daniels et al., 2021;Bezerra et al., 2021) includes software for users of the methods.SpikeSeq uses identical enzymes to those used in COVID-19 diagnostic testing, simplifying the supply chain. Presumably, any RT-PCR enzyme mix will perform well with SpikeSeq. Identifying emerging VOCs requires large-scale tracking of the complete SARS-CoV-2 genome, and we do not suggest substituting the WGS efforts with SpikeSeq. SpikeSeq of all SARS-CoV-2 positive samples can however enable rapid, decentralized mutation typing and association to current variants and mutations of concern and has already proven useful for identifying emerged variants like Alpha, Beta, Gamma, Delta, and Omicron (BA.1,BA.2). The proposed amplicon will potentially continue to cover all mutations in important variants arising in the future. As RT-qPCR requires modification and testing of the assay for each new emerging variant, SpikeSeq has proven to be a more robust method, since no modifications have been needed in the assay to be able to identify all VOCs to date. However, with an established assay, variant specific qPCR will often be faster and potentially cheaper than SpikeSeq if analyzing only a few mutations per sample.

Conclusion
We show that typing of SARS-CoV-2 VOC specific mutations can be performed by SpikeSeq in a cheap and fast way, and we provide an automatic, open source, and upload-free mutation analysis software. Our SpikeSeq protocol is robust as it has been able to detect several VOCs arising after the assay was designed, namely Beta, Gamma, Delta, and Omicron (BA.1, BA.2, BA.4/5, BA.2.75.x), and many non-VOC variants as well. SpikeSeq has the potential to change how SARS-CoV-2 VOC mutations are typed globally, as it is the cheapest and simplest way proposed yet to variant type SARS-CoV-2 positive samples, without the need to constantly modify the assay. We achieved mutation calls from 85% of all samples, which is comparable to the reported WGS success rate. We do not suggest abandoning the use of WGS, as it is important for tracking emerging variants, but SpikeSeq have advantages over both WGS (speed, price, ease, data storage, small batch size) and qPCR (robustness, result resolution). SpikeSeq has been used routinely during the pandemic to type all positive samples at the High Throughput SARS-CoV-2 RT-qPCR testing facility at DTU and at the Copenhagen University Hospital, Rigshospitalet.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability
Data will be made available on request.