Looking for a needle in a haystack. SARS-CoV-2 variant characterization in sewage

SARS-CoV-2 variants are emerging worldwide, and monitoring them is key in providing early warnings. Here, we summarize the different analytical approaches currently used to study the dissemination of SARS-CoV-2 variants in wastewater and discuss their advantages and disadvantages. We also provide preliminary results of two sensitive and cost-effective approaches: variant-specific reverse transcription-nested PCR assays and a nonvariant-specific amplicon deep sequencing strategy that targets three key regions of the viral spike protein. Next-generation sequencing approaches enable the simultaneous detection of signature mutations of different variants of concern in a single assay and may be the best option to explore the real picture at a particular time. Targeted PCR approaches focused on specific signature mutations will need continuous updating but are sensitive and cost-effective.


Introduction
Wastewater surveillance for SARS-CoV-2 has proved to be useful in monitoring the evolution of the COVID-19 pandemic. However, new emerging variants are posing new challenges. The SARS-CoV-2 variants a, b, g and d (also known as lineages B.1.1.7, B.1.351, P.1 and B.1.617.2, respectively) were first detected in the United Kingdom, South Africa, Brazil and India, respectively, and were immediately considered to be variants of concern (VOCs). Such variants, which have been associated with the fluctuations seen with the pandemic waves, possess mutations that affect viral infectivity and antigenicity. These mutations are mainly located in the gene encoding the viral spike (S) protein. In particular, mutations leading to the E484K and N501Y substitutions within the receptorbinding domain of the S protein have been demonstrated to give the S protein a greater affinity for the human ACE2 receptor [13]. The commonly applied PCR methods used to quantify the concentration of the virus in environmental samples use specific primers and probes targeting the nucleocapsid (N), envelope (E) or RNA-dependent RNA polymerase (RdRp) regions. However, as stated above, the VOCs and the new variants of interest (VOIs) have most of their signature mutations within the S gene. Figure 1 summarizes the signature mutations identified in each VOC and VOI.
Although the combination of genome sequence analysis of samples from COVID-19 patients with epidemiological datasets has produced reliable assessments of the extent of SARS-CoV-2 transmission in the community [22], the time lag between infection and symptoms and the future decrease in sequencing will add further delays compared to the expected immediacy of the results from wastewater surveillance. At the beginning of October 2020, several new SARS-CoV-2 variants started to circulate globally [7]. At that moment, the minimum number of clinical samples that had to be sequenced to find the a variant was 400, assuming that only 5% of the positive clinical samples had been sequenced and that the prevalence of this VOC in the population was 5% [20]. Thus, the analysis of SARS-CoV-2 genomes sequenced from clinical samples is limited to the fraction of the clinical samples subjected to whole-genome sequencing.
Monitoring the circulation of variants in wastewater has its caveats when dealing with mixtures of variants and/or the presence of inhibitors. Although the environmental surveillance of other epidemic viruses (like noroviruses) have been observed to be sensitive in detecting variants [17], the consensus sequences obtained from wastewater samples might lead to artificial genomes that do not represent an existing virus. However, SNPs can be linked to particular variant clusters or clades and give information about SARS-CoV-2 variants circulating in a region [15]. Thus, the study of the viral RNA sequences found in wastewater is important to understand viral transmission patterns and to establish an alert system for new SARS-CoV-2 variants.

Recent trends in studies on SARS-CoV-2 variants in wastewater samples
A recently published study using the EU Sewage Sentinel System for SARS-CoV-2 provided an extensive report of 'The HERA Incubator' [10], with nextgeneration sequencing (NGS) information about the diversity of SARS-CoV-2 variants and their associated mutations at the community level. It determined the relative abundance of each VOC based on the abundance of reads associated with certain amino acid mutations [11]. The categorization of the mutations as unique or shared was based on the percentage of the sequences for associated mutations submitted to GISAID.

Quantitative RT-PCR based approaches
New quantitative reverse transcription PCR (RT-qPCR) protocols targeting specific mutations or deletions have been described to differentiate between SARS-CoV-2 variants. The first multiplex RT-qPCR assay was published by Ref. [26], which uses the deletion within the ORF1a gene (that exists in most of the VOCs) and the HV69/70 deletion (present in the a variant) to differentiate this variant from the rest. Other research groups have developed allele-specific RT-qPCRs for the a variant [5,12,19,29] or multiplex assays for specific S protein mutations (L452R, E484K and N501Y) [27]. These RT-qPCR strategies can be used when there is already a high prevalence of the VOC in the community, or in other words, when SARS-CoV-2 RNA levels, measured with assays targeting the N gene, for example, are high. Using the same basis, reverse transcription droplet digital PCR (RT-ddPCR) is an alternative that might be more sensitive and allows the discrimination of closely related sequences [1,6,14].
[Abachin_et_al_2017] designed an RT-ddPCR assay using two different probes to discriminate between wild-type sequences and sequences containing the Spike protein mutations that can affect both tropism (receptor binding) and immune evasion and are therefore the focus of surveillance. All mutations indicated are related to the reference sequence (NC_045512). Variants of concern correspond to: a, b, g, and d. To date (15 July 2021), the rest are variants of interest. Orange ticks indicate deletions and yellow ticks amino acid mutations.
N501Y signature mutation (present in the a, b, g and q variants) in wastewater.

Amplicon sequencing based approaches
Reverse transcription-nested PCR (RT-nPCR) assays followed by Sanger sequencing and/or NGS analysis have been published for SARS-CoV-2 characterization. In October 2020, Martin and collaborators designed an RT-nPCR approach followed by Sanger sequencing and NGS analysis of the amplified products from five different regions of the viral genome, which demonstrated changes in the predominance of the virus variants [20]. La Rosa and coworkers [25] adopted a similar approach involving conventional Sanger sequencing of the amplicon but focusing only on key mutations of the S gene, which allowed rapid screening of the SARS-CoV-2 variants [_Rosa_et_al_2021]. Recently, another group from the United Kingdom used two different RT-nPCR assays targeting the RdRP and ORF8b gene regions for diagnostics and two primer sets targeting the S gene regions to discriminate between the a, b and g variants [28].
Sequencing amplicons using NGS, commonly known as amplicon deep sequencing (ADS), has not only been applied to selected parts of the SARS-CoV-2 genome but also to the whole genome as an informative method for detecting and identifying SARS-CoV-2 variants. Several custom enrichment strategies based on designing primer sets coupled with Illuminacompatible library preparation kits have been used to sequence amplified fragments spanning the whole or near-complete genome of SARS-CoV-2 from environmental samples [2,15,18,20,28]. Other studies have used the open-source ARTIC protocol [3,16,23]. This protocol, released in March 2020 and designed to sequence the virus from clinical samples, uses 98 multiplexing PCR primer pairs to amplify the whole genome of the virus [24]. Similarly, the commercial AmpliSeq SARS-CoV-2 Research Panel (Thermo Fisher Scientific) consists of two pools with amplicons ranging from 125 bp to 275 bp that covers >99% of the SARS-CoV-2 genome and are compatible with either Illumina or Ion Torrent sequencing platforms [2]. Another strategy based on NGS is the use of a commercial oligo-capture approach, like the Illumina Respiratory Virus Oligo Panel (Illumina, Inc.) or the VirCapSeq Enrichment Kit (Roche), which are designed to enrich the sequences of human respiratory or vertebrate viruses, respectively, and both have been applied to complex environmental samples prior to massive sequencing [8,21].
Based on the findings of available studies, the most abundant single nucleotide variations (SNVs) that have been identified in wastewater to date correspond to the most abundant SNVs in clinical samples [8]. The identification of an individual or several signature mutations ( Figure 1) located in close proximity to one another within the sample amplicon can help identify new SNVs in the population being analyzed. When using these approaches in environmental samples containing a mixture of variant sequences, there is a possibility of generating artificial genome reconstructions or artefacts during sequence assembly, which could result in unreliable VOC or VOI assignations.
The ADS of selected regions provides a more robust characterization of genomic variants compared to broader genome reconstructions within individual samples. When applied to clinical samples, long-read sequencing platforms have been proven to be efficient in obtaining highly accurate consensus-level sequences despite the higher error rates [4]. However, to our knowledge, this approach has not been applied in the study of SARS-CoV-2 variants in sewage.

Specific regions for the characterization of SARS-CoV-2 genomic variants
Approaches targeting selected regions of the SARS-CoV-2 genome in which signature mutations are located generate more interest compared to the sequencing of other regions that are more conserved and less informative about genomic variants. For discriminating between variants, European authorities have established that sequencing should cover at least the S gene, particularly that encoding the entire N-terminal region and the receptor-binding domain (RBD) corresponding to amino acids 1 to 541 [9]. Preliminary data obtained from two different approaches that were developed by our research group are detailed below. These approaches involved specific RT-nPCR assays targeting the signature mutations of the main VOCs and VOIs followed by Sanger sequencing (assay A and B) and an ADS strategy targeting three different regions of the S gene (assays A1, A2 and A3). Both approaches were tested in parallel in samples collected from February to May 2021 from wastewater treatments plants (WWTP) of different sizes located in Catalonia, northeast Spain. More information about the methodology is provided in the Supplementary Material. The results obtained are summarized in Table 1 and the datasets generated are available in Zenodo under the DOI number https://doi. org/10.5281/zenodo.5497909. Table 1 Table 1 Summary of SARS-CoV-2 concentrations (GC/L) detected using RT-qPCR and signature mutations detected using RT-nPCR and Sanger sequencing or ADS in a MiSeq platform. ND: not detected.

Variant study approaches: the pros and cons
Different analytical approaches for the study of SARS-CoV-2 variants in wastewater samples have been developed, each one providing different types of information. In Table 2, the pros and cons of the different methodologies that have been used to date are listed. Depending on their intrinsic properties, a suitable application has been suggested.
RT-qPCR and RT-ddPCR are designed to detect a signature mutation of a particular variant and are the fastest at providing results. Both methodologies are often designed as duplex or multiplex, allowing the simultaneous detection of other variants and giving an estimation of their percentages among other simultaneously occurring variants. Thus, they are appropriate for monitoring a specific variant in a region where it has spread and become established since a certain proportion of the target variant with respect to the others is needed to be detected. RT-ddPCR might be more sensitive and precise than RT-qPCR, but it is also more expensive [1,6,14] [_Abachin_et_al_2017].
However, wastewater is a complex sample, and it is likely to contain a mixture of variants. In a region where the predominant variant circulating within the population is not clear or where the situation is constantly changing, non-variant-specific methodologies might be more suitable since they do not need continuous updating of the assay. In such cases, RT-nPCR assays followed by Sanger sequencing of specific regions containing signature mutations would be highly informative and would identify the predominant variant circulating in the population, as this type of sequencing gives information about the most abundant sequence amplified. Furthermore, RT-nPCR can use specific primers for a defined mutation that can target specific variants and regions where other mutations may occur. By contrast, if the objective is to perform an accurate characterization of the diversity present in wastewater, or in other words, identify different variants present in a mixture, NGS analysis would be more appropriate. The extensive information provided by NGS techniques, considered to be expensive, requires an exhaustive bioinformatics analysis and expertise.

Conclusions
Monitoring SARS-CoV-2 variants in wastewater is important for epidemiological surveillance in a community. Different analytical approaches have been developed to identify and study the dissemination of SARS-CoV-2 variants in wastewater samples, including RT-qPCR, RT-nPCR, and NGS approaches. Due to their intrinsic nature, each method has pros and cons and provides different types of information that is Table 2 List of pros and cons of the different methodologies used in the study of SARS-CoV-2 variants in sewage samples. important to consider when selecting the appropriate method for a specific objective. In a postpandemic scenario, when PCR-based assays and sequencing of clinical samples will decrease, the sequencing of a subset of wastewater samples may be enough to monitor the circulation of different VOCs and VOIs in a community. A representative sample needs to be collected regularly from a certain region to accurately estimate and monitor the prevalence of SARS-CoV-2 variants. Nonvariant-specific techniques may be the best option to explore the real picture of all the circulating variants at a particular time, providing broader information that can contribute to community surveillance. This study provides guidance on available approaches for detecting and identifying circulating SARS-CoV-2 variants considering different scenarios. Further work on the application of massive sequencing of SARS-CoV-2 from environmental samples is needed towards producing longer fragments in order to avoid overlapping and chimera constructions, and also shorter bioinformatic processing for an effective early warning.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.