External Quality Assessment of SARS-CoV-2 Sequencing: an ESGMD-SSM Pilot Trial across 15 European Laboratories

ABSTRACT This first pilot trial on external quality assessment (EQA) of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) whole-genome sequencing, initiated by the European Society of Clinical Microbiology and Infectious Diseases (ESCMID) Study Group for Genomic and Molecular Diagnostics (ESGMD) and the Swiss Society for Microbiology (SSM), aims to build a framework between laboratories in order to improve pathogen surveillance sequencing. Ten samples with various viral loads were sent out to 15 clinical laboratories that had free choice of sequencing methods and bioinformatic analyses. The key aspects on which the individual centers were compared were the identification of (i) single nucleotide polymorphisms (SNPs) and indels, (ii) Pango lineages, and (iii) clusters between samples. The participating laboratories used a wide array of methods and analysis pipelines. Most were able to generate whole genomes for all samples. Genomes were sequenced to various depths (up to a 100-fold difference across centers). There was a very good consensus regarding the majority of reporting criteria, but there were a few discrepancies in lineage and cluster assignments. Additionally, there were inconsistencies in variant calling. The main reasons for discrepancies were missing data, bioinformatic choices, and interpretation of data. The pilot EQA was overall a success. It was able to show the high quality of participating laboratories and provide valuable feedback in cases where problems occurred, thereby improving the sequencing setup of laboratories. A larger follow-up EQA should, however, improve on defining the variables and format of the report. Additionally, contamination and/or minority variants should be a further aspect of assessment.

ABSTRACT This first pilot trial on external quality assessment (EQA) of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) whole-genome sequencing, initiated by the European Society of Clinical Microbiology and Infectious Diseases (ESCMID) Study Group for Genomic and Molecular Diagnostics (ESGMD) and the Swiss Society for Microbiology (SSM), aims to build a framework between laboratories in order to improve pathogen surveillance sequencing. Ten samples with various viral loads were sent out to 15 clinical laboratories that had free choice of sequencing methods and bioinformatic analyses. The key aspects on which the individual centers were compared were the identification of (i) single nucleotide polymorphisms (SNPs) and indels, (ii) Pango lineages, and (iii) clusters between samples. The participating laboratories used a wide array of methods and analysis pipelines. Most were able to generate whole genomes for all samples. Genomes were sequenced to various depths (up to a 100-fold difference across centers). There was a very good consensus regarding the majority of reporting criteria, but there were a few discrepancies in lineage and cluster assignments. Additionally, there were inconsistencies in variant calling. The main reasons for discrepancies were missing data, bioinformatic choices, and interpretation of data. The pilot EQA was overall a success. It was able to show the high quality of participating laboratories and provide valuable feedback in cases where problems occurred, thereby improving the sequencing setup of laboratories. A larger follow-up EQA should, however, improve on defining the variables and format of the report. Additionally, contamination and/or minority variants should be a further aspect of assessment. KEYWORDS NGS, external quality assessment, ring trial, whole-genome sequencing W hole-genome sequencing (WGS) of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) isolates has been used in many countries mainly to determine (i) specific viral lineages and (ii) the molecular epidemiological context. WGS will become increasingly important both as a typing technology in virological routine diagnostics of individual patients and for epidemiological surveillance. The European Centre for Disease Prevention and Control (ECDC) recently published a document to support the usage and implementation of WGS of SARS-CoV-2 in European countries (1).
Quality management is a central element for ensuring accurate and robust laboratory results for both routine diagnostic and reference laboratories. Internal and external controls are integral to the assessment of quality, e.g., in an ISO-accredited environment. In particular, external quality assessments (EQAs) represent a cornerstone in introducing new test methods, capacity building, and ensuring a baseline quality level. This is even more important in a pandemic situation, where a novel, previously unknown pathogen necessitates prompt development, validation, and rollout of assays for which microbiological expertise and diagnostic knowledge are limited. In this context, EQAs can ensure and improve testing quality and result comparability. They also allow, if sufficiently scaled, the comparison of the test performances of in-house-developed and commercial assays.
To date, no EQA results have been reported focusing on WGS of SARS-CoV-2, although some publications have shared quality aspects of a single center's experiences (2,3). Along these lines, individual centers in Switzerland have reported protocols on WGS with different epidemiological questions (4,5). In the past, the Swiss Institute of Bioinformatics has coordinated EQAs for viral metagenomics (6) and bacterial typing (7), which is an important first step in the capacity forming of WGS technology between diagnostic laboratories. Many other European countries are following suit.
For this reason, the European Society of Clinical Microbiology and Infectious Diseases (ESCMID) Study Group for Genomic and Molecular Diagnostics (ESGMD) and the Swiss Society of Microbiology (SSM) aimed to conduct a first EQA pilot trial focusing on SARS-CoV-2 WGS with a focus on three key aspects of genome analysis: (i) identification of single nucleotide polymorphisms (SNPs) and deletions, (ii) identification of Pango lineages (8), and (iii) assessing genomic relatedness using a molecular epidemiological approach.
The aim is to exchange knowledge and build a framework between diagnostic laboratories in order to improve quality for the continuing demands for high-quality genomes to address epidemiological questions during an ongoing pandemic.

MATERIALS AND METHODS
Design of the external quality assessment. The EQA was designed such that each laboratory could choose its own sequencing method as well as bioinformatic analysis. This introduces variability and makes disentangling methodological effects more difficult but best reflects clinical reality. Moreover, it provides direct feedback to laboratories concerning their sequencing pipeline.
An overview of the individual analysis pipelines is shown in Table 1, and a full description can be found in the supplemental material.
The desired key aspects for the EQA (SNPs/indels, Pango lineage assignment, and cluster assignment) as well as additional features such as read depth and percentage of missing data were reported back to the sequencing team at the University Hospital Basel (coordinating center for this pilot study).
Samples. Large quantities of virus suspensions were needed for the EQA. For this reason, it was decided to culture the virus to generate enough material. Vero76 cells were grown in Dulbecco's modified Eagle's medium (DMEM) (10% fetal bovine serum, 1% glutamine) in flat-bottom 96-well plates (Thermo Fisher Scientific, MA, USA). One hundred microliters of SARS-CoV-2-positive naso-oropharyngeal fluids was added, and cells were incubated for 48 h at 37°C. The cell culture supernatants were harvested, and SARS-CoV-2 RNA was quantified using the laboratory-developed Basel-SCoV2-112bp nucleic acid test (NAT), as described previously (9), targeting specific viral sequences of the spike glycoprotein S gene.
A total of 10 samples (named NGS1 to -10) of the cell culture supernatants were frozen and shipped on dry ice to participating laboratories. The viral isolates originated from routine diagnostic samples from Clinical Virology, University Hospital Basel, reflecting diverse epidemiological backgrounds. The cell culture supernatants used contained a range of viral loads of SARS-CoV-2, reflecting viral loads typically observed in routine diagnostics of acutely ill coronavirus disease 2019 (COVID-19) patients (see Table S1 in the supplemental material). To ensure that no changes occurred during culture, both primary material and the cell culture supernatant were sequenced and compared; the resulting sequences were identical (results not shown).
Assessment of variant calling. SNPs, compared to the reference Wuhan-Hu-1 strain, were assessed as reported (usually in the form of a list of variants). In order to compare results across centers and samples, a score was developed. As there is no "correct solution" to compare results against, a majority consensus approach was chosen; i.e., an SNP/indel was considered correct if the majority of laboratories detected it (ignoring missing data). If the correct base was called, a score of 1 was given per site. Incorrect base calls were scored as 21; missing data received a score of 0. If an ambiguous base was called where a true SNP occurred and the correct base was included in the ambiguity code (IUPAC), a score of 0.5 was given. Otherwise, reported ambiguous sites were not counted as SNPs. In the case of deletions that were present but not reported, we chose to set the score to 21 given that centers were instructed to report deletions and that a failure to report could be an artifact of the bioinformatics pipeline. The score was finally normalized per sample by the number of correct SNPs.
Assessment of lineage and cluster assignment.
The "correct answer" was again assumed to be the majority consensus. Clusters were relabeled to unify the nomenclature and compare laboratories. We did not provide a strict definition of a cluster but allowed laboratories to determine clusters based on internal criteria. In addition, no classical epidemiological metadata were provided, to help with potential interpretations.

RESULTS
Genome depth, coverage, and assembly. The mean read depth per center ranged from 313Â to 37,172Â, which reflects a .100-fold difference across centers. However, this was mostly driven by center 14, which sequenced to an extremely high read depth ( Fig. 1A; see also Table S2 in the supplemental material). Centers 7 and 9 are on the lower end of the spectrum (mean depths 6 standard deviations [SD] of 325Â 6 275Â and 313Â 6 132Â, respectively), whereas all other laboratories usually sequenced to a mean depth of between 1,000Â and 8,000Â. The majority of samples could be assembled to a consensus genome by all centers, with the exception of NGS8, for which assembly failed partially for center 7 and completely for center 9 as seen by the percentage of missing data shown in Fig. 1B (numeric values are shown in Table S3).
SNPs and indels. Variants have been assessed as reported and are displayed in Fig. S1A to J as a dot plot indicating the presence and absence of the variant. Some centers reported mixed sites using ambiguous codes, while others did not. Moreover, not all centers reported deletions. Whether these had been correctly called in the consensus genome was therefore checked for each variation and, if present, specifically marked in Fig. S1. Additionally, Table S5 lists the number of correct, wrong, and missing SNP calls for each sample and laboratory.
A variant calling score was developed in order to quantify and compare the variant calls per sample and laboratory (see Materials and Methods). The results are shown in Fig.  1C (numerical values are shown in Table S4), with average scores per sample across all centers (row marked with ø) also shown as a measure of congruence across laboratories. As expected, samples with a higher proportion of missing data produced a lower score if the affected regions harbored many variations (e.g., NGS3 by center 7, which had a coverage of 91%). Samples NGS7, -9, and -10 had many deletions, and laboratories not reporting these deletions received a correspondingly lower score. NGS8, however, was a sample with which many centers had problems. Many laboratories reported missing data for variant loci. Additionally, incorrect base calls were made, in particular by center 15 (Fig. S1H). A combination of several of these factors can in turn result in a lower mean score for a center (e.g., center 7, with an average score of 0.75) (Table S4).
Lineage assignment. Correct lineage assessment is of course dependent on correct SNP calling and sufficient coverage across the genome. The majority of centers assigned all samples to the correct lineage (Table 2). Two centers with the lowest mean depths failed in correctly assigning the lineage of one sample, NGS8 (B.1.177) (Table S2). Center 7, which provided a 57% complete genome (mean read depth of 39Â), could assign the sample to lineage B. Rather surprisingly, the laboratory with by far the highest depth, center 14, assigned the lineages of two samples incorrectly: NGS7 and -9 were both assigned only as lineage A, as opposed to the more accurate correct solution of A.27. This was due to an outdated version of Pangolin. Cluster identification. Almost all centers reported the same clusters (Table 3). Samples NGS2 and NGS5 formed one cluster (cluster B); NGS3, NGS6, and NGS8 formed the second cluster (cluster C); and NGS7 and NGS9 formed the third cluster (cluster E).
The low coverage for sample NGS8 was a challenge for the two above-mentioned centers 7 and 9. However, center 7 reported a presumed allocation into the correct cluster using the partial genome (asterisk in Table 3). Center 9 could not identify the cluster due to unsuccessful sequencing (9Â mean depth [ Table S2, highlighted in red]). This resulted in a too-small cluster.
Center 12 had difficulties with two samples (NGS1 and -4) and allocated them incorrectly to cluster B (together with NGS2 and -5) ( Table 2, shading). This was despite them falling into different Pango lineages (Table 2). Center 14 incorrectly assigned NGS1 and NGS4 to a separate cluster (Table 2, shading), again despite differing Pango lineage assignments. However, the other clusters were correctly assigned by both laboratories.

DISCUSSION
Impact of methodological choices. Given that laboratories had free choice over their experimental as well as analytical protocols, disentangling the individual effects of these differences is impossible. A factor known to influence sequencing success is vi- a NA, not applicable; cluster assignment was impossible. Shading highlights discrepant cases discussed in more detail in the text. * marks that the center reported an assumed cluster assignment based on a partial genome. ral load. For example, NGS8, while having a viral load comparable to those of NGS9 and -10 (threshold cycle [C T ] values of 28.4 and 28.1, respectively), was on the lower end of the spectrum (C T value of 28) (see Table S1 in the supplemental material). This could be why many centers had problems with this sample. When grouping the sequencing methods roughly into Illumina single-end versus Illumina paired-end versus Oxford Nanopore Technologies (ONT) methods, a platformrelated effect does not seem to have occurred (Fig. S2). In fact, centers 7 and 8 had very similar sequencing setups, with the exception of their analysis pipelines (Table 1). Center 8, however, was able to sequence to a greater depth and was therefore better able to perform accurate genomic analyses as it achieved overall higher coverage across the genome. Moreover, the small genome of SARS-CoV-2 and the lack of long repeat regions allow the use of short reads or single-end sequencing, which would be more problematic for WGS of other pathogens.
The mean depth had an effect only insofar as a too-low depth leads to too much missing data. Once a sufficient read depth had been achieved, there was no further clear correlation between the score of variant calling and depth (Fig. S3). In general, depth across the genome can be very uneven, and average depth as a measure does not fully take this into account. Technically, read depths of between 100Â and 200Â can be enough for genotyping. For example, samples NGS2 and -5 for center 7 have 191Â and 131Â coverages, respectively, as well as a small amount of missing data and a high variant calling score (Fig. 1). However, when coverage is uneven, missing data can still be an issue, even at a higher average depth (e.g., NGS10 for center 7 at 246Â) ( Fig. 1; Table S2). For accurately genotyping SARS-CoV-2, it is necessary to capture the entirety of the genome and not just some areas (even of biologically important areas such as the S gene) as the software used to determine the lineage built its models based on whole-genome diversity (the pangoLEARN algorithm within Pangolin) (8). It is therefore important to strive for the best coverage across the genome (i.e., a small amount of missing data), and "sufficient read depth," as mentioned above, is therefore a function of this. More even coverage in amplicon-based sequencing can, for example, be achieved by balancing primer sets.
Instead of average depth, other factors such as variant reporting capacity, mapping quality, as well as interpretation of data play a larger role. This is an important point for diagnostic laboratories with respect to operational costs. The importance of this was highlighted by center 14, which sequenced to by far the highest depth but had difficulties with lineage and cluster assignments despite very good variant calling. Upon receiving a preliminary report, center 14 reexamined its analysis pipeline and found that it had used an outdated Pangolin and pangoLEARN version. The Pango lineage nomenclature is dynamic, meaning that the nomenclature system develops as SARS-CoV-2 evolves, and lineage definitions and names can change over time (8). The pilot EQA provided valuable feedback for the center to improve its workflows.
Cluster assignment, on the other hand, highlighted another challenge for the development of any EQA: communication and interpretation. The majority of other centers determined a cluster as a putative transmission cluster that differs between 0 and maximally 2 SNPs (thresholds vary slightly) (see supplemental methods in the supplemental material). Two centers had difficulties, which could be resolved upon feedback. Center 12 had interpreted the terminology "cluster" differently and instead reported the Nextclade assignment (10); center 14 in turn deemed samples NGS1 and NGS4 to belong to a single cluster. While they share an ancestor, most other laboratories deemed them sufficiently different to assign them to two separate clusters. In fact, they differ in 27 SNPs, whereas the other true clusters (clusters B, C, and E in Table 3) had 0 to 1 SNPs between genomes. This highlights that there is a certain element of subjectivity in data interpretation when lacking clear definitions as well as the need to clarify the objective of the task (in this case the assessment of transmission clusters rather than simply related sequences in a phylogenetic tree).
An important factor for routine sequencing is cost. In general, the amplicon-based protocols used in this study consist of a reverse transcription step, an amplification step, library preparation, and sequencing. As the first two steps are mostly the same for different sequencing technologies, cost is driven mainly by library preparation and sequencing itself. Here, Oxford Nanopore sequencing allows faster data generation due to real-time base calling, while sequencing on an Illumina machine typically takes slightly more than a day (11). Cost-wise, the price per sample will decrease with increasing throughput. But the many library preparation kits available as well as the wide range of sequencing machines used here (Table 1) make comparisons between the centers difficult.
All protocols used by the participating centers in this EQA used amplicon-based sequencing, and primer bias can have an influence on sequencing accuracy. Here, primer sets vary between laboratories (Table 1). For the Artic v3 primers (which are public), we find no apparent bias in the data reported here compared to the other primer panels. However, centers 7 and 8 used the same primer panel but did not detect the variant G21255C in samples NGS3, -6, and -8 (Fig. 1C, F, and H). This SNP is present in almost all representatives of lineage B.1.177 (12). Whether this failure in detection is truly due to primer bias cannot be conclusively answered, however, as commercial primer sequences are often not public. A possibility to deal with this issue bioinformatically is to trim primer sequences prior to assembly. Nevertheless, primer bias is a real issue if it leads to dropouts. Fortunately, this is actively monitored by the community. For example, dropouts of the Artic v3 panel have been reported, especially for Beta and Delta variants. For this reason, a new primer panel has been developed to avoid high-frequency variant sites in the newer lineages (13).
Factors not assessed in this pilot EQA. This pilot EQA focused on reporting findings related to consensus genome sequences but did not include minority variant reporting. Center 15 reported issues with contamination for sample NGS8, yet lineage and cluster assignments were successful as the key sites were not affected. However, some contamination spilled over into the consensus genome as evidenced by a number of wrong variant calls (Fig. S1H). Similarly, some laboratories reported mixed loci as SNPs in their reports, although we were mostly interested in fixed changes. Differentiating between contamination and true, albeit rare, mixed infections or possible in-host evolution can be very difficult, especially in a clinical setting with high sample throughput. Assessment of contamination and analysis of minority variants would allow the provision of more detailed feedback to the laboratories. Contamination, for example, would likely be an isolated event for a center, resulting in mixed sites, while a true mixture would be prevalent across all centers. At the same time, it would offer an interesting analytical challenge, particularly if samples with true mixed infections were sent to participants.
Conclusion and lessons learned. The first ESGMD-SSM pilot EQA of SARS-CoV-2 sequencing was overall a success. Most centers generated whole-genome sequences and correctly identified all lineages and clusters. Additionally, there was a consensus regarding the majority of called SNPs despite the strong effect that missing data and unreported deletions (although present in the data) had on the scores of some. This suggests an overall high quality in each participating center. The standardized reporting of important variations in the genome should be the focus of improvement for some bioinformatic pipelines. The most critical aspect was coverage across the genome, which correlated with correct lineage and cluster assignments.
For a follow-up EQA, the variables and format of the variables to document have to be more clearly defined. Moreover, minority variants should be included to some degree from samples with mixed infections. Information on primer sets for ampliconbased methods should be carefully recorded, especially in light of new virus lineages. Instead of culture supernatants, it might also be of interest to include primary patient samples diluted in a clinical collection matrix as well as an empty control. Finally, to trigger a discussion on cluster definition, samples with high similarity but 2 to 5 SNP differences could also be included.
The COVID-19 pandemic required a rapid global laboratory response involving the development and rollout of new diagnostic assays and diagnostic platforms on an unprecedented scale. In response to the emergence and spread of virus variants of concern, WGS is increasingly being utilized not only for surveillance but also for diagnostic purposes, thus necessitating the rapid deployment and sharing of quality assurance schemes. This EQA pilot provides proof of feasibility for the development and operationalization of an EQA for WGS in a pandemic context, and lessons learned from its design, delivery, and results should inform future pandemic preparedness.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only. SUPPLEMENTAL FILE 1, PDF file, 0.8 MB. SUPPLEMENTAL FILE 2, PDF file, 0.1 MB. SUPPLEMENTAL FILE 3, PDF file, 0.1 MB.