Comprehensive, Comparative Evaluation of 35 Manual SARS-CoV-2 Serological Assays

ABSTRACT The onset of the coronavirus disease 2019 (COVID-19) pandemic resulted in hundreds of in vitro diagnostic devices (IVDs) coming to market, facilitated by regulatory authorities allowing “emergency use” without a comprehensive evaluation of performance. The World Health Organization (WHO) released target product profiles (TPPs) specifying acceptable performance characteristics for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) assay devices. We evaluated 26 rapid diagnostic tests and 9 enzyme immunoassays (EIAs) for anti-SARS-CoV-2, suitable for use in low- and middle-income countries (LMICs), against these TPPs and other performance characteristics. The sensitivity and specificity ranged from 60.1 to 100% and 56.0 to 100%, respectively. Five of 35 test kits reported no false reactivity for 55 samples with potentially cross-reacting substances. Six test kits reported no false reactivity for 35 samples containing interfering substances, and only one test reported no false reactivity with samples positive for other coronaviruses (not SARS-CoV-2). This study demonstrates that a comprehensive evaluation of the performance of test kits against defined specifications is essential for the selection of test kits, especially in a pandemic setting. IMPORTANCE The markets have been flooded with hundreds of SARS-CoV-2 serology tests, and although there are many published reports on their performance, comparative reports are far fewer and tend to be limited to only a few tests. In this report, we comparatively assessed 35 rapid diagnostic tests or microtiter plate enzyme immunoassays (EIAs) using a large set of samples from individuals with a history of mild to moderate COVID-19, commensurate with the target population for serosurveillance, which included serum samples from individuals previously infected, at undetermined time periods, with other seasonal human coronaviruses, Middle East respiratory syndrome coronavirus (MERS-CoV), and SARS-CoV-1. The significant heterogeneity in their performances, with only a few tests meeting WHO target product profile performance requirements, highlights the importance of independent comparative assessments to inform the use and procurement of these tests for both diagnostics and epidemiological investigations.

IMPORTANCE The markets have been flooded with hundreds of SARS-CoV-2 serology tests, and although there are many published reports on their performance, comparative reports are far fewer and tend to be limited to only a few tests. In this report, we comparatively assessed 35 rapid diagnostic tests or microtiter plate enzyme immunoassays (EIAs) using a large set of samples from individuals with a history of mild to moderate COVID-19, commensurate with the target population for serosurveillance, which included serum samples from individuals previously infected, at undetermined time periods, with other seasonal human coronaviruses, Middle East respiratory syndrome coronavirus (MERS-CoV), and SARS-CoV-1. The significant heterogeneity in their performances, with only a few tests meeting WHO target product profile performance requirements, highlights the importance of independent comparative assessments to inform the use and procurement of these tests for both diagnostics and epidemiological investigations.
KEYWORDS evaluation, SARS-CoV-2, serology I n November 2019, a novel acute respiratory disease (coronavirus disease 2019 [COVID-target populations, had inappropriate interpretations, and/or assessed small numbers of test kits. Few findings were published in peer-reviewed journals (1)(2)(3). Due to these shortcomings and in the face of requests for guidance on appropriate use and procurement, the World Health Organization (WHO) published guidance (4) recommending that RDTs be for research use only and that "They should not be used in any other setting, including for clinical decision-making, until evidence supporting use for specific indications is available." The WHO also stated, "although research into their performance and potential diagnostic utility is highly encouraged" (4).
To address some of the shortcomings and to better shape guidance and inform the procurement of serological assays, the WHO commissioned the National Serology Reference Laboratory, Australia (NRL), a WHO collaborating center and authorized WHO IVD prequalification evaluation laboratory, to develop a comprehensive protocol and conduct a large comparative evaluation of both RDTs and laboratory-based anti-SARS-CoV-2 serology tests suitable for use in low-and middle-income countries (LMICs). An open call for expression of interest for manufacturers to participate in the evaluation scheme was issued (5). One hundred two products from 71 manufacturers were submitted, and 45 products from 44 manufacturers were accepted based on compliance with both entry and short-listing criteria. Nine of the short-listed manufacturers withdrew their applications, leaving 35 products. In July 2021, the NRL and the WHO began publishing summary results for all 35 products, describing key observations and implications for use for clinical care and surveillance.
The primary aims of this study were to produce a statistically significant assessment of the performances of test kits designed to detect different classes of antibodies to different SARS-CoV-2 antigens, directly compare performance data from a range of commercially available serology tests by using the same panel of samples, and provide comprehensive performance characteristics of serology assays and determine whether serology could serve any useful purpose in the diagnosis or surveillance of SARS-CoV-2 infections.

RESULTS
The results of this study represent the performances of the versions of the product and lot numbers used, and other versions or lots may result in different findings. The performance of all IVDs should be monitored over time with a well-designed quality assurance program. The invalid test rate, sensitivity, and specificity results for RDTs and EIAs are presented in Figures 1 and 2.
Invalid test rate. The invalid test rate ranged from 0.00% to 1.40%. A total of 13 of the 23 RDTs that reported IgG and IgM had no invalid test results. Only Biogenix reported an invalid test rate of .1.00%. Of the 12 tests that reported single IgG, total antibody, or neutralizing (Nt) antibody results, 7 had no invalid results.
Concordance with recent infection. The results for testing samples positive for anti-SARS-CoV-2 are expressed as "percent concordance with recent infection," including 95% confidence interval (CI) ranges. The 23 RDTs were evaluated for reactivity to IgG, IgM, or IgG and/or IgM against SARS-CoV-2 ( Fig. 1 Analytical sensitivity and lot-to-lot variation. Three samples (COVID461, COVID491, and COVID492) were doubling diluted from 1:2 to 1:1,024 and were tested with two reagent/test lots of each of the 35 tests. The results of testing the doubling dilution series are presented in Table S1 in the supplemental material. There was a large range of analytical sensitivities reported by the different test kits. In several RDT kits, a nonreactive  Four of the 35 tests reported a difference in the reactivities of two or more doubling dilutions when the same sample was tested with two different lots, including Lysun (IgG and IgM for COVID461), Singclean (IgM for COVID461), Biocan (IgM for COVID461), and Deepblue (IgG and IgM for COVID461 and -492).
Cross-reactivity and interference. The summary results for testing 55 samples containing potentially common cross-reacting analytes, 35 samples with interfering substances, and 31 samples from individuals with known past infection with SARS-CoV-1, Middle East respiratory syndrome coronavirus (MERS-CoV), and seasonal human coronavirus (HCoV) (HCoV-229E, HCoV-NL63, or HCoV-OC43) are presented Fig. 3, with the complete set of results shown in Table S2. There was a broad range in the numbers of false-reactive results for the cross-reacting, interfering, and non-SARS-CoV-2 coronavirus samples. Only five tests (MDGen, OmniPath, Serion, Standard Q, and Wantai) reported no false reactivity for the 55 samples with potentially cross-reacting substances. Of the 35 samples with potentially interfering substances, the 5 samples containing rheumatoid factor were falsely reactive by most tests. Dynamiker reported false IgM reactivity for 47 of 55 (85.5%) cross-reacting samples and 26 of 35 (74.3%) samples containing interfering substances. Six tests (Bio Hit, MDGen, OmniPath, Serion, Sure Status, and Wondfo) reported no false-reactive results for the 35 samples containing interfering substances.
MDGen was the only test kit that had no false reactivity across the cross-reacting, interfering, and non-SARS-CoV-2 panels. However, this test also failed to detect truepositive samples, reporting a low sensitivity of 60.1%.
FIG 2 Invalid test rates, concordances with recent infection, and specificities of rapid test devices that report a single result for IgG only, total antibodies, or neutralizing antibodies presented as a heat map, with shades of green representing results of .90%, shades of yellow representing results of between 60 and 90%, and orange representing results of ,60%. NA, not assessed; Total, total antibodies; Nt, neutralizing antibodies.

Evaluation of 35 Manual SARS-CoV-2 Serological Assays Microbiology Spectrum
Late-seroconversion panels. Most kits reported results reactive for the analyte tested (IgG/IgM/total/Nt) for all serial blood samples. The IgM results from BTNX, Deepblue, Dynamiker, Healgen, Lysun, Sensing, Standard Q, Sugentech, and VivaDiag all reported one or more negative IgM results about 30 to 40 days after symptom onset. For one five-member series, four test kits (BioHit, Biomedomics, Getein, and RightSign) did not detect IgM in any of the samples.
Seroconversion panels. The results for seroconversion panels are summarized in Table S4. Samples were drawn from 8 days before to up to 52 days after the start of symptoms. Most test kits detected SARS-CoV-2 IgG and IgM antibodies within the first week postinfection. Generally, the IgM response was detected earlier than or at the same time as the IgG response. There were some notable exceptions. MDGen, testing for IgG only, failed to detect antibodies in one patient's series of 9 samples and reported 5 negative results, 6 equivocal results, and 1 reactive result for a second patient's series of 14 samples. Serion, also testing for IgG only, reported negative results for the first five of a series of nine samples. The IgM responses by both RightSign and Standard Q decreased to undetectable levels in the same two of five seroconversion panels. Repeatability and reproducibility. Repeatability and reproducibility studies were conducted on six EIAs. The results were expressed as the percent coefficient of variation (CV%) and are summarized in Table 1. Repeatability ranged from 3.70% to 11.37%, and reproducibility ranged from 3.52% to 13.42%.

DISCUSSION
Within 6 months of the start of the pandemic, numerous antibody detection SARS-CoV-2 RDTs and EIAs became available. Regulators allowed the use of these novel tests through some form of emergency use authorization, requiring manufacturers to provide only limited evidence of test kit performance. Numerous studies comparing the performances of test kits were published (2, 3, 6, 7), often as preprints (8), which were not subject to rigorous peer review (2,7). Early in the pandemic, the utility of SARS-CoV-2 serology testing was unknown but was used due to the lack of inexpensive point-of-care testing options and the long turnaround times for molecular diagnostics (7,9,10). RDTs claiming to detect IgM were used to diagnose recent infections (11,12). To develop rational guidance on use and inform procurement, a comprehensive, headto-head comparison of test kits was established, and the results were compared with published WHO target product profiles (TPPs) specifying acceptable and desirable performance characteristics, with RDT sensitivity being acceptable at $90% and desirable at $95% and specificity being acceptable at $97% and desirable at $99% and higherthroughput assays having acceptable and desirable sensitivities of $95% and .98% and specificities of $97% and $99%, respectively (13).
This study used samples from individuals with a recent history of mild-to-moderate clinical disease. The percent concordances of the results of IgM assays compared with nucleic acid amplification testing (NAT)-confirmed recent infection and specificity were highly variable, ranging from 55.8 to 99.5% and 34.3 to 99.0%, respectively. Five of the 23 test kits (27.3%) that reported IgM results reported more than 30 of the 90 cross-reacting and interfering substance-containing samples as being falsely reactive. There is some evidence that the IgM response decreases over time. No test achieved both the acceptable sensitivity and specificity criteria of TPPs based on the detection of IgM. These findings support the position that there is very limited clinical and epidemiological utility of IgM antibody testing (14).
All tests evaluated detected SARS-CoV-2 IgG either independently (IgG only), in association with the detection of IgM (IgG/IgM), or as a total antibody (IgG, IgM, and IgA) or neutralizing antibody test. Eight RDTs reporting IgG achieved acceptable levels of both sensitivity and specificity (Fig. 3). No RDT had desirable levels of both sensitivity and specificity. EIAs are the preferred method to assess seroprevalence (2,15). Wondfo and OmniPath met the acceptable TPP criteria and Wantai met the desirable criteria for both sensitivity and specificity, respectively (Fig. 2). These findings support the use of these limited numbers of RDT and EIA products for serosurveillance or retrospective diagnosis (for unvaccinated individuals).
In addition to some products achieving high levels of concordance with recent infection and specificity, some also had very low reactivity with cross-reacting substances. More specifically, Wantai (EIA) reported just one false-reactive result from the 35 samples with interfering substances and none from the 55 cross-reacting samples, whereas Wondfo (RDT) reported no and one false-reactive result, respectively. bioLytica (RDT) and Bio-Rad (EIA) reported 5/35 and 11/55 and 1/35 and 2/55 false-reactive results, respectively.
The %CVs for the repeatability of seven tests that reported quantitative results ranged from 3.70 to 11.37%, whereas the %CVs for reproducibility ranged from 3.52 to 15.50%. The imprecision of quantitative tests should continually be monitored using a well-developed quality control (QC) program (16).
The results of this study indicate that the SARS-CoV-2 IgG response is detectable at the same time as or one bleed after the detection of IgM (10). The results of the seroconversion and late-seroconversion panels indicate that there was little evidence that the IgG response became undetectable within 7 weeks after infection, which is consistent with the results of other studies (10,17).
This study demonstrated that some tests had unacceptably poor concordance with recent infection and specificity, and others reported unacceptably high levels of false reactivity due to cross-reactive and interfering substances and antibodies from other coronaviruses. Most tests were reactive within the first week after the onset of symptoms. Several tests demonstrated a .2-fold difference in analytical sensitivity between the two lots, indicating inconsistent manufacturing practices.
Several limitations of this study should be noted. This study did not assess safety, usability, cost, or test kit robustness. This study used predominantly citrated plasma samples collected using plasmapheresis. This study, along with many others, applied tests in laboratory settings on plasma or serum samples, while they are also approved for use as point-of-care tests using (capillary) whole blood; therefore, it is not possible to ascertain the clinical accuracy of these tests in the intended settings of use. Some studies suggest a performance comparable to that with whole blood (18,19).
All positive samples were from individuals infected with the ancestral variant. This study did not evaluate the test kits using samples obtained from individuals vaccinated with varying vaccines or numbers of doses, with or without a history of infection. Additional studies in these populations will be required. Some of the samples used in the cross-reacting and interfering substance panels had limited clinical information.
The landscape of clinical diagnostics for SARS-CoV-2 has drastically evolved since the initial antibody tests emerged on the scene; point-of-care SARS-CoV-2 antigenand molecular-based tests and expanded PCR laboratory capacities now fill the acute diagnostic needs (13). This evaluation was an attempt to better inform the role of antibody testing and subsequent procurement as part of the pandemic response. In the future, mechanisms should be on standby to allow more rapid comparative evaluations to identify good-and poor-performing products and better understand the appropriate use. Although this was a large study of 35 products, it included only a fraction of the products on the market and revealed dramatic variability in performance across various parameters. Nonetheless, a small number of products met WHO TPP criteria and, in the right context, could play a useful role in serosurveillance and epidemiological research. These results, coupled with more stringent regulatory requirements, may be useful in selecting products for these purposes.

MATERIALS AND METHODS
Test kit selection. In November 2020, the WHO issued an expression of interest and the evaluation protocol (5). The following exclusion criteria were used: products targeting IgM or IgA only; products   Evaluation of 35 Manual SARS-CoV-2 Serological Assays Microbiology Spectrum needing proprietary platforms; products for which kit instructions for use (IFUs) were not included in the application; manufacturers without a free-sales certificate or ISO13485 accreditation; products that had low accuracy in early evaluations performed by the Foundation for New and Innovative Diagnostics (FIND) (6) (low accuracy defined as ,80% sensitivity and ,98% specificity); RDTs targeting anti-N antibodies only; and multiple products from a single manufacturer, with the exception of EIAs targeting anti-N antibodies. The latter two criteria were adopted considering the potential for future seroprevalence studies after mass vaccination campaigns with spike protein-based vaccines. Furthermore, RDTs targeting anti-N antibodies alone were excluded because the literature indicated that anti-N antibody titers decay more quickly than anti-S antibodies and therefore may be less sensitive for the detection of past infection (20). Serosurveillance would require high-accuracy, high-throughput assays such as EIAs (2). Of the test kits selected, 26 were RDTs, and 8 were EIAs. One EIA (OmniPath) was added at a later stage as it was the commercialized version of the product used in the RECOVERY trial, the data from which suggested that the use of serology could help select those patients who were most likely to benefit from treatment with a monoclonal antibody cocktail (Regeneron) (21). The complete list of test kits evaluated is summarized in Table 2. All selected test kits were provided to the NRL free of charge.
Sample panels. The performance characteristics evaluated depended on the class(es) of antibodies being detected and the method for the reporting of results. All test kits were evaluated for sensitivity (concordance with documented SARS-CoV-2 RNA positivity by quantitative PCR [qPCR]), specificity, analytical sensitivity, quantification, lot-to-lot variation, seroconversion, cross-reactivity, and interference.
Test kits reporting quantitative results (e.g., sample-to-cutoff [S/Co] values) were evaluated for repeatability and reproducibility.
Samples contained various anticoagulants, including sodium citrate or citrate dextrose solutions. The anticoagulants used in some other samples were unknown. Some test kits evaluated specified the use of certain anticoagulants in the IFU. False reactivity due to the anticoagulants used in the panel cannot be discounted, and the results should be interpreted accordingly.
(i) Sensitivity/concordance with recently confirmed SARS-CoV-2 infection. A total of 199 samples were obtained by two commercial organizations (BioMex, Heidelberg, Germany [BioMex], and Medical Research Networx Biologicals, FL, USA [MRN]) from nonhospitalized individuals with a recent history of clinical infection with ancestral SARS-CoV-2, confirmed by various commercial NATs. As samples were collected from individuals between January and April 2020, it is assumed that infections were not due to Delta or Omicron variants. These samples were collected between 14 and 71 days after the onset of symptoms or after a positive NAT result. The results of the positive sample panels were reported as "concordance with recent infection." Approximately half of the panel was tested by each of two reagent/test lots.
(ii) Specificity. A total of 300 plasma samples obtained from NRL's sample bank, collected prior to November 2019, were used as the specificity panel. These samples were obtained from healthy blood donors and screened negative for blood-borne infections by serology and NATs. These samples were assumed to be negative for SARS-CoV-2 antibodies, and no further confirmation testing was performed. Approximately half of the panel was tested by each of two reagent/test lots.
(iii) Analytical sensitivity/lot-to-lot variation. Three of the sensitivity panel samples had 10 doubling dilutions, from 1:2 to 1:1,024, prepared in human plasma negative for SARS-CoV-2 antibodies. All dilutions were tested by two reagent lots.
(iv) Cross-reactivity. A total of 55 plasma or serum samples known to contain potentially crossreacting analytes were tested by a single reagent/test lot along with a further 31 samples confirmed to be positive by NATs for severe acute respiratory syndrome coronavirus (SARS-CoV-1), Middle East respiratory syndrome coronavirus (MERS-CoV), or seasonal human coronavirus (HCoV-229E, HCoV-NL63, or HCoV-OC43) ( Table 3). Samples were obtained from individuals with evidence of past infection with the organism indicated, unless specifically indicated by IgM reactivity.
(v) Interfering substances. A total of 35 plasma samples known to contain potentially interfering substances were tested by a single reagent/test lot. The interfering substance panel consisted of 5 visibly icteric samples, 5 visibly hemolyzed samples, 7 samples with visibly high levels of bilirubin, 5 lipemic samples, 5 samples with antinuclear antibodies, 3 samples positive for antibodies to double-stranded DNA (lupus), and 5 samples positive for rheumatoid factor.
(vi) Late-seroconversion panels. The late-seroconversion panel was comprised of 47 plasma samples collected by BioMex from 10 different, nonhospitalized, volunteer donors at various intervals commencing from 18 days or later after symptom onset. The purpose of this panel was to demonstrate the decline in IgM antibody titers over time.
(vii) Seroconversion panels. Seroconversion panels consisted of a total of 60 plasma samples collected by MRN from five different SARS-CoV-2 NAT-positive individuals at regular intervals from early infection to approximately 8 weeks after symptoms. The results of testing were used to determine the number of days after the onset of symptoms when the test kit first detected reactivity.
(viii) Repeatability. For repeatability studies (within-run precision), a positive sample diluted in negative plasma to give a low positive reaction or a commercial anti-SARS-CoV-2 quality control (QC) sample (DiaMex, Heidelberg, Germany) was tested 30 times in the same test run. The percent coefficient of variation (%CV) was calculated.
(ix) Reproducibility. For reproducibility studies, the same sample used in the repeatability study was tested 30 times across no fewer than five different runs, and the results were presented as the %CV.
Testing protocol. (i) Rapid diagnostic tests. Rapid diagnostic testing was performed according to the test IFU by one operator. The results were read by that operator and independently read by a second reader. The intensities of the test and control lines were graded according to a defined scale (Table 4). When consensus for the sample reading was not met, a third, independent reader recorded their result, and the eventual consensus (2 of 3 readings being the same) was used as the final result. The number of invalid results, as defined by the IFU, was recorded.
(ii) Enzyme immunoassays. Enzyme immunoassays were performed singly according to the IFU by the same operator. Invalid test runs were defined as when the kit controls failed the manufacturer's validation criteria.
All results, recorded on hard-copy result sheets at the time of reading and manually transcribed into Microsoft Excel, were double-checked by a second, independent person daily.

ACKNOWLEDGMENTS
We thank NRL scientific staff, including Sadaf Mohiuddin, Jingjing Cai, Bethmi Liyanage, and all technical support staff, who took part in composing the panels and performing testing. In particular, we acknowledge Technopath Clinical Diagnostics  Medium-to-strong reactivity and Seracare (MA, USA). We also acknowledge the Duke-NUS Medical School, the Erasmus Medical Center, Tan Tock Seng Hospital, and the International Vaccine Institute (IVI) for contributing samples from individuals previously infected with SARS-CoV-1, MERS-CoV, and seasonal human coronaviruses.