Comparison of influenza-specific neutralizing antibody titers determined using different assay readouts and hemagglutination inhibition titers: good correlation but poor agreement

Determination of influenza-specific antibody titers is commonly done using the hemagglutination inhibition assay (HAI) and the viral microneutralization assay (MN). Both assays are characterized by high intraand inter-laboratory variability. The HAI assay offers little opportunity for standardization. For the MN assay, variability might be due to the use of different assay protocols employing different readouts. We therefore aimed at investigating which of the MN assay readout methods currently in use would be the most suitable choice for a standardized MN assay that could serve as a substitute for the HAI assay. For this purpose, human serum samples were tested for the presence of influenza specific neutralizing antibodies against A/California/7/09 H1N1 (49 sera) or A/Hong Kong/4801/2014 (50 sera) using four different infection readout methods for the MN assay (cytopathic effect, hemagglutination, ELISA, RT qPCR) and using the HAI assay. The results were compared by correlation analysis and by determining the level of agreement before and after normalization to a standard serum. Titers as measured by the 4 MN assay readouts showed good correlation, with high Person’s r for most comparisons. However, agreement between nominal titers varied with readouts compared and virus strain used. In addition, Pearson’s correlation of MN titers with HAI titers was high but agreement of nominal titers was moderate and the average difference between the readings of two assays (bias) was virus strain-dependent. Normalization to a standard serum did not result in better agreement of assay results. Our study demonstrates that different MN readouts result in nominally different antibody titers. Accordingly, the use of a common and standardized MN assay protocol will be crucial to minimize inter-laboratory variability. Based on reproducibility, cost effectiveness and unbiased assessment of results we elected the MN assay with ELISA readout as most suitable for a possible replacement of the HAI assay. ! 2020 The Authors. Published by Elsevier Ltd. This is an open access article under theCCBY license (http:// creativecommons.org/licenses/by/4.0/).


Introduction
Influenza A and B viruses are highly contagious airborne viruses which cause an infectious respiratory disease of varying severity. Being responsible for 2-5 million severe cases and 290000-650000 deaths per year worldwide, influenza viruses represent a significant burden to society [1][2][3]. Vaccines are the mainstay of infection control during seasonal epidemics and are considered as highly important to mitigate occasional pandemics. The majority of currently available influenza vaccines aims at eliciting antibodies which can prevent the virus from infecting its envisaged target cells. Accordingly, vaccine immunogenicity can be evaluated by determining serum levels of such antibodies [4][5][6].
The two primary serological assays for this purpose are the hemagglutination inhibition (HAI) assay and the virus microneutralization (MN) assay [7][8][9]. The HAI assay is generally more frequently used than the MN assay to measure influenza specific humoral immunity for multiple reasons: a widely accepted correlate of protection exists for HAI titers which is not the case for titers measured by MN assay, it requires less hands-on time than the MN assay and it does not necessarily require Biosafety Level 2 (BSL2) conditions if performed with inactivated virus [10][11][12] The HAI assay relies on the ability of hemagglutinin-specific antibodies to inhibit the binding between the hemagglutinin of the virus and the sialic acid receptors on the surface of red blood cells (RBCs; mostly from chicken or turkey). This assay thus quantifies antibodies which are able to prevent virus-induced agglutination of RBCs [10,[13][14][15]. However, the HAI assay lacks reproducibility and consistency as shown by several interlaboratory studies, mainly due to the fact that it requires reagents which are difficult to standardize: e.g. RBCs [10,14,16]. Red blood cells from various animal species differ in their ability to agglutinate influenza viruses, and for some virus subtypes, such as influenza A H5 and H7, the use of RBCs from selected animal species, in these cases horse, is required for efficient hemagglutination [10,16]. Moreover, an increasing number of currently circulating influenza A H3 strains shows a reduced ability to agglutinate erythrocytes and this assay cannot always detect antibody responses to influenza B strains [10,17,18]. Furthermore, in some HAI protocols agglutination is performed at room temperature, while in others it is performed at 0°C, thus introducing yet another cause of variability in the results [3,19,20]. Finally, the well-established HAI titer of 40, reported in literature as correlate of 50% protection, might not be as reliable as we thought since it is not valid for the whole population, leaving out the very young and the elderly from its window of accountability, and since it was established using viruses that are no longer circulating [10,14,21]. For this reason an HAI titer ! 40 is no longer accepted as a threshold of seroprotection by the EMA [22].
The MN assay on the other hand, is an assay for the detection of neutralizing antibodies to influenza viruses in human and animal sera. The assay starts with a virus-antibody reaction step, in which the virus is mixed with dilutions of serum and incubated for a certain time period in order for any antibodies to bind to the virus. The assay then proceeds with a step in which the mixture is inoculated into cells, Madin-Darby Canine Kidney cells (MDCK) in most of the cases, followed by an infection detection step, which allows for indirect evaluation of neutralizing antibody titers. Low levels of infection correlate with high titers of influenza specific neutralizing antibodies [8,18,23,24]. Several protocols exist for MN assays which vary in major steps of the MN protocol: cell seeding conditions, number of cells seeded, virus amount used in the infection step, virus-serum-cells incubation period etc. The biggest difference among MN assays is represented by the infection detection step which can be carried out using different readouts. The most commonly used readouts are: evaluation of cytopathic effect (CPE), detection of viral particles through hemagglutination (HA), detection of viral infectious particles through plaque reduction neutralization test (PRNT) and detection of viral nucleoprotein (NP) through ELISA assay [22,23,25]. In addition, novel readouts are recently emerging, such as quantification of viral RNA by qPCR [26].
The MN assay has been reported to be more sensitive than the HAI and single radial hemolysis assay (SRH) for detection of antibodies to seasonal strains and H5N1 viruses and it has the great advantage of being a functional assay, detecting antibodies which are indeed truly inhibiting the infection [10,[27][28][29]. As such, the MN assay could potentially substitute the poorly reproducible HAI assay. Nevertheless, the differences in pivotal steps in MN assay protocols, in particular the readout method used, account for a high degree of variability and inconsistency of the results [10,24]. Using a common readout, a harmonized protocol for the MN assay, and a common standard serum could potentially solve this issue [30,31].
Given this situation, the main aim of our study was to investigate which of the several MN assays and readout methods currently in use would be the most adequate choice for implementation in a standardized MN assay protocol in terms of reproducibility, standardization possibilities, cost effectiveness and comparability of results with results from HAI assays. Using two sets of human serum samples we determined MN titers against A/California/7/2009 (H1N1pdm) and A/Hong Kong/4801/2014 (H3N2), respectively, employing CPE, HA, ELISA, and qPCR as readout methods. PRNT readout, despite being commonly used as MN readout, was not included in our study because of the fact that it is time-consuming, and relies greatly on the expertise of the person performing the experiment for the interpretation of the results [18,32]. For each pair of assays, the titers obtained were compared by Pearson's correlation analysis and further analyzed for equivalence using the Bland Altman method [33]. In addition, antibody titers were determined by conventional HAI assay and results were compared to the results of the MN assays. The effect of normalizing titers to a standard serum on comparability among assays was also investigated.

Serum samples
Two panels of human sera were analyzed for the presence of antibodies against influenza A/California/7/09 (H1N1pdm) and A/Hong Kong/4801/2014 (H3N2): 49 sera for H1N1pdm and 50 sera for H3N2 respectively. These 99 sera were collected by the University of Siena from anonymized subjects whose vaccination history was unknown. Standard sheep hyper immune sera (anti-A/California/7/09 HA serum 16/114 and anti-A/Hong Kong/4801/2014-like HA serum 16/182) were purchased from the National Institute for Biological Standards and Control (NIBSC, Potters Bar, UK) and used as positive controls. Human serum minus IgA/IgM/IgG (Sigma) was used as negative control. All sera were treated with sodium citrate dihydrate tribasic solution 8% (Sigma) and heat inactivated at 56°C for 1 h before use.
MDCK cells were used for all MN assays in this study. Cell suspensions or adherent cell monolayer were prepared in different media depending on the MN assay considered.

Red blood cells (RBCs)
Red blood cells (RBCs) from turkey suspended in 50% Alsever solution (2.05% dextrose, 0.8% sodium citrate, 0.055% citric acid, and 0.42% sodium chloride) were used for HAI assay and MN-HA assays. The blood was washed twice with saline solution by two centrifugation steps (15 min at 295g). The pellet was collected to prepare a solution of 0.5% RBCs (v/v) in saline solution.

Viruses
The viruses used in this study were influenza A/California/7/09 (H1N1pdm, 09/268) and A/Hong Kong/4801/2014 (H3N2,15/184). Both viruses were obtained from NIBSC and propagated in embryonated 10-11 days old chicken eggs according to the following procedure: using a special drill a small area of the shell in a location corresponding to the air sac of the egg was removed. Influenza virus stock was diluted to obtain approximately 10 -4 pfu/ml in sterile PBS. Using an insulin syringe, 100 ml of virus was inoculated in the allantoic cavity of each egg. The holes in the shell were sealed using wax. The inoculated eggs were left in an incubator at 32-36°C for 24-72 h depending on the strain. Eggs were then left at À20°C for 30-40 mins. Using sterile scissors the eggs were opened and the allantoic fluid was collected. The allantoic fluid was then centrifuged at approximately 650g for 10 min at 4°C to remove debris. Virus containing supernatants were frozen at À80°C for long term storage.

Virus titration
For each virus, the infectious titer was determined by infecting MDCK cells with 2-fold serial dilutions of the virus and calculating the 50% tissue culture infectious dose (TCID 50 ) using the Spearman-Karber formula [34]. The viruses were titrated separately for each of the assays, using experimental conditions (cell numbers, virus incubation time, readout etc.) similar to the ones later employed in the MN assay.

Virus inactivation
For use in HAI assays, the viruses were inactivated using formalin (37 wt%, Sigma Aldrich) at a final concentration of 0.02 wt% overnight in an 18°C water bath.

Microneutralization assays
All assays were performed using the most recent protocols employed in the GLP facility of Vismederi.S.r.l.. Differences among the protocols of important parameters like the absence or presence of FBS during infection, infection of cells in suspension or while attached to the culture plate etc. were therefore taken for granted. Serum samples were tested in duplicate in two individual experiments and ultimately geometric mean titers (GMTs) were calculated from the results of the two experiments for the final analysis of the data. In case the titers of the two duplicates differed by more than 2 dilution steps, the sera were retested. The sera were serially 2-fold diluted from a starting dilution of 1:10 in 96-wells microtiter plates and pre-incubated with live virus (amount depending on the readout, see below) for 1 h in a 37°C, 5% CO 2 humidified incubator before incubation with adherent or suspension MDCK cells. After incubation of the virus-serum-cell mixture for the indicated period, the infection of MDCK cells was evaluated through four different readouts.

Evaluation of the cytopathic effect (CPE)
MDCK cells (350000 cell/ml) were pre-seeded in 96 well plates 48 h before infection in a final volume of 100 ml/well in Episerf Serum Free Medium (Gibco). Supernatants were removed from all the wells and 100 ml of the ''virus (2000 TCID 50 /ml) + serum" mixture was applied to the wells. The medium used to dilute serum and virus in this assay was Episerf Serum Free medium supplemented with 8 mg/ml of TPCK trypsin (Sigma). The cells were checked for signs of CPE 5 days post-infection by optical microscopy. Serum titers were expressed as the reciprocal of the highest dilution of sera showing less than 50% of CPE in the MDCK cell lawn. Not infected cell monolayers were used as negative control.

Evaluation of hemagglutination
MDCK cells were seeded and infected as described above. Five days post-infection the cell supernatants (100 ml) were harvested and transferred from the MN 96-wells plate to new ''V" bottom 96-well plate and 1 vol of 0.5% RBCs in saline solution were added in each well. The plates were checked for hemagglutination after a 2 h-incubation period at room temperature (RT) and the serum titers were expressed as the reciprocal of the highest dilution of serum that prevented agglutination. Supernatant from a noninfected cell monolayer was used as negative control.

Detection of influenza A virus nucleoprotein (NP) by ELISA assay
MDCK cells (150000 cell/ml) were seeded in 96 well plates in a final volume of 100 ml/well in EMEM supplemented with 0.5% FBS. 100 ml of the ''virus (2000 TCID 50 /ml) + serum" mixture was applied to the cells right after seeding. The final volume in each well was then 200 ml.
18-22 h post infection the ELISA assay was performed using the ELISA Starter Accessory Kit (Bethyl). To improve the sensitivity of NP detection, a mixture of two mouse monoclonal antibodies specific for anti-influenza A NP protein were employed at a dilution of 1:4000 and in a 1:1 ratio (MAB8257, MAB8258 -Merck). Horseradish peroxidase (HRP)-labeled goat anti-mouse IgG (H + L) (Invitrogen) was used as secondary antibody at a dilution of 1:16000. 3,3 0 ,5,5 0 -Tetramethylbenzidine (TMB) was used as substrate for HRP and the OD (measured at 450 nm) of each well was determined by ELISA reader. The OD value x which represents the cut off for virus neutralization is calculated as follows: where the cell control (CC) consisted of non-infected cells and the virus control (VC) consisted of infected cells without addition of serum [3]. All wells with an OD (450 nm) below or equal to ''x" were considered positive for neutralization activity. The neutralizing titer was determined as the reciprocal of the highest serum dilution with an OD x.

Detection of the influenza A virus Matrix 1 (M1) gene by RT-qPCR
The use of qPCR for the quantification of viral RNA was based on amplification of the Matrix protein 1(M1)-encoding sequence and was performed according to Teferedegne et al. with some modifications [26]. Briefly MDCK cells (300000 cell/ml) were seeded in 96 well plates in a final volume of 100 ml /well in Minimum Essential Medium (EMEM, Lonza) supplemented with 0.5% Fetal Bovine Serum (FBS, Lonza). 100 ml of the ''virus (20000 TCID 50 /ml) + serum" mixture was added to the cells right after seeding. The final volume in each well was then 200 ml.
6 h post-infection cytoplasmic RNA from the cells in each well was extracted as follows. Cell supernatants were discarded and cell lysates were obtained by incubating iScript TM RT-qPCR Sample Preparation Reagent (Biorad) on the cells for 1 min at RT (24°C); cell lysates were then collected and stored frozen at À80°C. Upon thawing, the RNA was transcribed in cDNA using the PrimeScript RT reagent Kit (Takara). The M1 gene was quantified from each cDNA sample through quantitative PCR. The sequences of the primers used were AAGACCAATCCTGTCACCTCTGA for M1 forward primer and CAAAGCGTCTACGCTGCAGTCC for M1 reverse primer (sequence of both primers is to be intended 5 0 À 3 0 ). The protocol used in the thermocycler was the following: 95°C for 10 min (1X), 95°C for 15 sec and 60°C for 1 min (40X), and finally 95°C for 15 sec.
In parallel to each test, 10-fold serial dilutions of plasmid DNA containing the M1 sequence were run in order to generate a standard curve for M1 copy number calculation in our samples of interest. Furthermore, for each test, both cell controls and virus controls were run. Neutralizing antibody titers were calculated as the recip-rocal of the highest dilution of serum showing a decrease in M1 copy number of at least 90% with respect to the virus control.

Hemagglutination inhibition assay
The hemagglutination units (HAU) of each inactivated virus preparation was quantified by evaluating hemagglutination activity in serially diluted virus suspensions in presence of 0.5% Turkey RBCs. The virus titer was calculated as the reciprocal of the highest dilution of virus showing complete hemagglutination of RBCs. Serum samples were serially 2-fold diluted from a starting dilution of 1:10 in 96-well microtiter plates (25 ml per well). Inactivated virus suspension containing 4 HAU/25 ml was added to each well containing diluted serum (25 ml per well). The plates were manually agitated for 10 s. Plates were covered and incubated at RT for 1 h. Finally 0.35% turkey RBCs in saline solution were added to each well (50 ml per well). Following 1 h incubation at RT, hemagglutination and its inhibition were read in each well. Wells only containing PBS were used as negative controls, while sheep hyperimmune sera were used as positive controls (see paragraph ''sera"). Serum titers were expressed as the reciprocal of the highest dilution of serum that did not show agglutination. For both, MN and HAI assays, if the initial dilution did not give a positive titer, the titer was recorded as half the minimum detectable titer for calculation purposes, e.g. 5 in MN assay if the starting dilution was 1:10.

Data analysis
Linearity of the MN assay data obtained using different readouts was evaluated by fitting a linear regression line and calculating respective R 2 in GraphPad Prism software (version 5.01).
Correlations between MN-assay results obtained with different readouts and between MN and HAI assay results were calculated by Pearson correlation test; p values were corrected for multiple comparisons using the Bonferroni-Šidák method using GraphPad Prism software (version 8.0). Comparison between different Pearson's correlation coefficients was performed using MEDCALC Ò as statistical software. Agreement between titers determined using the different readouts for the MN-assay and between titers determined by MN and HAI assay, respectively, was analyzed graphically using Bland-Altman plots [33].

MN assays and HAI assay show good linearity and repeatability
When comparing two methods it is important that both show good repeatability. When there is considerable variability in repeated measurements, correlation and agreement between the two methods are clearly doomed to be poor. For this reason, we first evaluated linearity and repeatability for each of the four MN assay readouts and for the HAI assay. Each criterion was evaluated individually using standard hyper-immune sera specific for the used strains of H1N1pdm09 and H3N2 viruses. A negative control (human serum depleted of IgA/IgM/IgG) was also included for the evaluation of repeatability. For all assays the repeatability was evaluated for each strain by performing multiple replicate titrations of standard samples (positive and negative sera) on different days. Titers obtained for the standard sera differed at most by 1 dilution steps in most of the cases (Table 1). Linearity of the MN assays and the HAI assay was evaluated by performing the assays starting from different initial dilutions of the H1N1pdm09-and H3N2-specific standard positive sera and by plotting Log2 of the initial serum dilutions against the last dilution on the plate where the cells were found to be still protected from infection (MN) or where hemagglutination was still inhibited (HAI). After fitting the data with a linear regression model, the coefficient of correlation (R 2 ) was determined for each assay. R 2 oscillated for all the MN assays between 0.96 and 0.99 (Fig. 1) thus confirming the ability of the assays to return titers that are directly proportional and linearly correlated with the concentration of the antibodies in the sample.

Correlation of MN assay results vary with the readouts compared and the virus strain used
We were next interested in evaluating in how far MN antibody titers determined using different assay readouts correlate with each other. For this purpose 99 human serum samples were tested for the presence of influenza specific neutralizing antibodies against A/California/7/09 H1N1 (49 sera) or A/Hong Kong/4801/2014 (50 sera) using four different infection readout methods for the MN assay, namely CPE, HA, ELISA, and RT qPCR.
Pearson's ''r" correlation coefficients were calculated by comparing the titers measured through the four MN assays in a pairwise fashion (Fig. 2). We observed statistically significant Pearson's correlation in all pairwise comparisons (p = 0,0006). When comparing r coefficients we observed statistically significant differences for just 2 out of 12 pairwise comparisons. Interestingly, assays with rather similar experimental conditions, such as ELISA vs qPCR, did not correlate better than assays with different conditions, such as CPE and ELISA. (Fig. 2). Moreover, the pattern of correlations among the different MN assays was not the same for the two different influenza virus strains included in the study. This result indicates that MN titers determined using different assay readouts correlate to various degrees and that the performance of some assays depends on the virus strain investigated.

MN assays with different readouts show poor agreement
Correlation analysis, even though informative and widely used in literature, does not tell us to what extent two assays deliver the same nominal antibody titer for a given serum sample. We therefore determined the level of agreement in the results of each pair of different assays using the method described by Bland and Altman [33]. This method makes use of a simple graphical technique for which the average of two readings derived from two different assays is plotted against the difference of the same readings.
Applying this technique to the data obtained from the MN assays revealed that for H1N1pdm09, agreement was fairly good for CPE vs qPCR and ELISA vs qPCR (differences fairly close to zero) but rather poor for any of the other comparisons (Fig. 3A-F). For Fig. 3. Agreement of MN titers as determined using different assay readouts. Bland-Altman plots were created as graphical representation of agreement between two MN assay readouts by plotting the differences of the log 10 titers between measurements obtained from two readouts for each serum (y axis) against their mean value (x axis). The limits of agreement were set to ±0.39 on the y axis which represents a 2.5 fold change in titers between one assay and the other (log 10 2.5 = 0.39). H1N1pdm09 (A-F) and H3N2 (G-N).
H3N2 virus, reasonable agreement was observed for CPE vs HA, CPE vs ELISA and HA vs ELISA but not for the other comparisons ( Fig. 3 G-N). In order to make the level of (dis-) agreement more quantifiable, we defined that a difference in nominal titers of ±2.5 folds (log 10 2.5 = 0.3979) would be acceptable and that for a good agreement at least 90% of the readings should fall in this interval. To determine whether this was the case we indicated the respective intervals in the plots, and assessed the percentage of values falling in these intervals (Table 2). For the two tests which were in best agreement, ELISA and qPCR for H1N1, CPE and ELISA for H3N2, indeed 90% of all readings fell into the ±2.5 fold interval. However, for the two tests in worst agreement, HA and ELISA for H1N1, CPE and qPCR for H3N2, only 12% and 16% of the readings fell into the set interval, respectively. This clearly indicates that titers determined by one or the other readout are, with few exceptions, not interchangeable. Moreover, the readouts giving the best comparable titers differed for the two viruses evaluated, similarly to what we observed when correlation between assays was assessed.
The plots, however, show that for some comparisons there was a systematic difference in the readings from two assays, meaning that within a pairwise comparison one assay consistently resulted in higher or lower readings than the other (i.e. CPE vs HA for H1N1pdm). The average difference between the readings of two assays is called the bias. The calculated biases for all comparisons are listed in Table 2. It becomes immediately obvious that the calculated biases vary largely for the different comparisons and are rather different for the assays performed with H1N1pdm09 and H3N2, respectively.
The bias could potentially be used as a correction factor to make readings from two different assays more comparable. To investigate this possibility, in Fig. 4, we plotted the same data as in Fig. 3 but set the ±2.5x interval around the respective bias. This improved the percentage of values falling into the acceptance interval considerably in most cases ( Table 2). In particular, the readings for CPE and HA and CPE and ELISA became rather well interchangeable by this exercise for both viruses. Yet, given the largely difference in the biases for the two viruses (2.5 and À0.4 for H1N1pdm09, 1.5 and 1 for H3N2), comparability of nominal titers between assays and for different viruses remains problematic.

MN and HAI assays show good correlation and poor agreement with an overall tendency for HAI titers to be lower than MN titers
Additionally, we were interested in understanding whether MN assays could be used as an alternative for HAI assays and which of the MN assay readouts currently in use would have to be preferred for rendering results best comparable with HAI in terms of both correlation and agreement. In order to answer this question, titers as determined by either of the four MN assays were compared pairwise with titers as determined by HAI assay and Pearson's ''r" correlation coefficient was calculated for every pairwise comparison (Fig. 5). Correlation between the titers of MN and HAI assays was statistically significant in each of the comparisons (p = 0,0006) confirming earlier published results [8,22,35]. When comparing r coefficients we observed statistically significant differences for just 2 out of 12 pairwise comparisons. Analysis of agreement between results from MN assays with the different readouts and from HAI assays using Bland Altman plots showed that for H1N1pdm09, agreement was fairly good for CPE vs HAI and RT qPCR vs HAI, but rather poor for the other comparisons ( Fig. 6 A-D). For H3N2, agreement was reasonably good for HA vs HAI but not for the other comparisons (Fig. 6 E-H). The agreement intervals used for the Bland-Altman plots were again set such that they would match a difference of 2.5 fold change in titers between one assay and the other, following the same rationale mentioned above. In order to score the agreement, the percentages of values falling in these intervals were again quantified ( Table 3). The agreement between MN and HAI was fairly poor for all readouts and for the two tests in worst agreement (ELISA vs HAI), only 38% (H1N1pdm09) and 50% (H3N2) of the readings fell into the set interval, respectively. In no case it happened that at least 90% of the readings would fall in the ±2.5 folds interval. This result clearly indicates that titers determined by MN or HAI assays are, in the context of our present study, not interchangeable. Moreover, the pairs of assays giving the best comparable titers differed for the two viruses evaluated, similarly to what happened in the comparison among MN assays.
The Bland-Altman plots indicate that MN assays consistently conveyed higher titers than the HAI assay in most of the cases for both influenza virus strains analyzed in this study. This becomes evident when we examine the biases between the various MN and HAI assays (Table 3): the average difference in titers (MN vs HAI) indeed appears to be higher than 0 in 3 out of 4 cases for both influenza virus strains. This finding confirms what has been reported in literature: MN assays tend to be more sensitive than HAI assays [14,25]. The bias could potentially be used as a correction factor to make readings from two different assays more comparable. To investigate this possibility, we again set the ±2.5 folds interval around the respective bias (Fig. 7). Even though this exercise improved the percentage of values falling into the acceptance interval in some cases, it becomes immediately obvious that agreement remains poor for HAI and most MN assay readouts. Moreover, the calculated biases vary largely for the different paired comparisons and are rather different for the assays performed with H1N1pdm09 and H3N2, respectively. Given these results, employing the bias as a correction factor would not bring significant advantages. 0.75 * The limits of agreement used for the Bland-Altman plots were set such that they would match a difference of 2.5 fold change in titers between assays and the percentage of sera falling in the agreement interval was determined before and after the introduction of the bias as correction factor. Bias was calculated as mean difference between logarithmically transformed titers.

The use of a standard serum does not lead to better agreement of titers measured with different assays
Expressing of MN or HAI titers in relation to the titers of a standard serum could potentially be helpful in tackling the issue of inter-assay and inter-laboratory variability in nominal antibody titers [25]. We therefore randomly chose two serum samples for H1N1pdm09 and two serum samples for H3N2 as 'standards', and expressed all other titers relative to these samples. If normal-ization to a standard would make readings across assays more comparable then the normalized values should be the same for all assays. Inspection of Supplementary Table 1 immediately shows that this is not the case. In order to make the level of (dis-) agreement more quantifiable, we calculated ratios between normalized titers derived from two different assays and defined that these should not vary by more than 2-fold (thus fall in the interval 0.5 x 2). We then determined the percentages of values falling in the indicated interval and defined that for a good agreement this Fig. 4. Agreement of MN titers as determined using different assay readouts upon introduction of the ''bias" as correction factor. The bias was calculated as the mean difference between measurements of two assays. Bland-Altman plots were re-designed by introduction of the ''bias" as a correction factor both for H1N1pdm09 (A-F) and H3N2 (G-N). The limits of agreement were set to ±0.39 on the y axis around the bias. percentage should be at least 90%. As depicted in Table 4, for the vast majority of the comparisons the percentage of values in the indicated interval was far below 90%. Moreover, agreement varied largely with the serum used as standard, the assays compared and the virus strain investigated. Interestingly, while in some cases the use of normalized instead of actual titers resulted in an improvement of agreement between assays this was not true in other cases (compare Table 4 to Tables 2 and 3). Thus, even after normalization to a standard serum, agreement between assays remained poor, with few exceptions.

Discussion
The aim of this study was to investigate whether MN assays could successfully replace the HAI assay given the numerous drawbacks of the latter, and at the same time to find out which MN assay readout method would be most appropriate to be used in a potential future standard protocol. Our results show that MN assay protocols involving CPE, HA, ELISA and qPCR as readouts all performed well in terms of linearity and reproducibility with consistently high Pearson's correlations. On the other hand, agreement between nominal titers varied with readouts compared and virus strain used. Overall agreement was moderate and could not easily be improved by introduction of a correction factor (bias). Moreover correlation of MN titers with HAI titers was high but agreement of nominal titers was moderate and the correction factor was virus strain-dependent. Normalizing titers to a standard serum did not improve the comparability among different assays.
Serum titers obtained with different MN readouts have only rarely been compared before. In a recent study, Laurie et al. compared the 2 day-MN-ELISA and the 3 day-MN-CPE or MN-HAI respectively in different laboratories. They observed a high correlation between MN-ELISA and MN-CPE in more than half of the laboratories involved in the study [24]. This observation is in line with the results of our study which demonstrate high correlation between different MN assay readouts. However, the study by Laurie et al did not address agreement of nominal titers determined using the different MN readouts. Numerous studies measured correlation between HAI assay and MN assays before and observed a high correlation between them as well as a tendency for MN assays to be more sensitive than HAI assays [8,22,35]. Our results in this context appear to confirm existing literature. Yet, most of the studies comparing MN and HAI assays did not measure or do not mention the actual agreement and equivalency between nominal antibody titers [8,22,25,35]. An exception is the study of Truelove et al. which, in line with our results, reports about poor equivalency between titers obtained with the two different assays [8].
We believe that a focus on correlation might have led to an overestimation of the comparability of the HAI and MN assay. Indeed our study shows how, despite showing good correlation, the two assays did not deliver comparable nominal titers in most of the cases.
A possible solution for poor agreement of nominal antibody titers between assays or between laboratories is the use of a standard serum for normalization. However, in our study the use of normalized titers (relative to a standard) instead of the original titers in most cases did not lead to a significant improvement in agreement between different assays for MN-MN and MN-HAI comparisons, and this was true for both influenza virus strains analyzed. Stephenson et al earlier showed that normalization of neutralizing antibody titer values to a standard reduced the geometric coefficient of variation of results obtained in different laboratories [25]. However, only a single serum was used as standard and no information is given about the assay protocols used. From our results we conclude that potential future use of a standard serum sample on its own will not be sufficient to ensure comparability of results obtained with different assays or in different laboratories.
This implies that replacement of the HAI assay by an MN assay would be successful only if a consensus standard protocol is going to be used for the latter [2,24,36]. Among the readouts studied, we would favor the MN-ELISA as choice for a new standard assay for the following reasons: (i) The MN-ELISA would be easiest to standardize in terms of reagents and procedure. The reagents in this case are widely available and can easily be harmonized between different facilities. (ii) The MN-ELISA readout is dependent on an instrument (ELISA reader) and therefore it is far less subjective than the MN-CPE or MN-HA readouts, which rely greatly on the expertise of the technician performing the experiment. (iii) A standard protocol already exists for MN-ELISA in the ''Manual for the laboratory diagnosis and virological surveillance of influenza", which might represent an effective starting point [3]. (iv) Among the four MN assays analyzed in this study, the MN-ELISA is only moderately expensive and delivers results quickly.
A limitation of our study may be the limited number of serum samples employed. Yet, we believe the number of samples was still high enough to perform valid and reliable statistics. A second limitation may be represented by the use of only two influenza virus strains. However, the two viruses employed are the two influenza A virus strains which are currently circulating in the human population and represent Group 1 and Group 2 viruses, respectively [4]. Moreover, even use of only these two virus strains demonstrated clear effects of the strain on MN and HAI assay results.
In conclusion, in this study we demonstrate that MN assay results depend on the readout method used, necessitating agreement on a consensus assay protocol for future studies. On basis of affordability, speed and possibilities for standardization we identified the MN assay with ELISA readout as the best candidate to replace the HAI assay. Development and introduction of a consensus protocol as well as determination of inter-laboratory variability of results will be up to international consortia such as FLUCOP and CONSISE which already have started these activities [10,14,30,31]. Ideally, these efforts will also lead to the definition of a new correlate of protection potentially universally valid for this MN assay. * The limits of agreement used for the Bland-Altman plots were set such that they would match a difference of 2.5 fold change in titers between assays and the percentage of sera falling in the agreement interval was determined before and after the introduction of the bias as correction factor. Bias was calculated as mean difference between logarithmically transformed titers. Fig. 7. Agreement between MN titers (as determined using different assay readouts) and HAI titers upon introduction of the ''bias" as correction factor. The bias was again calculated as the mean difference between measurements of two assays. Bland-Altman plots comparing MN assays and HAI assays among each other were re-designed by introduction of the ''bias" as a correction factor. The limits of agreement were set to ±0.39 on the y axis around the bias for both H1N1pdm09 (A-D) and H3N2 (E-H).

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. * All titers were normalized by the respective reading for the indicated serum for each of the assays. Subsequently, ratios of normalized values were calculated and the percentage of sera falling in the interval 0.5 x 2 was determined.