Using interpretable machine learning to extend heterogeneous antibody-virus datasets

Summary A central challenge in biology is to use existing measurements to predict the outcomes of future experiments. For the rapidly evolving influenza virus, variants examined in one study will often have little to no overlap with other studies, making it difficult to discern patterns or unify datasets. We develop a computational framework that predicts how an antibody or serum would inhibit any variant from any other study. We validate this method using hemagglutination inhibition data from seven studies and predict 2,000,000 new values ± uncertainties. Our analysis quantifies the transferability between vaccination and infection studies in humans and ferrets, shows that serum potency is negatively correlated with breadth, and provides a tool for pandemic preparedness. In essence, this approach enables a shift in perspective when analyzing data from “what you see is what you get” into “what anyone sees is what everyone gets.”


INTRODUCTION
Our understanding of how antibody-mediated immunity drives viral evolution and escape relies upon painstaking measurements of antibody binding, inhibition, or neutralization against variants of concern. 1 While antibodies can cross-react and inhibit multiple variants, viral evolution slowly degrades such immunity, leading to periodic reinfections that elicit new antibodies.To get an accurate snapshot of this complex response, we must not only measure inhibition against currently circulating strains but also against historical variants. 2,3very antibody-virus interaction is unique because (1) the antibody response (serum) changes even in the absence of viral exposure and (2) for rapidly evolving viruses such as influenza, the specific variants examined in one study will often have little to no overlap with other studies (Figure 1).This lack of crosstalk hampers our ability to comprehensively characterize viral antige-nicity, predict the outcomes of viral evolution, and determine the best composition for the annual influenza vaccine. 4n this work, we develop a new cross-study matrix completion algorithm that leverages patterns in antibody-virus inhibition data to infer unmeasured interactions.Specifically, we demonstrate that multiple datasets can be combined to predict the behavior of viruses that were entirely absent from one or more datasets (e.g., Figure 2A, predicting values for the green viruses in dataset 2 and the gray viruses in dataset 1).Whereas past efforts could only predict values for partially observed viruses within a single dataset (i.e., predicting the red squares for the blue/gray viruses in dataset 2 or the green/blue viruses in dataset 1), [5][6][7] here we predict the behavior of viruses that do not have a single measurement in a dataset.
Algorithms that predict the behavior of large virus panels are crucial because they render the immunological landscape in MOTIVATION To quantify the immune response against a rapidly evolving virus, groups routinely measure antibody inhibition against many virus variants.Over time, the variants being studied change, and there is a need for methods that infer missing interactions and distinguish between confident predictions and hallucinations.Here, we develop a matrix completion framework that uses patterns in antibody-virus inhibition to infer the value and confidence of unmeasured interactions.This same approach can combine general datasets-from drug-cell interactions to user movie preferences-that have partially overlapping features.
higher resolution, helping to reveal which viruses are potently inhibited and which escape antibody immunity. 3,4For example, polyclonal human sera that strongly neutralize one virus may exhibit 103 weaker neutralization against a variant with one additional mutation. 8Given the immense diversity and rapid evolution of viruses, it behooves us to pool together measurements from different studies and build a more comprehensive description of serum behavior.
Even when each dataset is individually complete, many interactions can still be inferred by combining studies.The seven datasets examined in this work measured 60%-100% of interactions between their specific virus panel and sera, but against an expanded virus panel containing all variants, fewer than 10% of interactions were measured.Moreover, the missing entries are highly structured, with entire columns (representing viruses; Figure 2A) missing from each dataset.5][16] In contrast, we construct a framework that harnesses the specific structure of these missing values, enabling us to predict over 2,000,000 new values comprising the remaining 90% of interactions.
The key feature we develop that enables matrix completion across studies is error quantification.Despite numerous algorithms to infer missing values, only a few methods exist that can estimate the error of these predictions under the assumption that missing values are randomly distributed, 17,18 and to our knowledge, no methods can quantify error for general patterns of missing data.Because we do not know a priori whether datasets can inform one another, it is crucial to estimate the confidence of cross-study predictions.Our framework does so using a data-driven approach to quantify the individual error of each prediction so that users can focus on high-confidence inferences (e.g., those with %4-fold error) or search for additional datasets that would further reduce this uncertainty.
Our results provide guiding principles in data acquisition and promote the discovery of new mechanisms in several key ways: (1) Existing antibody-virus datasets can be unified to predict each serum against any virus, providing a massive expansion of data and fine-grained resolution of these antibody responses.(2) This expanded virus panel enables an unprecedented direct comparison of human 4 ferret and vaccination 4 infection studies, quantifying how distinct the antibody responses are in each category.(3) Using the expanded data, we explore the relation between two key features of the antibody response, showing the tradeoff between potency and breadth.(4) We demonstrate an application for pandemic preparedness, where the inhibition of a new variant measured in one study is immediately extrapolated to other datasets.(5) Our approach paves the way to rationally design virus panels in future studies, saving time and resources by measuring a substantially smaller set of viruses.In particular, we determine which viruses will be maximally informative and quantify the benefits of measuring each additional virus.
Although this work focuses on antibody-virus inhibition measurements for influenza, it readily generalizes to other viruses, other assays (e.g., using binding or neutralization), and more general applications involving intrinsically low-dimensional datasets.

The low dimensionality of antibody-virus interactions empowers matrix completion
Given the vast diversity of antibodies, it is easy to imagine that serum responses cannot inform one another.][21][22][23] Yet much of the heterogeneity of antibody responses found through sequencing 24 collapses when we consider functional behavior such as binding, inhibition, or neutralization against viruses. 25,267]28 However, these efforts have almost exclusively focused on individual datasets of ferret sera generated under controlled laboratory conditions, circumventing the many obstacles of predicting across heterogeneous human studies.
In the following sections, we develop a matrix completion algorithm that predicts measurements for a virus in dataset 1 (e.g., the virus-of-interest in Figure 2A, boxed in gold) by finding universal relationships between the other overlapping viruses and the virus-of-interest in dataset 2 and applying them to dataset 1.We first demonstrate the accuracy of matrix completion by withholding all hemagglutination inhibition (HAI) measurements from one virus in one dataset (Figure 2A, gold boxes) and using the other datasets to generate predictions ± errors, where each error quantifies the uncertainty of a prediction.Although we seek accurate predictions with low estimated error, it may be impossible to accurately predict some interactions (e.g., measurements of viruses from 2000-2010 may not be able to predict a distant virus from 1970), and those error estimates should be larger to faithfully reflect this uncertainty.After validating our approach on seven large serological studies, we apply matrix completion to greatly extend their measurements.

Cross-study matrix completion using a random forest
We first predict virus behavior between two studies before considering multiple studies.Figure 2 and Box 1 summarize leave-one-out analysis, where a virus-of-interest V 0 is withheld from one dataset (Figure 2A, blue virus boxed in gold).We create multiple decision trees using a subset of overlapping viruses V 1 , V 2 .V n as features and a subset of antibody responses within dataset 2 for training (STAR Methods).These trees are crossvalidated using the remaining antibody responses from dataset 2 to quantify each tree's error s Training , and we predict V 0 in dataset 1 using the average of the values ± errors from the 5 best trees with the lowest error (Figures 2B and 2C; Box 1).
One potential pitfall of this approach is that the estimated error s Training derived from dataset 2 will almost always underestimate the true error for these predictions (s Actual ) in dataset 1 because the antibody responses in both studies may be very distinct (e.g., sera collected decades apart or from people/animals with different infection histories).
To correct for this effect, we estimate an upper bound for s Actual by computing the transferability f 2/1 (x), which quantifies the accuracy of a relation found in dataset 2 (e.g., V 0 = V 1 + V 2 , although complex non-linear relations are allowed) when applied Studies may have different fractions of missing values (dark-red boxes) and measured values (gray).To test whether virus behavior can be inferred across studies, we predict the titers of a virus in dataset 1 (V 0 , gold squares), using measurements from the overlapping viruses (V 1 -V n ) as features in a random forest model.(B) We train a decision tree model using a random subset of antibodies and viruses from dataset 2 (boxed in purple), cross-validate against the remaining antibody responses in dataset 2, and compute the root-mean-square error (RMSE, denoted by s Training ).(C) Multiple decision trees are trained, and the average from the 5 trees with the lowest error are used as the model going forward.Applying this model to dataset 1 (which was not used during training) yields the desired predictions, whose RMSE is given by s Actual .We repeat this process, withholding each virus in every dataset.(D) To estimate the prediction error s Actual (which we are not allowed to directly compute because V 0 's titers are withheld), we define the transferability relation f 2/1 between the training error s Training in dataset 2 and actual error s Actual in dataset 1 using the decision trees that predict viruses V 1 -V n (without using V 0 ).Applying this relation to the training error, f 2/1 (s Training ), estimates s Actual for V 0 .
to dataset 1.More precisely, if a relation has error s Training in dataset 2 and s Actual in dataset 1, then the transferability gives an upper bound, f 2/1 (s Training from dataset 2) R s Actual in dataset 1, that holds for the majority of decision trees.Thus, a low f 2/1 (s Training from dataset 2) guarantees accurate predictions.
To calculate the transferability f 2/1 , we repeat the above algorithm, but rather than inferring values for V 0 , we predict each of the overlapping viruses V 1 -V n measured in both datasets whose s Training and s Actual can be directly computed (Figure 2D; Box 2).We found that transferability was well characterized by a simple linear relationship (Figure S1; note that f 2/1 repre-sents an upper bound and not an equality).Finally, we apply this relation to the training error for virus V 0 to estimate prediction error in dataset 1, s Predict h f 2/1 (s Training ).In this way, both values and errors for V 0 are inferred using a generic, datadriven approach that can be applied to diverse datasets.
Leave one out: Inferring virus behavior without a single measurement To assess matrix completion across studies, we applied it to three increasingly difficult scenarios: (1) between two highly similar human vaccination studies, (2) between a human infection and human vaccination study, and (3) between a ferret infection and human vaccination study.We expected prediction accuracy to decrease as the datasets become more distinct, resulting in both a larger error (s Actual ) and larger estimated uncertainty (s Predict ).
For these predictions, we utilized the Fonville influenza datasets consisting of six studies: four human vaccination studies (dataset Vac,1-4 ), one human infection study (dataset Infect,1 ), and one ferret infection study (dataset Ferret ). 20In each study, sera were measured against a panel of H3N2 viruses using HAI.Collectively, these studies contained 81 viruses, and each virus was measured in at least two studies.
We first predicted values for the virus V 0 = A/Auckland/5/1996 in the most recent vaccination study (dataset Vac,4 ) using data from another vaccination study (dataset Vac,3 ) carried out in the preceding year and in the same geographic location (Table S1).After training our decision trees, we found that the two studies had the best possible transferability (s Predict = f Vac,3/Vac,4 (s Training ) z s Training ), suggesting that there is no penalty in extrapolating virus behavior between these datasets.More precisely, if there exist five viruses, V 1 -V 5 , that can accurately predict V 0 's measurements in dataset-Vac,3 , then V 1 -V 5 will predict V 0 equally well in dataset Vac, 4 .
Indeed, we found multiple such decision trees that predicted V 0 's HAI titers with s Predict = 2.0-fold uncertainty, meaning that each titer t is expected to lie between t/2 and t$2 with 68% probability (or, equivalently, that log 10 (t) has a standard deviation of log 10 ( 2)) (top panel in Figure 3A, gray bands represent s Predict ).Notably, this estimated uncertainty closely matched the true error s Actual = 1.7-fold.To put these results into perspective, the HAI assay has roughly 2-fold error (i.e., repeated measurements differ by 2-fold 50% of the time and by 4-fold 10% of the time; STAR Methods), implying that these predictions are as good as possible given experimental error.
When we inferred every other virus between these vaccine studies (datasets Vac,3/Vac,4) , we consistently found the same highly accurate predictions: s Predict zs Actual z 2-fold (Figure S2A).As an alternative way of quantifying error, we plotted the distribution of predictions within 0.5, 1.0, 1.5.standard deviations from the measurement, which we compare against a folded Gaussian distribution (Figure 3A, bottom).For example, 82% of predictions were within 1 standard deviation, somewhat larger than the 68% expected for a Gaussian, confirming that prediction error was slightly overestimated.
We next predicted values for V 0 = A/Netherlands/620/1989 between a human infection and vaccination study (dataset Infect,1/Vac,4 ).In this case, the predicted values were also highly accurate with true error s Actual = 2.3-fold (Figure 3B; remaining viruses predicted in Figure S2B).When quantifying the uncertainty of these predictions, we found worse transferability of virus behavior (f Infect,1/Vac,4 (s Training ) z 2.8s Training , where the larger prefactor of 2.8 indicates less transferability; STAR Methods), and hence we overestimated the prediction error as s Predict = 4.3-fold.Last, when we predicted values for V 0 = A/Victoria/110/2004 between a ferret infection and human vaccination study (dataset Ferret/Vac,4 ), our predictions had a larger true error, s Actual = 4.4-fold (Figure 3C), than the inferences between human data, as expected.Moreover, poor transferability between these datasets led to a poorer guarantee of prediction accuracy, s Predict = 6.5-fold, indicative of larger variability when predicting between ferret and human data.Importantly, we purposefully constructed s Predict to overestimate s Actual when datasets X and Y exhibit disparate behaviors, since matching the average distribution of s Predict to s Actual could lead to an unwanted underestimation of the true error.With our approach, a low s Predict guarantees accurate predictions.As we show in the following section, the estimated values and error become more precise when we use multiple datasets to infer virus behavior.
Combining influenza datasets to predict 200,000 measurements with %3-fold error When multiple datasets are available to predict virus behavior in dataset 1, we obtain predictions ± errors (m j ± s j ) from dataset 2/1, dataset 3/1, dataset 4/1.These predictions and their errors are combined using the standard Bayesian approach as (Equation 1) The uncertainty term in this combined prediction has two key features.First, adding any additional dataset (with predictions m k ± s k ) can only decrease the uncertainty.Second, if a highly uninformative dataset is added (with s k /N), it will negligibly affect the cumulative prediction.Therefore, as long as the uncertainty estimates are reasonably precise, datasets do not need to be prescreened before matrix completion, and adding more datasets will always result in lower uncertainty.
To test the accuracy of combining multiple datasets, we performed leave-one-out analysis using all six Fonville studies, systematically withholding every virus in each dataset (311 virus-dataset pairs) and predicting the withheld values using all remaining data.Each dataset measured 35-300 sera against 20-75 viruses (with 81 unique viruses across all 6 studies) and had 0.5%-40% missing values (Figure 4A).
Collectively, we predicted the 50,000 measurements across all datasets with a low error of s Actual = 2.1-fold (between the measured value and the left-hand side of Equation 1).Upon stratifying these predictions by dataset, we found that the four human vaccination studies were predicted with the highest accuracy (datasets Vac,1-4 , s Actual z 2-fold), while the human infection study had slightly worse accuracy (dataset Infect,1 , s Actual = 2.7-fold) (Figure 4A).Remarkably, even the least accurate human / ferret predictions had %4-fold error on average (s Actual = 3.4-fold), demonstrating the potential for these cross-study inferences.As negative controls, permutation testing as well as predictions based solely on virus sequence similarity led to nearly flat predictions with substantially larger error (Figure S3).
In addition to accurately predicting these values, the estimated error closely matched the true error in every human study (s Predict z s Actual , datasets Vac,1-4 and dataset Infect,1 ).The uncertainty of the ferret predictions was slightly overestimated virus must be included in at least two datasets Steps: d For each dataset D 0 in {D j }, for each virus V 0 in D 0 , for every other dataset D j containing V 0 d Create n Trees = 50 decision trees predicting V 0 based on n Features = 5 other viruses, as described in Box 1 d For each tree, store the following: signifying that viruses will be poorly predicted between these studies B In the chord diagrams (Figures 4B and 5B), the width of the arc between Dataset D j and D 0 is proportional to (vf Dj/D0 /vs Training ) À1 (s Predict = 4.2-fold, dataset Ferret ); mathematically, this occurs because the upper envelope of s Training -vs-s Actual is steep, making s Actual difficult to precisely determine (Figure S1).We visualize the transferability between datasets using a chord diagram (Figure 4B), where wider bands connecting datasets X4Y represent larger transferability (Figure S4; STAR Methods).As expected, there was high transferability between the human vaccine studies carried out in consecutive years (datasets Vac,14Vac,2 and datasets Vac,34Vac,4 , Table S1) but generally less transferability across vaccine studies more than 10 years apart (datasets Vac,14Vac,3 , datasets Vac,14Vac,4 , datasets Vac,24Vac,3 , or datasets Vac,24Vac,4 ).
Transferability is not necessarily symmetric because virus inhibition in dataset X could exhibit all patterns in dataset Y (leading to high transferability from X/Y) along with unique patterns not seen in dataset Y (resulting in low transferability from Y/X).For example, all human datasets displayed small transferability to the ferret data, whereas the ferret dataset accurately predicts the human dataset Infect,1 ; this suggests that the ferret responses show some patterns present in the human data but also display unique phenotypes.As another example, the human infection study carried out from 2007-2012 had high transferability from the human vaccine studies conducted in 2009 and 2010 (dataset Vac,3/4/Infect,1 ) but showed smaller transferability in the reverse direction.
To show the generality of this approach beyond H3N2 HAI data, we predicted H1N1 virus neutralization across two monoclonal antibody datasets, finding an error s Actual = 3.0-3.6-foldacross measurements spanning two orders of magnitude (Figure S5).While these serum and monoclonal antibody results lay the foundation to compare datasets and quantify the impact of a person's age, geographic location, and other features on the antibody response, they are not exhaustive characterizations; for example, additional human datasets may be able to more accurately predict these ferret responses.The strength of this approach lies in the fact that cross-study relationships are learned in a data-driven manner.As more datasets are added, the number of predictions between datasets increases, while the uncertainty of these predictions decreases.

Versatility of matrix completion: Predicting values from a distinct assay using only 5 overlapping viruses
To test the limits of our approach, we used the Fonville datasets to predict values from a large-scale serological dataset by Vinh et al., 25 where only 6 influenza viruses were measured against 25,000 sera.This exceptionally long and skinny matrix is challenging for several reasons.First, after entirely withholding a virus, only 5 other viruses remain to infer its behavior.Furthermore, only 4 of the 6 Vinh viruses had exact matches in the Fonville dataset; given this small virus panel, we utilized the remaining 2 viruses by associating them with the closest Fonville virus based on their hemagglutinin sequences (STAR Methods; sequences available in GitHub repository).Associating functionally distinct viruses will result in poor transferability, and hence the validity of matching nearly homologous viruses can be directly assessed by comparing the transferability with or without these associations.
Second, the Vinh study used protein microarrays to measure serum binding to the HA1 subunit that forms the hemagglutinin head domain.While HAI also measures how antibodies bind to this head domain, such differences in the experimental assay could lead to fundamentally different patterns of virus inhibition, resulting in smaller transferability and higher error.
Third, there were only 1,200 sera across all Fonville datasets, and hence predicting the behavior of 25,000 Vinh sera will be impossible if they all exhibit distinct phenotypes.Indeed, any such predictions would only be possible if this swarm of sera are highly degenerate, the behavior of each Vinh virus can be determined from the remaining 5 viruses, and these same relations can be learned from the Fonville data.Last, we note one superficial difference: the Vinh data span a continuum of values, while the Fonville data take on discrete 2-fold increments, although this feature does not affect our algorithm.After growing a forest of decision trees to establish the transferability between the Fonville and Vinh datasets (Figure S1), we predicted the 25,000 serum measurements for all 6 Vinh viruses with an average s Actual = 3.2-fold error, demonstrating that even a small panel containing 5 viruses can be expanded to predict the behavior of additional strains (Figure 4A, dataset Infect,2 ).
Notably, 5 of these 6 viruses (which all circulated between 2003 and 2011) had a very low s Predict zs Actual z 2-to 3-fold error (Figure S6).The final Vinh virus circulated three decades earlier (in 1968), and its larger prediction error was underestimated (s Actual = 9.3-fold, s Predict = 3.8-fold).This highlights a shortcoming of any matrix completion algorithm; namely, that when a dataset contains one exceptionally distinct column (i.e., one virus circulating 30 years before all other viruses), its values will not be accurately predicted.These predictions would have improved had these six viruses been sampled uniformly between 1968 and 2011.
Leave multi out: Designing a minimal virus panel that maximizes the information gained per experiment Given the accuracy of leave-one-out analysis and that only 5 viruses are needed to expand a dataset, we reasoned that these studies contain a plethora of measurements that could have been inferred by cross-study predictions.Pushing this to the extreme, we combined the Fonville and Vinh datasets and performed leave-multi-out analysis, where multiple viruses were simultaneously withheld and recovered.Future studies seeking to measure any set of viruses, V 1 -V n , can use a similar approach to select the minimal virus panel that predicts their full data.
In the present search, we sought the minimum viruses needed to recover all Fonville and Vinh measurements with %4-fold error; we chose this threshold because it lets us remove dozens of viruses while being much smaller than the 1,000-fold range of the data.A virus was randomly selected from a dataset and added to the withheld list when its values, and those of all other withheld viruses, could be predicted with s Predict % 4-fold (without using s Actual to confirm these predictions; STAR Methods).In this way, 133 viruses were concurrently withheld, representing 15%-60% of the virus panels from every dataset or a total of N = 70,000 measurements (Figure 5A).
Even with this hefty withheld set, prediction error was only slightly larger than during leave-one-out analysis (s Actual between 2.1-to 3.0-fold for the human datasets and s Actual = 3.8fold for the ferret data).This small increase is due to two competing factors.On one hand, prediction is far harder with fewer viruses.At the same time, our approach specifically withheld the most ''redundant'' viruses that could be accurately estimated (with s Predict % 4-fold).These factors mostly offset one another so that the 70,000 measurements exhibited the desired s Actual % 4-fold.
The transferability between datasets, computed without the withheld viruses, was similar to the transferability between the full datasets (Figure 5B).Some connections were lost when there were <5 overlapping viruses between datasets, while other connections were strengthened when the patterns in the remaining data became more similar across studies.Notably, the ferret data now showed some transferability from vaccination datasets-Expanding datasets with 2 3 10 6 new measurements reveals a tradeoff between serum potency and breadth In the previous section, we combined datasets to predict serumvirus HAI titers, validating our approach on 200,000 existing measurements.Future studies can immediately leverage the Fonville datasets to expedite their efforts.If a new dataset contains at least 5 Fonville viruses (green arrows/boxes in Figure 6A), then HAI values ± errors for the remaining Fonville viruses can be predicted.Viruses with an acceptably low error (purple in Figure 6A) can be added without requiring any additional experiments.
To demonstrate this process, we first focus on the Vinh dataset, where expansion will have the largest impact because the Vinh virus panel is small (6 viruses), but its serum panel is enormous (25,000 sera).By predicting the interactions between these sera and all 81 unique Fonville viruses, we add 2,000,000 new predictions (more than 103 the number of measurements in the original dataset).
For each Fonville virus V 0 that was not measured in the Vinh dataset, we grew a forest of decision trees as described above, with the minor modification that the 5 features were restricted to the Vinh viruses to enable expansion.The top trees were combined with the transferability functions (Figure S1) to predict the values ± errors for V 0 (Figure S7).
The majority of the added Fonville viruses (67 of 75) had tight predictions of s Predict % 4-fold (Figure 6B).As expected, viruses (A) We combined seven influenza datasets spanning human vaccination studies (blue boxes), human infection studies (green), and a ferret infection study (orange).Each virus in every dataset was withheld and predicted using the remaining data (shown schematically in gold in the top left box).We display each dataset (left; missing values in dark red and measurements in grayscale) and the collective predictions for all viruses in that dataset (right; gray diagonal bands show the average predicted error s Predict ).The total number of predictions N from each dataset is shown above the scatterplots; when this number of points is too great to show, we subsampled each distribution evenly while maintaining its shape.The inset at the bottom right of each plot shows the probability density function (PDF) histogram of error measurements (y axis) that were within 0.5s, 1.0s, 1.5s.(x-axis) compared with a standard folded Gaussian distribution (black curve).The fraction of predictions within 1.0s is explicitly written and can be compared with the expected 68% for a standard folded Gaussian.(B) Chord diagram representing the transferability between datasets.For each arc connecting dataset X/Y, transferability is shown near the outer circle of Y, with larger width representing greater transferability (Figures S1 and S4; STAR Methods).
circulating around the same time as the Vinh panel (1968 or  2003-2011) tended to have the lowest uncertainty, whereas the furthest viruses from the 1990s had the largest uncertainty (Figure 6C).To confirm these estimates, we restricted the Fonville datasets to these same 6 viruses and expanded out, finding that any virus with s Predict % 6-fold prediction error (which applies to nearly all Vinh predictions) had a true error s Actual % 6-fold (Figure S8).We similarly expanded the Fonville datasets, adding 175 new virus columns across the six studies (Figure S7; extended datasets provided on GitHub).In addition, dimensionality reduction via uniform manifold approximation and projection (UMAP) recovered a linear trend from the oldest to newest viruses in both the Fonville and Vinh datasets; this trend is especially noteworthy in the latter case because we did not supply the circulation year for the 75 inferred viruses, yet we can discern its impact on the resulting data (Figure S9).
For each Vinh serum, this expansion fills in the 3.5-decade gap between 1968 and 2003 by predicting 47 additional viruses, as well as adding another 28 measurements between 2003 and 2011 (Figure 7A, new interactions highlighted in purple).We also predicted dozens of new viruses in the vaccine studies, and for some sera this increased resolution revealed a more jagged landscape than what was apparent from the direct measurements (Figure 7A).Although HAI titers tend to be similar for viruses circulating around the same time, exceptions do arise (e.g., A/Tasmania/1/1997 vs. A/Perth/5/1997 as well as A/Hanoi/EL201/2009 vs. A/Hanoi/EL134/2008 had >4-fold difference in their predicted titers), and our expanded data reveal these functional differences between variants.
The expanded data also enable a direct comparison of sera across studies, something that is exceedingly difficult with the original measurements given that none of the 81 viruses were in all 7 datasets.Figure 7A shows that an antibody response may be potent against older strains circulating before 2000 but weak against newer variants (bottom), highly specific against strains from 1980-2000 with specific vulnerabilities to viruses from 1976 (center), or relatively uniformly against the entire virus panel (top).
We next used the expanded data to probe a fundamental but often unappreciated property of the antibody response; namely, the tradeoff between serum potency and breadth.Given a set of viruses circulating within Dvirus years of each other (the top of Figure 7B shows an example with Dvirus years = 2), how potently can a serum inhibit all of these variants simultaneously?For any set of viruses spanning Dvirus years, we computed HAI min (the minimum titer against this set of viruses) for each serum and plotted the maximum HAI min in each dataset (Figure 7B).(While children born after the earliest circulating strains may have artificially smaller HAI min , every dataset contains adults born before the earliest strain, and we only report the largest potency in each study.)We find that HAI min decreases with Dvirus years, demonstrating that it is harder to simultaneously inhibit more diverse viruses.This same tradeoff was seen for monoclonal antibodies, 29,30 and it suggests that efforts geared toward finding extremely broad and potentially universal influenza responses may run into an HAI ceiling.

Toward pandemic preparedness
When two studies have high transferability, each serves as a conduit to rapidly propagate information.For example, if a new variant V 0 emerges this year, the most pressing question is whether our preexisting immunity will inhibit this new variant or whether it is sufficiently distinct to bypass our antibody response.
Traditionally, antigenic similarity is measured by infecting ferrets with prior circulating strains and measuring their cross-reactivity to the new variant, yet the above analysis (and work by many others 31,32 ) shows that ferret4human inferences can be poor.Instead, we can rapidly assess the inhibition of V 0 in multiple existing human cohorts that measured HAI against viruses V 1 -V 5 by measuring a single additional human cohort against V 0 -V 5 and then predicting V 0 's titers in all other studies.As an example, consider the more recent virus strain in the latest vaccine dataset (A/Perth/16/2009 from vaccine study 4, carried out in 2010, around the time this variant emerged).Our framework predicts how all individuals in vaccine study 3 inhibit this variant with s Actual = 2.4-fold error (Figure 7C).
Another recent application of pandemic preparedness tested the breadth of an influenza vaccine containing H1N1 A/Michigan/45/2015 by measuring the serum response against one antigenically distinct H1N1 A/Puerto Rico/8/1934 strain. 33Inferring additional virus behavior would provide greater resolution into the coverage and potential holes of an antibody response.As shown in Figure 7A, z5 measurements can extrapolate serum HAI against viruses circulating in multiple decades, providing this needed resolution from a small number of interactions.

Matrix completion via nuclear norm minimization poorly predicts behavior across studies
In this final section, we briefly contrast our algorithm against singular value decomposition (SVD)-based approaches, such as nuclear norm minimization (NNM), which are arguably the simplest and best-studied matrix completion methods.With NNM, missing values are filled by minimizing the sum of singular values of the completed dataset.
To compare our results, we reran our leave-multi-out analysis from Figure 5, simultaneously withholding 133 viruses and predicting their values using an established NNM algorithm from Einav and Cleary. 7The resulting predictions were notably worse, with s Actual between 3.4 and 5.4-fold.(A) Viruses were concurrently withheld from each dataset (left, gold columns), and their 70,000 values were predicted using the remaining data.We withheld as many viruses as possible while still estimating a low error of s Predict % 4-fold (blinding ourselves to actual measurements), and indeed, the actual prediction error was smaller than 4-fold in every dataset.As in Figure 4, plots and histograms show the collective predictions and error distributions.The plot label enumerates the number of concurrent predictions (and percent of data predicted).(B) Chord diagram representing the transferability between datasets after withholding the viruses.For each arc connecting datasets X/Y, transferability is shown near the outer circle of Y, with larger width representing greater transferability (Figures S1 and S4; STAR Methods).
Because of two often neglected features of NNM, we find that our approach significantly outperforms this traditional route of matrix completion in predicting values for a completely withheld virus column.First, NNM is asymmetrical when predicting large and small values for a withheld virus.Consider a simple noisefree example where one virus's measurements are proportional to another's, (virus 2's values) = m 3 (virus 1's values) (Figure S10A shows m = 5).Surprisingly, even if provided with one perfect template for these measurements, NNM incorrectly predicts that (virus 2's values) = (virus 1's values) for any m R 1 (Figure S10B).This behavior is exacerbated when multiple datasets are combined, emphasizing that NNM can catastrophically fail for very simple examples (Figures S10C and S10D).This artifact can be alleviated by first row-centering a dataset (subtracting the mean of the log 10 [titers] for each serum in Figure 2A), as in Box 1.
Even with row-centering, a second artifact of NNM is that large swaths of missing values can skew matrix completion because relationships are incorrectly inferred between these missing values.Intuitively, all iterative NNM algorithms must initialize the missing entries (often either with 0 or the row/column means), so that after initialization, two viruses with very different behaviors may end up appearing identical across their missing values.
For example, suppose we want to predict values for virus V 0 from dataset X/Y and that ''useful'' viruses V 1 -V 4 behave similarly to V 0 in datasets X and Y. On the other hand, ''useless'' viruses V 5 -V 8 are either not measured in dataset 2 or are measured against complementary sera; moreover, these viruses show very different from V 0 in dataset 1 (Figures S10E and S10F show a concrete example from Fonville).Ideally, matrix completion should ignore V 5 -V 8 (given that they do not match V 0 in dataset 2) and only use V 1 -V 4 to infer V 0 's values in dataset 1.In practice, NNM using V 0 -V 8 results in poor predictions (Figures S10E  and S10F).This behavior is disastrous for large serological datasets, where there can be >50% missing values when datasets are combined.
Our algorithm was constructed to specifically avoid both artifacts.First, we infer each virus's behavior using a decision tree on row-centered data that does not exhibit the asymmetry discussed above.Second, we restrict our analysis to features that have R80% observed measurements to ensure that patterns detected are based on measurements rather than on missing data.
As another point of comparison, consider the leave-one-out predictions of the six Vinh viruses using the Fonville datasets.Whereas our algorithm yields tight predictions across the full range of values (Figure S6), NNM led to a nearly flat response, with all 25,000 sera incorrectly predicted to be the mean of the measurements (see Figure S11 in Einav and Cleary 7 ).In addition, we utilized an existing SVD-based matrix completion method that quantifies the prediction uncertainty for each entry under the assumption that values are randomly missing from a dataset. 18Applying this method to the Fonville datasets resulted in predictions whose actual error was >20-fold larger than the estimated error, emphasizing the need for frameworks that specifically handle structured missing data. 34

DISCUSSION
By harnessing the wealth of previously measured antibody-virus interactions, we can catapult future efforts and design experiments that are far larger in size and scope.Here, we developed an algorithm that leverages patterns in HAI data to predict how a virus measured in one study would inhibit sera from another study without requiring any additional experiments.Even when the original studies only had a few overlapping viruses, the expanded datasets can be directly compared using all variants.
While it is understood that sera cross-react, exhibiting similar inhibition against nearly homologous variants, it is unclear whether there are universal relationships that hold across datasets.We introduce the notion of transferability to quantify how accurately local relations within one dataset map onto another dataset (Figure 4B; STAR Methods). 35Transferability is based on the functional responses of viruses, and it does not require side information, such as virus sequence or structure, although future efforts should quantify how incorporating such information reduces prediction error.In particular, incorporating sequence information could strengthen predictions when virus panels have little direct overlap but contain many nearly homologous variants.
7][38] Transferability directly addresses these questions.Through this lens, we compared the Fonville and Vinh studies, which utilized different assays, had different dynamic ranges, and used markedly different virus panels. 20,25e found surprisingly large transferability between human infection and vaccination studies.For example, vaccine studies from 1997/1998 (dataset Vac,1/2 ) were moderately informed by the Vinh infection study from 2009-2015 (dataset Infect,2 ), even though none of the Vinh participants had ever been vaccinated (Figure 4B).Conversely, both infection studies we analyzed were well informed by at least one vaccine study (e.g., dataset Infect,1 was most informed by datasets Vac,3/4 ).
These results demonstrate that diverse cohorts can inform one another.Hence, instead of thinking about each serum sample as being entirely unique, large collections of sera may often exhibit surprisingly similar inhibition profiles.For example, the 1,200 sera in the Fonville datasets predicted the behavior of the 25,000 Vinh sera with %2.5-fold error on average, demonstrating that these Vinh sera were at least 20-fold degenerate. 25This corroborates recent work showing that different individuals often target the same epitopes, 26 which should limit the number of distinct functional behaviors.As studies continue to measure sera in new locations, their transferability will quantify the level of heterogeneity across the world.
To demonstrate the scope of new antibody-virus interactions that can be inferred using available data, we predicted 2,000,000 new interactions between the Fonville and Vinh sera and their combined 81 H3N2 viruses.Upon stratifying by age, these landscapes can quantify how different exposure histories shape the subsequent antibody response. 25,39Given the growing interest in universal influenza vaccines that inhibit diverse variants, these high-resolution responses can examine the breadth of the antibody response both forwards in time against newly emerging variants and backwards in time to assess how rapidly immunity decays. 3,23,40,41We found that serum potency (the minimum HAI titer against a set of viruses) decreases for more distinct viruses (Figure 7B), as shown for monoclonal antibodies, 7,29 suggesting that there is a tug-ofwar between antibody potency and breadth.For example, a specific HAI target (e.g., responses with HAI R 80 against multiple variants) may only be possible for viruses spanning 1-2 decades.
Our framework inspires new principles of data acquisition, where future studies can save time and effort by choosing smaller virus panels that are designed to be subsequently expanded (Figure 6A).One powerful approach is to perform experiments in waves.A study measuring serum inhibition against 100 viruses could start by measuring 5 of these viruses that are widely spaced out in time.With these initial measurements, we can compute the values ± errors of the remaining viruses as well as the next 5 maximally informative viruses, whose measurements will further decrease prediction error.Each additional wave of measurements serves as a test for the predictions, and experiments can stop oncewhen enough measurements match the predictions.
Antibody-virus interactions underpin diverse efforts, from virus surveillance 4 to characterizing the composition of antibodies within serum 20,30,42,43 to predicting future antibody-virus coevolution. 44,45Although we focused on influenza HAI data, our approach can readily generalize to other inherently low-dimensional datasets, both in and out of immunology.In the context of antibody-virus interactions, this approach not only massively extends current datasets but also provides a level playing field where antibody responses from different studies can be directly compared using the same set of viruses.This shift in perspective expands the scope and utility of each measurement, enabling future studies to always build on top of previous results.

Limitations of the study
For cross-study antibody-virus predictions, there must be partial overlap in either the antibodies or viruses used across datasets.We only investigated cases where the virus panels overlapped, and we found that studies should contain R5 viruses (whose data can inform one another's inhibition) for accurate predictions.For example, pre-pandemic H1N1, post-pandemic H1N1, and H3N2 would all minimally inform one another and should be considered separately (or else both the estimated and actual prediction error will be large).While we mostly inves-tigated influenza HAI data, further work should extend this analysis to other viruses, other assays, and even to non-biological systems.In each context, this framework combines datasets to predict the value ± uncertainty of unmeasured interactions, and it circumvents issues of reproducibility or low-quality data (i.e., garbage in, garbage out) by explicitly computing intraand inter-study relationships in a data-driven manner.

Figure 2 .
Figure 2. Combining datasets to predict values and uncertainties for missing viruses (A) Schematic of data availability; two studies measure antibody responses against overlapping viruses (shades of blue) as well as unique viruses (green/gray).Studies may have different fractions of missing values (dark-red boxes) and measured values (gray).To test whether virus behavior can be inferred across studies, we predict the titers of a virus in dataset 1 (V 0 , gold squares), using measurements from the overlapping viruses (V 1 -V n ) as features in a random forest model.(B) We train a decision tree model using a random subset of antibodies and viruses from dataset 2 (boxed in purple), cross-validate against the remaining antibody responses in dataset 2, and compute the root-mean-square error (RMSE, denoted by s Training ).(C) Multiple decision trees are trained, and the average from the 5 trees with the lowest error are used as the model going forward.Applying this model to dataset 1 (which was not used during training) yields the desired predictions, whose RMSE is given by s Actual .We repeat this process, withholding each virus in every dataset.(D) To estimate the prediction error s Actual (which we are not allowed to directly compute because V 0 's titers are withheld), we define the transferability relation f 2/1 between the training error s Training in dataset 2 and actual error s Actual in dataset 1 using the decision trees that predict viruses V 1 -V n (without using V 0 ).Applying this relation to the training error, f 2/1 (s Training ), estimates s Actual for V 0 .

Box 1 . 1 .
Predicting virus behavior (value ± error) across studies Input: d Dataset-of-interest D 0 containing virus-of-interest V 0 whose measurements we predict d Other datasets {D j }, each containing V 0 and at least 5 viruses V j,1 , V j,2 .that overlap with the D 0 virus panel, used to extrapolate virus behavior d Antibody responses A j,1 , A j,2 . in each dataset D j .When j s 0, we only consider antibody responses with non-missing values against V 0 Steps: For each D j , create n Trees = 50 decision trees predicting V 0 based on n Features = 5 other viruses and a fraction f Samples = 3/10 of sera B For robust training, we restrict attention to features with R80% non-missing values.If fewer than n Features viruses in D j satisfy this criterion, do not grow decision trees for this dataset B Bootstrap sample (with replacement) both the viruses and antibody responses B Data are analyzed in log 10 and row-centered on the features (i.e., for each antibody response in either the training set D j or testing set D 0 , subtract the mean of the log 10 [titers] for the n Features viruses using all non-missing measurements) to account for systematic shifts between datasets.Row-centering is undone once decision trees make their predictions by adding the serum-dependent mean B Compute the cross-validation root-mean-square error (RMSE, s Training ) of each tree using the remaining 1 À f Samples fraction of samples in D j 2. Predict the (un-row-centered) values of V 0 in D 0 using the n BestTrees = 5 decisions trees with the lowest s Training B Trees only make predictions in D 0 where all n Features are non-missing B Predict m j ± s j for each antibody response -m j = (mean value for n BestTrees predictions) -s j = f Dj /D0 (mean s Training for n BestTrees trees), where the transferability f Dj/D0 is computed by predicting

Box 2 .
Computing the transferability f Dj /D0 between datasets Input:

Figure 3 .
Figure 3. Predicting virus behavior between two datasetsExample predictions between two Fonville studies.Top: plots comparing predicted and withheld HAI measurements (which take the discrete values 5, 10, 20.).Estimated error is shown in two ways: (1) as vertical lines emanating from each point and (2) by the diagonal gray bands showing s Predict .Bottom: histograms of the standardized absolute prediction errors compared with a standard folded Gaussian distribution (black dashed line).The fraction of predictions within 1.0s are shown at the top left, which can be compared with the expected 68% for the standard folded Gaussian distribution.(A) Predicting A/Auckland/5/1996 between two human vaccination studies (datasets Vac,3/Vac,4 ).(B) Predicting A/Netherlands/620/1989 between a human infection and human vaccination study (datasets Infect,1/Vac,4 ).(C) Predicting A/Victoria/110/2004 between a ferret infection and human vaccination study (datasets Ferret/Vac,4 ).

Figure 6 .
Figure 6.Expanding the Vinh dataset with 75 additional viruses (A) If a new study contains at least 5 previously characterized viruses (green boxes and arrows), we can predict the behavior of all previously characterized viruses in the new dataset.Those with an acceptable error (e.g., %4-fold error boxed in purple) are used to expand the dataset.(B) Distribution of the estimated uncertainty s Predict when predicting how each Fonville virus inhibits the 25,000 Vinh sera.Most viruses are estimated with %4-fold error.(C) Estimated uncertainty of each virus.The six viruses on the left represent the Vinh virus panel.Colors at the bottom represent the year each virus circulated.

Figure 7 .
Figure 7. Applications of cross-study predictions (A) We predict HAI titers for 25,000 sera against the same set of 81 viruses, providing high-resolution landscapes that can be directly compared against each other.Representative responses are shown for dataset Infect,2 (top, serum 5130165 in GitHub), dataset Vac,1 (center, subject 525), and dataset Vac,3 (bottom, subject A028).(B) Tradeoff between serum breadth and potency, showing that viruses spaced apart in time are harder to simultaneously inhibit.For every study and each possible set of viruses circulating within Dvirus years of each other, we calculate the highest HAI min (i.e., a serum exists with HAI titers R HAI min against the entire set of viruses).(C) Top: when a new variant emerges and is measured in a single study, we can predict its titers in all previous studies with R5 overlapping viruses.Bottom: example predicting how the newest variant in the newest vaccine dataset is inhibited by sera from a previous vaccine study (datasets Vac,4/Vac,3 ).
and D j used to construct the tree B Viruses used to train the tree B RMSE s Training on the 1-f Samples samples in D j B Predictions of V 0 's values in D 0 B True RMSE s Actual of these predictions for V 0 in D 0 d When predicting V 0 using D j /D 0 in Box1, we compute f Dj /D0 between s Training and s Actual by predicting the other viruses V 1 , V 2 .V n that overlap between D j and D 0 (making sure to only use decision trees that exclude the withheld V 0 ) B From the forest of decision trees above, find the top 10 trees for each virus predicted between D j /D 0 and plot s Training vs. s Actual for all trees (see Figure S1) B Find the best-fit line using perpendicular offsets, y = ax+b where x = s Training and y = s Actual .Since there is scatter about this best-fit line, and because it is better to overestimate rather than underestimate error, we add a correction factor c=(RMSE between s Actual and ax+b).Lastly, we expect that a decision tree's error in another dataset will always be at least as large as its error on the training set (s Actual Rs Training ), and hence we define f Dj/D0 = max(as Training +b+c, s Training ).This max term is important in a few cases where f Dj /D0 has a very steep slope but some decision trees have small s Training B Datasets with high transferability will have f Dj /D0 (s Training )zs Training , meaning that viruses can be removed from D 0 and accurately inferred from D j .In contrast, two datasets with low transferability will have a nearly vertical line, vf Dj /D0 /vs Training [1,