A large-scale systematic survey reveals recurring molecular features of public antibody responses to SARS-CoV-2

Summary Global research to combat the COVID-19 pandemic has led to the isolation and characterization of thousands of human antibodies to the SARS-CoV-2 spike protein, providing an unprecedented opportunity to study the antibody response to a single antigen. Using the information derived from 88 research publications and 13 patents, we assembled a dataset of ∼8,000 human antibodies to the SARS-CoV-2 spike protein from >200 donors. By analyzing immunoglobulin V and D gene usages, complementarity-determining region H3 sequences, and somatic hypermutations, we demonstrated that the common (public) responses to different domains of the spike protein were quite different. We further used these sequences to train a deep-learning model to accurately distinguish between the human antibodies to SARS-CoV-2 spike protein and those to influenza hemagglutinin protein. Overall, this study provides an informative resource for antibody research and enhances our molecular understanding of public antibody responses.


In brief
Since the start of the COVID-19 pandemic, the isolation of antibodies to SARS-CoV-2 spike protein has been a major research focus. Wang et al. analyzed 8,000 published human monoclonal antibodies to the spike protein and identified sequence and molecular features of the public antibody responses to SARS-CoV-2. The results enable the construction of a sequencebased, deep-learning model to predict antibody specificity.

INTRODUCTION
From the beginning of the COVID-19 pandemic, many research groups worldwide turned their attention to SARS-CoV-2 and, in particular, to the immune response to infection and vaccination. Since 2020, thousands of human monoclonal antibodies to SARS-CoV-2 have been isolated and characterized (Li et al., 2022a;Raybould et al., 2021). The major surface antigen to which antibodies are elicited is the SARS-CoV-2 spike (S) protein, which is a homotrimeric glycoprotein that facilitates virus entry by first engaging the host receptor angiotensin-converting enzyme 2 (ACE2) and then mediating membrane fusion (Shang et al., 2020;Zhou et al., 2020). The S protein has three major domains, namely the N-terminal domain (NTD), receptorbinding domain (RBD), and S2 domain Wrapp et al., 2020). Most studies on SARS-CoV-2 antibodies have focused on the immunodominant RBD  because neutralizing antibodies can be elicited to it with very high potency Wang et al., 2021). Antibodies to the NTD and the highly conserved S2 domain have also been discovered (Cerutti et al., 2021;Chi et al., 2020;Li et al., 2021a;2022b;Pinto et al., 2021;Voss et al., 2021;Zhou et al., 2022b).
A common or public antibody response describes antibodies to the same antigen in different donors that share genetic elements that usually result in similar modes of antigen recognition. Deciphering public responses to particular antigens is not only critical for uncovering the molecular features of recurring antibodies within the diverse antibody repertoire at the population level, but also important for development of effective vaccines (Andrews and McDermott, 2018;Lanzavecchia et al., 2016). A conventional approach to study public antibody responses is to identify public clonotypes, which are antibodies from different donors that share the same immunoglobulin-heavy variable (IGHV) gene and with similar complementarity-determining region (CDR) H3 sequences (Henry Dunand and Wilson, 2015;Jackson et al., 2014;Pieper et al., 2017;Setliff et al., 2018;Tr€ uck et al., 2015). While this definition of public clonotypes has improved our understanding of public antibody response, it generally ignores the contribution of the light chain. Moreover, our recent study has shown that a public antibody response to influenza hemagglutinin (HA) is driven by an IGHD gene with minimal dependence on the IGHV gene (Wu et al., 2018). Therefore, the true extent and molecular characterization of public antibody responses remain to be explored.
Although information of many human monoclonal antibodies to SARS-CoV-2 is now publicly available, it has been difficult to leverage all available information to investigate public antibody responses to SARS-CoV-2. One major challenge is that the data from different studies are rarely in the same format. This inconsistency imposes a huge barrier to data mining. The establishment of the coronavirus antibody database (CoV-Ab-Dab) has enabled researchers to deposit their antibody data in a standardized format and has partially resolved the data formatting issue . However, not every SARS-CoV-2 antibody study has deposited their data to CoV-AbDab. Furthermore, IGHD gene identities, nucleotide sequences, and donor IDs are not available in CoV-AbDab, which makes it challenging to study public antibody responses using CoV-AbDab. Thus, additional efforts must be made to fully synergize the information across many different SARS-CoV-2 antibody studies to investigate and decipher public antibody responses.
In this study, we performed a systematic literature survey and assembled a large dataset of human SARS-CoV-2 monoclonal antibodies with donor information. We then analyzed this dataset and uncovered many antibody sequence features that contribute to the public antibody responses to SARS-CoV-2 S. For example, we identified a public antibody response to RBD that is largely independent of the IGHV gene, as well as involvement of a particular IGHD gene in a public antibody response to S2. Our analysis also revealed a number of recurring somatic hypermutations (SHMs) in different public clonotypes. All of these sequence features provide a foundation for using deep learning to identify SARS-CoV-2 S antibodies.

RESULTS
A large-scale collection of SARS-CoV-2 antibody information Information for 8,048 human antibodies was collected from 88 research publications and 13 patents that described the discovery and characterization of antibodies to SARS-CoV-2 (Figure S1; Table S1). Among these antibodies, which were isolated from 215 different donors, 7,997 (99.4%) react with SARS-CoV-2, and the remaining 51 react with SARS-CoV or seasonal coronaviruses. While 99.1% (7,923/7,997) SARS-CoV-2 antibodies in our dataset bind to S protein, 49 bind to N and 25 to ORF8. Epitope information was available for most SARS-CoV-2 S antibodies, with 5,002 to RBD, 513 to NTD, and 890 to S2. In addition, information on neutralization activity, germline gene usage, sequence, structure, bait for isolation (e.g., RBD and S), and donor status (e.g., infected patient, vaccinee, etc.), if available, was collected for individual antibodies. Using this large dataset, we aimed to analyze the sequence features of public antibody responses to SARS-CoV-2 S.
Antibodies to RBD, NTD, and S2 have distinct V gene usage bias We first performed an analysis on the V gene usage of SARS-CoV-2 S antibodies. Our analysis captured previously known V gene usage patterns, including the prevalence of IGHV3-53/ IGKV1-9 and IGHV3-53/IGKV3-20 among RBD antibodies (Cao et al., 2020;Clark et al., 2021;Kim et al., 2021;Tan et al., 2021;Yuan et al., 2020;Zhang et al., 2021; Figure 1A), as well as substantial enrichment of IGHV1-24 among NTD antibodies (Cerutti et al., 2021;Chi et al., 2020;Li et al., 2021a;Voss et al., 2021; Figure 1B). Importantly, our dataset also enabled us to discover previously unknown patterns in gene usage. For example, IGHV3-30 and IGHV3-30-3 were highly enriched among S2 antibodies ( Figure 1B). V gene usage bias was also observed in the light chain. For example, IGKV3-20 and IGKV3-11 were most used among S2 antibodies, whereas IGKV1-33 and IGKV1-39 were most used among RBD antibodies ( Figure 1C). Overall, these results demonstrated that RBD, NTD, and S2 antibodies have distinct patterns of V gene usage and that both heavy and light-chain V genes contribute to the public antibody response to SARS-CoV-2 S. CDR H3 analysis reveals domain-specific public antibody response Most of the antibody sequence diversity comes from the CDR H3 region due to V(D)J recombination (Elhanati et al., 2015;Jung and Alt, 2004;Schatz and Swanson, 2011). To identify the sequence features of CDR H3 in public antibody response to SARS-CoV-2 S, CDR H3 sequences with the same length were clustered by an 80% sequence identity cutoff. A total of 170 clusters that contained antibodies from at least two different donors were identified ( Figure 2A; Table S1).
Furthermore, we also discovered several S2-specific CDR H3 clusters (clusters 6, 9, and 11) that were predominantly encoded by IGHV3-30 with diverse IGK(L)V genes, suggesting a public heavy-chain response to S2 ( Figure 2B). Clusters 10 and 15 were also of particular interest. Cluster 10 featured a very short CDR H3 (6 amino acids, IMGT numbering) and was encoded by IGHV4-59/IGKV3-20, which was a frequent V gene pair among the S2 antibodies ( Figure 1A). Cluster 15 was encoded by IGHV1-69/IGKV3-11, which was the most used V gene pair among the S2 antibodies ( Figure 1A). Therefore, clusters 10 and 15 represented two major S2 public clonotypes, despite their minimal neutralizing activity (Table S2). In contrast to RBD-and S2-specific clusters, all NTD-specific CDR H3 clusters had a relatively small size (Figure 2A), suggesting that the paratopes for most NTD The frequency of different V gene pairings between heavy and light chains are shown for SARS-CoV-2 S antibodies to RBD, NTD, and S2. The size of each data point represents the frequency of the corresponding IGHV/IGK(L)V pair within its epitope category. Only those antibodies where both IGHV and IGK(L)V information is available for both heavy and light chains were included in this analysis. (B) The IGHV gene usage in antibodies to NTD, RBD, and S2 is shown. Only those antibodies with IGHV information available were included in this analysis. (C) The IGK(L)V gene usage in antibodies to NTD, RBD, and S2 is shown. Only those antibodies with IGK(L)V information available were included in this analysis. (B and C) Error bars represent the frequency range among 26 healthy donors Guo et al., 2019;Soto et al., 2019). See also Figure S1 and Tables S1 and S2. antibodies are not dominated by CDR H3. Nevertheless, the small number of H3 clusters among NTD antibodies may also be due to fewer antibodies to NTD than to RBD or S2 in our dataset.
IGLV6-57 contributes to RBD-specific public antibody response While most clusters had a dominant IGHV gene, diverse IGHV genes were observed in cluster 7 ( Figures 2B and 2C). Most antibodies (42 out of 45) in cluster 7 used IGLV6-57, suggesting their paratopes are mainly composed of CDR H3 and light chain. S2A4, which is encoded by IGHV3-7/IGLV6-57 (Piccoli et al., 2020), is an antibody in cluster 7. A previously determined structure of S2A4 in complex with RBD indeed demonstrates that its CDR H3 contributes 38% of the buried surface area (BSA) of the epitope, whereas the light chain contributes 53% (Figures 2D and 2E). Specifically, IGLV6-57 forms an extensive H-bond network with the RBD (Figure 2F), whereas a 97 WLRG 100 motif at the tip of CDR H3 interacts with the RBD through H-bonds, p-p stacking, and hydrophobic interactions ( Figure 2G). Although G100 does not participate in binding, it exhibits backbone torsion angles (F = À94 , J = À160 ) that are in the preferred region of Ramachandran plot for glycine, but in the allowed region for non-glycine ( Figure S4). Consistently, this 97 WLRG 100 motif is highly conserved in cluster 7 ( Figure 2B). This analysis substantiates the importance of the light chain in the public antibody response to SARS-CoV-2 S. IGHD1-26 contributes to S2-specific public antibody response As shown in our previous study, the IGHD gene can drive a public antibody response (Wu et al., 2018). Here, we found that IGHD1-26 was highly enriched among S2 antibodies ( Figure 3A). These IGHD1-26 S2 antibodies were predominantly encoded by IGHV3-30 ( Figure 3B), which was one of the most used IGHV genes among S2 antibodies ( Figure 1B). In contrast, the IGK(L) V gene usage was more diverse among these IGHD1-26 S2  Tables S1 and S2. antibodies, although several were more frequently used than others ( Figure 3C), implying that this public antibody response to S2 was mainly driven by the heavy chain. 70% of these IGHD1-26 S2 antibodies had a CDR H3 of 14 amino acids, whereas only <20% of other S antibodies had a CDR H3 of 14 amino acids ( Figure 3D). In fact, most members of clusters 6, 9, and 11 in our CDR H3 analysis above ( Figure 2B) represented this public antibody response to S2. While CDR H3 is also encoded by the IGHJ gene, the distribution of IGHJ gene usage in these IGHD1-26 S2 antibodies did not show a strong deviation from that of other S antibodies in our dataset ( Figure 3E). In our dataset, there were 110 IGHD1-26 S2 antibodies from 17 donors with a CDR H3 length of 14 amino acids. Most of these 110 IGHD1-26 S2 antibodies could cross-react with SARS-CoV, but with minimal neutralization activity (Table S2). Sequence logo analysis of these 110 antibodies revealed a conserved 97 [S/G]G [S/N]Y 100 motif in the middle of their CDR H3 sequences . In-depth analysis of the CDR H3 sequences from three representative IGHD1-26 S2 antibodies from three different donors (Graham et al., 2021;Tong et al., 2021;Wec et al., 2020) further indicated that the conserved 97 [S/G]G[S/N]Y 100 motif was within the IGHD1-26-encoded region ( Figure 3G). Together, these results show that the public antibody response to SARS-CoV-2 S also involves the IGHD gene.

SHM analysis reveals a recurring affinity maturation pathway
Our recent study has shown that V H Y58F is a recurring SHM among IGHV3-53 antibodies to SARS-CoV-2 RBD (Tan et al., 2021), indicating that SHM is involved in the public antibody response to SARS-CoV-2. To identify additional recurring SHMs in SARS-CoV-2 S antibodies, antibodies from at least two donors that had the same IGHV/IGK(L)V genes and CDR H3s from the same CDR H3 cluster were classified as a public  (Graham et al., 2021;Tong et al., 2021;Wec et al., 2020). While P008_088 and G32M4 were from SARS-CoV-2-infected individuals, ADI-56059 was from a SARS-CoV survivor. Putative germline sequences and segments were identified by IgBlast  and are indicated. Somatically mutated nucleotides are underlined. Intervening spaces at the V-D and D-J junctions are N-nucleotide additions. See also Tables S1 and S2. clonotype ( Figure 4A). SHM that occurred in at least two donors within a public clonotypes was defined as a recurring SHM. This analysis led to the identification of several recurring SHMs in IGHV3-53/3-66-encoded public clonotypes that were previously characterized, including V H F27V, T28I, and Y58F (Hurlburt et al., 2020;Scheid et al., 2021;Tan et al., 2021; Figure S5). Many of the recurring SHMs were not hotspots for activation-induced deaminase (AID) (Á lvarez-Prado et al., Di Noia and Neuberger, 2007;Yeap et al., 2015). For example, among the seven recurring V H SHMs that had high occurrence frequency in IGHV3-53/3-66encoded public clonotypes (F27V, F27L, T28I, S31R, S35T, S35N, and Y58F), only V H T28I and S35N involved deamination, and only V H S35N was at the hotspot (nucleotide motif RGYW) for AID (Á lvarez-Prado et al., 2018).
V L S29R in a IGHV1-58/IGKV3-20 public clonotype represented a previously unknown recurring SHM ( Figure 4B). V L S29R emerged in 8 out of 26 (31%) donors that carried this IGHV1-58/IGKV3-20 public clonotype. Antibodies of this IGHV1-58/IGKV3-20 public clonotype bind to the ridge region of SARS-CoV-2 RBD ( Figure 5A) and are able to potently neutralize multiple variants of concern (VOCs) (Li et al., 2021b;Schmitz et al., 2021;Wang et al., 2021), including Omicron (Zhou et al., 2022a). Furthermore, therapeutic antibody tixagevimab is derived from a member of this IGHV1-58/IGKV3-20 public clonotype, namely COV2-2196 (Dong et al., 2021). Here, we compared two previously determined structures of IGHV1-58/ IGKV3-20 antibodies in complex with RBD (Dejnirattisai et al., 2021;Wheatley et al., 2021). One has the germline-encoded V L S29 ( Figure 5B) and the other carries a somatically mutated V L R29 ( Figure 5C). While neither V L S29 nor V L R29 directly interact with RBD, V L R29 is able to form a cation-p interaction with V L Y32, which in turn forms a T-shaped p-p stacking with RBD-F486 and H-bonds with RBD-C480 ( Figure 5C). The positioning of V L R29 can further be stabilized by a salt bridge with another SHM V L G92D ( Figure 5C). The RBD binding affinity of COVOX-253, which is an IGHV1-58/IGKV3-20-encoded antibody, was improved >3-fold by the V L S29R/G92D double mutant but only subtly enhanced or diminished by V L S29R or V L G92D, respectively ( Figure 5D), indicating a synergistic effect between V L S29R and V L G92D. In fact, V L G92D seemed to have x axis represents the position on the V gene (Kabat numbering). y axis represents the percentage of donors who carry a given recurring SHM among those who carry the public clonotype of interest. For example, V L S29R emerged in 8 donors out of 26 donors that carry a public clonotype that is encoded by IGHV1-58/IGKV3-20. As a result, V L S29R (IGHV1-58/IGKV3-20) is 31% (8/26) within the corresponding clonotype. Of note, since each public clonotype is also defined by the similarity of CDR H3 (see STAR Methods), there could be multiple clonotypes with the same heavy-and light-chain V genes (e.g., IGHV3-53/IGKV1-9). The CDR H3 cluster ID for each clonotype is indicated with a prefix ''c,'' following the information of the V genes. For heavy chain, SHMs that emerged in at least 40% of the donors of the corresponding clonotype are labeled. For light chain, SHMs that emerged in at least 20% of the donors of the corresponding clonotype are labeled. See also Figure S5 and Table S1. (legend continued on next page) coevolved with V L S29R, since V L G92D was found in four out of the 67 antibodies in this IGHV1-58/IGKV3-20 public clonotype and all four carried V L S29R ( Figure 5E). Moreover, a phylogenetic analysis showed that V L G92D emerged from a cluster of antibodies with V L S29R ( Figure 5E). These analyses illustrate that recurring SHMs are associated with the public antibody response to SARS-CoV-2 S and further suggest the existence of common affinity maturation pathways that involve emergence of multiple SHMs in a defined order.
Deep learning enables classification of antibody specificity Since many sequence features of public antibody responses to the S protein could be observed in our dataset, we postulated that the dataset was sufficiently large to train a deep learning model to identify S antibodies. To provide a proof of concept, we trained a deep learning model to distinguish between human antibodies to S and to influenza HA. Among different antigens, HA was chosen here because there were a large number of HA antibodies with published sequences, albeit still lower than the published SARS-CoV-2 S antibodies. Here, 1,356 unique human antibodies to HA and 3,000 unique human antibodies to SARS-CoV-2 S with complete information for all six CDR sequences were used (Table S3). None of these antibodies had identical sequences in all six CDRs. These antibodies to S and HA were divided into a training set (64%), a validation set (16%), and a test set (20%), with no overlap between the three sets. The overlap of clonotypes was also minimal ( Figure S6A). Subsequently, the training set was used to train the deep learning model. The validation set was used to evaluate the model performance during training. The test set was used to evaluate the performance of the final model. Our deep learning model had a simple architecture, which consisted of one encoder per CDR followed by three fully connected layers ( Figure 6A). To evaluate the model performance on the test set, the area under the curves of receiver operating characteristic (ROC AUC) and precision-recall (PR AUC) were used to measure the model's ability to avoid misclassification (Flach et al., 2011;Saito and Rehmsmeier, 2015). Model performance was the best when all six CDRs (i.e., H1, H2, H3, L1, L2, and L3) were used to train the model, which resulted in an ROC AUC and an PR AUC of 0.88 and 0.93, respectively ( Figure 6B; Table S4). However, reasonable performance was also observed when the model was trained by a subset of CDRs (AUCs = 0.75-0.85 and PR AUCs = 0.84-0.91). These results are consistent with the notion that the public antibody response to SARS-CoV-2 is composed of diverse sequence features on both heavy and light chains.
We further tested if a deep learning model could be trained to distinguish antibodies to different domains of S, namely RBD, NTD, and S2. Since the numbers of NTD and S2 antibodies were small, the model was trained by the heavy-chain CDRs (H1, H2, and H3), so that antibodies without sequence information for the light chain could also be used (Table S3). The ROC AUC and PR AUC of the RBD/NTD/S2 model were 0.79 and 0.62, respectively ( Figure S6B), which were much worse than the S/HA model above. The poorer performance of the RBD/ NTD/S2 model may be attributable to the smaller dataset. Since most known antibodies to SARS-CoV-2 S were RBD-specific, we also examined if a deep learning model that was trained to distinguish RBD and HA antibodies could achieve a better performance than the S/HA model above. Indeed, the ROC AUC and PR AUC of the RBD/HA model were 0.90 and 0.94, respectively ( Figure S6B; Table S3), which were slightly higher than those of the S/HA model. These observations indicate that the size of the training dataset is indeed critical for model performance.
A recent study reported 81 antibodies to SARS-CoV-2 RBD that were elicited by Beta variant infection, in which 44 could cross-react with the ancestral Hu-1 strain and 37 were Beta-specific (Reincke et al., 2022). While these 81 antibodies were not included in the dataset that we assembled (Table S1), they provided an opportunity to further evaluate the performance of our deep learning model. Our deep learning model that was trained on all six CDRs to distinguish between antibodies to S and HA (see above) successfully predicted that 77 of the 81 (95%) antibodies as SARS-CoV-2 S antibodies ( Figure 6C; Table S5). Of note, since our model was designed to distinguish between antibodies to SARS-CoV-2 S and influenza HA, the prediction on non-S/non-HA antibodies was expected to be close to random. Consistent with that expectation, when we applied our model to 691 HIV antibodies from GenBank (Table S6), 46% were predicted to be S antibodies and 54% were predicted to be HA antibodies ( Figure S6C). As different antigenic variants of SARS-CoV-2 emerge and individuals start to accumulate unique SARS-CoV-2 immune histories, the antibody response to SARS-CoV-2 is likely to evolve and diversify. Although our model still performs well on antibodies that were elicited by the Beta variant ( Figure 6C), it remains to be explored whether this performance will hold for antibodies that are elicited by SARS-CoV-2 variants that are more antigenically distinct from the ancestral Hu-1 strain originally identified in Wuhan.

DISCUSSION
Through a systematic survey of published information on SARS-CoV-2 antibodies, we identified many molecular features of public antibody responses to SARS-CoV-2. The large amount of published information has allowed us to explore distinct patterns of germline gene usages in antibodies that target different domains on the S protein (i.e., RBD, NTD, and S2). Notably, the (D) Binding kinetics between COVOX-253 Fabs (wild type or mutants) and SARS-CoV-2 RBD were measured by biolayer interferometry (BLI). y axis represents the response. Blue lines represent the response curves and red lines represent the 1:1 binding model. Binding kinetics were measured for five concentrations of the RBDs at 3-fold dilution ranging from 300 to 3.7 nM. The dissociation constant (K D ) values ± standard deviations are indicated. (E) A phylogenetic tree was constructed for the light-chain sequences of 67 antibodies in the IGHV1-58/IGKV3-20 public clonotype. The phylogenetic tree was rooted using the germline sequence of IGKV3-20. Each tip represents one antibody and is colored according to the corresponding amino acid variants at V L residues 29 and 92. Amino acid variants that represent SHM are underlined. Numbers of antibodies in the IGHV1-58/IGKV3-20 public clonotype carrying the germline-encoded variant at V L residues 29 and 92 (S29, G92), as well as V L SHM S29R and G92D (red) are listed in the inset table. Of note, one antibody in this IGHV1-58/IGKV3-20 public clonotype carries S29/N92 and another carries S29/V92. However, they are not listed in the table here. types and nature of public antibody responses to different domains appear to be quite different. For example, convergence of CDR H3 sequences can be readily identified in the public antibody responses to RBD and S2. In contrast, the public antibody response to NTD seems to be largely independent of the CDR H3 sequence. Furthermore, an IGHD-dependent public antibody response was enriched against S2, but not RBD or NTD. Together, our study demonstrates the diversity of sequence features that can constitute a public antibody response against a single antigen.
The public antibody response to SARS-CoV-2 has also been examined by a recent data mining study that focused on identifying public clonotypes . This previous study defined public clonotypes as antibodies with the same IGHV/ IGHJ/IGK(L)V/IGK(L)V genes and high similarity of CDR H3 . While multiple public clonotypes were identified using this stringent definition , the characterization of public antibody response is likely far from complete. A public antibody response may not always involve a defined pair of IGHV/IGK(L)V genes, especially when either IGHV or IGK(L)V gene-encoded residues only make a minimal contribution to the paratope. In fact, a well-characterized public antibody response to the highly conserved stem region of influenza HA has a paratope that is entirely attributed to the IGHV1-69 heavy chain (Dreyfus et al., 2012;Ekiert et al., 2009;Lang et al., 2017;Sui et al., 2009). IGHV3-30/IGHD1-26 antibodies to S2 in our study may represent a similar type of IGK(L)V-independent public antibody response, although it still needs to be confirmed by  Reincke et al. (2022) were predicted by a deep learning model that was trained to distinguish between S antibodies and HA antibodies. See also Figure S6 and Tables S3-S6. structural analysis. On the other extreme, RBD antibodies that are encoded by IGLV6-57 with a 97 WLRG 100 motif in the CDR H3 represent a public response that is largely independent of IGHV gene usage. Given the diverse types of public antibody responses to SARS-CoV-2 S, we need to acknowledge the limitation of using the conventional strict definition of public clonotype to study public antibody responses. Public antibody response to different antigens can have very different sequence features. For example, IGHV6-1 and IGHD3-9 are signatures of public antibody response to influenza virus (Joyce et al., 2016;Kallewaard et al., 2016;Wu et al., 2018Wu et al., , 2020a, whereas IGHV3-23 is frequently used in antibodies to Dengue and Zika viruses (Robbiani et al., 2017). In contrast, these germline genes are seldom used in the antibody response to SARS-CoV-2 as compared with the naive baseline. Since the binding specificity of an antibody is determined by its structure, which in turn is determined by its amino acid sequence, the antigen specificity of an antibody can theoretically be identified based on its sequence. This study provides a proof of concept by training a deep learning model to distinguish between SARS-CoV-2 S antibodies and influenza HA antibodies, solely based on primary sequence information. Technological advancements, such as the development of single-cell highthroughput screen using the Berkeley Lights Beacon optofluidics device (Winters et al., 2019) and advances in paired B cell receptor sequencing (Curtis and Lee, 2020), have been accelerating the speed of antibody discovery and characterization. As more sequence information on antibodies to different antigens is accumulated, we may be able in the future to construct a generalized sequence-based model to accurately predict the antigen specificity of any antibody.
In summary, the amount of publicly available information on SARS-CoV-2 antibodies has provided invaluable biological insights that have not been readily obtained for other pathogens. One reason is that the COVID-19 pandemic has gathered scientists from many fields and around the globe to work intensively on SARS-CoV-2. The parallel efforts by many different research groups have enabled SARS-CoV-2 antibodies to be discovered at unprecedented speed and scale that have not been possible for other pathogens. We anticipate that knowledge of the molecular features of the antibody response to SARS-CoV-2 will keep growing as more antibodies are isolated and characterized. Ultimately, the extensive characterization of antibodies to the SARS-CoV-2 S protein may allow us to address some of the most fundamental questions about antigenicity and immunogenicity, as well as how the human immune repertoire has evolved to respond to specific classes of viral pathogens that have coexisted with humans for hundreds to thousands of years.
Limitations of the study Many antibodies in our collection were isolated from SARS-CoV-2-infected individuals. However, sequence information of the infecting viral variants was not available in the original publications. Although most of these antibodies were isolated during the early phase of the COVID-19 pandemic, some antibodies in our collection may have been elicited by a SARS-CoV-2 variant rather than the ancestral Hu-1 strain. Relatedly, this study did not examine the antibody specificity to different variants. By leveraging the published information on antibody neutralization activity to different variants, future analysis could investigate the relationship between antibody sequence features and neutralization breadth.

Identification of recurring somatic hypermutation (SHM)
In this analysis, a public clonotype was classified as antibodies from at least two donors that had the same IGHV/IGK(L)V genes and CDR H3s from the same CDR H3 cluster (see ''CDR H3 clustering analysis'' above). For each antibody, ANARCI was used to number the position of each residue according to Kabat numbering (Dunbar and Deane, 2016). The amino-acid identity at each residue position of an antibody was then compared to that of the putative germline gene. CDR H3, CDR L3, and framework region 4 in both heavy and light chains were not included in this analysis. Insertions and deletions were also ignored in this analysis. SHM that occurred in at least two donors within a public clonotype was defined as a recurring SHM.
Expression and purification of SARS-CoV-2 RBD SARS-CoV-2 spike receptor-binding domain (RBD) was expressed in mammalian cells and purified as described previously (Wu et al., 2020b). Briefly, the RBD (residues 319-541) of the SARS-CoV-2 spike (S) protein (GenBank: QHD43416.1) was cloned into a phCMV3 vector and fused with a C-terminal 6xHis tag. The plasmid was transiently transfected into Expi293F cells using ExpiFectamine 293 Reagent (Thermo Fisher Scientific) according to the manufacturer's instructions. The supernatant was collected at 7 days post-transfection. The protein was purified with Ni Sepharose excel resin (Cytiva) followed by size exclusion chromatography.

Expression and purification of Fabs
The heavy and light chains were individually cloned into a phCMV3 vector. The plasmids were transiently co-transfected into ExpiCHO cells at a ratio of 2:1 (HC:LC) using ExpiFectamine CHO Reagent (Thermo Fisher Scientific) according to the manufacturer's instructions. The supernatant was collected at 7 days post-transfection. The Fabs were purified with a CaptureSelect CH1-XL Affinity Matrix (Thermo Fisher Scientific) followed by size exclusion chromatography.

Model construction
The deep learning model consisted of two networks, namely multi-encoder (ME) and a stack of multi-layered perceptrons (MLP). The CDR amino-acid sequences were taken as input and passed to ME. Specifically, each CDR amino-acid sequence was described by a 21-letter alphabet vector x ! = ðx 1 ;x 2 ;.;x LÀ1 ;x L Þ;x˛R L , where L represented the length of sequence, and x represented the amino acid category. Each of the 20 canonical amino acids was one category, whereas all the ambiguous amino acids were grouped as the 21 st category. Before passing to ME, tokenized amino acid sequences were processed by zero padding, so that the size of each input was the same. Subsequently, the inputs were mapped to the embedding vectors with additional dimension d. The sinusoidal positional encoding vectors were added to the embedding vectors to encode the relative position of tokens (i.e. amino acids) in the sequence. Each embedding vector, x !˛RL3d , with size of L 3 d, was passed into transformer encoder layer by self-attention mechanism to learn the sequence feature (Vaswani et al., 2017). All learned sequence features were then concatenated together and passed to multi-layered perceptron (MLP). Each MLP layer contained leaky rectified linear unit (ReLU) activations to avoid the vanishing gradient. Dropout layers were placed after each MLP block to avoid model overfitting (Srivastava et al., 2014). The final output layer was followed by a sigmoid activation function to predict the probability of different classes. The prediction losses were calculated by binary cross-entropy loss.
Training detail SARS-CoV-2 S antibodies and influenza HA antibodies with complete information for all six CDR sequences were identified. Sequences of each HA antibody were from NCBI GenBank database (www.ncbi.nlm.nih.gov/genbank)   (Table S3). If all six CDR sequences were the same between two or more antibodies, only one of these antibodies would be retained. After filtering duplicates, there were 4,736 antibodies to SARS-CoV-2 and 1,356 to influenza HA. To avoid data imbalance, we further down-sampled to 3,000 SARS-CoV-2 antibodies. The CDR sequences were identified by IgBLAST and PyIR (Soto et al., 2020;Ye et al., 2013). This dataset was randomly split into a training set (64%), a validation set (16%), and a test set (20%). The training set was used to train the deep learning model. The validation set was used to evaluate the model performance during training. The test set was used to evaluate the performance of the final model. There was no overlap of antibody sequences among the training set, validation set, and test set. The Adam algorithm was used to optimize the model. The same procedure was used for training the RBD/NTD/S2 model or the RBD/HA model, except that the prediction losses for RBD/NTD/S2 model were calculated by categorical cross-entropy loss since it has more than two categories. For the RBD/NTD/ S2 model, the number of RBD antibodies were down-sampled to 800. Without down-sampling the RBD antibodies, the model would be highly biased towards RBD, with very low recall rates of 0.39 and 0.16 for S2 antibodies and NTD antibodies, respectively. For the RBD/HA model, 3000 RBD antibodies and 1,356 HA antibodies were used. Performance Metrics For evaluating model performance, S antibodies and HA antibodies were considered ''positive'' and ''negative'', respectively. False positives (FP) and false negatives (FN) were samples that were misclassified by the model while true negatives (TN) and true positives (TP) were correctly classified ones. The following metrics were computed to evaluate model performance: In addition, we also used the receiver operating characteristic (ROC) curve and precision-recall (PR) curve to measure the model's ability to avoid misclassification (Flach et al., 2011;Saito and Rehmsmeier, 2015). Area under the curves of ROC (i.e. ROC AUC) and PR (i.e. PR AUC) were computed using the "keras.metrics" module in TensorFlow (Abadi et al., 2016).

QUANTIFICATION AND STATISTICAL ANALYSIS
Standard deviation for K D estimation was computed by Octet analysis software 9.0.