Human T cell receptor occurrence patterns encode immune history, genetic background, and receptor specificity

The T cell receptor (TCR) repertoire encodes immune exposure history through the dynamic formation of immunological memory. Statistical analysis of repertoire sequencing data has the potential to decode disease associations from large cohorts with measured phenotypes. However, the repertoire perturbation induced by a given immunological challenge is conditioned on genetic background via major histocompatibility complex (MHC) polymorphism. We explore associations between MHC alleles, immune exposures, and shared TCRs in a large human cohort. Using a previously published repertoire sequencing dataset augmented with high-resolution MHC genotyping, our analysis reveals rich structure: striking imprints of common pathogens, clusters of co-occurring TCRs that may represent markers of shared immune exposures, and substantial variations in TCR-MHC association strength across MHC loci. Guided by atomic contacts in solved TCR:peptide-MHC structures, we identify sequence covariation between TCR and MHC. These insights and our analysis framework lay the groundwork for further explorations into TCR diversity.


Introduction
T cells are the effectors of cell-mediated adaptive immunity in jawed vertebrates. To control a broad array of pathogens, massive genetic diversity in loci encoding the T cell receptor (TCR) is generated somatically throughout an individual's life via a process called V(D)J recombination. All nucleated cells regularly process and present internal peptide antigens on cell surface molecules called major histocompatibility complex (MHC). Through the interface of TCR and MHC, a T cell with a TCR having affinity for a peptide antigen complexed with MHC (pMHC) is stimulated to initiate an immune response to an infected (or cancerous) cell. The responding T cell proliferates clonally, and its progeny inherit the same antigen-specific TCR, constituting long-term immunological memory of the antigen. The diverse population of TCR clones in an individual (the TCR repertoire) thus dynamically encodes a history of immunological challenges.
Advances in high-throughput TCR sequencing have shown the potential of the TCR repertoire as a personalized diagnostic of pathogen exposure history, cancer, and autoimmunity (Thomas et al., 2014;Kirsch et al., 2015;Friedensohn et al., 2017;Ostmeyer et al., 2017). Public TCRs-defined as TCR sequences seen in multiple individuals and perhaps associated with a shared disease phenotype-have been found in a range of infectious and autoimmune diseases and cancers including influenza, Epstein-Barr virus, and cytomegalovirus infections, type I diabetes, rheumatoid arthritis, and melanoma (Venturi et al., 2008;Li et al., 2012;Madi et al., 2017;Pogorelyy et al., 2017;Dash et al., 2017;Glanville et al., 2017;Chu et al., 2018;Pogorelyy et al., 2018). By correlating occurrence patterns of public TCRb chains with cytomegalovirus (CMV) serostatus across a large cohort of healthy individuals, Emerson et al. identified a set of CMV-associated TCR chains whose aggregate occurrence was highly predictive of CMV seropositivity (Emerson et al., 2017). Staining with multimerized pMHC followed by flow cytometry has been used to isolate and characterize large populations of T cells that bind to defined pMHC epitopes (Dash et al., 2017;Glanville et al., 2017), providing valuable data on the mapping between TCR sequence and epitope specificity. We and others have leveraged these data to develop learning-based models of TCR:pMHC interactions, using TCR distance measures (Dash et al., 2017), CDR3 sequence motifs (Glanville et al., 2017) and k-mer frequencies (Cinelli et al., 2017), and other techniques.
MHC proteins in humans are encoded by the human leukocyte antigen (HLA) loci and are among the most polymorphic in the human genome (Robinson et al., 2015). Within an individual, six major antigen-presenting proteins are each encoded by polymorphic alleles. The set of these alleles comprise the individual's HLA type, which is unlikely to be shared with an unrelated individual and which determines the subset of peptide epitopes presented to T cells for immune surveillance. Specificity of a given TCR for a given antigen is biophysically modulated by MHC structure: MHC binding specificity determines the specific antigenic peptide that is presented, and the TCR binds to a hybrid molecular surface composed of peptide-and MHC-derived residues. Thus, population-level studies of TCR-disease association are severely complicated by a dependence on individual HLA type. eLife digest The immune system has two major ways of clearing up an infection. A rapid, first line of defense buys time while the second 'adaptive' response disposes of the threat with precision. The adaptive response takes longer to develop but once it has dealt with a disease, it remembers: the next time the body encounters the same threat, the immune system can respond much faster.
When cells are infected by a disease-causing microbe, like a bacterium or a virus, they start carrying fragments of that microbe on their surface. Immune cells known as T cells then recognize these fragments using proteins called T cell receptors. Each T cell has a different receptor, which is specific to a precise fragment of a particular microbe. After successfully clearing an infection, some of the T cells that were mobilized remain in the blood. These memory T cells, and their specific receptors, are a lasting trace of the infections a person has encountered in the past.
The exact portion of the microbial fragments that the T cells receptors can 'see' depends on another set of proteins, called MHC. These hold the fragments at the surface of the infected cells. The genes that code for MHCs are incredibly diverse, to the point that the exact combination of MHCs carried by a cell can be specific to an individual. However, different MHCs present different microbial fragments, and this changes which receptor can recognize the infection. At the level of a population, this mechanism makes it difficult to use T cell receptors to know exactly which diseases people had to face.
Here, DeWitt et al. look at the T cell receptor sequences of 666 healthy participants, as well as their MHC variants, to try to reconstruct their disease history. This revealed that many people have clusters of similar T cells receptors sequences that occur together; these could be linked to exposure to common viruses such as parvovirus, influenza, cytomegalovirus and Epstein-Barr virus. Furthermore, examining 3D structures of T cell receptors binding to fragments carried by MHCs helps to identify how changes in the sequence of the MHC can influence which receptor will be able to attach to the complex.
These results show that, despite the diversity and complexity of T cell receptors and MHCs, it is possible to spot patterns across people, and to start understanding how those patterns emerge. In addition to fighting body invaders, T cells can also use their receptors to recognize certain protein fragments carried by tumor cells. Improving our knowledge of T cell receptors and MHCs could give new insights to fight cancer.
Here we report an analysis of the occurrence patterns of public TCRs in a cohort of 666 healthy volunteer donors, in which information on only TCR sequence and HLA association guide us to inferences concerning disease history. To complement deep TCRb repertoire sequencing available from a previous study (Emerson et al., 2017), we have assembled high-resolution HLA typing data at the major class I and class II HLA loci on the same cohort, as well as information on age, sex, ethnicity, and CMV serostatus. We focus on statistical association of TCR occurrence with HLA type, and show that many of the most highly HLA-associated TCRs are likely responsive to common pathogens: for example, eight of the ten TCRb chains most highly associated with the HLA-A*02:01 allele are likely responsive to one of two viral epitopes (influenza M1 58 and Epstein-Barr virus BMLF1 280 ). We introduce new approaches to cluster TCRs by primary sequence and by the pattern of occurrences among individuals in the cohort, and we identify highly significant TCR clusters that may indicate markers of immunological memory. Four of the top five most significant clusters appear linked with common pathogens (parvovirus B19, influenza virus, CMV, and Epstein-Barr virus), again highlighting the impact of viral pathogens on the public repertoire. We also find HLA-unrestricted TCR clusters, some likely to be mucosal-associated invariant T (MAIT) cells, which recognize bacterial metabolites presented by non-polymorphic MR1 proteins, rather than pMHC (Kjer-Nielsen et al., 2012). Our global analysis of TCR-HLA association identifies striking variation in association strength across HLA loci and highlights trends in V(D)J generation probability and degree of clonal expansion that illuminate selection processes in cellular immunity. Guided by structural analysis, we used our large dataset of HLA-associated TCRb chains to identify statistically significant sequence covaration between the TCR CDR3 loop and the DRB1 allele sequence that preserves charge complementarity at the TCR:pMHC interface. These analyses help elucidate the complex dependence of TCR sharing on HLA type and immune exposure, and will inform the growing number of studies seeking to identify TCR-based disease diagnostics.

The matrix of public TCRs
Of the 80 million unique TCRb chains (defined by V-gene family and CDR3 sequence) in the 666 cohort repertoires, about 11 million chains are found in at least two individuals and referred to here as public chains (for a more nuanced examination of TCR chain sharing see [Elhanati et al., 2018]). The occurrence patterns of these public TCRbs-the subset of subjects in which each distinct chain occurs-can be thought of as forming a very large binary matrix M with about 11 million rows and 666 columns. Entry M i;j contains a one or a zero indicating presence or absence, respectively, of TCR i in the repertoire of subject j (ignoring for the moment the abundance of TCR i in repertoire j; Figure 1 depicts two illustrative sub-matrices of M). (Emerson et al., 2017) demonstrated that this binary occurrence matrix M encodes information on subject genotype and immune history: they were able to successfully predict HLA-A and HLA-B allele type and CMV serostatus by learning sets of public TCRb chains with occurrence patterns that were predictive of these features. Specifically, each feature-such as the presence of a given HLA allele (e.g. HLA-A*02:01) or CMV seropositivitydefines a subset of the cohort members positive for that feature, and can be encoded as a vector of 666 binary digits. This phenotype occurrence pattern of zeros and ones can be compared to the occurrence patterns of all the public TCRb chains to identify similar patterns, as quantified by a pvalue for significance of co-occurrence across the 666 subjects; thresholding on this p-value produces a subset of significantly associated TCRb chains whose collective occurrence in a repertoire was found by Emerson et al. to be predictive of the feature of interest (in cross-validation and, for CMV, on an independent cohort). Generalizing from these results, it is reasonable to expect that other common immune exposures may be encoded in the occurrence matrix M, and that these encodings could be discovered if we had additional phenotypic data to correlate with TCR occurrence patterns. In this study, we set out to discover these encoded exposures de novo, without additional phenotypic correlates, by learning directly from the structure of the occurrence matrix M and using as well the sequences of the TCRb chains (both their similarities to one another and to TCR sequences characterized in the literature). We hypothesized that patterns of TCR co-occurrence (correlations between rows in the matrix M) might indicate shared responses to unknown immune exposures, that co-occurrence between TCR chains and HLA alleles (correlations between rows in M and rows in the HLA allele occurrence matrix) could be used to help identify functional TCR chains, and that clustering TCRb chains by co-occurrence and sequence could highlight functional associations (Figure 1). To support this effort we assembled additional HLA typing data for the subjects, now at 4-digit resolution (e.g., A*02:01 rather than A*02) and including MHC class II alleles, and we compiled a dataset of annotated TCRb chains by combining online TCR sequence databases, structurally characterized TCRs, and published studies (see Materials and methods; [Shugay et al., 2018;Tickotsky et al., 2017;Berman et al., 2000;Dash et al., 2017;Glanville et al., 2017;Song et al., 2017;Kasprowicz et al., 2006]). Here we describe the outcome of this discovery process, and we report a number of intriguing general observations about the role of HLA in shaping the T cell repertoire.
The results of our analysis are organized in the remaining five sections as follows. We begin with an examination of TCR co-occurrence patterns across the full cohort (first section, Figures 2-3). In the next section we examine patterns of TCR-HLA association (Table 1 and  . In the third section we analyze TCR co-occurrence within subsets of the cohort positive for specific HLA alleles, and we identify TCR clusters that may be reflective of shared immune exposures (Figures 6-8). In the fourth section we use our dataset of HLA-associated TCRb chains to identify covariation between HLA and the TCRb CDR3 sequence (Table 2 and Figure 9). In the final section we focus on CMVresponsive TCRb chains, examining their degree of HLA-restriction and the extent to which they may be responding to other antigens ( Figure 10). Figure 1 provides a graphical overview of the co-occurrence analysis.

Globally co-occurring TCR pairs form clusters defined by shared associations
We hypothesized that we could identify unknown immune exposures encoded in the public repertoire by comparing the occurrence patterns of individual TCRb chains to one another. A subset of TCRb chains that strongly co-occur across the 666 cohort subjects might correspond to an unmeasured immune exposure that is common to a subset of subjects. Since shared HLA restriction could represent an alternative explanation for significant TCR co-occurrence, we also compared the TCR occurrence patterns to the occurrence patterns for class I and class II HLA alleles. We began by analyzing TCR occurrence patterns over the full set of cohort members. For each pair of public TCRb chains t 1 and t 2 we computed a co-occurrence p-value P CO t 1 ; t 2 ð Þ that reflects the probability of seeing an equal or greater overlap of shared subjects (i.e., subjects in whose repertoires both t 1 and t 2 are found) if the occurrence patterns of the two TCRs had been chosen randomly (for details, see Materials and methods and Figure 12). In a similar manner we computed, for each HLA allele a and TCR t, an association p-value P HLA a; t ð Þ that measures the degree to which TCR t tends to occur in subjects positive for allele a. Finally, for each pair of strongly co-occurring (P CO <1 Â 10 À8 ) TCRb chains t 1 and t 2 , we looked for a mutual HLA association that might explain their co-occurrence, by finding the allele having the strongest association with both t 1 and t 2 , and noting its association pvalue: where A denotes the set of all HLA alleles. In words, we take the p-value of the strongest HLA allele association with the TCR pair, where the association of an HLA allele with a TCR pair is defined by the weakest association of the allele among the individual TCRs. Based on this analysis, we identified two broad classes of strongly co-occurring TCR pairs (Figure 2): those with a highly significant shared HLA association, where the co-occurrence of the two TCRs can be explained by a shared HLA allele association (i.e. a common HLA restriction), and those with only modest shared HLA-association p-value, for which another explanation of co-occurrence must be sought. Points above the dashed y ¼ x line correspond to pairs of TCRs for which there exists an HLA allele whose co-occurrence with each of the TCRs is stronger than their mutual cooccurrence, while for points below the line no such HLA allele was present in the dataset.
We used a neighbor-based clustering algorithm, DBSCAN (Ester et al., 1996), to link strongly co-occurring TCR pairs together to form larger correlated clusters (see Materials and methods), and then investigated phenotype associations with these clusters. At an approximate family-wise error rate of 0:05 (see Materials and methods), we identified 28 clusters of co-occurring TCRs, with sizes ranging from 7 to 386 TCRs ( Figure 3). Given one of these clusters of co-occurring TCRs, we can Figure 1. Clustering of TCR occurrence patterns across the full cohort (top) and within a cohort subset defined by a shared HLA allele (bottom). As described in detail in the following sections, we used covariation analysis to identify clusters of co-occurring TCRb chains. Here we provide a graphical introduction to these results by depicting occurrence patterns of clustered TCRs over the full cohort and over a cohort subset defined by a single HLA allele (HLA-A*01:01). TCR clusters over the full cohort are largely driven by the occurrence patterns of specific HLA alleles (compare the occurrence patterns of the top five global clusters to those of the top 5 HLA alleles, respectively), whereas HLA-restricted clusters may reflect shared immune exposures, as illustrated here by a CMV-associated TCR cluster (the pink cluster in the bottom panels). In the top left panels, occurrence patterns of HLA alleles and TCRb chains (rows) are indicated for each of the cohort subjects (columns) by filled (black) matrix elements. The TCRb chains chosen for depiction in the occurrence matrix are the members of the 28 global co-occurrence clusters identified in section 'Globally co-occurring TCR pairs form clusters defined by shared associations'. The TCRs (rows) are ordered by cluster membership as indicated by colored bands to the left of the matrix. The selected HLA alleles correspond to the strongest associations for the top 10 clusters (two of which are not HLA-associated). The cohort subjects are ordered by column similarity so as to emphasize block structure present in the matrix. The bottom left panels similarly show occurrence patterns for HLA-A*01:01-associated TCRb chain clusters over the subset of subjects carrying this allele, alongside an indicator of cytomegalovirus seropositivity for each subject (red). In-depth analysis of these (and other) HLA-associated TCRb clusters is presented in section 'HLA-restricted TCR clusters'. For visualization purposes, two-dimensional embeddings of the TCRb chains based on their occurrence patterns (binary strings representing presence/ absence in the subjects) are depicted in the right panels, with the TCR chains colored by cluster assignment and annotated by known associations. DOI: https://doi.org/10.7554/eLife.38358.003 count the number of cluster member TCRs found in each subject's repertoire. The aggregate occurrence pattern of the cluster can be visualized as a rank plot of this cluster TCR count over the subjects (the black curves in Figure 3C-D). This ranking can also be compared with other phenotypic or genotypic features of the same subjects. In particular, by comparing this aggregate occurrence pattern to a control pattern generated by repeatedly choosing equal numbers of subjects independently at random (dotted green lines in Figure 3C-D), we can identify a subset of the cohort with an apparent enrichment of cluster member TCRs and look for overlap between this subset and other defined cohort features. Performing this comparison against the occurrence patterns of class I and class II HLA alleles revealed that the majority of the TCR clusters were strongly associated with at least one HLA allele (as depicted for a DRB1*15:01-associated cluster in Figure 3C and summarized in Figure 3B).
In addition, there were two large clusters of TCRs which were not strongly associated with any of the typed HLA alleles (clusters 6 and 7 in Figure 3). Visual inspection of the CDR3 regions of TCRs in one of these clusters revealed a distinctive 'YV' C-terminal motif that is characteristic of the TRBJ2-7*02 allele (Figure 3-figure supplement 1), and indeed the 41 subjects whose repertoires indicated the presence of this genetic variant were exactly the 41 subjects enriched for members of this TCR cluster ( Figure 3D). This demonstrated that population diversity in germline allele sets Figure 2. Strongly co-occurring TCR pairs form two broad classes distinguished by HLA-association strength. The co-occurrence p-value P CO for each pair of public TCRs is plotted (x-axis) against the HLA-association p-value P HLA for the HLA allele with the strongest mutual association with that TCR pair (y-axis). There are 6092 TCR-pairs above the diagonal (y ¼ x) and 4713 pairs below the diagonal. DOI: https://doi.org/10.7554/eLife.38358.004 The following source data is available for figure 2:  (Howie et al., 2015). DOI: https://doi.org/10.7554/eLife.38358.009 manifests as occurrence pattern clustering. The other large, non-HLA associated TCR cluster had a number of distinctive features as well: strong preference for the TRBV06 family, followed by TRBV20 and TRBV04 (Figure 3-figure supplement 2); low numbers of inserted 'N' nucleotides; and a skewed age distribution biased toward younger subjects ( Figure 3-figure supplement 3). These features, together with the lack of apparent HLA restriction, suggested that this cluster represented an invariant T cell subset, specifically MAIT (mucosal-associated invariant T) cells (Kjer-Nielsen et al., 2012;Venturi et al., 2013;Pogorelyy et al., 2017). Since MAIT cells are defined primarily by their alpha chain sequences, we searched in a recently published paired dataset (Howie et al., 2015) for partner chains of the clustered TCRb chain sequences, and found a striking number that matched the MAIT consensus (TRAV1-2 paired with TRAJ20/TRAJ33 and a 12 residue CDR3, Figure 3-figure supplement 3D). We also looked for these clustered TCRs in a recently published MAIT cell sequence dataset (Howson et al., 2018) and found that 93 of the 138 cluster member TCRs occurred among the 31,654 unique TCRs from this dataset; of these 93 TCRb chains, 27 were found among the 78 most commonly occurring TCRs in the dataset (the TCRs occurring in at least 7 of the 24 sequenced repertoires), a highly significant overlap (P<2 Â 10 À52 in a one-sided hypergeometric test). These concordances indicate that our untargeted approach has detected a well-studied T cell subset de novo through analysis of occurrence patterns.

HLA-associated TCRs
These analyses suggested to us that TCR co-occurrence patterns across the full cohort of subjects are strongly influenced by the distribution of the HLA alleles, in accordance with the expectation that the majority of ab TCRs are HLA-restricted. Covariation between TCRs responding to the same HLA-restricted epitopes would only be expected in subjects positive for the restricting alleles, with TCR presence and absence outside these subjects likely introducing noise into the co-occurrence analysis. We therefore decided to analyze patterns of TCR co-occurrence within subsets of the cohort positive for specific HLA alleles, and to restrict our co-occurrence analysis to TCRs having a statistically significant association with the specific allele defining the cohort subset. To begin, we performed a comprehensive analysis of TCR-HLA association.
At a false discovery rate of 0.05 (estimated from shuffling experiments; see Materials and methods), we were able to assign 16,951 TCRb sequences to an HLA allele (or alleles: DQ and DP alleles were analyzed as ab pairs, and there were 5 DR/DQ haplotypes whose component alleles were so highly correlated across our cohort that we could not assign TCR associations to individual DR or DQ components; see Materials and methods). Table 1 lists the top 50 HLA-associated TCR sequences by association p-value and top 10 associated TCRs for the well-studied A*02:01 allele.
We find that 8 of the top 10 A*02:01-associated TCRs have been previously reported and annotated as being responsive to viral epitopes, specifically influenza M1 58 and Epstein-Barr virus (EBV) BMLF1 280 (Shugay et al., 2018;Tickotsky et al., 2017). Moreover, each of these 8 TCRb chains is present in a recent experimental dataset (Dash et al., 2017) that included tetramer-sorted TCRs positive for these two epitopes; each TCR has a clear similarity to one of the consensus epitope-specific repertoire clusters identified in that work, with the EBV TRBV20, TRBV29, and TRBV14 TCRs, respectively, matching the three largest branches of the BMLF1 280 TCR tree, and the three influenza M1 58 TCRs all matching the dominant TRBV19 'RS' motif consensus (Figure 4-figure supplement 2). TCRs with annotation matches are sparser in the top 50 across all other alleles, which is likely due in part to a paucity of experimentally characterized non-A*02 TCRs, however we again see EBV-epitope responsive TCRs (with B*08:01 and B*35:01 restriction).
A global comparison of TCR feature distributions for HLA-associated versus non-HLA-associated TCRs provides further evidence of functional selection. As shown in Figure 4A, HLA-associated TCRs are on average more clonally expanded than a set of background, non-HLA associated TCRs with matching frequencies in the cohort. They also have lower generation probabilities-are harder to make under a simple random model of the VDJ rearrangement process-which suggests that their observed cohort frequencies may be elevated by selection ( Figure 4B, see Materials and methods for further details on the calculation of clonal expansion indices and generation probabilities; also see (Pogorelyy et al., 2018)). Examination of two-dimensional feature distributions suggests that these shifts are correlated, with HLA-associated TCRs showing an excess of lower-probability, clonally expanded TCRs ( Figure 4C); this trend appears stronger for class-I associated TCRs than for class II-associated TCRs (Figure 4-figure supplement 1). To give a global picture of TCR-HLA association, we counted the number of significant TCR associations found for each HLA allele in the dataset, and plotted this number against the number of subjects in the cohort with that allele ( Figure 5). As expected, the more common HLA alleles have on average greater numbers of associated TCRs (since greater numbers of subjects permit the identification of more public TCRs, and the statistical significance assigned to an observed association of fixed strength grows as the number of subjects increases). What was somewhat more surprising is that the slope of the correlation between cohort frequency and number of associated TCRs varied dramatically among the HLA loci, with HLA-DRB1 alleles having the largest number of associated TCRs for a given allele frequency and HLA-C alleles having the smallest. The best-fit slope for the five DR/DQ haplotypes (12.2) was roughly the sum of the DR (7.99) and DQ (3.39) slopes, suggesting as expected that these haplotypes were capturing TCRs associated with both the DR and DQ component alleles. The smaller rate of TCR association observed at the HLA-C locus could be explained by a relatively lower level of cell surface expression of HLA-C alleles as well as their greater tendency to interact with killer cell immunoglobulin-like receptors (KIR) on natural killer (NK) cells (Kaur et al., 2017).
We assessed the accuracy of our TCR:HLA associations in two ways. First, we compared our HLA allele assignments to those given in the VDJdb database (which provides the peptide:MHC target and hence a putative HLA restriction for all entries; [Shugay et al., 2018]) and found that 90% of the VDJdb assignments for TCRb chains present in both sets matched our associations. This agreement increases to 96% after filtering for the highest level of supporting evidence (VDJdb score of 3). Interestingly, two of the mismatches with VDJdb score three were from the protein structural database: the allo-complex between the B*08-restricted LC13 TCR and HLA-B*44:05 (Macdonald et al., 2009), and the structure of the A*02-restricted JM22 TCR bridged to a class II allele by a A B C D Figure 4. HLA-associated TCRs are more clonally expanded and have lower generation probabilities than equally common, non-HLA associated TCRs.
(A) Comparison of clonal expansion index distributions for the set of HLA-associated TCRs (blue) and a cohort-frequency matched set of non HLAassociated TCRs (green). (B) Comparison of VDJ-rearrangement TCR generation probability (P gen ) distributions for the set of HLA-associated TCRs (blue) and a cohort-frequency matched set of non HLA-associated TCRs (green). (C) Two-dimensional probability density function (PDF) for the distribution of P gen versus clonal expansion index for HLA-associated TCRs. Contours indicate level sets of the PDF. (D) Two-dimensional probability density function (PDF) for the distribution of P gen versus clonal expansion index for background (non HLA-associated) TCRs whose cohort frequencies match the TCRs in (C). DOI: https://doi.org/10.7554/eLife.38358.012 The following source data and figure supplements are available for figure 4: staphylococcal superantigen (Saline et al., 2010). In both of these cases, our data predict the canonical association: B*08 for the LC13 TCRb chain and A*02 for the JM22 TCRb chain. Second, we looked for HLA-associated public TCRb chains in sequenced repertoires from T cell populations that were sorted for the presence of CD4/CD8 surface markers. One would expect that TCRb chains associated with class I MHC molecules should be preferentially found in CD8+ populations, while class II-associated TCRs should be found in CD4+ populations. We selected four repertoire datasets Rubelt et al., 2016;Li et al., 2016;Oakes et al., 2017) with matched CD4 + and CD8+ repertoires from a total of 63 individuals, and we analyzed the occurrence patterns of our HLA-associated TCRb chains in these sequence datasets, producing for each TCRb counts of the number of CD4+ and CD8+ repertoires it was observed in. Figure 5-figure supplement 1 shows that if we assign each TCRb to the class (CD4+ or CD8+) with the higher count, these assignments are largely concordant with the MHC class of its associated HLA allele, and moreover this agreement increases as we increase either the stringency of HLA association or the stringency of the CD4/CD8 assignment (i.e., the minimum absolute difference between the CD4 and CD8 repertoire counts; see Materials and methods).

HLA-restricted TCR clusters
Having identified a set of HLA-associated TCRb chains, we next sought to identify TCR clusters that might represent HLA-restricted responses to shared immune exposures. We performed this analysis for each HLA allele individually, restricting our clustering to the set of TCR chains significantly-associated with that allele and comparing occurrence patterns only over the subset of subjects positive for that allele. To reduce spurious co-occurrence signals driven by the presence/absence of other HLA alleles, we excluded TCR chains that were more strongly associated with a different HLA allele (i.e., not the one defining the cohort subset). The smaller size of many of these allele-positive cohort subsets reduces our statistical power to detect significant clusters using co-occurrence information. To counter this effect, we used the TCRdist similarity measure (Dash et al., 2017) to leverage the TCR sequence similarity which is often present within epitope-specific responses (Dash et al., 2017;Glanville et al., 2017) (see for example the A*02:01 TCRs in Table 1 and Figure 4-figure supplement 2). We augmented the probabilistic similarity measure used to define neighbors for DBSCAN clustering to incorporate information about TCR sequence similarity (as measured by TCRdist), in addition to cohort co-occurrence (see Materials and methods). We independently clustered each allele's associated TCRs and merged the clustering results from all alleles; using the Holm multiple testing criterion (Holm, 1979) to limit the approximate family-wise error rate to 0.05, we found a total of 78 significant TCR clusters. We analyzed the sequences and occurrence patterns of the TCRs belonging to these 78 clusters in order to assess their potential biological significance and prioritize them for further study ( Table 3). Each cluster was assigned two scores ( Figure 6): a size score (S size , x-axis), reflecting the significance of seeing a cluster of that size given the total number of TCRs clustered for its associated allele, and a co-occurrence score (Z CO , y-axis), reflecting the degree to which the TCRs in that cluster co-occur within its allele-positive cohort subset (see Materials and methods). In computing the co-occurrence score, we defined a subset of individuals with an apparent enrichment for the member TCRs in each cluster; the size of this enriched subset of subjects is given in the 'Subjects' column in Table 3. We rank ordered the 78 clusters based on the sum of their size and co-occurrence scores (weighted to equalize dynamic range); the top five clusters are presented in greater detail in Figure 7 and Figure 8. HLA associations, member TCR and enriched subject counts, cluster center TCR sequences, scores, and annotations for all 78 clusters are given in Table 3.
We found that a surprising number of the most significant HLA-restricted clusters had links to common viral pathogens. For example, the top cluster by both size and co-occurrence (Figure 7, upper panels) is an A*24:02-associated group of highly similar TCRb chains, five of which can be found in a set of 12 TCRb sequences reported to respond to the parvovirus B19 epitope FYT-PLADQF as part of a highly focused CD8+ response to acute B19 infection (Kasprowicz et al., 2006). The subject TCR-counts curve for this cluster (Figure 7, top right panel) shows a strong enrichment of member TCRs in roughly 30% of the A*24:02 repertoires, which is on the low end of prevalence estimates for this pathogen (Heegaard and Brown, 2002) and may suggest that, if cluster enrichment does correlate with B19 exposure, there are likely to be other genetic or epidemiologic factors that determine which B19-exposed individuals show enrichment. The second most significant cluster by both measures is an A*02:01-associated group of TRBV19 TCRs with a high frequency of matches to the influenza M1 58 response (41/43 TCRs, labeled 'INF-pGIL' for the first three letters of the GILGFVFTL epitope). Notably, the cluster member sequences recapitulate many of the core features of the tree of experimentally identified M1 58 TCRs (Figure 4-figure supplement 2): a dominant group of length 13 CDR3 sequences with an 'RS' sequence motif together with a smaller group of length 12 CDR3s with the consensus CASSIG.YGYTF.
Rounding out the top five, the third and fifth most significant clusters also appear to be pathogen-associated. Cluster #3 brings together a diverse set of DRB1*07:01-associated TCRb chains (Figure 8, top dendrogram), none of which matched our annotation database. However, it was strongly associated with CMV serostatus: As is evident in the subject TCR-counts panel for this cluster (Figure 8, top right), there is a highly significant (P<3 Â 10 À19 ) association between CMV seropositivity (blue dots at the bottom of the panel) and cluster enrichment (here defined as a subject TCR count ! 3). Finally, the B*08:01-associated cluster #5 (bottom panels in Figure 8) appears to be EBV-associated: four of the TCRb chains in this cluster match TCRs annotated as binding to EBV epitopes (two matches for the B*08:01-restricted FLRGRAYGL epitope and two for the B*08:01-restricted RAKFKQLL epitope). The fact that this cluster brings together sequence-dissimilar TCRs that recognize different epitopes from the same pathogen supports the hypothesis that at least some of the observed co-occurrence may be driven by a shared exposure.
As a preliminary validation of the clusters identified here, we examined the occurrence patterns of cluster member TCRs in two independent cohorts: a set of 120 individuals ('Keck120') that formed the validation cohort for the original Emerson et al. study, and a set of 86 individuals ('Brit86') taken from the aging study of (Britanova et al., 2016). Whereas the Keck120 repertoires were generated using the same platform as our 666-member discovery cohort, the Brit86 repertoires were sequenced from cDNA libraries using 5'-template switching and unique molecular identifiers. In the 1 2 3 4 5 Figure 6. Many HLA-restricted TCR clusters contain TCRb chains annotated as pathogen-responsive. Each point represents one of the 78 significant HLA-restricted TCR clusters, plotted based on a normalized cluster size score (S size , x-axis) and an aggregate TCR co-occurrence score for the member TCRs (Z CO , y-axis). Markers are colored by the locus of the restricting HLA allele and sized based on the strength of the association between cluster member TCRs and the HLA allele. The database annotations associated to TCRs in each cluster are summarized with text labels using the following abbreviations: B19 = parvovirus B19, INF = influenza, EBV = Epstein Barr Virus, RA = rheumatoid arthritis, MS = multiple sclerosis, MELA = melanoma, T1D = type one diabetes, CMV = cytomegalovirus. Clusters labeled 'coCMV' are significantly associated (P<1 Â 10 À5 ) with CMV seropositivity (see main text discussion of cluster #3). Clusters labeled 1-5 are discussed in the text and examined in greater detail in Figure 7 and Source data 1. Paired TCRa chain sequences from the pairSEQ dataset of (Howie et al., 2015) for all clusters with at least 2 matched TCRb chains, along with a score for each cluster that assesses the degree of sequence similarity among the partner chains. DOI: https://doi.org/10.7554/eLife.38358.021 . Top five HLA-restricted clusters (continued on following page). Details on the TCR sequences, occurrence patterns, and annotations for the five most significant clusters (labeled 1-5 in Figure 6) based on size and TCR co-occurrence scores. Each panel consists of a TCRdist dendrogram (left side, labeled with annotation, CDR3 sequence, and occurrence counts for the member TCRs) and a per-subject TCR count profile (right side) showing the aggregate occurrence pattern of the member TCRs (blue curve) and a control pattern (green curve) produced by averaging occurrence counts from multiple independent randomizations of the subject set for each TCR. The numbers in the two 'Counts' columns represent the number of HLA+ (left) and HLA-(right) subjects whose repertoire contained the corresponding TCR, where HLA± means positive/negative for the restricting allele (for example, A*24:02 in the case of cluster 1). Annotations use the following abbreviations: B19 (parvovirus B19), INF (influenza virus), YFV (yellow fever virus), MELA (melanoma), T1D (type 1 diabetes), EBV (Epstein-Barr virus), RA (rheumatoid arthritis). In cases where the peptide epitope for the annotation match is known, the first three peptide amino acids are given after '-p'. Non-germline CDR3 amino acids with 2 or 3 non-templated nucleotides in their codon are shown in uppercase, while amino acids with only a single non-templated coding nucleotide are shown in lowercase. DOI: https://doi.org/10.7554/eLife.38358.022 absence of HLA typing information for these subjects, we simply evaluated the degree to which each cluster's member TCRs co-occurred over the entirety of each of these validation cohorts, using the co-occurrence score described above (Z Keck120 CO and Z Brit86 CO columns in Table 3). Although rare alleles and cluster-associated exposures may not occur with sufficient frequency in these smaller cohorts to generate co-occurrence signal, co-occurrence scores support the validity of the clusterings identified on the discovery cohort: 94% of the Keck120 scores and 92% of the Brit86 scores are greater than 0, indicating a tendency of the clustered TCRs to co-occur (smoothed score distributions are shown in Covariation between CDR3 sequence and HLA allele Given our large dataset of HLA-associated TCRb sequences, we set out to look for correlations between CDR3 sequence and HLA allele sequence. Previous studies have identified correlations between TCR V-gene usage and HLA alleles (Sharon et al., 2016;Blevins et al., 2016); these correlations are consistent with a picture of TCR:peptide:MHC interactions in which the CDR1 and CDR2 loops (whose sequence is determined by the V gene) primarily contact the MHC while the CDR3 loops contact the peptide. To complement these studies and leverage our large set of HLA-associated sequences, we set out to look for correlations between the CDR3 sequence itself and the HLA allele. In our previous work on epitope-specific TCRs (Dash et al., 2017), we identified a significant negative correlation between CDR3 charge and peptide charge, suggesting a tendency toward preserving charge complementarity across the TCR:pMHC interface. Although the CDR3 loop primarily contacts the MHC-bound peptide, computational analysis of solved TCR:peptide:MHC structures in the Protein Data Bank (Berman et al., 2000) (see Materials and methods) identified a number of HLA sequence positions that are frequently contacted by CDR3 amino acids ( Table 2). For each frequently-contacted HLA position with charge variability among alleles we computed the covariation between HLA allele charge at that position and average CDR3 charge for allele-associated TCRs. Since portions of the CDR3 sequence are contributed by the V-and J-gene germline sequences, and covariations are known to exist between HLA and V-gene usage, we also performed a covariation analysis restricting to 'non-germline' CDR3 sequence positions whose coding sequence is determined by at least one non-templated insertion base (based on the most parsimonious VDJ reconstruction; see Materials and methods). We found a significant negative correlation (R ¼ À0:47; P<4 Â 10 À4 for the full CDR3 sequence; R ¼ À0:52; P<7 Â 10 À5 for the non-germline CDR3 sequence) between CDR3 charge and the charge at position 70 of the class II beta chain (correcting these p-values for the fact that we considered 7 positions yields 2:3 Â 10 À3 and 4:3 Â 10 À4 ). We did not see a significant correlation for the frequently contacted position on the class II alpha chain, perhaps due to the lack of sequence variation at the DRa locus and/or the more limited number of DQa and DPa alleles. None of the five class I positions showed significant correlations, which could be due to their lower contact frequencies, a smaller average number of associated TCRs (51 for class I versus 309 for class II), bias toward A*02 in the structural database, or noise introduced from multiple contacted positions varying simultaneously. Further analysis of the class II correlation suggested that it was driven largely by HLA-DRB1 alleles: position 70 correlations were À0:56 versus À0:10 for DR and DQ, respectively, over the full CDR3 and À0:64 vs À0:38 for the non-germline CDR3. Figure 9 provides further detail on this DRB1-TCR charge anti-correlation, including a structural superposition showing the proximity of position 70 to the TCRb CDR3 loop. CMV-associated TCRb chains are largely HLA-restricted We analyzed the HLA associations of strongly CMV-associated TCRb chains to gain insight into their predictive power across genetically diverse individuals. Here we change perspective somewhat from earlier sections, in that we select TCRs based on their CMV association and then evaluate HLA association, rather than the other way around. In their original study, Emerson et al. identified a set of TCRb chains that were enriched in CMV seropositive individuals and showed that by counting these CMV-associated TCRb chains in a query repertoire they could successfully predict CMV serostatus both in cross-validation and on an independent test cohort. The success of this prediction strategy across a diverse cohort of individuals raises the intriguing question of whether these TCRbs are primarily HLA-restricted in their occurrence and in their association with CMV, or whether they span multiple HLA types. To shed light on this question we focused on a set of 68 CMV-associated TCRb chains whose co-occurrence with CMV seropositivity was significant at a p-value threshold of 1.5e-5 (corresponding to an FDR of 0:05; see Materials and methods). For each CMV-associated TCRb chain, we identified its most strongly associated HLA allele and compared the p-value of this association to the p-value of its association with CMV ( Figure 10A). From this plot we can see that the majority of the CMV-associated chains do appear to be HLA-associated, having p-values that exceed the FDR 0:05 threshold for HLA association. The excess of highly significant HLA-association p-values for these CMV-associated TCRbs can be seen in Figure 10B, which compares the observed p-value  [Hennecke and Wiley, 2002;Deng et al., 2007;Harkiolaki et al., 2009;Yin et al., 2011;Deng et al., 2012]) showing the DRa chain in green, the DRb chain in cyan, the peptide in magenta, the TCRb chain in blue with the CDR3 loop colored reddish brown. The TCRa chain is omitted for clarity, and position 70 is highlighted in yellow. DOI: https://doi.org/10.7554/eLife.38358.025 The following source data is available for figure 9: distribution to a background distribution of HLA association p-values for randomly selected frequency-matched public TCRbs. As a next step we looked to see whether these HLA associations fully explained the CMV association, in the sense that the CMV association was only present in subjects positive for the associated allele. For each of the 68 CMV-associated TCRs, we divided the cohort into subjects positive for its most strongly associated HLA allele and subjects negative for that allele. Here we considered both 2-and 4-digit resolution alleles when defining the most strongly associated allele, to allow for TCRs whose association extends beyond a single 4-digit allele. We computed association p-values between TCR occurrence and CMV seropositivity over these two cohort subsets independently and compared them ( Figure 10C). We see that the majority of the points lie below the y ¼ x line-indicating a stronger CMV-association on the subset of the cohort positive for the associated alleleand also below the line corresponding to the expected minimum of 68 uniform random variables (i.e. the expected upper significance limit in the absence of CMV association on the allele-negative cohort subsets). There are however a few TCRbs which do not appear strongly HLA-associated and for which the CMV-association remains strong in the absence of their associated allele (the points above the line y ¼ x in Figure 10C). For example, the public TCRb chain defined by TRBV07 and the CDR3 sequence CASSSDSGGTDTQYF (which corresponds to the highest point in Figure 10C) is strongly CMV-associated (22=23 subjects with this chain are CMV positive; P<3 Â 10 À7 ) but does not show evidence of HLA association in our dataset. TCRs with HLA promiscuity may be especially interesting from a diagnostic perspective, since their phenotype associations may be more robust to differences in genetic background.
Finally, we looked to see whether CMV association completely explained the observed HLA associations, in the sense that a response to one or more CMV epitopes was likely the only driver of HLA association, or whether there might be evidence for other epitope-specific responses by these TCRb chains or a more general affinity for the associated allele, perhaps driven by common self antigens. Put another way, do we see evidence for pre-existing enrichment of any of these TCRb chains when their associated allele is present, even in the absence of CMV, which might suggest that the CMV response recruits from a pre-selected pool enriched for TCRs with intrinsic affinity for the restricting allele? To approach this question we split the cohort into CMV seropositive and seronegative subjects and computed, for each of the 68 CMV-associated TCRs, the strength of its association with its preferred allele over these two subsets separately. Figure 10D compares these HLA-association pvalues computed over the subsets of the cohort positive (289 individuals, x-axis) and negative (352 individuals, y-axis) for CMV. We can see in this case that all of the associations on the CMV-positive subset are stronger than those on the CMV-negative subset, and indeed the CMV-negative p-values do not appear to exceed random expectation given the number of comparisons performed. Thus, the apparent lack of any significant HLA-association on the CMV-negative cohort subset suggests that the HLA associations of these CMV-predictive chains are largely driven by CMV exposure. A limitation of this analysis is that, although the CMV-negative subset of the cohort is larger than the CMV-positive subset, the number of TCR occurrences in the CMV-negative subset is likely lower than in the CMV-positive subset for these CMV-associated chains, which will limit the strength of the HLA associations that can be detected.

Discussion
Each individual's repertoire of circulating immune receptors encodes information on their past and present exposures to infectious and autoimmune diseases, to antigenic stimuli in the environment, and to tumor-derived epitopes. Decoding this exposure information requires an ability to map from amino acid sequences of rearranged receptors to their eliciting antigens, either individually or collectively. One approach to developing such an antigen-mapping capability would involve collecting deep repertoire datasets and detailed phenotypic information on immune exposures for large cohorts of genetically diverse individuals. Correlation between immune exposure and receptor occurrence across such datasets could then be used to train statistical predictors of exposure, as demonstrated by Emerson et al. for CMV serostatus. The main difficulty with such an approach, beyond the cost of repertoire sequencing, is likely to be the challenge of assembling accurate and complete immune exposure information. or CMV-negative (y-axis) subsets of the cohort suggest that for these 68 CMV-associated TCRb chains, HLA-association is driven solely by response to CMV (rather than generic affinity for their associated allele, for example, or additional self or viral epitopes). In panels (A), (C), and (D), points are colored by CMV-association p-value; in all panels we use a modified logarithmic scale based on the square root of the exponent when plotting p-values in order to avoid compression due to a few highly significant associations. DOI: https://doi.org/10.7554/eLife.38358.027 The following source data is available for figure 10: Source data 1. Full and subsetted CMV-and HLA-association p-values for 68 CMV-associated TCRs. DOI: https://doi.org/10.7554/eLife.38358.028 For this reason, we set out to discover potential signatures of immune exposures de novo, in the absence of phenotypic information, using only the structure of the public repertoire-its receptor sequences and their occurrence patterns. By analyzing co-occurrence between pairs of public TCRb chains and between individual TCRb chains and HLA alleles, we were able to identify statistically significant clusters of co-occurring TCRs across a large cohort of individuals and in a variety of HLA backgrounds. Indirect evidence from sequence matches to experimentally-characterized receptors suggests that some of these TCR clusters may reflect hidden immune exposures shared among subsets of the cohort members; indeed, several of the most significant clusters appear linked to common viral pathogens (parvovirus B19, influenza, CMV, and EBV).
The results of this paper demonstrate the potential for a productive dialog between statistical analysis of TCR repertoires and immune exposure analysis. Specifically, sequences from the statistically-inferred clusters defined here could be tested for antigen reactivity or combined with immune exposure data to infer the driver of TCR expansion, as was done here for the handful of CMV-associated clusters based on CMV serostatus information. In either case our clustering approach will reduce the amount of independent data required, since the immune phenotype data is used for annotation of a modest number of defined TCR groupings rather than direct discovery of predictive TCRs from the entire public repertoire. We can also look for the presence of specific TCRs and TCR clusters identified here in other repertoire datasets, for example from studies of specific autoimmune diseases or pathogens, as a means of assigning putative functions. However the answer may not be entirely straightforward: it remains possible that enrichment for other cluster TCRs, rather than being associated with an exposure per se, is instead associated with some subject-specific genetic or epigenetic factor that determines whether a specific TCR response will be elicited by a given exposure.
The finding by Emerson et al.-now replicated and extended in this work-that there are large numbers of TCRb chains whose occurrence patterns (independent of potential TCRa partners) are strongly associated with specific HLA alleles, raises the question of what selective forces drive these biased occurrence patterns. Our observations point to a potential role for responses to common pathogens in selecting some of these chains in an HLA-restricted manner. Self-antigens (presented in the thymus and/or the periphery) may also play a role in enriching for specific chains, as suggested by (Madi et al., 2017) in their work on TCR similarity networks formed by the most frequent CDR3 sequences. Our conclusions diverge somewhat from this previous work, which may be explained by the following factors: our use of HLA-association rather than intra-individual frequency as a filter for selecting TCRs, our inclusion of information on the V-gene family in addition to the CDR3 sequence when defining TCR sharing and computing TCR similarity, and our use of TCR occurrence patterns, rather than CDR3 edit distance, to discover TCR clusters. We also find it interesting that class II loci appear on average to have greater numbers of associated TCRb chains than class I loci ( Figure 5): presumably this reflects differences in selection and/or abundance between the CD4+ and CD8+ T cell compartments (Sinclair et al., 2013), but the underlying explanation for this trend is unclear, although a similar bias was observed by Sharon et al., 2016. One caveat is that it can be difficult to reliably assign TCR associations to individual members of groups of highly correlated HLA alleles; perfectly correlated alleles have been collapsed into haplotypes in our analysis, but there remain allele pairs (particularly between the HLA-DR and HLA-DQ loci) that strongly co-occur across the cohort. In addition, TCRb chains associated with multiple HLA alleles (for example, because they recognize the same peptide presented by several different alleles) might be missed in our approach; although our analysis of HLA-association for CMV-associated TCR chains did not detect a substantial degree of HLA promiscuity, it remains to be seen whether this extends to other classes of functional TCRs. Alternative approaches that focus on other features, such as clonal abundance, to select TCR chains for clustering and downstream analysis are worth pursuing. It is also worth pointing out that our primary focus on presence/absence of TCRb chains (rather than abundance) assumes relatively uniform sampling depths across the cohort; in the limit of very deep repertoire sequencing, pathogen-associated chains may be found (presumably in the naive pool) even in the absence of the associated immune challenge, while shallow sampling reliably picks out only the most expanded T cell clones. Here the use of clusters of responsive TCRs rather than individual chains lessens stochastic fluctuations in TCR occurrence patterns, providing some measure of robustness.
We look forward to the accumulation of new data sets, which will enable future researchers to move beyond the limitations of the study presented here. An ideal study would perform discovery on repertoire data from multiple large cohorts, rather than the single large cohort generated with a single sequencing platform. Although we do validate TCR clusters on two independent datasets, with one from a different immune profiling technology, performing discovery on multiple large cohorts would presumably give more robust results. Future analyses of independent, HLA-typed cohorts will provide additional validation of trends seen here. The lack of sequenced TCRa or paired a/b repertoires for this cohort limits the features we can detect and may introduce bias into some of our conclusions. Certain T cell subsets, such as MAIT and invariant natural killer T cells, are more easily recognized from a chain sequence data. It is likely that many TCRs that are associated with specific immune exposures when considered as paired TCR chains are not detectably associated with those exposures (or with other TCRs responding to those exposures) when analyzing only the a or b chain alone: indeed it is somewhat surprising that we find as many apparent associations and cooccurring clusters as we do given that we are considering only the TCRb chain. Greater sequencing depth and/or analysis of sorted T cell populations will likely be required of future studies that aim to examine the impact of HLA on the composition of the naive T cell repertoire. We also hope that future studies will have rich immune exposure data beyond CMV serostatus: although the cohort members were all nominally healthy at the time of sampling, it is likely that there are a variety of immune exposures, some presaging future pathologies, that can be observed in a diverse collection of 650+ individuals. As an example, two of our EBV-annotated clusters contain TCRb chains also seen in the context of rheumatoid arthritis: cross-reactivity between pathogen and autoimmune epitopes may mean that TCR clusters discovered on the basis of common infections also provide information relevant in the context of autoimmunity.

Materials and methods Datasets
TCRb repertoire sequence data for the 666 members of the discovery cohort was downloaded from the Adaptive biotechnologies website using the link provided in the original (Emerson et al., 2017) publication (https://clients.adaptivebiotech.com/pub/Emerson-2017-NatGen). The repertoire sequence data for the 120 individuals in the 'Keck120' validation set was included in the same download. Repertoire sequence data for the 86 individuals in the 'Brit86' validation set was downloaded from the NCBI SRA archive using the Bioproject accession PRJNA316572 (Britanova et al., 2016) and processed using scripts and data supplied by the authors (https://github.com/mikessh/agingstudy) in order to demultiplex the samples and remove technical replicates. Repertoire sequence data for TCRb chains from MAIT cells was downloaded from the NCBI SRA archive using the Bioproject accession PRJNA412739 (Howson et al., 2018). Repertoire sequence data for TCRb chains from T cells sorted for CD4/CD8 surface markers were taken from the following studies: , available for download at https://clients.adaptivebiotech.com/pub/emerson-2013-jim; (Rubelt et al., 2016), downloaded from the NCBI SRA archive using the Bioproject accession PRJNA300878; (Li et al., 2016), downloaded from the NCBI SRA archive using the Bioproject accession PRJNA348095; and (Oakes et al., 2017), downloaded from the NCBI SRA archive using the Bioproject accession PRJNA390125.
V and J genes were assigned by comparing the TCR nucleotide sequences to the IMGT/GENE-DB (Giudicelli et al., 2005) nucleotide sequences of the human TR genes (sequence data downloaded on 9/6/2017 from http://www.imgt.org/genedb/). CDR3 nucleotide and amino acid sequences and most-parsimonious VDJ recombination scenarios were assigned by the TCRdist pipeline (Dash et al., 2017) (the most parsimonious recombination scenario, used for identifying non-germline CDR3 amino acids, is the one requiring the fewest non-templated nucleotide insertions). To define the occurrence matrix of public TCRs and assess TCR-TCR, TCR-HLA and TCR-CMV association, a TCRb chain was identified by its CDR3 amino acid sequence and its V-gene family (e.g., TRBV6-4*01 was reduced to TRBV06). TCR sequence reads for which a unique V-gene family could not be determined (due to equally well-matched V genes from different families, a rare occurrence in this dataset) were excluded from the analysis. The matrix M of public TCRb occurrences across the discovery cohort, HLA allele occurrence patterns, and other associated data needed to reproduce the findings of this study have been deposited in the Zenodo database (doi:10.5281/ zenodo.1248193).

Eliminating potential cross-contamination
A preliminary analysis of TCR sharing at the nucleotide level was conducted to identify potential cross-contamination in the discovery cohort repertoires. Each TCRb nucleotide sequence that was found in multiple repertoires was assigned a generation probability (P gen , see below) in order to identify nucleotide sequences with suspiciously high sharing rates among repertoires. Visual comparison of the sharing rate (the number of repertoires in which each TCRb nucleotide sequence was found) to the generation probability ( Figure 11) showed that the majority of highly-shared TCRs had correspondingly high generation probabilities; it also revealed a cluster of TCR chains with unexpectedly high sharing rates. Examination of the sequences of these highly-shared TCRs revealed them to be variants of the consensus sequence CFFKQKTAYEQYF (coding sequence: tgttttttcaagcagaagacggcatacgagcagtacttc). Consultation with scientists at Adaptive Biotechnologies confirmed that these sequences were likely to represent a technical artifact of the sequencing pipeline. We elected to remove all TCRb nucleotide sequences whose sharing rates put them outside the decision boundary indicated by the black line in Figure 11, which eliminated the vast majority of the artifactual variants as well as a handful of other highly shared, low-probability sequences (592 nucleotide sequences in total were removed).

Measuring clonal expansion
Each public TCRb chain was assigned a clonal expansion index (I exp ) determined by its frequencies in the repertoires in which it was found. First, the unique TCRb chains present in each repertoire were ordered based on their inferred nucleic acid template count (Carlson et al., 2013), and assigned a rank ranging from 0 (lowest template count) to S À 1 (highest template count), where S is the total number of chains present in the repertoire. TCRs with the same template count were assigned the same tied rank equal to the midpoint of the tied group. In order to compare across repertoires, the ranks for each repertoire were then normalized by dividing by the number of unique sequences in the repertoire. The clonal expansion index for a given public TCR t was taken to be its average normalized rank for the repertoires in which it occurred: where the sum is taken over the m repertoires in which t is found, r i is the template-count rank of TCR t in repertoire i, and S i is the total size of repertoire i.

HLA typing
HLA genotyping was performed and confirmed by molecular means, including sequence specific oligonucleotide probe typing (SSOP), Sanger sequencing (SBT) or next generation sequencing (NGS) (Smith et al., 2014). Independently, HLA alleles were imputed using data generated by high density single-nucleotide polymorphism arrays as previously described (Martin et al., 2017). Imputed alleles were compared with HLA typing data from SBT and NGS, and used to resolve ambiguous HLA codes generated by SSOP and provide a uniform set of four digit allele assignments. HLA typing data availability varied across loci as follows:  Table 1, the cohort was restricted to the subset of subjects with available HLA typing at the relevant locus. For comparing TCR association rates across loci in Figure 5, associations were calculated over the cohort subset (522 subjects) with typing data at all compared loci (A, B, C, DRB1, DQA1, and DQB1) in order to avoid spurious differences in association strengths arising from differential data availability among the loci. Due to their very strong linkage on our cohort, five DR-DQ haplotypes were treated as single allele units for association

TCR generation probability
We implemented a version of the probabilistic model proposed by Walczak and co-workers (Murugan et al., 2012) in order to assign to each public TCRb chain (defined by a V-gene family and a CDR3 amino acid sequence) a generation probability, P gen , which captures the probability of seeing that TCRb in the preselection repertoire. P gen is calculated by summing the probabilities of the possible VDJ rearrangements that could have produced the observed TCR: where S represents the set of possible VDJ recombination scenarios capable of producing the observed TCR V family and CDR3 amino acid sequence. To compute the probability of a given recombination scenario s, we use the factorization proposed by Marcou et al. (2018), which Figure 11. Analysis of TCR sharing at the nucleotide level and VDJ recombination probabilities helps to identify potential contamination. Each point represents a TCRb nucleotide sequence that occurs in more than one repertoire, plotted according to its generation probability (P gen , x-axis) and the number of repertoires in which it was seen (N repertoires , y-axis). Very low probability nucleotide sequences that are shared across many repertoires represent potential cross-contamination, as confirmed for one large cluster of artifactual sequences (see the main text). We excluded all TCRb nucleotide sequences lying above the boundary indicated by the black line (N ¼ 592). DOI: https://doi.org/10.7554/eLife.38358.029 The following source data is available for figure 11: Source data 1. TCRb nucleotide sequences excluded from our analysis. DOI: https://doi.org/10.7554/eLife.38358.030 captures observed dependencies of V-, D-, and J-gene trimming on the identity of the trimmed gene and of inserted nucleotide identity on the identity of the preceding nucleotide: Here the recombination scenario s consists of a choice of V gene (V s ), D gene (D s ), J gene (J s ), number of nucleotides trimmed back from the end of the V gene (del s V) or J gene (del s J) or D gene (del s D5 0 and del s D3 0 ), number of nucleotides inserted between the V and D genes (Ins s VD) and between the D and J genes (Ins s DJ) and the identities of the inserted nucleotides ( n i f g and m i f g respectively). At the start of the calculation, the CDR3 amino acid sequence is converted to a list of potential degenerate coding nucleotide sequences (here degenerate means that nucleotide class symbols such as W (for A and T) and R (for A and G) are allowed). Since each amino acid other than Leucine, Serine, and Arginine has a single degenerate codon (P=CCN, N = AAY, K = AAR, etc.) and these three amino acids have two such codons (S={TCN,AGY}, R={CGN,AGR}, L={CTN,TTR}), this list of nucleotide coding sequences is generally not too long. The generation probability is then taken to be the sum of the probabilities of these degenerate nucleotide sequences. Since the total number of possible recombination scenarios is in principle quite large, we make a number of approximations to speed the calculation: we limit excess trimming of genes to at most three nucleotides, where excess trimming is defined to be trimming back a germline gene nucleotide which matches the target CDR3 nucleotide (therefore requiring non-templated reinsertion of the same nucleotide); at most two palindromic nucleotides are allowed; sub-optimal D gene alignments are only considered up to a score gap of 2 matched nucleotides relative to the best match. The parameters of the probability model are fit by a simple iterative procedure in which we generate rearranged sequences using an initial model, compare the statistics of those sequences to statistics derived from observed out-of-frame rearrangements in the dataset, and adjust the probability model parameters to iteratively improve agreement. We compared the nucleotide sequence generation probabilities computed using our software with those computed using the published tool IGoR (Marcou et al., 2018) and found good overall agreement: a linear regression analysis of the log 10 P gen À Á values gives a correlation coefficient R ¼ 0:97 with slope of 0:98 and an intercept of 0:22 for a set of 800 randomly selected TCRb chains.

Co-occurrence calculations
We performed an analysis of covariation across the cohort for pairs of TCR chains and for TCR chains and HLA alleles ( Figure 12). We used the hypergeometric distribution to assess the significance of an observed overlap between two subsets of the cohort (for example, the subset of subjects positive for a given HLA allele and the subset of subjects with a given TCRb chain in their repertoires), taking our significance p-value to be the probability of seeing an equal or greater overlap if the two subsets had been chosen at random: where k is the size of the overlap, N 1 and N 2 are the sizes of the two subsets, and N is the total cohort size (i.e., the number of individuals in the cohort). We use P overlap to assess the significance of an overlap C a \ C t between an HLA allele a found in the cohort subset C a and a TCRb chain t found in the cohort subset C t as follows: P HLA a; t ð Þ ¼ P overlap jC a \ C t j; jC a j; jC t j; N ð Þ where jCj denotes the cardinality of the set C. A complication arises when assessing TCR-TCR cooccurrence in the presence of variable-sized repertoires: TCRs are more likely to come from the larger repertoires than the smaller ones, which violates the assumptions of the hypergeometric distribution and leads to inflated significance scores. In particular, when we use the hypergeometric distribution to model the overlap between the sets of subjects in which two TCR chains are found, we implicitly assume that all subjects are equally likely to belong to a TCR chain's subject set. If the subject repertoires vary in size, this assumption will not hold. For example, in the limit of a subject with an empty repertoire, no TCR subject sets will contain that subject, which will inflate all the overlap pvalues since we are effectively overstating the size N of the cohort by 1. On the other hand, if one of the subject repertoires contains all the public TCR chains, then each TCR-TCR overlap will automatically contain that subject, again inflating the p-values since we are artificially adding 1 to each of k, N 1 , N 2 , and N. We developed a simple heuristic to correct for this effect using a per-subject bias factor by defining where S i is the size of repertoire i and N is the cohort size. To score an overlap between the occurrence patterns of two TCRb chains t and t 0 , where t is found in the subset C t of the cohort, t 0 is found in the subset C t 0 , and their overlap C t \ C t 0 contains the k subjects s 1 ; :::; s k , we adjust the overlap pvalue (P overlap ) by the product of the bias factors of the subjects in the overlap: Figure 12. Schematic diagram illustrating the co-occurrence analysis. Co-occurrence p-values are calculated to assess TCR-TCR (P CO ) and TCR-HLA (P HLA ) covariation across the cohort. Shared response to unknown immune exposures may explain strongly co-occurring TCR pairs, while significant HLA association can highlight functional TCRs. TCRb chains are compared to a set of previously characterized TCRs for annotation purposes. DOI: https://doi.org/10.7554/eLife.38358.031 Here we are multiplying the hypergeometric p-value (P overlap ) by a term that corrects for the fact that not all overlaps of size k are equally likely (the product of the k bias factors captures the relative bias toward the observed overlap). This has the effect of decreasing the significance assigned to overlaps involving larger repertoires, yet remains fast to evaluate, an important consideration given that the all-vs-all TCR co-occurrence calculation involves about 10 14 pairwise comparisons (and this calculation is repeated multiple times with shuffled occurrence patterns to estimate false-discovery rates). When clustering by co-occurrence, we augmented this heuristic p-value correction by also eliminating repertoires with very low (fewer than 30,000) or very high (more than 120,000) numbers of public TCRb chains (nonzero entries in the occurrence matrix M), as well as five additional repertoires which showed anomalously high levels of TCR nucleotide sharing with another repertoire-all with the goal of reducing potential sources of spurious TCR-TCR co-occurrence signal.

Estimating false-discovery rates
We used the approach of (Storey and Tibshirani, 2003) to estimate false-discovery rates for detecting associations between TCRs and HLA alleles and between TCRs and CMV seropositivity. Briefly, for a fixed significance threshold P we estimate the false-discovery rate (FDR) by randomly permuting the HLA allele or CMV seropositivity assignments 20 times and computing the average number of significant associations discovered at the threshold P in these shuffled datasets. The estimated FDR is then the ratio of this average shuffled association number to the number of significant associations discovered in the true dataset at the same threshold. In order to estimate a false-discovery rate for TCR-TCR co-occurrence over the full cohort, we performed 20 co-occurrence calculations on shuffled occurrence matrices, preserving the per-subject bias factors during shuffling by resampling each TCR's occurrence pattern with the bias distribution b i f g determined by the subject repertoire sizes.

Assigning CD4+/CD8+ status to public TCRs
We assessed the accuracy of our TCR:HLA associations by looking for HLA-associated public TCRb chains in sequenced repertoires from T cell populations that were sorted for the presence of CD4/ CD8 surface markers. We selected four repertoire datasets with matched CD4+ and CD8+ reperrtoires from a total of 63 individuals (see the section Datasets for access details; Rubelt et al., 2016;Li et al., 2016;Oakes et al., 2017]). We analyzed the occurrence patterns of HLA-associated TCRb chains in these sequence datasets, producing for each TCRb counts of the number of CD4+ and CD8+ repertoires it was observed in (N CD4 and N CD8 ). TCRb abundance levels within the individual repertoires were ignored; each occurrence in a repertoire contributed a single count to the respective CD4 or CD8 total (which therefore range between 0 and 63). Given a threshold d on the CD4/CD8 counts difference, we assign to the CD4 compartment all TCRs for which N CD4 À N CD8 ! d, and we assign to the CD8 compartment all TCRs for which N CD8 À N CD4 ! d. Figure 5-figure supplement 1 shows the concordance between these assignments and inferences based on the HLA class of the most strongly associated HLA allele, for all significantly associated TCRb chains and for various threholds d.

TCR clustering
We used the DBSCAN (Ester et al., 1996) algorithm to cluster public TCRb chains by their occurrence patterns. DBSCAN is a simple and robust clustering procedure that requires two input parameters: a similarity/distance threshold (T sim ) at which two points in the dataset are considered to be neighbors, and a minimum number of neighbors (N core ) for a point to be considered a core, as opposed to a border, point. DBSCAN clusters consist of the connected components of the neighbor-graph over the core points, together with any border point neighbors the core cluster members have. To prevent the discovery of fictitious clusters, T sim and N core can be selected so that core points (points with at least N core neighbors) are unlikely to occur by chance. There is a trade-off between the two parameter settings: as T sim is relaxed, points will tend to have more neighbors on average and thus N core should be increased, which biases toward discovery of larger clusters; conversely, more stringent settings of T sim are compatible with smaller values for N core which permits the discovery of smaller, more tightly linked clusters.
For clustering TCRs by co-occurrence over the full cohort, we used a threshold of T sim ¼ 10 À8 and chose a value for N core (6) such that no core points were found in any of the 20 shuffled datasets. In other words, two TCRs t 1 and t 2 were considered to be neighbors for DBSCAN clustering if P CO t 1 ; t 2 ð Þ<10 À8 ; a TCR was considered a core point if it had at least 6 neighbors. Choosing parameters for HLA-restricted TCR clustering was slightly more involved due to the variable number of clustered TCRs for different alleles, and the more complex nature of the similarity metric, whose dependence on TCR sequence makes shuffling-based approaches more challenging. To begin, we transformed the TCRdist sequence-similarity measure into a significance score P TCRdist which captures the probability of seeing an observed or smaller TCRdist score for two randomly selected TCRb chains. Since public TCRb chains are on average shorter and closer to germline than private TCRs, we derived the P TCRdist CDF by performing TCRdist calculations on randomly selected public TCRs seen in at least 5 repertoires. We identified neighbors for DBSCAN clustering using a similarity score P sim that combines co-occurrence and TCR sequence similarity: where the transformation by f x ð Þ ¼ x À x log x ð Þ corrects for taking the product of two p-values because f x ð Þ is the cumulative distribution function of the product of two uniform random variables. Thus, if P TCRdist and P CO are independent and uniformly distributed, the same will be true of P sim .
For HLA-restricted clustering using this combined similarity measure we set a fixed value of T sim ¼ 10 À4 and adjusted the N core parameter as a function of the total number of TCRs clustered for each allele. As in global clustering, our goal was to choose N core such that core points were unlikely to occur by chance (more precisely, had a per-allele probability less than 0:05). We estimated the probability of seeing core points by modeling neighbor number using the binomial distribution, assuming that the observed neighbor number of a given TCR during clustering is determined by M À 1 independent Bernoulli-distributed neighborness tests with rate r, where M is the number of clustered TCRs. Rather than assuming a fixed neighbor-rate r across TCRs, we captured the observed variability in neighbor-rate (due, for example, to unequal V-gene frequencies and variable CDR3 lengths) by using a mixture of 20 rates r j È É estimated from similarity comparisons on randomly chosen public TCRs. More precisely, we choose the smallest value of N core for which the following inequality holds (where M is the number of clustered TCRs for the allele in question): M 20 We also used this neighbor-number model to assign a p-value (P size ) to each cluster reflecting the likelihood of seeing the observed degree of clustering by chance. Since DBSCAN clusters are effectively single-linkage-style partitionings of the core points (together with any neighboring border points), they can have a variety of shapes, ranging from densely interconnected graphs, to extended clusters held together by local neighbor relationships (Ester et al., 1996). Modeling the total size of these arbitrary groupings is challenging, so we took the simpler and more conservative approach of assigning p-values based on the size of the largest TCR neighborhood (set of neighbors for a single TCR) contained within each cluster. We identified the member TCR with the greatest number of neighbors in each cluster (the cluster center) and computed the likelihood of seeing an equal or greater neighbor-number under the mixture model described above. This significance estimate is conservative in that it neglects clustering contributions from TCRs outside the neighborhood of the cluster center, however in practice we observed that the majority of TCR clusters were dominated by a single dense region of repertoire space and therefore reasonably well-captured by a single neighborhood. To control false discovery when combining DBSCAN clusters from independent clustering runs for different HLA alleles, we used the Holm method (Holm, 1979) applied to the sorted list of cluster P size values, with a target family-wise error rate (FWER) of 0:05 (i.e., we attempted to limit the overall probability of seeing a false cluster to 0:05). In the Holm FWER calculation we set the total number of hypotheses equal to the total number of TCRs clustered across all alleles minus the cumulative neighbor-count of the cluster centers (we exclude cluster center neighbors since their neighbor counts are not independent of the neighbor count of the cluster center). When performing HLA-restricted clustering, each TCRb chain was assigned to its most strongly associated HLA allele. Where two alleles had identical or nearly identical (within a factor of 1:25) association p-values, the TCR chain was included in the clustering analysis for both alleles.

Analyzing TCR clusters
For each (global or HLA-restricted) TCR cluster, we analyzed the occurrence patterns of the member TCRs in order to identify a subset of the (full or allele-positive) cohort enriched for those TCRs. We counted the number of cluster member TCRs found in each subject's repertoire and sorted the subjects by this TCR count (rank plots in Figure 3B-C and in the right panels of Figure 7). For comparison, we generated control TCR count plots by independently resampling the subjects for each member TCR, preserving the frequency of each TCR and biasing by subject repertoire size. Each complete resampling of the cluster member TCR occurrence patterns produced a subject TCR rank plot; we repeated this resampling process 1000 times and averaged the rank plots to yield the green ('randomized') curves in Figure 3B-C and Figure 7. To compare the observed and randomized curves, we took a signed difference between the observed counts C j and the randomized counts R j , where the value of the subject index i ¼ i max that maximizes the right-hand side in the equation above represents a switchpoint below which the observed counts generally exceed the randomized counts and above which the reverse is true (both sets of counts are sorted in decreasing order). We take this switchpoint i max as an estimate of the number of enriched subjects for the given cluster (this is the value given in the 'Subjects' column in Table 3).
Since the raw D CO values are not comparable between clusters of different sizes and for different alleles, we transformed these values to a Z-score (Z CO ) by generating, for each cluster, 1000 additional random TCR count curves and computing the mean ( D ) and standard deviation (s D ) of their D rand CO score distribution: We used this co-occurrence score Z CO together with a log-transformed version of the cluster size p-value, S size ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi À log 10 P size ð Þ p for visualizing clustering results in Figure 6 (S size on the x-axis and Z CO on the y-axis) and prioritizing individual clusters for detailed follow-up.

TCR annotations
We annotated public TCRs in our dataset by matching their sequences against two publicly available datasets: VDJdb (Shugay et al., 2018), a curated database of TCR sequences with known antigen specificities (downloaded on 3/29/18; about 17; 000 human TCRb entries) and McPAS-TCR (Tickotsky et al., 2017), a curated database of pathogen-associated TCR sequences (downloaded on 3/29/18; about 9; 000 human TCRb entries). VDJdb entries are associated with a specific MHCpresented epitope, whereas McPAS-TCR also includes sequences of TCRs isolated from diseased tissues whose epitope specificity is not defined. We added to this merged annotation database the sequences of structurally characterized TCRs of known specificity (see below), as well as literaturederived TCRs from a handful of primary studies (Dash et al., 2017;Glanville et al., 2017;Song et al., 2017;Kasprowicz et al., 2006). For matches between HLA-associated TCRs and database TCRs of known specificity, we filtered for agreement (at 2-digit resolution) between the associated HLA allele in our dataset and the presenting allele from the database. In other words, TCRs belonging to B*08:01-restricted clusters were not annotated with matches to database TCRs that bind to A*02:01-presented peptides. Table 3. HLA-restricted TCR clusters with size (S size ) and co-occurrence (Z CO ) scores, annotations (abbreviated as in Figure 6), and validation scores.  Table 3 continued on next page

Structural analysis
We analyzed a set of experimentally determined TCR:peptide-MHC structures to find MHC positions frequently contacted by the CDR3b loop. Crystal structures of complexes involving human TCRs and human class I or class II HLA alleles (Table 4) were identified using BLAST (Altschul et al., 1997) searches against the RCSB PDB (Berman et al., 2000) sequence database (ftp://ftp.wwpdb.org/ pub/pdb/derived_data/pdb_seqres.txt). Structural coverage of HLA loci and alleles is sparse and highly biased toward well studied alleles such as HLA-A*02. Given the high degree of structural similarity among class I and among class II MHC structures solved to date, we elected to share contact     Table 4 continued on next page information across loci using trans-locus sequence alignments. For class I we used the merged alignment (ClassI_prot.txt) available from the IPD-IMGT/HLA (Robinson et al., 2015) database. Starting with multiple sequence alignments for individual class II loci from the IPD-IMGT/HLA database, we inserted gaps as needed in order to created merged alignments for the class II a and b chains. These alignments provided a common reference frame in which to combine residue-residue contacts from the TCR:peptide-MHC structures. We considered two amino acid residues to be in contact if they had a side chain heavyatom contact distance less than or equal to 4:5Å . The CDR3b contact frequency for an alignment position (class I, class II-a, or class II-b) was defined to be the total number of contacted CDR3b amino acids observed for that position, divided by the total number of structures analyzed. Redundancy in the structural database was assessed at the level of TCR and HLA sequence, ignoring the sequence of the peptide. Contacts from a set of n structures all containing the same TCR and HLA were given a weight of 1=n when computing the residue contact frequencies.
The statistical significance of correlations between HLA allele charge and average HLA-associated TCR CDR3 charge were computed using a 2-sided test as implemented in the function scipy.stats. linregress.

Acknowledgements
This work was supported in part through the NIH/NCI Cancer Center Support Grant P30 CA015704 and by NIH NHLBI grant R01-HL105914 to JH, as well as R01 GM113246 and U19 AI117891. The research of Frederick Matsen was supported in part by a Faculty Scholar grant from the Howard