Energy landscapes of peptide-MHC binding

Molecules of the Major Histocompatibility Complex (MHC) present short protein fragments on the cell surface, an important step in T cell immune recognition. MHC-I molecules process peptides from intracellular proteins; MHC-II molecules act in antigen-presenting cells and present peptides derived from extracellular proteins. Here we show that the sequence-dependent energy landscapes of MHC-peptide binding encode class-specific nonlinearities (epistasis). MHC-I has a smooth landscape with global epistasis; the binding energy is a simple deformation of an underlying linear trait. This form of epistasis enhances the discrimination between strong-binding peptides. In contrast, MHC-II has a rugged landscape with idiosyncratic epistasis: binding depends on detailed amino acid combinations at multiple positions of the peptide sequence. The form of epistasis affects the learning of energy landscapes from training data. For MHC-I, a low-complexity problem, we derive a simple matrix model of binding energies that outperforms current models trained by machine learning. For MHC-II, higher complexity prevents learning by simple regression methods. Epistasis also affects the energy and fitness effects of mutations in antigen-derived peptides (epitopes). In MHC-I, large-effect mutations occur predominantly in anchor positions of strong-binding epitopes. In MHC-II, large effects depend on the background epitope sequence but are broadly distributed over the epitope, generating a bigger target for escape mutations due to loss of presentation. Together, our analysis shows how an energy landscape of protein-protein binding constrains the target of escape mutations from T cell immunity, linking the complexity of the molecular interactions to the dynamics of adaptive immune response.

short chains, on average 8 to 11 amino acids long, resulting from the degradation of intracellular proteins carried out by the proteasome [9,10].MHC-II acts in specific antigenpresenting cells, presenting peptides transported from the extracellular space for interaction with helper T cells [11].The amino acid chains that bind to MHCII are more variable in length, on average 15-20 amino acids, and the relevant binding occurs in a subset of about 9 amino acids [12,13].MHC-I and MHC-II have different binding mechanisms.The MHC-I binding pocket is closed at both ends, causing peptides to be anchored at their extremities, usually in positions 2 and 9, while a weaker interaction acts in the center [14,15].In contrast, MHC-II has an open binding pocket, allowing the peptide to be anchored at multiple positions [12,16].
The complexity of interactions between binding positions represents the key challenge for affinity models [17,18].Ini-tial studies on MHC-I include pairwise interactions among specific peptide positions in otherwise additive models [17,19].In contrast, state-of the-art computational tools based on neural networks trained on experimental data contain pervasive nonlinearities in the binding energy function, albeit implicit in the model output.These models are highly effective for MHC-I alleles [20,21,22,23,24,25,26], they are also increasingly successful in addressing the higher complexity of the binding mechanism for MHC-II alleles [13,18,27].However, given their high number of parameters and their complexity, training becomes difficult for rare alleles insufficiently represented in empirical datasets.Moreover, the neural network architecture impedes a mechanistic interpretation of the model output [28].
This paper presents a systematic study of the energy landscapes of protein-MHC binding, which map a given peptide sequence onto the binding affinity to a given MHC allele.These landscapes have nonlinearities, or epistasis [29,30,31,32,33,34], caused by interactions of multiple peptide and MHC amino acids -a common feature of peptide-MHC binding [17].The form of epistasis, however, depends on the MHC class.For MHC-II, we find rugged landscapes where the binding energy depends on specific amino acid combinations of a peptide in a network of several binding positions.The MHC-I landscapes are much simpler: the binding energy is a nonlinear function of an intermediate trait that is additive in the peptide sequence [31], a less complex alternative to the pairwise interactions approach, already well studied in previous works [19].In the literature on evolution, these two types of epistasis, characterising MHC-II and MHC-I energy landscapes, are referred to as idiosyncratic and global, respectively [35,36,37,38].
In the first part of the paper, we infer global and idiosyncratic epistasis in energy landscapes from empirical data of several MHC-I and MHC-II alleles.We show that data and predictive models for MHC-I and MHC-II have low and high complexity, respectively.In the second part, we discuss two important consequences of the form of epistasis.First, we infer a matrix energy model for MHC-I that outperforms state-of-the art machine-learning models (NetMHC 4.0), providing predictions of very similar quality with a drastically reduced number of model parameters.
This model can be learned by simple regression, demonstrating that the complexity of epistasis sets the optimal learning algorithm.Second, we infer spectra of mutational effects in energy landscapes for MHC-I and MHC-II, and we discuss the differences between these spectra induced by the form of epistasis.

Results
Datasets of peptide-MHC binding affinities.For training and testing, we analyze peptide sequences and MHC binding data from the Immune Epitope Database (IEDB) [39].The datasets are curated before training (Mate-84 rial and Methods).For MHC-I, we include the most frequent 85 alleles of HLA-A and HLA-B: HLA-A*02:01, HLA-A*11:01, 86 and HLA-B*07:02 with 6953, 1981, and 1024 curated pep-87 tides of length 9 aa.The resulting dataset is similar but not 88 identical to the data used for benchmarking and retraining 89 of NetMHC 4.0 [21,40].For MHC-II, our dataset contains 90 7434 peptides from allele HLA-DRB1*01:01 of length 15 91 aa with a binding core of length 9 aa.In these data, bind-92 ing affinities are reported as half-inhibitory concentration 93 (IC50), other peptides are annotated qualitatively as binding 94 or non-binding.Throughout this paper, we describe binding 95 affinities by the free energy ∆G = log(IC50/ρ 0 ), which is 96 measured in units of k B T and has its zero point at the 97 physiological binding threshold, ρ 0 = 500nM (MHC-I) and 98 ρ 0 = 1000nM (MHC-II) [41,42].

99
Inference of a linear energy model.Fig. 1A shows 100 position-weight matrices, or binding motifs, for two represen-101 tative alleles HLA-A*02:01 for MHC-I and HLA-DRB1*01:01 102 for MHC-II.These matrices record the frequencies p i (a) 103 of amino acid a at a given position in the binding core 104 (i = 1, . . ., ℓ), evaluated in the set of binding peptides 105 (∆G ≤ 0).The resulting position-specific sequence in-106 formation, ∆S i = a p i (a) log(p i (a)/p 0 (a)), is defined as 107 the Kulback-Leibler distance between the observed amino 108 acid frequency distribution and a null distribution (here 109 p 0 (a) = 1/20).For MHC-I, the sequence information is 110 concentrated in positions 2 and 9, the known anchor posi-111 tions of peptide-MHC binding [43] (Fig. 1B).In contrast, 112 MHC-II has a more homogeneously distributed sequence 113 information, i.e., less bias towards the anchor positions 1, 4, 114 6, and 9.The total additive information, ∆S = ℓ i=1 ∆S i , 115 which measures the power of linear sequence models to dis-116 criminate peptides presented on the cell surface, is lower for 117 MHC-II than for MHC-I.

118
As a starting point for the inference of energy models, 119 we construct a linear energy matrix, 120 where a = (a 1 , . . ., a ℓ ) denotes the peptide sequence and 121 ϵ i (a) is the energy contribution of amino acid a at position i. 122 Energy models of this kind are widely used for transcription 123 factor binding and for protein-protein binding, including 124 early-stage predictions of peptide-MHC-I interactions [44, 125 19].Here we obtain an optimized 20 × 9 energy matrix, 126 which minimises the mean squared error (MSE), via 10-fold 127 cross-validation for each of the selected alleles (Material and 128 Methods).At each position, the energy ranking of amino 129 acids reproduces the frequency ranking in the position weight 130 matrix (Table S2 and Table S8).

131
The heat maps of Fig. 1C show the optimized linear 132 energy L plotted against the validation data for the two representative alleles.For MHC-I, the linear model systematically underestimates the affinity of strong binders (L > ∆G).A similar effect is observed for the weakest binders, albeit supported by fewer data points.For MHC-II, the linear model has a more drastic mismatch with the data and only explains about half of the variation in ∆G.Likely factors contributing to this mismatch include the variation of binding cores and additional energy contributions of peptide flanking regions not included in the model [45,46].In what follows, we will link these deficiencies of the linear model to epistasis, the form of which depends on the MHC class.
Energy landscapes with global and idiosyncratic epistasis.Nonlinear energy landscapes provide a better representation of peptide-MHC binding.Here we refer to such nonlinearities as energy epistasis and parametrize them by interaction terms involving pairs of amino acids in the peptide binding core.For MHC-II, we use a generic landscape of the form for each pair of positions, the 20 × 20 matrix β ij (a, b) characterizes the interaction energy between amino acid a at position i and b at position j.For MHC-I, we use a simpler landscape, where the interactions are given by the linear energy terms, β i,j = λϵ i (α i )ϵ j (α j ).Following the literature on fitness landscapes, we refer to these two types of nonlinearities as idiosyncratic and global epistasis, respectively [38,35].
Under idiosyncratic epistasis, the energy effect of a mutation  such pairs in our datasets, we map instead the deviations of 178 partial data from the previously inferred linear model [49].179 For MHC-I, we form data subsamples containing se-180 quences with particular amino acid pairings in positions 181 2 and 9.For each subsample, we perform a linear regres-182 sion, the slope and offset of which compare validation data 183 with the linear model L (dashed line).Fig. 2A shows these 184 regressions for a few subsamples of peptides binding HLA-185 A*02:01; all subsamples are shown in Fig. S2.We observe 186 a systematic modulation of subsample slopes with linear 187 energy L: deviations from the linear model increase from 188  For comparison, we also analyze epistatic landscapes with random parameters.For idiosyncratic epistasis, we use energy landscapes of the form (2), where the parameters ϵ i (a) and β ij (a, b) are Gaussian random variables with mean 0 and variance σ 2 ϵ and σ 2 β , respectively.For a given landscape, we evaluate the energy ∆G(a) for a sample of random peptides, subject to an additional measurement error of variance σ 2 m = 1.5.The strength of epistasis is given by λ = σ 2 β /σ 2 ϵ (details of the simulations are given in Materials and Methods).With an appropriate strength ( λ = 2), the simulated data generate a regression pattern strikingly similar to the empirical data of MHC-II (Fig. S2, Fig. S3B),  S2).The simulated data are drawn with a measure- Next, we compare the performance of the global-epistasis model with the most widely used neural network model, NetMHC 4.0, in both binding energy predictions and classification.As validation data for binding energy predictions, we use the set of all IEDB peptides that are not contained in the training data for NetMHC 4.0 (obtained from IEDB Analysis Resources [39,40]).The set contains 465, 69 and 183 sequences for HLA*A-02:01, HLA*A-11:01 and HLA*B-07:02, respectively.These peptides are also excluded from the training of our model.The global-epistasis model for HLA*A-02:01 reaches a smaller error than NetMHC 4.0 (MSE = 6.8 compared to MSE = 7.2).The aggregated error computed on all MHC-I alleles included in the analysis reaches a slightly larger value (MSE = 6.9 compared to MSE = 6.8 for NetMHC 4.0), reflecting the reduced availability of training sequences (Fig. S4).To assess the classification performance, we collect sets of positive and negative binders from IEDB.We exclude sequences used for training and present in the IEDB training dataset.The performance metric is reported as Area Under the Curve (AUC, see Material and Methods).For HLA*A-02:01, we obtain a score of 0.837 for the global epistasis model and 0.843 using NetMHC 4.0.Scores for other alleles are detailed in Table S1.This level of performance is achieved with a drastically lower model complexity: the global-epistasis model has 20 × 9 + 1 parameters per allele, significantly less than the number of parameters required to account for all the pairwise interaction terms between each positions [17,19].In comparison, the network model employs > 900 parameters [21].
Factoring in the difference in model complexity by a Bayesian Information Criterion (Materials and Methods), the global-epistasis model strongly outperforms NetMHC 4.0 (Fig. 3B).Hence, for the smooth energy landscapes of MHC-I alleles, neural network models are an overkill: they apply a complex multi-level learning procedure to an inference problem that can be solved by simple regression.In contrast, neural network models are an appropriate tool for the intrinsically more complex idiosyncratic energy landscapes of MHC-II.For HLA-DRB1*01:01, low-complexity models of the form (3) fail to map the energy variation (within-dataset MSE = 5.7), similar to the linear model shown in Fig. 1C (cross-validated MSE = 5.9), while NetMHCII-2.3 still provides a faithful representation of the data (within-dataset MSE = 1.6).Mutational effect spectra in epistatic landscapes.
We can use the energy data of MHC-I and MHC-II for a comparative analysis of mutational effects.In Fig. 4, we show the distribution of energy changes, ∆∆G, plotted separately for mutations at anchor and non-anchor positions and for ancestral sequences in two affinity windows: typical binders (∆G anc ≈ −1) and strong binders (∆G anc ≈ −5).
For MHC-I, we observe homogeneously upscaled effects for mutations of strong binders, reflecting the larger slope of the global epistasis function ∆G(L) (Fig. 4A).This upscaling increases mean and variance of magnitude effects, E(|∆∆G|) and Var(|∆∆G|), in non-anchor and, at larger amplitudes, in anchor positions (inserts in Fig. 4).For MHC-II, the mean effect E(|∆∆G|) is also enhanced in strong binders (Fig. 4B).This is the known global pattern of idiosyncratic epistasis [38]: strong binders have a higher fraction of affinity-increasing bonds (i.e., amino acid pairings for which β ij (a i , a j ) < 0), generating an additional energy increase for mutations breaking these bonds.However, the variance of magnitude effects is largely independent of ∆G anc , and the effect distributions are similar in anchor and non-anchor positions.This is again consistent with strong idiosyncratic 324 epistasis: large mutational effects at non-anchor positions 325 can be generated by energy bonds to anchor positions.

326
To quantify idiosyncratic epistasis directly, we record 327 mutational effect spectra generated by background sequence 328 variation from calibrated and random energy models for 329 MHC-II (these spectra are undersampled in the data).For 330 a given mutation a → a ′ of a peptide with given energy 331 ∆G anc , we resample the ancestral background sequence at 332 the other positions subject to the constraint of maintain-333 ing ∆G anc ; the resulting effect spectrum is then shifted to 334 mean 0. In Fig. S5, we plot these spectra for a sample 335 of mutations in typical binders (∆G anc ≈ −1) and strong 336 binders (∆G anc ≈ −5), together with the corresponding 337 sample-averaged distributions.The rms effect given by 338 these distributions measures idiosyncratic deviations from 339 global epistasis.This effect is sizeable for typical and for 340 strong binders and similar to the global shift in E(|∆∆G|) 341 between the two binding regimes (Fig. 4B, Fig. S5).Hence, 342 in MHC-II, the energy effect of a given mutation depends 343 strongly on the specific amino acid pairings of the ancestral peptide.
Immune escape evolution.The energy statistics of peptide-MHC binding affects the escape evolution of antigens from T cell immune recognition.In both MHC classes, even strong-binding antigenic epitopes can effectively suppress presentation by large-effect mutations (∆∆G ≳ 5, shaded areas in Fig. 4).For MHC-I, large-effect mutations occur in anchor positions, while most mutations in non- MHC-I molecules bind short peptides in a fixed contact map; global epistasis of the binding energy is consistent with mutual reinforcement of binding between the anchor positions that is mediated by the intermediate chain.In contrast, MHC-II binds longer peptides with variable contact maps [16], suggesting that idiosyncratic epistasis can partly be attributed to changes in the core contact points and in the peptide configuration in between these points.

393
The type of epistasis also governs the efficient learning 394 of energy landscapes from data.For MHC-I, we derive 395 a simple energy matrix model that outperforms state-of-396 the art machine-learning models [21] in BIC score.This 397 model can be inferred by a joint regression procedure for 398 the linear trait L(a) and for the nonlinear function ∆G(L).399 Such regression methods fail for MHC-II, while neural net-400 works still produce a faithful energy landscape [13].The 401 difference in learnabilty is not surprising, given the strongly 402 different complexity of global and idiosyncratic energy mod-403 els.This compression can be understood as a dimension-404 ality reduction in phenotype space.In the idiosyncratic 405 case, sequence variation affects not only the focal trait, 406 here the binding energy ∆G, but also a number of co-407 varying traits A, B, . . ., including the spatial configura-408 tion of the interaction surface and allosteric interactions.409 A given mutation a → a ′ on two different backgrounds 410 b, b ′ has effects ∆∆G b = ∆∆G(∆G ab , A ab , B ab , . . . ) and 411 ∆∆G b ′ = ∆∆G(∆G ab ′ , A ab ′ , B ab ′ , . . .).These effects dif-412 fer because they probe the slope of the energy landscape 413 at different values of the co-varying traits, even if they 414 start from similar energies, ∆G ab ≈ ∆G ab ′ .Under global 415 epistasis, the ensemble of traits collapses to a single rele-416 vant trait ∆G.This implies ∆∆G b = ∆∆G(∆G ab ) and 417 ∆∆G b ′ = ∆∆G(∆G ab ′ ); the only remaining source of epis-418 tasis for the mutation a → a ′ is the energy effect of the 419 background mutation b → b ′ .The (lack of) collapse from 420 idiosyncratic to global epistasis can be evaluated in a given 421 training dataset and serve as a diagnostic for the appropriate 422 learning procedure.

423
In summary, we have highlighted the roles of epistasis 424 in the peptide-MHC energy landscapes for T cell immunity.425 The type of epistasis is linked to the underlying biophysical 426 mechanism, governs the learnabilty of landscape models 427 from data, and shapes the evolution of binding agents (here, 428 antigenic peptides).We expect this broad picture of trait-429 based epistasis to be applicable to other protein-protein 430 interaction landscapes.
For both classes, we perform 10 separate realizations for each λ and λ, characterized by different random sets of training sequences and initial conditions for ϵ.A linear model is optimised and tested on random test sets.The partial regressions, performed on similar L ranges, are used to estimate the complexity of each landscape.

Complexity estimates.
For estimating the complexity of the given MHC class, we compute ∆MSE in bins of L with size 2, spanning L values from -11 to 15 for both MHC classes (other bin sizes give broadly consistent results).In each bin, we evaluate ∆MSE = MSE ave − MSE p over a set of p subsamples with specific allele combinations, including only subsamples with more than 2 data points.
The ∆MSE values from simulations are derived from randomly generated sets of 20,000 peptides.The values reported in the main text are obtained by averaging over 10 realizations, and the standard deviation is computed across bins over these realizations.The range of L spanned in the simulated data is determined based on the minimum and maximum values observed in the specific realization under consideration.
Classification score.The classification score is computed on set of positive and negative binders, sourced from IEDB on the March 4th 2024 for all the considered alleles.After curations, the validation sets consist of 2488 positive and 1497 negative binders for HLA*A-02:01, 391 positive and 184 negative binders for HLA*A-11:01 and 382 positive and 505 negative binders for HLA*B-07:02.
We use the curated sets to construct the Receiver Operating Characteristic (ROC) curves, illustrating the relationship between sensitivity, representing the ratio of true positives to the total number of positives, and 1−specificity, which denotes the ratio of false positives to the total negatives [17,50].We define positives and negatives based on the binding probability defined below: We calculate these metrics by varying the thresholds that distinguish positives from negatives and computing the corresponding variables.The area under this curve provides the AUC score, which is presented for the relevant alleles in Table S1.
Model comparison.We use the BIC score [51], where n is the number of sequences in the test set and k the number of parameters, respectively.For NetMHC 4.0, we compute k = k h + k out , in which k h = (n in + 1) × n h , where n in and n h are the number of inputs and hidden neurons.Similarly, k out = (n h + 1) × n out , where n out is the number of output neurons [52].With the architecture information provided in reference [21], n in = 180, n h = 5 and n out = 1, we obtain k is 911.The total BIC is reported in Fig. 3B.
Mutational effect spectra.The mutational effect in Fig. 4 is computed collecting all pairs of single mutants available in data.Specifically, we collect 156 pairs in the strong binders regime and 119 for typical binders regime for HLA-A*02:01.For HLA-DRB1*01:01, we collect 214 and 230 sequence pairs for strong and typical binders regimes, respectively.For each energy window, both quantitative and qualitative measurements are collected.For mutations with qualitative data, ∆G wt > ∆G thr , we define ∆∆G = 0 if ∆G mut > ∆G thr and ∆∆G = ∆G mut − ∆G thr otherwise; mutations with a qualitative ∆G mut are treated in an analogous way.The spectra of background mutations in Fig. S5 are computed based on 50 single mutants presenting the same mutation at the chosen position.The ancestral peptides are selected in the same energy window for strong and typical binders, as above.For both the idiosyncratic model and NetMHCII 2.3, positions 4 and 6 are chosen as main anchors.
Data and Code availability.All curated data, raw data sources, and materials used in this study are available in Supplementary Tables, at: https://github.com/lcollesano/MHC-gemand in the IEDB Analysis Resources at http://tools.iedb.org/mhci/download.

Supplementary Figures
at a given position, ∆∆G, depends on the detailed sequence content at other positions.Under global epistasis, it depends only on the energy of the original sequence.In other words, the full energy can be written as a nonlinear function of an intermediate linear trait, here ∆G = L + λL 2 [31, 48].These two types of energy landscapes are strongly different in model complexity.The idiosyncratic interaction term in Equation (2) has 20 × 20 energy parameters for the amino acid combinations at each each pair of interacting peptide positions.The global epistasis model of Equation (3) requires only 20 × ℓ parameters for the linear term and a single additional parameter λ measuring the strength of epistasis.A common approach to infer epistasis from data is to compare the effect of a mutation on different sequence backgrounds, for example, by analyzing pairs of mutations at two positions in two different orders (ab → a ′ b → a ′ b ′ and ab → ab ′ → a ′ b ′ ).Here, given the limited availability of

Figure 1 .
Figure 1.Binding motifs, sequence information, and linear model.(A) Position weight matrix for HLA-A*02:01 (left) and HLA-DRB1*01:01 (right) inferred from training data.Letter size indicates the frequency of a given amino acid at a given position, pi(a), in the sample of MHC-binding peptides (∆G ≤ 0).Yellow bars highlight anchor positions, the color of letters shows the chemical properties of amino acid (black: hydrophobic, red: acidic, blue: basic, green: polar, purple: neutral) [47].(B) Position-specific sequence information, ∆Si, defined as the Kulback-Leibler distance of the observed distribution pi(a) from an equidistribution.(C) Error plot comparing observed energies, ∆G, and predictions of the linear model, L (contours give densities above 0.01).

Figure 2 . 4 194and 6 .
Figure 2. Inference of epistasis.Scatter plots of experimental ∆G values and the linear energy model L for subsamples with given amino acid pairs in selected positions (max 50 random points are displayed).Top: HLA-A*02:01, selected positions 2 and 9; bottom: HLA-DRB1*01:01, selected positions 4 and 6.From left to right: subsamples containing stronger binders, marginal binders, and non-binders, respectively.Lines give linear regressions for each subsample.See Fig. S2 for regression lines of all subsamples.

Figure 3 .
Figure 3. Global-epistasis model for MHC-I.(A) Error plot comparing observed energies, ∆G, and predictions of the optimal global-epistasis model (λ = −0.04)for HLA-A*02:01.To be compared with the linear model of Figure 1; see Fig. S1 for other alleles.(B) Model comparison of the globalepistasis model and NetMHC-4.0:error function (dark shading) and total BIC score accounting for model complexity (full bars) for allele HLA-A*02*01 (orange) and aggregated over all 3 MHC-I alleles (blue), evaluated by blind testing (see text for details).
236 ment error of variance σ 2 m = 1.For intermediate strength of 237 epistasis (λ = −0.04), the regression pattern is again similar 238 to the empirical data (Fig. S3A) and produces a similar 239 error difference (∆MSE = 0.20 ± 0.1).240 Energy landscapes with global epistasis for MHC-I.241 The type of epistasis has immediate consequences for the 242 learnability of landscape models from training data.For 243 MHC-I, we can infer a global-epistasis model by regression: 244 we learn the linear model L(a) together with the parameter 245 λ of the nonlinear function ∆G(L).For the allele HLA-246 A*02:01, we infer an optimal quadratic model of the form (3) 247 with λ = −0.04.This curvature is associated with moderate 248 increasing-return epistasis: the global nonlinearity reduces 249 the systematic error of the linear model in the strong-binding 250 regime (Fig. 3A, Fig. 1C), leading to a reduced mean square 251 error (MSE = 4.3) compared to the linear model (MSE = 4.6) in the 10 fold cross-validation procedure.Adding a cubic nonlinearity to the function does not further reduce the MSE in a significant way.As shown below, the nonlinearity substantially increases the effects of mutations in strong binders.Very similar global energy landscapes are inferred for the other MHC-I alleles included in our analysis (Fig. S1).
anchor positions maintain presentation.This reflects the modularity of presentation and T cell recognition encoded in peptide sequences.For MHC-II, escape mutations acting on presentation are broadly distributed across anchor and non-anchor positions.The overall target for this type of escape evolution is larger for MHC-II: 21 % of all mutations have ∆∆G ≥ 5, compared to 12 % for MHC-I.This suggests the idiosyncrasy of peptide-MHC-II binding facilitates the escape evolution of antigens by generating large epistatic mutation steps.Discussion Peptide-MHC binding interactions have strong, class-specific nonlinearities.For MHC-I, these interactions can be described by an energy landscape with global epistasis, ∆G(L), where the intermediate trait L is additive in the peptide sequence.The curvature of the landscape increases in slope at strong binding.This enhances the energy discrimination of presented peptides and increases selection on affinitydecreasing escape mutations in pathogen evolution.Moreover, the recognition chain of class I is modular: strong MHC binding energy effects governing peptide presentation are predominantly determined by the anchor positions 2 and 9, while non-anchor positions encode the extracellular interactions of presented peptides with T cell receptors.For MHC-II, we find rugged energy landscapes with idiosyncratic epistasis, describing cooperative binding that depends on the amino acid content at multiple peptide binding positions.Mutational effects at a given binding position depend strongly on the peptide sequence background, and there is no modularity: weak and strong effects are found at anchor and non-anchor positions.These differences in epistasis between energy landscapes are in tune with distinct biophysical binding mechanisms.

Figure S3 .
Figure S3.Regression lines and measure of complexity in simulation.Linear regression for subsamples of simulated data (one realization) according to the global epistasis model and the idiosyncratic epistasis model.A) From left to right, Weak global epistasis with λ = −0.01,moderate with λ = −0.04 and strong λ = −0.1.The lines are color coded based on the minimum of the L spanned by the given subsample.B) From left to right, Weak idiosyncratic epistasis with λ = 0.3, moderate with λ = 1 and strong λ = 2.The transparency of each regression line is proportional to the number of datapoints in the given subsample.

Figure S4 .
Figure S4.Blind set validation for MHC-I alleles.Error plot compairs observed energies contained in the blind set, ∆G, and predictions of the global epistasis model with λ = −0.04 for all the tested alleles (A, on top) and NetMHC 4.0 (B, bottom).Contours give densities above 0.01.The blind set contains 465 sequences for HLA-A*02:01, 69 for HLA-A*11:01 and 183 for HLA-B*07:02.

Table S1 .
AUC scores for classification comparison between the global epistasis model and NetMHC 4.0.

Table S4 .
Position weight matrix for the linear model, L(a), in HLA-A*11:01

Table S5 .
Position weight matrix for the global-epistasis model, L(a), in HLA-A*11:01

Table S6 .
Position weight matrix for the linear model, L(a), in HLA-B*07:02