Improved predictive models for peptide presentation on MHC I

Large surveys of peptides naturally presented on major histocompatibility class I (MHC I) proteins have enabled improved MHC I ligand prediction by dramatically expanding the available data for many MHC I alleles. However, it is unclear to what extent antigen processing signals can also be learned from these datasets. Here, we developed a predictor of antigen processing by training neural networks to discriminate mass spec-identified MHC I ligands from unobserved peptides, where both classes of peptides are predicted to be strong MHC I binders. The resulting predictor shows qualitative consistency with established preferences for the transporter associated with antigen processing, proteasomal cleavage, and endoplasmic reticulum aminopeptidases. When we combined the antigen processing predictor with a novel pan-allele MHC I binding predictor in a logistic regression model, the combination model significantly outperformed the two components alone as well as the NetMHCpan 4.0 and MixMHCpred 2.0.2 tools at predicting mass spec-identified MHC I ligands. Our predictors are implemented in the open source MHCflurry package, version 1.6.0 (github.com/openvax/mhcflurry).


Main text Introduction
Cytotoxic T cells recognize peptides presented in complex with major histocompatibility class I (MHC I) molecules on cell surfaces.These peptides are usually derived from the degradation of endogenous proteins and comprise a snapshot of the protein content of the cell, enabling T cells to distinguish healthy cells from those with viral, bacterial, or tumor-associated mutated proteins ,2 .The repertoire of MHC I-presented peptides is generated through a complex series of biochemical processes, beginning with cleavage of a protein into peptides in the proteosome, further cleavage (or destruction) by cytosolic peptidases, peptide transport into the endoplasmic reticulum (ER) through the transporter associated with antigen processing (TAP) complex, trimming by ER-resident aminopeptidases (ERAP), and stable association with one of the several MHC I proteins expressed by a cell 3 .The MHC I genes ( HLA-A, HLA-B , and HLA-C in humans) are the sites of dramatic allelic variation at the population level, with each of the thousands of known HLA alleles associated with a relatively strict, potentially distinct peptide binding preference.While a high-affinity interaction with MHC I is the most selective requirement for a peptide to be presented, the other processes in the antigen presentation pathway may exert important secondary effects.
Computational prediction of MHC I-presented epitopes are a routine tool in vaccine design and studies of infectious diseases and cancer 4 .Most predictive pipelines in use today focus only on MHC I binding affinity (BA) prediction, although predictors for antigen processing steps such as proteasomal cleavage and TAP transport have also been proposed [5][6][7][8] .The data available to develop predictive models has recently expanded as multiple groups apply mass spec (MS) to sequence the peptides presented by MHC I on cell lines and tissues 9 .The datasets generated in these studies have been used to train improved MHC I binding predictors, which benefit from deeper coverage of MHC I alleles that were poorly-characterized using lower throughput assays, predominantly in vitro peptide / MHC binding affinity measurements.However, besides establishing binding motifs for additional MHC alleles, the MS-identified peptides also implicitly include information on other steps of the antigen presentation process.To some extent, this information likely informs existing MHC I binding predictors trained on MS datasets, although it remains intertwined with MHC I binding preferences, making it is difficult to parse the contributions from other processes and potentially leading to lower predictive accuracy.
In this work, we develop separate predictors for MHC I allele-dependent (MHC I binding) and allele-independent (antigen processing) effects.We first trained a new pan-allele MHC I binding affinity predictor (referred to as MHCflurry 1.6.0BA) on available MHC I ligand data, including affinity measurements and MS datasets.The use of in vitro affinity measurements in the training data, which are independent of antigen processing, is intended to limit the predictor's tendency to learn antigen processing signals.We use this predictor to generate a training set for a model of antigen processing by combining MS-identified peptides (hits) with unobserved peptides (decoys), where both hits and decoys are predicted by the BA predictor to bind the relevant MHC I alleles.The antigen processing predictor (MHCflurry 1.6.0AP) is trained to distinguish hits from decoys on this set without considering MHC I allele, and thus models the residual MHC allele-independent sequence properties that were not learned by the BA predictor.We combined the BA and AP models in a logistic regression model to develop a composite prediction, which we refer to as the presentation score (MHCflurry 1.6.0PS).We found that the PS predictor outperformed both its component models and the commonly-used MHC I ligand predictors NetMHCpan 4.0 and MixMHCpred 2.0.2 on a benchmark of held-out MS datasets.

MS benchmark construction and approach
Dataset curation.To benchmark the binding predictors developed here, we collected datasets from 11 studies that identified MHC I-bound peptides using MS.We included only samples with known four-digit MHC I genotypes.Two of the studies included experiments that used cell lines engineered to express a single MHC I allele.We refer to these as the MONOALLELIC samples, which comprise 72 samples from the recent publication by Sarkizova et al 10 and 8 samples from Abelin et al 11 .We refer to the other samples, in which exact MHC I restrictions were not experimentally determined, as the MULTIALLELIC samples.We divided these into two groups: MUTLIALLELIC-OLD, comprised of 56 experiments from eight studies published before 2018 [12][13][14][15][16][17][18][19] , and MULTIALLELIC-RECENT, comprised of 20 experiments from two studies published in 2019 10,20 .S1, the sample groups used for each benchmark experiment are given in Table S2, and Additional Files 1 and 2 give the full datasets for the MULTIALLELIC and MONOALLELIC benchmarks, respectively.

Curated samples are listed in Table
Decoy selection.We generated accuracy benchmarks from the curated datasets by sampling unobserved peptides (decoys) from the same proteins as the observed peptides (hits) for each sample.We first associated each sample with a publicly-available RNA-seq expression dataset based on its cell type or tissue of origin.
Each MS-identified peptide was searched in the Uniprot human reference proteome (UP000005640_9606).In the case of multiple matches, we used the RNA-seq for each sample to select a single protein with the highest mRNA expression (total across transcripts).For each sample, we randomly selected 99n decoy peptides, where n is the number of hits.Equal numbers of decoy peptides of each length (8, 9, 10, 11) were sampled.
We excluded from this procedure all peptides (hits and decoys) that were present in the training data for any of a sample's alleles for the MHCflurry predictor under evaluation, as well as hits that contained noncanonical amino acids or could not be matched to a Uniprot protein and Ensembl gene (Ensembl release 98).
Comparison to existing tools.We included two existing tools in our benchmarks, NetMHCpan 4.0 (ref. 21) and MixMHCpred 2.0.2 (ref. 12,22).The NetMHCpan 4.0 tool is an ensemble of neural networks trained on peptide-MHC I affinity measurements plus peptides from monoallelic MS experiments.It gives separate predictions for each kind of data, i.e. a binding affinity (BA) prediction and an eluted ligand (EL) prediction, which we evaluated as separate predictors in our benchmarks.MixMHCpred is trained on peptides identified in multiallelic MS experiments, which it deconvolves into clusters that are subsequently associated with MHC I alleles.As both the MHCflurry binding affinity predictor and NetMHCpan 4.0 do not incorporate multiallelic MS in their training sets, we evaluated them on the full MULTIALLELIC benchmark.As MixMHCpred is trained on multiallelic MS, we tested it only on samples published recently (after it was trained), i.e. the MULTIALLELIC-RECENT benchmark.The two studies in the MONOALLELIC dataset were published after NetMHCpan 4.0 and MixMHCpred 2.0.2 were released, so it is also an appropriate test set for them.The released MHCflurry 1.6.0binding predictor incorporates the MONOALLELIC dataset as training data, however, so in evaluations of this benchmark we used a variant of MHCflurry re-trained without these datasets.As the logistic regression model used in the MHCflurry 1.6.0PS predictor was fit to the MULTIALLELIC-OLD subset, we benchmarked it only on the MULTIALLELIC-RECENT samples.

Scoring metrics.
We assessed accuracy at distinguishing MS hits from decoys using two metrics, area under the curve (AUC) and positive predictive value (PPV).The AUC is a standard accuracy metric for classification tasks, interpretable as the probability that a randomly selected hit will be scored higher by a predictor than a randomly selected decoy.The PPV, as defined here, focuses on a predictor's ability to rank hits far above the decoys.To calculate PPV, for each sample we sorted the n hits and 99n decoys by their predictions, and calculated the fraction of the top n peptides that are hits.A random predictor would score 0.01 in PPV and 0.5 in AUC.

MHC I binding affinity predictor
The MHCflurry 1.6.0BA predictor is a new pan-allele MHC I binding affinity predictor that supports variable-length peptides up to 15-mers.It extends an earlier version of MHCflurry 23 -in which separate neural networks were trained for each of 112 supported alleles-to now support 14,993 MHC I alleles using a single neural network ensemble.The input to each neural network consists of (1) an encoding of the peptide amino acid sequence and (2) an encoding of the amino acids at 37 selected positions from a multiple sequence alignment of a large number of MHC I alleles across species.The neural networks each output a nanomolar binding affinity (transformed to a 0-1 scale using the formula ).The mean of the log (nM af f inity) 1 − 50000 transformed predictions over the ensemble gives the overall prediction.The predictor is trained on affinity measurements and MS datasets from cell lines monoallelic for a single MHC I allele.MS hits are assigned a "< 50 nM" binding affinity.The models are trained using a variant of the mean squared loss (MSE) that supports such inequalities, as described previously 23 .
Peptide representation.Peptides of 15 amino acids or shorter are supported.Each peptide is transformed to a length-45 sequence by concatenating three representations: left aligned, centered, and right aligned.Unused positions are represented using a special X symbol, treated as a 21st amino acid.For example, the peptide "GILGFVFTL" is represented by concatenating "GILGFVFTLXXXXXX" (left aligned), "XXXGILGFVFTLXXX" (centered), and "XXXXXXGILGFVFTL" (right aligned), resulting in "GILGFVFTLXXXXXXXXXGILGFVFTLXXXXXXXXXGILGFVFTL".Each amino acid in this sequence is transformed to a 21-dimensional vector using the BLOSUM62 substitution matrix 24 , extended to include the X placeholder, which is assigned similarity 1 to itself and 0 to all amino acids.Allele multiple sequence alignment.Full-length HLA-A , -B , -C , -E , -F , and -G amino acid sequences were downloaded from the IMGT/HLA 25 11 The training set for the BA predictor is available as Additional File 3.For testing on the MONOALLELIC benchmark, a predictor variant trained without these datasets was used (Additional File 4).The set of IEDB affinity measurements was augmented with the "BD2013" dataset from Kim et al. 31 .The final training set consisted of 470,586 MS hits and 219,357 affinity measurements.A summary of the training datasets used for all predictors introduced in this work is given in Table S3.
Neural network architectures.Each neural network in the ensemble corresponds to one of 35 possible model architecture variants.The overall design in all cases is similar: the allele and peptide representations are concatenated, flattened, and passed through a series of two or three dense layers, whose size ranges from 256 to 1024.Each dense layer is followed by a Dropout layer 32 with the dropout rate set to 50%.Architectures vary in terms of the size and number of the dense layers, the amount of regularization applied to dense layer weights (L1 penalty), and whether skip connections are used to give each dense layer direct access to the two preceding layers.

Model training.
The training data was sampled four times to generate four training subsets.Each subset excluded one quarter or 100 of the training points for each MHC I allele, whichever was less for each allele.
Models corresponding to the 35 architectures were fit to each of the four training subsets, for 140 trained models in total.Initial weights were selected using layer-sequential unit-variance initialization 33 .This was followed by a pre-training step, in which the network was fit to synthetic measurements generated by a previous version of MHCflurry (version 1.2.0, using allele-specific models trained without MS datasets).The synthetic data consists of affinity predictions for random peptides across 99 alleles.While pre-training on synthetic data, the training data was used as a test set for early stopping, i.e. pre-training was halted once the mean square error (MSE) was no longer improving on the actual (non-synthetic) training data.Typically, several hundred million synthetic measurements were used.Model fitting then proceeded using the training data, keeping 10% held-out for early stopping.The training dataset was augmented to include random peptides set to have a very weak affinity (>35,000 nM).The lengths of the random peptides were selected to equalize the number of non-binder data points across peptide lengths for each allele.As in NetMHC, the sequences of the random negative peptides were resampled after each epoch.

Model selection.
From the 140 trained models, the model selection procedure selected nine to use in the final predictor ensemble.This was done by selecting a set of models from each of the four training subsets independently, using MSE on the held-out points as the accuracy metric.A forward stepwise procedure was used, in which models are added until the ensemble accuracy no longer improves.The final ensemble was the union of the models selected across training subsets.

Antigen processing predictor
The MHCflurry 1.6.0AP predictor is trained to model the MHC allele-independent effects that were not captured by the BA predictor.We designed it with proteasomal cleavage prediction in mind, although the Model training and selection.The training data was sampled to generate four subsets, where each training subset held out 10 randomly selected MS experiments.Models corresponding to the 128 neural network architectural variants were fit to each training subset (512 models total).Models were trained using the Adam optimizer 34 and binary cross entropy loss with a random 10% of each training subset used for early stopping.
Model selection was performed for each training subset separately using AUC on the held-out data as the accuracy metric.The final selected ensemble included 8 models.

Presentation score model
The MHCflurry 1.6.0PS predictor is a two-input logistic regression model that integrates a BA prediction (tightest predicted binding affinity over the MHC I alleles for a sample) with an AP prediction to give a composite prediction, referred to as the presentation score.It has just three learned parameters: two coefficients and an intercept.In contrast to the BA and AP predictors, which use only monoallelic MS, the PS model is fit to multiallelic MS datasets.Training data for the PS model was generated using the MULTIALLELIC-OLD set of samples by sampling length-matched decoys (two decoys per hit) from the same proteins as the hits (Additional File 6).The training set included 753,777 entries (of which 251,277 were hits) from 56 samples.The model was fit using the logistic regression implementation in scikit-learn 35 with the LBFGS solver and default parameters.Two variants of the PS model were generated: one that made use of the AP with flanks predictor (i.e.upstream and downstream amino acids included) and another that used the AP without flanks predictor.The PS model was benchmarked on the MULTIALLELIC-RECENT samples.

Results
We first compared the new MHCflurry 1.6.0binding affinity (BA) predictor (Fig. 1a) to existing binding affinity (NetMHCpan 4.0 BA) and MS ligand (NetMHCpan 4.0 EL and MixMHCpred 2.0.2) predictors on the MULTIALLELIC benchmark.MHCflurry BA performed best at differentiating MS hits from decoys in terms of AUC, outperforming NetMHCpan BA on 74 of 76 samples, NetMHCpan EL on 56 of 76, and MixMHCpred on 18 of 20 samples in the MULTIALLELIC-RECENT subset (Fig. 1b).The increase in AUC relative to NetMHCpan BA was 5.7% (bootstrap 95% confidence interval 4.7-6.7),1.8% (1.3-2.5)relative to NetMHCpan EL, and 2.6% (1.8-3.4)relative to MixMHCpred, with the greatest improvements observed for non-9-mer peptides (Fig. 1c).By contrast, in terms of PPV, which emphasizes differences at the high end of a predictor's output, MHCflurry BA showed only an advantage in comparison to NetMHCpan BA, with a mean improvement in PPV of 50% (39 -65).Differences were within error in comparison to the MS ligand predictors, with a mean 3.7% (-2.0 -+10.3)improvement over NetMHCpan EL and a trend toward underperforming MixMHCpred, with a mean difference of -5.0% (-12.6 -+4.6;Fig. 1d).Similar results were observed when evaluating the predictors on the MONOALLELIC benchmark using a variant of MHCflurry that was trained without the MONOALLELIC samples (Fig. S1).
We hypothesized that explicitly modeling processes that do not depend on MHC I allele might enable accuracy improvements over MHCflurry BA alone.We therefore developed an MHC I allele-independent model trained to distinguish hits from decoys where both the hits and decoy peptides are predicted to be tight binders (rank less than 2%) by the MHCflurry BA predictor.We refer to this model as the MHCflurry 1.6.0antigen processing (AP) predictor.Its neural network architecture is motivated by the possibility of learning peptide N-and C-terminal cleavage or processing signals (Fig. 2a).We trained two versions of the AP predictor on the MONOALLELIC benchmark dataset.One predictor includes the peptide plus the 15 immediately upstream and downstream residues from its source protein (AP with flanks); the second predictor includes only the peptide (AP without flanks).
To understand if the MHCflurry AP predictors learned a meaningful signal, we evaluated their accuracy on the MULTIALLELIC benchmark (Fig. 2b).While the AP variants underperformed the standard MHC binding predictors and MHCflurry BA, they performed better than might be expected given that they do not take MHC allele as an input.Both the AP with flanks and AP without flanks predictors showed a mean 0.85 AUC (bootstrap 95% CI 0.83 -0.87), compared to 0.91 (0.90 -0.92) for the MHCflurry BA predictor (Fig. 2b).This suggested that the MHCflurry AP predictors had learned a meaningful MHC I allele-independent signal from the MONOALLELIC MS training set.
To understand what the MHCflurry AP predictors may be learning, we ranked all 9-mer peptides in the MULTIALLELIC benchmark by AP prediction, calculated position weight matrices for the top 1% highest predictions, and plotted a sequence logo (Fig. 2c and Additional File 7).This analysis showed the AP predictor learned that hits are depleted for cysteines across the peptide, a known bias of MS 36 .It also showed depletion of prolines from the first position in the peptide (P1) extending upstream into the N-flank, as well as strong but complex specificity at the C-terminus of the peptide and some preferences at the downstream flanking residue.
These signals suggested qualitative agreement with established signatures for TAP transport, proteasomal cleavage, and ERAP trimming (see Discussion).
As cysteine depletion is one of the strongest known MS biases, we were concerned that much of the accuracy of the AP predictor may be due its modeling of this bias.We therefore repeated the AUC analysis on the MULTIALLELIC benchmark after removing peptides (hits and decoys) that contained cysteine.This analysis showed very similar AUC values as the earlier analysis, with a change in AUC of less than 3% for all samples (Fig. 2d).Thus, while the AP predictor learns the cysteine MS bias, this effect alone is not the primary driver of its performance.
To evaluate the extent to which the AP predictors learned a signal that is also learned by the MHCflurry BA predictor, we measured the correlation between AP and BA predictions for random peptides across the HLA-A , HLA-B , and HLA-C alleles in the BA training set (n=181).The correlations were positive, significant, but modest in magnitude (Pearson r < 0.5) for all alleles tested.For example, AP and BA predictions for the HLA-A*02:01 allele showed a Pearson correlation of 0.23.While peptides predicted to bind HLA-A*02:01 tightly tended to have higher AP scores, it was possible to find peptides with high scores for one predictor but not the other (Fig. 2e).Correlations with the AP predictor were somewhat higher for HLA-B and HLA-C than for HLA-A alleles (Fig. 2f).The alleles used to train the AP predictor showed no greater AP vs. BA correlations on average than those that were not included in the AP training set (Fig. 2g).Overall, this analysis suggested that the AP predictor is at least partially non-redundant with the MHCflurry BA predictor.
We next asked if the AP predictor may be combined with the MHCflurry BA predictor to achieve higher performance than either alone.We trained a logistic regression model that takes two inputs: the strongest MHCflurry BA prediction across alleles for the sample (transformed to fall in the range 0.0 -1.0, with higher indicating a stronger binder), and the AP prediction (Fig. 3a).We refer to this logistic regression model as the MHCflurry presentation score (PS) predictor.We trained MHCflurry PS predictors using either the AP with flanking (PS with flanking) or AP without flanking (PS without flanking) predictors on the MULTIALLELIC-OLD dataset and evaluated performance on the MULTIALLELIC-RECENT benchmark.

Discussion
Our MHC I ligand prediction method is comprised of two separate neural network models: the MHCflurry binding affinity (BA) predictor and the MHC I allele-agnostic MHCflurry antigen processing (AP) predictor.The AP predictor is trained to learn what the BA predictor missed, i.e. residual sequence properties that distinguish hits from decoys among peptides predicted to bind MHC I tightly.Both predictors are trained on monoallelic MS datasets (plus affinity measurements for the BA predictor), and their results combined using a logistic regression model fit to multiallelic MS datasets.When evaluated on held-out multiallelic MS experiments, the combined predictor, referred to as MHCflurry presentation score (PS), outperformed the individual components and standard tools.While inclusion of flanking sequences, i.e. the adjacent residues in the peptide's source protein, provided a small additional accuracy boost, the overall improvement over standard tools was also evident when only the peptide was provided to the AP predictor.
Historically, a number of methods for prediction of MHC I antigen presentation have been proposed that integrate MHC I binding prediction with models fit to small datasets for individual processing steps (e.g. proteasomal cleavage products or peptides translocated by TAP) [37][38][39] .However, these methods tended to give only modest improvement in accuracy relative to state-of-the-art MHC I binding predictors, and currently most epitope discovery efforts focus on MHC I binding prediction alone as the key computational step.Learning antigen processing from naturally presented MHC I ligands, as we do here, is an attractive alternative to modeling individual processing components because far more MHC I ligand data is currently available than, for example, information on TAP efficiency.Additionally, the relative contributions of each antigen processing step (and potentially unknown effects) are automatically accounted-for in MS ligand datasets.These advantages may be responsible for the increase in accuracy we observed when we combined the AP and BA predictors.
The AP predictor binding motif (Fig. 2c) shows similarities with the established preferences of key antigen processing steps.Deconvolution of effects, however, is complicated by the overlapping specificity of these steps, potentially a consequence of coevolution of the antigen processing machinery 7 .In particular, the AP predictor's preferences at the C terminus of the peptide may reflect TAP binding and/or proteasomal cleavage.
Work by Tampé and colleagues 40 found that TAP favors peptides with a C terminal Phe, Tyr, Arg, or Leu and disfavors Asp, Glu, Asn and Ser, all recapitulated by the AP predictor.The AP motif is also consistent with the effects of the chymotryptic-like activity of the proteasome (cleavage after Phe, Tyr, Leu, Trp but not Gly) as well as tryptic-like (cleavage after Arg and Lys), but not caspase-like specificity 41,42 .Some agreement with proteasomal processing is also apparent at the interior residues of the peptide, such as a depletion of proline at P8 and enrichment for proline at P6 (referred to as P2 and P4, respectively, in cleavage studies).At the first flanking residue downstream of the peptide, which is free from the effects of TAP and MHC binding, the enrichments for Arg, Ala, Lys, Ser, and Gly are consistent with proteasomal cleavage preferences for the position after the cut site 41,43 , although the enrichment for Arg and Lys could also indicate tryptic-like specificity working from the C-terminus of the protein 36 .The striking depletion of prolines in the N-flanking residues up to and including the first position of the peptide (P1) and the enrichment for proline at P2 is consistent with trimming by ERAP 44 , although again we cannot exclude a contribution from proteasomal cleavage.Overall, there is a reasonable indication that the AP predictor has learned some antigen processing signals, although a quantitative assessment is ongoing work.
Our approach requires that the BA predictor does not fully model the antigen processing signals available in its MS training datasets.If it did, there would be no residual effects for the AP predictor to learn and no improvement in accuracy from combining the BA and AP predictors.While the BA predictor likely includes a contribution from antigen processing, several technical choices may have blunted its ability to learn these effects.The BA predictor's training set includes 69% mass spec-identified ligands and 31% peptide-MHC I affinity measurements.The predictor therefore models a compromise between antigen processing-sensitive (MS ligand) and -insensitive (affinity measurement) training data.The influence of MS data is expected to be further diluted at the extreme strong-binder segment of the BA predictor's output because MS hits are assigned an ic 50 of "< 50 nM" in the training set, i.e. any assigned affinity tighter than 50 nM results in zero contribution to the training loss.This means that MS training data does not guide the relative ranking (exact ic 50 ) of strong binders, potentially the regime where antigen processing signals may make the greatest difference.This effect likely also plays a role in the relatively lackluster performance of the BA predictor when assessed by the PPV metric, which emphasizes the rank-order of peptides at the extreme high-end of the predictor's output, despite good performance on AUC.Finally, an important difference between the BA and AP predictors is that the AP predictor uses peptides from the proteome as decoys (unobserved non-binders), whereas the BA predictor uses random sequences sampled according to the same amino acid distribution as the hits.The AP predictor strategy is expected to be more realistic and informative, at the cost of giving the AP predictor more opportunity to learn MS biases.
An important limitation of this work is that we are using datasets of MHC I ligands detected by MS both to train and to benchmark our predictors.Assay biases, which we expect are modeled by the AP predictor, have the potential to erroneously inflate our accuracy scores.While the main known bias, depletion of cysteine, does not seem to have a dramatic effect on AP predictive accuracy, we cannot rule out the contributions of other kinds of bias.Future work will need to assess whether the predictors described here enable improved prediction of T cell epitopes.Table S2.Summary of sample groups used for each benchmark experiment.The MULTIALLELIC set is the union of the MULTIALLELIC-OLD and MULTIALLELIC-RECENT samples.

Experiment Dataset Figures
Benchmark of NetMHCpan Table S3.Summary of training datasets for the predictors developed in this work.

Predictor Train data type Training dataset Figures
MHCflurry 1.
Allele representation.MHC I alleles are represented to the neural network by the amino acids at 37 positions from a global multiple sequence alignment.This representation is referred to as a "pseudosequence" by the NetMHCpan authors.We use the 34 peptide-contacting positions included in the NetMHCpan 4.0 pseudosequence (re-derived using a new alignment), plus 3 additional positions.The additional positions were selected to differentiate several pairs of alleles (A*23:01/A*24:13, A*29:01/A*29:02, B*44:02/B*44:27, C*03:03/C*03:04) that shared identical 34-mer NetMHCpan pseudosequences.Similar to the peptide encoding, each amino acid was transformed to a 21-dimensional vector using the BLOSUM62 substitution matrix, where a placeholder X character represents global alignment positions with no residue (i.e. a deletion) for a particular MHC allele.
resulting models are expected to capture a range of effects.Since the efficiency of cleavage by proteasomes and peptidases are affected by the residues both before and after the cut site, we experimented with the inclusion of the flanking sequence on either side of the peptide from its source protein.To understand the importance of flanking sequences, we trained two variants of the MHCflurry 1.6.0AP predictor: one takes as input a peptide plus the 15 amino acids on either side in its source protein (AP with flanks) and one takes only a peptide (AP without flanks).Both models are independent of MHC allele.Training data.The AP predictor is trained on the MS hits from the MONOALLELIC benchmark samples (Additional File 5).Hits of length 8-11 were matched 1:100 to decoys of the same length and from the same proteins.The resulting peptides for each sample were sorted by MHCflurry 1.6.0BA prediction for the relevant allele, and the top 2% strongest binding peptides (hits and decoys) were selected for inclusion in the AP predictor's training set.This resulted in a training set of 297,313 distinct peptides (399,392 entries total including duplicate peptides selected from different MS samples), of which 106,738 (36%) were hits.The median BA prediction for the peptides in AP training set was 44 nM for decoys and 23 nM for hits.Neural network architectures.AP models were trained using 128 neural network architectures with a similar overall design but varying in layer sizes, activation function, level of L1 weight regularization, and dropout rate.The input to each network consists of peptides and flanking sequences (if used) as they occur in the source protein (N-flank -peptide -C-flank), with each amino acid encoded as a 21-dimensional vector using the BLOSUM62 substitution matrix extended to include an X symbol, as in the BA predictor.The sequence is right-padded with X characters to generate fixed-length inputs of length 45 (AP with flanks) or 15 (AP without flanks).The first layer is convolutional, with a kernel size of 11-17 and 256 or 512 filters, tanh or relu activation, and dropout.This transforms the input sequence into a new sequence of the same length but with up to 512 channels instead of 21.From this representation, two parallel convolutional layers with a kernel size of 1 (i.e. each position is considered independently) are applied to predict the favorability of an "N-terminal cut" and a "C-terminal cut" at each position in the sequence.The cut site predictors are implemented using two stacked convolutional layers with a kernel size of 1, and thus can be thought of as 2-layer dense networks that consider a single position in the learned representation of the input sequence.For example, in one model architecture the "C-terminal cut" predictor takes a 512-vector corresponding to a position in the input sequence and applies a dense layer with 8 outputs followed by a single-output dense layer.The final layer in the AP architecture is a dense layer that integrates these cut-site predictor results.It takes as input: (1) the N-terminal cut site prediction at the peptide N-terminus, (2) the max of the N-terminal cut site predictions across the rest of the peptide, (3) the C-terminal cut site prediction at the peptide C-terminus, and (4) the max of the C-terminal cut site predictions across the rest of the peptide.This design is motivated by the intuition that presented MHC I ligands must have favorable cleavability at their termini but avoid cleavage at interior residues.To give the model further opportunity to consider average properties of the flanking sequences (e.g.secondary structure), two additional inputs are given to the final layer: (5) the result from a dense layer applied to the per-channel average of the initial convolutional layer across the N-flank, (6) a similar result for the C terminus.The final layer in the AP predictor is thus intended to model the tradeoff between cleavability at the peptide termini, cleavability at interior positions in the peptide, and overall favorability of cleavage given the flanking sequences.

Figure 2 .
Figure 2. The MHCflurry 1.6.0antigen processing (AP) predictor models MHC I allele-independent effects.(a) AP predictor training scheme and neural network architecture.(b) Mean AUC and PPV accuracy on the multiallelic mass spec benchmarks for the AP predictor in comparison to other predictors.The MixMHCpred 2.0.2 tool was benchmarked on the MULTIALLELIC-RECENT subset; the other predictors were tested on the full MULTIALLELIC benchmark.(c) Sequence logo for the motif learned by the AP predictor.Positive values (above the center line) indicate enrichments above proteome level; negative values indicate depletions.(d) Comparison of AP predictor AUC scores on the MULTIALLELIC benchmark samples when peptides containing cysteine were removed (y-axis) vs. retained (x-axis).The inset shows the percentage change in AUC when cysteine-containing peptides are removed.(e) Correlation of the AP predictor with the BA prediction for HLA-A*02:01.Each red dot corresponds to a 9-mer peptide sampled from the proteome.The black line indicates the best-fit regression line, and example peptides are indicated.(f, g) Correlation of AP predictor with the BA prediction across HLA alleles by gene (f) or by allele representation in the AP training set (g) .

Figure 3 .
Figure 3.The MHCflurry 1.6.0presentation score (PS) model combines binding affinity and antigen processing prediction.(a) The PS model is a two-input logistic regression model that integrates BA and AP predictions to give a composite score.It is trained on multiallelic mass spec hits and decoys.(b) Comparison of PPV scores of the PS models with other predictors.The top row shows the performance of the PS model variant that uses the AP without flanks predictor, which considers only the peptide and not the upstream and downstream flanking sequences.The bottom row corresponds to the variant that also considers the flanking sequences.(c) Mean percent change in AUC and PPV for the indicated predictors (y-axis) relative to each of the three existing predictors (columns).Error bars indicate 95% confidence intervals for the mean change.