MHCflurry: open-source class I MHC binding affinity prediction

Machine learning prediction of the interaction between major histocompatibility complex I (MHC I) proteins and their small peptide ligands is important for vaccine design and other applications in adaptive immunity. We describe and benchmark a new open-source MHC I binding prediction package, MHCflurry. The software is a collection of allele-specific binding predictors incorporating a novel neural network architecture and adhering to software development best practices. MHCflurry outperformed the standard predictors NetMHC 4.0 and NetMHCpan 3.0 on a benchmark of mass spec-identified MHC ligands and showed competitive accuracy on a benchmark of affinity measurements. The accuracy improvement was due to substantially better prediction of non-9-mer peptide ligands, which offset a narrowly lower accuracy on 9-mers. MHCflurry was on average 8.6X faster than NetMHC and 44X faster than NetMHCpan; performance is further increased when a graphics processing unit (GPU) is available. MHCflurry is freely available to use, retrain, or extend, includes Python library and command line interfaces, and may be installed using standard package managers.

of shallow neural networks trained on binding affinity measurements deposited in the immune epitope database (IEDB) [8] . NetMHC uses an "allele-specific" approach, in which separate predictors are trained for each MHC allele; the input to the neural network is the peptide of interest. NetMHCpan uses a "pan-allele" approach, in which a single neural network takes as input both the peptide and a representation of the MHC allele. The impressive accuracy of these tools has resulted in wide adoption, despite certain limitations: they are closed source, may be trained (fit) only by their developers, and do not make available their training data.
Here we describe and benchmark a new package of allele-specific class I MHC binding predictors, MHCflurry version 0.9.1. MHCflurry predictors show competitive accuracy with the NetMHC tools and a significant speed improvement while addressing a number of limitations of the NetMHC software. In particular, MHCflurry is open source, retrainable, precisely documents the data and workflow used to train the released models on measurements in IEDB, exposes both a Python API and a command line interface, is installed using standard Python package management tools, and applies software development best practices such as unit tests, continuous integration, and code documentation.
MHCflurry is freely available under the Apache License 2.0. It may be installed from the Python package index. Source code is maintained at https://github.com/hammerlab/mhcflurry . All data and scripts used to train the models are available in this repository.

Implementation
MHCflurry is implemented in Python (versions 2.7 and 3.4+ are supported) using the Keras neural network library ( https://github.com/fchollet/keras ). Model training and prediction use a graphics processing unit (GPU) when available.
Similar to NetMHC, MHCflurry is an ensemble of MHC I allele-specific predictors. Separate models are trained for each allele. The input to each model is a peptide. No representation of the MHC allele, such as its amino acid sequence, is used. The models are trained independently and no information is shared between alleles. For each allele, MHCflurry includes an ensemble of eight models. The final nanomolar affinity prediction is taken to be the geometric mean of the individual model outputs. The variance of the individual model predictions gives an indication of the uncertainty of the prediction and is also made available to users.
We arrived at the MHCflurry input representation, architecture, and training approach through an informal, iterative process using data held out from the training dataset. The final predictors were trained on the full training dataset and validated on the two benchmarks presented here.

Peptide representation
MHCflurry introduces a novel peptide representation, in which variable-length peptides of length 8-15 are encoded as fixed-length vectors. Unlike the averaging scheme implemented in NetMHC and NetMHCpan, in which non-9mer peptides are encoded as 9-mers by adding or removing amino acids, this approach makes the full peptide available to the network. It also avoids the need for more complex neural network architectures, such as recurrent networks, that would be required to explicitly support variable length model inputs.
The motivation for the encoding is to preserve the positionality of the residues that make the most important stabilizing contacts with the MHC molecule. These are known as anchor positions, and generally occur toward the beginning or end of the peptide for most alleles. For example, the anchor positions of HLA-A*02:01 and many other alleles are at the second and last positions in the peptide, i.e. positions 2 and 9 for a 9-mer peptide. While overhangs, in which longer peptides protrude from the end of the binding groove, have been reported [9][10][11][12] , it is thought that the most common conformation adopted by longer peptides is a bulged or zigzag conformation of the middle residues [13][14][15] . In this case the positions of the anchor residues remain intact relative to the nearest end of the peptide.
The peptide encoding is performed as follows (Figure 1a). Each peptide of length 8-15 is transformed to a length 15 string, in which missing residues are filled with an X character, which is treated as a 21st amino acid. The first four and last four residues in the peptide always map to the first four and last four positions in the representation. The middle seven residues are filled as needed depending on the length of the peptide: an 8-mer leaves all middle positions as an X whereas a 15-mer fills all positions. In this way, the positions most likely to contain anchor residues are consistently mapped to the same positions in the representation relative to the end of the peptide.

Output encoding
As in NetMHC and NetMHCpan, MHCflurry transforms binding affinities to values between 0.0 and 1.0, where 0.0 is a non-binder and 1.0 is a strong binder. The neural networks are trained using the transformed values and the inverse transform is used to return prediction results as nanomolar affinities. The transform is given by 1 -log 50000 (x) where x is the nanomolar affinity.
Affinities are capped at 50,000 nM.

Neural network architecture
The MHCflurry predictors are feedforward neural networks composed of the following layers: the peptide representation (described previously) encoded as a 1-hot (binary) vector, two locally In preliminary investigations (data not shown), we observed that models using more than one fully connected layer performed poorly and that using an amino acid embedding layer did not significantly outperform a 1-hot encoded peptide. We additionally tested two recent ideas from the deep learning literature, dropout [16] and batch normalization [17] , which also did not significantly improve performance. While the downloadable MHCflurry models do not use these features, the MHCflurry software includes support for them to enable experimentation.

Construction of the training dataset
The training dataset was assembled from a snapshot of IEDB MHC ligands downloaded on May 17, 2017 augmented with the BD2013 dataset published in ref. [18] .
IEDB entries with non-class I, non-specific, mutant, or unparseable allele names were dropped, as were those with peptides containing post-translational modifications or noncanonical amino acids. To avoid the potential for bias in favor of MHCflurry on the mass-spec validation dataset, entries deriving from mass-spec studies were removed from the training data. This yielded an IEDB dataset of 147,716 quantitative and 43,704 qualitative affinity measurements. We assigned nanomolar affinities to the qualitative measurements as follows: positive-high , 50; positive-intermediate , 500; positive-low , 5000; positive , 100; negative , 50000.
Of 179,692 measurements in the BD2013 dataset published in ref. [18] , 55,473 were not also present in the IEDB dataset. After selecting peptides of length 8-15 and dropping alleles with fewer than 200 measurements, the combined training dataset consists of 235,597 measurements across 101 alleles (Table S1).

Neural network training
The MHCflurry models are trained (fit) using a procedure similar to NetMHC. Eight models are trained for each allele. Each model is trained on a random 80% sample of the data for the allele; the remaining 20% is used as a test set for early stopping. Training proceeds using the RMSprop optimizer [19] until the accuracy on the test set has not improved for ten epochs. Mean squared error is used as both the training loss and test set accuracy metric. At each epoch, 25 synthetic negative peptides for each length 8-15 are randomly generated. These random negative peptides are sampled so as to have the same amino acid distribution as the training peptides and are assigned uniformly random affinities between 20,000 nM and 50,000 nM. No model selection is performed.
The models within an ensemble thus differ from each other due to several sources of randomness. The most important factor is that each model is trained on a different 80% sample of the data. Lesser factors include random initial model weights, the nondeterminacy of stochastic gradient descent, and the random negative peptides.
Training the models described here took 311 minutes on a machine with eight 2.30 GHz Intel Xeon CPUs, one NVIDIA Tesla K80 GPU, and 52 GB memory.

Benchmark approach
As the NetMHC tools are fit to binding data in IEDB, new datasets not included in IEDB are required to assess the performance of these tools. We use two datasets for this purpose: (1) a published dataset of peptides eluted from cell-surface MHC and identified by mass-spec [20] , which we refer to as the ABELIN dataset, and (2)  decoys. We assessed the accuracy of each predictor at differentiating hits from decoys in terms of positive predictive value (PPV). To compute PPV for an allele with n hits, we ranked the n + 100n hits and decoys from tightest to weakest predicted binding affinity and calculated the fraction of the top n peptides that were hits.
In addition to the standard MHCflurry, NetMHC, and NetMHCpan predictors, we considered seven variations on the MHCflurry architecture in the ABELIN benchmark. The variants changed one or two aspects of the architecture or training data and were otherwise identical to MHCflurry 0.9.1 (Table 1). For each architecture, we evaluated both a single predictor and an ensemble of eight models. Fully connected layer size of 64 The HPV benchmark dataset consists of 194 affinity measurements across seven alleles.

MHCflurry variant Description
Peptides derived from the E6 and E7 proteins of HPV16 were assayed using a cell-based competitive binding assay [22,23] (Supplemental Methods). We assessed accuracy on this benchmark using three well-known metrics, area under the receiver operator characteristic curve (AUC), F1, and Kendall rank correlation coefficient (Kendall's tau). The AUC score estimates the probability that a strong binding peptide (measured affinity 500 nM or less) will have a stronger predicted affinity than a weak-or non-binding peptide (measured affinity greater than 500 nM). The F1 score summarizes a predictor's precision and recall at classifying peptides as having affinity less or greater than 500 nM. The Kendall tau score measures the correlation in rank when peptides are sorted by measured or predicted affinity; it uses no cutoff and assesses agreement across the affinity spectrum. NetMHCpan, respectively.

Results
The accuracy advantage of MHCflurry over the NetMHC tools was due to better performance on non-9-mer peptides, which offset slightly lower accuracy on 9-mers ( Figure 2b). On non-9-mers (i.e. respectively. This suggests that, while there is likely room for improvement, the MHCflurry 0.9.1 architecture and training strategy is a reasonable compromise between median and minimum performance across alleles. Ensembles showed a consistent improvement in accuracy over single models, although even a  Figure 3).
MHCflurry was substantially faster than the other predictors (Figure 2d). Using only the CPU, On the ABELIN mass-spec benchmark, MHCflurry outperformed the NetMHC tools overall and in particular on non-9mer peptides, suggesting that the peptide representation introduced here is a useful approach for training feedforward neural networks on variable-length MHC I peptide ligands. The non-9-mer accuracy improvement is interesting given that the IEDB training dataset is highly biased toward 9-mer peptides and contains very few peptides of length greater than 11 (1.4% of total). The case of 12-mer peptides is representative. In the ABELIN benchmark, the allele with the most 12-mer peptides is A*68:02, with 163 mass-spec identified 12-mers out of 1,986 total detected peptides. On this allele, the PPV scores were 0.20, 0.11, and 0.13 for MHCflurry, NetMHC, and NetMHCpan, respectively, suggesting that, while no predictor performs well in this context, MHCflurry learned something the other predictors did not.
In the IEDB training data, there were only nine 12-mer peptides with affinity measurements for this allele, and all except one (TLVGLAIGLVLL with 542 nM affinity) had non-binder affinities.
This suggests that the MHCflurry models generalized to 12-mer prediction for this allele by learning from peptides of other lengths.
In contrast, MHCflurry narrowly underperformed the NetMHC tools on predictions for 9-mers, with a median 1.7% lower PPV than NetMHCpan across alleles on 9-mers. Addressing this deficit is important future work. As one possible explanation, we note that, unlike the NetMHC

Ethics approval and consent to participate
Not applicable.

Consent for publication
Not applicable.

Availability of data and material
The MHCflurry source code and training workflow, data, and models are available at

Competing interests
The authors declare that they have no competing interests.

Funding
This research was supported by the Parker Institute for Cancer Immunotherapy. The length-15 encoded peptide is supplied as a 1-hot vector, with entries for the 20 amino acids plus the X special character. Two locally connected layers are applied with a hyperbolic tangent (tanh) activation, followed by a fully connected layer with rectified linear (ReLU) activation, and an output neuron with sigmoidal activation.

Supplemental methods
Competitive binding assay (HPV benchmark) The binding affinity of HPV16 E6 and E7 derived peptides to selected HLA class I molecules was tested in competition-based binding assays as described in [22,23] . Briefly, test peptides in The binding affinity of the test peptide was determined by non-linear regression analysis as the concentration that inhibits 50% binding of the fluorescein-labeled reference peptide (IC50).