The Human Immunopeptidome Project: A Roadmap to Predict and Treat Immune Diseases*

Immunopeptidomics is a rapidly evolving field that is catalyzed by neoantigen discovery and cancer immunotherapy. Emulating the path of other omic disciplines, Vizcaíno et al. proposes that technological advances in the mass spectrometry field will lead to large immunopeptidomic cohort studies, and ultimately, the immunopeptidome-wide association study (IWAS) paradigm, which will provide profound insights into disease susceptibility and resistance. Graphical Abstract Highlights Immunopeptidomics bears the potential to link diseases to antigen representation. We suggest to achieve this by analyzing the immunopeptidomes of cohorts of patients. Current mass spectrometry-based techniques to analyze immunopeptidomes are described. We term the proposed approach “Immunopeptidome-wide association studies” (IWAS). The science that investigates the ensembles of all peptides associated to human leukocyte antigen (HLA) molecules is termed “immunopeptidomics” and is typically driven by mass spectrometry (MS) technologies. Recent advances in MS technologies, neoantigen discovery and cancer immunotherapy have catalyzed the launch of the Human Immunopeptidome Project (HIPP) with the goal of providing a complete map of the human immunopeptidome and making the technology so robust that it will be available in every clinic. Here, we provide a long-term perspective of the field and we use this framework to explore how we think the completion of the HIPP will truly impact the society in the future. In this context, we introduce the concept of immunopeptidome-wide association studies (IWAS). We highlight the importance of large cohort studies for the future and how applying quantitative immunopeptidomics at population scale may provide a new look at individual predisposition to common immune diseases as well as responsiveness to vaccines and immunotherapies. Through this vision, we aim to provide a fresh view of the field to stimulate new discussions within the community, and present what we see as the key challenges for the future for unlocking the full potential of immunopeptidomics in this era of precision medicine.

The science that investigates the ensembles of all peptides associated to human leukocyte antigen (HLA) molecules is termed "immunopeptidomics" and is typically driven by mass spectrometry (MS) technologies. Recent advances in MS technologies, neoantigen discovery and cancer immunotherapy have catalyzed the launch of the Human Immunopeptidome Project (HIPP) with the goal of providing a complete map of the human immunopeptidome and making the technology so robust that it will be available in every clinic. Here, we provide a long-term perspective of the field and we use this framework to explore how we think the completion of the HIPP will truly impact the society in the future. In this context, we introduce the concept of immunopeptidome-wide association studies (IWAS). We highlight the importance of large cohort studies for the future and how applying quantitative immunopeptidomics at population scale may provide a new look at individual predisposition to common immune diseases as well as responsiveness to vaccines and immunotherapies. Through this vision, we aim to provide a fresh view of the field to stimulate new discussions within the community, and present what we see as the key challenges for the future for unlocking the full potential of immunopeptidomics in this era of precision medicine. The Human Genome Project was a major milestone in the life sciences (1,2). First-completed twenty years ago, this mind-shifting project has changed and will continue to change the way we practice medicine. Historically, physicians have largely focused on treating disease already in progress, but modern medicine is now progressively shifting from disease treatment to disease prevention based on an individual's risk (3). The past two decades have seen an enormous success of wide-scale studies in identifying genetic variants that predict an individual's predisposition to common diseases (4). In fact, robust, rapid and inexpensive identification of functional genetic variants in individuals is now enabling predictive, preventive and personalized medicine approaches (5)(6)(7). Since the first report of single-nucleotide polymorphisms (SNPs) 1 analyzed for association with myocardial infarction by genome-wide association studies (GWAS) in 2002 (8), the GWAS Catalogue resource has now grown to contain tens of thousands of SNPs associated with hundreds of common diseases (9). In this post-GWAS era, the HLA has been established as the region of the genome that is associated with the greatest number of human diseases (10). In fact, population studies of diverse ancestries have identified hundreds of susceptibility loci within the HLA region that predispose individuals to immune diseases (11)(12)(13)(14)(15)(16)(17)(18). Most HLA disease associations that were reported over the last 50 years are related to the immune system, but the exact mechanisms driving the associations remain largely elusive.
The HLA is divided into two main subclasses: class I, which include the classical HLA-A, HLA-B and HLA-C molecules, as well as the nonclassical HLA-E, HLA-F and HLA-G molecules; and class II, which includes HLA-DPA1, HLA-DPB1, HLA-DQA1, HLA-DQA2, HLA-DQB1, HLA-DQB2, HLA-DRA, HLA-DRB1, HLA-DRB2, HLA-DRB3, HLA-DRB4, and HLA-DRB5 molecules (Fig. 1A) (19). To date, more than 23,000 different classical HLA alleles have been identified (https://www.ebi. ac.uk/ipd/imgt/hla/stats.html) and this number may keep climbing up to nearly 8 -9 million HLA variants ( Fig. 1B) (20). The role of HLA molecules is to present a repertoire of peptides at the cell surface for T-cell recognition. The non-random amino acid composition of those peptides is restricted by length, generally between 8 and 12 (can be up to 15) amino acids for class I and between 13 and 25 amino acids for class II, and by the presence of allele-specific binding motifs (21). Therefore, the nature of those peptides and the extreme diversity of HLA alleles at the population level greatly enhances the complexity of the peptide repertoires, which are collectively termed as the human immunopeptidome (22,23).
T cells scan the HLA immunopeptidome and seek for "abnormal" peptides that originate from metabolic perturbations, pathogenic sources or neoplastic transformations (24 -28). Those abnormal cells are then eradicated once engaged by T cells. The extreme diversity of the human immunopeptidome maximizes the probability that at least some individuals within the world population can mount a T-cell attack against an emerging infection and survive (29). This is exemplified by the HIV epidemic and evidences that HLA-B57 individuals are more likely protected against HIV (30,31). However, such evolutionary advantage comes with the price of disease susceptibility. In fact, individuals expressing specific HLA alleles are more susceptible to suffer from specific auto-inflammatory and autoimmune diseases. For instance, birdshot chorioretinopathy was shown to be associated with HLA-A29, ankylosing spondylitis with HLA-B27, Behç et's disease with HLA-B51 and psoriasis with HLA-C06. Combinations of endoplasmic reticulum aminopeptidases (ERAP) 1 haplotypes, involved in trimming HLA-associated peptides, are also risk factors for these diseases in people that have specific HLAs (e.g. HLA-B27*05) (32)(33)(34)(35). Moreover, an increasing number of GWAS results indicate that amino acid polymorphisms associated with immune diseases are likely affecting binding affinities of peptides within the groove of HLA proteins, and thus, affecting the repertoire of peptides presented to T cells that are capable of triggering or perpetuating human diseases (10, 12, 19, 36 -45). Interestingly, those evidences obtained from GWAS, meta-analysis and HLA region fine-mapping studies point toward a fundamental role of the immunopeptidome in driving human diseases. We highlight below four cases of such studies: (1) in allergic rhinitis, the most common clinical presentation of allergy affecting 400 million people worldwide, a recent meta-analysis including 59,762 cases and 152,358 controls from European ancestry, confirmed in a replication phase of 60,720 cases and 618,527 controls, indi-FIG. 1. Number of HLA Class I and Class II alleles from the IPD-IMGT/ HLA Database. The IPD-IMGT/HLA is a database for sequences of the human major histocompatibility complex (MHC) and is part of the international ImMuno-GeneTics project (IMGT). A, Number of named alleles for each HLA gene. B, History of the database growth, from the first version (1.0) in 1998 to the latest version (3.37) in 2019. The number includes other non-HLA alleles and confidential alleles. Data were extracted from the database on 2019/08/20. Updated numbers can be found online at https:// www.ebi.ac.uk/ipd/imgt/hla/stats.html. cated that the strongest associated amino acid variants in HLA-DQB1 and HLA-B were both located in the peptidebinding groove (36); (2) in lung cancer, analysis of HLA genetic variation among 26,044 lung cancer patients and 20,836 controls revealed amino acid variants in the HLA-B*08:01 peptide-binding groove (37); (3) in alopecia areata, one of the most prevalent autoimmune diseases, a meta-analysis including 3253 cases and 7543 controls highlighted HLA-DR as the primary risk factor of the disease with amino acid variants located in the peptide-binding groove (39); and (4) in common infections, a meta-analysis including over 200,000 individuals of European ancestry suggested important roles of specific amino acid polymorphisms in the peptide-binding groove, for eight common infectious diseases, in particular: i.e. chicken pox, shingles, cold sores, tuberculosis, scarlet fever, tonsillectomy, pneumonia and plantar warts (38). GWAS studies have also shown the association of Parkinson's disease with HLA-DRB5*01 and DRB1*15:01, which are present in ϳ15% of the general population (46). In this study, the authors showed that a defined set of peptides that are derived from alpha-synuclein, a protein aggregated in Parkinson's disease, act as antigenic epitopes displayed by the alleles and drive helper and cytotoxic T-cell response in patients with Parkinson's disease (47). Thus, if further tested and validated for other diseases, cohort studies may ultimately suggest a universal role of the immunopeptidome in disease susceptibility.
In addition to qualitative changes in the peptide repertoire, the level of expression and stability of HLA molecules at the cell surface have been found to be associated with diseases (11,12,19). For instance, a high level of expression of HLA-C increases the risk of Crohn's disease but also promotes the control of HIV infection (13,14); instability of HLA-DQ2 and HLA-DQ8 could increase the risk of type 1 diabetes whereas stability of HLA-DQ6 may confer protection (15)(16)(17). In cancer, a recent analysis of more than 1500 melanoma patients treated by immune checkpoint inhibitors (ICIs) showed that HLA-B44 alleles were associated with an improved survival rate whereas HLA-B62 alleles were associated with a decreased effectiveness of ICIs (48,49). These data suggest that HLA-B44 immunopeptidomes may favor presentation of a broader, and perhaps more stable repertoire of tumor-specific antigens to trigger the function of cytotoxic T cells in response to ICIs. This is consistent with the recent observation that specific HLA genotypes are associated with the appearance of specific oncogenic mutations (50,51).
GWAS can identify genetic variants at the gene level and can be complemented with genome-wide transcript quantitative trait loci (eQTL) and genome-wide protein quantitative trait loci (pQTL) to better understand genotype-phenotype associations (52)(53)(54)(55). However, attempts to understand HLA disease associations by only measuring expression levels of intracellular genes and proteins is unlikely to work very accurately for immune diseases in which HLA-peptide-T-cell interactions play a critical role. In contrast, measuring HLA-bound peptides is highly relevant because those peptides are directly "seen" by T cells and are involved in disease progression (56). Therefore, technologies capable of robust and comprehensive measurements of immunopeptidomes at population-scale (from tens of thousands of samples) would be extremely powerful as they would provide direct physical evidences about the identity and quantity of the peptides that are directly "seen" and engaged by T cells. Such technologies could provide key information to better understand associations between HLA alleles and human diseases, or between HLA alleles and responsiveness to treatment. Thinking forward in this context, we anticipate that further development of "immunopeptidomics" (57) technologies will ultimately lead to the immunopeptidome-wide association study (IWAS) paradigm and will provide a new layer of information about disease susceptibility. Emulating the path of transcriptome-, proteome-and metabolome-wide association studies (TWAS, PWAS and MWAS, respectively) (55, 58 -61), the road toward IWAS is technically very challenging, but conceptually straightforward: quantify the immunopeptidome of large human population cohorts and statistically link immunopeptidomic variations to different clinical outcomes ( Fig. 2A). IWAS coming from very large immunopeptidomic sample cohorts could be of great value to treat patients but may also provide highly precise information about the predisposition of an individual to get specific immune diseases, hence representing a new look at human disease risk factors.
The road toward IWAS is probably very long and it is actually very likely that the feasibility of conducting IWAS from large human population cohorts will face a great deal of skepticism within the immunopeptidomics community as many technical issues can be considered at this point of time. In fact, the immunopeptidome is very different from one tissue to the other of every person and obvious questions will arise. For instance, (1) which tissues and cell types should be used?; (2) which disease should be prioritized while considering sample accessibility?; (3) what is the minimum number of samples (disease and control) that would be required?; and (4) which HLA alleles would be included and would they need to be shared across all samples to enable proper analysis of the IWAS data set? Addressing these questions would require the expertise of biostatisticians and are difficult to answer now. To our knowledge, the largest immunopeptidomic cohort study so far was conducted by the group of Arie Admon through the Glioma Actively Personalized Vaccine Consortium (GAPVAC). In this study, the authors have isolated HLA-associated peptides from 142 plasma samples collected from glioblastoma patients (62). Although impressive, 142 samples are not enough to conduct an IWAS. Akin to other association studies, it is likely that over 10,000 samples sharing specific HLA alleles (e.g. HLA-A02) would be required to enable IWAS. Accessibility to such cohorts of samples would probably need to be done through large multi-institutional networks and biobanking efforts, which may include both academia and industrial partners. Such an endeavor may be viewed as overoptimistic by some experts in the field. Here, our intent is neither to answer all potential questions to make IWAS a reality, nor to provide a detailed roadmap to achieve this goal. Instead, we aim to emphasize the most pressing technical issues to move faster toward this long-term goal, and most importantly, stimulate a mind-shift within the community to move the field to a higher orbit. If the community acknowledges the impact of moving the field from small to large immunopeptidomic cohort studies, we predict that immunopeptidomics will emerge in the future as a central science of immunology to predict, diagnose, monitor and treat immune diseases, and will therefore profoundly impact medicine (Fig. 2).
The Human Immunopeptidome Project-Development of mass spectrometry (MS) technology platforms may pave the way toward the IWAS paradigm. In fact, MS technologies have recently attracted a high level of interest for providing direct measurements of immunopeptidomes given the ongoing revolution in immuno-oncology and the intense interest in identifying therapeutically relevant antigenic peptides that are actionable in the clinic for vaccine development (63)(64)(65)(66)(67)(68)(69)(70)(71)(72)(73). More specifically, advances in genomics and MS techniques have enabled the development of proteogenomic methods for FIG. 2. Next-generation immunopeptidomics to predict, diagnose, monitor and treat immune diseases. A, Further advances in MS technologies will ultimately enable IWAS in which robust, quantitative and comprehensive analysis of HLA class I and class II immunopeptidomes will be performed from thousands for patients. Such analysis will add a new layer of biological information and will enable the deciphering of immunopeptidome quantitative trait loci to predict disease and responsiveness to vaccines and immunotherapies. Human diseases listed in the boxes were shown to be associated with particular HLA alleles, with amino acid variants affecting either the peptide-binding groove of the HLA or its level of expression. The literature reference showing evidences for the HLA disease association is indicated in parenthesis. B, Integration of immunopeptidomics into a multi-omic biomarker discovery platform for cancer immunotherapy. Patients are treated by checkpoint blockade immunotherapy. In this example, cohort samples will be collected longitudinally and analyzed using an integrative multi-omic approach. HLA-bound peptides will be identified and quantified before, during and after treatment in a robust and comprehensive fashion using advanced immunopeptidomics technologies. Data will be analyzed and correlated with clinical outcomes. Molecular signatures of tumor rejection will be identified, in the immunopeptidome of checkpoint blockade high responders. Signatures (or epitope biomarkers) observed from such cohort studies will then be used prospectively to stratify patients and to treat those that will best respond to checkpoint therapy. the direct detection of immunogenic tumor-specific mutated and non-mutated peptides that originate from both coding and non-coding regions of the genome (74 -77). Those studies have provided an important proof-of-concept that MS is an effective approach that can be deployed for direct identification of immunogenic tumor-specific antigens. If further developed and routinely applied in cohort studies, such analytical method would be of great value for biomarker discovery in cancer immunotherapy to provide molecular signatures or epitope biomarkers of tumor rejection to discriminate patients who best respond to ICIs (Fig. 2B) (78). For the time being, imprecise predictions of HLA-associated peptides from genes and mRNA is still preferred over direct HLA-peptide measurements by MS (26,79). This choice can be explained by the state and accessibility of the respective measurement techniques: whereas essentially complete genome and transcriptome analysis is readily available to immunologists through core facilities, MS-based analysis of immunopeptidomes is most effectively performed by expert labs and cannot easily reach the throughput, sensitivity and reproducibility of genome and transcriptome analysis. To solve this issue, the MS-based immunopeptidomics Human Immunopeptidome Project (HIPP) was launched under the umbrella of the Human Proteome Organization (HUPO) with the long-term goal of 1) mapping the entire composition of the human immunopeptidome and 2) making data and the robust experimental and computational techniques of immunopeptidomics accessible to any researcher and clinical investigators (22,23). Ultimately, absolute quantification of immunopeptidomes should become as cheap, fast, sensitive and reproducible as it is nowadays for genome and transcriptome analysis. Only then, immunopeptidomics will become widely accessible to the broader immunology community to enable very large immunopeptidomic cohort studies, the development of epitope biomarker discovery platforms, and ultimately, IWAS.
Historical Milestones in MS-based Immunopeptidomics-Over the last five years, many excellent reviews have been published regarding the analysis of immunopeptidomes using MS technologies (71, 73, 80 -93). To avoid redundancies with those reviews, we focus below on key historical milestones within three different timeframes, which we view as important toward the IWAS paradigm: (1) the groundbreaking work of Rammensee and Hunt/Engelhard, (2) the capability of MS technologies for large-scale sequencing of MHC-associated peptides, and (3) the renaissance of cancer immunotherapy (Fig. 3A).
In the early 1990s, the teams of Hans-Georg Rammensee, Victor Engelhard and Donald Hunt provided truly groundbreaking work (Fig. 3A). They isolated MHC/HLA class I-associated peptides by immunoaffinity purification and applied liquid chromatography (LC)-MS methods to provide the first physical evidence about the nature of MHC class I-bound peptides (94,95). Specifically, in 1990, Rammensee's group applied LC-MS to analyze naturally processed viral peptides recognized by cytotoxic T cells (95). In 1991, his group reported pooled sequencing of peptides eluted from several MHC class I molecules, with identification of allele-specific peptide binding motifs (21). The year after, the findings by Hunt et al. were consistent with the Rammensee motif (94). In fact, Hunt/Engelhard's group quantified ϳ200 peptide species by LC-MS and partially sequenced ϳ19 of them, validating the motif concept and providing the first estimate that the total number of different peptides presented by HLA-A2 could easily exceed 1000, with most of the peptides present in 100 or fewer copies per cell, hence suggesting for the first time the existence of a large and complex immunopeptidome (94, 96 -101). In 1992, Hunt/Engelhard provided physical evidences for non-canonical long peptides presented by HLA class I molecules (102), and in 1997, sequenced the very first mutated MHC class I peptides (Fig. 3A) (103), largely referred nowadays as tumor-specific neoantigens (26). Further, in 2006, Hunt/Engelhard's group showed evidences for phosphopeptides as potential targets for cancer immunotherapy (104,105). Since then, MS-based discovery of mutated (75, 76, 80, 106 -108), non-canonical (86, 87, 109 -111) and posttranslationally modified HLA-associated peptides [phosphorylation (112)(113)(114)(115), methylation (116), citrullination (117), deamidation (118 -121), kynurenine (122), glycosylation (123)(124)(125), peptide splicing (126 -130)] continues to be a topic of much interest in the field.
A key technical milestone that greatly impacted the immunopeptidomics field was the commercialization of the Orbitrap technology in 2005 (131). In fact, the launch of this instrument has enabled researchers to routinely perform large-scale sequencing of MHC-associated peptides, going from dozens to hundreds, to lately thousands of MHC-associated peptides identified in single experiments (87,132). Computational algorithms were created and improved (91)(92)(93)(133)(134)(135)(136) and different MS techniques developed by the proteomics community were used for the analysis of MHC peptides: data-dependent acquisition (DDA) for the discovery of new antigenic peptides, selected/multiple/parallel reaction monitoring (S/M/PRM) for targeted analysis of pre-defined sets of peptides with a high level of sensitivity, reproducibility and quantitative accuracy, and more recently, sequential window acquisition of all theoretical fragment ion spectra/dataindependent acquisition (SWATH/DIA). SWATH/DIA makes use of high resolution qTOF instruments (SWATH) or Orbitrap instruments for high-throughput targeted analysis of large fractions of immunopeptidomes and is reviewed in (71). For three decades, a very small community of experts have deployed these MS techniques to generate MS data-generally shared nowadays through MS public data repositories such as PRIDE and MassIVE-in most cases through collaborations with immunologists and clinical investigators (Fig. 3B). Until now, more than 500 original publications related to MSbased sequencing of MHC-associated peptides were reported (supplemental Table S1; see filtered PubMed search were about the application of the available technology to better understand the basic mechanisms of antigen presentation (46%), followed by application in the fields of cancer (21%), autoimmunity or inflammation (12%), and infectious diseases (21%) (Fig. 3C).
A clear milestone in immunopeptidomics also became apparent in 2013 following the recognition of cancer immunotherapy as breakthrough of the year by the magazine Science (Fig. 3A) (137). In fact, the renaissance of cancer immunotherapy over the last recent years, thanks to the astonishing clinical success of checkpoint blockade inhibitors (138), has significantly increased the level of interest of researchers and clinical oncologists for applying MS-based immunopeptidomics in the quest of tumor-specific antigen discovery (62, 73, 139 -141). In 2014, Gubin et al. demonstrated that tumor-specific mutated peptides were targeted by T cells in response to checkpoint blockade therapy (106), thereby providing groundbreaking mechanistic insights about how checkpoint blockade works to eradicate cancer in patients (142)(143)(144). In the following years, the capability of MS techniques to directly identify tumorspecific mutated antigens as physical molecules was further demonstrated, collectively providing an important proof-of-concept about the potential utility of MS technologies in the clinic (65,138,145,146). This enthusiasm has created new business opportunities and has resulted in an increasing number of biotech and pharma companies interested in applying MS-based immunopeptidomics in the space of antigen discovery (Table I) (147). Advances in MS technologies and the increasing clinical relevance of the immune system to treat cancer has greatly impacted the culture of the immunopeptidomics field, progressively leading it into a new and exciting era, as highlighted by the recent "Immunopeptidomics" special issue in the journal Proteomics (148).

Sample Preparation Is The Achilles' Heel For Large Immunopeptidomic Cohort
Studies-Next-generation sequencing and MS technologies are revolutionizing the life sciences by enabling robust quantitative analysis of genomes, transcriptomes and proteomes in many laboratories, making multiomic cohort studies a reality (149). In contrast, the field of immunopeptidomics is lagging behind even though its potential impact in the clinic is well recognized. In immunopeptidomics, the isolation of MHC-associated peptides is currently in our view the "Achilles' heel" of the whole workflow. Below, we argue that the development of new protocols, reagents and consumables will need to be prioritized over the next five years to accelerate the explosion of large-scale cohort studies in immunopeptidomics.
As of today, the field of immunopeptidomics heavily relies on the use of antibodies for the specific isolation of MHCassociated peptides by immunoaffinity purification (see (72) for a list of HLA-specific antibodies). Notably, for the isolation of HLA-ABC-associated peptides, researchers have generally used the exact same antibody (W6/32) over the last 30 years. The method has been refined by a handful of experts but remains fundamentally like the methods used by Donald Hunt and Hans-Georg Rammensee in the early 1990s (72, 150 -153). In addition, relatively large quantities of antibodies are the proportion of publications that have answered an immunology question and/or that have tackled a technological limitation. (Right) The pie chart shows the proportion of studies that have applied the available technology to answer an immunology question in the context of basic research (antigen processing and presentation) or applied research (cancer, infectious diseases and autoimmunity/inflammation). See supplemental Table S1 for details about the PubMed and PRIDE/MassIVE searches. required to perform this procedure (i.e. ϳ1 mg of antibody on average per sample). As of today, any laboratory aiming at conducting large-scale immunopeptidomic studies must generally produce and purify antibodies from hybridoma cell lines, which is relatively costly and time consuming, and therefore represents a considerable limitation for scaling up the procedure and for rapidly implementing immunopeptidomics workflows into new laboratories. Moreover, very little is known about the yield of the immunoaffinity purification procedure. Until now, the use of isotopically labeled peptide-MHC monomers represents the best strategy for the accurate quantification of stepwise yields of the procedure (154). In this regard, the group of Peter van Veelen has recently synthesized 24 different isotopically peptide-HLA monomers and showed that the yield was sequence-dependant. Specifically, the yield ranged between 2.1 and 15%, and was below 10% for 70% of the peptides tested (17 out of 24), indicating important losses and/or biases during sample preparation (155). Along these lines, other separation techniques were reported to create a bias in the detectable immunopeptidome, which can be explained by the enormous biochemical diversity of those peptides (156). For instance, highly hydrophilic peptides may not be captured by conventional reversed-phase C 18 material whereas highly hydrophobic peptides may be preferentially lost during the procedure simply because they stick to plastic materials during the multiple handling steps. To address the issues mentioned above, exploration of new approaches in digital microfluidics for immunoprecipitation may help increase the sample throughput (157); further development of acoustic technologies for contactless handling may accelerate unbiased analysis of immunopeptidomes (158,159); and low-cost distribution of antibodies may accelerate the testing of new sample preparation protocols, which may ideally enable all sample processing steps to be carried out in a single tube, thereby enhancing sensitivity, throughput and scalability of immunopeptidomic analyses (160,161). As a first step, HUPO-HIPP recently proposed to design a multilaboratory study in which a large library of isotopically labeled peptide-HLA monomers would be distributed across different research groups for the accurate quantification of stepwise yields of the established immunoaffinity purification procedure (22). This type of work has been underappreciated over the years but is nevertheless invaluable to define the physicochemical properties of peptides that are preferentially detected or lost during the isolation procedure. Inspired by landmark studies from other -omic disciplines (162)(163)(164)(165)(166)(167), conducting the proposed multi-laboratory study will clarify the uncertainty about the yield of the peptide isolation procedure, will determine the robustness of the methodology used by different groups and will possibly indicate the need to prioritize the development of innovative sample preparation protocols.
In summary, the field has greatly benefited from the traditional immunoaffinity purification method to make ground-breaking discoveries. The method will likely continue to be used for small-scale studies and relatively larger cohort studies in expert laboratories soon (62,141). However, the development of new sample preparation protocols will become critical to scale up and bring the field to the next level. Isolation of MHC-associated peptides in a cheap, fast, unbiased, reproducible and high-throughput fashion has the potential to transform the field and increase the impact of immunopeptidomics in biomedical research through robust clinical applications, and ultimately, population-scale immunopeptidomic studies.
Disruptive Technologies for Large-Scale Immunopeptidomics-IWAS and population-scale studies will undoubtedly require immunopeptidomic technologies to be widely accessible and commercially well-supported. In genomics, high throughput generation of complete genomic maps became widely accessible only with the development and commercialization of technologies and methods that enabled the sequencing of millions of nucleic acid segments in parallel (168). Reaching this state in the field of immunopeptidomics for high throughput generation of complete digital immunopeptidomic maps is possible but embracing public-private partnerships will need to be prioritized in the coming years to accelerate commercialization of new immunopeptidomics technologies/ methods and their wide-spread distribution. It is conceivable that MS technologies, such as SWATH/DIA-MS, will continue to improve toward this goal. However, it is also possible that disruptive new approaches will be developed for massively parallel peptide sequencing, making MS technologies potentially obsolete in the future. In this regard, Swaminathan et al. have recently introduced a groundbreaking peptide fluorosequencing technology which may accelerate the process toward the IWAS paradigm. Akin to massively parallel measurements of DNA using fluorescence as readout they demonstrated that parallel fluorescence sequencing is also achievable for peptides (169). Their method is promising because it paves the way toward single-molecule peptide sequencing at very high throughput from minute amounts of samples (http://www.erisyon.com/) (170).
Their method builds on three well-characterized methods: Edman chemistry, massively parallel DNA sequencing and MS-based computational strategies for sequence database searching (171). Briefly, isolated peptides are first labeled with fluorophores for each amino acid residues; second, labeled peptides are immobilized on a glass surface and imaged by total internal reflection microscopy to monitor decreases in each molecule's fluorescence after consecutive rounds of Edman degradation; third, the pattern of drops in fluorescence intensity is interpreted to provide a sequencing annotation for each peptide, which is matched and scored against a peptide sequence database to infer the most likely set of peptides present in the sample (169,170,172). In our opinion, this approach is very promising for proteins soon, but further development of the technology will be essential for sequenc-ing HLA-associated peptides because it is currently impossible to label most of the amino acids with fluorophore. Although this proof of concept approach has limitations at this stage (e.g. dyes used, sample complexity, sequencing yield), it capitalizes on three established techniques that may accelerate maturation from proof of concept to routine application and wide adoption by the community.
If further developed, highly parallel single-molecule peptide sequencing may offer an improvement of more than one millionfold in sensitivity over conventional MS technologies and may allow for millions of distinct peptides to be sequenced in parallel, identified and digitally quantified (169,171). Given the qualitative and quantitative complexity of the human immunopeptidome, this technology may radically transform the immunopeptidomics field. However, one has to keep in mind that the performance of this approach will always rely on the quality of upstream peptide isolation procedures. Hence, the shortcomings mentioned above related to the isolate of MHCassociated peptides by immunoaffinity purification will need to be overcome for a generally accessible, reliable and truly universal immunopeptidomic technology.
Data Analysis Challenges-The main approaches used currently for the analysis of immunopeptidomics experiments were developed originally for the analysis of MS proteomics datasets. Database searches are the main analysis method of choice for data dependent acquisition workflows. Here, the acquired spectra are compared with generated (theoretical) ones coming from peptide sequences drawn from a given protein sequence database. Several limitations of this type of analyses can apply to immunopeptidomics. First of all, the size of the search database (and the related search space) can become extremely large. For instance, proteasome-generated spliced peptides have been detected by MS in recent studies (126,129,173,174), so the inclusion of spliced variants in the search database could be required (129). However, this strategy has sparked intense discussions in the field over the last two years and remains actively debated (130,175,176). Additionally, in most proteomics experiments, trypsin is used as the digestion enzyme. Because "no enzyme specificity" must be used in immunopeptidome searches, the search space grows exponentially. As a result of these two related issues, the time required for performing the searches can increase enormously and even more, if variable posttranslational modifications such as phosphorylation have to be considered in the search. More importantly, when making use of the decoy-target approach, it becomes very challenging for search engines to differentiate between true and false matches, considering the huge search space. Consequently, a compromise is often needed, and the accepted PSM false discovery rate can become 5% instead of the standard 1% used in standard MS proteomics experiments.
In parallel, the average percentage of identified spectra can be significantly lower in immunopeptidomics experiments, because of different factors (e.g. shortness of MHC class I peptides, lack of basic amino acid residues, low amount of samples, acquisition parameters, etc), as described previously (71). To address this issue, several approaches have been proposed, like the sequential use of different search engines or the use of Percolator. This tool performs machine learning on high-confidence matches to rescore database search results for lower-confidence peptides (177). Other approaches for performing rescoring have also been successfully used recently in the field (133).
Moreover, alternative analysis approaches to standard database searches can be also used, either alone or in combination. One example is the use of de novo search methods, where the spectra can be sequenced directly, without the need of a protein sequence database (178). The use of de novo algorithms has recently been recommended in immunopeptidomics protocols (72). Spectral searches approaches can also be employed (179). Here, the experimental spectra are compared with existing collections of spectra in the public domain called spectral libraries. However, this approach, although increasingly popular, has had a limited uptake so far. One of the reasons is that the generation of high-quality spectral libraries is not straightforward, and some researchers are reluctant to use publicly available spectral libraries generated by others.
Additionally, the increasing popularity of open modification searches makes them a promising approach. The new tools developed in the last few years (e.g. MSFragger, ANN-SoLo, TagGraph, among others (180 -182) have solved the high computational cost of these methods. Additionally, in our opinion, a promising approach for targeting interesting spectra would be the use of clustering of MS/MS spectra (183)(184)(185). This approach can be used to select those spectra that remain unidentified and that are commonly found across MS runs in different samples. The hypothesis is that the corresponding peptides are potentially interesting and biologically relevant, because they are abundant. The resulting representatives of these clusters of unidentified spectra could then be subjected to those alternative analysis methods explained above. It is certainly a possibility that a significant fraction of unidentified spectra from such clusters includes contaminating peptides that show up in many samples. They could indeed be actual contaminants usually found in proteomics experiments (e.g. keratins), or not biologically relevant peptides as artifacts derived from the experimental protocol. If the latter case applies, information from such clusters could be used to feed a database of contaminating peptides that are constantly observed in immunopeptidomics. Such database does not exist yet but would be useful for the growing immunopeptidomics community and would be conceptually like the CRAPome, which serves as a repository of protein contaminants in protein-protein interaction studies (186).
Further, artificial intelligence approaches have the potential to improve the analysis dramatically. In MS-based proteomics, deep-learning techniques have been recently deployed using millions of high quality MS2 spectra generated from hundreds of thousands of synthetic tryptic peptides, to build new algorithms to predict fragmentation patterns of peptides, thereby enabling successful generation of in-silico peptide spectral libraries for high throughput targeted analysis of DIA data (187)(188)(189). These developments should also greatly benefit the further improvement of de novo search algorithms. Deep-learning was also recently applied for de novo identification of HLA-associated peptides using DIA-MS (190,191).
Finally, it is worth highlighting that a significant number of challenges would need to be considered in addition with regards to the quantification of immunopeptidomes. This quantitative aspect is much less mature at present. Analogous challenges in quantitative MS proteomics approaches have been described before (192).
The SysteMHC Atlas Project-Open and comprehensive reference atlas in life sciences are increasingly beneficial for the scientific community (193)(194)(195)(196)(197)(198). Similarly, the creation of a comprehensive atlas of the immunopeptidome in human, mouse, and other species would be of great value for both understanding health and diagnosing, monitoring and treating immune diseases (22,23,69). In this context, participants of the HIPP initiative recognized the need for an open immunopeptidomics atlas/repository in which output files of mass spectrometric measurements of immunopeptidome samples would be annotated, stored and shared without restriction. In response to this need, the SysteMHC Atlas project was created-a project fully dedicated to the public dissemination and analysis of immunopeptidomic data generated by MS (https://systemhcatlas.org) (Fig. 4) (199). The SysteMHC Atlas uploads raw immunopeptidomics MS data originally deposited into public proteomics databases (mainly the PRIDE database (200), which is the leading ProteomeXchange repository (201,202)) along with the metadata associated with the harmonizing and sharing immunopeptidomic data generated by the community. Over the next years, we anticipate that the growing MS community working on various immunopeptidomic-related research topics (e.g. cancer, autoimmunity, infectious diseases) will contribute to the expansion of the SysteMHC Atlas by uploading data into existing repositories of the ProteomeXchange consortium (mainly PRIDE and MassIVE; the others shown are PeptideAtlas, Panorama, iProX and jPOST). Those data will be reprocessed and harmonized using a uniform and open source computational pipeline developed under the SysteMHC Atlas project. We envision that ProteomeXchange resources will be integrated better with the SysteMHC Atlas in the near future, thereby facilitating the application of deep-learning approaches from qualitycontrolled immunopeptidomic "Big Data" to improve software tools for MHC peptide identification and quantification by DDA-and DIA-MS techniques. experiment (203). Raw MS data are then processed through the uniform and open-source Trans-Proteomic Pipeline (TPP)based computational pipeline for HLA peptide identification, annotation (204 -206) and statistical validation (205,206). Lists of HLA peptides as well as allele-specific peptide spectral libraries (207) are generated and presented in the Atlas.
The SysteMHC Atlas project is still at an early-stage of development. The current version is composed of 23 immunopeptidomic datasets (https://systemhcatlas.org/datasets). In the short-term, we plan to enhance the capabilities of the SysteMHC Atlas project by integrating emerging computational MS workflows (e.g. proteogenomics approaches, ultrafast search engines, identification of post-translationally modified peptides) and expand its content through a closer partnership with the ProteomeXchange consortium (Fig. 4). High-quality harmonized immunopeptidomic data stored in the SysteMHC Atlas will find use over the next years to build and improve software algorithms for the identification and quantification of HLA peptides by DDA-and DIA-MS (Fig. 4). Hence, one can anticipate that the increasing number of high quality MS2 spectra in the SysteMHC Atlas will be of great value for computational scientists to tailor next-generation software for the analysis of immunopeptidomic data. New computational tools will also be used to re-process and re-score data in the Atlas and keep improving the quality of the data. By doing so, the ever-increasing number of HLA peptides in the Atlas will be useful to enhance MS identification and quantification of canonical and non-canonical HLA-associated peptides, including neoantigens that are unpredictable from genomic information as well as those that originate from foreign organisms and the microbiome (208). Moreover, it is our intention to create, in longer-term, the SysteMHC Atlas Data Analysis Center, which will support robust measurements of large immunopeptidomic sample cohorts by SWATH/DIA-MS. In fact, an important utility of the SysteMHC Atlas project is to access reference HLA allele-specific peptide spectral libraries for reproducible, quantitative and comprehensive analysis of immunopeptidomes by SWATH/DIA-MS (71, 209 -212). Hence, we envision that the SysteMHC Atlas project is the beginning of an enterprise that will accelerate the design of large-scale immunopeptidomic studies at population scale, and therefore, represents a potential strategy toward the IWAS concept. CONCLUSION Cohort studies of the post-GWAS era are increasingly impactful and an important driving force for revolutionizing healthcare (4). Recent GWAS-based evidences indicate that generation of digital immunopeptidome maps from thousands of individuals may unveil a vast array of immunopeptidomic signatures (or epitope biomarkers) that could eventually be used in the clinic to predict the susceptibility or resistance of an individual to immune diseases as well as an individual's predisposition to respond to vaccines and immunotherapies.
Inspired by the vision and achievements of pioneers in genomics (213)(214)(215)(216) and proteomics (217,218), we envision that the ongoing enthusiasm in MS profiling of the human immunopeptidome will lead to the development of the IWAS paradigm, in which absolute quantification of immunopeptidomes at population-scale will become a reality. If tested and validated, we envision that IWAS will radically change the way we think about immunopeptidomics and will represent a new milestone in the history of the field. Achieving this goal will require tremendous technological development and community efforts; from highly standardized techniques for high throughput isolation of peptides to advanced knowledge about the baseline and dynamics of the human immunopeptidome, as well as appropriate data and computational resources to handle the analysis of very large immunopeptidomic datasets. With the advent of numerous new "biobank" type of studies, great opportunities will exist to collect samples for immunopeptidomic analysestogether with genomic, transcriptomic, proteomic and metabolomic analyses-that will impact personalised health care management and public health care policy in the future.
Acknowledgments-We thank the members of the Human Immunopeptidome Project for insightful discussions. ʈʈ To whom correspondence should be addressed. E-mail: etienne.caron@umontreal.ca.