TSNAdb v2.0: The Updated Version of Tumor-specific Neoantigen Database

In recent years, neoantigens have been recognized as ideal targets for tumor immunotherapy. With the development of neoantigen-based tumor immunotherapy, comprehensive neoantigen databases are urgently needed to meet the growing demand for clinical studies. We have built the tumor-specific neoantigen database (TSNAdb) previously, which has attracted much attention. In this study, we provide TSNAdb v2.0, an updated version of the TSNAdb. TSNAdb v2.0 offers several new features, including (1) adopting more stringent criteria for neoantigen identification, (2) providing predicted neoantigens derived from three types of somatic mutations, and (3) collecting experimentally validated neoantigens and dividing them according to the experimental level. TSNAdb v2.0 is freely available at https://pgx.zju.edu.cn/tsnadb/.


Introduction
Tumor neoantigens are tumor-specific antigens derived from somatic mutations in tumor cells, which have been recognized as ideal targets for tumor immunotherapy in recent years [1][2][3][4].
Due to the huge workload for experimental verification, it is preferred to utilize cancer genomics and bioinformatics for neoantigen identification.Numerous prediction tools considering the biological process of neoantigen generation, such as human leukocyte antigen (HLA)-peptide binding [5][6][7], have been developed, which have been embedded in neoantigen prediction pipelines such as pVACtools [8], tumor-specific neoantigen detector (TSNAD) [9,10], and pTuneos [11].Neoantigen-related databases such as TRON cell line portal (TCLP) [12], the cancer immunome atlas (TCIA) [13], and tumor-specific neoantigen database (TSNAdb) [14] have also been created for better usage of neoantigens in clinical research.In TSNAdb v1.0, we took the complex of mutated peptides and HLA class I molecules (peptide-HLA pairs, pHLAs) as tumor neoantigens and predicted binding affinities between mutated/wild-type pHLAs by NetMHCpan v2.8/v4.0.We then obtained 3,707,562/1,146,961 potential neoantigens derived from single nucleotide variants (SNVs) of 7748 tumor samples from The Cancer Genome Atlas (TCGA, https://portal.gdc.cancer.gov/).With the development of neoantigenbased tumor immunotherapy, neoantigens from other types of mutations have been identified, and more experimental data have been generated [15,16].Therefore, it is urgent to perform system updates for the TSNAdb v1.0.
Here, we present an updated version of TSNAdb v1.0 that improves on the following points.(1) More stringent criteria were used for neoantigen identification to reduce the high false-positive rate of neoantigen prediction in practice.Only the pHLAs that met the thresholds of three tools were considered potential neoantigens (Figure 1).The pHLAs would not be considered neoantigens if the mutated genes were not expressed in the tumor cells.(2) We provided predicted neoantigens derived not only from SNVs but also from insertions/deletions (INDELs) and gene fusions (Fusions).In total, 372,273 SNV-derived neoantigens, 137,130 INDEL-derived neoantigens, and 11,093 Fusion-derived neoantigens were obtained.The mean number of neoantigens generated for each SNV (0.38) was lower than each INDEL (1.22) or Fusion (0.88). ( 3) We collected as many experimentally validated neoantigens from public databases and literature as possible (1856 neoantigens) and divided them into three tiers according to the level of experimental verification.Corresponding genes and mutations of the collected neoantigens were linked to the cancer-driving site profiling database (CandrisDB) [17] since neoantigens derived from driver genes or driver mutations would be ideal targets for tumor immunotherapy [18].
We believe that the updated database will contribute to neoantigen-based tumor immunotherapy and that the database will continue updating in the aspect of predicting neoantigens from more types of mutations and collecting more experimentally validated neoantigens.

Data collection and preprocessing
The SNVs, INDELs, and the expression level of corresponding genes were collected from TCGA.Mutated nucleotide sequences generated by SNVs and INDELs are translated into mutated amino acid sequences and have been decomposed into 8 to 11 peptides using the pipeline TSNAD v2.0 [10].The Fusions were collected from Gao et al. [19], and the mutated proteins were generated by STAR-Fusion [20].The HLA alleles of corresponding samples were collected from TCIA.Finally, 972,187 SNVs from 7748 samples, 112,404 INDELs from 7086 samples, and 12,639 Fusions from 4234 samples were used for neoantigen prediction.

Stricter criteria for neoantigen identification
Neoantigen-based tumor immunotherapy has shown good application prospects in clinical practice.However, the high false-positive rate of neoantigen prediction limits its usage.How to select high-confidence immunogenic neoantigens remains to be resolved.To reduce the potential false-positive rate in our predicted results, three tools (DeepHLApan, MHCflurry, and NetMHCpan v4.0) were used for neoantigen prediction, and only the pHLAs that met all the criteria of the three tools were considered potential neoantigens (Figure 1).The reason we chose these three tools is as follows: NetMHCpan [7] is the most frequently used tool for neoantigen prediction in clinical practice.MHCflurry [6] obtains the prediction neoantigen efficiently and with high quality.DeepHLApan [5] considers both HLA-peptide binding and immunogenicity of pHLA that the other two tools have not taken into consideration for high-confidence neoantigen prediction.The threshold of each tool is as follows.For NetMHCpan v4.0, pHLA with rank < 2% or affinity < 500 nM is considered binding, and we used both thresholds to select higher quality neoantigens.The output of MHCflurry is rank % and has no specific threshold.We set rank < 2% as the threshold, which is the same as NetMHCpan v4.0.The predicted scores of DeepHLApan are posterior probabilities, so we set the threshold to 0.5.In addition, the pHLAs whose corresponding genes were not expressed [transcripts per million reads (TPM) < 1] were removed.1).We further explored the relationship between the number of mutations and neoantigens for the three mutation types.The results showed that the numbers of SNV-derived neoantigens and INDEL-derived neoantigens had positive correlations with the numbers of SNVs and INDELs, with the Pearson correlation coefficient r = 0.925 and r = 0.902, respectively (Figure 2A and B).There was no significant correlation between the number of Fusion-derived neoantigens and the number of Fusions (Figure 2C, r = 0.452), which might be attributed to the fact that the number of neopeptides each Fusion generated varies greatly.

Shared neoantigens generated from frequent somatic mutations
Currently, most neoantigen-targeted immunotherapies are personalized and expensive, which led us to wonder if we could identify shared neoantigens that can be applied to a wider range of tumor patients.Here, we analyzed the frequency of each neoantigen and obtained 16,913 neoantigens shared in at least two tumor samples (Table S1).Among three SNVderived neoantigens shared in more than 20 samples, the mutated peptides are generated from BRAF and KRAS, which are well-known cancer driver genes.The most frequent shared neoantigen derived from SNV is the complex of HLA-B57:01 and mutated peptide GLATEKSRW generated by BRAF V600E, which is present in 41 tumor samples.The complex of HLA-A02:01 and the neopeptides RLMAPVGSV and SLLTQPSPA generated by the frameshift mutation XYLT2 G529Afs*78 are the most frequent neoantigens among INDEL-derived neoantigens, which both appear in 31 samples (Table 2).The two Fusion-derived shared neoantigens are the complex of HLA-A02:01 and the neopeptides ALNSEALSVV and ALNSEALSV generated by the fusion of the TMPRSS2 and ERG genes, which both appear in 14 samples (Table S1).We believe that these shared neoantigens are expected to be ideal drug targets for tumor immunotherapy, which might need further experimental validation.

Experimentally validated neoantigens
On the ''Validation" page of TSNAdb v1.0, we only collected experimental data about wild-type pHLAs that were difficult to identify as neoantigens due to the limited binding data between mutated pHLAs.With the development of clinical studies on neoantigen-based tumor immunotherapy, a large number of experimental results have provided a rich source for the functional confirmation of neoantigens.Here, we collected experimentally validated mutated pHLAs not only from several neoantigen databases (dbPepNeo [21], NeoPeptide [22], NEPdb [23], and Cancer Antigenic Peptide Database [24]) but also from published literature through data mining.For the neoantigens without gene or mutation information, BLAST was used to determine the mutated genes and the positions of somatic mutations at proteins.All collected data were further checked to determine whether the neoantigens were immunogenic or presented to the cell surface, and the collected neoantigens were divided into three tiers according to the experimental level.Neoantigens that have been both validated as immunogenic and to be presented to the cell surface were labeled tier 1, while those only validated as immunogenic were labeled tier 2, and those only validated to be presented to the cell surface were labeled tier 3. We collected 1856 experimental neoantigens, among which 67 neoantigens were classified as tier 1, 1190 neoantigens were classified as tier 2, and 599 neoantigens were classified as tier 3.Among the collected neoantigens, most were SNV-derived (22 were Fusionderived, 125 were INDEL-derived, 23 were noncodingderived, 33 were RNA splice-derived, and the remaining were SNV-derived) and enriched in several tumor types (430 belonged to lung cancer, 477 belonged to skin cancer, 361 belonged to B-cell lymphoma, and 123 belonged to colorectal cancer).
The ''Search" page contains the main page and two subpages ''Gene" and ''HLA".On the main page of ''Search", users could search for desired neoantigens by selecting the mutation type, tumor type, and gene.Compared with the ''Detailed neoantigen" of the ''Browse" page, it could provide more customized functions, such as sorting and searching.In the subpages ''Gene" and ''HLA", the detailed neoantigens and their distribution of selected genes or HLAs would be displayed once searching.The displayed pie charts are linked with the bellowed table that the detailed neoantigens would be changed once clicking on the part of the pie charts.
On the ''Collected" page, all collected neoantigens are validated by experiments to be presented to the cell surface or immunogenic, which are different from those in TSNAdb  v1.0.The corresponding genes and mutations of neoantigens are linked to CandrisDB as those in the ''Shared neoantigens".

Discussion and perspectives
Neoantigens play an important role in tumor immunotherapy.
A comprehensive and high-confidence neoantigen database would greatly meet the needs of clinical research.In TSNAdb v2.0, we predict more mutation type-derived neoantigens with stricter criteria, present the tissue-specific and gene-specific distribution of candidate tumor-specific neoantigens of TCGA tumor samples, and collect 1856 neoantigens that have been experimentally validated, which is the most systematic database of tumor-specific neoantigens at present.Compared with other databases, TSNAdb v2.0 has several advantages as follows.First, TSNAdb v2.0 provides both high-quality predicted neoantigens and experimentally validated neoantigens, while most of the other databases except NEPdb only provide one of them.Compared with NEPdb, TSNAdb v2.0 provides more sources of predicted neoantigens and has richer forms of presentation.Second, TSNAdb v2.0 provides the analysis of shared neoantigens and links corresponding genes and mutations to CandrisDB to identify high-quality neoantigens, which other databases do not provide.Finally, TNSAdb will be updated continuously to provide constant service for related researchers and clinicians.We believe that it would certainly contribute to neoantigen-based tumor immunotherapy.However, neoantigens are derived not only from SNVs, INDELs, and Fusions but also from splice variants [15], the mitochondrial genome [16], and translated unannotated open reading frames [25].It is necessary to predict all sources of neoantigens to construct a comprehensive neoantigen database.Limited by the difficulty of collecting other mutations and corresponding HLAs, we only chose three sources of neoantigen in this version of the database.In the following update of TNSAdb, we would add neoantigens from more sources and collect more validated neoantigens to construct a more comprehensive neoantigen database.

Figure 2
Figure 2 The relationship between the number of mutations and the number of neoantigens across 16 tumor types of three mutation types A. The relationship between the number of SNVs and SNV-derived neoantigens.B. The relationship between the number of INDELs and INDEL-derived neoantigens.C. The relationship between the number of Fusions and Fusion-derived neoantigens.The Pearson correlation coefficient is used for evaluation.

Wu J etFigure 3
Figure 3 Screenshots of the ''Browse" page of TSNAdb v2.0 A. The ''Statistics" part of the ''Mutation type" page.B. The ''Neoantigen with mutation" part of the ''Mutation type" page.C. The ''Neoantigen with clinical information" part of the ''Mutation type" page.D. The ''Detailed neoantigen" part of the ''Mutation type" page.E. The ''Statistics" part of the ''Tumor type" page.F. The ''Neoantigen with mutation" part of the ''Tumor type" page.G.The ''Detailed neoantigen" part of the ''Tumor type" page.H.The ''Shared neoantigens" page.
Neoantigens are not only generated from SNVs but also generated from other mutations, such as INDELs and Fusions.Based on the analysis of different types of mutations in TCGA tumor samples, we provided 137,130 INDEL-derived neoantigens and 11,093 Fusion-derived neoantigens into TSNAdb v2.0.The number of predicted neoantigens derived from SNVs was greater than that derived from INDELs and Fusions due to the greatest number of SNVs among the somatic mutations collected.However, the average number of neoantigens derived from each SNV (0.38) was less than that derived from each INDEL(1.22)or each Fusion (0.88) (Table Figure1The neoantigen prediction process of TSNAdb v2.0 SNV, single nucleotide variant; INDEL, insertion/deletion; Fusion, gene fusion; HLA, human leukocyte antigen; TPM, transcripts per million reads.

Table 1
The distribution of mutations and neoantigens across 16 tumor types Note: SNV, single nucleotide variant; INDEL, insertion/deletion; Fusion, gene fusion.

Table 2
The detailed information of shared neoantigens present in more than 20 samples Note: fs*78 indicates that 78 amino acids have been changed after the frameshift site.HLA, human leukocyte antigen; fs, frameshift.