The Cancer Exome Generated by Alternative mRNA Splicing Dilutes Predicted HLA Class I Epitope Density

Several studies have shown that cancers actively regulate alternative splicing. Altered splicing mechanisms in cancer lead to cancer-specific transcripts different from the pool of transcripts occurring only in healthy tissue. At the same time, altered presentation of HLA class I epitopes is frequently observed in various types of cancer. Down-regulation of genes related to HLA class I antigen processing has been observed in several cancer types, leading to fewer HLA class I antigens on the cell surface. Here, we use a peptidome wide analysis of predicted alternative splice forms, based on a publicly available database, to show that peptides over-represented in cancer splice variants comprise significantly fewer predicted HLA class I epitopes compared to peptides from normal transcripts. Peptides over-represented in cancer transcripts are in the case of the three most common HLA class I supertype representatives consistently found to contain fewer predicted epitopes compared to normal tissue. We observed a significant difference in amino acid composition between protein sequences associated with normal versus cancer tissue, as transcripts found in cancer are enriched with hydrophilic amino acids. This variation contributes to the observed significant lower likelihood of cancer-specific peptides to be predicted epitopes compared to peptides found in normal tissue.


Introduction
Cancer-specific splice variants are of significant interest as they may be involved in pathogenesis and may further potentially be used as biomarkers and generate novel targets for cancer [1,2]. The human immune system is capable of responding to some of these cancer specific antigens, as first shown by a melanomaspecific antigen, MAGE-1, able to stimulate human T cells [3,4]. More generally, individuals with high or medium cytotoxic activity of peripheral-blood lymphocytes are further associated with a significantly lower risk of cancer, suggesting a role for natural immunological host defense mechanisms in cancer [5].
Alternative splicing can change the structure of mRNA by inclusion or skipping of exons, and this may alter the function, stability or binding properties of encoded proteins and thereby contribute to human diseases such as cancer [6]. In a study investigating alternative splicing events in ovarian and breast tissues affected by tumors it was found that about half of all splicing events in these tissues are altered in tumors, many of them due to exon skipping [7]. Similar trends have been observed in other types of cancers, e.g., in colon cancer and testicular tumor [8,9], as well as in gastric cancer, where genes showing differential expression between cancer cell lines and corresponding normal tissues were found [10]. In addition to cancer being involved in dysregulating pathways, thus contributing to changes in alternative splicing and gene expression controlled by these proteins [11], human leukocyte antigen (HLA) class I antigen processing components and HLA expression have also been shown to be downregulated in connection with cancer [12,13].
A study investigating alterations of HLA class I expression in 12 ovarian cancer patients reported low levels of HLA class I antigens in tumor cells from all patients. One patient-derived tumor cell line showed a complete haplotype loss, including the HLA-A2 locus [14].
These observations are interpreted as mechanisms adopted by tumors to escape immune surveillance and to avoid tumor cell recognition and destruction [15,16]. It has been suggested that elimination of growing tumors by the immune system may lead to selection of tumor variants that are efficient in avoiding immune system recognition [17]. There thus seems to be accumulative evidence for cancer being coupled to alternative splicing as well as to an efficacy in evasion from the immune system by downregulation and altering HLA expression. Most of the studies relating cancer-specific alternative splicing to altered immune system surveillance are, however, of limited size and in most cases anecdotal. Here, we wanted to investigate, in a large-scale study, if the alternative cancer exome already at the step of mRNA splicing contains a bias compared to normal transcripts in the set of possible HLA class I epitopes.

Results
Transcripts over-represented in cancer contain fewer predicted epitopes restricted by the three most common HLA class I supertypes The aim of this study was to investigate, using a large-scale data set, if peptidomes specific for cancer versus normal tissue have different properties related to altered degree of immune system surveillance. To do this, we constructed two sets of peptides, one over-represented by cancer tissue and one over-represented by normal tissue. Globally permutated versions of these sets were produced as described in Material and Methods. The global permutation destroys structural characteristics within the HLAbinding 9-mers, only maintaining global compositional properties. For comparison, we constructed locally permutated normal and cancer sets by permutating each peptide separately, thus preserving the local amino acid composition of each peptide. To investigate immune-related properties, potential epitopes covering all 12 HLA class I supertypes were predicted using NetMHCpan. For each supertype, we calculated the percentages of predicted epitopes for the six peptide data sets: normal, normal globally and normal locally permutated, cancer and cancer globally and cancer locally permutated.
It is well known that some HLA class I supertype representatives are more common than others. It is therefore expected that for the less frequent HLA alleles, the results are more likely to include noise. The source of our data set, the ASTD database, is to a large extent originating from EST data without HLA specific information. EST data is mostly based on Caucasian Europeans [18]; therefore we can safely assume that the more common HLA types in the European population are also more common in our dataset. The HLA allele frequencies were obtained from the dbMHC database [19]. Approximate numbers of expected phenotype per supertype in the European population are given in Table 1.
The three most common supertype representatives in the European population are HLA-A*02:01, HLA-A*01:01 and HLA-A*03:01. For these three supertype representatives, the transcripts associated with normal tissue have a significantly higher percentage of predicted epitopes than transcripts over-represented in cancer. Figure 1 shows the observed numbers, in percentages of predicted epitopes per 9-mers, for the different data sets for these three most common supertype representatives. All observed differences between normal and cancer tissues shown in Figure 1 are significant (p,0.006, 2-sample test for equality of proportions).
For most HLA class I supertypes, cancer transcripts contain fewer predicted epitopes Further, the percentage of predicted epitopes for permutated and not-permutated sequences for all 12 supertype representatives is shown in Table 2. Here, we observed a similar tendency as compared to our observation for the three most common supertypes in the European population. For non-permutated sequences, seven out of the twelve supertype representatives (HLA-A*01:01, HLA-A*02:01, HLA-A*03:01, HLA-A*24:02, HLA-A*26:01, HLA-B*15:01 and HLA-B*58:01) had a significant lower fraction of predicted epitopes in sequences assigned to cancer pathology. A statistical significant difference, where cancerassociated peptides contained more predicted epitopes was, on the other hand, only observed for one supertype representative, namely HLA-B*27:05.
When analyzing permutated sequences, similar results were observed. Only one supertype representative (HLA-B*40:01, locally permutated) had significantly more predicted epitopes in the permutated cancer sequences than in the permutated normal sequences. On the other hand, permutated, normal sequences had consistently for both the local and global permutated sets more predicted epitopes for seven supertype representatives (HLA-A*01:01, HLA-A*02:01, HLA-A*03:01, HLA-A*24:02, HLA-A*26:01, HLA-B*15:01, HLA-B*58:01). For these seven supertype representatives, the difference between normal and cancer data sets is significant in the permutated as well the non-permutated data sets. The observation that cancer transcripts contain fewer predicted epitopes for most HLA class I supertype representatives, is stable, when different thresholds for the prediction of potential epitopes are applied (data not shown).

HLA motif and amino acid composition biases
The relative difference in predicted epitope density between normal and cancer is, for our previously defined most common HLA alleles, relatively stable. Also, the difference in epitope density is largest when comparing non-permutated to globally permutated peptide sets. For HLA-A*02:01, a noticeable decrease of predicted epitopes is observed when comparing normal and cancer non-permutated peptides to normal and cancer permutated peptides. As seen from Table 2 and Figure 1, the difference in percentage of epitopes is the largest when comparing the nonpermutated sequences to the globally permutated sequences (normal: 0.83 vs 0.73, cancer: 0.77 vs 0.69). For HLA-A*01:01, the percentage of epitopes in non-permutated versus permutated sequences appears to be relative stable (normal: 0.82 vs 0.80, cancer: 0.77 vs 0.76), whereas permutated HLA-A*03:01 sequences have more predicted epitopes than the corresponding nonpermutated sequences (normal: 0.94 vs 1.01, cancer: 0.91 vs 0.96). For these three supertype representatives, the percentage of predicted epitopes in locally permutated peptides always falls between the respective percentages for non-permutated and globally permutated sequences. Locally permutated peptides preserve only local amino acid composition, and globally permutated peptides have their local structural properties destroyed and preserve only global amino acid composition. These observations indicate that both global and local structural amino acid properties are factors that define the observed differences in the epitope densities between the normal and cancer peptidome.
An analysis of relative amino acid composition was performed for all over-represented 9-mers associated with normal and cancer. We found that hydrophilic residues are more common in polypeptides from cancer transcripts than from the normal polypeptides. The relations of N/C ratios compared to the hydrophilicity scale of amino acids by Hopp-Woods, the hydrophobicity scale by Wimley-White as well as the mean ranking of amino acids according to the frequency of their occurrence for 38 published hydrophobicity scales are shown in Figure 2. In Figure 2, residues are more common in cancer if N/C is smaller than 1. Hydrophilic residues are marked black. The Hopp-Woods and Wimley-White scales correlate strongly with the N/C ratios with a Spearman rank correlation coefficient of 20.72 and 0.78, respectively. The mean ranking amino acid scale is correlated with a correlation coefficient of 20.65. All three correlation coefficients are significant (P-value ,0.003, exact permutation test). No correlation was found for other amino acid properties like mass, surface area or volume (data not shown).
It is striking to observe that all strong hydrophilic amino acids (KPRQ, Hopp-Woods scale) are enriched in sequences associated with cancer. A similar observation is made for Wimley-White scale: We identified seven amino acids significantly more common in cancer (APERKDQ). Six of these (all except A) are within the seven most hydrophilic amino acids based on the Wimley-White scale. A reversed trend is found for hydrophobic amino acids. The Each bar shows the percentage of predicted epitopes per 9-mer in the respective set. Each set consists of peptides that are either over-represented in normal or cancer tissue. Globally permutated or locally permutated version of the peptide sets were constructed as described in Materials and Methods. All observed differences between cancer and normal tissues are significant (p,0.006, 2-sample test for equality of proportions). doi:10.1371/journal.pone.0038670.g001 Table 2. Epitopes per set for all supertype representatives. % epitopes % epitopes globally permutated % epitopes locally permutated top significant amino acids classified by both Hopp-Woods and Wimley-Scott as hydrophobic (WFICM) are all more common in sequences associated with transcripts from normal tissue. Based on these findings, one could suggest an explanation for the difference in epitope density between the normal and cancer peptidome. The binding motifs for the 3 most frequent supertype representatives are shown in Figure 3. Out of the four most preferred amino acids at the HLA-A*02:01 anchor positions, three amino acids (VMI) are enriched in normal transcripts, whereas only one (L) is as common in normal as in cancer. This leads to the conclusion that at least part of the observed differences in percentage of predicted epitopes in normal versus cancer transcripts are due to amino acid composition. The same tendency is found for HLA-A*01:01. The two most frequent amino acids in the motif (YT) are also more often found in normal tissue, whereas S is neutral and the next common amino acid, D, is more common in cancer. The most frequent amino acid for HLA-A*03:01(K) is slightly more common in cancer, whereas the second-next frequent (Y) is, due to a stronger preference to fit peptides from normal tissues, shifting the bias towards amino acids more common in splice variants associated with normal tissue. For all three motifs, we further calculated average weighted biases, based on N/C ratios and amino acid frequencies (see materials and methods). The weighted biases were calculated for both the respective 5 most frequent amino acids per motif as well as all 20 amino acids. For all three motifs we observed an overall preference for amino acids found in our normal tissue set.

Discussion
Alternative splicing of mRNA transcripts is an important mechanism for generating genomic complexity and has been shown to differ between cancer and the corresponding normal tissues [1,8,9]. In addition, cancers in some cases downregulate HLA class I antigen-processing components and HLA class I expression to avoid detection by the immune system. These observations led us to investigate whether transcripts found in cancer tissue share characteristics that would reduce immune system recognition. Here, we have carried out a large-scale analysis aiming at identifying immune system related imprints that can differentiate cancer from normal transcripts. Based on ASTD database, a database providing predicted splice forms, we identified two peptide data sets; one associated with transcripts over-represented in cancer and one associated with transcripts over-represented in normal tissue. Using state-of-the-art immunoinformatics prediction tools, we next analyzed the two data sets for differences in terms of likelihood of being presented on prevalent HLA class I molecules, and hence potential for activating the immune system.
We found that peptides, which due to alternative splicing are expressed in cancer tissue, contain fewer predicted epitopes restricted by the three most common HLA class I alleles than peptides expressed in normal tissue. Using globally permuted data sets we consistently, for the three most common HLA class I alleles, found that the observed loss in epitope density in the cancer peptidome is maintained also for the permuted data sets. This strongly indicates that differences in amino acid composition between peptides from alternatively spliced normal and cancer transcripts are the driving force of the reduced predicted epitope density.
The reason for the observed change in frequency of specific amino acids in proteins associated with cancer as compared to normal tissue is unknown, but the phenomenon has previously been observed in studies aiming at identifying biomarkers for early stage detection of cancer: In a recent study, the levels of alanine, isoleucine, leucine and valanine were found to be increased in the pancreases of rats with pancreatic cancer as compared to samples from rats with chronic pancreatitis and healthy rats [20]. In another study, the levels of N-methylalanine and lysine were found to be significantly increased in the plasma from pancreatic cancer patients, while the level of glutamine and phenylalanine was found to be decreased [21]. These studies identified differences in amino acid composition in a single cancer type based on blood plasma and tissue samples. We, in contrast, analyze over-represented cancer peptides in general. As to be expected, the findings regarding amino acid concentration reported in this study are not concurrent with those of the single cancer type studies.
A possible explanation as to why we observed fewer predicted epitopes in peptides, which due to alternative splicing are expressed in cancer, could be that the host's immune system restricts the cancer exome. In that case, pressure from the immune system disfavors cancer cells that present new epitopes at the cell surface. An alternative explanation -which does not exclude the previous explanation -takes as starting point the observed change . N/C ratio is the ratio of observed frequencies of the respective amino acids in polypeptides of overrepresented transcripts from normal and cancer tissues. If the N/C value .1, the amino acid is more common in normal tissue; If the N/C value ,1, the amino acid is more common in cancer. Green bars refer to more hydrophobic amino acids, whereas black bars refer to more hydrophilic amino acids. All N/C ratios larger or smaller than 1 are significant (p,0.001, calculated using the Wilson score [45] and Bonferroni corrected). doi:10.1371/journal.pone.0038670.g002 in amino acid frequency, especially the increase in hydrophilic amino acids in cancer proteins.
It has previously been shown that missense mutations in the BRCA domain of high-risk breast and ovarian cancer patients frequently target strongly hydrophobic amino acids [22]. Further, it has been suggested that the stabilization of a protein structure is to a large part due to the hydrophobic effect [23]. Accordingly, the increase in hydrophilic amino acids has a destabilizing effect on protein structure, which is in concordance with the protein loss-offunction that is correlated with cancer progression. This is exemplified by a study concerning inherited missense mutations of the tumor suppressor gene, BRCA-1, which may predispose to breast or ovarian cancer [24]. In this study, it was found that the mutations predominately target conserved hydrophobic amino acids that are responsible for folding and stability. Since, in particular, the most common HLA class I allele, A*02:01, prefers hydrophobic amino acids at the anchor positions, an increase in hydrophilic amino acids will inevitably lead to fewer predicted epitopes. The reduction in epitope density in peptides associated with cancer might therefore be an intrinsic property of proteins that are destabilized by a decrease of hydrophobic amino acids as part of the progression to cancer. We are, however, not aware of studies that show a general increase of hydrophobic amino acids throughout different forms of cancer.
A bias of fewer potential epitopes due to cancer was previously shown by Wiedenfeld et al [25]. Peptides caused by missense mutations of p53 were shown to have a decreased fit to the HLA-A*02:01 motif. All predicted variants of the peptides were either from patients with other alleles or the allele was lost during tumorigenesis. The decreased fit to the HLA motif due to mutations is in coherence with our study investigating differences in epitope density due to alternative splicing. To our knowledge, this is the first study indicating that alternatively spliced cancer transcripts tend to express fewer potential epitopes than alternatively spliced transcripts found in normal tissue. The identified difference in amino acid composition towards hydrophilic amino acids in the alternative spliced cancer exome is a possible explanation for the bias in potential HLA class I epitopes. The preference for hydrophilic amino acids at the step of alternative mRNA splicing could support the development of cancer by providing it with the possibility of evading the host's immune system. In this case by leading to fewer potential HLA class I epitopes presented at the cell surface.

Data extraction from the ASTD database
The Alternative Splicing and Transcript Diversity database (ASTD) provides access to a collection of alternative splice events and transcripts of genes from human, mouse and rat [26]. The aim of the database is to analyze the mechanisms of alternative splicing on a genome-wide scale. It integrates a computational pipeline for detection and characterization of isoform splice patterns as well as alternative introns and exons. The database contains predicted transcripts, generated by mapping expressed sequence tags (ESTs) to genome sequences. Our study is based on ASTD version v1.1 build 9 (accessible at ftp://ftp.ebi.ac.uk/pub/databases/astd/). The database covers 14,194 human genes and lists 50,581 unique transcripts not covered by Ensembl genes. Based on related evidences from cDNA libraries, many of these transcripts are tagged with pathology information. The pathology information is given as eVOC ontologies, which is a controlled vocabulary for unifying gene expression data [27]. As an alternative to the ASTD database, we would have liked to use RNA sequence data, but could not identify any usable database, providing genome-wide coverage of potential transcripts, together with pathology information.
Two data sets were generated based on annotated pathology information. All transcripts tagged with the information of being expressed in normal tissue were assigned to subset N. This subset consisted of 30,739 transcripts derived from 11,980 genes. A second subset, C, with transcripts related to cancer, consisted of 27,967 transcripts derived from 10,730 genes.
The cancer subset consists of all transcripts tagged with eVOC terms related to cancer; that is being a subgroup of tumor in the eVOC ontology hierarchy (Table 3). Several eVOC terms can be associated to the same transcript.
For our analysis, we were interested in transcripts uniquely associated to normal tissue or to one or more of the cancer eVOC terms. Two new subsets consisting of transcripts only associated to either normal or cancer eVOC terms were created. Out of 30,739 transcripts associated to normal, 16,566 were uniquely associated with normal tissue, due to ASTD database, and not with cancer (unique N set). The subset of transcripts uniquely associated with cancer (unique C set) consists of 13,794 transcripts (see Table 4). Transcripts covered by each data set are unique for either normal tissue or cancer as defined by the ASTD database. The ASTD database does not provide pathology information for all transcripts nor lists all potential tissue types or pathologies. Accordingly, we refer to our sets of transcripts uniquely associated to either normal or cancer as over-represented in either normal or cancer tissue.

Translation to proteins
All transcripts assigned to either normal or cancer pathology were translated to their respective protein sequence using Virtual Ribosome [28]. The longest ORF among all three reading frames was chosen as the translated protein sequence. The protein sequence and corresponding transcript were discarded if no ORF was found or if the resulting protein sequence was shorter than 9 amino acids. The threshold of 9 amino acids was chosen as we subsequently apply the epitope prediction on 9-meric peptides, although we are aware that proteins this small might not be functional. Applying this filter resulted in a normal set of 16,490 transcripts and a cancer set of 13,721 transcripts.

Generation of unique 9-mers
All proteins assigned to either normal or cancer pathology states were divided into overlapping 9-meric peptide sequences. Peptide sequences that were found in both groups were removed, leading to the creation of two sets of unique 9-mer peptides. There are 1,856,231 unique 9-mers in the normal group (N-peptidome) and 1,684,028 unique 9-mers in the cancer group (C-peptidome). Note that normal and cancer sets do not consist of complete proteins; they only consist of unique 9-meric peptides not found in the other set. Permutated sets of both the unique N and unique C set were created. For each set, one locally permutated and one globally permutated set of 9-meric peptides was generated. The local permutated sets were constructed by permuting each 9-mer, thus keeping the amino acid composition within each 9-mer fixed. The global permutated sets were made by randomly constructing new 9-mers out of all amino acids within each set. This preserves the overall amino acid composition within the unique N and C sets, local properties within each 9-mer are, however, destroyed.

Prediction of potential HLA class I epitopes
The prediction method NetMHCpan-2.4 [29,30] was used for predicting potential epitopes for the 12 HLA class I supertypes [31]. The NetMHCpan-2.4 method was trained on an experimentally validated data set of more than 100,000 quantitative peptide -HLA class I interactions covering more than 100 HLA molecules and has been evaluated as the best pan-specific method for HLA peptide binding in a large benchmark study [32]. A general accepted threshold for binding is a rank score of 1% [33,34] (binding strength falling within the top 1% compared to a large set of random natural peptides), which is also the threshold, used throughout this study.
The percentages of potential epitopes per 9-mer for all 6 sets (normal 9-mers, normal globally permutated 9-mers, normal locally permutated 9-mers, cancer 9-mers, cancer globally permutated 9-mers and cancer locally permutated 9-mers) were calculated. P-values for difference in percentage of predicted epitopes between normal and cancer 9-mers for non-permutated and permutated subsets were calculated by a 2-sample test for equality of proportions and adjusted for multiple testing (Bonferroni correction).

Amino acid scales
The amino acid abundance for normal tissue compared to cancer tissue was determined based on all unique 9-mers in the two data sets. The relative frequencies for all amino acids in both the normal and cancer sets were calculated. Observed ratio of frequencies (N/C) of amino acids among normal and cancer tissues was correlated with Hopp-Woods hydrophilicity [35] and Wimley-White hydrophobicity scale [36] values. The ratio was further correlated with a mean ranking scale per amino acid as published by Simpso]. According to Simpson [37], the scale is based on the mean ranking of amino acids according to the frequency of their occurrence at each sequence rank for 38 published hydrophobicity scales [38]. Other investigated scales are average volume of buried residues [39,40], van der Waals volume [41] and total accessible surface area [42].
Bootstrapping was applied to test if an amino acid property scale is correlated with enriched expression of residues in either unique normal or cancer 9-mers. For each scale, the Spearman rank correlation coefficient was calculated and the significance of the correlation was estimated using exact permutation test.  HLA motif bias HLA binding motifs were generated from NetMHCpan-2.4 training data. Position specific weight-matrices were calculated using sequence weighting and correction for low counts [43]. Sequence logos were visualized as described by Schneider and Stephens [44], where each letter represents its proportional frequency of the corresponding amino acid at that position. Based on amino acid frequencies and observed ratio of frequencies (N/C) of amino acids among normal and cancer tissues, we calculated for the HLA-A*A02:01, HLA-A*A01:01 and HLA-A*A03:01 motifs their respective overall bias towards either our defined normal or cancer peptide set. This was done for all 20 amino acids and for the 5 most frequent amino acid occurrences per motif. Per position, the tendency to fit preferably to either the normal or the cancer peptidome was calculated by summation of the respective amino acid frequencies multiplied with the related N/C values for all 20 amino acids. Likewise the calculation for the 5 most frequent amino acid occurrences per motif, where only the subset of the motifs 5 most frequent amino acid occurrences at this position is considered. Similar to the N/C ratio, a motifs bias to preferably fit to our normal set is given, if the average over all position for a motif is larger than 1.