Somatic mutations render human exome and pathogen DNA more similar

Immunotherapy has recently shown important clinical successes in a substantial number of oncology indications. Additionally, the tumor somatic mutation load has been shown to associate with response to these therapeutic agents, and specific mutational signatures are hypothesized to improve this association, including signatures related to pathogen insults. We sought to study in silico the validity of these observations and how they relate to each other. We first addressed the question whether somatic mutations typically involved in cancer may increase, in a statistically meaningful manner, the similarity between common pathogens and the human exome. Our study shows that common mutagenic processes like those resulting from exposure to ultraviolet light (in melanoma) or smoking (in lung cancer) increase, in the upper range of biologically plausible frequencies, the similarity between cancer exomes and pathogen DNA at a scale of 12 to 16 nucleotide sequences (corresponding to peptides of 4 – 5 amino acids). Second, we investigated whether this increased similarity is due to the specific mutation distribution of the considered mutagenic processes or whether uniformly random mutations at equal rate would trigger the same effect. Our results show that, depending on the combination of pathogen and mutagenic process, these effects need not be distinguishable. Third, we studied the impact of mutation rate and showed that increasing mutation rate generally results in an increased similarity between the cancer exome and pathogen DNA, again at a scale of 4 – 5 amino acids. Finally, we investigated whether the considered mutational processes result in amino-acid changes with functional relevance that are more likely to be immunogenic. We showed that functional tolerance to mutagenic processes across species generally suggests more resilience to mutagenic processes that are due to exposure to elements of nature than to mutagenic processes that are due to exposure to cancer-causing artificial substances. These results support the idea that recognition of pathogen sequences as well as differential functional tolerance to mutagenic processes may play an important role in the immune recognition process involved in tumor infiltration by lymphocytes.


Introduction
Recent clinical advances firmly establish the role of immunotherapy (in particular, checkpoint inhibition targetting the CTLA4 and PD1/PD-L1 pathways [1]) in the treatment of cancer. However, the rates of response vary by indication, outlining the important role of identifying the patients most likely to respond [2][3][4][5]. In parallel, the analysis of the data in large scale genomic efforts including The Cancer Genome Atlas (TCGA [6]) has identified universal characteristics of the tumor and its environment that ellicit potential recognition by the host immune system. In particular, somatic mutational load as inferred by DNA sequencing [7,8] and cytolytic infiltrate as inferred by immunohistochemistry or RNA sequencing [9] have emerged as hallmarks of an immune-active tumor enviroment. It is thus important to understand the causality and mechanism of action that drives the heterogenous composition of the tumor and its environment and consequently the heterogeneity of response to immunotherapy, in order to select the right patients for treatment, potential combinations, and potential for early intervention.
Multiple recent studies have suggested a strong causal link between the mutational burden of the tumor and clinical response to immunotherapy across multiple indications including Melanoma [10,11], Non Small Cell Lung Cancer [12], Bladder cancer [13] and Colorectal cancer [14]. In these studies, a strong relationship between neoantigen load (the number of mutations with immunogenic potential) and response to immunotherapy has been identified. Importantly, each of these indications are characterized by distinct mutagenic processes that result in abundant neoantigen load [7,8]: UV light exposure in Melanoma, smoking in Non Small Cell Lung Cancer, APOBEC activation in Bladder cancer, and MMR defficiency in MSI-h Colorectal cancer. Whether particular mutations or mutational patterns preferentially induce an immunologic phenotype remains an open question [10,11]. However, several hypotheses have recently been put forward, including the presence of mutations in particular genes [15,16], or the presence of a transversion signature related to smoking [12]. In particular, Snyder et al. [11] put forward a hypothesis linking cancer exomes with patterns present in common pathogens. Namely, their results with exome analysis of Melanoma patients treated with Ipilimumab, a CTLA4 inhibitor, suggest that somatic mutations in cancer genomes that lead to tetrapeptides similar to those found in common pathogens are more likely to elicit a response to the therapy than common somatic mutations. This association is presumably driven by the innate ability of significant portions of the adaptive immune repertoire to recognize such pathogens.
We took an in-silico approach to evaluate the impact of certain mutagenic processes on the similarity between cancer exomes and pathogen DNAs. Somatic mutations are an inherent natural process related to cell division and aging which in some instances is exacerbated by mutagenic factors. We simulated such mutagenic processes using mixtures of mutational signatures with empirically derived mixing parameters. We used a simple similarity metric between the mutated exome and common pathogen exomes to estimate changes in overall potential immunogenicity of cancer exomes as compared to the normal exome. We considered simulations of mutagenic processes that yield most mutated cancer exomes, namely ultra-violet (UV) light (Melanoma), smoking (Non Small Cell Lung Cancer), and APOBEC activation (Bladder cancer) [7,9]. Our results suggest that, in the upper range of biologically plausible mutation rates, mutagenic processes resulting from exposure to these common mutagens lead to cancer exomes that are more similar pathogen DNAs at a scale of 12 to 16 nucleotides. These changes are subtle but nevertheless statistically significant and are particularly important in the range of peptide sizes that are relevant for epitope presentation in the human MHC mechanism; MHC presentation typically involves peptides with lengths between 8-18 nucleotides (8-13 for class I MHC and 13-18 for class II MHC [17]).
However, our results also suggest that the increased similarity need not be caused by the specificity of the mutation distribution. Depending on the pathogen, uniformly random mutations (at the same rate) may result in equal increased similarity. Finally, we show that increasing mutation rate generally results in increased similarity between cancer exomes and pathogen DNAs. These conclusions suggest that mutagenic processes might act as a mechanism of pressure that models the mutational spectra observed in tumors by increasing recognition from the host immune system.
Opposite to the aforementioned effect that increases the likelihood that a cancer exome is recognized by the immune system, an antagonist mechanism of pressure on mutational landscape stems from tolerance by the immune system to natural mutagenic processes. To that extent, we establish that exomes across species are generally more resilient, in terms of a functional point of view related to the synonymity of amino-acid changes, to mutagenic processes that are due to exposure to elements of nature than to mutagenic processes that are due to exposure to cancer-causing artificial substances. In particular, we observe that the functionality of the genetic code (allocation of codons to amino-acids) is more resilient to UV light than smoking mutagenic processes at a fixed rate. This suggests the possibility that there are different tissue-dependent evolutionary tolerance levels, modulated by the pathogen recognition apparatus in terms of both immune recognition and cancer development, which for example reflect in the much higher mutational loads and immune infiltrate in Melanoma compared to Lung cancer [9].

Methods
We sought to assess whether certain mutagenic processes result in somatic alterations that increase the similarity of the mutated human exome with selected pathogens. Accordingly, we first defined a pairwise similarity metric among DNA sequences of different length and evaluated the similarity between pathogens and the normal human exome. Second, we simulated mutations resulting from different mutagenic processes at different mutation rates acting on the human exome and evaluated the consequent change in similarity of the mutated human exome with respect to the pathogen exomes. Third, we investigated the resiliency of exomes (human exome and model organism exomes) in terms of maintained functionality of the resulting amino-acids and compared the sequences of amino acids of the normal and mutated exomes.

Data and computing resources
We obtained the human normal exome from GRCh38 http://www.ensembl.org/Homo_ sapiens/Info/Index.
We considered simulations of mutational signatures resulting from ultra-violet (UV) light (specific to Melanoma), smoking (specific to Non Small Cell Lung Cancer (NSCLC)), and APOBEC activation (specific to Bladder cancer). These simulations were based on the data from [8, Supplementary information, Table S2] restricted to the set of patients with Melanoma  cancer, NSCLC, and Bladder cancer. For simulations we used Python 2.7.6 (libraries random, numpy, and scipy.stats) and ran programs on a shared server with 8 CPUs and 128GB memory.

Pathogen DNA vs. human exome and MHC mechanism
To quantify the similarity between a pathogen DNA, denoted by x, and the human exome, denoted by y, we considered the following similarity score. For a given integer ℓ � 1, the similarity score, denoted by s ℓ (x, y), corresponds to the relative proportion of length-ℓ strings in the pathogen DNA that also appear in the human exome at least once, that is Here L denotes the length of the pathogen DNA, . . . ; x iþ'À 1 denotes the pathogen DNA substring starting at position i and ending at position i + ℓ − 1, and "�" denotes string inclusion. In particular, s ℓ (x, y) = 1 corresponds to the case where all length-ℓ strings in the pathogen DNA also appear in the human exome and s ℓ (x, y) = 0 corresponds to the case Each curve represents the matching probability (similarity score) s ℓ (x, y) between a pathogen DNA x and the human exome y, as a function of the subsequence length ℓ. The "Random" curve refers to the average score of a randomly and uniformly generated "pathogen" DNA sequence.
https://doi.org/10.1371/journal.pone.0197949.g001 Effects of cancer specific mutations on a normal exome modeled as a cancer channel. Cancer exomeỹ ¼ fỹ 1 ;ỹ 2 ; . . . ;ỹ G g is obtained from normal exome y = {y 1 , y 2 , . . ., y G } through a cancer specific probabilistic transformation P c;r ðbjaÞ which assigns to each nucleotide α the probability of being mutated to nucleotide β. This transformation depends on both the mutation distribution specific to cancer c and the mutation rate ρ.
https://doi.org/10.1371/journal.pone.0197949.g002 The height of the red bars represents the proportions of pathogen DNA that are more similar to cancer exomes than to normal exome (one-sample t-test results with p-value � 0.01). The height of the blue bars represents the proportion of pathogens whose DNA are more similar to cancer exomes than to exomes with equal mutation rate but uniformly distributed mutations (two-sample one-sided t-test p-value � 0.01).
https://doi.org/10.1371/journal.pone.0197949.g003 where the pathogen DNA and the human exome have no length-ℓ string in common. Observe that s ℓ (x, y) can be interpreted as the probability that a randomly and uniformly picked lengthℓ string in the pathogen DNA also appears in the human exome. Accordingly, we often refer to s ℓ (x, y) as the matching probability. Finally, notice that s ℓ (x, y) does not count multiplicity, i.e., strings that appear only once in the human exome and strings that appear multiple times in the human exome are note distinguished.
In Fig 1, each curve represents the matching probability s ℓ (x, y) for a specific pathogen DNA x and the normal human exome y, for ℓ 2 {9, 10, . . ., 18}. To benchmark these scores we also considered the matching probability with respect to a randomly and uniformly generated "pathogen" sequence, where each nucleotide is equally likely to occur. The average matching probability with respect to such a sequence is represented by the "Random" curve in Fig 1 and turns out to be independent of its length L. This curve is indistinguishable from the 95% confidence interval corresponding to a randomly generated sequence. Supporting material for Fig 1 is deferred to Section A.1 in the Appendix. We make the following observations: • For all pathogens the similarity score is equal to one for ℓ � 10, that is length ℓ � 10 subsequences of the pathogen DNAs all appear in the human exome as well.
• The similarity scores are non-zero for all pathogens up to length ℓ = 20. At ℓ = 21 the similarity scores is zero for the Ebola virus, the Measles virus, and the Dengue virus.
• From ℓ = 10 there is a steep decrease in the similarity scores, down to less than 15% for ℓ = 15. A closer look at the data (see Tables in the Appendix A.1) reveals that, for all pathogens, the sharpest relative drop of the similarity score occurs from ℓ = 12 to ℓ = 13 or from ℓ = 13 to ℓ = 14.
• The differences in score across pathogens is maximal at ℓ 2 {12, 13}. These in-silico observations are in line with the concept that 4 -5 amino acids are enough for the presentation machinery in terms of both diversity of possible sequences (20 4 -20 5 ) and differentiation of self from foreign sequences in the MHC machinery. Namely, this length is strikingly similar to the length of peptides studied in the signature determined by [11].

Impact of somatic mutations on pathogen DNA and human exome similarity score
To assess the impact of somatic mutations on pathogen DNA and human exome similarity score and identify the roles of mutation distribution and mutation rate we proceeded as follows: • Normal exome vs. cancer exome: we investigated whether cancer somatic mutations render pathogen and human exome more similar, and whether random mutations alone, with uniform distribution across mutations, would produce the same results as (typically nonuniform) cancer-dependent mutations, at the same mutation rate.
• Impact of mutation rate: we investigated whether a higher mutation rate renders pathogen DNA and human exome more similar.
Central to our investigation is a notion of cancer channel described next. Cancer channel. We simulated the changes induced to the normal exome by cancer specific mutagens in a probabilistic way. The cancer exomes were generated from the normal exome by using cancer-dependent mixtures of mutational signatures with empirical weights derived from data in [8]. Note that even if a cancer typically exhibits a dominant mutational signature, the simulated mutagenic process results in a more realistic combination of such signatures. The similarity scores of the normal exome and cancer exome were then computed for each pathogen. To formalize our analysis, we used concepts from information theory, in particular related to communications over a noisy channel. To a given cancer and mutation rate we associated a transformation, referred to as "cancer channel," which mimics the typical effects of the mutagenic process that are specific to the cancer at the given mutation rate. Analogously to a communication channel that alterates a transmitted message because of noise (see, e.g., [18]), a cancer channel alterates a DNA sequence because of somatic mutations. Given a particular cancer c and a mutation rate ρ the cancer channel assigns to each nucleotide α the probability P c;r ðbjaÞ of being mutated into nucleotide β. This probability was derived using data from [8, Supplementary information, Table S2] (see Appendix A.2 in this paper).
To obtain a cancer exomeỹ we "passed" the normal human exome y through cancer channel P c;r ð�j�Þ as shown in Fig 2. Specifically, the cancer exomeỹ was generated from y so that the probability to obtainỹ ¼ fỹ 1 ;ỹ 2 ; . . . ;ỹ G g from normal exome y = {y 1 , y 2 , . . ., y G } was given by Cancer exomes and pathogen DNA Normal vs. cancer specific and random mutations. For given pathogen x, cancer c, and mutation rate ρ we performed two tests. In Test 1, we evaluated the statistical significance of the effect of cancer somatic mutations in making human exome more similar to pathogen DNA sequences. In Test 2, we compared cancer somatic mutations and random mutations in making the human exome more similar to pathogen DNA sequences. Both tests were peformed for ρ-values of 0.0005, 0.001, and 0.01. The lowest mutation rate was chosen to be 0.0005 as it represents a good compromise between biological and statistical relevance. It lies in the upper range of the mutation rates observed in actual cancer samples [8] and in the lower range for statistical relevance-see next subsection.
Test 1: For each ℓ 2 {10, 11, . . ., 18} we independently generated 1000 cancer exomes fỹg from the normal human exome y and computed the corresponding similarity scores fs ' ðx;ỹÞg. P-values were computed for comparing the mean of fs ' ðx;ỹÞg against s ℓ (x, y) using a one-sided t-test with a null hypothesis that the true mean of s ' ðx;ỹÞ is no larger than s ℓ (x, y). Test 2: We replaced the cancer channel by a "random channel" which produced mutations at the same rate but in a uniform (1/3, 1/3, 1/3) manner. For each ℓ 2 {10, 11, . . ., 18} we independently generated 1000 exomes fŷg by passing the normal human exome y through the random channel and computed the corresponding similarity scores fs ' ðx;ŷÞg. P-values were computed for comparing the mean of fs ' ðx;ŷÞg against the mean of fs ' ðx;ỹÞg (obtained in Test 1) using a two-sample one-sided t-test with a null hypothesis that the true mean of s ' ðx;ỹÞ is no larger than the true mean of s ' ðx;ŷÞ-note that directly computing the true mean of s ' ðx;ỹÞ overỹ is impossible as it amounts to computing a sum over all � 4 10 7 possible cancer exomes, and similarly for the mean of s ' ðx;ŷÞ.
In Fig 3, each histogram refers to a particular cancer and mutation rate. Red bars refer to Test 1 and blue bars refer to Test 2. Bar height represents, for any given subsequence length ℓ 2 {10, 11, . . ., 18}, the proportion of pathogens (out of the 8 considered in this paper) for which the p-value is � 0.01. Related data can be found in the tables of the Appendices A.3, A.4, and A.5 for ρ = 0.0005, ρ = 0.001, and ρ = 0.01, respectively. In these tables, the second column refers to s ℓ (x, y), the third column gives a 95% confidence interval for s ' ðx;ỹÞ, the fourth column gives the p-value for Test 1 and the fifth column gives the p-value for Test 2. We make the following observations: • Referring to Test 1 (red bars in Fig 3), all three mutagenic processes render the human exome more similar to all pathogen DNA sequences at all ρ 2 {0.0005, 0.001, 0.01} and ℓ 2 {12, . . ., 16}. For ℓ � 11 or ℓ � 17 the effect of the mutagenic processes on the similarity scores are less conclusive. This suggests that the increase of similarity is particularly relevant in the range of peptide sizes (4 -5 amino-acids) that are relevant for epitope presentation in the human MHC presentation. Note, however, that the changes in similarity are small, typically � 1% (see tables in Sections A.3-A.5, Columns 2, 3).
• Whether the above change of similarity is due to the specificity of the mutation distribution or random mutations trigger the same effect depends on the pathogen, the length, and the Impact of mutation rate. To assess the impact of mutational rate on the similarity between pathogen DNA and human exome, for any given mutagenic process, pathogen DNA, and length we proceeded as follows. We first generated 1000 cancer exomes at mutation rate ρ = 0.0005 and 1000 cancer exomes at mutation rate 0.001. Second, we computed the similarity scores of the two sets of cancer exomes relative to the pathogen DNA. P-values were computed for comparing the means of the two sets of similarity scores using a two-sample one-sided ttest with a null hypothesis that the true mean of the similarity scores at the lowest rate (ρ = 0.0005) is no larger than the true mean of the similarity scores at the higher rate (ρ = 0.001). We then repeated the experiment for ρ = 0.001 vs. ρ = 0.01. In Fig 4, the histograms represent

Resiliency of exomes with respect to mutagenic processes
In order to compare the resiliency of the model organism exomes with respect to mutagenic processes, we evaluated the error correction capabilities of the genetic code (the codon allocation to amino-acids) for each combination of model exome and mutagenic process. Referring to Fig 5, y = {y 1 , . . ., y L } represents a DNA sequence whose corresponding sequence of amino Cancer exomes and pathogen DNA acids is {a 1 , . . ., a L/3 }. This DNA sequence is then passed through a given cancer channel P c;r ð�j�Þ and results in a cancer sequenceỹ ¼ fỹ 1 ;ỹ 2 ; . . . ;ỹ L g and a corresponding sequence of cancer amino acids fã 1 ;ã 2 ; . . . ;ã L=3 g. From {a 1 , a 2 , . . ., a L/3 } and fã 1 ;ã 2 ; . . . ;ã L=3 g we computed the relative proportion of amino acids that were affected, that is : Finally, averaging over all possible realizations ofỹ (and therefore overã), we obtained the average error probability Pðerrorjy; c; rÞ ¼ Ejfi :ã i 6 ¼ a i gj ðL=3Þ : ð2Þ   Fig 6, we obtain the following result: • Although the proportion of non-synonymous mutations varies across exomes for the three types of mutagenic processes, it is always lowest for melanoma and maximal for lung. Moreover, this ordering holds irrespectively of the intensity of the mutation rate. It should be noted that we evaluated the proportions of non-synonymous mutations for several other organisms as well (including the set of pathogens considered in this paper) and this finding was validated in all cases.

Discussion
We employed large scale simulations to model the random (across space) effect of stochastic mutagenic processes on the human normal genome. We believe this is a valid approach since the cancer exome available data does suggest that, while at the granular level mutation rates vary, the mutagenic processes in cancers with large number of mutations affect equally all chromosomal regions of the exome [8]. Essentially, we simplify the analysis using this assumption. Our in-silico results show that, in general, the typical stochastic mutagenic processes encountered in the major cancer indications with abundant neoantigens do appear to shift the peptide distribution of the modified exome universally towards a landscape that appears more similar to pathogenic insult. Specifically, all three mutagenic processes considered induce subtle but robust shifts in the measure by which we characterized the similarity between the normal human exome and pathogen DNA sequences, at mutation rates in the upper range of the mutation rates observed in actual cancer samples (� 0.0005). Moreover, the range of peptide lengths where this shift happens aligns with the typical length of peptides presented by the human MHC presentation system, suggesting an increased potential for recognition of these types of somatic mutations by a pathogen-trained host immune system. We also note that for many combinations of pathogen DNA and mutagenic process cases this increase of similarity cannot be solely attributed to the mutation distribution; randomly and uniformly distributed mutations can cause similar shifts in similarity. By contrast, increasing the mutation rate while keeping the underlying mutation distribution fixed always results in an increased similarity betweeen human exome and pathogen DNA at ℓ 2 {11, . . ., 16}, which again corresponds to the length of peptides presented by the human presentation system. This suggests that the intensity of the mutational rate is an important parameter that directly affects the similarity between cancer exome and pathogen DNA.
We also observe that the effect of the considered mutagenic processes on the likelihood of observing a non-synonymous alteration is strikingly different across processes but consistent across the species studied in our framework (human and model organisms). Melanoma/UV light alterations are the least likely to result in amino acid functional changes, followed by Cancer exomes and pathogen DNA APOBEC-driven alterations and then by smoking alterations, suggesting different error-correcting capabilities of the living exomes towards this various mutagenic insults. This is an attractive observation from an evolutionary perspective: due to universal exposure to sunlight, organisms likely developed similarly universal intrinsic protection from UV light type of modifications to their exomes via the redundancies in the aminoacid codon allocation. Similarly, APOBEC-activation appears to be a universal innate protection mechanism that allows the cell to induce damaging mutations to foreign organisms, while the mutations resulting from tobacco smoking are less likely to have presented evolutionary pressure. In summary, our insilico approach reveals two competing mechanisms of tolerance pressure on the major mutagenic processes present in human cancers that modulate the potential immune recognition of alterations at the exome level through pathogen similarity and through functional redundancy; the balance between these mechanisms may significantly contribute to the eventual mutational landscape of advanced cancers.

A.1 Data for Fig 1
In Table 1 below we listed the similarity scores s ℓ (x, y) of each pathogen x against the human exome y, as a function of the subsequence length ℓ. The column "Random" refers to a 95% confidence interval for the similarity score between a randomly generated pathogen sequence X, where each nucleotide is independently and uniformly selected with probability 1/4, and the normal human exome y. To compute this confidence interval we proceeded as follows. The similarity score for a random instance X of length L is given by Here M ℓ denotes the number of distinct length-ℓ substrings in the human genome and was computed empirically for ℓ 2 {9, 10, . . ., 15}:  A confidence inteval for s ℓ (X, y) was computed via Chebyshev's inequality as follows. We have Furthermore, where for the second equality we used the fact that the Z i 's are identically distributed and that  Cancer exomes and pathogen DNA Therefore, Finally, from (3), (4), and (5) we get To obtain a 95% confidence interval we picked which is below 0.002 for all ℓ 2 {9, 10, . . ., 15} regardless of the pathogen length L.
where p c (i, α ! β) denotes the proportion of α ! β mutations among all mutations in patient i and was computed from [8, Supplementary information, Table S2]. The probability that a nucleotide α in the normal exome results in nucleotide β in the cancer exome is therefore The parameter ρ denotes the overall mutation rate and p(α) denotes the relative number of nucleotide α in the exome and was computed from [8, Supplementary information, Table S2].
Remark. Because in the data from [8, Supplementary information, Table S2] complementary mutations were counted under the same category (e.g., a change from cytosine to tyamine would Cancer exomes and pathogen DNA be treated the same as a change from guanine to adenine), mutation types were considered in pairs. Since the relative proportions of complementary pairs were not given inf, we made the assumption that they were equal. Hence, in the above expression p c (i, α ! β, i) actually corresponds to where (α 0 , β 0 ) is the complementary pair of (α, β). The second column in Tables 2-9 represents s ℓ (x, y) as a function of ℓ. The third column represents a 95% confidence interval for s ' ðx;ỹÞ obtained through a standard application of the central limit theorem. This confidence interval is given by

A.6 Error probability data for Fig 6
To compute Ejfi :ã i 6 ¼ a i gj in (2) we proceeded as follows. We have where the summation ranges over amino acid positions. Let us compute Pðã 1 6 ¼ a 1 Þ-for the other terms we proceed in the same way. Observe that a 1 is a function of the first three nucleotides y 1 , y 2 , y 3 of the normal exome y. To emphasize this, let us write a 1 as a 1 (y 1 , y 2 , y 3 ). Similarly,ã 1 is a function of the first three nucleotidesỹ 1 ;ỹ 2 ;ỹ 3 of the cancer genomeỹ and we write it asã 1 ðỹ 1 ;ỹ 2 ;ỹ 3 Þ. Therefore, we have where P c;r ðỹ j jy j Þ is the cancer channel defined in the Appendix A.2.