A mutation-level covariate model for mutational signatures

Mutational processes and their exposures in particular genomes are key to our understanding of how these genomes are shaped. However, current analyses assume that these processes are uniformly active across the genome without accounting for potential covariates such as strand or genomic region that could impact such activities. Here we suggest the first mutation-covariate models that explicitly model the effect of different covariates on the exposures of mutational processes. We apply these models to test the impact of replication strand on these processes and compare them to strand-oblivious models across a range of data sets. Our models capture replication strand specificity, point to signatures affected by it, and score better on held-out data compared to standard models that do not account for mutation-level covariate information.

The interpretation of results is on the normalised Dirichlet parameter α/sum α_i, and is compositional data. Therefore, signature-specific results are difficult and this needs to be made clearer in Figures 5 and 6.
We have now explained more clearly the caveats of the parameter ratio approach in Figure 5 and how it is addressed in the results of Figure 6 to pinpoint the relevant biased signatures (p. 10).
The paper the authors claim to present a way of extracting signatures and exposures (e.g. Conclusions: We have developed new topic models to account for mutation-level covariates when learning mutational signatures and their exposures) However, they assume that mutational signatures are known and only estimate the exposures -which needs to be made clearer throughout the paper.
We have stated this point (of known signatures) more clearly (p. 3,11). α′ is not defined in text.
We added an explicit definition of the a_i and a' parameters (p. 3) I did not understand the held-out log-likelihood comparison method of assessing performance of the competing methods, please expand the explanation.
We have clarified this test procedure (p. 7). SBS2, SBS13 and SBS26 are put forward as signatures with strand bias. SBS13 does not appear to have strand bias according to https://cancer.sanger.ac.uk/signatures/sbs/sbs13. Please discuss.
We now cite the evidence for the replicative strand bias of SBS13 (" The topography of mutational processes in breast cancer genomes", Table 1, row 4).

Reviewer 2
This manuscript proposes two novel topic models for mutational signature learning. Unlike previous methods, these models explicitly account for the effect of covariates on the exposures to different signatures. The authors apply their methods to real data and conclude that they obtain a better fit to held-out data and that they identify known strand-specific biases, and find new ones.
We have several major concerns, one being that there are no simulation studies that prove that the proposed models and learning algorithms do what they set out to do. Additionally, the text is at times difficult to follow. We detail our concerns below.
We address the specific comments in detail below.
The model description is unclear: what is a and b, and what is z drawn from? The symbols are introduced in the plate models before they are explicitly defined in the text, which makes this a very unpleasant read. In Figure 1, what is z drawn from? Is it a multinomial distribution with Nt and \theta as parameters? And how does it generate m? And why is a_k generated from a Gaussian distribution?
The symbols a, b are defined before the first plate model they appear in, z is now explained (p. 3). The generation process here is LDA, so z is drawn as the referee guessed. We have clarified the text accordingly (p. 3). a'_k is drawn from gaussian distribution as in LDA. Regarding a'_k, we followed [20].
In figures 1 and 2: The node for \gamma is greyed out as an observed quantity, which seems odd as it is defined in the text as the probability to draw a mutation m from signature k.
Correct. We specified in the text that we assume that the signatures themselves are known. We have now clarified this assumption (p. 3).
In figures 3 and 4: The node for the inherent exposures is white, indicating it is an unobserved quantity. Later in the text, we see that it is inferred separately using a non-joint model, so in the scope of the joint model, it is given and so should be greyed out as an observed quantity.
We followed this recommendation and clarified the special status of the exposures in the legend and text.
Page 5: "we also define a guided LDA (gLDA) model variant which is fed with external information on the exposures as in JMCSM". In the plate model this is an unshaded node, which means it is unobserved. But in the text, it is referred to as external information, which is confusing. Which is it?
Fixed, see answer to the previous comment.
In model learning, the authors should clarify what is the objective function. Are they computing the maximum marginal likelihood? If so, which parameters do they marginalize over and which do they optimize? And how are they including the prior information in the estimation? Similarly, it should be explained why model comparison based on the held-out log-likelihood is valid for the particular marginalization applied.
We now clarify that we marginalize over all multinomial parameters draws from the Dirichlet distributions. Since we deal with log-likelihood, the prior parameter sigma appears as an additive constant that doesn't affect the learning process, so it was omitted in equation (6). We clarified that in the text (p. 7). Likelihood on held-out data is a standard way for comparing models with different parameterizations.
Additionally, the authors must describe their learning algorithms in more detail so that readers are able to understand the procedure and replicate it.
We have detailed the learning algorithm (p. 6) Page 6: "we run SEM for 50 iterations". Since this is an EM algorithm, can't the authors assess the convergence of the objective function directly?
To show convergence, we present the incremental difference in held-out log likelihood of the first 25 iterations (starting from iteration #2 minus iteration #1) in the following table.
One can see that our main models, JMCSM and gLDA converge after 10-20 iterations depending on the database and yields almost noise-free results. This is a result of the fact that the concentration parameters we get in this model are very high, resulting in an almost deterministic Dirichlet distribution (from which the exposures are drawn).
Our less refined models, MCSM and LDA stop improving faster, and yield much noisier results. However, despite being noisy, when testing on MALY and CLLE, MCSM yields better results than LDA almost consistently. In the following table, we present the likelihood difference between MCSM and LDA (MCSM -LDA) in the last 25 iterations for all 3 databases: Before applying their models to real data, the authors must benchmark them on simulated data, in order to see whether their learning method is working and whether they actually recover ground truth strand biases. Without such assessment, it is difficult to trust the results on real data.
We ran simulations as requested and describe the results below, however since the proposed simulation could be artificial and not representative of real data we chose to exclude them from the paper. We note that this comment appeared already in our point-by-point response to the RECOMB reviews in the original PLOS submission.
To generate data for MCSM, we drew Dirichlet parameters (a' and b') from a standard normal distribution with = 1 and = 0, and generated the mutation data using the generative process described in the text. We did it with different signatures, and number of samples (and their sizes) with respect to the original databases.
In the following examples, a_gen and b_gen are the Dirichlet parameters used to generate the data (where a_gen = exp(a') and b_gen = exp(b')), while a_est and b_est are the results of the reconstruction process. To demonstrate the ability of our MCSM code to reconstruct the parameters, we followed the process described above, and generated 5 independent databases based on BRCA.
The generated Dirichlet parameters and their estimations are reported below (In the following examples, a_gen and b_gen are the Dirichlet parameters used to generate the data (where a_gen = exp(a') and b_gen = exp(b')), while a_est and b_est are the results of the reconstruction process): MCSM First, it can be seen that we are able to estimate the Dirichlet parameters accurately. We also report the average approximation error over all tests to show how accurately the code can reconstruct the parameters. The average approximation error is 9%. For comparison, the average approximation error of a random estimator that estimates the parameters using the same generation method described above is 228%.
Generating data for JMCSM is a little trickier. Given a database, we first learned the inherent exposures in order to use them in the generative process as inherent exposures per sample. Then, we drew Dirichlet parameters, multiplied them by = 1000 and used the product as Dirichlet parameters in the generative process. Note that this is necessary in the JMCSM framework, as it is assumed that each strand's exposure is close to the inherent exposure. Another issue with learning JMCSM is that since we learn the strand bias per signature, we won't necessarily reconstruct the exact parameters. However, the ratio between the normalized parameters = ∑ =1 / ∑ =1 (used in figure 5) should be preserved instead.
To demonstrate the ability of our JMCSM code to reconstruct the ratios, we followed the process described above, and generated 5 independent databases based on BRCA.
The generated ratios and their estimations are reported below: Again, we observe that the algorithm can estimate the ratios (although slightly less accurately than its MCSM variant). We also report the average approximation error over all tests to show how accurately the code can reconstruct the ratios. The average approximation error is 40%. For comparison, the average approximation error of a random estimator that estimates the parameters using the same generation method described above is 630%.
The authors mention that the Watson/Crick strand information did not lead to good results, but show no data supporting this. They should support this statement with data.
We added the results as Supplementary Table 1.

Reviewer 3
This paper introduces several probabilistic models and corresponding inference algorithms for inferring mutational signature exposures. These models are variations of the Latent Dirichlet Allocation model, capturing covariates among mutational signatures. This paper has been reviewed and accepted at RECOMB-CCB previously. Although I did not review this paper, a point-by-point response is missing, and the paper is still a bit rough (models are hard to read, typos, excessive number of acronyms, etc.). Moreover, the experimental evaluation is not great, lacking a compelling biological use case.
A point-by-point response was included in the original PLOS submission file. We will address the following issues separately.
1. Improve description of the methods I found the methods hard to read due to undefined variables, distributions and missing intuition. I will list some examples below.
-"The per-sample mutation category count of the two possible feature values are denoted I_{t,m} and J_{t,m}, respectively." (1) What is m? --from context one can guess it is a mutation category, but it would be good to be explicit about it.
We have clarified the methods and added all missing definitions (p. 03).
(2) What is a single binary feature? What are the two possible feature values? Why only 2? Motivation is needed here.
We now explain more clearly that a binary feature is the simplest scenario and motivate by the example we analyze of a strand-identity feature (p. 02) -Why are a_k' ~ Norm(0, sigma^2)? This allows for negative values, whereas Dirichlet concentration parameters must be positive. I have seen gamma distributions being used as a prior for a_k.
Correct. However, we use exp(a_k') as Dirichlet parameters to avoid negative values. We used this method following the paper "Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression".
- Figure 1 and Section 2.1. gamma is undefined --do these describe signatures? What are the dimensions?
It is described early in the Preliminaries section: "the signatures are modeled as multinomial distributions, such that the probability to draw a mutation $m$ from signature $k$ is notated by $\gamma_{k,m}$" -Section 2.1: What is a signature count --variable N_{t,k}?
The number of times signature k occurs in sample t. We now clarify its definition in the text (p. 03).
- Figure 1: What are variables z and m?
Z indicates the signature category of the current mutation, and m indicates its mutation category. We now clarify their definition in the text (p. 03).
We added this information as requested (for all relevant figures).
As a general comment, it would help to include a table, comparing the different models. Also, some additional motivation for each model would be welcome.
We clarified the motivation for each model and summarized them in a table as requested.

Experimental evaluation
My main concern with the current experimental evaluation is that the case for more complex models that support covariates has not been made. The experimental evaluation does not compare against a baseline method that does not include any covariates (this can be MMM, or just straight up NMF with the signature matrix fixed). Do more complex models allow for better inference with fewer mutations? The paper needs to make this case.
Both models (MCSM and JMCSM) are compared to covariate invariant model (LDA and gLDA, respectively). We clarify it in the text (p. 8).
-I find it interesting that the "[paper] could detect no advantage of the models over their strand oblivious counterparts". Is this because the inference is more challenging for the more complex models? To assess this, simulations with known ground truth would be useful. Do the given algorithms for the more complex models have convergence issues?
See answers to comments 7 and 8 of Reviewer #2.
-Equation (6) lists the empirical log-likelihood of the MCSM model. What about the other models?
The Model Evaluation define the process in which the empirical log-likelihood is calculated, without regard to a specific model.
-Do models with more parameters achieve higher empirical likelihoods by default?
There is no such bias since we compare log-likelihood on held-out data.
JMCSM. We clarified the text (p. 9) * Figure 6 is not clear. Include some equations in the main text.
We clarified the test procedure as requested (p. 10). * Citation needed: "Signatures 2, 13, 26 are known to have a strong replication strand bias" We added the required citation (p. 9).