A Bayesian model integration for mutation calling through data partitioning

Abstract Motivation Detection of somatic mutations from tumor and matched normal sequencing data has become among the most important analysis methods in cancer research. Some existing mutation callers have focused on additional information, e.g. heterozygous single-nucleotide polymorphisms (SNPs) nearby mutation candidates or overlapping paired-end read information. However, existing methods cannot take multiple information sources into account simultaneously. Existing Bayesian hierarchical model-based methods construct two generative models, the tumor model and error model, and limited information sources have been modeled. Results We proposed a Bayesian model integration framework named as partitioning-based model integration. In this framework, through introducing partitions for paired-end reads based on given information sources, we integrate existing generative models and utilize multiple information sources. Based on that, we constructed a novel Bayesian hierarchical model-based method named as OHVarfinDer. In both the tumor model and error model, we introduced partitions for a set of paired-end reads that cover a mutation candidate position, and applied a different generative model for each category of paired-end reads. We demonstrated that our method can utilize both heterozygous SNP information and overlapping paired-end read information effectively in simulation datasets and real datasets. Availability and implementation https://github.com/takumorizo/OHVarfinDer. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Cancer is driven by genomic alterations. Acquired somatic mutations, together with individual germ line variations, are important factors in cancer evolution. Together with decreasing massively parallel sequencing costs, mutation calling from tumor and matched normal sequence data has become fundamental analysis methods in cancer research (Meyerson et al., 2010).
Previous statistical mutation callers can be mainly categorized into two types. The first type of mutation caller does not assume any probability distribution that is specific for sequence data (Koboldt et al., 2012;Yoshida et al., 2011), and mutation calling is conducted based on Fisher's exact test (Fisher, 1925). For this type, the numbers of reference supporting reads and variant supporting reads are counted in tumor and normal samples, and a P-value is computed based on a 2 Â 2 contingency table. These methods only consider differences in variant allele frequencies between tumor and normal samples and ignore the biases found in sequence errors and mapping errors.
The second type of mutation caller constructs generative models that are specific for sequence data. This type of method first prepares generative models for sequence data, and then computes statistical scores based on techniques, e.g. maximum a posteriori inference of genotypes (Roth et al., 2012) and Bayes factor-based model selection (Cibulskis et al., 2013;Moriyama et al., 2017;Usuyama et al., 2014).
The most important advantage of the second approach is that we can construct generative models based on sequence data-specific information sources, and then utilize the given information sources. Some methods are known to perform well by using some characteristic information of heterozygous single-nucleotide polymorphisms (SNPs) nearby mutation candidates (Usuyama et al., 2014) or overlapping pairedend reads (Moriyama et al., 2017).
Simultaneous usage of multiple characteristic information sources, e.g. heterozygous SNPs nearby mutation candidates and overlapping paired-end reads, is preferable for improving performance for the second type of method. However, existing mutation callers do not consider various information sources simultaneously. To utilize multiple information sources, we proposed a Bayesian model integration framework, named as partitioning-based model integration, and then we developed a novel mutation calling method named as OHVarfinDer based on the framework.
In Section 2, we explain the partitioning-based model integration framework, and then describe details of OHVarfinDer.
In Section 3, we first show that our method can utilize both heterozygous SNPs information and overlapping paired-end read information effectively in simulation datasets. In this experiment, we demonstrate the comparable performance of our method with other methods when only one of the two information sources is available; we also demonstrate the superior performance of our method compared to the other methods when both information sources are available. Second, we demonstrate the better performance of our method for real datasets.
In Section 4, we discuss the advantages and limitations of the proposed method.

Materials and methods
In this section, we first explain multiple characteristic information sources for mutation calling. Second, we elucidate our proposed framework of partitioning-based model integration in a general form. Third, we describe how these multiple information sources are incorporated in OHVarfinDer based on the partitioning-based model integration.

Characteristic information sources for mutation calling
2.1.1 Heterozygous SNPs covered by paired-end reads The first additional information source in somatic mutation calling is heterozygous SNPs near somatic mutation candidates. The human genome is a diploid set of haplotypes, i.e. the maternal haplotype and paternal haplotype. Each somatic mutation is known to occur typically only on one side of the haplotypes, i.e. heterozygous mutation. Therefore, variant supporting reads that cover heterozygous SNPs are generated from only one side of the haplotypes as shown in the left side of Figure 1a. However, when sequence errors occur on the mutation candidate position, variant supporting reads covering heterozygous SNPs probably have both heterozygous SNPs as in the right side of Figure 1a. This information source was used in HapMuC (Usuyama et al., 2014).

Overlaps of paired-end reads
The second additional information source is overlaps of paired-end reads. Through Illumina's sequencing, a pair of paired-end reads, i.e. forward and reverse reads, is sequenced from both sides of the same DNA fragment. If the DNA fragment is shorter than 2-fold the read length, the pair of reads has an overlapping region where sequence process is conducted twice from different directions independently.
If the both forward and reverse reads show the same alteration in the overlapping region as in the left side of Figure 1b, it is likely that the change is because of a mutation and not because of errors, as the occurrence probability of two errors at the same site in the overlapping region is expected to be very low, except for PCR errors in the sample preparation phase (Chen-Harris et al., 2013). In contrast, an error case is probable when only one of the reads contains an alteration in the overlapping region as in the right side of Figure 1b. This information source has been used in OVarCall (Moriyama et al., 2017).

Strand biases of paired-end reads
The third additional information source we considered is strand biases in variant supporting reads that cover a mutation candidate. If only forward (or reverse) reads contain a mutation candidate despite sufficient numbers of both forward and reverse reads, this phenomenon is known as strand bias as in the right side of Figure 1c. If a true somatic mutation exists, strand bias rarely occurs, and the proportion of variant supporting forward/reverse reads should be ideally similar as in the left side of Figure 1c. This information source is used for filtering in MuTect (Cibulskis et al., 2013).

Representative examples in real datasets
We show the examples from real datasets, in which we can find that given mutation candidates are only errors. Figure 2 shows screenshots of IGV (http://software.broadinstitute.org/software/igv/).
The first erroneous case shown in Figure 2a represents the variant supporting reads with both heterozygous SNPs. In this case, variant supporting reads have both heterozygous SNPs, as indicated by red and blue circles. This case corresponds with the erroneous case in Figure 1a.
The second erroneous case shown in Figure 2b represents a paired-end reads with inconsistent bases at a mutation candidate position. In this case, reads in a paired-end reads that are highlighted in red line have different bases at the mutation candidate position. This case corresponds with the erroneous case in Figure 1b.
Simpler methods, e.g. a Fisher's exact test-based method of VarScan2, evaluate these two types of errors as somatic mutations. In the case of Figure 2a, VarScan2 showed a low P-value of 0.043 and in Figure 2a, VarScan2 also showed a low P-value of 0.0050. The main purpose of this paper is to construct a Bayesian method which discriminates these errors from somatic mutations.

Bayes factor for finding mutations
We denote a dataset as X :¼ fx n g d n¼1 , where x n is the n-th string consisting of fA; T; G; Cg and d is the depth on the mutation candidate position. We denote tumor and error models as M T ; M E , and  corresponding parameters as h T ; h E . Next, the Bayes factor (Kass and Raftery, 1995) is written as follows:

Partitioning-based model integration
First, we assume K 2 N models in each tumor and error model, and denote these models as M T;k ; M E;k , where k 2 f1; . . . ; Kg. We denote corresponding parameters as h T;k ; h E;k , where k 2 f1; . . . ; Kg.
We also assume that we can observe indicator variable t n 2 f1; 2; . . . ; Kg with each data x n . We assume that the original dataset is partitioned into K subsets and t n indicates the subset of data to which x n belongs. We also assume that the k-th subset of data is generated through the k-th model of M T;k or M E;k . We denote this augmented dataset as X aug :¼ fðx n ; t n Þg d n¼1 . We assume the graphical model of Figure 3 and that the distribution of each parameter h S;k is dependent on the k-th model of M S;k .
x n jt n ; h S;all ; M S $ Prðx n jh S;tn ; M S;tn Þ; (1) Our purpose here is to compute the following Bayes factor: From the graphical model in Figure 3 and above assumptions of Equations (1) and (2), the joint probability can be computed as follows: where S 2 fT; Eg; Prðt n jM S Þ > 0; h S;all :¼ fh S;1 ; ::; h S;K g. From this joint probability, the marginal likelihood can be computed as follows: If we can assume PrðtjM T Þ ¼ PrðtjM E Þ for any t 2 f1; . . . ; Kg, we do not need to set PrðtjM T Þ; PrðtjM E Þ for computation of Bayes factor, because This manner of model integration requires two conditions. The first condition is a partition rule on the dataset and we can construct a corresponding generative model for each partitioned dataset. The second condition is that partition probabilities should be the same among the tumor and error model ( The merit of this manner is that partition probabilities PrðtjM T Þ; PrðtjM E Þ do not affect the Bayes factor and thus careful and explicit settings of these probabilities are not necessary.

Notations in practical models of OHVarfinDer
The graphical model of OHVarfinder is shown in Figure 4a. r n is the n-th paired-end read, i.e. tuple of two forward/reverse reads of ðr n;þ ; r n;À Þ, and each r n;þ ; r n;À is a string sequence of fA; T; G; Cg. t n is the n-th partition indicator variable. H k is an set of template paired-end reads and contains paired-end reads like Figure 1. z n (if t n ¼ k; z n 2 f0; 1; . . . ; jH k j À 1g) is the n-th categorical latent variable indicating the template paired-end read for the n-th paired-end read r n . For the generation process of r n as a whole, we assume that n-th paired-end read r n is generated from z n -th paired-end read of H k;zn with sequence errors and mapping errors added randomly.

Partition rules for each paired-end read in OHVarfinDer
In our method, we split paired-end reads into five types. t n 2 f0; 1; 2; 3; 4g is determined for each paired-end read r n;6 by the following partitioning rule.

O(1)H(2) category
A paired-end read in this category (t n ¼ 0) is overlapping between the forward read and reverse read at the mutation candidate position and covers no heterozygous SNPs nearby the candidate position.

O(2)H(1) category
A paired-end read in this category (t n ¼ 1) is not overlapping between the forward read and reverse read at the mutation candidate position and covers heterozygous SNPs nearby the candidate position.
Note that global haplotype phasing is not necessary and we only conduct haplotype phasing locally around the mutation candidate positions as previously conducted in (Usuyama et al., 2014). The genotype A and B as in Figure 4b is determined from the number of variant supporting reads for each SNP.

O(1)H(1) category
A paired-end read in this category (t n ¼ 2) is overlapping between the forward read and reverse read at the mutation candidate position and covers heterozygous SNPs nearby the candidate position.

O(2)H(2)S(1) category
A paired-end read in this category (t n ¼ 3) is not overlapping between the forward read and reverse read at the mutation candidate position and covers no heterozygous SNPs nearby the candidate position. The mutation candidate position is covered by the forward read. (Forward/reverse is determined by the mapping direction compared to the reference sequence.)

O(2)H(2)S(2) category
A paired-end read in this category (t n ¼ 4) is not overlapping between the forward read and reverse read at the mutation candidate position and covers no heterozygous SNPs nearby the candidate position. The mutation candidate position is covered by the reverse read.

Suitability of partition based integration
We should note that partitioning-based integration is suited for this problem setting for two reasons. The first reason is that we can set partition rules on paired-end reads and construct generative models for each dataset by referring to existing methods. The second reason is that partitioning probabilities PrðtjM T Þ; PrðtjM E Þ are thought to be the same, e.g. the existence of a mutation does not affect whether a paired-end read will cover a heterozygous SNP.

Details of tumor generative model for O(1)H(1) type of paired-end read
Here, we only show the details of the tumor generative model for O(þ)H(þ) type (t n ¼ 3) due to the limitation of the space. See the Supplementary Material A.1-A.8 for the details of our models. z n is the one of eight expression vector indicating an idealized paired-end read. For the parameters, we used l ; b ; p H (that is h T;3 :¼ ð l ; b ; p H Þ). l 2 ½0; 1 is the error rate when the paired-end read is overlapping at the mutation candidate position, b 2 ½0; 1 is the strand bias rate. p H is a 3-dimensional non-negative simplex, indicating the proportion of paired-end reads from a maternal haplotype, paternal haplotype and haplotype with somatic mutation. Let a l 2 R 2 þ ; a b 2 R 2 þ ; c H 2 R 3 þ the hyperparameters for l ; b ; p H . The tumor generative model for the O(þ)H(þ) type of paired-end read is defined as follows: l ja l $ P beta ð l ja l Þ; b ja b $ P beta ð b ja b Þ; p H jc H $ P dir ðp H jc H Þ; z n j l ; b ; p H ; t n $ P mult ðz n jf T;tn Þ; r n jz n ; H tn ; $P align ðr n;þ jH tn;idxðznÞ;þ ÞP align ðr n;À jH tn;idxðznÞ;À Þ: where P beta ðÁÞ; P dir ðÁÞ; P mult ðÁÞ are the probability density function of beta, Dirichlet, and multinomial distributions respectively. P align ðÁÞ Fig. 3. Graphical model for partitioning-based integration of generative models. Where S 2 fT; Eg states the hypothesis is the alignment probability which is formulated by profile hidden Markov model (HMM) (Albers et al., 2011;Usuyama et al., 2014). f T;k is a non-negative simplex defined by h T;k . We only show the case for f T;3 ; f E;3 in Figure 4b. idxðÁÞ is a function that returns the index where the value is 1 from a given one-hot encoding vector.

Bayes factor in OHVarfinDer
Here, we show the Bayes factor in OHVarfinDer and explain that our method is truly based on the partitioning-based integration and that setting of PrðtjM T Þ and PrðtjM E Þ are not necessary. Let R NT :¼ fr n g d n¼1 is the set of paired-end reads for both tumor and normal sample data which cover a mutation candidate position, and the marginal likelihoods can be computed as follows: where F k ¼ Prðh S;k jM S;k Þ Y fnjtn¼kg Prðz n jh S;k ÞPrðr n jz n ; H tn Þ ð¼ PrðR NT jM S ; ft n g n Þ: Therefore, if PrðtjM T Þ ¼ PrðtjM E Þ, it is not necessary to set these distributions in the Bayes factor of OHVarfinDer as shown in the previous section.

Computation of marginal likelihoods
We applied the variational Bayes procedures for computing PrðR NT jM S Þ. We can obtain a lower bound for lnPrðR NT jM S ; ft n g n Þ from the convexity of log function (Jensen, 1906). lnPrðR NT jM S ; ft n g n Þ ! E q ln PrðR NT ; Z S;NT jM S ; ft n g n Þ qðZ S;NT Þ " # ; ( where we denote all latent variables and parameters of fz n g n ; fh S;k g k as Z S;NT . qðZ S;NT Þ is the variational distribution for Z S;NT which is formulated in the independent form as follows: qðZ S;NT Þ :¼ Y k ½qðZ S;NT;k Þqðh S;k Þ; qðZ S;NT;k Þ :¼ Y njtn¼k qðz n Þ: In the above inequality of Equation (3), the equality holds true when qðZ S;NT Þ is equal to the posterior distribution of the PrðZ S;NT jR NT ; ft n g n ; M S Þ. In the variational Bayes procedure (Beal, 2003), we maximize the lower bound for each variational distribution of qðh S;k Þ and qðZ S;NT;k Þ iteratively until the updated lower bound converges, and approximate the log marginal likelihood using this maximized lower bound. We described the full procedures for variational Bayes in the Supplementary Material A.9-A.16.

Performance evaluation of OHVarfinDer using simulation data
3.1.1 Simulation data generation procedure We tested OHVarfinDer using simulation datasets. The simulation procedure is described as follows. In the following procedure, we prepared two types of errors. The first type of errors are position-specific ones, and known as error prone sites (Moriyama et al., 2017;Shiraishi et al., 2013). The second type of errors are non-position-specific ones.
1. Generate a random reference DNA sequence. 2. Generate a heterozygous germ line variant in a random location, as well as two haplotypes (h1 and h2) 3. Generate a somatic mutation randomly around a heterozygous germ line variant, according to an empirical distribution of whole genome data, as well as two haplotypes (h3 and h4) 4. Randomly generate paired-end reads around 900 somatic mutations and 2100 error prone sites randomly. a. Determine the number of paired-end reads covering the position, by generating a random value d from a norm distribution of N(50, 2), and round d to the nearest integer value. b. Randomly determine the haplotype of the original DNA fragment. We set the frequency of haplotypes as h1: 50-v%, h2: 50%, h3: v%, h4: 0% if a somatic mutation truly exists. We set the frequency of haplotypes as h1: 50%, h2: 50%, h3: 0%, h4: 0% otherwise. c. For each paired-end read, determine the DNA fragment size by generating a random value l from Nðl l ; r l Þ, and round l to the nearest integer value. d. Generate the 100-bp length read sequence on forward strand.
Each observed base flips with the sequence error probability of p error . If the position of each observed base is the error prone site, p error is generated from a beta distribution of Beta(2, 30). If the position of each observed base is not the error prone site, p error is generated from Beta(10, 1000). e. Generate the read sequence on the reverse strand like (d).
3.1.2 Performance evaluation of OHVarfinDer using simulation data As a counterpart method, we prepared OVarCall, HapMuC, and a simple Fisher's exact test (Fisher, 1925) method, which uses a 2 Â 2 contingency table of read counts, tumor and normal samples/variant and reference alleles. We calculated the area under the curve (AUC) values from the plotted receiver-operating characteristic curve (ROC) (Bradley, 1997) for each simulation condition as shown in Table 1. We described the filter conditions in the Supplementary Material B.1.
In the simulation dataset under the condition of B, only overlapping paired-end read information was available. In this case, our method performs comparable with OVarCall. In the simulation dataset in the condition of C, only heterozygous SNP information was available. In this case, our method performed comparably well with HapMuC that can utilize this information source. In the simulation dataset under the condition of A, neither of the above types of information was available. In this case, our method performed comparably well with Fisher's exact test. In the simulation dataset under the condition of D, both overlapping paired-end read information and heterozygous SNP information were available. In this case, our method outperformed both OVarCall and HapMuC. We summarized the ROC curves in the Supplementary Material B.10.1-B.10.3.

SNVs in exome sequence dataset
We confirmed whether the performance of our method could be improved by using overlapping information using real exome datasets, as shown in Table 2 for the real datasets, we used exome sequence data from renal clear-cell carcinoma, which has already been used for performance evaluation of OVarCall (Moriyama et al., 2017). In these datasets, $40% of paired-end reads overlapped, and thus the use of overlapping paired-end reads is expected to affect the performance. In this dataset, true somatic SNVs were validated by deep sequencing (Shiraishi et al., 2013). In both the case of lower variant allele frequency of 2-7% and the case of moderate variant allele frequency above 7%, OHVarfinDer performed comparably well with OVarCall and outperformed HapMuC. Furthermore, we observed that our method returned low Bayes factor of 0.0000011 in the false positive case in Figure 2b. Therefore, we confirmed that our method can incorporate overlapping information and improve its performance. For the details of this experiment, see the Supplementary Material B.2, B.8 and B.10.4.

SNVs and InDels in whole genome dataset
We examined whether we could improve the performance of our method by using heterozygous SNP and strand bias information using whole genome sequence data. The results are summarized in Table 3 for the dataset, we used whole genome sequence datasets from breast cancer cell lines, which are publicly available as a part of The Cancer Genome Atlas (TCGA) Mutation Calling Benchmark 4 datasets (These datasets can be downloaded from https://gdc.cancer. gov/resources-tcga-users/tcga-mutation-calling-benchmark-4-files) and have been used for performance evaluation of HapMuC.
In these datasets, pure cell line sequence datasets of normal and tumor cell line and computational mixtures of these sequence datasets are prepared, e.g. HCC1143_n40t60 represents that 40% of pure normal and 60% of pure tumor sequence data are mixed. In this experiment, we obtained answers of true mutations from these pure cell line datasets, and we conducted performance evaluations for tumor sequence datasets with several mixture rates, i.e. n20t80, n40t60, n60t40, n80t20. For these datasets, the use of heterozygous SNPs information and strand bias information is important for improving performance because the average proportion of overlapping paired-end reads was $3% within these datasets. The highest AUC values are written in italic letters. For the performance of OHVarfinDer, OHVarfinDer performed better than any other mutation caller, except for HCC1954_n80t20. We also observed that our method returned low Bayes factor of 0.000059 in the false positive case in Figure 2a. Therefore, we confirmed that our method can incorporate heterozygous SNP and strand bias information and improve its performance. For the details of this experiment, see the Supplementary Material B.3, B.9 and B.10.5.

Discussion
Some mutation calling methods, e.g. HapMuC and OVarCall, can incorporate a characteristic information source, e.g. heterozygous SNPs and overlapped paired-end reads, in their mutation calling process. However, no existing methods utilize multiple types of such characteristic information sources simultaneously.
In this paper, we first introduced a framework for Bayesian model integration named as partitioning-based model integration, which differs from Bayesian model averaging (Hoeting et al., 1999). In this framework, we first set a partitioning rule for data and augmented the data with indicator variables which show the category of partitioning. Second, we constructed a generative model for each category of partitioned dataset. This framework requires two assumptions. The first assumption is that we can set a partitioning rule and construct corresponding generative models. The second assumption is that partitioning probabilities are common among the tumor model and error model. If the above assumptions hold true, we can compute the Bayes factor without careful setting of prior partitioning probabilities. In our problem setting of mutation calling, the above two assumptions seem natural, and thus we constructed a Bayesian mutation calling method, OHVarfinDer, based on this framework.
We conducted performance evaluations with simulation and real datasets. In the simulation datasets, we showed that our method could utilize multiple information sources, particularly overlapping paired-end read information and heterozygous SNP information. If only one information source was given, our method performed comparably well with other existing methods. If both information sources were given, our method performed better than other existing methods. In the real datasets, e.g. TCGA Mutation Calling Benchmark 4 datasets, we also demonstrated the better performance of our method compared to other existing methods.
We have demonstrated how to integrate known multiple information sources for mutation calling by our framework. We note that mapping quality and base quality of reads are also used in our method by incorporating the profile HMM modeling (Albers et al., 2011;Usuyama et al., 2014). Although our framework is practically useful for mutation calling, there is at least one limitation for this framework, i.e. our framework does not assume inference over the parameter distributions, e.g. prior distributions for the error parameters. Such inference is important if we consider using multiple sequence datasets simultaneously. For example, if we can use pooled normal sequence datasets, we can infer the error distributions depending on the genomic positions. For the future work, we plan to extend our framework to infer the form of the parameter distributions, e.g. incorporating predictive distributions for the error parameters. The highest AUC values are written in italic letters.