EAGLE: Explicit Alternative Genome Likelihood Evaluator

Background Reliable detection of genome variations, especially insertions and deletions (indels), from single sample DNA sequencing data remains challenging, partially due to the inherent uncertainty involved in aligning sequencing reads to the reference genome. In practice a variety of ad hoc quality filtering methods are employed to produce more reliable lists of putative variants, but the resulting lists typically still include numerous false positives. Thus it would be desirable to be able to rigorously evaluate the degree to which each putative variant is supported by the data. Unfortunately, users who wish to do this, e.g. for the purpose of prioritizing validation experiments, have been faced with limited options. Results Here we present EAGLE, a method for evaluating the degree to which sequencing data supports a given candidate genome variant. EAGLE incorporates candidate variants into explicit hypotheses about the individual’s genome, and then computes the probability of the observed data (the sequencing reads) under each hypothesis. In comparison with methods which rely heavily on a particular alignment of the reads to the reference genome, EAGLE readily accounts for uncertainties that may arise from multi-mapping or local misalignment and uses the entire length of each read. We compared the scores assigned by several well-known variant callers to EAGLE for the task of ranking true putative variants on both simulated data and real genome sequencing based benchmarks. For indels, EAGLE obtained marked improvement on simulated data and a whole genome sequencing benchmark, and modest but statistically significant improvement on an exome sequencing benchmark. Conclusions EAGLE ranked true variants higher than the scores reported by the callers and can used to improve specificity in variant calling. EAGLE is freely available at https://github.com/tony-kuo/eagle. Electronic supplementary material The online version of this article (10.1186/s12920-018-0342-1) contains supplementary material, which is available to authorized users.


S1.1 Overview
We derive a probability distribution of the observed data (the sequencing reads) conditioned on an entire hypothetical genome (reference or alternative) and then describe a simple but effective way to use traditional read mapping to approximate the computation. Our model is conceptually close to a previous model by Simola & Kim [1], but the details of our method and the mathematical exposition we present here differ substantially from their work.
Throughout this paper we assume that sequencers do not make indel errors. This is not essential to our theoretical approach, but simplifies the derivation and greatly reduces computation time. The results we obtained suggest this assumption is tenable for the Illumina sequencing platform.

S1.2 Basic Model
To describe our model, we start by deriving the likelihood of one sequencing read r of length r , whose base-call quality scores define a probability distribution over some length s DNA sequence s: where s i is the ith base of s and r i is the probability vector over {a,c,g,t} corresponding to the base and quality score of the ith position of r. It is useful to reverse the direction of the conditional with Bayes' law producing: where P [s] is the prior probability of a DNA sequence s when we know only that s was read by a sequencer. For simplicity, we assume no indel errors so that the length of each read is the same as the length of the segment it derives from. The sequencing technology and experimental protocol employed will determine the prior probability distribution of the length of reads (and equivalently the length of segments read) which we denote as L( ) (by definition L( ) ≡ 1). Furthermore, we assume that the prior probability of equal length sequence segments are equal. Thus, where 4 is the alphabet size of possible nucleotides. A similar type of distribution could be assumed for P [r] using a larger alphabet size to include quality score information, but for our purposes it is more useful to leave P [r] unspecified, as it will cancel later on when we look at probability ratios. Thus, is the probability of a read given that we know it derives from a length r DNA sequence s. But what if we only know that it derives from somewhere in genome G? Then, where P [g|G] is the probability that genome sequence segment g is sequenced, given we know only that some segment from G is sequenced. Although often unrealistic, we assume uniform priors on genome segments g of any particular length (i.e. uniform coverage) and further assume that the length distribution of g follows the general prior defined above as L( ). Using n g to denote the number of genome segments of length g : P [g|G] = L( g ) 1 n g and combining, where the approximation in the last line reflects the fact that since chromosomes are typically much longer than reads, the number of genome segments n g can be well approximated as the genome length n. Later, we will explicitly consider genome hypotheses of differing length, but as the length of these differences will be small compared to the genome length, we will still use n to approximate the number of segments of any given length for all genome hypothesis. One important detail is that in practice some fraction ξ (mnemonic for xenos) of the reads will derive from sequences outside of the reference sequence coordinates. Such reads may derive from sequences unrelated to the human genome (e.g. contamination) or regions of the human genome not fully covered in the reference (e.g. telomere regions). Later we will revisit how to model the outside source of reads, but first we proceed with a generic outside source which simply assumes that the reads from the outside follow the same distribution as the general prior P [r].
To compute the probability of the entire set of reads we assume that each read is generated independently, an assumption which is not true considering biases introduced by PCR amplification and other steps of the sequencing process. Nonetheless we make this assumption of read independence, to obtain: Equivalently, by Bayes' law The ratio of the probability of two alternative hypotheses H v and H u , assuming genome sequences G v and G u respectively is: This equation can be slightly cleaned up by defining convenience constants x = nξ and h = 1 − ξ, yielding: where G is treated as the multiset of all length segments of the hypothetical genome (including both chromosomes in each pair and for whole genome sequencing both strands, while for exome sequencing the targeted regions of the genome). As written, this computation iterates over each genome position once for each read, requiring an unrealistic amount of computation time. Fortunately, we can overcome this problem. First, we note that that the probability P [r|g] will be negligible for the vast majority of genome segments g. Only segments similar to the consensus sequence of r will contribute significantly to the sumprecisely the segments which read mapping software, used in multimap mode, are designed to find quickly.
Some more notation is needed to explain our approximation precisely. Consider the likelihood ratio of two genome hypothesis, e.g. the reference genome G u (u mnemonic for unchanged) and a variant genome G v obtained by slightly editing G u in accordance to a candidate genome variant, say by deleting one base, at position e of the reference genome. Further let us denote the set of reads which cover position e when mapped to the reference genome as R e and the set of genome segments to which read r ∈ R e maps to in the reference or variant genome as G r u and G r v respectively. Importantly, although by definition the reads in R e all map covering position e of the reference, in general they may map to multiple regions of the genome, and those regions may differ between different reads in R e . Finally we define the neighborhood N r u (N r v ) as the set of length r genome segments overlapping any segment in G r u or G r v respectively. By summing the likelihood over these neighborhoods instead of just a single segment, we cover the case in which the exact alignment of a read to the genome (i.e. gap placement) is ambiguous -as is often the case for indel variants found in repetitive regions and/or occurring simultaneously with other nearby variants. This summation is necessary because in the case of ambiguous local alignments more than one local length genome segment could generate read r with non-negligible probability. Now we can express our approximation of the likelihood ratio as: Note that although we make an effort to treat the reference and variant genome hypothesis symmetrically, in the formulation as stated some asymmetry remains. Namely, we define the read set R e relevant to a candidate variant in terms of which reads map near position e in the reference genome. Ideally, the set of reads considered would be the union of those which map near position e in either the reference genome or the variant genome, which (especially in the case of longish indels) might not be the same as R e . Defining R e only in terms of the reference genome is a practical decision which eases program implementation -all of the reads can be mapped to the reference genome in a precomputation step and those results used to quickly determine R e for each candidate variant (which number in the millions). To remove this reference bias, one would need a way to quickly query the read set for reads similar to a genome segment overlapping the edited region of each variant genome hypothesis. In principle a suffix array or similar type of index built from the reads could be used to do this, but we leave that for future work. Nevertheless, although the set of reads considered potentially relevant to a given candidate variant is biased towards the reference genome, the rest of the computation is completely symmetric, explicitly handling genome segments unique to the alternative genome and the probability mass of each r ∈ R e derived from sites distant from e.

S1.3 Outside Source Revisited
The generic outside source of reads described in our derivation above is inadequate for both technical and biological reasons. In technical terms, the only reads for which a generic source may have a non-negligible posterior probability are reads which are dissimilar to any region of the hypothetical (reference or alternative) genome, i.e. unmappable reads, but in practice we ignore such reads. If our implementation could be improved to perform read mapping to both hypotheses, the generic outside source could play a role in preventing the probability of the read in the hypothesis to which it does not map from becoming excessively small -but in our current implementation we only map to the reference genome. In terms of biology, there is a need to model paralogous sequences found in the individual genome but not in the reference genome; for example an individual might have an extra occurrence of the Alu repetitive element distinct from any of the numerous Alu elements included in the reference genome. If the extra copy is a recent one, its sequence may be close enough to its paralogous brethren to allow for reads deriving from it to map to the reference genome. Thus our method will see those reads, and potentially make incorrect inferences about candidate variants based on them if this clearly non-generic outside source of reads is not modeled effectively.

S1.3.1 Outside Paralog Source
Here we describe a semi-theoretical model of an outside source of sequences paralogous to the human genome. We admit up front that some of the assumptions will seem forced. In fact, this model is reverse engineered from practice. Nevertheless we feel it is useful to put what we are doing in an explicit theoretical framework in the hope that exposing the remaining issues will facilitate future improvement of the method and/or theory by ourselves or others. Our generative model of outside paralogs assumes every genome segment g, potentially has an additional copy in the sequenced individual which is absent in the reference genome. We use F to denote the "shadow" of G and f g to denote the doppelgänger paralog of g in F (f is mnemonic for "far", cognate to "para", and F is an alphabetic neighbor of G). Armed with that notation we seek to model the following: Where P [g|F ] is the probability that a read originating from the outside paralog source is derived from the doppelgänger copy of g. P [f |g] is the probability of a particular sequence f given that it is an outside paralog derived from a copy of segment g of the inside genome G, and P [r|f ] is the probability of a read given sequence f (more generally written as P [r|s]). Realistically modeling P [g|F ] would require consideration of the differing tendencies for various types of sequences to repeat frequently, for example known repetitive elements are much more likely to be repeated polymorphically than segments which occur uniquely in the reference genome. However for simplicity we choose to assume a uniform distribution over the starting position of g and the general prior distribution L( r ) on the length of any read derived from a copy of g, so P [g|F ] = L( r ) n . Turning to P [f |g], a realistic model would require assumptions about the distribution of age and mutation rate of paralogs which we do not want to deal with. Instead we start with the less ambitious observation that since by assumption f is a paralogous copy of g, f should be similar to g, and P [f |g] should be negligible unless f is similar to g. The least informative probability distribution consistent with that assumption is to assign an equal probability to all sequences similar to g. This begs the question of how similar is to be defined; which for practical reasons we define in a procedural way via mappability. More precisely, we only consider the possibility that a read r might come from the doppelgänger copy of g if r (multi)maps to g when aligned to G. This is roughly equivalent to viewing the mapping sensitivity threshold as a lower bound on the degree of dissimilarity for which P [f |g] becomes small enough to ignore.
Where m r is the number of segments in G to which r maps; and ν r is the size of the set "f similar to r", subscripted with r to indicate that the size of this set grows with r . The set "f similar to r" is not precisely defined, but could be considered to be implicitly defined by the condition that g and r are similar enough that r is mappable to g. This follows from the assumption that r is a read obtained by sequencing f (a paralog of g). In any case, in practice we replace the set "f similar to r" with a precisely defined one in an approximation given below. As before we define the prior probability of a DNA sequence, given only that we know it was sequenced to produce some read r, as P [f ] = L( r ) 4 r ; and apply Bayes' law to obtain a formula in which quality scores can be used, obtaining: Now we make another approximation.
Where r * is the sequence called by the read (i.e. the sequence given by the read assuming no errors), HD(f, r * ) denotes the Hamming distance between f and r * , and a is a proportionality constant. In other words the assumption is that summing over sequences with zero or one mismatches relative to the read is an acceptable approximation to summing over all sequences similar to r. Note that although not completely obvious from the formula above, S H 1 (r) can be computed in time proportional to r by taking advantage of the fact that changing one character in f only changes one term in Although motivated by computation cost, the approximation may be a reasonably good one in practice. For example consider a read of length 100 with all quality scores corresponding to a 1% error rate. The probability the read r was sequenced from its consensus sequence would be 0.99 100 ≈ 0.366, while the probability it was sequenced from one with exactly one substitution would be 100 * 0.01 * 0.99 99 = 0.99 98 ≈ 0.370. So in this case, the Hamming distance ≤ 1 set of sequences together account for over 73% of the probability mass. In the hope of slightly improving the approximation we multiply by a constant a ≥ 1, making the somewhat weaker assumption that the sum over the strings within Hamming distance one will be approximately proportional (as opposed to equal) to the full probability.
We can now replace the generic outside source of reads defined in the previous section to obtain: Assuming the reads are generated independently: Where in the last line, the convenience constant h is redefined as h = ξa (1−ξ)ν r . In practice h is a tunable parameter of Eagle and it is not necessary to decompose it into its constituent components ξ, ν r and a. The theory described above suggests that the value of h should decrease with the read length of r, but currently Eagle uses the same value of h for all reads.
Down-weighting multi-mapped reads: We note in passing that with the paralog outside source model, multi-mapped reads which potentially support a candidate variant are weighted down in two ways. First, before even looking at the details of the read, the simple fact that it maps to so many places inside the reference genome, suggests it may be more likely to also map outside the reference genome if those extra sequences were available; an effect modeled by the m r term. Second, even if we assume the read comes from the inside, its probability mass will be distributed among the multiple sites it maps to in accordance with the details (sequence and quality scores) of the read.

S1.3.2 Effectiveness of Outside Paralog Model in Practice
To investigate the practical impact of the outside paralog model, we compared the performance of candidate variant ranking using EAGLE with the full model outside term versus using EAGLE with a constant outside paralog term. We used the average value of the full model outside term over all candidate variants for the value of the constant. As shown in Figure S1, the impact on performance of using the full outside model was small and not even consistently beneficial. This anti-climatic result suggests that for the benchmarks we used, it is sufficient to have an appropriately weighted regularizer term to dampen the likelihood ratio in cases for which all hypotheses considered are poorly supported by the reads. However, we would note that the full outside model is designed to be robust in the face of putative variants found in repetitive regions -the hardest case for variant calling. We optimistically speculate that such difficult to call variants are under-represented in the gold standard set of variants benchmarked here, and the full outside model might demonstrate its value if variants in repetitive regions were more fully catalogued. Also, although the datasets we handled have uniform length reads, we note that the performance of a constant outside term would be expected to degrade for variable read-length datasets. The theoretical reasoning for this expectation is that the inside probability decreases rapidly with increased read length, so to maintain balance the outside term should also decrease with read length.

S2.1 Benchmarking Procedure
To simulate reads from the NS12911 genome, first we applied the phased variants to hg19 to obtain the heterozygous genome. Then we used dnemulator to simulate reads from chromosome 22 using the command: • fasta-paired-chunks -f 500 -s 30 -l 100 -n 13000000 chr22.SIM.true.fa chr22.SIM.R1.fa chr22.SIM.R2.fa The fastq quality scores were modeled using a real exome sequencing dataset, DRX000842 from the myelodysplasia study described in the first simulation benchmark using the command: We then performed variant calling after mapping the simulated reads to the original hg19 reference genome.
For variant calling, we followed pre-processing steps described in GATK 'best practices' to construct a BAM file ready for variant calling as follows: After variant calling, we also attempted to use the machine learning method, Variant Quality Score Recalibration (VQSR), to score the callsets of each variant caller. However, we were only able to analyze the GATK results with VQSR, despite trying various combinations of training data and annotation sets. For Samtools and FreeBayes, the VariantRecalibrator tool exited with an error before finishing. On the Platypus callset it was able to finish, but not all variants in the callset were annotated with VQSR scores. In our simulation experiment using planted NS12911 mutations, we only simulated reads from chr22, which does not overlap with enough variants in the training data to allow training VQSR.

S2.2 Comparison of EAGLE and VQSR
We compared the performance of EAGLE and VQSR for NA12878 exome and whole genome sequencing datasets ( Figure S3). We ran VQSR on the GATK variant calls for the NA12878 exome sequencing and WGS benchmarks with training sets: hapmap 3.3, 1000G omni 2.5, dbsnp (v138 for exome, v1 for WGS), and Mills 1000G gold standard indels; obtained from the GATK resource bundles along with the corresponding 'best practice' recommended parameters.
Tentatively, the performance is similar except for WGS indels and exome sequencing SNPs at mid-recall where EAGLE has significantly better precision. However, one advantage to VQSR is its ability to identify variants with abnormally high read depths (indicating CNVs or perhaps PCR errors), which EAGLE does not account for currently (as seen in low recall levels for WGS SNPs). We also note that the number of called variants suitable for VQSR training on this benchmark dataset is probably unusually high. Although the VQSR documentation recommends 30 exome samples to obtain the "thousands" of training points needed, a single NA12878 exome sequencing yields ∼350, 000 training points. This likely does not represent the typical use case.

S2.3 Simple performance evaluation using a doctored reference genome
The section describes the results of an additional performance evaluation not mentioned in the main text. We conducted a simple simulation study using real human exome sequencing data from a myelodysplasia study [2] normal sample (DRX000842); but using a doctored version of the reference genome with planted "mutations". This scheme provides a partial gold standard for variant calling, for example if we add a 6bp insertion into the doctored reference genome, a good variant caller should detect that discrepancy as a 6bp deletion in the sample genome. This simulation scheme is rather limited, but it does have the benefit of using real sequencing reads, and can therefore give some indication of whether our assumption of read independence is tenable.
For each variant caller's callset, we evaluated: the planted mutations, called variants that neighbor the planted mutations, and 1000 randomly selected called SNPs and indels respectively. We then compared our likelihood ratio to the caller's quality scores in its ability to rank variants. Note that while many of these randomly selected calls may be correct, some probably are not; while we know with certainty that the planted mutations are correct. Thus all things being equal, a ranking which ranks the planted mutations higher is plausibly of higher quality. Overall, the EAGLE results were encouraging (Table S5, Figure S4).  Table S3: NA12878 Runtime (wall-clock, hrs) using 8 cores (2.53 GHz). VC represents the time required for the variant calling procedure excluding read mapping. Pre-processing (mark duplicates, indel realignment, base quality score recalibration) required 5.78 hrs and 94.07 hrs for GIAB and IPG respectively, and is included in the VC time (note that some variant callers and some steps in pre-processing are not multi-processing capable). EAGLE represents the time required to run EAGLE on each callset. Immediately before publication of this manuscript (after the times reported in this table were measured) we implemented computational optimizations which make the current version of EAGLE run faster (to be described elsewhere).