Post hoc deconvolution of human mitochondrial DNA mixtures by EMMA 2 using fine-tuned Phylotree nomenclature

Graphical abstract


Introduction
Mitochondrial DNA (mtDNA) enjoys popularity in different fields of research, particularly with the emerging Massively Parallel (Next Generation) Sequencing (MPS) technologies. Full mtDNA genomes (mitogenomes) can meanwhile reliably be generated from minute amounts of ancient DNA [4], challenging samples in the forensic field using a variety of techniques [23,29,31] or in medically-oriented studies [34]. In contrast to nuclear DNA that is present in two parentally inherited copies in the nucleus, mtDNA, being located in the mitochondria, is exclusively passed along the maternal lineage and present in many (up to thousands) molecules per cell.
Mixtures of mtDNA are commonly observed in sequencing results and can be assigned to the following sources: first, an mtDNA mixture can consist of contributions from two individuals, either in the form of contamination or as genuine mixture from two individuals, which has particular relevance in forensic genetics [33]. Second, mtDNA is also known to be present in heteroplasmic form within the tissue/cell of an individual [3,16,20], which is manifest as mixture of almost identical mtDNA molecules that typically differ only at few nucleotide positions (ntp; mostly between 1 and 3 per individual, [15,30]. Heteroplasmy is specifically relevant in the medical genetic field at disease-associated positions, where the relative amount of the minor, disease-causing variant can be crucial for the severeness of a disease [34]. A third form is observed as mixture of mtDNA with nuclear mitochondrial elements (NUMTs), whose abundance and frequency have been underestimated with earlier Sanger-based sequencing technologies [11]. The diversity of NUMTs was only recognized when MPS-based technologies were introduced and appropriate identification methods for NUMTs were developed [7]. Finally, one can encounter mtDNA mixtures known to originate from genuine mtDNA and artefacts such as drop-in effects, damage patterns, background signal and more. For more details we direct the reader to a recent review for more in-depth discussion of mixtures from an analytical stand-point [22].
It can be challenging to discern the different forms of mixtures (see [22]), particularly for less experienced analysts. In this study we developed and evaluated a strategy to deconvolute mtDNA mixtures using phylogenetic principles. Earlier attempts to separate human mtDNA mixtures applied physical methods, such as denaturing high-performance liquid chromatography [6,18], base composition profiling using mass spectrometry [12], quantitative data in a statistical framework [9] and MPS-based data [5,13,21] with a continuous statistical phasing framework [28]. While some of these strategies have proven useful, there is still a lack of technology-agnostic, non-quantitative and easy to use mtDNA mixture deconvolution tools that provide traceable splittings on any kind of sequencing data (Sanger and MPS).

Mitotypes and symbols
A mitotype or profile describes the nucleotide sequence of an investigated mtDNA fragment as the reading range plus a list of differences to the rCRS [2] coded by the position and the differing symbol. The symbol alphabet has a size of 31 and contains the five unique symbols A,C,G,T and -(the gap for a deletion), and the 26 ambiguous symbols R, Y, S, W, K, M, B, D, H, V, N, a, c, g, t, r, y, s, w, k, m, b, d, h, v, n using the extended IUPAC code [25]. Here, lower-case symbols additionally include the gap. For mitotypes T 1 ,T 2 ,. . .,T k the formal mixture T 1 &T 2 &. . .&T k is the mitotype obtained by joining the symbols of the corresponding sequences at each position. E.g., for T 1 = 73G 263G and T 2 = 152C 263G we have T 1 &T 2 = 73R 152Y 263G. We use the ampersand for mixtures because the plus sign may be contained in the names of the haplogroup motifs, such as H + 195, which denotes the motif of haplogroup H with the additional transition at position 195. Conversely, given an arbitrary mitotype T, we call the representation T = T 1 &T 2 &. . .&T k with suitable mitotypes T 1 ,T 2 ,. . .,T k a splitting of T with components T 1 ,T 2 ,. . .,T k . For instance, if the query profile has symbol Y at some mtDNA position, the theoretically possible splittings with two components at this position are C&T, C&Y, T&C, T&Y and Y&Y.

Database of haplogroup motifs
The original haplogroup motifs of Phylotree [32] Build 17 from 18 February 2016 were revised in [8]. The revised motifs included 966 additional motifs for yet undetermined subclades. Afterwards we noticed that some additional GenBank mitogenomes of [17] are questionable because of the possible omission of the diagnostic 9 bp-deletion at 8181-8189. Therefore, we deleted all subclades affected by these mitogenomes and provide an updated list of revised motifs in Tables S1-S4. This list of 6380 profiles from 5435 haplogroups consists of 5435 refined haplogroup motifs from Phylotree 17 and 945 additional motifs for subclades. Some motifs carry ambiguous symbols because the corresponding mutations are unstable or uncertain according to Phylotree, e.g., 207R for haplogroup L0f or 16362Y for haplogroup H13b.

The deconvolution problem
For a given mitotype Q and a fixed number k of components the challenge is to find all combinations of mitotypes Q 1 ,Q 2 ,. . .,Q k with Q = Q 1 &Q 2 &. . .&Q k where the components Q 1, Q 2 ,. . .,Q k resemble haplogroup motifs as close as possible, and the associated haplogroup combinations. Here, differences between the components and the haplogroup motifs may be measured by parsimony or likelihood.

Irredundant extension splittings
To reduce the number of possible splittings without losing relevant splittings, we only consider irredundant extension splittings. Here, the extension condition requires that at every position each unique symbol contained in both the query and the motif symbol is also contained in the symbol of the corresponding component.
The irredundancy condition requests that at every position no unique symbol of any component can be omitted without destructing the splitting or violating the extension condition. E.g., for query symbol Y (C and T), component number 2, first motif symbol C and the second motif symbol T, the only irredundant extension splitting is Y = C&T. Contrary, for query symbol B (C, G and T), component number 3, first motif symbol M (A and C), second motif symbol C and third motif symbol G, the three irredundant extension splittings are B = Y&C&G, B = C&Y&G and B = C&C&K, where K is a mixture of G and T.

Log-likelihood ratios
Differences between the components and the haplogroup motifs are quantified by costs, which are sums of log-likelihood ratios (LLRs) of fluctuation rates at each position. For unique symbols fluctuation rates were introduced in [27] as probabilistic rates to measure the fluctuation of positional mutations within haplogroups, and have been refined in [14]. In [27] the LLR of component symbol c and motif symbol m has been defined as log 10 (r(c|c)/ r(c|m))/3, where r(c|c) and r(c|m) denote the rates and the normalization of the logarithm has been chosen to obtain LLRs of about 1.0 for standard mutations. E.g., the transition C151T has an LLR of about 0.7, whereas the rarer transversion C16053G has an LLR of about 1.4. In [27] it was shown that for mitotypes minimizing sums of LLRs is an efficient method to maximize the likelihood function.
For an ambiguous symbol in the components the LLR is computed as the arithmetic mean of the LLRs of all unique symbols contained in the component symbol. For an ambiguous symbol in the haplogroup motif the LLR is chosen as the minimal LLR of all unique symbols contained in the motif symbol. E.g., for the component symbol Y and the motif symbol C, we have LLR(Y|C) = LLR(T| C)/2 because LLR(C|C) = 0, while for the component symbol C and the motif symbol Y, we have LLR(C|Y) = 0 for the same reason.
For a perfect splitting, where the components equal haplogroup motifs, the costs are zero. Conversely, if the costs of a splitting are zero, then the components of the query profile differ from haplogroup motifs only by irrelevant mutations. E.g., for the query symbol Y and the motif symbols T and M (A or C) the optimal splitting is Y = T&C with cost LLR(T|T) + LLR(C|M) = 0.

Overview of the deconvolution algorithm
The proposed algorithm is shown in pseudocode in Document S5 and has a single mitotype for the putative mixture and the putative component number as input. Exact genotyping using ambiguous symbols at positions with multiple nucleotides is important because the algorithm relies on the supplied mitotype. Furthermore, the mitotype should be phylogenetically aligned [14] to comply with the haplogroup motifs from Phylotree.
The output is graded by clustering costs of splittings with a margin of 0.5 as for haplogrouping in EMPOP [14]. Splittings have rank 1 if their costs are less than mincost1 + 0.5 where mincost1 denotes the minimal cost among all splittings. Splittings have rank 2 if their costs are equal or greater than mincost1 + 0.5 but less than mincost2 + 0.5 where mincost2 denotes the minimal cost among all splittings that do not reach rank 1. For each rank the algorithm outputs the range of the observed costs, the list of splittings and the list of maximal haplogroup combinations. As the algorithm does not use quantitative information about mixture components, the number of possible splittings can be huge because of the many possibilities of reassigning private mutations of the true components, and displaying long lists of splittings or haplogroup combinations in the output would not be instrumental. Thus, only maximal combinations of haplogroups with respect to the covering relation are listed in Phylotree order [32]. Here, a haplogroup combination is covered by another haplogroup combination if, after a possible reordering, each haplogroup is a subhaplogroup of or equal to the corresponding haplogroup of the other combination. E.g., the two haplogroup combinations R&J and J&U are listed in Phylotree order because R is listed before J and J before U in the tree, but J&U will be omitted in the output, because U is a subhaplogroup of R and thus J&U is covered by R&J. Therefore, when deconvolving mixtures with known components in Section 4, we consider the algorithm successful if the true haplogroup combination is covered by a combination of rank 1 or 2.

Details of the algorithm
The algorithm uses an exhaustive search on the motif combinations to find all optimal or nearly-optimal irredundant extension splittings, has been implemented in C with OpenMP and applies several optimizations described below to achieve acceptable running times for mixtures with at most three components (a mixture of three full mitogenomes takes less than one hour on a custom PC). First, the motif profiles are condensed to the reading range of the query profile to avoid duplicate comparisons. For each condensed database profile its haplogroup is determined as the most common recent ancestor (MRCA) of the haplogroups of the corresponding motif profiles, e.g., for the CR the 6380 haplogroup motifs are condensed to 3439 CR profiles. Second, all possible combinations of motif profiles are tested, which results in about 20 million tests for two components or in about 44 billion tests for three components of full mitogenomes. The algorithm generates all irredundant extension splittings by using precomputed binary tables at every position. These tables have k columns corresponding to the components of the splitting, l rows corresponding to the unique symbols in the query symbol, and entries of 0 or 1. The motif table has the entry 1 if the motif symbol for that component contains the corresponding symbol of the query profile, and 0 otherwise. A table is called an extension table if its entries are 1 whenever the corresponding entries of the motif table are 1. Thus, extension tables are obtained from the motif table by converting zero or more entries from 0 to 1. An extension table is called feasible if each row and each column contains at least one 1, and a feasible extension table is called irredundant if no 1 can be changed to 0 without violating the extension condition or losing feasibility. Then, irredundant extension splittings can be described by irredundant extension tables at every position. In the first example of section 3.1 with query symbol Y (C and T), component number 2, first motif symbol C and the second motif symbol T, the only irredundant extension  table is the motif table shown in Table 1a.
In the second example of section 3.1 with query symbol B (C, G and T), component number 3, first motif symbol M (A and C), second motif symbol C and third motif symbol G, the three irredundant extension tables can be obtained from the motif table shown in Table 1b by replacing one 0 in the last row by a 1.
The cost of an irredundant extension splitting is the sum of the LLRs over all positions where query profile or motif profiles differ from the rCRS. The algorithm stops the summation as soon as the current sum exceeds rank2cost + 0.5, because then the splitting cannot reach rank 1 or 2. To decrease the rank 2 costs rapidly the motif profile combinations are generated in the graded lexico-graphic order of the motifs (A, C, D, . . ., X, Y, Z, A1, A3, A5, . . .) such that the combinations of the superhaplogroups are tested first.

Results and discussion
The interpretation of mixed genetic data deriving from multiple sample donors remains one of the biggest challenges in practical forensic casework. Since the introduction of DNA profiling in the early 1990 0 s, the forensic community has been tackling this issue in various ways and with varying success. A lot of foreground was produced in mixture deconvolution of nuclear DNA markers for human identification, such as Short Tandem Repeats (STRs) over the years, particularly with the development of concepts and software that support the analyst by providing quantifiable information of possible contributors (see e.g., [10]. In contrast to the Mendelian nature of STR markers, haploid genomes carry linked patterns of signature mutations, especially with the analysis of SNPs, such as mtDNA variants, that can be used to assign mixture components to the contributing sources (e.g., [33,28]. Here, we apply a new algorithm for the splitting of mtDNA mixtures based on the mitochondrial phylogeny with the major advantage that this concept can be applied to any mtDNA sequences regardless of the technology used. An obvious limitation is the lack of quantitative information on the contributing sources as only the consensus haplotype is used as input for the deconvolution.
Since the concept is based on phylogenetic principles, mixtures are evaluated based on contributing haplogroups rather than haplotypes. The main reason for this is that some mutations, particularly hot-spots, cannot unambiguously be assigned to only one of the possible components. Instead, they can belong to multiple contributors present in a mixture. Hence, the result supplied by EMMA 2 represents a ranked evaluation of possible contributing haplogroups to 2-and 3-person mixtures. As shown in our examples, this strategy provides useful information for the interpretation of a mixed mtDNA profile.
In the following we evaluate the performance of the algorithm EMMA 2 on a variety of artificially created and real-world mixtures, for which the true components are known. We discuss the output in a forensic context that can however also be applied to any other discipline using mtDNA sequences.

Artificial mixtures
To evaluate the algorithm 3751 quality-controlled full mitogenomes from the EMPOP database [24] with different geographic and phylogenetic background were collected, including 1178 Westeurasian, 2036 East Asian, 529 Native American, and 8 Oceanian mitotypes. From this dataset 1000 random mixtures were generated with two components and 100 random mixtures with three components. The deconvolution algorithm was used to compute the maximal haplogroup combinations of ranks 1 and 2 and checked whether the true haplogroup combination was covered by the rank 1 or 2 combinations. We consider the inclusion of the true mitotype combination in the output as successful outcome of this approach as this provides useful information for the analyst and guides further evaluations or experiments (e.g., targeted sequencing of particular positions of interest). Table 2 shows that Table 1a Motif table for the first example in section 3.1.

Nucleotide
Component 1 Component 2 The results of the artificial mixture experiments aptly demonstrate the general suitability of the EMMA 2 algorithm. The single exception in 1100 tested combinations however pinpoints at a limitation of the approach in those instances, where contributing mitotypes harbor private mutations that constitute signature mutations in other haplogroups. The experimental study shows that this phenomenon may be rare but needs to be kept in mind when interpreting mixtures.

Biological mixtures
In order to evaluate the algorithm with real-world mixtures, we used a variety of different resources that were generated in our laboratory with either Sanger-based or MPS-based technologies (examples 1 and 3) or MPS data that we reviewed for quality control purposes (example 2).

Example Mega-NUMT
The first example is derived from sample IV.3 from [19], where the authors provide evidence for the presence of Mega-NUMTs (multi-copy inserts of full mitogenomes) in the nuclear DNA. The full mitotype listed in the fifth column of Table 2 of [19] has 33 ambiguous symbols spread over the whole mitogenome. Table 3 shows the hypothetical number of components in the first column, the range of costs of rank 1 splittings in the second column, and the haplogroup estimates of EMMA 2 of rank 1 in the last column.
In [19] the authors identified this mitotype mixture by cloning and MPS experiments as a mixture of haplogroups V and U4c1, which confirms the EMMA 2 haplogroup estimates in rank 1. The computed two-component splittings of lowest costs 2.74 have components composed of haplogroup V and U4c1 motifs, and five private mutations that can be arbitrarily assigned to the component.

Example short NUMTs
The In contrast to Mega-NUMTs where ambiguous symbols are expected to be distributed across the entire mitogenome, 13 of the 15 mixed positions occur in the range 9494-9578 here. This suggests the presence of short NUMT sequences overlapping with the genuine mtDNA in this region. Therefore, we applied a modified version of the algorithm with motifs as database profiles only in the first component and NUMTs from [35] as database profiles in the other components. Table 4 shows the hypothetical number of components in the first column, the range of costs of rank 1 splittings in the second column, and the haplogroup/NUMT estimates of EMMA 2 of rank 1 in the last column.

Examples from GEDNAP
As third example two-and three component mixtures generated in the course of the proficiency test program GEDNAP [26] were used to sequence the CR applying Sanger-based sequencing. For each mixture Tables 5a-i (2 persons) and Tables 6a-c (3 persons) show the reported mixed mitotype in the first row, the hypothetical number of components in the first column, the range of costs of rank 1 splittings in the second column, the haplogroup estimates of EMMA 2 of rank 1 in Phylotree order in the last column, and the mitotypes of the single-source components that were known to the laboratory in the last two or three rows. Haplogroup estimates that cover the true haplogroup combinations are indicated in bold-face.
In GEDNAP 30 Stain 2 (Table 5a) the formal mixture of the true components coincides with the mitotype reported by the lab. The two-component splittings of lowest cost 2.09 have components   from haplogroups H5a1j and U6a3a2. As the algorithm also finds splittings into components from haplogroups H5 and U6a3a2 with cost 2.21, the true combination H5a1j&U6a3a2 is not listed in the table, because it is covered by the rank 1 combination H5&U6a3a2.
In GEDNAP 30 Stain 4 (Table 5b) the formal mixture of the true components differs from the mitotype reported by the lab by the variants 309.1c, 573.1c, 573.2c and 573.3c for technical reasons, as it was difficult in this example to interpret C-stretch variation in the mixed haplotype. The two-component splittings of lowest cost 1.99 have haplogroup combination H1a3&H5. The true splitting with haplogroup combination H1a&H5 has cost 2.35 and reaches rank 1. The low cost 1.99 of a hypothetical mixture of three components can be explained by the fact that the motifs of the haplogroups H and U2 differ in the CR only by the transitions 16051G and 73G, and that the true second component harbors both transitions.
In GEDNAP 32 Stain 3 (Table 5c) the formal mixture of the true components differs from the mitotype reported by the lab by the variants 309.2c, 523a, 524c, 16126Y, 16164R and the missing reads 16180 M and 16181 M. This was an extreme mixture as far as the quantitative contribution of the sources were concerned, which is why some of the mixed positions could not be observed in the Sanger sequencing data. The two-component splittings of lowest cost 3.91 have haplogroup combinations D5a2a + 16092&J1c, D5a2a + The results of the GEDNAP samples demonstrate the strength but also pinpoint limitations of the algorithm. GEDNAP samples are generally considered authentic forensic samples, as they derive from a proficiency testing program that provides specimens that closely mimic real-world casework samples. There, unbalanced mixtures occur that do not show the mixture at all theoretically affected positions. This is particularly true for Sanger sequencing that is not a fully quantitative method. Hence, missing variants obviously affect haplogrouping and thus may have an impact on the ranked presentation of haplogroup combinations. Also, homopolymeric stretches that may be already challenging to interpret in single source data result in complex constellations in mixtures that cannot be unambiguously called and thus also may affect haplogrouping. Still, the true components of the GEDNAP mixtures were included in rank 1 and 2 assignments by the algorithm for all examples.
We further observed that the true haplogroup combination was not always explicitly stated in the output, but in some cases masked by the coarser haplogroup combination. This is due to the applied irredundancy setting that minimizes the reported combinations in a convenient way. Also, we note that 3-person mixtures may yield lower costs than true 2-person mixtures, when shared mutations can be assigned in a different, ''cheaper" way. This is also true when private mutations are recognized as signature mutations in other haplogroups as observed above. Still, in the vast majority of cases, the splitting suggested by EMMA 2 is including the true haplogroups in the suggested combinations, which constitutes a useful outcome for further forensic interpretation.

Conclusions
Mixtures of mtDNA occur frequently, either from two contributing individuals, from NUMTs, as heteroplasmic events, as modifications of mtDNA during aging or as artefacts as a result of the analytical process. Some of these mixtures display similar or even identical phenotypes in the raw sequencing data and can therefore not easily be differentiated by the analyst. While heteroplasmy generally leads to less affected positions in the raw data and is restricted to known hot-spot positions in the CR, Mega-NUMTs and genuine mtDNA mixtures generally affect multiple positions at haplogroup-diagnostic sites. These measures alone are however not sufficient to discern the contributing sources of mixtures easily by hand.
We here present a new algorithm that will be available in EMMA 2, a software implemented in the EMPOP database [24] that produces splittings based on phylogenetic principles. There are several advantages to this concept: the algorithm can be used with any kind of sequencing data, regardless of the sequencing chemistry and technology used. It does not require quantitative data, which are sometimes difficult to extract from the raw data, e.g., from BigDye Terminator Sanger-based technologies. The algorithm was optimized with respect to computing time and resources and can be performed in reasonable time on a custom PC. Furthermore, the algorithm is not hidden in a ''black box", but leads to traceable results based on the concept outlined here. This lends this algorithm a suitable post hoc tool for the analysis of general mtDNA sequencing data.
The splitting results, albeit confirmatory of the input mitotypes in most cases shown here, cannot be taken at face value without considering alternative options. Our evaluation demonstrates that (an excess of) private mutations can lead to alternative splittings that result in lower costs and will thus be favored by the algorithm over the true mixture components. This can however be recognized by the analyst, as haplogroup diagnostic and private mutations are provided in the output.
Our results show that the algorithm is generally reliable. It thus adds to existing suite of mixture deconvolution tools that use other concepts and sources, e.g., quantitative data, to identify the components. More research is needed in terms of documenting and cataloguing mtDNA variation to further improve phylogeny-based algorithms. The proposed algorithm is also applicable to NUMTS or to non-human mtDNA if a representative database such as the 6380 revised haplogroup motifs of Phylotree 17 for human mtDNA is available.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.