Dcifer: an IBD-based method to calculate genetic distance between polyclonal infections

Abstract An essential step toward reconstructing pathogen transmission and answering epidemiologically relevant questions from genomic data is obtaining pairwise genetic distance between infections. For recombining organisms such as malaria parasites, relatedness measures quantifying recent shared ancestry would provide a meaningful distance, suggesting methods based on identity by descent (IBD). While the concept of relatedness and consequently an IBD approach is fairly straightforward for individual parasites, the distance between polyclonal infections, which are prevalent in malaria, presents specific challenges, and awaits a general solution that could be applied to infections of any clonality and accommodate multiallelic (e.g. microsatellite or microhaplotype) and biallelic [single nucleotide polymorphism (SNP)] data. Filling this methodological gap, we present Dcifer (Distance for complex infections: fast estimation of relatedness), a method for calculating genetic distance between polyclonal infections, which is designed for unphased data, explicitly accounts for population allele frequencies and complexity of infection, and provides reliable inference. Dcifer’s IBD-based framework allows us to define model parameters that represent interhost relatedness and to propose corresponding estimators with attractive statistical properties. By using combinatorics to account for unobserved phased haplotypes, Dcifer is able to quickly process large datasets and estimate pairwise relatedness along with measures of uncertainty. We show that Dcifer delivers accurate and interpretable results and detects related infections with statistical power that is 2–4 times greater than that of approaches based on identity by state. Applications to real data indicate that relatedness structure aligns with geographic locations. Dcifer is implemented in a comprehensive publicly available software package.


Concept
Many problems that require traversing all combinations of a given type that satisfy a number of conditions and performing computations/manipulations on each combination can be conveniently framed in terms of multisets.A multiset M , which is a collection of elements that do not need to be distinct (unlike elements in a set), may be defined as a 2-tuple (A, m), where A is some set of elements that includes those present in M and m : A → Z ⩾ is a function from A to the set of non-negative integers giving multiplicity of each element in A. Thus A is a superset of the support of M , where Supp(M ) is a set of unique elements of M ; A ⊇ Supp(M ).
If a combination is represented by a multiset, all combinations that need to be enumerated can form a collection of multisets.Let A be a finite set containing all the possible elements of the multisets in this collection; in addition, assign some arbitrary order to the elements of A. Then each multiset can be represented by a sequence of multiplicities of the ordered elements of A, i.e. the number of occurences of each element in a corresponding multiset.For a collection of multisets, these sequences are of the same length and their elements are finite non-negative integers.In turn, each sequence can be thought of as a representation of a number in some positional numeral system, with each element of the sequence being a single digit of that number (regardless of how many digits are used for that element in, for example, the decimal numeral system).Then all the sequences in the collection can be translated into distinct numbers in the same numeral system.
To account for all the combinations, it might be useful to impose some kind of ordering on them and, consequently, a rule prescribing how to move from one combination to the next.Ultimately, finding a rule that would allow one to efficiently traverse all the necessary combinations is the problem to solve.Since a combination can be represented by a single number, moving from that combination to the next can be as simple as incrementing or decrementing that number or, if there are constraints (e.g. a bound on a sum of the digits of the number), moving to the closest "allowed" number.Because different representations of a combination described above are bijective, the "next" number uniquely defines a combination in the collection.Figure 1 provides a diagram of how the chain of these representations can be used to move from a given combination to the next one.

Positional numeral systems
A base, or radix, is the number of unique digits at each position: the smallest digit is 0 and the greatest is radix minus 1.
• Standard systems: fixed base -same base at each position (e.g.decimal system: base 10, binary system: base 2); The number represented by a sequence of digits (a n , a n−1 , . . ., a 1 , a 0 ) (usually written as

Incrementing
Incrementing a number in a mixed radix system can follow the same principal as in standard systems: going from right to left, find the first position that can be incremented, add 1, and set all the positions to the right of that position to 0 (Figure 2).The star in the top panel of Figure 2 (representing x) indicates the position to be incremented.Note that having positions with base 1 can be useful in some situations.The digital representations of x and x + 1 in the same mixed radix numeral system are shown in Figure 3 (in decimal system, their values are 7, 917, 695 and 7, 917, 696).

Constraints
The incrementing procedure described above is outlined in Knuth [2011] Algorithm M (Mixedradix generation) for generating all n-tuples.By incorporating some constraints into that algorithm and adding modifications, we can efficiently solve a surpising number of problems.A very useful constraint would be an upper limit on cardinality of the multisets we want to generate; we might also want to set a lower non-zero bound on a digit at each position.A modification that includes position-associated weights could accomodate another set of problems.Here we extend this simple and intuitive algorithm to include the cardinality constraint, look at other possible extensions, and illustrate the utility of this approach with the following examples: • Generate all the collections of non-unique elements of a given size n from a set A of their unique elements (all the multisets M l such that |M l | = n and Supp(M l ) = A); • Find all the sub-collections of a given collection B with a cardinality constraint n (all the multisets M l such that M l ⊆ B and |M l | ⩽ n); • Find all the binary sequences with elements in {0, 1} of length k where the number of 1's does not exceed n ⩽ k; • Enumerate all the collections of elements whose sum is equal to a given number n.
Including a lower bound S min on the sum as a constraint would be very similar to incorporating an upper bound in the algorithm above.Instead of a sequence of zeros, the initial sequence m init would represent the smallest number satisfying S min condition, e.g. for d max = (3, 9, 1, 2) and S min = 5, m init = (0, 2, 1, 2).

Versions and modifications
In addition to specifying d max,i for each position, it might be useful for some problems to specify d min,i instead of using 0 as the smallest value.The simplest example would be to generate multisets with the same support, where m(a i ) > 0 ∀ i; in other examples varying d min,i from position to position might be needed.The straightforward solution in that case would be to run the algorithm with a new sequence d ′ max , where d ′ max,i = d max,i − d min,i ∀ i, and S ′ max = S max − k i=1 d min,i , then add d min,i to m i for each i in all resulting multisets.Alternatively, it might be a little more efficient to rewrite the algorithm substituting d min,i for 0's and calculating the "largest number" taking d min into account.
Another possible version could have a constraint W max , such that n i=1 a i ⩽ W max instead of S max (a sum of all elements in a multiset instead of its cardinality).In this case, each position could be assigned a weight w i (e.g.w i = W max −i+1) and the condition rewritten as l i=1 m i w i ⩽ W max .
Algorithm 2: Mixed radix incrementing, weighted sum version Input : An integer W max .
Output: A collection C of sequences satisfying W max condition.
Using the mixed radix incrementing approach for this particular problem is not necessarily the most efficient way to traverse the combinations, but it has an advantage of being intuitive and easily constructed without any additional knowledge.

Multisets of given cardinality and support
Generate all multisets M l of cardinality n with Supp(M l ) = A, where A is a given set: 1 for an example).A problem like this can arise, for example, when a multiset of a known cardinality that we are interested in is unobserved and only the set of its unique elements is observed.
The number of multisets satisfying these conditions is called a multiset coefficient (or multiset number); it is a function of n and k ≡ |A| and is equal to n−1 k−1 .It can also be defined recursively: Then the multiset coefficient is equal to for (yi in y:1) { # y:1 for clarity + res <-res + mcoef(x -1, yi) The algorithm outputs multisets of cardinality less than or equal to a given number, so to get the multisets of the given cardinality only, we run the algorithm with one position removed (k ′ = k − 1) and set d min,i = 1 ∀ i.Note that for this problem, the base is the same for all the positions and is equal to n − k + 1.The output is a matrix, in which each multiset M l is represented by a column with multiplicities of elements of A (the third column in Table 1).Below we generate multisets for n = 8 and k = 4; the last row is added after running the algorithm to bring the cardinality of each multiset to n: Table 2: Multisets included in a multiset B of different conditions on cardinality.

Binary sequences with a bound on the sum
Produce all binary sequences (a 1 , . . ., a k ) of length k and k i=1 a i ⩽ n (Table 3).The solution of course is equivalent to generating combinations of l out of k elements, l ⩽ n with the corresponding number of such sequences equal to k The mixed radix incrementing algorithm, however, might provide a more efficient solution, which in this case is as simple as incrementing in binary numeral system but with an optional constraint on the sum of the digits.
As an example, we generate binary sequences of length 5 and the sum bounded by 3: Here we generate multisets for W max = 8: > # The output is represented by the right side of Table 4 > # (each column in the matrix corresponds to a row in the Table ).> s <-8 > # Function mirsaW() runs the weighted version of the algorithm > # eq = TRUE specifies that the sum of the elements should be equal to s > mirsaW(s, eq = TRUE)

Conclusion
We showed just a few examples to demonstrate the utility and efficiency of the mixed radix incrementing approach.The algorithm is used to solve combinatorial problems that arise in Dcifer likelihood calculation, illustrated by the first three examples.Example 3.1 represents all the combinations of non-unique alleles that are compatible with a given COI and detected (unique) alleles; example 3.2 traverses subcollections of non-unique alleles that are shared between two infections at a given locus (the number of subcollection's elements constrained by M , the number of strain pairs that can be related); and 3.3 represents possible (IBD 1,t , . . ., IBD M,t ) sequences, where the number of 1's cannot exceed the number of shared non-unique alleles at a locus t.The problem in 3.4 appears, for example, in calculation of higher-order statistics.When a problem can be framed in terms of generating all the possible multisets with some prescribed conditions, it may be solved with this unified approach and algorithm versions tailored to accommodate various kinds of constraints.