EagleImp: fast and accurate genome-wide phasing and imputation in a single tool

Abstract Motivation Reference-based phasing and genotype imputation algorithms have been developed with sublinear theoretical runtime behaviour, but runtimes are still high in practice when large genome-wide reference datasets are used. Results We developed EagleImp, a software based on the methods used in the existing tools Eagle2 and PBWT, which allows accurate and accelerated phasing and imputation in a single tool by algorithmic and technical improvements and new features. We compared accuracy and runtime of EagleImp with Eagle2, PBWT and prominent imputation servers using whole-genome sequencing data from the 1000 Genomes Project, the Haplotype Reference Consortium and simulated data with 1 million reference genomes. EagleImp was 2–30 times faster (depending on the single or multiprocessor configuration selected and the size of the reference panel) than Eagle2 combined with PBWT, with the same or better phasing and imputation quality in all tested scenarios. For common variants investigated in typical genome-wide association studies, EagleImp provided same or higher imputation accuracy than the Sanger Imputation Service, Michigan Imputation Server and the newly developed TOPMed Imputation Server, despite larger (not publicly available) reference panels. Additional features include automated chromosome splitting and memory management at runtime to avoid job aborts, fast reading and writing of large files and various user-configurable algorithm and output options. Due to the technical optimizations, EagleImp can perform fast and accurate reference-based phasing and imputation and is ready for future large reference panels in the order of 1 million genomes. Availability and implementation EagleImp is implemented in C++ and freely available for download at https://github.com/ikmb/eagleimp. Supplementary information Supplementary data are available at Bioinformatics online.

1 Original Core Algorithms of Eagle2 and PBWT Here, we briefly describe the core algorithms of the original tools Eagle2 and PBWT. For details, we refer to the original publications [4] and [1].

Eagle2 Phasing Core Algorithm
The phasing core of Eagle2 consists of some preliminary steps before the actual phasing, which are performed for each target sample independently.

K best haplotypes
At first, a subset of the reference of the best-fitting haplotypes is determined. These haplotypes have the lowest genotype-to-haplotype distance to the sample. (This distance is simply calculated by counting the number of inconsistent haplotype alleles to the corresponding target genotype.) The number is defined by the input parameter K. A smaller K reduces the number of haplotypes selected from reference panel and thus reduces phasing quality in favor of a faster runtime. Eagle2 used K = 10, 000 by default as a good compromise between quality and speed. This step is only omitted if K is greater than the total number of available reference haplotypes. In this case simply all haplotypes are taken.

Condensed reference and PBWT data structure
The set of the K best haplotypes is reduced to call sites afterwards. These are the sites where a phase call is required (basically heterozygous target sites). The information stored in the condensed reference is the allele of each haplotype at each call site and the information if a segment between call sites is consistent with the target or not, i.e. if the segment between two call sites (implicitly containing only homozygous sites) matches the corresponding segment in the haplotype according to the genotype-to-haplotype distance.
After this step the original Eagle2 method introduces a so-called IBD-check (identical by descent). This check is intended to remove a certain bias introduced by reference sequences that have a high similarity to the target (caused by a probable relatedness). Segments in the condensed reference are simply marked inconsistent if runs of consistent segments are too long (the default parameters are 10 and 20 segments depending on two different types of IBD).
Based on the condensed reference a PBWT data structure is created in order to perform quick sequence lookups from each call site, which is required by the following phasing process. For details on the PBWT, we refer to Durbin [1].

Phasing
The original phasing method in Eagle2 is categorised into three sub parts: Fast pre-phasing, fine phasing and reverse phasing. All three parts are handled subsequently by performing a beam search with different parameters. The parameters for the pre-phasing part are runtime optimised. It is intended to constraint phases with very high probability. This information is later used in the fine phasing part to reduce its runtime. Here, the phase decisions are made by calculating the keep-phase probability P keep for each call site (see "Beam search and probability model" below). The reverse phasing part is intended to fix the phasing decisions from the fine phasing and is performed only in the last iteration (in the case several phasing iterations are performed). Reverse phasing is a fine phasing iteration only with reversed direction, so the phasing starts at the end of all variants. In this part, a phase correction is made if and only if the re-caclulated probability for a phase keep or switch is higher than the previously calculated one in the fine phasing.

Beam search and probability model
A beam consists of several pairs of haplotype paths at each position in the condensed reference. The probability of such a pair (further referred to as diplotype) forms the basis of the keep-phase probability at a call site: If the sum of the probabilities of all diplotypes in the beam at a call site that keep the current phase is higher than for those introducing a phase switch, the current phase is kept, otherwise switched.
The probability model for a diplotype is the following: n err is the number of consistency errors between the pair of haplotype paths to the target genotypes, e.g. if both paths show a common allele at a site (homozygous) while the genotype is heterozygous at that site, this counts as an error. ε is a fixed error probability set to 0.003 per default. The haplotype path probability is assumed to follow: The parameter H indicates the size of a history, i.e. how many sites should be looked back at in order to calculate the actual probability of the current site m. This defaults to 30 for the prephasing step and is set to 100 in the fine and reverse phasing steps. P (rec m|x) is the recombination probability between the two sites m and x. It is computed as a function dependent on the genetic distance between the sites, the effective population size and the number of references. The function f (h x...m ) returns the frequency of the given sequence h x...m in all reference sequences from sites x to m. Since this function is called very frequently a fast execution is crucial. The PBWT data structure allows this information to be accessed in approximately constant time while a naive implementation would be linear in the number of references. The keep-phase probability can now be quantified as the sum of all probabilities of diplotypes in the beam that keep the phase divided by all diplotype probabilities: The switch-phase probability is quantified as its counterpart P switch = 1 − P keep . The phasing confidence for a complete sample, which is reported at the end of the phasing part, is the average of all phase confidences for each call site, i.e. either the keep-phase probability if the phase is kept, or the switch-phase probability otherwise.
The beam search algorithm sweeps over the call sites while the beam is updated continuously. Eagle2 uses a HapHedge data structure for this purpose which is basically a tree with the leaf nodes representing active beam paths. Diplotypes that have a probability below ε times the current maximum diplotype probability are removed. Furthermore, the least probable diplotypes are removed until the number of diplotypes is less than or equal to the beam width parameter, which defaults to 50 in general (30 for pre-phasing).
A parameter ∆ lets the beam advance to site m + ∆ before doing the phase call for site m. This ensures that only stable diplotypes that survive until m + ∆ in the beam are taken for calculating the phase probability. This defaults to 20 (10 for pre-phasing).

Set-maximal matches
Preliminary to imputation, the core algorithm requires the reference to be in PBWT structure format. Together with the target data a reference panel in PBWT format reduced to common variant positions is created. Then, for each sample haplotype all set-maximal matches are calculated. A match is defined by an interval m start , m end (inclusive start and exclusive end positions of the match in the reduced panel) where the target haplotype matches a set of reference haplotypes exactly. A match is set-maximal to a sequence in the reference, if the match is locally maximal, i.e. there is no longer match including this one, and there is also no other sequence in the reference with a longer match that includes this region. Note, that set-maximal matches may (and should) overlap. It is worth to mention that the original PBWT tool sets missing target haplotypes to the reference allele without further notice in the default mode. Missing data may be imputed explicitly as an extra step before the final imputation only, and in this step only the phased target data is used for the imputation instead of the reference panel. In EagleImp we do not follow this approach (see Section 2.3.2).

Dosage calculation
For each variant allele that needs to be imputed, the imputation dosage is calculated as follows. For each sequence (indexed with i) from a set-maximal match that covers this variant, a score s i is computed that depends on the position of the variant in the match and the length of the match: whereby m is the position of the last common variant in the reduced reference panel and m start i and m end i the start and end positions of the corresponding set-maximal match. Note, that this equation produces a zero score for all variants in the first segment in a set-maximal match (between the first and the second common site in the reduced panel). In EagleImp we add 1 in the first factor to generate a positive score instead (see main paper).
The allele dosage is simply calculated as: In the case that variants are not covered by set-maximal matches, PBWT simply fills these gaps with the most frequent allele and sets the dosage to the reference panel allele frequency.

EagleImp Implementation
EagleImp combines the basic algorithmic concepts of Eagle2 [4] and PBWT [1] described above. In the following, we describe our main variations in the algorithm and implementation, divided into a general part with methods addressing the complete application, the part regarding only the phasing step, and finally the imputation part.

Qref reference format
Reading large reference input panels is slow if the panel is provided in VCF-format (i.e. .vcf.gz or .bcf). On the one hand, VCF is a very flexible format for storing genetic data, but on the other hand, it becomes cumbersome when only certain parts of the provided information needs to be parsed. Thus, several imputation tools require their own formats, e.g. .m3vcf for mini-mac4 (https://genome.sph.umich.edu/wiki/Minimac4) or .pbwt for the PBWT tool (PBWT 3.1, https://github.com/VertebrateResequencing/pbwt.git). This circumvents the processing power required for parsing VCF data and the data is ideally already in the desired format for the tool. To be compatible with other tools, we still support the reference panel in VCF-format, but as reference panels are reused several times for different target datasets, we targeted this runtime bottleneck with our own Quick Reference Format (Qref ).
The Qref-format is basically a consecutive arrangement of the required metadata of the input file followed by the haplotype data. The metadata are firstly numerical values, such as the number of haplotypes and the number of variants, followed by all variant positions, then all variant IDs, all reference allele IDs, all alternative allele IDs, allele frequencies and flags that mark multi-allelic variants (see below). For chromosome X, flags that mark haploid reference samples are also stored in a Qref. Choosing this format, the reading process can directly assign the data to already prepared memory areas without further restructuring and parsing.
The haplotype information, which is the largest part by far, is stored binary encoded in a variant-major format. To save memory, one line of haplotype information is converted to a binary sequence which is run-length encoded afterwards. We use a run-length encoding that uses 8 bit patterns (7 bits for the run-length and 1 bit for the data that is repeated). In most cases, the run-length encoding leads to a significant reduction in size, but in those cases where an encoded line is not smaller than the original, we simply store the not-encoded sequence.
When creating a QRef-file multi-allelic variants are converted to bi-allelic in advance by splitting. However, these variants are flagged such that a user may distinguish them from originally bi-allelic variants and is able to exclude them later with the --excludeMultiAllRef switch. A .qref file can easily be created from a VCF-file by applying the --makeQref switch in EagleImp.

PBWT operations
The PBWT data structure is required by the phasing process and the imputation process. For phasing, the PBWT is generated from the condensed reference for each target (once for the prephasing and fine phasing steps in forward direction and once in reverse direction for the reverse phasing step). For imputation a single PBWT is required for the complete reference which is then accessed by all targets.
In its basic form the PBWT requires a permutation array a for each position which results in huge memory requirements. Encouraged by Durbin [1] we create and store PBWT data structures required for the phasing and imputation part in a compact format with an index similar to the FM-index for a Burrows-Wheeler transform (BWT) [3]. Supplementary Listing 1 shows the pseudo-code of the creation of a PBWT from a reference in our format. The index simply indicates for each position the number entries that are 0 after each block of 32 entries. Note, that the original PBWT algorithm determines the count0 -value (i.e. the total number of 0-entries for the current position) during creation of the PBWT. The disadvantage is that two arrays (one of length count0, the other of length N −count0 ) have to be merged after each position. In EagleImp we can either determine these values beforehand on-the-fly during pre-processing (e.g. while reading the reference data) or we compute it very quickly taking benefit from our Boolean data representation and the processor directive popcount. Either way, we can omit array merging as we know the start position of the second array beforehand, which in turn saves computation time.
Also note, that we prepare the input data in transposed (i.e. variant-major) format, as the outer loop of the algorithm iterates over the sites and for each site fast access on the haplotype data of several sequences is required. For faster access we store the PBWT data and the index in an interlaced format. In particular, each 32 bits of PBWT data are stored in a 32bit-integer followed by the corresponding index value also stored in a 32bit-integer.
The basic operation required on the PBWT is to identify where a certain sequence is positioned in the permutation order of a specific site m if we know the sequence's positional index i at the previous site m − 1. This operation should ideally be in constant time complexity as it is required for a fast frequency lookup of certain sequences in the phasing step (see below). Basically, this is a lookup of the relative permutation of index i at site m − 1. However, since storing these permutation arrays would irrationally increase the memory requirements for the PBWT, we have to decode this value from the compact form created from Supplementary Listing 1. Clearly, the relative permutation of i must be the exact number of 0-entries until i at site m − 1 if the corresponding data bit is 0. This is easily to be calculated from the PBWT index value at the field corresponding to i minus the number of all 0-entries subsequent to i until the index value was stored (as it is stored every 32 entries only). Analogue, if the data bit is 1, the relative permutation must be the exact number of 1-entries until i at site m − 1 plus the number of all 0-entries (as 1-entries are stored after all 0-entries), which can be calculated likewise. Supplementary Listing 2 shows a simplified pseudo-code to determine the relative permutation index of an index i from site m − 1 to m. Based on this, we implement a fast lookup procedure for frequencies of certain sequences when iterating over the sites of a PBWT. The frequency lookups are required by the core algorithms in phasing and imputation, and for an efficient lookup we deviate from the procedures described by Durbin [1]. Since a main feature of the PBWT structure is that at each position m in the PBWT the prefixes are sorted backwards, the number of occurrences l of a sequence h x...m in the PBWT at m can be determined from an interval [i x , j x ] m with l = j x − i x + 1 (the size of the interval) and i x and j x pointing to indices in the actual permutation at m in the PBWT such that the corresponding reference sequences equal h x...m at positions x to m (and sequences before i x and after j x do not).
When advancing from a position m − 1 to m the calculation of an interval [i x , j x ] m from the predecessor [i x , j x ] m−1 is simple, if above condition applies: As the corresponding sequences at m − 1 match h x...m−1 and the extension h m can only be either 0 or 1, x ] m depending on the value at h m . Thus, we only need to know where the interval borders map to (which we directly get from the permutation array) and we can calculate the borders of the other interval directly. Say w.l.o.g. i x is mapped to i 0 x , and j x is mapped to j 0 x (i.e. both interval borders have a 0-allele at the new position) and let c 0 be the number of zero-bits at m. Then the alternative interval [i 1 x , j 1 x ] m can be calculated as follows: The calculation is analog if i x or j x were mapped to i 1 x or j 1 x . Fig. 2 in the main paper illustrates this relation.

Phasing Core Improvements
As explained above, we keep the main concept of the Eagle2 core algorithm for phasing. In particular, phasing is divided into phasing preliminaries and the three steps fast pre-phasing, fine phasing and reverse phasing (see Supplementary Section 1.1.3 -we omitted the pre-phasing step per default though as we did not encounter significant changes, but the user may explicitly enable it again by applying the --doPrePhasing switch). We also take over the probability model and the concept of the beam search (see below or [4]).

K best haplotypes and the condensed reference
Changes in the phasing procedure include the extensive use of Boolean operations in the selection of the K best haplotypes and the creation of the condensed reference in the phasing preliminaries.
The reduction to common variants in target and reference is already done while loading the input data. As the preliminaries contain procedures that work either sample-wise or variant-wise, we store the reduced reference data twice in sample-major and in variant-major format.
K best haplotypes For the selection of the K best haplotypes, we require the reference data in sample-major format. In general, each allele in a haplotype is encoded as a bit (0 = reference, 1 = alternative). An allele sequence forming a haplotype is encoded in a vector of 64bit-integers, such that one integer can store up to 64 alleles. (Multi-allelic variants in the target sample are excluded from phasing while the user may allow multi-allelic reference variants, although these would be split into bi-allelic variants then.) Genotypes are encoded in two bits per genotype, one bit indicating if the genotype is homozygous reference (is0) and the other indicating if the genotype is homozygous alternative (is2). If both bits are not set, the genotype is heterozygous, while both bits set indicate a missing genotype. A genotype sequence is stored in a vector of 64 bit integers (as for haplotypes) with the difference that each two subsequent integers form a pair where the first integer stores 64 bit of is0 information, while the second stores is2 information. With this encoding, choosing the K best haplotypes can be performed very quickly using Boolean operations for comparing genotypes and haplotypes. The vector of inconsistencies (inc) from a haplotype (hap) to a genotype (encoded in is0 and is2) is calculated as follows: The genotype-to-haplotype distance is then determined by using the processor directive popcount that returns the number of set bits in an integer, which is equal to the number of inconsistencies in this case. The K sequences with the smallest distance finally form the K best haplotypes. Note, that Supplementary Equation (8) produces an inconsistency if the genotype is missing. Anyway, this is not a problem since this inconsistency is generated for every reference sequence and thus does not influence the order of best haplotypes.
Condensed reference For creating the condensed reference, the variant-major reference format comes in handy. For each target split site the haplotype information from the reference is copied (selective from the K-best haplotypes or simply all information if K is greater or equal to the number of reference haplotypes). For the inconsistency segment between two split sites a complete inconsistency vector of the current target to all references is generated. This is generally faster then selecting the K best haplotypes first as all Boolean operations can be performed on full 64bit blocks instead of selecting K references for each site in the segment first. The inconsistency vector is simply calculated by a consecutive bitwise OR operation with each inconsistency vector for each site in the segment. The inconsistencies for a single site are easily determined similar to Supplementary Equation (8) with the difference that the current target genotype is a constant now for the current site. Thus, the inconsistencies are equal to the reference haplotypes themselves if the genotype is homozygous reference or their negation if the genotype is homozygous alternative. For missing genotypes we consider all references to be consistent, so no operation is required at all. The same would apply for heterozygous genotypes, but this case can be excluded as heterozygous target sites have to be selected as split sites before. Only after reaching the next split site (or the end respectively) the K best haplotypes are selected from the complete inconsistency vector and copied to the condensed reference. An example of a condensed reference and its creation is depicted in Supplementary Fig. 1.

Beam search
A major difference to Eagle2 phasing applies for the beam search where Eagle2 applies a HapHedge data structure (see Supplementary Section 1.1.4 and [4] for details), which we omit in favor of a simple list data structure to manage active haplotype and diplotype paths. This is mainly achieved by our interval mapping technique (see above) and the fact that all active paths have to be extended in each iteration (over the sites) anyway, which is why a data structure with a fast forward iteration is favourable. As haplotype path extensions are easily to be foreseen (extensions may be either 0 or 1), it is easy to keep the list sorted by the path's history of haplotypes. And as beam paths (or diplotype paths) only manage pointers to the underlying haplotype paths, the required merging and pruning steps in the beam search can be easily implemented using the sorted haplotype path list and fast Boolean operations for matching paths. Furthermore, due to the sort order of the path history, equal PBWT intervals are organised next to each other. This characteristic can be used to prevent unnecessary multiple PBWT lookups in the extensions that require interval mappings.

Haplotype Path Probabilities in the Beam Search
The core equation for calculating the phase probability is Supplementary Equation (2) that computes a haplotype path probability. The most performance critical step here is the frequency lookup f (h x+1...m ) since it is called up to H times for each path (whereby H is the history parameter that defaults to H = 100 and x increasing from m−H to m−1). f returns the (relative) number of occurrences of the haplotype path h x+1...m in the reference from positions x + 1 to m. So, ideally, the process to lookup the frequency has to be in constant time complexity, which we achieve by our interval mapping technique described in Supplementary Section 2.1.2 above.
To compute the path probability in Supplementary Equation (2), we need the history of recent path probabilities P (h 1...x ) and the PBWT intervals [i x+1 , j x+1 ] m to [i m , j m ] m that lead to the frequencies f (h x+1...m ). The history is processed from the newest (most recent) to the oldest site. The recombination probabilities P (rec m|x) are computed only once for each site according to Loh et al. [4]. Note, that the condensed reference alternately stores call site and segment inconsistency information. To advance from one call site to the next thus implies two PBWT interval mappings, whereby from the first mapping (to the inconsistency information) only the 0-interval (containing only consistent sequences) will be taken in the second mapping. We observed that a lot of haplotype paths in the beam equal each other in the most recent sites. Thus, we ordered the path extensions such that paths that match each other at the last sites are located next to each other. This way, we can save computing time by directly copying the interval mappings of the previous paths for those positions that are equal, which can be determined quickly by comparing the actual path only to its predecessor and only to the first non-matching position.
Floating point format Haplotype path probabilities may become extremely small, especially within a multiplicative combination to a beam path probability (see Supplementary Equations (1) and (2)). The original Eagle2 tool solves this problem by transforming the probability values into a logarithm-based representation. In this representation multiplications are solved by addition and additions should be solved by back-transformation first. Eagle2 circumvents the problem of back-transformation by performing an approximation for the addition of two probabilities in log-based format. We find this solution problematic, as in the computation of the haplotype path probability (Supplementary Equation (2)) a summation over up to H terms is required, whereby in-turn each term was computed the same way before. This leads to a strong error propagation which we addressed in EagleImp by keeping the probabilities in the non-transformed format but introducing a separate exponential scaling factor, such that the real probability P is defined by the scaling factor α and the internal probability representation P int : To save computation time, we keep the scaling factor non-normalised and do an update only after several path extension operations in the case the probability drops below a certain limit.
In addition, we keep the scaling factor equal throughout all history path probabilities, such that the summation in Supplementary Equation (2) does not require the alignment of the scaling factors before addition. We need to take care when comparing path probabilities though, but this way, we keep the computations precise with almost no overhead in computation time.

Imputation Core Improvements
The imputation core algorithm computes the set-maximal matches of a target haplotype and the reference, for which purpose the reference is reduced to common sites only. As for the phasing part, we keep the basic concept of the original PBWT tool in EagleImp, but introduce significant changes in the computation of the set-maximal matches, the handling of missing data, dosage calculation and imputation information (such as allele frequencies and imputation score).

Computation of set-maximal matches
A match is defined by an interval m start , m end (inclusive start and exclusive end positions of the match in the reduced panel) where the target haplotype matches a set of reference haplotypes exactly. Due to the use of the PBWT data structure, the set can be stored as an interval in the PBWT. We recall the original definition by Durbin [1] here, that a match is set-maximal to a sequence in the reference, if the match is locally maximal, i.e. there is no longer match including this one, and there is also no other sequence in the reference with a longer match that includes this region. In order to find the set-maximal matches, we create a binary tree data structure containing PBWT intervals in each node. Briefly explained, the algorithm sweeps over all sites and updates the tree by extending all subsequences represented by the PBWT intervals with 0 or 1 according to the current target allele. In particular, all nodes in the tree are processed recursively with each new position, i.e. the interval is mapped as described in Supplementary Section 2.1.2. The complete interval [0, N − 1] is inserted as root and the old root is attached as a child. This way, the tree is extended and each node represents an interval that is a subinterval of its parent.
At the point where an interval may not be further extended, the subsequence represented by it is reported as a match and the node is deleted. If the match is reported from the deepest node in the tree it is set-maximal because all other reported matches from this step must be shorter due to later start points and all nodes still alive do not represent a subsequence of this match anymore as they have a later endpoint now. (For further information on the tree structure see detailed explanation below.) To save computation time, nodes representing equal intervals are merged such that the same interval does not need to be mapped multiple times. However, the nodes keep track of their depth level before merging.
It is clearly to see that all nodes have to be linked in a chain if the target alleles are well defined. The tree represents a simple list in that case. However, the tree representation has its justification as missing data would violate above condition. With a missing allele both extensions are allowed at that site and generate two disjunctive intervals (see below). Also, we leave the opportunity to allow errors in the set-maximal matches. (This option is implemented, but not activated per default as first benchmarks did not show a quality improvement for the extra runtime costs. We leave this open for further investigation.) Tree structure for calculating set-maximal matches For calculating set-maximal matches a PBWT structure similar to that for the condensed reference is created preliminary. In contrast to the PBWTs for the condensed references for each target, this PBWT is slightly larger (because it targets the complete reference without reduction to the K best haplotypes and it misses the reduction to call sites and segments in between). But as the same PBWT can be reused for each target haplotype, the computation is not performance critical.
We use our idea of interval mapping from the phasing part and create a binary tree structure that is updated recursively (depth-first) in every step while iterating m from the first to the last position. A tree node mainly consists of a matching interval I and potentially two subtrees where the root nodes (and therefor all other descending nodes) contain intervals that are a subset of I.
For convenience, we also add the current depth of the node in the tree, which directly reflects the length of the match represented by the interval. When switching the position from m to m + 1, we insert a new node with the complete interval [0, N − 1] into the tree as root node and the existing tree is attached w.l.o.g. as left tree. With a depth-first strategy all intervals in the nodes are mapped in the PBWT according to the strategy described in the main paper. Mapping always creates two intervals (the 0-interval and the 1-interval, depending on the allele of the corresponding reference sequences at m + 1). The current interval of the node is updated to the mapping that fits the current haplotype allele which is being matched. Furthermore, the node is attached to its root as a left node, if the selected interval is the 0-interval, and attached to its root as a right node otherwise. A special case occurs, if the current haplotype allele is missing. In this case, a new node is created and both intervals are attached to the root. (Note that the original PBWT tool treats missing alleles as reference alleles (0) per default.) At this point we could allow a certain number of mismatches in a set-maximal match by tracking intervals that introduce an error (and not extending the maximal number of allowed errors). However, we did not follow this strategy any further since first tests showed that for allowing a single error the runtime already increased significantly due to the larger tree size, and the imputation quality could not be increased likewise.
A tree node is deleted whenever the corresponding interval cannot be mapped according to the matching haplotype. Descendant subtrees are deleted all the same as if an interval cannot be mapped then clearly its subintervals can neither be. However, at this point, the algorithm has found a potential set-maximal-match at the deepest node of the subtree to be deleted. We used "potential" here since during depth-first processing it is not clear at this point, that the deleted subtree reached the deepest point in the complete tree.
After the recursion returned to the initial caller, all reported matches end at position m. Only the longest match is a final candidate for a set-maximal match and needs to be compared to the final depth of the tree. Only if the length is greater or equal to the tree depth, it can be ensured that no match endures in the tree that begins at a position less or equal to that of the match, which makes it set-maximal at last. In order to save computation time, we attached the current tree depth (at the time of starting the recursion) as parameter to the recursion procedure. When it comes to reporting a potential set-maximal match candidate, the length is compared to this parameter before reporting it. This way, we can ensure that we report only the match from the deepest node in the tree which makes a comparison of all match candidates after the recursion obsolete. We introduced another runtime improvement by merging nodes that have a single descendant with the same interval (with tracking the depth of the deeper node). This prevents equal intervals to be mapped multiple times and reduces descents in the tree.
The runtime complexity of this algorithm is O(d m ) for each position m whereby d m is the number of nodes in the tree which reflects the number of matching intervals at position m. (In most cases, without the presence of missing haplotypes or allowed match errors, this equals the distance of the longest match until here which is the same as the depth of the tree.)

Missing target data
The default reference imputation mode of PBWT handles missing target haplotypes by simply setting these to the reference allele. By explicitly adding the option to impute missing alleles (--imputeMissing), PBWT performs a preliminary step before reference imputation and imputes the missing variants using data from the phased target only instead of data from the reference panel.
In EagleImp, we simply exclude missing positions from our procedure to compute set-maximal matches, such that these positions are imputed as all other missing variants. In particular, our tree-based procedure described above allows both extensions (0 and 1) at missing sites, which effectively neutralises this position by allowing both alleles.

Calculation of Allele Dosage, Genotype Dosage and Genotype Probabilities
For each variant allele that needs to be imputed, the imputation dosage is calculated as follows. For each sequence (indexed with i) from a set-maximal match that covers this variant, a score s i is computed that depends on the position of the variant in the match and the length of the match: whereby m is the position of the last common variant in the reduced reference panel and m start i and m end i the start and end positions of the corresponding set-maximal match. Note, that, in contrast to original PBWT, we add 1 in the first factor, such that all variants in the first segment in a set-maximal match (between the first and the second common site in the reduced panel) generate a positive score instead of zero.
Then, the allele dosage is simply calculated as: Exceptions are for variants that were genotyped prior to imputation (which are tagged with TYPED in the output). In this case, allele dosages are 0.0 or 1.0 for homozygous calls. For heterozygous calls, the allele dosage corresponds to the phasing confidence. In the case that variants are not covered by set-maximal matches, the dosage is set to the allele frequency in the reference panel (which is the same approach as in the original PBWT tool).
The imputed allele is selected according to the calculated allele dosage: 0 if d ≤ 0.5, 1 otherwise. Furthermore, the user is able to select which fields in the VCF output should accompany the imputed alleles: allele dosages, genotype dosage, genotype probabilities, a combination of those or none. This may significantly reduce the size of the output files if fields that are not necessarily required are omitted. The default is to provide only allele dosages with the genotypes as these form the base for the other values. The genotype dosage is simply the sum of the maternal and paternal allele dosages: The genotype probabilities are calculated as follows:

Imputation information
In contrast to original PBWT, we compute the estimated allele frequency (AF) and the minor allele frequency (MAF) in addition to the allele count (AC) and the allele number (AN) for each variant and provide this information in our output VCF/BCF-file. We also replace the imputation INFO score by the imputation r 2 calculated as in minimac4.
Here, d n denote the allele dosages for the current variant and N is the total number of alleles in this dataset. Note, if the divisor in Supplementary Equation (18) is zero, r 2 is also set to zero.

Multi-processing in imputation
As the imputation of a target sample is independent from other targets, a simple parallelisation strategy is to run multiple threads for imputing several targets concurrently. We perform this strategy for the computation of set-maximal matches. However, when it comes to the actual imputation the variant-major format of the output files presents a major problem as for each variant the imputed data of all targets is required at once and writing the output via the HTSlib is slow. We address this problem with the following approach. First of all, we launch a fourth of the available system threads as writer threads with each writer creating a temporary output file (i.e. 8 threads when 32 threads are available). All output variants are evenly distributed in the same number of coherent blocks with each block assigned to one output file. Each block is then evenly divided in coherent chunks with a variable number of variants that is derived from the number of samples. (In particular, the chunk size times the number of samples is 10 million per default. This value delivered good results but may be subject for optimisation. It is automatically reduced if the target dataset includes a lot of samples in order to reduce memory consumption. The number depends on the available system memory.) Each chunk is processed using the remaining threads to impute its variants parallelised over the targets. The blocks and chunks are processed alternating over the blocks (i.e. first chunk from first block, then first chunk from second block etc.). Each chunk directs the imputation output to the writer queue of the corresponding output thread assigned to the current block. This way, the load to the output queues is balanced and the multi-processing capabilities of the computing machine can be fully used.
After the imputation the generated temporary output files are simply concatenated to a single file. As the packed VCF or BCF files are in Block Gzip Format (BGZF) format, this is possible without any difficulty as packing is handled block-wise (but would have not been possible if the output format was standard zip). Fig. 2 in the main paper illustrates the multi-processing scheme for imputation.

Preparation of three reference panels
First benchmarks were performed using the Haplotype Reference Consortium (HRC) Release 1.1 [5] portion of the SIS imputation reference (hereafter referred to as the HRC1.1 panel), which is available through the European Genome-phenome Archive (EGA) upon user request (approved and downloaded on Oct 26, 2020). This part of the HRC panel contains 40.4 million variants for chromosomes 1 to 22 and chromosome X, with the number of samples being 22,691 for chromosome 1, 26,199 for chromosome X (including 11,739 males), and 27,165 for all other chromosomes. The second benchmark panel was the 1000 Genomes (1000G) Phase 3 reference panel [7] with the latest updates to version 5a from Jul 31, 2020 (hereafter referred to as the 1000G Phase 3 ). This panel contains 84.8 million variants for chromosomes 1 to 22 and chromosomes X and Y, for a total of 2,504 samples (including 1,233 male samples). To demonstrate that EagleImp can use large reference panels in the order of one million samples for phasing and imputation in the future, we artificially synthesised a third benchmark panel (hereafter referred to as the simulated panel) based on the HRC1.1 panel described above. We randomly sampled the haplotypes for one million artificial samples for each variant defined in the original HRC1.1 panel with a distribution according to the reported reference panel allele frequencies from the Sanger website (https://imputation.sanger.ac.uk). The result was a panel with one million samples (2 million haplotypes) for all chromosomes 1-22 and X that follow the original alle frequencies from the HRC1.1 panel but did not form a particular population. Further, no samples equals another in the panel.

Preparation of 12 target benchmark datasets
For our benchmarks with the HRC1.1 panel, we created five sub-datasets as target files (i.e. GWAS input files) from HRC1.1 : Firstly, we determined all samples from the five superpopulations African (AFR), American (AMR), East Asian (EAS), European (EUR), and South Asian (SAS) from the 1000G Phase 3 panel that are also included in the HRC1.1 panel in all chromosomes (since chromosome 1 contains less samples in the HRC1.1 panel than the other chromosomes). We extracted these samples from the HRC1.1 panel to form the basis of the target datasets repectively named after the corresponding superpopulations HRC.AFR, HRC.AMR, HRC.EAS, HRC.EUR, and HRC.SAS. To create a realistic scenario for phasing and imputation, we then reduced the genomic information of the target datasets to the genetic variants contained on Illumina's Global Screening Array (GSA) (https://www.illumina.com/products/by-type/microarray-kits/infinium-global-screening.html), which combines multi-ethnic genome-wide content, curated clinical research variants, and quality control (QC) markers for precision medicine research; further, we removed the phase information to create unphased genotypes. (In fact, the phase information did not have to be explicitly removed as both EagleImp and Eagle2 ignore this information by default when reading the target input file for the phasing process.) The final benchmark target datasets contained 619,872 variants found on both the GSA and the HRC1.1 panel. For each of the five target datasets, we further reduced the HRC1.1 panel by the samples from the respective target datasets to avoid duplicate samples in target and reference panel.
Similary, using the 1000G Phase 3 panel, we created five further benchmark datasets. For each of the five superpopulations, we randomly extracted 50 samples (1kG.AFR, 1kG.AMR, 1kG.EAS, 1kG.EUR, and 1kG.SAS ), and reduced the set of variants to the GSA marker content (resulting in 647,963 variants shared between the GSA and the 1000G Phase 3 panel). Again, we created a reduced 1000G Phase 3 panel without the target samples for each benchmark dataset.
To compare the imputation accuracy of EagleImp with the accuracy of current imputation servers (SIS, MIS and Topmed), we chose two real-world target GWAS datasets from GSA genotyping named COVID.Italy and COVID.Spain from a COVID-19 Genome-wide Association Study (GWAS) [2] comprising 2,113 samples (839 cases, 1,274 controls) typed at 559,519 variants and 1,792 samples (842 cases, 950 controls) typed at 549,696 variants, respectively, and performed phasing and imputation with our HRC1.1 reference panel. Finally, we used the larger set, COVID.Italy, also to demonstrate the capabilities of EagleImp with large reference panels by imputing it against our synthetic HRC1.1 panel with 1 million samples.

Benchmark parameters
Quality benchmarks were performed for five datasets with the public HRC1.1 reference panel (HRC.* ), and for five datasets for the 1000 Genomes reference panel (1kG.* ). The runtime benchmarks were performed with several multi-processing configurations for EagleImp and Eagle2 on the HRC.EUR dataset. For PBWT we simply processed all chromosome datasets in parallel as PBWT lacks support for multi-threading.

Switch errors and genotype errors
Supplementary

Multi-processing Runtimes
The different multi-processing runtimes are distinguished by the number of concurrent worker processes started and the number of threads assigned to each worker. An Eagle2/PBWT run in 8x4 configuration means that 8 concurrent worker processes are executed in parallel with 4 threads assigned to each worker. For EagleImp the number of worker processes that share a CPU-lock is also relevant. A 2x4x8 configuration means that 2x4 workers are executed in parallel with each two workers sharing a CPU-lock, and 8 threads (mutual exclusion via the lock) are assigned to each worker.

Preparation of Reference Data
As Eagle2 handles reference files in .vcf.gz or .bcf format, PBWT requires these files to be converted in .pbwt format. EagleImp can handle reference files in .vcf.gz or .bcf format as well, but we recommend a conversion into our .qref format for performance reasons, already if the user intends to do more than only one analysis with the same reference. All presented benchmarks for EagleImp in this manuscript use .qref references.
In Supplementary Table 12 we present the preparation times for .pbwt and .qref files for the 1000 Genomes and HRC panel used in our benchmarks. Note, that we ran all conversions for each chromosome concurrently on our system. The table also contains the runtime for converting our simulated panel with 1 million samples with variants from the public HRC1.1 panel. Note that this conversion failed for the .pbwt generation due to insufficient memory even after trying to convert the files one by one (and not concurrently as for the other conversions) and already for the smallest chromosome 22.
For .pbwt creation we used the following command: pbwt -readVcfGT <reference vcf/bcf file> -writeAll <reference file prefix> The following parameters were activated for EagleImp for creating .qref files:

eagleimp --ref <reference bcf file> --makeQref
Supplementary  Table 13: Runtimes (in seconds) for COV.Italy data and the simulated panel with 1 million samples. K = 10, 000 is the default in Eagle2, K = 54, 330 is the maximum value for a benchmark against the public HRC1.1 panel, K = max corresponds to all 2 million haplotypes. Note that a PBWT imputation was not possible as PBWT was not able to generate the required .pbwt files from the reference beforehand. So, we listed the runtimes for Eagle2 phasing only.