HapFABIA: Identification of very short segments of identity by descent characterized by rare variants in large sequencing data

Identity by descent (IBD) can be reliably detected for long shared DNA segments, which are found in related individuals. However, many studies contain cohorts of unrelated individuals that share only short IBD segments. New sequencing technologies facilitate identification of short IBD segments through rare variants, which convey more information on IBD than common variants. Current IBD detection methods, however, are not designed to use rare variants for the detection of short IBD segments. Short IBD segments reveal genetic structures at high resolution. Therefore, they can help to improve imputation and phasing, to increase genotyping accuracy for low-coverage sequencing and to increase the power of association studies. Since short IBD segments are further assumed to be old, they can shed light on the evolutionary history of humans. We propose HapFABIA, a computational method that applies biclustering to identify very short IBD segments characterized by rare variants. HapFABIA is designed to detect short IBD segments in genotype data that were obtained from next-generation sequencing, but can also be applied to DNA microarray data. Especially in next-generation sequencing data, HapFABIA exploits rare variants for IBD detection. HapFABIA significantly outperformed competing algorithms at detecting short IBD segments on artificial and simulated data with rare variants. HapFABIA identified 160 588 different short IBD segments characterized by rare variants with a median length of 23 kb (mean 24 kb) in data for chromosome 1 of the 1000 Genomes Project. These short IBD segments contain 752 000 single nucleotide variants (SNVs), which account for 39% of the rare variants and 23.5% of all variants. The vast majority—152 000 IBD segments—are shared by Africans, while only 19 000 and 11 000 are shared by Europeans and Asians, respectively. IBD segments that match the Denisova or the Neandertal genome are found significantly more often in Asians and Europeans but also, in some cases exclusively, in Africans. The lengths of IBD segments and their sharing between continental populations indicate that many short IBD segments from chromosome 1 existed before humans migrated out of Africa. Thus, rare variants that tag these short IBD segments predate human migration from Africa. The software package HapFABIA is available from Bioconductor. All data sets, result files and programs for data simulation, preprocessing and evaluation are supplied at http://www.bioinf.jku.at/research/short-IBD.


S1 Introduction
This report provides the following supplementary information to the manuscript "HapFABIA: Identification of very short segments of identity by descent characterized by rare variants in large sequencing data": details on the FABIA method for identifying short IBD segments in genotype data details on generation of artificial and simulated data a section discussing how to obtain small likelihood of identity by state (IBS) without identity by descent (IBD): investigation of the trade-off between small minor allele frequency and many individuals sharing a segment comparisons of extracted IBD segments based on genotype, haplotype, dosage, and likelihood investigation to what extent linkage disequilibrium (LD) can help to identify IBD segments that are tagged by rare variants commandline calls, parameter settings, and result filters for each method and each experiment discussion whether rare variants are recent or old discussion of the result that IBD segments are shared between continental groups sharing of IBD segments between populations and genomes analyses of IBD lengths distributions additional results: other thresholds for matching an archaic genome and comparison results given by the mean instead of the median example IBD segments

S2 FABIA Biclustering for Identifying IBD Segments in Genotype Data
FABIA models each IBD segment in genotype data X by an outer product z λ T of two vectors λ and z, where the vector λ indicates tagSNVs by non-zero values and the vector z indicates individuals that possess the IBD segment. FABIA can represent a homozygous region of an IBD segment by z = 2, that is, two occurrences of an IBD segment in one diploid individual. If we assume genotyping errors which are accounted for by a noise term Υ, the FABIA model for genotype data X is

S2 FABIA Biclustering for Identifying IBD Segments in Genotype Data
where X ∈ R l×n is the genotyping data; Z ∈ R l×p is the matrix that indicates which individuals possess an IBD segment; Λ ∈ R p×n indicates IBD segment tagSNVs; Υ ∈ R l×n is an additive noise term; n is the number of SNVs; l is the number of individuals (or chromosomes for phased genotypes); p is the number of IBD segments; λ i ∈ R n is the tagSNV indicator vector for the i-th IBD segment (the i-th row of Λ); and z i ∈ R l corresponds to the number of times each of the l individuals contains the i-th IBD segment (the i-th column of Z). The additive noise Υ not only covers genotyping errors but also genotypes which cannot be explained by IBD segments. Such unexplained genotypes may arise from recently acquired SNVs, segments observed in only one individual, and IBD segments that are too short, or segments that are shared by too few individuals to be called.
According to Eq. (S1), the j-th genotype x j , i.e., the j-th column of X, is where j is the j-th column of the noise matrix Υ andz j = (z 1j , . . . , z pj ) T indicates which IBD segments are present in genotype x j (j-th column of the matrix Z with length equal to the number of IBD segments p). Recall that z i = (z i1 , . . . , z il ) T in Eq. (S1) corresponds to the number of times each of the l individuals contains the i-th IBD segment.
If we drop the index j which indicates the individual, the formulation in Eq. (S2) facilitates a generative interpretation by a factor analysis model with p factors: where x ∈ R n is the observed genotype. The vector of factorsz = (z 1 , . . . ,z p ) T indicates which IBD segments are present in genotype x, where componentz i indicates how often the individual with genotype x possesses the i-th IBD segment. ∈ R n is the additive noise, which is normally distributed according to N (0, Ψ) with a diagonal covariance matrix Ψ ∈ R n×n . For large sample sizes and binary genotype data, the normal distribution approximates the binomial well (1, p. 130).
Both the vector z i and the vector λ i should be sparse to describe an IBD segment. Sparse z i means that only few individuals possess the IBD segment, which implies rare tagSNVs. Sparse λ i means that only few SNVs are tagSNVs, which implies short IBD segments. Sparse z i can be achieved if components z ij are sparse, that is, if the vector of factorsz j in Eq. (S2) is sparse or, equivalently, the vectorz in Eq. (S3). In contrast to standard factor analysis, FABIA's model selection is tailored to sparse factors and sparse parameters, which are essential for IBD detection. Sparseness in the FABIA model is obtained by a component-wise independent Laplace distribution both for the prior on the parameters λ i and for the distribution of the factorsz (2): The Laplace distribution of the factors leads to an analytically intractable likelihood: p(x | Λ, Ψ) = p(x |z, Λ, Ψ) p(z) dz .
Therefore, the model selection of FABIA is performed by variational expectation maximization (EM) (3,4,5,6,7). The idea of the variational approach is to express the prior p(z) by the maximum over a model family p(z | ξ) that is parametrized by the variational parameter ξ or by scale mixtures A Laplace distribution can be expressed exactly by the maximum of a Gaussian family or by Gaussian scale mixtures (3,4). Therefore, for each x, the maximumξ of the variational parameter ξ allows for representing the Laplacian prior by a Gaussian: The maximumξ can be computed analytically (see Eq. (S12) below) because the likelihood Eq. (S6) can be computed analytically for each Gaussian.
Eqs. (S10)-(S15) below represent the iterative update rules of the FABIA EM algorithm. If we denote the j-th genotype by x j ∈ R n with corresponding factors z j ∈ R p , then we obtain the following variational E-step (5): where Ξ j stands for diag (ξ j ). The update for the variational parameter ξ j is The variational M-step is (5) The parameter α controls the degree of sparseness (an expectation of how rare the tagSNVs are) and can be introduced as a parameter of the Laplacian prior of the factors (5).

S3 Data Generation
The number of bicluster need not be determined a priori if p is chosen large enough. The sparseness constraint will remove spurious biclusters by setting λ to the zero vector. In this way, FABIA automatically determines the number of biclusters.
Non-negative factors and loading. We enforce FABIA to produce non-negative Λ and nonnegative Z values, because indicator matrix Λ and count matrix Z are positive. The indicator matrix of tagSNVs Λ is 1, if the corresponding SNV is a tagSNV, and 0 otherwise. The matrix Z counts the number of occurrences of IBD segments in individuals or chromosomes. FABIA biclustering does not regard these non-negativity constraints. For HapFABIA, we modified FABIA to ensure that the tagSNV indicator matrix Λ is non-negative, which also implies that Z is nonnegative. The matrix Z is given by the posterior means E z j | x j of z j . E z j | x j is nonnegative according to Eq. (S10) with Ξ j > 0 and Ψ jj > 0 if both genotype data x j and tagSNV indicator matrix Λ are non-negative. Consequently, the new tagSNV indicator matrix Λ new is non-negative according to Eq. (S13). The prior term α l Ψ sign(Λ) in the update rule Eq. (S13) is not allowed to change the signs of the tagSNV indicator matrix Λ in one update step. Therefore, it is sufficient to initialize Λ by positive values to enforce non-negative Λ and Z.
Sparse matrix algebra for efficient computations. Most entries of the genotype vectors x j are zero because the major allele is observed. Most entries of the parameter matrix Λ are zero because only few SNVs become tagSNVs. We exploit this sparsity to speed up computations and to allow IBD segment detection in large sequencing data. We developed a specialized sparse matrix algebra which only stores and computes with non-zero values. The update rules in Eqs. (S10)-(S15) become much faster. In contrast to most existing sparse matrix software packages, our sparse matrix algebra allows for multiplying two sparse vectors.
Interpretation for analyzing genotype data. The vectors λ i and z i for the i-th bicluster obtain a new interpretation when detecting short IBD segments. Components of λ i indicate how many individuals possess the corresponding tagSNV relative to individuals that possess the i-th IBD segment. Thus, λ ij = 1 means the minor allele of j-th SNV is present in all individuals that possess the i-th IBD segment, while λ ij = 0.5 it means the SNV is only present in 50% of individuals that possess the the i-th IBD segment. This means that components of λ i give the quality of the corresponding tagSNV for detecting the i-th IBD segment. As mentioned previously, components of z i indicate how often the i-th IBD segment is present in the corresponding multiploid individual. Additionally components of z i show how well the DNA segment in an individual matches the i-th IBD segment. DNA segments in an individual may contain genotyping errors or may be broken up during meiosis, therefore, the i-th IBD segment need not be matched perfectly.

S3.1 Artificial Genotype Data with IBD Segments
We created artificial phased genotype data with rare variants (MAF <5%) by randomly generating chromosomes. The statistical characteristics (minor allele frequencies and distances between SNVs) were chosen to match the genotyping data from the 1000 Genomes Project. We implanted short IBD segments that are tagged by rare variants. The artificial genotype data consist of 100 and 1,000 diploid individuals (200 and 2,000 chromosomes) and 10,000 SNVs. The average physical distance between adjacent SNVs is 100 bp in order to be close to the 1000 Genomes Project data, S3 Data Generation 9 for which the average is 78 bp. The distance between adjacent SNVs was drawn according to an exponential distribution to simulate a Poisson process. Rare variants were obtained by drawing their minor allele from a binomial distribution with a probability 1/25 (4%) resulting in a very sparse matrix with no coincidental IBD segments. Common SNVs were not considered, as we focus on IBD segments that are tagged by rare variants. This means that we generated a 200×10,000 or a 2,000×10,000 matrix. Then SNVs were randomly switched to their minor allele with a probability of 1/25 and IBD segments were implanted. The lengths of IBD segments were chosen to be very short, containing 100 to 200 SNVs on average, which correspond to a physical length of 10-20 kbp. The IBD segment lengths were motivated by the lengths of haplotype blocks (8,9). For example, Gabriel et al. (10) found that common haplotype blocks have an average length of 9 kbp in Africans and 18 kbp in Europeans and Asians. Each IBD segment possesses a particular number of tagSNVs and is contained in a certain number of chromosomes. An implanted segment in a particular chromosome is shortened by randomly breaking up its beginning and end to simulate recombinations. For data sets artAMis, artBMis, artCMis, and artDMis, genotyping errors are simulated for each implanted segment by randomly switching alleles.

S3.2 Sequencing Data with Implanted IBD Segments
The second data set was constructed by implanting IBD segments into real sequencing data from chromosome 1 of the 1000 Genomes Project. Following Browning and Browning (11), we destroyed existing IBD segments to assess false positive rates. We implanted and then tried to rediscover short IBD segments of 10 or 20 kbp. To ensure that discoveries other than the implanted IBD segments are false discoveries, we destroyed all IBD segments with a length of 5 kbp or larger. For the same reason, detected IBD segments that are shorter than 5 kbp are discarded. Therefore, detected IBD segments are either implanted IBD segments (true positives) or false discoveries (false positives). For destroying IBD segments larger than 5 kbp, we divided chromosome 1 into blocks of 5 kbp and then shuffled the sequential order of these blocks. The 5 kbp blocks from the original data ensure that local linkage disequilibrium still exists. Following Browning and Browning (11) and Su et al. (12), we copied IBD segments of one individual onto several other individuals. In contrast to previous experiments, we implanted very short IBD segments with a length of around 0.01 cM (10 kbp) and 0.02 cM (20 kbp).
Shuffling cannot be done completely at random because then, by chance, blocks that were close in the original chromosome could still be together in the shuffled chromosome. Methods can detect dependencies between such blocks as their SNVs are correlated (e.g. LD exists). Therefore, we require blocks that were close in the original chromosome to be as far apart in the shuffled chromosome as possible. Similarly, we require that blocks that are close in the shuffled chromosome were far apart in the original chromosome. To achieve this, we enumerated the blocks repeatedly from 1 to c, where we will show in the following how to optimize the number of groups c (see top panel in Fig. S1). Next, the s blocks with the same number are grouped, for example, in Fig. S1, the 4 blocks that are enumerated by 1 are put next to each other. Blocks that were adjacent in the original chromosome are separated by s (group size) blocks in the shuffled chromosome (see Fig. S1). Blocks that are adjacent in the shuffled chromosome were separated by c (number of groups) blocks in the original chromosome. Thus, we have to simultaneously maximize the number of groups c and the group size s, therefore, we set s = c. The number of blocks b is chromosome length divided by block length while b = cs = c 2 . Hence, the optimal number of clusters is the squared root of chromosome length divided by block length (c = 249, 000, 000/5, 000 = 223. Consequently, blocks that were adjacent in the original chromosome are now separated in the shuffled chromosome by 1.1 cM (223 * 5 kbp= 1.1 Mbp) and vice versa. However, original IBD segments longer than 1 cM may still introduce segments longer than 5 kbp in the shuffled chromosome which will be considered as false positives in this analysis.
In previous simulations like Browning and Browning (11) and Su et al. (12), segments were copied from one individual onto another individual. However, this procedure is not applicable in our setting, for two reasons: 1. The individual from which the segment is extracted may share short IBD segments with other individuals within one 5 kbp block. The number of minor alleles that tag the short IBD segment may even be large. Methods that consider more than two individuals will either miss the implanted segment or detect too many individuals. In the first case, methods may prune the implanted segment off all SNVs outside the short IBD segment. In the second case, methods consider mismatches outside the short IBD segment as sequencing errors or broken segments and consider all individuals that possess the short IBD segment as having the long segment. The problem is that the residual structure within the 5 kbp blocks may be stronger than the structure generated by implanting segments. This problem does not exist if only two individuals are compared to each other but arises if all individuals are compared simultaneously.
2. In most short segments, the majority of observed minor alleles are minor alleles of common SNVs. Therefore, such a segment is tagged by common variants in contrast to our assumptions in this manuscript (see Introduction of the main manuscript).
To implant IBD segments of 10 kbp and 20 kbp length into the shuffled chromosome, we first randomly extract them from the original chromosome. We copy the SNVs within the extracted segment onto SNVs of the shuffled chromosome. Most common variants of the extracted segment are copied onto rare variants of the shuffled chromosome. In the shuffled chromosome most SNVs are rare, however most observed minor alleles stem from common variants. For example, a common SNV with 30% MAF contributes 300 minor alleles to the data set, which requires 300 private SNVs or 30 SNVs with 1% MAF for the same contribution of minor alleles. Copying the nucleotides of the selected segment would result in many adjacent SNVs with a minor allele (top red rectangle in Fig. S2). It may be hard to detect IBD, if only two individuals that possess the segment are considered. However, IBD is detected easily, if multiple individuals are considered, including those that do not possess the IBD segment. The IBD segment clearly sticks out. Therefore, in order to get more realistic data, we only copy the pattern of minor alleles onto the shuffled chromosomes. At locations where the extracted segment has minor alleles, also the implanted segment has its minor alleles (see lower case characters in in the lower rectangle in Fig. S2). To make the break points of the implanted IBD segment recognizable, it has a minor allele at its first and last SNV. Further, a segment is implanted at a region where the SNVs onto which the segment SNVs are copied span about the same genomic region as the extracted segment. This selection of a sufficiently large (in number of bases) region avoids that the SNVs of a 10 kbp or 20 kbp segment are copied onto a 5 kbp or smaller region where the SNVs are denser.

S3.3 Forward Simulation Genotype Data with Implanted IBD Segments
The third data set was constructed by implanting IBD segments into genotype data that has been generated by forward-time simulations. The data is phased per construction, as the forward-time simulation provides the chromosomes for each diploid individual. We now also compare the IBD detection methods on long and medium sized IBD segments (0.5, 1, and 2 Mbp).
In contrast to previous data sets, forward and coalescent simulation ensures linkage disequilibrium on a larger scale and evolutionary relationships between chromosomes. We investigated the coalescent simulation programs MSMS (13), GENOME (14), and COSI (15). We found out that coalescent simulations based on these software packages did not match the characteristics of sequencing data as well as forward-time simulations. In particular the percentage of private / rare SNVs and the distance between SNVs of forward-time simulations were closer to data from sequencing projects (1000 Genomes Project, UK10K, Korean Personal Genome Project) than coalescent simulations. Forward-time simulators are slower than coalescent simulators, but more powerful (16,17). The forward-time simulator SFS_CODE (18) produced results which were closest to the characteristics of sequencing projects.
With the forward-time simulator SFS_CODE, we generated 147 1 DNA chunks of 300 kbp length each. We used the SFS_CODE commandline  Joining the 147 DNA chunks led to a DNA strand with 44,094,874 bases. When joining two segments, on average, 35 bases are lost to keep the relative DNA distance of SNVs to their predecessors. We obtained 1,148,822 SNVs for 5,000 randomly sampled individuals. The average number of SNVs in 1,000 individuals is 418,000 with an average distance of 105 bp. This is close to the numbers of the 1000 Genomes Project, where 1,092 individuals have an average SNV distance of 78 bp.

a C T C T T C T T T G G G A A G A G A C T G A A a C T C T C T T C T T T G G G A A G A G A C T G A A G C a C T C T g C T T T G G t A A G A G A C a G A A G C a C T C T g C T T T G G t A A G g G A C T G c A G C a C T C T g C T T T G G t A A G A G A C T G A A G C a C T C T g C T T T G G t A A G A G A C T G A A G C T C T C T T C T T T G G G A A G A G A C T G A A G C T C c C T g C T T T G G G A c G g G A C T G A A G C T C T C T T C T a T G G t A A G A G A C T minor allele pattern
To generate data for an experiment, we first sample 1,000 individuals from the 5,000 individuals. Next we selected an interval containing 10,000 SNVs (∼1 Mbp length) for short IBD segments (10 and 20 kbp) and an interval containing 200,000 SNVs (∼20 Mbp length) for long and medium sized IBD segments (0.5, 1, and 2 Mbp). Then we implant IBD segments into the selected interval, where the IBD segments are taken from individuals that do not belong to the sampled 1,000 individuals. Implantation was performed analogously to the previous experiments that were based on shuffled chromosome 1 data from the 1000 Genomes Project. For implantation, we selected IBD segments that had at least 8 tagSNVs. An IBD segment of length 1 Mbp resulted in 140-250 tagSNVs and a length of 8,000-10,000 SNVs. Table S1 shows an overview over the data sets which are characterized by the length of implanted IBD segments, number of different IBD segments implanted, and how many chromosomes possess a particular IBD segment. "L" gives the physical length of the IBD segments in kbp, "F" shows how many chromosomes contain the IBD segment, "#I" reports the number of different IBD segments that were implanted.

S4 How Many Individuals Share an IBD Segment?
How often does one find IBD segments that are shared by more than two individuals? The probability of finding IBD segments increases with decreasing segment size (20). Thus, the smaller the IBD segments, the greater the chance of finding multiple individuals who share this segment. In a growing population with N founding haplotypes, the probability that two randomly chosen haplotypes are IBD at a very short random segment (no recombination) is 1/N (21). The probability segments that are shared by k individuals. This number holds for one DNA location. Genetic drift may have led to more frequent segments than 1/N increasing the likelihood of finding IBD segments that are shared by multiple individuals. In summary, many IBD segments will be shared by more than two individuals if (a) the segments are small and (b) the cohort is large.

S5 Small Likelihood of Identity by State: Small MAF vs. Many Individuals
In step 1 of IBD segment extraction from FABIA models as described in subsection "Extraction of IBD Segments from FABIA Models" of the main manuscript, the likelihood of observing identity by state (IBS) without IBD and, therefore, without recombination, is computed. If q is the minor allele frequency (MAF) for one SNV, the probability of observing the minor allele of this SNV in all t individuals is q t . We assumed that all SNVs have the same MAF q -in the experiments we used the average MAF. According to the main manuscript, the probability of observing k or more counts of model SNVs by chance in an interval of n SNVs is where l is the number of individuals and l t is the number of possibilities to chose t individuals from the l individuals of the study. The number of counts i runs from k to the number of SNVs n.
If we try to minimize this probability, we observe a trade-off between small average minor allele frequency (MAF) q and the number t of individuals that share an IBD segment. For a small IBS likelihood, the average MAF q should be small and t large. However, increasing t also increases the lower bound t l on q. If we assume minimal q = t l , then above probability Eq. (S17) is governed by the term For small q = t l we have 1 − t l t n−i ≈ 1. The function t l it has its minimum at t = l/e independent of i (e is Euler's number). Thus, if the factor l t is ignored and t = l/e individuals share the minor allele, the evidence for IBD is maximal.
To avoid IBS without IBD by recombination, we focused on rare SNVs (see main manuscript). IBD can be distinguished from IBS without IBD by rare alleles, because for them two independent origins are unlikely, so IBS generally implies IBD, which is not true for common alleles (22, chapter 15.3, p. 441). Hence, the minimum t = l/e will not be attained with rare SNVs. For fewer individuals t than the minimum t = l/e, the function t l it is decreasing. This justifies that more individuals give more evidence for IBD because random minor allele sharing (IBS without IBD) is less likely.

S6 IBD Segments based on Likelihood and Dosage
We compare the IBD segments that HapFABIA detects if it uses genotype, haplotype, dosage, or likelihood data from the 1000 Genomes Project (chromosome 1). For the likelihood, HapFABIA uses one minus the likelihood of observing only the major allele. This value is thresholded by 0.6, i.e. only if the likelihood of observing only the major allele is below 0.4, a value different from zero is provided. For dosage, HapFABIA uses the raw value.
First we present a typical example from the 1000 Genomes Project using data from chromosome 1. For the interval from SNV number 1,000,000 to 1,010,000, 205 IBD segments were detected in genotype data, 206 in haplotype data, 207 in dosage data, and 210 in likelihood data. The number of IBD segments that match between two types of data is 143 for G-H, 155 for G-D, 144 for G-L, 137 for H-D, 136 for H-L, and 136 or D-L (G=genotype, H=haplotype, D=dosage, L=likelihood). For deciding whether two IBD segments match or not, we used hierarchical clustering according to the pseudo-distance measure described in the main manuscript in section "Extraction of IBD Segments from FABIA Models". A match of two original IBD segments is indicated by merging them in hierarchical clustering. The number of IBD segments after hierarchical clustering is 218 for G-H, 205 for G-D, 226 for G-L, 222 for H-D, 235 for H-L, and 238 for D-L. For example, 75% of IBD segments extracted from genotype data can be matched with IBD segments from dosage data and vice versa. HapFABIA's step 4 in the second stage ensures that IBD segments within genotype data or within dosage data are not similar to each other, therefore, during hierarchical clustering only IBD segments from genotype data can be merged with IBD segments from dosage data. Percentages are given with respect to the row data. For example, 71% of IBD segments detected in haplotype data are also found in genotype data, but only 64% of IBD segments detected in genotype data are also found in haplotype data. Results are based on genotyping data from chromosome 1 of the 1000 Genomes Project.
We computed the average matching of IBD segments extracted from different kind of data across all intervals of chromosome 1 based on the data from the 1000 Genomes Project. Table S2 reports the average percentage of IBD segments that could be matched between data types. The agreement of the detected IBD segments between detections based on different genotype values is high considering the inconsistencies between the different genotype measurements. In the following we describe some of these inconsistencies.
The difference between genotype and haplotype based IBD segment detection is often caused by phasing errors (examples of such phasing errors can be found at http://www.bioinf.jku.

S7 Identification of IBD Segment tagSNVs by Linkage Disequilibrium
at/research/short-IBD). Genotype and dosage agree at about 74% of their IBD segments. However, many inconsistencies between dosage and genotype exist. For example, the record [0|0:1.000:-0.48,-0.48,-0.48] is observed very often, which means that, for dosage, the minor allele is one, while it is zero in the genotype. Each record describes an individual-SNV pair. The first entry is the phased first haplotype, where 1 means that the minor allele is present. To the right of "|", the minor allele of the second haplotype is given. Next, after ":", the dosage is reported and finally the log 10 likelihood of the genotypes 0|0, 1|0 or 0|1, and 1|1. In summary the agreement of IBD segments based on genotype, haplotype, dosage, and likelihood is high given the inconsistencies in the data.

S7 Identification of IBD Segment tagSNVs by Linkage Disequilibrium
We investigated to what extent linkage disequilibrium (LD) can help to find IBD segments or to identify haplotype blocks with a focus on rare variants. For rare variants, LD has high variance (23,24). Therefore, it is difficult to employ LD for identifying IBD segments or haplotype blocks that contain mainly rare variants.
To see how well LD detects tagSNVs of IBD segments, we constructed a classification task using the artificial data sets. SNV pairs are classified as belonging -or not belonging -to an IBD segments. Pairs of IBD segment tagSNVs are true pairs, and others are false pairs. SNV pairs are ranked according to their LD values and the top-ranked pairs are considered to be positive SNV pairs. We are interested in whether IBD segments can be detected using the LD values, rather than evaluating this classification task. Therefore, we counted how many IBD segment tagSNVs are identified via SNV pairs with high LD values -the number of true positives (TP) -and how many non-IBD segment SNVs are falsely detected -the number of false positives (FP). If an IBD segment tagSNV is detected once via a true SNV pair, it counts as a true positive, even if it participates in false pairs. This evaluation takes two issues into account: First, an IBD segment tagSNV needs to be detected only once. Secondly, spurious similarities are penalized as false positives because they mislead a subsequent IBD segment extraction procedure. The power (true positive rate) and the false discovery rate (FDR) can be computed using the IBD segment tagSNVs (true SNVs) and the SNVs belonging to a top-ranked pair (positive SNVs). For each SNV pair, PLINK computes the correlation r 2 , which measures the LD independently of allele frequencies.
Values for r 2 are only computed within a window of 500 SNVs to consider local agglomerations of IBD segment tagSNVs -the same window size is used for HapFABIA in the experiments with the artificial data sets. Pairs are first ranked according to r 2 and then the top-ranked 100, 1,000, or 10,000 pairs are selected. Table S3 summarizes 100 runs for each data set and reports the median values for the number of true positives (TP), the number of false positives (FP), the power (recall or the true positive rate), and the false discovery rate (FDR). The FDR is extremely high, with a median of 0.99 (mean 0.99) and a minimal average value of 0.964 for data set artC. An SNV is a positive IBD segment tagSNV if it belongs to at least one top-ranking (according to its LD value) SNV pair. Values are given for top-ranking 100, 1,000, and 10,000 SNVs. We report true positives (TP), false positives (FP), power (recall or true positive rate), and false discovery rate (FDR). Each value corresponds to the median of 100 runs for each of the data sets artA100, artA, artB100, artB, artC100, artC, artD100, and artD. The FDR is strikingly high.
If more IBD segments with the same size exist at the same DNA location, even if they contain only common variants, then haplotype block identification may be still possible. However, in the 1000 Genomes Project data we found many IBD segments which contained only few common variants.

S8.1 Commandline Calls
We used the following commandline calls for the methods that were compared. The calling parameters are different for the different data sets and are described in the next subsection.

DASH:
The results of GERMLINE, which are stored in the file germline_output.txt.match, are filtered and then written into the file germline_result_filtered. Then DASH is called by cut -f 1,2,4 germline_result_filtered | dash_cc dataplink.fam \ dash_output -min 2

Relate:
The content of the parameter file relate_options is

S8.2 Arguments and Filters for the Different Data Sets
Except for HapFABIA, we optimized the calling arguments of the methods and filters on the results on 5 additional data sets for each of the experiments. This optimization should ensure that the parameters and filters are close to those the authors of the methods would use. We aim at removing biases that result from being more familiar with one method and therefore being able to choose a more appropriate parameter setting for an experiment. The parameters of HapFABIA are not optimized but chosen according to characteristics of the data set (e.g. average minor allele frequency) and the lengths of IBD segments that should be extracted.
We report the arguments used at the calls of the different methods in the tables below. The term "artificial" denotes data sets with artificially generated chromosomes into which short artificial IBD segments are implanted: artA100, artA, artAMis, artB100, artB, artBMis, artC100, artC, artCMis, artD100, artD, and artDMis (see main manuscript for details on these data sets). Artificial data sets with "no mismatches" are artA100, artA, artB100, artB, artC100, artC, artD100, artD, while artificial data sets with "mismatches" (up to 6 mismatches) are artAMis, artBMis, artCMis, and artDMis. The term "implanted" denotes data sets based on shuffled 5 kbp blocks of the 1000 Genomes Project into which short real IBD segments are implanted: impA, impB, impC, impD, and impE. The term "simulated" denotes data sets based on forward-time simulation into which short IBD segments (also from the simulation) are implanted: simA, simB, simC, simD, and simE. The term "simulated long" denotes data sets based on forward-time simulation into which short and medium sized IBD segments (also from the simulation) are implanted: simAlong, simBlong, simClong, simDlong, and simElong. As simElong and simFlong contain implanted IBD segments of 0.5 Mbp, filter criteria are adjusted to this length. fastPHASE was used with option -T1 to allow only one random start and either without fixing K or with -K400. We used this value of K because up to 400 IBD segments were found in a 10,000 SNV interval in the 1000 Genomes Project data.
For BEAGLE/fastIBD, we used for short IBD segment thresholds of 1e-10 (fastIBD-1) and 1e-13 (fastIBD-2). Smaller values lead to power close to 0. If these thresholds do not yield a detection, they are increased by a factor of 10 until IBD segments are detected. Using the filter "length", detected IBD segments are enforced to have a minimal length in kbp. Table S4 gives an overview over the fastIBD parameters and filters used for the different data sets.
For PLINK, we first pruned SNVs using the option --indep 50 5 2. LD-based pruning was recommended by the authors of PLINK and was performed in most previous comparisons of IBD methods. We adjusted the minimal segment length (--segment-length) and the minimal number of SNVs in segment (--segment-snp). For segment-length, smaller values than 0.1 did not change the result if compared to the value 0.1. Using the filter "length", detected IBD segments are enforced to have a minimal length in kbp. Table S5 gives an overview over the PLINK parameters and filters used for the different data sets.  For GERMLINE, we adjusted the minimum length for matches (min_m), the maximum number of mismatches of homozygous and heterozygous SNVs (-err_hom and -err_het), and the number of SNVs (-bits) in an initial (seed) segment. To avoid an excessive number of false positives, we filtered GERMLINE results to keep only longer segments. For example, with the artificial data, only IBD segments with more than 150 (GERMLINE-1) or more than 200 (GERMLINE-2) SNVs are kept -these values match the sizes of the implanted IBD segments. For implanted IBD segments in the shuffled chromosome 1 of the 1000 Genomes Project, we used -bits 128, and -err_hom 0 as well as -err_het 0 (we did not simulate sequencing errors). For this data set, GERMLINE extracted 30 million IBD regions and had an extremely high FDR with -bits 30. Therefore, we used for this data set stricter filtering criteria for GERMLINE. We kept segments greater or equal to 260 SNVs (GERMLINE-1) or greater or equal to 380 SNVs (GERMLINE-2) in order to reduce the FDR -these lengths were found to be optimal for this experiment.
For simulated data sets simB and simC with 10 kbp long implanted IBD segments , we called GERMLINE with -bits 80, which gives an average physical length of the seed of 8 kbp as the average distance between SNVs is 105 bases. To reduce the FDR we kept segments greater or equal to 70 SNVs (GERMLINE-1) or greater or equal to 90 SNVs (GERMLINE-2). For simulated 22 S8 Calls, Filters, and Parameter Settings for the Methods data sets simA, simC, and simE with 20 kbp long implanted IBD segments , we called GERMLINE with -bits 170, which gives an average physical length of the seed of 17 kbp. To reduce the FDR we kept segments greater or equal to 150 SNVs (GERMLINE-1) or greater or equal to 180 SNVs (GERMLINE-2). Though the filter of GERMLINE-2 is only slightly longer than the seed, the power dropped drastically in the experiments. Using the filter "length", detected IBD segments are enforced to have a minimal length in kbp. Table S6 gives an overview over the GERMLINE parameters and filters used for the different data sets. Minimum length for match (min_m), maximum mismatches of homozygous and heterozygous SNVs (-err_hom and -err_het), number of SNVs in IBD segment for seed (-bits), filter on the length of found segments in number of SNVs (filter 1 and filter 2), filter on minimal IBD segment length in kbp (length). For implanted data sets -bits 30 resulted in more than 100 million detected segments which lead to an FDR of 1, therefore we set -bits 128. For the simulated data sets we used -bits 80 for IBD segment length 10 kbp and -bits 170 for 20 kbp (the average distance between SNVs is 105 bases). If we used more stringent filters for the simulated data like -bits 128 or 150 for the filter on simB or simD, then we had power of zero (all 10 kbp segments were filtered out).
DASH is based on the output of filtered GERMLINE-1 results and filtered for a minimal number of individuals in a cluster (-min). Using the filter "length", detected IBD segments are enforced to have a minimal length in kbp. Table S7 gives an overview over the DASH parameters and filters used for the different data sets.
The default FABIA parameters are p=10 (10 biclusters), alpha=0.03 (default sparseness value), cyc=50 (number of cycles of the EM algorithm), iter=1 (number of iterations). For IBD segment extraction the default settings are IBDsegmentLength=50 kbp (typical length of IBD segments), thresCount=1e-5 (p-value for tagSNV counts in a bin to detect IBD segments and to distinguish IBS from IBD), mintagSNVsFactor=3/4 (minimal IBD segment overlap between two chromosomes to call an IBD segment). Furthermore, Lt=0.1 (percentage of top bicluster SNVs considered) and Zt=0.2 (percentage of top bicluster samples considered). Variable pMAF is the estimated average minor allele frequency. Further filters are minimal number of individuals ("min") possessing an IBD segment and minimal physical length in kbp ("length") of IBD seg-  ments. For the 1000 Genomes Project data, we performed more iterations. The value pMAF=0.035 was estimated on data from the 1000 Genomes Project by counting how many of n = 500 SNV bins contain 10 or more minor alleles in an individual. We know the true pMAF=0.04 only for the artificial data. Table S8 gives an overview of the HapFABIA parameters and filters used for the different data sets. IBDsegmentLength is the typical length of IBD segments (in kbp) for extraction and simultaneously determines the tagSNVs accumulation threshold. This parameter is based on the assumption of an average distance of 100 bases between SNVs: avSNVsDist=100. pMAF is the average minor allele frequency for determining the tagSNVs accumulation threshold. "min" gives the minimal number of individuals that must possess the IBD segment (see DASH). For the artificial data, pMAF=0.04 was known. For data sets "D", we implanted more IBD segments, therefore increased iter to 2. More iterations than the chosen iter value did not change the results because no more biclusters are found. For the 1000 Genomes project, in most cases smaller iter values, like iter=20, would be sufficient as no more biclusters are found in later iterations.  The column labeled "Pop" gives the population, "#Ind" reports the number of individuals, "Grp" gives the continental group, and "Location" the location from where the individuals stem. ay be lower for populations ample, our resource includes s with frequencies of ,0.1%, enomes sequenced in a study e SardiNIA study).

S8 Calls, Filters, and Parameter Settings for the Methods
ween populations iled view of variation across 2a). Most common variants Fig. 2a) were known before ad their haplotype structure ontrast, only 62% of variants ts with frequencies of #0.5% sis, populations are grouped try: Europe (CEU (see Fig. 2a ns), TSI, GBR, FIN and IBS), a (CHB, JPT and CHS) and ariants present at 10% and ost all found in all of the of low-frequency variants in le ancestry group, and 53% of single population (Fig. 2b). nts are weakly differentiated ht's fixation index (F ST ) are ugh below 0.5% frequency ound within the same popus from the ancestry group f rare-variant differentiation e, within Europe, the IBS and

S10 Calculation of IBD Lengths
The raw lengths were computed as the length of the maximal IBD sharing between two individuals that possess the IBD segment. This resulted in overestimation of the lengths, which are corrected as described below.
Furthermore, we correct IBD segment lengths for IBD with archaic genomes, as archaic genomes only match a part of the IBD segment.

S10.1 Correction for the Assumptions of IBD Length Distributions
The raw IBD segment lengths are computed as the maximal IBD sharing of any two individuals that possess the IBD segment. The exponential distribution of IBD lengths as explained in Subsection S13.1 was derived for pairs of haplotypes (21,26,27,28). Thus, our raw IBD segment lengths are longer than the IBD lengths for pairs of haplotypes. To apply the assumption of exponentially distributed IBD segment lengths requires the correction of the raw IBD segment lengths. We observed a second cause for raw IBD segments being longer than expected by the exponential distribution. The more individuals share an IBD segment, the more likely it is to find two individuals that share random minor alleles which would falsely extend the IBD segment.
Therefore, we correct the raw lengths of IBD segments by locating the first tagSNV from the left (upstream) which is shared by 3/4 of the individuals that possess the IBD segment. This tagSNV is the left break point for the IBD segment. Analogously, we determine the right break point by the first tagSNV from the right (downstream) that is shared by 3/4 of the individuals. The distance between these break points is the (corrected) length of an IBD segment.

S10.2 Length Correction for IBD with Archaic Genomes
We are interested in IBD between human and archaic genomes. However, the human IBD segment length is not an appropriate measure for the length of IBD with archaic genomes because only a part of the IBD segment may match an archaic genome (see Figures S31, S32, and S33).
We correct the IBD segment lengths to obtain the IBD lengths between human and archaic genomes. The corrected length of an IBD segment is the length of the "archaic part" that matches a particular archaic genome. First, the left (upstream) break point of the "archaic part" of an IBD segment genome is detected. This left break point is defined as the first location in the IBD segment from the left (upstream), where at least 4 out of 9 tagSNVs match the archaic genome. From the right (downstream), the right break point of the "archaic part" of an IBD segment was detected analogously. Since not all bases of the Neandertal genome were called, we modified the definition of the break points. For the Neandertal genome, a break point requires at least 5 or 6 bases of the 9 tagSNVs to be called of which 2 or 3, respectively, have to match the Neandertal genome. If either the left or right break point of an "archaic part" could not be found, then this IBD segment does not contain an "archaic part".
Matching of an IBD segment and an archaic genome for IBD segment lengths analyses was defined as: 1. at least 15% of the tagSNVs of the IBD segment must match the archaic genome, S11 Are Rare Variants Recent or Old? 27 2. 30% of tagSNVs in the "archaic part" of the IBD segment must match the archaic genome, and 3. the "archaic part" of the IBD segment must at least contain 9 tagSNVs. S11 Are Rare Variants Recent or Old?
Rare variants are found to be old which might seem to contradict results of other researchers. In the following two subsections, we show that our conclusion that rare variants can be old does not contradict previous findings. In the first subsection, we argue that only because rare variants are in general recent does not contradict that some rare variants are old. In the second subsection, we support the viewpoint that rare variants are old by other publications.

S11.1 Some Rare Variants are Recent and Some Old
That rare variants are, in general, recent does not contradict our results. On the contrary, that the majority of IBD segments extracted by HapFABIA is found in Africans can explain why three times as many variants with 0.5-5% minor allele frequency are found in Africans as in European or Asians in the 1000 Genomes Project (25). In particular, if rare variants are caused by a recent population growth, then the numbers found by the 1000 Genomes Project (25) are difficult to interpret. We want to discuss two reasons why rare variant may be old.
(I) Many rare variants are old, however, most rare variants may be recent, e.g. because of a recent growth in population. We did not consider private variants and have a bias toward less rare variants because, the more individuals share an IBD segment, the higher its significance, the more likely it is detected by HapFABIA. In our analysis, only 39% of the rare variants (not counting private variants) are in IBD segments. Therefore, the remaining majority of rare and private variants might be recent.
The vast majority of IBD segments is found in Africans. This would explain why Africans have more rare and low frequent variants than Europeans or Asians. The publication of the 1000 Genomes Project (25) reports: "individuals from populations with substantial African ancestry (YRI, LWK and ASW) carry up to three times as many low-frequency variants (0.5-5% frequency) as those of European or East Asian origin," Table S14 in the supplementary information of the publication of the 1000 Genomes Project (25) provides the number of derived variants per individual in each population in its last rows (DAF denotes "derived allele frequency", i.e. the frequency of the mutation): Recent variants are supposed to be derived variants. The table shows about four times more variants with derived allele frequency of 0.5-5% in Africans than in other populations (if we ignore the admixed Americans that have African admixture). For derived allele frequency < 0.5%, we observe 2.5 times more variants in Africans than in Europeans or Asians.
(II) Many rare variants are old, however, compared to common SNPs, they are recent, i.e. the temporal relations remain, but variants are dated further back. We state in the manuscript that many rare variants are old and from times before humans migrated out of Africa. However, many common SNPs are even older and stem from common ancestors of human and chimpanzee. This means that many rare variants are old, but compared to common SNPs they are recent.
That common SNPs are old is supported by findings of the Chimpanzee Consortium. In Hacia et al. (29) it was found that, of 397 human SNP sites, 214 were ancestral (shared with common ancestors of chimpanzee and human). Of the ancestral SNPs, 1/4 had the minor allele as ancestral allele. For the chimpanzee genome (30), it was found that "Of ∼7.2 million SNPs mapped to the human genome in the current public database, we could assign the alleles as ancestral or derived in 80% of the cases according to which allele agrees with the chimpanzee genome sequence" and that "a significant proportion of derived alleles have high frequencies: 9.1% of derived alleles have frequency ≥ 80%." According to (30, Suppl. Fig. S9) about 25% of the derived alleles have frequency ≥ 50% in which case the minor allele is the ancestral allele.
That some SNPs are very old and that some haplotypes are shared between humans and chimpanzee was also found in Leffler et al. (31): "We conducted a genome-wide scan for long-lived balancing selection by looking for combinations of SNPs shared between humans and chimpanzees. In addition to the major histocompatibility complex, we identified 125 regions in which the same haplotypes are segregating in the two species," S11 Are Rare Variants Recent or Old? 29 se that resulted in changes within groups (t-test; P , 0.00001 by mutation; Supplementary Fig. 10a). These differences in average le age are probably due to varying intensities of selective constraint ong different classes of SNVs 12 . Consistent with this prediction, we erved significantly higher values of the neutrality index, a measure he direction and degree of departure from neutral evolution, in omic regions enriched for younger variants (Spearman's correla-; P 5 0.004 and P 5 0.001 for European Americans and African ericans, respectively; Supplementary Fig. 11), indicating a higher den of deleterious SNVs. o more directly identify putatively deleterious SNVs, we used four ctional prediction methods (SIFT 13 , PolyPhen2 (ref. 14), a likelihood o test 15 , MutationTaster 16 ) applicable to non-synonymous SNVs and two conservation-based methods ( plicable to all SNVs (Supplementary Inf inverse relationship between average SN thods that predicted a variant to be delet predicted to be deleterious by multiple (on average) more intense purifying sele interest in disease mapping studies. The predicted to be deleterious by all 6 metho European Americans and African Ame were less than 5,000 years old (92.9% and and African Americans, respectively).
The strengths and weaknesses of fu vary substantially and as a result the a those that resulted in changes within groups (t-test; P , 0.00001 by permutation; Supplementary Fig. 10a). These differences in average allele age are probably due to varying intensities of selective constraint among different classes of SNVs 12 . Consistent with this prediction, we observed significantly higher values of the neutrality index, a measure of the direction and degree of departure from neutral evolution, in genomic regions enriched for younger variants (Spearman's correlation; P 5 0.004 and P 5 0.001 for European Americans and African Americans, respectively; Supplementary Fig. 11), indicating a higher burden of deleterious SNVs. To more directly identify putatively deleterious SNVs, we used four functional prediction methods (SIFT 13 , PolyPhen2 (ref. 14), a likelihood ratio test 15 , MutationTaster 16 ) applicable to non-synonymous SNVs and two conservation-based methods (GERP11 17 and PhyloP 18 ) applicable to all SNVs (Supplementary Information). We found a strong inverse relationship between average SNV age and the number of methods that predicted a variant to be deleterious (Fig. 2a, b). Thus, SNVs predicted to be deleterious by multiple methods probably experience (on average) more intense purifying selection and may be of particular interest in disease mapping studies. The age of non-synonymous SNVs predicted to be deleterious by all 6 methods was 3,000 and 6,200 years in European Americans and African Americans, respectively, and 88.7% were less than 5,000 years old (92.9% and 80.6% in European Americans and African Americans, respectively).
The strengths and weaknesses of functional prediction methods vary substantially and as a result the accuracy of any single method

S11.2 Old SNVs Have Been Reported in Other Publications
In a recent publication (32), it was found that "The average age across all SNVs was 34,200 ± 900 years (± s.d.) in European Americans and 47,600 ± 1,500 years in African Americans, and these estimates were robust to sequencing errors [...]".
The authors further write "SNVs shared between European Americans and African Americans were significantly older (104,400 years and 115,800 years for European Americans and African Americans, respectively)".
These figures are visualized as bar plots in Fig. S4.

S12.1 Sharing of IBD Segments Between Continental Groups
We found that most IBD segments are observed in only one continental group, which are, in most cases, Africans. However, for other continental groups, we saw a considerable sharing of IBD segments across continental groups. The sharing between continental groups seems to contradict statements in other publications. For example, Gravel et al. (33) report that sharing of SNVs between continental groups is low.
Sharing between continental groups has been reported in the publication of the 1000 Genomes Project (25). In our nomenclature, the terms "low-frequency variants" and "rare variants" from (25) are unified to the term "rare variants". The 1000 Genomes Project (25) use the term "ancestry groups" to denote EUR, EAS (ASN in our nomenclature), AFR, and AMR according to their Fig. 2b. The authors of the 1000 Genomes Project (25) write: "By contrast, 17% of low-frequency variants in the range 0.5-5% were observed in a single ancestry group, and 53% of rare variants at 0.5% were observed in a single population (Fig. 2b)." This means that 83% of the low-frequency variants are shared between continental groups and 47% (almost half) of the rare variants are shared between continental groups.
Another example taken from the 1000 Genomes Project paper (25) is: "Some sharing patterns are surprising; for example, 2.5% of the f 2 FIN-X variants are shared with YRI or LWK populations." f 2 variants are those variants for which their minor allele is observed in exactly 2 individuals. f 2 variants are rarer than most SNVs that tag our extracted IBD segments because, on average, 6 individuals possess an IBD segment. According to (33), sharing between continental groups decreases with rarer variants. Thus, in our extracted IBD segments, we would expect to find sharing between continental groups considerably more often than for f 2 variants. Furthermore, the same publication (25) reports: "By contrast, between populations with no recent shared ancestry, f 2 variants are present on very short haplotypes, for example, an average of 11 kb for FIN-YRI f 2 variants (median between ancestry groups excluding admixture is 15 kb), and are therefore likely to reflect recurrent mutations and chance ancient coalescent events." This statement also supports our findings of IBD sharing between continental groups, but we could exclude recurrent mutations because IBD segments are shared between multiple individuals.

S12.2 IBD Segments that Match Archaic Genomes
We tested whether IBD segments that match particular archaic genomes to a large extent are found more often in certain populations than expected randomly. For each IBD segment, we compute two values: The first value is the proportion of tagSNVs that match a particular archaic genome, which we call "genome proportion" of an IBD segment (e.g. "Denisova proportion"). The second value is the proportion of individuals that possess an IBD segment and are from a certain population as opposed to the overall number of individuals that possess this IBD segment. We call this value the "population proportion" of an IBD segment (e.g. "Asian proportion"). Consider the following illustrative examples: If an IBD segment has 20 tagSNVs of which 10 match Denisova bases with their minor allele, then we obtain 10/20=0.5=50% for the Denisova proportion. If an IBD segment is observed in 6 individuals of which 4 are Africans and 2 Europeans, then the African proportion is 4/6=0.67=67% and the European proportion is 0.33=33%.
We define an IBD segment to match a particular archaic genome if the genome proportion is 30% or higher. We plot densities of population proportions for IBD segments that match a particular archaic genome and for those that do not match that genome. Figure S5 shows the density of Asian proportions of Denisova-matching IBD segments (pink) vs. density of Asian proportions of non-Denisova-matching IBD segments (cyan). Figure S6 shows analogous densities for Europeans. Denisova-matching IBD segments are observed more often in Asians and Europeans than non-matching IBD segments. This can be seen by the higher densities of matching IBD segments (pink) compared to densities of non-matching IBD segments (cyan) if the population proportions are not very close to zero -or, conversely, it can be seen at the lower peak at zero (less IBD segments that match the Denisova genome without Asians or European sharing). Many Denisova-matching IBD segments are shared exclusively among Asians which is indicated by the high density at a population proportion of 1 (pink density in Fig. S5). Population proportion of 1 means that IBD segments are only shared among Asians. Figs. S8 and S9 show analogous densities as in Figs. S5 and S6, but for the Neandertal genome. The differences we already observed for Denisova are even more prominent for Neandertal: Neandertal-matching IBD segments are observed even more often in Asians and Europeans than non-matching IBD segments. The higher densities (pink) are now more clearly observable if the population proportion is not close to zero -or conversely, it can be seen at the lower peak at zero (less IBD segments that match the Neandertal genome without Asians or European sharing). For Asians, the peak at 1 in Fig. S8 is very prominent, representing IBD segments that are shared exclusively among Asians. IBD segment sharing exclusively within one population is very common, as the blue peaks at 1 in Fig. S5 and Fig. S8 show. Figs. S7 and S10 show the same densities for Africans. Both the density of African proportions of Denisova-matching and the same density for Neandertal-matching IBD segments have two peaks: one at a low and one at a high proportion of Africans. For Neandertal-matching IBD segments, the density at low proportions of Africans is even larger than for high proportions. Thus, IBD segments that match archaic genomes are either shared by a very low or a very high proportion of Africans. The low proportion of African density peak hints at admixture of ancestors of modern humans and Denisovans / Neandertals outside of Africa. The density peak at high proportions of Africans may be be archaic DNA segments shared by hominid groups. IBD segments that match the "Archaic genome" are the IBD segments that match both the Denisova and Neandertal genome. Population proportion densities for the "Archaic genome" are presented in Figs. S11, S12, and S13 for Asians, Europeans, and Africans, respectively. For the "Archaic genome" we see the same figure as for the Neandertal and Denisova genome: the African density is bimodal that means either it is dominated by Africans or it contains no or only few Africans.
We found that Japanese share parts of the Denisovan genome. An example of an IBD segment 32 S12 Sharing of IBD Segments Between Populations and Genomes that is shared by Japanese and that matches the Denisovan genome can be found in Fig. S26. Further examples can be seen in Figures S32 and S33.
Asian proportion of IBD segment occurrence

S12 Sharing of IBD Segments Between Populations and Genomes
European proportion of IBD segment occurrence The lengths of an IBD segment are exponentially distributed with a mean of 100/(2g) cM (cen-tiMorgans), where g is the number of generations which separate the two haplotypes that share a segment from their common ancestor (21,26,27,28,34). We follow Ulgen and Li (35), who recommend to use a cM-to-Mbp ratio of 1. However, the recombination rate, the cM-to-Mbp ratio, varies from 0 to 9 along a chromosome (36). An average IBD segment length of 10 kbp corresponds to a most recent common ancestor 5,000 generations or 100 thousand years ago (kya) using 20 years per generation, and 20 kbp corresponds to 2,500 generations or 50 kya. Therefore, individuals sharing short IBD segments provide insight into the origin and migration of modern humans.
Age estimations of the IBD segments should be considered with care. Computing the time since the last common ancestor depends on many assumptions like the time span of a generation, the ratio of centiMorgans and megabases which we assumed to be 1, and the distribution of IBD lengths.

S13.2 Correction of the Length-Year Relation For Ancient Genomes
The genomes of the Neandertals and the Denisovans are older than present human genomes, therefore the assumptions that g generations separate the first haplotype and g the second haplotype from a common ancestor are wrong.
The Neandertal bones from which the DNA was extracted were dated to 38,310 and 44,450 years before present day by carbon-14 accelerator mass spectrometry (37). This gives an average of 40 kya, which corresponds to 2,000 generations. The corrected formula for the average IBD segment length would be 100/(2g − 2000) cM. The Denisovan finger bone from which the Denisovan DNA was sequenced is also dated back about 40 kya. Thus, 10 kbp correspond to 6,000 generations or to 120 kya and 20 kbp to 3,500 generations or 70 kya. The correction described above adds 1,000 generations or 20 kya as compared to the number if two present day genomes would have been compared, that is, if human IBD would be considered. Figure S14 show the histograms of IBD segment lengths for all IBD segments (human genome) and for IBD segments that match the Neandertal genome. For the human genome a peak is at 24,200 bp (41 kya), whereas, for the Neandertal genome, peaks are at 18,500 (74 kya) and 43,000 bp (43 kya). It can be seen that IBD segments that match the Neandertal genome are shorter, thus also older, though there is also an enrichment for longer IBD segments around 43,000 bp. Figure S15 show the histograms of IBD segment lengths for all IBD segments that match the Denisova and the "Archaic" genome ("Archaic genome" contains IBD segments that match both the Denisova and Neandertal genome). For the Denisova genome, we have peaks at 24,200 bp (61 kya) and 9,000 bp (131 kya), whereas, for the Archaic genome, peaks are at 11,200 bp (109 kya), 24,200 bp (61 kya), and 41,000 bp (44 kya). It can be seen that IBD segments that match the Neandertal genome are shorter, thus also older, though there is also an enrichment for longer IBD segments around 43,000 bp. Figure S16 shows the density of the lengths of IBD segments that are private to Asians vs. the density of the lengths of IBD segments that are private to Europeans. Since IBD segments are private to each population, the densities are based on disjoint IBD segment sets. Asians show a global peak at 25,800 bp (39 kya in the past from present), while Europeans show a global peak at 24,200 bp (41 kya). Europeans have also smaller peaks at 17,000 bp (59 kya) and 29,000 bp (34.5 kya).  Figure S17A shows the density of lengths of IBD segments that are private to Asians vs. lengths of IBD segments that are only shared between Asians and Africans. Figure S17B shows the same plot for Europeans instead of Asians. A small difference is visible in the global peaks of lengths distributions of IBD segments that are private to a population and those shared with  22,000 (45.5 kya) for Europeans. Interestingly, even though the IBD segments used in both plots are disjoint, African-sharing IBD segment lengths distributions agree in their major peak at about 22,000 bp (45.5 kya) and in a peak at 5,000 bp (200 kya). The global Asian peak is at 25,800 bp (red dashed line in Fig. S17A), while the global peak for Europeans is at 24,200 bp (red dashed line in Fig. S17B). Europeans in Fig. S17B have additional peaks at 16,500 bp and 28,900 bp. The peak at 5,000 bp is in a range where African-sharing IBD segment lengths are enriched. This range is from 3,000 bp (333 kya) to 10,000 bp (100 kya). We follow up on the 5,000 bp peak by investigating the lengths of IBD segments that are shared by Asians, Europeans, and Africans.  Figure S18A shows the density of lengths of IBD segments that are private to Asians vs. lengths of IBD segments that are shared between Asians, Europeans, and Africans. Figure S18B shows the same plot for Europeans instead of Asians. Of course, the peaks for Asians and Europeans are the same as in the previous plot. The peak at 5,000 bp (200 kya) is also visible for lengths of IBD segments that are shared by all populations and which marks the higher density range 3,000-10,000 bp (333-100 kya). IBD segments that are shared by all populations seem to be very old. Fu et al. (32) found that the SNVs shared between European Americans and African Americans date back 104,400 and 115,800 years, respectively. We would date our rare tagSNVs even further back. However, the estimated age of the IBD segments should be considered with care. The computation of the time since the last common ancestor depends on many assumptions like the length of a generation, the recombination rate (ratio between centiMorgans and megabases), and the distribution of IBD lengths.

S13.4 IBD Segment Lengths of Human Populations
Next we investigated the effect on the lengths distribution if IBD segments are removed that are shared by Africans. Figure S19A shows the density of lengths of IBD segments that are private to Asians vs. the density of IBD segment lengths shared by Asians and Europeans, but not by Africans. In Fig. S18B, the same plot as in Fig. S18A is shown, but now compared to IBD segments that are private to Asians. In Fig. S18A the density for IBD segments that are shared with Africans, has again a peak at 5,000 bp and high values in the range 3,000-10,000 bp (blue). In Fig. S18B, this peak vanishes, because IBD segments that are shared with Africans are removed.
A density peak at 50,000 bp (20 kya) becomes very prominent for the lengths of IBD segments that are shared between Europeans and Asians, but not with Africans. The density of lengths of IBD segments shared between Europeans and Asians (blue) is higher between 35,000 bp (28.5 kya) and 55,000 bp (18 kya) than all lengths densities of IBD segment that are shared by Africans .  Figure S19: Panel A: Density of lengths of IBD segments that are private to Asians vs. density of IBD segment lengths shared by Asians and Europeans but not by Africans that means, Africansharing IBD segments are removed as opposed to Fig. S18A. Panel B: The same plot as in panel A but, but IBD segments that are shared by Europeans and Asians are compared to IBD segments that are private to Europeans. African-sharing IBD segments are removed as opposed to Fig. S18B. Both panels show that removing IBD segments shared with Africans in turn remove the peak at 5,000 bp and the high density range 3,000-10,000 bp which were visible in Fig. S18. In panel B a peak at 50,000 bp (blue) becomes very prominent for lengths of IBD segments shared between Europeans and Asians (no Africans).

S13.5 Lengths of IBD Segments that Match the Denisova or Neandertal Genome
We investigated the age of IBD segments that match the Denisova or Neandertal genome. Toward this end, we consider the lengths distributions of IBD segments that match the Denisova or the Neandertal genome. S13.5.1 Lengths of IBD Segments that Match the Denisova Genome Figure S20A shows the density of the lengths of all IBD segments (human genome) vs. the density of the lengths of IBD segments that match the Denisovan genome. The human IBD segment lengths density has a peak at 24,200 bp (41 kya), while the lengths distribution of IBD with the Denisovan genome has a peak at 9,000 bp (131 kya). Clearly, segments of IBD with the Denisovan genome are shorter than among humans (red density above blue at the left hand side). We are interested in how different populations share the Denisovan genome. Figure S20B shows densities of lengths of IBD segments that match the Denisova genome and are enriched by a particular population. For Africans, the density peak is at 8,500 bp (137 kya). Asians have a clear enrichment of IBD lengths in the range 18,000-28,000 bp (75-56 kya). Europeans show three peaks at 18,000 (75 kya), 25,500 (59 kya), and 40,000 (45 kya). This density seems to reveal that there was a gene flow from Denisovans into the Asian genome outside Africa. However, also Europeans show some hints of introgression from the Denisovans after migration out of Africa. These peaks can be seen more clearly if only IBD segments are considered that are private to a population. Figure S21A shows densities of lengths of IBD segments that match the Denisova genome and are private to a population. For all populations together, the peak of the lengths density is at 8,500 bp (137 kya); for IBD segments private to Africans, it is at 9,700 (123 kya); 44 S13 Analyses of Lengths of IBD Segments for IBD segments private to Europeans, it is at 14,800 (87 kya); and for IBD segments private to Asians, it is at 27,000 (57 kya). The density of IBD lengths that match the Denisova genome and are private to Asians is also high around 40,000 bp (45 kya). Figure S21B shows densities of lengths of IBD segments that match the Denisova genome and are private to Africans vs. IBD segments that are not observed in Africans. The peak for Africans is at 10,000 bp (120 kya), while the density of lengths of IBD segments that are not observed in Africans has peaks at 20,000 (70 kya) and 28,000 bp (56 kya). Africans have older segments probably stemming from common ancestors of Denisovans and humans. For non-African populations, the high densities for longer IBD segments hint at an introgression from Denisovans after migration out of Africa. Parts of the Denisovan or archaic genomes may also have been introduced by Neandertals after migration out of Africa. Neandertals may have re-introduced parts of archaic genomes that were lost in humans or parts of the Denisovan genome stemming from introgression of one hominid group into another.  Figure S21: Panel A: Densities of lengths of IBD segments that match the Denisova genome and are private to a population. The peaks are at 8,500 for ll IBD segments matching the Denisova genome, 9,700 for Africans, 14,800 for Europeans, and 27,000 for Asians. Panel B: Densities of lengths of IBD segments that match the Denisova genome and are private to Africans vs. IBD segments that are not observed in Africans. The peak for Africans is at 10,000 bp, while IBD segment lengths that are not observed in Africans have peaks at 20,000 and 28,000 bp. Figure S22A shows the density of the lengths of all IBD segments (human genome) vs. the density of the lengths of IBD segments that match the Neandertal genome. The human IBD segment lengths density has a peak at 24,200 bp (41 kya), while the lengths distribution of IBD segments with the Neandertal genome (the part of the IBD segment that matches the Neandertal genome) has a global peak at 18,000 bp (75 kya) and a smaller peak at 43,000 bp (43 kya). Figure S22B shows densities of lengths of IBD segments that match the Neandertal genome and are enriched in a particular population. The peak of the lengths distribution for IBD segments that are shared by Africans is at 17,000 bp (79 kya). Asians have a density peak at 25,800 bp (59 kya) and Europeans have a peak at 24,000 bp (62 kya). Both the densities of lengths of IBD segments that are private to Europeans and that are private to Asians and match the Neandertal genome have a peak around 42,000 bp (44 kya). The density peak for Africans is clearly separated from the density peaks for Europeans and Asians which match each other. This hints to introgression from the Neandertals into anatomically modern humans that were the ancestors of Europeans and Asians after these humans left Africa. The higher density of short IBD segments which are prominent in Africans in the range 5,000-15,000 bp (220-87 kya) hint at old DNA segments that humans share with the Neandertal genome. Next we consider IBD segments that are private to a population. Figure S23A shows densities of lengths of IBD segments that match the Neandertal genome and are private to a population. The peaks are at 15,500 bp (84.5 kya) for Africans, at 19,500 bp (71 kya) for all humans, at 22,800 bp (64 kya) for Europeans, and at 23,800 bp (62 kya) for Asians. The densities of Asians and Europeans agree very well to each other. The peak close to 43,000 bp is no longer visible and seems to have been caused by IBD segments that are shared between Asians and Europeans. Therefore, we are interested in IBD segments that match the Neandertal genome and that are private to Africans and those which are not observed in Africans. Figure S23B shows densities of lengths of IBD segments that match the Neandertal genome and are private to Africans vs. IBD segments that are not observed in Africans. The peak for Africans is at 17,000 bp (79 kya), while IBD segments that are not observed in Africans have a global lengths density peak at 23,800 (62,000 years) and a smaller peak at 21,000 bp (68 kya). Interestingly, the peak near 43,000 bp (43 kya) is again visible. Thus, this peak is from IBD segments that are shared between Europeans and Asians. Most prominently, non-African IBD segments that match the Neandertal genome are enriched in the range 40,000-55,000 bp (45-38 kya).

S13.5.3 Lengths of IBD Segments that Match an Archaic Genome
IBD segments that match the "Archaic genome" are IBD segments that match both the Neandertal and the Denisova genome. The archaic genome segments stem either from a genome of archaic hominids which were ancestors of Neandertals and Denisovans or stem from introgression of one hominid group into another. Hammer et al. (38) write "The observation that populations from many parts of the world, including Africa, show evidence of introgression of archaic variants (6,16,19) suggests that genetic exchange between morphologically divergent forms may be a common feature of human evolution." Elizabeth Pennisi (39) writes: "The analyses paint a complex picture of mingling among ancient human groups, Pääbo reported. The data suggest inbreeding in Neandertals, a large Denisovan population, and mixing between Denisovans and an even earlier mystery species." Figure S24A shows the density of the lengths of all IBD segments (human genome) vs. the density of the lengths of IBD segments that match the Archaic genome. For the human genome, the global density peak is at 24,200 bp (61 kya). For IBD with the Archaic genome, the global density peak is at 11,200 bp (103 kya), but a smaller peak at 41,000 bp (103 kya) can be observed. Figure S24B shows densities of lengths of IBD segments that match the Archaic genome and are enriched in a particular population. IBD lengths density peaks can be seen at 12,000 bp (103 kya) S13 Analyses of Lengths of IBD Segments 47 for most populations, 22,000 bp (65.5 kya) for Europeans, and 24,400 bp (61 kya) for Asians. The global peak for Africans is not separated from the global peaks of non-Africans, as we observed in Neandertal-or Denisova-matching IBD segments. Interestingly, a smaller peak is visible for both Europeans and Asians at 42,000 bp (44 kya). This peak at 42,000 bp has already been observed in Neandertal-matching IBD segments. There may be Neandertal matching IBD segments that do not match the single Denisova genome that has been sequenced, but might match other Denisova genomes. Therefore, some Neandertal matching segments might be also old and would belong to the "Archaic genome" matching IBD segments. Next we consider IBD segments that are private to a population. Figure S25A shows densities of lengths of IBD segments that match the Archaic genome and are private to a population. The peaks are at 13,000 bp (97 kya) for humans (1000 Genomes Project), 13,600 bp (93.5 kya) for Europeans, 19,500 bp (71 kya) for Asians, and 21,000 (68 kya) for Africans. A peak appears at 46,000 bp (42 kya) for IBD segments that are private to Europeans. Figure S25B shows densities of lengths of IBD segments that match the Archaic genome and are private to Africans vs. IBD segments that are not observed in Africans. The peak for Africans is at 22,000 bp (65.5 kya), while lengths of IBD segments that are not observed in Africans have a global peak at 14,700 bp (88 kya). However, Africans have also a peak at 14,300 bp (90 kya), while IBD segments that are not observed in Africans have a peak at 21,000 (68 kya). These peaks match quite well on a global scale. Most prominently, non-African IBD segments that match the Archaic genome are enriched in the range 33,000-55,000 bp (50-38 kya). This enrichment seem to be caused by events after humans migrated out of Africa. The introgression from the Neandertal into ancestors of modern humans may also have introduced a part of the Denisovan genome that has been contained in the Neandertal genome. We would consider this part of the human genome as Archaic genome. Another explanation would be that introgression from the Neandertal into human has 48 S13 Analyses of Lengths of IBD Segments introduced archaic genome parts that were lost in the human population in Africa, but still match the Denisovan genome. In the main manuscript, we defined an IBD segment to match a particular archaic genome if the genome proportion is 30% or higher. Instead of 30%, we also tested thresholds of 40% and 50%, which resulted in odds ratios of 5.4 and 5.8 with p-values < 1.9e-272 and < 1.3e-252, respectively, for the dependency between the Asian population and the Denisova genome. Hence, we obtain the same figures with other thresholds for defining an IBD segment to match an archaic genome.

S14.2 Comparison Results Based on Mean Instead of the Median
We report the results of the comparisons using the mean instead of the median. The p-values are still based on the Wilcoxon rank sum test over the 100 experiments.   Columns labeled "mean" show the mean F1 score over 100 experiments. Columns labeled "p" provide the p-values of a Wilcoxon rank sum test over the 100 experiments with the null hypothesis that HapFABIA and another method yield the same value for the F1 score. The F1 score is the harmonic mean of precision (1 -FDR) and power and has an optimal value of 1 for perfect IBD detection. HapFABIA performs significantly better than all other methods on all data sets except artBMis for which the performance of PLINK is not significantly worse.  Columns labeled "m" show the mean F1 score over 100 experiments. Columns labeled "p" provide the p-values of a Wilcoxon rank sum test over the 100 experiments with the null hypothesis that HapFABIA and another method yield the same value for the F1 score. However, for simBlong, simDlong, simElong, and simFlong HapFABIA has a significantly higher F1 score than other methods.

58
S15 Examples of IBD Segments that Match Archaic Genomes S15 Examples of IBD Segments that Match Archaic Genomes Figure S26 shows a typical example of an IBD segment that matches the Denisova genome and is shared exclusively among Asians. Figure S27 shows an IBD segment that matches the Denisova genome and is shared by Africans and admixed Americans, and Fig. S28 shows an IBD segment that matches the Denisova genome and is shared across all populations. Furthermore, Fig. S28 reveals three phasing errors in the individuals where a yellow line starts from the left hand side: HG01271, NA12890, and NA19137. Figure S29 shows an example of an IBD segment that matches the Neandertal genome and is shared by Europeans and admixed American. Figure S30 shows an example of an IBD segment that matches the Neandertal genome and is shared by Europeans and Asians. Figure S31 shows an IBD segment that matches the Neandertal genome and is found exclusively in Africans. Another phasing error is visible in this figure in individual NA19317. Interestingly, IBD segments containing parts of the Denisova genome or the Neandertal genome are often shared by Africans. Figure S32 shows an example of an IBD segment that matches the Denisova genome and is shared by Japanese. Further it can be seen that only a part of the IBD segment matches the Denisova genome. Figure S33 shows a second example of an IBD segment that matches the Denisova genome and is shared by Japanese. Again only a part of the IBD segment matches the Denisova genome.  Figure S26: Example of an IBD segment matching the Denisova genome shared exclusively among Asians. The data analyzed by HapFABIA were phased genotypes from chromosome 1 of the 1000 Genomes Project. The rows give all chromosomes that contain the IBD segment and columns consecutive SNVs. If both chromosomes of an individual contain the IBD segment then two adjacent identical row labels are present. Major alleles are shown in yellow, minor alleles of tagSNVs in violet, and minor alleles of other SNVs in cyan. The row labeled "model L" indicates tagSNVs identified by HapFABIA in violet. The rows "Ancestor", "Neandertal", and "Denisova" show bases of the respective genomes in violet if they match the minor allele of the tagSNVs (in yellow otherwise). Neandertal tagSNV bases that are not called are shown in orange.  Figure S31: Example of an IBD segment matching the Neandertal genome shared exclusively among Africans. A phasing error is visible at individual NA19317. See Fig. S26 for a description.  Figure S32: Example 1 of an IBD segment that matches the Denisovan genome and is shared by Japanese. Only a part of the IBD segment matches the Denisovan genome. See Fig. S26 for a description.