Quantifying unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects

As new proposals aim to sequence ever larger collection of humans, it is critical to have a quantitative framework to evaluate the statistical power of these projects. We developed a new algorithm, UnseenEst, and applied it to the exomes of 60,706 individuals to estimate the frequency distribution of all protein-coding variants, including rare variants that have not been observed yet in the current cohorts. Our results quantified the number of new variants that we expect to identify as sequencing cohorts reach hundreds of thousands of individuals. With 500K individuals, we find that we expect to capture 7.5% of all possible loss-of-function variants and 12% of all possible missense variants. We also estimate that 2,900 genes have loss-of-function frequency of <0.00001 in healthy humans, consistent with very strong intolerance to gene inactivation.


Supplementary Figure 5. Predicted number of unique variants in cohorts of size up to 500K individuals with the same demographic distribution as the ExAC dataset.
The x-axis indicates the number of individuals in the cohort and the y-axis indicates the fraction of possible variants that we expect to observe at in a cohort of that size. We trained UnseenEst on the full ExAC dataset and made the predictions for synonymous (grey), missense (orange) and loss-of-function (brown) variants. Figure 6. Predicted number of unique missense variants in gene families. We trained the model on the cohort that matches U.S. demographics and predicted the fraction of possible missense variants in each gene family that we can expect to observe in cohorts of size up to 500K individuals. (a) Recessive genes (red) and dominant genes (blue). (b) All genes (red) and genes with cerebral specific expression (blue). (c) Genes associated with GWAS loci (red) and drug target genes (blue). Figure 7. Validation of the estimated number of genes with at least 10 LoF alleles. We trained UnseenEst on random subsamples of 10% of the alleles in the U.S. Census matched cohort and applied it to estimate the number of genes with at least 10 LoF alleles in the entire cohort. The red curve is the actual number of genes with at least 10 LoF alleles and the blue curve is the average predictions over the different subsamples. The shaded blue region corresponds to one standard deviation of the predictions.

Supplementary Figure 8. Discovery rate of LoF genes in non-Finnish Europeans.
Estimated number of genes with at least 10 LoF alleles in non-Finnish Europeans as a function of the sample size. The number of genes with at least 10 LoF alleles saturates around 16K genes, in agreement with the saturation level of LoF genes in the U.S. Census-matched population (Figure 1d).

Supplementary Tables
Allele counts 0- 10 10-100 100-1000 >1000 All variants ExAC 6.55M 0.57M 0.17M 109422 All variants predicted 6.59M (0.57M) 0.55M (0.09M) 0. 18M  We begin by formalizing the model in which we are working, and describe the sense in which our algorithm recovers the distribution of variant frequencies. The core of our approach is a linear programming (LP) based algorithm, and we discuss the intuition behind this method. We then establish the performance guarantees of our algorithm, proving that, with high probability, it will recover an accurate estimate of the true frequency distribution, and yields accurate predictions for the number of new variants that will be observed in larger samples. We also discuss in detail related approaches to estimate different statistical properties of frequency distributions based on other linear program, jackknife and Bayesian methods in Section 5.
The model. Let S denote a particular variant class of interest. For example, S can correspond to all possible missense mutations in a gene family. Each possible variant s ∈ S is associated with a probability p s , which is the probability that an allele contains s. We model all the alleles as independent and all variants as independent. Hence the p s 's are the parameters of independent Bernoulli random variables. When we sample an allele, we obtain an independent draw from the Bernoulli at each s, s ∈ S. In a sample of k alleles, the frequency of observing variant s is distributed according to bin(p s , k). The histogram represents all of the information of P except for the labels of the variants.
In this work, we are interested in accurately recovering the histogram h P . For the purpose of estimating any property of the p s 's that does not depend on the specific labels of the variants themselves, the histogram, h P , contains all of the useful information. Such properties are referred to as symmetric as they are unaffected by 'renaming' the variants.
The following examples illustrate several interesting symmetric properties: Examples: • The total number of variants that occur with probability more than c is a symmetric property, and is given by x>c:h(x)>0 h(x).
• The expected number of unique variants that will be observed in a sample of k alleles is a symmetric property, and is given by • The expected number of unique variants that will be observed more than 10 times in a sample of k alleles is a symmetric property, and is given by Because our goal is to recover an accurate approximation of the histogram h P , it will be useful to define a metric on histograms to provide a concrete notion of what it means for two histograms to be "similar".
Definition 1.2. Given two histograms, g and h, assume without loss of generality that x: The generalized relative earthmover distance between them, denoted plus the minimum over all schemes of moving the mass of histogram g to yield h ′ , where • the cost, per unit mass, of moving from probability value x to probability y is | log x y |.
Note that the amount of mass in histogram g at probability value x is given by x · g(x).
The generalized relative earthmover distance allows for comparisons of histograms with different total masses, which is necessary since the inferred histogram from data will typically have a slightly different mass from the true distribution. Intuitively, relative earthmover also highlights the importance of estimating the rare variants well: mistaking variants with frequency 10 −5 for frequency 10 −6 suffers substantial distance cost. The other main reason for using the relative earthmover distance is that many properties of interest are Lipschitz continuous with respect to this distance: if two histograms are close in relative earthmover distance, then they have similar property values. In particular, if we guarantee that, with high probability, our algorithm recovers an estimate of the underlying histogram that is accurate in relative earthmover distance, then estimates of properties that we obtain from the recovered histogram will be accurate.
The following proposition, whose proof is given in Section 6.3 illustrates this point, and shows that if two histograms are close in relative earthmover distance, then the expected number of variants that will be observed in any given sized sample will be correspondingly similar.
where R(h P , h Q ) is the generalized relative earthmover distance between the histograms corresponding to P and Q.
In analogy to the histogram h P giving us a label-less representation of the true underlying P = {p s }, it is convenient to have a label-less representation of the observed variant counts from a sample of alleles. To this end, we define the fingerprint of the observed variants, which is also known as the site frequency spectrum (SFS) in genetics, or the "pattern" of the sample in some statistics contexts.
Definition 1.5. Given sample X of k alleles, the associated fingerprint, F = (F 1 , F 2 , ...) is the "histogram of the histogram" of X. Formally, F is the vector whose ith component, F i , is the number of variants in S that occur exactly i times in sample X.
Remarks on the model. Our model assumes that all the variants are independent random variables. Population demography and linkage disequilibrium introduce correlations especially between the common genetic variants. For the common variants, UnseenEst uses the empirical frequency to accurately estimate the true population frequency. For the very rare variants, which Unseen-Est tries to estimate while using the independence assumption, this assumption is also a better approximation of the real data.
While the discussion here focuses on estimating the histogram of genetic variation, UnseenEst is an general approach to estimate the histogram and statistical properties of any finite lists of probabilities {p 1 , ..., p n } from independent Bernoulli samples, and can have broad applications beyond genetics. Note that s p s can be significantly smaller or larger than 1.

Supplementary Note 2: The UnseenEst Algorithm
We partion the variants into two classes: common variants and rare variants. In our applications, variants with empirical allele frequency above 1% are defined to be common. With current cohorts of 10s of thousands of alleles, we are likely to have observed all the common variants and the empirical allele frequencies of the common variants should be very close to the true population frequencies. Therefore we focus the efforts of the algorithm on estimating the frequencies of rare variants.
Given a sample of k alleles and the associated fingerprint F, we truncate the fingerprint to only the rare variants with frequency less than 1%, i.e. we consider {F i : i k ≤ 0.01}. For the common variants with frequency above 1%, we simply use their empirical frequency as an estimator of the true frequency. On the truncated fingerprint, we solve the following linear program for variables corresponding to h(x), x ∈ X for a finite mesh of probabilities X.
• n = upper bound on the number of possible variants.
Solve for h(x), x ∈ X, to minimize the objective function where m ≡ i i · F i is the total number of observed variants .
For a given histogram h, x∈X h(x) · bin(x, k, i) is the expected number of variants observed i times in k alleles. The objective function of the LP captures how much this expected number of variants deviates from the empirical number of variants observed i times (represented by the entries of the fingerprint F i ). The term normalizes the deviation by the standard deviation. The constraints enforce that h(x) ≥ 0, namely that there can not be a negative number of variants that arise with a given probability, and that the total sum of the probabilities matches the empirical estimate of the sum of the probabilities. Note that m k is a very accurate estimator of p s for large k. In practice, we found it sufficient to use the probability mesh X with geometrically increasing probabilities with rate α = 1.05. Using a smaller α can marginally improve accuracy at the cost of run-time. UnseenEst is very efficient; on the ExAC dataset (described below) the computation took less than 10s on a standard laptop.
Estimating the number of unique variants. Given the estimated histogram h produced by UnseenEst, the expected number of unique variants in a sample of j alleles is Confidence interval for prediction. We quantify the uncertainty around the prediction for V (h, j) using bootstrapping as follows. Suppose histogram h(F, k) is the output of UnseenEst trained on a sample of k alleles. We use binomial sampling from h(F, k) to generate N s synthetic cohorts, each of k alleles, with the corresponding fingerprints {F 1 , ..., F Ns }. Then we apply Unseen-Est on each of these fingerprints to generate histograms {h 1 , ..., h Ns }. These are what the estimated histogram could be if we have a different sample of k alleles from the population, assuming the frequency distribution is h. The uncertainty of V (h, j) is quantified as the standard deviation of the estimates {V (h 1 , j), ..., V (h N 2 , j)}.

Performance Guarantees
For the performance guarantees, we analyze the slightly modified linear program, UnseenEst2, below. The core algorithm of UnseenEst2 is identical to UnseenEst. The only differences are that UnseenEst2 uses a finer grid for the probability values, X, and it uses a different cutoff frequency threshold for what are considered to be the common variants. These parameter settings simplify the theoretical analysis of the algorithm. Empirically, we find that using the probability grid and frequency threshold as in UnseenEst gives very similar results compared to UnseenEst2 and is faster.
To simplify the notations, we set the constants B, C, D such that Given as input an untruncated fingerprint F i of m total variants generated from k alleles, the linear program algorithm is Algorithm UnseenEst2. Input: • Fingerprint F from k alleles; m ≡ i i · F i is the total number of variants observed.
• n = upper bound on the number of possible variants.
Solve for h(x), x ∈ X, to minimize the objective function The UnseenEst2 algorithm satisfies the following guarantee.
Theorem 2.1. Let n be the support size (the number of possible variants), k be the number of alleles sequenced, and P = {p s } denote the true distribution of the variant frequencies with p s the expected number of variants per allele. For sufficiently large n, with probability at least 1 − e −(k ps) Ω(1) , the algorithm will return a histogram g satisfying: where δ = n (k ps) log(k ps) and the 'O' notation hides an absolute constant.
Theorem 2.1 demonstrates that UnseenEst2 is asymptotically unbiased. The interesting regime to interpret the bound in Theorem 2.1 is when, for a fixed n, δ < 1. As the number of sequenced samples, k, increases, the error bound factor δ decreases to 0, implying that the estimated frequency distribution, g, becomes arbitrarily close to the true distribution, h P . Moreover UnseenEst2 is also efficient in leveraging the information in the samples. As k increases, the error bound decreases rapidly as 1/ (k p s ) log(k p s ). Here p s can be interpreted as the average number of variants per allele and k p s is the total number of variants we expect to observe in the cohort.
The above theorem, together with Proposition 1.4 implies the following corollary: Let n be the support size (the number of possible variants), k be the number of alleles sequenced and p s is the expected number of variants per allele. Given a sample of k alleles, with probability at least 1 − e −(k ps) Ω(1) , the algorithm estimates the expected number of unique variants that will be observed in a sample of k ′ alleles to within additive error One interpretation of the above corollary is that the estimate of the expected number of unique variants will be accurate, relative to the total expected number of observed variants, k ′ p s , provided n < (k p s ) log(k p s ).

Supplementary Note 3: Datasets
We used the exome sequencing data from the Exome Aggregation Consortium (ExAC) [1]. This dataset consists of high-quality sequencing of the protein-coding regions in the genome (exomes) from 60706 healthy individuals. Consistent with the ExAC analysis, we considered only regions of the exome with sufficient sequencing depth: each nucleotide must be covered by at least 10 reads in at least 80% of all ExAC individuals.
Loss-of-function (LoF) variants. We define LoF variants to be single-nucleotide substitutions that introduce a stop codon in the reading frame or disrupts a splice donor or receptor site. We do not include insertion/deletions (indels) in the class of LoF variants. Variant annotation was performed using the Variant Effect Predictor (VEP) v81 on Gencode v19 and genome build GRch37. LoF annotation was performed using LOFTEE (version 0.2; available at available at https://github.com/konradjk/loftee) plugin to VEP. While early stop codon and splice donor/receptor disruptions often lead to truncated proteins, this does not imply that the protein has lost all of its function. Our annotation of LoF variants does not explicitly assess protein function and hence serves only as a proxy for the true deleteriousness of the variant.
Upper bound on the number of possible variants. A natural way to interpret the discovery rate of a given variant class is to calculate, among all possible variants in this class, what fraction of them do we expect to observe at a given sample size. To estimate an upper bound for the total number of possible variants in each class, we first identified all the nucleotides for which we have sufficient read coverage (at least 10 reads in at least 80% of all ExAC individuals). Then at each well-covered nucleotide we identified the number of possible variants that belongs to a given class.  Table 3.
For each variant class, we divided the number of unique variants we expect to identify by this upper bound to obtained the fraction of possible variants observed at a given cohort size. As technology improves in future sequencing projects, we expect the well-covered regions of the exome to increase and hence the number of identified variants to also increase.
LoF genes. We used the same set of 18225 genes as in the ExAC analysis [1]. Briefly, we summed all exon level variant counts across Gencode v.19 canonical transcripts. If an exon had a median depth < 1, the variant counts for that exon were not included in the total for the transcript. We then removed all transcripts where no variants were observed. We also removed the outliers whose observed synonymous and missense counts deviated significantly from the expected. This left 18225 for which ExAC had high-quality data.
We associated with each gene, g, a Bernoulli random variable with probability p g , which corresponds to the probability that an allele of the gene contains at least one LoF variant as defined above or at least one insertion-deletion (indel) that disrupts the reading frame. The presence of such a LoF variant or indel is a proxy for true loss-of-function and does not necessarily mean that the gene is entirely non-functional on that allele. For example, if the LoF variant introduces a stop codon near the 3' end of the gene, then the corresponding truncated protein may still retain some functions.
UnseenEst can be applied to estimate the histogram of any set of probabilities {p g }, and hence it directly applies in this setting. On the U.S. Census matched cohort, we assign a gene 1 on an allele if it has at least one LoF variant or frame-shift indel. Otherwise it is assigned a 0. The fingerprint F i here corresponds to the number of genes that are LoF in exactly i alleles. We trained UnseenEst on this gene-level fingerprint.
Gene lists. We describe the curation of the various gene lists below.
• GWAS genes: 2801 genes that are the closest 3' and 5' genes to GWAS hits in the NHGRI GWAS catalog as of February 9, 2015.
• Drug target genes: 460 genes whose protein products are known to be the mechanistic targets of drugs; curated from [4] [5].
• Genes with cerebral specific expression: 979 genes with cerebral specific expression downloaded from [6].

Supplementary Note 4: Validation experiments
We performed multiple experiments to validate the prediction accuracy of UnseenEst.
Accuracy of allele frequency estimation. For each class of variant (synonymous, missense, LoF, CpG) we randomly partitioned all the ExAC alleles into ten groups. We trained UnseenEst on the site frequency spectrum of one partition (i.e. 10% of the alleles) and used the model to predict the allele frequency distribution of the entire ExAC cohort. We grouped variants into 4 frequency bins: 1) variants that occur in 0-10 alleles; 2) variants that occur in 11-100 alleles; 3) variants that occur in 101-1000 alleles; and 4) variants that occur in more than 1000 alleles. We repeated this procedure for each of the ten random partitions and computed the average and the standard deviation for the number of variants predicted to belong to each bin. These estimates are compared with the observed number of variants in each bin in ExAC.
Accuracy of the estimated number of unique variants. For each variant class, we randomly sampled 10% of the alleles and applied UnseenEst on the SFS of this subsample to estimate the histogramĥ of the variant frequencies. For any positive integer k, the number fo unique variants we expect to see in k alleles is p h(p)(1 − (1 − p) k ). As before, we compute the average and the standard deviation of the estimates across the different 10% subsamples. To produce the 'true' discovery rate, we create a random ordering of all the ExAC alleles. Then for each k less than the ExAC cohort size, we count the number of unique variants observed in the first k alleles.
Accuracy of gene LoF frequency. We randomly partitioned the alleles into ten subsets. For each subset with 10% of the alleles, we generated the gene-level LoF fingerprint from this subsample. We trained UnseenEst on this subsampled fingerprint and compared the predicted number of genes with at least 10 LoF alleles with that of the observed in the entire ExAC data. The mean and standard deviations of the predictions were computed from the 10 different partitions.

Supplementary Note 5: Related works
While our algorithm and analysis are closely related to the approach in [7][8], there are important differences in the model. In [7], we have an unknown discrete distribution P on n elements and we have k independent samples from P . This model was motivated by the classic problem of estimating the vocabulary size of Shakespeare from a sample of his works [9][10]. The discrete distribution setting can be reformulated by associating with each element s an independent Poisson random variable poi(p s ), where p s is the weight of P for s. Here, unlike in our model, s p s = 1. The number of times that s appears in k samples is distributed according to poi(k · p s ). In our genetics model, the number of times that a variant s appears in k alleles is distributed according to bin(p s , k) (a related linear program proposed in [11] uses hypergeometric distributions to model the variant frequencies). While poi(k · p s ) and bin(p s , k) both have expectation k · p s , the Poisson has a slightly larger variance. Because the number of elements n is potentially very large, this difference between Poisson and Binomial aggregates over all the elements and can give rise to substantial differences in the expected fingerprints between the two models.
Harmonic jackknife estimator have been used to estimate discovery rates [12,4]. Unlike UnseenEst, harmonic jackknife does not estimate the frequency distribution of variants. Instead it seeks to directly estimate number of variants that are missed in the samples observed so far by positing a specific parametric form for the number of missing variants. Let V (n) denote the number of distinct variants we expect to find in a cohort of n samples. For n 2 > n 1 , harmonic jackknife of order p assumes that The constants a i are given in [12] and, in practice, 3rd order harmonic jackknife is used. In general, there is no guarantee that the actual discover rate of variants has the parametric form above, and this model mismatch can lead to biased predictions.
We tested harmonic jackknife on the ExAC data in exactly the same way that we used to validate UnseenEst. For each ancestry population, we applied the jackknife estimator to a random sampling of 10% of the alleles and compared the predicted discovery rate with the true discovery rate for the entire ExAC cohort (Supp. Figure 3). The number of distinct variants predicted by harmonic jackknife is consistently lower than the ground truth in all ancestry groups. At the actual cohort size, harmonic jackknife on average underestimated the true discovery size by 3.6%. In comparison, UnseenEst was unbiased and the average difference between its prediction and the ground truth is 0.15%. In addition, the uncertainty interval of jackknife is very narrow (too narrow to be noticeable in Supp. Figure 3) and does not accurately capture the real uncertainty in the predictions.
Recently, [11] also proposed using linear program to estimate the discovery rate of new variants. They solve two linear programs with hypergeometric coefficients, to estimate the upper and lower bounds on the number of unique variants at a given sample size that are consistent with the observed site frequency spectrum. Under the infinite genome assumption (i.e. there are infinitely many possible variants), [11] showed that there exist solutions to these two linear programs. The approach of [11] tries to identify the range of the number of unique variants that is consistent with the observed data, though it does not guarantee how wide this interval is and whether it concentrates around the true value in general. Our linear program is guaranteed to produce a histogram that is close to the true SFS. Moreover our analysis makes explicit the dependence on the sample size k and the frequency distribution p s which was not present in [11].
Bayesian approaches have also been applied to estimate the number of unseen variants [10] [13]. A limitation of this approach is that it requires parametric forms for the distribution of variant frequencies, which corresponds to assumptions on the underlying demographic process and selection. For example, in [10], the mutation probabilities of the variants are assumed to be i.i.d. samples from a Beta(a, b) prior, where the hyperparameters a, b are fitted from data. The Beta prior is a reasonable model for neutrally evolving variants but may not be appropriate for deleterious mutations. The advantage of UnseenEst is that it does not require any modeling assumption about selective pressure and demographic history, i.e. it is non-parametric. Theorem 2.1 applies in all settings where the independence assumption is a reasonable approximation.
How to design sequencing studies to optimize variant discovery was analyzed in [13]. This study was limited to 500-kilobase regions in the genomes of 39 ENCODE individuals. Consistent with our results, [13] showed that adjusting the distribution of the cohort over different ancestries can increase the number of variants discovered.

Supplementary Note 6: Proofs of the Guarantees
The proof of Theorem 2.1 for UnseenEst2 has three main components. First we show that given a sample of k alleles from the above model, with high probability the empirical fingerprint F i 's are close to their expected values ps h(p s ) · bin(p s , k, i). This sample of k alleles is what we call a faithful sample. Next we show that given a faithful sample, the histogram of the true distribution, h(p), rounded so as to be supported on the set X of discrete probability values, is a point in the plausible region of the linear program in UnseenEst2. Intuitively the plausible region captures all the histograms that can plausibly generate the observed SFS. The last component of the proof will argue that any two points in the plausible region must be close in generalized relative earthmover distance. This completes the proof because the solution returned by the linear program in UnseenEst2 is in the plausible region and hence must be close in relative earthmover distance to the rounded true histogram, which is close to the true histogram.
The proof of Theorem 2.1 follows the steps of the proof of Theorem 2 in [7]. We have to replace calculations involving Poisson distributions with Binomials in the appropriate places. We also have to rescale all the earthmoving costs by p s . We provide explicit analysis where our proof differs from that of [7]; otherwise we refer to the appropriate part of [7] when the calculations are identical.

Faithful samples
Definition 6.1. A sample of k alleles with fingerprint F, drawn from a set P = {p s } of probabilities with histogram h and sum t = s p s , is said to be faithful if the following conditions hold: • For all i, • For all possible variants s ∈ S, letting p s denote the true probability of s, the number of times s occurs in the sample from P differs from its expectation k · p s by at most max (kp s ) 0.5+D , (kt) B(0.5+D) . Lemma 11 in [7]). There is a constant γ > 0 such that for sufficiently large number of individuals, k, the empirical distribution is faithful with probability at least 1 − e −(kt) γ , where t = s p s .

Lemma 6.2 (Analogous to
Proof. The first condition follows from Hoeffding bound with high probability. In our model, E[F i ] = ps h(p s ) · bin(p s , k, i). Each fingerprint F i is the sum of independent binary variables, representing whether each mutation occurred exactly i times in the population. Hence Chernoff bounds apply. The analysis showing that the second condition is satisfied is the same as in the proof of Lemma 11 in [7]. We include it here for completeness.
The analysis of the second condition is split into two cases, according to whether and hence this quantity is bounded by setting E[F i ] = (kt) B . A union bound over the first 2kt fingerprints shows that the probability that a sample of k alleles violate the first condition is at most k · 2e −(kt) 2BD /3 ≤ e −(kt) Ω (1) . Note that the probability that there are more than 2kt nonzero fingerprints is similarly bounded, as the probability that a variant is observed more than 2kt times is inverse exponential in kt.
For the third condition, we want to show that for all variants s, the number of times that s is observed in k alleles differs from its expectation p s k by at most max((kp s ) 0.5+D , (kt) B(0.5+D) ). The analysis also splits into two cases depending on whether p s k ≥ (kt) B and follows from the same Chernoff bound as before, replacing F i by the number of times s occurs in the sample and replacing E[F i ] by p s k. Definition 6.3. Given a fingerprint F, an upper bound on the support size n, m = i i · F i , and a finite set of probability values X, the plausible region is the set of histograms h supported on X satisfying the conditions As the name suggests, the plausible region is the set of histograms that can plausibly generate the observed fingerprint F. The last three requirements of plausibility are the same as the LP constraints in UnseenEst2.
The following lemma shows that, given a faithful sample of k alleles, the corresponding plausible region has a point that is extremely close to the histogram of the true distribution.
Lemma 6.4. (Analogous to Lemma 12 of the [7].) For sufficiently large k, and n < m 2+B/2 /k: given a distribution of support size at most n and a faithful sample of k alleles with fingerprint F, the plausible region has a point v ′ such that v ′ is close to the true histogram h where h v ′ is obtained from v ′ by appending the empirical fingerprint entries F i for i ≥ m B + 2m C .
Proof. The idea of the proof is to show that, provided the sample is faithful, the true histogram h can be minimally modified into a plausible point v ′ . We construct v ′ by taking the portion of h with probabilities at most m B +m C m and rounding the support of h to the closest multiple of 1/m 2 , so as to be supported at points in the set X = {1/m 2 , 2/m 2 , ...}.
We construct h ′ and v ′ as in [7]. The first two steps of the construction are the same. In the third step, we want to normalize the total probability mass m F + x xv ′ x to be m/k instead of to 1. This involves rescaling v ′ x by a factor of s = (m/k − m F )/ x xv ′ x .
Next we show that the discretization does not violate the requirements of plausibility. We note that d dx bin(x, k, i) ≤ k. Since we discretize to multiples of 1/m 2 , the discretization alters the contribution of each site to each expected fingerprint by at most k/m 2 . The support size is bounded by n, the discretization alters each expected fingerprint by at most n · k/m 2 . The rescaling step also does not violate the plausibility conditions. Finally the last part of the proof bounds the per unit earth-moving cost, which does not use any properties of the Poisson distribution. We can apply the same earth-moving scheme and analysis of the per unit cost. The final cost R(h, h v ′ ) needs to be scaled by m/k since that's the total amount of probability mass.

Chebyshev construction
The previous section established that, given a faithful sample (which we are likely to obtain with high probability), there exists a plausible point which is very close to the true histogram. In this section, we will show that any two plausible points are close in generalized relative earthmover distance. By the triangle inequality, this guarantees that the solution returned by UnseenEst2 will be close to the true histogram. To establish the closeness of the histograms, we will explicitly construct a earthmoving scheme using Chebyshev polynomials. This is analogous to the earthmoving scheme in [7], replacing all instances of poi(kx, i) by bin(x, k, i). We define binomial Chebyshev bumps, following [7]. Definition 6.6. Let s = ⌊0.2 log kt⌋, where t = s p s . Define g 1 (θ) = s−1 j=−s cos(jθ) to be an approximation of the delta function, truncated at Fourier degree s. Define a slightly fatter version and, for i ∈ {1, . . . , s−1}, define its shifted versions g i 3 (θ) = g 2 (θ− iπ s )+g 2 (θ+ iπ s ), and g 0 3 = g 2 , and g s 3 = g 2 (y + π). Let t i (x) be the linear combination of Cheybyshev polynomials so that t i (cos θ) = g i 3 (θ). We define s + 1 functions, the "skinny bumps", to be B i (x) = t i (1 − xk 2s ) s−1 j=0 bin(x, k, j), for i ∈ {0, . . . , s}. We now prove a number of nice properties about the Chebyshev earthmoving scheme.
Lemma 6.8 (Lemma 18 in [7]). For any θ, and for any x, Proof. Same as in Lemma 18 of [7]. Nothing special about Poisson density was used in that proof.
Proof. We decompose g i 3 (θ) into a linear combination of cos(ℓθ), for ℓ ∈ {0, . . . , s}. Since cos(−ℓθ) = cos(ℓθ), g 1 (θ) consits of one copy of cos(sθ), two copies of cos(ℓθ) for each ℓ strictly between 0 and s, and one copy of cos(0θ). g 2 (θ) consists of ( 1 16s times) 8 shifted copies of g 1 (θ)'s. The shifts changes the phases of the Fourier coefficients but not their magnitude. Sine components may have been introduced in the shifts, but since g i 3 is an even function, the sine components cancel out. Since each g 3 contains at most two shifted g 2 's, each g i 3 (θ) is a linear combination s ℓ=0 cos(ℓθ)b iℓ with the Fourier coefficients bounded by |b iℓ | ≤ 2 s .
Proof. Same proof as in Lemma 20. This lemma doesn't involve Poisson density at all. Proof. The analysis has two parts. For the first part, we consider the cost of bumps f i for i ≥ s + 1, where recall that s = 0.2 log kt. This is the cost of moving bin(x, k, i) mass from x to i k . The unit cost of moving mass from x to i x is | log xk i |, which is upper bounded by xk i − 1 when i < xk and i xk − 1 otherwise. We split the calculation into two parts. First, for i ≥ ⌈xk⌉, When summed over i ≥ max{s, ⌈xk⌉}, this telescopes to an expression bounded by bin(x, k, max{s, ⌈xk⌉} − 1) = O( 1 s term sums to at most 1 s . Note that bin(x, k, i) x(k+1) i+1 = bin(x, k + 1, i + 1) ≤ bin(x, k, i + 1), where the last inequality is because i ≤ ⌈xk⌉ − 1. Therefore the rest of the sum telescopes to bin(x, k, ⌈xk⌉) − bin(x, k, s) = O( 1 √ s ). Thus in total, f i for i ≥ s + 1 contributes O( 1 √ s ) to the relative earthmover cost, per unit of weight moved.
Next we analyze the skinny bumps f i (x) for i ≤ s. The simple case is when xk ≥ 4s. Recall the definition f i (x) = t i (1 − xk 2s s−1 j=0 bin(x, k, j). Since xk > x, we bound s−1 j=0 bin(x, k, j) ≤ s · bin(x, k, s). Each f i (x) is exponentially small in both x and s, the thus the total earthmoving scheme, per unit of mass above 4s k is exponentially small.
Since x x · h(x) = m/k and x h(x) ≤ n, by the Cauchy-Schwarz inequality, The total earthmoving cost in this regime is O( m k n m log kt ) and hence we need n = δm log kt to ensure that the total cost here is O(m √ δ/k).
Finally we put all the pieces together. The total probability mass that need to be moved is O(m/k). . For the bump centers i ≥ s + 1, the same analysis as in [7] shows that relative earth mover cost is O( 1 k Ω(1) ). We consider the first s + 1 = O(logkt) bump centers corresponding to the skinny Chebyshev bumps. Recall that for these centers, c i , the bump functions B i (x) may be expressed as s j=0 s q=0 a ijq bin(x, k + q, j + q) for a ijq satisfying s q=0 s j=0 |a ijq | ≤ β ≡ 2(kt) 0.3 .
Using the shorthand x for x:h(x)+g(x) =0 , we have )bin(x, k + q + 1, j + q + 1) where we have used triangle inequality and the first condition of plausibility in the last inequality. Since B < 0.1, we have that this discrepancy is O( max{1, ps} k Ω (1) ) for each center c i , and since there are log kt centers, the total discrepancy is also O( max{1, ps} k Ω (1) ). Putting all the pieces together, by the triangle inequality, we have R(p, q) ≤ R(p, h) ≤ R(p, h) + R(h, h ′ ) + R(h ′ , g ′ ) + R(g ′ , g) ≤ O(m √ δ/k).

Proof of Proposition 1.4
For convenience, we restate the proposition in a slightly more general form: Proposition 1.4 Given two lists of probabilities P = {p s ∈ S} and Q = {q s : s ∈ S} with s p s ≥ s q s , let E[S k,P ] = s∈S Pr[bin(k, p s ) > 0] denote the expected number of variants observed in a sample of k alleles with the distribution of frequencies given by P , and let E[S k,Q ] denote the analogous quantity corresponding to frequencies Q. Let P ′ = {p ′ s : s ∈ S} be any list of probabilities satisfying: 1. Either for all s ∈ S, p ′ s ≤ p s , or for all s ∈ S, p ′ s ≥ p s ,

2.
i p ′ s = i q s , then, for any k, where R(h P ′ , h Q ) is the relative earthmover distance between the histograms corresponding to P ′ and Q. Hence for k > 3, Proof. By the triangle inequality, The first term is trivially bounded by k i |p i −p ′ i | = k | i p i − i q i | , since each unit of probability mass can, in expectation, account for at most k distinct observations. To bound the second term, first note that both the relative earthmover cost, and expected number of distinct elements observed are linear functions of the number of elements of P and Q with each different probability value, it suffices to analyze the costs of the earthmoving distance and the change in the expected number of distinct elements for a single earthmoving operation: consider moving c units of mass from probability value x to y. The change to the expected number of distinct elements observed is exactly c and the relative earthmover cost of this is c| log x y |. We now show that the ratio of these quantities is always at most k 4 .
Thus for x ≤ 1 2 the ratio of derivatives is bounded as 1 − e −x(k−1) ((k − 1)x + 1) x + e −x(k−1) x 2 (k − 1)((k − 1)x + 1) x The first term of the right hand side, after dividing by k − 1, can be reexpressed in terms of y = x(k − 1) as 1−e −y (y+1) y , which has a global maximum less then 0.3; the second term in the right hand side, after the same variable substitution, equals e −y y(y+1), which has a global maximum less than 1. Thus, for x ≤ 1 2 , the absolute value of the ratio of derivatives is bounded as 0.3(k − 1) + 1. For x ≥ 1 2 , the right hand side of Equation 1 is 1 x minus some positive quantity, and is hence at most 2. Since 0.3(k − 1) + 1 ≥ 2 for any k ≥ 5, all that remains is to checking the k = 2, 3, 4 cases where 0.3(k − 1) + 1 < 2 by hand to confirms that 0.3(k − 1) + 1 is in fact a global bound.