Quantifying the unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects

As new proposals aim to sequence ever larger collection of humans, it is critical to have a quantitative framework to evaluate the statistical power of these projects. We developed a new algorithm, UnseenEst, and applied it to the exomes of 60,706 individuals to estimate the frequency distribution of all protein-coding variants, including rare variants that have not been observed yet in the current cohorts. Our results quantified the number of new variants that we expect to identify as sequencing cohorts reach hundreds of thousands of individuals. With 500K individuals, we find that we expect to capture 7.5% of all possible loss-of-function variants and 12% of all possible missense variants. We also estimate that 2,900 genes have loss-of-function frequency of less than 10-5 in healthy humans, consistent with very strong intolerance to gene inactivation.


Introduction
Recent efforts aggregating the genomes and exomes of tens of thousands of individuals have provided unprecedented insights into the landscape of rare human genetic variation 1,2 and generated critical resources for clinical and population genetics. The recently announced U.S. Precision Medicine Initiative raises the prospect of growing these databases to encompass hundreds of thousands of human genomes. In the context of these ambitious efforts, it is important to quantify the power of large sequencing projects to discover rare functional genetic variants 3 . In particular, we need to understand, as we sequence ever larger cohorts of individuals, how many new variants we can expect to identify and their expected allele frequencies. Accurate estimates of these quantities will enable better study design and quantitative evaluation of the potential and limitations of these datasets for precision medicine.

Results
Predicting the number of new variants we expect to identify in larger cohorts requires accurate estimates of allele frequencies of all the genetic variation in the human population, including the rare variants that have not been observed in the current sequencing cohorts [4][5][6] . The population frequencies of the unobserved rare variants determine the discovery rate of new variants as the cohort sizes increase. We developed a new method, UnseenEst, to estimate the frequency distribution of all variants using the observed site frequency spectrum (SFS) of the current cohort.
The method is based on linear program estimators of the SFS 7 , and our mathematical analysis shows that it enables accurate extrapolation of the SFS from current data to cohort sizes more than an order of magnitude larger (Supplementary Information).
Protein-coding variants represent the most readily interpretable and medically relevant slice of human genetic variation, and have been assessed in large sample sizes through the widespread application of exome sequencing approaches 2 . We leveraged data from the Exome Aggregation Consortium (ExAC) 8 to estimate the discovery rates of different classes of protein coding variants in larger cohorts. We validated UnseenEst by training it on random 10% of the alleles in ExAC and then used the estimated frequency distribution to predict the number of distinct variants that we can identify in the entire ExAC cohort. For every variant type (Supp. Figure 1) and every population (Supp. Figure 2), UnseenEst accurately predicted the number of unique variants that were identified in the entire ExAC cohort as well as the empirical SFS of ExAC (Supp. Table 1).
From the full ExAC dataset, we generated a cohort of 33778 healthy individuals that matched the ancestral population breakdown of the 2010 U.S. Census (Supp. Table 2). We trained UnseenEst on this U.S. Census-matched cohort and predicted the frequency distributions of variants in the entire population (Supp. Figure 3). In particular, we estimated the number of distinct variants we expect to identify in cohorts of up to 500K individuals. These results provide a quantitative framework to evaluate the power and limitations of precision medicine initiatives in discovering rare coding variants.
We categorized the variants by their predicted functional consequence-synonymous, missense, and loss-of-function (LoF), which is defined as point substitutions that introduce stop codons or disrupt splice donor/acceptor sites (Figure 1a). The discovery rate of LoF variants is the lowest, reflecting the fact that LoFs are likely to be deleterious and hence tend to occur comparatively rarely in the healthy population. With 500K individuals, we expect to identify 400K distinct LoF variants or 7.5% of all possible LoF point mutations in the human exome. In the same cohort, we expect to identify 3.4 million synonymous and 7.5 million missense variants, corresponding to 18% and 12% of possible synonymous and missense variants respectively. These estimates indicate that the discovery rates of rare LoF, missense and synonymous variants are far from saturation, even with 500K individuals. We note that slightly higher numbers of distinct synonymous and missense variants (Supp. Figure 4) would be discovered if the 500K individuals were instead sampled from the same ancestral composition as the current ExAC cohort, which contains higher fractions of South and East Asian individuals than the U.S., indicating that the overall discovery rate of rare variants can be boosted by optimizing the population composition of the sequencing cohort.
We additionally classified the variants by their biochemical properties (Figure 1b). With the 34K individuals of the current cohort, we can already identify close to 50% of all possible variants at CpG sites (the most highly mutable substitution class), and the discovery rate for this class of variant quickly saturates as cohorts grow larger. Transversions, in contrast, are discovered much more slowly-attaining 7.6% of all possible transversions with 500K individuals-which is consistent with their much lower mutation rate. We further applied UnseenEst to quantify the number of distinct missense variants we expect to discover in specific gene families of interest, for example genes near GWAS hits and known drug target genes (Supp. Figure 5). Missense mutations in drug target genes are particularly suppressed, suggesting that these genes are more likely to be essential to humans.
LoF variants likely disrupt the normal function of genes and by studying individuals carrying such variants, we can quantify the phenotypic consequence of disrupting particular genes. Therefore, a catalogue of the number of human alleles harboring candidate LoF variants for each gene is an important resource for drug development and disease diagnosis. We applied UnseenEst to estimate the LoF frequency of genes in the U.S. population (Figure 1c, Supp. Figure 6). About 2900 genes have LoF allele frequency lower than 10 -5 , consistent with strong intolerance to inactivation, whereas 1700 genes are expected to harbor LoF variants in at least 0.1% of the population. With 250K individuals, we expect to identify 14K genes that harbor LoFs in at least 10 individuals, substantially expanding the current catalog of 10K such genes in ExAC (Figure 1d, Supp. Figure   7). We estimate that the discovery rate of these genes with multiple LoF occurrences will saturate around 16K, providing an upper bound on the number of genes that can tolerate LoF variants on one allele.

Discussion
We describe a framework for estimating the power of sequencing cohorts to discover proteincoding variants. We apply it to the largest available collection of sequenced individuals to estimate the discovery power of much larger cohorts such as the ones proposed by the Precision Medicine Initiative. While our predictions here assumed that the samples are representative of the U.S. demography, UnseenEst can be directly applied to estimate the discovery rate of cohorts with different ancestral composition. Our results show that sequencing a cohort of 500K randomly selected U.S. individuals would provide access to over 12% of all possible missense variants and 7.5% of all possible LoF variants, thereby permitting exploration of a substantial fraction of human biological diversity. We trained UnseenEst on the U.S. Census-matched ExAC cohort ("current") and predicted the number of unique variants we expect to find in up to 500K individuals. The number of unique variants in the cohort were estimated for synonymous, missense and lose-of-function (LoF) variants in (a), and for CpGs, transitions and transversions in (b). The shaded regions correspond to one standard deviation around the estimates. (c) A gene is classified as LoF on a given allele if that allele contains at least one variant that introduces a stop codon, disrupts a splice donor/receptor site, or disrupts the reading frame. Genes are partitioned into bins based on their LoF allele frequencies: less than 10 -5 , 10 -5 to 10 -4 , 10 -4 to 10 -3 , and greater than 10 -3 . The y-axis indicates the number of genes with LoF allele frequency belonging to each bin. Error bars correspond to one standard deviation. (d) Estimated number of genes with at least 10 and 20 LoF alleles. Figure 1. Using 10% of the ExAC alleles to predict the number of unique variants in the entire ExAC cohort. Each panel corresponds to one variant type. For each variant type, we applied UnseenEst on 10% of the ExAC alleles (5919 individuals) to predict the number of unique variants that we would expect to observe in a cohort of size less than or equal to ExAC (59198 individuals). The blue curves are the average predictions over the different 10% sub-samples and the blue shaded regions correspond to one standard deviation from the average. The red curves are the actual number of unique variants observed in ExAC. For all variant types, the predicted number of unique variants is in good agreement with the observed number of unique variants. Figure 2. Using 10% of the alleles in each ExAC population to predict the total number of observed variants. For each of the ExAC populations, we trained UnseenEst on random 10% of the alleles and applied it to predict the total number of unique variants in the entire population. The x-axis of each panel indicate the number of individuals of that population; the first mark (e.g. 5919 in (a)) indicate the size of the training set and the last mark (e.g. 59198 in (a)) is the total cohort size of that population in ExAC. The blue curves are the average predictions over the different 10% sub-samples and the blue shaded regions correspond to one standard deviation from the average. The red curves are the actual number of unique variants observed in ExAC. For all variant types, the predicted number of unique variants is in good agreement with the observed number of unique variants.

Preliminaries
Given the genetic variation observed in a sample of individuals, what can one infer about all the genetic variation across the entire population? We introduce a robust, general, and theoretically sound algorithm, UnseenEst, for accurately quantifying the distribution of frequencies of all the genetic variation, including the ones that we have not observed in the current samples, based on the sequences from surprisingly small sets of individuals. This estimated distribution of frequencies can then be leveraged to yield accurate estimates of a number of useful properties, including accurate estimates of the number of new variants that are likely to be observed in larger cohorts of individuals.
We begin by formalizing the model in which we are working, and describe the sense in which our algorithm recovers the distribution of variant frequencies. The core of our approach is a linear programming (LP) based algorithm, and we discuss the intuition behind this method. We then establish the performance guarantees of our algorithm, proving that, with high probability, it will recover an accurate estimate of the true frequency distribution, and yields accurate predictions for the number of new variants that will be observed in larger samples.
The model. Let S denote a particular variant class of interest. For example, S can correspond to all possible missense mutations in a gene family. Each possible variant s ∈ S is associated with a probability p s , which is the probability that an allele contains s. We model all the alleles as independent and all variants as independent. Hence the p s 's are the parameters of independent Bernoulli random variables. When we sample an allele, we obtain an independent draw from the Bernoulli at each s, s ∈ S. In a sample of k alleles, the frequency of observing variant s is distributed according to bin(p s , k).
The histogram represents all of the information of P except for the labels of the variants.
In this work, we are interested in accurately recovering the histogram h P . For the purpose of estimating any property of the p s 's that does not depend on the specific labels of the variants themselves, the histogram, h P , contains all of the useful information. Such properties are referred to as symmetric as they are unaffected by 'renaming' the variants.
The following examples illustrate several interesting symmetric properties: Examples: • The total number of variants that occur with probability more than c is a symmetric property, and is given by x>c:h(x)>0 h(x).
• The expected number of unique variants that will be observed in a sample of k alleles is a symmetric property, and is given by • The expected number of unique variants that will be observed more than 10 times in a sample of k alleles is a symmetric property, and is given by Because our goal is to recover an accurate approximation of the histogram h P , it will be useful to define a metric on histograms to provide a concrete notion of what it means for two histograms to be "similar".
Definition 1.2. Given two histograms, g and h, assume without loss of generality that plus the minimum over all schemes of moving the mass of histogram g to yield h , where • the cost, per unit mass, of moving from probability value x to probability y is | log x y |.
Note that the amount of mass in histogram g at probability value x is given by x · g(x).
The generalized relative earthmover distance allows for comparisons of histograms with different total masses, which is necessary since the inferred histogram from data will typically have a slightly different mass from the true distribution. Intuitively, relative earthmover also highlights the importance of estimating the rare variants well: mistaking variants with frequency 10 −5 for frequency 10 −6 suffers substantial distance cost. The other main reason for using the relative earthmover distance is that many properties of interest are Lipschitz continuous with respect to this distance: if two histograms are close in relative earthmover distance, then they have similar property values. In particular, if we guarantee that, with high probability, our algorithm recovers an estimate of the underlying histogram that is accurate in relative earthmover distance, then estimates of properties that we obtain from the recovered histogram will be accurate.
The following proposition, whose proof is given in Section 6.3 illustrates this point, and shows that if two histograms are close in relative earthmover distance, then the expected number of variants that will be observed in any given sized sample will be correspondingly similar.
where R(h P , h Q ) is the generalized relative earthmover distance between the histograms corresponding to P and Q.
In analogy to the histogram h P giving us a label-less representation of the true underlying P = {p s }, it is convenient to have a label-less representation of the observed variant counts from a sample of alleles. To this end, we define the fingerprint of the observed variants, which is also known as the site frequency spectrum (SFS) in genetics, or the "pattern" of the sample in some statistics contexts.
Definition 1.5. Given sample X of k alleles, the associated fingerprint, F = (F 1 , F 2 , ...) is the "histogram of the histogram" of X. Formally, F is the vector whose ith component, F i , is the number of variants in S that occur exactly i times in sample X.
Remarks on the model. Our model assumes that all the variants are independent random variables. Population demography and linkage disequilibrium introduce correlations especially between the common genetic variants. For the common variants, UnseenEst uses the empirical frequency to accurately estimate the true population frequency. For the very rare variants, which Unseen-Est tries to estimate while using the independence assumption, this assumption is also a better approximation of the real data.
While the discussion here focuses on estimating the histogram of genetic variation, UnseenEst is an general approach to estimate the histogram and statistical properties of any finite lists of probabilities {p 1 , ..., p n } from independent Bernoulli samples, and can have broad applications beyond genetics. Note that s p s can be significantly smaller or larger than 1.

The UnseenEst Algorithm
We partion the variants into two classes: common variants and rare variants. In our applications, variants with empirical allele frequency above 1% are defined to be common. With current cohorts of 10s of thousands of alleles, we are likely to have observed all the common variants and the empirical allele frequencies of the common variants should be very close to the true population frequencies. Therefore we focus the efforts of the algorithm on estimating the frequencies of rare variants.
Given a sample of k alleles and the associated fingerprint F, we truncate the fingerprint to only the rare variants with frequency less than 1%, i.e. we consider {F i : i k ≤ 0.01}. For the common variants with frequency above 1%, we simply use their empirical frequency as an estimator of the true frequency. On the truncated fingerprint, we solve the following linear program for variables corresponding to h(x), x ∈ X for a finite mesh of probabilities X.
• n = upper bound on the number of possible variants.
Solve for h(x), x ∈ X, to minimize the objective function is the total number of observed variants with empirical frequency less than or equal to 1%.
For a given histogram h, x∈X h(x) · bin(x, k, i) is the expected number of variants observed i times in k alleles. The objective function of the LP captures how much this expected number of variants deviates from the empirical number of variants observed i times (represented by the entries of the fingerprint F i ). The term normalizes the deviation by the standard deviation. The constraints enforce that h(x) ≥ 0, namely that there can not be a negative number of variants that arise with a given probability, and that the total sum of the probabilities matches the empirical estimate of the sum of the probabilities. Note that m k is a very accurate estimator of p s for large k. In practice, we found it sufficient to use the probability mesh X with geometrically increasing probabilities with rate α = 1.05. Using a smaller α can marginally improve accuracy at the cost of run-time. UnseenEst is very efficient; on the ExAC dataset (described below) the computation took less than 10s on a standard laptop.
Estimating the number of unique variants. Given the estimated histogram h produced by UnseenEst, the expected number of unique variants in a sample of k alleles is

Performance Guarantees
For the performance guarantees, we analyze the slightly modified linear program below. To simplify the notations, we set the constants B, C, D such that Given as input an untruncated fingerprint F i of m total variants generated from k alleles, the linear program algorithm is Algorithm UnseenEst2.
Input: fingerprint F from k alleles, n = upper bound on the number of possible variants, and m = i i · F i is the total number of variants observed in k alleles.
• For each x ∈ X, define the associated LP variable h(x).
Output: histogram h with support on X.
Solve for h(x), x ∈ X, to minimize the objective function The UnseenEst2 algorithm satisfies the following guarantee.
Theorem 2.1. Let n be the support size (the number of possible variants), k be the number of alleles sequenced, and P = {p s } denote the true distribution of the variant frequencies with p s the expected number of variants per allele. For sufficiently large n, with probability at least 1 − e −(k ps) Ω(1) , the algorithm will return a histogram g satisfying: where δ = n (k ps) log(k ps) and the 'O' notation hides an absolute constant.
The above theorem, together with Proposition 1.4 implies the following corollary: Let n be the support size (the number of possible variants), k be the number of alleles sequenced and p s is the expected number of variants per allele. Given a sample of k alleles, with probability at least 1 − e −(k ps) Ω(1) , the algorithm estimates the expected number of unique variants that will be observed in a sample of k alleles to within additive error One interpretation of the above corollary is that the estimate of the expected number of unique variants will be accurate, relative to the total expected number of observed variants, k p s , provided n < (k p s ) log(k p s ). For comparison, the naive algorithm that attempts to learn the distribution P will only be accurate in the regime where n < k p s .

Datasets
We used the exome sequencing data from the Exome Aggregation Consortium (ExAC) [1]. This dataset consists of high-quality sequencing of the protein-coding regions in the genome (exomes) from 60706 healthy individuals. Consistent with the ExAC analysis, we considered only regions of the exome with sufficient sequencing depth: each nucleotide must be covered by at least 10 reads in at least 80% of all ExAC individuals.
Loss-of-function (LoF) variants. We define LoF variants to be single-nucleotide substitutions that introduce a stop codon in the reading frame or disrupts a splice donor or receptor site. We do not include insertion/deletions (indels) in the class of LoF variants. Variant annotation was performed using the Variant Effect Predictor (VEP) v81 on Gencode v19 and genome build GRch37. LoF annotation was performed using LOFTEE (version 0.2; available at available at https://github.com/konradjk/loftee) plugin to VEP. While early stop codon and splice donor/receptor disruptions often lead to truncated proteins, this does not imply that the protein has lost all of its function. Our annotation of LoF variants does not explicitly assess protein function and hence serves only as a proxy for the true deleteriousness of the variant.
Upper bound on the number of possible variants. A natural way to interpret the discovery rate of a given variant class is to calculate, among all possible variants in this class, what fraction of them do we expect to observe at a given sample size. To estimate an upper bound for the total number of possible variants in each class, we first identified all the nucleotides for which we have sufficient read coverage (at least 10 reads in at least 80% of all ExAC individuals). Then at each well-covered nucleotide we identified the number of possible variants that belongs to a given class.  For each variant class, we divided the number of unique variants we expect to identify by this upper bound to obtained the fraction of possible variants observed at a given cohort size. As technology improves in future sequencing projects, we expect the well-covered regions of the exome to increase and hence the number of identified variants to also increase.
LoF genes. We used the same set of 18225 genes as in the ExAC analysis [1]. Briefly, we summed all exon level variant counts across Gencode v.19 canonical transcripts. If an exon had a median depth < 1, the variant counts for that exon were not included in the total for the transcript. We then removed all transcripts where no variants were observed. We also removed the outliers whose observed synonymous and missense counts deviated significantly from the expected. This left 18225 for which ExAC had high-quality data.
We associated with each gene, g, a Bernoulli random variable with probability p g , which corresponds to the probability that an allele of the gene contains at least one LoF variant as defined above or at least one insertion-deletion (indel) that disrupts the reading frame. The presence of such a LoF variant or indel is a proxy for true loss-of-function and does not necessarily mean that the gene is entirely non-functional on that allele. For example, if the LoF variant introduces a stop codon near the 3' end of the gene, then the corresponding truncated protein may still retain some functions.
UnseenEst can be applied to estimate the histogram of any set of probabilities {p g }, and hence it directly applies in this setting. On the U.S. Census matched cohort, we assign a gene 1 on an allele if it has at least one LoF variant or frame-shift indel. Otherwise it is assigned a 0. The fingerprint F i here corresponds to the number of genes that are LoF in exactly i alleles. We trained UnseenEst on this gene-level fingerprint.
Gene lists. We describe the curation of the various gene lists below.
• GWAS genes: 2801 genes that are the closest 3' and 5' genes to GWAS hits in the NHGRI GWAS catalog as of February 9, 2015.
• Drug target genes: 460 genes whose protein products are known to be the mechanistic targets of drugs; curated from [4] [5].
• Genes with cerebral specific expression: 979 genes with cerebral specific expression downloaded from [6].

Validation experiments
We performed multiple experiments to validate the prediction accuracy of UnseenEst.
Accuracy of allele frequency estimation. For each class of variant (synonymous, missense, LoF, CpG) we randomly partitioned all the ExAC alleles into ten groups. We trained UnseenEst on the site frequency spectrum of one partition (i.e. 10% of the alleles) and used the model to predict the allele frequency distribution of the entire ExAC cohort. We grouped variants into 4 frequency bins: 1) variants that occur in 0-10 alleles; 2) variants that occur in 11-100 alleles; 3) variants that occur in 101-1000 alleles; and 4) variants that occur in more than 1000 alleles. We repeated this procedure for each of the ten random partitions and computed the average and the standard deviation for the number of variants predicted to belong to each bin. These estimates are compared with the observed number of variants in each bin in ExAC.
Accuracy of the estimated number of unique variants. For each variant class, we randomly sampled 10% of the alleles and applied UnseenEst on the SFS of this subsample to estimate the histogramĥ of the variant frequencies. For any positive integer k, the number fo unique variants we expect to see in k alleles is p h(p)(1 − (1 − p) k ). As before, we compute the average and the standard deviation of the estimates across the different 10% subsamples. To produce the 'true' discovery rate, we create a random ordering of all the ExAC alleles. Then for each k less than the ExAC cohort size, we count the number of unique variants observed in the first k alleles.
Accuracy of gene LoF frequency. We randomly partitioned the alleles into ten subsets. For each subset with 10% of the alleles, we generated the gene-level LoF fingerprint from this subsample. We trained UnseenEst on this subsampled fingerprint and compared the predicted number of genes with at least 10 LoF alleles with that of the observed in the entire ExAC data. The mean and standard deviations of the predictions were computed from the 10 different partitions.

Related works
While our algorithm and analysis are closely related to the approach in [7] [8], there are important differences in the model. In [7], we have an unknown discrete distribution P on n elements and we have k independent samples from P . This model was motivated by the classic problem of estimating the vocabulary size of Shakespeare from a sample of his works [9]. The discrete distribution setting can be reformulated by associating with each element s an independent Poisson random variable poi(p s ), where p s is the weight of P for s. Here, unlike in our model, s p s = 1. The number of times that s appears in k samples is distributed according to poi(k · p s ). In our genetics model, the number of times that a variant s appears in k alleles is distributed according to bin(p s , k). While poi(k · p s ) and bin(p s , k) both have expectation k · p s , the Poisson has a slightly larger variance. Because the number of elements n is potentially very large, this difference between Poisson and Binomial aggregates over all the elements and can give rise to substantial differences in the expected fingerprints between the two models.
Recently, [10] also proposed using linear program to estimate the discovery rate of new variants. They solve two linear programs with hypergeometric coefficients, to estimate the upper and lower bounds on the number of unique variants at a given sample size that are consistent with the observed site frequency spectrum. Under the infinite genome assumption (i.e. there are infinitely many possible variants), [10] showed that there exist solutions to these two linear programs. The approach of [10] tries to identify the range of the number of unique variants that is consistent with the observed data, though it does not guarantee how wide this interval is and whether it concentrates around the true value in general. Our linear program is guaranteed to produce a histogram that is close to the true SFS. Moreover our analysis makes explicit the dependence on the sample size k and the frequency distribution p s which was not present in [10].
Bayesian approaches have also been applied to estimate the number of unseen variants [11] [12].
Other approaches based on the jackknife estimator have also been applied to similar settings [13] [4].
In [11], the mutation probabilities of the variants are assumed to be i.i.d. samples from a Beta(a, b) prior, where the hyperparameters a, b are fitted from data. A limitation of this approach is that it requires parametric forms for the distribution of variant frequencies, which requires some model of demography and selection. For example, the Beta prior used in [11] is a reasonable assumption for neutrally evolving variants but may not be appropriate for deleterious mutations. The advantage of UnseenEst is that it does not require any modeling assumption about selective pressure and demographic history, i.e. it is non-parametric. Theorem 2.1 applies in all settings where the independence assumption is a reasonable approximation.

Proofs of the Guarantees
The proof of Theorem 2.1 for UnseenEst2 has three main components. First we show that given a sample of k alleles from the above model, with high probability the empirical fingerprint F i 's are close to their expected values ps h(p s ) · bin(p s , k, i). This sample of k alleles is what we call a faithful sample. Next we show that given a faithful sample, the histogram of the true distribution, h(p), rounded so as to be supported on the set X of discrete probability values, is a point in the plausible region of the linear program in UnseenEst2. Intuitively the plausible region captures all the histograms that can plausibly generate the observed SFS. The last component of the proof will argue that any two points in the plausible region must be close in generalized relative earthmover distance. This completes the proof because the solution returned by the linear program in UnseenEst2 is in the plausible region and hence must be close in relative earthmover distance to the rounded true histogram, which is close to the true histogram.
The proof of Theorem 2.1 follows the steps of the proof of Theorem 2 in [7]. We have to replace calculations involving Poisson distributions with Binomials in the appropriate places. We also have to rescale all the earthmoving costs by p s . We provide explicit analysis where our proof differs from that of [7]; otherwise we refer to the appropriate part of [7] when the calculations are identical.

Faithful samples
Definition 6.1. A sample of k alleles with fingerprint F, drawn from a set P = {p s } of probabilities with histogram h and sum t = s p s , is said to be faithful if the following conditions hold: • For all i, • For all possible variants s ∈ S, letting p s denote the true probability of s, the number of times s occurs in the sample from P differs from its expectation k · p s by at most max (kp s ) 0.5+D , (kt) B(0.5+D) .
Lemma 6.2 (Analogous to Lemma 11 in [7]). There is a constant γ > 0 such that for sufficiently large number of individuals, k, the empirical distribution is faithful with probability at least 1 − e −(kt) γ , where t = s p s .
Proof. The first condition follows from Hoeffding bound with high probability.
In our model, E[F i ] = ps h(p s ) · bin(p s , k, i). Each fingerprint F i is the sum of independent binary variables, representing whether each mutation occurred exactly i times in the population. Hence Chernoff bounds apply. The analysis showing that the second condition is satisfied is the same as in the proof of Lemma 11 in [7]. We include it here for completeness.
The analysis of the second condition is split into two cases, according to whether and hence this quantity is bounded by setting E[F i ] = (kt) B . A union bound over the first 2kt fingerprints shows that the probability that a sample of k alleles violate the first condition is at most k · 2e −(kt) 2BD /3 ≤ e −(kt) Ω (1) . Note that the probability that there are more than 2kt nonzero fingerprints is similarly bounded, as the probability that a variant is observed more than 2kt times is inverse exponential in kt.
For the third condition, we want to show that for all variants s, the number of times that s is observed in k alleles differs from its expectation p s k by at most max((kp s ) 0.5+D , (kt) B(0.5+D) ). The analysis also splits into two cases depending on whether p s k ≥ (kt) B and follows from the same Chernoff bound as before, replacing F i by the number of times s occurs in the sample and replacing E[F i ] by p s k.
Definition 6.3. Given a fingerprint F, an upper bound on the support size n, m = i i · F i , and a finite set of probability values X, the plausible region is the set of histograms h supported on X satisfying the conditions As the name suggests, the plausible region is the set of histograms that can plausibly generate the observed fingerprint F. The last three requirements of plausibility are the same as the LP constraints in UnseenEst2.
The following lemma shows that, given a faithful sample of k alleles, the corresponding plausible region has a point that is extremely close to the histogram of the true distribution.
Lemma 6.4. (Analogous to Lemma 12 of the [7].) For sufficiently large k, and n < m 2+B/2 /k: given a distribution of support size at most n and a faithful sample of k alleles with fingerprint F, the plausible region has a point v such that v is close to the true histogram h where h v is obtained from v by appending the empirical fingerprint entries F i for i ≥ m B + 2m C .
Proof. The idea of the proof is to show that, provided the sample is faithful, the true histogram h can be minimally modified into a plausible point v . We construct v by taking the portion of h with probabilities at most m B +m C m and rounding the support of h to the closest multiple of 1/m 2 , so as to be supported at points in the set X = {1/m 2 , 2/m 2 , ...}.
We construct h and v as in [7]. The first two steps of the construction are the same. In the third step, we want to normalize the total probability mass m F + x xv x to be m/k instead of to 1. This involves rescaling v x by a factor of s = (m/k − m F )/ x xv x .
Next we show that the discretization does not violate the requirements of plausibility. We note that d dx bin(x, k, i) ≤ k. Since we discretize to multiples of 1/m 2 , the discretization alters the contribution of each site to each expected fingerprint by at most k/m 2 . The support size is bounded by n, the discretization alters each expected fingerprint by at most n · k/m 2 . The rescaling step also does not violate the plausibility conditions. Finally the last part of the proof bounds the per unit earth-moving cost, which does not use any properties of the Poisson distribution. We can apply the same earth-moving scheme and analysis of the per unit cost. The final cost R(h, h v ) needs to be scaled by m/k since that's the total amount of probability mass.

Chebyshev construction
The previous section established that, given a faithful sample (which we are likely to obtain with high probability), there exists a plausible point which is very close to the true histogram. In this section, we will show that any two plausible points are close in generalized relative earthmover distance. By the triangle inequality, this guarantees that the solution returned by UnseenEst2 will be close to the true histogram. To establish the closeness of the histograms, we will explicitly construct a earthmoving scheme using Chebyshev polynomials. This is analogous to the earthmoving scheme in [7], replacing all instances of poi(kx, i) by bin(x, k, i). We define binomial Chebyshev bumps, following [7]. Definition 6.6. Let s = 0.2 log kt , where t = s p s . Define g 1 (θ) = s−1 j=−s cos(jθ) to be an approximation of the delta function, truncated at Fourier degree s. Define a slightly fatter version and, for i ∈ {1, . . . , s−1}, define its shifted versions g i 3 (θ) = g 2 (θ− iπ s )+g 2 (θ+ iπ s ), and g 0 3 = g 2 , and g s 3 = g 2 (y + π). Let t i (x) be the linear combination of Cheybyshev polynomials so that t i (cos θ) = g i 3 (θ). We define s + 1 functions, the "skinny bumps", to be B i (x) = t i (1 − xk 2s ) s−1 j=0 bin(x, k, j), for i ∈ {0, . . . , s}. We now prove a number of nice properties about the Chebyshev earthmoving scheme.
Lemma 6.8 (Lemma 18 in [7]). For any θ, and for any x, Proof. Same as in Lemma 18 of [7]. Nothing special about Poisson density was used in that proof.
Proof. We decompose g i 3 (θ) into a linear combination of cos( θ), for ∈ {0, . . . , s}. Since cos(− θ) = cos( θ), g 1 (θ) consits of one copy of cos(sθ), two copies of cos( θ) for each strictly between 0 and s, and one copy of cos(0θ). g 2 (θ) consists of ( 1 16s times) 8 shifted copies of g 1 (θ)'s. The shifts changes the phases of the Fourier coefficients but not their magnitude. Sine components may have been introduced in the shifts, but since g i 3 is an even function, the sine components cancel out. Since each g 3 contains at most two shifted g 2 's, each g i 3 (θ) is a linear combination s =0 cos( θ)b i with the Fourier coefficients bounded by |b i | ≤ 2 s .
Proof. Same proof as in Lemma 20. This lemma doesn't involve Poisson density at all. Proof. The analysis has two parts. For the first part, we consider the cost of bumps f i for i ≥ s + 1, where recall that s = 0.2 log kt. This is the cost of moving bin(x, k, i) mass from x to i k . The unit cost of moving mass from x to i x is | log xk i |, which is upper bounded by xk i − 1 when i < xk and i xk − 1 otherwise. We split the calculation into two parts. First, for i ≥ xk , = bin(x, k + 1, i + 1) ≤ bin(x, k, i + 1), where the last inequality is because i ≤ xk − 1. Therefore the rest of the sum telescopes to bin(x, k, xk ) − bin(x, k, s) = O( 1 √ s ). Thus in total, f i for i ≥ s + 1 contributes O( 1 √ s ) to the relative earthmover cost, per unit of weight moved.
Next we analyze the skinny bumps f i (x) for i ≤ s. The simple case is when xk ≥ 4s. Recall the definition f i (x) = t i (1 − xk 2s s−1 j=0 bin(x, k, j). Since xk > x, we bound s−1 j=0 bin(x, k, j) ≤ s · bin(x, k, s). Each f i (x) is exponentially small in both x and s, the thus the total earthmoving scheme, per unit of mass above 4s k is exponentially small.
Since x x · h(x) = m/k and x h(x) ≤ n, by the Cauchy-Schwarz inequality, The total earthmoving cost in this regime is O( m k n m log kt ) and hence we need n = δm log kt to ensure that the total cost here is O(m √ δ/k).
Finally we put all the pieces together. The total probability mass that need to be moved is O(m/k). . For the bump centers i ≥ s + 1, the same analysis as in [7] shows that relative earth mover cost is O( 1 k Ω(1) ). We consider teh first s + 1 = O(logkt) bump centers corresponding to the skinny Chebyshev bumps. Recall that for these centers, c i , the bump functions B i (x) may be expressed as s j=0 s q=0 a ijq bin(x, k + q, j + q) for a ijq satisfying j + q + 1 k + q + 1 x (h(x) − g(x))bin(x, k + q + 1, j + q + 1) where we have used triangle inequality and the first condition of plausibility in the last inequality. Since B < 0.1, we have that this discrepancy is O( max{1, ps} k Ω (1) ) for each center c i , and since there are log kt centers, the total discrepancy is also O( max{1, ps} k Ω (1) ). Putting all the pieces together, by the triangle inequality, we have R(p, q) ≤ R(p, h) ≤ R(p, h) + R(h, h ) + R(h , g ) + R(g , g) ≤ O(m √ δ/k).

Proof of Proposition 1.4
For convenience, we restate the proposition in a slightly more general form: 1. Either for all s ∈ S, p s ≤ p s , or for all s ∈ S, p s ≥ p s ,

2.
i p s = i q s , then, for any k, where R(h P , h Q ) is the relative earthmover distance between the histograms corresponding to P and Q. Hence for k > 3, The first term is trivially bounded by k i |p i −p i | = k | i p i − i q i | , since each unit of probability mass can, in expectation, account for at most k distinct observations. To bound the second term, first note that both the relative earthmover cost, and expected number of distinct elements observed are linear functions of the number of elements of P and Q with each different probability value, it suffices to analyze the costs of the earthmoving distance and the change in the expected number of distinct elements for a single earthmoving operation: consider moving c units of mass from probability value x to y. The change to the expected number of distinct elements observed is exactly c and the relative earthmover cost of this is c| log x y |. We now show that the ratio of these quantities is always at most k 4 .
Thus for x ≤ 1 2 the ratio of derivatives is bounded as The first term of the right hand side, after dividing by k − 1, can be reexpressed in terms of y = x(k − 1) as 1−e −y (y+1) y , which has a global maximum less then 0.3; the second term in the right hand side, after the same variable substitution, equals e −y y(y+1), which has a global maximum less than 1. Thus, for x ≤ 1 2 , the absolute value of the ratio of derivatives is bounded as 0.3(k − 1) + 1. For x ≥ 1 2 , the right hand side of Equation 1 is 1 x minus some positive quantity, and is hence at most 2. Since 0.3(k − 1) + 1 ≥ 2 for any k ≥ 5, all that remains is to checking the k = 2, 3, 4 cases where 0.3(k − 1) + 1 < 2 by hand to confirms that 0.3(k − 1) + 1 is in fact a global bound.