Theoretical and Simulation-Based Investigation of the Relationship between Sequencing Effort, Microbial Community Richness, and Diversity in Binning Metagenome-Assembled Genomes

Short-read sequencing with Illumina sequencing technology provides an accurate, high-throughput method for characterizing the metabolic potential of microbial communities. Short-read sequences can be assembled and binned into metagenome-assembled genomes, thus shedding light on the function of microbial ecosystems that are important for health, agriculture, and Earth system processes. The work presented here provides an analytical framework for selecting sequencing effort as a function of genome relative abundance. As such, experimental goals in metagenome-assembled genome creation projects can select sequencing effort based on the rarest target genome as a constrained threshold. We hope that the results presented here, as well as GRASE, will be valuable to researchers planning sequencing experiments.

T he reconstruction of high-accuracy short-read sequences into metagenome-assembled genomes (MAGs) is a powerful approach to characterize microbial metabolisms within complex communities (1). The recent creation of ϳ8,000 MAGs from largely uncultured organisms across the tree of life (2), the spatial characterization of microbial metabolisms and ecology across Earth's oceans (3), and the characterization of the potential impact that fermentation-based microbial metabolisms have on biogeochemical cycling in subsurface sediment environments (4) provide a few examples of how MAGs have helped to elucidate the relationships between microbial ecology, microbial metabolisms, and biogeochemistry.
Sampling environmental microbial DNA involves selecting a target environment, sequencing effort, bioinformatic pipeline software and parameters, metabolism characterization software (i.e., for gene identification and similarity searches), and databases ( Fig. 1). At present, there is little information to guide how much sequencing is appropriate to achieve scientific goals in such experiments (5). This gap in knowledge is partly attributed to the unknown structure of target microbial communities. A further challenge is that the accuracy and efficiency of bioinformatic pipelines are often difficult to characterize, and thus obscure the relationship between sequencing effort and MAG retrieval. Recent estimates compiled by Quince et al. (5) suggest that typical metagenomic shotgun sequencing experiments usually sequence between 1 Gbp and 10 Gbp. Researchers require more precise guidance to select an appropriate shotgun sequencing effort in order to maximize information and minimize cost for their specific experimental question.
Illumina sequencing technology is currently the most popular platform to generate metagenomic shotgun sequences (5). Previous investigators established theoretical relationships between contig formation rate (6) and single genome coverage (7) as a function of short-read sequencing effort. On the community level, heuristic approaches have been proposed for evaluating community-level coverage in respect to increases in sequencing effort. For example, it has been proposed to utilize short-read redundancies as a function of sequencing effort to estimate community-level coverage (8). Without a priori knowledge of the microbial community structure, practical application of these methods to estimate MAG retrieval as a function of sequencing effort is hindered.
Here, we present two distinct analyses which constrain the relationship between the quantity of Illumina metagenomic shotgun sequences and the community-level sequence coverage. First, we applied a theoretical model and numerical simulations to estimate the minimum sequencing effort needed to sequence a metagenome to a target fraction of exhaustion. Our theoretical model is unique compared to previous models (6,7) in that we characterize sequencing effort in the context of community structure and consider all sequenceable combinations of k-mers in a metagenome. Second, we performed in silico experiments to simulate the effect of sequencing effort on retrieved MAGs for Illumina sequence data sets. Coupling results from the two analyses provides a framework for investigators to define sequencing experiments in the context of selecting a rarity and fraction of exhaustion for a desired target genome when sequencing a community. The patterns presented here can be used to guide sequencing effort decisions in future sequencing projects when MAG reconstruction is a primary goal.

RESULTS
Theoretical and numerical sequencing effort simulations. Using equation 6, we calculated the number of sequence reads required to sequence four hypothetical microbial communities to exhaustion: a perfectly even, moderately uneven, highly uneven, and a lognormally distributed structured community (Fig. 2a to d). Here, we define sequencing a community to exhaustion as sequencing all possible combinations of DNA k-mers within a genome. Note that, to simplify the model, this model treats identical k-mers (i.e., same DNA sequence) at different locations in the genome as mathematically different k-mers. The expected number of sequence reads to fully sequence the hypothetical communities was linear after log-transforming both expected sequences and metagenome size (unit of unique k-mers), suggesting a powerlaw relationship between metagenome size and the number of sequence reads required to sequence the metagenome to exhaustion (Fig. 2e). For all community structures, the slope of the relationship between log-transformed sequencing effort (sequenced reads) and metagenome size (unique k-mers) was within 1% of 1.06. The structure of the population strongly influenced the number of reads required such that more-even community structures required far fewer reads than less-even structures.
A limitation from using equation 6 for modeling sequencing effort is that it estimates only the sequencing effort for sequencing a metagenome to exhaustion. We circumnavigated this limitation by applying a numerical simulation to estimate the sequencing effort necessary to sequence a metagenome to a fraction of exhaustion for the same community structures analyzed earlier. The numerical simulation was validated by comparing the sequencing effort predicted from the numerical simulation and those from equation 6 when sequencing a metagenome to exhaustion (Fig. 3). Sequencing effort, with units of log of sequence reads, required to fully sequence four different community structures, one with relatively high community evenness (a), relatively moderate community evenness (b), relatively low community evenness (c), and one with a lognormal community structure (d), were predicted using linear regressions (e) and the log of metagenome size (total possible k-mers) as a predictor.
Again, the relationship between metagenome size and sequencing effort fit well to a power law. This observation was independent of the selected target fraction of exhaustion. Nonetheless, an increase in target fraction of exhaustion resulted in uniform increases in sequencing effort (log units) when the community structure was fixed; however, the rate of this increase varied with community structure evenness.
We were interested in quantitatively relating community evenness to sequencing effort. These communities ranged from perfectly even (a ϭ 0, equation 8) to more uneven (a ϭ 0.02, Fig. 4a). Evenness was quantified using the Pielou evenness index, which expresses Shannon diversity relative to the diversity of a perfectly even community (9). The sequencing effort required to characterize genomes depended strongly on both the evenness and the target fraction of exhaustion (Fig. 4b). Again, less-even communities required more sequence reads than more-even communities did. The strength of this relationship also depended on the target fraction of completion. A community with a Pielou evenness of 0.97 required 3 orders of magnitude more sequencing effort to sequence a metagenome to a target fraction of exhaustion than a perfectly even community, while the same community required only about 42% more reads to sequence 50% of the metagenome.
The inherent limitation to the theoretical and numerical constraints presented above is that community structure is not known a priori. Nonetheless, a simple line of rationalization can be applied to circumnavigate this issue for practical applications of our model. Equation 6 constrains expected sequencing effort based on the proportion of the population under consideration. That is, we can limit the model to consider only DNA k-mers associated with a set of genomes that represent a certain fraction of the community. Here, we limit the population such that our model predicts the expected sequencing effort of the rarest genome that we wish to sequence. Inherently, any 17 17 . member more abundant than the selected rarest genome will also be sequenced to the minimum coverage and depth of the selected rarest genome. To determine the rareness of a given genome within a metagenome, the total number of unique k-mers within a genome (equation 2) is scaled by the true abundance of the microbe and divided by the size of the metagenome (equation 7). Thus, the proper application of this rationale requires a desired target fraction of exhaustion, an assumed genome size, and the relative abundance of the rarest genome to sequence. Numerical simulations, like those described earlier, were performed to determine the sequencing effort necessary to achieve a target fraction of exhaustion. The difference here is that these simulations analyzed the sequencing effort necessary to sequence a genome of a certain rareness to a target fraction of exhaustion, whereas the simulations above analyze the effort necessary to sequence the entire community. A generalized additive model (GAM) was built from simulation outputs to extrapolate sequencing effort required for relative abundances of less than 1% as simulations with lower abundances became computationally too intensive (GAM) (Fig. 5). The GAM shows expected sequencing effort required for microbial genome sizes of 0.5, 2, 5 , 10, and 20 Mbp, target genome completeness fractions from 0.5 to 1.0, and genome relative abundances from 1 to 0.0001. The smooth dimensions for target fraction, genome size, and fraction of the metagenome community were 50, 6, and 29, respectively. To normalize for different sequence read length, sequence reads were converted to bases and ranged from 1 ϫ 10 7 to 1 ϫ 10 15 total bases. More bases were required to sequence microorganisms (i) when the genome was relatively rarer in the community, (ii) to achieve better coverage of the genome, and (iii) when average genome sizes were larger.
Rarefying MAG binning as a function of sequencing effort. We rarefied four sequence read data sets to nine different fractions of the complete sequence data sets in triplicate. The rarefied data sets were then assembled and binned for a total of 108 analyzed metagenomes. Raw MAG data are provided in Data Set S1 in the supplemental material. The sum of medium-and high-quality MAGs as a function of high-quality bases empirically fit the Gompertz equation (equation 12; Fig. 6b to e; parameters in Table 1; Data Set S1). The sum of medium-and high-quality MAGs (henceforth referred to as quality MAGs for brevity) reduces sensitivity to whether bioinformatic pipelines tend to lump contigs into fewer, more-complete MAGs versus split them into more, less-complete MAGs. This was important for our analysis due to the large number of metagenomes (n ϭ 108) which required assembly and an unsupervised binning algo- rithm. The large number of metagenomes made manual bin curation impractical during rarefication simulations. For each environment, the quality MAGs as a function of simulated sequencing effort followed a sigmoidal relationship. In order to make the parameters of the fit intuitive to interpret, we fit the data to the Gompertz equation as rewritten by Zwietering et al. (10) (equation 12). Here, A, , and correspond to the maximum quality MAGs assembled with the pipeline, the maximum rate which the quality MAGs form with more sequencing, and the "lag bases," or the bases which must be sequenced before a sufficient number of sequence reads exist to generate overlap and form contigs (6). The predicted maximum quality MAGs varied from ϳ9.6 in the estuary sediment community to ϳ19 in the maize soil community. The predicted maximum rate that the quality MAGs increased varied from ϳ0.9 to ϳ3.8 MAGs Gbp Ϫ1 . Last, the minimum threshold of sequencing necessary prior to seeing quality MAGs varied from ϳ0.6 to ϳ6.1 Gbp. The Tara Oceans data set, where the quality decreased at a sequencing effort of Ͼ20 Gbp, was an exception. For the estuary, maize, and human gut data sets, the quality MAGs yield began to asymptote with increasing sequencing efforts. The Tara Oceans data set followed a similar pattern at Ͻ25 Gbp. However, when the number of sequenced bases was Ͼ25 Gbp, the quality MAGs decreased and became insensitive to sequencing effort.
Constraining MAG rarefaction analyses to community structure. Using the relationship shown in Fig. 5, we can convert sequencing effort to an abundance (rareness) cutoff if we assume a genome size. Genome sizes for genomes in RefSeq v92 (11) have 25th, 50th, and 75th quantiles of 2.73 Mbp, 4.30 Mbp, and 5.14 Mbp, respectively, and provide reasonable constraints for assumptions of genome size. Figure 7 shows quality MAGs retrieved for a genome of a given level of abundance that is sequenced to a target fraction of completeness for the human gut, maize soil, estuarine sediment, and surface ocean microbiomes analyzed earlier. Correlation coefficients for the regressions used to relate log-transformed sequencing effort (in base pairs) to genome relative abundance were R ϭ 1 for all three genome sizes evaluated (1 Mbp, 5 Mbp, and 20 Mbp). Evaluation of 1 Mbp and 20 Mbp define the range of uncertainty in predicting quality MAGs as the true size of genomes is unknown. Unlike the asymptotic response of quality MAGs to sequencing effort ( Fig. 6b to e), quality MAGs increase exponentially as the abundance cutoff decreases (note that the abundance cutoff is on a log scale). The target genome completeness fraction was held at a constant of 0.5 for all regressions, and the sensitivity of quality MAGs will change with respect to the genome abundance cutoff with different values of target genome completeness fraction.

DISCUSSION
We sought to establish evidence-based guidelines for selecting a sequencing effort during shotgun metagenomic sequencing experiments. The model proposed here (equation 6) addresses this goal. Our model establishes an intrinsic relationship between a community structure and the sequencing effort necessary to sequence members with different rareness, or relative abundance in a community (equation 1; Fig. 8).
It is important to emphasize that the proposed model treats individual k-mers within a genome as members of the total population of DNA in an environment (equation 4; Fig. 8). Thus, the relative abundance of any given k-mer in a population is equal to the abundance of the host genome divided by the total number of k-mers in the entire metagenome (equation 7). Summing the probabilities of sequencing all individual k-mers from the same genome (equation 2; Fig. 8), should equal the relative abundance of the genome within the population of genomes. The theoretical model utilizes these individual probabilities of sampling k-mers to determine how much sequencing effort is required to sequence all possible k-mers in a community ( Fig. 2 and 3) and for a member of some rareness in a community (Fig. 5). Practical sequencing challenges associated with strain-level microdiversity, extracellular DNA, and sequencer contamination are not problematic for the proposed model. However, these issues can still lead to problems during assembly. In particular, abundant lineages with a large amount of strain-level microdiversity can be entirely missing from the assembled data set despite a high coverage. Strains could be treated as independent taxonomic units, and ultimately, sequencing effort should be selected based on a target rareness of DNA being sequenced in the sequencer. Proper measures such as removing extracellular DNA (e.g., reference 12) and properly removing contaminated DNA are essential to avoid skewing the sequenced DNA population. Last, in practice, homologous DNA regions across genomes (e.g., shared genes) will be sequenced faster than unique regions. For simplicity of our model, we cannot predict the degree of homologous DNA and simply treat unique loci at DNA as a unique k-mer. Nonetheless, if this information is known a priori, the proposed model can account for homologous DNA. Equation 6 calculates sequencing effort for individual k-mer regions based on their relative abundance (p j ) with respect to all k-mers in the community. Theoretically, the relative abundances for

FIG 7
Quality MAGs (medium and high quality) as a function of the rarest genome sequenced to 50% exhaustion for human gut, maize soil, estuarian sediment, and surface ocean sequence data sets. Sequencing effort was converted to genome relative abundance sequenced to target fraction using the GAM presented in Fig. 5 The translucent shaded areas correspond to uncertainty from the target genome size (1 Mbp, or the lower bound, to 20 Mbp, or the upper bound), while the solid lines correspond to genome sizes of 5 Mbp. k-mers in homologous regions could be treated as an independent "reservoir," or genome (g GM,i ; equation 3), such that the relative abundance of the jth k-mer in this reservoir equals the sum of relative abundances from all host genomes that the homologous DNA actually exists in. The theoretical probability model (equation 6) demonstrated that the sequencing effort to sequence a metagenome to exhaustion was predictable, regardless of community structure (Fig. 2). Furthermore, our characterization of community complexity demonstrates that less-complex communities require less sequencing to sequence all available DNA (Fig. 4). A limitation to equation 6 is that it predicts only the sequencing effort to sequence a metagenome to completion. In practice, assemblers do not require every unique k-mer to generate an accurate contig. As long as sufficient overlap exists between sequenced k-mers, assemblers can generate contigs. We therefore used a numerical simulation to predict the sequencing effort necessary to sequence an individual metagenome to a target fraction of exhaustion. The numerical simulation agreed with the theoretical model when predicting the sequencing effort to sequence a metagenome to exhaustion (Fig. 3). A last analysis explored the semiquantitative relationship between community evenness and richness and the necessary sequencing effort to achieve a target fraction of exhaustion for a metagenome (Fig. 4).
In practice, information about a target community structure may not be available for estimating sequencing effort. The GAM built here predicts the minimum number of sequences necessary to sequence a given fraction of a target genome as a function of average genome size and the relative abundance of the target genome in the community (Fig. 5). Even without knowledge of a target community's structure, the GAM Equation 3 Equation 2 provides a useful constraint for designing whole-genome shotgun metagenomic sequencing experiments. The size of prokaryotic genomes is fairly constrained, with 50% of the genomes in RefSeq v92 spanning from 2.74 Mbp to 5.15 Mbp (11). The limited range in genome size allows for reasonable assumptions about genome size for prokaryotes. We wanted to relate our model to actual sequencing experiments. As such, the subsampling analysis on existing short-read data sets (individually sampled, assembled, and binned) simulated the effect of creating MAGs from data sets of different sizes from different environments. The data sets analyzed here are representative of both the sequencing effort (1 to 10 Gbp) (5) and the types of target environments that bacterial and archaeal ecologists often investigate (13). We want to emphasize that the data sets analyzed here do not necessarily reflect all the variability in characteristics of their parent communities (i.e., these data sets do not reflect global/temporal variations of these environments). Furthermore, a wide variety of software packages are available for all steps of MAG creation pipelines, and the quantity/quality of MAGs will depend on software selection, software configuration, and sequenced environment (Fig. 1) (5). The best practice is to manually curate algorithmically created MAG bins (14). Due to the large number of metagenomes that were assembled in this work (n ϭ 108), manual curation was an impractical approach. As with most sequencing experiments, viral/ eukaryotic DNA is likely included during assembly and binning. Thus, we do not argue that the pipeline used here is objectively optimal for generating high-quality (highcompleteness and low-contamination) MAGs. Rather, our pipeline was configured to minimize contamination (MAGs with Ͻ10%) associated with retrieved MAGs at the expense of reduced completeness (see Text S1 in the supplemental material). For this reason, we reported the number of quality MAGs (medium-and high-quality MAGs) rather than the actual number of MAGs. Reporting quality MAGs reduces bias associated with the generalized approach used with the binning software. As such, we interpret the data sets analyzed here as reflecting general community properties (richness, abundances, and phylogeny) which are generally known to be different from one another (4,(15)(16)(17).
All communities demonstrated similar responses to rarefaction. All subsampled depths were performed in triplicate to account for possible variation in assembly and binning. As sequencing effort increased, there was an initial lag in quality MAGs followed by a rapid increase in quality MAGs, and then diminishing returns at higher sequencing efforts. Previous investigators modeled the response of 16S RNA gene (18)(19)(20), Hill's number diversity (21), taxon-resolved abundance (22), and gene abundance (22) as a function of sequencing effort with rarefaction curves, or collector's curves. The quality MAGs as a function of sequencing effort did not match a traditional collector's curve, which lacks an initial lag. Rather, the data appear sigmoidal. We modeled the data using the Gompertz function (equation 12), because its parameters can be interpreted in terms of quantities that are familiar from microbial growth curves (lag time, growth rate, and maximum density) (10). The fits to the Gompertz function illustrate that there is an optimal sequencing effort for MAG creation efforts corresponding to the upper shoulder of the Gompertz curve ("late log phase"). When sequencing effort is too close to , MAGs bin poorly, and when sequencing effort is too great, the number and quality of MAGs per unit sequencing effort (and therefore cost) are low. We speculate that our choice of pipeline, and specifically the fact that we discarded contigs of Ͻ3 kb, caused poor performance at higher sequencing effort for the Tara Oceans data set. Species-level microdiversity and interspecies homologous DNA can cause "bubbles," which impair assembly in larger data sets (23,24). Improved assembly would likely have yielded more quality MAGs for our assembly of the largest subsets of the Tara Oceans data.
Metagenomic shotgun sequencing experimental designs should be rationally designed such that sequencing effort is selected to capture a desired fraction of a target microbial genome. Investigators should be cognizant of the rarest microbial genome desired to be characterized as well as the degree of characterization of that microbial metagenome when designing a sequencing experiment. To that end, we have built a tool, Genome Relative Abundance to Sequencing Effort (GRASE), to report estimated sequencing effort required to capture a defined fraction of a genome as a function of the relative abundance of the corresponding microorganism in the community and average genome size. This R-based graphical user interface (GUI) app can be accessed online at http://adsteen.shinyapps.io/grase and is archived at http://github.com/ adsteen/GRASE, from which it can be downloaded and run locally.
When the sequence read data sets analyzed here ( Table 2 and Fig. 6) are reevaluated in the context of the relative abundance of a microbial metagenome (g MG ) sequenced to a target fraction of exhaustion (0.5), quality MAGs increase appreciably in response to minor increases in deeper characterization of the community metagenome (Fig. 7). This observation contrasts the quality MAGs response to sequencing effort (in base pairs), where substantial increases in sequencing effort (by contemporary standards) leads to diminishing returns in quality MAGs. It is important to note that abundance cutoff and sequencing effort are interchangeable; however, the responses of quality MAGs to changes in the respective predictor (i.e., base pairs versus abundance cutoff) alter the optics of the collector's curve. Modest increases in sequencing effort contribute minor amounts to extending abundance cutoff. A substantial amount of genomic and metabolic data can be gained from targeting rarer microbes (metagenomic abundances of Ͻ0.005), with the caveat that whole-genome shotgun sequencing technology (as well as computational power) requires significant increases in either the number or length of reads generated per run.

MATERIALS AND METHODS
Defining the microbial metagenome and sequencing probability. Here, we draw on set theory to provide a theoretical grounding for our in silico simulations described below. The expected number of sequences to sequence a fraction of an individual microbe's genome can be modeled with probability theory by defining a community metagenome with set theory. Figure 8a to e provide cartoons illustrating the application of this set theory on a hypothetical microbial population, G. G contains unique genomes (g) with finite abundances (n). The definition of microbial species is somewhat contentious (25). Here, we define g as a genome that is unique in length and composition for all loci compared to all other genomes in a community. As such, the species richness (s; unique g) of G will vary on how it is defined and should be consistent with the objectives of the investigator. In the example communities (Fig. 8a to e), s is 6 and the total n is 13. Thus, G is represented as follows (Fig. 8a): G ϭ ͕n1g1, n 2 g 2 ѧ n s g s |n ʦ N͖ (1) where s is the species richness. When characterizing G via shotgun metagenomics, the ith genome, g i , can be sequenced at K unique sections given a characteristic read length, k, and average genome size, l, in number of base pairs (Fig. 8b). Thus, the number of possible k-mers, K, associated with the ith genome, g i , within G is equal to: Note that equation 2 considers homologous DNA as unique k-mers. This is for model simplification. From equation 2, the metagenome, g MG , for g i is defined as the set of all k-mers (Fig. 8c) or: where the subscripts for g i represent a given k-sized read spanning from an arbitrary starting base pair to the arbitrary starting base pair plus k. By substituting g MG,i into all g for equation 1 (Fig. 8d), the G MG ϭ ͕n1gMG,1, n 2 g MG,2 ѧ n s g MG,s |n ʦ N͖ (4) while the population of k-mers in the metagenome, G MG (Fig. 8e), is represented as: From equation 5, one can determine the cardinality, or the total number, of k-mers associated with G MG (expressed as |K MG |). To an effect, |K MG | is analogous with "metagenomic richness" of an environment. When attempting to fully sequence G MG using shotgun metagenomics, we assume that sampling events (sequence reads) are independent and are sampled with replacement (26).
The probability of sequencing all elements in G MG reduces to a coupon collector problem (27) by making the above assumptions. Using the general functional form for calculating expected samples for sampling all unique elements in a set (equation 13b in reference 8), one can predict the number of sequences necessary to sequence all elements in K MG , such that the expected number of sequences, E(G MG ), is: where j is a given element within K MG , t is the number of sampling events, and p j is equal to the proportion of the jth k-sized read within a given population of k-sized reads. p j can be expressed as follows: where n i is the respective abundance for the species whose MAG contains the jth k-sized read within K MG , and |G MG | is the cardinality of G MG , or the total number of k-sized reads in the metagenome, G MG . Modeling expected sequences. Equation 6 provides an estimate for the total number of sequences to sequence all K MG . The influence of increasing species richness (i.e., s in equation 1) on the expected number of sequences was tested for four hypothetical communities. The first community had an even structure such that all k-mers were equally distributed across all K MG . In the second community, 90% of the k-mers were equally distributed in 50% of K MG , and the remaining 10% of the k-mers were distributed equally across the remaining 50% of K MG . This community represented a community with relatively moderate species evenness. In the third community, 90% of the k-mers were equally distributed across 10% of K MG , and the remaining 10% of the k-mers were distributed equally across the remaining 90% of K MG . This community represented a community with relatively low species evenness. The last community had 10 equally sized groups. The abundance of the k-mers in each group was based on the function form of a lognormal community (28) which has been observed in microbial populations (e.g., references 21 and 29), such that: S ͑ R ͒ ϭ S 0 e Ϫa 2 R 2 (8) where S 0 was treated as the maximum relative of abundance (S 0 ϭ 1), a was the inverse width of the distribution, R was treated as the positive octave range spanning 0 to 9, and S(R) represented the abundance for a given group. For the lognormal abundance distribution in Fig. 2d, a was set at a value of 0.2. Each hypothetical community started with |K MG | ϭ 1 ϫ 10 6 . |K MG | incrementally increased at 10 equally spaced, linear steps to a maximum of |K MG | ϭ 1 ϫ 10 8 . As |K MG | increased, all community structures remained constant. Graphical representation of rank abundance in Fig. 2a to d was normalized by a given |K MG | to reflect that populations retained the same structure even as population size varied. We defined a normalized rank abundance r n such that r n ϭ r s (9) where r and s are untransformed rank abundance and richness, respectively. For each community, at each step, the expected number of sequences was calculated using equation 6. The expected number of sequences as a function of |K MG | were modeled with linear regressions. Equation 6 gives the expected number of sequences required to sequence any size community to exhaustion. Numerical sequencing simulations were performed to determine the number of sequences necessary to sequence a subset of all k-mers (K MG ). These numerical sequencing simulations were applied to four hypothetical community structures described above. Numerical simulations were performed such that |K MG | ϭ 3 ϫ 10 7 , 4 ϫ 10 7 , 5 ϫ 10 7 , 7 ϫ 10 7 , 9 ϫ 10 7 , and 1 ϫ 10 8 . During each of these simulations, the parameters read length (k) and average genome size (l) were set at 100 and 1 ϫ 10 6 , respectively, for all g. Random elements from K MG were selected with replacement to simulate sequencing events. Numerical simulations were performed until the percentage of |K MG | sequenced was 50%, 70%, 90%, 95%, 99%, or 100%. A weight distribution was applied to elements in a given K MG . The weight distribution biased sequencing to reflect the relative abundances of the four hypothetical communities described above. The percentage of |K MG | sequenced was evaluated every 1 ϫ 10 7 sequences. Numerical simulations were performed in triplicate for all |K MG | and all target fractions of |K MG |.
We explored the influence of community evenness on required sequencing effort by performing sequencing simulations on six different lognormally distributed communities. The numerical sequencing simulations followed the simulations described above. The six lognormal communities were modeled such that each community had S 0 ϭ 1, R ϭ 10, and |K MG | ϭ 1 ϫ 10 7 . The values of a for the six lognormal