Measurement Error and Resolution in Quantitative Stable Isotope Probing: Implications for Experimental Design

One of the biggest challenges in microbial ecology is correlating the identity of microorganisms with the roles they fulfill in natural environmental systems. Studies of microbes in pure culture reveal much about their genomic content and potential functions but may not reflect an organism’s activity within its natural community. Culture-independent studies supply a community-wide view of composition and function in the context of community interactions but often fail to link the two. Quantitative stable isotope probing (qSIP) is a method that can link the identity and functional activity of specific microbes within a naturally occurring community. Here, we explore how the resolution of density gradient fractionation affects the error and precision of qSIP results, how they may be improved via additional experimental replication, and discuss cost-benefit balanced scenarios for SIP experimental design.

relative to amplicon sequencing and the amount of data per sample may become a computational limitation.
The majority of SIP studies published in the past 20 years targeted marker genes to identify substrate assimilators, usually with 16S rRNA gene-based analysis. However, focusing only on 16S rRNA misses a wealth of genetic information. A handful of more recent studies have combined SIP with metagenomics or metatranscriptomics to investigate genomic potential and actively expressed genes by functional guild (23)(24)(25), but the combination of SIP with shotgun sequence analysis quickly becomes limiting both financially and computationally. Some investigators have tried to downsize these burdens by sequencing only highly labeled fractions (12), pooling density fractions, or sequencing unfractionated DNA and matching assembled genomes to SIP-identified substrate assimilators (26)(27)(28)(29)(30). One additional way to address the cost issues inherent to metagenomic analysis (e.g., larger amounts of DNA required, higher sequencing costs, and higher computational complexity) is to use only the minimum number of density fractions that are needed to yield a desired level of sensitivity. In addition, adding more sample replicates but collecting a reduced number of fractions (gradient resolution) could lead to higher accuracy with similar effort.
Using multiple SIP data sets (Table 1), we tested the effects of fraction size and sample replication on the robustness and sensitivity of qSIP. We combined (in silico) density fractions and measured the effects of lower gradient resolution on per-taxon density shifts and unlabeled weighted mean density. We also calculated the repercussions of reducing the number of density fractions in replicated and unreplicated data sets from both marine and terrestrial microbial communities. Our results show that reducing the gradient resolution from an average density fraction size of 0.002 g ml Ϫ1 (ϳ50 fractions) down to 0.011 g ml Ϫ1 (ϳ9 fractions in a 5.1-ml tube) yields comparable shift detection and a detection limit of 0.005 g ml Ϫ1 (equal to 9% enrichment with 13 C). We discuss using the small inherent variability between replicates as a way to define a shift detection limit and show that this inherent variability is comparable between replicates centrifuged together (within spin) and between replicates centrifuged separately (between spins). Finally, we stress the need for internal standards that can be spiked into each sample, and evaluate the statistical and financial benefits of different experimental design options.

RESULTS
Variability of taxon density shifts is greatest for rare OTUs. In SIP, taxon density shifts, or changes in weighted mean density (WMD) due to incorporation of a stableisotope-labeled substrate, are the basis for calculating isotope enrichment. In silico qSIP analyses have shown that these shifts are detectable in moderately to highly abundant operational taxonomic units (OTUs) (Ͼ0.1% relative abundance) (10). Using unlabeled sample experimental data from data set 2, we first examined variation in qSIP-derived density estimates for both rare and common OTUs. We found that OTU abundance (log 10 transformed) was positively correlated with the proportion of density fractions in which that OTU was detected (Pearson's r ϭ 0.704, P Ͻ 0.0001). In other words, the most abundant OTUs tended to occur in all (or nearly all) fractions of the density gradient, whereas the least abundant OTUs tended to occur in fewer density fractions. Variability of the unlabeled WMD was lowest for common OTUs and greatest for rare E. coli and P. putida 4-6 30 5 Mock community 9 48 OTUs (Fig. 1). WMD variation (expressed as 2ϫ standard deviation of the WMD for a given OTU) correlated negatively with the proportion of density fractions in which that OTU was detected ( Fig. 1; Pearson's r ϭ Ϫ0.560, P Ͻ 0.0001). Similarly, WMD variation also correlated negatively with log 10 -transformed OTU abundance (Pearson's r ϭ Ϫ0.217, P Ͻ 0.0001). Density shifts are consistent across medium-to-high gradient fraction resolution. While it is widely assumed that the detection limit in a SIP experiment will depend on the number of fractions collected from each density gradient (18), to our knowledge, this has not been comprehensively tested. We first examined how fraction resolution affects density shift, using the 100 most abundant OTUs (from all fractions combined) in an unreplicated data set of OTUs from naphthalene-enriched seawater (data set 1) where the DNA had been divided into 50 fractions, of which 45 had quantifiable DNA. Consecutive density fractions were consolidated in silico (every 2, 3, 4 fractions, etc.) to represent a range of fraction sizes spanning 0.002 to 0.028 g ml Ϫ1 (4 to 50 fractions, assuming a standard 5.1-ml tube). We found that the estimated magnitude of OTU density shifts remained consistent at fraction sizes from 0.002 up to 0.011 g ml Ϫ1 , expanding previous results that demonstrated this trend with fractions of 0.003 to 0.008 g ml Ϫ1 ( Fig. 2A) (10). This indicates that fractionating the sample into more than nine fractions had negligible additional effects on OTU density shifts.
The magnitude of density shift data can also be represented as a relative error, which we define as the density shift in resolution r (r Ͻ original number of fractions) compared to the density shift realized with maximum resolution. As relative error increases, power declines, so the likelihood that taxa are misassigned as nonisotope incorporators increases. We found a positive linear correlation between fraction size and mean relative error (R 2 ϭ 0.95; Fig. 2B). Additionally, the increase in the mean relative error is accompanied by an increase in its variation.
We next assessed the correlation between the number of fractions and the relative standard deviation of the WMD in a study with replication (data set 2, n ϭ 5, preincu- Mean proportion of fractions occupied 2 x standard deviation of WMD (g ml 1   )   1<x 10  10<x 1e2  1e2<x 1e3  1e3<x 1e4  1e4<x 1e5 OTU abundance (16S rRNA copies Pl 1 )

FIG 1
Effect of OTU abundance on the standard deviation of density measured using qSIP. Points represent individual OTUs from a qSIP experiment (n ϭ 320, data set 2), where H 2 18 O was added to a forested peat bog soil. Only unlabeled (H 2 16 O) experimental replicates incubated at an intermediate temperature (n ϭ 5) are plotted here. We calculated the mean proportion of density fractions in which each OTU occurred (x axis) as well as 2ϫ standard deviation of WMD for each OTU (y axis). OTUs are colored by their mean unfractionated abundance across all unlabeled replicates (units, 16S rRNA gene copies l Ϫ1 ). The variability of qSIP-calculated unlabeled weighted mean density (WMD) was lowest for common OTUs and greatest for rare OTUs. Highly abundant OTUs tended to occur in the majority of density fractions. bation temperature of 15°C); here we found the relationship is not linear. Instead, between 2 and 11 fractions, every additional fraction reduced the standard deviation exponentially, but for 12 fractions or more, the proportional differences are much smaller and linearly correlated with the number of fractions (Fig. 3A). Mean fraction size (g*ml −1 ) Mean density shift (g*ml −1 ) 50 9 4

A B
Mean fraction size (g*ml −1 ) Low variance influences minimum detectable density shift. In a SIP experiment, it is important to identify statistically significant density shifts between samples treated with a labeled substrate versus control samples (unlabeled substrate). To define a minimum detectable effect size, we calculated the inherent variability of WMD of unlabeled DNA from taxa in data set 2-a well-replicated SIP experiment (n ϭ 10, time zero). Reducing the resolution of the density gradient may impact this variability. We evaluated this by merging and averaging data from an increasing number of consecutive fractions. Our initial analysis with medium resolution (fraction size 0.007 g ml Ϫ1 ; e.g., 18 fractions in a 4.7-ml tube) revealed that the WMD of abundant taxa varied little between replicates (Fig. 3B). The WMD per OTU was calculated over 10 replicates. The resulting WMDs varied around a median of 0.004 to 0.007 g ml Ϫ1 at gradient resolutions varying from 0.007 to 0.034 g ml Ϫ1 (3 to 18 fractions). With progressively fewer fractions (decreasing gradient resolution), variation in WMD remained statistically indistinguishable for fraction sizes between 0.007 and 0.027 g ml Ϫ1 (4 to 18 fractions) but increased significantly (analysis of variance [ANOVA], Tukey 95% confidence level) with a fraction size of 0.034 g ml Ϫ1 (3 fractions in a 4.7-ml tube) (Fig. 3B). This variability of the WMD as a function of fraction size can be used to choose an acceptable minimum shift in density which a researcher would like to detect with a desired confidence level.
We found the range of relative error increased with larger fraction sizes. To explore how this variation affects the detection of substrate incorporators, we calculated the putative sensitivity (proportion of true positives) and specificity (proportion of true negatives) as a function of the shift detection threshold for all gradient resolutions in data set 1. The shift detection threshold can be thought of as the smallest difference between labeled and unlabeled WMD that would be considered a significant density shift. We performed these calculations using the assumption that a density shift higher than a specific threshold in the original experimental setup (data set 1, 50 fractions) represented significant enrichment. Therefore, all our comparisons are relative to the originally measured results, which likely contain some inherent error. As the sensitivity and specificity of qSIP have been shown to be high, we assume this error is very small. Both sensitivity and specificity parameters were stable down to a fraction resolution of 0.011 g ml Ϫ1 (9 fractions), using a shift detection threshold of 0.005 g ml Ϫ1 or higher. Specificity was Ͼ95% for all gradient resolutions at a threshold of Ͼ0.003 g ml Ϫ1 , but sensitivity was more impacted by a gradient resolution of Ͼ0.013 g ml Ϫ1 (Fig. 4).
We also tested how fraction number affects error in the estimates of isotope incorporation using a replicated experiment (data set 2, SPRUCE [Spruce and Peatland Responses Under Changing Environments] soil). Our analyses show that confidence intervals widen as the number of density fractions declines (see Fig. S2 in the supplemental material). In other words, as the number of density fractions declines, the likelihood of detecting a given level of isotope assimilation declines as well.
Replication, statistical power, and detection thresholds in SIP experiments. In experiments with replicate samples, the number of replicates and desired statistical power influence the detection limit. When designing an experiment, it is valuable to use the desired statistical power and desired enrichment detection threshold in advance, to decide on the number of replicates needed. Both parameters depend on the scientific purpose of the study. The higher the power and threshold, the fewer replicates are necessary (Fig. 5).
Variation in mean weighted density is comparable within and between spins. To assess the impact of spin-to-spin and tube-to-tube variability on WMD, we tested the variability of WMD using pure culture DNA (data set 4). DNA extracted from pure cultures of unlabeled Escherichia coli and unlabeled or 100% 13 C-labeled Pseudomonas putida was aliquoted into replicates, ultracentrifuged in CsCl, and fractionated in an automated pipeline. The known genome differences in %GC of these organisms (i.e., distance between their peak densities) allowed us to calculate their WMD, even when both were unlabeled. When we compared the WMD mean of E. coli replicates between spins, we found the variation was comparable to the range of WMDs measured within a spin (up to 0.004 g ml Ϫ1 ) (Fig. 6).
We also tested within-and between-spin effects with a genomic mock community (data set 5) by centrifuging triplicates of a DNA mixture comprised of four organisms with distinguishable genomic %GC in equal quantities. In this test, between-spin variation was 0.0013 to 0.0025 g ml Ϫ1 , whereas within-spin variation ranged up to 0.0056 g ml Ϫ1 (see Data Set S1 in the supplemental material [raw data]).
Using genomic mock communities to explore density-to-GC content conversion. One potential strategy for decreasing the costs and labor needed for meta- OTUs from a naphthalene-enriched seawater SIP study (data set 1) and plotted versus fraction size. The colors represent detection limit thresholds (g ml Ϫ1 ). Here, we use the phrase "true positive" in the statistical sense, where the power of a test refers to its statistical sensitivity, its true positive rate, or its probability of detection. "True negative" refers to the proportion of actual negatives that are correctly identified. For example, the minimum number of replicates necessary to detect a density shift corresponding to 5 atom% excess 18 O and a statistical power of 0.60 would be 4. This plot is calculated for 18 O based on data set 2 (SPRUCE) but can easily be converted for carbon or nitrogen isotopes. Ten atom% enrichment (at percent excess [APE]) due to incorporation of 0.0065 g ml Ϫ1 18 O-labeled substrates corresponds to 12 APE with the same incorporation of 13 C substrates, or 31 APE with 15 N substrates (50% GC).
qSIP Sensitivity and Reproducibility Analysis genomic-SIP experiments would be to sequence only the labeled samples and calculate an approximate density shift using the %GC of the genomic bins. The conversion of density to the GC content of a genome is a linear function of the unlabeled ( 12 C) weighted mean. There is a canonical equation describing this function (31)(32)(33), but it has also been determined empirically in the past with a ladder of three bacterial taxa with different GC contents (9). If this equation were identical for each gradient in a SIP experiment (as is usually assumed), unlabeled replicates of the same organism should have the same weighted mean in every run. However, as shown here and previously, this is not the case (9, 16). To address this variation, we hypothesized that an external standard could help to calibrate the conversion of %GC to the mean weighted density for each sample. To test this, we ran nine replicates of a mock community (data set 5) with a wide range of known GC content, generated a calibration curve from each, and fitted a linear equation to it. In this mock community, which consisted of highmolecular-weight genomic DNA from four bacterial taxa, we found mean weighted density and %GC were highly correlated with a linear relationship (n ϭ 9, R 2 Ͼ 0.94) but that the slopes and intercepts for each individual replicate varied (Table 1, Fig. S3). We found similar results when using the peak (highest) buoyant density per genome rather than the WMD. There was a significant difference between the WMD as calculated according to the canonical equation ϭ ͑0.098͓G ϩ C͔͒ ϩ 1.66 (32) and the observed mean of WMDs per genome over all replicates (n ϭ 9, paired t test, P ϭ 0.02). For example, using the canonical equation, the GC content of the four mock community members in replicate 1 are 38%, 46.5%, 62%, and 72.1%, respectively. These values deviate by several percent from the known GC content of these genomes ( Table 2).

DISCUSSION
SIP is a powerful tool for investigating taxon-specific microbial functions in complex assemblages. Like any method, SIP-derived measurements have some inherent variability which can be managed-within limits-to address research questions of interest. Our results show how experimental design (number of replicates, number of fractions, use of standards) can be customized to achieve the goals of a particular research question and the level of sensitivity/detection required. Despite the wide use of SIP, there has been relatively little benchmarking of interpretation of its results. Here, we attempt to shed light on some practical aspects of the method and discuss how to adjust it to maximize results within the limits of labor and cost.

Spin1
Spin2 Spin3  Within a replicated qSIP experiment, it is important to evaluate statistical powerthe probability of detecting a given level of isotopic enrichment. Yet, power is evaluated infrequently, because avoiding type I errors is typically prioritized above avoiding type II errors, reflecting the common practice of identifying thresholds for alpha (0.01, 0.05), while beta (␤) thresholds remain unreported. This perspective makes sense in some contexts, e.g., incorrectly inferring that a medical treatment is effective is often viewed as more hazardous than concluding it is not effective, when in fact it is. However, in contrast, applying the precautionary principle to the impact of a potential pollutant or toxin, it is in some cases wiser to avoid type II errors over type I (i.e., erring on the side of environmental caution [34]). For a SIP experiment, power analysis involves evaluating the trade-offs among several parameters: (i) the effect size of interest, which for qSIP studies is the density shift (or amount of isotope incorporation) that the researcher wishes to detect (this can be thought of as the minimum detectable difference); (ii) the acceptable ␣ value, or acceptable probability of type I error (for qSIP, a type I error occurs when the researcher concludes that there is isotope incorporation, when in fact none occurred); (iii) the acceptable ␤ value, or acceptable probability of type II error (for qSIP, a type II error occurs when the researcher infers "no isotope incorporation," when in fact some isotope incorporation actually occurred); and (iv) the number of true, independent replicates (sample size) used in the experiment. Power is defined as 1 -␤ and is the probability that a true difference will be detected in a given experimental design. Applied to qSIP, power analysis can show how increasing the number of replicates increases the probability of detecting a given level of isotope incorporation. Power analysis can also show, at a constant level of power, how increasing the number of replicates decreases the threshold level of isotope incorporation that can be detected-making it more likely to detect taxa that are slower growing, or perhaps are secondary consumers in a substrate trophic chain. Last, power analysis can clarify the trade-offs between type I and type II errors and gives useful context for interpreting results from qSIP experiments.
In any SIP experiment, the detection of density shifts is a critical goal. However, variability in a number of factors, including fraction resolution and replication, can limit this ability. In addition, variation of the weighted mean density (WMD) of the same unlabeled taxon over numerous replicates, observed even between samples processed simultaneously and identically, implies that there are physical and/or chemical factors unrelated to genomic %GC that can affect SIP metrics. The results of the unlabeled WMD variation and interpretation error analyses we conducted imply that when using qSIP, researchers can expect that a minimum detectable effect size of 0.005 g ml Ϫ1 in unreplicated 13 C experiments with would result in low type I and type II errors. This cutoff applies when a DNA density gradient has been divided into at least four density fractions, but we note that precision will increase if additional fractions are utilized. Adding additional replication would lead to increased statistical power using the same detection limit. However, we found that density shift estimates of taxon-specific isotope incorporation are broadly robust across a wide range of fraction sizes. For example, the relative error of high incorporators only varied by an average of 0.02% in shift from fraction size 0.002 to 0.011 g ml -1 (n ϭ 5, 50 to 9 fractions in a 5.1-ml tube).
Variable WMD values mean that the same organism may peak at a density anywhere within a specific range. Moreover, the unlabeled mean and labeled mean of a single replicate can deviate in different directions, increasing the observed density shift and leading to a type I error. We show that the potential for such deviation increases at low gradient resolution. This may explain why variation in the WMD relative error increases as resolution decreases.
The physics of DNA behavior within a density gradient during centrifugation and processing will affect the number of fractions where OTUs can be detected. For example, the long tails of DNA concentration versus density distributions (e.g., see Fig. S1 in the supplemental material) have been attributed to DNA smearing along the centrifuge tube walls (10). It stands to reason that the more abundant an OTU, the higher its representation in this smear will be. In addition, the detection limit of an OTU affects the number of fractions where it can be detected. We found that when inspecting presence/absence of OTUs in all fractions, OTU abundance was positively correlated with the proportion of fractions in which it was present. Abundant OTUs appear in almost all fractions, whereas rare OTUs appear in relatively few fractions, and in some cases only one fraction. These rare OTUs also tended to have highly variable WMD values.
Low gradient resolution in combination with higher variability can lead to false classification of borderline taxa as isotope incorporators when in truth, they are not (type I error). For example, a recent qSIP model simulation showed that the rate of true negatives (specificity) and true positives (sensitivity) is 88% and Ͼ90%, respectively, with no effect of fraction size (in the range of 0.003 to 0.008 g ml Ϫ1 ) (10). When we examined the relative specificity and sensitivity of qSIP using real unreplicated data over an even wider range of fraction sizes, we found that gradient resolution and shift detection limit strongly affect specificity and sensitivity. However, the reliability of qSIP remained extremely high as long as the detection limit was 0.005 g ml Ϫ1 or higher (sensitivity Ͼ 90% and specificity Ͼ 95%) regardless of gradient resolution. This detection limit is comparable to 2 standard deviations of unreplicated unlabeled WMD, further supporting this threshold. This means that if one decides to add experimental replication, even as few as 3 replicates, one can increase the power of this analysis to have very low type II error (assuming a 0.005 g ml Ϫ1 detection threshold). Thus, our analysis can be used for experimental design based on the desired statistical power.
SIP is an expensive method (18). All of the experimental design factors we assessed (number of replicates, number of fractions, distribution of fractions across tubes/spins, internal standards) directly affect the cost of a SIP experiment-in terms of supplies, enriched isotope labels, labor, and sequencing. For many researchers, processing only 9 to 12 density fractions (as our analysis suggests is sufficient) could significantly mitigate labor and sequencing costs. Cost savings will be considerable for amplicon-SIP studies, and even more so for metagenomic SIP (MG-SIP), given the high cost of metagenomic analyses.
The combination of metagenomics and SIP (26,35) involves sequencing metagenomes instead of amplified marker genes from density fractions, and assembly of genomic bins from those metagenomes. But this approach, while powerful, remains rarely used. Early obstacles such as low DNA yield, multiple displacement amplification (MDA) biases, and low throughput (5) have now been resolved by improvements in sequencing platforms and library preparation kits. However, to date, most MG-SIP studies have sequenced only a few heavy fractions, or at best, two or three light fractions as well (27,28,36). While analyzing so few fractions saves money, doing so limits detection of substrate incorporators in several ways. (i) Choosing which fractions to sequence relies on density shift of the entire community, which may be subtle. (ii) Low-GC genomes, even if highly enriched, may not become heavy enough to reach the heavy fractions. (iii) Low-GC genomic islands may not be well covered and therefore not assembled, leading to increased genome fragmentation (Fig. S4, data set 3). (iv) Depending on the density of the fractions sequenced, high-GC genomes may be highly represented regardless of enrichment. (v) As we and others have demonstrated, abundant organisms can be found in all density fractions (10) and may be erroneously classified as incorporators. Sequencing all fractions in MG-SIP studies should overcome all of these obstacles, even if only a small number of fractions are used (37). While low gradient resolution (Ͻ4 fractions) may limit detection to only highly enriched taxa, medium gradient resolution (and the decreasing price of shotgun sequencing) should keep the financial and computational costs of MG-SIP manageable while maintaining acceptable detection limits. Specifically, our data suggest that with circa 10 density fractions (fraction size 0.011 g ml Ϫ1 ), the increase in error compared to higher resolution is minor.
Reducing the number of fractions from ϳ40 (typical for many high-resolution SIP studies [16,17,21,38]) to only 10 will substantially impact the feasibility of a SIP study, particularly for MG-SIP. As an example, we compared the resources and yield of a simplified experiment. For a high-resolution MG-SIP study (40 fractions ϫ labeled/ unlabeled ϭ 80 metagenomes) with minimal replication (3), one could easily need to sequence 240 metagenomes. Assuming shallow sequencing of 2 Gbp per metagenome, this translates to 480 Gbp of data that would need to be stored, manipulated, and assembled and total costs (sequencing, library prep) approaching $50,000. This highly simplified experiment could easily balloon if experimental treatments or time points were added. In comparison, our findings suggest one may reduce the number of fractions to only ϳ10 (0.011 g ml Ϫ1 ) and keep to a relative error lower than 0.005 g ml Ϫ1 . We show that for fraction sizes below 0.011 g ml Ϫ1 , the mean relative error increases, as does its variability (in both replicated and unreplicated data sets). However, this increase in variability can be mitigated by reallocating effort toward replication. Indeed, reducing the number of fractions by just a factor of 2 allows for doubling the number of replicates without additional costs, while increasing the statistical power of downstream analyses.
All commonly used SIP protocols rely on a linear conversion between the peak density (as a measure of the central tendency of the distribution) of an unlabeled genome and its GC content (4,8,29,32,33). However, the inherent variations that our results illustrate suggest that GC content cannot be accurately converted to density with the canonical equation (32), since the relationships between calculated %GC and known GC content have a very high R 2 but various slopes and intercepts. Instead, researchers may need to create a calibration curve of %GC/WMD (or peak density) per tube, using an internal standard. Accurate conversion between WMD and %GC is particularly important for MG-SIP experiments where only the heavy fractions of labeled samples are sequenced. Once genomic bins are assembled, their %GC can be converted into a theoretical unlabeled WMD which can be used to calculate the density shift, and thus the enrichment level of those bins. It is conceivable that a reliable calculation like this could obviate the need to analyze most of the unlabeled controls and is thus another experimental design customization that can save substantial labor and costs.
If one were to combine MG-SIP with an internal %GC/density ladder, one could significantly decrease the number of unlabeled controls sequenced and use the calibration curve from the labeled tubes with the GC content of genome bins to calculate unlabeled WMD and density shift values. The internal standard should be informatically separable from the sample. This could be done by creating a mock community of organisms which are highly unlikely to appear in the sample (e.g., in a soil sample, use genomes of strictly marine organisms) and could be customized per experiment. Due to the variation within spin, an external ladder (a mock community in a separate ultracentrifuge tube) would be insufficient. However, it could be argued that finding a qSIP Sensitivity and Reproducibility Analysis suitable set of nonindigenous genomes distinguishable from a highly diverse environment such as soil may prove difficult. Alternatively, if highly complete genomic bins can be assembled, then their %GC would be more reliable, and their WMD can be calculated from the gradient. Such bins could be used as an internal standard for generating a WMD-to-GC formula. As the generation of high-completion bins could only be assessed post hoc, we would still recommend the use of an internal standard.
Conclusions. The inherent variability in qSIP stems from its many steps: replicate variation, bottle effects during incubation, extraction efficiencies, tube-to-tube variation in the gradient, amplification bias, strain heterogeneity, among-treatment shifts in community composition, and OTU clustering errors (for marker genes). Quantifying the sensitivity of qSIP to these factors will improve existing amplicon-based qSIP techniques and facilitate efficient ways of extending SIP to more ambitious applications, such as metagenome-assembled genome-based SIP. The analyses we present here illustrate how customizing experimental design factors can allow researchers to achieve ideal levels of sensitivity and specificity while keeping costs in check.

MATERIALS AND METHODS
We used five unpublished data sets from different ecosystems for our in silico analyses: a highresolution unreplicated SIP study of [ 13 Table 1 contains the number of density fractions and replicates per data set. As these experiments were performed by different laboratories, using slightly different protocols, we describe their SIP pipelines separately (see Text S1 in the supplemental material). All postsequencing steps were performed identically for all 16S rRNA operational taxonomic units (OTUs  2 18 O. For full details, see Text S1. Data set 4. E. coli and P. putida cultures. A 1:1 mix of DNA from unlabeled E. coli and 0.5 g DNA from unlabeled (n ϭ 8) or labeled (n ϭ 12) P. putida was processed on an automated SIP pipeline at the Joint Genome Institute. For full details, see Text S1. Data set 5. Genomic mock communities. DNA for a genomic mock community was purchased from ATCC, resuspended in Tris-EDTA buffer, mixed in equal proportions, and aliquoted into replicates. The mock community was composed of high-molecular-weight DNA of Thermoanaerobacter pseudethanolicus, Bacillus licheniformis, Bifidobacterium longum subsp. infantis, and Streptomyces violaceoruber (see Table S1 for accession numbers), and these genomes were selected for their distinct %GC content (34.5%, 46%, 60% and 73%, respectively). For full details, see Text S1.
16S rRNA gene-based microbial community composition in individual fractions. Each data set was processed separately. For data sets 1 and 2, reads were quality trimmed using Trimmomatic version 0.33 (40) with parameters set at LEADING:20 TRAILING:20 SLIDINGWINDOW:15:25. The resulting reads were merged using Usearch version 7 (41), clustered in mothur following the MiSeq standard operating procedure (SOP) (42)(43)(44). To track individual operational taxonomic units (OTUs) in separate density fractions, the relative abundance of the OTU in each fraction was multiplied by either the concentration of DNA in the same fraction (seawater, data set 1; as in reference 20) or by the total 16S rRNA gene copy number (SPRUCE soil, data set 2; as in reference 9).
Density shifts. The weighted mean density of each OTU in labeled and control samples was calculated by multiplying density by OTU abundance (amount of DNA/16S rRNA gene copies ϫ OTU relative abundance) within each fraction, summing the products and dividing them by the sum of abundances of the OTU across all fractions. The density shift was calculated by subtracting the weighted mean density of the OTU in the unlabeled control from the its weighted mean density in the labeled treatment (9). Density shifts were then plotted in R (45) for the 100 most abundant OTUs in each sample. The relative error was calculated as the difference between the density shift per OTU in resolution r minus the shift per OTU in the original high-resolution r max .
Sensitivity analysis. We used the variation in the weighted mean density (WMD) of the same OTU in replicated samples to assess the minimum density shift that can be detected. The leading principle here was that a shift smaller than the natural variation could not be statistically significant.
Using unlabeled replicated (n ϭ 10) samples from data set 2 time zero (SPRUCE soil, with five subsamples from each of two preincubation temperatures [15°C and 45°C]), we calculated the WMD for each OTU in each replicate and the standard deviation of the 10 resulting means for each of the 100 most abundant OTUs. We also assessed the effect of OTU abundance on WMD variability using all 320 OTUs from the same data set (Text S1). The use of standard deviation has an underlying assumption of normality. Consistent with this assumption, we observed only slight leptokurtosis in the distribution of taxon relative abundance as a function of density that is not thought to strongly influence inference (Fig. S1).
Sensitivity to number of fractions. We used data set 1 (seawater) and data set 2 (SPRUCE soil) to estimate how precision in the estimate of taxon isotope incorporation varies with the number of density fractions collected in a qSIP experiment. We combined fractions in silico to simulate the results of these experiments had they been run with fewer fractions (f), with the following principles. (i) Only adjacent fractions were combined. (ii) Fraction combinations were conducted to create new, combined fractions that were approximately equal in size and sequencing depth (i.e., with minimal variation in the range of densities represented by each simulated fraction). For example, to simulate an experiment where only two density fractions were collected instead of 18, we ran three possible scenarios using the original empirical data: combining the lightest 8, 9, or 10 fractions into a simulated fraction and combining the remaining heaviest fractions into a second simulated fraction. Similarly, to simulate an experiment with 10 density fractions instead of 18, we ran all 45 possible scenarios of combining eight pairs of adjacent density fractions with two uncombined single density fractions. In this way, we simulated data sets across the full range of possible fraction numbers, from 2 to 18. We did this to simulate typical approaches to SIP experiments, where fractions that span similar density ranges are usually combined. We computed the weighted-average density for each taxon in each replicate of each simulated data set using the midpoint of each density fraction (combined or not) and weighting each of those density values by the area under the curve for that fraction, where the curve is delineated by the plot of 16S rRNA gene copies versus density. The sum of those weighted densities, divided by the total area under the curve, yielded the weighted-average density metric used in subsequent analyses. For nonpermuted analyses, fractions were combined from preincubation samples (temperature, 15°C and 45°C [n ϭ 10]). For each permuted combination (using the replicated data set 2), we ran the qSIP code (13,46) and estimated the atom percent excess 18 O for each replicate sample and then calculated the standard deviation of that estimate across all replicates (n ϭ 5, preincubation temperature of 15°C). Finally, we calculated the relative standard deviation as a function of increasing number of simulated fractions compared to the original number of fractions.
Power analysis. We evaluated statistical power using the SPRUCE data set (data set 2). These data came from a SIP experiment where peat bog soils were incubated for 10 days at an intermediate temperature (15°C) and then harvested after 5 or 10 days of exposure to H 2 18 O. An unlabeled control sample was sampled at day 0. For the power analysis, we omitted 27 taxa that did not occur in all 15 samples (n ϭ 5 for control, H 2 18 O at day 5, and H 2 18 O at day 10), 5 rare taxa (relative abundance of Ͻ0.026%), and 7 taxa identified as outliers for variance of the estimate of weighted average density (P Ͻ 0.05, Grubb's test). Excluding taxa with high variance will inflate power, but the effect is likely negligible given the small number of taxa excluded for this reason. Of the remaining 236 taxa, 211 were included in the power analysis. We used the observed variation among taxa in weighted average density shift, which ranged from Ϫ0.003 to 0.033 g ml Ϫ1 . This captures a wide range of possible values of isotope uptake, from ϳ0 to ϳ60 atom% excess 18 O. We used in silico resampling with replacement to estimate power. For each taxon at each sample date, n random samples were drawn (with replacement) from each of the 18 O-labeled and unlabeled data sets, a t test was performed, and the P value was recorded. This was repeated 1,000 times, and power was estimated as the frequency of significant t tests among the 1,000 simulations (47,48). N (number of sample replicates) was varied between 2 and 6 to simulate experiments with different replication by pruning or duplicating replicates from the original data set. Average power was calculated across all taxa. The upper 10th percentile of power was also calculated to estimate power typical for more dominant taxa.
Data availability. Data set 1 sequence data are available in ENA under project number PRJEB26952 and accession numbers ERS2507530 to ERS2507619. Data set 2 sequence data were submitted to NCBI Sequence Read Archive (SRA) under project ID PRJNA641177 and accession numbers SRR12071929 to SRR12072082.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only. TEXT S1, DOCX file, 0.02 MB.