Complementing 16S rRNA gene amplicon sequencing with estimates of total bacterial load to infer absolute species concentrations in the vaginal microbiome

Whereas 16S rRNA gene amplicon sequencing quantifies relative abundances of bacterial taxa, variation in total bacterial load between samples restricts its ability to reflect absolute concentration of individual species. Quantitative PCR (qPCR) can quantify individual species, but it is not practical to develop a suite of qPCR assays for every bacterium present in a diverse sample. We analyzed 1320 samples from 20 women with a history of frequent bacterial vaginosis, who self-collected vaginal swabs daily over 60 days. We inferred bacterial concentrations by taking the product of species relative abundance (assessed by 16S rRNA gene amplicon sequencing) and total bacterial load (measured by broad-range 16S rRNA gene qPCR). Log10-converted inferred concentrations correlated with targeted qPCR (r = 0. 935, p<2.2e-16) for seven key bacterial species. The mean inferred concentration error varied across bacteria, with rarer bacterial vaginosis-associated bacteria associated with larger errors. 92% of errors >0.5 log10 occurred when relative abundance was <10%. Many errors occurred during early bacterial expansion or late contraction. When relative abundance of a species is >10%, inferred concentrations are reliable proxies for targeted qPCR. However, targeted qPCR is required to capture bacteria at low relative abundance, particularly with BV-associated bacteria during the early onset of bacterial vaginosis.


Introduction 29
For most infectious diseases, the absolute concentration of a single pathogen is often the most 30 specific marker of disease severity and therapeutic response(1-3). In constrast, studies of 31 bacterial communities usually rely on broad-range consensus sequence PCR of taxonomically 32 informative genes (such as 16S rRNA) coupled with next generation sequencing (NGS) to assess 33 relative, but not absolute abundances of bacteria. At a mechanistic level, specific combinations 34 of bacteria and bacterial gene products are thought to play a causative role in the pathogenesis 35 of many microbiome associated conditions(4-6), and this approach of characterizing the 36 microbiota is valuable. However, absolute concentration of individual bacterial taxa within 37 communities may be a better predictor of biological activity or disease risk compared to relative 38 abundances of these taxa. Quantitating absolute concentration of individual species with qPCR 39 is time intensive, requires generation of a standard curve for each organism using known 40 concentrations of DNA, is expensive and only available in specialized laboratories. Moreover, 41 each qPCR assay requires significant development and validation costs. qPCR is therefore not 42 typically comprehensive for all species in a community. Moreover, selection of the most 43 appropriate species for analysis may reflect investigator bias. 44 A method to infer absolute concentration of multiple bacterial species from NGS data would be 45 extremely useful for the field including studies of the vaginal microbiome. NGS amplicon 46 sequencing is a fractional approach that has been used to help define conditions such as bacterial 47 using broad-range PCR and sequencing. Sequence reads have been submitted to the NCBI Short 93 Read Archive (in submission. Accession numbers pending). Relative abundances and absolute 94 concentrations of specific vaginal bacteria were measured on all samples in two participants and 95 in daily morning samples for the remaining 18. We performed qPCR on all samples collected from 96 each participant, but for the purpose of this work only consider the morning samples. 97 All data generated or analysed during this study are included in the supplementary material. 98

Statistical considerations. We calculated inferred concentrations using equation 1. 99
Equation 1 100 Inferred Concentration = Relative abundance × Total Bacterial Load 101 (16S rRNA gene copies) (%) (16S rRNA gene copies) 102 species across all samples (Figure 6b & 7b, right upper and left lower quadrants). Such rIC error 267 often occurred when species were transitioning to or from a low abundance (<10 6 16S rRNA gene 268 copies per sample). The second type of rIC error, occurred when two consecutive points had 269 inferred concentrations of zero, resulting in underestimation of growth or contraction rates for 270 individual species (Figure 7b). This phenomenon also commonly occurred when a species was 271 transitioning to or from a low abundance (<10 6 16S rRNA gene copies per sample). These two 272 forms of transitions accounted for 91.7% of rIC error > 0.05 (Figure 7c). If all transitions involving 273 a zero value were eliminated from the analysis, we observed excellent correlation between 274 inferred and observed rate of change (r=0. 876, p<2.2e-16). It follows that inferred concentrations 275 do not capture kinetics during microbial blooming or contraction when bacteria are at low 276 concentration or not detected using the less sensitive broad-range PCR with NGS approach. 277 However, inferred concentrations can be used to estimate individual species growth and 278 contraction rates when bacteria are present at higher concentrations such as >10 6 16S rRNA gene 279 copies/swab . 280 281

Complete linkage clustering by inferred and absolute concentrations shows general agreement. 282
To assess whether inferred concentrations provide similar or disparate classification of samples, 283 we clustered samples using complete linkage hierarchical clustering based on Euclidean distances 284 (21) by inferred and absolute concentrations ( Figure S3). We compared the resulting 285 dendrograms using the entanglement coefficient from the dendextend package in R (24), where 286 a value of 1 corresponds to complete discordance and a value of 0 indicates perfect alignment. 287 The two dendrograms were found to be in agreement, with a low entanglement coefficient 0.11. 288 289 We next determined the number of clusters using NbClust package in R (27). Absolute 290 concentration identifies two whereas inferred concentration identifies three clusters. The third 291 cluster arose from a general disctinction between samples dominated by L. crispatus from L. iners 292 as the inferred concentrations had a lower threshold (1 copy per swab) than the qPCR (93.8 293 copies per swab). 294 295

Inferred concentration may provide the most comprehensive overview of individual species 296
kinetics. Inferred concentrations can be calculated for all species captured by NGS. In Figure 1  297 and S1, we show the inferred concentrations of the most abundant species across all samples. 298 We imposed a 1% relative abundance threshold to limit the possible 0.5 IC error described in 299 Figure 4b. This relative abundance cut-off results in abrupt appearance and disapparence of 300 organisms. Although we cannot validate our projections for species outside of the seven key 301 bacterial species for which we have targeted qPCR assays, inferred concentrations have the 302 potential to describe the kinetics of relevant species present at moderate to high concentrations 303 during bacterial shifts in the microbiome. 304 We carried out complete linkage hierarchical clustering based on Euclidean distance by 305 inferred concentration and relative abundance for the 20 most abundant species of the data set 306 ( Figure S4). The resulting dendrograms showed general agreement, with an entanglement 307 assays. This variability may arise from different bacterial targets having varying GC content, 352 secondary structures and amplification product size. In this sense, absolute concentration by 353 qPCR may not be a perfect gold standard for comparing inferred concentration. 354 In summary, we developed and validated a simple, user-friendly method to estimate absolute 355 species abundance in complex polymicrobial communities. This method is best employed when 356 species are present at >10% relative abundance and must be validated for each species of 357 interest. Ultimately, inferred concentration of one or several species may serve as a more 358 predictive variable of disease association, compared to relative abundance, and may advance our Daily samples from a woman, Participant 18, who performed self-swabbing of the vagina were analyzed by: a) targeted qPCR of seven specific species, b) high throughput sequencing using 16S rRNA and c) inferred concentration for species with relative abundance above 1%. qPCR allows measures of absolute concentration, whereas broad range PCR with sequencing provides a measure of bacterial diversity in a given sample. Targeted qPCR often detects shifts in single species prior to NGS. Inferred concentration follows qPCR more closely than relative abundance does and may project concentration of species for which targeted qPCR assays are not available.  Hours in Study Hours in Study Relative Abundance (%) Relative Abundance (%) log 10 (16S rRNA copies) log 10 (16S rRNA copies) a) b) Figure 3. Inferred concentration correlates more strongly with absolute concentration than relative abundance. a) Scatter plot of relative abundance vs absolute concentration. Pearson correlation coefficient (pcc), r is 0.936, P<2.2e-16. b) Scatter plot of inferred concentration vs absolute concentration. Both axes are plotted on a logarithmic scale. Pearson correlation coefficient (pcc), r is 0.935, P<2.2e-16. Samples which were negative by NGS but not by targeted qPCR are plotted on the x-axis while samples negative by targeted qPCR but positive by NGS are listed on the reported threshold for targeted qPCR (log10(93.8) 16S rRNA gene copies). Relative abundances and inferred concentrations were generally falsely negative at low absolute concentrations. Variance in the relationship between absolute concentration and relative abundance is inversely proportional to species concentrations (Breusch-Pagan test, P=2e-3). Whereas this relationship was not statistically significant between absolute concentration and inferred abundance (Breusch-Pagan test, P=0.06).
log 10 (16S rRNA copies) log 10 (Relative Abundance (%)) a)   Bar-chart of incidence of >0.5 IC error by relative abundance group. Black is inclusive of double-negatives (0 inferred concentration and threshold absolute concentration): 93% of >0.5 IC errors are accounted for by relative abundances <10% (85% by relative abundances <1%). In grey, concurrent negative samples are removed: 84% of >0.5 IC errors are accounted for by relative abundances <10% (66% by relative abundances <1%). c) Boxplots displaying IC error for samples with unconfirmed and confirmed underestimates of total bacterial load by broad range qPCR assay (samples where BR16S is lower than the sum of concentrations of the seven targeted species). Data points with zero inferred concentration were removed. Samples with underestimates of total bacterial load overestimate concentration more than other samples. Overall however, the range of IC error is comparable between both groups. Boxes are the interquartile range; whiskers are 1.5x the IQR and dots are samples outside of this range; crosses are mean.  Scatter-plots color-coded by IC error. Each dot is a sample for a specific species from a single participant. a) Relative abundance versus Shannon diversity index. High IC error predominately occurred at low relative abundance but across both low and high diversity samples. b) Relative abundance versus sequencing depth. High IC error predominately occurred at low relative abundance but across various levels of sequencing depth. c) Sequencing depth vs species counts. High IC error occurred at species counts below 100, although >0.5 IC error is observed across all species counts.   Figure 7. Inferred concentration measures allow accurate inference of kinetic changes between two sequential non-negative samples. a) Top: Levels of A. vaginae over time in a single participant (dotted is inferred and solid is absolute concentration); bottom: rate of change in levels of A. vaginae over time in the same participant (dotted is inferred and solid absolute concentration); divergence in swab to swab levels between inferred and absolute concentrations varies only when inferred concentration is zero in one of the sequential samples. b) Scatter plot of rate of change of inferred concentration as predicted by NGS vs qPCR observed values. Both axes are plotted on a logarithmic scale. Data is the same as in Fig 6  a and b. Triangles correspond to panel c. Points are colored according to whether consecutive samples were double positive (both >0 inferred concentration), single positive (one >0 and one 0 inferred concentration) or double negative (both 0 inferred concentration). Data points in which both samples are positive (no zeroes) are much more highly correlated (r is 0.876 , P<2.2e-16). c) A majority of rIC errors > 0.05 occur during transitions between positive and negative samples (one zero).