Differential Functional Constraints Cause Strain-Level Endemism in Polynucleobacter Populations

Understanding the biological factors influencing habitat-wide genetic endemism is important for explaining observed biogeographic patterns. Polynucleobacter is a genus of bacteria that seems to have found a way to colonize myriad freshwater ecosystems and by doing so has become one of the most abundant bacteria in these environments. We sequenced metagenomes from locations across the Chicago River system and assembled Polynucleobacter genomes from different sites and compared how the nucleotide composition, gene codon usage, and the ratio of synonymous (codes for the same amino acid) to nonsynonymous (codes for a different amino acid) mutations varied across these population genomes at each site. The environmental pressures at each site drove purifying selection for functional traits that maintained a streamlined core genome across the Chicago River Polynucleobacter population while allowing for site-specific genomic adaptation. These adaptations enable Polynucleobacter to become dominant across different riverine environmental gradients.

Polynucleobacter 16S rRNA sequences were resolved to the strain level by oligotyping (12), and the oligotypes had a pattern of reduced beta diversity within sites in the same region similar to that found between regions ( Fig. 2d and e). Oligotyping was performed on the dominant Polynucleobacter OTU (OTU32), resulting in six oligotypes across all seven sites. Oligotype 2 was extremely dominant and was found at all of the sites except WW99 (South Branch Chicago River [SBCR]). (For clarity, we use the term SBCR for the WW99 and WW108 sites throughout this paper.) The beta diversity pattern of these Polynucleobacter oligotypes showed a significant positive correlation with the concentration of ammonia (Biology Environ, UniFrac R 2 ϭ 0.7, P Ͻ 0.01), as did the abundance of oligotype 2 (R 2 ϭ 0.56, P Ͻ 0.05). Geographic localization (kilometers) had no significant correlation with either OTU or oligotype distribution (see Fig. S2B and C in the supplemental material). However, the beta diversity pattern of physicochemical measurements (see Fig. S3 in the supplemental material) was similar to the taxonomy analysis (Fig. 1b), suggesting that physicochemical factors, and hence local adaptation, shape Polynucleobacter diversity. Metagenomic annotation and method-independent validation of Polynucleobacter abundance. Fourteen metagenomes (two technical replicates per sample site) were sequenced, producing numbers of quality-trimmed reads ranging from 25 to 100 million (Table 1). 16S rRNA gene rarefaction curves (see Fig. S4A in the supplemental material) and abundance-based weighted coverage (13) Fig. S4B in the supplemental material), which could indicate oligotrophy through ecological adaptation (14). MetaPhAln analysis and metagenome assembly (contigs assigned to taxa) were used to validate the 16S rRNA amplicon analysis. Polynucleobacter bacteria were abundant at all of the sites (see Fig. S1B in the supplemental material). The beta-diversity trend of unassembled metagenomic data, annotated to known functional genes (see Fig. S4C in the supplemental material), was similar to the taxonomic analysis (Fig. 2b), suggesting that functional adaptations underlie taxon-specific differentiation across regions (15).
Discounting core metabolic functions (e.g., energy metabolism, transcription, and translation), the relative abundance of aminobenzoate degradation, amino acid biosynthesis (valine, leucine, and isoleucine), mismatch repair, phosphotransferase system, and polycyclic aromatic hydrocarbon degradation were significantly different among the regions (ANOVA, Tukey-Kramer post hoc test, and Bonferroni correction, P Ͻ 0.05; see Table S1 in the supplemental material). Enzyme-level functional analysis (BLASTXbased KEGG mapping) highlighted the TonB-linked outer membrane protein (SusC/ RagA family), leucine/isoleucine/valine transporter permease, and amine transporter categories as differentially abundant across all three regions. The SBCR and NBCR regions had the most divergent functional profiles (R 2 ϭ 0.76; two-sided Welch t test, P Ͻ 0.05), while Calumet (n ϭ 3) and SBCR (n ϭ 2) had significantly similar functional profiles (R 2 ϭ 0.96, P Ͻ 0.005).
Population-level genetic differentiation across in situ Polynucleobacter cohorts. Corresponding to previously published culture-dependent and -independent studies (4), our 16S rRNA gene amplicon-and shotgun metagenomics-based analysis highlighted that in situ Polynucleobacter populations follow the ubiquity-bydiversification theory, whereby lineage-specific ecological preferences result in niche separation ( Fig. 2c to e) (2). Intrataxon-specific AGS variations (see Fig. S4B in the supplemental material) and distance decay analysis (see Fig. S2A in the supplemental material) suggest that local functional constraints (physical environment) are fundamental for lineage-specific genetic adaptation. To test this hypothesis at the level of the population genome, we employed nucleotide composition-based (tetranucleotide frequency usage and GϩC content [percent]) genotype reconstruction methods to bin Polynucleobacter contigs (representing in situ strains) across each sample (see Table S2 in the supplemental material). Marker gene copy number variation (CNV) analysis revealed that sites WW73 (NBCR) and WW76 (Calumet) had the maximum (n ϭ 12) and minimum (n ϭ 3) numbers of Polynucleobacter species, respectively. Individual-readbased AGS variation analysis (see Fig. S4B in the supplemental material) revealed that Polynucleobacter cohorts had a smaller AGS (NBCR, 1.9 Ϯ 0.19 Mb; Calumet, 2.7 Ϯ 0.23 Mb; SBCR, 2.8 Ϯ 0.16 Mb) than the total community (i.e., NBCR, 2.6 Ϯ 0.11 Mb; Calumet, 3.4 Ϯ 0.26 Mb; SBCR, 4.0 Ϯ 0.19 Mb). Polynucleobacter population bin size (total) varied across metagenome samples, which positively correlated with sequencing depth (Table 1). WW76 and WW57 (Calumet) had the greatest and least sequencing depths (100 million and 25 million reads, respectively) and Polynucleobacter population bin sizes (7.1 and 1.9 Mb, respectively).
P. necessarius subspecies can maintain free-living and endosymbiont modes of life (8,9). In order to study the in situ population-level evolutionary dynamics of these two modes of life (endosymbiotic and free living), we compared the whole-genome sequences of the available symbiotic (P. necessarius subsp. necessarius STIR1) (16) and free-living taxa (P. necessarius subsp. asymbioticus QLW-P1DMWA-1 and betaproteobacterium CB [17]). The Polynucleobacter population bins (protein-coding genes) from the present study were thus demarcated into endosymbiont (n ϭ 205)-, free-living organism (n ϭ 336)-, and core (n ϭ 1,579)-specific genetic repertoires (see Tables S3 and S4 in the supplemental material). Pairwise orthologous gene identification categorized 205 protein-coding genes as endosymbiont specific, which contrasts with the 105 genes identified previously (9). The majority of the samples showed a greater abundance of free-living organism (48 to 71%)-and core genome (49 to 71%)-specific genotypes than endosymbiont-specific gene content (10 to 24%) (see Table S3 in the supplemental material). Consistent with previous reports (4), the Polynucleobacter population core genome has the potential for urea (Pnuc_1190 to Pnuc_1202), inorganic sulfur (Pnuc_1476 to Pnuc_1494), and nitrogen (Pnuc_0987 to Pnuc_1003) metabolism. However, pathway-and enzyme-level comparisons (intergroup) of the core genome revealed significant differences in energy metabolism across Polynucleobacter ecotypes (Fig. 3a). Pantothenate (vitamin B 5 , Pnec_0171)-and ubiquinone-based cofactor biosynthesis (Pnec_0171) was highly abundant (ANOVA, P Ͻ 0.05) across NBCR and Calumet ecotypes ( Fig. 3b and c), in contrast to SBCR samples. However, SBCR ecotypes had a greater abundance of genes involved in pyrimidine metabolism ( Fig. 3a and d, dUMP biosynthesis from dCTP, Pnec_0171). Similarly, one-carbon metabolism (tetrahydrofolate interconversion, methylenetetrahydrofolate dehydrogenase, Pnec_0171) showed significant variations across all three regions ( Fig. 3a and e). These habitatwise differential patterns of substrate utilization and of energy metabolism demonstrate mechanisms of local environmental adaptation, as hypothesized for cultured Polynucleobacter strains (4).
The trends in beta diversity between samples for the functional annotations of the Polynucleobacter bins (protein-coding genes annotated to metabolic enzymes) were similar to the trends observed in the taxonomic analysis (see Fig. S4D in the supplemental material). Polynucleobacter bins had pseudogene profiles more closely related to those of free-living subspecies (P. necessarius subsp. asymbioticus QLW-P1DMWA-1) than to those of endosymbionts (P. necessarius subsp. necessarius STIR1). Also, distance decay analyses of the core genome-based allele frequency (pairwise Fst index) sug-gested that geographic distance had no significant impact on the intrapopulation-level genetic differentiation of Polynucleobacter, unlike the genus-level 16S rRNA amplicon analysis, which suggests that genotypic studies provide higher-resolution investigations (see Fig. S2D in the supplemental material). This observation further highlights that, despite a potentially high dispersal rate, environmental selection has an important role in shaping the in situ Polynucleobacter community gene content.
Effect of environmental selection on Polynucleobacter evolution. Populationlevel orthologous gene pairs (Ͼ300 bp) for endosymbiont (n ϭ 205)-, free-living organism (n ϭ 336)-, and core (n ϭ 1,579)-specific gene contents were analyzed for natural selection patterns (ratio of nonsynonymous to synonymous evolutionary changes [dN/dS ratio]) across all seven sites. Highlighting the strength of purifying natural selection (dN/dS ratio, Ͻ1), the majority of genes had median dN/dS ratios ranging in 0.01 to 0.09, whereas fast-evolving (dN/dS ratio, Ͼ1) protein-coding genes were less prevalent (n ϭ 1 to 8/site) across all of the groups (Fig. 4). Using BLASTX annotation results against the NCBI nr data set, the majority (Ͼ80%) of the positively selected genes (annotated; dN/dS ratio, Ͼ1) were assigned to the transferase and membrane transport enzyme-based categories. Interestingly, two putative horizontal gene transfer events (specifically, shikimate dehydrogenase from bacterium UASB14 and heme ABC transporter permease from Pusillimonas) were also observed to have higher dN/dS ratios (Ͼ1) in the core genotypes of WW76 (Calumet) and WW96 (NBCR) samples, respectively.
Recently, Ran et al. (18) revealed that patterns of correlation between the CUB (codon usage bias) and dN/dS ratio of orthologous genes from closely related genomes could be used as a metric to analyze lineage-specific environmental adaptation. In other words, the relationship between two aspects of selection (CUB and dN/dS ratio) can be used as a metric to highlight the importance of translation fine-tuning genetic adaptation to environment changes. In order to validate our hypothesis that strong in situ functional constraints are shaping the natural selection trends across in situ Polynucleobacter populations, orthologous protein pairs were analyzed to study the coupling between the dN/dS ratio and CUB. Interestingly, a limited but statistically significant (negative) correlation pattern was observed for the free-living and core genotypes (Fig. 4).

DISCUSSION
Freshwater ecosystems have greater microbial AGSs than marine ecosystems (14, 19). These AGS differences indicate the selection pressure of local metabolic (functional)

FIG 4 Coupling between natural selection (y axis) and CUB (x axis) validates the habitat-wide impact of in situ functional constraints across free-living-organism-, core-, and endosymbiont-specific gene contents of Polynucleobacter taxa. Reverse BLAST hit-based orthologous genes were used to perform pairwise dN/dS ratio and CUB analyses. The codon deviation coefficient was computed by methods explained in reference 41.
constraints on genomic adaptation. At the population level, the AGS variation may represent the in situ metabolic potential of a taxon. Specifically, an ecogenetically adapted generalist copiotroph should have a larger genome size than a specialized yet abundant oligotroph, but variation within a single-species population also suggests that AGS is significantly influenced by adaptation to local environmental conditions. Using 16S rRNA amplicon and shotgun metagenomic sequencing, we have investigated the population ecology of the dominant freshwater bacterium Polynucleobacter across all three main regions of the CAWS, i.e., NBCR, SBCR, and Little Calumet River (LCR) Discounting methodological bias (e.g., amplification, sequencing quality, and coverage), this study suggests that Polynucleobacter bacteria are abundant across all three regions of the CAWS, suggesting that interhabitat strain-specific genetic adaptation may enable Polynucleobacter taxa to become ubiquitous in freshwater ecosystems (7).
The AGS of the Polynucleobacter assembled bins was smaller than that of the total community (see Fig. S4B in the supplemental material), suggesting that the genus Polynucleobacter is a specialized taxon. Geographic distance had no major role in shaping the interhabitat-level population dynamics of Polynucleobacter bacteria, as highlighted by the distance decay analysis of oligotyping-based beta diversity patterns and the genetic differentiation of the assembled core genome (Fst; see Fig. S2C and D in the supplemental material). Metabolic differences in energy metabolism, substrate utilization (Fig. 3), and the concentration of available nutrients (e.g., ammoniacal nitrogen) seem to have a stronger influence on the Polynucleobacter population structure. This variance in genetic fitness suggests that the ubiquity of Polynucleobacter strains results from stenoecious (specialist) genetic adaptations and not from euryoecious (generalist) adaptations.
Therefore, we hypothesize that in situ metabolic constraints were the primary reason for the genotypic diversity seen. The genome-wide influence of environmental selection (functional constraints) on the CAWS Polynucleobacter population was determined by using pairwise dN/dS ratios for orthologous genes. Habitatwise selection analysis of the core-, free-living-organism-, and endosymbiont-specific genotypes revealed that while the majority of genes were under purifying natural selection (median dN/dS ratios, 0.01 to 0.09), which is already known for most bacterial genomes (20), there was differential selection pressure for each genotype (free-living, core, and endosymbiont) for specific genes across each habitat (Fig. 4). Specifically, the core genes had the lowest median dN/dS ratio (Fig. 4) across each habitat. However, this observation is expected because a core gene codes for important metabolic enzymes and has a direct metabolic interaction with the environment (18; see also Table S4 in the supplemental material).
To further confirm that the different selection patterns were due to strong and diverse environmental selection (in situ functional constraints), we further analyzed the patterns of correlation between CUB and the dN/dS ratio across orthologous proteincoding genes. Interestingly, a weak but significant negative correlation was observed between the dN/dS ratio and CUB across core and free-living genotypes. Since we have used evolutionarily conserved (orthologous genes) and population-level genetic information for natural selection and CUB analysis, we assume that pairwise dN/dS ratio and CUB variations of these genes directly represent the strength of in situ functional constraints. The observation of habitat-wide differential correlation patterns between the dN/dS ratio and CUB supports our hypothesis that functional constraints cause and maintain the strain-level genetic endemism, and hence the genotypic diversity, of Polynucleobacter populations, and thus, loss of function would be deleterious to the organism.

MATERIALS AND METHODS
Site selection, sampling, and physicochemical analysis. Figure 1 shows the seven sites selected for this study. The sites represent highly altered channelized streams of the CAWS, including the NBCR (sites WW73 and WW96), SBCR (sites WW108 and WW99), South Fork SBCR (known as Bubbly Creek), and the LCR (sites WW56, WW57, and WW76), which is not directly connected to the NBCR. Sites WW56 and WW76 are upstream and downstream of a major inflow into the CAWS from the Calumet Water Reclamation Plant (WRP), and site 73 is downstream of the Terrence J. O'Brien WRP, another major inflow. Sites WW96 and WW57 include tributaries that contribute flow to the CAWS. These sites constitute part of the current ambient water quality monitoring sampling stations of the Metropolitan Water Reclamation District (MWRD) of Greater Chicago. All locations were sampled monthly by surface grab sampling and analyzed for physicochemical parameters including pH, water temperature, alkalinity, total suspended solids, ammonia, nitrate, phosphorus, total metals, dissolved metals, cyanide, phenol, and fecal coliform bacteria, while organic priority pollutants and nonylphenols were sampled semiannually and quarterly, respectively (Table 1 shows sample parameters). Both pH and water temperature were measured at each site.
16S rRNA gene amplicon data analysis. Paired-end reads were quality trimmed and processed for OTU clustering with the UPARSE pipeline (21), set at a 97% identity cutoff. High-quality (Ͻ1% incorrect bases) OTUs were assigned to various taxonomic levels by using the parallel_assign_taxonomy_rdp.py script from QIIME software (22). Multiple-sequence alignment and phylogenetic reconstruction were performed with PyNast and FastTree (22). The Phyloseq package (23) was used for detailed downstream analysis, e.g., alpha and beta diversity-based ordination on a rarefied abundance matrix. The OTU matrix was processed to remove OTUs containing fewer than three reads and rarified to the minimum numbers of reads present in the smallest library (11,083 reads). We used the oligotyping pipeline (12) to identify the sub-OTU-level differences across the Polynucleobacter taxon, i.e., one of the five most differentially abundant genera, i.e., Rhodobacter, Synechococcus, Sediminibacterium, Polynucleobacter, and Novosphingobium, as predicted by MetagenomeSeq (24).
Quality filtering, coverage estimation, metagenome assembly, and annotation. Paired-end metagenome reads were quality trimmed with nesoni (GitHub Victorian Bioinformatics Consortium) by using the following parameters: a minimum length of 75, a quality cutoff of 30, adapter trimming, and 0 ambiguous bases. Taxonomic and functional information was assigned to the individual metagenome reads with MetaPhlAn (25) and MGRAST (26), respectively. Individual-read-based functional annotations were used for functional diversity and richness estimation. Quality-trimmed metagenome reads were assembled into contigs with IDBA_UD (27) by using k-mer lengths ranging from 31 to 41. Metagenome contigs with lengths of Ͻ300 bp were excluded from further analysis. Metagenome contigs were assigned to various taxonomic levels by NBC Classifier (28). Average metagenome coverage and sequence diversity were computed for each sample with Nonpareil (13) set at default parameters. AGS was computed for each metagenome sample and Polynucleobacter bin with MicrobeCensus (29). FragGeneScan (30) was also used to predict the protein-coding genes across metagenome contigs. Functional annotation of individual metagenome reads and contigs (ORFs) was performed with paladin (GitHub) and prokka (31), respectively. Genotype binning and population-level comparative genomics. In order to understand the population-level dynamics (taxonomic, functional, and evolutionary) of Polynucleobacter across these sites, we focused our further assembly efforts to bin population genomes (genotypes, not individual genomes) for this taxon. Tetranucleotide frequency usage and GϩC content values (percent) were computed for each metagenome contig with 2TBinning (32). Contigs were clustered into bins with hierarchical agglomerative clustering performed with an interprofile correlation cutoff (R 2 ) of 0.9. Chimeric contigs, i.e., those that differed from the mean GϩC content (percent) by more than 1 standard deviation, were removed from the individual population bin. Polynucleobacter bins were further screened (Nmer ϭ 12) for the contaminants (assigned to different taxons) with NBC Classifier (28). Single-copy marker gene-based CNV analysis (33) was used to estimate the number of species across each bin. To predict the number of species across each site, single-copy genes were clustered at 97% identity. By using reference genomes pairwise, the orthologous gene prediction method (34) was used to demarcate in situ Polynucleobacter population gene contents into core-, free-living-organism-, and endosymbiontspecific genes. Pseudogenes were predicted across population genomes with GenePRIMP (35). Reconstructed population genomes were uploaded to the RAST server (36) for automated genome annotation. Fst calculations were performed for the core gene contents of Polynucleobacter populations across all seven metagenome samples with the PoPoolation2 software (37).
Evaluation and validation of the influence of in situ functional constraints. Pairwise selected orthologous protein-coding genes were aligned with Clustal W (38). Multiple-codon alignments were constructed from the corresponding aligned protein sequences with pal2nal script (39). Final alignments (stop codons removed) were processed for dN/dS ratio analysis with PAML (40). To further validate the influence of in situ functional constraints on the observed natural selection patterns, we processed the orthologous gene pairs by using codon bias variation. By methods explained in reference 41, the codon deviation coefficient was used as the measure of codon bias across orthologous gene pairs predicted across free-living, endosymbiont, and core genotypes. The mean value of two orthologous genes was used for the correlation analysis against dN/dS ratios.
Statistical analysis. All of the statistical analyses done in this study were performed in the R environment (R Development Core Team, 2012). Multigroup and two-group comparisons were performed by ANOVA (Tukey-Kramer post hoc test, effect size ϭ Eta-squared and multiple-test correction by Storey's false-discovery rate [FDR]) and Welch t test (two sided and multiple-test correction by Storey's FDR), respectively. Beta diversity was analyzed by calculating distance decay with the vegdist package implemented in R (R Development Core Team, 2012). Briefly, Bray-Curtis similarity matrices were created from the OTU and oligotype data and Euclidean distance matrices were created from the distances between individual samples. The MetagenomeSeq package (24) was used to identify the differentially abundant taxons. The number of reads per kilobase per genome equivalent was also used to normalize and quantify the pathway and/or subsystem abundance from shotgun metagenomes as predicted by MGRAST (26).
Conclusions. Different in situ functional constraints cause and maintain a conserved core genome in Polynucleobacter populations. Observed patterns of coupling between the dN/dS ratio and CUB highlight that translational fine-tuning likely helps Polynucleobacter bacteria to adapt to subtle metabolic changes in the local environment. However, dominant taxa are known to have complex interspecies metabolic tradeoffs (5) that can influence their genetic evolution, and therefore, understanding the biological factors that influence this habitat-wide genetic conservation remains a challenge for future studies.
Nucleotide sequence accession numbers. The metagenome data obtained in this study have been uploaded to MG-RAST under project no. 7450 and accession numbers 4549281. 3