Diversity at single nucleotide to pangenome scales among sulfur cycling bacteria in salt marshes

ABSTRACT Sulfur-cycling microbial communities in salt marsh rhizosphere sediments mediate a recycling and detoxification system central to plant productivity. Despite the importance of sulfur-cycling microbes, their biogeographic, phylogenetic, and functional diversity remain poorly understood. Here, we use metagenomic data sets from Massachusetts (MA) and Alabama (AL) salt marshes to examine the distribution and genomic diversity of sulfur-cycling plant-associated microbes. Samples were collected from sediments under Sporobolus alterniflorus and Sporobolus pumilus in separate MA vegetation zones, and under S. alterniflorus and Juncus roemerianus co-occuring in AL. We grouped metagenomic data by plant species and site and identified 38 MAGs that included pathways for sulfate reduction or sulfur oxidation. Phylogenetic analyses indicated that 29 of the 38 were affiliated with uncultivated lineages. We showed differentiation in the distribution of MAGs between AL and MA, between S. alterniflorus and S. pumilus vegetation zones in MA, but no differentiation between S. alterniflorus and J. roemerianus in AL. Pangenomic analyses of eight ubiquitous MAGs also detected site- and vegetation-specific genomic features, including varied sulfur-cycling operons, carbon fixation pathways, fixed single-nucleotide variants, and active diversity-generating retroelements. This genetic diversity, detected at multiple scales, suggests evolutionary relationships affected by distance and local environment, and demonstrates differential microbial capacities for sulfur and carbon cycling in salt marsh sediments. IMPORTANCE Salt marshes are known for their significant carbon storage capacity, and sulfur cycling is closely linked with the ecosystem-scale carbon cycling in these ecosystems. Sulfate reducers are key for the decomposition of organic matter, and sulfur oxidizers remove toxic sulfide, supporting the productivity of marsh plants. To date, the complexity of coastal environments, heterogeneity of the rhizosphere, high microbial diversity, and uncultured majority hindered our understanding of the genomic diversity of sulfur-cycling microbes in salt marshes. Here, we use comparative genomics to overcome these challenges and provide an in-depth characterization of sulfur-cycling microbial diversity in salt marshes. We characterize communities across distinct sites and plant species and uncover extensive genomic diversity at the taxon level and specific genomic features present in MAGs affiliated with uncultivated sulfur-cycling lineages. Our work provides insights into the partnerships in salt marshes and a roadmap for multiscale analyses of diversity in complex biological systems.


Distribution of primary MAGs genomic content across samples.
For each MAG, we also analyzed the distribution of the mapped reads across the different contigs composing each MAG by plotting the read recruitment results, expressed as GCPM (Fig. S4-S6).The individual contig coverage was obtained using a custom-made script to extract the contig-specific coverage values calculated by Salmon v.1.9.0 (4) metaWRAP v.1.3.2 (5) Quant bin module for each sample and MAG (Supplemental . Analysis of genomic diversity.We used Anvi'o v.7.1 "hope" (available from https://github.com/merenlab/anvio/releases/v7) to profile mapping results, finalize genomic bins, annotate, and visualize results following the workflow outlined in (6) .In brief, we used anvi-gen-contigs-database to identify open reading frames (ORFs) with Prodigal v.2.6.3 (7) and HMMER (8) to identify single-single copy core genes of bacterial and archaeal origin.We used anvi-run-ncbi-cogs to annotate the identified ORFs using the NCBI Clusters of Orthologous Genes (COG) database (COG20 release) (9) and KEGG Orthology using GhostKoala (10)(11)(12) .Gene taxonomy was determined using the top hits against Centrifuge v.1.0.4 (13) and from the Kegg Taxonomy results produced by GhostKoala.The MAG taxonomical identity was resolved using anvi-run-scg-taxonomy.Anvi-init-bam was used to incorporate into Anvi'o the mapping results generated with Bowtie2 v.2.4.2 (14) using sensitive presets ( -D 15 -R 2 -N 0 -L 22 -i S,1,1.15) and end-to-end alignment to minimize spurious read recruitment.Anvi-profile with the flag -profile-SCVs was used to process each BAM file and to calculate coverage and genetic variability metrics.Anvi-merge was used to combine profiles from each metagenomic sample and create a merged Anvi'o profile of each MAG in the dataset.Finally, the results of the analysis were visualized with anvi-interactive.
For each profiled primary MAG, we selected a subset of representative genes to analyze site-specific variability using average entropy as a measurement of diversity.The gene subset included the genes encoding the ribosomal proteins, DNA-directed RNA polymerase subunits, and translation initiation factors used by CheckM (15) to calculate metagenome completeness.We also analyzed average entropy for genes representative of the broad metabolism (30) , and genes involved in sulfur-cycling processes (see Supplemental Data 9 for the full list).For each of the genes, we identified single codon variants (SCVs) using the program anvi-gen-variability-profile as implemented in Anvi'o (6) run with the flag -kiefl-mode.In all cases, we computed gene-specific coverage statistics.We used custom-made R scripts to filter, summarize and visualize the data generated by anvi-gen-variability-profile.To summarize the variability level of each gene, we created the function average gene entropy.In it, and for each gene, we calculated the gene average entropy normalized to the gene length.[(entropy.sum/entropy.n)/gene_length]Reconstruction of reference-guided, sample-specific, assembled MAGs.We generated sample-specific reassembled MAGs using the reassemble_bins module implemented in metaWRAP v. 1.3.2 (5) .The MAGs selected for further analysis were the same as those previously selected for Anvi'o profiling.In this portion of the analysis, the MAGs generated through the coassembly strategy were used as a general reference to guide sample-specific assemblies using each of the 24 metagenomic samples listed in Table S1.In brief, the reassemble_bins module metaWRAP uses BWA v.0.7.15 (16) to align the reads in a metagenomic sample to a reference MAG followed by a genome reconstruction using SPAdes v. 3.14.1 (17) .This pipeline uses a parallel approach to reassembly derived from permissive and strict mapping conditions in the BWA step (2 and 5 mismatches respectively).Here, we only considered the results of the reassembly under strict conditions.The number of reads mapped per metagenomic sample to each MAG is presented in Table S5.Initial completeness of the reference-guided, sample-specific, assembled MAGs was estimated using CheckM v.1.0.12 (15) .

Pangenomic analysis of the reference-guided, sample-specific, assembled MAGs
We used the Anvi'o pangenomics pipeline [ https://merenlab.org/2016/11/08/pangenomics-v2/ ] for the analysis of the suite of reference-guide genomes reassembled from each of the eight MAGs selected for further analysis.For each bacterial taxon, we stored the primary MAG generated through a coassembly (see Assembly and taxonomic assignment of contigs section) and each of the reference-guided reassembled metagenome in an Anvi'o database using the command anvi-gen-genomes-storage.We identified single-copy genes and rRNAs using anvil-run-hmms.Gene functions were assigned using NCBI COG20 (9) , HMM hits from Kofam, and EBI's Pfam database (12) .Additionally, tRNAs were scanned using anvi-scan-trnas.Anvi-run-scg-taxonomy was used to confirm taxonomy and anvi-estimate-metabolism to estimate metabolic capabilities.
To create the pangenomes from each metagenome, we used the command anvi-pan-genome which used NCBI's BLAST v.2.11.0 (18) to quantify gene similarity within and between genomes in each database and Markov Cluster algorithm (MCL) (19) to cluster groups of similar genes.We set the flag --min-ocurrence to 2 to remove singletons.The resulting pangenomes were visualized using anvi-display-pan.In the case of the closely related MAGs ALJR36 and MASA10, we completed an additional pangenomic analysis comparing the groups of metagenomes reassembled across samples for each of these sulfur-oxidizer bacteria affiliated with the UBA6429 family.We completed a functional enrichment analysis using the module anvi-compute-functional-enrichment to identify those genes, functions, and pathways over-represented in the pan-groups (reference MAG, and site and vegetation).

Analysis of the phylogenetic relationships among the reference-guided reconstructed MAGs.
We used the command anvi-get-sequences-for-gene-clusters to extract, from each metagenome, the predicted amino acid sequence of a subset of core genes (ribosomic proteins, polymerases, and transcription factors), housekeeping genes, and sulfur-cycling genes (Supplemental Data 9).Amino acid gene sequences were concatenated and samples with a high percentage of missing data in the gene alignments (>90% of missing sites) were removed from the analysis.Maximum-likelihood phylogenetic trees were calculated using FastTree v.2.1.3(20) as implemented in anvi-gen-phylogenomic-tree.

Identification of Diversity-Generating Retroelements in sulfur-cycling bacteria.
DGRs were identified with the python package DGRpy (available at GitHub: https://pypi.org/project/DGR-package/), which provides an integrated pipeline to annotate the essential features, RT, VR, and TR within 20kbp windows, as previously described (21,22) .Genomic sequences were used as the input for DGRpy and open reading frame (ORF) predictions determined using Prodigal v.2.6.3 (7) with -p meta and otherwise default parameters.Next, ORFs were analyzed for RT homology via HMMER v.3.2.1 (8) against a custom RT-HMM profile (21) .The genomes were processed into local 20 Kbp regions, extracted with +/-10 Kbp flanking the RT-like sequence.The RT-proximal sequences were next used to generate short sliding windows as 200 nt fragments that overlap by 50 nt.All 200 nt sequences were compared via BLASTn (23) and the resulting XML was parsed to search for VR/TR pairs in alignments having an HSP score < 100 and alignment length ≥ 60 nt.A VR/TR pair was identified in alignments having mismatches that correspond to adenines (or thymidine if on the reverse strand) for one of the two pairwise sequences.Alignments with i) a relative proportion of 80% A-mismatches (or reverse-strand T-mismatches) along one sequence, relative to other mismatches (G, C, or T) and ii) a minimum of five A-/T-specific mismatches, were considered VR/TR pairs.Since the BLAST xml output contains both full and partial TRs/VRs, sequences were clustered at 100% identity using CD-HIT (24) c, to obtain the full-length representatives.Finally, RTs, VRs, and TRs were manually inspected in the original MAGs to identify the putative DGR hypermutation target genes.

MAGs occurrence across all vegetated samples.
We completed an analysis of GCPM values at the individual contig level.The coverage patterns, as relatively constant values across contigs included in a bin, were similar to those expected in samples containing S-cycling bacteria that were highly similar to the MAGs used as references for mapping reads.This was especially apparent with some of the MAGs reconstructed using samples collected in sediments inhabited by S. alterniflorus in Massachusetts (see Supplemental Data 4; MASP12, MASA6, MASA5, MASA4, MASA3, MASA1, ALSA234, ALSA191, ALSA113, ALJR36, ALJR32, ALJR30, ALJR18, ALJR15, ALJR4).As was the case on the vegetated sediments, higher read recruitment in sulfur-oxidizing lineages when compared to sulfate-reducers were also observed in these subtidal metagenome samples.In Particular, we detected the presence of sulfur-oxidizing taxa affiliated with Alphaproteobacteria and Burkholderiales MAGs across all sites and environments (Fig. S6).

Gene heterogeneity as average entropy.
We calculated the average entropy for a selected set of genes (listed in Supplemental Data 9) to get a more accurate picture of the distribution of the genetic heterogeneity in the coding regions of the genomes of these bacterial populations.While the SNV density is useful to identify mismatches to a 'reference' sequence, entropy provides a quantification of the disagreement from the consensus, measuring the diversity in a bacterial population by removing the possible effect of the fixed SNV in a given bacterial population.By focusing on the entropy of single codon variants (SCV) we specifically explored the changes affecting the triplets of nucleotides encoding amino acids, independently if they were synonymous changes or not.
Only four MAGs (ALJR36, ALJR15, ALSA124, and MASA10), specifically in the metagenomic samples collected in Alabama, independently of vegetation type, had enough coverage to fulfill the minimal coverage depth set as in the analysis.Average gene entropy values varied among genes annotated for a given MAG.In most cases, average entropy values were higher for the genes encoding ribosomal proteins than for housekeeping or sulfur-related genes (Fig. S18 and Supplemental Data 7).Entropy values were not dependent on the average gene coverage as indicated for the similar values of average entropy calculated for a given gene in diverse samples (Fig. S18A, heatmap in purple) despite variations in coverage (Fig. S18A heatmap in orange).
The main result we observed is that for a given gene of a given MAG, the entropy values calculated remained relatively constant across different bacterial populations in different samples analyzed here, independently of the year of collection or the plant dominating the sediments.More interestingly, we identified that the average entropy values of a given gene were MAG-specific.To better showcase these results, we compared the average entropy values of housekeeping genes and sulfur-cycling genes in the SOXs ALJR36 and MASA10 (Fig. S18B) and specifically, using samples collected in Alabama as they were consistently abundant and therefore produced more robust results.The average entropy values, and therefore diversity in the bacterial population, calculated e.g. for the gene Succinate dehydrogenase Sdh A−1 (COG1053) were consistently higher for ALJR36 than for MASA10.This was also observed for the genes encoding subunits alpha (K17993) and delta (K17994) of the Sulfhydrogenase ( Hyd ) while in MASA10 the entropy values of the two subunits were similar.In ALJR36 the values were not only much higher but the average entropy value for the alpha subunit almost doubled that for the delta subunit.In the case of the dissimilatory sulfite reductase ( dsr ), the average entropy calculated for the genes encoding subunits alpha (K11180) and beta (K11181) were similar for the bacterial populations matching MASA10 (~0.001).Values of dsrA in ALJR36 were also in that range while dsr B displayed a considerably lower genetic diversity.Table S7 .Results of the functional enrichment analysis comparing the metabolic functionalities of ALJR36 and MASA10 pangenomes at COG20 Category level.Each pangenome group is represented by the primary MAG and the site-specific reassembled MAGs.

COG20 CATEGORY Adjusted q value COG ID
Categories significantly enriched in the ALJR36 pangenome group.Table S8 .Results of the functional enrichment analysis comparing the metabolic functionalities of ALJR36 and MASA10 pangenomes at the KEGG Module level.Each pangenome group includes the primary MAG and the site-specific reassembled MAGs.

KEGG Module Adjusted q value KEGG ID
Modules significantly enriched in the ALJR36 pangenome group.Subtidal Sediments Nutrient-enriched creek Reference creek salinity salinity . Distribution of sulfur cycling primary MAG sequences across environmental metagenomes.Samples included published metagenomes from subtidal sediments along a salinity gradient.Circle size reflects averaged GCPM across all contigs for each MAG.Contig-specific coverage values can be found in Supplemental Data 4 and sample description in Table S9.Contig-specific coverage values for each salt marsh sample included in this study of each sulfur cycling bacteria MAG identified in the co-assembly from Alabama in sediments inhabited by the plant Sporobulus alterniflorus (ALSA).For each sample, contig coverage (expressed as CMP, y-axis) is sorted from low to high.For comparison purposes, samples collected in the same geographical area and in the proximity of the same plant are colored with the same color.Full data is presented in Supplemental Data 3.For each sample, contig coverage (expressed as CMP, y-axis) is sorted from low to high.For comparison purposes, samples collected in the same geographical area and in the proximity of the same plant are colored with the same color.Full data is presented in Supplemental Data 3.

Figure S7-S14.
A)Anvi'o coverage profiles of eight selected sulfur-cycling bacteria MAGs.In each anviogram, outer cycles represent the contig coverage (as mean coverage) of each of the 24 salt-marsh metagenomic samples included in this analysis.Samples are color-coded according to geographical site and the dominant plant species in the area where sediment collection occurred.Inner rings represent GC content (dark gray) and contig length (gray).Contigs are clustered (inner tree) based on the sequence composition and differential coverage using Euclidean distance and Ward hierarchical clustering method.The samples' order (rings) was determined using a clustering method based on the mean coverage.Bar plots represent (top to bottom) the total number of reads in each library, the total number of reads mapped to each respective MAG, the percentage of mapped reads, and the total number of Single Nucleotide Variants (SNV), Single Codon Variants (SCV) and indels.B) Anvi'o pangenomic analysis of eight sulfur-cycling bacteria MAGs.In each ring, each vertical line represents a gene.The genes' order was determined using Euclidean distance and Ward clustering method based on the presence or absence of a gene cluster across all MAGs.In the outer rings, we indicate the number of genes in each gene cluster, the number of paralogous, and the results of the combined homogeneity test.Dark gray bars in the next set of rings indicate a known gene function, annotated according to COG 2020.The single-copy core gene cluster, present in all reference-guided reassembled MAGs, is highlighted in purple.The primary MAG used as a reference is shown in dark teal.Each of the remaining circles represents a reference-guided reassembled MAG from each of the salt-mash metagenomic samples included in this analysis.Samples are color-coded according to geographical site and the dominant plant species in the area where sediment collection occurs.

Figure S9
Reassembled MAGs ALJR ( 8) MASA ( 4) MASP ( 2)   5) MASP ( 5) MASA ( 5) MASP ( 5)    Figure S18.A) Coverage (left) and gene variability (right) of selected genes.In the heatmaps, each row represents a gen (see Supplementary Data 9 for the full list of genes), and each column is a metagenomic sample.Genes with average coverage under 5 were excluded from the analysis and are indicated in gray.Gene variability was calculated as average gene entropy from single codon variants (SCVs) and results are summarized in Supplemental Data 9. B) Comparison of coverage and entropy values for the two closely-related MAGs ALJR36 and MASA10.

Primary MAG
secretion, and vesicular transport-Intracellular trafficking, secretion, and vesicular transport 1.45E-07 U-U Amino acid transport and metabolism-General function prediction only-Replication, recombination and repair 1.45E-07 E-R-L Amino acid transport and metabolism-Posttranslational modification, protein turnover, chaperones 3.21E-06 E-O Coenzyme transport and metabolism-Inorganic ion transport and metabolism 1

Figure S1 .Figure S2 .
Figure S1.Maximum likelihood phylogenetic trees of the dsrA (top) and dsrB (bottom) genes identified in sulfur-cycling MAGs from Alabama and Massachusetts salt marshes.Phyla are shown on the right of each branch.

Figure S4 .
Figure S4.Contig-specific coverage values for each salt marsh sample included in this study of each sulfur cycling bacteria MAG identified in the co-assembly from Alabama in sediments inhabited by the plant Juncus roemerianus (ALJR).For each sample, contig coverage (expressed as CMP, y-axis) is sorted from low to high.For comparison purposes, samples collected in the same geographical area and in the proximity of the same plant are colored with the same color.Full data is presented in Supplemental Data 3.

Figure S6 .
Figure S6.Contig-specific coverage values for each salt marsh sample included in this study of each sulfur cycling bacteria MAG identified in the co-assembly from Massachusetts in sediments inhabited by Sporobolus alterniflorus (MASA) or Sporobolus pumilus (MASP).For each sample, contig coverage (expressed as CMP, y-axis) is sorted from low to high.For comparison purposes, samples collected in the same geographical area and in the proximity of the same plant are colored with the same color.Full data is presented in Supplemental Data 3.

Table S1 .
Summary of sampling locations, dominant vegetation, depth, dates, and the number of reads of the selected metagenomic datasets.(AL, Alabama; MA, Massachusetts).

Table S2 .
Summary statistics for each of the metagenomic co-assemblies.

Table S5 .
Number of reads recruited per MAG after Bowtie2 metagenomic sample alignment.

Table S6 .
Estimation of completeness (CheckM) of the primary MAGs and the reference-guided reassembled MAGs.1
Squeme of the Reverse Citrate Cycle (Arnon-Buchanan) and Phosphate acetyltransferase-acetate kinase pathways.The enzymes detected in the reference-guided reassembled metagenomes corresponding to the group ALJR36 are indicated in dark grey and those in the MASA10 reassembled group are in light grey.Adapted from Kegg map (00720) representing the Carbon Fixation Pathways in Prokaryotes.Phylogeny reconstructed for MA/AL DGRs.Reverse transcriptase sequences were aligned with representative RTs from previously identified DGRs in Paul et al. (2017).The extracted subtree shows closely related DGR-RTs from this study (dashed turquoise box) and other bacterial clades.Scale bar indicates substitutions per site.