Decomposing a San Francisco estuary microbiome using long-read metagenomics reveals species- and strain-level dominance from picoeukaryotes to viruses

ABSTRACT Although long-read sequencing has enabled obtaining high-quality and complete genomes from metagenomes, many challenges still remain to completely decompose a metagenome into its constituent prokaryotic and viral genomes. This study focuses on decomposing an estuarine metagenome to obtain a more accurate estimate of microbial diversity. To achieve this, we developed a new bead-based DNA extraction method, a novel bin refinement method, and obtained 150 Gbp of Nanopore sequencing. We estimate that there are ~500 bacterial and archaeal species in our sample and obtained 68 high-quality bins (>90% complete, <5% contamination, ≤5 contigs, contig length of >100 kbp, and all ribosomal and tRNA genes). We also obtained many contigs of picoeukaryotes, environmental DNA of larger eukaryotes such as mammals, and complete mitochondrial and chloroplast genomes and detected ~40,000 viral populations. Our analysis indicates that there are only a few strains that comprise most of the species abundances. IMPORTANCE Ocean and estuarine microbiomes play critical roles in global element cycling and ecosystem function. Despite the importance of these microbial communities, many species still have not been cultured in the lab. Environmental sequencing is the primary way the function and population dynamics of these communities can be studied. Long-read sequencing provides an avenue to overcome limitations of short-read technologies to obtain complete microbial genomes but comes with its own technical challenges, such as needed sequencing depth and obtaining high-quality DNA. We present here new sampling and bioinformatics methods to attempt decomposing an estuarine microbiome into its constituent genomes. Our results suggest there are only a few strains that comprise most of the species abundances from viruses to picoeukaryotes, and to fully decompose a metagenome of this diversity requires 1 Tbp of long-read sequencing. We anticipate that as long-read sequencing technologies continue to improve, less sequencing will be needed.

• Upload point-by-point responses to the issues raised by the reviewers in a file named "Response to Reviewers," NOT in your cover letter.
• Upload a compare copy of the manuscript (without figures) as a "Marked-Up Manuscript" file.
• Upload a clean .DOC/.DOCX version of the revised manuscript and remove the previous version.
• Each figure must be uploaded as a separate, editable, high-resolution file (TIFF or EPS preferred), and any multipanel figures must be assembled into one file.
• Any supplemental material intended for posting by ASM should be uploaded with their legends separate from the main manuscript.You can combine all supplemental material into one file (preferred) or split it into a maximum of 10 files with all associated legends included.
For complete guidelines on revision requirements, see our Submission and Review Process webpage.Submission of a paper that does not conform to guidelines may delay acceptance of your manuscript.
Data availability: ASM policy requires that data be available to the public upon online posting of the article, so please verify all links to sequence records, if present, and make sure that each number retrieves the full record of the data.If a new accession number is not linked or a link is broken, provide mSystems production staff with the correct URL for the record.If the accession numbers for new data are not publicly accessible before the expected online posting of the article, publication may be delayed; please contact production staff (mSystems@asmusa.org)immediately with the expected release date.
Publication Fees: For information on publication fees and which article types are subject to charges, visit our website.If your manuscript is accepted for publication and any fees apply, you will be contacted separately about payment during the production process; please follow the instructions in that e-mail.Arrangements for payment must be made before your article is published.

ASM Membership:
Corresponding authors may join or renew ASM membership to obtain discounts on publication fees.Need to upgrade your membership level?Please contact Customer Service at Service@asmusa.org.
The ASM Journals program strives for constant improvement in our submission and publication process.Please tell us how we can improve your experience by taking this quick Author Survey.
Thank you for submitting your paper to mSystems.

Sincerely, Xiao-Hua Zhang Editor mSystems
Reviewer #1 (Comments for the Author): Ocean and estuarine microbiomes cause high interest because of their fundamental roles in global element cycling.San Francisco Estuary, the largest estuary on the west coast of the United States has high nutrient loadings that are higher than other estuaries.This along with increased algal toxins and primary production in recent years supports the hypothesis that the .San Francisco Estuary is the best model to investigate the health of the Ocean ecosystems.
The manuscript titled "Decomposing a San Francisco estuary microbiome using long read metagenomics reveals species-and strain-level dominance from picoeukaryotes to viruses" is devoted to the study of metagenome to obtain a more accurate estimate of microbial diversity.To achieve this, a new bead-based DNA extraction method was developed, a novel bin refinement method designed, and 150 Gbases of Nanopore sequencing were obtained.An estimated ~500 bacteria and archaeal species in the sample, and 68 high-quality bins (>90% complete, <5% contamination, {less than or equal to}5 contigs, contig length >100 Kbases, and all ribosomal and tRNA genes) were obtained.Many contigs of picoeukaryotes, environmental DNA of larger eukaryotes such as mammals, complete mitochondrial and chloroplast genomes were obtained.~40,000 viral populations were detected too.Ocean and estuarine microbiomes play critical roles in global element cycling.New sampling and bioinformatics methods to attempt decomposing an estuarine microbiome into its constituent genomes.The results suggest there are only a few strains that comprise most of the species abundances from viruses to picoeukaryotes.Characterization of a metagenome of this diversity requires 1Tbase of long read sequencing.The manuscript is well-written, the presented data support conclusions, and this work can be very important for further development of metagenome and ecosystems analysis.
Reviewer #2 (Comments for the Author): In this article, Lui and Nielsen describe a detailed and thorough interrogation of the microbial composition of a single water sample from the San Francisco Estuary using nanopore sequencing.Their observations are interesting and the assembly methods are rigorous, but the objectives and takeaways from this n of 1 study are not entirely clear.It is also not clear if it is intended to be presented as a methods study demonstrating improvement in genomic methodologies, or as a results paper presenting new observations about the estuary community.
My major recommendation would be to strengthen the validation of the methods demonstrated in the paper by comparing the findings more thoroughly with what would be found by using Illumina sequencing alone and/or with more standard nanopore methods (e.g.fully automated binning, a single classification tool).To claim that these results are "advances in the use of long reads to obtain genomes from metagenomes", there should be a more detailed comparison with previous methods to put the authors' findings into context.This could include comparing overall properties of the assembly as well as more detailed features.For example, plasmids are thought to be retained better by Illumina sequencing, since they can be lost from a nanopore library prep due to circularity and/or size selection.It would be very interesting to compare plasmids detected in your Illumina data with those found in the nanopore assembly.Similarly, there is a nice comparison with the Illumina taxonomic assignments in lines 269-275, but the details of this comparison (the higher-level Illumina classifications, or which classifications are gained and lost with each method) don't appear to be actually included in the results anywhere.
Other major comments: -The introduction mentions the importance of high introgen and phosphorus loadings in this environment and the possibility of the estuary being at a "tipping point".It would be helpful if the analysis and/or discussion could address whether/how the improved assembly helps address these questions (perhaps gaining deeper insights into the set of organisms with particular metabolic capabilities).
-There are occasional value judgment statements presented without concrete support or details.For example: -lines 152-154 "Sorting through the different classification methods also points out the hazards of relying on one method" -what are the hazards?-lines 290-294: "We were selective about choosing binning software...eventually we settled on GraphMB" How/why was GraphMB chosen?-Are the authors planning to make code available for their binning workflow?(since that is one of the main advances in this paper) -I recommend the authors reframe their conclusion to focus more on the major takeaways of the study and how they might be useful for other researchers, instead of previewing next steps/unpublished data by the authors.

Minor comments:
-Define SSU when it is first used (line 133?).I would recommend referring to these as "SSU genes" for clarity but not necessary.
-lines 326-328: It would be helpful to explain a little more specifically why SAR11 genomes are difficult to assemble -lines 342-346: Did the authors distinguish integrated prophages when classifying contigs as viral?(other than based on the presence of SSU genes) -The "Results and Discussion" section includes short descriptions of the sampling, sequencing, and taxonomic classification, but not of the assembly and/or polishing.It would help with clarity to add a sentence or two to fill in this gap.
-There is some assembly jargon that should be defined for the more general mSystems reader, e.g.: line 170: "taxonomic collapse" Line 304: "chaos" bins -I recommend replacing "manpower/man hours" in line 394 with a gender neutral term such as "labor", "person hours".

Reviewer #1 (Comments for the Author):
Ocean and estuarine microbiomes cause high interest because of their fundamental roles in global element cycling.San Francisco Estuary, the largest estuary on the west coast of the United States has high nutrient loadings that are higher than other estuaries.This along with increased algal toxins and primary production in recent years supports the hypothesis that the .San Francisco Estuary is the best model to investigate the health of the Ocean ecosystems.The manuscript titled "Decomposing a San Francisco estuary microbiome using long read metagenomics reveals species-and strain-level dominance from picoeukaryotes to viruses" is devoted to the study of metagenome to obtain a more accurate estimate of microbial diversity.To achieve this, a new bead-based DNA extraction method was developed, a novel bin refinement method designed, and 150 Gbases of Nanopore sequencing were obtained.An estimated ~500 bacteria and archaeal species in the sample, and 68 high-quality bins (>90% complete, <5% contamination, {less than or equal to}5 contigs, contig length >100 Kbases, and all ribosomal and tRNA genes) were obtained.
Many contigs of picoeukaryotes, environmental DNA of larger eukaryotes such as mammals, complete mitochondrial and chloroplast genomes were obtained.~40,000 viral populations were detected too.Ocean and estuarine microbiomes play critical roles in global element cycling.New sampling and bioinformatics methods to attempt decomposing an estuarine microbiome into its constituent genomes.
The results suggest there are only a few strains that comprise most of the species abundances from viruses to picoeukaryotes.Characterization of a metagenome of this diversity requires 1Tbase of long read sequencing.
The manuscript is well-written, the presented data support conclusions, and this work can be very important for further development of metagenome and ecosystems analysis.

Response:
We thank the reviewer for their comments.

Reviewer #2 (Comments for the Author):
In this article, Lui and Nielsen describe a detailed and thorough interrogation of the microbial composition of a single water sample from the San Francisco Estuary using nanopore sequencing.Their observations are interesting and the assembly methods are rigorous, but the objectives and takeaways from this n of 1 study are not entirely clear.It is also not clear if it is intended to be presented as a methods study demonstrating improvement in genomic methodologies, or as a results paper presenting new observations about the estuary community.

Comment #1:
My major recommendation would be to strengthen the validation of the methods demonstrated in the paper by comparing the findings more thoroughly with what would be found by using Illumina sequencing alone and/or with more standard nanopore methods (e.g.fully automated binning, a single classification tool).To claim that these results are "advances in the use of long reads to obtain genomes from metagenomes", there should be a more detailed comparison with previous methods to put the authors' findings into context.This could include comparing overall properties of the assembly as well as more detailed features.For example, plasmids are thought to be retained better by Illumina sequencing, since they can be lost from a nanopore library prep due to circularity and/or size selection.It would be very interesting to compare plasmids detected in your Illumina data with those found in the nanopore assembly.
Similarly, there is a nice comparison with the Illumina taxonomic assignments in lines 269-275, but the details of this comparison (the higher-level Illumina classifications, or which classifications are gained and lost with each method) don't appear to be actually included in the results anywhere.

Response:
We appreciate the reviewer's comments in regards to putting the Nanopore assemblies in context with Illumina assemblies.We had previously avoided comparison of the Illumina and Nanopore assemblies because of the vast difference in sequencing effort (150Gbp vs 21Gbp), as this will affect the quality of the assembly and makes comparisons difficult.In general, the shorter average length of the contigs in the Illumina assembly also make it more difficult to classify the contigs as plasmids or to assign taxonomy.We still think that these points hold true, but we also think that the suggestion of adding in additional comparison to the Illumina assembly provides useful analysis and discussion and have added in additional text on plasmids (lines 211-226) and taxonomic assignments of SSUs (lines 323-326).

Comment #2:
-The introduction mentions the importance of high introgen and phosphorus loadings in this environment and the possibility of the estuary being at a "tipping point".It would be helpful if the analysis and/or discussion could address whether/how the improved assembly helps address these questions (perhaps gaining deeper insights into the set of organisms with particular metabolic capabilities).
We have added additional text to the introduction to clarify this point and explain the potential impact of improved assembly in relation to biogeochemical cycling studies (lines 69-78).
-There are occasional value judgment statements presented without concrete support or details.
We thank the reviewer for pointing out the statements below as the manuscript benefits from the clarifying text.We have included explanations below and the corresponding edited lines in the revised manuscript.

For example: -lines 152-154 "Sorting through the different classification methods also points out the hazards of relying on one method" -what are the hazards?
In more than one case, the database used had a strong influence on the results.For example, if only MMseqs2 with the GTDB database was used for classification, then ~85k contigs were classified as bacterial.This is in contrast with our final evaluation, where we estimate that only ~41K contigs were bacterial after using all of the different classification methods.We have added more description in the main text to clarify this point (lines 161-168, line 180).
Older metagenomics binners do not take full advantage of assembly graphs which are available from modern assemblers.They are also not optimized for long read assemblies.GraphMB employs graph neural networks that take full advantage of the assembly graph and the long reads.We have added text to address this point (lines 345-346).
-Are the authors planning to make code available for their binning workflow?(since that is one of the main advances in this paper) We would like to make all of the code we use available.However, it is not in the form of a finished script that can run automatically.Rather it is a collection of little pieces that are run iteratively and require human intervention between iterations.Longer term, we are working on ways of building automated work flows to help with this.
-I recommend the authors reframe their conclusion to focus more on the major takeaways of the study and how they might be useful for other researchers, instead of previewing next steps/unpublished data by the authors.
We have added more text in regards to major takeaways on lines 363-368.

Minor comments:
-Define SSU when it is first used (line 133?).I would recommend referring to these as "SSU genes" for clarity but not necessary.
This edit has been added to the manuscript (line 151-152).
-lines 326-328: It would be helpful to explain a little more specifically why SAR11 genomes are difficult to assemble We have added additional text in regards to this point (392-395).
-lines 342-346: Did the authors distinguish integrated prophages when classifying contigs as viral?(other than based on the presence of SSU genes) Since we used geNomad to predict if contigs were viral, it classifies whether a contig contains a prophage or not.These were excluded from the viral contig counts.We have added clarifying text (lines 572-573).
-The "Results and Discussion" section includes short descriptions of the sampling, sequencing, and taxonomic classification, but not of the assembly and/or polishing.It would help with clarity to add a sentence or two to fill in this gap.
Additional text has been added to describe the assembly and polishing steps (lines 128-132).
-There is some assembly jargon that should be defined for the more general mSystems reader, e.g.: line 170: "taxonomic collapse" Line 304: "chaos" bins We thank the reviewer for pointing this out and have added clarifying text for the above jargon (lines 197-199, 357-365).
-I recommend replacing "manpower/man hours" in line 394 with a gender neutral term such as "labor", "person hours".
We appreciate this comment about gender neutral terms and have changed it to "person" hours (line 468).
1st Revision -Editorial Decision Re: mSystems00242-24R1 (Decomposing a San Francisco estuary microbiome using long read metagenomics reveals species-and strain-level dominance from picoeukaryotes to viruses) Dear Dr. Lauren M Lui: I am happy with the corrent version.
Your manuscript has been accepted, and I am forwarding it to the ASM production staff for publication.Your paper will first be checked to make sure all elements meet the technical requirements.ASM staff will contact you if anything needs to be revised before copyediting and production can begin.Otherwise, you will be notified when your proofs are ready to be viewed.
Data Availability: ASM policy requires that data be available to the public upon online posting of the article, so please verify all links to sequence records, if present, and make sure that each number retrieves the full record of the data.If a new accession number is not linked or a link is broken, provide production staff with the correct URL for the record.If the accession numbers for new data are not publicly accessible before the expected online posting of the article, publication may be delayed; please contact ASM production staff immediately with the expected release date.
Publication Fees: For information on publication fees and which article types have charges, please visit our website.We have partnered with Copyright Clearance Center (CCC) to collect author charges.If fees apply to your paper, you will receive a message from no-reply@copyright.com with further instructions.For questions related to paying charges through RightsLink, please contact CCC at ASM_Support@copyright.com or toll free at +1-877-622-5543.CCC makes every attempt to respond to all emails within 24 hours.
ASM Membership: Corresponding authors may join or renew ASM membership to obtain discounts on publication fees.Need to upgrade your membership level?Please contact Customer Service at Service@asmusa.org.
PubMed Central: ASM deposits all mSystems articles in PubMed Central and international PubMed Central-like repositories immediately after publication.Thus, your article is automatically in compliance with the NIH access mandate.If your work was supported by a funding agency that has public access requirements like those of the NIH (e.g., the Wellcome Trust), you may post your article in a similar public access site, but we ask that you specify that the release date be no earlier than the date of publication on the mSystems website.

Embargo Policy:
A press release may be issued as soon as the manuscript is posted on the mSystems Latest Articles webpage.The corresponding author will receive an email with the subject line "ASM Journals Author Services Notification" when the article is available online.
Cover Image Submissions: If you would like to submit a potential Cover Image, please email a file and a short legend to msystems@asmusa.org.Please note that we can only consider images that (i) the authors created or own and (ii) have not been previously published.By submitting, you agree that the image can be used under the same terms as the published article.Image File requirements: TIF/EPS, 7.5 inches wide by 8.25 inches tall (at least 2,250 pixels wide by 2,475 pixels tall), minimum 300 dpi resolution (600 dpi preferred), RGB, and no figure elements, e.g., arrows or panel labels.The legend should be a short description of the image, 1-2 sentences recommended.Please download and use this interactive template in Adobe to ensure that your proposed cover image meets our size requirements (https://journals.asm.org/pb-assets/pdf-text-excel-files/ASM-Interactive-Sizing-Cover-Template-1715689791.pdf).
Author Video:: For mSystems research articles, you are welcome to submit a short author video for your recently accepted paper.Videos are normally 1 minute long and are a great opportunity for junior authors to get greater exposure.Importantly, this video will not hold up the publication of your paper and you can submit it at any time.

Details of the video are:
• Minimum resolution of 1280 x 720 • .movor .mp4video format • Provide video in the highest quality possible but do not exceed 1080p • Provide a still/profile picture that is 640 (w) x 720 (h) max • Provide the script that was used We recognize that the video files can become quite large, so to avoid quality loss ASM suggests sending the video file via https://www.wetransfer.com/.When you have a final version of the video and the still ready to share, please send it to mSystems staff at mSystems@asmusa.org.
Thank you for submitting your paper to mSystems.
Sincerely, Xiao-Hua Zhang Editor mSystems Reviewer #1 (Comments for the Author): The manuscript "Decomposing a San Francisco estuary microbiome using long read metagenomics reveals species-and strainlevel dominance from picoeukaryotes to viruses" has received a high score in the first round of review.There is no doubt that it would be even better after some revisions.There is no doubd that it is even better after some revision.
Reviewer #2 (Comments for the Author): I thank the authors for their thoughtful responses -the paper is much improved.The comparison of plasmid annotations between the two assemblies is very interesting.Two comments: -Line 461 still discusses "manhours" although the line above it was changed to "person hours" -I would still strongly recommend that the authors make their binning code publicly available for reference (consistent with mSystems policy), even if not in the form of a fully polished software tool.