Deduplication Improves Cost-Efficiency and Yields of De Novo Assembly and Binning of Shotgun Metagenomes in Microbiome Research

ABSTRACT In the last decade, metagenomics has greatly revolutionized the study of microbial communities. However, the presence of artificial duplicate reads raised mainly from the preparation of metagenomic DNA sequencing libraries and their impacts on metagenomic assembly and binning have never been brought to attention. Here, we explicitly investigated the effects of duplicate reads on metagenomic assemblies and binning based on analyses of five groups of representative metagenomes with distinct microbiome complexities. Our results showed that deduplication considerably increased the binning yields (by 3.5% to 80%) for most of the metagenomic data sets examined thanks to the improved contig length and coverage profiling of metagenome-assembled contigs, whereas it slightly decreased the binning yields of metagenomes with low complexity (e.g., human gut metagenomes). Specifically, 411 versus 397, 331 versus 317, 104 versus 88, and 9 versus 5 metagenome-assembled genomes (MAGs) were recovered from MEGAHIT assemblies of bioreactor sludge, surface water, lake sediment, and forest soil metagenomes, respectively. Noticeably, deduplication significantly reduced the computational costs of the metagenomic assembly, including the elapsed time (9.0% to 29.9%) and the maximum memory requirement (4.3% to 37.1%). Collectively, we recommend the removal of duplicate reads in metagenomes with high complexity before assembly and binning analyses, for example, the forest soil metagenomes examined in this study. IMPORTANCE Duplicated reads in shotgun metagenomes are usually considered technical artifacts. Their presence in metagenomes would theoretically not only introduce bias into the quantitative analysis but also result in mistakes in the coverage profile, leading to adverse effects on or even failures in metagenomic assembly and binning, as the widely used metagenome assemblers and binners all need coverage information for graph partitioning and assembly binning, respectively. However, this issue was seldom noticed, and its impacts on downstream essential bioinformatic procedures (e.g., assembly and binning) remained unclear. In this study, we comprehensively evaluated for the first time the implications of duplicate reads for the de novo assembly and binning of real metagenomic data sets by comparing the assembly qualities, binning yields, and requirements for computational resources with and without the removal of duplicate reads. It was revealed that deduplication considerably increased the binning yields of metagenomes with high complexity and significantly reduced the computational costs, including the elapsed time and the maximum memory requirement, for most of the metagenomes studied. These results provide empirical references for more cost-efficient metagenomic analyses in microbiome research.

4. Two popular metagenome assembly tools, i.e., MEGAHIT and METASPADES, were used for the evaluation. While most results consistently suggest improved assembly and binning performance after deduplication, it seems that the selection of the two assembly tools matters for the comparative patterns before and after dereplication (e.g., Fig. 3). This should be clarified and properly discussed. 5. The idea of digital normalization and partitioning should be discussed in light of saving computational cost for metagenomic dataset. a) Howe, Adina Chuang, Janet K. Jansson, Stephanie A. Malfatti, Susannah G. Tringe, James M. Tiedje, and C. Titus Brown. "Tackling Soil Diversity with the Assembly of Large, Complex Metagenomes." Proceedings of the National Academy of Sciences 111, no. 13 (April 1, 2014): 4904-9. https://doi.org/10.1073/pnas.1402564111. 6. The comparative results showed that binning yields of metaSPAdes assemblies are remarkably higher than the binning yields of MEGAHIT assemblies, while the disagreement rates of MAGs recovered from deduplicated data were lower than that of MAGs recovered from clean data (without deduplication) in most of taxonomic level. The interesting results seem to show that both the assembly tools and deduplication co-affect the quality of MAGs obtained, which should be but have not yet been discussed with explanations on the observed result differences between assembly tools and before and after deduplication.
Minor comments: The fastQC should be mentioned as it is the most robust tool to check the duplication level of a metagenomic dataset.
Line 123-125: how was the coverage of microbiome diversity computed? Pls clarify Line 131: add 'The results showed that' before 'the assembly quality' Line 144: Use lowercase for 'Significant' Line 145: rephrase 'lower with deduplication' Line 166: unify the use of 'MAGs' or 'bins' which refer to the same data type Line 171: rephrase 'Results showed' as 'The results showed' throughout : how was the chimera in MAGs checked. Pls specify Line 293: what did 'this' refer to? Pls specify Line 567: Nonpareil should be the software used for the analysis, right? This can be briefly mentioned to ensure that all readers can easily understand what it is. Fig. 1: X-axis 'Sequencing effort' should be 'Sequencing depth' Fig. 2: what is the unit for the Y-axis for (c) and (d)? Pls add the label Table 1: how was high-quality MAG defined? This should be clearly described in the legend.
Reviewer #2 (Comments for the Author): Summary: The paper by Zhang et al. provides a systematic analysis of the effect of deduplicating sequencing reads prior to performing MAG binning. The authors show that deduplication significantly decreases computational effort in the assembly and binning process, and results in equivalent or higher numbers of MAGs across varied natural environments. This result has the potential to reduce computational effort exerted for assembly when the objective is creating MAG bins, as well as to increase recovery of novel species into MAG bins in environmental datasets.
The paper is well-composed, but would benefit from some editing for grammatical clarity, as well as some naming and organizational revision (the latter is described in more detail in the Major and Minor Comments). Overall, the most confusing aspect of the paper was the use of the two distinct assemblers across the environmental datasets. In many cases, differences between the MEGAHIT and metaSPAdes assemblers overshadowed or obscured differences attributable to the deduplication approach. The authors are encouraged to reconsider the organization of sections into MEGAHIT and metaSPAdes-specific paragraphs, and to potentially revise the focus to be on the recovery in each collection of environmental metagenomes, regardless of assembler. This would help to clarify the discrepancy that metaSPAdes results were occasionally unaffected by deduplication.
The authors are encouraged to either provide a public repository containing the code used to generate dereplication, assembly, binning, and post-analysis workflows, or to provide more detailed information on the parameters used for each tool that was applied to the data. This would ensure that the analysis is reproducible and clarify the methods. More specific notes on each tool referenced are provided in the Minor Comments below.
Major Comments: -Beyond microbiome complexity, it is unclear how the samples varied with respect to the taxonomic diversity of MAGs, and whether specific lineages might have been disproportionately affected by your deduplication approach. I suggest that you add information about the rate of recovery of different taxonomic groups and potentially functional genes across the different environmental samples, rather than comparing the results of using two different assemblers.
-I suggest that you add a section to the discussion, in addition to identifying specific differences between the MAGs recovered with respect to captured functional and taxonomic diversity, which addresses the effect that your approach could have on existing studies. Based on the increase in taxonomic diversity of species-specific MAGs, what is the potential for extending known MAG representatives from environmental samples? This may be a more exciting note to end on than the technical methods of avoiding the presence of duplicated reads in the first place.
-No details are provided in the methods about the statistical tests that were applied to the data; this information is only included in figure legends and parenthetical remarks.
Minor Comments: -The use of "clean" throughout to refer to quality filtered reads is confusing, since the deduplicated reads are also "cleaned." I would suggest changing the terminology to "standard" or similar to emphasize that your key advance is the deduplication, rather than the cleaning process.
-Lines 176-178: This is confusing; I think what is meant here is that most of the MAGs recovered with the standard/"clean" assembly were also recovered after deduplication of reads, but that the new MAGs generated by the deduplication process were not just duplicate species, they were new, not previously recovered species -Lines 206-207: Consider revising language. Fewer chimeras being present implies that there were some MAGs that fully lacked disagreement, whereas here what is being assessed is a rate of disagreement on a gradient from fully distinct to "fully" chimeric (not interpretable as a distinct organism/species-level genome) -Lines 255-256: Consider expanding this list of references; this would be more impactful if a more representative set of studies was included -Lines 284-286: It is unclear how the duplicated reads and skewed coverage information specifically leads to contamination in this explanation. Why would it not affect only yields, beyond your observations? -Lines 293-294: Is there a consensus between binning developers, or could this be specifically associated with the binning tool you used in the study? -Lines 326-328: Can duplicate reads be wholly avoided? Is there a recommendation for assessment for identifying when duplicate reads may exist? -Lines 343-344: parameters used to decide on which reads were eliminated, as well as a list of adapters, should be provided (either in the text or in a public repository) -Lines 350-353: for the cross-sample coverage, it is unclear what method was used to calculate coverage profiles. If this was using the companion script to MetaBAT2, this should be specified and the workflow used to generate bam coverage files should also be included -Line 354: parameters/settings used for MetaBAT2? -Lines 364-365: Do you expect all MAGs to be prokaryotic? Details on the filtering protocol used are missing from the sequencing methodology -Lines 376-377: can the in-house script be published alongside the paper? -Lines 370-372: How were these 10 marker genes selected? Can we be sure that they are balanced? -Line 378+: Details on how figures were generated is missing; were all statistical tests and figures generated using base R utilities? If so, it would be good to explicitly specify this, as well as to list functions used in R for statistical analyses. - Figure 2: It appears as though the significance test is being calculated across samples, yet bars are used to show the outcomes of individual assemblies for each environmental sample. This figure would be much easier to read and interpret if boxplots or other distribution-based plots could compare each set of environmental samples, rather than distinct bars that do not appear to be ordered in any particular way from left to right (unless this information is missing from the figure) - Figure 3: A scatter plot with colors or shapes indicating each environmental sample type (along with a one-to-one line) may make it easier to read and interpret the difference in number of MAGs before and after deduplication. The two adjacent bars are similar in size and it is challenging to compare the relative difference between each sample type Staff Comments:

Preparing Revision Guidelines
To submit your modified manuscript, log onto the eJP submission site at https://spectrum.msubmit.net/cgi-bin/main.plex. Go to Author Tasks and click the appropriate manuscript title to begin the revision process. The information that you entered when you first submitted the paper will be displayed. Please update the information as necessary. Here are a few examples of required updates that authors must address: • Point-by-point responses to the issues raised by the reviewers in a file named "Response to Reviewers," NOT IN YOUR COVER LETTER. • Upload a compare copy of the manuscript (without figures) as a "Marked-Up Manuscript" file. • Each figure must be uploaded as a separate file, and any multipanel figures must be assembled into one file. For complete guidelines on revision requirements, please see the journal Submission and Review Process requirements at https://journals.asm.org/journal/Spectrum/submission-review-process. Submissions of a paper that does not conform to Microbiology Spectrum guidelines will delay acceptance of your manuscript. " Please return the manuscript within 60 days; if you cannot complete the modification within this time period, please contact me. If you do not wish to modify the manuscript and prefer to submit it to another journal, please notify me of your decision immediately so that the manuscript may be formally withdrawn from consideration by Microbiology Spectrum.
If your manuscript is accepted for publication, you will be contacted separately about payment when the proofs are issued; please follow the instructions in that e-mail. Arrangements for payment must be made before your article is published. For a complete list of Publication Fees, including supplemental material costs, please visit our website.
Remarks to the Authors: Summary: The paper by Zhang et al. provides a systematic analysis of the effect of deduplicating sequencing reads prior to performing MAG binning. The authors show that deduplication significantly decreases computational effort in the assembly and binning process, and results in equivalent or higher numbers of MAGs across varied natural environments. This result has the potential to reduce computational effort exerted for assembly when the objective is creating MAG bins, as well as to increase recovery of novel species into MAG bins in environmental datasets.
The paper is well-composed, but would benefit from some editing for grammatical clarity, as well as some naming and organizational revision (the latter is described in more detail in the Major and Minor Comments). Overall, the most confusing aspect of the paper was the use of the two distinct assemblers across the environmental datasets. In many cases, differences between the MEGAHIT and metaSPAdes assemblers overshadowed or obscured differences attributable to the deduplication approach. The authors are encouraged to reconsider the organization of sections into MEGAHIT and metaSPAdes-specific paragraphs, and to potentially revise the focus to be on the recovery in each collection of environmental metagenomes, regardless of assembler. This would help to clarify the discrepancy that metaSPAdes results were occasionally unaffected by deduplication.
The authors are encouraged to either provide a public repository containing the code used to generate dereplication, assembly, binning, and post-analysis workflows, or to provide more detailed information on the parameters used for each tool that was applied to the data. This would ensure that the analysis is reproducible and clarify the methods. More specific notes on each tool referenced are provided in the Minor Comments below.
Major Comments: -Beyond microbiome complexity, it is unclear how the samples varied with respect to the taxonomic diversity of MAGs, and whether specific lineages might have been disproportionately affected by your deduplication approach. I suggest that you add information about the rate of recovery of different taxonomic groups and potentially functional genes across the different environmental samples, rather than comparing the results of using two different assemblers. -I suggest that you add a section to the discussion, in addition to identifying specific differences between the MAGs recovered with respect to captured functional and taxonomic diversity, which addresses the effect that your approach could have on existing studies. Based on the increase in taxonomic diversity of species-specific MAGs, what is the potential for extending known MAG representatives from environmental samples? This may be a more exciting note to end on than the technical methods of avoiding the presence of duplicated reads in the first place.
-No details are provided in the methods about the statistical tests that were applied to the data; this information is only included in figure legends and parenthetical remarks.
Minor Comments: -The use of "clean" throughout to refer to quality filtered reads is confusing, since the deduplicated reads are also "cleaned." I would suggest changing the terminology to "standard" or similar to emphasize that your key advance is the deduplication, rather than the cleaning process. -Lines 176-178: This is confusing; I think what is meant here is that most of the MAGs recovered with the standard/"clean" assembly were also recovered after deduplication of reads, but that the new MAGs generated by the deduplication process were not just duplicate species, they were new, not previously recovered species -Lines 206-207: Consider revising language. Fewer chimeras being present implies that there were some MAGs that fully lacked disagreement, whereas here what is being assessed is a rate of disagreement on a gradient from fully distinct to "fully" chimeric (not interpretable as a distinct organism/species-level genome) -Lines 255-256: Consider expanding this list of references; this would be more impactful if a more representative set of studies was included -Lines 284-286: It is unclear how the duplicated reads and skewed coverage information specifically leads to contamination in this explanation. Why would it not affect only yields, beyond your observations? -Lines 293-294: Is there a consensus between binning developers, or could this be specifically associated with the binning tool you used in the study? -Lines 326-328: Can duplicate reads be wholly avoided? Is there a recommendation for assessment for identifying when duplicate reads may exist? -Lines 343-344: parameters used to decide on which reads were eliminated, as well as a list of adapters, should be provided (either in the text or in a public repository) -Lines 350-353: for the cross-sample coverage, it is unclear what method was used to calculate coverage profiles. If this was using the companion script to MetaBAT2, this should be specified and the workflow used to generate bam coverage files should also be included -Line 354: parameters/settings used for MetaBAT2? -Lines 364-365: Do you expect all MAGs to be prokaryotic? Details on the filtering protocol used are missing from the sequencing methodology -Lines 376-377: can the in-house script be published alongside the paper? -Lines 370-372: How were these 10 marker genes selected? Can we be sure that they are balanced? -Line 378+: Details on how figures were generated is missing; were all statistical tests and figures generated using base R utilities? If so, it would be good to explicitly specify this, as well as to list functions used in R for statistical analyses. - Figure 2: It appears as though the significance test is being calculated across samples, yet bars are used to show the outcomes of individual assemblies for each environmental sample. This figure would be much easier to read and interpret if boxplots or other distribution-based plots could compare each set of environmental samples, rather than distinct bars that do not appear to be ordered in any particular way from left to right (unless this information is missing from the figure) - Figure 3: A scatter plot with colors or shapes indicating each environmental sample type (along with a one-to-one line) may make it easier to read and interpret the difference in number of MAGs before and after deduplication. The two adjacent bars are similar in size and it is challenging to compare the relative difference between each sample type Remarks to the Editor: I believe that the manuscript by Zhang et al. is a valuable comparative analysis that will be useful for the growing community of scientists using MAG binning to answer ecological questions. The manuscript needs a fair amount of copy editing, so I refrained from making comments about grammar or diction in my review. In my opinion, the biggest gap in the analysis itself is a detailed comparison of the lineages of MAGs that were most considerably affected by the deduplication process. The authors rarely point out taxonomic distinctions between recovered MAGs, in particular because their environmental samples of varying complexity had such dramatically different rates of MAG binning. I suggest that the authors deemphasize the observed differences between metaSPAdes and MEGAHIT assembly, which have been previously reported, and refocus their efforts on explaining the features of the datasets which lead to the discrepancies. This will make the analysis more interpretable and actionable for researchers that hope to use the findings to guide their own work. I suggest that the authors expand their evaluation of these differences in their results and discussions sections. The authors also need to expand their explanation of statistical tests performed, which appear to exist and be sufficient but are not described in the methods or supplementary file, as well as their description of the bioinformatic workflow used. For the latter in particular, publishing code or parameter details would be tremendously helpful and would make the manuscript more reproducible, as the major advance presented is their computational workflow. After these changes have been made, I recommend that the authors resubmit the manuscript.

Responses to the Comments of the Editor and Reviewers
Dear Editor and Reviewers, Thanks for your precious comments on our paper entitled "Deduplication Improves Editor's comments: 1. We have received helpful but also critical comments from two reviewers regarding your manuscript. I agree with the reviewers that this study provides a valuable comparative analysis that will be useful for the growing community of scientists using MAG binning to answer ecological questions. I also concur with the reviewers that substantial revision is needed to improve the clarity, writing, solid evidence or better discussion.
Response: Thank you for the decision on a major revision of our work. We have revised the manuscript according to your and the reviewers' comments and suggestions.
2. For instance, a fair amount of copy editing is needed throughout the manuscript.
Important and representative habitats like human guts are suggested to be considered or included.
Response: We have edited the language throughout the manuscript. Following the suggestion, we also investigated the effects of deduplication on the assembly and binning of human gut metagenomes and included the results in the revised version of the manuscript.
3. The biggest gap in the analysis itself could be a detailed comparison of the lineages of MAGs that were most considerably affected by the deduplication process. The authors are suggested to deemphasize the observed differences between metaSPAdes and MEGAHIT assembly, which have been previously reported, and refocus their efforts on explaining the features of the datasets which lead to the discrepancies. This will make the analysis more interpretable and actionable for researchers that hope to use the findings to guide their own work. Response: Thanks for your kind reminding. We have prepared the responses and the submitted files, as you suggested.

Reviewer #1:
Metagenomics has dramatically promoted the direct study of genetic materials of microbial communities. This manuscript systematically evaluates the impacts of artificial duplicate reads in metagenomic datasets on metagenome assembly and binning which represents the most commonly used bioinformatics procedures in today's metagenomic studies. Overall, this interesting study is rationally designed and well organized to evaluate the impacts of duplicate reads on the assembly-based metagenomic analysis, an important but still unaddressed technical issue in the field.
The authors found that deduplication not only considerably increased the binning yields (by 3.5% to 80%) thanks to improved contig length and coverage profiling of contigs but also reduced the computational costs of a metagenomic assembly including elapsed time and maximum memory requirement. Based on these results, they recommended removing duplicate reads to achieve more cost-efficient processing of metagenomic data. The manuscript is well-written, and the results are logically arranged to support the major findings. The study provides valuable evaluation and new knowledge on the impact of artificially duplicated reads on the downstream metagenomic analyses and their computational cost.
Response: Thank you very much for the positive comments and recognition of our work. We have made point-by-point responses to your valuable comments as follows.
1. The impacts of duplication were explored by analyzing metagenomic datasets of four representative ecosystem habitats, namely bioreactor sludge, surface water, lake sediment, and forest soil, which showed different levels of microbiome complexity.
These samples covered typical natural or engineered environmental microbiome, while human, animal, or host environment is not evaluated. How about human gut microbiome in which the microbial diversity and complexity are generally lower?
What is a potential relationship between the generation of artificial duplicates and the microbiome complexity or diversity? This can be properly discussed to expand the impact of the finding of this work.
Response: Thank you for bringing up this important point. We followed your helpful suggestion and evaluated the impacts of deduplication on the metagenomic assembly and binning of human gut metagenomes. On average, the microbial complexity was lowest when compared with the other four sample types, as shown in Fig. 1 3,4 , which would also increase the duplicate rate. The duplication level is also correlated to the sequencing depth, which means that the duplication level will increase with the increasing sequencing depth (The trouble with PCR duplicates | The Molecular Ecologist). We also observed a significant positive relationship between duplication rate and sequencing depth, as shown in the following Figure. We also clarified this point in Line 129-130 of the revised manuscript. were used for the evaluation. While most results consistently suggest improved assembly and binning performance after deduplication, it seems that the selection of the two assembly tools matters for the comparative patterns before and after dereplication (e.g., Fig. 3). This should be clarified and properly discussed. Line 337-349: Therefore, assembling a big-size complex metagenome will significantly challenge the computational capacity. Previously, digital normalization and partitioning of metagenomic data have been applied as preassembly filtering approaches in the analysis of complex soil metagenomes, which could significantly reduce the demand of computational memory and time for metagenomic de novo assembly and simultaneously produce nearly identical assemblies as compared with the unprocessed dataset (32). In this study, we revealed that deduplication could decrease the demand for computational resources and improve the binning yields simultaneously as exemplified by five distinct groups of metagenomes, i.e., human gut, bioreactor sludge, surface water, lake sediment and forest soil. Therefore, deduplication, digital normalization and partitioning approaches could be coconsidered as preassembly filtering approaches in light of reducing the demand of computational resources, particularly for big-size complex environmental metagenomes.
6. The comparative results showed that binning yields of metaSPAdes assemblies are remarkably higher than the binning yields of MEGAHIT assemblies, while the disagreement rates of MAGs recovered from deduplicated data were lower than that of MAGs recovered from clean data (without deduplication) in most of taxonomic level. The interesting results seem to show that both the assembly tools and deduplication co-affect the quality of MAGs obtained, which should be but have not yet been discussed with explanations on the observed result differences between assembly tools and before and after deduplication.
Response: Thank you for bringing up this point. We observed that the binning yields of metaSPAdes assemblies are higher than that of MEGAHIT assemblies, and more high-quality MAGs were recovered from metaSPAdes than from MEGAHIT in all five types of metagenomes (Table 1). It was reported that the assembly quality (e.g., N50, the longest contig, the number of contigs over 10000bp) of metaSPAdes was better than that of MEGAHIT in the previous studies, which might contribute to better binning results. However, we did not observe a significant difference between assemblies of deduplicate data and no deduplicated data, except for soil metagenomes.
Therefore, the reasons for the higher yields of binning metaSPAdes assemblies are still unclear.
In this study, we mainly focused on comparing the amount and quality of MAGs recovered from data with and without deduplication. We found that the disagreement rates of MAGs recovered from deduplicated data were lower than that of MAGs recovered from clean data (without deduplication) at most of the taxonomic levels. It was speculated that the presence of duplicate reads might affect the coverage abundance of contigs, leading to the increase of contamination level of MAGs. We also provided explanations in Line 309-310 of the revised manuscript.
Minor comments: We annotated 10 single-copy marker genes recommended by STAG and evaluated the homogeneity of their taxonomic annotation for each MAG, as follows: "No annotation" if less than two marker genes were annotated; "Agreeing" if all marker genes had the same taxonomic annotation at all classification levels; and "Disagreeing" if different taxonomic annotation of marker genes presented at any classification level.
The disagreement rate was calculated by the number of "Disagreeing" MAGs divided by the number of all MAGs excluding "No annotation". This method was also adopted by the recent Nature paper 5  11. Fig. 1: X-axis 'Sequencing effort' should be 'Sequencing depth' Response: Thanks for your suggestion. We have changed it to "Sequencing depth".
12. Fig. 2: what is the unit for the Y-axis for (c) and (d)? Pls add the label Response: Thank you for the reminding. The unit for the Y-axis of (c) and (d) is the same as the Y-axis of (a) and (b). we have added the label.
13. Response: Thank you for the positive comments on our work.
2. The paper is well-composed, but would benefit from some editing for grammatical clarity, as well as some naming and organizational revision (the latter is described in more detail in the Major and Minor Comments). Overall, the most confusing aspect of the paper was the use of the two distinct assemblers across the environmental datasets.
In many cases, differences between the MEGAHIT and metaSPAdes assemblers overshadowed or obscured differences attributable to the deduplication approach. The authors are encouraged to reconsider the organization of sections into MEGAHIT and metaSPAdes-specific paragraphs, and to potentially revise the focus to be on the recovery in each collection of environmental metagenomes, regardless of assembler.
This would help to clarify the discrepancy that metaSPAdes results were occasionally unaffected by deduplication. 3. The authors are encouraged to either provide a public repository containing the code used to generate dereplication, assembly, binning, and post-analysis workflows, or to provide more detailed information on the parameters used for each tool that was applied to the data. This would ensure that the analysis is reproducible and clarify the methods. More specific notes on each tool referenced are provided in the Minor Comments below.
Response: Thanks for your thoughtful suggestions. We have provided the detailed parameters used by each tool in the Materials and methods of the revised manuscript.
For example, Line 403-410: the standard datasets and deduplicated datasets were individually de novo assembled using both MEGAHIT (v1.2.9) (48) with the followed parameters: "--min-contig-len 1000 -m 300 -t 30" and metaSPAdes (v3.14.1) (33) with the following parameters "-m 400 -t 30 --only-assembler --meta". The resulting assembly of each sample was subsequently clustered into metagenome-assembled genomes (MAGs) using the MetaWRAP pipeline (43) with the followed parameters: "metawrap binning -o path_to_output_file binning_results -a path_to_the_assembly assembly.fasta --metabat2 -t 30 path_to_the_nine_samples/*.fastq". Line 189-200: When taxonomic composition of MAGs was compared, it was found that identical phylum-level taxonomic groups were recovered from the standard and deduplicated data of both human gut metagenomes and bioreactor sludge metagenomes (Fig. 4a). However, the deduplication failed to recover Myxococcota_A MAGs from the surface water metagenomes, and also failed to recover Cyanobacteria MAGs but instead successfully recovered the Methylomirabilota MAGs from the lake sediment metagenomes (Fig. 4a). In contrast, the extra MAG representatives at the phylum level (i.e., Nitrospirota and Actinobacteriota) were recovered from deduplicated data but not standard data of forest soil metagenomes (Fig. 4a), revealing a promotive effect of deduplication on the genome recovery from soil metagenomes.
Line 217-224: For the detailed composition, the identical phylum-level taxonomic groups were recovered from the standard and deduplicated data of human gut metagenomes and from the standard data and deduplicated data of surface water metagenomes (Fig. 4b). However, the deduplication failed to recover Firmicutes_C MAGs from the bioreactor sludge metagenomes and Cyanobacteria MAGs from the lake sediment metagenomes (Fig. 4b). In contrast, the extra phylum representative MAGs (i.e., Thermoproteota and Krumholzibacteriota) were recovered from deduplicated data but not standard data of forest soil metagenomes (Fig. 4b).
-I suggest that you add a section to the discussion, in addition to identifying specific differences between the MAGs recovered with respect to captured functional and taxonomic diversity, which addresses the effect that your approach

Minor comments:
-The use of "clean" throughout to refer to quality filtered reads is confusing, since the deduplicated reads are also "cleaned." I would suggest changing the terminology to "standard" or similar to emphasize that your key advance is the deduplication, rather than the cleaning process.
Response: Thanks for your suggestion. We have replaced "cleaned" with "standard" throughout the revised manuscript.
-Lines 176-178: This is confusing; I think what is meant here is that most of the MAGs recovered with the standard/"clean" assembly were also recovered after deduplication of reads, but that the new MAGs generated by the deduplication process were not just duplicate species, they were new, not previously recovered species.
Response: Our apologies for making you confused. Your understanding is precisely right. We have revised the sentence to avoid further confusion.
Line 186-188: Most of the MAGs recovered from the standard data were also recovered from data with deduplication. In contrast, extra species-level MAGs were recovered from deduplicated data that were not recovered from standard data (Fig.   S4).
-Lines 206-207: Consider revising language. Fewer chimeras being present implies that there were some MAGs that fully lacked disagreement, whereas here what is being assessed is a rate of disagreement on a gradient from fully distinct to "fully" chimeric (not interpretable as a distinct organism/species-level genome) Response: Thank you for bringing up this point. Our apologies for making you confused here. We intended to compare the proportions of MAGs that contains chimera between data with and without deduplication. We assessed the chimera presence in MAGs by evaluating the homogeneity of their taxonomic annotation for each MAG, as follows: "No annotation" if less than two marker genes were annotated; "Agreeing" if all marker genes had the same taxonomic annotation at all classification levels; and "Disagreeing" if different taxonomic annotation of marker genes presented at any classification level. The disagreement rate was calculated as the number of MAGs containing chimera divided by the total number of MAGs excluding "No annotation". We compared the disagreement rate at the different taxonomic levels. For a specific MAG, we first checked the taxonomic annotation of 10 marker genes at the phylum level. If more than two marker genes were found in the MAGs and they have the same taxonomic annotation at the phylum level, the MAG will be considered as "agreeing"; otherwise, the MAG will be regarded as "disagreeing". Then we will check the disagreement rate at other taxonomic levels, such as class, order, family and genus. We have revised the language in the Results of the manuscript.
Line 229-237: The reliance on the counts of single-copy marker genes for FastQC could be used to evaluate the duplicates rate of metagenomic samples.
-Lines 343-344: parameters used to decide on which reads were eliminated, as well as a list of adapters, should be provided (either in the text or in a public repository).
Response: Thank you for bringing up this point. The Fastp was used for deduplication in this study. The adapters can be detected by per-read overlap analysis, which seeks for the overlap of each pair of reads. This method is robust and fast, so we usually don't need to input the adapter sequences. If the number of "N" accounts for more than 10% of the total read bases, the paired reads will be removed. If the low-quality (Q <=5) bases account for more than 50% of the total read bases, the paired reads will be removed. The parameters we used for quality control and deduplication in this study are as follows: "--qualified_quality_phred 5", "--unqualified_percent_limit 50", "--n_base_limit 15", "--dedup", and "--dup_calc_accuracy 3". The accuracy level to calculate duplication ranges from 1 to 6, and "3" is the default value for deduplication.
The higher level uses more memory and more time. We also provided these parameters in the Line 395-410 of the revised manuscript.
-Lines 350-353: for the cross-sample coverage, it is unclear what method was used to calculate coverage profiles. If this was using the companion script to MetaBAT2, this should be specified and the workflow used to generate bam coverage files should also be included.
Response: Thank you for bringing up this point. We used the Metawrap pipeline for the binning of assemblies, which was clarified in the revised manuscript (Line 408).
The script "jgi_summarize_bam_contig_depths" within MetaBAT2 was used for coverage calculation. MAGs were filtered with a quality over 50%, in which the quality was calculated as completeness-5×contamination. We also clarified this point in the Line 433 in the revised manuscript.
-Lines 376-377: can the in-house script be published alongside the paper?
Response: Thank you for bringing up this point. We used "time" function (\time -f %e -v -o Running.log) in Linux shell to record memory requirement and elapsed time.
Then, each sample's log file was merged using the "paste" function in Linux shell. We also made supplements in the revised manuscript.
Line 455-457: Elapsed time and RAM taken by the software to complete the metagenomic assembly and binning were recorded using "time" function in Linux shell with customized options: \time -f %e -v -o Running.log.
-Lines 370-372: How were these 10 marker genes selected? Can we be sure that they are balanced?
Response: Thank you for bringing up this point. The 10 marker genes were selected from the mOTUs profiler 9-11 , a widely-used tool for microbial community profiling.
The 10 marker genes chosen have the properties of universal occurrence and low rate of horizontal gene transfer, meaning that they have good phylogenetic properties and can be used to track the evolutionary relationships between different microorganisms 9 accurately. We also made supplements to the Materials and methods of the revised manuscript.
Line 444-448: We annotated 10 single-copy marker genes selected from mOTUs profiler (20). The 10 marker genes chosen have the properties of universal occurrence and a low rate of horizontal gene transfer, meaning that they have good phylogenetic properties and can be used to accurately track the evolutionary relationships between different microorganisms (20,55).
-Line 378+: Details on how figures were generated is missing; were all statistical tests and figures generated using base R utilities? If so, it would be good to explicitly specify this, as well as to list functions used in R for statistical analyses.
Response: Thanks for your kindly reminding. Yes, all statistical tests and figures in this study are generated using package in R. We have made supplements in the revised manuscript.
Line Thank you for submitting your manuscript to Microbiology Spectrum. Your manuscript has been read again by the two reviewers, and both of them are satisfied with your revision. You could also find some additional but minor comments below this message. I expect that your manuscript is very close to acceptance. When submitting the revised version of your paper, please provide (1) point-by-point responses to the issues raised by the reviewers as file type "Response to Reviewers," not in your cover letter, and (2) a PDF file that indicates the changes from the original submission (by highlighting or underlining the changes) as file type "Marked Up Manuscript -For Review Only". Please use this link to submit your revised manuscript -we strongly recommend that you submit your paper within the next 60 days or reach out to me. Detailed instructions on submitting your revised paper are below.

Link Not Available
Below you will find instructions from the Microbiology Spectrum editorial office and comments generated during the review.
ASM policy requires that data be available to the public upon online posting of the article, so please verify all links to sequence records, if present, and make sure that each number retrieves the full record of the data. If a new accession number is not linked or a link is broken, provide production staff with the correct URL for the record. If the accession numbers for new data are not publicly accessible before the expected online posting of the article, publication of your article may be delayed; please contact the ASM production staff immediately with the expected release date.