To Dereplicate or Not To Dereplicate?

Metagenome-assembled genomes (MAGs) expand our understanding of microbial diversity, evolution, and ecology. Concerns have been raised on how sequencing, assembly, binning, and quality assessment tools may result in MAGs that do not reflect single populations in nature. Here, we reflect on another issue, i.e., how to handle highly similar MAGs assembled from independent data sets. Obtaining multiple genomic representatives for a species is highly valuable, as it allows for population genomic analyses; however, when retaining genomes of closely related populations, it complicates MAG quality assessment and abundance inferences.

W hile initially the reconstruction of metagenome-assembled genomes (MAGs) was only achievable in lower-diversity or highly uneven communities (1), in the past 5 years, reports on the reconstruction of hundreds to thousands of MAGs have become routine (2,3). In the past year, highly automated assembly and binning pipelines have accelerated this trend (4,5). These advances open up exciting prospects for addressing questions regarding the physiology, ecology, and evolution of microbial life. However, MAGs are inherently less reliable than isolate genomes due to their assembly and binning from DNA sequences originating from a mixed community. Various reports have highlighted issues associated with MAGs. Misassemblies and/or incorrect binning can lead to composite genomes that contain contigs originating from independent genomes, which can lead to incorrect inferences about the metabolic potential and phylogenetic novelty of populations represented by these MAGs (6,7). Additionally, binning software performs poorly on contigs smaller than 2 to 5 kb (8); hence, smaller contigs generally are excluded from binning procedures. However, the assembly of metagenomic data from communities containing closely related strains often leads to highly fragmented assemblies with many contigs below these cutoffs, leading to incomplete genomes that can lead to wrong conclusions regarding ecological differentiation between populations (9). As sample series from the same environment across time or space typically will sample many closely related populations, the independent assembly of each sample is often preferable to avoid assembly fragmentation due to genomic variation between conspecific populations in different samples. However, this often leads to highly similar MAGs being generated across the sample data set. Therefore, multiple tools have been developed to remove these seemingly redundant MAGs, mainly based on average nucleotide identity (ANI) between MAGs after sequence alignment. Generally, due to the high number of MAGs, which makes doing all pairwise alignments too computationally intensive, MAGs are first grouped using a fast but less accurate alignment method, such as Mash (10). This step is included in the tool dRep, which subsequently uses genomewide alignments using animf with mummer/nucmer to align contigs (dRep default [11][12][13]) or gene-based alignments using NSimScan to align open reading frames identified in each MAG (dRep gANI [14]). Other tools, such as pyani (15), calculate sequence identity using BLAST-based genomewide alignments of MAG contigs. While slower, we consider it the reference for comparison due to the higher accuracy of BLAST-based alignments (16) (Fig. 1). When using Mash as a step preceding pyani, we used default parameters (10). The computed pairwise distances then were used to cluster genomes into similar groups with hierarchical clustering using a custom python script with fcluster from SciPy (http://www .scipy.org/) with a threshold of 2 (https://github.com/DenefLab/Dereplication-Letter -Code). pyani then was run within each group created from the clustering (Fig. 1). We used the recommended default percent sequence alignment threshold for dRep (10%) but used 75% for pyani, similar to the cutoff used for identifying orthologs using BLAST alignments. Only MAGs meeting this threshold with both comparisons from the pairwise comparison were dereplicated.

WHY DEREPLICATE?
Dereplication is the reduction of a set of genomes, typically assembled from metagenomic data, based on high sequence similarity (e.g., Ͼ99% average nucleotide identity) between these genomes. When redundancy in a database of genomes is maintained, the subsequent step of mapping sequencing reads back to this database of genomes leads to sequencing reads having multiple high-quality alignments. Depending on the software used and parameters chosen, this leads to sequencing reads either being randomly distributed across the redundant genomes, with one random alignment reported from many possible options, or being reported at all redundant locations. When using these data to make inferences about the relative abundance and population dynamics across samples, relative abundance for each taxon represented by these redundant MAGs will look artificially low, and it will appear that multiple ecologically equivalent populations co-occur. Instead, the more likely conclusion should be that one abundant population exists across all samples (Fig. 2). This issue has been acknowledged in multiple studies, and authors have chosen various cutoffs to avoid this issue (e.g., Ͼ95% average nucleotide identity [4], Ͼ98% average nucleotide identity [2], Ͼ95% amino acid identity [17], and Ͼ99.5% amino acid identity [3]). In our experience, the presence of multiple closely related genomes can also complicate manually curating MAGs using, for example, Anvi'o (18). When multiple similar genomes are present, particularly when some of the closely related MAGs are less complete, these differential coverage patterns will be less reliable due to the distribu- Overview of dereplication approaches used in this study. All approaches first cluster similar genomes (Mash clusters are delineated with boxes) using a fast, less accurate approach (Mash), which is included in the dRep package but is a separate preprocessing step we carried out for the pyani analysis (indicated with the dotted line). Each cluster of MAGs then is separately dereplicated using pairwise alignments by identifying MAGs within each Mash cluster that share ANI above the specified threshold. These clusters are indicated by boxes, with Mash clusters split in two multiple cluster groups using the same line type (full or dashed lines). Which genomes end up in the same cluster varies depending on the approach used; only one clustering is shown. Finally, a representative MAG is selected, either as part of the package (dRep) or using a custom script (our approach that used pyani for pairwise comparisons, indicated by the dotted line), selecting the MAG with the highest estimated completion and lowest estimated contamination.
tion of sequencing reads across MAGs in some parts of the genome and not in others, generating divergent differential coverage patterns for contigs originating from the same population. This could lead to the removal of parts of the genome that do in fact belong.

WHY NOT DEREPLICATE?
Traditional population genomic analyses rely on the availability of genome sequences of many conspecific isolates. Metagenomics-based population genomic analyses could enhance cultivation-dependent methods by obtaining sequences of multiple closely related, independently sampled populations (a population being defined as individuals of the same species occurring at the same time and place). While many methodological issues remain (see reference 19), MAGs of Ͼ99% ANI originating from multiple samples could be a valuable resource, which would be lost if we dereplicate our MAG data sets. Most approaches to dereplicate remove genomes based on the sequence identity of shared parts of the genome. As such, when removing genomes, in addition to data on single-nucleotide variation, we may lose information  3), grouped based on sequence similarity by Mash. A box outline indicates the genome was preserved after dereplication, while white space indicates it was removed. The dRep default does not remove multiple nearly identical MAGs, while dRep-gANI removes MAGs that are more distantly related than the 99% or 96.5% ANI cutoff. Black bars show the average sequence read coverage across all contigs of each MAG, ranging from 0 to 2,000, when aligning a metagenomic data set (Sequence Read Archive accession no. SRR1702559) using all genomes in the tree (none) or dereplicated genome sets using different tools. Reads were mapped to each Multi-FASTA file of retained MAGs using BWA-MEM with default parameters (26). Average coverage per contig was computed with pileup.sh from bbtools (https://sourceforge.net/projects/bbmap/). The phylogenetic tree was created by searching for marker genes with PhyloSift (27) using its default set of marker genes. All MAGs had estimated completeness levels of Ͼ90% (3). The genes then were aligned with PhyloSift and the resulting alignments concatenated, and the tree was created with FastTree (28) using the -nt and -gtr parameters. on variability in the auxiliary gene content among representatives from the same species. As an example, we analyzed the effect of dereplication on database auxiliary gene content using two of the most commonly used tools (dRep and redundancy removal based on pyani results). We used a set of 46 Microcystis aeruginosa MAGs that we previously generated with extensive manual curation (9). The ANI between pairs of these 46 genomes averages to 96.4%. Out of a total of 9,175 unique gene clusters across the 46 MAGs, dereplication led to the removal of up to 2,228 auxiliary gene clusters when using dRep gANI with a 96.5% cutoff (used for species delineation using genome sequences [14]) (Fig. 2). At the other extreme, using the dRep default, no genomes were removed from the MAG set; thus, no gene clusters were lost, while intermediate numbers of gene clusters were removed when using pyani (213) and dRep gANI (447) at 99% thresholds.

VARIABLE PERFORMANCE OF COMMONLY USED SOFTWARE
As already indicated from the analysis shown in the first part of Table 1, different dereplication tools lead to different outcomes, even when using the same sequence identity cutoffs. Using publicly available MAG data sets (3)(4)(5), we evaluated the performance of two commonly used dereplication tools, dRep and pyani. For dRep, we used the default parameters, a cutoff of 99% (dRep default), and the gANI option, which aligns predicted open reading frames rather than the entire genomes, with cutoffs of 99% and 96.5% (dRep gANI). For pyani, we used a 99% ANI cutoff. As outlined in Fig. 1, to reduce computation time, all approaches first cluster MAGs by calculating pairwise distances using Mash.
First, we performed a comprehensive analysis of a set of 7,800 genomes generated from 1,550 public metagenomes (3). In this study, no dereplication was done for most analyses, except for building the tree represented in Fig. 2 in the study originally reporting these MAGs (3). For the latter analysis by Parks and coauthors, dereplication was performed by removing genomes with an amino acid identity (AAI) of Ն99.5%, as calculated using CompareM (https://github.com/dparks1134/CompareM), resulting in the removal of 27.5% of all MAGs. In our own analyses, relative to the pyani reference (32.9% removal), default dRep removed fewer genomes (19.3%), while the gANI dRep approach removed more MAGs (48.1% [99% ANI], 56.9% [96.5% ANI]) ( Table 1). A closer look at one cluster of related MAGs indicated that dRep gANI regularly removed genomes that did not require removal, while dRep with default parameters did not remove a sufficient number of MAGs (Fig. 2).
For a recent study that generated more than 90,000 MAGs (4), we performed our comparative dereplication analysis on the 1,952 uncultured bacterial species that were identified and examined by the authors. These were MAGs not classified at the species level in current databases that had been dereplicated by removing less complete MAGs that shared an ANI of Ͼ95% across 60% of their sequence length. In this case, pyani  (9). A gene cluster consists of all genes across all genomes that had a minimum bit score of at least 0.5 when using the pangenome analysis workflow in Anvi'o (18). Retention of at least one representative in each gene cluster was evaluated when using different dereplication tools and ANI settings. b Number of MAGs remaining after dereplication tools were run shows tool-dependent results of dereplication using data from three published data sets. SGB refers to numbered species-level clusters generated in analyses by Pasolli et al. (5).
removed four times fewer MAGs than the different implementations of dRep (Table 2). In contrast to our preceding analyses, dRep default removed more MAGs than pyani, potentially because the authors had already dereplicated their MAG set at 95% ANI. Finally, we analyzed two MAG groups, clustered at the species level (95% ANI) by the authors of a recent study generating more than 150,000 MAGs (5). In this case, dRep default again removed fewer MAGs than pyani, while dRep using gANI removed many more MAGs (Table 2). Various factors may explain the differences in outcome between dereplication approaches, including the faster but less accurate alignment and percent identity calculation methods used in dRep, the diverging percent alignment thresholds used, and, potentially, different implementations of Mash in dRep and in our combined Mash-pyani approach. As for the percent alignment thresholds used, when we tried different coverage thresholds with the set of genomes represented in Fig. 2, dropping the coverage threshold for pyani to 10%, used by dRep, led to the removal of one extra MAG. When doing the same for the Microcystis MAG set, we observed no change in the number of genomes removed by pyani (10 to 75% alignment thresholds when using 100% complete MAGs) ( Table 2). Finally, dereplication tools also vary in their performance based on the completeness of the MAGs that are being compared. Previous guidelines for the use of dRep have shown that Mash, the first step in all the tested dereplication approaches (Fig. 1), underestimates genome similarity for incomplete genomes and recommends against dereplicating genomes below 50% completeness (https://drep.readthedocs.io/en/ latest/choosing_parameters.html). When using the pyani approach we favor, using our data set of Microcystis MAGs artificially reduced to lower completeness by randomly removing contigs, only 1 out of 46 genomes was removed at completeness levels between 10 and 60%. At 70, 80, and 90% completeness, pyani removed 2, 8, and 22 genomes at a threshold of Ͼ99% identity and alignments covering Ͼ75% of the MAG (Table 2). At their highest estimated completeness level, which ranged between 95 and 100%, 25 genomes were removed. While performance will likely vary depending on the data set, these results suggest that dereplication with our preferred combined Mashpyani approach is not able to prune highly incomplete genomes and will retain more genomes than optimal even for genomes that are estimated to be 80 to 90% complete. Considering the limited effect of percent alignment threshold on the pyani approach when using complete MAGs, lowering this threshold helps remove more genomes and, hence, should be lowered to 50% or less when trying to prune more incomplete MAGs (Table 2).

AVAILABLE APPROACHES TO LEVERAGE SAMPLING OF BETWEEN-POPULATION VARIATION
Several tools have been developed to maintain the auxiliary genomes of closely related strains while avoiding redundancy when tracking strain-resolved population dynamics in the environment using metagenomic data (reviewed in reference 19). They typically use metagenomic data in combination with a database of genomes of closely related isolates or MAGs based on whether alleles of shared genes (StrainPhlAn [20] and  (%)   10  20  30  40  50  60  70  80  90  100  10  37  35  36  33  33  31  29  27  23  21  25  45  44  37  33  33  31  29  27  23  21  50  45  45  45  45  40  31  29  27  23  21  75  45  45  45  45  45  45  44  38  24  21 a Microcystis aeruginosa MAGs were artificially reduced in completeness by random subsampling of contigs. The full MAGs ranged in estimated completeness between 95 and 100%. Shown are the numbers of the original 46 MAGs retained by our combined Mash-pyani approach using different percent alignment thresholds.
ConStrains [21]), strain-specific auxiliary genes (PanPhlAn [22]), or both are present in a sample (MIDAS [23]). Similarly, the Anvi'o package incorporates a metapangenome workflow that reduces a set of user-defined conspecific genomes to gene clusters representing core and auxiliary genes and then estimates strain abundances across metagenomic data sets (18). In principle, all of these approaches avoid the issues associated with database redundancy highlighted in Table 1 and Fig. 2 and the loss of population-specific auxiliary genes highlighted in Table 1. Although variant identification errors do remain, which are tool and likely database and metagenomic data set dependent, this has been reported to be as low as 0.1% (20). While potential issues with these approaches have not been fully evaluated, analyses focusing on populations where the dominant strain can be more readily resolved have been able to go as far as tracking in situ bacterial evolution in environmental biofilms and the human gut (24,25).

CONCLUSIONS
Genome-centric metagenomics has opened a view into the undescribed branches of the tree of life. However, full awareness of the risks associated with MAGs is needed to avoid the misinterpretation of the data and populating databases with questionable genomes. Dereplication is a step carried out by many researchers as part of metagenomic informatic pipelines, but we highlight large differences between commonly used tools in how many genomes are removed. Tools designed to resolve closely related genomes in a database exist and may circumvent issues with redundancy while maximally leveraging all data from conspecific populations obtained from generating MAGs. As the ability to resolve closely related genomes is dependent on the genetic distance between genomes in the database and between database genomes and those of sampled populations, these tools need broader adaptation and evaluation to fully evaluate their accuracy. Additionally, we hope there will be further development of tools that resolve the issues highlighted in this study.

DATA AVAILABILITY
All code written and used for the analyses described in the manuscript can be found at https://github.com/DenefLab/Dereplication-Letter-Code.