Deconvolute individual genomes from metagenome sequences through short read clustering

Metagenome assembly from short next-generation sequencing data is a challenging process due to its large scale and computational complexity. Clustering short reads by species before assembly offers a unique opportunity for parallel downstream assembly of genomes with individualized optimization. However, current read clustering methods suffer either false negative (under-clustering) or false positive (over-clustering) problems. Here we extended our previous read clustering software, SpaRC, by exploiting statistics derived from multiple samples in a dataset to reduce the under-clustering problem. Using synthetic and real-world datasets we demonstrated that this method has the potential to cluster almost all of the short reads from genomes with sufficient sequencing coverage. The improved read clustering in turn leads to improved downstream genome assembly quality.

assemble the short reads into longer contigs, then cluster the contigs into individual draft genomes through 32 the binning process (Roumpeka et al., 2017;Kang et al., 2016). The assembly step in these software 33 tools simultaneously tackles the computational and algorithmic challenges by constructing a huge de 34 Bruijn graph and subsequently partitions it in parallel (reviewed in Breitwieser et al. (2017) ;Quince 35 et al. (2017)). These tools, including MEGAHIT (Li et al., 2015), metaSpades (Nurk et al., 2017) and 36 MetaHipmer (Georganas et al., 2018), have achieved considerable success and are widely used. To 37 overcome the limitation of this "assembly-then-cluster" approach that does not allow optimization for 38 individual genome assembly, a "cluster-then-assembly" alternative has recently been proposed. This 39 strategy first clusters the reads based on their genome of origin (Guo et al., 2015;Shi et al., 2018), and 40 then each cluster can be assembled individually and potentially optimized. Tools adopting this strategy 41 take advantage of the scalability and robustness of Apache TM Hadoop (Guo et al., 2015) or Spark (Shi 42 et al., 2018) platforms to construct and partition an overlap graph in parallel. 43 We previously reported that an Apache Spark TM -based read clustering method, SpaRC, that showed 44 a great potential in achieving good scalability and clustering performance (Shi et al., 2018). SpaRC can 45 be flexibly deployed to the cloud or HPC computing environments. However, the demonstrated clustering 46 success was limited to long-read sequencing technologies. Even though SpaRC can form pure clusters 47 (low false positives), clustering short-read datasets suffered a false negative problem, or one genome is 48 clustered into many small clusters (under-clustering). This is not desirable as most of the metagenome 49 datasets are based only on short-read sequencing technologies. Clustering short reads to recover single 50 genomes has been previously shown to be possible by a latent strain analysis approach (LSA, Cleary et al.

51
(2015)). However, clustering metagenome reads directly based on k-mer statistics across multiple samples 52 is very challenging (Wang et al., 2012;Liao et al., 2013). 53 In this paper, we describe a new method to target the under-clustering problem of SpaRC by 54 exploiting statistics derived from multiple, independent samples in short-read datasets. This method 55 first estimates the abundance of each read cluster using a set of short, representative k-mers, and then 56 calculates the similarity among the clusters and uses it to construct a graph of clusters. Finally, it partitions 57 the cluster graph to obtain larger read clusters. We name the new clustering algorithm developed here 58 as "global clustering", as it deals with cross-sample information from the entire dataset. Conversely, the 59 clustering algorithm in SpaRC we had reported previously is now renamed as "local clustering", as it 60 only deals with read overlap information. We implemented the global clustering algorithm on the Apache 61 Spark platform to achieve data scalability and computing robustness. In addition, we adopted minimizers 62 (Roberts et al., 2004) as a replacement for k-mers to improve computing and memory efficiency. We 63 compared the clustering performance of the global clustering algorithm to the local clustering using 64 a synthetic mouse gut microbiome dataset from the CAMI2 project (Sczyrba et al., 2017). Several 65 clustering parameters were also explored to gauge their influence on global clustering performance. Using 66 a real-world metagenome dataset, we showed that clustering the reads before the assembly can greatly 67 improve the assembly quality of the species with high sequencing coverage.

69
Clustering strategies 70 An overview of the two clustering strategies is shown in Figure 1. The local clustering strategy in 71 the original SpaRC has been described in (Shi et al., 2018). In brief, during the local clustering step, we 72 cluster reads by their overlap. The read clusters are further clustered into bigger clusters by the global 73 clustering strategy, which we will describe in detail below.

75
In SpaRC, the number of shared k-mers is used to estimate similarity between reads (Shi et al.,76 2018). As it takes 100-200 times more space after reads are transformed into k-mers and edges, SpaRC is 77 neither space nor time-efficient. To improve computing efficiency, we implemented a new function to 78 use minimizers (Roberts et al., 2004) instead of k-mers to estimate similarity between reads. As many 79 adjacent shared k-mers can be represented by a single minimizer without losing sensitivity, in theory, the 80 minimizer-based method should greatly reduce the memory requirement in SpaRC (as fewer k-mers and 81 edges will be produced). In practice we did observe a 3.2-fold memory usage reduction, and 3.3-fold 82 speed-up (Supplemental Figure 1). It is worth noting that minimizers may not be applicable to uncorrected 83 long reads from PacBio and Nanopore sequencing technologies due to their high error rate.

Global clustering
Reads from a genome can form many read clusters after the local clustering step, leading to low 86 clustering completeness. The ultimate goal of global clustering is to predict all the read clusters originated 87 from the same genome. It does this based on the assumption that the sequencing coverage of each region 88 of a genome, defined by the read clusters, closely resembles the sequencing coverage of the same genome 89 across different metagenomic samples. In other words, if two clusters, c1 and c2, belong to the same 90 genome g. After c1 and c2 are assembled into contigs C1 and C2, the coverage of C1 and C2 in sample S, 91 in theory, should be equal to the coverage of g. In the context of single genome assembly, the sequencing coverage of a genome can be robustly 95 estimated from unassembled reads by k-mer analysis (Chor et al., 2009;Lo and Chain, 2014). Similarly, 96 we can estimate the sequencing coverage of the latent genome represented by a read cluster. As shown in 97 Figure 1B, clusters from different genomes, in theory, will show different k-mer frequency peaks (genome 98 coverage) in a given sample, while clusters from the same genome will show similar peaks, even though 99 the number of k-mers at those peaks could be very different, with larger read clusters having more k-mers. 100 We therefore sample several k-mers around the k-mer frequency peak of a cluster, and use the median of 101 their counts among each sample to estimate the coverage of this cluster in each sample. We term these 102 k-mers as representative k-mers for each cluster. Because very small clusters do not generate reliable 103 k-mer spectra, we only select clusters with more than 50 reads for further clustering. We modified the 104 original KMR function in SpaRC to track k-mer counts of each sample.

105
The cluster coverage information is stored in a m by n coverage matrix, where m is the number of 106 samples and n is the number of read clusters.

108
In the above cluster coverage matrix, every cluster is represented by a vector of counts. If two clusters 109 are derived from the same genome, we expect that their vectors should be very similar. We chose cosine 110 similarity to measure the similarity of cluster vectors as it is most commonly used in high-dimensional 111 positive spaces. After all pairwise similarities are calculated, the coverage matrix is transformed into 112 a n by n similarity matrix, where n is the number of clusters. We only keep the similarity exceeding a 113 predefined threshold because of the sparse nature of this matrix. This threshold parameter has a direct 114 impact on clustering performance, as higher thresholds produce smaller but purer clusters. An optimal 115 threshold parameter could be determined by performing a grid search for the one that gives the best 116 clustering accuracy on a labelled dataset. For real-world metagenome datasets without known reference, 117 this parameter has to be guessed..

118
Graph construction and partitioning 119 By using the cosine similarity calculated above as weighted edges and the clusters as nodes, a 120 weighted graph can be constructed. This cluster graph can then be partitioned into big clusters the same 121 way as in the local clustering step by using the Label Propagation Algorithm (Raghavan et al., 2007).

122
Cluster assembly and quality evaluation 123 We selected large clusters (more than 1,000 reads for CAMI2 and more than 8,000 reads for MetaHIT) 124 to assemble. Under these criteria, more than 90% of original reads were retained in these two datasets.  MetaQuast (ver 5.0.2), (Mikheenko et al., 2015) was used for metagenome assembly evaluation.

129
As MetaHIT dataset does not have known references, we built a reference database by BLASTing the 130 assembled contigs against NCBI nonredundant reference genomes, and selected the subset of references 131 that have sequencing coverage of 30x or more (n=68). There were many bins from the MetaHIT dataset 132 having no genomes mapped to them, and they were omitted from further analyses. There were also bins 133 mapped to multiple genomes, and genomes split into multiple bins. If the predominant part of a genome 134 is included in a bin, then the completeness and purity of this genome were calculated according to all the 135 contigs within this bin. Genome assembly quality evaluation metrics were obtained with MetaQuast using 136 default parameters. This mock dataset is a synthetic community with real-world sequence data (Singer et al., 2016). It 140 contains Illumina reads from 23 bacteria and 3 Archaea species with known reference genomes. The 141 sequence length is (90-150)x2 bp totaling 3.3 Gb (Table 1). This dataset was used as a toy dataset for 142 testing local clustering with minimizers. smaller dataset allowed us to reduce computation cost and obtain results faster. We clustered the reads 167 using two clustering methods: in the "local clustering" method, we combined all the reads from the 50 168 samples for clustering and selected clusters with 50 reads or larger; in the "global clustering" method, we 169 further applied the global clustering module to these clusters to form big clusters (Material and Methods).

170
This labeled synthetic dataset enabled us to systematically compare the clustering performance between these two methods for cluster size, purity, completeness. The results are shown in Figure 2. We used the 172 same purity and completeness metrics as in (Shi et al., 2018). Briefly, the purity of a cluster is defined 173 as the percentage of reads from the predominant genome within the cluster, while the completeness of 174 a cluster is defined as the percentage of all the reads from the predominant genome that are captured 175 by the cluster. Because almost identical strains from the same species were engineered in the dataset, 176 both of the two metrics, especially purity, likely underestimate species-level clustering performance. For 177 example, if a species has two closely related strains with equal number of reads, clusters derived from 178 this species will have a purity of 50%. We therefore also measured cluster purity at the species level. In 179 this experiment, the parameters were k=41, m=22, min shared kmers=2, max degree=25, representative 180 k-mer count=100, and cosine threshold=0.925.

181
The local clustering step resulted in 78.29% of the reads clustered into many small clusters  We obtained clusters from these datasets by running SpaRC with the same parameters (k=41, m=22, 205 min shared kmers=2, min read per cluster=50) for local clustering, and representative kmer count=100 206 for global clustering.

207
As in the previous section, we evaluated the purity and completeness of the resulting clusters. As 208 shown in Figure 3, the median purity of the clusters at the genome level slightly dropped as the number 209 of samples increases, from the highest 96.96% (at n=5) to the lowest 88.89% (at n=20, Figure 3A). This  Table S1).. In contrast, the completeness continuously 213 increases with increasing number of samples (median completeness rises from 7.69% at n=5 to 19.97% 214 at n=50, Figure 3B). This result supports the hypothesis that more samples enhance global clustering Manuscript to be reviewed performance, likely due to better estimation of cluster similarity.

216
The ultimate goal of read clustering is to recover the complete set of any genome without any 217 contamination from other genomes. In order to measure how many genomes can be recovered by 218 read clustering, here we define "a clustered genome" as a read cluster that simultaneously satisfies two 219 criteria: purity > 95% and completeness > 80%. It is worth noting that these criteria are very strict.

220
As strain-level heterogeneity can greatly reduce purity, and one small region larger than a read with 221 no sequencing coverage, either due to statistical sampling or systematic sequencing biases, will greatly 222 decrease completeness.

224
Since the ability to obtain a clustered genome depends on the sequencing coverage, and there are 20 225 genomes with at least 100X coverage, this translates to a recovery rate of 35% with 50 samples. Other 226 genomes with lower sequencing coverage also benefit from more samples included in clustering, as Figure   227 3C shows the extent of recovery for all genomes with a sequence coverage > 10X. The details of these 7 228 recovered genome can be found in Supplemental Table 1.

229
Parameters that may impact clustering performance 230 There are 10 parameters in the local clustering algorithm (Shi et al., 2018). In the global clustering  Table 3.  (Chikhi and Medvedev, 2013).

257
The number of representative k-mers used for cluster abundance estimation seems to have a very 258 minor effect on clustering performance. If the number is too low, then the performance is slightly lower.

259
As expected, very low cluster similarity thresholds cause over-clustering, and very high ones lead to 260 under-clustering. Same as the other parameters, the best parameter choice should be determined by the 261 underlying scientific requirements to balance sensitivity and specificity.

262
Assembly quality comparison between SpaRC-based and the classic approach 263 Compared to the current metagenome assembly strategy without clustering, i.e., assemble the entire 264 dataset followed by binning (hereafter referred as "the classical approach"), clustering the reads into 265 individual clusters by SpaRC followed by assembly and binning (hereafter referred as "SpaRC-based 266 approach") may produce better results. To test this hypothesis, we carried out both approaches on the 267 above CAMI2 testing dataset (Methods) and compared their assembly results.

268
As shown above, the effectiveness of clustering depends on sequencing coverage, we therefore 269 evaluated the assembly performance for genomes with coverage of 100x or above, 50-100X, and 30-50X, 270 respectively. We observed a comparable number of mis-assemblies from both approaches (Supplemental 271   Table S2), so we focused on the following four metrics: genome coverage (percent of reference covered 272 by the assembled contigs), contamination (percent of contigs not belonging to the reference at the species 273 level), N50 and L50 (measuring contiguity of the assembly). Specially, we only reported genome bins 274 with at least 95% purity, or less than 5% contigs from other species. We evaluated their assembly accuracy 275 in terms of near-complete genomes (95% genome coverage and above) and fairly complete genomes 276 (80-95% genome coverage). The results are shown in Figure 4.

277
At high sequencing coverage (100X or above), the SpaRC-based clustering approach was able 278 to recover more near-complete genomes than the classical approach, 16 for SpaRC vs 11 for Classic 279 ( Figure 4A). Among them, the SpaRC approach assembled 6 genomes that were not completely assembled 280 by the classical approach, while missed only one assembled by the classical approach ( Figure 4B). The 281 number of fairly complete genomes assembled by the two approaches are comparable at this sequencing 282 coverage. Besides, 11 genomes assembled by the classical approach have smaller L50s and larger N50s 283 while 6 genomes assembled by SpaRC approach do (Supplemental Table S2). As sequencing coverage 284 is getting lower, the classical approach has an advantage over the SpaRC-based approach, especially 285 when coverage drops to below 50X ( Figure 4A). These genomes tend to spread across many small pure 286 clusters. This result is consistent with the above clustering performance analyses, suggesting there is still 287 an "under-clustering" problem for global clustering.

288
We added a binning step using MetaBAT after the assembly of each of the clusters because some 289 of the large ones are mixtures of a few genomes. Clusters with more than 100,000 reads contain 290 3.125 genomes on average, suggesting the species complexity of these clusters are much more reduced 291 comparing to the original metagenome. The metrics presented in Figure 4 were calculated after the 292 binning step.

293
To test whether these conclusions can be generalized to real-world datasets, we applied the two 294 approaches to a human microbiome dataset (MetaHIT, Methods). As there is no ground truth for this 295 dataset, we mapped the metagenomic short reads to NCBI non-redundant reference genome database, and 296 used the 68 genomes with 30x or more sequencing coverage as a reference set.

297
Similar to the previous experiment on CAMI2 dataset, the SpaRC-based approach can cluster 298 genomes with 100X sequencing coverage from MetaHIT dataset, and this ability decreases dramatically as 299 coverage decreases. The SpaRC-based clustering approach was able to recover 9 near complete genomes 300 with 100X or more coverage (total 16), while the classic method recovered 10 ( Figure 4C). Among them,

301
SpaRC approach recovered 3 genomes that were not completely recovered by the classical approach, 302 while missed 4 genomes ( Figure 4D). For a complete list of the recovered genomes from these two 303 approaches on the two datasets, please refer to Supplemental Table S3.

Manuscript to be reviewed
The above results suggest that adding a clustering step can complement the common metagenome 305 assembly strategy, at least for the species with high sequencing coverage. SpaRC-based approach, we were able to distribute the read clustering step on an AWS Elastic MapReduce 311 (EMR) cluster with 350 nodes, each with 8 CPU cores and 61 GB of RAM (Table 2). The clustering step 312 took 8.9 hours to complete. These results suggest that SpaRC is not as good as the classic approach in 313 terms of costs and computational efficiency, instead it offers an advantage to scale up to bigger datasets 314 (over assemblers only run on single nodes), and overall shorter computational execution time.

316
Even with global clustering, there is still a lot of room left to further improve the accuracy of 317 metagenome read clustering, as both under-clustering and over-clustering problems are still outstanding.

318
One idea is to employ better metrics to improve the prediction that different clusters belong to the same 319 species. The read clustering problem is similar to the metagenome binning problem. In the unsupervised 320 metagenome binning problem, contigs are further clustered into genomes based on two metrics, tera-

331
Another idea to improve the accuracy of metagenome clustering may be leveraging long read 332 sequencing technologies. Long read technologies (such as PacBio and Nanopore) are increasingly applied 333 to metagenome sequencing, a hybrid clustering approach leveraging long reads to cluster short reads may 334 also greatly increase clustering performance. Long reads should be very helpful to increase clustering 335 completeness as they span the regions where short reads have low coverage, and improve clustering purity 336 at the strain level as they can distinguish repeats and closely related species or strains.

337
The global clustering step added more parameters to the entire clustering pipeline. We have shown 338 that clustering accuracy can be influenced by some of these parameters, including k-mer/minimizer size, 339 minimum shared k-mers to detect overlap, and abundance similarity thresholds among clusters to construct 340 the cluster graph. Deriving an optimal set of parameters is challenging because it is likely dependent on 341 the underlying data characteristics and the scoring metrics. Further, searching for an optimal parameter 342 set from a large parameter space on a large dataset using grid-search or random-search strategies is 343 computational prohibitive. There isn't an evident law to select the default parameters. This problem may 344 be a good candidate for Bayesian optimization (Snoek et al., 2012;Hutter et al., 2011).

345
It was worth noting that while the global clustering procedure improves clustering completeness, 346 but this comes at a small cost of lower clustering purity. This is a trade-off inherent to all clustering 347 problems, and the above suggested potential improvements, including better metrics, longer reads and 348 optimal parameters, may only improve completeness or purity, but not both. TNF can help completeness, 349 while introducing impurity as TNF signals from smaller clusters tend to be noisy. Long reads can help 350 link small clusters, but they are not useful to separate impure clusters because of their limited sequencing 351 coverage. Larger k-mer/minimizer sizes, more k-mers/minimizers required for a valid overlap, larger 352 abundance similarity thresholds can all lead to clusters with higher purity, but will inevitably also lead to 353 lower clustering completeness.

354
Running SpaRC on very large metagenome datasets like the MetaHIT was still very challenging.

355
In addition to requiring a large number of nodes, the memory overflow problem may occur during the 356 execution when the number of executors per node or the number of cores per executor is not set properly.  Figure 1. An overview of the clustering strategies. (A) Local clustering: short reads sequences from multiple samples of a microbial communities (such as derived from different sample sites or times, S1, S2 ... Sm) are combined and clustered using the scalable overlap-based clustering algorithm in SpaRC. Many small clusters are formed and reads from the same genomes scatter across many clusters (under-clustering). (B) Estimating genome coverage from unassembled read clusters. In the left illustration, two read clusters show different k-mer frequency peaks, each corresponding to the coverage of their underlying genome (dotted lines). In the right illustration, multiple read clusters derived from the same genome in theory will have the same genome coverage in a given sample, while the height of the peak (number of k-mers) can be very different depending on the size of the read clusters. (C) Global clustering. First, sequencing coverage of each small cluster from the local clustering step is estimated and a cluster coverage matrix is derived. Second, a square similarity matrix is obtained by computing pair-wise cosine similarities between all clusters. Finally, a graph is constructed using clusters as nodes and their similarity as weighted edges. Larger clusters containing all the reads from individual genomes can be obtained by partitioning the graph using the Label Propagation Algorithm (LPA).

12/15
PeerJ reviewing PDF | (2019:05:38058:2:0:NEW 18 Mar 2020) Manuscript to be reviewed The distribution of clustered reads among the clusters for local clustering (light gray) and global clustering (dark gray). x-axis is cluster size (log10) and y-axis is the percent of reads that are clustered at a given cluster size. Cluster size refers to the number of reads in a cluster. B. Violin plots of cluster completeness and purity (at genome level and species level). Global clustering metrics are plots filled in dark gray. The units on y-axis are percentages. C. A scatter plot of sequencing coverage of the genomes and their completeness from local clustering (light gray triangles) and global clustering (dark gray circles). x-axis is the sequencing coverage (log2) and y-axis is the completeness in percentage.

13/15
PeerJ reviewing PDF | (2019:05:38058:2:0:NEW 18 Mar 2020) Manuscript to be reviewed Median completeness comparison among different number of samples. x-axis is the number of samples and y-axis is median completeness. C. The number of clustered individual genomes with purity > 95% and completeness > 80% among different number of samples. There are 97 genomes with sequencing coverage > 10X in this dataset. Different shades of gray represent different completeness levels. x-axis is the number of genomes and y-axis is the number of samples.