Deconvoluting simulated metagenomes: the performance of hard- and soft- clustering algorithms applied to metagenomic chromosome conformation capture (3C)

Background Chromosome conformation capture, coupled with high throughput DNA sequencing in protocols like Hi-C and 3C-seq, has been proposed as a viable means of generating data to resolve the genomes of microorganisms living in naturally occuring environments. Metagenomic Hi-C and 3C-seq datasets have begun to emerge, but the feasibility of resolving genomes when closely related organisms (strain-level diversity) are present in the sample has not yet been systematically characterised. Methods We developed a computational simulation pipeline for metagenomic 3C and Hi-C sequencing to evaluate the accuracy of genomic reconstructions at, above, and below an operationally defined species boundary. We simulated datasets and measured accuracy over a wide range of parameters. Five clustering algorithms were evaluated (2 hard, 3 soft) using an adaptation of the extended B-cubed validation measure. Results When all genomes in a sample are below 95% sequence identity, all of the tested clustering algorithms performed well. When sequence data contains genomes above 95% identity (our operational definition of strain-level diversity), a naive soft-clustering extension of the Louvain method achieves the highest performance. Discussion Previously, only hard-clustering algorithms have been applied to metagenomic 3C and Hi-C data, yet none of these perform well when strain-level diversity exists in a metagenomic sample. Our simple extension of the Louvain method performed the best in these scenarios, however, accuracy remained well below the levels observed for samples without strain-level diversity. Strain resolution is also highly dependent on the amount of available 3C sequence data, suggesting that depth of sequencing must be carefully considered during experimental design. Finally, there appears to be great scope to improve the accuracy of strain resolution through further algorithm development.


INTRODUCTION
membership is cluster overlap, or more formally, that the intersection between clusters κ k and κ l is no 153 longer strictly empty (i.e. |κ k ∩ κ l | ≥ 0). 154 Possibly motivated by a desire to obtain the plainest answer with maximal contrast, and for the sake of 155 relative mathematical simplicity, hard-clustering is the more widely applied approach. Despite this, many 156 problem domains exist in which cluster overlap reflects real phenomena. For instance, in metagenomes 157 containing closely related species or strains, there is a tendency for the highly conserved core genome to 158 co-assemble in single contigs while more distinct accessory regions do not. Co-assembly implies that 159 uniquely placing (a 1-to-1 mapping of) contigs into source-genome bins (clusters) is not possible. Rather, 160 an overlapping model is required, allowing co-assembled contigs to be placed multiple times in relation 161 to their degree of source-heterogeneity. 162 From the aspect of prior knowledge, classification and clustering algorithms fall into three categories 163 (Jajuga et al., 2002). Supervised classification, where for a known set of classes, a set of class-labelled 164 objects are used to determine a membership function; semi-supervised classification/clustering, which 165 leverages additional unlabelled data as a means of improving the supervised membership function; and 166 unsupervised clustering, where these prerequisites are not required. Unsupervised algorithms, in removing 167 this a priori condition, are preferable if not necessary in situations where prior knowledge is unavailable 168 (perhaps due to cost or accessibility) or the uncertainty in this information is high.

170
Simply put, clustering algorithms attempt to group together objects when they are similar (the same 171 cluster) and separate those objects which differ (different clusters). Although algorithmic complexity can 172 ultimately dictate applicability to a given problem domain, the quality of a clustering solution remains 173 a primary concern in assessing an algorithm's value. To fully assess the quality of a given clustering 174 solution, multiple aspects must be considered. Measures that fail to account for one aspect or another the notion of a ragbag. Here, a ragbag is when preference is given to placing uncertain assignments in a 178 single catch-all cluster, rather than spreading them across otherwise potentially homogeneous clusters or 179 leaving them as isolated nodes. 180 External measures, which compare a given solution to a gold-standard are a powerful means of 181 assessing quality and they themselves vary in effectiveness. F 1 -score, the harmonic mean (Equation 1) of 182 the traditional measures precision and recall, is frequently employed in the assessment of bioinformatics 183 algorithms. For clustering algorithms, it is perhaps not well known that F 1 -score fails to properly consider 184 the aspect of completeness (Amigó et al., 2009) and further is sensitive to a preprocessing step where 185 clusters and class labels must first be matched (Hirschberg and Rosenberg, 2007). The entropy based 186 V-measure (Hirschberg and Rosenberg, 2007) was conceived to address these shortcomings but does not    (Burton et al., 2014) as it requires the number of clusters to be specified a priori.

211
Runtime parameters particular to each algorithm were controlled in the sweep as necessary (Table   212 2). The widely used MCL (markov clustering) algorithm (van Dongen, 2001) uses stochastic flow 213 analysis to produce hard-clustering solutions, where cluster granularity is controlled via a single parameter four additional runtime parameters (balance, quality, redundancy and penalty ratio). It was determined 220 that default settings were apparently optimal for these additional parameters (results not shown), and 221 therefore only inflation was varied over the same range as MCL.

222
The Louvain modularity Q (Newman and Girvan, 2004) quantifies the degree to which a graph is 223 composed of pockets of more densely interconnected subgraphs. Density is uniform across a graph when 224 Q = 0 and there is essentially no community structure, while as Q → 1 it indicates significant community 225 structure with a strong contrast in the degree to which nodes are linked within and between communities.

226
Louvain clustering builds upon this modularity score (Blondel et al., 2008), following a greedy heuristic 227 to determine the best partitioning of a graph by the measure of local modularity, identifying sets of 228 nodes more tightly interconnected with each other than with the remainder of the graph. Although a 229 hierarchical solution by recursive application of the Louvain method on the subsequent subgraphs can 230 be obtained, at each step the result is a hard-clustering. We implemented a one-step Louvain clustering 231 algorithm in Python making use of the modules python-louvain and Networkx. We further extended this 232 hard-clustering method (Louvain-hard) to optionally elicit a naive soft-clustering solution (Louvain-soft), 233 where after producing the hard-clustering, any two nodes in different clusters that are connected by an 234 edge in the original graph are made members in both clusters.  introduce uncertainty. In particular, the loss of read placement information in de Bruijn graph assembly 245 means we must infer the genomic origin of each contig rather than obtain it explicitly from assembly 246 output metadata.

247
In this study, the gold-standard must accurately map the set of assembly contigs C to the set of 248 community source genomes G, while supporting the notion of one-to-many associations from contig c i 249 to some or all genomes g i ∈ G. It is this one-to-many association that represents the overlap between 250 genomes at low evolutionary divergence. The mapping must also contend with spurious overlap signal 251 from significant local alignments due to such factors as conserved gene content and try to minimize false  Manuscript to be reviewed between contig c i and reference genome g k was accepted if µ k > 0.96.

262
Undirected 3C-contig graphs were generated by mapping simulated 3C read-pairs to WGS assembly 263 contigs using BWA MEM (v0.7.9a-r786) (Li, 2013). Read alignments were accepted only in the case 264 of matches with 100% coverage of each read and zero mismatches. In general, this restriction to 100% 265 coverage and identity should be relaxed when working with real data, and we found the iterative strategy

279
where P b 3 and R b 3 are the weighted arithmetic means of Unchanged from Extended Bcubed, the expressions for the Multiplicity Bcubed precision ) account for the non-binary relationship between any two 283 items in the set when dealing with overlapping clustering.
where K(o i ) is the set of clusters and Θ(o i ) is the set of classes for which either contains object o i .

285
Simulating HiC/3C read-pairs 286 A tool for simulating HiC/3C read-pairs was implemented in Python (simForward.py). Read-pairs were 287 generated for a given community directly from its reference genomes, where the relative proportion of Manuscript to be reviewed was used to approximate a long-tailed probability distribution as a function of genomic separation and calibrated by fitting to real experimental data (Beitel et al., 2014). For these 3C reads, the modeling of 293 experimental/sequencing error was not performed. Variation in intra-chomosomal probability attributable 294 to 3D chromosomal structure was not included. In effect, chromosomes were treated as flat unfolded rings.

295
The tool takes as input a seed, read length, number of read-pairs, abundance profile and inter-chromosomal 296 probability and outputs reads in either interleaved FastA or FastQ format. From a given phylogenetic tree and an ancestral sequence, the community generation module produces 311 a set of descendent taxa with an evolutionary divergence defined by the phylogeny and evolutionary model.

312
The simulated evolutionary process is implemented by sgEvolver (Darling et al., 2004), which models 313 both local changes (e.g. single nucleotide substitutions and indels) and larger genomic changes (e.g. gene 314 gain, loss, and rearrangement). The degree of divergence is controlled through a single scale factor α BL 315 (Table S1) that uniformly scales tree branch lengths prior to simulated evolution. As data inputs, the 316 module takes a phylogenetic tree and an ancestral genome. As data outputs, the module generates a set of 317 descendent genomes G and an accompanying gold-standard. Overall, community generation introduces Manuscript to be reviewed the following two sweep parameters: branch length scale factor α BL and random seed (S) ( After the assembly and mapping module comes the community deconvolution module, taking as input  End-to-end, the simulation pipeline makes a large number of variables available for manipulation, 344 and the size and dimensionality of the resulting space is much larger than can be explored with available 345 computational resources. Therefore we decided to focus our initial exploration on a small part of this 346 space. We used two simple phylogenetic tree topologies (a four taxon ladder and a four taxon star) ( Figure   347 2), to develop insight into the challenges that face metagenomics researchers choosing to apply 3C to 348 communities which contain closely related taxa.
Graph Complexity

391
Although simple intrinsic graph properties such as order, size and density can provide a sense of com-392 plexity, they do not consider the internal structure or information content present in a graph. One 398 where {λ : λ > 0} is set the non-zero eigenvalues of the normalized Laplacian N.

400
Experimental Design 401 We implemented a computational pipeline which is capable of simulating arbitrary metagenomic HiC/3C 402 sequencing experiments (Figure 1). The pipeline exposes parameters governing both the process of 403 sequencing and community composition for control by the researcher and further, provides the facility for 404 performing parametric sweeps on these parameters (Table 1).

405
The pipeline was used to vary community composition, in particular, the degree of within-community Manuscript to be reviewed an ancestral sequence, a phylogenetic tree and an abundance profile; 10 communities were generated 408 with varying evolutionary divergence by scaling branch length ( Figure 2). The range of evolutionary 409 divergence was chosen so as to go from a region of easily separable species (≈ 85% ANI) to that of very as genomic sources became more closely related, the resulting metagenomic assembly contigs were of 429 increasingly mixed origin.

430
Regarding the assembly process as a dynamic system in terms of evolutionary divergence, the turning introduce a grid-agnostic layer so that redeployment in varying environments is only a configuration detail.

561
Although a single global seed is used in all random sampling processes, the possibility for irreproducibility 562 remains due to side-effects brought on by variance in a deployment target's operating system and codebase. collate and interpret. 577 We chose to limit the representation of the combined WGS and 3C read data to a 3C-contig graph.

578
While other representations built around smaller genomic features, such as SNVs, could in principle 579 offer greater power to resolve strains, they bring with them a significant increase in graphical complexity.

580
How more detailed representations might impact downstream algorithmic scaling, or simply increase the 581 difficulty in accurately estimating a gold-standard remains to be investigated. Manuscript to be reviewed by our pipeline could be explored. Ultimately, the parameter set we defined for the pipeline (Table 1) has the benefit of being domain-specific and therefore meaningful to experimental researchers.

591
Detection of overlapping communities in networks is a developing field and much recent work has 592 left the performance of many clustering algorithms untested for the purpose of deconvolving microbial 593 communities via 3C read data. Not all algorithms are wholly unsupervised. Individually they fall into . As evolutionary divergence decreased from easily separable species (ANI b ≈ 85%) to very closely related strains (ANI b → 1), assemblies went through a transition from a state of high purity (S mixing ≈0) to a highly degenerate state (S mixing ≈1), where many contigs were composed of reads from all community members. A crisis point was observed for small evolutionary divergence (α BL < 0.2924, ANI b < 95%), where a sharp change in contiguity (implied by N50 and L50) occured. At very low divergence, N50 and L50 statistics implied that assemblies were recovering, while source degeneracy (S mixing ) monotonically increased. Graphical complexity (H L ) exhibited a similar turning point to L50 and was dominated by graph order |n| (number of contigs/nodes).

19/21
PeerJ reviewing PDF | (2016:04:10448:2:0:NEW 28 Sep 2016) Manuscript to be reviewed BL PC1 (53%) PC2 (13.6%) Figure 5. For the 300 3C-contig graphs pertaining to uniform abundance, a PCA biplot is shown for the two most significant components (PC1, PC2). Respectively, PC1 and PC2 explain 53% and 13.6% of the variation within the data-set. Here vectors represent sweep variables, while points represent individual 3C-contigs graphs and are coloured by 3C sequencing depth (n3c: 10k -1M pairs). Double-sized points show mean values of these n3c groupings. Vectors labelled after the five clustering algorithms represent performance as measured by scoring metric F B 3 (Equation 1) PC1 and PC2) explained 53% and 13.6% of the variation within the data-set respectively. PC1 was most strongly correlated with graphical complexity (H L ) and the number of graph nodes (order), which come about with decreasing evolutionary divergence (ANI b and α BL ) and explained the majority of variation in performance for four out of five clustering algorithms. The notable exception was Louvain-soft which had significant support on PC2. PC2 was related to HiC/3C sampling depth (n3c), which correlated strongly with the number of graph edges (size). The positive response Louvain-soft had to increasing the number of HiC/3C read-pairs (n3c) relative to the remaining four algorithms is evident.