Multiple Comparative Metagenomics using Multiset k-mer Counting

Large scale metagenomic projects aim to extract biodiversity knowledge between different environmental conditions. Current methods for comparing microbial communities face important limitations. Those based on taxonomical or functional assignation rely on a small subset of the sequences that can be associated to known organisms. On the other hand, de novo methods, that compare the whole set of sequences, do not scale up on ambitious metagenomic projects. These limitations motivated the development of a new de novo metagenomic comparative method, called Simka. This method computes a large collection of standard ecology distances by replacing species counts by k-mer counts. Simka scales-up today metagenomic projects thanks to a new parallel k-mer counting strategy on multiple datasets. Experiments on public Human Microbiome Project datasets demonstrate that Simka captures the essential underlying biological structure. Simka was able to compute in a few hours both qualitative and quantitative ecology distances on hundreds of metagenomic samples (690 samples, 32 billions of reads). We also demonstrate that analyzing metagenomes at the k-mer level is highly correlated with extremely precise de novo comparison techniques which rely on all-versus-all sequences alignment strategy. Simka is an open source software, distributed under GNU affero GPL License, available for download at https://gatb.inria.fr/software/simka/


Introduction
It is estimated that only a fraction of 10 −24 to 10 −22 of the total DNA on earth has been sequenced [1]. In large scale metagenomics studies such as Tara Oceans [2] or Human Microbiome Project (HMP) [3] most of the sequenced data comes from unknown organisms and their short reads assembly remains an inaccessible task (see for instance results from the CAMI challenge http://cami-challenge.org/). When precise taxonomic assignation is not feasible, microbial ecosystems can nevertheless be compared on the basis of their diversity, inferred from metagenomic read sets. In this framework, the beta-diversity, introduced in [4], measures the dissimilarities between communities in terms of species composition. Such compositions may be approximated by sequencing marker genes, such as the rRNA 16S in bacterial communities [5], and clustering the sequences into Operational Taxonomic Units (OTU) or working species. However, marker genes surveys suffer from amplification and primer bias [6] and therefore may not capture the whole microbial diversity of a sample. Furthermore, even within the captured diversity, the marker may not be informative enough to discriminate between sub-species or even species strains [7]. Finally, this approach is impractical for whole metagenomic sets for at least two reasons: clustering reads into putative species is computationally costly and leaves out a large fraction of the reads [8].
In this context, it is more practical to ditch species composition altogether and compare microbial communities using directly the sequence content of metagenomic read sets. This has first performed by using Blast [9] for comparing read content [10]. This approach was successful but can not scale up to large studies made up of dozens or hundreds of large read sets, such as those generated from Illumina sequencers.
In 2012, the Compareads method [11] was proposed. The method compares the whole sequence content of two read sets. It introduced a rough approximation of read similarity based on the number of shared words of length k (k-mer) and used it for providing so defined similar reads between read sets. The number of similar reads was then used for computing a Jaccard distance between pairs of read sets. Commet [12] is an extended version of Compareads. It better handles the comparison of large read sets and provides a read sub-set representation that facilitates result analyses and that reduces disk footprint. In 2014, Seth et al. [13] showed that k-mer based distances are accurate enough to recover known biological structure. Given a query sample, their method, called DSM, retrieves similar samples (indexed in a database) based on shared k-mers with their abundance.
The Commet and DSM methods have in common the fact that they use k-mers as a fundamental information source for comparisons. Actually, k-mers are a natural unit for comparing communities: (1) sufficiently long k-mers are usually specific of a genome [14], (2) k-mer frequency is linearly related to genome's abundance [15], (3) k-mer aggregates organisms with very similar k-mer composition (e.g. related strains from the same bacterial species) without need for a classification of those organisms [16].
Even if Commet and DSM approaches were designed to scale-up to large metagenomic read sets, their use on data generated by big projects is turning into a bottleneck in terms of time and/or memory requirements.
In this paper, we present Simka. Simka compares N metagenomic datasets based on their k-mers counts. It computes a large collection of distances classically used in ecology to compare communities by replacing species counts by k-mer counts. Simka is, to our knowledge, the first method able to rapidly compute a full range of distances enabling the comparison of any number of datasets. This is performed by processing data on-the-fly (i.e. without storage of large temporary results) and it outperforms state-of-the-art read comparison methods in terms of computational needs. For instance, Simka was run on 690 samples from the HMP project (totalling 32 billion reads) in 14 hours.
The contributions of this manuscript are three-fold. First we propose a new method for efficiently counting k-mers from a large number of metagenomic samples. The usefulness of such counting is not limited to comparative metagenomics and may have applications in many other fields. Second, we show how to derive a large number of ecological distances from k-mer counts. And third, we show on real datasets that substituting k-mer counts to species counts gives admittedly different distances but that those distances are biologically relevant as they capture the same underlying structure and lead to the same conclusions as those based on taxonomic composition.

Materials and Methods
Overview Given N metagenomic datasets, denoted as S 1 , S 2 , S i , ...S N , the objective is to provide a N × N distance matrix D where D i,j represents an ecological distance between datasets S i and S j . Such possible distances are listed in Table 1. The computation of the distance matrix can be theoretically decomposed into two distinct steps: … Figure 1. Simka strategy. The first step takes as input N datasets and generates multiple streams of abundance vector from disjoint sets of k-mers. The abundance vector of a k-mer consists of its N counts in the N datasets. These abundance vectors are taken as input by the second step to iteratively update independent contributions to the ecological distance in parallel. Once an abundance vector has been processed, there is no need to keep it on record. The final step aggregates each contribution and computes the final distance matrix.
Sorting Count Each k-mer of a dataset is extracted and its canonical representation is stored (the canonical representation of a k-mer is the smallest lexicographic value between the k-mer and its reverse complement). Canonical k-mers are then sorted in lexicographical order. Distinct k-mers can thus be identified and their occurrence computed.
As the number of distinct k-mers is generally huge, the sorting step is divided into two sub-tasks and proceeds as follows: The k-mers are first separated into P partitions, then stored on disk. After this preliminary task, each partition is sorted and counted independently, and stored again on disk. Conceptually, at the end of the sorting count process, we dispose of N × P sorted partitions. As the same distribution function is applied to all datasets, a partition P i contains a specific subset of k-mers common to all datasets. Fig. 2-A illustrates the Sorting Count phase.
The Sorting Count phase has a high parallelism potential. A first parallelism level is given by the independent counts of each dataset. N processes can thus be run in parallel, each one dealing with a specific dataset. A second level is given by the fine grained parallelism implemented in software such as DSK [18] or KMC2 [17] that intensively exploit today multicore processor capabilities. Thus, the overall Sorting Count process especially suits for grid infrastructures made of hundred of nodes, and where each node implements 8 or 16-core systems.
Furthermore, to limit disk bandwidth and avoid I/O bottleneck, partitions are compressed. A dictionary-based approach such as the one provided in zlib [19] is used. This type of compression is very well suited here since it efficiently packs the high redundancy of sorted k-mers.
Merging Count Here, the data partitioning introduced in the previous step is advantageously used to generate abundance vectors. The N files associated to a partition P i , are taken as input of a merging process. These files contain k-mer counts sorted in lexicographical order. A Merge-Sort algorithm can thus be efficiently applied to directly generate abundance vectors.
In that scheme, P processes can be run independently, resulting of the generation of P abundance vectors in parallel, allowing to compute simultaneously P contributions of the ecological distance. Note that the abundance vectors do not need to be stored. They are only used as input streams for the next step. Fig. 2-B illustrates the Merging Count phase.
k-mer abundance filter Distinct k-mers with very low abundance usually come from sequencing errors. As a matter of fact, a single sequencing error creates up to k erroneous distinct k-mers. Filtering out these  k-mers speeds-up the Simka process, as it greatly reduces the overall number of distinct k-mers, but may also impact the content of the distance matrix. This point is evaluated and discussed in the result section. This filter is activated during the count process. Only k-mers whose abundance is equal or above a given abundance threshold are kept. By default the threshold is set to 2. The k-mers that pass the filter are called "solid k-mers".

Ecological distance computation
Simka computes a collection of distances for all pairs of datasets. As detailed in the previous section, abundance vectors are used as input data. For the sake of simplicity, we first explain the computations on the Jaccard distance. All other distances, presented later on, can be computed in the same way, with only small adaptations.
Computing the abundance-based Jaccard distance The abundance-based Jaccard distance is given by the following equation: where w is a k-mer and N Si (w) is the abundance of w in the dataset S i . We consider here that w ∈ S i ∩ S j Algorithm 1: Compute a Jaccard distance (equation 1) between N datasets Input: -V s : vector of size P representing the abundance vector streams -V ∪ : vector of size N containing the number of k-mers in each dataset Output: a distance matrix Dist 1 M ∩ ← empty square matrix of size N // number of shared k-mers by dataset pairs 2 In parallel: foreach abundance vector stream S in V s do 3 M ∩part ← empty squared matrix of size N // part of number of shared k-mers Write M ∩part to disk The equation involves marginal (or dataset specific) terms (i.e. w∈Si N Si (w) is the total amount of k-mers in dataset S i ) acting as normalizing constants and crossed terms that capture the (dis)similarity between datasets (i.e. w∈Si∩Sj N Si (w) + N Sj (w) is the total amount of shared k-mers between datasets S i and S j ). Marginal and crossed terms are then combined to compute the final distance.
Algorithm 1 shows that it is straightforward to compute the distance matrix between N datasets from the abundance vectors. Inputs of this algorithm are provided by the Multiple k-mer Counting algorithm (MKC). These are the P streams of abundance vectors and the marginal terms of the distance, i.e. the number of k-mers in each dataset, determined during the first step of the MKC which counts the k-mers.
A matrix, denoted M ∩ , of dimension N × N is initialized (step 1) to record the final value of the crossed terms of each pair of datasets. P independent processes are run (step 2) to compute P partial crossed term matrices, denoted M ∩part (step 3), in parallel. Each process iterates over its abundance vector stream (step 4). For each abundance vector, we loop over each possible pair of datasets (steps 5-6). The matrix M ∩part is updated (step 8) if the k-mer is shared, meaning that it has positive abundance in both datasets S i and S j (step 7). Since a distance matrix is symmetric with null diagonal, we limit the computation to the upper triangular part of the matrix M ∩part . The current abundance vector is then released. Each process writes its matrix M ∩part on the disk when its stream is done (step 9).
When all streams are done, the algorithm reads each written M ∩part and accumulates it to M ∩ (step 10-11). The last loop (steps 13 to 16) computes the Jaccard distance for each pair of datasets and fills the distance matrix reported by Simka.
The amount of abundance vectors streamed by the MKC is equal to W s , which is also the total amount of distinct solid k-mers in the N datasets. This algorithm has thus a time complexity of O(W s × N 2 ).
Other ecological distances The distance introduced in Eq. 1 is a single example of ecological distance. There exists numerous other ecological distances that can be broadly classified into two categories (see [20] for a finer classification): distances based on presence-absence data (hereafter called qualitative) and distances based on proper abundance data (hereafter called quantitative). Qualitative distances are more sensitive to factors that affect presence-absence of organisms (such as pH, salinity, depth, humidity, absence of light, etc) and therefore useful to study bioregions. Quantitative distances focus on factors that affect relative changes (seasonal changes, nutrient availability, concentration of oxygen, depth, diet, disease, etc) and are therefore useful to monitor communities over time or along an environmental gradient. Note that some factors, such as pH, are likely to affect both presence-absence (for large changes in pH) and relative abundances (for small changes in pH). Algorithmically, most ecological distances (including most of those mentioned in [20]) can be expressed for two datasets S i and S j as: where g and f are simple functions, and C Si is a marginal (i.e. dataset-specific) term of dataset S i , usually of size 1 (i.e. a scalar). In most distances, C Si is simply the total number of k-mers in S i . By contrast, the value of f corresponds to crossed terms and requires knowledge of both N Si (w) and N Sj (w) (and potentially C Si and C Sj as well). For instance, for the abundance-based Jaccard distance of Eq. 1, we have . Those distances can be computed in a single pass over the data using a slightly modified variant of Algorithm 1. The marginal terms C Si are computed during the first step of the MKC which counts the k-mers of each dataset. The crossed terms involving f are computed and summed in steps 7-8 (but exact instructions depend on the nature of f ). Finally, the actual distances are computed in steps 15-16 and depend on both f and g.
Qualitative distances form a special case of ecological distances: they can all be expressed in terms of quantities a, b and c where a is the number of distinct k-mers shared between datasets S i and S j , b is the number of distinct k-mers specific to dataset S i and c is the number of distinct k-mers specific to dataset S j . Those distances easily fit in the previous framework as a = w∈Si∩Sj 1 {N S i (w)N S j (w)>0} , C Si = w∈Si 1 {N S i (w)>0} = a + b and similarly C Sj = a + c. Therefore, a is a crossed term and b and c can be deduced from a and the marginal terms.
In the same vein, Chao et al. [21] introduced variations of presence-absence distances incorporating abundance information to account for unobserved species. Those distances, named AB-Jaccard, AB-Ochiai and AB-Sorensen can all be expressed in terms of quantities U and V where U is the probability that a k-mer from dataset S i is found in dataset S j and V is the reverse: the probability that a k-mer from dataset S j is found in dataset S i . U and V play the same role here as a, b and c do in qualitative distances. U and V are not known in practice and must be estimated from the data. Chao et al. [21] proposed several estimates but the simplest one are U = Y SiSj /C Si and V = Y Sj Si /C Sj where Y SiSj = w∈Si∩Sj N Si (w)1 {N S j (w)>0} and C Si = w∈Si N Si (w). Note that Y SiSj corresponds to crossed terms and is asymmetric, i.e. Y SiSj = Y Sj Si . Table 1 gives the definitions of the collection of distances computed by Simka while replacing species counts by k-mer counts. These are qualitative, quantitative and abundance-based variants of qualitative ecological distances. The table also provides their expression in terms of C i , f and g, adopting the notations of Eq. 2.
Finally, note the additive nature of the computed distances over k-mers is instrumental in achieving a linear time complexity (in W s , the amount of distinct solid k-mers) and in the parallel nature of the algorithm. The algorithm is therefore not amenable to other, more complex classes of distances that account for ecological similarities between species [22], or edit distances between k-mers as those complex distances require all versus all k-mer comparisons.

Name
Definition Jensen-Shannon Table 1. Definition of some classical ecological distances computed by Simka. All quantitative distances can be expressed in terms of CS, f = f (x, y, X, Y ) and g = g(x), using the notations of Eq. 2, and computed in one pass. Qualitative ecological distances (resp. AB-variants of qualitative distances) can also be computed in a single pass over the data by computing first a, b and c (resp. U and V ). See main text for the definition of a, b, c, U and V .

Implementation
Simka is based on the GATB library [23], a C++ library highly optimized to handle very large sets of k-mers. It includes a powerful implementation of the sorting count algorithm with the latest improvements in term of k-mer counting introduced by [17]. Simka is usable on standard computers and has also been entirely parallelized for grid or cloud systems. It automatically splits the process into jobs according to the available number of nodes and cores. These jobs are sent to the job scheduling system, while the overall synchronization is performed at the data level.

Results
In this section, Simka performances are first evaluated in terms of computation time, memory footprint and disk usage. Then the distance quality is compared with other approaches and known results.
We conduct our experiments with data from the Human Microbiome Project (HMP) [3] which is currently one of the largest publicly available metagenomic dataset: 690 available samples that passed the quality control assessment (http://www.hmpdacc.org/HMASM/). The whole dataset contains 2*16 billions of Illumina paired reads distributed non uniformly across the 690 samples. These samples have been gathered from different human body sites. Supplementary file S1 details precisely how the datasets used for each experiment were built.

Performance Evaluation
Experiments have been conducted on a 200-CPU core platform: 25 × 8-CPU nodes of 64 GBytes memory running at 2.60 GHz. Fig. 3-A reports the computation time of Simka for an increasing numbers of metagenomic samples. The figure indicates the execution time of the different steps: MKC-count, MKC-merge, simple and complex distance computation (as defined below). It can be seen that the time devoted to the computation of complex distances increases exponentially with the number of metagenomic samples, the other steps having a linear behavior. The difference of execution time between simple and complex distances computation comes from the amount of operations required to calculate the crossed terms from an abundance vector, which has quadratic time complexity relatively to N (the number of samples). For a given abundance vector, the simple distances only need to be updated for each pair (S i , S j ) such that N Si > 0 and N Sj > 0 whereas complex distances need to be updated for each pair such that N Si > 0 or N Sj > 0, entailing a lot more update operations. Among all distances listed in Table 1, all distances are simple, except the Whittaker, Bray-Curtis, Jensen-Shannon and Canberra distances. Fig. 3-B plots the memory footprint. It shows that memory requirement is independent of the number of samples (N ). On the other hand, the disk space needed to perform the MKC step increases linearly with N ( Fig. 3-C). The maximum memory used by Simka is a parameter. Increasing this parameter speeds-up the computations.
Results presented in this section show that Simka performances in term of time, memory and disk consumption offer the possibility to apply it on a important number of large samples. Note that, as presented in the next section, except in term of disk usage, Simka performances outperform state-of-the-art tools Commet and DSM.
Comparison with Commet and DSM Each tool has been set to provide the quantitative Jaccard distance. Simka was run with k = 31 without filtering option. Commet was run with its default parameters (k = 33), ensuring its best time and memory usage. DSM does not directly provide distance matrices. Its strategy for counting the k-mers and merging the k-mer counts was thus evaluated with k = 31. Running Simka on HMP dataset Simka has been run on the full dataset of the HMP project (690 samples), with and without the filtering option in order to estimate its impact on the overall performances. The threshold of the filtering is 2, meaning that k-mers seen only once are discarded. Table 3 reports this experimentation. Supplementary file S1 Fig provides more information of the Simka scalability. Solid distinct k-mers (k-mers seen at least twice) represent less than half of all distinct k-mers. But they account for 95% of the whole dataset. Since the overall complexity of Simka is linear with the number of distinct k-mers, its performances are twice as good, both in terms of computation time and disk usage when considering only solid k-mers.
With the filtering option enabled, the overall computation time on a 200-CPU platform does not exceed 24 hours with very low memory requirements. By comparison, Commet took several days to compute 1-vs-all distance matrix and therefore would require years of computation to achieve the N × N distance matrix. DSM ran out of memory on a 1.6 TB memory cluster.

Evaluation of the distances
We evaluate the quality of the distances computed by Simka in two ways. First, are they similar to distances between read sets computed using other approaches? Second, do they recover the known biological structure of HMP samples? For the first evaluation, we focus on comparing a read-based Jaccard distance against a kmer-based Jaccard distances. We considered a quantitative Jaccard distance in both cases. To make the comparison easier, we transform Jaccard distance into a similarity measure.
Formally, the percentage of matched k-mers between two read sets S i and S j is given by the following equation: which is indeed exactly 1 minus the Jaccard distance given by Eq. 1, and multiplied by 100 to get values in [0, 100]. Reads were also compared using Commet [12] and an alignment-based method using BLAT [24]. Both these read-based approaches define and use a read similarity notion. They then derive the percentage of reads from S i similar to at least one read from S j , noted by S i − → ∩ S j . The read-based similarity between read sets S i and S j is then given by: which is the exact counterpart of Eq. 3 on reads using the read similarity notion.
Correlation with Commet Looking at the correlation with Commet is interesting because this tool uses a heuristic based on shared k-mers but its final distance is based on read counts. Both tools were run on 50 samples from the HMP project. Fig. 4 shows that the two methods are highly correlated (Mantel test, r = 0.992, p = 0.001).  Correlation with an alignment-based approach using BLAT We conducted a similar experiment using BLAT to measure the percentage of shared reads between two read sets. BLAT was used on a dataset made of the 15 smallest HMP samples and similarity was defined based on several identity thresholds: two reads were considered similar if their alignment spanned at least 70 nucleotides and had a percentage of identity higher than 92%, 95% or 98%.
Independently, Simka was run with three k-mer sizes: 15, 21 and 31. Fig. 5 shows the correlation between Simka and BLAT for each identity threshold and each k-mer sizes. Theses plots show a clear correlation between the percentage of matched k-mers and the percentage of similar reads. The lowest correlation value is 0.896 obtained for k = 31 and 92% identity. As expected, larger k-mer sizes correlate better with higher identity thresholds and vice versa.
One may notice (S2 Fig), that increasing the k-mer size impacts Simka performances. Mainly, when k increases, the disk usage increases sub-linearly whereas the computation time and the memory usage stay constant.
Applying Simka to the full HMP dataset Simka was run on the samples from the HMP project in order to assess the impact of filtering k-mers by abundance and in order to compare results obtained from quantitative distances to those obtained from qualitative ones.
To easily visualize the structure found by each distance, we used the Principal Coordinate Analysis (PCoA) [25] to get a 2-D representation of the distance matrix and of the samples: distances in the 2-D plane optimally preserve values of the distance matrix. Fig. 6-A shows the PCoA of the samples using the abundance-based Ochiai distance without the abundance filter. We can see that the samples are clearly segregated by body sites. This is in line with results from studies of the HMP consortium [3,26,27]. It should be highlighted that the Ochiai distance is one of the simplest and fastest to compute. Results using the Hellinger distance with no filter (Fig. 6-C) show that this distance provides a different characterization of the oral samples. This new characterization is biologically relevant: it tends to separate in three clusters the samples coming from three distinct parts  of the oral site. This confirms the fact that it is important to conduct analyses using several distances as suggested in [20,27] as different distances capture different features of the samples.
Figs. 6-B and 6-D show the impact of the solid filter on the two previously mentioned distances. One may notice only slight differences. As shown in Table 3, even if the solid filter discards more than half of the distinct k-mers, solid k-mers represent more than 95% of the k-mer abundances of the HMP dataset.
From Fig. 6-E, we can see that there are no major difference between the quantitative and qualitative versions of the Ochiai distance when using all k-mers. However, the results change dramatically when using only solid k-mers (Fig. 6-F). In this case, different sampling sites from the oral cavity become well separated but the distinction between nasal, skin and urogenital samples becomes more confused. This experiment shows that, as expected, solid filtering affects qualitative distances more than quantitative ones. For qualitative distances, user is invited to compare both filtered and unfiltered results to assess whether or not they rely on rare and potentially erroneous k-mers.

Discussions
In this article, we introduced Simka, a new method for computing a collection of ecological distances, based on k-mer composition, between numerous and large metagenomic datasets. This was made possible thanks to the Multiple k-mer Count algorithm (MKC), a new strategy that counts k-mers with state-of-the-art time, memory and disk performances. The novelty of this strategy is that it counts simultaneously k-mers from any number of datasets, and that it represents results as a stream of data, providing counts in each dataset, k-mer per k-mer. The motivation for computing a collection of distances rather than just one is two folds: different distances capture different features of the data [20,22,27] and all the distances computed by Simka have in common that they are additive over k-mers and can thus be simultaneously computed using the same algorithm.
Simka also provides qualitative distances, based only on the presence or absence of k-mers, not on their abundance. This is a novelty as there is, to our knowledge, no alternative method to efficiently compute presence-absence based distances from sets of reads as they require efficient read dereplication from these sets. Although qualitative distances are more fragile to sequencing errors than quantitative ones, they remain useful to define bioregions and study small-scales changes (colonization, contamination, etc). Indeed, quantitative distances are obscured by changes in most abundant organisms and thus most abundant k-mers.
A notable key point of our proposal is to estimate beta-diversity using k-mers diversity only. We are conscious this may lead to biased estimates of the beta-diversities defined from species composition data. The bias can run both ways: on the one hand, shared genomic regions or horizontal transfers between species will bias the k-mer-based distance downwards. On the other hand, genome size heterogeneity and k-mer composition variation along a microbe genome will bias the k-mer-based distance upwards. However, species composition based approaches are not feasible for large read sets from complex ecosystems (soil, seawater) due to the lack of good references and/or mapping scaling limitations. Moreover, our proposal has the advantage of being a de novo approach, unbiased by reference banks inconsistency and incompleteness. Finally, the results on the HMP datasets show that Simka distances recover the same biological structure as taxonomic studies do.
There is nevertheless room for improving Simka performances. The distance computation has a time complexity in O(W × N 2 ), with W is the number of considered distinct k-mers and N is the number of input samples. N is usually limited to a few dozens or hundreds and can not be reduced. However, W may be in the range of hundreds of billions. Future work will consist in reducing W using new filtering approaches based on k-mer sequences signatures and/or on k-mer counts for selecting only informative k-mers, that discriminate the samples. For instance, DSM method proposed to filter the k-mers based on their Shannon entropy over the N samples. Other potential filters include the chi-square statistics and the variance of the abundance vector to screen out uninformative k-mers. It would be interesting to investigate how robust the beta-diversity distances are to the choice of discriminant k-mers.
Since metagenomic projects are constantly growing, it is important to offer the possibility to add new sample(s) to a set for which distances are already computed, without starting back the whole computation from scratch. This is straightforward to adapt the MKC algorithm to such operation, but the merging step and distance computation step have to be done again. However, adding a new sample does not modify previously computed distances and only requires to compute a single line of the distance matrix, it can thus be achieved in linear time. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q qq q q qq qq q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q PC1 (13.11%) PC2 (8.72%) F PA − Ochiai (with filter) Gastrointestinal Nasal Oral Skin Urogenital Figure 6. Distribution of the diversity of the HMP samples by body sites with respect to computed distance and filtering approaches. PCoA of the samples is based on distances computed either on all k-mers (left) or solid k-mers only (right). Each color corresponds to a distinct human body sites. The color shades correspond to different subparts of a body site. Distances were computed with k = 21.