GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs

Abstract Genome search and/or classification typically involves finding the best-match database (reference) genomes and has become increasingly challenging due to the growing number of available database genomes and the fact that traditional methods do not scale well with large databases. By combining k-mer hashing-based probabilistic data structures (i.e. ProbMinHash, SuperMinHash, Densified MinHash and SetSketch) to estimate genomic distance, with a graph based nearest neighbor search algorithm (Hierarchical Navigable Small World Graphs, or HNSW), we created a new data structure and developed an associated computer program, GSearch, that is orders of magnitude faster than alternative tools while maintaining high accuracy and low memory usage. For example, GSearch can search 8000 query genomes against all available microbial or viral genomes for their best matches (n = ∼318 000 or ∼3 000 000, respectively) within a few minutes on a personal laptop, using ∼6 GB of memory (2.5 GB via SetSketch). Notably, GSearch has an O(log(N)) time complexity and will scale well with billions of genomes based on a database splitting strategy. Further, GSearch implements a three-step search strategy depending on the degree of novelty of the query genomes to maximize specificity and sensitivity. Therefore, GSearch solves a major bottleneck of microbiome studies that require genome search and/or classification.

and 10,000 query genomes were used, respectively, for this analysis.Note that GSearch and BinDash are plotted on different axes, primary y and secondary y axis, respectively.
A total of 24 threads were used.Sketch size m=10^5 was used to achieve similar accuracy with ProbMinHash.

Figure S11. Correlation between ANI estimates from our implementation of Densified
MinHash (optimal densification) and Mash bottom-m implementation.A sketch size m=12,000 was used for improved accuracy in densified MinHash while 15,000 was used in Mash.Our implementation, as also reported in the BinDash paper, is at least 10 times faster than Mash. ) for Jaccard similarity, where m is the sketch size.We use set size (number of elements in the set) n=300,000 for all cases, m is 10^5 for classic MinHash.
1 Broder (1997), 2 Shrivastava (2017), 3 Ertl (2017), 4 Ertl (2020), 5 Ertl (2018), 6 Christiani (2020), 7 Flajolet et al. (2007), 8 Karppa and Pagh (2022), 9 Ertl (2021) 10 which means that adding an element or taking the union of multiple subsets can be performed in sketch space, important feature for large scale application when new data need to be added to existing sketches or distributed environment. 11OD indicates optimal densification.Since it is not mergeable, it is difficult to be applied in distributed environment. 12SetSketch can be applied in a weighted fashion but not in this paper Table S2.Theoretical comparison of estimator variance (relative mean standard error) of MinHash-like algorithms for Jaccard index, where m is the number of registers (sketch size).The same m was used for all tools expect for SetSketch, which requires a smaller m to have similar accuracy.J or Jp represents Jaccard index or Jp, respectively.
SetSketch     a ef_search=M, the width of search for querying equals to M b Large M and ef_construct take more time for both build and query search.In the MNIST fashion dataset, Euclidean distance is precomputed and not considered in this test.The distance used is the same for Table S8, Table S9 and table S10.c Tests were run on a 24 threads Intel (R) Xeon (R) Gold 6226 CPU@2.70Ghz.Speed is tested using 10,000 queries.a Tests were run on a 24 threads Intel (R) Xeon (R) Gold 6226 CPU@2.70Ghz.Speed is tested using 10,000 queries.
Table S9.Benchmarking of hnswlib-rs library for scalability using the SIFT1M dataset.a Tests were run on an Intel (R) Xeon (R) Gold 6226 CPU@2.70Ghz.For each run, the number of threads shown in column #2 were requested.Speed is tested using 10,000 queries.Table S14.GSearch (ProbMinHash), Sourmash, Dashing and Sourmash benchmark against Blastn-ANI at nucleotide level.Query genome is "OceanDNA-b42278.fa"from (Nishimura and Yoshizawa, 2022).Database genomes are all NCBI/RefSeq genomes (318k) after removing the top 10 ground truth genomes found in Table S13 above.For each column, matches are ranked by the respective distance provided in the output of each tool.Top 10 matches found by blastn-ANI showed between 75% to 80% to the query genome, corresponding to Jaccard index 0.009 to 0.015 respectively.Boldface denotes the genomes found by each method compared to the ground truth in column 1. Table S17.Recall (top 10) of GSearch (ProbMinHash option), Mash and Sourmash against the BLAST-AAI results.The same query genome as shown in Table S14 was used but at the proteome level after gene prediction.Dashing and BinDash were not included because these tools do not provide an amino acid level search option.Same query genome as Table S13.such as HULK.Motivated by the SuperMinHash for conventional Jaccard similarity estimation 11 and BagMinHash algorithm for weighted Jaccard similarity estimation 12 , ProbMinHash (probminhash 3(a) and 4 algorithm) is orders of magnitude faster than the original algorithm P-MinHash proposed in D 2 histosketch 14 .Probminhash estimates the Jaccard probability Jp similarity, and 1-Jp is indeed a metric on the probability distributions and is Pareto optimal (Supplementary Note 1) 8, 14 .Densified MinHash, or One Permutation MinHash with Optimal/faster Densification, is the fastest MinHash algorithm due to theoretical breakthrough (average case O(n+m)) despite large variance than classic MinHash (Table S2).It uses only one hash function but copy values from empty bins to non-empty bins ("densified") in the sketch vector either mapping forward or backward or both 13,15,16 .We choose Sourmash, Mash, Dashing 1/2 and BinDash for benchmark because all provide estimation of Jaccard index, which correlates very well with ANI after transformation, and had been all previously benchmarked against BLASTbased ANI and fastANI 2,19 .

Comparison with other genome/sequence search algorithm
There are many other data structures designed for general purposes sequence search problems such as Sequence Bloom Tree (SBT) and its variants [20][21][22] , COBS/BIGSI 23,24 , Layered LSH (approximating ungapped alignment, not applicable to ANI like distance, which is gapped alignment) 25 and RAMBO (Repeated and Merged Bloom Filter) 26 that can achieve linear or sometimes even sub-linear genomic database search performance alone.However, those tools have never been benchmarked against BLAST-based ANI/AAI in the context of microbial genomic search (that is, whether the best hits/genomes found by the mentioned tools are the same with ANI/AAI comparison best hits/genomes), which is a popular reference standard for measuring microbial genomic distance/identity, and thus infer microbial taxonomy 27,28 .MinHash-based tool such as Mash and FastANI have been benchmarked against BLAST-based ANI, and were shown to be the most accurate k-mer-based sub-linear algorithms for ANI comparisons and/or searching 2, 19 .

HNSW in Rust benchmark against testing dataset
To benchmark our reimplementation of hnswlib, we followed standard ANN benchmark procedures using two popular testing datasets (MINST and SIFT1M) based on their Euclidean distance 29 .Our results showed that, for the MINST fashion dataset (784 dimensions, 60,000 vectors), recall for top 100 neighbors of 10,000 query vectors is greater than 98% for a smaller number of M and ef_construct, and even higher recall rate (99.86%) for a medium M and ef_construct while query speed is not compromised (Supplemental Table S7).For the SIFT1M dataset (128 dimensions, 1,000,000 vectors), recall for top 100 neighbors of 10,000 query vectors was 99.77% for a medium M and ef_construct (Supplemental Table S7 and S8).In terms of speed, we compare it with hnswlib using the MINST-fashion dataset and it is as fast as hnswlib: it took 18.06s and 0.89s for database building and searching for hnswlib-rs while it took 18.47s and 1.07s for database building and searching for the C++ hnswlib) (Supplementary Table S9).
The Rust package hnswlib-rs can be found at: https://github.com/jean-pierreBoth/hnswlib-rs.For each genomic database, we chose M and ef_construct experimentally, by gradually increasing M and ef_construct while monitoring query speed and recall, similar to what is shown in Supplementary Table S2 for MNIST dataset.We stopped the assessment when there was only a marginal increase in accuracy but decent decrease in speed.To leverage between recall and speed, we use M=128 and ef_search=1600 for graph building for GTDB database fungal database while M=128, ef_search=3200 for phage database.

Details of program implementation in Rust
Tohnsw starts by reading database genomes and generating k-mer profile and sketches using the ProbMinHash3a algorithm (or SuperMinHash, SetSketch, Densified MinHash) for distance calculation (Figure 1a and c).Next, tohnsw selects, at random, the first batch of genomes to insert into the graph (Figure 1a (1)), following HNSW constructing rules mentioned above and taking into account the computed ProbMinHash distance or SetSketch approximated Jaccard distances (1-J) between genomes to connect genomes based on their relatedness (Figure 1b) until all genomes in database have been inserted (Figure 1a (2), ( 3) and ( 4)).Finding nearest genomes for the genome to be inserted is essentially a search process but search in a partially built graph, which is similar to the request/search module: whenever a genome is going to be searched against the existing graph, each genome in the graph is associated with a list that stores the M closest neighbors/genomes to the genome and the distance to these neighbors.Then, the distances of this genome with the nearest neighbors (M) of entry genome in each layer will be computed (ef_construct times) using the Probminhash3a algorithm or the SetSketch, and the smallest distance of the neighbor genomes will be the new entry genome (Figure 1d and e).This process will be repeated until the nearest genomes (<=M) in the layer are found and subsequently, the program will go to the layer below, using the genome that was represented by the nearest genome in the above layer as new entry genome in the new layer.The search layer algorithm is repeated until the bottom layer is reached/analyzed (Figure 1c).In contrast to the default settings in the original hnswlib, we allow the two parameters of neighbor selecting heuristics, extendCandidates to be true and keepPrunedConnections to be false because our genomic data is extremely clustered and there is no need to fix the number of connections per element considering the maximum connection allowed.The "Add" module is to add new genomes to prebuilt HNSW database using default parameters loaded in the pre-built database files.Request module will load the graph database and then search query genomes against it to return the best neighbors of each query, following exactly the same procedure with building step but without updating the database.

Details of the SetSketch implementation
We implemented SetSketch algorithm 1 locality sensitivity section (LSH) according to lower and upper bound of Jaccard ( !"# = (0, ) where D0 is the number of registers in the sketch of genome A that are equal to those in the sketch of genome B. In practice, Jlow is closer to true Jaccard for small J thus we use it.We use parameter  = 6144, b = 1.0005, a = 20, and q = 65534 instead of the default ones to have smaller RMSE (<0.8%) around J=0.015 (corresponding to ANI 77.99% according to Mash equation) and acceptable running time (Figure S7), which is equivalent to HyperLogLog for estimating Jaccard index in terms of space but with smaller variance (with b close to 2, SetSketch will then be equivalent to SuperMinHash).A SetSketch using this configuration is suitable to represent any set with up to 10^19 distinct elements/k-mers (much larger than total-number of k-mers from microbial genomes).The expected error of cardinality estimates or LSH is very small ( , ) 30 , close to that of MinHash for large m, like 10^4 or above but use much smaller space.We also implemented the Joint Maximum Likelihood Estimator (JMLE): the ML estimate for Jaccard was found by standard univariate optimization algorithm called Brent's method, based on the argmin Rust package since the ML function is strictly concave for the parameter mentioned above 30 .The RMSE of JMLE based on its Fisher information can be found in Supplementary Note 7. Since RMSE of JMLE is smaller than LSH, it is much slower in practice, but we only use it for filtering false positives after top k best neighbors were found for each query by LSH and HNSW, thus it is not a problem for overall speed. ) . P-MinHash is considered a more general case for Jaccard Index estimation and is more close to true Jaccard than JW and JN 13,14 .More importantly, Jp is Pareto optimal and 1-Jp is a proper metric 13 .It has been shown that ANI estimated from Jp, which was computed by ProbMinHash (ProbMinHash2 algorithm) in Dashing 2, is slight better for bacterial genomes than traditional MinHash like Mash 15 .ProbMinHash was built upon the new MinHash algorithm with further computational optimization for speed 12 .Hence, ProbMinHash is the currently the default option in GSearch.HyperLogLog can also be used to estimate Jaccard index: distinct element count of k-mers in set/genome A and B via HyperLogLog sketch (memory efficiency) and then use the inclusion-exclusion rule to have Jaccard index ((, ) = ) as implemented in Dashing, despite being overall less accurate for small cardinalities (e.g.virus genomes) 16,17 .The Maximum Likelihood estimator (MLE) 16 , in addition to the original estimator in HyperLogLog and improved estimator in 16 for distinct element counting, was thought to have the smallest variance, which met the Cramér-Rao lower bound, despite slower and difficult to compute and update 18 .This is the idea behind Dashing, which we benchmark against GSearch.

Note 2. Kmer selecting rationale for nucleotide (nt) and amino-acid (aa) level searches for
MinHash/probminhash tools.Assume amino acid/nucleotide sequences evolve at a constant rate and alphabets || for AA is 20 and 4 for nucleotide.Consider two sequences x , y ∈ A N drawn randomly over the alphabet of size #A = 20, N is the genome length, and let vx,vy ∈ R 20^k denote the k-mer frequency profile (multiplicity of each k-mer divided by total number of k-mers) for each sequence.For long sequences with N ≫ 20 k , k-mer frequencies will converge to their mean, that is 20 -k for all their components, which implies that ||vx−vy||1 → 0. Therefore, any k-mer profilebased method will severely underestimate distance between these two random sequences.In order to avoid this, k has to be restricted to large values  ?≥  |@| ().In practice, since genomes sequences are not completely random, k can be slightly smaller than log4(N) without underestimating genomic distance.Ondov et al., in the MASH paper, came up with a probability term (1-q)/q to take into account the probability of having a random kmer:  ?≥  |@| ((1 − )/) 19 .Assume a probability q=0.01, for a typical bacterial proteome, k>=log20(1000000*0.99/0.01)=6.146, so at least a kmer of size 6 or above should be used.For a typical bacterial genome at nucleotide level, N is about 4*10^6, k >= log4(4*10^6*0.99/0.01)=14.280, so at least a kmer of size 14 or above should be used.For universal genes only, N is about 350 AA * 120=42000, so k >= log20(42000*0.99/0.01)= 5.087, so at least a kmer of size 5 or above should be used.Since universal gene alphabets are not randomly evolving (e.g., some regions in genes that encode key metabolic functions such as RNA processing units rarely mutate), k=5 should be appropriate.To optimize between sensitivity and specificity for kmer, we followed the practice suggested by Ondov et al., and Jain et al. 19,20 ; that is, use k=16 for bacterial nucleotide genome sequences and a k=7 for bacterial proteome sequences.For fungal genomes, genome size is 10 to 20 times larger than tht of bacteria but the same formula applies and so, we have k >= log4(20*4*10^6*0.99/0.01)=16.441, and accordingly we use kmer=21.For fungal proteome, coding density is much smaller than that of bacteria (we use an empirical 0.5 coding density) thus, we have k >= log20(10*1*10^6*0.99/0.01)=6.914.We use k=11 for fungal proteome to increase the specificity of the searches.For virus/bacteriophage genomes, we use k=11 for nt and k=7 for aa, considering also the jumbo/giant phage genomes that were found recently.
Note 3. Assessment of the time complexity of ProbMinHash and HNSW.The search time complexity is estimated based on the equation (or big O annotation) (()), where n is the dataset size while v and d are maximum out-degree of the graph to be built and the number of dimensions (kmer features in a genome) of the dataset, respectively 21,22 .In nearest neighbor search studies, L2 distance (Euclidean distance) is often used and is related to dataset dimension, which is very large for k-mer features of genomes (~10 ^6).In the case of ProbMinHash distance, which is a MinHash-based method to sample the k-mer space of genomes and serve as a dimensionality reduction in approximating genomic distance, the sampled number of k-mers (or meanwise hashes) is always a constant and much smaller number (sketch size, normally around 10,000) than the total number of k-mers of a genome (number of dimensions of dataset).
Therefore, d can be ignored (MinHash samples only a small fraction of k-mer space/dimension).
Also, v is a small constant considering the graph structure to be built (normally smaller than 100).
Therefore, the time complexity, as it was also explained in the main text, can be safely written as (()) for low dimension datasets 22 .Similar rules applied for the time complexity of the graph build step, which is (()) and thus, (()) in our case, where d is dataset dimension.
Time complexity ( + ()) of the Probminhash3a algorithm, where n is the set size (total sampled k-mer number) while m is the number of weighted elements (k-mers), is also a constant in the context of high-quality genomes given also that m is very small for prokaryotic and bacteriophage genomes (normally about 5% genomic sequences are multi-copy).Therefore, ProbMinHash time complexity is very close to () in practice.For a given genome, for example, bacterial genome or viral genome, n is the sampled k-mer number of the genome, and thus also a constant number.
Note 4. Kmer-based genomic distance estimation is less accurate for distantly related genomes.For a random mutation rate  ∈ (0,1), the probability that a k-mer (k ls length of the k-mer) belonging to sequence X and Y is not mutated is ) , indicating that  ≲  %& , so k must be small enough to capture the mutation and also big enough to avoid underestimation (Note 2).This means that for close related bacterial genomes (r<0.01 for example), k could easily meet both conditions mentioned above while for distantly related genomes -for example-r=0.125(corresponding to an ANI value of 87.5%) k <= 8, which is contradictory with k>log4(N)=14.28(Note 2).Thus, k-mer based method will lose accuracy for distantly related genomes.In practice, we observed that this r threshold is around 0.215 (ANI=78.5%)because mutation is not completely random as assumed above.
).The RMSE of the ML estimate is expected to be  %&/H (), where I denotes the Fisher information with respect to J for  C and  D : The SetSketch paper showed that the estimation error (RMSE) for J will be almost the same for MinHash with the same number of registers m.

Figure S2 .
Figure S2.Relationship between AAI (entire proteome) and ProbMASH distance for (a) all AAI values and (b) AAI values between 0.52 and 0.95 among 2000 genomes randomly selected from the GTDB v207 database.Both Spearman and Pearson correlation significance test showed p<0.01.

Figure S4 .Figure S5 .
Figure S4.Performance of GSearch search step.(a) Search time (blue) and speed (orange) for an increasing number of query genomes against a fixed database of 67503 genomes and, (b) search time (blue) and speed (orange) for an increasing number of database genomes and a fixed number of 1059 query genomes.Search speed is defined as the number of database genomes searched, on average, per query genome per second.The orange line with squared data points represents Mash, whose speed was constant for any database size for 1059 query genomes, for comparison.

Figure S6 .Figure S7 .Figure S8 .
Figure S6.Database build time for the IMGVR phage species database (935,122) at the amino acid level (a), and total search (request) time for 10,000 query phages (b) against this database.Database size was about 15.8 GB, and thus loading the database takes a substantial fraction of the total searching time.

Figure S9 .Figure S10 .
Figure S9.Parallel efficiency and memory consumption of GSearch tohnsw and request modules (ProbMinHash) at the nucleotide level.(a) tohnsw build module real-time CPU usage (left axis) and memory consumption (right axis) for the GTDB v207 (65,703 genomes) database using 24-thread node.(b) The same with (a) but on a 64-thread node.(c) Real-time CPU usage and memory consumption for searching 8,466 genomes against the GTDB prebuilt database in (a) on a 24-thread node.(d) Real-time CPU usage and memory consumption for searching 8,466 genomes against entire NCBI/Ref_Seq genomes (~318K).The two phases in all figures are sketching and graph building/searching, respectively.

Figure S13 .
Figure S13.FastANI (y-axis) versus orthoANI (x-axis).Results are based on a testing dataset of 3000 genomes randomly subsampled from the large collection pf genome used in Figure S1.MAE is 1.4%.

Table S1 .
Comparisons of MinHash-like and HyperLogLog algorithms.In bold face are results obtained as part of this study (via gsearch --algo prob/super/hll option); remaining results shown are from the literature.Jp is closer to real Jaccard index than Jw despite both algorithms considering k-mer weights.

d Request time for USCG (min) d
Parallel package was used to run multiprocess at the same time.FGSrs stands for FragGeneScanRs (comparisons with Prodigal can be found in TableS11).Note that in practice only those genomes failing in the request step at the nucleotide (nt) level (best match found was less than 78% ANI) will be used in this amino-acid step.d Only 1000 genomes were used for testing hmmsearch and USCG request in GSearch because this step was for assessing a few genomes that were novel at the order level or above.Parallel Packages was used to run multiple processes of hmmsearch, one thread per process for hmmsearch. c

Table S4 .
GSearch 3-step classification pipeline accuracy using GTDB-tk as reference for 1,000 test genomes of various degrees of novelty relative to the database genomes.

Table S5 .
GSearch search (request) performance on major CPU platforms using the GTDB v207 database for graph building and 1000 genome queries.Default ProbMinHash option was used.aRHEL v7.9, Linux v3.10.0-1160,all threads used.b MacOS v12.3, Darwin 21.4.0,all threads used.c Parallel package was used to run multiprocess at the same time.FGSrs stands for FragGeneScanRs.Note that in practice only those genomes failing at the Request step for nucleotide-level search (best match found is <78% ANI) will be used at this step.d Only 100 genomes were used for testing hmmsearch because this step is for very novel (deep-branching) genomes at order level or above, which are not very common in real-world dataset.The Parallel Package was used to run multiple processes of hmmsearch, one thread per process for hmmsearch.

Table S6 .
GSearch performance and accuracy for the viral database using the BLASTbased AAI as the reference standard.MASH dist command was run using 24 threads for this analysis.

Table S10 .
Comparison of hnswlib-rs with C++ hnswlib using the MNIST fashion (70,000) dataset.For this test, 60,000 genomes were used for building as database and 10,000 as query, requesting 50 best neighbors.

Table S11 .
The table shows benchmark results of FGSrs against truth (full version v0.0.1) and comparison with other tools.True positives were denoted as the number of base pair in a prediction on the correct strand, false positives the number of base pair in a prediction outside gene annotations or on the wrong strand, true negatives the number of bp outside gene annotations that weren't in any prediction, and false negatives the number of bp inside gene annotations that weren't in any prediction.Prec denotes precision, Sens denotes sensitivity, Spec denotes specificity.Negative Predictive Value and Matthew's Correlation Coefficient are provided as percentages.

Table S12 .
Effect of genome completeness on GSearch recall.Genomes of different degrees of completeness were obtained by randomly sampling the gene sequences from the predicted gene collection of the complete genome.Hydrogenimonas urashimensis, an H2 dependent chemolithoautotrophic bacteria isolated from deep sea hydrothermal vent(Mino et al., 2021), was used as query.There is no identical genome to H. urashimensis in the GTDB database.

Table S13 .
GSearch, Sourmash,Dashing and BinDash benchmark against blastn-ANI at nucleotide level.Query genomes is "OceanDNA-b42278.fa"from (Nishimura and Yoshizawa, 2022).Database genomes were all NCBI/RefSeq genomes.Mash, FastANI and Blastn-based ANI give the same top 10 for this query genomes.For each column, matches are ranked by the respective distance, taken directly from the software output.Top 10 found by blastn-ANI were about 80% to 97% ANI.

Table S15 .
GSearch (SetSketch)and Dashing (Ertl's Joint MLE) benchmarking against blastn-ANI.The table is similar to TableS14above but using the SetSketch algorithm.

Note 5. Theoretical guarantee of graph based NNS search algorithms
Due to ownership mechanisms of Rust (so called memory and thread safety), fearless concurrency (e.g., no data competition/race, memory and thread safety) is made possible.The crossbeam crate package was used in GSearch for communication of data among threads for task level parallelism while Rayon crate was used to do data level parallelism25.Accordingly, by default, GSearch uses all available processors/threads to make full use of multi-processor CPUs.
. Until recently, a theoretical analysis based on a dataset evenly distributed on an d-dimension Euclidean sphere (d≪log(n), n is the dataset size) showed that under certain conditions, there is a guarantee that the best neighbors could be found compare to brute-force distance metric comparisons23.However, a theoretical analysis under more general conditions, e.g., other metric space, or d not being much smaller than log(n) is still not available, to the best of our knowledge.Despite the lack of theoretical analysis under more general conditions, graph-based algorithms work well in practice, as also reflected by the large number of graph-based NNS libraries available21, 24.Note 6. Parallelism of probminhash and HNSW.

Details of Joint Maximum Likelihood estimator in SetSketch algorithm.
Forcardinalities  C and  D of set U and V estimated by SetSketch cardinality estimation section according to equation 12 in26, the maximum likelihood method can be used to estimate J.( $ ( − )) +  % ( $ ( − )) +  E (1 −  $ ( − ) −  $ ( − )),where  * = |{:  C. >  F. }|,  % = |{:  C. <  F. }|,  E = |{:  C. =  F. }| are number of registers in the sketch of U that are greater than, less than, or equal to those in the sketch of V, respectively;  C.   F. are registers of set U and V, respectively while v and u is relative cardinalities  = *