Fast and compact matching statistics analytics

Abstract Motivation Fast, lightweight methods for comparing the sequence of ever larger assembled genomes from ever growing databases are increasingly needed in the era of accurate long reads and pan-genome initiatives. Matching statistics is a popular method for computing whole-genome phylogenies and for detecting structural rearrangements between two genomes, since it is amenable to fast implementations that require a minimal setup of data structures. However, current implementations use a single core, take too much memory to represent the result, and do not provide efficient ways to analyze the output in order to explore local similarities between the sequences. Results We develop practical tools for computing matching statistics between large-scale strings, and for analyzing its values, faster and using less memory than the state-of-the-art. Specifically, we design a parallel algorithm for shared-memory machines that computes matching statistics 30 times faster with 48 cores in the cases that are most difficult to parallelize. We design a lossy compression scheme that shrinks the matching statistics array to a bitvector that takes from 0.8 to 0.2 bits per character, depending on the dataset and on the value of a threshold, and that achieves 0.04 bits per character in some variants. And we provide efficient implementations of range-maximum and range-sum queries that take a few tens of milliseconds while operating on our compact representations, and that allow computing key local statistics about the similarity between two strings. Our toolkit makes construction, storage and analysis of matching statistics arrays practical for multiple pairs of the largest genomes available today, possibly enabling new applications in comparative genomics. Availability and implementation Our C/C++ code is available at https://github.com/odenas/indexed_ms under GPL-3.0. The data underlying this article are available in NCBI Genome at https://www.ncbi.nlm.nih.gov/genome and in the International Genome Sample Resource (IGSR) at https://www.internationalgenome.org. Supplementary information Supplementary data are available at Bioinformatics online.


Compressing frequency and position arrays
Given a position i of the query S, knowing the frequency of the matching statistics string S[i..i + MS S,T [i] − 1] in the text T can be useful in genome-genome and read-genome comparison, since it can tell for example whether the longest match belongs to an exact repeat of T (see e.g. Figure 22 in the supplement). One might also want to keep the exact values of MS just for the positions with low frequency, i.e. one might want to compute a frequency thresholded MS array in which every window of ms between two lowfrequency one-bits can be permuted to achieve compression. We define the frequency array . We can detect the latter case since we know the frequency of the current string after every backward step. Consider now the first few operations of the second scan of S (from left to right): we managed to match some prefix of S, and we are now witnessing a Weiner link fail from some node v of ST, thus we move to the parent u of v. Node u must have a different frequency in T than v, so we know that the first parent operation we take leads to the position i of the first zero-bit from the left in the runs bitvector. We can measure f (u) with the topology, and we know that . We repeat this process after every parent operation. At some point the Weiner link succeeds, so we derive the updated frequency from the BWT and we restart the whole process. This algorithm can be parallelized using the same methods as in Section 3.
When S and T are similar, most matches are long and the values of F are more likely to be small or equal to one (and vice versa when S and T are dissimilar), thus delta-coding F achieves better compression when S and T are similar. Moreover, assume that most edges of ST T are labelled by long strings. If S and T are dissimilar, they mostly have short matches, the loci of such short matches have low tree depth in ST T , and they do not create long runs in F. If S and T are similar, they have many long matches, the loci of such long matches are more likely to have large tree depth in ST, and this can induce long runs in F. If the edges of ST have short labels, there is little to be gained in having S similar to T . In practice, for pairs of genomes from different species, representing F as a delta coding of its runs produces bigger files than just delta-coding every entry of F in isolation; but for pairs of human individuals, the run-length encoded F is about 15 times smaller (data not shown for brevity). Note that the values in the run-length encoded F can be accessed easily using the runs bitvector.
We conclude this section by mentioning one more array that is easy to compress. Let can be reconstructed by recurring on P[j − 1] + 1. Statistical properties of informative positions in random sequences were described by [5]. In practice, when comparing genomes from different species, approximately half of all positions are informative; this fraction ranges from 0.7 to 0.2 in proteomes, whereas in human individuals it becomes smaller than 0.01 (see Figure 5 in the supplement). Note also that, in the rangemax queries described in Section 5 of the main text, the maximum of each block is assumed by an informative position, thus it suffices to consider just those during construction.
One can imagine other position arrays that might be useful in genome comparison, for example an array that stores zero if a match occurs in distinct chromosomes of the text, or the identifier of the only chromosome that contains the match otherwise.

Acknowledgements
We thank Giorgio Vinciguerra for help with the software in [2], and Massimiliano Rossi and Dominik Koeppl for help with the software in [3].           Sizes are measured on disk, and refer to the RLEVector data structure by [7].        Figure 20: Scanning the ms bitvector in range-max queries: ratio between the time of the baseline approach that accesses every bit of the bitvector, and the time of our optimized code. Red: bit vector data structure from SDSL [4]. Green: RLEVector data structure from [7]. Left: chromosome 1 of two human individuals. Right: Homo sapiens and Mus musculus. Every point is the average of 50 random queries of the same size. Figure 21: Scanning the ms bitvector in range-sum queries: time of the baseline approach that accesses every bit of the bitvector (dotted lines), and time of our optimized code (continuous lines). Red: bit vector data structure from SDSL [4]. Green: RLEVector data structure from [7]. Blue: rrr vector from SDSL. Left: chromosome 1 of two human individuals. Right: Homo sapiens and Mus musculus. Every point is the average of 50 random queries of the same size.