Dynamic compression schemes for graph coloring

Abstract Motivation Technological advancements in high-throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains hard to query for the research community due to a lack of efficient data representation and indexing solutions. One of the available techniques to represent read data is a condensed form as an assembly graph. Such a representation contains all sequence information but does not store contextual information and metadata. Results We present two new approaches for a compressed representation of a graph coloring: a lossless compression scheme based on a novel application of wavelet tries as well as a highly accurate lossy compression based on a set of Bloom filters. Both strategies retain a coloring even when adding to the underlying graph topology. We present construction and merge procedures for both methods and evaluate their performance on a wide range of different datasets. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph, we can reduce memory requirements by up to three orders of magnitude. Representing individual colors as independently stored modules, our approaches can be efficiently parallelized and provide strategies for dynamic use. These properties allow for an easy upscaling to the problem sizes common to the biomedical domain. Availability and implementation We provide prototype implementations in C++, summaries of our experiments as well as links to all datasets publicly at https://github.com/ratschlab/graph_annotation. Supplementary information Supplementary data are available at Bioinformatics online.


B Wavelet trie updating
Internal insertion of rows is done similarly to parallel construction. The alignment step is identical, while merging proceeds by internal insertion of β j into β j instead of concatenation. When traversing down the tree, the index i j at which β j is inserted is updated using i j ← rank 1 (β j , i j ) if traversing to the right and i j ← rank 0 (β j , i j ) if traversing to the left.
As an optimization for updating bits in a computed wavelet trie operates in three steps. Suppose we are given a set of entries {A i 1 j 1 , . . . , A i j } for the wavelet trie compressing A of size n × m. Then 1. Insert Query the rows {A i 1 , . . . , A i } and flip the appropriate bits in their decompressed forms. Concatenate these rows to produce a new matrix A of size (n + ) × m. 2. Swap For each i k , swap the rows (A i k , A n+k ) by swapping bits in the βs. 3. Remove Remove the rows {A n , . . . , A n+ } to form the final matrix A of size n × m.

C Data used for evaluation
The datasets used to evaluate the performance of our compression schemes originate either from viruses (Virus100-Virus50000), bacteria (Lactobacillus) or human (chr22+gnomAD and hg19+gnomAD) and are chosen to test the methods on different color distributions, annotation matrix sizes and densities. They further reflect varying graph topologies and allow us to study the effect of topology-informed compression in a robust testing bed. We construct de Bruijn graphs of order k = 63 for each dataset and compare the compression performance of all methods by measuring the compression ratio for each dataset, defined as the ratio of the number of bits in an annotation matrix and the number of bits in its respective compressed representation.

C.1 Virus Datasets
The virus datasets are generated from publicly available GenBank (Clark et al., 2016) complete genome data. The Virus50000 dataset consists of the set of 53,412 virus strains present in GenBank on September 30th, 2016. The Virus1000, Virus3000 and Virus20000 datasets are randomly selected subsets of the Virus50000 set. The Virus100 dataset is a subset consisting of the first 100 genomes in Virus1000. On these datasets, the colors are defined to represent the membership to each genome ID, while the class indicator bits are defined by each genome's taxonomic genera. The rows of these annotation matrices are very sparse and present degenerate cases with respect to compressability by wavelet tries.

C.1.1 Virus50000
This dataset consists of 53,412 viral genome sequences downloaded from GenBank via the eUtils API on 09/30/2016. The search term applied was txid10239 [orgn] AND "complete genome" Fig. S-1: Merging of wavelet tries T 1 and T 2 to form the wavelet trie T 1 + T 2 . Starting from the root node, the common prefix of the two α vectors is found and new β vectors are computed from their respective next significant bit after the common prefix. These become new parent nodes and the initial nodes' α vectors are updated to their respective remainders after removing the common prefix (e.g., the conversion from from T 1 to T 1 ). When the two αs are equal, their respective βs are concatenated and the merging function is applied to their children. When a leaf is reached in one tree, but the equivalent node in the other tree is internal (e.g., the left child of the root in T 1 + T 2 ), the leaf is merged by appending or prepending additional zeros to the β vectors of all left ancestors. Note that extra leaf nodes producing trailing zeros in the decoded bit vectors are added during the merging process. See Section 2.3.1 for more details.
[Title] NOT txid131567 [orgn]. The data represents a set of sequences with both high variability to each other but also good pairwise conservation for sub-species. It reflects a wide range of variability and provides a good testing bed for the application within a colored de Bruijn graph setting. The graph corresponding to this dataset contains 622,587,315 nodes, 625,110,390 edges, and 1,359,843 unique edge colorings.
C.1.2 Virus1000, Virus3000, Virus20000 The Virus1000 dataset is composed of 1000 virus genomes randomly selected from the Virus50000 set, meant to study a graph whose topology is a series of almost mutually-exclusive loops with slight variation. The columns in this graph's annotation matrix indicate presence of edges in each of the virus genomes. Similar to the Lactobacillus dataset, the viruses were grouped by the first word of their names and the first species in each group was assigned as a backbone path. The resulting annotation bit matrix is very sparse and adjacent rows are either almost identical or almost mutually exclusive. This graph contains 30,310,634 unique k-mers, 30,347,373 edges, and 11,612 unique edge colorings. The Virus3000 and Virus20000 datasets are supersets of Virus1000. The Virus3000 graph contains 82,418,835 unique k-mers, 82,579,519 edges, and 52,187 unique edge colorings. The Virus20000 graph contains 357,552,076 unique k-mers, 358,683,520 edges, and 537,344 unique edge colorings.

C.1.3 Virus100
This is a subset of the Virus1000 set containing only 100 virus strains used to facilitate the permutation tests in Section 3.1.1. This graph contains 2,954,719 unique k-mers, 2,956,113 edges, and 463 unique edge colorings.

C.1.4 Compressor update virus datasets
These datasets consist in subsets of the Virus50000 dataset of varying sizes, but fixed number of columns. Given a number of columns C, 50 different contiguous subsets of C viruses are selected uniformly (i.e., a sliding window). Graphs and compressed annotations are then constructed for these sets and a randomly-selected virus genome not in any of the windows (gi|1000033404|gb|KP975430.1|) was selected to update the graph. Values of C ∈ {200, 500, 2000} were used.

C.2 Details for Lactobacillus
This dataset is composed of 135 strains of bacteria in GenBank Clark et al. (2016) from the Lactobacillus genus, leading to a linear topology in the graphs with many shorter paths (bubbles) diverging from and reconnecting to the main reference genome paths. The colors are defined to represent the membership to genome IDs, while the class indicator bits are defined by each genome's species. The columns in this graph's annotation matrix indicate presence of an edge in each of the strains. Because of the low variability in the input sequences, they are represented as a graph with a predominantly linear topology and short variant paths (called bubbles). One genome from each of the species was chosen as a backbone path. The resulting graph had 134,951,429 unique k-mers, 135,369,397 edges, and 6,630 unique edge colorings (i.e., color combinations). See Supplementary Section D for a list of the bacterial strains used.

C.3 Human Data
For the human datasets, the hg19 reference assembly is used as the main reference backbone, together with exome variants from the gnomAD dataset (Consortium et al., 2016). The hg19+gnomAD dataset consists of the human autosomal portion of this data and chr22+gnomAD dataset consists of chromosome 22 only. On these datasets, the colors are defined to represent the membership to the reference chromosomes and the ethnic groups present in the gnomAD data. The colors corresponding to reference chromosomes are designated as the class indicators without adding additional columns (i.e., the sequence variant edges are additionally colored by their corresponding reference chromosome colors).

C.3.1 chr22+gnomAD
This graph consists of chromosome 22 from the hg19 assembly of the human reference genome as the main reference backbone. To provide genetic variability, the set of exome variants from the gnomAD dataset were incorporated into the graph Lek et al. (2016). This larger dataset is meant to analyze the properties of the trie when the underlying graph is large, but with little variability. The columns in this graph's annotation matrix are defined to represent the membership of its edges to 9 ethnic groups defined in the dataset. The first column in the matrix is used to indicate edges which are present in the reference genome and serves as the backbone bit. The graph contains 178,196,890 unique k-mers, 180,023,641 edges, and 510 unique edge colorings.

C.3.2 hg19+gnomAD
This graph was constructed from the same datasets as the one described above, using data from the full human autosome. The same definition is used for the annotation matrix columns, with 9 columns being used to indicate edges observed in the defined ethnic groups and 22 prefix columns being used to indicate presence in the first 22 reference chromosomes as the backbone bits. This graph's topology was designed to be analogous to the Virus1000 dataset, but with 1000× the number of rows and one-tenth of the number of annotation columns. It contains 5,714,136,751 unique k-mers, 5,728,489,633 edges, and 380,051 unique edge colorings.

E List of viral strains used
See the GitHub repository for a list of viral strains used.  Column-order effect on the Virus100 and Lactobacillus datasets Shown are distributions of the file sizes of wavelet tries over 100 random permutations of the Virus100 and Lactobacillus annotation matrix column orders. The red dashed lines link the points corresponding to the original column orderings in the annotation matrices. Adding class indicator (CI) bits as the matrix prefices in both datasets leads to decreases in the sizes of the compressed files, with the original ordering being optimal when CI bits are set as the initial columns. The original orderings are not optimal when CI bits are not set. In both datasets, the CDFs are not disjoint, so there exist permutations of the columns with CI bits that exhibit worse compression ratios than some permutations without CI bits set.     RAM usage values for the graph and annotation compressor objects are reported separately. In the plots, each data point represents a mean value across the genomes of each virus collection of a given size (e.g., six draws of the Virus100 dataset are averaged in both axes and represented by one point). Error bars in both axes represent standard deviation. To reduce RAM usage to fit with our computing system's 400GB job RAM usage limit, the wavelet trie for the Virus50000 dataset with class indicator bits was computing in two steps, first computing the graph, then loading it in chunks to compress the annotation. Bloom filter runtimes are for a single thread, while wavelet trie runtimes are for ten threads. *: performed with a succinct graph representation. † : Constructed with an alternate method in which the graph was unloaded to disk, then read in chunks for wavelet trie construction.