Published October 28, 2022 | Version v1
Presentation Open

Approximate Pattern Matching Using Search Schemes and In-Text Verification

Description

Approximate pattern matching entails finding approximate occurrences of a search pattern P in a search text T. In a typical bioinformatics scenario, P is a DNA fragment (a read) and T a (collection of) reference genomes(s). Search schemes are an efficient computational technique that guarantee to exhaustively identify all occurrences within a pre-specified number of allowed errors (substitutions and indels). Using a bidirectional FM-index, they describe how to traverse the search space in such a way that runtime is minimized. Despite their optimal time complexity, basic operations on the FM-index require expensive random memory access. 

We examine whether in-text verification, in which a candidate occurrence is validated in the search text via a bit-parallel pairwise alignment approach, can be used to supplement in-index matching using search schemes. In comparison to pure in-index matching, we find that hybrid in-index/in-text matching can reduce running time by more than a factor of two. 

We introduce Columba 1.1, an open-source C++ software program that efficiently implements these ideas. Columba 1.1 can locate, within a maximum edit distance of 4, all occurrences of 100,000 reads (150 bp) in the human reference genome in about 30 seconds. This outperforms existing, state-of-the-art lossless alignment tools such as Bwolo, Yara and GEM and is comparable to the lossy aligner BWA in mem mode.

Files

fears_2022_pitch.pdf

Files (75.9 kB)

Name Size Download all
md5:3aa85f31218543a63a7b7f6954164f95
75.9 kB Preview Download