String Matching in DNA Databases

The recent development of next-generation sequencing has changed the way we carry out the molecular biology and genomic studies. It has allowed us to sequence a DNA (Deoxyribonucleic acid) sequence at a significantly increased base coverage, as well as at a much faster rate [

The core and the first step to take advantage of the new sequencing technology is termed as read aligning, where a read is a short nucleotide sequence of 30-1000 base pairs (bp) [3] generated by a high throughput sequencing machine made by Illumina, Roche, ABI/Life Technologies [4], which is in fact a sequence fragment fetched from a longer DNA molecule present in a sample that is fed into the machine (https://github.com/lh3/wgsim/).Most of the next-generation sequencing projects begin with a reference sequence which is a previously well studied, known genome.The process of a read aligning is to find the meaning of reads, or in other words, to determine their positions within a reference sequence, which will then be used for an effective statistical analysis.
Compared to the traditional pattern matching problems, the new challenge from the read aligning is its enormous volume, and usually millions to billions of reads need to be aligned within a same genome sequence.For example, to sequence a human molecule sample with 15X coverage, one may need to align 1.5 billion reads of length about 100 characters (bps).Another challenge is the socalled inexact matching, by which we will find all the subsequences in a genome sequence having at most k positions different from a read or with k errors.Due to the polymorphisms or mutations among individuals or even sequencing errors, a read may disagree in some positions by an occurrence in the corresponding genome.Again, due to the huge number of reads, the existing methods have to be greatly modified to fit into such a new environment.In the past decade, a new indexing structure based on the [5] to speed up the matching of massive reads against different genome sequences, based on the so-called Burrows-Wheeler transformation (BWT-transformation for short) ( [6] and also www.youtube.com/watch?v=4n7NPk5-lwbI) and trie structures [7] or automata over reads [8].In terms of the test results reported in [9], an exact read matching can be done 40% faster using our new method than any existing strategies.The algorithm has been successfully used in a biological research project conducted in an RNA laboratory at University of Manitoba [10] (http://home.cc.um anitoba.ca/~xiej/).The BWT can also be utilized to expedite the string matching with k-mismatches [11,12].However, this method cannot be easily extended to handle the massive string matching with k-mismatches or k-errors since when searching a trie or an automaton the mismatch information or the word periodicity cannot be directly employed in a BWT array scanning.

Literature Review
The matching of DNA sequences is just a special case of the general string matching problem, which has always been one of the main focuses in computer science.All the methods developed up to now can be roughly divided into two categories: exact matching and inexact matching.By the former, all the occurrences of a pattern string p in a target string s will be searched.By the latter, a best alignment between p and s (i.e., a correspondence with the highest score) is looked for in terms of a given score matrix M, which is established to indicate the relevance between characters [13] (more exactly, the meanings represented by them.)

Exact matching
Scanning-based: By this kind of algorithms, both pattern p and targets are scanned from left to right, but often with an auxiliary data structure used to speed up the search, which is typically constructed by a pre-processor.The first of them is the famous table (for p) containing the so-called shift information (or say, failure function values) to indicate how far to shift the pattern from right to left when the current character in p fails to match the current character in s.Its time complexity is bounded by O(m+n), where m=|p| and n=|s|.The Boyer & Moore [15] approach works a little bit better than the Knuth & Morris [14].In addition to the next-table, a skip-table (also for p) is kept.For a large alphabet and small pattern, the expected number of character comparisons is about n/m, and is O(m+n) in the worst case.Although these two algorithms have never been used in practice [16], they sparked a series of research on this problem, and are improved by different researchers in different ways, such as the algorithms discussed in [17][18][19][20][21].However, the worst-case time complexity remains unchanged.The idea of the 'shift information' has also been adopted indifferent approaches [8,[22][23][24] for the multiple-string matching, by which s is searched for the occurrences of any one of a set of k patterns: {p 1 ,p 2 ,…,p k }.Their algorithm needs only 1 ( ) where m i =|p i | (i=1,…,k).However, this algorithm cannot simply be adapted to an index environment due to its working fashion to search the characters in some by one, which is totally unsuitable for indexes.

Index-based:
In situations where a fixed string s is to be searched repeatedly, it is worthwhile constructing an index over s, such as suffix trees, suffix arrays, and more recently the BWTtransformation.A suffix tree is in fact a trie structure [7] over all the suffixes of s; and by using the Weiner's algorithm it can be built in O(n) time [25].However, in comparison with suffix trees, the BWT-transformation is more suitable for DNA sequences due to its small alphabet  since the smaller  is, the smaller space will be occupied by the corresponding BWT array [26].According to a survey done by Li & Homer [1] on sequence alignment algorithms for next-generation sequencing, the average space required for each character is 12-17 bytes for suffix trees while only 0.5-2 byte for the BWT.The experiments reported in [9] also confirm this distinction.For example, the file size of chromosome 1 of human is 270Mb.But its suffix tree is of 26Gb in size while its BWT needs only 390Mb-1Gb for different compression rates of auxiliary arrays, completely handle-able on PC or laptop machines.The huge size of a suffix tree may greatly increase the computation time.For example, for the Zebra fish and Rat genomes (sizes 1,464,443,456pb, and 2,909,701,677pb, respectively), one cannot finish the construction of their suffix trees within two days in a computer with 32GB RAM [27].

Hash-based:
Intrinsically, all hash-table-based algorithms [28][29][30] extract short subsequences called 'seeds' from a pattern sequence p and create a signature (a bit string) for each of them.The search of a target sequence s is similar to that of the Brute Force searching, but rather than directly comparing the pattern at successive positions in s, their respective signatures are compared.Then, stick each matching seed together to form a complete alignment.Its expected time is O(m+n), but in the worst case, which is extremely unlikely, it takes O(m.n) time.The hash technique has also been extensively used in the DNA sequence research [30,31], and all experiments show that they are generally inferior to the suffix tree and the BWT index in both running time and space requirements [27,31,32].

Inexact string matching
K-mismatches and k-errors: The inexact matching [33] ranges from the score-based to the k-mismatching [34][35][36], as well as the k-error [37].By the score-based method, a matrix M of size |||| is used to indicate the relevance between characters.The algorithm designed is to find the best alignment (or say, the alignment with the highest scores) between two given strings, which can be DNA sequences, protein sequences, or XML documents; and the dynamic programming paradigm is often utilized to solve the problem [38].By the string matching with k-mismatching or k-errors, we will find all those subsequences q of s such that d(p,q)k, where d() is a distance function.When it is the Hamming distance (defined to be the number of different positions), the problem is known as the sequence matching with k mismatches.When it is the Levenshtein distance, the problem is known as the sequence matching with k errors [8,26].By the Levenshtein distance, the distance function is defined as follows: ( ) where d i,j represents the distance between p[1.
.i] and q[1..j], p i (q j ) the ith character in p(j th character in q),  an empty character, and w(pi,qj) the cost to transform pi into qj.
There is a bunch of algorithms proposed for this problem, such as for the k-mismatch; and [39,40] for the k-error.By the best method for the k-mismatch a time complexity ( log ) O n k k can be achieved.Especially, for small k and large , the search requires a sublinear time on average.In addition, the BWT and suffix trees can also be used as indexes for this problem.For the k-error, the worst case time complexity is bounded by O(mn).But the expected time can reach O(kn) by an algorithm discussed in [33].

Don't care symbols:
As a different kind of inexact matching, the string matching with Don't-Cares (or wild-cards) [41] has also been an active research topic for decades, by which we may have wild-cards in p, in s, or in both of them.A wild card matches any character.Due to this property, the 'match' relation is no longer transitive, which precludes straightforward adaption of the shift information used by Knuth & Morris [14].All the methods proposed to solve this problem also require quadratic time [42].Using a suffix array as the index, however, the searching time can be reduced to O(logn) for some patterns, which contain only a sequence of consecutive Don't Cares [43].

Massive string pattern mapping
By the massive pattern matching, we mean to find all the occurrences of all the patterns in a target string s.The number of patterns can be millions to billions.In [9], a method is proposed to build a BWT-array for s, denoted as BWT(s), and to organize the patterns into a trie structure T, which is in fact a tree such that all those patterns with the same prefixes will be clustered together.Then, exploring a path in T, a group of patterns will be searched, instead of a single one, against BWT(s).Especially, the so-called multi-character checking is used, by which multiple characters from different patterns will be checked at each step when a segment in BWT(s) is scanned.In this way, high efficiency can be achieved.This method can be further improved by organizing all the patterns into an automaton [44], which is constructed by adding some links between nodes to trie T with each from a node v to another node u such that the string represented by the path from the root to u is the same as a suffix of the string represented by the path from the root to v.One objective of this method is to extend this kind of triebased methods to an automaton-based method.For this, two issues should be investigated: How to use links to speed up a search of BWT(s)?

B.
Whether the multi-character checking and links can be simultaneously used to speed up a searching?

Inexact matching
As mentioned in Section 2, by the inexact string matching, we distinguish between k-mismatches and k-errors, as well as the string matching with don't-care symbols.What needs to be done is to shift all the problems to an environment of massive patterns and very long targets.

String matching with k-mismatches
By the string matching with k mismatches, we mean a problem to find all the occurrences of a pattern string pin a target string s with each occurrence having up to k positions different between p and s.This problem is important for DNA databases to support the biological research, where we need to locate all the appearances of a read in a genome sequence for disease diagnosis or some other purposes.Due to possible mistakes or mutations among different individuals, the occurrence of p in target s with some differences may need to be considered.As an example, consider a target s=ccacacagaagcc, and a pattern p=aaaaacaaac.Assume that k=4.Let us see whether there is an occurrence of pwith k mismatches that starts at the third location in s.
At only four locations s and p have different characters, implying an occurrence of p starting at the third location of s.Note that the case k=0 is the extensively studied string matching problem.
An initial strategy for this problem is to build a BWT array A for s, and for a pattern p search A against p repeatedly to find all the possible string matching with k mismatches.This requires O(mn) time in the worst case, even worse than some existing on-line algorithms.To avoid redundant work done in the above process, a special tree structure, called an S-tree, needs to be utilized to record information on the mismatches between p and each path P made up of some entries of A, which are searched to find some possible string matchings with k mismatches.Concretely, each path in the S-tree can be considered as a vector V of length k+1 such that V[i]=j if and only if p[j]P[j] and it is the ith mismatch between p and P.More importantly, this kind of information can be used to derive mismatches between p and some other parts of s in terms of the mismatching information among different parts of p, instead of searching them in a normal way.Using the S-tree, the time complexity can be reduced to O(kn), where n≪n.This is better than ( log ) O n k k , the best time complexity achievable by an on- line algorithm, when . In extensive experiments have also been conducted, showing that the method with the S-tree uniformly outperforms any existing on-line approach.In a probabilistic analysis of n is given.In a next step, it will be investigated how to use the periodicity of pto speed up the computation, which has been used by the on-line algorithm discussed in [34].However, it is not clear whether the periodicity of p can also be employed in an indexing environment.So, a thorough research is required on all different indexing mechanisms: suffix trees, suffix arrays, hashing and BWT arrays to see how the periodicity of p can be integrated.

String matching with k-errors
The string matching with k-errors is quite different from the string matching with k-mismatches.Hence, the methods developed for solving k mismatches cannot be directly used for them.In fact, almost all the algorithms proposed for k errors are based on the dynamical programming paradigm, using an nm matrix to store differences between p[1..i] and s[1..j] for each pair i, j(i=1,…,m; j=1,…,n).However, for the string matching with k errors, the matrix can be created column by column and thus only O(m) space is required for the task.In addition, by using the so-called diagonal wise monotocinity of the Levenshtein distance table recognized by Ukkonen [40], many entries in the matrix needn't be generated, leading to an average time complexity O(kn) [42], which has been further improved by involving the use of mismatching statistics, and suffix trees built for patterns, as well as the partition of columns and the so-called shift-add strategy proposed by Baeza & Manber [43].However, no effort has been made to index the target string in a way like the exact matching.As for the k mismatches, a BWT array for s needs to be constructed and searched, but combined with the dynamical programming computation.Specifically, the following investigation will be conducted. A.
Check whether it is possible to create a dynamical programming matrix columnal wise during a BWT array search.

B.
Check whether the diagonal wise monotocinity of the Levenshtein distance table can be used during a BWT search.

C.
Finally, we will also try to integrate the partition of columns and the use of matching statistics into a BWT array search.