On the Average-case Complexity of Pattern Matching with Wildcards

In this paper we present a number of fast average-case algorithms for pattern matching with wildcards. We consider the problems where wildcards are restricted to either the pattern or the text, however, the results can be easily adapted to the case where wildcards are allowed in both. We analyse the algorithms \textit{average-case} complexity and their \textit{expected-case} complexity and show new lower bounds for the average-case complexity of the problems. To the best of our knowledge these are the first results on the average-case complexity of wildcard matching.


Introduction
String matching with wildcards is a string matching problem where the alphabet consists of standard letters which only match themselves and a special letter φ which matches every character in the alphabet, the wildcard. Given a text of length n and a pattern of length m < n the problem then consists of finding all factors of the text that match the pattern. There exists a number of variants of this problem where the number of wildcards is bounded, wildcards are restricted to one string, wildcards may match more than one symbol and where wildcards are optional.
An early and important result in pattern matching with wildcards was the FFT (Fast Fourier Transform) based algorithm of Fischer and Paterson [8] with runtime time O(n log m log σ) where σ is the size of the alphabet. This algorithm was the first to exploit the similarities between string matching and integer multiplication. Later a study by Pinter [19] outlined a number of difficulties in designing wildcard algorithms; he illustrates the problem of intransitivity which prevents algorithms such as KMP being used or easily modified when wildcards are present. After Fischer and Patterson much work focused on improving the algorithms by removing the dependency on the alphabet size, with randomized O(n log n) and O(n log m) solutions being proposed by [14] and [15] respectively. Later, deterministic O(n log m) solutions were proposed, first by Cole and Hariharan [6] and then a simplified by Clifford and Clifford [4]. All of these algorithms make use of the theoretically fast computation of the FFT, infact, this is the only technique known for theoretically fast algorithms for pattern matching with wildcards.
The indexing version of the problem has also been studied by various group. In [13] Iliopoulos and Rahman presented an index supporting queries in O(m+α) where wildcards only occur in the pattern and O(m + α log log n) in the case of optional wildcards in the pattern. Understanding the time complexity requires us to recall their notation, denote the pattern as p = p 0 φ g0 p 1 φ g1 . . . φ g h−1 p h where φ gi represents a group of consecutive wildcards, p i ∈ Σ + , and α is then defined as the sum of the number of matches of each p i in t for 0 ≤ i < h. A short coming of this approach is that in the worst case α may be Θ(hn). In [5] Cole et al presented an index which given a text with k wildcards and an integer d, allows searching for any pattern with at most d wildcards. For a pattern containing g ≤ d wildcards, the matching takes O(m + 2 g log k n log log n + occ) time 1 ; when wildcards are restricted to either the pattern or the text the query time becomes O(m+2 g log log n+occ) and O(m+log k n log log n+occ) respectively. A drawback of the index of Cole et al is that once the index has been built it can only be used to search for patterns with at most d wildcards. Very recently a number of indexes were presented by Bille et al [3] where they give a linear space index with query time O(m + σ g log log n + occ) and a linear query time index with space complexity O(σ k 2 n log k log n). These are then modify for variable length wildcards by reducing the problem to searching with optional wildcards. In the area of succinct indexes for strings with wildcards Lam et al [17] have presented indexes for a number of problems shown in the table below, where p i is the same as defined above, t i is analogously defined for a text t of length n, occ(u, v) denotes the number of occurrences of u in v, γ = Σ ℓ+1 j=1 occ(t j , p), h is the number groups of consecutive wildcards in the text, g is the total number of wildcards in the text, β = min 1≤i≤h+1 {occ(p i , t)} :

Problem
Query Time Wildcards in the text O(m log n + γ + occ) Wildcards in the pattern O(m + hβ) Wildcards in the text and pattern O(m log n + hβ + γ + occ) Optional wildcards in the text O(m 2 log n + m log 2 n + γ log n + occ) Optional wildcards in the pattern O(m + ghβ) Optional wildcards in the text and pattern O(m 2 log n + m log 2 +ghβ + γ log n + occ) Table 1. Properties of the indexes presented in [17] The space for all of the above is O(n). Other succinct indexes have been presented in [20] with a space usage of (2 + o(1))n log σ + O(n) + O(d log n) + O(k log k)) bits for a text containing d groups of k wildcards in total; this index is based on an augmented compressed suffix array. The authors of [12] propose a compressed index where wildcards can only occur in the text with space usage The rest of this article is structured as follows: Section 2 we present preliminaries and problem definitions, in Section 3 we present algorithms for the case where wildcards are only allowed in the text, then for the case where wildcards are only allowed in the pattern, in Section 4 we give a general lower bound in Section 5 we give concluding remarks and discuss future work.

Preliminaries
An alphabet Σ is a finite non-empty set, of size σ = O(1), whose elements are called letters. A string on an alphabet Σ is a finite, possibly empty, sequence of elements of Σ. The zero-letter sequence is called the empty string, and is denoted by ε. The length of a string x is defined as the length of the sequence associated with the string x, and is denoted by |x|. We denote by x[i], for all 0 ≤ i < |x|, the letter at index i of x. Each index i, for all 0 ≤ i < |x|, is a position in x when x = ε. It follows that the i-th letter of x is the letter at position i − 1 in x, and that A string x is a factor of a string y if there exist two strings u and v, such that y = uxv. Consider the strings x, y, u, and v, such that y = uxv. If u = ε, then x is a prefix of y. If v = ε, then x is a suffix of y.
A wildcard letter is a special letter that does not belong to alphabet Σ, and matches with itself as well as with any letter of Σ; it is denoted by φ. Two letters a and b of alphabet Σ ∪ {φ} are said to correspond (denoted by a ≈ b) if they are equal or at least one of them is the wildcard letter.
Let x be a non-empty string and y be a string. We say that there exists an occurrence of x in y or, more simply, that x occurs in y when x is a factor of y. Every occurrence of x can be characterised by a position in y. Thus we say that x occurs at the starting position i in y when y[i . . i + |x| − 1] = x. It is sometimes more suitable to consider the ending position i + |x| − 1. Any logarithm is base 2 unless explicitly stated otherwise. In this paper the problems we consider are the following.

Algorithms
In the following section we present a number of average-case algorithms for pattern matching in the presence of wildcards. All of the algorithms follow the same scheme but have different details; the scheme is outlined below.
-Build a dictionary of the pattern for all the factors of length r over either Σ ∪ {φ} or Σ, depending on the problem being considered. -Create a sliding window of size m, then for each window we do the following.
-Check the suffix of size r, if it matches any factor of the pattern run an O(n log n) algorithm for wildcard matching on a factor of size 2m. For the rest of the article we assume that the text t is of length n and is random and uniformly drawn from either Σ or Σ ∪ {φ} depending on the problem being considered. For each problem we design an algorithm according to the above scheme and analyse it in a number of settings. We determine the average-case complexity of the algorithm for an arbitrary pattern and then under the assumption that the pattern is also random and uniformly drawn from either Σ or Σ ∪ {φ}.
We consider the pattern being random and the situation where it is not as we believe they provide some insight into the nature of these problems; additionally, both assumptions are used in the literature. For example an arbitrary pattern is used in [7,21,16], however, in later work such as [9,11,2,10,1,18] the assumption of a random pattern is made. Perhaps the first notion of average-case optimality for string algorithms is in [16] where Knuth conjectured his algorithm was average-case optimal in the following sense: 'Patterns of length m exist for arbitrarily large m, such that an average of at least cn(log m)/m bits must be inspected for all large n'.
Yao [21] then proved a stronger statement: 'We have demonstrated that, for almost any pattern of length m, and a random text of length n we must, on average examine log σ m characters'.
Note that Yao's result and Knuth's conjecture do not require that the pattern be random, Yao shows the existence of a set of patterns where string matching takes O( n log σ m m ) on average in a random text. It has since become customary to analyse algorithms as if both the pattern and text are random however, as we can see, this was not always the case. In [10] the algorithm for multiple pattern matching was analysed as if the patterns are random; the same algorithm was later used in a case where the patterns are not random and it was shown that the original analysis of the problem considered is still valid although under different, stricter, conditions on the variables involved. The difference between the two notions is that for arbitrary patterns the complexity must always hold for any given pattern, however, for a random pattern the average is over all patterns. So patterns may exist which perform significantly worse than the complexity suggests. Due to this we consider average-case complexity to be the case that the pattern is not random and expected-case complexity that when the pattern is also random.
The algorithm sketched at the beginning of this section has the property that, for each window of the text considered, characters are inspected in a fixed order dependent only on m; this is a ubiquitous property in string algorithms. There is some existing work on the effect the order of inspection has on the performance of algorithms. In [21] it is shown that, for classical string matching, inspecting the text in a fixed order prevents any such algorithm being optimal when m and n are sufficient close. In this article we also explore the effect that this property can have on the performance of algorithms and refer to algorithms which examine a sliding window in a fixed order as fixed algorithms. For the purpose of showing lower bounds in this article we consider a simplified model model of computation for fixed algorithms and define them as follows.
Definition 3 Consider t partitioned into non-overlapping sections of size 2m. Let (i 1 , i 2 , . . , i 2m ) be an arbitrary but fixed permutation of (0, 1, 2, 3, . . , 2m − 1). An algorithm is fixed with respect to (i 1 , i 2 , . . , i 2m ) if for every section characters are inspected in the order specified by (i 1 , i 2 , . . , i 2m ). The problem is then to find all factors of t that correspond to p and are contained entirely within a section.
We only consider occurrences contained entirely within a section to simply the analysis. Clearly this lower bounds the procedure to find all occurrences by examining the characters of a sliding window in a fixed order. We show a tight upper bound on the best performance for these types of algorithms and for nonfixed algorithms we show a lower bound which we only know is tight for large g. Additionally we show that a greedy inspection scheme is optimal. We consider the following simplified definition of non-fixed algorithms.
Definition 4 Consider t partitioned into non-overlapping sections of size 2m. An algorithm is non-fixed if characters of t can be inspected in any order. The problem is then to find all factors of t that correspond to p and are contained entirely within a section.
Our Contribution: In this paper, we present fast average-case algorithms for pattern matching with wildcards and explore the models and assumptions discussed above and establish lower bounds. We present fixed algorithms with average-case search time O(

Wildcards in Text Only
We begin by considering the case where wildcards may appear only in the text. We refer to the algorithm of this section as algorithm Wt. First we will establish the size of r, the suffix we will check for each window of the text and the probability of a match with the text with the lemma below.
Lemma 5 The probability of a random string of size 2 log σ+1 2 m over Σ ∪ {φ} matching in a string of size m over Σ is at most 1 m . We now describe how the dictionary of factors will be built. For this dictionary we index all the factors of the text we may read and determine if they correspond to a factor of the pattern. To do this we generate all factors of length 2 log σ+1 2 m over the alphabet of the text and determine if they correspond to any factor in the pattern. By building a dictionary in this way we guarantee O(log σ m) search time in the worst case.
We start by declaring a binary array of size (σ + 1) If there is a factor of the pattern that corresponds we encode the generated string as a decimal number and set this index in the array to 1. To then search for a string of size 2 log σ+1 2 m, we compute the numerical representation of the string and check if the value at this index in the array is set to 1. We can see that this is a polynomial preprocessing scheme as the maximum exponent occurs when σ = 2 and in this case 3 2 log 1.5 m ≈ m 5.42 so the total preprocessing is O(m 6.42 log m), although for larger alphabets this exponent is greatly reduced. Following the general scheme we described at the start of Section 3, we set r = 2 log σ+1 2 m and use the dictionary specified above. From this we get the following. In this section we gave a lower bound and an average-case optimal algorithm for wildcard matching when wildcards occur only in the text. In this situation we only require polynomial preprocessing and the lower bound on searching matches that of exact string matching within a constant factor. In this case the assumption that the pattern is randomly drawn from Σ does not alter the result.

Wildcards in Pattern Only
Now we consider the case where wildcards only appear in the pattern. First we consider the average-case time taken to search for a pattern with wildcards in a random string without wildcards. In this situation we can pick any pattern, so for some pattern p we denote the number of wildcards in the pattern by g. Then we consider the average-case complexity where both the pattern and text are random strings. We refer to the algorithm of this section as Wp. We now establish the size of r in the lemma below.

Lemma 8
The probability of a random string of size g +2 log σ m over Σ matching in a string of size m over Σ ∪ {φ} with g wildcards is no more than 1 m .
We can easily adapt the dictionary used in the previous section to this case, change the length of factors to r = g + 2 log σ m and the alphabet to Σ; the space complexity becomes O(σ g+2 log σ m ) with preprocessing time O(σ g+2 log σ m m(g + log σ m)). For the next section we borrow some notation from the literature on average-case approximate string matching. When considering string matching with k differences, the algorithms tolerance to errors is often expressed as an error ratio k/m where k is the number of errors allowed and m the length of the pattern; analogously, by g/m we denote the wildcard ratio. Again we use the details established above and the algorithm sketched at the beginning of Section 3 and achieve the following result. The wildcard ratio we specify is actually very permissive. To see this note that we are free to pick any value for ǫ subject to ǫ < 1. So for any ratio g/m < 1 it is possible to pick a value for ǫ and a sufficiently large value of m such that the algorithm can run in the desired running time. In the following theorem we show that for any fixed algorithm it is impossible to do any better than this. We show that for any integer g, there exists arbitrarily large m such that for fixed algorithms Ω( ng m ) characters must be inspected for sufficiently large n. Note that for these patterns, even in the best case a fixed algorithm must inspect g + 1 characters.
Theorem 10 Algorithm Wp runs in average-case time O(n(g +log σ m)/m) and O(n log m) in the worst case and no fixed algorithm can do better. Now we consider the expected-case complexity of the algorithms when the pattern is randomly drawn from Σ ∪{φ}; the result changes in the following way. By setting r = 2 log σ+1 2 m the probability of a match remains 1 m and by following a similar argument as for wildcards only in the text we get the following. We can combine the techniques we have presented above and apply them to the case where wildcards may appear in both. We have excluded this due to space constraints; however, the reader can verify them by setting r = g + 2 log σ+1

A General Lower Bound
In the previous section we have considered the average-case and expected-case complexity of each algorithm. In the average-case patterns are designed that give bad performance if the algorithm is fixed. Fixed algorithms consider the characters in each window of the text in a fixed order, an approach that is ubiquitous in string algorithms. In this section we consider an arbitrary inspection scheme and derive an average-case lower bound for any algorithm solving the problem of wildcards in the pattern with an arbitrary pattern.
Consider that we have a pattern of length m which contains g wildcard characters and a text of length n. We partition the text into non-overlapping segments of size 2m, referred to as blocks, and only consider that we have to report all matches or exclude all positions from within blocks. This is optimistic as this excludes those matches which may overlap two blocks. In the following section we will determine a lower bound for the number of character inspections required for one block. The lower bound for the problem can then be derived.
For each block we call all 0, . . , m − 1 possible starting positions candidates and when we inspect a character from the block we call this a block access. The candidates affected by a block access are intersected by it. Given a block access i j to block b we can only rule out candidate c if b ij = p ij −c+1 . For all non-wildcard positions aligned with a given block access i j , there is a probability of at most 1/σ that the candidate will not be ruled out. For those candidates where this block access aligns with a wildcard there is probability 1 that it will not be ruled out. Now we outline a few optimistic assumptions.
-Any access intersects all m candidates.
-Intersections are distributed uniformly across all candidates.
The affect of this is that m − g candidates have a chance of being ruled out at every block accesses, as we consider all m and g may be wildcards. After k block accesses in this model we have made (m − g)k intersections and we assume that these are distributed uniformly across all m candidates. This is optimistic as the following inequality holds, where the first summation is the expected number of candidates left for the uniform scheme and the second is for any other.
Where k i represents the number of accesses to candidate i and Informally this means that we may only overestimate the number of candidates already ruled out. Clearly the summation can be evaluated as shown below.
We need to either rule out every candidate or declare a match at a candidate. So the optimal is to determine when we would expect to have ruled out every candidate position. We minimise the following so that we expect to have at most one candidate left or until we have read all 2m positions.
Rearranging this we get the following Now we know that the lower bound for each block is Ω( m log σ m m−g ). The result below follows.

Theorem 13
The average-case lower bound for wildcard matching with wildcards only in the pattern is Ω( n log σ m m−g ).
For values of g such that m − g = Θ(m) this does not give us much additional insight as the lower bound matches exact string matching. However, consider the extreme cases such that g = m − f where log σ m 2 ≤ f we see that we must inspect the following number of characters.
For values of f less than log σ m 2 we must check all 2m character in the block. So for g = m − x log σ m the presented algorithm is optimal. Intuitively this because as we increase the number of wildcards each block access gives us less information. This argument also generalises to the case where wildcards may appear in both strings, due to space constraints we have excluded this.
Finally we show that a greedy inspection scheme performs an optimal number of character comparisons. By greedy we mean that at each step the block access which would most greatly reduce the expected number of remaining positions is chosen. For a candidate i let k i be the number of times it has been accessed. Now for each position in a block we define the following set of candidates it affects. For each 0 ≤ i ≤ m − 1 let B i be the set of candidates that the block access i intersects.
The effect on the expected number of candidates not ruled out by inspection some position ℓ is given by the following. Let U = {0, 1, . . , m − 1} − B ℓ and k i denote the number of times candidate i has been intersected before the block access to ℓ.
The greedy inspection maximises the last two terms of the above summation at each step. Now we show that this is in fact optimal.
Theorem 14 The greedy inspection scheme performs an optimal number of character comparisons.

Discussion & Concluding Remarks
In this article we have investigated the average-case complexity of two wildcard matching problems by analysing the algorithms average-case and expected-case complexity. The original notion of average-case complexity in string matching is that of Knuth, however, for exact and approximate string matching the expected and average-case complexities are the same. Clearly the arbitrary pattern model is a stronger notion of average-case complexity. However, considering the expected-case complexity also gives us some insight into the problems.
Here we see that although there exists hard patterns which must take longer in both the fixed and non fixed models, given a random pattern we do not expect it to be much harder than exact string matching. This suggests that most patterns are actually easy to process on average. Although we have shown that the greedy inspection scheme is optimal, it is not clear if it matches the lower bound. This and designing an algorithm to implement this are future works.
It may seem that the time and space complexities of the dictionary are quite large. However, state of the art linear query time index of Bille et al has a space complexity O(σ g 2 n log g log n). Proof. Recall that the lower bound for exact string matching is Ω(n log σ m/m) [21]. Clearly pattern matching with wildcards in the text only is at least as hard as exact string matching as we are also required to find non-wildcard matches. Algorithm Wt runs in average-case time O(n log σ m/m), matching the lower bound for exact string matching. Therefore the algorithm is optimal and the lower bound on pattern matching with wildcards in the text only is also Ω(n log σ m/m).
In the worst case we run the algorithm for every window and the proof is identical to that of [4]. ⊓ ⊔

Proof of Lemma 8
Proof. Consider a random factor of size g + 2 log σ m drawn from Σ. In the worst case all g wildcards of the pattern appear in a single factor of size g+2 log σ m, this factor contains 2 log σ m positions which are not wildcards, the probability that these match the corresponding characters in a random factor of size g + 2 log σ m of Σ is no more than 1 m 2 . For any factor with less than g wildcards the probability is less than 1 m 2 . Pessimistically assume all factors of length g + 2 log σ m have a probability to match of 1 m 2 , there are m−(g +2 log σ m) factors so the probability is certainly no more than 1 m . ⊓ ⊔

Proof of Theorem 9
Proof. We set r = g + 2 log σ m and create a sliding window on the text of length m, check the suffix of length g + log σ m and if it matches, run a standard O(m log m) algorithm on a factor of size 2m. After this we shift by at least m − r if the suffix did not match and m if it did. We have at most n m−g−2 log σ m windows and at each one we may do O(m log m + g + log σ m) work. For the algorithm to to run in the claimed time it must be the case that n m−g−2 log σ m = O( n m ) and for this to be true it must be that g + 2 log σ m < ǫm where ǫ < 1; this makes the denominator O(m). This places the following condition on our algorithm. ⊓ ⊔

Proof of Theorem 10
Proof. Recall that the lower bound for exact string matching is Ω(n log σ m/m). Clearly when g ≤ O(log σ m) the bound of Yao [21] holds as this problem requires us to report non-wildcard matches as well. We now show that there is any fixed algorithm has a lower bound of Ω( ng m ) in the case where there are more than O(log σ m) wildcards in the pattern. If there is no occurrence of the pattern then we must check at least g characters before we can declare there is no match.
Assume the text is partitioned into non-overlapping blocks of size 2m and that we only find occurrences contained entirely within these blocks. Let π 2m denote all the permutations of (0, 1, . . , 2m − 1) and assume that we examine the characters of each block in the same fixed but arbitrary order (i 1 , i 2 , . . , i 2m ) ∈ π 2m . We can construct patterns, for any g < m, such that we must examine at least g + 1 characters before all start positions can be ruled out.
We construct a pattern in the following way, if i j for 0 < j ≤ g occur in some range (k, ℓ) such that ℓ − k ≤ m then place the wildcards in positions i 1 , . . , i g of the pattern. Otherwise for 0 < j ≤ g and i j < m place a wildcard at position i j . Any remaining wildcards may be placed anywhere in the pattern; the remaining positions of the pattern can be any character from Σ. After inspecting characters i 1 , . . , i g of the block the at least one position can neither be ruled out nor declared as a match. Combining the lower bound of Yao and this we see that any fixed algorithm has a lower bound of Ω(n(g + log σ m)/m) for this problem. Algorithm Wp runs in average-case time O(n(g+log σ m)/m), matching the lower bound; therefore the algorithm is optimal in the family of fixed algorithms. The proof of the worst case runtime is identical to that of [4].

Proof of Theorem 14
Proof. Let α by the minimum number of inspections required to reduce the expected number of candidates to less than 1. Now consider an arbitrary inspection scheme of length α, for each of these inspection schemes it is possible to consider the inspections they perform in any order. This is possible as for any inspection scheme of length α we are required to make all of these inspections. So the order these α are inspected in makes no difference to the final probability of any scheme.
For the rest of the proof we consider each inspection scheme in a greedy order, that is with respect to the α block accesses in the inspection scheme at each step pick the one that minimises the expected value.
Let g 0 , g 1 , . . g s−1 be the block access made by the greedy inspection scheme and let k 0 , k 1 , . . k q−1 be an arbitrary inspection scheme considered in greedy order. When considered in this order it is the case that for an arbitrary inspection scheme the following holds. Let E k i and E g i be the expected number of candidates not ruled out after i access for the inspection scheme k 0 , k 1 , . . k q−1 and g 0 , g 1 , . . g s−1 respectively.
We proceed by induction on the number of block accesses, assume that for all 1 ≤ f ≤ ℓ it holds that Now consider the case for ℓ + 1. We pick the next block access by the greedy method, if this causes E g ℓ+1 to be less than or equal to E k ℓ+1 then we are done. Now we show that this must always be the case. Assume that it is possible to pick a block access such that E k ℓ+1 < E g ℓ+1 . For E k ℓ+1 < E g ℓ+1 to be true it must be the case that the chosen block access reduces the E k ℓ+1 by at least the difference between E g ℓ and E k ℓ and the reduction to the expectation of the greedy scheme given by the block access picked by the greedy scheme. Let δ g i = E g i−1 − E g i and δ k i = E k i−1 − E k i . More formally for E k ℓ+1 < E g ℓ+1 it must be the case that And therefore also that E k ℓ − E g ℓ > δ g ℓ+1 − δ k ℓ+1 (4) By rearranging the definition of δ k i with i = ℓ + 1 we get that By substituting δ k ℓ+1 from (3) into the above formula the following is derived Rearranging it becomes Applying the definition of δ k ℓ+1 this becomes Finally with some rearrangement we get that By (4) this is a contradiction and therefore E g ℓ+1 is the smallest and by the induction the greedy inspection scheme is optimal.