Combinatorics of minimal absent words for a sliding window

A string $w$ is called a minimal absent word (MAW) for another string $T$ if $w$ does not occur in $T$ but the proper substrings of $w$ occur in $T$. For example, let $\Sigma = \{\mathtt{a, b, c}\}$ be the alphabet. Then, the set of MAWs for string $w = \mathtt{abaab}$ is $\{\mathtt{aaa, aaba, bab, bb, c}\}$. In this paper, we study combinatorial properties of MAWs in the sliding window model, namely, how the set of MAWs changes when a sliding window of fixed length $d$ is shifted over the input string $T$ of length $n$, where $1 \leq d<n$. We present \emph{tight} upper and lower bounds on the maximum number of changes in the set of MAWs for a sliding window over $T$, both in the cases of general alphabets and binary alphabets. Our bounds improve on the previously known best bounds [Crochemore et al., 2020].


Introduction
We say that a string s occurs in another string T if s is a substring of T . A non-empty string w is said to be a minimal absent word (an MAW ) for a string T if w does not occur in T but any proper substring of w occurs in T . Note that by definition a string of length 1 (namely a character) which does not occur in T is also an MAW for T . On the other hand, any MAW for T of length at least 2 can be represented as aub, where a and b are single characters and u is a (possibly empty) string, such that both au and ub occur in T . For example, let Σ = {a, b, c} be the alphabet. Then, the set of MAWs for string w = abaab is {aaa, aaba, bab, bb, c}.

Algorithms for finding MAWs for string
Given the afore-mentioned motivations, finding MAWs from a given string has been an important and interesting string algorithmic problem and several nice solutions have been

MAWs for sliding window
This paper follows the recent line of research on MAWs for the sliding window model, which was initiated by Crochemore et al. [10]. In this model, the goal is to compute or analyze MAW(T [i..i + d − 1]) for every window T [i..i + d − 1] of fixed length d ≥ 1 that shifts T from left to right with increasing i = 1, . . . , n − d + 1.
Crochemore et al. [10] presented a suffix-tree based algorithm that maintains the set of all MAWs for a sliding window in O(σn) time using O(σd) working space. Crochemore et al. [10] also showed how their algorithm can be applied to approximate pattern matching under the length weighted index (LWI ) metric [6].
The (in)efficiency of their algorithms is heavily dependent on combinatorial properties of MAWs for the sliding window. In particular, Crochemore et al. [10] studied the number of MAWs to be added/deleted when the current window is shifted to the right by one character. As was done in [10], for ease of discussion let us separately consider We remark that these two operations are symmetric. Crochemore et al. [10] considered how many MAWs can change before and after the window has been shifted by one position, and showed that where denotes the symmetric difference and Since both s i − s α and p i − p β can be at most d − 1 in the worst case, the asymptotic bounds for the numbers of changes in the set of MAWs obtained by Crochemore et al. [10] are: Crochemore et al. [10] also considered the total changes in the set of MAWs for every sliding window over the string T , and showed that (2)

Our contribution
The goal of this paper is to give more rigorous analyses on the number of MAWs for the sliding window model. This study is well motivated since revealing more combinatorial insights to the sets of MAWs for the sliding windows can lead to more efficient algorithms for computing them.
In this paper, we first give the following upper bounds: Our new upper bounds in (4) improve Crochemore et al.'s upper bounds in (1) for any alphabet of size σ ∈ ω(1). Our upper bounds in (4) are also tight as there exists a family of strings achieving the matching lower bounds Ω(d).
In this paper, we also present a new upper bound for the total changes of MAWs: which improves the previous bound O(σn) in (2). We then show that this new upper bound in (5) is also tight.
All of our new bounds afore-mentioned are tight for any alphabet of size σ ≥ 3. We further explore the case of binary alphabets with σ = 2, and show that there exist even tighter bounds in the binary case. Namely, for σ = 2, we prove that We remark that plugging σ = 2 into (3) for the general case only gives d + σ + 1 = d + 3, which is larger than max{3, d} in (6). We consider the case σ ≥ d in Lemmas 10 and 11. We also show that the upper bounds max{3, d} in (6) are tight by giving the matching lower bounds with a family of binary strings. A part of the results reported in this article appeared in a preliminary version of this paper [18]. In addition, this present article considers the case of binary alphabets and presents tight upper and lower bounds for this case.

Strings
Let Σ be an alphabet. An element of Σ is called a character. An element of Σ * is called a string. The length of a string T is denoted by |T |. The empty string ε is the string of length 0. If T = xyz, then x, y, and z are called a prefix, substring, and suffix of T , respectively. We say that a string w occurs in a string T if w is a substring of T . Note that by definition the empty string ε is a substring of any string T and hence ε always occurs in T .

Minimal absent words (MAWs)
A string w is called an absent word for a string T if w does not occur in S. An absent word w for S is called a minimal absent word or MAW for S if any proper substring of w occurs in S. We denote by MAW(S) the set of all MAWs for S. By the definition of MAWs, it is clear that w ∈ MAW(S) iff the three following conditions hold: We note that if w is a string of length 1 which does not occur in S (i.e. w is a single character in Σ of size σ not occurring in S), then w is a MAW for T since w

MAWs for a sliding window
Given a string T of length n and a sliding window S i = T [i..j] of length d = j − i + 1 for increasing i = 1, . . . , n − d + 1, our goal is to analyze how many MAWs for the sliding window can change when the window shifts over the string T . We will consider both the maximum change per one shift, and the maximum total number of changes when sliding the window from the beginning to the end.
As was done in [10], for simplicity, we separately consider two symmetric operations of appending a new character to the right of the window and of deleting the leftmost character from the window.

Tight bounds on the changes to MAWs for sliding window
In this section, we present our new bounds for the changes of MAWs for the sliding window over the string T . In Section 3.1, we consider the number of changes of MAWs when the current window T [i..j] is extended by adding a new character T [j + 1]. Section 3.2 is for the symmetric case where the leftmost character T [i] is deleted from T [i..i + j + 1]. Finally, in Section 3.3, we consider the total number of changes of MAWs while the window has been shifted from the beginning of T until its end.

Changes to MAWs when a character is appended to the right
We consider the number of changes of MAWs when appending T [j + 1] to the current window For the number of deleted MAWs, the next lemma is known: Next, we consider the number of added MAWs. We classify each MAW w in to the following three types 2 (see Figure 1) is said to be of: We denote by M 1 , M 2 , and M 3 the sets of MAWs of Type 1, Type 2 and Type 3, respectively. Recall that w is a MAW for T [i..j + 1].
Let σ i,j be the number of distinct characters occurring in the current window T [i..j]. We also use σ = σ i,j for simplicity.
The next three lemmas show the upper bounds of M 1 ,M 2 , and M 3 : Proof. It is shown in [10] Proof. We show that there is an injection f : Next, for the sake of contradiction, we assume that f is not an injection, i.e. there are two distinct MAWs If |w 1 | = |w 2 |, then w 1 = w 2 and it contradicts with w 1 = w 2 . If |w 1 | > |w 2 |, then w 2 is a proper suffix of w 1 , and it contradicts with the fact that w 2 is absent from T [i..j + 1] (see Figure 2). Therefore, f is an injection and Summing up all the upper bounds for M 1 , M 2 , and M 3 , we obtain the following: Figure 2: Illustration for the contradiction in the proof of Lemma 5. Consider two strings w 1 = a 1 x 1 b 1 and w 2 = a 2 x 2 b 2 that are MAWs for T of Type 3 where a 1 , a 2 , b 1 , b 2 ∈ Σ and , then x 2 is a proper suffix of x 1 , and it contradicts that a 2 x 2 b 2 is absent from T .
Proof. Immediately follows from Lemmas 2, 3, and 4 and that M 1 , M 2 , and M 3 are mutually disjoint.
Now we obtain the main result of this subsection, which shows the matching upper and lower bounds for The upper bound is tight when σ ≥ 3 and σ + 1 ≤ σ. In the following, we show that the upper bound is tight, i.e. there is a string Z of length d and a character α where |MAW(Z) MAW(Zα)| = σ + d + 1 for any two integers d and σ with 1 ≤ σ ≤ d and σ + 1 ≤ σ. Let Σ = {a 1 , a 2 , · · · , a σ } be an alphabet. Given two integers d and σ with 1 ≤ σ ≤ d and σ + 1 ≤ σ, consider a string Z = a 1 a 2 · · · a σ −1 a d−σ +1 σ of length d and a character α = a σ +1 . Then, Also, This leads to the matching lower bound |MAW(Z) MAW(Zα)| = σ + d + 1.
A concrete example for our lower-bound strings Z and Zα is shown below.  Figure 1).
• If |u | ≥ k, then c k+1 is a suffix of u as shown in Figure 5. However, by the definition of Type-2 MAWs, u c must occur in S (see also the middle of Figure 1), which implies that c k+1 occurs in S. This contradicts that c k is the longest run of c's in S.
• If |u | < k, then a u c = c |a u c| with |a u c| ≤ k + 1 occurs in T [i..j + 1] as a suffix, and this contradicts that a u c is a MAW for T [i..j + 1].
Hence a u c cannot be in M 2 , which leads to |M 2 | ≤ σ i,j − 1, and thus |M 1 | + |M 2 | ≤ σ i,j for any string T [i..j + 1] such that T [i..j] contains at least one character that is equal to Recall that Lemma 2 and Lemma 3 in the case where σ i,j+1 ≥ σ i,j gives us |M 1 |+|M 2 | ≤ σ + 1 = σ i,j + 1. Compared to this, Lemma 6 shaves the total size of M 1 and M 2 by one in the case where T [j + 1] already occurs in T [i..j]. Coupled with Lemma 4, Lemma 6 leads us to the following corollary: The upper bound is tight when σ ≥ 3 and σ = σ.

Changes to MAWs when the leftmost character is deleted
Next, we analyze the number of changes of MAWs when deleting the leftmost character from a string. By a symmetric argument to Theorem 1, we obtain: where d = j − i + 1 and σ is the number of distinct characters occurs in T [i..j]. Also, the upper bound is tight when σ ≥ 3 and σ + 1 ≤ σ.
Proof. Symmetric to the proof of Lemma 1.
Finally, by combining Theorem 1 and Corollary 2, we obtain the next theorem:

Total changes of MAWs when sliding window on string
In this subsection, we consider the total number of changes of MAWs when sliding the window of length d from the beginning of T to the end of T . We denote the total number of changes of MAWs by S( The following lemma is known: The aim of this subsection is to give a more rigorous bound for S(T, d). We first show that the above bound is tight under some conditions. Lemma 8. The upper bound of Lemma 7 is tight when σ ≤ d and n − d ∈ Ω(n).
Next, we consider the case where σ ≥ d + 1.

Proof. By Theorem 2, it is clear that S(T, d) ∈ O(d(n
The main result of this section follows from the above lemmas: Theorem 3. For a string T of length n > d over an alphabet Σ of size σ, S(T, d) ∈ O(min{d, σ}n). This upper bound is tight when n − d ∈ Ω(n).
We remark that n − d ∈ Ω(n) covers most interesting cases for the window length d, since the value of d can range from O(1) to cn for any 0 < c < 1. In what follows, let us denote by Σ 2 = {0, 1} the binary alphabet, and assume without loss of generality that we append the new character α = 0 to the window S of length d and obtain the extended window Sα = S0.

Tighter bounds for binary alphabets
As a warm up, we begin with the two following lemmas which show that at most 3 MAWs can change in the cases where d = 1 and d = 2 for any binary strings. We move onto the case where d ≥ 3. Our first observation is that it is sufficient to consider the case that S is not unary. For any d, it is clear that |MAW(0 d ) MAW(0 d+1 )| = 2. Now let us consider 1 d in the next lemma. According to Lemmas 10,11 and 12, in what follows we focus on the case where d ≥ 3 and the current window S = T [i..i + d − 1] contains at least one 0. The latter condition implies that we focus on the case where the new character α = 0 already occurs in the window S.
As in the case of non-binary alphabets, we analyze the numbers of added Type-1/Type- Here we show that the range of such an injection f is [2, d − 1] for any binary string S with σ = 2. Since the appended character is α = 0, and since the candidate x for the MAW of Type 3 which should be mapped to the first position in S is of length 2, the candidate x has to be either 00 or 10.
(1) If x = 00, then S[1] = 0. If 00 does not occur in S (see also the top picture of Figure 4), then 00 is already a MAW for S (i.e. 00 ∈ MAW(S)). Thus 00 / ∈ MAW(S0) \ MAW(S) in this case. Otherwise (00 occurs in S), then clearly 00 is not a MAW for S0 (see also the middle picture of Figure 4).
(2) If x = 10, then S[1] = 1. However, since the appended character is 0, 10 must occur somewhere in S0 (see also the bottom picture of Figure 4). Thus 10 is not a MAW for S0.
Hence, the first position of S cannot be assigned to any MAW of Type 3 for S0, leading to |M 3 | ≤ d − 2 for any binary string S of length d ≥ 3.
In other words, Lemma 13 shows that in the binary case with σ = 2, the maximum number of added Type-3 MAWs is 1 less than in the case with σ ≥ 3.
Next, we consider the total number of added Type-1/Type-2 MAWs. From Lemma 6, the next corollary holds.  Proof. First we consider Type-2 MAWs for S0. By Lemma 3, there are at most two MAWs in M 2 . We assume that there are two MAWs in M 2 and let au0 and a 1 be the two MAWs where a, a ∈ Σ 2 and u, u ∈ Σ Second we consider Type-3 MAWs for S0. let au0 be the Type-3 MAW where a ∈ Σ 2 , u, ∈ Σ * 2 since u0 must be a suffix of S0. By the definition of Type-3 MAW, there has to be an occurrence of au in S. Note that this occurrence has to be immediately followed by a 1 We also gave an asymptotically tight bound O(min{d, σ}n) for the number S(T, d) of total changes in the set of MAWs for every sliding window of length d over any string T of length n, where σ is the alphabet size for the whole input string T .
The following open questions are intriguing: • We showed that a matching lower bound S(T, d) ∈ Ω(min{d, σ}n) when n − d ∈ Ω(n).
Is there a similar lower bound when n − d ∈ o(n)?
• Crochemore et al. [10] gave an online algorithm that maintains the set of MAWs for a sliding window of length d in O(σn) time. Can one improve the running time to optimal O(min{d, σ}n)?