Sensitivity of string compressors and repetitiveness measures

The sensitivity of a string compression algorithm $C$ asks how much the output size $C(T)$ for an input string $T$ can increase when a single character edit operation is performed on $T$. This notion enables one to measure the robustness of compression algorithms in terms of errors and/or dynamic changes occurring in the input string. In this paper, we analyze the worst-case multiplicative sensitivity of string compression algorithms, which is defined by $\max_{T \in \Sigma^n}\{C(T')/C(T) : ed(T, T') = 1\}$, where $ed(T, T')$ denotes the edit distance between $T$ and $T'$. For the most common versions of the Lempel-Ziv 77 compressors, we prove that the worst-case multiplicative sensitivity is upper bounded by a small constant, and give matching lower bounds. We generalize these results to the smallest bidirectional scheme $b$. In addition, we show that the sensitivity of a grammar-based compressor called GCIS is also a small constant. Further, we extend the notion of the worst-case sensitivity to string repetitiveness measures such as the smallest string attractor size $\gamma$ and the substring complexity $\delta$, and show that the worst-case sensitivity of $\delta$ is also a small constant. These results contrast with the previously known related results such that the size $z_{\rm 78}$ of the Lempel-Ziv 78 factorization can increase by a factor of $\Omega(n^{1/4})$ [Lagarde and Perifel, 2018], and the number $r$ of runs in the Burrows-Wheeler transform can increase by a factor of $\Omega(\log n)$ [Giuliani et al., 2021] when a character is prepended to an input string of length $n$. By applying our sensitivity bounds of $\delta$ or the smallest grammar to known results (c.f. [Navarro, 2021]), some non-trivial upper bounds for the sensitivities of important string compressors and repetitiveness measures including $\gamma$, $r$, LZ-End, RePair, LongestMatch, and AVL-grammar are derived.


Introduction
In this paper we introduce a new notion to quantify efficiency of (lossless) compression algorithms, which we call the sensitivity of compressors. Let C be a compression algorithm and let C(T ) denote the size of the output of C applied to an input text (string) T . Roughly speaking, the sensitivity of C measures how much the compressed size C(T ) can change when a single-characterwise edit operation is performed at an arbitrary position in T . Namely, the worst-case multiplicative sensitivity of C is defined by where ed(T, T ) denotes the edit distance between T and T . This new and natural notion enables one to measure the robustness of compression algorithms in terms of errors and/or dynamic changes occurring in the input string. Such errors and dynamic changes are commonly seen in real-world texts such as DNA sequences and versioned documents.
The so-called highly repetitive sequences, which are strings containing a lot of repeated fragments, are abundant today: Semi-automatically generated strings via M2M communications, and collections of individual genomes of the same/close species are typical examples. By intuition, such highly repetitive sequences should be highly compressible, however, statistical compressors are known to fail to capture repetitiveness in a string [37]. Therefore, other types of compressors, such as dictionary-based, grammar-based, and/or lex-based compressors are often used to compress highly repetitive sequences [41,63,38,24,48].
Let us recall two examples of well-known compressors: The run-length Burrows-Wheeler Transform (RLBWT ) is one kind of compressor that is based on the lexicographically sorted rotations of the input string. The number r of equal-character runs in the BWT of a string is known to be very small in practice: Indeed, BWT is used in the bzip2 compression format, and several compressed data structures which support efficient queries have been proposed [16,3,55,56]. The Lempel-Ziv 78 compression (LZ78 ) [69] is one of the most fundamental dictionary based compressors that is a core of in the gif and tiff compression formats. While LZ78 only allows Ω( √ n) compression for any string of length n, its simple structure allows for designing efficient compressed pattern matching algorithms and compressed self-indices (c.f. [32,18,19,46,15] and references therein).
The recent work by Giuliani et al. [22], however, shows that the number r of runs in the BWT of a string of length n can grow by a multiplicative factor of Ω(log n) when a single character is prepended to the input string 1 . It is noteworthy that the family of strings discovered by Giuliani et al. [22] satisfies r(T ) = O(1) and r(T ) = Ω(log n), where r(T ) and r(T ) respectively denote the number of runs in the BWTs of T and T . The other work by Lagarde and Perifel [40] shows that the size of the dictionary of LZ78, which is equal to the number of factors in the respective LZ78 factorization, can grow by a multiplicative factor of Ω(n 1/4 ), again when a single character is prepended to the input string. Letting the LZ78 dictionary size be z 78 , this multiplicative increase can also be described as Ω(z 3/2 78 ). Lagarde and Perifel call the aforementioned phenomenon on LZ78 as "one-bit catastrophe". Based on these known results, here we introduce the three following classes of string compressors depending on their sensitivity.
(A) Those whose sensitivity is O(1); (B) Those whose sensitivity is polylog(n); (C) Those whose sensitivity is proportional to n c with some constant 0 < c ≤ 1.
By generalizing the work of Lagarde and Perifel [40], we say that Class (C) is catastrophic in terms of the sensitivity. Class (B) may not be catastrophic but the change in the compression size can still be quite large just for a mere single character edit operation to the input string. Class (A) is the most robust against one-character edit operations among the three classes. Recall that LZ78 z 78 belongs to Class (C), while it is not clear yet whether RLBWT r belongs to Class (B) or (C) (note that the work of Giuliani et al. [22] showed only a lower bound Ω(log n)). In this paper, we show that the other major dictionary compressors, the Lempel-Ziv 77 compression family, belong to Class (A), and thus such a catastrophe never happens with this family. The LZ77 compression [68], which is the greedy parsing of the input string T where each factor of length more than one refers to a previous occurrence to its left, is the most important dictionary-based compressor both in theory and in practice. The LZ77 compression without self-references (resp. with self-references) can achieve O(log n) compression (resp. O(1) compression) in the best case as opposed to the Ω( √ n) compression by the LZ78 counterpart, and the LZ77 compression is a core of common lossless compression formats including gzip, zip, and png. In addition, its famous version called LZSS (Lempel-Ziv-Storer-Szymanski) [64], has numerous applications in string processing, including finding repetitions [13,36,23,4], approximation of the smallest grammar-based compression [62,11], and compressed self-indexing [7,8,47,5], just to mention a few.
We show that the multiplicative sensitivity of LZ77 with/without self-references is at most 2, namely, the number of factors in the respective LZ77 factorization can increase by at most a factor of 2 for all types of edit operations (substitution, insertion, deletion of a character). Then, we prove that the multiplicative sensitivity of LZSS with/without self-references is at most 3 for substitutions and deletions, and that it is at most 2 for insertions. We also present matching lower bounds for the multiplicative sensitivity of LZ77/LZSS with/without self-references for all types of edit operations as well. In addition, the multiplicative sensitivity of RLBWT r turns out to be O(log r log n), which implies that r belongs to Class (B) 2 These results suggest that, LZ77 and LZSS of Class (A) may better capture the repetitiveness of strings than RLBWT of Class (B) and LZ78 of Class (C), since a mere single character edit operation should not much influence the repetitiveness of a sufficiently long string. We also consider the smallest bidirectional scheme [64] that is a generalization of the LZ family where each factor can refer to its other occurrence to its left or right. It is shown that for all types of edit operations, the multiplicative sensitivity of the size b of the smallest bidirectional scheme is at most 2, and that there exist strings for which the multiplicative sensitivity of b is 2 with insertions and substitutions, and it is 1.5 with deletions. The smallest grammar problem [11] is a famous NP-hard problem that asks to compute a grammar of the smallest size g * that derives only the input string. We show that the multiplicative sensitivity of the smallest grammar size g * is at most 2. Further, we extend the notion of the worst-case multiplicative sensitivity to string repetitiveness measures such as the size γ of the smallest string attractor [30] and the substring complexity δ [35], both receiving recent attention [29,60,39,43,12]. We prove that the value of δ can increase by at most a factor of 2 for substitutions and insertions, and by at most a factor of 1.5 for deletions. We show these upper bounds are also tight by presenting matching lower bounds for the sensitivity of δ. We also present non-trivial upper and lower bounds for the sensitivity of γ.
As is mentioned above, the work by Lagarde and Perifel [40] considered only the case of prepending a character to the string for the multiplicative sensitivity of LZ78. We show that the same lower bounds hold for the multiplicative sensitivity of LZ78 in the case of substitutions and deletions, and insertions inside the string, by using a completely different instance from the one used in [40].
Studying the relations between different string repetitiveness measures/string compressor output sizes has attracted much attention in the last two decades (for details see the survey [48]). Combining these known relations and our new sensitivity upper bounds mentioned above gives us a kind of "sandwich" argument, which is formalized in Lemma 1. Using this lemma, some non-trivial upper bounds for the sensitivity of other measures can be driven, including the LZ-End compressor [37] and grammar-based compressors RePair [41], Longest-Match [33], Greedy [2], Sequential [66], LZ78 [69], α-balanced grammars [11], AVL-grammars [62], and Simple [26]. Theses upper bound results are reported as corollaries in the following sections.
Moreover, we consider the sensitivity of other compressors and repetitiveness measures including Bisection [52], GCIS [58,59], and CDAWGs [10]. Table 1 summarizes our results on the multiplicative sensitivity of the string compressors and repetitiveness measures. Table 1: Multiplicative sensitivity of the string compressors and string repetitiveness measures studied in this paper and in the literature, where n is the input string length and Σ is the alphabet. In the table "sr" stands for "with self-references". The upper bounds marked with " †" are obtained by applying known results [30,35,28,37,31,11,62,26] and our results on the sensitivity of the substring complexity δ or the smallest grammar g * to Lemma 1.  In addition to the afore-mentioned multiplicative sensitivity, we also introduce the worst-case additive sensitivity, which is defined by for all the string compressors/repetitiveness measures C dealt in this paper. We remark that the additive sensitivity allows one to observe and evaluate more details in the changes of the output sizes, as summarized in Table 2. For instance, we obtain strictly tight upper and lower bounds for the additive sensitivity of LZ77 with and without self-references in the case of substitutions and insertions. Studying the additive sensitivities of string compressors is motivated by approximation of the Kolmogorov complexity. Let K(T ) denote the Kolmogorov complexity of string T , that is the length of a shortest program that produces T . While K(T ) is known to be uncomputable, the additive sensitivity K(T ) − K(T ) for deletions is at most O(log n) bits, since it suffices to add "Delete the ith character T [i] from T ." at the end of the program. Similarly, the additive sensitivity of K for insertions and substitutions is at most O(log n + log σ) bits, where σ is the alphabet size. Therefore, a "good approximation" of the Kolmogorov complexity K should have small additive sensitivity.

String monotonicity
A string repetitiveness measure C is called monotone if, for any string T of length n, C(T ) ≤ C(T ) holds with any of its prefixes T = T [1.
.i] and suffixes T = T [j..n] [35]. Kociumaka et al [35] pointed out that δ is monotone, and posed a question whether γ or the size b of the smallest bidirectional macro scheme [64] are monotone. This monotonicity for C can be seen as a special and extended case of our sensitivity for deletions, namely, if we restrict T to be the string obtained by deleting either the first or the last character from T , then it is equivalent to asking whether .n]}} ≤ 1. Mantaci et al. [43] proved that γ is not monotone, by showing a family of strings T such that γ(T ) = 2 and γ(T ) = 3 with T = T [1..n − 1], which immediately leads to a lower bound 3/2 = 1.5 for the multiplicative sensitivity of γ. In this paper, we present a new lower bound for the multiplicative sensitivity of γ, which is 2. Mitsuya et al. [45] considered the monotonicity of LZ77 without self-references z 77 presented a family of strings T for which z 77 (T )/z 77 (T ) ≈ 4/3 with T = [2.
.n]. Again, our matching upper and lower bounds for the multiplicative sensitivity of z 77 , which are both 2, improve this 4/3 bound.

Comparison to sensitivity of other algorithms
The notion of the sensitivity of (general) algorithms was first introduced by Varma and Yoshida [65]. They studied the average sensitivity of well-known graph algorithms, and presented interesting lower and upper bounds on the expected number of changes in the output of an algorithm A, when a randomly chosen edge is deleted from the input graph G. The worst-case sensitivity of a graph algorithm for edge-deletions and vertex-deletions was considered by Yoshida and Zhou [67]. As opposed to these existing work on the sensitivity of graph algorithms, our notion of the sensitivity of string compressors focuses on the size of their compressed outputs and does not formulate the perturbation of their structural changes. This is because the primary task of data compression is to represent the input data with as little memory as possible, and the structural changes of the compressed outputs can be of secondary importance.
We remark that most instances of Σ n are not compressible, or in other words, a randomly chosen string T from Σ n is not compressible. Such a string T does not become highly compressible Table 2: Additive sensitivity of the string compressors and string repetitiveness measures studied in this paper, where n is the input string length and Σ is the alphabet. Some upper/lower bounds are described in terms of both the measure and n. In the table "sr" stands for "with self-references". The upper bounds marked with " †" are obtained by applying known results [30,35,28,37,31,11,62,26] and our results on the sensitivity of the substring complexity δ or the smallest grammar g * to Lemma 1.
all e n just after a one-character edit operation, and hence C(T ) and C(T ) are expected to be almost the same. Therefore, considering the average sensitivity of string compressors and repetitiveness measures does not seem worth discussing, and this is the reason why we focus on the worst-case sensitivity of string compressors and repetitiveness measures. Still, our notion permits one to evaluate the worst-case size changes of several known compressed string data structures in the dynamic setting, as will be discussed in the following subsection.

Compressed string data structures
A compressed string data structure is built on a compressed representation of the string and supports efficient queries such as pattern matching and substring extraction within compressed space. Since the string compressors and string repetitiveness measures that we deal with in this paper are models for highly repetitive strings, we mention some compressed string indexing structures for highly repetitive sequences below.
The Block tree of a string of length n uses O(z SS log(n/z SS )) words of space and supports random access queries in O(log(n/z SS )) time. Navarro [47] proposed an LZ-based indexing structure that uses O(z SS log(n/z SS )) words of space and counts the number of occurrences of a query pattern in the text string in O(m log 2+ n) time, where m is the length of the pattern and > 0 is any constant. An O(log n)-time longest common extension (LCE) data structure that takes O(z SS log(n/z SS )) space and is based on Recompression [26] was proposed by I [25]. Nishimoto et al. [54] presented a dynamic O(min{z SS log n log * n, n})-space compressed data structure that supports pattern matching and substring insertions/deletions in O(m · polylog(n)) time, where m is the length of the pattern/substring. Kociumaka et al. [35] proposed a compressed indexing structure that uses O(δ log(n/δ)) words of space, performs random access in O(log(n/δ)) time, and finds all the occ occurrences of a given pattern of length m in O(m log n + occ log n) time. Very recently, Kociumaka et al. [34] proposed an improved data structure of O(δ log(n/δ))-space that supports pattern matching queries in O(m + (occ + 1) log n) time. Two independent compressed indexing structures, which are based on grammar compression called GCIS (Grammar Compression by Induced Sorting) [58] have been proposed [1,14]. Our constant upper bounds on the multiplicative sensitivity for z SS , δ, and g is imply that the afore-mentioned compressed data structures retain their asymptotic space complexity even after one-character edit operation at an arbitrary position, though they may incur a certain amount of structural changes.
The r-index [16], the refined r-index [3], and the OptBWTR [55] are efficient indexing structures which are built on the RLBWT and use O(r) words of space. The result by Giuliani et al. [22], which uses a family of strings of length n with r = O(1), shows that the space complexity of these indexing structures can grow from O(1) words of space to O(log n) words of space, after appending a character to the string. In turn, our upper bound for the sensitivity of r implies that after a onecharacter edit operation, the space usage of these indexing structures is bounded by O(r log r log n) for any string of length n.
There also exist compressed data structures based on other string compressors and/or repetitiveness measures: Kempa and Prezza [30] presented an O(γτ log τ (n/γ))-space data structure that allows for extracting substrings of length-in O(log τ (n/γ) + log(σ)/ω) time, where τ ≥ 2 is an integer parameter, σ is the alphabet size, and ω is the machine-word size in the RAM model. Navarro and Prezza [50] gave a data structure of size O(γ log(n/γ)) that supports pattern matching queries in O(m log n + occ log n) time. Christiansen et al. [12] introduced a compressed indexing structure that occupies O(γ log(n/γ) log n) space and finds all the occ pattern occurrences in optimal O(m + occ) time (for other trade-offs between the space and the query time are also reported, see [12]). Gawrychowski et al. [21] presented a data structure for maintaining a dynamic set of strings, which is based on Recompression by Jeż [26]. Kempa and Saha [31] developed a compressed data structure that occupies O(z End ) space and supports random access and LCE queries in O(polylog(n)) time. A compressed indexing structure that can be built directly from the LZ77compressed text is also known [28,27]. For other compressed string indexing structures, see this survey [49].

Paper organization
Section 2 introduces necessary notations. We then present the worst-case sensitivity of string compressors and repetitiveness measures in the increasing order of their respective sizes: from δ to γ, LZ77 family, LZ-End, and grammars: Section 3 deals with the substring complexity δ; Section 4 deals with the smallest string attractor γ, Section 5 deals with the RLBWT r, Section 6 deals with the smallest bidirectional scheme b, Section 7 deals with the LZ77 with/without selfreferences z 77 and z 77sr ; Section 8 deals with the LZSS with/without self-references z SS and z SSsr . Section 9 deals with the LZ-End z End ; Section 10 deals with the LZ78 z 78 ; Section 11 deals with the smallest grammar g * , and its applications to practical and/or approximation grammars RePair g rpair , LongestMatch g long , Greedy g grdy , Sequential g seq , LZ78 z 78 , α-balanced grammar g α , AVLgrammar g avl , and Simple grammar g simple . Section 12 deals with the GCIS grammar g is ; Section 13 deals with the Bisection grammar g bsc ; Section 14 deals with the CDAWG size e. In Section 15 we conclude the paper and list several open questions of interest.

Strings, factorizations, and grammars
Let Σ be an alphabet of size σ. An element of Σ * is called a string. For any non-negative integer n, let Σ n denote the set of strings of length n over Σ. The length of a string T is denoted by |T |. The empty string ε is the string of length 0, namely, |ε| = 0. The i-th character of a string T is denoted by T [i] for 1 ≤ i ≤ |T |, and the substring of a string T that begins at position i and ends at position j is denoted by .j] and T [i..|T |] are respectively called a prefix and a suffix of T .
A factorization of a non-empty string T is a sequence f 1 , . . . , f x of non-empty substrings of T such that T = f 1 · · · f x . Each f i is called a factor. The size of the factorization is the number x of factors in the factorization.
A context-free grammar G which generates only a single string T is called a grammar compression for T . The size of G is the total length of the right-hand sides of all the production rules in G. The height of G is the height of the derivation tree of G.

Worst-case sensitivity of compressors and repetitiveness measures
For a string compression algorithm C and an input string T , let C(T ) denote the size of the compressed representation of T obtained by applying C to T . For convenience, we use the same notation when C is a string repetitiveness measure, namely, C(T ) is the value of the measure C for T .
Let us consider the following edit operations on strings: character substitution (sub), character insertion (ins), and character deletion (del). For two strings T and S, let ed(T, S) denote the edit distance between T and S, namely, ed(T, S) is the minimum number of edit operations that transform T into S.
Our interest in this paper is: "How much can the compression size or the repetitiveness measure size change when a single-character-wise edit operation is performed on a string?" To answer this question, for a given string length n, we consider an arbitrarily fixed string T of length n and all strings T that can be obtained by applying a single edit operation to T , that is, ed(T, T ) = 1. We define the worst-case multiplicative sensitivity of C w.r.t. a substitution, insertion, and deletion as follows: We also consider the worst-case additive sensitivity of C w.r.t. a substitution, insertion, and deletion, as follows: We remark that, in general, C(T ) can be larger than C(T ) even when T is obtained by a character deletion from T (i.e. |T | = n − 1). Such strings T are already known for the Lempel-Ziv 77 factorization size z when T = T [2..n] [45], or for the smallest string attractor size γ when The above remark implies that in general the multiplicative/additive sensitivity for insertions and deletions may not be symmetric and therefore they need to be discussed separately for some C. Note, on the other hand, that the maximum difference between C(T ) and C(T ) when |T | = n − 1 (deletion) and C(T ) − C(T ) < 0 is equivalent to AS ins (C, n − 1), and symmetrically the maximum difference of C(T ) and C(T ) when |T | = n + 1 (insertion) and C(T ) − C(T ) < 0 is equivalent to AS del (C, n + 1), with the roles of T and T exchanged. Similar arguments hold for the multiplicative sensitivity with insertions/deletions. Consequently, it suffices to consider MS ins (C, n), MS del (C, n), AS ins (C, n), AS del (C, n) for insertions/deletions. Consider two measures α and β. An upper bound for the multiplicative sensitivity of β can readily be derived in the some cases, as follows: Lemma 1. Let T be any string of length n and let T be any string with ed(T, T ) = 1. If the following conditions: • β(T ) = O(α(T ) · f · (n, α(T ))), where f is a function such that for any constant c there exists a constant c satisfying f (n, c · α(T )) ≤ c · f (n, α(T )).
Proof. Let c = α(T )/α(T ), where c is a constant. Then we have Also, The functions satisfying f (n, c · α(T )) ≤ c · f (n, α(T )) include functions f which are polynomial, poly-logarithmic, or constant in terms of α(T ).

Substring Complexity
In this section, we consider the worst-case sensitivity of the string repetitiveness measure δ, which is the substring complexity of strings [35]. For any string T of length n, the substring complexity δ(T ) is defined as δ(T ) = max 1≤k≤n (Substr(T, k)/k), where Substr(T, k) is the number of distinct substrings of length k in T . It is known that δ(T ) ≤ γ(T ) holds for any T [35].
In what follows, we present tight upper and lower bounds for the multiplicative sensitivity of δ for all cases of substitutions, insertions, and deletions. We also present the additive sensitivity of δ.
3.1 Lower bounds for the sensitivity of δ Theorem 1. The following lower bounds on the sensitivity of δ hold: Proof. substitutions: Consider strings T = a n and T = a n−1 b. Then δ(T ) = 1 and δ(T ) = 2 hold. Thus we get MS sub (δ, n) ≥ 2 and AS sub (δ, n) ≥ 1.
• For 3m < k ≤ n: The prefix w 1 aw 2 contains at most three distinct substrings for every k and the substrings w 3 and w 4 contain no substrings of length k > 3m. The remaining distinct substrings must again contain the positions in [6m + 4, 6m + 5] or [9m + 4, 9m + 5]. These substrings can also be described in a similar way to the previous case for 3 ≤ k ≤ 3m, except for how we should remove duplicates. We have the two following sub-cases: -For k = 3m + 1: Since a k = a 3m+1 has no occurrences in T but (abb) k /3 has other occurrences and it has already been counted, the number of such distinct substrings is at most 2(k − 1) − 1.
Consider the string that can be obtained from T by removing T [3m + 1] = a between w 1 and w 2 . We consider the number of distinct substrings of length 3m+1 in T : Because of the lengths of w j with j ∈ {1, 2, 3, 4}, each substring of length 3m + 1 is completely contained in w 2 or it contains some boundaries of w j .
• The suffix w 3 w 4 = a 3m (bba) m contains 3m − 1 distinct substrings of length 3m + 1 (note that a(bba) m is a duplicate and is not counted here).

Upper
Bounds for the sensitivity of δ Theorem 2. The following upper bounds on the sensitivity of δ hold: Proof. First we consider the additive sensitivity for δ. For each k, the number of substrings of length k that contains the edited position i is clearly at most k. Therefore, after a substitution or insertion, at most k new distinct substrings of length k can appear in the string T after the modification. Also, after a deletion, at most k − 1 new distinct substrings of length k can appear in T . Hence, in the case of substitutions and insertions, δ(T ) ≤ max 1≤k≤n ((Substr(T, k) + k)/k) ≤ max 1≤k≤n (Substr(T, k))/k) + max 1≤k≤n (k/k) = δ(T ) + 1 holds. Also, in the case of deletions, δ(T ) ≤ max 1≤k≤n ((Substr(T, k) + k − 1)/k) ≤ δ(T ) + max 1≤k≤n ((k − 1)/k) holds. Thus we obtain AS sub (δ, n) ≤ 1, AS ins (δ, n) ≤ 1, and lim sup n→∞ AS del (δ, n) ≤ lim sup k→∞ (k − 1)/k = 1.
Next we consider the multiplicative sensitivity for δ. Note that δ(T ) ≥ 1 for any non-empty string T , since Substr(T , 1) ≥ 1. Combining this with the afore-mentioned additive sensitivity, we obtain MS sub (δ, n) ≤ 2 and MS ins (δ, n) ≤ 2. For the case of deletions, observe that δ(T ) = 1 only if T is a unary string. However δ(T ) cannot increase after a deletion since T is also a unary string. Thus we can restrict ourselves to the case where T contains at least two distinct characters. Then, we have lim sup n→∞ MS del (δ, n) ≤ 1.5, which is achieved when δ(T ) = 2 and δ(T ) = 2 + k−1 k with k → ∞.

String Attractors
In this section, we consider the worst-case sensitivity of the string repetitiveness measure γ, which is the size of the smallest string attractor [30]. A string attractor Γ(T ) for a string T is a set of positions in T such that any substring T has an occurrence containing a position in Γ(T ). We denote the size of the smallest string attractor of T by γ(T ). It is known that γ(T ) is upper bounded by any of z 77 (T ), r(T ), e(T ) for any string T [30].

Upper Bounds for the sensitivity of γ
In this section, we present some upper bounds for the worst-case sensitivity of the smallest string attractor size γ. We use the following known results: Theorem 4 (Lemma 3.7 of [30]). For any string T , γ(T ) ≤ z SSsr (T ).
We are ready to show our results: The following upper bounds on the sensitivity of γ hold: Proof. Let T be any string of length n, and let T be any string such that ed(T, T ) = 1.

Run-Length Burrows-Wheeler Transform (RLBWT)
The Burrows-Wheeler transform (BWT ) of a string T , denoted BWT(T ), is the string obtained by concatenating the last characters of the lexicographically sorted suffixes of T . The run-length BWT (RLBWT ) of T is the run-length encoding of BWT(T ) and r(T ) denotes its size, i.e., the number of maximal character runs in BWT(T ). For example, for string T = abbaabababab, r(T ) = 4 since BWT(T ) = babbbbbaaaaa consists in four maximal character runs b 1 a 1 b 5 a 5 .
Theorem 7 (Theorem 1 of [22]). There exists a family of strings S such that r(S) = 2 and r(S ) = Θ(log n), where n = |S| and S is a string obtained by prepending a character to S. The string S is a reversed Fibonacci word.
Theorem 7 immediately leads to the following lower bound for the sensitivity of r: The following lower bound on the sensitivity of RLBWT with |Σ| = 2 hold: insertions: MS ins (r, n) = Ω(log n). AS ins (r, n) = Ω(log n).
To obtain a non-trivial upper bound for the sensitivity of r, we can use the following known result: Theorem 8 (Theorem III.7 of [28]). For any string T of length n, Proof. For any string T , it is known that δ(T ) ≤ r(T ) [30,35]. We also use a simplified and relaxed bound r(T ) = O(δ(T ) log n log δ(T )) from Theorem 8, which always holds and is sufficient for our purpose.
Let T be any string with ed(T, T ) = 1. It follows from Theorem 2 that δ(T ) ≤ 2δ(T ). Therefore, we obtain r(T ) = O(δ(T ) log n log δ(T )) = O(δ(T ) log n log δ(T )) = O(r(T ) log n log r(T )) by Lemma 1. This leads to the claimed upper bounds for the sensitivity for r.
We remark that the lower bounds MS ins (r, n) = Ω(log n) and AS ins (r, n) = Ω(log n) from Theorem 7 and Corollary 2 are asymptotically tight when r = O(1), since MS ins (r, n) = O(log n log r) = O(log n) and AS ins (r, n) = Ω(log n) in this case.

Bidirectional Scheme
In this section, we consider the worst-case sensitivity of the size of bidirectional scheme [64]. For example, for string T = abaabababbbba, B shown below is a valid bidirectional scheme of the smallest size possible: where its corresponding factorization is: In what follows, we present upper and lower bounds for the multiplicative/additive sensitivity of b. It is noteworthy that our upper and lower bounds for the multiplicative sensitivity of b for substitutions and insertions are tight.

Lower bounds for the sensitivity of b
Theorem 9. The following lower bounds on the sensitivity of b hold: Proof. substitutions: Consider strings T = a n and T = a n/2 −1 ba n/2 . Then b(T ) = 2 and b(T ) = 4 hold. Thus we get MS sub (b, n) ≥ 2.
insertions: Consider strings T = a n and T = a n/2 ba n/2 . Then b(T ) = 2 and b(T ) = 4 hold. Thus we get MS ins (b, n) ≥ 2.
The family of strings used in Theorem 9 gives us tight lower bounds for multiplicative sensitivities. However, this family of strings only provides us with weak lower bound 2 for the additive sensitivity of b. The following theorem will give us stronger lower bounds for the additive sensitivity for b. We remark that this theorem also leads us to a non-trivial lower bound for the multiplicative sensitivity of b in the case of deletions.
Theorem 10. The following lower bounds on the sensitivity of b hold: Proof. Consider string where # j for every 1 ≤ j ≤ k is a distinct character. One of the valid bidirectional schemes B for T is The corresponding factorization of the above bidirectional scheme is as follows: The size of B is 2k + 4 and thus b(T ) ≤ 2k + 4.
As for substitutions, let T be the string obtained by substituting the leftmost occurrence of x at position k + 1 in T with a character y such that y = x, that is, Then, one of the valid bidirectional schemes B of T is: Also, the corresponding factorization for B is as follows: The size of B is 3k + 5. We show that B is a valid bidirectional scheme for T of the smallest size possible, namely, b(T ) = 3k + 5. Since y and # j for every 1 ≤ j ≤ k are unique characters in T , they have to be ground phrases. Also, since each substring a k −j +1 xa j of length k + 2 for all 1 ≤ j ≤ k and a k+1 are unique in T , each corresponding interval has to have at least one boundary of phrases. In addition, at least one occurrence of x has to be a ground phrase. Then, b(T ) = 3k + 5 holds. Since |T | = n = k 2 +5k +2, we have k = Θ( √ n). Hence, we get lim inf n→∞ MS sub (b, n) ≥ 1.5 and AS sub (b, n) ≥ k + 1 = b/2 − 1 = Ω( √ n). Moreover, by considering the case where the character T [k + 1] is deleted and the case where the character y is inserted between positions k + 1 and k + 2, we obtain Theorem 10.

Upper bounds for the sensitivity of b
Theorem 11. The following upper bounds on the sensitivity of b hold: Proof. In the following, we consider the case that T [i] = a is substituted by a character # that does not occur in T . The other cases of insertions, deletions, and substitutions with another character b ( = a) occurring in T , can be proven similarly. We show how to construct a valid bidirectional scheme of T of the size b ≥ b(T ) by dividing each phrase of B into some phrases, where B is a valid bidirectional scheme for T of the smallest size possible. We categorize each phrase of B into one of the three following cases: We divide the phrase f j into at most five phrases w 1 = (q j , |w 1 |), a, w 2 = (q j + |w 1 | + 1, |w 2 |), #, w 2 = (q j + |w 1 | + 1, |w 2 |). See also the middle of Figure 1. Case (2): No changes are made to the phrase f j in this case, since f j can continue to refer to the same reference. Case (3): Among all phrases in Case (3), let f k be the phrase whose ending position of the reference is the rightmost. Let T [p k ..p k + k − 1] = u 1 au 2 , where u 1 , u 2 ∈ Σ * and q k + |u 1 | = i. Then we divide the phrase f k into at most three phrases u 1 = (q k , |u 1 |), a, u 2 = (q k + |u 1 | + 1, |u 2 |) in T . For the other phrases of Case (3), we divide f j = v 1 av 2 , where v 1 , v 2 ∈ Σ * and q j + |v 1 | = i, into at most two phrases v 1 = (q j , |v 1 |) and av 2 = (q k + |u 1 |, |v 2 | + 1). From the above operations, the character that referred to position i in T becomes a ground phrase or refers to position q k + |u 1 |, which is a ground phrase, in T . The other substrings refer to the original reference positions or to a subinterval of [q k + |u 1 |..q k + |f k | − 1]. The reference of the subinterval corresponds to the original reference of the substring. See also the bottom of Figure 1.
Then, the bidirectional scheme obtained from the above operations is ensured to be valid. The size of the bidirectional scheme b is maximized if exactly one phrase of Case (1) is divided into five phrases, and the remaining b(T ) − 1 phrases belong to Case (3). Since at most one of the b(T ) − 1 phrases of Case (3) can be divided into three phrases, and all the others can be divided into two phrases, b is at most 5 + 3 + 2(b(T ) − 2) = 2b(T ) + 4. Furthermore, if T is a unary string, then b(T ) = 2 and the valid bidirectional scheme of size 4(= 2b(T )) can be constructed easily. Otherwise, there are at least two ground phrases in T , and these phrases can not be divided into some phrases in T . Then we get b ≤ 2b(T ) + 2 and Theorem 11.

Lempel-Ziv 7factorizations with/without self-references
In this section, we consider the worst-case sensitivity of the Lempel-Ziv 77 factorizations (LZ77 ) [68] with/without self-references.
For convenience, let f 0 = ε. A factorization f 1 · · · f z for a string T of length n is the non self- .|f k |−1] never overlaps with its previous occurrence, it is called non self-referencing. The last factor f z is the suffix of T of length n − |f 1 · · · f z−1 | and it may have multiple occurrences in f 1 · · · f z .
A factorization f 1 · · · f z for a string T of length n is the self-referencing LZ77 factorization may overlap with its previous occurrence, it is called self-referencing. The last factor f z is the suffix of T of length n − |f 1 · · · f z−1 | and it may have multiple occurrences in f 1 · · · f z .
If we use a common convention that the string T terminates with a unique character $, then the last factor f z satisfies the same properties as f 1 , . . . , f z−1 , in both cases of (non) self-referencing LZ77 factorizations.
To avoid confusions, we use different notations to denote the sizes of these factorizations. For a string T let z 77 (T ) and z 77sr (T ) denote the number z of factors in LZ77(T ) and LZ77sr(T ), respectively.
For example, for string T = abaabababababab$, where | denotes the right-end of each factor in the factorizations. Here we have z 77 (T ) = 6 and z 77sr (T ) = 5.
In what follows, we present tight upper and lower bounds for the multiplicative sensitivity of z 77 and z 77sr for all cases of substitutions, insertions, and deletions. We also present the additive sensitivity of z 77 and z 77sr .

Lower bounds for the sensitivity of z 77
Theorem 12. The following lower bounds on the sensitivity of non self-referencing LZ77 factorization hold: Proof. Let p ≥ 2 and Σ = {0, 1, 2}. We use the following string T for our analysis in all cases of substitutions, insertions, and deletions. Let each Q k forms a single factor in the non self-referencing LZ77 factorization of T . Namely, substitutions: Consider the string which can be obtained from T by substituting the first 0 with 2. Let us analyze the structure of the non self-referencing LZ77 factorization LZ77(T ) of T . We prove by induction that Q k is divided into exactly two factors for every Since Q k has 01 k−1 as a suffix and this is the leftmost occurrence of 1 k−1 in T , the next factor is this remaining suffix Q 2 · · · Q k−2 11 of Q k . Thus, the non self-referencing LZ77 factorization of T is insertions: Let T be the string obtained by inserting 2 immediately after the first character T [1] = 0, namely, Then, by similar arguments to the case of substitutions, we have deletions: Let T be the string obtained by deleting the first character T [1] = 0, namely Then, by similar arguments to the case of substitutions, we have The strings T and T used in Theorem 12 give us optimal additive lower bounds in terms z 77 , are highly compressible (z 77 (T ) = O(log n)) and only use two or three distinct characters. By using more characters, we can obtain larger lower bounds for the additive sensitivity for the size of the non self-referencing LZ77 factorizations LZ77 in terms of the string length n, as follows: Theorem 13. The following lower bounds on the sensitivity of non self-referencing LZ77 factorization LZ77 hold: Proof. In A.1.

Upper bounds for the sensitivity of z 77
Theorem 14.
The following upper bounds on the sensitivity of non self-referencing LZ77 factorization LZ77 hold: Proof. In the following, we consider the case that T [i] = a is substituted by a character # that does not occur in T . The other cases of insertions, deletions, and substitutions with another character b ( = a) occurring in T , can be proven similarly, which will be discussed at the end of the proof. We denote the factorizations as Now we prove the following claim: Claim. Each interval [p j , q j ] has at most two starting positions p k and p k+1 of factors in LZ77(T ) for some 1 ≤ k < z .
Proof of claim. There are the three following cases: (1) When the interval [p j , q j ] satisfies q j < i: f j = f j holds for any such j. Therefore, in the interval [p j , q j ] there exists exactly one starting position p j = p j of a factor in LZ77(T ).
where a, c, # ∈ Σ and w 1 , w 2 ∈ Σ * . By definition, w 1 aw 2 has at least one previous occurrence in f 1 · · · f j−1 . After the substitution, w 1 # becomes a factor f j of LZ77(T ) since # is a fresh character, and w 2 c becomes a prefix of the next factor f j+1 in LZ77(T ). This means that p j = p j and q j+1 ≥ q j . Therefore, the interval [p j , q j ] has at most two starting positions p j and p j+1 of factors in LZ77(T ).
(3) When the interval [p j , q j ] satisfies i < p j : There are the two following sub-cases: has a previous occurrence which does not contain the edited position i in T : In this case, any suffix of T [p j ..q j − 1] has a previous occurrence in T . Therefore, [p k , q k ] with p j ≤ p k satisfies q k ≥ q j . Hence, the interval [p j ..q j ] has at most one starting position p k of a factor in LZ77(T ). The above proof can be generalized to all the other cases, by replacing # in T as follows: The analysis for Case (2) and Case (3)
As for substitution, we consider the string = 02 · 001 · 000011 · 000010000111 · · · R p which can be obtained from T by substituting the second 0 with 2. Let us analyze the structure of the self-referencing LZ77 factorization of T . The second factor 0001 in LZ77sr(T ) becomes 2001 in the edited string T , and this is divided into exactly three factors as 2|00|1| in LZ77sr(T ) because 2 is a fresh character, 00 is the shortest prefix of T [3.
.n] = 001R 3 · · · R p that does not occur in T [1..2] = 02, and 1 is a fresh character. Our claim is that each , which means that the next factor is a prefix of R k · · · R p . Since . Since its prefix 0R 2 · · · R k−2 1 has a previous occurrence and 0R 2 · · · R k−2 11 has a suffix 01 k−1 which is the leftmost occurrence of 1 k−1 in T , this remaining part 0R 2 · · · R k−2 11 becomes the next factor in LZ77sr(T ). Thus, the self-referencing LZ77 factorization of T is with z 77sr (T ) = 2p, which leads to MS sub (z 77sr , n) ≥ 2p/p = 2 and AS sub (z 77sr , n) ≥ 2p − p = p = z 77sr = Ω(log n). insertions: We use the same string T in the case of substitutions. Let T be the string obtained by inserting 2 immediately after T [1] = 0, namely, Then, by similar arguments to the case of substitutions, we have with z 77sr (T ) = 2p, which leads to MS ins (z 77sr , n) ≥ 2p/p = 2 and AS ins (z 77sr , n) ≥ 2p − p = p = z 77sr = Ω(log n).
deletions: As for deletions, we use the same strings T and T from Theorem 12. This string and the deletion also achieve the same lower bound for the self-referencing LZ77 factorization in the case of deletions. Then, we obtain z 77sr (T ) = p, z 77sr (T ) = 2p − 2, which leads to lim inf n→∞ MS del (z 77sr , n) ≥ 2 and AS del (z 77sr , n) ≥ z 77sr − 2 = Ω(log n).
The strings T and T used in Theorem 15 give us optimal additive lower bounds in terms z 77sr , are highly compressible (z 77sr (T ) = O(log n)) and only use two or three distinct characters. By using more characters, we can obtain larger lower bounds for the additive sensitivity for the size of the self-referencing LZ77 factorizations in terms of the string length n, as follows: Theorem 16. The following lower bounds on the sensitivity of self-referencing LZ77 factorization LZ77sr hold: substitutions: AS sub (z 77sr , n) = Ω( √ n). insertions: AS ins (z 77sr , n) = Ω( √ n). deletions: AS del (z 77sr , n) = Ω( √ n).

Upper bounds for the sensitivity of z 77sr
Theorem 17. The following upper bounds on the sensitivity of self-referencing LZ77 factorization LZ77sr hold: Proof. We use the same notations as in Theorem 14 of Section 7.2. We consider the case where T [i] is substituted by a fresh character #, as in the proof for Theorem 14. We prove the following claim: Claim. Each interval [p j , q j ] has at most two starting positions p k and p k+1 of factors in LZ77sr(T ) for 1 ≤ k < z , excluding the interval [p I , q I ] that contains the edited position i. The interval [p I , q I ] has at most three starting positions of factors in LZ77sr(T ).
Proof of claim. Cases (1) and (3)  This completes the proof for the claim.
Using the same character(s) as in the proof for Theorem 14, we can generalize this proof to the other types of edit operations.
• it is the non self-referencing LZSS factorization LZSS(T ) of T if for each 1 ≤ i ≤ z the factor f i is either the first occurrence of a character in T , or the longest prefix of f i · · · f z occurs in f 1 · · · f i−1 .
• it is the self-referencing LZSS factorization LZSSsr(T ) of T if for each 1 ≤ i ≤ z the factor f i is either the first occurrence of a character in T , or the longest prefix of f i · · · f z occurs at least twice in f 1 · · · f i .
To avoid confusions, we use different notations to denote the sizes of these factorizations. For a string T let z SS (T ) and z SSsr (T ) denote the number z of factors in the non self-referencing LZSS factorization and in the self-referencing LZSS factorization of T , respectively.
For example, for string T = abaabababababab, we have where | denotes the right-end of each factor in the factorizations. Here we have z SS (T ) = 7 and z SSsr (T ) = 5.

Lower bounds for the sensitivity of z SS
Theorem 18. The following lower bounds on the sensitivity of non self-referencing LZSS factorization LZSS hold: AS del (z SS , n) = Ω( √ n).

Upper bounds for the sensitivity of z SS
Theorem 19. The following upper bounds on the sensitivity of non self-referencing LZSS factorization LZSS hold: Proof. Let LZSS(T ) = f 1 · · · f z and LZSS(T ) = f 1 · · · f z . We denote the interval of the jth factor f j (resp. f j ) by [p j , q j ] (resp. [p j , q j ]), namely T [p j ..q j ] = f j and T [p j ..q j ] = f j . Also, let f I be the factor of LZSS(T ) whose interval [p I , q I ] contains the edited position i, namely p I ≤ i ≤ q I . substitutions: In the following, we consider the case that the ith character T [i] = a is substituted by a fresh character # which does not occur in T . The other cases can be proven similarly. Now we show the following claim: Claim. After the substitution, each interval [p j , q j ] has at most three starting positions p k , p k+1 , and p k+2 of factors in LZSS(T ) for 1 ≤ k ≤ z − 2.
Proof of claim. There are the three following cases: (i) When the interval [p j , q j ] satisfies q j < i: By the same argument to Case (1) for LZ77, the interval [p j , q j ] contains exactly one starting position p j = p j .
(ii) When the interval [p j , q j ] satisfies p j ≤ i ≤ q j (namely, f j = f I ): For the string w j 1 aw j 2 = T [p j ..q j ], it is guaranteed that w 1 aw 2 has at least one occurrence in f 1 · · · f j−1 . After the substitution which gives T [p j ..q j ] = w 1 #w 2 , w 1 and # become factors as f j and f j+1 , and w 2 becomes the prefix of factor f j+2 . This means that p j = p j and q j+2 ≥ q j . Therefore, the interval [p j , q j ] contains at most three starting positions p j , p j+1 and p j+2 of factors in LZSS(T ).
(iii) When the interval [p j , q j ] satisfies i < p j : We consider the two following sub-cases: (iii-A) When T [p j ..q j ] has at least one occurrence which does not contain the edited position i in T : Any suffix of T [p j ..q j ] still has a previous occurrence in T . Therefore, [p k , q k ] with p j ≤ p k satisfies q k ≥ q j , meaning the interval [p j , q j ] contains at most one starting position p k of a factor in LZSS(T ). If p k is in u 2 , then q k ≥ q k and thus there is only one starting position of a factor of LZSS(T ) in the interval [p j ..q j ]. Suppose p k is in u 1 . If a has no previous occurrences (which happens when T [i] was the only previous occurrence of a), then T [p k +|u 1 |] is the first occurrence of a in T and thus q k = p k + |u 1 | − 1, p k+1 = q k + 1 and q k+1 = p k+1 + 1.
This completes the proof for the claim.
insertions: In the following, we consider the case that # is inserted to between positions i − 1 and i. The other cases can be proven similarly. Now we show the following claim: a ∈ Σ and w 1 , w 2 ∈ Σ * . It is guaranteed that w j 1 a, and w j 2 still have previous occurrences in T . Therefore, each range of w j 1 a and w j 2 can contain at most one starting position of a factor in LZSS(T ).
It follows from the above claim that z SS (T ) ≤ 2z SS (T ) + 1 holds any string T and insertions with #. By using the same discussion as for f 1 , we obtain z SS (T ) ≤ 2z SS (T ) holds. Then we have MS ins (z SS , n) ≤ 2 and AS ins (z SS , n) ≤ z SS .
deletions: In the following, we consider the case that T [i] = a is deleted. Now we show the following claim: Proof of claim. For Cases (i) and (iii), we can use the same discussions as in the case of substitutions. Now we consider case (ii): (ii) When the interval [p j , q j ] satisfies p j ≤ i ≤ q j (namely, f j = f I ): Let w 1 aw 2 = T [p j ..q j ] with a ∈ Σ and w 1 , w 2 ∈ Σ * . It is guaranteed that w 1 aw 2 has at least one previous occurrence in f 1 · · · f j−1 . Therefore, after the deletion of a, each range of w 1 and w 2 can contain at most one starting position of a factor in LZSS(T ).
It follows from the above claim that z SS (T ) ≤ 3z SS (T ) − 1 holds for any string T and deletions. By using the same discussion as for f 1 , z SS (T ) ≤ 3z SS (T ) − 3 holds. Then we get lim sup n→∞ MS del (z SS , n) ≤ 3 and AS del (z SS , n) ≤ 2z SS − 3.

Lower bound for the sensitivity of z SSsr
Theorem 20. The following lower bounds on the sensitivity of self-referencing LZSS factorization LZSSsr hold: AS del (z SSsr , n) = Ω( √ n).
Proof. We use the same strings T and T as in the proof for Theorem 18 which shows the lower bounds of the sensitivity of the non self-referencing LZSS. For the string T and each edit operation, the self-referencing LZSS factorization is the same as the non self-referencing LZSS factorization. Hence, we obtain Theorem 20.

LZ-End factorizations
In this section, we consider the worst-case sensitivity of the LZ-End factorizations [37]. This is an LZ77-like compressor such that each factor f i has a previous occurrence which corresponds to the ending position of a previous factor. This property allows for fast substring extraction in practice [37].
A factorization T = f 1 · · · f z End for a string T of length n is the LZ-End factorization LZEnd(T ) of T such that, for each 1 ≤ i < z End , f i [1..|f i | − 1] is the longest prefix of f i · · · f z End which has a previous occurrence in f 1 · · · f i−1 as a suffix of some string in {ε, f 1 , f 1 f 2 , . . . , f 1 · · · f i−1 }. The last factor f z End is the suffix of T of length n − |f 1 · · · f z End −1 |. Again, if we use a common convention that the string T terminates with a unique character $, then the last factor f z End satisfies the same properties as f 1 , . . . , f z−1 , in the cases of LZ-End factorizations. Let z End (T ) denote the number of factors in the LZ-End factorization of string T .
For example, for string T = abaabababababab$, where | denotes the right-end of each factor in the factorization. Here we have z End (T ) = 6.

Upper bounds for the sensitivity of z End
To show a non-trivial upper bound for the sensitivity of z End , we use the following known results: 37]). For any string T , z SSsr (T ) ≤ z End (T ).
For convenience, The last factor f z 78 is the suffix of T of length n − |f 1 · · · f z 78 −1 | and it may be equal to some previous factor f j (1 ≤ j < z 78 ). Again, if we use a common convention that the string T terminates with a unique character $, then the last factor f z 78 can be defined analogously to the previous factors. Let z 78 (T ) denote the number of factors in the LZ78 factorization of string T .
For example, for string T = abaabababababab$, where | denotes the right-end of each factor in the factorization. Here we have z 78 (T ) = 8.
As for the sensitivity of LZ78, Lagarde and Perifel [40] showed that MS ins (z 78 , n) = Ω(n 1/4 ), AS ins (z 78 , n) = Ω(z 3/2 78 ), and AS ins (z 78 , n) = Ω(n/ log n) for insertions. 3 In this section, we present lower bounds for the multiplicative/additive sensitivity of LZ78 for the remaining cases, i.e., for substitutions and deletions, by using a completely different string from [40]. Proof. Consider the string where σ i for every 1 ≤ i ≤ 2k is a distinct character and y j for every 1 ≤ j ≤ k satisfies the following property: y j is the maximum integer at most k such that 2 + j + j − 1 ≡ y j (mod j ) where j is an integer satisfying (1/2) j ( j − 1) + 1 ≤ j ≤ (1/2) j ( j + 1). We remark that the parentheses ( and ) in T are shown only for the better visualization and exposition, and therefore they are not the characters in T .
See also Figure 2 for a concrete example. As for deletions, by considering T obtained from T by deleting the first character of the 2k +1th factor in LZ78(T ), we obtain a similar decomposition as the above. Thus, MS del (z 78 , n) = Ω(n 1/4 ), AS del (z 78 , n) = Ω(z 3/2 78 ), and AS del (z 78 , n) = Ω(n 3/4 ) also hold. We remark that our string also achieves MS ins (z 78 , n) = Ω(n 1/4 ), AS ins (z 78 , n) = Ω(z 3/2 78 ), and AS ins (z 78 , n) = Ω(n 3/4 ) for insertions, if we consider the string T obtained from T by inserting # between the first and second characters of the 2k + 1th factor of LZ78(T ).
In Section 11, we will present an O((n/ log n) 2 3 ) upper bound for the multiplicative sensitivity for LZ78.

Smallest grammars and approximation grammars
In this section, we consider the sensitivity of the smallest grammar size g * and several grammars whose sizes satisfy some approximation ratios to g * .

Smallest grammar
In this section (and also in the following sections), we consider grammar-based compressors for input string T .
It is known that the problem of computing the size g * (T ) of the smallest grammar only generating T is NP-hard [64,11]. It is also known that z SS (T ) is a lower bound of the size of any grammar generating T , namely, z SS (T ) ≤ g * (T ) holds for any string T [62,11].
We have the following upper bounds for the sensitivity of g * (T ): Theorem 26. The following upper bounds on the sensitivity of g * hold: Proof. Let T be any string of length n, and let G * (T ) be a grammar of size g * (T ) that only generates T . We describe the case of substitutions. Let T be the string that can be obtained by substituting a character c for the ith character T [i] of T , where c = T [i]. Let X be a non-terminal of G * (T ) in the path P from the root to the leaf for the ith character in the derivation tree of G * (T ). Let X → Y 1 · · · Y k be the production from X, and let Y j (1 ≤ j ≤ k) be the non-terminal that is the child of X in the path P . Then, we introduce a new non-terminal X and a new production where Y j will be the new non-terminal at the next depth in the path P . By applying this operation in a top-down manner on P , we can obtain a grammar G(T ) of size g(T ) ≤ 2g * (T ) that generates T . Since g * (T ) ≤ g(T ), we have the claimed bounds. The cases with insertions and deletions are analogous.

Approximation grammars
There also exist (better) approximation algorithms in terms of the smallest grammar size g * .

Grammar compression by induced sorting (GCIS)
In this section, we consider the worst-case sensitivity of the grammar compression by induced sorting (GCIS ) [58,59]. GCIS is based on the idea from the famous SAIS algorithm [57] that builds the suffix array of an input string in linear time. Recently, it is shown that GCIS has a locally consistent parsing property similar to the ESP-index [44] and the SE-index [54], and grammar-based indexing structures based on GCIS have been proposed [1,14].
First we explain how the GCIS algorithm constructs its grammar from the input string. For any text position 1 ≤ i ≤ |T |, position i is of type L if T [i..|T |] is lexicographically larger than T [i + 1...|T |], and it is of type S otherwise. For any 2 < i < |T |, we call position i an LMS (LeftMost S ) position if i is of type S and i − 1 is of type L. For convenience, we append a special character $ to T which does not occur elsewhere in T , and assume that positions 1 and |T $| are LMS positions.
Let i 1 , . . . , i z+1 be the sequence of the LMS positions in T sorted in increasing order. Let D j = T [i j ..i j+1 − 1] for any 1 ≤ j ≤ z. When z ≥ 2, then T = D 1 , . . . , D z is called the GCISparsing of T .
Next, we create new non-terminal symbols R 1 , . . . , R z such that R i = 1 + σ + |{D j : D j ≺ D i : 1 ≤ j ≤ z}| for each i. Intuitively, we pick the least unused character from Π and assign it to R i . Then, G 1 = R 1 · · · R z is called the GCIS-string of T . Let G 1 the set of all z symbols in G 1 , and P 1 = {R i → D i : 1 ≤ i ≤ z} is the set of production rules. Let D 1 = {D 1 , . . . , D z } be the set of all distinct factors. Let G 0 = T , then we define GCIS recursively, as follows: Definition 1. For k ≥ 0, let the sequence i 1 , i 2 , . . . i z k +1 be all LMS positions sorted in increasing order, and D j = G k [i j . . . i j+1 − 1] for any 1 ≤ j ≤ z k . G k = D 1 , D 2 , . . . , D z k is the GCIS-parsing of G k . For all i in 1 ≤ i ≤ z k , we define R to satisfy : Then, G k+1 = R 1 . . . R z k is the GCIS-string of G k . G k+1 is the set of non-terminals, P k = {R i → D i : 1 ≤ i ≤ z k } is the set of production rules. D k = {D 1 , . . . , D z k } is the set of all distinct factors in the GCIS-parsing of G k .
Again, each R i is chosen to be the least unused character from Π. G k+1 is not defined if there are no LMS positions in G k [2..|G k |]. Then, the GCIS grammar of T is (Σ, k t=1 G t , k−1 t=1 P t , G k ). T is derived from the recursive application of the rules k−1 t=1 P t , which is the third argument, to the fourth argument G k , which is the start string, until there are no non-terminal characters, which is in the second argument k t=1 G t = Π, in the string. Let r = k be the height of GCIS, in other words how many times we applied this GCIS method recursively to T . Let g is (T ) be the size of GCIS grammar of T . Then, if r = 0, g is (T ) = |T |, and if r ≥ 1, g is (T ) = D 1 + · · · + D r + G r , where S for a set of strings denotes the total length of the strings in S. Figure 3 shows an example on how GCIS is constructed from an input string. From now on, we consider to perform an edit operation to the input string T and will consider how the GCIS changes after the edit. Our single-character edit operation performed to T can be described as F (T, T ) = (1, 1) for substitution, F (T, T ) = (0, 1) for insertion, and F (T, T ) = (1, 0) for deletion. We will use this notation F to the GCIS-strings for T and T , in which case a, b can be larger than 1. Still, we will prove that a, b are small constants for the GCIS-strings.
As with the definitions for T , T = D 1 , . . . , D z is the GCIS-parsing of T , G 1 = R 1 · · · R z is the GCIS-string of T , G 1 is the set of non-terminals for T , D 1 = {D 1 , . . . , D z } is the set of all distinct factors of the GCIS-parsing of T , P 1 = {R i → D i : 1 ≤ i ≤ z } is the set of production rules. Let G 0 = T , then we can recursively define G 1 , G 2 . . . , G r similarly to T , where r is the height of the GCIS for T .

Upper bounds for the sensitivity of g is
This section presents the following upper bounds for the sensitivity of GCIS.
Theorem 27. The following upper bounds on the sensitivity of GCIS hold: We will prove this theorem as follows: We unify substitutions, insertions, and deletions by using the F function in Definition 2. First, we prove that edit operations do not affect the size of the GCIS grammar. Second, we divide the size of GCIS grammar g is (T ) into D 1 and g is (G 1 ), and prove that The essence is to find the two special stringsĜ 1 andĜ 1 which satisfy: •Ĝ 1 can be obtained fromĜ 1 by some substitutions, insertions, and deletions.
Then, we can apply the method to each height. The extra additive O(1) factor can be charged to the process of the GCIS compression, which is to be proved in Lemma 12. Finally, we will obtain g is (T ) ≤ 4g is (T ). Lemma 3. Let G 1 andĜ 1 denote the GCIS-strings of T andT , respectively. ThenĜ 1 is the string that can be obtained by replacing the characters in G 1 without changing the ranks of any characters in G 1 , and g is (T ) = g is (T ).
Proof. The lemma immediately follows from Lemma 2 and that rankT [i] = rank T [i] for every 1 ≤ i ≤ |T |. Figure 4 shows a concrete example for Lemma 3. A natural consequence of Lemma 3 is that edit operations which do not change the relative order of the characters in T do not affect the size of the grammar.
From now on, we analyze how the size of the GCIS of the string T can increase after the edit operation in the string T . In the following lemmas, let 1 ≤ h ≤ r, where r is the height of the GCIS grammar for T .
Proof. Considering k where i k ≤ c < i k+1 in G h and l where i z−l−1 ≤ c + x < i z−l , the total length of new factors to be added in D h , is at most Proof. Assume |G h+1 | > |G h+1 | + 1 + y/2 . In other words, there are at least 2 + y/2 positions which are not LMS positions in G h but are LMS positions in G h . Let i be the right-most position Proof. We immediately get a ≤ 2 + (x + 1)/2 , b ≤ 2 + (x + 1)/2 by a direct application of Lemma 4. Assume y mod 2 = 1, b ≤ 2 + (y + 1)/2 . Then, Lemma 6 shows that there is only one possible combination of new b LMS positions i + 1, c + 1, c + 3, . . . , c + y in G h . For that, neither i + 1 nor c + x can be LMS positions in G h in this case since they must be new LMS positions in G h . Therefore, a ≤ 2 + (x + 1)/2 − 1 since there are no possible combination of a + 1 LMS positions in G h . Assume x mod 2 = 1 and a ≤ 2 + (x + 1)/2 . Then, Lemma 6 shows that there is only one possible combination of a disappearing LMS positions i + 1, c + 1, c + 3, . . . , c + x in G h . For that, neither i + 1 not c + y can be LMS positions in G h in this case since they must be disappearing LMS positions in G h . Therefore, a ≤ 2 + (x + 1)/2 − 1 since there are no possible combination of new b + 1 LMS positions in G h . Lemma 8. If F (G h , G h ) = (x, y), there are two stringsĜ h+1 ,Ĝ h+1 such thatĜ h+1 ,Ĝ h+1 can be obtained by replacing some characters in G h+1 , G h+1 without changing the relative order of any characters in G h+1 , G h+1 , respectively, and F (Ĝ h+1 ,Ĝ h+1 ) = (a, b), where a ≤ 2 + (x + 1)/2 , b ≤ 2 + (y + 1)/2 , and a + b ≤ 4 + (x + y)/2 .
Proof. By Lemma 10, Proof. If |D h | = 1, then G h+1 must be a unary string, and therefore no G h+2 is constructed. If |D h | = 2 and there is a factor of length 1 in D h , then G h+1 is still a unary string except for the first position, and therefore no G h+2 is constructed. Therefore, G h+2 is constructed only if |D h | ≥ 2 and there are at least two factors of length at least 2, and hence D h ≤ 4( D h − 2) − x + y holds.

Lower bounds for the sensitivity of g is
Theorem 28. The following lower bounds on the sensitivity of GCIS hold: Proof. Assume p > 1.

Bisection
In this section, we consider the worst-case sensitivity of the compression algorithm Bisection [52] which is a kind of grammar-based compression that has a tight connection to BDDs.
Given a string T of length n, the bisection algorithm builds a grammar generating T as follows. We consider a binary tree T whose root corresponds to T . The left and right children of the root correspond to T 1 = T [1..2 j ] and T 2 = T [2 j + 1..n], respectively, where j is the largest integer such that 2 j < n. We apply the same rule to T 1 and to T 2 recursively, until obtaining single characters which are the leaves of T . After T is built, we assign a label (non-terminal) to each node of T . If there are multiple nodes such that the leaves of their subtrees are the same substrings of T , we label the same non-terminal to all these nodes. The labeled tree T is the derivation tree of the bisection grammar for T . We denote by g bsc (T ) the size of the bisection grammar for T . Recall that Σ is the alphabet.
Let us briefly consider the case of unary alphabet Σ 1 = {a}. Let h(T ) denote the height of the derivation tree T for T = a n . After obtaining T = a n+1 for insertion or T = a n−1 for deletion, at most h(T ) − 1 new productions are added (note that X → a exists both for T and for T ). Thus the additive sensitivity of Bisection for unary alphabets is at most h(T ) − 1. This bound is almost tight, e.g. deleting a single a from T = a 2 k adds new k − 2 = h(T ) − 2 non-terminals to the existing k = h(T ) non-terminals (note that the production X → a remains and the existing root of T is replaced with the new one). The multiplicative sensitivity for Bisection is thus asymptotically In what follows, let us consider the case of multi-character alphabets, where at least one of T and T contains two or more distinct characters.
Proof. substitutions: Consider a unary string T = a n with n = 2 k . The set of productions for T is X 1 = a (generating a), with g bsc (T ) = 2k − 1. Let T = a n−1 b that can be obtained by replacing the last a in T with b.
deletions: Assume that |Σ| = 2 i with a positive integer i ≥ 1. Let Q be a string that contains t = |Σ| 2 distinct bigrams and |Q| = |Σ| 2 + 1. Let Q = Q[2..|Q|]. Let σ i denote the lexicographically ith character in Σ. We consider the string Note that p = log(n/σ). The set of productions for T from depth 1 to p is: Thus, the derivation tree T has p|Σ| internal nodes with distinct labels. Additionally, after height |Σ|, the string consists of t − 1 distinct bigrams, and there is no run of length 2. Then the derivation tree T has t − 1 internal nodes with distinct labels in height above p. Finally, g bsc (T ) = p|Σ| + t − 1.
We consider the string T where T [1] is removed, namely, The set of productions for T of height 1 is: Thus, the derivation tree T for string T has t = |Σ| 2 internal nodes with distinct labels at height one. Because of this, the number of internal nodes of the derivation tree T in each height 2 ≤ p ≤ p is also at least t = |Σ| 2 . After that, the string of height p consists of t distinct bigrams, and there is no run of length 2, which is the same condition of T . Then the derivation tree T has additional t − 1 internal nodes with distinct labels in height above p. Finally, g bsc (T ) = tp + t. Then, we obtain: where Ω(|Σ| 2 p) = Ω |Σ| 2 log n |Σ| and Ω(|Σ| 2 p) = Ω(|Σ|g bsc (T )). insertions: We use the same string T as in the case of deletions. We consider the string T that is obtained by prepending Q [1] to T , namely, The set of productions for T of height 1 is: Thus, the derivation tree T has t + 1 internal nodes with distinct labels at height one. Because of this, the number of internal nodes of derivation tree T of each height 2 ≤ p ≤ p is also at least t = |Σ| 2 nodes. After that, the string of height p consists of t distinct bigrams, and there is no run of length 2, which is the same condition of T . Then derivation tree T has additional t − 1 internal nodes with distinct labels in height above p. Finally, g bsc (T ) = (t + 1)p + t. Then, we obtain: where Ω(|Σ| 2 p) = Ω |Σ| 2 log n |Σ| and Ω(|Σ| 2 p) = Ω(|Σ|g bsc (T )).
We show a concrete example of how the derivation tree of Bisection changes by an insertion in Figure 7.
Proof. substitutions: Let i be the position where we substitute the character T [i]. We consider the path P from the root of T to the ith leaf of T that corresponds to T [i]. We only need to change the labels of the nodes in the path P , since any other nodes do not contain the ith leaf. Since T is a balanced binary tree, the height h of T is log 2 n and hence |P | ≤ h = log 2 n . Since h ≤ g bsc , we get MS sub (g bsc , n) ≤ 2. Since each non-terminal is in the Chomsky normal form and since log 2 n ≤ g bsc , AS sub (g bsc , n) ≤ 2 log 2 n ≤ 2g bsc .
insertions: Let i be the position where we insert a new character a to T , and let T and T be the derivation trees for the strings T and T before and after the insertion, respectively. For any node v in the derivation tree T , let T (v) denote the subtree rooted at v. Let (v) and r(v) denote the text positions that respectively correspond to the leftmost and rightmost leaves in T (v). We use the same analysis for the left children of the nodes in the path P from the root to the new ith leaf which corresponds to the inserted character a. Let v denote a node in T . From now on let us focus on the subtrees T (v ) of T such that (v ) > i and v is not in the rightmost path from the root of T . Let str(v ) denote the string that is derived from the non-terminal for v , and let v be the node in T which corresponds to v . Observe that str a a a a a a a a a a a a a a b b a a a a a a a a a a a a a a a b b   There are nodes X 1 , X σ+1 , X 2σ+1 , X 3σ+1 in the leftmost path in the derivation tree of T = a 2 4 b 2 4 b 2 4 · · · (upper). After a z is prepended to T (yielding T ), new internal nodes X 1 , X σ+1 , X 2σ+1 , X 3σ+1 that correspond to za, za 3 , za 7 , za 15 occur in the derivation tree for T (lower). This propagates to the other σ − 1 bigrams ab, bc, . . . , which consist of distinct characters. str(v ) has been shifted by one position in the string due to the new character a inserted at position i. Since T [ (v)..r(v)] is represented by the node v in T , there exist at most g bsc distinct substrings of T that can be the "seed" of the strings represented by the nodes v of T with (v ) > i. Since the number of left-contexts of each T [ (v)..r(v)] is at most |Σ|, there can be at most |Σ| distinct shifts from the seed T [ (v)..r(v)]. Since the rightmost paths from the roots of T and T are all distinct except the root, and since inserting the character can increase the length of the rightmost path by at most 1, overall, we have that g bsc (T ) ≤ |Σ|g bsc (T ) + log 2 n + 1 ≤ |Σ|g bsc (T ) + h(T ) + 1, where h(T ) is the height of T . For the case of multi-character alphabets g bsc (T ) ≥ h(T ) + 1 holds, and hence g bsc (T ) ≤ (|Σ|+1)g bsc (T ) follows from formula (2). Hence we get MS ins (g bsc , n) ≤ |Σ|+1 and AS ins (g bsc , n) ≤ |Σ|g bsc . deletions: By similar arguments to the case of insertions, we get MS del (g bsc , n) ≤ |Σ| + 1 and AS del (g bsc , n) ≤ |Σ|g bsc .

Compact Directed Acyclic Word Graphs (CDAWGs)
In this section, we consider the worst-case sensitivity of the size of Compact Directed Acyclic Word Graphs (CDAWGs) [10]. The CDAWG of a string T , denoted CDAWG(T ), is a string data structure that represents the set of suffixes of T , such that the number v of internal nodes in CDAWG(T ) is equal to the number of distinct maximal repeats in T , and the number e of edges in CDAWG(T ) is equal to the number of right-extensions of maximal repeats occurring in T . Therefore, the smaller CDAWG(T ) is, the more repetitive T is. Since v ≤ e always holds, we simply use the number e of edges in the CDAWG as the size of CDAWG(T ), and denote it by e(T ). It is known (c.f. [6]) that CDAWG(T ) induces a grammar-based compression of size e for T .
Proof. deletions: Consider string T = a m ba m b of length n = 2m + 2. All the maximal repeats of T are either of form (1) a h with 1 ≤ h < m or (2) a m b. Each of those in group (1) has exactly two out-going edges labeled with a and b, and the one in (3) has exactly one out-going edge labeled a m b. Summing up these edges together with the two out-going edges from the source, the total number of edges in CDAWG(T ) is 2m + 1 = n − 1 (see also the left diagram of Figure 8). Consider string T = a 2m b of length n − 1 = 2m + 1 that can be obtained by removing the middle b from T . CDAWG(T ) has 2m internal nodes each of which represents maximal repeat a k for 1 ≤ k < 2m and has two out-going edges labeled with a and b. Thus, CDAWG(T ) has exactly 4m = 2n − 4 edges, including the two out-going edges from the source (see also the right diagram of Figure 8). substitutions: By replacing the middle b of T with a, we obtain string T = a 2m+1 b, which gives us similar bounds lim inf n→∞ MS del (e, n) ≥ 2, AS del (e, n) ≥ n − 2 and AS del (e, n) ≥ e − 2.
insertions: Consider string S = a n of length n. The maximal repeats of CDAWG(S) are all of form a h with 1 ≤ h < n and each of them has exactly one out-going edge labeled by a. The total number of edges in CDAWG(S) is thus n including the one from the source. Consider string S = a n b of length n + 1. The set of maximal repeats does not change from S, but b is a right-extension of a h for each 1 ≤ h < n. Thus, CDAWG(S ) has a total of 2n − 2 edges, including the two out-going edges from the source. Thus we have e(S )/e(S) = 2n−2 n and e(S ) − e(S) = n − 2. This gives us lim inf n→∞ MS ins (e, n) ≥ 2 AS ins (e, n) ≥ n − 2 and AS ins (e, n) ≥ e − 2.

Concluding remarks and future work
In the seminal paper by Varma and Yoshida [65] which first introduced the notion of sensitivity for (general) algorithms and studied the sensitivity of graph algorithms, the authors wrote: "Although we focus on graphs here, we note that our definition can also be extended to the study of combinatorial objects other than graphs such as strings and constraint satisfaction problems." Our study was inspired by the afore-quoted suggestion, and our sensitivity for string compressors and repetitiveness measures enables one to evaluate the robustness and stability of compressors and repetitiveness measures. The major technical contributions of this paper are the tight and constant upper and lower bounds for the multiplicative sensitivity of the LZ77 family, the smallest bidirectional scheme b, and the substring complexity δ. We also presented tight and constant upper and lower bounds for the multiplicative sensitivity of the recently proposed grammar compressor GCIS, which is based on the idea of the Induced Sorting algorithm for suffix sorting. We also reported non-trivial upper and/or lower bounds for other string compressors, including RLBWT, LZ-End, LZ78, AVL-grammar, αbalanced grammar, RePair, LongestMatch, Greedy, Bisection, and CDAWG. Some of the upper bounds reported here follow from previous important work [30,35,28,37,31,11,62,26].
Apparent future work is to complete Tables 1 and 2 by filling the missing pieces and closing the gaps between the upper and lower bounds which are not tight there.
While we dealt with a number of string compressors and repetitiveness measures, it has to be noted that our list is far from being comprehensive: It is intriguing to analyze the sensitivity of other important and useful compressors and repetitiveness measures including the size ν of the smallest NU-systems [51], the sizes of the other locally-consistent compressed indices such as ESP-index [44] and SE-index [54].
Our notion of the sensitivity for string compressors/repetitiveness measures can naturally be extended to labeled tree compressors/repetitiveness measures. It would be interesting to analyze the sensitivity for the smallest tree attractor [61], the run-length XBWT [61], the tree LZ77 factorization [20], tree grammars [42,17], and top-tree compression of trees [9].

A Omitted proofs
In this section, we present omitted proofs.
The self-referencing LZ77 factorization of T is LZ77sr(T ) = a|a p−1 b|c|abc# 1 |a 2 bc# 2 | · · · |a p bc# p | with z 77sr (T ) = p + 3. Notice that the second factor a p−2 1 is self-referencing. Consider the string T T = a p b · abc# 1 · a 2 bc# 2 · · · a p bc# p that can be obtained from T by deleting the first c of position p + 2. Let us analyze the structure of the self-referencing LZ77 factorization of T . The first two factors are unchanged. The third factor c of LZ77sr(T ) is removed, and each of the remaining factors of form a k bc# k in LZ77sr(T ) is divided into two factors as a k bc|# k |. Thus the self-referencing LZ77 factorization of T is LZ77sr(T ) = a|a p−1 b|abc|# 1 |a 2 bc|# 2 | · · · |a p bc|# p | with z 77sr (T ) = 2p + 2, which leads to lim inf n→∞ MS del (z 77sr , n) ≥ lim inf p→∞ (2p + 2)/(p + 3) = 2, AS del (z 77sr , n) ≥ 2p + 2 − (p + 3) = p − 1 = Ω( √ n).
It is also possible to binarize the strings T and T in the above proof for the cases of substitutions and insertions, while retaining the same lower bounds: Corollary 8. For the self-referencing LZ77 factorization, there are binary strings of length n that satisfy MS sub (z 77sr , n) ≥ 2, MS ins (z 77sr , n) ≥ 2, respectively.