Relationship between superstring and compression measures: New insights on the greedy conjecture

the compression ratio and the superstring ratio of an approximation algorithm in general, and derive a bound of the superstring ratio in function of the compression ratio. When applied to greedy on words of fixed length ( r -SSP), we obtain a superstring approximation ratio of 2 for 3-SSP, and this ratio increases with r to reach for r = 6 a value of 7 / 2, which is the best known ratio for the greedy algorithm [10]. But we also get a tight superstring ratio of 3 / 2 for 2-SSP, thereby demonstrating that the greedy algorithm can achieve a ratio strictly smaller than 2. This shows first that


Introduction
Given a set of p words P := {s 1 , s 2 , . . ., s p } over a finite alphabet Σ, a superstring of P is a string containing each s i for 1 ≤ i ≤ p as a substring.The Shortest Superstring Problem (SSP) asks for a superstring of P of minimal length.SSP is a well studied problem (alias Shortest Common Superstring), with a strong relation to the Asymmetric Travelling Salesman Problem, and is known to be NP-hard even on a binary alphabet [7].The restriction to instances where all input strings share the same length, say r > 1, is denoted r-SSP, becomes polynomial if r ≤ 2, but remains NP-hard as soon as the strings are of length at least 3 [1].Two approximation measures can be optimised for SSP: either the length of the superstring is minimised, or the compression is maximised (i.e., the sum of the lengths of the input strings minus that of the superstring).For a word x, |x| denotes the length of x.Let ∥P∥ denote  s i ∈P |s i | and let t be the output superstring, then the compression equals ∥P∥ − |t|.With both measures SSP is hard to approximate (MAX-SNP-hard, see [1]).Since 1991, a long series of elaborated algorithms have improved the approximation ratio for both measures culminating in 2 11  23 for the superstring [13] and in 3/4 for the compression measure [14].A recent table listing these ratio and the literature, as well as known inapproximability bounds appears in [9].A detailed survey gives an overview of the numerous application contexts of SSP [8].
In 1988, a seminal paper introduced a simple greedy algorithm, consisting in repeatedly merging two words that exhibit the largest (prefix-suffix) overlap until only one string remains [16].With P := {abba, bbaa, aaba} for example, abba is first merged with bbaa yielding abbaa (they share a 3-letter overlap), then, abbaa is merged with aaba resulting in the superstring abbaaba of length 7; as ∥P∥ = 12, the compression obtained equals ∥P∥ − |t| = 12 − 7 = 5.Note that their greedy algorithm, denoted by greedy, can be seen as the greedy algorithm of a specific hereditary system [4].Tarhio and Ukkonen proved in [16] that greedy achieves a compression ratio of 1/2 and formulated the greedy conjecture: the greedy algorithm yields a superstring ratio of 2. Despite a lot of research dedicated to SSP, this conjecture has remained open since 1988.A weaker form of this conjecture asks to prove this ratio for r-SSP and some values of r.Blum et al. have shown for greedy a superstring ratio of 4 [1], which was later improved to 3.5 in [10].The greedy conjecture is supported by simulated experiments [18,15].Moreover, the superstring approximation ratio obtained by the greedy algorithm remains a crucial question, especially since other approximation algorithms are usually less efficient than greedy [10].
Recently, it has been proven that in the case where all input words have length 4 (for 4-SSP) the greedy algorithm achieves a superstring ratio of at most 2, as stated by the conjecture [11].This proof is valid only for words of length 4 and cannot be adapted to words of length 3, for instance.Kulikov and colleagues [11] suggest that the conjecture for 3-SSP follows from the fact that greedy achieves 2-approximation of the compression measure, citing [16].To our knowledge, no proof for the greedy conjecture for words of length 3 has ever been published and there are no mention of it in a recent survey [8].Here, we study the relationship between the compression ratio and the superstring ratio of an approximation algorithm in general, and derive a bound of the superstring ratio in function of the compression ratio.When applied to greedy on words of fixed length (r-SSP), we obtain a superstring approximation ratio of 2 for 3-SSP, and this ratio increases with r to reach for r = 6 a value of 7/2, which is the best known ratio for the greedy algorithm [10].But we also get a tight superstring ratio of 3/2 for 2-SSP, thereby demonstrating that the greedy algorithm can achieve a ratio strictly smaller than 2. This shows first that the general relationship between the superstring and compression measures is important and can serve for future research.Second, the ratio smaller than 2 does not contradict known bounds or instances.Indeed, the known examples give a bound that converges towards 2 from below when the length of the input words tends to infinity.Thus, we propose a more precise conjecture for r-SSP, in which the superstring ratio equals 2 − 1 r instead of 2.
Notation: An alphabet Σ is a finite set of letters.A linear word or string over Σ is a finite sequence of elements of Σ.The set of all finite words over Σ is denoted by Σ ⋆ , and Σ r denotes the subset of Σ ⋆ of words of length r for any positive integer r.
Given two words x and y, we denote by xy the concatenation of x and y.

Relation between maximum compression and shortest superstring approximation ratios for SSP
Here, we exhibit for SSP an upper bound of the superstring approximation ratio of an algorithm in function of its compression ratio.
Let A be a polynomial-time approximation algorithm for SSP.As all approximation algorithms considered here take polynomial time in the input size, we simply omit this characteristic in the sequel.We denote by s A (P) the output of algorithm A with input P, and by s opt (P) an optimal superstring for this input.Note that s opt (P) also achieves a maximum compression for P. We only consider approximation algorithms that return a superstring whose length is bounded by ∥P∥.In other words, we disregard algorithms that insert additional symbols beyond those required by the words of the instance.Without this restriction, the approximation ratio super(A) would not be defined for any algorithm A, and the ratio comp(A) could be negative; both ratios are defined a few lines below.Instances where the optimal superstring is the concatenation of all the words of the instance satisfy |s opt (P)| = ∥P∥.In such cases, for any approximation algorithm A, one has ∥P∥ = |s opt (P)| = |s A (P)| = ∥P∥.Such instances are excluded from Theorem 1.Let us define the superstring approximation ratio of algorithm A, denoted super(A), as the smallest real value such that for any input P: Similarly, we define the compression ratio comp(A) as the largest real value such that, for any input Proof.Let α = (γ −1)×comp(A)+1 γ and the function f : x  → (x−1)×comp(A)+1 x .Its derivative is f , which is negative since 0 < comp(A) ≤ 1.Moreover, f is decreasing, and as γ < 1, we get α = f (γ ) > f (1) = 1.We obtain that γ = 1−comp(A) α−comp(A) .It follows that: By definition A achieves the compression ratio comp(A), so using the previous inequality we get As for any set P of input words, super(A) is the smallest value larger than |s opt (P)| , and as α does not depend on P, we get:

Approximation of r-SSP
Let r be an integer satisfying r > 1.Now, let us study the superstring approximation for the restriction of SSP to instances in which all input words have the same length r.First we show a theorem bounding the superstring ratio in function of the compression ratio for r-SSP for any algorithm.Then, we derive an upper bound and prove a lower bound for the superstring ratio of the greedy algorithm.Finally, applying this theorem improves the superstring ratio for r < 6 compared to the 7/2 bound of [10], and solves the greedy conjecture for 3-SSP.
Since the instance P is a subset of Σ r , we have ∥P∥ = r × p.As all words of P are different, any word differs from the other by at least one symbol and any two words overlap by at most r − 1 positions, which implies the following property.Proposition 1.Let t be a superstring of P. Then |t| ≥ r + p − 1.
We derive the following theorem.
Theorem 2. Let r be an integer such that r > 1 and let P be a subset of Σ r .For any approximation algorithm A, we have: Proof.From Proposition 1, we know that |s opt (P)| ≥ r + p − 1, which implies .
Using Theorem 1 with γ = 1/r, we obtain Theorem 2 bounds the ratio of an algorithm A for any instance of r-SSP.Consequently, super(A) also satisfies the same inequation.
We can now provide a bound on the approximation ratio of the greedy algorithm for r-SSP, knowing that its compression ratio is 1/2 [16,4,3].Proof.Theorem 2 gives an upper bound on the approximation ratio of greedy.To obtain the desired lower bound, we exhibit an instance where |s greedy (P)| |s opt (P)| = 2 − 1 r (see Fig. 1).
In the instance of the proof above, because of the fixed word length, both the alphabet cardinality and the number of words go to infinity to reach the bound.Thanks to Proposition 2 and to Theorem 2, and by using the compression ratio of greedy, which equals 1/2, we obtain new bounds on the approximation ratio of greedy for r-SSP.The known greedy superstring ratio of 3.5 [10] allows us to precise the upper bound of super(greedy) for r-SSP.

Theorem 3. The superstring approximation ratio of greedy for r-SSP is bounded by
 .
Note that the lower and upper bounds meet for r = 2. Theorem 3 suggests a more precise version of the greedy conjecture: the superstring approximation ratio of greedy on r-SSP is 2 − 1 r .Note that this ratio has been proven, but for a subset of instances corresponding to a restricted class of orders in which strings are merged, known as ''linear greedy orders'' [17].
Table 1 shows the actual bounds for small values of r.One observes that greedy achieves a superstring ratio that increases from 3/2 for words of length 2 until 7/2 for r = 6.It reaches a ratio of 2 for 3-SSP, which solves the classical greedy conjecture for 3-SSP.As the previously known bound on the approximation ratio of greedy for r-SSP is 7/2 [10], our theorem improves on this bound for all values of r below 6.Surprisingly for 2-SSP, greedy achieves a ratio of 3/2, which is tight.This shows that greedy can do better than the ratio of 2 stated by the classical greedy conjecture.Note that other approximation algorithms (which are more complex than greedy) yield better approximation ratios for small values of r.For instance, an algorithm that combines a de Bruijn graph and an overlap graph approaches yields a ratio (r 2 + r − 4)/(4r − 6), which is 4/3 for 3-SSP [9].The greedy conjecture remains open for r > 5 and in general for SSP.

Conclusion
The Shortest Superstring Problem is a crucial problem in computer science and has many practical applications in data compression, and in bioinformatics where it models genome assembly [8].In this context, the case of r-SSP is realistic since sequencers often produce sequencing reads of the same length.Because it is simple and more efficient than other methods [10], and because it yields very good solutions in practice [12,15], the greedy algorithm is important.More generally, we exploit the relationship between the two approximation measures, the superstring length and the compression, to bound the superstring ratio in function of the compression ratio, which to our knowledge is new.This bound applies to SSP in general, and our results could prove useful for variants of SSP, like SSP for DNA strings, SSP with flippings, or for cyclic superstrings [2,6].
Maximising the compression or minimising the superstring length are dual problems (known as Maximum Compression and SSP, respectively).An optimal solution for one is also optimal for the other, while good approximate solutions differ for both.To solve this artificial asymmetry, another definition of approximation ratio has been proposed: the differential approximation ratio [5], which incorporates the size of the worst solution.For SSP in general, there is no longest superstring (no worst solution).With the natural restriction we considered for Theorem 1, the superstring ratio is the classical ratio, while the compression ratio is the differential approximation ratio for SSP.For the Maximum Compression problem, the compression ratio is both the classical and the differential approximation ratio.As we conjecture that computing a longest superstring obtained from a permutation of the input words is NP-hard, the study of the differential approximability of SSP appears as an appealing future line of research.
In addition, the greedy algorithm also gives an exact solution for finding the Shortest Cyclic Cover of Strings Proving the greedy conjecture in general remains a challenging open question.Here, we prove the greedy conjecture of a 2 superstring approximation ratio for 3-SSP, a restriction of SSP known to be NP-hard.Our proof also implies better superstring ratios for r < 6 (except 4).In addition, we show that greedy has a tight approximation bound of 3/2 on 2-SSP, meaning that it can yield ratios strictly smaller than 2, which was unknown.It suggests that the ratio depends on the length of input words.Hence, we propose to revise the greedy conjecture for input words of fixed length: is the superstring ratio of greedy equal to 2 − 1/r?
(a) Prefix graph of the instance.(b) Path of the optimal superstring in the prefix graph.(c) Path of the greedy superstring in the prefix graph.

Fig. 1 .
Fig.1.Illustration of the instance considered in the proof of Proposition 2, which gives the lower bound of greedy superstring ratio for r-SSP.The prefix graph for this instance is shown in (a), the path corresponding to the optimal solution in (b), and the path of the greedy solution in (c).The prefix graph is complete digraph in which each input word is a node, and the weight of an arc (x, y) equals the length of x minus the length of the overlap between x and y.

1 2 a r−1 3 . . . a r− 1 m−1 a m a r 2 a r 4 . . . a r m− 1 while
Then in the worst case, the greedy solution is s greedy (P) = a 1 a r−an optimum superstring is s opt (P) = a 1 a r 2 a r 3 . . .a r m−1 a m .Thus, we get |s greedy (P)| |s opt (P)|

Table 1
Bounds on the approximation ratio of the greedy algorithm for r-SSP for r < 7. It achieves a bound of 2 for 3-SSP.Ratio 7/2 is the currently best known ratio for r-SSP in general.It also gives a tight ratio of 3/2 for 2-SSP, which is polynomial.