Enhanced string factoring from alphabet orderings

In this note we consider the concept of alphabet ordering in the context of string factoring. We propose a greedy-type algorithm which produces Lyndon factorizations with small numbers of factors along with a modification for large numbers of factors. For the technique we introduce the Exponent Parikh vector. Applications and research directions derived from circ-UMFFs are discussed.


Introduction
Factoring strings is a powerful form of the divide and conquer problem-solving paradigm for strings or words.Notably, the Lyndon factorization [1] is both efficient to compute [5] and useful in practice [7].We study the effect of an alphabet's order on the number of factors in a Lyndon factorization and propose a greedy algorithm that assigns an ordering to the alphabet.In addition, we formalize the distinction between the sets of Lyndon and co-Lyndon words [3] as avenues for alternative string factorizations.More generally, circ-UMFFs provide the opportunity for achieving further diversity with string factors [2,3].✩ The authors were part-funded by the European Regional Development Fund through the Welsh Government, grant 80761-AU-137 (West).✩✩ A preliminary version of this paper was accepted as a poster in IWOCA 2018 (International Workshop on Combinatorial Algorithms).

Notation
Given an integer n ≥ 1 and a nonempty set of symbols (bounded or unbounded), a string of length n, equivalently word, over takes the form x = x 1 ...x n with each x i ∈ .For brevity, we write x = x[1..n] with x[i] = x i .The length n of a string x is denoted by |x|.The set is called an alphabet whose members are letters or characters, and + denotes the set of all nonempty finite strings over .
If x = uw v for strings u, w, v ∈ * , then u is a prefix, w is a substring or factor, and v is a suffix of x; we say u = x is a proper prefix and similarly for the other terms.If x = uv, then vu is said to be a rotation (cyclic shift or conjugate) of x.A string x is said to be a repetition if and only if it has a factorization x = u k for some integer k > 1; otherwise, x is said to be primitive.For a string x, the reversed string x is defined as A string which is both a proper prefix and a proper suffix of a string x = ε is called a border of x; a string is border-free if the only border it has is the empty string ε.An UMFF W is a circ-UMFF if it contains exactly one rotation of every primitive string x ∈ + .The classic and foundational circ-UMFF is the set of Lyndon words, which we denote L, where the rotation chosen is the one that is strictly least in the lexorder derived from an ordering of the letters of the alphabet ( [1,5,2]).The co-Lyndon circ-UMFF, co-L, was introduced in [3], where a co-Lyndon word is strictly least amongst its rotations in co-lexorder.
Every circ-UMFF W yields a strict order relation, the W-order: if W contains strings u, v and uv then u < W v. For the Lyndon circ-UMFF, its specific W-order is lexorder: It was observed in [3] that the analogue of Theorem 1 does not hold for every circ-UMFF -we will establish this phenomenon for the co-Lyndon circ-UMFF.
Applying Theorem 1 we have v u ∈ L and hence uv ∈ co-L.Next if uv ∈ co-L then it must be primitive and border-free [2].Thus u = v which gives rise to two cases.Suppose first that u ≺ v.If u is a proper suffix of v then uv = uwu for some w = ε contradicting the border-free property.Otherwise, with |u| = n there is some largest j, and so v ≺ u as required.2 The sets of Lyndon and co-Lyndon words are distinct and almost disjoint.
Lemma 2. For a given , L = co-L and L ∩ co-L = .
Proof.Let v ∈ L and w ∈ co-L with |v|, |w| ≥ 2. Then v starts with some letter α which is minimal in v. Since v is border-free [2] then it ends with some β where α < β.Similarly, w starts γ and ends δ, where γ > δ.Therefore v = w.Finally, every circ-UMFF contains the alphabet as expressed in [2,3]. 2 The following result generalizes the Lyndon factorization theorem [1] and is a key to further applications of string decomposition.Theorem 2. [2] Let W be a circ-UMFF and suppose x =

Alphabet ordering
Suppose the goal is to optimize a Lyndon factorization by minimizing or maximizing the number of factors.For this we consider choosing the order of the letters in the -assumed unordered -alphabet so as to influence the number of factors.To illustrate, consider the string Whereas, if we choose the alphabet ordering to be {b < c < a < d}, the Lyndon factorization of x becomes a ≥ bcabcdabcaba.
Towards this goal we now describe a greedy algorithm for producing small numbers of factors which has performed well in practice on the biological {A, C , G, T } alphabet -the experimentation compared results with those for the 4! letter permutations.For a string v = v 1 . . .v n , we suppose that the number of distinct characters in v is δ ≤ σ ; for practical purposes we can assume σ is at most The proposed method requires an extension to a Parikh vector, p(v), of a finite word v, where p(v) enumerates the occurrences of each letter of the alphabet in v. Our modification is that for each distinct letter we will record its individual RLE (run length encoding) exponent pattern (left-to-right sequence of exponents for a letter) -so the sum of these exponents is the Parikh entry for that letter.We call this the Exponent Parikh vector, or EP vector, implemented as an EP array.For example, over the alphabet ]; whereas, for the EP vector we record the se- quences [(3, 2), (2, 1), (1)].So usually the letters are processed in the alphabet order with a Parikh vector while in the EP case we process them in order of first occurrence.
An overview of the method is that we use the fact that in a Lyndon factorization the first factor is the longest prefix which is a Lyndon word.Then the heuristic is that the left-most letter, α say, in the given string whose exponents, when read as a string, form a Lyndon word with the minimal number of factors is chosen as the least letter in the alphabet ordering.In order to construct a Lyndon word using the exponents of letters we require the order of the exponent integer alphabet to be inverted, that is, let ¯ = {. . . 3 < 2 < 1}.Next, the algorithm attempts to assign order to letters in the substrings between runs of α characters, where these substrings are denoted X i -if it gets stuck it tries backtracking.Finally, if there is a nonempty prefix prior to the first α then it is processed.So note that with this algorithm the required property for the exponents of α is that they form a Lyndon word over ¯ and in conjunction a requirement for assign- ing letters to the X i substrings is that the ordering will be cycle-free.The algorithm can be modified to generate large numbers of factors which involves assigning distinct letters to be in decreasing order.

Greedy algorithm
The pseudocode in Algorithm 1 greedily assigns an alphabet order to letters.

Algorithm 1:
Order the alphabet so as to reduce the number of factors in a Lyndon factorization.
// assign first letter to be minimal in ; if q = 1 assign each new letter in X 1 successively in for h = 2 to q do if j h = j 1 then // same exponents so assign alphabet in order to letters in X 1 and assign each new letter successively in ; d++; The following example, which uses the notation of Algorithm 1, illustrates how backtracking can lead the algorithm from an inconsistent ordering to a successful assignment and associated factorization.
Only the letter a has an exponent greater than 1, and

Experimentation: factorization of DNA strings
We chose as an example the 120 prokaryotic reference genomes from RefSeq, 1 to investigate the results of the algorithm in practice. 2Most of these genomes are provided as a single contiguous sequence but some of them have additional smaller pieces representing plasmids or other information.The longest contiguous sequence was chosen for each genome in these cases, and smaller pieces were discarded.The retained sequences ranged from 640,681 letters to 10,236,715 in length, with a mean of 3,629,792.
In order to determine how often our greedy algorithm found a good or optimal alphabet reordering in practice, we calculated the Lyndon factorizations resulting from all possible (4! = 24) alphabet reorderings of the characters A, C, G and T across this collection of genomes.The improvement that could potentially be made to the factorization by reordering is substantial, with at least a halving of the number of factors in most cases and an improvement reducing 25 factors down to 3 in one case.For each genome we ranked the results of all possible reorderings by the number of factors produced and compared the reordering produced by the algorithm.The algorithm found the optimal reordering for 21/120 genomes and the second-most optimal in 31/120 genomes.
The EP vector is used to determine the least letter in the reordering.If the first choice leads to inconsistency (and hence small factors), backtracking to inspect other possible choices can be helpful.However, in many cases, the initial choice is still better than the next possible consistent solution found via backtracking through the EP vector.Without backtracking, the algorithm found 23/120 optimal orderings and a further second-most optimal orderings in 31/120 genomes.Histograms of the full results, with and without backtracking, can be seen in the Supplemental Information.

Applications
In many cases, such as natural language text processing, the order of the alphabet is prescribed, and hence the Lyndon factors of an input text cannot be manipulated.On the other hand, bioinformatics alphabets have no inherent ordering suggested by biological systems and applications involving Lyndon words, such as the Burrows-Wheeler transform (BWT), will allow for useful manipulation of the Lyndon factors.The co-BWT is the regular BWT of the reversed string, or the BWT with co-lexorder, which has been applied in the highly successful Bowtie sequence alignment program [6].Integral with the BWT transform is the computation of suffix arrays via induced suffix-sorting.We also propose that pattern matching can be implemented with the Lyndon factorization in big data applications, such as sequence alignment, and further enhanced by fortuitous arrangements of the alphabet.
In [7] a new method is presented for constructing the suffix array of a text by using its Lyndon factorization advantageously.Partitioning the text according to its Lyndon properties allows tackling the problem in local portions of the text, local suffixes, prior to extending the solution globally, to achieve the suffix array of the entire text.It is stated that the algorithm's time complexity is not competitive for the construction of the overall suffix array -we propose that manipulating the factors by alphabet ordering could improve the efficiency.

Research problems
• As a complementary structure to the Lyndon array we introduce and propose studies of the Lyndon factorization array.The Lyndon array λ = λ x [1..n] of a given x = x[1..n] gives at each i the length of the maximal Lyndon word starting at i -reverse engineering in [4] includes a linear-time test for whether an integer array is a Lyndon array.We define the Lyndon factorization array F = F x [1..n] of x to give at each position i the number of factors in the Lyndon factorization starting at i.
• The greedy algorithm does not necessarily produce an optimal solution hence natural problems are to design algorithms for Lyndon factorizations with a guaranteed minimal/maximal number of Lyndon factors.
• Using Duval's Lyndon factorization algorithm [5] as a benchmark, modify the alphabet order so as to increase/decrease the number of factors.
• Theorem 2 supports the following optimization problem from [3]: Determine the circ-UMFF(s) which factors a string x into the minimal/maximal number of factors, possibly combined with alphabet ordering.
2221 has the minimal number of factors with f 1 = f 2 = 2221.After assigning a to be the first letter in , processing f 1 causes inconsistency (and similarly f 2 ), since the substrings X 1 and X 2 give b < c while X 1 and X 3 would require c < b.So the algorithm then backtracks through the EP array and chooses the letter with the least number of factors (albeit singletons), namely d -the result is = {d < c < a < b} with F L (x) = aab ≥ dcaacdaabdbabaabcaacaacab. Note the order {a < b < c < d} would have given 3 factors.

, or u = ras, v = rbt for
some a, b ∈ such that a < b and for some r, s, t ∈ * .We call the ordering ≺ based on lexorder of reversed strings https://doi.org/10.1016/j.ipl.2018.10.0110020-0190/© 2018 The Authors.Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).If is a totally ordered alphabet then lexicographic ordering (lexorder) u < v with u, v ∈ + means that either u is a proper prefix of v

2. Unique Maximal Factorization Families (UMFFs)
With a linear scan record the Exponent Parikh (EP) vector of the string for δ distinct letters -O (n); Compute F L (p r ) of each exponent string p r over ¯ and record its number of factors -O (n); bool ← true; // initiate alphabet ordering while bool = true do Select the next leftmost p r , p i say, with minimal number of factors, t say -O (n); // assign alphabet order to the t factors of