Counting Lyndon factors

In this paper, we determine the maximum number of distinct Lyndon factors that a word of length $n$ can contain. We also derive formulas for the expected total number of Lyndon factors in a word of length $n$ on an alphabet of size $\sigma$, as well as the expected number of distinct Lyndon factors in such a word. The minimum number of distinct Lyndon factors in a word of length $n$ is $1$ and the minimum total number is $n$, with both bounds being achieved by $x^n$ where $x$ is a letter. A more interesting question to ask is what is the minimum number of distinct Lyndon factors in a Lyndon word of length $n$? In this direction, it is known (Saari, 2014) that an optimal lower bound for the number of distinct Lyndon factors in a Lyndon word of length $n$ is $\lceil\log_{\phi}(n) + 1\rceil$, where $\phi$ denotes the golden ratio $(1 + \sqrt{5})/2$. Moreover, this lower bound is attained by the so-called finite"Fibonacci Lyndon words", which are precisely the Lyndon factors of the well-known"infinite Fibonacci word"-- a special example of a"infinite Sturmian word". Saari (2014) conjectured that if $w$ is Lyndon word of length $n$, $n\ne 6$, containing the least number of distinct Lyndon factors over all Lyndon words of the same length, then $w$ is a Christoffel word (i.e., a Lyndon factor of an infinite Sturmian word). We give a counterexample to this conjecture. Furthermore, we generalise Saari's result on the number of distinct Lyndon factors of a Fibonacci Lyndon word by determining the number of distinct Lyndon factors of a given Christoffel word. We end with two open problems.


INTRODUCTION
This paper is concerned with counting Lyndon words occurring in a given word of length n.
First, let us recall some terminology and notation from combinatorics on words (see, e.g., [10,11]). A word is a (possibly empty) finite or infinite sequence of symbols, called letters, drawn from a given finite set Σ, called an alphabet, of size σ = |Σ|. A finite word w := x 1 x 2 · · · x n with each x i ∈ Σ is said to have length n, written |w| = n. The empty word is the unique word of length 0, denoted by ε. The set of all finite words over Σ (including the empty word) is denoted by Σ * , and for each integer n ≥ 2, the set of all words of length n over Σ is denoted by Σ n .
A finite word z is said to be a factor of a given finite word w if there exist words u, v such that w = uzv. If u = ε, then z is said to be a prefix of w, and if v = ε, then z is said to be a suffix of w. If both u and v are non-empty, we say that z is a proper factor of w. A prefix (respectively, suffix) of w that is not equal to w itself is said to be a proper prefix (respectively, proper suffix) of w. A factor of an infinite word is a finite word that occurs within it.
A non-empty word x that is both a proper prefix and a proper suffix of a finite word w is said to be a border of w. We say that a word which has only an empty border is borderless. If, for some word x, w = xx · · · x (k times for some integer k ≥ 1), we write w = x k , and w is called the k-th power of x. A non-empty finite word is said to be primitive if it is not a power of a shorter word. Two finite words u, v are said to be conjugate if there exist words x, y such that u = xy and v = yx. Accordingly, conjugate words are cyclic shifts of one another, and thus conjugacy is an equivalence relation. A primitive word of length n has exactly n distinct conjugates. For example, the primitive word abacaba of length 7 has 7 distinct conjugates; namely, itself and the six words bacabaa, acabaab, cabaaba, abaabac, baabaca, aabacab. The set of all conjugates of a finite word w is called the conjugacy class of w.
In this paper we consider only words on an ordered alphabet Σ = {a 1 , a 2 , . . . , a σ } where a 1 < a 2 < · · · < a σ . This total order on Σ naturally induces a lexicographical order (i.e., an alphabetical order) on the set of all finite words over Σ. A Lyndon word over Σ is a non-empty primitive word that is the lexicographically least word in its conjugacy class, i.e., w ∈ Σ or w < vu for all non-empty words u, v such that w = uv (e.g., see [10]). Equivalently, a non-empty finite word w over Σ is Lyndon if and only if w ∈ Σ or w < v for all proper suffixes v of w [7]. Note, in particular, that there is a unique Lyndon word in the conjugacy class of any given primitive word. For example, aabacab is the unique Lyndon conjugate of the primitive word abacaba. Lyndon words are named after R.C. Lyndon [12], who introduced them in 1954 under the name of "standard lexicographic sequences". Such words are well known to be borderless [7].
We begin in Section 2 by computing D(σ, n), the maximum number of distinct Lyndon factors in a word of length n on an alphabet Σ of size σ. In Section 3 we compute ET(σ, n), the expected total number of Lyndon factors (that is, counted according to their multiplicity) in a word of length n over Σ, while Section 4 computes ED(σ, n), the expected number of distinct Lyndon factors in word of length n over Σ. Section 5 considers distinct Lyndon factors in a Lyndon word of length n; in particular, we generalise a result of Saari [13] on the number of distinct Lyndon factors of a Fibonacci Lyndon word by determining the number of distinct Lyndon factors of a given Christoffel word (i.e., a Lyndon factor of an infinite Sturmian word -to be defined later). Lastly, in Section 6, we state some open problems.

THE MAXIMUM NUMBER OF DISTINCT LYNDON FACTORS IN A WORD
Let D(σ, n) be the maximum number of distinct Lyndon factors in a word of length n on the alphabet Σ = {a 1 , a 2 , . . . , a σ }. We want to find a word that achieves D(σ, n), given σ and n. It is clear that a necessary condition for attaining the maximum is that w takes the form a k 1 1 a k 2 2 . . . a k σ σ . This word contains ( n+1 2 ) factors of lengths 1, 2, . . . , n, of which each is a Lyndon word except those of the form a k i , k > 1. The number of powers of each a i is ( k i +1 2 ), including a i itself. The total number of Lyndon factors in w is therefore where the final σ counts the single letters a i . We claim that the summation is minimised when the k i differ by at most one. Suppose to the contrary that k j = k i + s for some i, j and s ≥ 2. It is easily checked that for s ≥ 2. Thus the summation term will be minimised when each k i equals either ⌊n/σ⌋ or ⌈n/σ⌉. If n = mσ + p, where 0 < p < σ, then ⌊n/σ⌋ = m and ⌈n/σ⌉ = m + 1. If p = 0 then each k i equals m. We therefore have the following result. Proof. If n = mσ, Theorem 1 gives

THE EXPECTED TOTAL NUMBER OF LYNDON FACTORS IN A WORD
We now wish to calculate the total number M(σ, n) of Lyndon factors (that is, counted according to multiplicity) appearing in all words in Σ n . Consider a Lyndon word L of length m ≤ n and a position i, 1 ≤ i ≤ n − m + 1, in words of length n. Words containing L starting at position i have the form xLy where xy is any word on Σ with length n − m. Thus there will be σ n−m words in Σ n which contain L in this position. This will be the same for any of the n − m + 1 possible values of i so in the σ n words in Σ n there will be (n − m + 1)σ n−m appearances of L. This is the same for all Lyndon words of this length. The number of such Lyndon words is 1/m of the number of primitive words of this length, since exactly one conjugate of each primitive word is Lyndon. The number of primitive words of length n ([10], equation where µ is the Möbius function. To get the total number of Lyndon factors appearing in Σ n , we sum over possible values of m: Dividing by σ n gives the expected total number ET(σ, n) := M(σ, n)/σ n of Lyndon factors in a word of length n on the alphabet Σ.

THE EXPECTED NUMBER OF DISTINCT LYNDON FACTORS IN A WORD
We use the notation from above, with [n] being the set {1, 2, . . . , n}. Most of the following analysis counts the number of words in Σ n that contain at least one factor equal to a specific Lyndon word L. At the end we sum over all possible L. Let S be a non-empty set of positions in a word w and let P(L, S, w) = 1 if w contains factors equal to L at each position in w beginning at a position in the set S, and 0 otherwise. Note that w may contain other factors equal to L. We claim that Then P(L, S, w) equals 1 if and only if S is any non-empty subset of T, so the left hand side of (4) becomes This equals 1 since the final sum is the binomial expansion of (1 − 1) t . The number of words in Σ n which contain at least one factor equal to L is therefore We now evaluate ∑ w∈Σ n P(L, S, w). This is counting the words in Σ n which have factors L beginning at positions i ∈ S. It clearly equals 0 if s|L| > n since then there is no room in w for s factors L (recalling that L is Lyndon, therefore borderless, and therefore cannot intersect a copy of itself). We also need the members of S to be separated by at least |L|. The number of such sets S is Once S is chosen there are σ n−s|L| ways of choosing the letters in w which are not in the specified factors L. Thus Substituting in (5) we see that the number of words in Σ n which contain at least one occurrence of L is To get the expected number ED(σ, n) of distinct Lyndon factors in a word of length n, we sum this over all L with length at most n, using the same technique as in the previous section, and divide by σ n . Replacing |L| with m we get the following: The following

DISTINCT LYNDON FACTORS IN A LYNDON WORD
Minimising the number of Lyndon factors over words of length n is not very interesting: the minimum number of distinct Lyndon factors is 1 and the minimum total number is n. Both bounds are achieved by x n where x is a letter. A more interesting question has been studied by Saari [13]: what is the minimum number of distinct Lyndon factors in a Lyndon word of length n? He proved that an optimal lower bound for the number of distinct Lyndon factors in a Lyndon word of length n is where φ denotes the golden ratio (1 + √ 5)/2. Moreover, this lower bound is attained by the so-called finite Fibonacci Lyndon words, which are precisely the Lyndon factors of the well-known infinite Fibonacci word f -a special example of a characteristic Sturmian word.
Following the notation and terminology in [11,Ch. 2], an infinite word s over {a, b} is Sturmian if and only if there exists an irrational α ∈ (0, 1), and a real number ρ, such that s is one of the following two infinite words: The irrational α is called the slope of s and ρ is the intercept. If ρ = 0, we have s α,0 = ac α and s ′ α,0 = bc α where c α is called the characteristic Sturmian word of slope α. Sturmian words of the same slope have the same set of factors [11, Prop. 2.1.18], so when studying the factors of Sturmian words, it suffices to consider only the characteristic ones.
The infinite Fibonacci word f is the characteristic Sturmian word of slope α = (3 − √ 5)/2. It can be constructed as the limit of an infinite sequence of so-called finite Fibonacci words { f n } n≥1 , defined by: That is, f 1 = ab, f 2 = aba, f 3 = abaab, f 4 = abaababa, f 5 = abaababaabaab, etc. (where f n is a prefix of f n+1 for each n ≥ 1), and we have f = lim n→∞ f n = abaababaabaab · · · Note. The length of the n-th finite Fibonacci word f n is the n-th Fibonacci number F n , defined by: F −1 = 1, F 0 = 1, F n = F n−1 + F n−2 for n ≥ 1.
More generally, any characteristic Sturmian word can be constructed as the limit of an infinite sequence of finite words. To this end, we recall that every irrational α ∈ (0, 1) has a unique simple continued fraction expansion: α = [0; a 1 , a 2 , a 3 , . . .] = 1 where each a i is a positive integer. The n-th convergent of α is defined by p n q n = [0; a 1 , a 2 , . . . , a n ] for all n ≥ 1, where the sequences {p n } n≥0 and {q n } n≥0 are given by p 0 = 0, p 1 = 1, p n = a n p n−1 + p n−2 , n ≥ 2 q 0 = 1, q 1 = a 1 , q n = a n q n−1 + q n−2 , n ≥ 2 Such a sequence of words is called a standard sequence, and we have |s n | = q n for all n ≥ 0.
Note that ab is a suffix of s 2n−1 and ba is a suffix of s 2n for all n ≥ 1.
Standard sequences are related to characteristic Sturmian words in the following way. Observe that, for any n ≥ 0, s n is a prefix of s n+1 , which gives obvious meaning to lim n→∞ s n as an infinite word. In fact, one can prove [8,3] that each s n is a prefix of c α , and we have The following lemma collects together some properties of the standard words s n . Note that from now on when referring to Lyndon words over the alphabet {a, b} we assume the natural order a < b. ◮ For all n ≥ 1, s n is a primitive word [6].  Proof. First we show that, for all n ≥ 1, the Lyndon conjugate of s n is the word ap n b. If s n = p n ba, then ap n b is clearly a conjugate of s n and it is Lyndon [2,5]. On the other hand, if s n = p n ab, then bp n a is a clearly a conjugate of s n , and since the conjugacy class of s n is closed under reversal and p n is a palindrome (by Lemma 3), it follows that ap n b is a conjugate of s n and it is Lyndon [2,5].
To prove the second claim, it suffices to show that if k < n, then the Lyndon conjugate of s k is a prefix or suffix of ap n b (since, by Lemma 3, the Lyndon factors of c α of length at least 2 are precisely the Lyndon conjugates of the (primitive) standard words in c α ). The claim is true for k = −1 and k = 0 since s −1 = b and s 0 = a. It is also true for k = 1 because s 1 = a d 1 b is the Lyndon conjugate of itself, and is a prefix of ap n b if d 1 ≥ 1 and a suffix of ap n b if d 1 = 0. Now suppose that k ≥ 2. Then k < n implies that s k is a prefix of p n . Furthermore, since p n is a palindrome, the reversal of s k is a suffix of p n . Therefore if s k = p k ba, then its Lyndon conjugate ap k b is a prefix of ap n b; otherwise, if s k = p k ab, then its Lyndon conjugate ap k b is a suffix of ap n b.
The Lyndon factors of (characteristic) Sturmian words of length at least 2 (i.e., the Lyndon conjugates of standard words) over {a, b} are precisely the so-called Christoffel words beginning with the letter a (see, e.g., the nice survey [1]). Christoffel words take the form aPal(v)b and bPal(v)a where v ∈ {a, b} * and Pal is iterated palindromic closure, defined by: Pal(ε) = ε and Pal(wx) = (Pal(w)x) + for any finite word w and letter x, where u + denotes the shortest palindrome beginning with u (called the palindromic closure of u). For example, Pal(aba) = abaaba where the underlined letters indicate the points at which palindromic closure is applied.
Let p, q be co-prime integers with 0 < p < q. The rational p/q has two distinct simple continued fraction expansions: p/q = [0; 1 + d 1 , d 2 , . . . , d n , 1] = [0; 1 + d 1 , d 2 , . . . , d n + 1] where d 1 ≥ 0 and all other d i ≥ 1. The so-called Christoffel word of slope p/q beginning with the letter a is the unique Sturmian Lyndon word over {a, b} of length q containing p occurrences of the letter b, given by: ] beginning with a.
Saari [13,Thm. 1] proved that if w is a Lyndon word with |w| ≥ F n for some n ≥ 1, then w contains at least n + 2 distinct Lyndon factors, with equality if and only if w is the Fibonacci Lyndon word of length F n . For example, aPal(ab)a = aabab is the Fibonacci Lyndon word of length F 3 = 5 and contains the minimum number (3 + 2 = 5) of distinct Lyndon factors over all Lyndon words of the same length. Saari also made the following conjecture.

Conjecture 7. [13]
If w is a Lyndon word of length n, n = 6, containing the least number of distinct Lyndon factors over all Lyndon words of the same length, then w is a Christoffel word.
The number 6 is excluded because the following words all contain 7 distinct Lyndon factors, which is the minimum for length 6 words, and only the first and last are Christoffel: aaaaab, aaabab, aabbab, ababbb, ababac, abacac, acbacc, abbbbb.
However the conjecture is not true. The following Lyndon word has length 28 and contains 10 distinct . . , d n , 1] for some co-prime integers p, q with 0 < p < q, then L(w) = d 1 + d 2 + · · · + d n + 3.
Proof. The word w is the Lyndon conjugate of the standard word s n+1 = s n s n−1 with s −1 = b, s 0 = a, and s i = s d i i−1 s i−2 for 1 ≤ i ≤ n (see Remark 5). By Lemma 4, the Lyndon conjugates of s i for 1 ≤ i ≤ n are either prefixes or suffixes of w. Moreover, for each i with 1 ≤ i ≤ n, the standard word s i contains d i distinct Lyndon factors of lengths |s m i−1 s i−2 | for m = 1, 2, . . . , d i . By Lemma 3, these are the only Lyndon factors of the Lyndon word w besides itself and the two letters a and b. Hence L(w) = (d 1 + d 2 + · · · + d n ) + 3.
The above result is a generalisation of [13, Lemma 9], which reworded (with the indexing of Fibonacci words and numbers shifted back by 2) states that if w is the Fibonacci Lyndon word of length F n−2 for some n ≥ 3, i.e., the Christoffel word of slope Is it true that the minimum number of distinct Lyndon factors over all Lyndon words of the same length is attained by at least one Christoffel word of that length?
Open Problem 2: Tables 2 and 3, showing values for ET(σ, n) and ED(σ, n), raise the question of whether there may exist asymptotic formulas for these quantities, simpler than the exact values displayed in equations (2) and (6), respectively.